How EliseAI trains specialized models to power AI-driven housing and healthcare operations

Domain experts tackling AI's hardest problems in housing and healthcare

EliseAI is the leading AI startup automating complex housing and healthcare systems, which are two of the most operationally demanding domains in the economy. Its platform handles patient scheduling, intake, and front-desk operations for medical practices, and leasing, maintenance, and resident engagement for property managers. These use cases require deep domain expertise in regulated industries where accuracy, latency, and reliability directly impact patient care and resident experience.

EliseAI processes millions of conversations every month across housing and healthcare. Over 80 percent of the country’s largest property management firms use EliseAI’s agents to speed up repetitive admin work like tracking maintenance requests, scheduling apartment tours and renewing leases. These requests need to be handled in real time. This includes sub-second latency, high extraction accuracy, and the flexibility to adapt as the product changes weekly.

When their closed-source APIs hit a ceiling on cost, speed, and controllability, EliseAI set out to build their own custom models. They partnered with the Baseten research team to get there.

The cost of renting intelligence

As EliseAI has grown rapidly, the need to own their models and SLAs became clear due to latency and cost issues. Specific task latency sat at 2.2 seconds p90 with closed-source providers. These models sit on the critical path of real-time conversations between residents and property managers. For EliseAI users, 2.2 seconds versus 250 milliseconds is the difference between an AI that feels responsive and one that feels broken.

What extraction actually looks like

Here's what EliseAI's models do thousands of times per hour. A resident sends a series of messages:

Resident: Hi, I'm looking for a 2-bed near downtown. My move-in is flexible, sometime in March works.

Agent: We have a few options! Any preference on layout?

Resident: Something with a balcony would be great. Oh and my budget is around $2,200.

The model reads this conversation and outputs structured JSON:

{

"layout_preferences": "2-bedroom with balcony",

"budget_max": 2200,

"move_in_date": "2026-03",

"unit_numbers": null,

"floorplan_names": null

}

On the surface, the task looks simple, but different fields start to interact with each other: does mentioning a unit number override previously extracted layout preferences? Temporal references are ambiguous: is "ASAP" a date or null? And the line between an expressed preference and a casual question is often unclear.

EliseAI runs dozens of these extraction tasks across housing and healthcare. They range from straightforward (parsing a tour time from a confirmation message) to complex and challenging (inferring a full preference snapshot from a multi-turn conversation with implicit signals).

General-purpose models required increasingly elaborate prompting to handle these cases. Even then, accuracy plateaued on the most complex conversations.

EliseAI needed a critical voice agent feature to respond quickly, as it was one of the most important capabilities in their leasing platform. Closed-source models took on average around 2 seconds, but the voice pipeline had a hard latency ceiling of 1.3 seconds. The team couldn't ship the feature without a much faster model. Fine-tuning a smaller open-source model to match and surpass accuracy under that latency ceiling had an immediate product impact.

Training on Baseten

EliseAI runs all fine-tuning through Baseten Training. The team kicks off jobs across tasks and configurations without managing GPU infrastructure and deploys the resulting models directly to Baseten's inference platform.

The default recipe is supervised fine-tuning on Qwen-4B. The 4B parameter size is a hard constraint from EliseAI's latency budget: models need to produce structured JSON fast enough to sit on the critical path of a real-time conversation. On their earliest production tasks, SFT on clean data matched the teacher model's accuracy at a fraction of the latency.

After onboarding, EliseAI's team was launching training jobs independently. The tight coupling between training and serving meant they could go from "training complete" to "deployed in production inference" without additional integration work.

Nearly every time we found the fine-tuned model made a mistake in production, the root cause was the same: the ground truth data contained the error. Not the model, not the fine-tuning. The data.
Mario Martone, Head of AI Research, EliseAI

With Baseten, EliseAI achieved

99% accuracy on one of the most important features in the leasing platform
250ms p90 latency, down from 2.2 seconds on closed-source APIs
~60% reduction in inference costs
Frontier-matching accuracy on a 4B parameter model, matching or exceeding models 25x its size on production benchmarks
Independent training operations with EliseAI's team launching jobs on their own after initial onboarding

Research support

For straightforward extraction tasks, SFT on clean data was enough. But some tasks are harder. When ambiguity is high and "correct" depends on conventions not fully captured in training data, SFT hits a ceiling.

EliseAI worked with our research team to push past it. The Baseten post-training research team, with backgrounds from Oxford, Cambridge, and MATS, brought hands-on experience with advanced post-training techniques across production workloads.

Together, they layered in On-Policy Self-Distillation (OPSD) and Iterative SFT for tasks where standard fine-tuning plateaued. OPSD outperformed pure SFT across every field EliseAI tested. Iterative SFT targeted specific failure modes the model was making in production, generating corrective training signal from the model's own mistakes.

Our team recommended which techniques to try first, helped diagnose where SFT was plateauing, and structured the training pipeline for EliseAI's extraction tasks specifically.

The researchers we worked with at Baseten are among the smartest people you can find anywhere.
Mario Martone, Head of AI Research

Continuous improvement

Fine-tuning isn't a one-time project. EliseAI's product requirements change weekly, and the AI space moves fast. Models need to be maintained and improved continuously to stay ahead of what general-purpose APIs can do out of the box.

Baseten Training supports this by making iteration cheap. EliseAI can test new training configurations, evaluate results against a stable test set, and deploy updates to production, all within the same platform. For a team shipping model updates weekly, that speed is what makes continuous improvement practical rather than aspirational.

You can't treat fine-tuning as a one-and-done check item. This space moves so fast that you have to continuously keep up.
Mario Martone, Head of AI Research

Results

“Once you know your task deeply and have the expertise to design the right system around it, fine-tuned open-source models can offer major advantages. They often provide stronger control, lower latency, and better efficiency for specialized, latency-sensitive workflows. In many cases, a model like Qwen-4B can be a particularly compelling option.”
Mario Martone, Head of AI Research

Powered by custom, fine-tuned models and Baseten's Training Product and research support, EliseAI owns the intelligence behind its platform — and continues to push the boundaries of what AI automation can deliver for the hardest problems in housing and healthcare.