What Building a Voice AI Product Taught Me About Token Limits

It Worked in the Demo

Here's how it usually goes with AI products: you build something, you demo it, everyone's impressed, and then a real user does something you didn't plan for and the whole thing breaks in a way you didn't know was possible.

That's exactly what happened to me.

I was working at a company that needed an ambient voice AI product — something that listens to real-time conversations, transcribes them, and turns them into structured documentation. Think clinical notes, charts, summaries. The kind of thing that saves professionals hours of documentation time every day.

I built the MVP by myself over the course of months. This was a high-visibility project — it was outside the company's core product but everyone from leadership down knew it was the future. All eyes were on it. People were betting their roadmap on it. During the process, the team around me changed — people got let go, roles shifted. The pressure was real.

But we got it working. Real-time transcription flowing into an LLM that structured everything into the right format. We had users testing it in real workflows. And the work opened doors — leadership saw that the same underlying technology could be applied to another product with high profit potential. The voice AI wasn't just a feature anymore. It was a platform play.

Then someone recorded a 30-minute session. And the response came back hallucinated garbage.

The Wall Nobody Warns You About

Here's the thing about token limits: you don't understand them until they break your product.

I know that sounds obvious. Everyone in AI engineering knows models have context windows. Everyone knows there's an input limit and an output limit. But knowing it as a concept and hitting it in production with a user waiting for their documentation are two completely different experiences.

What was happening: a 30-minute session generates a lot of transcription text. When that text exceeded the model's context window, the response would either get truncated — just cut off mid-sentence — or the model would start hallucinating. It would make up things that didn't happen. It would confuse details. It would generate documentation that looked perfectly structured and was completely wrong.

The dangerous part wasn't the obvious failures. It was the subtle ones. A hallucinated document that looks plausible — correct format, correct terminology, wrong content — is worse than an error message. The user might not catch it. That's a liability issue, not just a bug.

And some users wanted hour-long sessions. An hour of ambient audio transcribed into text blows past every reasonable context window. This wasn't an edge case. This was the product.

I didn't know any of this going in. I kept getting responses that were cut off or wildly inaccurate and I couldn't figure out why. The transcription looked fine. The prompt looked fine. The model just... broke. It took real debugging to trace it back to token limits on specific models. There's no error message that says "you sent too many tokens and now I'm making things up." The model just confidently generates nonsense. I've written about context windows before — but this was where I learned the lesson the hard way.

That's what makes token limits different from most engineering constraints. A database tells you when a query is too large. An API returns a 413. But an LLM just does its best with whatever it has, and its best might be fiction.

The Fix: Chunking, Summarizing, and Picking Your Battles

Once I understood the problem, the solution was straightforward in concept but tricky in practice.

We chunked the transcriptions. Instead of sending the entire 30 or 60-minute transcript to the model at once, we broke it into segments and processed each one. Then we summarized — compacting the transcript while preserving the important information. The domain-specific content needed to survive the compression. The small talk did not.

We also moved transcription in-house instead of relying on a third-party API. That gave us control over the pipeline and helped with costs. The in-house transcription wasn't as accurate as something like AWS Transcribe, but here's what surprised us: the LLM was really good at cleaning up bad transcription. It could take a messy transcript with misheard words and missing punctuation and correct most of it during the structuring step. Not perfect — but good enough that a professional could review the final output and make minor corrections before submission. That "good enough to review" bar turned out to be the right target. You don't need perfect transcription if the LLM can bridge the gap.

There's always a tradeoff between compression and fidelity, and we were constantly tuning that dial.

The Part That Kept Me Up at Night: The Money

Here's what nobody in AI Twitter talks about: cost modeling.

Building an AI product that works is one thing. Building one that makes money is a completely different problem. And there was a contract on the line. Real money. Leadership had eyes on the numbers. The math had to work before anyone signed anything.

Every token costs money. Every API call costs money. Every minute of audio transcription costs money. When your product processes 30 to 60 minutes of audio per session, and a business runs dozens of sessions a day, the numbers add up fast. If the per-session cost is too high, the product can't be priced competitively. If you cut costs too aggressively, the quality drops and users won't trust it.

I was building systems to test the cost models before we committed to anything. Running scenarios: what happens if average session length is 20 minutes vs 40? What happens if we use a cheaper model for step 1 but keep the expensive model for step 2? What's the break-even point? What do we have to sacrifice?

We used a lower-tier model for the summarization step — not because it was better, but because cost forced our hand. The lower models were tricky. They'd miss things, lose nuance, make odd choices. If we could have used the best model for every step, we would have. But when you're modeling cost per session across thousands of daily users, the difference between a $0.01 and $0.10 step matters. You don't always get to pick the best tool. You pick the one the math allows.

I think this is what a lot of AI companies are dealing with right now. The technology works. The demos are impressive. But when you sit down and model the actual cost of running it at scale for real users with real usage patterns, the margins get thin. The companies that figure out how to deliver quality while managing token economics are the ones that survive. The ones that assume "we'll optimize later" burn through their runway.

The Ground Keeps Moving

There's another thing nobody warns you about: models aren't interchangeable. You can't just swap in a newer, cheaper model and expect everything to work. Every model handles instructions differently. The guardrails you built for one model might make a different model worse — it interprets your constraints differently, follows your formatting rules differently, hallucinates in different ways. That means either rewriting your prompts specifically for each model or trying to write a generic prompt that works across any model you might use. Both options are tricky. Model-specific prompts are more accurate but expensive to maintain. Generic prompts are portable but you lose precision. We hit this every time a new model dropped and we wanted to take advantage of the better pricing or larger context window.

Here's the wild part: within a month or two, the models got better and cheaper. Context windows expanded. Costs dropped. The problem we spent weeks architecting around became significantly easier to solve with the next generation of models. That's the reality of building AI products right now — the ground shifts under your feet while you're building on it. The solutions you engineer today might be unnecessary tomorrow. But you can't wait for tomorrow when you have users today. The move right now is to get a working product out — even if it's an MVP with tradeoffs — because the tech will catch up to you faster than you think. Way faster than any previous generation of technology. Ship something decent, learn from real users, and the models will close the gap behind you.

This raises a question I don't have the answer to yet: by the time we perfected what we were building, the tech had moved far enough that we could have just rebuilt it better from scratch. And that's going to keep happening. So do we keep maintaining legacy codebases? Or can AI take an old codebase and rewrite it better on the second pass — or the third? Do we even need to hold onto code the way we used to? I genuinely don't know. But it's a question every AI engineering team is going to have to answer soon.

After the MVP proved the concept, I got moved onto the API team — from a standalone application to serverless architecture and cloud services, integrating the same functionality into the broader platform. Different constraints, different architecture, different headaches. But the core lessons from the MVP phase — chunk your transcripts, compress intelligently, watch your token budget — carried straight through.

What You Don't Know Until You Build It

The biggest takeaway from this project is this: token limits are one of those problems you don't truly understand until you've shipped a product that hits them.

You can read every blog post about context windows. You can memorize the token counts for every model. You can nod along when someone explains that longer inputs degrade output quality. But until you're staring at hallucinated documentation from a 30-minute recording and trying to figure out why your perfectly good pipeline is generating fiction — you don't really get it.

It's the same pattern I've seen across my career. The hardest problems aren't the ones you plan for. They're the ones that don't announce themselves. The model doesn't throw an error. It doesn't return a 500. It just quietly gets worse, and you have to be the one who notices.

If you're building AI products right now — especially anything that processes long-form content like voice recordings, documents, or multi-turn conversations — budget for this. Not just in tokens. In time. In architecture. In the hard conversations about what quality tradeoffs you're willing to make and what the economics actually look like at scale.

The demo always works. The question is what happens when a real user hits record and talks for an hour.

— Bill John Tran