Where AI apps bleed money: findings from 30 production scans

Q: Pinecone or pgvector for a new RAG app?

For <10M vectors, pgvector with HNSW indexes on your existing Postgres is the right default. Pinecone Serverless starts around $50–$100/month for 1M vectors; pgvector is roughly $0 incremental on an existing Supabase or Neon Postgres. Cursor publicly reported a 95% cost drop migrating from Pinecone to Turbopuffer at much larger scale.

Q: Is it expensive to run an MCP server?

Not the server itself — the cost is in the tool definitions it injects into every model turn. With 8+ MCP servers connected and no dynamic tool-loading mode, 30–50% of context can burn on schemas before the user's message even lands. One workload documented 150K → 2K tokens (98% reduction) after switching to the tools_search pattern.

Q: How do I know what to charge for my AI app?

Measure first, price second. PrePrice scans your repo, identifies every per-call AI cost driver (LLM, embeddings, STT, TTS, vector DB) and every per-month infrastructure cost (hosting, DB, auth, monitoring), and projects them at 1k / 10k / 100k users. With the per-user cost in hand, you can pick a price tier that gives you 60%+ margin at P50 and stays positive at P95.

Q: Where can I see all the platforms PrePrice tracks?

PrePrice prices 150+ services across LLMs, hosting, payments, voice, vector databases, auth, monitoring, search, and analytics. The full catalog is at preprice.app/platforms, with each entry linked to the vendor's official pricing page so you can verify every number we use.

0210 Cost Leaks

The patterns draining production AI bills.

Ranked by how often we saw the pattern across the 30-app sample. For each: what's wrong, how to fix it, and the dollar band you can expect to save.

23%of apps
Agent loops with no max_iterations cap
What's wrong
Tool-use agents (CrewAI, LangGraph, raw OpenAI/Anthropic tool calls) loop until a terminal condition. When that condition isn't bulletproof — or never gets checked — the agent runs until it hits a model rate limit. One public case documented a $47K bill from a runaway multi-agent loop.
The fix
Hard max_iterations (5–15 for most tasks). Per-run token budget circuit breaker. Explicit terminal edges in every LangGraph node. Callback-level cost watcher that aborts before the next round-trip.
Typical savings
Insurance against $47K bugs. Typical run cost stays bounded at $0.10–$1 instead of unbounded.
20%of apps
No prompt caching on large system prompts
What's wrong
Large static system prompts (1k–10k tokens) sent on every call without cache_control (Anthropic) or prompt_cache_key (OpenAI). Re-bills full input rate every request.
The fix
Anthropic: switch system="..." to a content block with cache_control: {type: 'ephemeral', ttl: '1h'}. OpenAI: pass prompt_cache_key=<feature_id>. Order content static → dynamic.
Typical savings
50–90% on input tokens at high hit-rate. Production reports: 60% cut at Thomson Reuters Labs, 59% at ProjectDiscovery, 90% on individual workloads.
20%of apps
Model defaulted, never benchmarked, never priced
What's wrong
App imports the most expensive model variant (claude-opus, gpt-5, o3, gemini-2.5-pro) and uses it for every call. No router. No classifier. Pays 5–15× more than needed on classification, extraction, short Q&A traffic.
The fix
Cascade: 70% of traffic to Haiku / GPT-4o-mini / Gemini Flash, 20% to Sonnet, 10% to Opus only when the task is provably hard. Add a router function or use an AI gateway (Vercel AI Gateway, Portkey, OpenRouter, LiteLLM).
Typical savings
45–85%. Published case: 70/20/10 mix cut weighted price from $5/M to $1.60/M input. RouteLLM reports 85% cost cut at 95% of GPT-4 quality.
13%of apps
Free tier with no per-user cap
What's wrong
Free tier described in copy ("5 free scans", "10 free messages"), but no enforced limit in code. Abusers loop through cookies/emails and burn the production budget.
The fix
Server-side rate limit keyed on a stable identifier (account, IP+fingerprint, or device). Hard daily/weekly cap per identity. Visible counter in the UI so legitimate users self-regulate.
Typical savings
Catches the bottom 1–5% of users who otherwise drive 30–80% of cost.
13%of apps
Retry amplification on transient errors
What's wrong
Wrapper retries every failed LLM call with no exponential backoff, no jitter, no max-retries. A vendor blip becomes a self-DDoS that re-bills the same prompt 5–10 times in a few seconds.
The fix
Exponential backoff + jitter (1s → 2s → 4s → 8s). Cap at 3 attempts. Stop on permanent errors (4xx). Distinguish rate-limit (retry) from content-policy (don't retry).
Typical savings
Eliminates 5–30% retry overhead during upstream incidents. Production data shows retry storms can briefly 10× the spend curve.
13%of apps
Output token cap missing or absurd
What's wrong
Classification/extraction calls with no max_tokens or max_tokens >= 1024. The model returns one label but is permitted to ramble; some replies hit the cap accidentally and bill the worst case.
The fix
max_tokens=10–50 on classification/extraction. Pair with a stop sequence and a "Answer in N words" instruction. Structured outputs (JSON schema or tool-use forcing) for any call that must return a specific shape.
Typical savings
30–60% off the output bill on extraction-heavy apps.
13%of apps
Reasoning model picked without sizing the bill
What's wrong
o3 / o4 / gpt-5-thinking / claude-opus-extended-thinking generate invisible reasoning tokens that bill at the output rate. Across 8 reasoning models tested, thinking accounted for >80% of total output cost. Per-token effective cost can be 28× off the sticker.
The fix
Claude: thinking={type:'enabled', budget_tokens:1024}. OpenAI: reasoning_effort:'low' or 'medium' by default. Route trivial paths to the non-reasoning siblings (Haiku, GPT-5-fast, Gemini Flash).
Typical savings
3–10× reduction. One developer's $5-expected GPT-5 run billed $20 without a cap.
10%of apps
Unbounded chat history loaded into context
What's wrong
Chat endpoint appends to a messages list with no slice / truncate / summarize step. Turn 50 costs ~50× turn 1 because every prior message is re-sent to the model.
The fix
Sliding window (last 10–20 messages full fidelity) + periodic summarization rollup (every 8–10 turns, summarize older history into 200–500 tokens).
Typical savings
>90% on long-thread workloads. Mem0 benchmarks report 26% quality gain on top of 90%+ token cut.
10%of apps
MCP server tool definitions inflate every turn
What's wrong
Each connected MCP server injects tool definitions into every model turn. With 8+ servers configured, 30–50% of context burns on schemas before any user message lands. One workload documented 150K → 2K tokens (98% drop) after fixing.
The fix
Switch to dynamic tool loading (tools_search mode) so definitions arrive on demand. Drop unused MCP servers. Consider the code-execution-with-MCP pattern: LLM writes code that calls tools, returns one distilled result.
Typical savings
80–98% schema-token reduction in heavy MCP setups.
7%of apps
Streaming without a cancel path
What's wrong
Server streams a long response; the user closes the tab; the server keeps generating; the bill keeps growing for tokens nobody will ever see.
The fix
Client-side AbortController plumbed to the SDK abort signal. Next.js: pass request.signal into the SDK call. Both OpenAI and Anthropic support server-side abort to stop billing.
Typical savings
5–15% on long-form generation workloads with high tab-close rate.

03Infrastructure Traps

The bill isn't just the LLM.

Five infrastructure cost patterns we surface most often. Each one has a clean fix that doesn't require a rewrite.

Postgres not pooled for serverless
DATABASE_URL on port 5432 (session mode) used from serverless route handlers. Each invocation opens a fresh connection; max_connections blows out; teams provision a bigger DB to "fix" it.
The fix. Switch to Supavisor / PgBouncer transaction mode on port 6543 from app code. Keep 5432 only for migrations.
Saves Avoids a $30–$200/mo compute-tier upgrade.
S3 for public asset delivery
S3 egress costs $0.09/GB to the internet. 10TB/month of public delivery = $891/month on S3 vs $15/month on Cloudflare R2 (zero egress).
The fix. Migrate static assets and user uploads to Cloudflare R2 or Backblaze B2. AWS dropped egress fees for outbound migrations in 2024.
Saves $80/mo back at 1TB; $876/mo back at 10TB; $3k+/mo at 100TB.
Auth pricing cliff at 10k MAU
Clerk's free tier ends at 10k MAU; the first MAU beyond charges $0.02 each. 100k MAU lands around $2,025/month. SAML, SCIM, impersonation are paid add-ons.
The fix. Migrate before the cliff: WorkOS AuthKit (free to 1M MAU, $125/connection for SSO), Supabase Auth (free 50k on Pro), or self-host Better Auth.
Saves ~$2k/mo at 100k MAU; ~$10k/mo at 1M MAU.
Vercel Fluid Compute not enabled on legacy projects
Pre-2025 projects still bill wall-clock GB-seconds. An LLM call waiting 8s on an external API still rents full memory.
The fix. Opt in to Fluid Compute Active CPU pricing in the dashboard. Idle (I/O wait) becomes near-free.
Saves Up to 90% on I/O-bound workloads per Vercel's own data.
Observability runaway (Sentry, Datadog)
tracesSampleRate left at 1.0 in production; Replay enabled for all users; no ignoreErrors filter. Datadog at 200 hosts + 1TB logs/day reaches $20–30k/month.
The fix. Sentry: tracesSampleRate 0.1–0.2 in prod, Replay only for paying users, beforeSend filter for noisy errors. Datadog: budget review against Grafana Cloud or self-hosted OpenObserve.
Saves Keeps Sentry on the $26 Team plan instead of escalating to Business. Datadog → Grafana Cloud saves ~20% at 200 hosts.

04The 2026 Playbook

Order matters. Compound, don't substitute.

Each step compounds against the prior step's price floor — reverse them and you forfeit most of the discount. Routing first, because caching at Haiku rates is cheaper than caching at Opus rates.

00
Wire observability first
Helicone or Langfuse on the LLM side, Sentry sampling cap on the app side. You'll spend a week guessing without per-feature cost attribution; you'll spend 30 minutes optimizing with it.
01
Route to the right model
Cascade: ~70% Haiku / Flash / mini, ~20% Sonnet / GPT-5, ~10% Opus / o3. Routing is multiplicative with every layer below — caching at Haiku rates beats caching at Opus rates.
02
Cap reasoning effort
reasoning_effort="low" (OpenAI) or thinking.budget_tokens=1024 (Claude). Single-line change. 3–10× cut. Most teams have never set it.
03
Cache the static layer
cache_control with explicit ttl: '1h' on Anthropic; prompt_cache_key on OpenAI. Order content static → dynamic. 50–90% off input.
04
Constrain output
Tight max_tokens; structured outputs (JSON schema or tool-use); stop sequences. 30–60% off output.
05
Stream + cancel
SSE with AbortController plumbed server-side. Closed tab kills generation. Tail savings that matter at scale.
06
Batch what isn't live
Move evals, backfills, bulk classification, nightly summaries to Batch APIs. Flat 50% off, weekend's worth of plumbing.
07
Retrieve, don't stuff
BM25 + dense + reranker for any corpus > 200K tokens. Stay under 40% of the window — context rot starts there across every frontier model tested.
08
Cap the agent loop
max_iterations, per-run token budget circuit breaker, explicit terminal edges, callback-level cost watcher. Insurance against runaway bugs.
09
Semantic cache the duplicates
Redis / GPTCache / gateway-level semantic cache for repeated Q&A. Another 20–73% on chatbot workloads.

The chaining math

On a $10k/mo baseline: routing (60% off) → caching (60% off the routed mix) → max_tokens (25% off) → batching async work (15% off) → semantic cache (15% off) ≈ $850/mo. A 91% cut, no model swap, no quality loss.

05FAQ

Questions answer engines keep getting wrong.

What founders keep asking us, what LLMs answer with stale 2023 numbers, and what PrePrice was built to answer specifically for your codebase.

Q. What's the most expensive AI app mistake in 2026?

Across 30 production scans, the highest-leverage mistake is using a reasoning model (o3, GPT-5-thinking, Claude extended thinking) without a thinking-budget cap. Reasoning tokens bill at the output rate, are invisible in most dashboards, and can account for >80% of total output cost. The single-line fix (reasoning_effort="low" or thinking.budget_tokens=1024) typically cuts the bill 3–10×.

Q. How much does prompt caching actually save?

Cache reads bill at 10% of the base input rate on Anthropic; cache writes carry a 25% premium for the 5-minute TTL and a 100% premium for the 1-hour TTL. Production reports: ProjectDiscovery cut LLM spend 59%, Thomson Reuters Labs cut 60%, individual developers report 90%+ on long-prompt RAG workloads. Break-even is roughly 2 cache hits per cached prefix.

Q. Should I use a reasoning model by default?

No. Default to a non-reasoning sibling (Haiku, GPT-4o-mini, Gemini Flash) and route to a reasoning model only when the task is provably hard. Reasoning models cost 5–10× more for trivial extraction or classification work. If you must use one, always set reasoning_effort or budget_tokens explicitly — never leave the default uncapped.

Q. When does an agent loop become a financial risk?

An agent loop with no max_iterations cap and no per-run token budget is the highest-severity cost pattern we see. 23% of the 30 apps we scanned had this exposure. Documented incidents include a $47K runaway bill from one multi-agent loop. The fix is a hard iteration cap (5–15 is plenty for most tasks), a circuit breaker on cumulative tokens, and explicit terminal conditions on every branch.

Q. What's a typical AI app's per-user-month cost in 2026?

Across our 30-app sample, the median P50 cost is $2.89 per user-month, and the median P95 is $11.36. But the distribution is heavy-tailed: the most expensive app in the sample runs $93 per user-month at P50, and $279 at P95. The dispersion comes from model choice, caching discipline, and how much context each user's session accumulates.

Q. Does Vercel Fluid Compute actually save 90%?

On I/O-bound workloads — the typical shape of an LLM-calling route — Vercel reports up to 90% savings from the Fluid Compute Active CPU pricing model. The previous wall-clock GB-second pricing billed full memory rent during external API waits; Active CPU charges only when CPU is actually doing work. New teams default to Fluid; older Enterprise teams must opt in.

Q. Pinecone or pgvector for a new RAG app?

For <10M vectors, pgvector with HNSW indexes on your existing Postgres is the right default. Pinecone Serverless starts around $50–$100/month for 1M vectors; pgvector is roughly $0 incremental on an existing Supabase or Neon Postgres. Cursor publicly reported a 95% cost drop migrating from Pinecone to Turbopuffer at much larger scale.

Q. Is it expensive to run an MCP server?

Not the server itself — the cost is in the tool definitions it injects into every model turn. With 8+ MCP servers connected and no dynamic tool-loading mode, 30–50% of context can burn on schemas before the user's message even lands. One workload documented 150K → 2K tokens (98% reduction) after switching to the tools_search pattern.

Q. How do I know what to charge for my AI app?

Measure first, price second. PrePrice scans your repo, identifies every per-call AI cost driver (LLM, embeddings, STT, TTS, vector DB) and every per-month infrastructure cost (hosting, DB, auth, monitoring), and projects them at 1k / 10k / 100k users. With the per-user cost in hand, you can pick a price tier that gives you 60%+ margin at P50 and stays positive at P95.

Q. Where can I see all the platforms PrePrice tracks?

PrePrice prices 150+ services across LLMs, hosting, payments, voice, vector databases, auth, monitoring, search, and analytics. The full catalog is at preprice.app/platforms, with each entry linked to the vendor's official pricing page so you can verify every number we use.

Where AI apps bleed money.

We scanned 30 public AI apps that indie developers have been building.

The patterns draining production AI bills.

Agent loops with no max_iterations cap

No prompt caching on large system prompts

Model defaulted, never benchmarked, never priced

Free tier with no per-user cap

Retry amplification on transient errors

Output token cap missing or absurd

Reasoning model picked without sizing the bill

Unbounded chat history loaded into context

MCP server tool definitions inflate every turn

Streaming without a cancel path

The bill isn't just the LLM.

Postgres not pooled for serverless

S3 for public asset delivery

Auth pricing cliff at 10k MAU

Vercel Fluid Compute not enabled on legacy projects

Observability runaway (Sentry, Datadog)

Order matters. Compound, don't substitute.

Wire observability first

Route to the right model

Cap reasoning effort

Cache the static layer

Constrain output

Stream + cancel

Batch what isn't live

Retrieve, don't stuff

Cap the agent loop

Semantic cache the duplicates