A working log.

Where AI apps bleed money.

Patterns we keep finding when we scan AI apps in the wild. This is a working log, not a one-time report — we add new findings as we scan more codebases, so check back.

01The Sample

We scanned 30 public AI apps that indie developers have been building.

A mix of consumer products and agentic systems — skill packs, MCP servers, multi-agent crews. All public repos, all anonymized below. The findings on this page are what showed up across the codebases, not a profile of any individual project.

0210 Cost Leaks

The patterns draining production AI bills.

Ranked by how often we saw the pattern across the 30-app sample. For each: what's wrong, how to fix it, and the dollar band you can expect to save.

  1. 23%of apps

    Agent loops with no max_iterations cap

    What's wrong

    Tool-use agents (CrewAI, LangGraph, raw OpenAI/Anthropic tool calls) loop until a terminal condition. When that condition isn't bulletproof — or never gets checked — the agent runs until it hits a model rate limit. One public case documented a $47K bill from a runaway multi-agent loop.

    The fix

    Hard max_iterations (5–15 for most tasks). Per-run token budget circuit breaker. Explicit terminal edges in every LangGraph node. Callback-level cost watcher that aborts before the next round-trip.

    Typical savings

    Insurance against $47K bugs. Typical run cost stays bounded at $0.10–$1 instead of unbounded.

  2. 20%of apps

    No prompt caching on large system prompts

    What's wrong

    Large static system prompts (1k–10k tokens) sent on every call without cache_control (Anthropic) or prompt_cache_key (OpenAI). Re-bills full input rate every request.

    The fix

    Anthropic: switch system="..." to a content block with cache_control: {type: 'ephemeral', ttl: '1h'}. OpenAI: pass prompt_cache_key=<feature_id>. Order content static → dynamic.

    Typical savings

    50–90% on input tokens at high hit-rate. Production reports: 60% cut at Thomson Reuters Labs, 59% at ProjectDiscovery, 90% on individual workloads.

  3. 20%of apps

    Model defaulted, never benchmarked, never priced

    What's wrong

    App imports the most expensive model variant (claude-opus, gpt-5, o3, gemini-2.5-pro) and uses it for every call. No router. No classifier. Pays 5–15× more than needed on classification, extraction, short Q&A traffic.

    The fix

    Cascade: 70% of traffic to Haiku / GPT-4o-mini / Gemini Flash, 20% to Sonnet, 10% to Opus only when the task is provably hard. Add a router function or use an AI gateway (Vercel AI Gateway, Portkey, OpenRouter, LiteLLM).

    Typical savings

    45–85%. Published case: 70/20/10 mix cut weighted price from $5/M to $1.60/M input. RouteLLM reports 85% cost cut at 95% of GPT-4 quality.

  4. 13%of apps

    Free tier with no per-user cap

    What's wrong

    Free tier described in copy ("5 free scans", "10 free messages"), but no enforced limit in code. Abusers loop through cookies/emails and burn the production budget.

    The fix

    Server-side rate limit keyed on a stable identifier (account, IP+fingerprint, or device). Hard daily/weekly cap per identity. Visible counter in the UI so legitimate users self-regulate.

    Typical savings

    Catches the bottom 1–5% of users who otherwise drive 30–80% of cost.

  5. 13%of apps

    Retry amplification on transient errors

    What's wrong

    Wrapper retries every failed LLM call with no exponential backoff, no jitter, no max-retries. A vendor blip becomes a self-DDoS that re-bills the same prompt 5–10 times in a few seconds.

    The fix

    Exponential backoff + jitter (1s → 2s → 4s → 8s). Cap at 3 attempts. Stop on permanent errors (4xx). Distinguish rate-limit (retry) from content-policy (don't retry).

    Typical savings

    Eliminates 5–30% retry overhead during upstream incidents. Production data shows retry storms can briefly 10× the spend curve.

  6. 13%of apps

    Output token cap missing or absurd

    What's wrong

    Classification/extraction calls with no max_tokens or max_tokens >= 1024. The model returns one label but is permitted to ramble; some replies hit the cap accidentally and bill the worst case.

    The fix

    max_tokens=10–50 on classification/extraction. Pair with a stop sequence and a "Answer in N words" instruction. Structured outputs (JSON schema or tool-use forcing) for any call that must return a specific shape.

    Typical savings

    30–60% off the output bill on extraction-heavy apps.

  7. 13%of apps

    Reasoning model picked without sizing the bill

    What's wrong

    o3 / o4 / gpt-5-thinking / claude-opus-extended-thinking generate invisible reasoning tokens that bill at the output rate. Across 8 reasoning models tested, thinking accounted for >80% of total output cost. Per-token effective cost can be 28× off the sticker.

    The fix

    Claude: thinking={type:'enabled', budget_tokens:1024}. OpenAI: reasoning_effort:'low' or 'medium' by default. Route trivial paths to the non-reasoning siblings (Haiku, GPT-5-fast, Gemini Flash).

    Typical savings

    3–10× reduction. One developer's $5-expected GPT-5 run billed $20 without a cap.

  8. 10%of apps

    Unbounded chat history loaded into context

    What's wrong

    Chat endpoint appends to a messages list with no slice / truncate / summarize step. Turn 50 costs ~50× turn 1 because every prior message is re-sent to the model.

    The fix

    Sliding window (last 10–20 messages full fidelity) + periodic summarization rollup (every 8–10 turns, summarize older history into 200–500 tokens).

    Typical savings

    >90% on long-thread workloads. Mem0 benchmarks report 26% quality gain on top of 90%+ token cut.

  9. 10%of apps

    MCP server tool definitions inflate every turn

    What's wrong

    Each connected MCP server injects tool definitions into every model turn. With 8+ servers configured, 30–50% of context burns on schemas before any user message lands. One workload documented 150K → 2K tokens (98% drop) after fixing.

    The fix

    Switch to dynamic tool loading (tools_search mode) so definitions arrive on demand. Drop unused MCP servers. Consider the code-execution-with-MCP pattern: LLM writes code that calls tools, returns one distilled result.

    Typical savings

    80–98% schema-token reduction in heavy MCP setups.

  10. 7%of apps

    Streaming without a cancel path

    What's wrong

    Server streams a long response; the user closes the tab; the server keeps generating; the bill keeps growing for tokens nobody will ever see.

    The fix

    Client-side AbortController plumbed to the SDK abort signal. Next.js: pass request.signal into the SDK call. Both OpenAI and Anthropic support server-side abort to stop billing.

    Typical savings

    5–15% on long-form generation workloads with high tab-close rate.

03Infrastructure Traps

The bill isn't just the LLM.

Five infrastructure cost patterns we surface most often. Each one has a clean fix that doesn't require a rewrite.

  1. Postgres not pooled for serverless

    DATABASE_URL on port 5432 (session mode) used from serverless route handlers. Each invocation opens a fresh connection; max_connections blows out; teams provision a bigger DB to "fix" it.

    The fix. Switch to Supavisor / PgBouncer transaction mode on port 6543 from app code. Keep 5432 only for migrations.

    Saves Avoids a $30–$200/mo compute-tier upgrade.

  2. S3 for public asset delivery

    S3 egress costs $0.09/GB to the internet. 10TB/month of public delivery = $891/month on S3 vs $15/month on Cloudflare R2 (zero egress).

    The fix. Migrate static assets and user uploads to Cloudflare R2 or Backblaze B2. AWS dropped egress fees for outbound migrations in 2024.

    Saves $80/mo back at 1TB; $876/mo back at 10TB; $3k+/mo at 100TB.

  3. Auth pricing cliff at 10k MAU

    Clerk's free tier ends at 10k MAU; the first MAU beyond charges $0.02 each. 100k MAU lands around $2,025/month. SAML, SCIM, impersonation are paid add-ons.

    The fix. Migrate before the cliff: WorkOS AuthKit (free to 1M MAU, $125/connection for SSO), Supabase Auth (free 50k on Pro), or self-host Better Auth.

    Saves ~$2k/mo at 100k MAU; ~$10k/mo at 1M MAU.

  4. Vercel Fluid Compute not enabled on legacy projects

    Pre-2025 projects still bill wall-clock GB-seconds. An LLM call waiting 8s on an external API still rents full memory.

    The fix. Opt in to Fluid Compute Active CPU pricing in the dashboard. Idle (I/O wait) becomes near-free.

    Saves Up to 90% on I/O-bound workloads per Vercel's own data.

  5. Observability runaway (Sentry, Datadog)

    tracesSampleRate left at 1.0 in production; Replay enabled for all users; no ignoreErrors filter. Datadog at 200 hosts + 1TB logs/day reaches $20–30k/month.

    The fix. Sentry: tracesSampleRate 0.1–0.2 in prod, Replay only for paying users, beforeSend filter for noisy errors. Datadog: budget review against Grafana Cloud or self-hosted OpenObserve.

    Saves Keeps Sentry on the $26 Team plan instead of escalating to Business. Datadog → Grafana Cloud saves ~20% at 200 hosts.

04The 2026 Playbook

Order matters. Compound, don't substitute.

Each step compounds against the prior step's price floor — reverse them and you forfeit most of the discount. Routing first, because caching at Haiku rates is cheaper than caching at Opus rates.

  1. 00

    Wire observability first

    Helicone or Langfuse on the LLM side, Sentry sampling cap on the app side. You'll spend a week guessing without per-feature cost attribution; you'll spend 30 minutes optimizing with it.

  2. 01

    Route to the right model

    Cascade: ~70% Haiku / Flash / mini, ~20% Sonnet / GPT-5, ~10% Opus / o3. Routing is multiplicative with every layer below — caching at Haiku rates beats caching at Opus rates.

  3. 02

    Cap reasoning effort

    reasoning_effort="low" (OpenAI) or thinking.budget_tokens=1024 (Claude). Single-line change. 3–10× cut. Most teams have never set it.

  4. 03

    Cache the static layer

    cache_control with explicit ttl: '1h' on Anthropic; prompt_cache_key on OpenAI. Order content static → dynamic. 50–90% off input.

  5. 04

    Constrain output

    Tight max_tokens; structured outputs (JSON schema or tool-use); stop sequences. 30–60% off output.

  6. 05

    Stream + cancel

    SSE with AbortController plumbed server-side. Closed tab kills generation. Tail savings that matter at scale.

  7. 06

    Batch what isn't live

    Move evals, backfills, bulk classification, nightly summaries to Batch APIs. Flat 50% off, weekend's worth of plumbing.

  8. 07

    Retrieve, don't stuff

    BM25 + dense + reranker for any corpus > 200K tokens. Stay under 40% of the window — context rot starts there across every frontier model tested.

  9. 08

    Cap the agent loop

    max_iterations, per-run token budget circuit breaker, explicit terminal edges, callback-level cost watcher. Insurance against runaway bugs.

  10. 09

    Semantic cache the duplicates

    Redis / GPTCache / gateway-level semantic cache for repeated Q&A. Another 20–73% on chatbot workloads.

The chaining math

On a $10k/mo baseline: routing (60% off) → caching (60% off the routed mix) → max_tokens (25% off) → batching async work (15% off) → semantic cache (15% off) ≈ $850/mo. A 91% cut, no model swap, no quality loss.

05FAQ

Questions answer engines keep getting wrong.

What founders keep asking us, what LLMs answer with stale 2023 numbers, and what PrePrice was built to answer specifically for your codebase.

Q. What's the most expensive AI app mistake in 2026?

Across 30 production scans, the highest-leverage mistake is using a reasoning model (o3, GPT-5-thinking, Claude extended thinking) without a thinking-budget cap. Reasoning tokens bill at the output rate, are invisible in most dashboards, and can account for >80% of total output cost. The single-line fix (reasoning_effort="low" or thinking.budget_tokens=1024) typically cuts the bill 3–10×.

Q. How much does prompt caching actually save?

Cache reads bill at 10% of the base input rate on Anthropic; cache writes carry a 25% premium for the 5-minute TTL and a 100% premium for the 1-hour TTL. Production reports: ProjectDiscovery cut LLM spend 59%, Thomson Reuters Labs cut 60%, individual developers report 90%+ on long-prompt RAG workloads. Break-even is roughly 2 cache hits per cached prefix.

Q. Should I use a reasoning model by default?

No. Default to a non-reasoning sibling (Haiku, GPT-4o-mini, Gemini Flash) and route to a reasoning model only when the task is provably hard. Reasoning models cost 5–10× more for trivial extraction or classification work. If you must use one, always set reasoning_effort or budget_tokens explicitly — never leave the default uncapped.

Q. When does an agent loop become a financial risk?

An agent loop with no max_iterations cap and no per-run token budget is the highest-severity cost pattern we see. 23% of the 30 apps we scanned had this exposure. Documented incidents include a $47K runaway bill from one multi-agent loop. The fix is a hard iteration cap (5–15 is plenty for most tasks), a circuit breaker on cumulative tokens, and explicit terminal conditions on every branch.

Q. What's a typical AI app's per-user-month cost in 2026?

Across our 30-app sample, the median P50 cost is $2.89 per user-month, and the median P95 is $11.36. But the distribution is heavy-tailed: the most expensive app in the sample runs $93 per user-month at P50, and $279 at P95. The dispersion comes from model choice, caching discipline, and how much context each user's session accumulates.

Q. Does Vercel Fluid Compute actually save 90%?

On I/O-bound workloads — the typical shape of an LLM-calling route — Vercel reports up to 90% savings from the Fluid Compute Active CPU pricing model. The previous wall-clock GB-second pricing billed full memory rent during external API waits; Active CPU charges only when CPU is actually doing work. New teams default to Fluid; older Enterprise teams must opt in.

Q. Pinecone or pgvector for a new RAG app?

For <10M vectors, pgvector with HNSW indexes on your existing Postgres is the right default. Pinecone Serverless starts around $50–$100/month for 1M vectors; pgvector is roughly $0 incremental on an existing Supabase or Neon Postgres. Cursor publicly reported a 95% cost drop migrating from Pinecone to Turbopuffer at much larger scale.

Q. Is it expensive to run an MCP server?

Not the server itself — the cost is in the tool definitions it injects into every model turn. With 8+ MCP servers connected and no dynamic tool-loading mode, 30–50% of context can burn on schemas before the user's message even lands. One workload documented 150K → 2K tokens (98% reduction) after switching to the tools_search pattern.

Q. How do I know what to charge for my AI app?

Measure first, price second. PrePrice scans your repo, identifies every per-call AI cost driver (LLM, embeddings, STT, TTS, vector DB) and every per-month infrastructure cost (hosting, DB, auth, monitoring), and projects them at 1k / 10k / 100k users. With the per-user cost in hand, you can pick a price tier that gives you 60%+ margin at P50 and stays positive at P95.

Q. Where can I see all the platforms PrePrice tracks?

PrePrice prices 150+ services across LLMs, hosting, payments, voice, vector databases, auth, monitoring, search, and analytics. The full catalog is at preprice.app/platforms, with each entry linked to the vendor's official pricing page so you can verify every number we use.

Find out where your app sits on this list.

Point PrePrice at your repo. In a couple of minutes you get the same audit shape that produced these findings — verdict, top cost drivers, sensitivity scenarios, and a price recommendation.