A working log.
Patterns we keep finding when we scan AI apps in the wild. This is a working log, not a one-time report — we add new findings as we scan more codebases, so check back.
A mix of consumer products and agentic systems — skill packs, MCP servers, multi-agent crews. All public repos, all anonymized below. The findings on this page are what showed up across the codebases, not a profile of any individual project.
Ranked by how often we saw the pattern across the 30-app sample. For each: what's wrong, how to fix it, and the dollar band you can expect to save.
What's wrong
Tool-use agents (CrewAI, LangGraph, raw OpenAI/Anthropic tool calls) loop until a terminal condition. When that condition isn't bulletproof — or never gets checked — the agent runs until it hits a model rate limit. One public case documented a $47K bill from a runaway multi-agent loop.
The fix
Hard max_iterations (5–15 for most tasks). Per-run token budget circuit breaker. Explicit terminal edges in every LangGraph node. Callback-level cost watcher that aborts before the next round-trip.
Typical savings
Insurance against $47K bugs. Typical run cost stays bounded at $0.10–$1 instead of unbounded.
What's wrong
Large static system prompts (1k–10k tokens) sent on every call without cache_control (Anthropic) or prompt_cache_key (OpenAI). Re-bills full input rate every request.
The fix
Anthropic: switch system="..." to a content block with cache_control: {type: 'ephemeral', ttl: '1h'}. OpenAI: pass prompt_cache_key=<feature_id>. Order content static → dynamic.
Typical savings
50–90% on input tokens at high hit-rate. Production reports: 60% cut at Thomson Reuters Labs, 59% at ProjectDiscovery, 90% on individual workloads.
What's wrong
App imports the most expensive model variant (claude-opus, gpt-5, o3, gemini-2.5-pro) and uses it for every call. No router. No classifier. Pays 5–15× more than needed on classification, extraction, short Q&A traffic.
The fix
Cascade: 70% of traffic to Haiku / GPT-4o-mini / Gemini Flash, 20% to Sonnet, 10% to Opus only when the task is provably hard. Add a router function or use an AI gateway (Vercel AI Gateway, Portkey, OpenRouter, LiteLLM).
Typical savings
45–85%. Published case: 70/20/10 mix cut weighted price from $5/M to $1.60/M input. RouteLLM reports 85% cost cut at 95% of GPT-4 quality.
What's wrong
Free tier described in copy ("5 free scans", "10 free messages"), but no enforced limit in code. Abusers loop through cookies/emails and burn the production budget.
The fix
Server-side rate limit keyed on a stable identifier (account, IP+fingerprint, or device). Hard daily/weekly cap per identity. Visible counter in the UI so legitimate users self-regulate.
Typical savings
Catches the bottom 1–5% of users who otherwise drive 30–80% of cost.
What's wrong
Wrapper retries every failed LLM call with no exponential backoff, no jitter, no max-retries. A vendor blip becomes a self-DDoS that re-bills the same prompt 5–10 times in a few seconds.
The fix
Exponential backoff + jitter (1s → 2s → 4s → 8s). Cap at 3 attempts. Stop on permanent errors (4xx). Distinguish rate-limit (retry) from content-policy (don't retry).
Typical savings
Eliminates 5–30% retry overhead during upstream incidents. Production data shows retry storms can briefly 10× the spend curve.
What's wrong
Classification/extraction calls with no max_tokens or max_tokens >= 1024. The model returns one label but is permitted to ramble; some replies hit the cap accidentally and bill the worst case.
The fix
max_tokens=10–50 on classification/extraction. Pair with a stop sequence and a "Answer in N words" instruction. Structured outputs (JSON schema or tool-use forcing) for any call that must return a specific shape.
Typical savings
30–60% off the output bill on extraction-heavy apps.
What's wrong
o3 / o4 / gpt-5-thinking / claude-opus-extended-thinking generate invisible reasoning tokens that bill at the output rate. Across 8 reasoning models tested, thinking accounted for >80% of total output cost. Per-token effective cost can be 28× off the sticker.
The fix
Claude: thinking={type:'enabled', budget_tokens:1024}. OpenAI: reasoning_effort:'low' or 'medium' by default. Route trivial paths to the non-reasoning siblings (Haiku, GPT-5-fast, Gemini Flash).
Typical savings
3–10× reduction. One developer's $5-expected GPT-5 run billed $20 without a cap.
What's wrong
Chat endpoint appends to a messages list with no slice / truncate / summarize step. Turn 50 costs ~50× turn 1 because every prior message is re-sent to the model.
The fix
Sliding window (last 10–20 messages full fidelity) + periodic summarization rollup (every 8–10 turns, summarize older history into 200–500 tokens).
Typical savings
>90% on long-thread workloads. Mem0 benchmarks report 26% quality gain on top of 90%+ token cut.
What's wrong
Each connected MCP server injects tool definitions into every model turn. With 8+ servers configured, 30–50% of context burns on schemas before any user message lands. One workload documented 150K → 2K tokens (98% drop) after fixing.
The fix
Switch to dynamic tool loading (tools_search mode) so definitions arrive on demand. Drop unused MCP servers. Consider the code-execution-with-MCP pattern: LLM writes code that calls tools, returns one distilled result.
Typical savings
80–98% schema-token reduction in heavy MCP setups.
What's wrong
Server streams a long response; the user closes the tab; the server keeps generating; the bill keeps growing for tokens nobody will ever see.
The fix
Client-side AbortController plumbed to the SDK abort signal. Next.js: pass request.signal into the SDK call. Both OpenAI and Anthropic support server-side abort to stop billing.
Typical savings
5–15% on long-form generation workloads with high tab-close rate.
Five infrastructure cost patterns we surface most often. Each one has a clean fix that doesn't require a rewrite.
DATABASE_URL on port 5432 (session mode) used from serverless route handlers. Each invocation opens a fresh connection; max_connections blows out; teams provision a bigger DB to "fix" it.
The fix. Switch to Supavisor / PgBouncer transaction mode on port 6543 from app code. Keep 5432 only for migrations.
Saves Avoids a $30–$200/mo compute-tier upgrade.
S3 egress costs $0.09/GB to the internet. 10TB/month of public delivery = $891/month on S3 vs $15/month on Cloudflare R2 (zero egress).
The fix. Migrate static assets and user uploads to Cloudflare R2 or Backblaze B2. AWS dropped egress fees for outbound migrations in 2024.
Saves $80/mo back at 1TB; $876/mo back at 10TB; $3k+/mo at 100TB.
Clerk's free tier ends at 10k MAU; the first MAU beyond charges $0.02 each. 100k MAU lands around $2,025/month. SAML, SCIM, impersonation are paid add-ons.
The fix. Migrate before the cliff: WorkOS AuthKit (free to 1M MAU, $125/connection for SSO), Supabase Auth (free 50k on Pro), or self-host Better Auth.
Saves ~$2k/mo at 100k MAU; ~$10k/mo at 1M MAU.
Pre-2025 projects still bill wall-clock GB-seconds. An LLM call waiting 8s on an external API still rents full memory.
The fix. Opt in to Fluid Compute Active CPU pricing in the dashboard. Idle (I/O wait) becomes near-free.
Saves Up to 90% on I/O-bound workloads per Vercel's own data.
tracesSampleRate left at 1.0 in production; Replay enabled for all users; no ignoreErrors filter. Datadog at 200 hosts + 1TB logs/day reaches $20–30k/month.
The fix. Sentry: tracesSampleRate 0.1–0.2 in prod, Replay only for paying users, beforeSend filter for noisy errors. Datadog: budget review against Grafana Cloud or self-hosted OpenObserve.
Saves Keeps Sentry on the $26 Team plan instead of escalating to Business. Datadog → Grafana Cloud saves ~20% at 200 hosts.
Each step compounds against the prior step's price floor — reverse them and you forfeit most of the discount. Routing first, because caching at Haiku rates is cheaper than caching at Opus rates.
Helicone or Langfuse on the LLM side, Sentry sampling cap on the app side. You'll spend a week guessing without per-feature cost attribution; you'll spend 30 minutes optimizing with it.
Cascade: ~70% Haiku / Flash / mini, ~20% Sonnet / GPT-5, ~10% Opus / o3. Routing is multiplicative with every layer below — caching at Haiku rates beats caching at Opus rates.
reasoning_effort="low" (OpenAI) or thinking.budget_tokens=1024 (Claude). Single-line change. 3–10× cut. Most teams have never set it.
cache_control with explicit ttl: '1h' on Anthropic; prompt_cache_key on OpenAI. Order content static → dynamic. 50–90% off input.
Tight max_tokens; structured outputs (JSON schema or tool-use); stop sequences. 30–60% off output.
SSE with AbortController plumbed server-side. Closed tab kills generation. Tail savings that matter at scale.
Move evals, backfills, bulk classification, nightly summaries to Batch APIs. Flat 50% off, weekend's worth of plumbing.
BM25 + dense + reranker for any corpus > 200K tokens. Stay under 40% of the window — context rot starts there across every frontier model tested.
max_iterations, per-run token budget circuit breaker, explicit terminal edges, callback-level cost watcher. Insurance against runaway bugs.
Redis / GPTCache / gateway-level semantic cache for repeated Q&A. Another 20–73% on chatbot workloads.
The chaining math
On a $10k/mo baseline: routing (60% off) → caching (60% off the routed mix) → max_tokens (25% off) → batching async work (15% off) → semantic cache (15% off) ≈ $850/mo. A 91% cut, no model swap, no quality loss.
What founders keep asking us, what LLMs answer with stale 2023 numbers, and what PrePrice was built to answer specifically for your codebase.
Across 30 production scans, the highest-leverage mistake is using a reasoning model (o3, GPT-5-thinking, Claude extended thinking) without a thinking-budget cap. Reasoning tokens bill at the output rate, are invisible in most dashboards, and can account for >80% of total output cost. The single-line fix (reasoning_effort="low" or thinking.budget_tokens=1024) typically cuts the bill 3–10×.
Cache reads bill at 10% of the base input rate on Anthropic; cache writes carry a 25% premium for the 5-minute TTL and a 100% premium for the 1-hour TTL. Production reports: ProjectDiscovery cut LLM spend 59%, Thomson Reuters Labs cut 60%, individual developers report 90%+ on long-prompt RAG workloads. Break-even is roughly 2 cache hits per cached prefix.
No. Default to a non-reasoning sibling (Haiku, GPT-4o-mini, Gemini Flash) and route to a reasoning model only when the task is provably hard. Reasoning models cost 5–10× more for trivial extraction or classification work. If you must use one, always set reasoning_effort or budget_tokens explicitly — never leave the default uncapped.
An agent loop with no max_iterations cap and no per-run token budget is the highest-severity cost pattern we see. 23% of the 30 apps we scanned had this exposure. Documented incidents include a $47K runaway bill from one multi-agent loop. The fix is a hard iteration cap (5–15 is plenty for most tasks), a circuit breaker on cumulative tokens, and explicit terminal conditions on every branch.
Across our 30-app sample, the median P50 cost is $2.89 per user-month, and the median P95 is $11.36. But the distribution is heavy-tailed: the most expensive app in the sample runs $93 per user-month at P50, and $279 at P95. The dispersion comes from model choice, caching discipline, and how much context each user's session accumulates.
On I/O-bound workloads — the typical shape of an LLM-calling route — Vercel reports up to 90% savings from the Fluid Compute Active CPU pricing model. The previous wall-clock GB-second pricing billed full memory rent during external API waits; Active CPU charges only when CPU is actually doing work. New teams default to Fluid; older Enterprise teams must opt in.
For <10M vectors, pgvector with HNSW indexes on your existing Postgres is the right default. Pinecone Serverless starts around $50–$100/month for 1M vectors; pgvector is roughly $0 incremental on an existing Supabase or Neon Postgres. Cursor publicly reported a 95% cost drop migrating from Pinecone to Turbopuffer at much larger scale.
Not the server itself — the cost is in the tool definitions it injects into every model turn. With 8+ MCP servers connected and no dynamic tool-loading mode, 30–50% of context can burn on schemas before the user's message even lands. One workload documented 150K → 2K tokens (98% reduction) after switching to the tools_search pattern.
Measure first, price second. PrePrice scans your repo, identifies every per-call AI cost driver (LLM, embeddings, STT, TTS, vector DB) and every per-month infrastructure cost (hosting, DB, auth, monitoring), and projects them at 1k / 10k / 100k users. With the per-user cost in hand, you can pick a price tier that gives you 60%+ margin at P50 and stays positive at P95.
PrePrice prices 150+ services across LLMs, hosting, payments, voice, vector databases, auth, monitoring, search, and analytics. The full catalog is at preprice.app/platforms, with each entry linked to the vendor's official pricing page so you can verify every number we use.
Point PrePrice at your repo. In a couple of minutes you get the same audit shape that produced these findings — verdict, top cost drivers, sensitivity scenarios, and a price recommendation.