LLM Token Optimization Playbook
You don’t have a token problem. You have a model allocation problem.
Most teams try to reduce tokens with prompt tricks. But the real cost driver is which model you use for each task.
The same workflow can cost 10-100x more depending on model selection.
This playbook shows you how to:
route tasks across model tiers
reduce token waste at the system level
scale AI usage without exploding cost
Most teams start with chat and stay well within limits, then move to coding tools and hit limits unexpectedly fast.
Why it matters: this is usually a workflow design issue, not just "too much usage."
Example: load a large repo, ask for a broad fix, retry three times, and resend most of the same context each cycle.
Takeaway: token control starts with workflow control.
Warning: Repeated broad-context retries are the fastest path to token exhaustion.
Token usage = Context Size x Iteration Count.
This formula is the operational model behind nearly every limit incident.
Context size = how much data you pass per request.
Iteration count = how many times you rerun the loop.
Numerical example: Loading a full repo (200k tokens) and iterating 10 times burns ~2M tokens.
Token Explosion Loop
Large Context → Model Call → Retry → More Context → More Calls
Repeated loops plus large context create compounding token growth.
Why this matters: Most users optimize wording. The bigger win is reducing context volume and loop count.
Choose the right mode before you spend tokens: use chat for thinking, code tools for execution.
Use chat to explore ambiguity, frame the problem, and design solution options before execution.
When to use
unclear requirements
comparing multiple approaches
designing architecture or prompts
Example: "Design a contract extraction pipeline with fallback logic"
Why it matters: planning in a low-cost mode prevents expensive execution retries.
Insight: Keep chat broad for thinking and tradeoffs. Then hand off a scoped plan to execution mode.
Use code mode for precise execution, targeted changes, and validating outcomes against clear criteria.
When to use
implementing specific changes
debugging known issues
running targeted updates
Example: "Update M2 classification logic to include FX fee detection"
Why it matters: code tools multiply cost when prompts are broad or iterative.
Insight: Code mode needs Precise instructions up front. It should produce Deterministic output with minimal retries.
Chat (Fast Path)
User → Model → Response
Coding (Full Execution Path)
User → Model → Load Files → Analyze Code → Generate Patch → Run Tests → Process Logs → Response
Dimension Chat Code
Primary intent Explore and plan Execute changes
Typical token pattern Lower, shorter turns High token usage with loops
Best input style Open framing Scoped, explicit constraints
Takeaway: Chat helps you choose the right path. Code mode should run that path with tight scope and fewer retries.
Token burn is usually caused by a few repeated workflow patterns, not one bad prompt.
Full file reloads What: large file sets are resent each turn. Why: default workflows attach broad context by default. Impact: higher base cost before useful reasoning even starts.
Repeated loops What: full retries after narrow failures. Why: teams rerun full flows instead of targeted stages. Impact: duplicate calls and duplicate context spend.
Tool/log payload bloat What: raw outputs are pasted into prompts. Why: results are not filtered to decision-relevant slices. Impact: context windows fill with low-signal noise.
Mixed-mode sessions What: planning and implementation happen in one long coding loop. Why: phase boundaries are unclear. Impact: more exploratory retries in expensive execution mode.
Takeaway: reduce repeated large-input turns first; prompt micro-edits come second.
These mistakes are common because they feel productive in the moment.
Thinking in code mode first Why people do it: it feels faster to start patching. Why it backfires: unclear scope triggers retries. Better alternative: plan first in chat, then execute once.
Loading full repos for local edits Why people do it: fear of missing dependencies. Why it backfires: massive context cost for small fixes. Better alternative: include only impacted files and interfaces.
Rerunning full workflows after small changes Why people do it: convenience and habit. Why it backfires: repeated duplicate context each cycle. Better alternative: rerun only changed stages first.
Vague execution prompts Why people do it: they expect the model to infer intent. Why it backfires: ambiguous outputs create retries. Better alternative: define scope, constraints, and done criteria explicitly.
Takeaway: every mistake above increases either context size or iteration count.
This is the recommended operating model for teams that want predictable cost and speed.
Why it matters: each step directly cuts either baseline context cost or retry count.
Separate thinking and execution. Design the solution path first. This avoids expensive exploratory loops during implementation.
Scope context tightly. Include only required files, logs, and constraints. This reduces base token cost per call.
Define done criteria early. Clear success criteria reduce uncertainty and retry churn.
Sequence tools intentionally. Chat for decisions, code tool for edits, local systems for validation. This prevents repeated mixed-mode overhead.
Example: define target files and acceptance tests in chat, then run one scoped execution pass in the code tool.
Tip: Aim for one focused execution pass per problem slice.
A structured before/after view makes the optimization practical and repeatable.
Before
Inefficient workflow
load full repo context before scoping
explore and patch in the same execution loop
rerun full workflow after each partial fix
Token usage: ~200k per cycle
After
Optimized workflow
plan scope in chat first
send only target files and interfaces
rerun only changed stages and validations
Token usage: ~15k-40k per cycle
Phase Before After
Context scope full repo loaded by default only impacted files + interfaces
Execution style explore + patch in one loop plan first, then scoped execution
Token usage ~200k per cycle ~15k-40k per cycle
What changed: context was narrowed, retries were reduced, and validation became stage-specific.
Takeaway: most savings come from removing duplicate context-heavy loops, not from shorter wording.
Most teams route every task through one powerful model. That feels simple, but it is expensive and inefficient at scale.
High-performance workflows separate thinking , execution , and bulk processing . Use intelligence where it matters. Use cheaper models everywhere else.
Tier 1 - Think / Plan / Architect
High intelligence, use sparingly.
Purpose: high-quality reasoning and decision framing.
When to use: problem framing, architecture design, prompt design, complex reasoning, output review.
Example: "Design a multi-stage contract intelligence pipeline with fallback extraction rules, confidence thresholds, and validation gates."
Model examples:
Opus 4.6
Sonnet 4.6
GPT-5.4
Gemini 3.1 Pro
Grok 4
Note: used sparingly, but critically important.
Tier 2 - Execute / Transform / Implement
Primary execution layer.
Purpose: the main working layer for production tasks.
When to use: applying changes, structured extraction, transformations, coding tasks, deterministic workflows.
Example: "Extract pricing clauses, normalize fee fields into a JSON schema, and generate deterministic patch updates for downstream services."
Model examples:
Haiku 4.5
GPT-5.4-mini
Qwen 3
GLM-4.5
Kimi K2
Note: this is where most tokens should be spent.
Tier 3 - Bulk / Filter / Local / Agent Loops
Low-cost scaling layer.
Purpose: high-volume, low-risk throughput work.
When to use: filtering documents, preprocessing, repetitive agent steps, retrieval prep, batch processing.
Example: "Filter 50,000 incoming contracts, drop irrelevant records, and route only high-signal documents into the Tier 2 extraction workflow."
Model examples:
Gemini 2.0 Flash
Mistral Small
Llama 3.1 70B
DeepSeek V3.2
Nemotron 3
Note: built for scale, not deep intelligence.
Tiered Workflow Flow
Tier 1: Design Workflow → Tier 2: Execute Tasks → Tier 3: Scale Bulk Steps
Step 1: design with Tier 1. Step 2: execute with Tier 2. Step 3: scale repetitive work with Tier 3.
Why this reduces tokens:
expensive models are used less frequently
execution is handled by cheaper, balanced models
bulk work is offloaded to low-cost or local layers
high-cost iteration loops are reduced
Bad Pattern
Use Tier 1 for everything
planning, execution, and bulk all on one expensive model
high-cost retries across every step
no cost-aware routing
Good Pattern
Use a tiered workflow
Tier 1 for planning
Tier 2 for implementation
Tier 3 for scale and repetition
Final takeaway: Token optimization is not about using fewer tokens. It is about using the right model at the right stage.
This section is your practical toolkit for reducing repetitive prompt load in Claude workflows.
Why it matters: persistent instructions and targeted tool access reduce both prompt size and retries across sessions.
CLAUDE.md
What it is: persistent instruction file for stable project behavior.
Why it reduces repetition: shared rules are not restated each session.
How to structure: rules, scope boundaries, constraints, output expectations.
What goes inside: coding standards, review format, validation expectations.
Example use case: compliance-heavy repo with strict output format and test gates.
Takeaway: place stable project rules in CLAUDE.md, not in every prompt.
Skills
What they are: reusable capabilities for repeated workflows.
Why they matter: avoid re-explaining workflow logic in every prompt.
When to use: extraction, review, refactor, summarization patterns.
Example use case: a review skill that always returns findings by severity with remediation steps.
Takeaway: use Skills for repeated task patterns that need consistent structure.
MCP
What it is: tool integration layer for external context access.
How it reduces manual context passing: fetches targeted data directly from tools.
Why it matters: prevents huge pasted context blocks in tool-based workflows.
Example use case: pull only relevant error traces and schema rows for the current task.
Takeaway: use MCP when data access is the bottleneck, not prompt writing.
Info: Use Skills to standardize behavior; use MCP to standardize access.
Dimension Skills MCP
Purpose reusable task logic external access
Token impact reduces repetition reduces context
Best for repeated tasks large data
Not for data retrieval task logic
Skills define HOW Claude works. MCP defines WHAT Claude can access.
Use this before long sessions to keep token spend predictable.
Plan in chat first : reduce ambiguity before execution.
Pass only required context : lower baseline token cost.
Avoid full repo loads for local edits : scope context to affected files.
Use precise prompts : reduce retries and output drift.
Rerun only changed stages : avoid duplicate full-loop costs.
Cap retries and recursion : prevent compounding overhead.
Use CLAUDE.md, Skills, and MCP : reduce repetition and context bloat.
LLM efficiency is mostly workflow design, not wordsmithing.
High-performing teams spend less time trimming prompts and more time controlling context scope, iteration count, and execution sequencing.