LLM Token Optimization Playbook

You don’t have a token problem. You have a model allocation problem.

Most teams try to reduce tokens with prompt tricks. But the real cost driver is which model you use for each task.

The same workflow can cost 10-100x more depending on model selection.

This playbook shows you how to:

  • route tasks across model tiers
  • reduce token waste at the system level
  • scale AI usage without exploding cost

The Problem

Most teams start with chat and stay well within limits, then move to coding tools and hit limits unexpectedly fast.

Why it matters: this is usually a workflow design issue, not just "too much usage."

Example: load a large repo, ask for a broad fix, retry three times, and resend most of the same context each cycle.

Takeaway: token control starts with workflow control.

Warning: Repeated broad-context retries are the fastest path to token exhaustion.

The Real Reason You're Hitting Limits

Token usage = Context Size x Iteration Count.

This formula is the operational model behind nearly every limit incident.

  • Context size = how much data you pass per request.
  • Iteration count = how many times you rerun the loop.

Numerical example: Loading a full repo (200k tokens) and iterating 10 times burns ~2M tokens.

Token Explosion Loop

Large ContextModel CallRetryMore ContextMore Calls

Repeated loops plus large context create compounding token growth.

Why this matters: Most users optimize wording. The bigger win is reducing context volume and loop count.

Chat vs Code

Choose the right mode before you spend tokens: use chat for thinking, code tools for execution.

Chat (Exploration Mode)

Use chat to explore ambiguity, frame the problem, and design solution options before execution.

When to use

  • unclear requirements
  • comparing multiple approaches
  • designing architecture or prompts

Example: "Design a contract extraction pipeline with fallback logic"

Why it matters: planning in a low-cost mode prevents expensive execution retries.

Insight: Keep chat broad for thinking and tradeoffs. Then hand off a scoped plan to execution mode.
Code (Execution Mode)

Use code mode for precise execution, targeted changes, and validating outcomes against clear criteria.

When to use

  • implementing specific changes
  • debugging known issues
  • running targeted updates

Example: "Update M2 classification logic to include FX fee detection"

Why it matters: code tools multiply cost when prompts are broad or iterative.

Insight: Code mode needs Precise instructions up front. It should produce Deterministic output with minimal retries.

Chat (Fast Path)

UserModelResponse

Coding (Full Execution Path)

UserModelLoad FilesAnalyze CodeGenerate PatchRun TestsProcess LogsResponse
DimensionChatCode
Primary intentExplore and planExecute changes
Typical token patternLower, shorter turnsHigh token usage with loops
Best input styleOpen framingScoped, explicit constraints

Takeaway: Chat helps you choose the right path. Code mode should run that path with tight scope and fewer retries.

What’s Actually Burning Your Tokens

Token burn is usually caused by a few repeated workflow patterns, not one bad prompt.

  • Full file reloads
    What: large file sets are resent each turn.
    Why: default workflows attach broad context by default.
    Impact: higher base cost before useful reasoning even starts.
  • Repeated loops
    What: full retries after narrow failures.
    Why: teams rerun full flows instead of targeted stages.
    Impact: duplicate calls and duplicate context spend.
  • Tool/log payload bloat
    What: raw outputs are pasted into prompts.
    Why: results are not filtered to decision-relevant slices.
    Impact: context windows fill with low-signal noise.
  • Mixed-mode sessions
    What: planning and implementation happen in one long coding loop.
    Why: phase boundaries are unclear.
    Impact: more exploratory retries in expensive execution mode.
Takeaway: reduce repeated large-input turns first; prompt micro-edits come second.

Biggest Mistakes

These mistakes are common because they feel productive in the moment.

  • Thinking in code mode first
    Why people do it: it feels faster to start patching.
    Why it backfires: unclear scope triggers retries.
    Better alternative: plan first in chat, then execute once.
  • Loading full repos for local edits
    Why people do it: fear of missing dependencies.
    Why it backfires: massive context cost for small fixes.
    Better alternative: include only impacted files and interfaces.
  • Rerunning full workflows after small changes
    Why people do it: convenience and habit.
    Why it backfires: repeated duplicate context each cycle.
    Better alternative: rerun only changed stages first.
  • Vague execution prompts
    Why people do it: they expect the model to infer intent.
    Why it backfires: ambiguous outputs create retries.
    Better alternative: define scope, constraints, and done criteria explicitly.

Takeaway: every mistake above increases either context size or iteration count.

The Fix — Practical Playbook

This is the recommended operating model for teams that want predictable cost and speed.

Why it matters: each step directly cuts either baseline context cost or retry count.

  1. Separate thinking and execution. Design the solution path first. This avoids expensive exploratory loops during implementation.
  2. Scope context tightly. Include only required files, logs, and constraints. This reduces base token cost per call.
  3. Define done criteria early. Clear success criteria reduce uncertainty and retry churn.
  4. Sequence tools intentionally. Chat for decisions, code tool for edits, local systems for validation. This prevents repeated mixed-mode overhead.

Example: define target files and acceptance tests in chat, then run one scoped execution pass in the code tool.

Tip: Aim for one focused execution pass per problem slice.

Real Example: Before vs After

A structured before/after view makes the optimization practical and repeatable.

Before

Inefficient workflow

  • load full repo context before scoping
  • explore and patch in the same execution loop
  • rerun full workflow after each partial fix

Token usage: ~200k per cycle

After

Optimized workflow

  • plan scope in chat first
  • send only target files and interfaces
  • rerun only changed stages and validations

Token usage: ~15k-40k per cycle

PhaseBeforeAfter
Context scopefull repo loaded by defaultonly impacted files + interfaces
Execution styleexplore + patch in one loopplan first, then scoped execution
Token usage~200k per cycle~15k-40k per cycle

What changed: context was narrowed, retries were reduced, and validation became stage-specific.

Takeaway: most savings come from removing duplicate context-heavy loops, not from shorter wording.

Advanced Optimization: Model Tiering Strategy

Most teams route every task through one powerful model. That feels simple, but it is expensive and inefficient at scale.

High-performance workflows separate thinking, execution, and bulk processing. Use intelligence where it matters. Use cheaper models everywhere else.

Tier 1 - Think / Plan / Architect

High intelligence, use sparingly.

Purpose: high-quality reasoning and decision framing.

When to use: problem framing, architecture design, prompt design, complex reasoning, output review.

Example: "Design a multi-stage contract intelligence pipeline with fallback extraction rules, confidence thresholds, and validation gates."

Model examples:

Opus 4.6 Sonnet 4.6 GPT-5.4 Gemini 3.1 Pro Grok 4

Note: used sparingly, but critically important.

Tier 2 - Execute / Transform / Implement

Primary execution layer.

Purpose: the main working layer for production tasks.

When to use: applying changes, structured extraction, transformations, coding tasks, deterministic workflows.

Example: "Extract pricing clauses, normalize fee fields into a JSON schema, and generate deterministic patch updates for downstream services."

Model examples:

Haiku 4.5 GPT-5.4-mini Qwen 3 GLM-4.5 Kimi K2

Note: this is where most tokens should be spent.

Tier 3 - Bulk / Filter / Local / Agent Loops

Low-cost scaling layer.

Purpose: high-volume, low-risk throughput work.

When to use: filtering documents, preprocessing, repetitive agent steps, retrieval prep, batch processing.

Example: "Filter 50,000 incoming contracts, drop irrelevant records, and route only high-signal documents into the Tier 2 extraction workflow."

Model examples:

Gemini 2.0 Flash Mistral Small Llama 3.1 70B DeepSeek V3.2 Nemotron 3

Note: built for scale, not deep intelligence.

Tiered Workflow Flow

Tier 1: Design WorkflowTier 2: Execute TasksTier 3: Scale Bulk Steps

Step 1: design with Tier 1. Step 2: execute with Tier 2. Step 3: scale repetitive work with Tier 3.

Why this reduces tokens:
  • expensive models are used less frequently
  • execution is handled by cheaper, balanced models
  • bulk work is offloaded to low-cost or local layers
  • high-cost iteration loops are reduced
Bad Pattern

Use Tier 1 for everything

  • planning, execution, and bulk all on one expensive model
  • high-cost retries across every step
  • no cost-aware routing
Good Pattern

Use a tiered workflow

  • Tier 1 for planning
  • Tier 2 for implementation
  • Tier 3 for scale and repetition

Final takeaway: Token optimization is not about using fewer tokens. It is about using the right model at the right stage.

Claude-Specific Optimization

This section is your practical toolkit for reducing repetitive prompt load in Claude workflows.

Why it matters: persistent instructions and targeted tool access reduce both prompt size and retries across sessions.

CLAUDE.md

  • What it is: persistent instruction file for stable project behavior.
  • Why it reduces repetition: shared rules are not restated each session.
  • How to structure: rules, scope boundaries, constraints, output expectations.
  • What goes inside: coding standards, review format, validation expectations.

Example use case: compliance-heavy repo with strict output format and test gates.

Takeaway: place stable project rules in CLAUDE.md, not in every prompt.

Skills

  • What they are: reusable capabilities for repeated workflows.
  • Why they matter: avoid re-explaining workflow logic in every prompt.
  • When to use: extraction, review, refactor, summarization patterns.

Example use case: a review skill that always returns findings by severity with remediation steps.

Takeaway: use Skills for repeated task patterns that need consistent structure.

MCP

  • What it is: tool integration layer for external context access.
  • How it reduces manual context passing: fetches targeted data directly from tools.
  • Why it matters: prevents huge pasted context blocks in tool-based workflows.

Example use case: pull only relevant error traces and schema rows for the current task.

Takeaway: use MCP when data access is the bottleneck, not prompt writing.

Info: Use Skills to standardize behavior; use MCP to standardize access.

Skills vs MCP

DimensionSkillsMCP
Purposereusable task logicexternal access
Token impactreduces repetitionreduces context
Best forrepeated taskslarge data
Not fordata retrievaltask logic

Skills define HOW Claude works. MCP defines WHAT Claude can access.

Execution Checklist

Use this before long sessions to keep token spend predictable.

  • Plan in chat first: reduce ambiguity before execution.
  • Pass only required context: lower baseline token cost.
  • Avoid full repo loads for local edits: scope context to affected files.
  • Use precise prompts: reduce retries and output drift.
  • Rerun only changed stages: avoid duplicate full-loop costs.
  • Cap retries and recursion: prevent compounding overhead.
  • Use CLAUDE.md, Skills, and MCP: reduce repetition and context bloat.

Core Principle

LLM efficiency is mostly workflow design, not wordsmithing.

High-performing teams spend less time trimming prompts and more time controlling context scope, iteration count, and execution sequencing.