The Claude 4.6 features you should ship today
1M context, extended thinking, the memory tool, code execution, 1-hour prompt caching, computer use. A tour of what's production-ready in Claude 4.6 and how to wire each one into a real app.
Claude 4.6 isn't one feature — it's a cluster of them, and most teams are using 20% of what's available. Here's the honest tour, with where each feature earns its keep and where it's a waste of tokens.
1M context — good, but only with caching
1M-token windows on Opus 4.6 and Sonnet 4.6 unlock workflows that chunked RAG can't: full-codebase refactors, cross-document synthesis, holistic review. Enable it with the context-1m-2025-08-07 beta header. Above 200K tokens, pricing shifts to a long-context tier.
The catch: 1M input is expensive per call. It becomes affordable when you pair it with prompt caching at 1-hour TTL. Cache the static portion (the codebase, the document set) once, pay ~10% of input cost on every subsequent question. Break-even hits after about 2 questions — so if you're asking many questions against the same corpus, this is the workflow.
When to skip it: when your corpus is > 1M tokens (use RAG), or when you only ask one question (pay-per-call without caching is brutal).
Extended thinking — surgical, not universal
Extended thinking gives the model a scratchpad before answering. Configured via thinking: { type: "enabled", budget_tokens: 10000 }. Thinking tokens bill as output tokens, so a reasoning-heavy call effectively doubles or triples in cost.
The rule: gate thinking on task complexity. Simple classification, extraction, and formatting don't benefit. Complex planning, math, multi-step code generation, and agent loops with > 3 tools do. Start at a 5K budget, measure quality-vs-cost, and size from there.
For agents, the highest-leverage variant is interleaved thinking (the interleaved-thinking-2025-05-14 beta). The model thinks between tool calls — reassessing after each result — instead of thinking once at the start. Almost always worth enabling for multi-tool agents.
Memory tool — persistent state without RAG
The memory tool (memory_20250818) lets Claude read and write files in a directory your app manages. Think of it as giving the agent a filesystem. You execute the operations; the model decides what's worth writing.
Where this wins over alternatives:
- Over "stuff everything into system prompt": state persists cheaply across sessions
- Over RAG: the agent curates what matters, not a retrieval system blindly returning top-k
- Over session history: structured memory survives compaction
Minimum viable implementation: per-user directory, path sandboxing, size caps, audit log. Seed a directory layout in the system prompt (/memory/user_profile.md, /memory/decisions/, etc.) so the agent uses consistent paths.
Biggest mistake we see: sharing a single memory root across users. Instant privacy incident.
Code execution — the analysis tool
code_execution_20250522 runs Python in a sandboxed container as part of the model response. Preinstalled: pandas, numpy, matplotlib, scipy, sklearn, pillow. No network by default. Ephemeral filesystem unless you pass a container ID for persistence.
Ship it when:
- You need the model to actually compute, not describe (data analysis, math verification)
- You want matplotlib plots returned as images via the Files API
- You want the model to run and iterate on code it just wrote
Skip it for pure text generation — you're paying for a sandbox you won't use.
A subtle win: pair code execution with interleaved thinking. The model plans, writes, runs, reads output, reflects, iterates. That's a mini data scientist you can drop into a session.
Prompt caching 1-hour TTL — the biggest cost lever
Caching is the single most leveraged feature for teams with large repeated context. Two TTLs: 5 minutes (ephemeral, cheap to write) and 1 hour (more expensive to write, persists across sessions).
Cache reads cost ~10% of fresh input. On a workload where 90% of your input is stable (system prompt, knowledge base, tool definitions), caching drops spend by 80% once the cache is warm.
The common mistakes that silently disable caching:
- A timestamp or UUID in the system prompt (moves every call, kills the cache)
- Reordering tools between calls (byte-exact prefix match required)
- Forgetting
cache_control(no breakpoint = no caching) - Prefix below minimum token size (1024 most models; 2048 Haiku)
Instrumentation matters. Log cache_read_input_tokens and cache_creation_input_tokens on every call. Target > 80% hit rate. If you see consistent creation on every call, your prefix is shifting — diff two consecutive requests byte by byte.
We have a dedicated skill on prompt caching TTLs that walks through the four breakpoints you get per request and how to place them.
Computer use — last resort, not first
Computer Use lets Claude take screenshots, click, type, and control a virtual desktop. Enable with computer_20250124 plus the beta header.
Cases where it's the right call:
- Legacy apps with no API
- Cross-app workflows nothing else can automate
- QA/UI testing where you want the agent to drive a real browser
Cases where you should run screaming:
- Any task that has an API (Computer Use is 10-100× more expensive and slower)
- Unsandboxed environments with real user data
- High-volume repetitive actions
- Destructive operations without human-in-loop
The dominant threat model is screen-injection: the agent reads untrusted text on a page, the page says "now click delete", the agent does. Isolated VM, network allow-list, human confirmation for destructive actions, explicit prompt instructions to ignore page-embedded instructions — all four, or skip it.
The combination that actually ships
If you're building a serious agent on Claude 4.6 in 2026, here's a stack that holds together:
- Sonnet 4.6 as default model
- Prompt caching at 1h TTL on the system prompt and any large static context
- Extended thinking with interleaved mode for any tool-use loop with > 3 tools
- Memory tool for per-user persistent state
- Code execution for any analytical or verification step
- Opus 4.6 as the escalation model for hardest queries (route by difficulty)
- Computer use nowhere, unless you've ruled out every other option
The features are composable — you can turn all of them on in a single request. The interesting engineering work in 2026 isn't building any single feature; it's choosing the right combination for each user request and routing accordingly.
Start by adding prompt caching. That one change typically pays for everything else.