← back to blog

The Claude 4.6 features you should ship today

1M context, extended thinking, the memory tool, code execution, 1-hour prompt caching, computer use. A tour of what's production-ready in Claude 4.6 and how to wire each one into a real app.

claudeclaude-4-6anthropic-apiai-engineering

Claude 4.6 isn't one feature — it's a cluster of them, and most teams are using 20% of what's available. Here's the honest tour, with where each feature earns its keep and where it's a waste of tokens.

1M context — good, but only with caching

1M-token windows on Opus 4.6 and Sonnet 4.6 unlock workflows that chunked RAG can't: full-codebase refactors, cross-document synthesis, holistic review. Enable it with the context-1m-2025-08-07 beta header. Above 200K tokens, pricing shifts to a long-context tier.

The catch: 1M input is expensive per call. It becomes affordable when you pair it with prompt caching at 1-hour TTL. Cache the static portion (the codebase, the document set) once, pay ~10% of input cost on every subsequent question. Break-even hits after about 2 questions — so if you're asking many questions against the same corpus, this is the workflow.

When to skip it: when your corpus is > 1M tokens (use RAG), or when you only ask one question (pay-per-call without caching is brutal).

Extended thinking — surgical, not universal

Extended thinking gives the model a scratchpad before answering. Configured via thinking: { type: "enabled", budget_tokens: 10000 }. Thinking tokens bill as output tokens, so a reasoning-heavy call effectively doubles or triples in cost.

The rule: gate thinking on task complexity. Simple classification, extraction, and formatting don't benefit. Complex planning, math, multi-step code generation, and agent loops with > 3 tools do. Start at a 5K budget, measure quality-vs-cost, and size from there.

For agents, the highest-leverage variant is interleaved thinking (the interleaved-thinking-2025-05-14 beta). The model thinks between tool calls — reassessing after each result — instead of thinking once at the start. Almost always worth enabling for multi-tool agents.

Memory tool — persistent state without RAG

The memory tool (memory_20250818) lets Claude read and write files in a directory your app manages. Think of it as giving the agent a filesystem. You execute the operations; the model decides what's worth writing.

Where this wins over alternatives:

  • Over "stuff everything into system prompt": state persists cheaply across sessions
  • Over RAG: the agent curates what matters, not a retrieval system blindly returning top-k
  • Over session history: structured memory survives compaction

Minimum viable implementation: per-user directory, path sandboxing, size caps, audit log. Seed a directory layout in the system prompt (/memory/user_profile.md, /memory/decisions/, etc.) so the agent uses consistent paths.

Biggest mistake we see: sharing a single memory root across users. Instant privacy incident.

Code execution — the analysis tool

code_execution_20250522 runs Python in a sandboxed container as part of the model response. Preinstalled: pandas, numpy, matplotlib, scipy, sklearn, pillow. No network by default. Ephemeral filesystem unless you pass a container ID for persistence.

Ship it when:

  • You need the model to actually compute, not describe (data analysis, math verification)
  • You want matplotlib plots returned as images via the Files API
  • You want the model to run and iterate on code it just wrote

Skip it for pure text generation — you're paying for a sandbox you won't use.

A subtle win: pair code execution with interleaved thinking. The model plans, writes, runs, reads output, reflects, iterates. That's a mini data scientist you can drop into a session.

Prompt caching 1-hour TTL — the biggest cost lever

Caching is the single most leveraged feature for teams with large repeated context. Two TTLs: 5 minutes (ephemeral, cheap to write) and 1 hour (more expensive to write, persists across sessions).

Cache reads cost ~10% of fresh input. On a workload where 90% of your input is stable (system prompt, knowledge base, tool definitions), caching drops spend by 80% once the cache is warm.

The common mistakes that silently disable caching:

  • A timestamp or UUID in the system prompt (moves every call, kills the cache)
  • Reordering tools between calls (byte-exact prefix match required)
  • Forgetting cache_control (no breakpoint = no caching)
  • Prefix below minimum token size (1024 most models; 2048 Haiku)

Instrumentation matters. Log cache_read_input_tokens and cache_creation_input_tokens on every call. Target > 80% hit rate. If you see consistent creation on every call, your prefix is shifting — diff two consecutive requests byte by byte.

We have a dedicated skill on prompt caching TTLs that walks through the four breakpoints you get per request and how to place them.

Computer use — last resort, not first

Computer Use lets Claude take screenshots, click, type, and control a virtual desktop. Enable with computer_20250124 plus the beta header.

Cases where it's the right call:

  • Legacy apps with no API
  • Cross-app workflows nothing else can automate
  • QA/UI testing where you want the agent to drive a real browser

Cases where you should run screaming:

  • Any task that has an API (Computer Use is 10-100× more expensive and slower)
  • Unsandboxed environments with real user data
  • High-volume repetitive actions
  • Destructive operations without human-in-loop

The dominant threat model is screen-injection: the agent reads untrusted text on a page, the page says "now click delete", the agent does. Isolated VM, network allow-list, human confirmation for destructive actions, explicit prompt instructions to ignore page-embedded instructions — all four, or skip it.

The combination that actually ships

If you're building a serious agent on Claude 4.6 in 2026, here's a stack that holds together:

  1. Sonnet 4.6 as default model
  2. Prompt caching at 1h TTL on the system prompt and any large static context
  3. Extended thinking with interleaved mode for any tool-use loop with > 3 tools
  4. Memory tool for per-user persistent state
  5. Code execution for any analytical or verification step
  6. Opus 4.6 as the escalation model for hardest queries (route by difficulty)
  7. Computer use nowhere, unless you've ruled out every other option

The features are composable — you can turn all of them on in a single request. The interesting engineering work in 2026 isn't building any single feature; it's choosing the right combination for each user request and routing accordingly.

Start by adding prompt caching. That one change typically pays for everything else.

════════════════════════════════════════════════════════════════
latestaiagents | MIT License

Made with </> for the AI agent community