Anthropic Releases Claude Opus 4.6 and Sonnet 4.6: A Major Leap in AI Reasoning and Coding

Anthropic has released two significant model upgrades that mark a substantial advancement in AI capabilities: Claude Opus 4.6 and Claude Sonnet 4.6. These releases represent a major step forward in agentic coding, computer use, and long-context reasoning, with both models now featuring 1M token context windows in beta.

Claude Opus 4.6: The New Frontier Model

Claude Opus 4.6 is Anthropic’s most capable model to date, delivering state-of-the-art performance across multiple domains. The model achieves the highest score on Terminal-Bench 2.0, an agentic coding evaluation, and leads all frontier models on Humanity’s Last Exam, a complex multidisciplinary reasoning test.

Breakthrough Performance Metrics

On GDPval-AA, an evaluation measuring performance on economically valuable knowledge work tasks in finance, legal, and other domains, Opus 4.6 outperforms OpenAI’s GPT-5.2 by approximately 144 Elo points and its predecessor Claude Opus 4.5 by 190 points. This translates to Opus 4.6 obtaining a higher score than GPT-5.2 roughly 70% of the time.

The model also excels at BrowseComp, which measures a model’s ability to locate hard-to-find information online, outperforming all other models in this critical capability.

Enhanced Coding and Agentic Capabilities

Early access partners report that Opus 4.6 brings unprecedented focus to challenging tasks, handles ambiguous problems with better judgment, and stays productive over longer sessions. The model demonstrates:

Improved planning: Breaking complex tasks into independent subtasks and running tools in parallel
Better codebase navigation: Handling multi-million-line codebases like a senior engineer
Enhanced debugging: Identifying edge cases that other models miss
Autonomous execution: Completing tasks without constant hand-holding

One particularly impressive demonstration came from a partner who reported that Opus 4.6 “autonomously closed 13 issues and assigned 12 issues to the right team members in a single day, managing a ~50-person organization across 6 repositories.”

Long-Context Mastery

Opus 4.6 represents a qualitative shift in long-context performance. On the 8-needle 1M variant of MRCR v2—a needle-in-a-haystack benchmark—Opus 4.6 scores 76%, compared to just 18.5% for Sonnet 4.5. This dramatic improvement addresses the common complaint of “context rot,” where model performance degrades as conversations exceed certain token limits.

The 1M token context window is sufficient to hold entire codebases, lengthy contracts, or dozens of research papers in a single request, and Opus 4.6 can reason effectively across all that context.

Claude Sonnet 4.6: Frontier Performance at Scale

Claude Sonnet 4.6 brings what Anthropic calls “frontier-level reasoning in a smaller and more cost-effective form factor.” The model approaches Opus-level intelligence at a price point that makes it practical for far more tasks.

Computer Use Capabilities

Anthropic was the first to introduce a general-purpose computer-using model in October 2024. Sonnet 4.6 shows remarkable progress on OSWorld, the standard benchmark for AI computer use, which presents hundreds of tasks across real software like Chrome, LibreOffice, and VS Code.

Early users report human-level capability in tasks like navigating complex spreadsheets, filling out multi-step web forms, and coordinating work across multiple browser tabs. The model has also shown significant improvements in resistance to prompt injection attacks, a critical security concern for computer use applications.

Coding Excellence

In Claude Code, early testing found that users preferred Sonnet 4.6 over Sonnet 4.5 roughly 70% of the time. Users reported that it:

More effectively reads context before modifying code
Consolidates shared logic rather than duplicating it
Shows less “laziness” and overengineering
Demonstrates better instruction following
Produces fewer hallucinations and false claims of success

Remarkably, users even preferred Sonnet 4.6 to Opus 4.5 (the previous frontier model from November) 59% of the time, particularly praising its consistency and follow-through on multi-step tasks.

Strategic Reasoning

Sonnet 4.6 demonstrated impressive strategic thinking in the Vending-Bench Arena evaluation, which tests how well a model can run a simulated business over time. The model developed a novel strategy: investing heavily in capacity for the first ten simulated months while spending significantly more than competitors, then pivoting sharply to focus on profitability in the final stretch. This timing helped it finish well ahead of the competition.

Safety and Alignment

Both models show strong safety profiles. Anthropic conducted the most comprehensive set of safety evaluations of any model for Opus 4.6, including new evaluations for user wellbeing, complex refusal tests, and updated evaluations of the model’s ability to surreptitiously perform harmful actions.

On automated behavioral audits, Opus 4.6 showed low rates of misaligned behaviors such as deception, sycophancy, and cooperation with misuse. It also shows the lowest rate of over-refusals—where the model fails to answer benign queries—of any recent Claude model.

For cybersecurity, where Opus 4.6 shows particular strengths, Anthropic developed six new cybersecurity probes to help track different forms of potential misuse. The company is also accelerating cyberdefensive uses of the model, using it to help find and patch vulnerabilities in open-source software.

New Features and Capabilities

Both models come with significant product updates:

API Features

Adaptive thinking: Models can now decide when deeper reasoning would be helpful, rather than requiring a binary on/off choice
Effort controls: Four effort levels (low, medium, high, max) let developers balance intelligence, speed, and cost
Context compaction: Automatically summarizes older context as conversations approach limits, enabling longer-running tasks
128k output tokens: Supports larger outputs without breaking tasks into multiple requests
US-only inference: Available for workloads requiring US-based processing

Product Updates

Agent teams in Claude Code: Multiple agents can work in parallel and coordinate autonomously
Claude in Excel: Improved performance with MCP connector support for tools like S&P Global, LSEG, and FactSet
Claude in PowerPoint: New research preview for creating presentations while maintaining brand consistency

Implications for AI Productivity Tools

The release of Claude Opus 4.6 and Sonnet 4.6 has significant implications for AI-powered productivity tools. The models’ improved long-context reasoning and computer use capabilities make them particularly well-suited for complex knowledge work.

For users of tools like ChatGPT to Notion, these advancements suggest a future where AI assistants can handle increasingly sophisticated workflows—from processing lengthy documents to coordinating multi-step tasks across different applications. The ability to maintain context over 1M tokens means AI can now work with entire project histories, comprehensive documentation sets, or extensive research materials without losing track of important details.

The improved computer use capabilities also point toward a future where AI can more seamlessly integrate with existing productivity tools, potentially automating complex workflows that previously required significant manual intervention.

Availability and Pricing

Both models are available now on claude.ai, the Claude API, and all major cloud platforms. Pricing remains unchanged:

Opus 4.6: $5/$25 per million input/output tokens
Sonnet 4.6: $3/$15 per million input/output tokens

The 1M token context window is currently available in beta on the Claude Developer Platform only, with premium pricing ($10/$37.50 per million tokens) for prompts exceeding 200k tokens.

Anthropic has also upgraded its free tier to Sonnet 4.6 by default, now including file creation, connectors, skills, and compaction.

Conclusion

The release of Claude Opus 4.6 and Sonnet 4.6 represents a significant milestone in AI development. The models’ combination of enhanced reasoning capabilities, improved coding skills, and robust safety measures sets a new standard for frontier AI systems.

Early feedback from partners suggests these models are already changing how teams approach complex technical and knowledge work. As one partner noted, “It feels less like a tool and more like a capable collaborator.”

With these releases, Anthropic continues to push the boundaries of what’s possible with AI while maintaining a strong focus on safety and alignment—a balance that will be crucial as AI systems take on increasingly sophisticated tasks in professional environments.