2025: The Year in LLMs - A Comprehensive Review
As 2025 draws to a close, it’s time to reflect on a year that fundamentally shifted the AI landscape. While 2024 introduced many concepts, 2025 was the year they matured and became practical. Drawing from Simon Willison’s excellent annual review, here are the defining trends that shaped the LLM world this year.
Reasoning Models Changed Everything
OpenAI’s o-series models introduced inference-scaling—the ability for LLMs to break problems into intermediate reasoning steps. What started as an experiment became standard practice across all major labs. This approach fundamentally changed how models tackle complex tasks, particularly tool-use scenarios where step-by-step planning matters.
The impact on practical applications was immediate. Models that could reason through problems achieved gold medals at July’s International Math Olympiad and September’s International Collegiate Programming Contest—using novel problems, not memorized solutions.
Agents Finally Arrived
After years of hype, AI agents that run tools in loops to achieve goals finally materialized in 2025. The “gullibility problem”—where models would blindly execute whatever they were told—was partially solved through improved reasoning capabilities.
The most impactful development was Claude Code’s February release, which quickly became a phenomenon. By December, Anthropic credited Claude Code with contributing to a $1 billion run-rate revenue—remarkable for a command-line tool.
Major labs rushed to release competing CLI coding agents, with asynchronous versions like Claude Code for web, OpenAI Codex web, and Google Jules enabling code research without the security risks of local execution.
Chinese Labs Seized the Crown
DeepSeek’s January R1 release triggered a $593 billion drop in NVIDIA’s market cap in a single day. But that was just the beginning. By December, Chinese models—GLM-4.7, Kimi K2, DeepSeek V3.2, MiniMax-M2.1—dominated the top ranks of open-weight benchmarks.
This represented a dramatic reversal from 2024. Meanwhile, Meta’s Llama 4 stumbled with oversized models (109B minimum) that alienated users accustomed to laptop-runnable versions, effectively ceding open-weight leadership to Chinese competitors.
OpenAI’s Changing Position
OpenAI maintained consumer dominance through ChatGPT but faced unprecedented competition across categories:
- Image generation: Google’s Nano Banana Pro outperformed DALL-E
- Coding: Claude Opus 4.5 took the lead
- Open-weight: Chinese models dominated
ChatGPT’s image editing feature generated 100 million signups in a single week in March, proving the consumer product remains strong. But the technical leadership that seemed unassailable in 2024 became contested territory.
Google Found Its Footing
Google’s Gemini line (2.0, 2.5, 3.0) proved genuinely competitive, with 1M+ token context windows becoming standard. Nano Banana Pro emerged as the leader for text-heavy image generation, excelling at infographics and documents—previously a weak spot for AI image models.
The integration of AI into Chrome and Google’s broader ecosystem raised both possibilities and concerns about browser security.
The Command-Line Renaissance
Perhaps no trend surprised more than the mainstream adoption of terminal-based AI tools. LLM CLI tools achieved widespread use through coding agents, proving that the command line was never too niche for AI interfaces.
This renaissance changed how developers interact with AI—from chat windows to integrated development environments where AI operates as a genuine collaborator rather than a separate tool to consult.
Vibe Coding Entered the Lexicon
Andrej Karpathy’s February coinage captured a new development style: “forget that the code even exists.” Vibe coding meant prompting without reading diffs, trusting the AI to handle implementation details.
This approach proved controversial. Proponents argued it unlocked new levels of productivity; critics worried about code quality and maintainability. The debate will likely continue into 2026.
Security Concerns Intensified
The year brought serious security considerations to the forefront:
The Lethal Trifecta: A term coined for prompt injection attacks combining private data access, external communication, and untrusted content exposure.
Browser Integration Risks: ChatGPT Atlas, Claude in Chrome, and Gemini in Chrome raised concerns about prompt injection attacks accessing sensitive browser data. Labs acknowledged this as a “frontier, unsolved” problem.
Normalization of Deviance: Security researcher Johann Rehberger warned that repeated risky behavior without consequences (like running YOLO mode agents) echoed the dynamics that led to the Challenger disaster.
The Rise of Long Tasks
METR’s research showed models doubling their task-completion duration every 7 months. By year-end, frontier models tackled 5-hour human tasks. This extension of capability opened new possibilities for autonomous work while raising questions about oversight and validation.
New Pricing Tiers Emerged
Claude Pro Max 20x ($200/month) and ChatGPT Pro established new premium tiers. The justification? Massive token consumption from agentic workflows. When a coding agent burns through context windows across multi-hour tasks, the economics require different pricing models.
Local Models Hit a Sweet Spot
Models in the 20-32B parameter range, like Mistral Small 3, achieved GPT-4-class performance on consumer hardware. While frontier cloud models remained superior for agentic work, the local option became viable for many use cases—important for privacy-conscious applications and cost-sensitive workflows.
MCP’s Uneven Year
Model Context Protocol adoption exploded across labs in early 2025. However, coding agents’ shell access may have made it less critical than expected—why use MCP when the agent can just run commands? Anthropic’s simpler “Skills” format gained traction as an alternative.
Data Center Backlash
Over 200 environmental groups demanded halts to new U.S. data center construction in December. Local opposition to AI infrastructure surged throughout the year. The sustainability of AI’s growth trajectory became a mainstream concern rather than a niche issue.
What 2025 Taught Us
The year consolidated rather than invented paradigms:
- Reasoning models moved from experimental to essential
- Agents became practical tools rather than demos
- Chinese labs proved they could compete at the frontier
- LLM integration into daily tools normalized
The fundamental question shifted from “Can LLMs do X?” to “How do we safely deploy LLMs doing X at scale?”
Looking Ahead
2025 was a year of maturation. The wild frontier of 2024 gave way to practical deployments, real revenue, and genuine integration into software development workflows. The tools that seemed like experiments became standard practice.
For developers, the message is clear: AI-assisted development isn’t a future possibility—it’s the present reality. The question isn’t whether to adopt these tools, but how to use them effectively and safely.
As we enter 2026, the foundations laid this year will determine what becomes possible next. The agents are here, the reasoning works, and the integration is happening. What we build on this foundation is up to us.
This post summarizes themes from Simon Willison’s comprehensive 2025 year-in-review, which covers 24 trends in significantly more detail.