2025: The Year in LLMs - A Comprehensive Review


As 2025 draws to a close, it’s time to reflect on a year that fundamentally shifted the AI landscape. While 2024 introduced many concepts, 2025 was the year they matured and became practical. Drawing from Simon Willison’s excellent annual review, here are the defining trends that shaped the LLM world this year.

Reasoning Models Changed Everything

OpenAI’s o-series models introduced inference-scaling—the ability for LLMs to break problems into intermediate reasoning steps. What started as an experiment became standard practice across all major labs. This approach fundamentally changed how models tackle complex tasks, particularly tool-use scenarios where step-by-step planning matters.

The impact on practical applications was immediate. Models that could reason through problems achieved gold medals at July’s International Math Olympiad and September’s International Collegiate Programming Contest—using novel problems, not memorized solutions.

Agents Finally Arrived

After years of hype, AI agents that run tools in loops to achieve goals finally materialized in 2025. The “gullibility problem”—where models would blindly execute whatever they were told—was partially solved through improved reasoning capabilities.

The most impactful development was Claude Code’s February release, which quickly became a phenomenon. By December, Anthropic credited Claude Code with contributing to a $1 billion run-rate revenue—remarkable for a command-line tool.

Major labs rushed to release competing CLI coding agents, with asynchronous versions like Claude Code for web, OpenAI Codex web, and Google Jules enabling code research without the security risks of local execution.

Chinese Labs Seized the Crown

DeepSeek’s January R1 release triggered a $593 billion drop in NVIDIA’s market cap in a single day. But that was just the beginning. By December, Chinese models—GLM-4.7, Kimi K2, DeepSeek V3.2, MiniMax-M2.1—dominated the top ranks of open-weight benchmarks.

This represented a dramatic reversal from 2024. Meanwhile, Meta’s Llama 4 stumbled with oversized models (109B minimum) that alienated users accustomed to laptop-runnable versions, effectively ceding open-weight leadership to Chinese competitors.

OpenAI’s Changing Position

OpenAI maintained consumer dominance through ChatGPT but faced unprecedented competition across categories:

  • Image generation: Google’s Nano Banana Pro outperformed DALL-E
  • Coding: Claude Opus 4.5 took the lead
  • Open-weight: Chinese models dominated

ChatGPT’s image editing feature generated 100 million signups in a single week in March, proving the consumer product remains strong. But the technical leadership that seemed unassailable in 2024 became contested territory.

Google Found Its Footing

Google’s Gemini line (2.0, 2.5, 3.0) proved genuinely competitive, with 1M+ token context windows becoming standard. Nano Banana Pro emerged as the leader for text-heavy image generation, excelling at infographics and documents—previously a weak spot for AI image models.

The integration of AI into Chrome and Google’s broader ecosystem raised both possibilities and concerns about browser security.

The Command-Line Renaissance

Perhaps no trend surprised more than the mainstream adoption of terminal-based AI tools. LLM CLI tools achieved widespread use through coding agents, proving that the command line was never too niche for AI interfaces.

This renaissance changed how developers interact with AI—from chat windows to integrated development environments where AI operates as a genuine collaborator rather than a separate tool to consult.

Vibe Coding Entered the Lexicon

Andrej Karpathy’s February coinage captured a new development style: “forget that the code even exists.” Vibe coding meant prompting without reading diffs, trusting the AI to handle implementation details.

This approach proved controversial. Proponents argued it unlocked new levels of productivity; critics worried about code quality and maintainability. The debate will likely continue into 2026.

Security Concerns Intensified

The year brought serious security considerations to the forefront:

The Lethal Trifecta: A term coined for prompt injection attacks combining private data access, external communication, and untrusted content exposure.

Browser Integration Risks: ChatGPT Atlas, Claude in Chrome, and Gemini in Chrome raised concerns about prompt injection attacks accessing sensitive browser data. Labs acknowledged this as a “frontier, unsolved” problem.

Normalization of Deviance: Security researcher Johann Rehberger warned that repeated risky behavior without consequences (like running YOLO mode agents) echoed the dynamics that led to the Challenger disaster.

The Rise of Long Tasks

METR’s research showed models doubling their task-completion duration every 7 months. By year-end, frontier models tackled 5-hour human tasks. This extension of capability opened new possibilities for autonomous work while raising questions about oversight and validation.

New Pricing Tiers Emerged

Claude Pro Max 20x ($200/month) and ChatGPT Pro established new premium tiers. The justification? Massive token consumption from agentic workflows. When a coding agent burns through context windows across multi-hour tasks, the economics require different pricing models.

Local Models Hit a Sweet Spot

Models in the 20-32B parameter range, like Mistral Small 3, achieved GPT-4-class performance on consumer hardware. While frontier cloud models remained superior for agentic work, the local option became viable for many use cases—important for privacy-conscious applications and cost-sensitive workflows.

MCP’s Uneven Year

Model Context Protocol adoption exploded across labs in early 2025. However, coding agents’ shell access may have made it less critical than expected—why use MCP when the agent can just run commands? Anthropic’s simpler “Skills” format gained traction as an alternative.

Data Center Backlash

Over 200 environmental groups demanded halts to new U.S. data center construction in December. Local opposition to AI infrastructure surged throughout the year. The sustainability of AI’s growth trajectory became a mainstream concern rather than a niche issue.

What 2025 Taught Us

The year consolidated rather than invented paradigms:

  • Reasoning models moved from experimental to essential
  • Agents became practical tools rather than demos
  • Chinese labs proved they could compete at the frontier
  • LLM integration into daily tools normalized

The fundamental question shifted from “Can LLMs do X?” to “How do we safely deploy LLMs doing X at scale?”

Looking Ahead

2025 was a year of maturation. The wild frontier of 2024 gave way to practical deployments, real revenue, and genuine integration into software development workflows. The tools that seemed like experiments became standard practice.

For developers, the message is clear: AI-assisted development isn’t a future possibility—it’s the present reality. The question isn’t whether to adopt these tools, but how to use them effectively and safely.

As we enter 2026, the foundations laid this year will determine what becomes possible next. The agents are here, the reasoning works, and the integration is happening. What we build on this foundation is up to us.


This post summarizes themes from Simon Willison’s comprehensive 2025 year-in-review, which covers 24 trends in significantly more detail.