AI Milestones & Benchmark Tracker

Name: Frontier AI Model Benchmark Scores
Creator: Latentmachine

216 tracked milestones · 10 companies · 2022-2026 · Updated February 25, 2026

This page is a complete, search-engine-readable archive of all data on Latentmachine. For the interactive experience with timeline, feed, grid, and stats views, visit the timeline tracker.

Companies Tracked

Latentmachine tracks AI milestones across 10 frontier AI labs:

OpenAI - 65 milestones tracked
Anthropic - 41 milestones tracked
Google - 37 milestones tracked
Meta AI - 16 milestones tracked
NVIDIA - 13 milestones tracked
Microsoft - 10 milestones tracked
DeepSeek - 10 milestones tracked
Mistral AI - 9 milestones tracked
xAI - 9 milestones tracked
Moonshot AI - 6 milestones tracked

Frontier Model Benchmark Comparison

Last updated: 2026-02-20. Sources: provider publications and independent evaluations.

Benchmarks Tracked

GPQA Diamond: Graduate-level reasoning — biology, physics, chemistry. PhD experts score 65-74%.
SWE-Bench Verified: Agentic coding — resolving real GitHub issues in production codebases.
AIME 2025: American Invitational Mathematics Examination — competitive high school math.
MATH 500: Diverse mathematical problem-solving across difficulty levels.
MMMLU: Multilingual Massive Multitask — 57 categories in 14 languages.
Humanity's Last Exam: Most challenging multi-domain benchmark. Frontier models score <50%.

Model Scores

Company	Model	GPQA Diamond	SWE-Bench Verified	AIME 2025	MATH 500	MMMLU	Humanity's Last Exam	Notes
OpenAI	GPT-5.2	92.4	80	100	—	—	35.4
OpenAI	GPT-5.3 Codex	—	—	—	—	—	—	Coding agent model — excels on SWE-Bench Pro (56.8%), Terminal-Bench (77.3%), OSWorld (64.7%). Not tested on standard reasoning benchmarks.
OpenAI	GPT-5.1	88.1	76.3	94.6	—	—	23.7
OpenAI	GPT-5	87.3	74.9	94	—	—	25.3
OpenAI	OpenAI o3	83.3	69.1	98.4	—	—	20.3
OpenAI	OpenAI o4-mini	81.4	68.1	93.4	—	—	19.4
Anthropic	Claude Opus 4.6	91.3	80.8	99.8	—	—	36.7
Anthropic	Claude Sonnet 4.6	89.9	79.6	95.6	—	89.3	33.2
Anthropic	Claude Opus 4.5	87	80.9	—	—	90.8	25.2
Anthropic	Claude Sonnet 4.5	83.4	77.2	—	—	89.1	—
Anthropic	Claude Opus 4.1	80.9	74.5	—	—	89.5	—
Anthropic	Claude 4 Opus	79.6	72.5	—	—	—	—
Google	Gemini 3 Pro	91.9	76.2	100	—	91.8	37.5
Google	Gemini 3.1 Pro	—	80.6	—	—	—	—	Released Feb 19, 2026. 80.6% on SWE-Bench Verified. Also 77.1% on ARC-AGI-2 (not tracked in this table).
Google	Gemini 3 Deep Think	93.8	—	—	—	—	48.4	Reasoning mode (Feb 2026 upgrade), not a separate model. Uses scaled inference-time compute. Also: 84.6% ARC-AGI-2, 3455 Elo Codeforces, gold-medal IPhO/IChO 2025.
Google	Gemini 2.5 Pro	—	—	—	—	89.2	21.6
Google	Gemini 2.5 Flash	78.3	—	88	—	—	—
xAI	Grok 4	87.5	75	91.7	—	—	25.4
Meta AI	Llama 4 Maverick	69.8	—	—	—	84.6	—
Meta AI	Llama 4 Behemoth	73.7	—	—	—	85.8	—	Still training — preliminary scores
Meta AI	Llama 4 Scout	57.2	—	—	—	—	—
DeepSeek	DeepSeek-R1-0528	81	57.6	87.5	—	—	17.7
DeepSeek	DeepSeek V3.1	—	66	—	—	—	—
DeepSeek	DeepSeek-R1	71.5	49.2	79.8	97.3	—	—
Mistral AI	Mistral Large 3	—	—	—	—	—	—	Top open-source on LMArena coding
Moonshot AI	Kimi K2.5	87.6	76.8	96.1	—	—	50.2	HLE score with tools (search, code, browsing). Open-source SOTA.
Moonshot AI	Kimi K2 Thinking	84.5	71.3	—	—	—	44.9	HLE score with tools. First open model to rival GPT-5 on agentic tasks.
Moonshot AI	Kimi K2	75.1	65.8	49.5	97.4	—	—

Benchmark Crown History

Who held #1 on each benchmark, and when they were dethroned.

GPQA Diamond

Model	Company	Score	From	To
GPT-4	OpenAI	39	January 1, 2024	March 4, 2024
Claude 3 Opus	Anthropic	60.4	March 4, 2024	September 12, 2024
OpenAI o1	OpenAI	77.3	September 12, 2024	January 31, 2025
OpenAI o3	OpenAI	83.3	January 31, 2025	June 25, 2025
Gemini 2.5 Pro	Google	84	June 25, 2025	July 10, 2025
Grok 4	xAI	87.5	July 10, 2025	September 29, 2025
GPT-5.1	OpenAI	88.1	September 29, 2025	November 18, 2025
Gemini 3 Pro	Google	91.9	November 18, 2025	December 11, 2025
GPT-5.2	OpenAI	92.4	December 11, 2025	Current

SWE-Bench Verified

Model	Company	Score	From	To
SWE-agent + GPT-4	OpenAI	18	April 2, 2024	June 20, 2024
Claude 3.5 Sonnet	Anthropic	33.4	June 20, 2024	October 22, 2024
Claude 3.5 Sonnet (Oct)	Anthropic	49	October 22, 2024	February 24, 2025
Claude 3.7 Sonnet	Anthropic	62.3	February 24, 2025	May 22, 2025
Claude 4 Opus	Anthropic	72.5	May 22, 2025	June 4, 2025
Claude 4 Sonnet	Anthropic	72.7	June 4, 2025	September 29, 2025
Claude 4.5 Sonnet	Anthropic	77.2	September 29, 2025	February 5, 2026
Claude Opus 4.6	Anthropic	80.8	February 5, 2026	Current

AIME 2025

Model	Company	Score	From	To
OpenAI o1	OpenAI	83.3	September 12, 2024	January 31, 2025
OpenAI o3	OpenAI	98.4	January 31, 2025	July 10, 2025
Grok 4	xAI	91.4	July 10, 2025	November 18, 2025
Gemini 3 Pro	Google	96.7	November 18, 2025	December 11, 2025
GPT-5.2	OpenAI	100	December 11, 2025	Current

Humanity's Last Exam

Model	Company	Score	From	To
GPT-5	OpenAI	25.3	July 23, 2025	November 18, 2025
Gemini 3 Pro	Google	37.5	November 18, 2025	February 12, 2026
Gemini 3 Deep Think	Google	48.4	February 12, 2026	Current

Infrastructure & Training Scale

Training Compute Scale

Model	Company	Date	Parameters	Architecture	Est. Cost
GPT-3	OpenAI	June 11, 2020	175B	Dense	~$5M
GPT-4	OpenAI	March 14, 2023	~1.8T (est.)	MoE (est.)	~$78M
Llama 2 70B	Meta AI	July 18, 2023	70B	Dense	~$2M
Gemini Ultra	Google	December 6, 2023	Undisclosed	Dense	~$191M
Llama 3.1 405B	Meta AI	July 23, 2024	405B	Dense	~$60M (est.)
DeepSeek V3	DeepSeek	December 26, 2024	671B (37B active)	MoE	~$5.6M (reported)
Llama 4 Maverick	Meta AI	April 5, 2025	400B (17B active)	MoE	Undisclosed
Llama 4 Scout	Meta AI	April 5, 2025	109B (17B active)	MoE	Undisclosed
DeepSeek V3.1	DeepSeek	August 1, 2025	671B (37B active)	MoE	Undisclosed
DeepSeek V3.2	DeepSeek	October 1, 2025	671B (37B active)	MoE	Undisclosed

Context Window Evolution

Model	Company	Date	Max Tokens
GPT-3.5	OpenAI	November 30, 2022	4,096
Claude 1	Anthropic	March 14, 2023	9,000
GPT-4	OpenAI	March 14, 2023	8,192
Claude 2	Anthropic	July 11, 2023	100,000
GPT-4 Turbo	OpenAI	November 6, 2023	128,000
Gemini 1.5 Pro	Google	February 15, 2024	1,000,000
Claude 3.5 Sonnet	Anthropic	June 20, 2024	200,000
Llama 3.1	Meta AI	July 23, 2024	128,000
Gemini 2.5 Pro	Google	March 25, 2025	1,000,000
Llama 4 Scout	Meta AI	April 5, 2025	10,000,000
Claude 4 Sonnet	Anthropic	June 4, 2025	200,000
GPT-5	OpenAI	August 7, 2025	400,000
Gemini 3 Pro	Google	November 18, 2025	1,000,000
Claude Sonnet 4.6	Anthropic	February 17, 2026	1,000,000

Continued NVIDIA hardware dominance. But Chinese efficiency advances raised questions about necessity of cutting-edge hardware. Inference optimization became focus.

Created GPT economy but monetization unclear. Quality curation challenge. Platform lock-in strategy. Developer enthusiasm high initially but sustainability questioned.

Set new standard for LLM capability. Triggered enterprise AI adoption wave. Established OpenAI market leadership. Multimodal foundation laid.

Tags: openai, model-release, multimodal, enterprise

AI Milestones — 2022

ChatGPT Public Launch

OpenAI · Applications & Products · November 30, 2022

The Narrative

Research preview of conversational AI assistant. Free to use. Optimized for dialogue using RLHF.

Source: OpenAI Blog

Reality Check

Reached 100M users in 2 months. Became fastest-growing consumer app in history. Triggered industry-wide AI race.

Implication

Defined the generative AI era. Made AI accessible to general public. Sparked massive investment wave across industry. Changed technology landscape permanently.

Tags: openai, consumer, paradigm-shift