AgMoDB
ModelsAgentsEvalsIndustry
AgMoDB by @mistakeknot

AI Evals & Benchmarks

What each benchmark measures, why it matters, and how to interpret scores.

Heatmap

AgMoBench

Transparent composite indices computed by AgMoDB using percentile-rank normalization across core domain benchmarks, with both full and confidence-oriented variants.

AgMoBench Overall
AgMoDB

Weighted blend of Reasoning (22%), Coding (22%), Math (14%), Agentic (14%), Robustness (18%), and Document Intelligence (10%) domain indices using observed benchmark data only (BenchPress predictions excluded). Uses percentile-rank normalization across the full model population.

Components

AgMoBench ReasoningAgMoBench CodingAgMoBench MathAgMoBench AgenticAgMoBench RobustnessAgMoBench Document Intelligence
AgMoBench Trust
AgMoDB

Trust-weighted blend of domain scores that emphasizes reasoning and coding signal. Uses prediction-inclusive data. Weights: Reasoning (27%), Coding (27%), Math (14%), Agentic (9%), Robustness (13%), and Document Intelligence (10%).

Components

AgMoBench ReasoningAgMoBench CodingAgMoBench MathAgMoBench AgenticAgMoBench RobustnessAgMoBench Document Intelligence
AgMoBench Predicted
AgMoDB

Prediction-inclusive variant of AgMoBench that includes BenchPress ML-predicted benchmark cells alongside observed data. Uses confidence adjustment, shrinking each domain's weight toward a 25% floor when that domain is prediction-heavy.

Components

AgMoBench ReasoningAgMoBench CodingAgMoBench MathAgMoBench AgenticAgMoBench RobustnessAgMoBench Document Intelligence
AgMoBench Reasoning
AgMoDB

Composite reasoning score averaging percentile ranks across knowledge and reasoning benchmarks.

Components

MMLU ProGPQA DiamondHLELiveBench OverallIFBenchARC-AGI-2LCRArena ELO
AgMoBench Coding
AgMoDB

Composite coding score averaging percentile ranks across code generation and software engineering benchmarks.

Components

LiveCodeBenchSciCodeTerminal-Bench HardAider PolyglotBigCodeBench CompleteArena ELO: Coding
AgMoBench Math
AgMoDB

Composite math score averaging percentile ranks across mathematical reasoning benchmarks.

Components

AIME 2025MATH-500FrontierMath
AgMoBench Agentic
AgMoDB

Composite agentic score averaging percentile ranks across tool use, web browsing, and autonomous task benchmarks.

Components

SWE-bench VerifiedGAIATauBench AirlineWebArenaMLEBenchBFCLBrowseComp
AgMoBench Robustness
AgMoDB

Composite factual accuracy score averaging percentile ranks across factuality and critical thinking benchmarks — how well models resist nonsense, avoid fabrication, and know what they don't know.

Components

BullshitBenchTruthfulQASimpleQAAA-Omniscience Index
AgMoBench Document Intelligence
AgMoDB

Composite document processing score averaging percentile ranks across OCR, parsing, table understanding, and document retrieval benchmarks — how well models handle enterprise document workflows.

Components

IDP OverallIDP OlmOCRIDP OmniDocIDP CoreOmniDocBenchOCRBench v2MMTUViDoRe v3

Artificial Analysis Indices

Proprietary composite scores from Artificial Analysis. Methodology is not publicly disclosed.

Intelligence Index
Artificial Analysis

Artificial Analysis's proprietary composite intelligence score aggregating multiple benchmarks. Methodology is not publicly disclosed.

Coding Index
Artificial Analysis

Artificial Analysis's proprietary composite coding ability score.

Math Index
Artificial Analysis

Artificial Analysis's proprietary composite mathematical reasoning score.

Other Aggregate Benchmarks

Third-party composite scores and leaderboard aggregates.

AA Intelligence Index (Matrix)
benchmark_matrix

Artificial Analysis's composite Intelligence Index as reported in the LLM Benchmark Matrix. Aggregates multiple evals into a single capability score.

Apex Agents
epoch_ai

Apex Agents benchmark scores from Epoch AI's benchmark data collection.

Arc Agi 2
epoch_ai

Arc Agi 2 benchmark scores from Epoch AI's benchmark data collection.

Chatbot Arena ELO
chatbot_arena

Chatbot Arena ELO measures relative user preference from blind head-to-head votes in LMArena. It is a live human-judgment signal for conversational quality under real prompts.

Epoch Capabilities Index
epoch_ai

Epoch Capabilities Index benchmark scores from Epoch AI's benchmark data collection.

GDP-Val AA
benchmark_matrix

Artificial Analysis's GDP-Val metric — an aggregate measure combining quality and value. Specific methodology not publicly documented.

Gdpval
epoch_ai

Gdpval benchmark scores from Epoch AI's benchmark data collection.

Hle
epoch_ai

Hle benchmark scores from Epoch AI's benchmark data collection.

LiveBench Overall
livebench

LiveBench Overall tracks broad model performance on recently refreshed questions with verifiable answers. It is used as a lower-contamination snapshot across major capability categories.

Open LLM Average
open_llm_leaderboard

Average score across HuggingFace's Open LLM Leaderboard benchmarks (IFEval, BBH, MATH Level 5, GPQA, MUSR, MMLU-PRO) — the community standard for open-source model evaluation.

Parameter Count
epoch_ai

Total number of trainable parameters in the model, as catalogued by Epoch AI's frontier model database.

Posttrainbench
epoch_ai

Posttrainbench benchmark scores from Epoch AI's benchmark data collection.

Training Compute
epoch_ai

Total floating-point operations (FLOP) used during model training, as estimated by Epoch AI from public disclosures, hardware counts, and training duration.

Training Cost (USD)
epoch_ai

Estimated total cost to train the model in 2024 US dollars, including compute hardware rental or amortization.