โ„• atory
Play Now

Preliminary Data Notice

This dashboard is currently running on a limited sample size. To draw definitive conclusions, we need to scale our testing to thousands of runs across hundreds of days.

High-reasoning models are volatile. To get an accurate score, we must run 10 attempts per model to account for variance.

We a currently exploring ways to scale our testing. If you're interested in helping us, please consider sponsoring.

๐Ÿ™‡ A Note from the Developer

Natory is built by a developer, for developers and engineering decision makers. While I treat this data seriously, I am a software developer, not a professional statistician. This benchmark is an evolving experiment in quantifying AI reasoning.

If you spot a flaw in the methodology, visualization, or have any suggestions, please let me know. I want this to be the most accurate tool possible, and I need your help to get the math right.

Version v5 Scoring Algorithm Version

Natory AI Benchmark: Constraint-Based Mathematical Search

Quantifying LLM reasoning capabilities beyond simple arithmetic: A study on planning, search efficiency, and reliability.

Is the model calculating, or is it reasoning? Standard benchmarks test if an AI can find the answer. Natory tests if it can find the path. We treat this game as a Hard Search Problem: models must optimize a mathematical expression under strict rule constraints, similar to code generation or program synthesis. This dashboard tracks the Price-Performance Ratio of modern LLMs, revealing which models can navigate complex logical search spaces without breaking the bank or the API contract.

1. Task Specification

The Natory Benchmark evaluates a model's ability to perform Constrained Number Construction. Unlike open-ended generation, the model must reconstruct a specific integer target using a strictly defined set of source digits and mathematical operators. Success requires solving a non-differentiable optimization problem: maximizing a scalar Reward Function (Score) within a hard search space.

The Objective

For a given Target Number (T) and a set of allowed digits (D), the model must synthesize a mathematical expression (E) such that:
eval(E) == T AND digits(E) โІ D.

The Reward Function (Scoring)

The scoring algorithm functions as a multi-objective optimization function (f(x)), balancing three competing variables:

  • Constraint Adherence (Hard vs. Soft): The environment enforces a Hard Constraint on digit validity; utilizing forbidden digits results in immediate zero-reward termination. However, it applies a Soft Exponential Penalty (10 * 2extra) for Frequency Violation (overusing allowed digits beyond a tolerance threshold) to discourage redundancy.
  • Bounded Sophistication: The model is rewarded for Operator Diversity and the use of high-cost functions (e.g., sopfr, pn). However, this is bounded by a Diminishing Returns curve: nesting depth >2 or operator frequency >3 yields zero marginal utility.
  • Semantic Efficiency (Anti-Triviality): Unlike simple length penalties, the scorer performs an AST Simplification pass (e.g., resolving x + 0 or x ร— 1). If the simplified expression yields a lower score than the raw output, the model is penalized for "Reward Hacking" (padding complexity without functional value).

Reference Example

Target (T): 13 | Allowed (D): {1, 3}

Model Output (E): 3(1 + 1) + (3 + 1)

Analysis: The model successfully synthesized the target using only set D. It maximized the score by utilizing (Depth Reward) rather than simple addition (e.g., 3+3+3+3+1), demonstrating search efficiency.

2. Experimental Protocol

To evaluate reasoning depth versus stability, we distinguish between Search Depth (Turns) and Independent Trials (Attempts):

  • Turn (Search Depth): A single reasoning step where the model generates a thought trace and executes a tool call.
  • Attempt (Trial): A full optimization session where the model is allocated a budget of 5 Turns to converge on an optimal solution.

Execution Pipeline:

  1. Initialization: The agent receives a target integer T (e.g., 127) along with a set of number-theoretic properties (metadata) to seed its search heuristic.
    View System Prompt Schema Target: 127
    You are a competitive expert player of the game "Natory".
    Your objective is to achieve the HIGHEST POSSIBLE SCORE by constructing a mathematical expression that evaluates EXACTLY to the target number.
    
    **THE TARGET:** 127
    **PROPERTIES:**
    ### Properties of 127
    - **Factorization**: 127^1
    - **Divisors**: 1, 127
    - **Prime Status**: Prime (#31)
    - **Prime Index**: 31
    - **Parity**: odd
    - **Special Sets**: **superPrime**: Prime index: 31, **friedman**: Yes
    
    
    **RULES:**
    1. **ALLOWED DIGITS:** {1, 2, 7}
       - You may REUSE these digits as many times as you like.
       - Example: If allowed is {1, 2}, "1+1+1" is valid.
    2. **VALIDITY:** The expression must evaluate exactly to 127.
    3. **OPERATOR RULES:**
       - You must use at least one operator.
    
    **ALLOWED OPERATORS:**
    1. Basic: +, -, *, /
    2. Exponent: ^ (Use ^{} for multi-char exponents, e.g., 2^{10})
    3. Factorial: !
    4. Square Root: sqrt(x)
    5. Prime Index: p_n(x) -> The x-th prime number.
       - Example: p_n(1)=2, p_n(5)=11.
    6. Sum of prime factors: sopfr(x) -> Sum of prime factors with multiplicity.
       - Example: sopfr(12) = sopfr(2*2*3) = 2+2+3 = 7
    7. Modulo: a mod b (Returns remainder of a/b. e.g. 10 mod 3 = 1)
      - **WARNING:** Do NOT use the "%" symbol, as it breaks LaTeX formatting. Use "a mod b".
    
    **SCORING (Objective: MAXIMIZE Total Score):**
    
    1. **High-Value Targets (The "Gold Standard"):**
      - **Exact Digits Bonus (+80 pts):** The ultimate goal. Use each digit exactly as many times as it appears in the target. (Overrides the All Digits bonus).
      - **All Digits Bonus (+20 pts):** If you can't be exact, at least use every allowed digit once.
      - **Sum of Powers Bonus (+20 pts):** Structure your expression as a sum of powers (e.g., x^a + y^b + dots).
    
    2. **Operator Bounties (Double Points Active):**
      - **sopfr(x):** **+18 pts** per use.
      - **p_n(x):** **+14 pts** per use.
      - **Power (^) / Factorial (!):** **+12 pts** per use.
      - **Sqrt / Mod:** **+10 pts** per use.
      - *Diversity Bonus:* +2 pts for every unique operator type used.
    
    3. **CRITICAL LIMITS (Don't waste turns!):**
      - **Nesting Cap:** Bonuses are **ONLY** awarded for the first 2 layers of depth. Operators nested deeper than level 2 (e.g., p_n(p_n(p_n(...)))) yield **0 POINTS**. Keep it shallow and wide.
      - **Spam Cap:** You only get bonuses for the first **3 uses** of any specific operator. The 4th sopfr is worth 0 points.
    
    4. **Penalties (The "Soft Walls"):**
      - **Redundant Operations:** Usage of operators that do not change the value or are trivial (e.g., sopfr(5), 1!, sqrt(1), x^1, x mod 1) will incur severe penalties (-20pts per instance) and receive NO bonuses.
      - **Digit Hoarding (Exponential):** Using >3 extra digits triggers a massive penalty (10 \times 2^{extra}).
      - **Complexity:** Expressions with >15 total items suffer a quadratic penalty.
      - **Triviality:** If your expression simplifies to the target (e.g. 127 + 0) without meaningful transformations, score is 0.
    
    **STRATEGY GUIDE:**
    
    - **The "Shallow & Wide" Strategy:** Since nesting deeper than 2 stops paying bonuses, prefer structures like sopfr(A) + sopfr(B) (Wide) over sopfr(sopfr(A)) (Deep).
    - **Hunt for 'sopfr':** It pays the most (18 pts). Check if your target or its components have convenient prime factors.
    - **Respect the "Gem":** If properties say "Prime Index: 31", use p_n(31).
    - **Avoid "Padding":** Do not wrap numbers in 1! or sopfr(prime) to gain depth. The system detects this as "Redundant" and will deduct -20 points instead of giving you a bonus.
    
    
    You have access to the following tools. To use a tool, output valid JSON with a "tool" field.
    
    1. **evaluate**: Test an expression to see its value, validity, and potential score.
       - Usage: { "tool": "evaluate", "expression": "sopfr(10) + p_n(1)" }
    
    2. **p_n**: Get the nth prime number.
       - Usage: { "tool": "p_n", "n": 10 }
    
    3. **sopfr**: Get the Sum of Prime Factors (with multiplicity) for a number.
       - Usage: { "tool": "sopfr", "n": 10 }
    
    
    **INSTRUCTIONS:**
    1. **Turn Budget:** You have exactly **5 TURNS** to interact.
       - **Turn 1:** Secure a "Safety Score" (valid but simple).
       - **Turns 2-4:** **Optimize.** Use the tool output to see which Bonuses/Penalties you triggered.
       - **Turn 5:** Try one final variation if you think you can beat your high score.
    2. **Auto-Save:** Every valid expression you evaluate is recorded. We keep the best one. You do not need to "submit" at the end.
    3. **Learn from Feedback:** The `evaluate` tool returns a `scoreBreakdown`. **READ IT.**
       - If you see `trivialExpressionPenalty`, you are using forbidden loops (remove `1!`, `sopfr(prime)`).
       - If you see `repetitionPenalty`, you are spamming operators (cap usage at 3).
    4. **Format:** Output the JSON block at the very end.
    
    **EXAMPLE:**
    Target: 10, Allowed: {1, 0}
    Reasoning: 2 times 5 is 10. I can use p_n(3) which is 5. But 3 is not a digit in 10. Maybe I can use p_n(1) and build a chain of p_n to reach 5.
    ```json
    {
      "tool": "evaluate",
      "expression": "p_n(p_n(p_n(1))) * p_n(1)"
    }
    ```
    
  2. Tool Environment: The model interacts with a deterministic environment via a strict API:
    • evaluate: Validates expressions and returns the scalar reward.
    • pn(x): Returns the x-th Prime Number (Search Tool).
    • sopfr(x): Sum of Prime Factors (Decomposition Tool).
  3. Iterative Refinement: Unlike "Pass/Fail" benchmarks, Natory supports iterative optimization. Upon submitting a valid expression via evaluate, the environment returns a Score Breakdown vector. The model can analyze this feedback to optimize for higher complexity or brevity in remaining turns.

Handling Randomness & Sample Size

LLMs can be unpredictable. A model might get lucky on one attempt and fail the next.

To account for this, we run multiple attempts (aiming for N=10) for a single target number. Since running these models is expensive, we release new numbers in two stages,

  • Partial Run (N=2): The default for new numbers. Quick but susceptible to luck.
  • Full Run (N=10): The goal for high-precision ranking. These runs are prioritized based on community sponsorship.

The 95% Confidence Interval (the bar width) visualizes uncertainty. A wide bar means either the model is fundamentally inconsistent, or we simply haven't raised the funds to run enough tests to be sure.

Optimization Logic: The scoring function (f(x)) acts as the objective. It positively weights operator depth (Complexity) and negatively weights token count (Inefficiency) and redundancy.

3. Main Performance Results

Model Runs Performance Reliability Efficiency
Mean [95% CI] Peak Score Turns to Peak Avg Turns to Peak Score: The average number of turns required to find the solution with peak score. Pass Rate 1-Shot 1-Shot Success (Pass@1): Percentage of attempts where a valid solution was found on Turn 1. Avg Cost ($) Avg Duration (s)
Baseline (Greedy Search) 10
25.7
257 - 20.0% - $0.000 -
22
174.6 +55%
[ 159.1 - 190.1 ]
3.82 100.0% 63.6% 0.701 1027.18
22
136.0 +107%
[ 119.6 - 152.4 ]
3.55 100.0% 77.3% 0.359 1272.28
22
120.4 +121%
[ 99.6 - 141.2 ]
2.91 100.0% 77.3% 0.420 3156.23
24
111.6 +491%
[ 94.1 - 129.1 ]
3.21 100.0% 62.5% 0.116 408.74
22
109.2 +518%
[ 87.3 - 131.0 ]
3.68 100.0% 81.8% 0.299 427.31
22
106.3 +855%
[ 80.1 - 132.5 ]
3.05 100.0% 68.2% 0.489 2793.53
22
93.5 +47%
[ 75.4 - 111.7 ]
3.18 100.0% 72.7% 0.358 1193.16
22
84.3 +305%
[ 69.6 - 99.0 ]
4.09 100.0% 86.4% 0.281 279.66
22
70.5 +417%
[ 54.5 - 86.6 ]
3.41 100.0% 86.4% 0.101 1645.39
20
70.0 +368%
[ 48.6 - 91.4 ]
3.35 100.0% 60.0% 0.103 366.84
22
55.0 +395%
[ 35.1 - 75.0 ]
2.41 100.0% 50.0% 0.162 285.04
22
52.5 +31%
[ 41.7 - 63.3 ]
2.86 100.0% 77.3% 0.019 866.83
21
39.6 +227%
[ 25.3 - 53.9 ]
2.05 100.0% 57.1% 0.091 226.44
22
11.0 +132%
[ 5.0 - 17.1 ]
1.68 54.5% 4.5% 0.002 31.38
Table 1: Comparative performance across runs. Ranges represent the 95% Confidence Interval.

4. Statistical Analysis

Figure 1 illustrates score variability. The boxes show the typical range of scores for each model, distinguishing between models that perform consistently and those that are unpredictable.

Figure 1: Performance Consistency by Model250250200200150150100100505000ScoreGemini 3 ProGemini 3 ProGemini 3 Flash (High)Gemini 3 Flash (High)GPT 5 (High)GPT 5 (High)GPT 5.1 (High)GPT 5.1 (High)Gemini 3 Flash (Medium)Gemini 3 Flash (Medium)Gemini 2.5 ProGemini 2.5 ProGPT 5.2 (High)GPT 5.2 (High)Claude Opus 4.5Claude Opus 4.5GPT 5.1 Mini (High)GPT 5.1 Mini (High)Gemini 2.5 Flash (High)Gemini 2.5 Flash (High)GPT 5 Nano (Medium)GPT 5 Nano (Medium)Claude Sonnet 4.5Claude Sonnet 4.5Claude Haiku 4.5Claude Haiku 4.5Gemini 2.5 Flash LiteGemini 2.5 Flash LiteGreedy Baseline

Figure 2 illustrates the efficiency frontier, mapping mean score against latency (log scale) and cost (bubble size).

Figure 2: Efficiency Frontier (Score vs. Time vs. Cost)200.0200.0150.0150.0100.0100.050.050.00.00.0Mean ScoreClaude Haiku 4.5Claude Opus 4.5Claude Sonnet 4.5Gemini 2.5 Flash (High)Gemini 2.5 Flash LiteGemini 2.5 ProGemini 3 Flash (High)Gemini 3 Flash (Medium)Gemini 3 ProGPT 5 (High)GPT 5.1 (High)GPT 5.2 (High)GPT 5.1 Mini (High)GPT 5 Nano (Medium)10s10s30s30s100s100s300s300s1000s1000s3000s3000sDuration (ms) - Log ScaleGreedy BaselineIdeal Zone (Fast & Smart)
Legend: Bubble Size โˆ Cost โ€ข Color โˆ Stability (Red: Volatile โ†’ Blue: Consistent)
Figure 3: Operator Usage FrequencyGPT 5 Nano (Medium)GPT 5 Nano (Medium)GPT 5.1 Mini (High)GPT 5.1 Mini (High)GPT 5.2 (High)GPT 5.2 (High)GPT 5.1 (High)GPT 5.1 (High)GPT 5 (High)GPT 5 (High)Gemini 3 ProGemini 3 ProGemini 3 Flash (Medium)Gemini 3 Flash (Medium)Gemini 3 Flash (High)Gemini 3 Flash (High)Gemini 2.5 ProGemini 2.5 ProGemini 2.5 Flash LiteGemini 2.5 Flash LiteGemini 2.5 Flash (High)Gemini 2.5 Flash (High)Claude Sonnet 4.5Claude Sonnet 4.5Claude Opus 4.5Claude Opus 4.5Claude Haiku 4.5Claude Haiku 4.5Models2.180.821.770.551.820.953.141.322.411.363.952.642.51.822.912.362.452.553.682.5521.734.324.412.790.254.253.083.0513.773.911.360.453.681.140.5500.730.052.050.254.122.640.412.360.823.140.682.411.52.760.191.810.43ExponentiationExponentiationSqrtSqrtp_np_nsopfrsopfrOperators
None
Low
Medium
High

Figure 4 shows how average score, runtime, and cost vary by the day a benchmark attempt was recorded.

120.0120.0100.0100.080.080.060.060.040.040.020.020.00.00.0Score16 Dec16 Dec24 Dec24 DecJan '26Jan '2608 Jan08 Jan16 Jan16 Jan24 Jan24 Jan180.0s180.0s160.0s160.0s140.0s140.0s120.0s120.0s100.0s100.0s80.0s80.0s60.0s60.0sTime (s)Figure 4: Daily Benchmark Trends (Score, Time, Cost)$3000.0000$3000.0000$2500.0000$2500.0000$2000.0000$2000.0000$1500.0000$1500.0000$1000.0000$1000.0000$500.0000$500.0000$0.0000$0.0000Cost ($)
Claude Haiku 4.5 Score
All Models Avg Score
Claude Haiku 4.5 Time (s)
All Models Avg Time (s)
Claude Haiku 4.5 Cost ($)
All Models Avg Cost ($)

5. Benchmarked Target Numbers

Target Released Last Updated Status Pass % Baseline Best Score
6174 Jan 15, 26 Jan 15, 26
2/10
Finish
88.9% FAIL
GPT 5.1 (High)
5676 Jan 14, 26 Jan 15, 26
1/10
Finish
89.3% FAIL
Gemini 3 Pro
2966 Dec 17, 25 Dec 20, 25
2/10
Finish
93.1% FAIL
Gemini 3 Flash (High)
6256 Dec 16, 25 Dec 20, 25
2/10
Finish
93.1% FAIL
Gemini 3 Pro
346 Dec 19, 25 Dec 19, 25
2/10
Finish
96.2% FAIL
Gemini 3 Pro
5623 Jan 19, 26 Jan 19, 26
2/10
Finish
96.4% 257
Gemini 3 Pro
3189 Dec 15, 25 Dec 20, 25
2/10
Finish
96.6% FAIL
GPT 5 (High)
8166 Mar 29, 25 Dec 20, 25
2/10
Finish
96.7% FAIL
Gemini 3 Pro
4673 Aug 11, 25 Dec 20, 25
2/10
Finish
100.0% FAIL
Gemini 3 Flash (Medium)
5809 Dec 18, 25 Dec 20, 25
2/10
Finish
100.0% FAIL
Gemini 3 Pro
6011 Dec 13, 25 Dec 20, 25
2/10
Finish
100.0% FAIL
Gemini 3 Pro
6322 Dec 14, 25 Dec 20, 25
2/10
Finish
100.0% FAIL
Gemini 3 Flash (Medium)
6482 Jan 28, 26 Jan 28, 26
2/10
Finish
100.0% FAIL
Gemini 3 Pro
Table 2: Numbers used as targets for the benchmark. Click on a number to explore the details of the benchmark run.

6. All Numbers

Status for all valid numbers in Natory game. You can hover/click on these to check the status of runs for each number.

All Numbers
Released
Planned
Partial
Full Run

7. Successful Models

Models with a success rate of 80% or higher.

Gemini 3 Pro

Mathemagician
22
Runs
3841
Total Score
174.59
Avg Score
242
Highest Score

Gemini 3 Flash (High)

Mathemagician
22
Runs
2992
Total Score
136.00
Avg Score
217
Highest Score

GPT 5 (High)

Mathemagician
22
Runs
2648
Total Score
120.36
Avg Score
188
Highest Score

Gemini 3 Flash (Medium)

Mathemagician
24
Runs
2679
Total Score
111.63
Avg Score
221
Highest Score

Gemini 2.5 Pro

Mathemagician
22
Runs
2402
Total Score
109.18
Avg Score
232
Highest Score

GPT 5.1 (High)

Mathemagician
22
Runs
2338
Total Score
106.27
Avg Score
232
Highest Score

GPT 5.2 (High)

Mathemagician
22
Runs
2058
Total Score
93.55
Avg Score
186
Highest Score

Claude Opus 4.5

Mathemagician
22
Runs
1855
Total Score
84.32
Avg Score
152
Highest Score

GPT 5.1 Mini (High)

๐Ÿง‘โ€๐Ÿ”ฌ
Efficiency Guru
22
Runs
1552
Total Score
70.55
Avg Score
152
Highest Score

Gemini 2.5 Flash (High)

๐Ÿง‘โ€๐Ÿ”ฌ
Efficiency Guru
20
Runs
1400
Total Score
70.00
Avg Score
155
Highest Score

Claude Sonnet 4.5

๐Ÿ•ต๏ธ
Pattern Seeker
22
Runs
1211
Total Score
55.05
Avg Score
152
Highest Score

GPT 5 Nano (Medium)

๐Ÿ•ต๏ธ
Pattern Seeker
22
Runs
1155
Total Score
52.50
Avg Score
96
Highest Score

Claude Haiku 4.5

๐Ÿ‘ถ
Baby Builder
21
Runs
831
Total Score
39.57
Avg Score
122
Highest Score

8. Model Configuration

Current configuration settings for all models in the benchmark.

Model Minimal Run Count Max Retry Context Window Reasoning Effort Input Cost ($/1M) Output Cost ($/1M)
gpt-5.2 2 1 400,000 high $1.75 $14.00
gpt-5 2 1 128,000 high $1.25 $10.00
gemini-2.5-pro 2 1 2,000,000 Default $1.25 $10.00
gemini-3-pro-preview 2 1 2,000,000 Default $2.00 $12.00
gpt-5.1 2 1 128,000 high $1.25 $10.00
gpt-5-mini 2 1 128,000 high $0.25 $2.00
claude-opus-4-5 2 1 200,000 Default $5.00 $25.00
claude-sonnet-4-5 2 1 200,000 Default $3.00 $15.00
gpt-5-nano 2 1 128,000 medium $0.05 $0.40
gemini-2.5-flash 2 1 1,000,000 high $0.30 $2.50
gemini-3-flash-preview 2 1 1,000,000 high $0.50 $3.00
gemini-3-flash-preview-medium 2 1 1,000,000 medium $0.50 $3.00
claude-haiku-4-5 2 1 200,000 Default $1.00 $5.00
gemini-2.5-flash-lite 2 1 1,000,000 Default $0.10 $0.40

We use cookies

Some cookies are essential for Natory to function, like keeping you signed in. We also use optional cookies to understand how the app is used and improve it over time. Learn more in our Privacy Policy. You can manage your preferences anytime.