Natory AI Benchmark

🙇 A Note from the Developer

Natory is built by a developer, for developers and engineering decision makers. While I treat this data seriously, I am a software developer, not a professional statistician. This benchmark is an evolving experiment in quantifying AI reasoning.

If you spot a flaw in the methodology, visualization, or have any suggestions, please let me know. I want this to be the most accurate tool possible, and I need your help to get the math right.

Target (T): 13 | Allowed (D): {1, 3}

Model Output (E): 3^{(1 + 1)} + (3 + 1)

Analysis: The model successfully synthesized the target using only set D. It maximized the score by utilizing (Depth Reward) rather than simple addition (e.g., 3+3+3+3+1), demonstrating search efficiency.

You are a competitive expert player of the game "Natory".
Your objective is to achieve the HIGHEST POSSIBLE SCORE by constructing a mathematical expression that evaluates EXACTLY to the target number.

**THE TARGET:** 127
**PROPERTIES:**
### Properties of 127
- **Factorization**: 127^1
- **Divisors**: 1, 127
- **Prime Status**: Prime (#31)
- **Prime Index**: 31
- **Parity**: odd
- **Special Sets**: **superPrime**: Prime index: 31, **friedman**: Yes


**RULES:**
1. **ALLOWED DIGITS:** {1, 2, 7}
   - You may REUSE these digits as many times as you like.
   - Example: If allowed is {1, 2}, "1+1+1" is valid.
2. **VALIDITY:** The expression must evaluate exactly to 127.
3. **OPERATOR RULES:**
   - You must use at least one operator.

**ALLOWED OPERATORS:**
1. Basic: +, -, *, /
2. Exponent: ^ (Use ^{} for multi-char exponents, e.g., 2^{10})
3. Factorial: !
4. Square Root: sqrt(x)
5. Prime Index: p_n(x) -> The x-th prime number.
   - Example: p_n(1)=2, p_n(5)=11.
6. Sum of prime factors: sopfr(x) -> Sum of prime factors with multiplicity.
   - Example: sopfr(12) = sopfr(2*2*3) = 2+2+3 = 7
7. Modulo: a mod b (Returns remainder of a/b. e.g. 10 mod 3 = 1)
  - **WARNING:** Do NOT use the "%" symbol, as it breaks LaTeX formatting. Use "a mod b".

**SCORING (Objective: MAXIMIZE Total Score):**

1. **High-Value Targets (The "Gold Standard"):**
  - **Exact Digits Bonus (+80 pts):** The ultimate goal. Use each digit exactly as many times as it appears in the target. (Overrides the All Digits bonus).
  - **All Digits Bonus (+20 pts):** If you can't be exact, at least use every allowed digit once.
  - **Sum of Powers Bonus (+20 pts):** Structure your expression as a sum of powers (e.g., x^a + y^b + dots).

2. **Operator Bounties (Double Points Active):**
  - **sopfr(x):** **+18 pts** per use.
  - **p_n(x):** **+14 pts** per use.
  - **Power (^) / Factorial (!):** **+12 pts** per use.
  - **Sqrt / Mod:** **+10 pts** per use.
  - *Diversity Bonus:* +2 pts for every unique operator type used.

3. **CRITICAL LIMITS (Don't waste turns!):**
  - **Nesting Cap:** Bonuses are **ONLY** awarded for the first 2 layers of depth. Operators nested deeper than level 2 (e.g., p_n(p_n(p_n(...)))) yield **0 POINTS**. Keep it shallow and wide.
  - **Spam Cap:** You only get bonuses for the first **3 uses** of any specific operator. The 4th sopfr is worth 0 points.

4. **Penalties (The "Soft Walls"):**
  - **Redundant Operations:** Usage of operators that do not change the value or are trivial (e.g., sopfr(5), 1!, sqrt(1), x^1, x mod 1) will incur severe penalties (-20pts per instance) and receive NO bonuses.
  - **Digit Hoarding (Exponential):** Using >3 extra digits triggers a massive penalty (10 \times 2^{extra}).
  - **Complexity:** Expressions with >15 total items suffer a quadratic penalty.
  - **Triviality:** If your expression simplifies to the target (e.g. 127 + 0) without meaningful transformations, score is 0.

**STRATEGY GUIDE:**

- **The "Shallow & Wide" Strategy:** Since nesting deeper than 2 stops paying bonuses, prefer structures like sopfr(A) + sopfr(B) (Wide) over sopfr(sopfr(A)) (Deep).
- **Hunt for 'sopfr':** It pays the most (18 pts). Check if your target or its components have convenient prime factors.
- **Respect the "Gem":** If properties say "Prime Index: 31", use p_n(31).
- **Avoid "Padding":** Do not wrap numbers in 1! or sopfr(prime) to gain depth. The system detects this as "Redundant" and will deduct -20 points instead of giving you a bonus.


You have access to the following tools. To use a tool, output valid JSON with a "tool" field.

1. **evaluate**: Test an expression to see its value, validity, and potential score.
   - Usage: { "tool": "evaluate", "expression": "sopfr(10) + p_n(1)" }

2. **p_n**: Get the nth prime number.
   - Usage: { "tool": "p_n", "n": 10 }

3. **sopfr**: Get the Sum of Prime Factors (with multiplicity) for a number.
   - Usage: { "tool": "sopfr", "n": 10 }


**INSTRUCTIONS:**
1. **Turn Budget:** You have exactly **5 TURNS** to interact.
   - **Turn 1:** Secure a "Safety Score" (valid but simple).
   - **Turns 2-4:** **Optimize.** Use the tool output to see which Bonuses/Penalties you triggered.
   - **Turn 5:** Try one final variation if you think you can beat your high score.
2. **Auto-Save:** Every valid expression you evaluate is recorded. We keep the best one. You do not need to "submit" at the end.
3. **Learn from Feedback:** The `evaluate` tool returns a `scoreBreakdown`. **READ IT.**
   - If you see `trivialExpressionPenalty`, you are using forbidden loops (remove `1!`, `sopfr(prime)`).
   - If you see `repetitionPenalty`, you are spamming operators (cap usage at 3).
4. **Format:** Output the JSON block at the very end.

**EXAMPLE:**
Target: 10, Allowed: {1, 0}
Reasoning: 2 times 5 is 10. I can use p_n(3) which is 5. But 3 is not a digit in 10. Maybe I can use p_n(1) and build a chain of p_n to reach 5.
```json
{
  "tool": "evaluate",
  "expression": "p_n(p_n(p_n(1))) * p_n(1)"
}
```

Model	Runs	Performance			Reliability		Efficiency
Model	Runs	Mean [95% CI]	Peak Score	Turns to Peak	Pass Rate	1-Shot	Avg Cost ($)	Avg Duration (s)
Baseline (Greedy Search)	10	25.7	257	-	20.0%	-	$0.000	-
	22	174.6 +55% [ 159.1 - 190.1 ]		3.82	100.0%	63.6%	0.701	1027.18
Avg Input / Output tokens per attempt 8817.6 in / 56913.6 out Output tokens per point 325.98 Error rate (%) 29.1% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	22	136.0 +107% [ 119.6 - 152.4 ]		3.55	100.0%	77.3%	0.359	1272.28
Avg Input / Output tokens per attempt 10056.7 in / 118023.5 out Output tokens per point 867.82 Error rate (%) 30.0% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	22	120.4 +121% [ 99.6 - 141.2 ]		2.91	100.0%	77.3%	0.420	3156.23
Avg Input / Output tokens per attempt 6685.5 in / 41124.3 out Output tokens per point 341.67 Error rate (%) 22.7% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	24	111.6 +491% [ 94.1 - 129.1 ]		3.21	100.0%	62.5%	0.116	408.74
Avg Input / Output tokens per attempt 9850.0 in / 36888.5 out Output tokens per point 330.47 Error rate (%) 45.0% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	22	109.2 +518% [ 87.3 - 131.0 ]		3.68	100.0%	81.8%	0.299	427.31
Avg Input / Output tokens per attempt 8621.0 in / 28843.7 out Output tokens per point 264.18 Error rate (%) 23.6% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	22	106.3 +855% [ 80.1 - 132.5 ]		3.05	100.0%	68.2%	0.489	2793.53
Avg Input / Output tokens per attempt 8004.6 in / 47925.4 out Output tokens per point 450.97 Error rate (%) 28.2% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	22	93.5 +47% [ 75.4 - 111.7 ]		3.18	100.0%	72.7%	0.358	1193.16
Avg Input / Output tokens per attempt 8815.5 in / 24459.0 out Output tokens per point 261.47 Error rate (%) 20.0% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	22	84.3 +305% [ 69.6 - 99.0 ]		4.09	100.0%	86.4%	0.281	279.66
Avg Input / Output tokens per attempt 10588.3 in / 9131.4 out Output tokens per point 108.30 Error rate (%) 23.6% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	22	70.5 +417% [ 54.5 - 86.6 ]		3.41	100.0%	86.4%	0.101	1645.39
Avg Input / Output tokens per attempt 7991.3 in / 49627.1 out Output tokens per point 703.48 Error rate (%) 25.5% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	20	70.0 +368% [ 48.6 - 91.4 ]		3.35	100.0%	60.0%	0.103	366.84
Avg Input / Output tokens per attempt 8527.0 in / 40076.6 out Output tokens per point 572.52 Error rate (%) 37.0% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	22	55.0 +395% [ 35.1 - 75.0 ]		2.41	100.0%	50.0%	0.162	285.04
Avg Input / Output tokens per attempt 10999.9 in / 8611.2 out Output tokens per point 156.44 Error rate (%) 62.7% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	22	52.5 +31% [ 41.7 - 63.3 ]		2.86	100.0%	77.3%	0.019	866.83
Avg Input / Output tokens per attempt 8173.1 in / 46875.7 out Output tokens per point 892.87 Error rate (%) 30.0% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	21	39.6 +227% [ 25.3 - 53.9 ]		2.05	100.0%	57.1%	0.091	226.44
Avg Input / Output tokens per attempt 12281.1 in / 15672.8 out Output tokens per point 396.06 Error rate (%) 55.2% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.
	22	11.0 +132% [ 5.0 - 17.1 ]		1.68	54.5%	4.5%	0.002	31.38
Avg Input / Output tokens per attempt 9665.2 in / 2266.4 out Output tokens per point 205.19 Error rate (%) 84.5% Fraction of turns with error messages (syntax/parsing/other failures) Note: token averages are computed per-attempt when recorded. If no per-attempt data exists for this model, global averages are used as a fallback.

Table 1: Comparative performance across runs. Ranges represent the 95% Confidence Interval.

Gemini 3 Flash (High)

Gemini 3 Flash (Medium)

Target	Released	Last Updated	Status	Pass %	Baseline	Best Score
6174	Jan 15, 26	Jan 15, 26	2/10 Finish	88.9%	FAIL	GPT 5.1 (High)
5676	Jan 14, 26	Jan 15, 26	1/10 Finish	89.3%	FAIL	Gemini 3 Pro
2966	Dec 17, 25	Dec 20, 25	2/10 Finish	93.1%	FAIL	Gemini 3 Flash (High)
6256	Dec 16, 25	Dec 20, 25	2/10 Finish	93.1%	FAIL	Gemini 3 Pro
346	Dec 19, 25	Dec 19, 25	2/10 Finish	96.2%	FAIL	Gemini 3 Pro
5623	Jan 19, 26	Jan 19, 26	2/10 Finish	96.4%	257	Gemini 3 Pro
3189	Dec 15, 25	Dec 20, 25	2/10 Finish	96.6%	FAIL	GPT 5 (High)
8166	Mar 29, 25	Dec 20, 25	2/10 Finish	96.7%	FAIL	Gemini 3 Pro
4673	Aug 11, 25	Dec 20, 25	2/10 Finish	100.0%	FAIL	Gemini 3 Flash (Medium)
5809	Dec 18, 25	Dec 20, 25	2/10 Finish	100.0%	FAIL	Gemini 3 Pro
6011	Dec 13, 25	Dec 20, 25	2/10 Finish	100.0%	FAIL	Gemini 3 Pro
6322	Dec 14, 25	Dec 20, 25	2/10 Finish	100.0%	FAIL	Gemini 3 Flash (Medium)
6482	Jan 28, 26	Jan 28, 26	2/10 Finish	100.0%	FAIL	Gemini 3 Pro

Released

Planned

Partial

Full Run

Gemini 3 Pro

Mathemagician

Runs

3841

Total Score

174.59

Avg Score

242

Highest Score

Gemini 3 Flash (High)

Mathemagician

Runs

2992

Total Score

136.00

Avg Score

217

Highest Score

GPT 5 (High)

Mathemagician

Runs

2648

Total Score

120.36

Avg Score

188

Highest Score

Gemini 3 Flash (Medium)

Mathemagician

Runs

2679

Total Score

111.63

Avg Score

221

Highest Score

Gemini 2.5 Pro

Mathemagician

Runs

2402

Total Score

109.18

Avg Score

232

Highest Score

GPT 5.1 (High)

Mathemagician

Runs

2338

Total Score

106.27

Avg Score

232

Highest Score

GPT 5.2 (High)

Mathemagician

Runs

2058

Total Score

93.55

Avg Score

186

Highest Score

Claude Opus 4.5

Mathemagician

Runs

1855

Total Score

84.32

Avg Score

152

Highest Score

GPT 5.1 Mini (High)

🧑‍🔬

Efficiency Guru

Runs

1552

Total Score

70.55

Avg Score

152

Highest Score

Gemini 2.5 Flash (High)

🧑‍🔬

Efficiency Guru

Runs

1400

Total Score

70.00

Avg Score

155

Highest Score

Claude Sonnet 4.5

🕵️

Pattern Seeker

Runs

1211

Total Score

55.05

Avg Score

152

Highest Score

GPT 5 Nano (Medium)

🕵️

Pattern Seeker

Runs

1155

Total Score

52.50

Avg Score

Highest Score

Claude Haiku 4.5

👶

Baby Builder

Runs

831

Total Score

39.57

Avg Score

122

Highest Score

Model	Minimal Run Count	Max Retry	Context Window	Reasoning Effort	Input Cost ($/1M)	Output Cost ($/1M)
gpt-5.2	2	1	400,000	high	$1.75	$14.00
gpt-5	2	1	128,000	high	$1.25	$10.00
gemini-2.5-pro	2	1	2,000,000	Default	$1.25	$10.00
gemini-3-pro-preview	2	1	2,000,000	Default	$2.00	$12.00
gpt-5.1	2	1	128,000	high	$1.25	$10.00
gpt-5-mini	2	1	128,000	high	$0.25	$2.00
claude-opus-4-5	2	1	200,000	Default	$5.00	$25.00
claude-sonnet-4-5	2	1	200,000	Default	$3.00	$15.00
gpt-5-nano	2	1	128,000	medium	$0.05	$0.40
gemini-2.5-flash	2	1	1,000,000	high	$0.30	$2.50
gemini-3-flash-preview	2	1	1,000,000	high	$0.50	$3.00
gemini-3-flash-preview-medium	2	1	1,000,000	medium	$0.50	$3.00
claude-haiku-4-5	2	1	200,000	Default	$1.00	$5.00
gemini-2.5-flash-lite	2	1	1,000,000	Default	$0.10	$0.40

We use cookies

Some cookies are essential for Natory to function, like keeping you signed in. We also use optional cookies to understand how the app is used and improve it over time. Learn more in our Privacy Policy. You can manage your preferences anytime.

Preliminary Data Notice

🙇 A Note from the Developer

Natory AI Benchmark: Constraint-Based Mathematical Search

1. Task Specification

The Objective

The Reward Function (Scoring)

Reference Example

2. Experimental Protocol

3. Main Performance Results

4. Statistical Analysis

Preliminary Insights

5. Benchmarked Target Numbers

6. All Numbers

7. Successful Models

Gemini 3 Pro

Gemini 3 Flash (High)

GPT 5 (High)

Gemini 3 Flash (Medium)

Gemini 2.5 Pro

GPT 5.1 (High)

GPT 5.2 (High)

Claude Opus 4.5

GPT 5.1 Mini (High)

Gemini 2.5 Flash (High)

Claude Sonnet 4.5

GPT 5 Nano (Medium)

Claude Haiku 4.5

8. Model Configuration