3. Main Performance Results
| Model | Runs | Performance | Reliability | Efficiency | |||||
|---|---|---|---|---|---|---|---|---|---|
| Mean [95% CI] | Peak Score | Turns to Peak | Pass Rate | 1-Shot | Avg Cost ($) | Avg Duration (s) | |||
| Baseline (Greedy Search) | 10 | 25.7 | 257 | - | 20.0% | - | $0.000 | - | |
| 22 | 174.6
+55%
[ 159.1 - 190.1 ] | | 3.82 | 100.0% | 63.6% | 0.701 | 1027.18 | ||
| Avg Input / Output tokens per attempt 8817.6 in /
56913.6 out
Output tokens per point 325.98 Error rate (%) 29.1%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 22 | 136.0
+107%
[ 119.6 - 152.4 ] | | 3.55 | 100.0% | 77.3% | 0.359 | 1272.28 | ||
| Avg Input / Output tokens per attempt 10056.7 in /
118023.5 out
Output tokens per point 867.82 Error rate (%) 30.0%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 22 | 120.4
+121%
[ 99.6 - 141.2 ] | | 2.91 | 100.0% | 77.3% | 0.420 | 3156.23 | ||
| Avg Input / Output tokens per attempt 6685.5 in /
41124.3 out
Output tokens per point 341.67 Error rate (%) 22.7%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 24 | 111.6
+491%
[ 94.1 - 129.1 ] | | 3.21 | 100.0% | 62.5% | 0.116 | 408.74 | ||
| Avg Input / Output tokens per attempt 9850.0 in /
36888.5 out
Output tokens per point 330.47 Error rate (%) 45.0%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 22 | 109.2
+518%
[ 87.3 - 131.0 ] | | 3.68 | 100.0% | 81.8% | 0.299 | 427.31 | ||
| Avg Input / Output tokens per attempt 8621.0 in /
28843.7 out
Output tokens per point 264.18 Error rate (%) 23.6%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 22 | 106.3
+855%
[ 80.1 - 132.5 ] | | 3.05 | 100.0% | 68.2% | 0.489 | 2793.53 | ||
| Avg Input / Output tokens per attempt 8004.6 in /
47925.4 out
Output tokens per point 450.97 Error rate (%) 28.2%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 22 | 93.5
+47%
[ 75.4 - 111.7 ] | | 3.18 | 100.0% | 72.7% | 0.358 | 1193.16 | ||
| Avg Input / Output tokens per attempt 8815.5 in /
24459.0 out
Output tokens per point 261.47 Error rate (%) 20.0%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 22 | 84.3
+305%
[ 69.6 - 99.0 ] | | 4.09 | 100.0% | 86.4% | 0.281 | 279.66 | ||
| Avg Input / Output tokens per attempt 10588.3 in /
9131.4 out
Output tokens per point 108.30 Error rate (%) 23.6%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 22 | 70.5
+417%
[ 54.5 - 86.6 ] | | 3.41 | 100.0% | 86.4% | 0.101 | 1645.39 | ||
| Avg Input / Output tokens per attempt 7991.3 in /
49627.1 out
Output tokens per point 703.48 Error rate (%) 25.5%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 20 | 70.0
+368%
[ 48.6 - 91.4 ] | | 3.35 | 100.0% | 60.0% | 0.103 | 366.84 | ||
| Avg Input / Output tokens per attempt 8527.0 in /
40076.6 out
Output tokens per point 572.52 Error rate (%) 37.0%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 22 | 55.0
+395%
[ 35.1 - 75.0 ] | | 2.41 | 100.0% | 50.0% | 0.162 | 285.04 | ||
| Avg Input / Output tokens per attempt 10999.9 in /
8611.2 out
Output tokens per point 156.44 Error rate (%) 62.7%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 22 | 52.5
+31%
[ 41.7 - 63.3 ] | | 2.86 | 100.0% | 77.3% | 0.019 | 866.83 | ||
| Avg Input / Output tokens per attempt 8173.1 in /
46875.7 out
Output tokens per point 892.87 Error rate (%) 30.0%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 21 | 39.6
+227%
[ 25.3 - 53.9 ] | | 2.05 | 100.0% | 57.1% | 0.091 | 226.44 | ||
| Avg Input / Output tokens per attempt 12281.1 in /
15672.8 out
Output tokens per point 396.06 Error rate (%) 55.2%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||
| 22 | 11.0
+132%
[ 5.0 - 17.1 ] | | 1.68 | 54.5% | 4.5% | 0.002 | 31.38 | ||
| Avg Input / Output tokens per attempt 9665.2 in /
2266.4 out
Output tokens per point 205.19 Error rate (%) 84.5%
Fraction of turns with error messages (syntax/parsing/other failures) | |||||||||