🚀 Real-time evaluation platform for mathematical reasoning models, featuring immediate results on AIME 2025 (released Feb 14, 2025).
The reported results for AIME 2025 represent the average performance across multiple temperature settings (0.0, 0.3, and 0.6). For detailed configuration parameters, please refer to the Hyperparameter Configuration
section.
- AIME 2024
- AIME 2025 Part 1
- AIME 2025 Part 2
For detailed instructions and implementation details, please refer to eval/README.md
.
- DeepSeek Series
- DeepSeek-R1
- DeepSeek-R1-Distill-Qwen (1.5B, 7B, 14B, 32B)
- DeepSeek-R1-Distill-Llama (8B, 70B)
- O Series
- o1-preview
- o1-mini
- o3-mini (low/medium/high)
- Others
- gemini-2.0-flash-thinking
- s1
- limo
- QwQ
For O Series, DeepSeek-R1, and Gemini models, we utilize their default API configurations without modification and sample 8 times per question.
For all other models evaluated locally, we maintain consistent hyperparameters across all evaluations:
{
"temperature": [0.0, 0.3, 0.6], # 0.3 used for AIME 2024
# Average of all three used for AIME I 2025
"n_sampling": 8, # Samples per question
"max_tokens": 32768, # Maximum response length
"seed": 0, # Fixed seed for reproducibility
"top_p": 0.95 # Nucleus sampling parameter
}
We conducted a comprehensive analysis of model performance across different temperature settings (0.0, 0.3, and 0.6) for AIME 2025. The result of AIME I 2025 is shown below:
Key findings include:
-
Large Model Stability
- DeepSeek-R1-Distill-Llama-70B showed the highest average performance (51.4%) but exhibited significant variance across temperatures (60.0%, 45.8%, 48.3%)
- DeepSeek-R1-Distill-Qwen-14B and 32B maintained relatively stable performance across all temperatures, with averages of 46.7% and 46.1% respectively
-
Medium-Size Model Behavior
- DeepSeek-R1-Distill-Qwen-7B showed interesting temperature scaling, with performance improving as temperature increased (33.3% → 37.5% → 40.0%)
- QwQ demonstrated optimal performance at temperature 0.3 (40.8%), with lower scores at both extremes
-
Smaller Model Characteristics
- DeepSeek-R1-Distill-Qwen-1.5B and s1 showed similar patterns, performing best at temperature 0.0
- DeepSeek-R1-Distill-Llama-8B uniquely performed best at temperature 0.3 (28.3%)
- Optimal Temperature Varies: No single temperature setting was universally optimal across all models
- Size Correlation: Larger models generally showed more stability across temperature variations
- Performance-Stability Tradeoff: Models with higher average performance often showed greater sensitivity to temperature changes
Based on these findings, we recommend:
- Model-Specific Tuning: Consider individual temperature tuning for each model rather than using a fixed setting
- Ensemble Approach: For critical applications, consider averaging results across multiple temperature settings
- Size Considerations: For larger models (>14B parameters), temperature settings have less impact on final performance
This analysis has been incorporated into our evaluation protocol for future benchmarks.
Yixin Ye, Yang Xiao, Tiantian Mi, Pengfei Liu
@misc{ye2025aimepreview,
title = {AIME-Preview: A Rigorous and Immediate Evaluation Framework for Advanced Mathematical Reasoning},
author = {Yixin Ye and Yang Xiao and Tiantian Mi and Pengfei Liu},
year = {2025},
howpublished = {\url{https://github.com/GAIR-NLP/AIME-Preview}},
note = {GitHub repository}
}