AIME-Preview: A Rigorous and Immediate Evaluation Framework for Advanced Mathematical Reasoning

🚀 Real-time evaluation platform for mathematical reasoning models, featuring immediate results on AIME 2025 (released Feb 14, 2025).

Latest Results

AIME 2025 Results

The reported results for AIME 2025 represent the average performance across multiple temperature settings (0.0, 0.3, and 0.6). For detailed configuration parameters, please refer to the Hyperparameter Configuration section.

Historical Performance

AIME 2024
AIME 2025 Part 1
AIME 2025 Part 2

Models Under Evaluation

For detailed instructions and implementation details, please refer to eval/README.md.

Open Models

DeepSeek Series
- DeepSeek-R1
- DeepSeek-R1-Distill-Qwen (1.5B, 7B, 14B, 32B)
- DeepSeek-R1-Distill-Llama (8B, 70B)
O Series
- o1-preview
- o1-mini
- o3-mini (low/medium/high)
Others
- gemini-2.0-flash-thinking
- s1
- limo
- QwQ

Evaluation Protocol

Hyperparameter Configuration

API-Based Models

For O Series, DeepSeek-R1, and Gemini models, we utilize their default API configurations without modification and sample 8 times per question.

Locally Deployed Models

For all other models evaluated locally, we maintain consistent hyperparameters across all evaluations:

{
    "temperature": [0.0, 0.3, 0.6],  # 0.3 used for AIME 2024
                                     # Average of all three used for AIME I 2025
    "n_sampling": 8,                 # Samples per question
    "max_tokens": 32768,             # Maximum response length
    "seed": 0,                       # Fixed seed for reproducibility
    "top_p": 0.95                    # Nucleus sampling parameter
}

Temperature Impact Analysis

We conducted a comprehensive analysis of model performance across different temperature settings (0.0, 0.3, and 0.6) for AIME 2025. The result of AIME I 2025 is shown below:

Key findings include:

Model-Specific Temperature Sensitivity

Large Model Stability
- DeepSeek-R1-Distill-Llama-70B showed the highest average performance (51.4%) but exhibited significant variance across temperatures (60.0%, 45.8%, 48.3%)
- DeepSeek-R1-Distill-Qwen-14B and 32B maintained relatively stable performance across all temperatures, with averages of 46.7% and 46.1% respectively
Medium-Size Model Behavior
- DeepSeek-R1-Distill-Qwen-7B showed interesting temperature scaling, with performance improving as temperature increased (33.3% → 37.5% → 40.0%)
- QwQ demonstrated optimal performance at temperature 0.3 (40.8%), with lower scores at both extremes
Smaller Model Characteristics
- DeepSeek-R1-Distill-Qwen-1.5B and s1 showed similar patterns, performing best at temperature 0.0
- DeepSeek-R1-Distill-Llama-8B uniquely performed best at temperature 0.3 (28.3%)

Key Observations

Optimal Temperature Varies: No single temperature setting was universally optimal across all models
Size Correlation: Larger models generally showed more stability across temperature variations
Performance-Stability Tradeoff: Models with higher average performance often showed greater sensitivity to temperature changes

Recommendations

Based on these findings, we recommend:

Model-Specific Tuning: Consider individual temperature tuning for each model rather than using a fixed setting
Ensemble Approach: For critical applications, consider averaging results across multiple temperature settings
Size Considerations: For larger models (>14B parameters), temperature settings have less impact on final performance

This analysis has been incorporated into our evaluation protocol for future benchmarks.

Contributors

Yixin Ye, Yang Xiao, Tiantian Mi, Pengfei Liu

Citaion

@misc{ye2025aimepreview,
    title = {AIME-Preview: A Rigorous and Immediate Evaluation Framework for Advanced Mathematical Reasoning},
    author = {Yixin Ye and Yang Xiao and Tiantian Mi and Pengfei Liu},
    year = {2025},
    howpublished = {\url{https://github.com/GAIR-NLP/AIME-Preview}},
    note = {GitHub repository}
}

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
eval		eval
images		images
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AIME-Preview: A Rigorous and Immediate Evaluation Framework for Advanced Mathematical Reasoning

Latest Results

AIME 2025 Results

Historical Performance

Models Under Evaluation

Open Models

Evaluation Protocol

Hyperparameter Configuration

API-Based Models

Locally Deployed Models

Temperature Impact Analysis

Model-Specific Temperature Sensitivity

Key Observations

Recommendations

Contributors

Citaion

About

Releases

Packages

Contributors 2

Languages

GAIR-NLP/AIME-Preview

Folders and files

Latest commit

History

Repository files navigation

AIME-Preview: A Rigorous and Immediate Evaluation Framework for Advanced Mathematical Reasoning

Latest Results

AIME 2025 Results

Historical Performance

Models Under Evaluation

Open Models

Evaluation Protocol

Hyperparameter Configuration

API-Based Models

Locally Deployed Models

Temperature Impact Analysis

Model-Specific Temperature Sensitivity

Key Observations

Recommendations

Contributors

Citaion

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages