You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Reliable and Efficient Amortized Model-based Evaluation
2
+
3
+
Reliable and Efficient Amortized Model-based Evaluation (Reeval) is an extension of the HELM framework for using Computerized Adaptive Testing (CAT) within the framework of Item Response Theory (IRT) to adaptively evaluate Large Language Models (LLMs). This approach selects the next question whose difficulty is closest to the estimated model ability, thereby reliably and efficiently eliciting the model's ability. The difficulties of the questions are provided on HuggingFace: [`stair-lab/reeval-difficulty-for-helm`](https://huggingface.co/datasets/stair-lab/reeval-difficulty-for-helm), which currently supports 22 scenarios in HELM. The paper's authors will supply a Python package for calculating these difficulties and will support more scenarios in the future.
4
+
5
+
# References
6
+
7
+
[Paper](https://arxiv.org/abs/2503.13335)
8
+
9
+
# Getting Started
10
+
11
+
Use Git to clone the `stanford-crfm` repository and set up the environment as a developer as in [Developer Setup](https://crfm-helm.readthedocs.io/en/latest/developer_setup/):
12
+
13
+
The following is an example of adaptively evaluating Openai GPT2 on the MMLU scenario using 50 instances. The argument `--model-ability` is the initial ability of the model for reeval evaluation. The argument `--max-samples` is the maximum number of samples to evaluate in the reeval mode. The argument `--metric-name` is the main metric name for the scenario. Note that the reeval mode does not support the argument `--max-eval-instances`, because it can potentially contradict the argument `--max-samples`. Other arguments stay the same as HELM.
0 commit comments