update homepage content

neulab · Dec 6, 2024 · a18536a · a18536a
1 parent 1f681c1
commit a18536a
Show file tree

Hide file tree

Showing 3 changed files with 28 additions and 43 deletions.
diff --git a/docs/assets/img/agorabench_results.png b/docs/assets/img/agorabench_results.png
diff --git a/docs/assets/img/cmu.png b/docs/assets/img/cmu.png
diff --git a/docs/index.markdown b/docs/index.markdown
@@ -3,7 +3,7 @@ layout: default
 ---
 
 
-
+## Motivation
 {: .sys-img}
 ![Motivation of AgoraBench.](/assets/img/motivation.png)
 
@@ -15,73 +15,58 @@ As shown in the Figure above, this makes it difficult to directly compare how go
 <br>
 To answer these kind of questions, we need a more systematic approach to evaluate how good a LM is as a data generator. More specifically, we need a unified experimental setting where only the data generator varies and all the other components are fixed. In this work, we introduce AgoraBench, a benchmark that serves this purpose by provided 9 experimental settings.
 
-{: .sys-img}
-![Motivation of AgoraBench.](/assets/img/motivation.png)
-
-However, employing GPT-4 as an evaluator LM has the following disadvantages:
-* <b>Close-source Nature</b>: The proprietary nature of LLMs brings transparency concerns as internal workings are not disclosed to the broader academic community.
-* <b>Uncontrolled Versioning</b>: As reproducibility is a cornerstone of scientific inquiry, any inconsistency stemming from version changes can undermine the robustness of research findings that depend on specific versions of the evaluator LM.
-* <b>Prohibitive Costs</b>: Evaluating 4 LLM variants across four sizes (ranging from 7B to 65B) using GPT-4 on 1K evaluation instances can cost over $2000, which is prohibitive for academic institutions.
 
-<br>
-To this end, we introduce <span class="sys-name">**Prometheus**</span> 🔥, a fully open-source LLM (7B & 13B) that shows high correlation with both human evaluators and GPT-4!
+## Data Generation Methods
+{: .sys-img}
+![Data Generation Methods covered in AgoraBench.](/assets/img/methods.png)
 
 <br>
-<br>
-## Inducing Fine-grained Evaluation Capability
-The main obstacle of obtaining a language model specialized on evaluation is because it needs to **know the important aspects tailored with the instruction** and should be able to **internally estimate what the answer of the instruction might be** in the first place. After then, the evaluator LM could assess the quality of the responses based on the information derived from the previous two steps.
+In AgoraBench, we cover the following data generation methods:
+* <b>Instance Generation</b>: Similar to Self-Instruct, the data generator is conditioned on in-context demonstrations and generates new instances.
+* <b>Response Generation</b>: Given a fixed set of instructions, the data generator generates responses for each instruction.
+* <b>Quality Enhancement</b>: Given large amounts of low-quality data, the data generator enhances the quality of the data.
 
-Our main intuition is that by incorporating the appropriate reference materials, the evaluator LM could solely focus on assessing the quality of the response instead of determining the important aspects or solving the instruction itself.
 
+## Metrics
 {: .sys-img}
-![input_output_format.](/assets/img/figure2.png)
+![Performance Gap Recovered (PGR) metric used in AgoraBench.](/assets/img/metrics.png)
 
-Specifically, we append a **Score Rubric** and a **Reference Answer** for the following purpose:
-* <b>Score Rubric</b>: Provides information of the pivotal aspects essential for addressing the given instruction. Without it, the evaluator LM should inherently know what details should be considered from the given instruction.
-* <b>Reference Answer</b>: Relieves the need for the evaluator LM to actually solve the given instruction and solely focus on evaluation the response. This enables to bypass a natural proposition that if an evaluator LM doesn't have the ability to solve the problem, it is likely that it cannot evaluate different responses effectively as well.
-
-<br>
 <br>
-## The Feedback Collection Dataset
-Along with the model, we release the **Feedback Collection**, which is a new feedback dataset that was used to train <span class="sys-name">**Prometheus**</span> 🔥!
+To quantify the quality of the generated data (i.e., data generation capability), we propose a new metric called **Performance Gap Recovered (PGR)**. On a high level, PGR measures how much a student model trained on the synthetic data improves over its base model compared to a reference model that shares the same base model. 
 
-Compared to previous feedback datasets (e.g., Selfee, Shepherd), the **Feedback Collection** consists of 1K fine-grained score rubrics, 20K instructions, and 100K responses and language feedback generated by GPT-4.
-</br>
+<br>
+Specifically, we use Llama-3.1-8B as our base model and Llama-3.1-8B-Instruct as the reference model. This captures how much the synthetic data is able to improve the performance of the student model compared to Meta's post-training process for obtaining Llama-3.1-8B-Instruct with Llama-3.1-8B.
 
-The construction process is consisted of (1) obtaining 50 seed score rubrics from human annotators, (2) expanding the score rubrics with brainstorming and paraphrasing, (3) obtaining instructions closely related to the score rubric, and (4) acquiring the remaining components for each instance.
 
+## AgoraBench Results
 {: .sys-img}
-![Animation of the overall workflow of EvalLM where users sample inputs from a dataset, generate outputs from each input using two different prompts, and then comparatively evaluate these outputs on user-defined criteria.](/assets/img/animation.gif)
+![AgoraBench results.](/assets/img/agorabench_results.png)
 
-The main considerations while constructing the **Feedback Collection** were:
-* Including as many reference materials (score rubric, reference answer) as possbile.
-* Maintaining a uniform length among the reference answers for each score (1 to 5) to prevent undesired length bias.
-* Maintaining a uniform score distribution among responses to prevent undesired decision bias after fine-tuning.
-* Limiting the scope of the score rubrics and instructions to realistic situations where a user is interacting with a LLM.
+<br>
+We find that different models have distinct strengths and weaknesses in each data generation method. For instance, while GPT-4o excels in instance generation, Claude-3.5-Sonnet shows stronger performance in quality enhancement.
 
-------
 
+## Conclusion
 For more information about our work, please check out our paper! Also, we plan to continually update our model based on your feedback! Feel free to reach out to us via email or twitter!
 
 ## Bibtex
-
 <pre>
-@misc{kim2023prometheus,
-      title={Prometheus: Inducing Fine-grained Evaluation Capability in Language Models}, 
-      author={Seungone Kim and Jamin Shin and Yejin Cho and Joel Jang and Shayne Longpre and Hwaran Lee and Sangdoo Yun and Seongjin Shin and Sungdong Kim and James Thorne and Minjoon Seo},
-      year={2023},
-      eprint={2310.08491},
+@misc{kim2024evaluating,
+      title={Evaluating Language Models as Synthetic Data Generators}, 
+      author={Seungone Kim and Juyoung Suk and Xiang Yue and Vijay Viswanathan and Seongyun Lee and Yizhong Wang and Kiril Gashteovski and Carolin Lawrence and Sean Welleck and Graham Neubig},
+      year={2024},
+      eprint={2412.03679},
       archivePrefix={arXiv},
-      primaryClass={cs.CL}
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2412.03679}, 
 }
 </pre>
 
 ------
 
 {: .logos}
-[![Logo of LKLab](/assets/img/lklab_logo.png)](https://lklab.kaist.ac.kr/)
-[![Logo of KAIST](/assets/img/kaist_logo.png)](https://kaist.ac.kr)
-[![Logo of NAVER](/assets/img/naver_ai_lab_logo.png)](https://www.facebook.com/NAVERAILAB)
+[![Logo of CMU](/assets/img/cmu.png)](https://www.lti.cs.cmu.edu/)
+
 
 {: .center .acknowledgement}
-This research was supported by the **KAIST-NAVER Hypercreative AI Center**.
+This research was supported by the **NEC Student Research Fellowship Program**.