Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update 0.5.0 report with results for Gemma 2 model family #244

Merged
merged 1 commit into from
Jul 5, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions docs/reports/v0.5.0/gemma-2-27b-it/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Evaluation from 2024-07-05 06:55:06

![Bar chart that categorizes all evaluated models.](./categories.svg)

This report was generated by [DevQualityEval benchmark](https://github.com/symflower/eval-dev-quality) in `version 0.5.0`.

**REMARK: `gemma-2-9b-it` and `gemma-2-27-it` were originally evaluated together with the results then being split into separate folders. Therefore some logs might contain entries from "the other" gemma model.**

## Results

> Keep in mind that LLMs are nondeterministic. The following results just reflect a current snapshot.

The results of all models have been divided into the following categories:

- category unknown: Models in this category could not be categorized.
- response error: Models in this category encountered an error.
- no code: Models in this category produced no code.
- invalid code: Models in this category produced invalid code.
- executable code: Models in this category produced executable code.
- statement coverage reached: Models in this category produced code that reached full statement coverage.
- no excess response: Models in this category did not respond with more content than requested.

The following sections list all models with their categories. The complete log of the evaluation with all outputs can be found [here](./evaluation.log). Detailed scoring can be found [here](./evaluation.csv).

### Result category "category unknown"

Models in this category could not be categorized.

- [`custom-nvidia/google/gemma-2-27b-it`](./custom-nvidia_google_gemma-2-27b-it/)
- [`custom-nvidia/google/gemma-2-9b-it`](./custom-nvidia_google_gemma-2-9b-it/)
5 changes: 5 additions & 0 deletions docs/reports/v0.5.0/gemma-2-27b-it/evaluation.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
model-id,model-name,cost,language,repository,task,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code
custom-nvidia/google/gemma-2-27b-it,gemma-2-27b-it,0,golang,golang/light,write-tests,4095,3650,100,83660,950489,85270,115,115,115
custom-nvidia/google/gemma-2-27b-it,gemma-2-27b-it,0,golang,golang/plain,write-tests,70,50,5,370,7029,440,5,5,5
custom-nvidia/google/gemma-2-27b-it,gemma-2-27b-it,0,java,java/light,write-tests,13812,13360,107,125031,1023420,126411,115,115,115
custom-nvidia/google/gemma-2-27b-it,gemma-2-27b-it,0,java,java/plain,write-tests,70,50,5,940,9753,1000,5,5,5
97,515 changes: 97,515 additions & 0 deletions docs/reports/v0.5.0/gemma-2-27b-it/evaluation.log

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions docs/reports/v0.5.0/gemma-2-27b-it/golang-summed.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
model,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code
custom-nvidia/google/gemma-2-27b-it,4165,3700,105,84030,957518,85710,120,120,120
2 changes: 2 additions & 0 deletions docs/reports/v0.5.0/gemma-2-27b-it/java-summed.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
model,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code
custom-nvidia/google/gemma-2-27b-it,13882,13410,112,125971,1033173,127411,120,120,120
2 changes: 2 additions & 0 deletions docs/reports/v0.5.0/gemma-2-27b-it/models-summed.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
model,score,coverage,files-executed,generate-tests-for-file-character-count,processing-time,response-character-count,response-no-error,response-no-excess,response-with-code
custom-nvidia/google/gemma-2-27b-it,18047,17110,217,210001,1990691,213121,240,240,240
Loading
Loading