Skip to content

Commit

Permalink
Merge pull request #196 from symflower/issue-template-roadmap
Browse files Browse the repository at this point in the history
Roadmap issue template and README section about releases
  • Loading branch information
zimmski authored Jul 4, 2024
2 parents 1ba5dd9 + b58d925 commit fa63a8b
Show file tree
Hide file tree
Showing 2 changed files with 127 additions and 20 deletions.
78 changes: 78 additions & 0 deletions .github/ISSUE_TEMPLATE/roadmap.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
name: Roadmap issue
about: Use this template for tracking release roadmaps.
title: "Roadmap for vXXXXX"
labels: roadmap
assignees: zimmski
---

Tasks/Goals:

- [ ] Development & Management 🛠️
- [ ] TODO what and why as goal
- [ ] Documentation 📚
- [ ] TODO what and why as goal
- [ ] Evaluation ⏱️
- [ ] TODO what and why as goal
- [ ] Models 🤖
- [ ] TODO what and why as goal
- [ ] Reports & Metrics 🗒️
- [ ] TODO what and why as goal
- [ ] Operating Systems 🖥️
- [ ] TODO what and why as goal
- [ ] Tools 🧰
- [ ] TODO what and why as goal
- [ ] Tasks 🔢
- [ ] TODO what and why as goal
- [ ] Closed PR / not-implemented issue 🚫
- [ ] TODO what and why with reason

Release version of this roadmap issue:

> ❓ When should a release happen? Check the [`README`](../../README.md#when-and-how-to-release)!
- [ ] Do a full evaluation with the version
- [ ] Exclude certain Openrouter models by default
- [ ] `nitro` cause they are just faster
- [ ] `extended` cause longer context windows don't matter for our tasks
- [ ] `free` and `auto` cause these are just "aliases" for existing models
- [ ] Exclude special-purpose models
- [ ] Vision models
- [ ] Roleplay and creative writing models
- [ ] Classification models
- [ ] Models with internet access (usually denoted by `-online` suffix)
- [ ] Models with extended context windows (usually denoted by `-1234K` suffix)
- [ ] Always prefer fine tuned (`-instruct`, `-chat`) models over a plain base model
- [ ] Tag version (tag can be moved in case important merges happen afterwards)
- [ ] For all issues of the current milestone, one by one, add them to the roadmap tasks (it is ok if a task has multiple issues) with the users that worked on it
- Fixed bugs should always be sorted into respective relevant categories and not in a generic "Bugs" category!
- [ ] For all PRs of the current milestone, one by one, add them to the roadmap tasks (it is ok if a task has multiple issues) with the users that worked on it
- Fixed bugs should always be sorted into respective relevant categories and not in a generic "Bugs" category!
- [ ] Search all issues for ...
- [ ] Unassigned issues that are closed, and assign them someone
- [ ] Issues without a milestone, and assign them a milestone
- [ ] Issues without a label, and assign them at least one label
- [ ] Write the release notes:
- [ ] Use the tasks that are already there for the release note outline
- [ ] Add highlighted features based on the done tasks, sort by how many users would use the feature
- [ ] Do the release
- [ ] With the release notes
- [ ] Set as latest release
- [ ] Prepare the next roadmap
- [ ] Create a milestone for the next release
- [ ] Create a new roadmap issue for the next release
- [ ] Move all open tasks/TODOs from this roadmap issue to the next roadmap issue.
- [ ] Move every comment of this roadmap issue as a TODO to the next roadmap issue. Mark when done with a :rocket: emoji.
- [ ] Blog post containing evaluation results, new features and learnings
- [ ] Update README with blog post link and new header image
- [ ] Update repository link with blog post link
- [ ] https://github.com/symflower/eval-dev-quality/discussions
- [ ] Remove the previous announcements
- [ ] Add a "Deep dive: $blog-post-title" announcement for the blog post
- [ ] Add a "v$version: $summary-of-highlights" announcement for the release
- [ ] Announce release
- [ ] Eat cake 🎂

TODO sort and sort out:

- [ ] TODO
69 changes: 49 additions & 20 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -167,8 +167,8 @@ Please check the [Kubernetes](./docs/kubernetes/README.md) documentation.

With `DevQualityEval` we answer answer the following questions:

- Which LLMs can solve software development tasks?
- How good is the quality of their results?
- Which LLMs can solve software development tasks?
- How good is the quality of their results?

Programming is a non-trivial profession. Even writing tests for an empty function requires substantial knowledge of the used programming language and its conventions. We already investigated this challenge and how many LLMs failed at it in our [first `DevQualityEval` report](https://symflower.com/en/company/blog/2024/can-ai-test-a-go-function-that-does-nothing/#why-evaluate-an-empty-function). This highlights the need for a **benchmarking framework for evaluating AI performance on software development task solving**.

Expand All @@ -188,9 +188,7 @@ Each repository can contain a configuration file `repository.json` in its root d

```json
{
"tasks": [
"write-tests"
]
"tasks": ["write-tests"]
}
```

Expand All @@ -211,21 +209,21 @@ On a high level, `DevQualityEval` asks the model to produce tests for an example

Currently, the following points are awarded for this task:

- `response-no-error`: `+1` if the response did not encounter an error
- `response-not-empty`: `+1` if the response is not empty
- `response-with-code`: `+1` if the response contained source code
- `compiled`: `+1` if the source code compiled
- `statement-coverage-reached`: `+10` if the generated tests reach 100% coverage
- `no-excess`: `+1` if the response did not contain more content than requested
- `response-no-error`: `+1` if the response did not encounter an error
- `response-not-empty`: `+1` if the response is not empty
- `response-with-code`: `+1` if the response contained source code
- `compiled`: `+1` if the source code compiled
- `statement-coverage-reached`: `+10` if the generated tests reach 100% coverage
- `no-excess`: `+1` if the response did not contain more content than requested

#### Cases

Currently, the following cases are available for this task:

- Java
- `plain/src/main/java/plain.java`: An empty function that does nothing.
- Go
- `plain/plain.go`: An empty function that does nothing.
- Java
- `plain/src/main/java/plain.java`: An empty function that does nothing.
- Go
- `plain/plain.go`: An empty function that does nothing.

## Results

Expand All @@ -243,10 +241,41 @@ To add new tasks to the benchmark, add features, or fix bugs, you'll need a deve

First of all, thank you for thinking about contributing! There are multiple ways to contribute:

- Add more files to existing language repositories.
- Add more repositories to languages.
- Implement another language and add repositories for it.
- Implement new tasks for existing languages and repositories.
- Add more features and fix bugs in the evaluation, development environment, or CI: [best to have a look at the list of issues](https://github.com/symflower/eval-dev-quality/issues).
- Add more files to existing language repositories.
- Add more repositories to languages.
- Implement another language and add repositories for it.
- Implement new tasks for existing languages and repositories.
- Add more features and fix bugs in the evaluation, development environment, or CI: [best to have a look at the list of issues](https://github.com/symflower/eval-dev-quality/issues).

If you want to contribute but are unsure how: [create a discussion](https://github.com/symflower/eval-dev-quality/discussions) or write us directly at [markus.zimmermann@symflower.com](mailto:markus.zimmermann@symflower.com).

# When and how to release?

## Publishing Content

Releasing a new version of `DevQualityEval` and publishing content about it are two different things!
But, we plan development and releases to be "content-driven", i.e. we work on / add features that are interesting to write about (see "Release Roadmap" below).
Hence, for every release we also publish a deep dive blog post with our latest features and findings.
Our published content is aimed at giving insight into our work and educating the reader.

- new features that add value (i.e. change / add to the scoring / ranking) are always publish-worthy / release-worthy
- new tools / models are very often not release-worthy but still publish-worthy (i.e. how is new model `XYZ` doing in the benchmark)
- insights, learnings, problems, surprises, achieved goals and experiments are always publish-worthy
- they need to be documented for later publishing (in a deep-dive blog post)
- they can also be published right away already (depending on the nature of the finding) as a small report / post

❗ Always publish / release / journal early:

- if something works, but is not merged yet: publish about it already
- if some feature is merged that is release-worthy, but we were planning on adding other things to that release: release anyways
- if something else is publish-worth: at least write down a few bullet-points immediately why it is interesting, including examples

## Release Roadmap

The `main` branch is always stable and could theoretically be used to form a new release at any given time.
To avoid having hundreds of releases for every merge to `main`, we perform releases only when a (group of) publish-worthy feature(s) or important bugfix(es) is merged (see "Publishing Content" above).

Therefore, we plan releases in special `Roadmap for vX.Y.Z` issues.
Such an issue contains a current list of publish-worthy goals that must be met for that release, a `TODO` section with items not planned for the current but a future release, and instructions on how issues / PRs / tasks need to be groomed and what needs to be done for a release to happen.

The issue template for roadmap issues can be found at [`.github/ISSUE_TEMPLATE/roadmap.md`](.github/ISSUE_TEMPLATE/roadmap.md)

0 comments on commit fa63a8b

Please sign in to comment.