0.4.0 (2024-04-26)

Deep dive into evaluation with this version: https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

This release's major additions are

Java as a new language,
automatic Markdown report with an SVG chart,
and lots of automation and testing to make the evaluation benchmark super reliable.

Features

Java language adapter with “java/plain” repository #62
Scoring through metric points and ranking of models #42
Automatic categorization of models depending on their worst result #36 #39 #48
Fully log per model and repository as results #25 #53
Migrate to symflower test instead of redoing test execution logic #62
Automatic installation of Symflower for RAG and general source code analytics to not reinvent the wheel #50
Generate test file paths through language adapters #60
Generate import / package paths through language adapters #63
Generate test framework name through language adapters #63
Human readable categories with description #57
Summary report as Markdown file with links to results #57 #77
Summary bar chart for overall results of categories as SVG in Markdown file #57

Bug fixes

More reliable parsing of code fences #70 #69
Do not exit process but instead panic for reliable testing and traces #69

Assets 2

04 Apr 12:50

zimmski

v0.3.0

a0f48bc

v0.3.0

First README of "DevQualityEval" (our final name for the benchmark) is online https://github.com/symflower/eval-dev-quality looking for feedback on how to make it more direct, less fluffy and more interesting for developers 🚨🔦Please help. We are currently sifting through the first benchmark and writing a report.

Assets 2

03 Apr 12:42

zimmski

v0.2.0

df10ae2

v0.2.0

This release makes the following tasks possible:

Add providers for models, models, and languages easily by implementing a common interface
Evaluate with any model that openrouter.ai offers and with Symflower's symbolic execution.
Add repositories that should be evaluated using Go as language
Run tests of Go repositories and query their coverage as the first evaluation benchmark task

More to come. If you want to contribute, let us know.

Assets 2

29 Mar 20:00

zimmski

v0.1.0

b0c59b4

v0.1.0

This release includes all basic components to move forward with creating an evaluation benchmark for LLMs and friends to compare and evolve code quality of code generation. The only big exceptions are a well documented README, interface to a generic LLM API service and tasks so people who want to contribute can help. These will follow soon.

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0.4.0 (2024-04-26)

Features

Bug fixes

Releases: symflower/eval-dev-quality

v0.4.0

0.4.0 (2024-04-26)

Features

Bug fixes

v0.3.0

v0.2.0

v0.1.0