Skip to content

Releases: symflower/eval-dev-quality

v0.4.0

26 Apr 12:42
8a38762
Compare
Choose a tag to compare

0.4.0 (2024-04-26)

Deep dive into evaluation with this version: https://symflower.com/en/company/blog/2024/dev-quality-eval-v0.4.0-is-llama-3-better-than-gpt-4-for-generating-tests/

This release's major additions are

  • Java as a new language,
  • automatic Markdown report with an SVG chart,
  • and lots of automation and testing to make the evaluation benchmark super reliable.

Features

  • Java language adapter with “java/plain” repository #62
  • Scoring through metric points and ranking of models #42
  • Automatic categorization of models depending on their worst result #36 #39 #48
  • Fully log per model and repository as results #25 #53
  • Migrate to symflower test instead of redoing test execution logic #62
  • Automatic installation of Symflower for RAG and general source code analytics to not reinvent the wheel #50
  • Generate test file paths through language adapters #60
  • Generate import / package paths through language adapters #63
  • Generate test framework name through language adapters #63
  • Human readable categories with description #57
  • Summary report as Markdown file with links to results #57 #77
  • Summary bar chart for overall results of categories as SVG in Markdown file #57

Bug fixes

  • More reliable parsing of code fences #70 #69
  • Do not exit process but instead panic for reliable testing and traces #69

v0.3.0

04 Apr 12:50
a0f48bc
Compare
Choose a tag to compare

First README of "DevQualityEval" (our final name for the benchmark) is online https://github.com/symflower/eval-dev-quality looking for feedback on how to make it more direct, less fluffy and more interesting for developers 🚨🔦Please help. We are currently sifting through the first benchmark and writing a report.

v0.2.0

03 Apr 12:42
df10ae2
Compare
Choose a tag to compare

This release makes the following tasks possible:

  • Add providers for models, models, and languages easily by implementing a common interface
  • Evaluate with any model that openrouter.ai offers and with Symflower's symbolic execution.
  • Add repositories that should be evaluated using Go as language
  • Run tests of Go repositories and query their coverage as the first evaluation benchmark task

More to come. If you want to contribute, let us know.

v0.1.0

29 Mar 20:00
b0c59b4
Compare
Choose a tag to compare

This release includes all basic components to move forward with creating an evaluation benchmark for LLMs and friends to compare and evolve code quality of code generation. The only big exceptions are a well documented README, interface to a generic LLM API service and tasks so people who want to contribute can help. These will follow soon.