PullRequestBenchmark

Fellow software engineers, we are genuinely at the cusp of making ourselves obsolete through the power of automation! Can't wait to get there.

PullRequestBenchmark uses binary feedback on existing, complex PRs to evaluate AI, addressing subjective challenges in code generation assessment. Achieving a 100% success rate would realistically mean the end of traditional coding jobs.

This effort might seem similar to SWE-bench at first glance, but there are key differences.

Expert PR reviewers are likely expert PR creators

The journey towards automating the programming profession as we know it today is both exciting and vitally important. This is the heart of our motivation for leveraging Large Language Models (LLMs) to review PRs (Pull Requests) as a representative step in this significant transition.

Assertion: Expertise in PR Review Capabilities Equates to Expertise in PR Creation Capability

a. Intuitive Argument (Competence Argument): If LLMs can expertly review and make decisions on complex PRs—covering major refactoring, architecture redesigns, large feature additions, intricate bug fixes, or advanced security measures—it intuitively implies that they are also capable of creating complex high-quality PRs. The understanding, analysis, and judgment required to review such PRs would necessarily entail the ability to create them.

b. Logical Argument (Bootstrap Argument): Leveraging LLMs' expert review skills, even initially basic PRs could be iteratively generated and refined. This process of self-evaluation and improvement, akin to bootstrapping, could potentially lead to the creation of high-quality, human-level PRs over time.
Assertion: PR Review Skill Evaluation is Easier Than PR Creation Skill Assessment

Assessing LLMs' PR review skills simplifies the evaluation process by allowing direct comparisons to human expert responses, focusing on the nuanced decision-making involved in software development.
Argument: Code Review Benchmarks Can Indicate Progress Toward Traditional Programming Role Obsolescence

If we accept the validity of both the first and second assertions, measuring the code review capabilities of LLMs using benchmarks could help us track advancements towards automating traditional programming roles. The ability of LLMs to generate complex pull requests autonomously would then imply a potential obsolescence of traditional programmer roles, signaling a transformation in the software development industry.

Benchmark Format

Input to the Model

Models are furnished with inputs akin to those a developer would consider, including:

Entire Git History: Offering a lens into the project's evolution and coding standards.
Pull Request Title and Description: Providing context and specifics of the proposed changes.
Changeset: Detailing the exact additions, deletions, and modifications proposed.

Expected Output from the Model

The output is a binary decision, with the model delivering a verdict akin to a human reviewer's judgment:

Approved: Signifies the PR aligns with project standards.
Rejected: Indicates the PR falls short, with reasons for rejection articulated through specific feedback.

How to Contribute

Extending the size of PullRequestBenchmark is greatly appreciated. Your contributions play a vital role in this effort. Here's how you can help:

Identify Suitable Repositories: Start by locating repositories that align with the scope of our benchmark.
Find PRs: Within these repositories, search for either approved or rejected Pull Requests (PRs) that capture the essence of reviewing.
Format and Add Evaluation: Follow our guidelines to document the evaluation data point correctly and add them to our benchmark suite.

For detailed contribution guidelines, please refer to CONTRIBUTING.md.

Prompt Size and Distribution

The PullRequestBenchmark encompasses a wide range of prompt sizes, reflecting the diversity of real-world software development scenarios. Prompt sizes, measured in tokens, vary significantly, with the smallest being around 125,506 tokens and the largest exceeding 49,015,233 tokens. This variation underscores the benchmark's comprehensive approach to evaluating LLMs across simple to complex pull request reviews. The distribution of prompt sizes illustrates the challenges in software development tasks, ranging from minor code tweaks to substantial feature additions or optimizations.

License

Licensed under the MIT License, PullRequestBenchmark encourages academic and research use, promoting advancements in AI for software development.