87 repos · 6 langs
Interactive summary of experimental results from the paper appendix. Use the toggles to switch tracks and click table headers to sort.
| # | Model | Pass | IFR | Alignment | Precision |
|---|
CodeTaste instances are mined from real, large-scale refactoring commits. For each instance, we synthesize a detailed refactoring task (Instructed Track) and an underspecified focus-area variant (Open Track), discover OpenGrep static rules that capture the refactoring intent, and build a reproducible execution environment to run the repository test suite, static analysis checks, and agent inference.
At evaluation time, we apply a model-generated patch \(\hat{X}\) to the base commit \(R\) and score it with both the test suite and the discovered static rules.
Correctness ($\textsc{PASS}$): we rerun the repository test suite after applying $\hat{X}$ and treat $\textsc{PASS}$ as a gate.
Instruction Following Rate ($\textsc{IFR}$): we evaluate whether additive patterns (introduced by the golden patch) are present and whether reductive patterns (removed by the golden patch) are absent using discovered OpenGrep rules.
Precision ($\textsc{PREC}$): we measure how much of the patch’s added/removed lines are covered by these rules, approximating the fraction of edits that are related to the intended refactoring.
Get started: To evaluate your model/agent on CodeTaste, use the refactoring-benchmark repo and follow the step-by-step guide in docs/benchmarking-your-agent.md. Download the runtime images from TOBEDETERMINED, provide an agent implementation, and run the provided inference + evaluation orchestration to produce leaderboard metrics.
We welcome community contributions to advance the benchmark. Enhancements to instance construction, test suites, rule discovery & analysis, as well as general feedback is welcome. Please visit our GitHub repository for details (issues and pull requests are encouraged).
@article{codetaste2026,
title={CodeTaste: Can LLMs Generate Human-Level Code Refactorings?},
author={anonymous},
year={2026},
eprint={TBD},
archivePrefix={arXiv}
}