CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

CodeTaste Leaderboard

The leaderboard below provides an interactive summary of our experimental results. Use the toggles to switch tracks and click table headers to sort.

Tracks

Instructed Track gives the agent a detailed refactoring specification to execute.
Open Track gives the agent only a focus area, so the agent must decide what refactoring to do (and implement it).

Open Track Modes

Direct: the agent needs to implement immediately.
Plan: the agent proposes a plan, then in a second run implements it.
Multiplan: propose multiple plans and an oracle is used to select the one that is best aligned with the human refactoring. In a second run the agent implements the selected plan.

#	Model	$\mathcal{A}$	$\textsc{Pass}$	$\textsc{IFR}$	$\textsc{Prec}$

Values represent the mean performance on different metrics as percentages; ± shows the 95% confidence interval.

What is CodeTaste?

CodeTaste instances are mined from real, large-scale refactoring commits. For each instance, we synthesize a detailed refactoring task (Instructed Track) and an underspecified focus-area variant (Open Track), discover OpenGrep static analysis rules that capture the refactoring intent, and build a reproducible execution environment to run the repository test suite, static analysis checks, and agent inference.

At evaluation time, we apply a model-generated patch to the base commit and score it with both the test suite and the discovered static analysis rules.

CodeTaste at a glance

100

instances
87 repos · 6 langs

73

Median Files edited

1,515

Median LoC edited

93

Average # Rules

1,638

Average # Tests

How are tasks evaluated?

Functional Correctness $(\textsc{Pass})$: Checks whether the model's patch preserves functional integrity, using the repository's test suite.

Instruction Following Rate $(\textsc{IFR})$: Measures whether the patch follows the intended refactoring using static analysis checks (both for requirements to add and requirements to remove).

Change Precision $(\textsc{Prec})$: Measures how well the patch avoids unrelated changes outside the intended refactoring scope. The golden commit reference solutions achieve 57.5%.

Alignment $(\mathcal{A})$: A combined score that only rewards rule compliance when tests are valid (i.e., you should not “follow instructions” by breaking correctness).

How can I evaluate my model on CodeTaste?

To run your own agent on CodeTaste and evaluate its performance, please consult the repository's README for comprehensive instructions.

How can I contribute to CodeTaste?

We welcome community contributions to advance the benchmark. Enhancements to instance construction, test suites, rule discovery & analysis, as well as general feedback is welcome. Please visit our GitHub repository for details (issues and pull requests are encouraged).

Citation

@misc{codetaste2026,
      title={CodeTaste: Can LLMs Generate Human-Level Code Refactorings?}, 
      author={Alex Thillen and Niels Mündler and Veselin Raychev and Martin Vechev},
      year={2026},
      eprint={2603.04177},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2603.04177}, 
}