CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

TL;DR: We introduce CodeTaste, a benchmark for evaluating whether LLM coding agents can execute and discover human-like refactorings in real codebases.

Key Takeaways

  • Discovery Gap: While models reliably execute well-specified instructions (achieving up to 69.6% alignment), they fail to autonomously identify human-aligned refactorings when given only a vague focus area. In these scenarios, even the best agentic systems score at as low as 7.7%.
  • Architectural Bottleneck: In the absence of detailed instructions, agents often fail to identify human-aligned, large-scale refactorings. Instead, they resort to lazy shortcuts, underscoring a significant deficit in autonomous architectural judgment.
  • Planning Benefits: Using a "propose-then-implement" strategy can improve architectural thinking and nearly double alignment scores for top models.

CodeTaste Leaderboard

The leaderboard below provides an interactive summary of our experimental results. Use the toggles to switch tracks and click table headers to sort.

Tracks
  • Instructed Track gives the agent a detailed refactoring specification to execute.
  • Open Track gives the agent only a focus area, so the agent must decide what refactoring to do (and implement it).
Open Track Modes
  • Direct: the agent needs to implement immediately.
  • Plan: the agent proposes a plan, then in a second run implements it.
  • Multiplan: propose multiple plans and an oracle is used to select the one that is best aligned with the human refactoring. In a second run the agent implements the selected plan.
# Model $\mathcal{A}$ $\textsc{Pass}$ $\textsc{IFR}$ $\textsc{Prec}$
Values represent the mean performance on different metrics as percentages; ± shows the 95% confidence interval.

What is CodeTaste?

Overview

CodeTaste instances are mined from real, large-scale refactoring commits. For each instance, we synthesize a detailed refactoring task (Instructed Track) and an underspecified focus-area variant (Open Track), discover OpenGrep static analysis rules that capture the refactoring intent, and build a reproducible execution environment to run the repository test suite, static analysis checks, and agent inference.

At evaluation time, we apply a model-generated patch to the base commit and score it with both the test suite and the discovered static analysis rules.

CodeTaste at a glance

100
instances
87 repos · 6 langs
73
Median Files edited
1,515
Median LoC edited
93
Average # Rules
1,638
Average # Tests

How are tasks evaluated?

Functional Correctness $(\textsc{Pass})$: Checks whether the model's patch preserves functional integrity, using the repository's test suite.

Instruction Following Rate $(\textsc{IFR})$: Measures whether the patch follows the intended refactoring using static analysis checks (both for requirements to add and requirements to remove).

Change Precision $(\textsc{Prec})$: Measures how well the patch avoids unrelated changes outside the intended refactoring scope. The golden commit reference solutions achieve 57.5%.

Alignment $(\mathcal{A})$: A combined score that only rewards rule compliance when tests are valid (i.e., you should not “follow instructions” by breaking correctness).

How can I evaluate my model on CodeTaste?

To run your own agent on CodeTaste and evaluate its performance, please consult the repository's README for comprehensive instructions.

How can I contribute to CodeTaste?

We welcome community contributions to advance the benchmark. Enhancements to instance construction, test suites, rule discovery & analysis, as well as general feedback is welcome. Please visit our GitHub repository for details (issues and pull requests are encouraged).

Citation

@misc{codetaste2026,
      title={CodeTaste: Can LLMs Generate Human-Level Code Refactorings?}, 
      author={Alex Thillen and Niels Mündler and Veselin Raychev and Martin Vechev},
      year={2026},
      eprint={2603.04177},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2603.04177}, 
}