CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Anonymous Author(s)1
1Anonymous Institution
TL;DR: We introduce CodeTaste, a benchmark for evaluating whether LLM coding agents can execute and discover human-like refactorings in real codebases.

Key Takeaways

  • Discovery Gap: Models execute well-specified instructions (up to 70.1% alignment) but fail to autonomously identify human-preferred refactorings when given only a vague focus area, scoring as low as 7.8%.
  • Planning Benefits: Using a "propose-then-implement" strategy can nearly double alignment scores for top models.
  • Architectural Bottleneck: In the absence of detailed instructions, agents often fail to identify human-aligned, large-scale refactorings. Instead, they resort to lazy shortcuts, underscoring a significant deficit in autonomous architectural judgment.

CodeTaste Leaderboard

Interactive summary of experimental results from the paper appendix. Use the toggles to switch tracks and click table headers to sort.

# Model Pass IFR Alignment Precision
Values are percentages; brackets show 95% confidence intervals when available.

What is CodeTaste?

Overview
Overview of the CodeTaste benchmark pipeline
(1) commit discovery, (2) task generation, (3) rule generation, (4) build environment, (5) inference, and (6) evaluation with tests + static rules.

CodeTaste instances are mined from real, large-scale refactoring commits. For each instance, we synthesize a detailed refactoring task (Instructed Track) and an underspecified focus-area variant (Open Track), discover OpenGrep static rules that capture the refactoring intent, and build a reproducible execution environment to run the repository test suite, static analysis checks, and agent inference.

At evaluation time, we apply a model-generated patch \(\hat{X}\) to the base commit \(R\) and score it with both the test suite and the discovered static rules.

How are tasks evaluated?

Evaluation metrics sketch
Sketch of the evaluation metrics
Functional correctness ($\textsc{PASS}$), rule-based instruction following ($\textsc{IFR}$), change precision ($\textsc{PREC}$), and the combined alignment score $\mathcal{A} = \textsc{PASS} \times \textsc{IFR}$.

Correctness ($\textsc{PASS}$): we rerun the repository test suite after applying $\hat{X}$ and treat $\textsc{PASS}$ as a gate.

Instruction Following Rate ($\textsc{IFR}$): we evaluate whether additive patterns (introduced by the golden patch) are present and whether reductive patterns (removed by the golden patch) are absent using discovered OpenGrep rules.

Precision ($\textsc{PREC}$): we measure how much of the patch’s added/removed lines are covered by these rules, approximating the fraction of edits that are related to the intended refactoring.

CodeTaste at a glance

100
instances
87 repos · 6 langs
73
Median Files edited
1,515
Median LoC edited
93
Average # Rules
1,638
Average # Tests

How can I use CodeTaste?

Get started: To evaluate your model/agent on CodeTaste, use the refactoring-benchmark repo and follow the step-by-step guide in docs/benchmarking-your-agent.md. Download the runtime images from TOBEDETERMINED, provide an agent implementation, and run the provided inference + evaluation orchestration to produce leaderboard metrics.

How can I contribute to CodeTaste?

We welcome community contributions to advance the benchmark. Enhancements to instance construction, test suites, rule discovery & analysis, as well as general feedback is welcome. Please visit our GitHub repository for details (issues and pull requests are encouraged).

Citation

@article{codetaste2026,
  title={CodeTaste: Can LLMs Generate Human-Level Code Refactorings?},
  author={anonymous},
  year={2026},
  eprint={TBD},
 archivePrefix={arXiv}
}