CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

Anonymous Author(s)¹

¹Anonymous Institution

Key Takeaways

Discovery Gap: Models execute well-specified instructions (up to 70.1% alignment) but fail to autonomously identify human-preferred refactorings when given only a vague focus area, scoring as low as 7.8%.
Planning Benefits: Using a "propose-then-implement" strategy can nearly double alignment scores for top models.
Architectural Bottleneck: In the absence of detailed instructions, agents often fail to identify human-aligned, large-scale refactorings. Instead, they resort to lazy shortcuts, underscoring a significant deficit in autonomous architectural judgment.

CodeTaste Leaderboard

Interactive summary of experimental results from the paper appendix. Use the toggles to switch tracks and click table headers to sort.

#	Model	Pass	IFR	Alignment	Precision

Values are percentages; brackets show 95% confidence intervals when available.

What is CodeTaste?

Overview of the CodeTaste benchmark pipeline

(1) commit discovery, (2) task generation, (3) rule generation, (4) build environment, (5) inference, and (6) evaluation with tests + static rules.

CodeTaste instances are mined from real, large-scale refactoring commits. For each instance, we synthesize a detailed refactoring task (Instructed Track) and an underspecified focus-area variant (Open Track), discover OpenGrep static rules that capture the refactoring intent, and build a reproducible execution environment to run the repository test suite, static analysis checks, and agent inference.

At evaluation time, we apply a model-generated patch $\hat{X}$ to the base commit $R$ and score it with both the test suite and the discovered static rules.

How are tasks evaluated?

Sketch of the evaluation metrics

Functional correctness ($\textsc{PASS}$), rule-based instruction following ($\textsc{IFR}$), change precision ($\textsc{PREC}$), and the combined alignment score $\mathcal{A} = \textsc{PASS} \times \textsc{IFR}$.

Correctness ($\textsc{PASS}$): we rerun the repository test suite after applying $\hat{X}$ and treat $\textsc{PASS}$ as a gate.

Instruction Following Rate ($\textsc{IFR}$): we evaluate whether additive patterns (introduced by the golden patch) are present and whether reductive patterns (removed by the golden patch) are absent using discovered OpenGrep rules.

Precision ($\textsc{PREC}$): we measure how much of the patch’s added/removed lines are covered by these rules, approximating the fraction of edits that are related to the intended refactoring.

How can I use CodeTaste?

Get started: To evaluate your model/agent on CodeTaste, use the refactoring-benchmark repo and follow the step-by-step guide in docs/benchmarking-your-agent.md. Download the runtime images from TOBEDETERMINED, provide an agent implementation, and run the provided inference + evaluation orchestration to produce leaderboard metrics.