RepoZero Leaderboard

Evaluating LLM Agents on Repository-Level Generation from Scratch (OpenHands-bash Baseline)

๐Ÿ“ฆ Download the RepoZero dataset to reproduce results or run your own evaluations. Download Dataset

๐Ÿ“Š Performance Metrics (Pass Rate %)

Model Average Easy Medium Hard
Claude-4.6-Sonnet51.4659.6949.3145.67
DeepSeek V4 Pro39.7848.8438.1932.28
GLM-5.134.1946.5138.8916.54
Kimi-K2.633.2547.2927.7825.20
Kimi-K2.531.3036.3633.0224.74
DeepSeek V4 Flash30.7438.7632.6420.47
DeepSeek V3.229.0137.2131.2518.11
GLM-527.4637.2122.0523.61
Minimax-M2.727.1534.8829.1716.54
DeepSeek V3.126.0834.1125.6918.11
Minimax-M2.522.7229.9722.2318.89
Ernie-5.019.4627.9118.7511.81
Model Average Easy Medium Hard
Claude-4.6-Sonnet44.6053.3345.7136.23
Kimi-K2.638.5150.8230.0036.23
DeepSeek V4 Pro35.5340.9832.8633.33
DeepSeek V4 Flash35.4140.9834.9231.88
GLM-5.135.0545.9031.4328.99
DeepSeek V3.233.5548.3328.5726.09
Kimi-K2.533.0040.9830.0028.99
DeepSeek V3.132.4943.3330.0026.09
Minimax-M2.732.1237.7025.7133.33
Minimax-M2.528.4836.0718.5731.88
GLM-523.4731.6720.0020.29
Ernie-5.013.9722.9512.864.35

๐Ÿ“ About RepoZero

RepoZero is the first benchmark enabling fully automated, execution-based verification of repository-level generation from scratch. Unlike existing benchmarks that focus on patch editing or rely on subjective LLM-as-judge metrics, RepoZero reformulates generation as repository reproduction. Agents must re-implement a target repository based solely on API specifications to match the original behavior, verified through output equivalence. (Well โ€” the first, unless you count the concurrent work ProgramBench, in which case we are a very proud second. ๐Ÿฅˆ)

๐Ÿ” Technical Q&A

1. How is cheating prevented?

Show โ–พ

2. Is test reliability guaranteed?

Show โ–พ

3. How does the benchmark scale?

Show โ–พ

4. Difference from ProgramBench?

Show โ–พ

5. How is data contamination mitigated?

Show โ–พ

6. How far behind ProgramBench are we?

Show โ–พ

๐Ÿ“Ž Citation

@misc{zhang2026repozerollmsgeneratecode, title={RepoZero: Can LLMs Generate a Code Repository from Scratch?}, author={Zhaoxi Zhang and Yiming Xu and Weikang Li and Jiahui Liang and Yunfang Wu}, year={2026}, eprint={2605.07122}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2605.07122}, }