RepoZero Leaderboard

Evaluating LLM Agents on Repository-Level Generation from Scratch (OpenHands-bash Baseline)

Model	Average	Easy	Medium	Hard
Claude-4.6-Sonnet	51.46	59.69	49.31	45.67
DeepSeek V4 Pro	39.78	48.84	38.19	32.28
GLM-5.1	34.19	46.51	38.89	16.54
Kimi-K2.6	33.25	47.29	27.78	25.20
Kimi-K2.5	31.30	36.36	33.02	24.74
DeepSeek V4 Flash	30.74	38.76	32.64	20.47
DeepSeek V3.2	29.01	37.21	31.25	18.11
GLM-5	27.46	37.21	22.05	23.61
Minimax-M2.7	27.15	34.88	29.17	16.54
DeepSeek V3.1	26.08	34.11	25.69	18.11
Minimax-M2.5	22.72	29.97	22.23	18.89
Ernie-5.0	19.46	27.91	18.75	11.81

Model	Average	Easy	Medium	Hard
Claude-4.6-Sonnet	44.60	53.33	45.71	36.23
Kimi-K2.6	38.51	50.82	30.00	36.23
DeepSeek V4 Pro	35.53	40.98	32.86	33.33
DeepSeek V4 Flash	35.41	40.98	34.92	31.88
GLM-5.1	35.05	45.90	31.43	28.99
DeepSeek V3.2	33.55	48.33	28.57	26.09
Kimi-K2.5	33.00	40.98	30.00	28.99
DeepSeek V3.1	32.49	43.33	30.00	26.09
Minimax-M2.7	32.12	37.70	25.71	33.33
Minimax-M2.5	28.48	36.07	18.57	31.88
GLM-5	23.47	31.67	20.00	20.29
Ernie-5.0	13.97	22.95	12.86	4.35

RepoZero is the first benchmark enabling fully automated, execution-based verification of repository-level generation from scratch. Unlike existing benchmarks that focus on patch editing or rely on subjective LLM-as-judge metrics, RepoZero reformulates generation as repository reproduction. Agents must re-implement a target repository based solely on API specifications to match the original behavior, verified through output equivalence. (Well — the first, unless you count the concurrent work ProgramBench, in which case we are a very proud second. 🥈)

🔍 Technical Q&A

1. How is cheating prevented?

Show ▾

Cheating is mitigated through a multi-layered protocol. This includes environment isolation within restricted Docker containers, stringent system-level command limitations, and a strict ban on importing external site packages. Cross-language consistency is enforced via regex-based checks to prohibit delegation via shell wrappers.

2. Is test reliability guaranteed?

Show ▾

Yes. RepoZero employs a black-box evaluation framework. Only four white-box test cases are provided for guidance, while the full evaluation suite is hidden from the agent's file system and configured as read-only. Success requires byte-for-byte output matching with the deterministic source repository.

3. How does the benchmark scale?

Show ▾

Scalability is a core feature. Human intervention is limited to the initial curation of open-source repositories based on criteria like determinism and complexity. Once repositories are selected, the generation of test files, test cases, and the filtering of ground truth are fully automated.

4. Difference from ProgramBench?

Show ▾

While benchmarks like ProgramBench might target high algorithmic difficulty, RepoZero focuses on the structural complexity of full repositories. It provides clear differentiation (discriminative power) by testing cross-file modular reasoning and architectural coherence, showing a significant gap between current model capabilities and real-world needs.

5. How is data contamination mitigated?

Show ▾

Data contamination is mitigated by evaluating agents in a different target programming language than the one used in training-time source repositories. Since the agent must re-implement functionality in a language that is distinct from the original, memorized source code cannot be directly applied. This cross-language reformulation substantially reduces the risk of leakage while preserving the difficulty of the task.

6. How far behind ProgramBench are we?

Show ▾

Exactly one day. We submitted to arXiv on May 5th — just 24 hours after ProgramBench. The good news: in the grand race to benchmark repository-level code generation, China is only trailing the US by a single day. The bad news: it's still a loss. We choose to interpret this as remarkable synchronicity rather than being scooped, and we stand by that interpretation. 🌏➡️🌎 (T-24h)

📎 Citation

@misc{zhang2026repozerollmsgeneratecode, title={RepoZero: Can LLMs Generate a Code Repository from Scratch?}, author={Zhaoxi Zhang and Yiming Xu and Weikang Li and Jiahui Liang and Yunfang Wu}, year={2026}, eprint={2605.07122}, archivePrefix={arXiv}, primaryClass={cs.SE}, url={https://arxiv.org/abs/2605.07122}, }

RepoZero Leaderboard

📊 Performance Metrics (Pass Rate %)

📝 About RepoZero

🔍 Technical Q&A

1. How is cheating prevented?

2. Is test reliability guaranteed?

3. How does the benchmark scale?

4. Difference from ProgramBench?

5. How is data contamination mitigated?

6. How far behind ProgramBench are we?

Authors

📎 Citation