Evaluating LLM Agents on Repository-Level Generation from Scratch (OpenHands-bash Baseline)
| Model | Average | Easy | Medium | Hard |
|---|---|---|---|---|
| Claude-4.6-Sonnet | 51.46 | 59.69 | 49.31 | 45.67 |
| DeepSeek V4 Pro | 39.78 | 48.84 | 38.19 | 32.28 |
| GLM-5.1 | 34.19 | 46.51 | 38.89 | 16.54 |
| Kimi-K2.6 | 33.25 | 47.29 | 27.78 | 25.20 |
| Kimi-K2.5 | 31.30 | 36.36 | 33.02 | 24.74 |
| DeepSeek V4 Flash | 30.74 | 38.76 | 32.64 | 20.47 |
| DeepSeek V3.2 | 29.01 | 37.21 | 31.25 | 18.11 |
| GLM-5 | 27.46 | 37.21 | 22.05 | 23.61 |
| Minimax-M2.7 | 27.15 | 34.88 | 29.17 | 16.54 |
| DeepSeek V3.1 | 26.08 | 34.11 | 25.69 | 18.11 |
| Minimax-M2.5 | 22.72 | 29.97 | 22.23 | 18.89 |
| Ernie-5.0 | 19.46 | 27.91 | 18.75 | 11.81 |
| Model | Average | Easy | Medium | Hard |
|---|---|---|---|---|
| Claude-4.6-Sonnet | 44.60 | 53.33 | 45.71 | 36.23 |
| Kimi-K2.6 | 38.51 | 50.82 | 30.00 | 36.23 |
| DeepSeek V4 Pro | 35.53 | 40.98 | 32.86 | 33.33 |
| DeepSeek V4 Flash | 35.41 | 40.98 | 34.92 | 31.88 |
| GLM-5.1 | 35.05 | 45.90 | 31.43 | 28.99 |
| DeepSeek V3.2 | 33.55 | 48.33 | 28.57 | 26.09 |
| Kimi-K2.5 | 33.00 | 40.98 | 30.00 | 28.99 |
| DeepSeek V3.1 | 32.49 | 43.33 | 30.00 | 26.09 |
| Minimax-M2.7 | 32.12 | 37.70 | 25.71 | 33.33 |
| Minimax-M2.5 | 28.48 | 36.07 | 18.57 | 31.88 |
| GLM-5 | 23.47 | 31.67 | 20.00 | 20.29 |
| Ernie-5.0 | 13.97 | 22.95 | 12.86 | 4.35 |
Cheating is mitigated through a multi-layered protocol. This includes environment isolation within restricted Docker containers, stringent system-level command limitations, and a strict ban on importing external site packages. Cross-language consistency is enforced via regex-based checks to prohibit delegation via shell wrappers.
Yes. RepoZero employs a black-box evaluation framework. Only four white-box test cases are provided for guidance, while the full evaluation suite is hidden from the agent's file system and configured as read-only. Success requires byte-for-byte output matching with the deterministic source repository.
Scalability is a core feature. Human intervention is limited to the initial curation of open-source repositories based on criteria like determinism and complexity. Once repositories are selected, the generation of test files, test cases, and the filtering of ground truth are fully automated.
While benchmarks like ProgramBench might target high algorithmic difficulty, RepoZero focuses on the structural complexity of full repositories. It provides clear differentiation (discriminative power) by testing cross-file modular reasoning and architectural coherence, showing a significant gap between current model capabilities and real-world needs.
Data contamination is mitigated by evaluating agents in a different target programming language than the one used in training-time source repositories. Since the agent must re-implement functionality in a language that is distinct from the original, memorized source code cannot be directly applied. This cross-language reformulation substantially reduces the risk of leakage while preserving the difficulty of the task.
Exactly one day. We submitted to arXiv on May 5th โ just 24 hours after ProgramBench. The good news: in the grand race to benchmark repository-level code generation, China is only trailing the US by a single day. The bad news: it's still a loss. We choose to interpret this as remarkable synchronicity rather than being scooped, and we stand by that interpretation. ๐โก๏ธ๐ (T-24h)