DXBench

The first benchmark that measures how well LLMs are able to improve development experience, by testing the LLMs ability to do something we all hate: writing tests.

GitHub PyPI

The Problem

LLMs claim they can code, but can they really?

Existing benchmarks test algorithmic puzzles and toy problems. But real development is about understanding business logic, writing maintainable tests, and creating code that actually works in production.

DXBench tests what matters: Can your LLM generate comprehensive, working unit tests for real functions? Can it understand edge cases? Does it write code that passes in a real environment?

❌ Other Benchmarks Test:

                        # Algorithmic puzzles
                        def fibonacci(n):
                            # Solve this math problem...
                    

✅ DXBench Tests:

                        # Real developer experience
                        def add(a: int, b: int) -> int:
                            return a + b
                        
                        # Can your LLM write comprehensive
                        # tests that actually pass?
                    

Why DXBench Matters

🔒

Sandboxed Safety

Test LLM-generated code safely in isolated Docker containers. No risk to your system, complete isolation from the real world.

🧩

Developer Experience

Tests real-world scenarios: importing modules, handling edge cases, writing maintainable test code that developers actually need.

🚀

Easy Integration

Simple Bot interface works with any LLM API. Get started in minutes, integrate with your existing ML pipelines effortlessly.

How It Works

Submit Code

DXBench gives your LLM real Python functions to analyze and understand.

Generate Tests

Your LLM writes comprehensive unit tests with proper imports and edge cases.

Sandboxed Execution

Tests run safely in isolated Docker containers with timeout protection.

Measure Results

Get detailed metrics on accuracy, pass rates, and specific failure modes.