The first benchmark that measures how well LLMs are able to improve development experience, by testing the LLMs ability to do something we all hate: writing tests.
LLMs claim they can code, but can they really?
Existing benchmarks test algorithmic puzzles and toy problems. But real development is about understanding business logic, writing maintainable tests, and creating code that actually works in production.
DXBench tests what matters: Can your LLM generate comprehensive, working unit tests for real functions? Can it understand edge cases? Does it write code that passes in a real environment?
Test LLM-generated code safely in isolated Docker containers. No risk to your system, complete isolation from the real world.
Tests real-world scenarios: importing modules, handling edge cases, writing maintainable test code that developers actually need.
Simple Bot interface works with any LLM API. Get started in minutes, integrate with your existing ML pipelines effortlessly.
DXBench gives your LLM real Python functions to analyze and understand.
Your LLM writes comprehensive unit tests with proper imports and edge cases.
Tests run safely in isolated Docker containers with timeout protection.
Get detailed metrics on accuracy, pass rates, and specific failure modes.
Stop guessing. Start measuring. See how well your LLM really codes.