OpenAI

Why we no longer evaluate SWE-bench Verified

SWE-bench Verified is increasingly contaminated and mismeasures frontier coding progress. Our analysis shows flawed tests and training leakage. We recommend SWE-bench Pro.

OpenAI

Our First Proof submissions

We share our AI model’s proof attempts for the First Proof math challenge, testing research-grade reasoning on expert-level problems.