Posts about benchmarks
-
Correctness is model-determined
I benchmarked four AI coding harnesses on 12 tasks using the same model. The harness barely matters for correctness — it's all about the model.
I benchmarked four AI coding harnesses on 12 tasks using the same model. The harness barely matters for correctness — it's all about the model.