skip to content
Topic

benchmarks

1 essay on this topic.

  1. Correctness is model-determined

    I benchmarked four AI coding harnesses on 12 tasks using the same model. The harness barely matters for correctness — it's all about the model.