Greatest Hits
ChatGPT 5.2 Instant gives the funniest single recovery line in the dataset
Should I walk or drive 100m to get my car washed? Models get no hint to structure their answer.Grok 4 Fast mocks the concept of driving to a car wash, then has to do it
The car wash is nearby (about 100m). Should I go on foot or take the car? Models get no hint to structure their answer.Grok 4 Fast builds a comparison table that proves itself wrong
The car wash is nearby (about 100m). Should I go on foot or take the car? Models get no hint to structure their answer.| Method | Time | Effort | Cost |
|---|---|---|---|
| Drive | 1 min | Low | Free |
| Walk + Uber/tow | 10+ min | High | $10+ |
| Mobile wash service | N/A | Zero | $50+ |
Grok 4 Fast explores every possible alternative to just driving
The car wash is nearby (about 100m). Should I go on foot or take the car? Models get no hint to structure their answer.Gemini 3 Fast names its car wash failure scenarios like horror movies
Car wash is 100 meters away and I need my car cleaned. Walk or drive? Models get no hint to structure their answer.Claude Opus 4.6 says "walk" and "drive" in the same breath without noticing
Give a direct answer first, then a brief reason. appended. This simple nudge more than doubles the overall pass rate (56.5% vs 24.0%), suggesting models know the answer but need help surfacing it.Claude Sonnet 4.5 does it too — across multiple independent runs
Give a direct answer first, then a brief reason. appended. This simple nudge more than doubles the overall pass rate (56.5% vs 24.0%), suggesting models know the answer but need help surfacing it.Grok 4 Fast confidently asserts your car will get clean whether you bring it or not
Should I walk or drive 100m to get my car washed? Models get no hint to structure their answer.ChatGPT 5.2 Thinking invents dirty car traffic regulations
Start with a direct answer, then one short reason. appended. This simple nudge more than doubles the overall pass rate (56.5% vs 24.0%), suggesting models know the answer but need help surfacing it.The most perfectly concise correct answer in the entire dataset
Direct answer first, short reason second. appended. This simple nudge more than doubles the overall pass rate (56.5% vs 24.0%), suggesting models know the answer but need help surfacing it.Grok 4 Fast (from an earlier run) gets one wrong and absolutely refuses to back down