We don’t have a religious allegiance to any model. We run whatever wins in production for a given job.
After six months of A/B testing both Claude (Sonnet and Opus) and GPT-4 across real client workloads, here’s what shook out.
Where Claude wins in production
Classification with nuance. Invoice categorisation, email triage, ticket routing. When the task is “read this and tell me which of these 12 buckets it belongs to, with a confidence score and a short reason”, Claude is consistently more calibrated. It knows when it’s unsure. GPT-4 tends to commit to an answer even when it shouldn’t.
Long-context RAG. For anything over 50 pages of grounded context (policies, regulations, internal docs), Sonnet holds attention across the whole window better than GPT-4 Turbo. The difference shows up in answers that reference multiple disconnected sections of the document — Claude does this naturally, GPT-4 often just quotes the nearest paragraph.
Structured JSON output. Both models can produce JSON. Claude produces it correctly on the first try more often. For n8n workflows that break on malformed shapes, that’s the entire ball game.
Where GPT-4 wins
Image understanding at speed. GPT-4V is still faster for basic OCR and receipt reading on bulk jobs. Not always more accurate — but when you’re processing 500 images an hour, “fast enough” matters.
Tool use breadth. OpenAI’s function-calling ecosystem has more off-the-shelf tooling. If your workflow chains 5+ function calls and you want the path-of-least-resistance, GPT-4 plus a popular agent framework gets you there in less code.
Cheap at tiny context. For quick classifications under 200 tokens (e.g., is this email spam, yes or no), GPT-4o-mini is cheaper per call than Claude Haiku. Pennies add up when you’re running millions of these.
Where it’s a coin flip
Creative writing. Summarisation under 2000 words. Code generation for common tasks. Both are good enough that the differences are in the noise. Pick whichever your team already knows.
What we tell clients
For anything client-facing (responses, explanations, content), we default to Claude. The tone calibration is better and it hallucinates less on the kind of details that would embarrass the client.
For anything internal-facing at massive throughput (classification, routing, tagging), we benchmark both. We’ve shipped builds on each. Whatever wins the eval wins production.
The boring truth
The choice between models is the least important decision in most AI builds. Architecture, prompt design, classification taxonomy, and feedback loops matter ten times more than which LLM you picked.
If you’re still on GPT-3.5 for production work, that’s a real problem. Beyond that, pick one, measure it, and stop worrying.
Get in touch if you want a second opinion on the model choice in your own stack.