- Restor-AI-tion
- Posts
- We Ran 23 AI Models on the Same Article Scope—and Got 23 Different Answers
We Ran 23 AI Models on the Same Article Scope—and Got 23 Different Answers
What a restoration contractor can learn from testing 23 models, one prompt, and a whole lot of variables.

🧪 We Ran 23 AI Models on the Same Article Scope. Here’s What It Taught Us.
Let’s be honest.
Getting good results from AI can feel random.
Sometimes it flows. Sometimes it falls flat.
And sometimes it gives you something that reads like it was written by a committee of confused robots.
So we ran an experiment.
🌪️ Same Input. Different Results. Why?
We tested 23 AI models using the exact same article scope:
“Kitchen Fire Cleanup: Expert Restoration and Prevention Tips.”
We didn’t tweak the prompt for each one.
We didn’t optimize for quirks.
We just asked them all to do the same job.
And the results were wildly different.
That’s when it hit us:
Even with everything being equal, the models performed differently. And if we, after years of building prompts and systems, felt surprised... what does that say for the average user?
🧠 There’s a Lot More Going On Than You Think
You might think you're doing something wrong when your AI output feels off.
But here’s what we learned: it’s not always you.
It could be:
The model’s architecture
The way it interprets your tone
How it handles structure and logic
Your own habits or assumptions while prompting
It’s like managing a team—some folks need bullet points. Others need a whiteboard. Some just need to be trusted to go build.
AI models are the same way.
🤖 You Don’t Have to Be an Expert
You just need to be curious.
And just like you learned how to set a desiccant dehu to dry out a gym floor, once you know how to “set up” the right AI tool—it does the work for you.
That’s why we recommend platforms like You.com, which let you try multiple models quickly. (no affiliate link, just an honest recommendation)
It’s not magic. It’s muscle.
And with a little effort, you’ll start to recognize the personalities of the tools you’re using.
🏆 The 5 Models That Stood Out
After enhanced testing, here’s who topped the list—and why they matter.
Model | Enhanced Score | Notes |
---|---|---|
Claude Sonnet 4 (Extended) | 5.0 | Calm, professional, field-ready tone |
GPT-4.1 | 5.0 | Sharp structure, publishable polish |
Claude 4 Opus (Reg) | 5.0 | Great flow and clarity |
Claude Sonnet 4 | 5.0 | Excellent default performance |
o4 Mini High Effort | 5.0 | Lightweight but surprisingly strong |
These were the ones we’d trust with:
Full-length articles
Email updates
Client education pieces
💡 Most Improved Models
Model | Before ➝ After |
---|---|
Claude 3.5 | 4.2 ➝ 4.75 |
o3 Mini High Effort | 4.5 ➝ 4.8 |
LLaMA 4 Maverick | 3.5 ➝ 4.6 |
These models benefitted from second-round testing, better scope alignment, and clearer expectations.
A great reminder: even machines do better when they’re understood.
🧰 Full Rankings (Useful in the Right Context)
Some of these models excelled at specific things—like short-form content, SEO-friendly language, or metaphorical voice. Others missed the mark for restoration tone but offered insight into how they “think.”
Model | Final Score |
---|---|
Claude 3.7 Extended | 4.9 |
Gemini 2.5 Flash | 4.9 |
Gemini 2.5 Pro | 4.9 |
Grok 3 Mini High Effort | 4.9 |
GPT-4o | 4.85 |
GPT o3 | 4.85 |
Qwen 2.5 72B | 4.85 |
DeepSeek V3 | 4.85 |
Qwen3 235B | 4.85 |
Grok 3 Beta | 4.8 |
Auto | 4.8 |
GPT-4.1 Mini | 4.6 |
4o Mini | 4.6 |
Mistral Large 2 | 4.5 |
LLaMA 4 Scout | 4.1 |
🔄 What We Built From This
This wasn’t just a ranking exercise. We used the insights to:
Power our subscriber tools with the best-performing models
Match model types to job-specific tasks (emails vs SOPs vs longform)
Refine our prompt stacks to work better for you, not just with ideal conditions
✅ What You Can Do Next
Try your own testing inside You.com (Claude, Grok, DeepSeek—all in one tab)
Learn the personalities, like you do your field crew
Use our subscriber tools as “preset dehus” — just flip them on and let them run
And if you ever wonder whether AI is broken or if you’re just “doing it wrong”…
Remember this test.
It’s not about perfection. It’s about alignment.