• Restor-AI-tion
  • Posts
  • We Ran 23 AI Models on the Same Article Scope—and Got 23 Different Answers

We Ran 23 AI Models on the Same Article Scope—and Got 23 Different Answers

What a restoration contractor can learn from testing 23 models, one prompt, and a whole lot of variables.

🧪 We Ran 23 AI Models on the Same Article Scope. Here’s What It Taught Us.

Let’s be honest.
Getting good results from AI can feel random.

Sometimes it flows. Sometimes it falls flat.
And sometimes it gives you something that reads like it was written by a committee of confused robots.

So we ran an experiment.

🌪️ Same Input. Different Results. Why?

We tested 23 AI models using the exact same article scope:

“Kitchen Fire Cleanup: Expert Restoration and Prevention Tips.”

We didn’t tweak the prompt for each one.
We didn’t optimize for quirks.
We just asked them all to do the same job.

And the results were wildly different.

That’s when it hit us:
Even with everything being equal, the models performed differently. And if we, after years of building prompts and systems, felt surprised... what does that say for the average user?

🧠 There’s a Lot More Going On Than You Think

You might think you're doing something wrong when your AI output feels off.

But here’s what we learned: it’s not always you.
It could be:

  • The model’s architecture

  • The way it interprets your tone

  • How it handles structure and logic

  • Your own habits or assumptions while prompting

It’s like managing a team—some folks need bullet points. Others need a whiteboard. Some just need to be trusted to go build.
AI models are the same way.

🤖 You Don’t Have to Be an Expert

You just need to be curious.

And just like you learned how to set a desiccant dehu to dry out a gym floor, once you know how to “set up” the right AI tool—it does the work for you.

That’s why we recommend platforms like You.com, which let you try multiple models quickly. (no affiliate link, just an honest recommendation)
It’s not magic. It’s muscle.
And with a little effort, you’ll start to recognize the personalities of the tools you’re using.

🏆 The 5 Models That Stood Out

After enhanced testing, here’s who topped the list—and why they matter.

Model

Enhanced Score

Notes

Claude Sonnet 4 (Extended)

5.0

Calm, professional, field-ready tone

GPT-4.1

5.0

Sharp structure, publishable polish

Claude 4 Opus (Reg)

5.0

Great flow and clarity

Claude Sonnet 4

5.0

Excellent default performance

o4 Mini High Effort

5.0

Lightweight but surprisingly strong

These were the ones we’d trust with:

  • Full-length articles

  • Email updates

  • Client education pieces

💡 Most Improved Models

Model

Before ➝ After

Claude 3.5

4.2 ➝ 4.75

o3 Mini High Effort

4.5 ➝ 4.8

LLaMA 4 Maverick

3.5 ➝ 4.6

These models benefitted from second-round testing, better scope alignment, and clearer expectations.
A great reminder: even machines do better when they’re understood.

🧰 Full Rankings (Useful in the Right Context)

Some of these models excelled at specific things—like short-form content, SEO-friendly language, or metaphorical voice. Others missed the mark for restoration tone but offered insight into how they “think.”

Model

Final Score

Claude 3.7 Extended

4.9

Gemini 2.5 Flash

4.9

Gemini 2.5 Pro

4.9

Grok 3 Mini High Effort

4.9

GPT-4o

4.85

GPT o3

4.85

Qwen 2.5 72B

4.85

DeepSeek V3

4.85

Qwen3 235B

4.85

Grok 3 Beta

4.8

Auto

4.8

GPT-4.1 Mini

4.6

4o Mini

4.6

Mistral Large 2

4.5

LLaMA 4 Scout

4.1

🔄 What We Built From This

This wasn’t just a ranking exercise. We used the insights to:

  • Power our subscriber tools with the best-performing models

  • Match model types to job-specific tasks (emails vs SOPs vs longform)

  • Refine our prompt stacks to work better for you, not just with ideal conditions

✅ What You Can Do Next

  • Try your own testing inside You.com (Claude, Grok, DeepSeek—all in one tab)

  • Learn the personalities, like you do your field crew

  • Use our subscriber tools as “preset dehus” — just flip them on and let them run

And if you ever wonder whether AI is broken or if you’re just “doing it wrong”…
Remember this test.

It’s not about perfection. It’s about alignment.