Turns Out AI Models Actually Have Bad Days. Here’s the Proof.

When I found out that AI models actually have good days and bad days — measurably, with data — I knew I had to write about it.

I see people on Reddit daily asking “when did [model] get nerfed?” or “is Claude worse today or am I losing it?” And I’ve always scrolled past. I’m not big into conspiracy theories. I figured these were the same people who think every update is a secret downgrade designed to upsell them to a more expensive tier. That’s not analysis, that’s just complaining for upvotes.

Turns out, there’s truth to the complaint. If you haven’t heard about this before, keep reading. I’ll tell you how I stumbled onto this and where you can go check the performance of your LLM yourself — at least for Claude and Codex.


How I Found Out

I was catching up on YouTube the other day (I wrote about how I actually use YouTube for AI news here) and landed on a PrimeAgen reaction video. He was reacting to PewDiePie’s latest project — and yes, PewDiePie is deep into AI model building now. Like, deep deep. Ten GPUs, 424GB of VRAM, months of fine-tuning an open-source model on coding benchmarks, dealing with data contamination, melting power cables, the whole thing. His video is called “I Trained My Own AI… It Beat ChatGPT” and it’s worth a watch regardless of how you feel about the guy.

But that’s not what this article is about. Somewhere in the Primeagen reaction, a website popped up on screen that I’d never seen before: marginlab.ai.

I clicked through. And that’s where the skeptic in me got fact-checked by a chart.


The Data

margin lab Claude Code performance shows high performance swings from day to day
AI performance on a benchmark varying 15% from one day to another

MarginLab is an independent third party — no affiliation with Anthropic, OpenAI, or anyone with skin in the game. Every day, they run 50 coding tasks from SWE-Bench-Pro through Claude Code and measure the pass rate. Same tests. Same setup. Every single day.

The baseline pass rate for Claude Code (currently Opus 4.6) is 56%. Sounds stable enough. Then you look at the daily chart. That orange line doesn’t gently hover around 56%. It swings from the mid-40s to the mid-60s depending on the day. On any given day, Claude Code’s performance on real coding tasks can vary by 10-15 percentage points.

And it’s not just daily noise. MarginLab has detected statistically significant performance drops — a confirmed 4.1% decline over 30 days using 655 evaluations. That crossed the p < 0.05 threshold. Statistically significant. Documented. Tracked.

So those people on Reddit saying “Claude was better last week”? Some of them were probably right.

For what it’s worth — Anthropic themselves published a postmortem in September 2025 describing three overlapping infrastructure bugs that degraded Claude’s responses for weeks. Routing errors, TPU misconfigurations, compiler bugs. At peak, one bug alone was affecting 16% of requests. They were transparent about it, and they’ve said they never intentionally reduce model quality. But the effect on users was the same: some days were just worse than others, and there was no way to know unless you had the data.


Myth Confirmed

So here’s where I’ve revised my position: the complainy people on the internet weren’t just complaining to get upvotes on Reddit. AI models actually have measurable performance variation, day to day. Sometimes it degrades over weeks. An independent tracker now exists to quantify it.

I don’t personally experience dramatic good days and bad days with Claude — though, actually, today has been particularly good. But I’m a casual user compared to someone shipping production code with Claude Code eight hours a day. For those people, a 10-15% swing in coding task accuracy on a Tuesday versus a Thursday is a real thing that costs real time.

Good to know. Bookmark MarginLab, I guess, and check it if you want to see how Opus or ChatGPT’s Codex is performing before you start a big coding session.

If you’re reading this, by the way, you’re consuming the third iteration of actual content. Ouch. PewDiePie made a video. PrimeAgen reacted to it. I watched the reaction, found a website, and wrote this — with AI help, naturally, because of course I did. At least I’m typing this out for real and asking AI to sharpen my rambling thoughts. You’re welcome. 🙂


Where to Track AI Performance

If this is new to you, here are the resources worth knowing about:

For model quality (is my AI performing well today?):

  • MarginLab — Daily benchmarks for Claude Code on SWE-Bench-Pro. Independent, statistically rigorous, updated daily. They also track Codex. This is the one that started this article.
  • Chatbot Arena / LMArena — Crowdsourced head-to-head model comparisons with Elo ratings. Not daily quality tracking, but useful for seeing how models rank against each other over time based on real user votes.
  • Artificial Analysis — Tracks speed, latency, and pricing across providers. More about throughput than output quality, but useful if you’re evaluating which API to use.
  • LiveBench — Continuously updated benchmark using new questions to avoid data contamination. Good for tracking whether models are actually improving over time.

For uptime and outages (is my AI even working right now?):

  • status.openai.com — OpenAI’s official status page.
  • status.claude.com — Anthropic’s official status page.
  • AI Status — Monitors uptime across ChatGPT, Claude, Gemini, DeepSeek, and others in one place.
  • AI Checker Hub — Independent monitoring with latency tracking and reliability scoring across major providers.

There’s an important distinction between these two categories. Status pages tell you if the service is up. MarginLab tells you if the service is good. Those are two very different questions, and until recently, nobody was tracking the second one.


This article was written by a human, sharpened by AI, about AI performance inconsistency, inspired by a reaction video about a video about training AI. If that doesn’t capture where we are in 2026, nothing does.

Leave a Comment