Why AI Projects Die in Testing (And How We Fixed It)

This summer, while building an MCP integration for a consulting client, I hit a wall that felt all too familiar. Our modest evaluation suite(about 20 test questions) took nearly 8 minutes to run. Every system prompt tweak, every tool description change meant waiting on tediously long feedback loops, or running only a subset of tests and hoping for the best. Attempts to speed up tests were met with rate limits and time spent investigating caching details on Antropic docs.

What we were building was simple, but shipping something we knew worked was incredibly tedious. I knew there had to be a better way and on the side I started tinkering with some ideas to abstract the testing process into a simpler configuration. At first it was just something to keep me busy and I just focused on user experience and adding basic MCP support for more models than just Anthropic.

But, I became obsessed — this became a problem I had to solve and there was a glimpse I could do this far better than I had imagined. I worked on running tests in parallel, standardized retry logic, and adding more eval checks. Then a breakthrough – the same suite that took 5 minutes was now down to less than 30 seconds! Tests that cost about $0.35 to run at Anthropic cost me just a $0.01 on open weights models. I felt like I was onto something.

Same quality checks, 10x faster, 30x cheaper.

The urgency increased. Now I knew I had to share what I had made. Since then I have been heads down to launch what I am calling a pop-up event. An invite only temporary release of vibe check. We’re treating it like production, just for a limited time.

Why AI Projects Die in Testing (And How We Fixed It)

Share this: