From Test Automation Architect to AI Systems: What 25 Years of Engineering Taught Me About AI

My Take

The most dangerous person in any AI project is the engineer who is impressed by the demo. I’ve been in rooms where a GenAI prototype gets a standing ovation — and I’m the one quietly thinking about what happens when the context window fills up, when the latency spikes under load, or when the model confidently returns the wrong answer. That instinct didn’t come from AI courses. It came from 25 years of building and breaking software — most of it spent making sure systems fail gracefully rather than catastrophically.


The Demo That Changed How I Think About AI

A few years ago I watched a GenAI demo that stopped me cold — not because it was impressive, but because I could see the engineering underneath and it was fragile.

The demo was slick. The model answered questions fluently, retrieved documents accurately, summarised complex content in seconds. The audience was sold. I was watching the prompts, the latency, the way the system handled edge cases — and I kept thinking: this breaks the moment someone asks it something slightly outside the training distribution. This breaks at scale. This breaks when the data changes.

I wasn’t being cynical. I was being a test engineer.

That moment clarified something I hadn’t fully articulated before: the skills that make someone good at test automation, quality engineering, and production validation are exactly the skills that AI engineering is desperately short of. Not the ML mathematics — that’s learnable. The discipline of thinking about failure modes before they happen. The habit of asking “what does this system do when it’s wrong?” before asking “what does it do when it’s right?”

For someone with a QA background, that question is instinct. For most AI teams, it’s an afterthought.


What 25 Years of Testing Actually Teaches You

I started my career in system and network programming, but the majority of my engineering life has been spent in testing and automation — which is a far more interesting discipline than its reputation suggests.

Test automation at scale forces you to think like an adversary. Your job is to find the ways a system breaks before users do. You build frameworks that must themselves be reliable. You write code that validates other code, which means you’re constantly reasoning about edge cases, boundary conditions, and failure modes that the original developers didn’t consider.

From there I moved through Python automation, cloud infrastructure, and storage systems — always with a QA lens. What are the assumptions baked into this system? Where does it degrade under load? How does it behave when the data is slightly wrong, the network is slightly slow, the input is slightly unexpected?

These questions are not glamorous. But they build something that no amount of ML coursework can shortcut: a systematic instinct for where systems fail.

When I now look at AI systems with that same lens, I see the same categories of failure I’ve seen throughout my career — just wearing different clothes:

  • Boundary condition failures: the model handles common inputs well and breaks on rare ones
  • State management failures: multi-step agent pipelines where errors in step 2 compound silently into step 5
  • Data quality assumptions: training data that doesn’t represent production distribution
  • Observability gaps: no way to know the system is degrading until a user complains
  • Graceful degradation gaps: systems that fail catastrophically instead of falling back safely

Every one of these is a testing problem. Every one of these is something a test automation background directly prepares you to address.


The Most Expensive Mistake in AI Engineering

In my experience, the most costly mistake teams make when building AI systems is trusting the model without understanding its failure modes.

This manifests in several ways. Teams test the happy path exhaustively — the cases where the model performs well — and ship without systematically exploring where it breaks. They treat model outputs as facts rather than probabilistic estimates. They build downstream systems that assume the model is always right, which means a single model failure cascades into a system failure.

The underlying cause is almost always the same: no one on the team has thought rigorously about what “wrong” looks like for this system, how often it happens, and what the system does when it does.

A good test engineer never makes this mistake. They define failure modes before writing test cases. They build adversarial inputs deliberately. They measure not just whether the system works but how it fails and how often.

That discipline — define failure before you define success — is what AI engineering currently lacks at scale. And it’s precisely what a QA background provides.


Why I’m Moving into AI Systems

I’m not moving into AI because it’s fashionable. I’m moving into AI because after 25 years of building and testing software systems, I can see clearly that the hard unsolved problems in AI engineering are fundamentally quality and reliability problems.

How do you evaluate an AI system that produces non-deterministic outputs? How do you build a test suite for a RAG pipeline? How do you detect when a model is degrading silently in production? How do you design an AI agent architecture that fails gracefully rather than catastrophically?

These are questions I find genuinely interesting. They sit at the intersection of my existing expertise and the emerging needs of AI engineering. And they’re questions that most teams are currently answering poorly — not because they lack AI knowledge, but because they lack testing discipline.

I’m currently building deep expertise in AI architecture, LLM internals, and agent systems — not at the surface level of “I can use the API” but at the level of understanding model behaviour well enough to design evaluation frameworks, failure detection systems, and quality gates for AI pipelines.

The notes on this site document that learning process. They’re not tutorials — there are enough of those. They’re the working notes of someone applying 25 years of quality engineering instinct to AI systems, asking at every step: how does this break, how would I know, and what does the system do when it does?


What This Means for Teams Building AI

If you’re building AI systems and finding that production behaviour doesn’t match demo behaviour — that your pipeline is inconsistent, that your model returns unexpected outputs under load, that you have no reliable way to measure quality — you almost certainly have a testing and evaluation problem, not a model problem.

The model is doing what models do. The engineering discipline around the model is what needs attention.

If you’d like to discuss your specific situation — AI agent architecture, evaluation framework design, or quality engineering for GenAI pipelines — I’m available for a free 30-minute discovery call.

About Kaushik Sarkar → Contact via the form at intellinotebook.com


Kaushik Sarkar is a Test Automation Architect with 25 years of engineering experience across system testing, cloud infrastructure, Python automation, and QA frameworks. He is currently building expertise in AI systems and GenAI architecture. He writes at intellinotebook.com and is based in Bangalore.

0 0 votes
Article Rating
Subscribe
Notify of
guest

0 Comments
Oldest
Newest Most Voted
Search

Table of Contents

You may also like to read

0
Would love your thoughts, please comment.x
()
x