Test Automation and Refactoring: Key Strategies for Improving Generative AI Applications
Hey, I am Klaus Haeuptle! Welcome to this edition of the Engineering Ecosystem newsletter in which I write about a variety of software engineering and architecture topics like clean code, test automation, decision-making, technical debt, large scale refactoring, culture, sustainability, cost and performance, generative AI and more. In this edition of the newsletter I am motivating the importance of test automation for generative AI applications, its limitation and provide some resources to learn more about the topic.
Why Testing Generative AI Applications is Critical
The same reasons for test automation apply to generative AI applications as to any other application. Additionally, generative AI applications have some unique challenges that make test automation even more critical. Compared to classical application building an application based on Generative AI requires many iterations and refactoring steps. Therefore, a test automation safety net is critical to make changes with more confidence. At the same time, test automation for generative AI applications comes with certain limitations.
Test Automation for Generative AI Applications
In a recent article on martinfowler.com David Tan and Jessie Wang have written about the importance and challenges of testing application based on generative AI. They argue that testing generative AI models can be challenging. Unlike standard software code, AI models are probabilistic (e.g., their behaviour is not deterministic), their inner workings are generally a "black box", and their complex output can be hard to evaluate (e.g. long generated text). In the article they describe several approaches to improve testability of generative AI applications like structured output, property based-testing and auto-evaluators leveraging Generative AI for testing.
Limitations of Test Automation for Generative AI Applications
The article also describes the limitations of test automation for generative AI applications. For testing generative AI based applications automated tests are not sufficient, and you will still need to find the appropriate boundary between the responsibilities of an AI system and how involve humans to address the risk of issues (e.g. hallucination). That is even more the case for testing Generative AI applications compared to classical deterministic applications. For example, your product design can involve a "staging pattern" where you still ask users to review and edit the generated output for factual accuracy and tone, rather than directly using the AI-generated output without human intervention.
Further Resources on Testing Generative AI Applications
Overall the entire topic of testing generative AI applications is a complex and evolving problem space that requires more attention. In this blog I only highlighted a few important aspects. If you want to learn more about testing generative AI applications, you can refer to the following resources:
The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction
How to Secure Your Large Language Model Applications - at least partially: OWASP Top 10 for LLM
Learnings from classical software engineering and testing practices also apply
If you’re finding this newsletter valuable, share it with a friend and co-workers.
Also, if you have feedback about how I can make the newsletter better, let me know via your preferred channel on LinkedIn, Mastodon or by leaving a comment in the newsletter edition🙏
Thanks for reading Software Engineering Ecosystem!