Every now and then, I get to hear about "X testing", where X might be API, mobile, embedded software or just anything that isn't a website. While I normally scoff at the relevance of such distinctions that usually can be narrowed to "understand your system, understand how to interact with it, the rest of the differences are nothing you haven't seen before", it is true that the good content will help you notice parameters you possibly didn't consider such as battery consumption for mobile, or heat and power fluctuation for embedded systems.
I got to listen to a recent episode of "The testing show" on AI testing which was, sadly, another one of the infomercial-like episode. It wasn't as terrible as the ones they bring in people from Qualitest, but definitely not one of their better ones. The topic was "AI\ML testing", and was mainly an iteration of the common pattern of "it's something completely new, and look at the challenges around here, also ML is really tricky".
This has prompted me to write this post and try to lay the basics for testing stuff related to ML, at least from my perspective.
The first thing you need to know is what you are testing - are you testing an ML engine? A product using (or even based on) an ML solution? the two can be quite different.
For the past decade, I've been testing a product that was based on some sort of machine learning - first a Bayesian risk-engine detecting credit-card fraud, and then an endpoint protection solution based on a deep neural net to detect (and then block) malware. The systems are quite different from each other, but they do have one aspect that was shared to them - The ML component was a complete black-box, not different than any 3rd party library that was included in it. True, that particular 3rd part was developed in-house and was the key differentiator between us and the competition, but even when you have a perfect answer to the relevant question (is this transaction fraudulent? is this file malicious?), there's still a lot of work to do in order to make that into a product - For the endpoint protection product it would be hooking to the filesystem to identify file-writes, identifying the file type, quarantining malicious ones and reporting the attack, all of which should be done smoothly while not taking up too much resources from the endpoint itself, not to mention the challenge of supporting various OS and versions deployed in the field. All of which have zero connection to the ML engine that powers everything. If you find yourself in a position similar to this (and for most products, this is exactly the case) - you are not testing ML, you are at most integrating with one, and can treat the actual ML component as a black box and even replace it with a simulator in most test scenarios.
There are cases, however, that one might find themselves actually testing the ML engine itself, in which case start by admittin that you don't have the necessary qualifications to do so (unless, of course, you do). Following that, we need to distinguish again between two kinds of algorithms - straightforward and opaque.
Straightforward algorithms are not necessarily simple, but a human can understand and predict their outcome given a specific input. For instance, in the first place I've been the ML was a Bayesian model with a few dozen parameters. The team testing the risk engine was using a lot of synthetic data - given specific weights for each "bucket" and a given input, verify the output is exactly X. In such cases, each step can be verified by regular functional tests they might require some math, but if a test fails we can see exactly where did the failure happen. In a Bayesian case, calculating a weighted score, normalizing it, recalculating "buckets" and assigning new weights are all separate, understandable steps that can be verified. If your algorithm is a straightforward one, "regular" testing is just what you need. You might need a lot of test data, but in order to verify the engine correctness, you just need to understand the rules by which it functions.
Opaque ML systems are a different creature. While it is possible to define the expected output of the algorithm given the state it's in (unless it also has a random effect as well), there's use to actually finding them since it would not help us understand why something was a specific answer was given. Notoriously, there are the deep neural networks that are nothing short of magic. We can explain the algorithm of transitioning between layers, the exact nature of the back-propagation function we use and the connections between "neurons", but even if we spot a mistake, there isn't much we can do besides feeding it through the back-propagation function and move on to the next data point.In fact, this is exactly what is being done on a massive scale while "training" the neural net. With opaque systems testing is basically the way they are created, so accepting them as fault-free is the best we can do.
That being said, ML algorithms are rarely fault free, and this brings us to the point we mentioned before - most of our product is about integrating ML component(s) to our system and we should focus on that The first thing is to see whether we can untangle it from the rest of our system. We could either mock the response we expect or use input that is known to provide certain result and see that our system works as expected given a specific result from the ML component.
Clever, just assume that correctness is someone else's problem, and we're all peachy. Right?
Well, not exactly. Even though in most cases there will be a team of data scientists (or is it data engineers now?) who are building and tuning the model, there are cases where we actually need to cover some gaps in figuring out whether it's good enough - Is our model actually as good as we think it is? Maybe we've purchased the ML component and we take the vendor's claims with a grain of salt.
A lot of the potential faults can be spotted when widening the scope from pure functionality to the wider world. Depending on what our ML solution does, there's a plethora of risks - from your chatbot turning Nazi to describing black people as apes to being gender-biased in hiring, that's without considering deliberate attacks that will run your self-driving car to the ditch. To avoid stupid mistakes that make us all look bad in retrospect, I like to go through a list of sanity questions - ones that have probably been addressed by the experts who've built this system, but just in case they got tunnel vision and forgot something - The list is quite short, and I'm probably missing a few key questions, but here's what I have in mind:
- Does it even makes sense? Some claims (such as "Identifying criminals by facial features") are absurd even before we dive deeper into the data to find the problems that make the results superficially convincing).
- The "Tay" question: Does the software continues to learn once in production? If it does - how trustworthy is the data? what kind of effort would be needed to subvert the learning ?
- The Gorillas question: Where did we get the training data from? is it representative of what we're expecting to see in production?
- Our world sucks question: Is there a real world bias that we might be encoding into our software? Teaching software to learn from human biased decisions will only serve to give this bias a stamp of algorithmic approval.
- Pick on the poor question: Will this software create a misleading feedback loop? This idea came from Cathy O'Neil's "Weapons of Math Destruction" - Predictive policing algorithms meant that cops were sent to crime ridden neighborhoods, which is great. But now that there are more cops there, they will find more crimes - from drunken driving to petty crimes or jaywalking. So even if those areas are back to normal rate of crime, it will still get more attention from the police and will make the life of the residents there more difficult.
- "Shh.. don't tell" question: Is the model using data that is unlawful to use? Is it using proxy measures to infer it? Imagine an alternative ML based credit score calculator. It helps those who don't have a traditional credit score to get better conditions for their credit. Can it factor in their sexual preference? And if they agree to disclose their social profiles for analysis, can we stop the algorithm from inferring their sexual preferences?
After asking those questions (that really should be asked before starting to build the solution, and every now and then afterwards), and understanding our model a bit better, we can come back to try and imagine risks to our system. In security testing (and more specifically, threat modeling) there's a method of identifying risks called "movie plotting" where we assemble a diverse team and ask them to plot attacks in a movie like "mission impossible". This idea could work well to identify risks in incorporating ML components to our business, with the only change is that the movie plot will be inspired by to "Terminator" or "The Matrix"
And yet, the problem still remains: Validating ML solutions is difficult and requires a different training than what most software testers have (or need). There are two tricks that I think can be useful.
- Find an imperfect oracle: it could be that you have competitors that provide similar service, or that there's a human feedback on the expected outcome (you could even employ the Mechanical Turk). Select new (or recent) data points and compare your system results with the oracle ones - The oracle should be chosen such that every mismatch is not necessarily a problem, but that it's something that is worth investigating. Keep track on the percentage of differences. If it changes drastically, something is likely wrong on your side. Investigate a few differences to see if your oracle is good enough. In our case, we try a bunch of files that enough of our competitors claim to be malicious, and if we differ from that consensus, we assume it's a bug until proven otherwise.
- Visualize. Sometimes, finding an oracle is not feasible. On the other hand, it could be that people can easily spot problems. Imagine that Google were placing a screen in several offices and projecting crawled images with what the ML thinks it actually is. It is possible that an employee would have seen it classifies people as apes way before real people were offended by it.
With that being said, I want to circle back to where I've started: While I hope you've gained an idea or two, testing is testing is testing. With any sort of system that you test, you should start by understanding just enough about it and how it interacts with your business needs, then figure out the risks that you care about and find the proper ways to address those risks.