אשרי אדם מפחד תמיד Happy is the man who always fears: Testing a cloud-native product

In the past several months the company I work for has been building its first cloud native product, and I was tasked with figuring out how to test this, given the limitations developing in the cloud poses for our kind of application. This posed a few challenges that made everything that much more complicated, and while there's a lot of materials online about tools that can help with testing stuff on the cloud, and specifically on AWS (which was my focus) I couldn't find a holistic overview of how to approach testing such a project. I hope this post will help remediating this gap, and hope even more that I'll annoy enough people who will point me to articles that already exist and I missed just to prove me wrong.

I had noted the following difficulties, that might, or might not be relevant to your project as well:

Our product deals with AWS accounts as its basic unit of operation - we deploy a service that protects the account's data. As such, each environment needs to have its own account, so deploying multiple environments for testing (and development) is not really feasible - it's expensive, long, and deleted accounts stay live for 90 days or so.
The product is designed as several services communicating with each other, maintained by several different teams, and using several repositories.

So, how can we bake feedback into our SDLC? It didn't help that we didn't yet define said SDLC, but we're not here to help, are we? Besides, this is also an opportunity to define said behavior in a way that will enable feedback instead of being an obstacle.

The solution that made sense in my mind was the one I found in Dave Farley's book "Modern Software Engineering" (I wrote about it here) that basically said testing the entire product before it is deployed to production is something you don't get to do in a services architecture. Really cool concept - having multiple, individually deployable components and have each of them reach production in its own time. There was only one caveat to this approach - there's no way in hell I'll manage to convince the organization its the right thing to do, and even less of a chance to get the necessary changes in our processes and thinking in place. However, I might be able to use the constraints we do have to help me push at least partially in that direction.

The plan, if I'm honest, is assuming the old and familiar test pyramid, only this time presented as "well, we can't do it in the ice-cream cone we're used to, we don't have enough environments."

So, the vision I'm trying to "sell" to the organization is as follows:

Yes, you can have your nightly "end-to-end" system wide regression test suite. However, unlike other products where the bulk of testing actually happens on this level, the nightly will be focused on answering the question "is it really true that all of the parts really can work together?" For this purpose, we'd want a small number of happy-flow tests to run through the various features we have.
The bulk of testing should be done in the unit test level. While our organization still has a lot to learn on how to use unit tests properly (starting with "let's have them as a standard"), at least for the small services it's really easy to cover all of the edge cases we can think of in unit tests. It also happens that for AWS there's an awesome mocking library called "moto" which provides a pretty decent and super easy to use mock for a lot of AWS functionality without needing to change the already existing code. In this layer we'll verify our logic, error handling, and anything else we can think of and can test at this level.
Still, not everything can be unit tested, and on top of our logic, we want to check that things work on the actual cloud before merging. Therefore, we are attempting to build a suite of component tests that will deploy parts of our system and trigger their sub-flows. For example, we have a component that scans a file and sends the verdict (malicious or benign) to another component to execute the relevant action - so we can deploy only the action executing lambda alongside a bucket and start sending it instructions to see that it works.One benefit we get, apart from being able to tell that our lambda code can actually run on the cloud, and is not relying on dependencies it doesn't have, is that we must keep our deployment scripts modular as well (I did mention some small steps, right?)
We have a lot of async communication, and a lot of modules are depending one on another to send data in specific format otherwise they break. To expose those kind of problems early, we are introducing Pact for contract testing - the python version of the tool is a bit limited at the moment, but it has most of what we need.

So, to make it short - during a pull request, we will run unit, component and contract tests, and once a night we'll run our system tests. I have yet to tackle some other kinds of testing needs such as performance, usability, or security, but we already have a worst-case solution: spin a dedicated environment for that kind of test and get the feedback just slightly slower than the rest of tests. I'm hoping that we'll be able to move at least some of those into the smaller parts of testing (such as having a performance test for each component separately), but only time will tell.

I expect that we'll leave some gaps in the products that already exist and are used in this new projects, but hoping this gap will be managed in a good enough manner.

One question that popped up in my head as I was writing - is this approach what I would choose for similar applications that don't have the major constraint on environments? From where I'm standing, the answer is yes - even if it's easy and cheap to spin a lot of test environments, testing each component in isolation and focusing on pushing tests downwards simply provides a lot of power and enables kinds of tests that would be really difficult on a system-wide context.

So, those are my thoughts on testing cloud native, service oriented applications. What do you think?

אשרי אדם מפחד תמיד
Happy is the man who always fears

Tuesday, June 18, 2024

Testing a cloud-native product

No comments:

Post a Comment