אשרי אדם מפחד תמיד Happy is the man who always fears

Monday, October 7, 2024

Pay to stay?

So, I stumbled upon (or rather, saw it in Alan Page's five for friday) this blog post explaining why companies should "pay their people to stay" that is - offer the people who are already working and having good impact on the company a competitive salary. I read the argument there, and it's convincing at first glance, at least up until the point you start thinking about it. For those who don't want to read the entire post, I'll sum it up - People currently working for the company are becoming more valuable as they stay, since they gain domain knowledge and become more effective in the specific thing the company does. Therefore, the argument goes, it's worth paying those individuals above market value for those skills.

So, what's wrong in this argument? Oh so many things.

First of all, the assumption that if you pay well people won't leave is not realistic. People leave for a lot of different reasons - they might look for a position that is not available, they might move to live elsewhere, or even just feel that they want to change domain or learn a new area of technology. That's assuming they don't run away from a bad boss. So, let's just put this on the table. You might be paying well, only to retain 10% of the people you want (or it might be 90%, but I doubt it).

Second, let's assume this strategy works - if we're doing it, other will too. In this case, let's look on the options our star employee with 5 years of tenure. Before we implemented the suggested change, their salary would get a nice 20% bump if they would be hired for their current level. Now, we've implemented a tenure-based salary increase that compensates for that, and their salary is 10% above market. They interviewed with foo-corp, which finds them a perfect match for what they are looking for. They can keep on looking, knowing that many other suitable candidates might have the same conditions, or they can offer above market rates, resulting in the market value of the job rising and our formula will need to be updated. Since companies do have some limits on their budget, this race to the impossible salary will stop somewhere, and on the way might have other side effects (such as "no raises for the first three years" or just underpaying juniors and deepening the imbalance between higher and lower paid workers). Another possible side effect if such practices are common, people might be hesitant to interview people who have been in their current workplace for over 5 years, pushing people who care about their career to start looking sooner just to avoid being labeled this way if they do need to change jobs in the future.

So, the financial model is not really convincing. But let's assume for the sake of the argument that it does, that we've found a magic formula that actually helps retaining people in significant numbers in a financially reasonable way. Do we actually want this? The post assumes that tenured people are more valuable, but that's far from being universally true. There are several costs to having people, even good ones, stay. Here are some disadvantages:

Reducing our bus factor. If we nurture those super-valuable people and they stay forever, they become more and more central to our organization, despite honest attempts they will gather more knowledge and expertise, and more people will rely on them, when they finally leave, or retire, it will be ten times more painful than if they left after only five years.
They prevent change. The people who are being retained are central to the current way of working. The learn the organization and grow accustomed to it. When they say "we've always done it this way", or "we've tried it and it didn't work" it will carry even more gravitas.
Less mobility means less ideas travel in the industry - yes, we might meet in conventions and meetups, but nothing can compare to working in a different setting and seeing your previous ideas about what's right just burn to the ground in a new context.
Slower growth to the others: When people who are central to the organization leave, there's some sort of vacuum left behind them in all the things they were just the go-to person. When they leave, this vacuum is filled by one or more people, who now need to fill boots bigger than they wore before, and most of the time they grow to do this well, and give it their own spin. Thus the industry gains more skilled people, the company gets to have the same tasks benefit from multiple viewpoints and the people themselves are gaining skills and sometimes a promotion. Yes, there are growing pains, and the right person leaving in the wrong time might mean those pains can topple a company, but by having multiple, smaller, holes to fill, we avoid the giant crater.

One final point - I'm not advocating to push your senior people out (I keep this recommendation to cases where you're starting to think "Uh, oh - if this person leaves we're all doomed"). If you're able to create a workplace that has a good balance between compensation, culture and professional interest people will stay and you will benefit from that extra amount of domain knowledge they learn, knowing that other companies will have a hard time matching you on such a fine balance. But don't fret if they leave. The graveyards, after all, are filled with people who couldn't be replaced.

Thursday, September 26, 2024

Wiring the winning organization - Book review

Book cover

Or, in its full name: Wring the Winning Organization: Liberating Our Collective greatness through Slowification, Simplification and Amplification. By Gene Kim and Steven J. Spear.

It's a bit of a mouthful, but this matches perfectly the ambition of the book's authors: To devise a theory of everything. Or, at least, of business operational success - any business. Naturally, in order to do that, they are abstracting quite a bit, but despite everything they manage to create something that sounds convincing and powerful. It's a great book, but it suffers from a major drawback for me - it's a managers oriented book, which fits their theory (short version - it's all about management) but is not as actionable for me as I would like it to be. I can still use the insights from it to reason with my managers, and to set some goals I will want to advocate for, but it puts one heck of a nail in the coffin of "driving a change bottom-up" idea.

So, to the theory of everything.

The authors define 3 layers of "stuff" that happens in just about any business:

Layer one: "Technical Object". Or, in other words, the doing. This is what the skilled people are doing - be it coding, chiseling a piece of rock or selling timeshares to elderly people.

Layer two: The tools. Those are the tools used to do the things in layer one, the hammers and jigsaws, the build systems and expensive electronic microscope.

Layer three: The social circuitry. This is how your organization is structured, how data flows, how dependencies are managed and how decisions are being made.

The radical claim the authors make is that while there are a lot of differences between organizations in their layers 1&2 needs, in layer 3 it's pretty much all the same. Therefore, it stands to reason that in this layer, good practices can be transferred between industries, which we can see happening in some places (the one I remember is the Toyota manufacturing approach being an inspiration to devops movement in software development). The book's main message is that in order to increase productivity & quality while decreasing costs and making employees happy and motivated, you "only" need to do three things. They name them "slowification, simplification & amplification". Do those, and you're on the right path. Note - there's a distinction between "simple" and "easy". Each of these words is packing a whole lot of work within it.

The first term is a word they made up to carry the meaning they were looking for of "slowing down to make things better" (they say that in German there is "Verbesserung", which my limited knowledge just says means "improve", but I might miss some context). The idea is that since when under pressure we revert to our habits and instincts instead of thinking (in Daniel Kahnman's terms that would be "system 1(fast) thinking"), we need to slow down to engage our brains (system 2, or "slow" thinking). This can happen at many levels - some phases, such as planning\design are slow by nature, and might only need some small nudges to be more conscious, we can use a testing environment that is low-risk and we can control to a greater degree, and we can find ways to slow even the performance environment (that's the name they use for "production", as it's more broadly applicable). Let's take for example a basketball game - before a game, the coach might go over some tactics with the team, draw some sketches on the board, so that players won't have to communicate the plan while they are running after the ball. This way they have time to suggest adjustments and deliberate alternatives. Then, when in practice, they might stop everything to perfect the way they pass the ball around, and during the game, a player can take a couple of seconds dribbling to re-orient and plan the next move. Slowing down is not always possible - a player jumping for the rebound ball can't pause to think, so in this instance they will have to rely on the instincts they've built during practice.

Simplification, thankfully, retains its dictionary meaning. in order to succeed, we want to solve easier problems. They mention three techniques to achieve this: Modularization is the way we partition a large problem to smaller yet coherent problems. Incrementalization is their way of saying "baby steps", where we establish some solid known basis (through standards, automation, documentation and so on), and invest efforts in learning the next new thing and linearization is creating a sequence where actions are depending at one another.

Most of the book is alternating between explaining the three concepts (or is it principles?) and analyzing success and failure examples according to them. They show, for instance, how the Wright brothers were able to get an airplane flying in a fraction of the cost of the (failing) competition because instead of trying to build one complete plane after another, they did a lot of experiments on various parts such as testing only the wing design elevation power, without attaching it to an actual plane, or using a kite in order to test the ability to steer by curving the wings. Those examples make a pretty good case for the main theory, but the one thing that was probably the reason for the book being that convincing was the opening "vignette" which is available in google books for free (link, I hope), they do a much better job of describing it than I could, but the main idea is describing how poor management and social circuitry are making a mess of something relatively simple such as managing moving furniture out of hotel rooms and painting them.
In between all of those, there are a few insights that are worth mentioning, such as the realization that the social circuitry should match the way the technical layers are organized (which, if we think about it, is a take on Conway's law that reverses the causal direction. It's not that "since our org structure is X, all of our systems will be X compatible", but rather "since we want to produce X, we need to structure our org in a manner that will support it"), or a motivational explanation of why the framework presented in the book works (in short: simpler, more isolated problems enables more minds to engage with the problems at hand). The most depressing insight they share, which I have already mentioned, is that those things don't come bottom-up, to really have an impact, one must have executive power. I'm guessing that as an individual, I can use some of these ideas to structure the work around me to some limited extent, but the truth is - simplifying most of the problems I deal with will require change in multiple other teams, so I can nudge, convince, and haggle - but I can't really control what's going over there, so my effect will be limited.

So, all in all - great book, especially if you're somewhere in the C-suite. I wonder if it should be taught in management courses.

Wednesday, July 3, 2024

I have released an open source library

עברית

In my past two jobs we always talked about collecting some tests metadata to analyze later, but never had time to collect it. Well, I had a few moments between jobs to create an easy to extend (I hope) library for pytest. It might or might not help me in future workplace, but I'm hoping it will help someone.

Taking the perspective of building a library for unknown users was an interesting experience - I obviously needed to put more emphasis on documentation, but more importantly, I needed to think "how can a user customize this for their needs?" One result is that I chose to omit functionality - for instance, I'm not providing a database storage utility, but leaving that for the user who knows which DB and schema are they using. I still consider creating an example project using this library, but we'll see if I have time and attention for that.

Anyway, its free to use, modify and so on. If you need help, ping me (or better - open an issue)

PyPi: https://pypi.org/project/pytest-stats/

Github: https://github.com/amitwer/pytest-stats

יצרתי ספריית קוד פתוח

English

בשני מקומות העבודה האחרונים שלי נורא רצינו לאסוף נתונים על המבדקים שאנחנו מריצים כדי שנוכל לעשות אנליזה רוחבית אחר כך, אבל אף פעם לא היה לנו את הזמן לבנות את התשתית לזה. ובכן, היה לי קצת זמן בין העבודות, אז בניתי תשתית איסוף נתונים עבור pytest. האם זה יעזור לי בעתיד? אלוהים גדול. אני מקווה שזה יעזור למישהו.

לבנות ספרייה עבור משתמשים שאין לי מושג מי הם זו חוייה מעניינת. מעבר לצורך הברור בתיעוד טוב יותר, פתאום מצאתי את עצמי חושב על "איך המשתמש יכול להתאים את זה לצרכיו?" אחת התוצאות היא שזרקתי החוצה חלק מהפונקציונליות. למשל - לא כתבתי מימוש מחדלי לשום מסד נתונים, כי אין לי מושג איך המשתמש ירצה שדברים ייראו אצלו, באיזה סוג מסד נתונים הוא משתמש או אילו נתונים הוא ירצה לשמור. תופעת לוואי של זה הייתה פרוייקט פשוט יותר למימוש - אני דואג לאסוף את הנתונים, והכתיבה היא בעיה של מישהו אחר. אולי בעתיד אצור פרוייקט לדוגמה שישמור נתונים איפשהו רק כדי להדגים איך עושים את זה, אבל נראה אם יהיה לי זמן או קשב כדי לעשות את זה.

בכל מקרה, מוזמנים להשתמש בעצמכם. אם אתם צריכים עזרה, צרו קשר (או, עדיף, פתחו באג\שאלה בגיטהאב).

PyPi: https://pypi.org/project/pytest-stats/

Github: https://github.com/amitwer/pytest-stats

Tuesday, July 2, 2024

closing 5 years, retrospect

After just a bit more than five years, today was my last day at Deep Instinct, and that's a great time for some reflection. I've seen some good, I've seen some bad, and I managed to learn from both. It's a bit daunting to try and pack five whole years into a single post, so I'll just paint an image in wide brush strokes. It will be inaccurate, and I'll miss a lot, but it is what it is.

Also, it will be long, bear with me.

Year 1

This was a year of growing, and of adjusting expectations. it was my first time working in a start-up, with only one previous workplace, and I was up for a surprise. There were no working procedures, minimal infrastructure, and the very strange part - it seemed like everyone were ok with that. In this year I learned how does it look to have a project that can make or break the company, I learned that building communication channels takes a lot longer than I anticipated, and that what I believed were industry standards were not necessarily common.

Our main focus on this year has been to build a team and to create the necessary infrastructure for system testing our products. So we spent some time hiring (note for readers - if you aim above the standard skill-set in your area, you have a long and arduous process ahead of you), and I found out that with more than 5 junior people to shuffle around (I wouldn't really call this mentoring) I was overburdened. I came with an idea on how should our framework look like, and was lucky enough to have someone in the team come up with a better suited idea. There was a lot of work done on the technical side, and not a lot done on integrating with the other teams or the product work. We simply were not yet ready. by the end of this period we had a working system test framework, a team of ~12 people, and we've proven our value to the organization.

Years 2 & 3

Those are not actually full calendar years, but it's a nice title for the period after we've made enough catching up and focused on integrating with our environment. Some of this effort was being done during the first period, but this was more in the way of laying some foundations for the future. Now we've started turning up the notch on talking with other groups. This is where we've started feeling the missing procedures and culture around us - getting invited to a design review is quite easy, but getting invited to a design review that does not happen is a different thing altogether. We've hacked around and compromised a lot - starting with just asking for someone to give us some sort of a handover in lieu of the design review would be one such example. In other places, where we did join the table, we made sure to provide value so that we will be invited again, and we volunteered to help other groups use our infrastructure for their end, partially to gain a reputation boost (but mostly because it was needed for the company). We had a great opportunity when a new product was starting from scratch (ish) and we provided them with a dedicated person. We then used this team as a model to say "this is how we think we should work", thus initiating the third phase of our 4 steps plan (worded a bit differently, I wrote about it here. Roughly labeling, the steps are stabilize, grow, disperse and disband) - we didn't actually complete the growing part: we didn't have a good enough infrastructure to share, and most of the people we had were not skilled enough in testing yet, but reality is messy, you don't get to complete things cleanly before moving on.

We also struggled with our own growth - when our team size neared 20, and the number of products grew, we started feeling the pain of stepping on each other's toes and trying to focus on too many things. So we split into two teams, which in turn required us to split our code-base to match - a realization we got to a few months after the team split, since we now had people focusing on one task and we didn't want us distracted by noises from the other side of the split. Another aspect of this split was that I found out I'm a difficult employee - as part of the split my then manager became a group lead, and a team lead was recruited. At the point where she has arrived, all of the team were people I either recruited or welcomed to the team, and I was a nexus of knowledge about most of the things happening around our team. That is to say - while the authority was hers, I had way more social power. It took us a while to sync, and there was this one time where I used my excessive power and clashed with her in front of the entire team - I knew I was wrong a few minutes after doing so, but the damage was done. I got a good lesson on the difficulty of apologizing publicly, and later we've both synced and balanced our power in a more suitable way, so I'm guessing it all worked out in the end, but I hope I will remember this lesson for the future.
We have also faced another problem - A lot of people were leaving us to pursue other opportunities. There have been a lot of factors to this - we have hired mostly juniors out of university, we were a small start-up with a lot of changes, but there were two factors that bothered me. First, there was the feeling that in order to progress in one's career, one had to get out of testing. The second was that in our company's culture, testers are seen as a second class employees - I'm pretty sure no one will admit to this even to themselves, but it can be seen in all of the tiny decisions - who gets invited to early discussions, how often do people feel comfortable telling you how to do your job, who gets credit for work done, and so on. It ties back nicely to the first problem, but it was noticeable enough to deserve its own place. It took me a while, but the first problem has led me down the path that led to building an in-testing career growth model. The second part of the problem was one I did not address at all, and when I look back on, there were probably some actions I could have attempted. The way I see it, there are 3 main reasons for testers to be on the bottom of the food chain: First is the reputation of the profession in the outside world, which is beyond me to change. Second is the fact that in most places, testers who write code are less capable than "proper developers" (which leads to a vicious cycle of hiring with low standards and reinforcing the conception of testers having inferior skills) and the third is that testing have no explicit contribution: outcomes such as risk reduction, placing safeguards, and maintaining test code are part of other roles as well, especially when actively pushing towards full integration of testing into the teams. I'm still bashing my head occasionally against this question, with recent thoughts influenced by "Wiring the winning organization" (A book review will come soon, I hope).

All in all, there were a lot of challenges during this period, but it was a good time.

Year 4

Year 4 has started on January 2nd, 2022. those with a keen eye and access to our HR system might notice that there are 2 months missing between my actual 4th anniversary on April 2nd and this date, but as I mentioned, the previous years have been so only in the most general way. In that date, we had a new VP engineering (it's formally named VP R&D, but we do have a separate research department) join us and replace the person who was filling this role at the time.

I learned one thing during this time - While it's true that Rome wasn't built in a day, it didn't take that long to burn it down. At first, I thought it's just a style of communication and trying to prove that he's the boss, but soon enough there were so many examples of bad leadership that I couldn't push down the fact that I was dealing with a bully. A few red flags for the future, and perhaps for the readers as well:

Everyone who were here before the bully are stupid, as is every decision made.
People that are still here need a firm sheriff to teach them how to do their work.
All decisions should go through the bully, if there was a decision made without him, it will be reversed.
Listening is for other people.

I could see the damage really fast - the number of crises that required people to stay up late and to work on weekends skyrocketed, communication between teams was reduced, and everyone tried to make sure that when something goes wrong, they will have someone to point at and say "I did my part, so it must be those people over there". Add to that a replacement of 2 of the 3 group leaders within 3 months (one quit, the other was fired in the most insensitive manner), and it's no surprise that things were a mess. About a month after the VP has joined, he announced on two major projects - a complete rewrite of our SaaS product, and a shift to scrum. Both failed spectacularly.
Let's start with the technical project - As it is with many companies, there are more urgent tasks than there is time or people, so shifting people to work on the project would mean slowing down on the roadmap, which can feel fatal for a startup. Also, do recall point #2 - people other than the VP are incompetent by definition. We also have no real documentation of the system we are aiming to rebuild and no experience is creating good requirements or specification even for smaller projects. The solution? To hire an external company to build part of our core business components. Yes, it is stupid as it sounds. Yes, it was said in some circles at the time. Saying it to the VP would have been hopeless, and an invitation to be bulldozed.

So, while this project is happening, the other project can't happen without the people in the company, right? Moving to Scrum is a people and processes task. Well, this has botched as well, and I wrote about it in some length already here. Since that time, the project has concluded and I got to talk with one of the consultants and asked him why some very basic things didn't happen, or at least mentioned as goals and his answer was very respectful and general, as he was masking the fact that the VP who hired them also blocked most of their initiatives (I did ask if I understood correctly, and got the "I can't answer this directly" kind of answer).

So, why did I stay? At first, I was mostly shielded by my skip-level manager, the one remaining group leader. In July I approached him to tell that I have a job offer, but if he can promise me that he won't break in 6 months, I'm staying. His response was that while everything is dynamic and he can't make promises, at the moment he has no thoughts of leaving and is not looking around for alternatives. A month later he called me to tell that he got a job offer and is accepting it. Well, that's life, and I wished him well. Then I took my time - things were not terrible for me personally, so I thought to choose a place I'd be happy to. In January 10th 2023 I got an offer from a place that I thought could be good enough, but I took some time to deliberate. The 4th year has ended at January 15th 2023, when we were told that the VP has chosen to pursue other opportunities and will be leaving effective immediately. The next day I told the place that offered me a position that I'm staying.

Naturally, there was a lot more going this year. For instance, I got to do some close mentoring with some people, which was very interesting - I worked formally with two people, one who joined us in order to stretch his abilities, and was very receptive to my attempts in teaching. I found out that I need to learn how to map the skills and gaps of my colleagues and then find proper way to teach those. For instance, I did manage to see that some of my mentees were struggling with a top-down code design, and then found that I'm not sure how to teach that (tried to force TDD, I think it worked to some extent), other skills such as modeling or communicating confidence - I don't think I managed to teach as much.

It was also the year where I could put my knowledge in testing theory to use, even if a bit violently. The story was as follows: After my skip-level manager has skedaddled, another was brought in his place, though without either his skills, his care or his work ethics (though he does hold the record of being the fastest person to be fired that I've seen). Shortly after that, there was a crisis (did I mention there were a lot of those?) a product that has been developed for an entire version without any sort of testing or attention of testers (all of the testers were busy on the previous crisis) and without mitigating this problem by telling the people working "ok, you don't have testers, do your best", was now blocked by all of the tickets in a "ready for testing" state. The new group lead, eager to show... something, has stated (or, perhaps, iterated the bully) "no compromises on testing, I want full test plans", which, given the deadlines we were facing was simply not possible, so I did some rough estimation and got to somewhere between 13 and 40 days just to write the test plans for the 42 features, let alone executing them, to which I got the lovely response "we need to be more agile". I'm still quite proud of not responding snarkily "agile doesn't mean cutting corners and doing a shoddy work", instead I suggested that we'd use SBTM, which won't solve our problems of not having enough time, but will mean that we'll start working and finding problems faster. I might have exaggerated its benefits a bit, but I can now say that I got to win an argument only because I was more educated on testing than the other side and could slap some important looking documents at them (I have James Bach to thank for creating this document) , so yay me. Oh, also there was the part where we didn't waste a ton of time.

Year 5

So, the bully has gone. There were some rumors around the cause of him getting fired, and sadly, "someone noticed he's menacing the entire organization" was not one of the options (in fact, another rumor had it that they actually had another VP coaching him on the bullying part, which, if true, saddens me greatly because it means his behavior was known). Now we had a chance to start a healing process and make up for the year of moving backwards (my feeling was that our culture went back to a state which is roughly equivalent to what was a year before the bully, but that's just a guess). We had several challenges to overcome - we had a beaten down department, we had lost a huge number of people (while I can't tell exactly the numbers, today, about half of our engineering department were hired in the year and a half after he was fired), the group leaders we had were hired specifically to match his management style - to relay his decisions and to have as little agency as possible and make no decisions themselves. Personally they were nice, but didn't have the skills to lead a healing process, and finally is the person who has replaced the bully, that I think is a good manager, but took on an impossible task - managing two departments (he was, and still is, our VP of research) with over 100 people at the time, and having the style of a good manager, which is to delegate a lot to his direct reports, which would have been great had we the right people and the right structure in place. Add to that the market pressure as funds of the last round were slowly but decisively dwindling, and the result is that we had an absent VP. On top of all that, since last October we got a war in Israel, with many people, the new VP included, were called for months of reserve duty, risking their very lives defending the citizens here.

So, not the ideal conditions for healing. But, there were opportunities as well - before the bully left, he declared that we're moving to (finally) disband the QA department, sure - as always, it was done for the wrong reasons and seemed like he was going to do that the wrong way, but once he left we talked to the group leaders and understood he did not share with them more details than he had shared with us (not a lot, in case you've wondered, only a general statement), so we started discussing how to do this transition safely, and where should the responsibilities that were currently held by the test teams be. This task proved too much for our group leaders, so they decided to leave the status quo more or less as is, with some cosmetic changes perhaps, this left my manager to start and dedicate more of our team to specific products, which is a smaller step then what I aimed for, but at least it was in the right direction. What we both did not know at the time, was that the bully, after clashing with my manager who was defending the team quite admirably, has started the process of firing her, which while halted due to him being let go, did paint her as a troublemaker in front of HR and upper management. The difficult discussions around the re-org were the last straw, and despite the team rallying up behind her, she was let go.

this was the point where I had to assume another role - the team's anchor. Several people approached me with concerns and wondered whether they should start looking for another place. I spent some time talking with them and trying to help them figure out what they need and how to get that here. I was feeling very ambiguous about this - on one hand, I wanted to join their rant and tell them I agree, on the other, I was the face of the system, and had a clear incentive of convincing them to stay. I was very careful not to lie, but there were some questions I had to dance around. There were some nicer aspects to this, though. Once I realized this is part of my role, I looked on the 2nd test team - we had a major problem there, a few months before my skip-level manager left, their team lead made a lateral move to another department, and the team was left without a manager. As long as my skip level was there, he was able to provide some guidance while looking for another team lead, but this effort stopped when he left, and the team of mostly juniors was left without a leader, and the few seniors there left or detached shortly after. So, now we have an orphaned team for over a year. People were leaving. I did a quick assessment of the people still there, and found that while most of the people there are ok, there is one that I really didn't want leaving, so I approached her and asked how she was feeling and what she was missing, which has led to us meeting weekly (ish) and me trying to be the professional authority she could learn from or consult with. That ended up being really fun.

On the work side, with a new VP, we finally started to work on the shadow project declared by the bully, only to find that, quite unsurprisingly, the contracting company who were now getting paid for over a year has managed to produce some lengthy architecture documents that were not really suited to what we needed, and that the bulk of work still needs doing. Surprised? I wasn't. So, teams started scrambling in order to get up to speed, and the added difficulty of coordinating with a contracting team made it that much more difficult. I was brought in to assess the "testing" they have done, just to see some demo robot-framework tests done against a mockup, with no thought on the SDLC, and no one on the contracting team who even understood my basic questions. So it was up to me to devise a testing strategy for this new product, and for the modifications of the other parts that needed to change in order to support this rewrite.

At any rate, reality was not waiting on us to sort our mess, with post-covid funding being harder to get and some major customers backing out of deals in the last moment, our company figured out that in order to stay in business, it is better to update the income forecast to be more on the pessimistic side, which meant a reduction in force - one of my mentees, whom I just told a week earlier that probably had a month to improve before termination was an option was the first to be named, and another, more appreciated employee was laid off as well. Again, I found myself as the go-to person for people disturbed by the change, and this time I could only tell them "the company claims that after this reduction we should be clear for at least a year, so if the only reason is job stability, keep an eye, but this need not be the fatal sign. As part of this, the company has pivoted and changed its market focus, so the entire rewrite project was called off, a year and a half too late. We returned to the previous initiative of a gradual upgrade of the current system, because, well, it made more sense. so new strategy and cool buzzword technologies? not anymore. At least this time when I did damage control I could be honest and say that my only problem with aborting this project was that it happened too late.

Another thing that happened is that management declared that "we have a quality problem", and while they also said "quality is everyone's responsibility", the way they behaved was more in the way of "we have a QA group, so they should patch quality in the end". We found out about this when we were told a "QA expert" consultant was hired and will be joining us. Yes, it was a slap in our face and ego, but we decided to let it pass and try to use this opportunity to further push our goals. This consultant, unlike the agile transition ones, was quite good - She did her research, did some coalition building, and started to look for ways to push for a change. The main difficulty? her hands were tied by the limited scope - fixing only the test team and not the surrounding problems we have could make a minor improvement at best. To make the problem worse, resource and time allocation were not really made (there were promises for time, but they were not scheduled or considered by the other projects) and we had a new test team lead (for the orphan team) that, at the very least, wasn't aligned with the rest of the company - but we still needed to push through his incessant resistance.

I've learned a lot from the consultant. I learned how to take a vague goal and make it concrete and presentable to management, I learned from our failure on the importance of setting clear milestones, and of saying "I don't need 10% of the department time as a constant, I need 5% now, in the form of one specific person, and in 30 days I'll need 20% of all hands on deck", not doing so meant that we had people bored while we were preparing, and schedules tight when we actually needed the people. One thing we tried to do but didn't get to an agreement was on our definition of "quality" (which is a term I really hate for its vagueness) and following that, on the goals we could theoretically achieve.

However, not all slaps in the face were as successful as this one, there were a few others, and two that had a significant impact, both were a case of a problematic hire. The first was a team leader for the orphaned team - it just so happened that the hiring manager, a developer in his past, had no idea how to assess testing skills, and thus, for most candidates in this position, he asked the other test team lead we had to interview, and due to the nature of the market - there were a lot of candidates that fit on paper, but had no relevant skills, and they were failing the simple skill checks. Then, there was this one candidate who did not get interviewed by said team lead - the benign explanation is that the team lead was under a time crunch and needed some space, and the hiring manager decided to skip this phase despite the obvious gaps in his filtering. The candidate that was hired was a very wrong fit - he didn't have the skills to help his team improve (they needed a strong coder) and he wasn't able to find someone else to provide this sort of service. He tried to bring a solution that worked for him in a completely different place without understanding the situation and without talking to his team, and he had the wrong mindset for what we were trying to achieve. All of this is because someone assumed they can hire a testing professional without at least having a conversation with the local testing experts (there were two of us, at the time, so I know it didn't happen) to understand what should they be looking for, and how to check for that. Failing to do that, the first person who knows to parrot some stuff that sound like testing will have an easy was to slip in.

The second slap is pretty recent - a new QA group lead was brought along. This is a double slap in our face - first, because bringing an outsider without letting someone from within to apply for the position is showing a lack of confidence in the people you have (this was also the case with the team lead position). Second, because bringing a group lead was the opposite of what we were doing here for the past few years, and finally, because no one talked to us, again. Now, it's complicated to have one interview their future manager, so it makes a lot of sense to leave us out of the hiring loop, but after so many bad choices, why not ask for our input? maybe we'll be able to convince you you don't need this function, maybe we won't and you'll explain to us what is it that you're trying to achieve, maybe we'll agree that in our current state we need to find a team lead that will have the explicit goal of disbanding this group within one year and then build a community of practice instead - but this didn't happen, and I found myself facing another manager that I don't appreciate professionally. In this specific case, it took the manager less than a month to make some unforgivable mistakes that were a clear signal for me that on top of what might be just professional differences between us and not sheer incompetence on his side, there are also a severe issue with his management style and skills that I can't accept. To be fair - this particular problem is something that is difficult to interview for, and I'm not sure it could have been avoided. It is still not a pleasant experience.

The one last thing that sticks in my mind is the most recent project I was involved in - a new, AWS native product that is supposed to be a major revenue boost for the company. It started while I was busy putting off some fires, or as we call it here "releasing a version". When I tried to get involved in the discussions with AWS, I got a pushback, and since I was busy with the fires, didn't insist. It turned out to be quite a mistake, since by the time I was ready to join, the initial architecture was already decided, and aside from being quite complex for reasons it took me a while to learn, there was zero thought on how to manage this coding project or how to test it, and I had to spend perhaps a month just wrapping my head around all of the new things I had to learn just to come up with a strategy, then break it down to a plan, and do this while everyone is asking "how soon can you start writing system tests?" and answer that we need about a month to change our infrastructure, and that it is more difficult to do because the product has no deployment scheme of the latest code, so each step is just taking five times the time and effort it should. Dealing with the constraints I had was tough, and I wrote about it in my last blog post. I'm quite happy with the result, but sadly I won't see how well it works, or be there to help push for the strategy to be fulfilled - there are a lot of ideas that are new to the company (sadly, one of them is unit tests), and there is surely some tinkering required with the new mechanisms (such as contract tests and managing the various contracts), but that's someone else's task now.

I'm off to find new challenges, and will be starting tomorrow in a new and exciting place.

Tuesday, June 18, 2024

Testing a cloud-native product

עברית

In the past several months the company I work for has been building its first cloud native product, and I was tasked with figuring out how to test this, given the limitations developing in the cloud poses for our kind of application. This posed a few challenges that made everything that much more complicated, and while there's a lot of materials online about tools that can help with testing stuff on the cloud, and specifically on AWS (which was my focus) I couldn't find a holistic overview of how to approach testing such a project. I hope this post will help remediating this gap, and hope even more that I'll annoy enough people who will point me to articles that already exist and I missed just to prove me wrong.

I had noted the following difficulties, that might, or might not be relevant to your project as well:

Our product deals with AWS accounts as its basic unit of operation - we deploy a service that protects the account's data. As such, each environment needs to have its own account, so deploying multiple environments for testing (and development) is not really feasible - it's expensive, long, and deleted accounts stay live for 90 days or so.
The product is designed as several services communicating with each other, maintained by several different teams, and using several repositories.

So, how can we bake feedback into our SDLC? It didn't help that we didn't yet define said SDLC, but we're not here to help, are we? Besides, this is also an opportunity to define said behavior in a way that will enable feedback instead of being an obstacle.

The solution that made sense in my mind was the one I found in Dave Farley's book "Modern Software Engineering" (I wrote about it here) that basically said testing the entire product before it is deployed to production is something you don't get to do in a services architecture. Really cool concept - having multiple, individually deployable components and have each of them reach production in its own time. There was only one caveat to this approach - there's no way in hell I'll manage to convince the organization its the right thing to do, and even less of a chance to get the necessary changes in our processes and thinking in place. However, I might be able to use the constraints we do have to help me push at least partially in that direction.

The plan, if I'm honest, is assuming the old and familiar test pyramid, only this time presented as "well, we can't do it in the ice-cream cone we're used to, we don't have enough environments."

So, the vision I'm trying to "sell" to the organization is as follows:

Yes, you can have your nightly "end-to-end" system wide regression test suite. However, unlike other products where the bulk of testing actually happens on this level, the nightly will be focused on answering the question "is it really true that all of the parts really can work together?" For this purpose, we'd want a small number of happy-flow tests to run through the various features we have.
The bulk of testing should be done in the unit test level. While our organization still has a lot to learn on how to use unit tests properly (starting with "let's have them as a standard"), at least for the small services it's really easy to cover all of the edge cases we can think of in unit tests. It also happens that for AWS there's an awesome mocking library called "moto" which provides a pretty decent and super easy to use mock for a lot of AWS functionality without needing to change the already existing code. In this layer we'll verify our logic, error handling, and anything else we can think of and can test at this level.
Still, not everything can be unit tested, and on top of our logic, we want to check that things work on the actual cloud before merging. Therefore, we are attempting to build a suite of component tests that will deploy parts of our system and trigger their sub-flows. For example, we have a component that scans a file and sends the verdict (malicious or benign) to another component to execute the relevant action - so we can deploy only the action executing lambda alongside a bucket and start sending it instructions to see that it works.One benefit we get, apart from being able to tell that our lambda code can actually run on the cloud, and is not relying on dependencies it doesn't have, is that we must keep our deployment scripts modular as well (I did mention some small steps, right?)
We have a lot of async communication, and a lot of modules are depending one on another to send data in specific format otherwise they break. To expose those kind of problems early, we are introducing Pact for contract testing - the python version of the tool is a bit limited at the moment, but it has most of what we need.

So, to make it short - during a pull request, we will run unit, component and contract tests, and once a night we'll run our system tests. I have yet to tackle some other kinds of testing needs such as performance, usability, or security, but we already have a worst-case solution: spin a dedicated environment for that kind of test and get the feedback just slightly slower than the rest of tests. I'm hoping that we'll be able to move at least some of those into the smaller parts of testing (such as having a performance test for each component separately), but only time will tell.

I expect that we'll leave some gaps in the products that already exist and are used in this new projects, but hoping this gap will be managed in a good enough manner.

One question that popped up in my head as I was writing - is this approach what I would choose for similar applications that don't have the major constraint on environments? From where I'm standing, the answer is yes - even if it's easy and cheap to spin a lot of test environments, testing each component in isolation and focusing on pushing tests downwards simply provides a lot of power and enables kinds of tests that would be really difficult on a system-wide context.

So, those are my thoughts on testing cloud native, service oriented applications. What do you think?

בדיקות מוצר מבוסס ענן

English

בחודשים האחרונים אנחנו בונים מוצר שהוא, כפי שמקובל לומר, קלאוד-נייטיב, יעני, בנוי בארכיטקטורה שמנצלת את הכלים השונים בענן מראש. אני קיבלתי את המשימה "בוא תבין איך אנחנו הולכים לבדוק את המוצר הזה, עם כל המגבלות שפיתוח מוצר ענן מכיל עבור מוצר כמו שלנו". האמת? היו פה כמה עניינים מורכבים למדי, ובעוד שיש לא מעט חומר בחוץ על כלים שיכולים לעזור עם בעיות ספציפיות, לא הצלחתי למצוא בשום מקום הצעות לגישה הוליסטית לבדיקת מוצרי ענן, משהו שאפשר לבנות בעזרתו אסטרטגיה. אני מקווה שהפוסט הזה יהיה הצעה אחת כזו. אני אפילו יותר מקווה שאצליח להרגיז מספיק מהקוראים כדי שחלקם יכוונו אותי למאמרים קיימים שפספסתי רק כדי להראות לי שאני טועה.

הקשיים בהם נתקלתי, שאולי יהיו רלוונטיים גם לפרוייקט שלכם היו:

המוצר שלנו מתייחס ל"חשבון AWS" בתור יחידת העבודה הבסיסית שלו, כי המוצר שלנו מגן על הנתונים שבתוך חשבון מסויים. כלומר, כל פריסה עצמאית של המוצר צריכה שיהיה לה חשבון AWS משלה, מה שגם לוקח לא מעט זמן לפרוש, וגם משאיר "שאריות" (חשבונות מחוקים נשארים לאיזה 90 יום), כלומר, היכולת להרים סביבות בדיקה מלאות על ימין ועל שמאל לא ממש ריאלית.
המוצר הוא מוצר ענן - יש בו כמה רכיבים שמתקשרים אחד עם השני דרך הודעות טקסטואליות, הרכיבים האלה יכולים להיות מפותחים על ידי צוותים שונים ונמצאים בכמה מאגרי גיט נפרדים (איך לתרגם repository?), מה שעלול להוביל לכך שעבודה על רכיב אחד תשבור דברים בהמשך הדרך.

אז, איך אפשר לדאוג למשוב אפקטיבי בתוך תהליך הפיתוח? זה כמובן לא מאוד עוזר שעדיין לא הגדרנו את תהליך הפיתוח, אבל אנחנו לא פה כדי לעשות לעצמנו חיים קלים, נכון? מעבר לזה, זו גם הזדמנות להגדיר את תהליך הפיתוח בצורה שתאפשר את המשוב האפקטיבי.

הגישה שהכי מצאה חן בעיני היא זו שמצאתי בספר של דייב פארלי (כתבתי עליו כאן) שאומר ש"אנחנו לא מקבלים לבדוק את המוצר כולו לפני שהוא נפרש לשטח". המערכת שלנו מחולקת לרכיבים ניתנים לפרישה עצמאית (כלומר, אפשר לשחרר גרסה של כל אחד מהם בלי להתחשב ברכיבים האחרים) וכך אפשר לעבוד מהר ובצורה בטוחה.

יש לגישה הזו רק חיסרון אחד קטן - אין שום סיכוי בעולם שאני אצליח לשכנע את הארגון שזו הדרך הנכונה, ויש אפילו פחות סיכוי שאצליח ליצור את השינויים התשתיתיים, המחשבתיים והתרבותיים כדי שיהיה אפשר להשתמש בגישה כזו. אבל, אולי אני יכול להשתמש במגבלות הקיימות בפרוייקט כדי לעשות צעד מסויים בכיוון הזה.

אחרי כל ההקדמה הזו, איך אני חושב שנכון לגשת לבדוק מוצר כזה?

אם אני הגון, אני חייב להודות שאין שום דבר חדשני בגישה שלי - אלה בסך הכל כמה דגשים על פירמידת הבדיקות, עם העמדת הפנים "זה לא שאני לא רוצה לעבוד במודל הגלידה שאנחנו מפעילים בכל המקומות האחרים, זה שאין לי מספיק סביבות בדיקה בשביל זה".

בקיצור, קו ההגנה שאני מנסה למכור לקולגות שלי הוא זה:

אל תדאגו, תהיה ריצה של כל המערכת יחד. פעם בלילה, והיא תכסה רק תרחישים נטולי שגיאות, או כאלה שאי אפשר יהיה לבצע ברמות הנמוכות יותר. מטרת הריצה היא לענות על השאלה "אנחנו חושב שכל הרכיבים שלנו מנגנים יחד, האם יש משהו פטאלי שהחמצנו"?
קו ההגנה הראשון שלנו הן בדיקות היחידה. כשעובדים עם AWS יש ספריית mock אמינה יחסית בשם moto שמאפשרת לנו להריץ בדיקות יחידה מול "ענן" שרץ בזיכרון. הספרייה לא מספקת את מלוא היכולות, אבל היא עושה עבודה די מרשימה. כאן אנחנו נוודא שהלוגיקה שלנו נכונה, שיש טיפול הולם בשגיאות, וכל מה שרק נצליח לחשוב עליו.
מעל השכבה הזו נריץ בדיקות רכיבים. אנחנו אולי לא יכולים לפרוש את כל המערכת בכל פעם, אבל אנחנו לגמרי יכולים לפרוש כל מיני חלקים ממנה במקביל. כך נוכל לוודא שכל רכיב מצליח לתקשר עם תשתית הענן עצמה ולוודא שהתחליפים שהשתמשנו בהם בבדיקות היחידה אכן נאמנים למה שקורה בענן עצמו. נוכל גם לבדוק חלקים שלא לגמרי ניתנים לבדיקה ברמת היחידה, כמו למשל לוגיקה שנמצאת באופן בו אנחנו מגדירים את המערכת על AWS ולא ממש בקוד שלנו. כך גם נוכל לראות אם הקוד שאנחנו מנסים להריץ באיזושהי Lambda מנסה למשוך תלויות שאין לו. יתרון נוסף שאנחנו מקבלים הוא שזה מכריח אותנו לכתוב את המוצר באופן מודולרי, כך שכל רכיב יהיה ניתן להתקנה בפני עצמו (אמרתי צעדים קטנים בכיוון, לא?)
החלק האחרון בפאזל הוא בדיקות חוזה. כאמור, יש לנו כל מיני רכיבים שמתקשרים זה עם זה באופן אסינכרוני דרך הודעות טקסט - במקום לחכות עד שהם יהיו פרושים יחד כדי לראות מה שברנו בשינוי הקוד האחרון שלנו, אנחנו יכולים להריץ בדיקות חוזה בעזרת כלי בשם Pact. הגרסה הפייתונית של הכלי קצת מוגבלת כרגע, אבל כל מיני אנשים עובדים על זה במרץ, וממילא הכלי מכיל את רוב היכולות שאנחנו צריכים.

או, במבט ממעוף הציפור - בזמן pull request אנחנו רוצים להריץ בדיקות יחידה, בדיקות רכיבים ובדיקות חוזה. אחת ליום נריץ את בדיקות המערכת כדי לקבל אישור שכל החלקים אכן מתחברים אלה לאלה. יש עוד כל מיני פערים שצריך להתייחס אליהם כמו למשל מה המקום הנכון לבדוק בו את ביצועי המערכת או אספקטים שונים של אבטחת מידע. במקרה הגרוע - תהיה לנו סביבה ייעודית עבור הדברים האלה. אולי לאורך הדרך נמצא פתרונות טובים יותר, אבל לא חייבים לפתור את כל הבעיות מראש.

מחשבה אחת שצצה לה כשכתבתי את הפוסט הזה הייתה "מה הייתי אומר אילו היה קל, מהיר וזול ליצור סביבות בדיקה מלאות? האם הגישה שלי הייתה שונה באופן מהותי? אמנם קשה לענות בכנות על שאלות היפותטיות, אבל אני רוצה לחשוב שהגישה שאני מציג כאן נכונה גם במקרה כזה. נוסף על המהירות היחסית של הבדיקות הקטנות ביחס לבדיקות המערכת המלאות, בבדיקות הקטנות יותר יש לנו גישה ושליטה שקשה מאוד להשיג באופן אחר. חשבו למשל על מקרים של race condition - בבדיקות יחידה אני יכול פשוט לדמות את התחרות בכל אחת מהתוצאות, במקרים אחרים יהיה לי קל מאוד ליצור מצב בו אחד הרכיבים שולח הודעה לא צפויה אל הרכיב הבא בשרשרת (נניח, אם אנחנו מבקשים לסרוק קובץ, אבל עד שמגיעים לטפל בהודעה הקובץ כבר נמחק).

בכל אופן, אלו המחשבות שלי לגבי הדרך הנכונה לבדוק מוצרים מבוססי ענן, מה דעתכם?