אשרי אדם מפחד תמיד Happy is the man who always fears

Thursday, July 27, 2023

This too is a way to learn a lesson.

Recently we've engaged with a consulting company to help us build a product in our core business.

A terrible idea, right?

In our defense, there were some compelling(?) arguments to act this way. We knew the risks of employing a company under contract to build something we care about, and were willing to pay the price.

So, things have started moving. As the main reason we didn't do the work was that we were too busy with other stuff happening, the consulting company has started working pretty much unsupervised for a few months. When we finally joined the effort we could see a lot of design work that has been done, with some proof of concept code for various parts - it was not a lot, but it was as much as could be expected given the requirements we gave them and our availability until then. At least, this is what I thought until I examined their "tests". Starting with a review of the code, I found a couple of robot-framework scripts that are executing a browser against the UI defined in Figma, and only running in theory. But, we are starting to work on a new project and some work has been done by the contractor, so we should at least give it the benefit of the doubt, right?

So, we had a talk and I started asking the basic questions - Where in the SDLC are those tests expected to run? against which of the multiple repositories? how should the product be deployed? how often? who are the intended authors of future tests? who are the intended consumers of the results?
Based on the work I've seen, I didn't really expect them to have answers, but I could use this to start an internal discussion about how do we want to approach testing in the new product. While it was rather easy to name the gates that a piece of code should pass on its way to production, we found that some things needed to be defined better, and that we needed to choose a lot of new technologies, or at least review the existing ones to see if they fit.

We talked about what is a "component test", should we run full system tests on pull requests to some services, how to incorporate contract tests to our process, and so on.

As we are trying to answer those questions we find that the decisions we make are influencing the rest of the product - from our branching strategy to the way we architecture our product and deployment. There are some difficult questions that we need to answer in a short time, but it's quite interesting so far.

Key takeaways?

~~First, never fight a land war in Asia.~~

First, when you plan your testing strategy, take time to figure out your context - who should be doing what, what results are important, and what limitations will hinder your progress.

Second, Especially if you are a service provider, be very explicit about how you expect your work to integrate with other parts of the development process. Have a discussion about what you will need, what you will provide, and how will this impact the work of others. If you can't have a discussion, share your thoughts with the other people involved.

Third, your choices will change the way the entire team works, it will have an impact on both the processes you employ and the architecture of your product. In this sense, it's no different than choosing a tool, a programming language or the roles of people in the team. And just as is the case in those other examples, it is a two way street and choices make there will impact how you do testing.

גם זו דרך ללמוד לקח

ENGLISH

לפני זמן מה סגרנו חוזה בעבודה מול חברה חיצונית כדי שתוכל לפתח את אחד המוצרים שנמצאים בליבה העסקית שלנו.

רעיון ממש רע, נכון?

להגנתנו, היו כמה טיעונים טובים(?) למדי לעשות את זה. ידענו מה הם הסיכונים שבהעסקה של חברה חיצונית והיינו מוכנים לשלם את המחיר.

בקיצור, הדברים התחילו להתגלגל, וכיוון שהסיבה המרכזית בגללה פנינו לחברה חיצונית הייתה שהיינו עסוקים עד מעל הראש בפרוייקטים אחרים, החברה התחילה לעבוד עם מעט מאוד הכוונה או מעורבות שלנו למשך כמה חודשים. כשסוף כל סוף התחלנו להצטרף למאמץ יכולנו לראות לא מעט עבודת תכנון שנעשתה, יחד עם הוכחת היתכנות לגבי כמה מהחלקים. זה לא המון, אבל בהתחשב באיכות הדרישות ורמת הקשב שהקדשנו להם, זה לגמרי מסוג הדברים שאפשר היה לצפות להם. לפחות, ככה הנחתי עד שהגעתי לבחון את החלק לו הם קראו "בדיקות". הסתכלתי קצת על הקוד שלהם וראיתי כמה סקריפטים שכתובים בעזרת robot-framework ומיועדים להריץ כמה תרחישים בעזרת הדפדפן, והחלק הכי טוב? זה נכתב אל מול ההגדרות שהיו בfigma כי כמובן שאין שום דבר פועל שאפשר לרוץ מולו. האם זה יכול לרוץ? תיאורטית. מול מה? לך תדע. מצד שני, אנחנו רק נכנסים לפרוייקט והחברה החיצונית כבר עובדת כמה חודשים אז בטח יש להם כל מיני מחשבות ותובנות לגבי איך הכל מתחבר, לא?

קבענו שיחה, והתחלתי לשאול שאלות - איך זה אמור להתחבר לתהליך הפיתוח? איפה ומתי הסקריפטים האלה צריכים לרוץ? מי אמור לתחזק אותם? מי אמור לקרוא את הדו"חות שיצאו כתוצאה מזה? למה בחרנו את רובוט כתשתית להרצת הבדיקות?
אחרי שראיתי את הקוד, זה לא שבאמת ציפיתי לקבל תשובות אינטיליגנטיות, אבל זו הייתה הדרך שלי לשקף להם שהם בזבזו את הזמן לעצמם ולנו, ובנוסף יכולתי להשתמש ב"הכנות לשיחה" כדי להתחיל את הדיונים האלה אצלנו. בעוד שזה היה די פשוט להגדיר את השלבים השונים שקוד צריך לעבור כדי להגיע לתוך המוצר, אבל מה בדיוק כולל כל שלב, אילו טכנולוגיות מתאימות לו ודברים כאלה - זה כבר יותר קשה. דיברנו על איך כל אחד רואה את התחום המעורפל של "בדיקות רכיב" (component tests, שבניגוד למה שמי שקרא את המילון של ISTQB עשוי לחשוב, זה דבר שונה לגמרי מבדיקות יחידה), ואיך אנחנו רוצים להריץ בדיקות חוזה ובדיקות אינטגרציה ומתי ואיפה נרצה להריץ בדיקות מערכת מלאות.

אחד הדברים שבלטו מאוד בדיונים האלה היה עד כמה צריך לקחת בחשבון את הארכיטקטורה שאנחנו רוצים לעבוד בה, ועד כמה הבחירות שאנחנו עושים משפיעות על תהליך הפיתוח ועל העבודה של כל הצוותים המעורבים. נתקלנו גם בכמה מגבלות שנובעות מהארכיטקטורה שאנחנו רוצים שתהיה למוצר, וגילינו שיש דברים שנצטרך לשפר כדי להגיע לתהליך הבדיקות שאנחנו רוצים.

בקיצור, בזכות העבודה המחופפת של החברה החיצונית, קיבלנו הזדמנות לתכנן את תהליך הפיתוח שלנו באופן מודע ולא פשוט להתגלגל מדבר לדבר. שיעור מרתק.

אז, מה אפשר ללמוד מכאן?

~~קודם כל, לא להשתתף במלחמה יבשתית באסיה~~

קודם כל, כשמתכננים את הבדיקות למוצר, קחו בחשבון את ההקשר - מי אחראי על מה, אילו תוצאות חשובות ולמי ומה המגבלות שיפריעו לנו בהמשך?

שנית, במיוחד אם אתם מספקים שירותים לחברה אחרת, היו ברורים מאוד לגבי איך אתם רואים את התהליך. נהלו דיונים על הדברים שתזדקקו להם ועל אלו שתספקו כמו גם על ההשפעה שתהיה לכם על עבודתם של אחרים. אם אין לכם מול מי לנהל דיון כזה, הניחו הנחות ושתפו אותן. העבודה שלכם תימדד, בין היתר, על בסיס ההנחות האלה.

לבסוף, הבחירות שלכם ישפיעו על כל הצוות. הן ישפיעו על התהליכים ועל ארכיטקטורת המוצר. במובן הזה בחירת תהליך בדיקות אינה שונה מבחירת טכנולוגיה לעבוד איתה, או הגדרת התפקידים השונים בצוות. ובדיוק כמו במקרה של הדוגמאות האלה - זו דרך דו סטרית. כל הדברים עליהם תהליך הבדיקות משפיע, ישפיעו עליו בחזרה.

Sunday, March 12, 2023

Quality is a four letter word

עברית

A question that started in the ABTesting slack group and has found its way to LinkedIn was wondering about whether an unethical product, even if exquisitely built, can be referred to as "high-quality".

This question resonated with a biff I have with the term "quality" for quite a while now. Specifically, with how it is commonly used. Since many of us have the word "quality" in our role description, many of us and of our colleagues in other software specialization are using it to indicate the sum of all desired things we might want in a piece of software. As we do that, we are shooting ourselves in our metaphoric leg. How bad is it? well, as I was writing this post, I stumbled upon this episode of the testing peers where we can hear four professionals who care about their craft, spend half an hour talking nonsense. By the end of the episode titled "what is testing" they do recognize the fact that they have not actually defined it.It is of no surprise, this term is difficult to define and is pretty vague, and as I've written in the past, it does not conform with the idea of quality in other fields. When my manger says "we have a quality problem in our product" he means one thing that is probably not the same as the product manager saying "We need to have this shipped with perfect quality", and is surely different from the now-so-common definition of "value to some person who matters" or the approaches adopted by ISO - "fitness for use" or "customer satisfaction". What we are getting when we use such an ambiguous term is a feeling of consensus that will blow up in our faces once the actual differences will start to show.

Moreover, "improving quality" is not actionable. If you actually have such a problem, you have a lot of work to do just to clarify what you actually need, and you will need to override people instinctive understanding of this term. Some people would go as far as claiming that "quality" is just about technical mastery and skillful coding. Others will stress the visibility of the product and others yet will look at the extent in which it gets the job done. All of those are important, but we need to align in order to move forward

Another notion I've stumbled upon sometime during my career is that "quality" is just another word fro "property", which leads me to suggest talking about the specific qualities we are actually seeking in our software. It could end up being a long list, but it will enable us to align more easily, and to break the dichotomous view of "high\low overall quality" - Software (and life in general) is all about choosing the right tradeoffs, and I might go for something that looks nicer even if it is not as fast, or choose the product that preserves my privacy better over the one with more features and better integrations - we need to use a language that helps us notice these possible tradeoffs instead of concealing them under the very broad term "quality".

If "Quality" is a profane word that should not be used, what should we talk about instead? My preferred way is to break it down to different qualities, and preferably, to avoid confusion: Desired properties. What sort of desired properties are there of a software? I'll list a few here, and there are probably more that would be relevant in the various contexts one might find themselves in:

Correctness: This is the extent to which the software is doing what it is supposed to do. A tax calculator makes no tax mistakes given the proper input, a game is responding to the user commands in the expected way, etc.
Robustness: The product can withstand messy (and even malicious) input, recover from mistakes and does so for a prolonged time.
Scalability: The product can support the expected growth in a viable way.
Performance: Personally I lump here both the quickness of response and the resources consumed by the system, those can be separated if we are in a place to make internal trade-off (for instance, throw more hardware to get faster results without a redesign).
Privacy: How well is the product protecting human data that it handles (or generates).
Ease-of-use: How well can people achieve their goals with the product.
Attractiveness: How appealing is a product to the target audience, that would conform to what James Bach has named "charisma"
Explainability: The extent to which the behavior of the software is predictable or easy to explain.
visibility: can we figure out what happens in production? can we use that data to improve the product?
Safety: will the product kill someone? Harm them in any way? will releasing a new version put us in risk we can't tolerate?

This is a partial list that can be extended with a lot of other properties (an example can be found in the paper by Rikard Edgren, Henrik Emilsson and Martin Jansson extending James Bach's quality criteria), but one obvious thing is not on my list: Testability. Why is that? because in itself, it provides very little value. I think of testability as a secondary desired property that helps achieving the others or verify their state.Other properties in this category could be maintainability or extendability I also don't consider process properties in this list, even though they are often more desirable than almost every bullet in my list. Things like efficiency or pace of deployment: I had to draw the line somewhere and so I chose to stick with properties that are attached to the product and not to the teams building and maintaining it.

So, next time someone asks you how to improve the quality of your product, take the time to understand what do they really mean. Are they bothered by functional problems? do we have a security problem? There's no need to go out right and tell them "don't use this meaningless, confusing word", but try to drill down using guiding questions such as "what is the most significant current concrete pain that we feel now?" or "do we want to invest now our efforts in increasing our ability to spot problems in production before our customers, or do we have more pressing matters?" get to somewhere concrete and work from there. Ask such questions again after there has been some improvements in the top priority desired properties.

איכות זו מילה גסה

English

שאלה שהתחילה בערוץ הסלאק של הפודקאסט ABTesting התגלגלה לפוסט בלינקדאין. בקיצור נמרץ, השאלה היא "האם תוכנה לא מוסרית יכולה להיחשב כבעלת 'איכות גבוהה'?".

השאלה הזו מתכתבת עם נושא שמציק לי כבר כמה שנים טובות: בעולם התוכנה משתמשים במילה "איכות" בצורה חסרת אחריות. אולי זה כי לחלקנו יש תפקיד שהמילה "איכות" מופיעה בהגדרה שלו, או כי כולם סביבנו מדברים ככה, אבל המשמעות של המילה "איכות" באופן בו משתמשים בה היא פחות או יותר "כל מה שטוב ושום דבר ממה שרע". השימוש הזה עושה בעיקר נזקים, ובמקרים מסויימים אפילו אפשר לומר שאנחנו יורים לעצמנו ברגל. כמה המצב חמור? בעודי כותב את השורות האלה נתקלתי בפרק של Testing peers שמנסה להגדיר את המונח הזה. זו דוגמה מושלמת לאיך שארבעה אנשי מקצוע שמשקיעים לא מעט זמן בלשפר את ההבנה המקצועית שלהם פשוט מדברים שטויות. אחרי חצי שעה של קשקשת, הם עוצרים כדי להכיר בעובדה ש"בעצם, לא ממש הגדרנו מה זה איכות". תכל'ס? זה לא מפתיע - די קשה להגדיר "איכות תוכנה". זה מושג עם הגדרה מעורפלת שפשוט לא מתיישרת עם הגדרות מתחומים אחרים. התוצאה? כשמנהל אחד אומר "יש לנו בעיית איכות" הוא לא מדבר על אותו דבר כמו מנהל המוצר שרוצה להוציא "מוצר באיכות מעולה" ושתי ההגדרות האלה בטח לא מתיישבות עם ההגדרה הנפוצה "ערך עבור מישהו שאכפ לנו ממנו" (Value to someone who matters) כמו שאמר ג'יימס באך בעקבות ג'רי ויינברג, או ההגדרות שאפשר למצוא בתקן ISO "התאמה לשימוש" או "השבעת רצון הלקוחות". אם ננסה לעבוד כשיש לנו תחושה כוזבת של הסכמה, זה רק יגרום לחיכוכים אחר כך,

כל כאב הראש הזה אומר ש"שיפור האיכות במוצר" זו הכרזה ריקה מתוכן, כזו שאי אפשר לפעול לפיה. אם אכן יש רצון לשפר את "איכות" המוצר, יש בפנינו עבודה לא בהכרח קלה של הגדרת התחומים בהם נרצה להשתפר. תוך כדי התהליך הזה נגלה שחלק מהאנשים מבינים את המושג "איכות" כהפגנת מיומנות - שימוש בטכנולוגיות העדכניות באופן אלגנטי ויעיל, אחרים יסתכלו על תוצאות עסקיות, או על חווית המשתמש או על מרווח בין תקלות. כל הדברים האלה חשובים, אבל כדי לזוז קדימה, אנחנו צריכים לדבר על אותו דבר.

אבחנה מילונית נחמדה בה נתקלתי מתישהו לאורך חיי היא שמשמעות אחת למילה "איכות" היא פשוט שם אחר ל"תכונה", זה אמנם טבעי יותר באנגלית, אבל גם בעברית האמירה "יש לו מגוון איכויות" מובנת בלי להיות העתקה בוטה מדי מאנגלית. דווקא הריבוי הזה נראה לי כמו נקודת מוצא נוחה מהתסבוכת הזו. אם במקום לדבר על "איכות" נוכל לדבר על "תכונות", או אפיל על "תכונות רצויות", נוכל לשבור את הדיכוטומיה סביב "איכות גבוהה\ירודה" ולהתחיל להתמקד במה שחשוב לנו - בלבחור את האיזון שמתאים לנו.

ואם "איכות" היא כזו מילה מגונה, במה יהיה לנו נכון יותר להשתמש? כמו שרמזתי בפסקה הקודמת, אני מעדיף לדבר על מכלול התכונות הרצויות ואלו שאינן. ככה אפשר למנוע בלבול ובאשכרה להתקדם לקראת המטרה אליה אנחנו חותרים. כמובן, זה לא כזה פשוט - רשימת התכונות הזו אינה סגורה, אבל הנה כמה תכונות שנראות לי רלוונטיות במגוון מוצרים, וכל אחד יוכל להוסיף מתוך ההקשר שלו.

נכונות: עד כמה התוכנה עושה את מה שהיא אמורה לעשות. מחשבון מס לא מגיע לסכומים לא נכונים, משחק מגיב לפקודות מהשלט בצורה הצפויה ודברים כאלה.
עמידות: עד כמה התוכנה יכולה לתפקד לאורך זמן בעולם האמיתי, גם לנוכח משתמשים שמנסים להתקיף אותה
סילומיות: תכל'ס, חיפשתי באינטרנט את המילה הזו, אנשים נורמלים יקראו לזה סקיילבליות או משהו כזה. עד כמה התוכנה שלנו מסוגלת לגדול כדי להכיל את סדרי הגודל הצפויים בעתיד הקרוב והבינוני.
ביצועים:אישית אני דוחס לכאן גם את התשובה לשאלה "כמה מהר" וגם את המשאבים שהמערכת צורכת. אם זה רלוונטי, אפשר להפריד את שני אלה כדי שנוכל לשלם באחד כדי לקבל את השני.
פרטיות: עד כמה התוכנה מגנה על מידע אנושי שהיא נחשפת אליו.
נוחות וקלות שימוש: כמה טוב אנשים יכולים להשיג את המטרות שלהם בעזרת המוצר שלנו.
אטרקטיביות: כמה המוצר מוצא חן בעיני המשתמשים. ג'יימס באך משתמש במושג "כריזמה" כדי לתאר את זה.
נהירות:עד כמה קל לנו להסביר מה התוכנה שלנו עושה ועד כמה קל לנו לצפות את ההתנהגות שלה במצבים שונים.
נראות: עד כמה אנחנו יכולים לדעת מה קורה עם המוצר שלנו בעודו רץ? האם יש לנו מידע מהשטח שאנחנו יכולים להשתמש בו כדי לשפר את המוצר? כדי לאתר בעיות?

אפשר להרחיב או לשנות את הרשימה הזו, ויש כבר מי שעשו דברים כאלה (למשל, אדגרן, אמילסון ויאנסון הרחיבו כאן את העבודה של ג'יימס באך), אבל אני רוצה להפנות לרגע את תשומת הלב לתכונה אחת שאני לא כולל בתוך התכונות שאנשים רוצים: בדיקתיות. הסיבה לכך היא שיש מעט מאוד ערך בבדיקתיות בפני עצמה. הדרך בה אני חושב על בדיקתיות היא כעל יעד עזר: זה כנראה יהיה משהו שנרצה כדי להשיג מטרה אחרת, או שזו תהיה תוצאת לוואי של תכונה קיימת (למשל, אם שיפרנו את נהירות התוכנה, ממילא שיפרנו את היכולת שלנו לבדוק אותה, ויהיה לנו קשה מאוד לשמור על רמת נכונות טובה אם המוצר שלנו יהיה קשה מאוד לבדיקה). יש תכונות נוספות שאפשר להכניס לתוך הקטגוריה הזו - כמו תחזוקתיות או יכולת הרחבה, אבל זה לגמרי עניין של מי שמרכיב את הרשימה.
חוץ מזה, יש סוג אחר של פריטים שלא כללתי ברשימה - תכונות תהליך. דברים כמו עקביות, או צפיות (כמה אנחנו יודעים לומר מה ייקח לנו כמה זמן). זה לא שהם לא חשובים, אלא שהייתי צריך למתוח את הקו איפשהו, ולכן בחרתי להתמקד בתכונות של הקוד עצמו ולא של תהליך הפיתוח. זה גם מתיישר עם התפיסה של "מוצר איכותי" - זה לא משנה מה התהליך שהוביל ליצירת מוצר כלשהו. אם הוא טוב, אז הוא טוב.

לכן, בפעם הבאה בה מישהו שואל "איך אפשר לשפר את איכות המוצר", עצרו לרגע כדי לחשוב למה בדיוק הוא מתכוון. האם יש לנו תקלות פונקציונליות? האם יש לנו בעיות אבטחה? האם לקוחות מתלוננים בהמוצר שלנו זוחל? כמובן, אין צורך (או טעם) לריב על טרמינולוגיה ולהתחיל להסביר "אל תשתמש במילה 'איכות', היא חסרת משמעות ומבלבלת", אבל מאוד כדאי לצלול טיפה יותר לעומק ולשאול שאלות מנחות כמו "מה התסמין הכי בולט של בעיות שיש לנו כרגע?" או "האם כדאי לנו להשקיע מאמץ בניסיון למצוא טעויות לפני הלקוחות או שיש דברים בוערים יותר לעשות?" אחרי שיש לנו כמה תשובות כאלה, אפשר להתחיל לעבוד. על הדרך, בזמן רגוע יותר, אפשר להתחיל לדבר על התכונות שמרכיבות "איכות" מבחינתנו.

Monday, November 21, 2022

לאן מתקדם בודק תוכנה?

English version

קישור למודל

כשמתכנת מתחיל את דרכו המקצועית יש עשרים אלף אופציות בהן הקריירה יכולה להתפתח - מערכות משובצות, שרתי-אינטרנט, מערכות הפעלה,מובייל, ועוד, בנוסף, כל תחום כזה יכול לבוא יחד עם התמחות בשפות תכנות ספציפיות כמו ג'אווה, סי, פיית'ון ואחרות. מתכנת מתחיל ימצא את עצמו בדרך כלל בצוות עם עוד מתכנתים שיש להם קצת יותר ניסיון ויתחיל לשמוע על כל הדברים שחסרים לו - כתיבת קוד מאובטח, תבניות עיצוב או היכרות עם ספריה כזו או אחרת. בכל פעם בה מתכנת מסיים קטע עבודה הוא מקבל משוב על טיבה - מישהו אחר בצוות יקרא את הקוד שלו, יעיר הערות ושני הצדדים ילמדו. אחרי זמן לא ארוך במיוחד יהיו למתכנת הזדמנויות לחזור לקוד שהוא כתב בעבר ולראות כמה הוא כבר השתפר ואיך טעויות שהוא עשה אז מפריעות לו עכשיו. בסך הכל - סביבת למידה מדהימה, גם אם אנחנו לא לוקחים בחשבון את מאות הקורסים הזמינים שמלמדים תוכן רלוונטי ברמה המתאימה.

ומה המצב אצל בודקי תוכנה? אפשר לומר שהוא כמעט הפוך. לא מאוד נדיר שבודק תוכנה מתחיל ימצא את המשרה הראשונה שלו באיזה סטארט-אפ בו הוא בודק התוכנה היחיד, או היחיד בצוות אליו הוא גוייס, רוב מי שהוא ייתקל בו, גם בודקי תוכנה מנוסים יותר, לא יודעים הרבה על הצד התיאורטי של בדיקות תוכנה, ובלא מעט מקומות, בודקי תוכנה נמצאים אי שם בחלק התחתון של שרשרת המזון. כמעט כל מי שמדבר על "לאן מתקדמים" מצביע על דרך התקדמות אל מחוץ למקצוע - לפיתוח, דבאופס, הנהלת מוצר או "סתם" לתפקיד ניהולי. אילו כישורים נדרשים מאדם כדי להיות בודק תוכנה מוצלח? יש כאן הרבה פרשנות ומעט מאוד קונצנזוס. אילו קורסים זמינים למי שרוצה להשתפר? יש עשרות קורסים מסביב לתוכן של ISTQB, אבל התוכן הזה פשוט לא טוב, ובטח לא משפר את המקצועיות של מי שטורח ולומד אותו (למה אני אומר את זה? אפשר לקרוא כאן). קורסים טובים? צריך לחפור מתחת לסלע. זה לא שאין בכלל, קיבלתי המלצות על כמה קורסים, אבל זה לא טריוויאלי למצוא אותם. ומבחינת משוב - גם כאן אין כזה באמת. לפעמים משהו מתפוצץ בשטח ואנחנו אומרים "כן, יכולנו לעשות עבודה טובה יותר", אבל מה עם כל הפעמים בהן לא היינו בסדר ושום דבר דרמטי לא קרה? או כשהיינו זהירים מדי ותקענו מקלות בגלגלים של כל תהליך הפיתוח? בדרך כלל המשוב שנקבל יהיה ספורדי, באיחור ודי מעורפל. גם כאן, יש את יחידי הסגולה שמצאו דרך לקבל משוב מהיר מהשטח על איכות הפיתוח - הם בדרך כלל לא באמת צריכים בודקי תוכנה.
בקיצור, מנקודת המבט של בודקי תוכנה מתחילים, כל מסלולי הקידום מובילים אל מחוץ למקצוע הבדיקות.חלק קטן מהבודקים האלה ימצאו במקרה מסלולים בתוך עולם הבדיקות (וחלק מתוך אלה יתקדמו בהם עד לנקודה בה תיאורי תפקיד רשמיים לא משנים מאוד), אבל רוב העובדים שטובים מספיק כדי להתקדם יכוונו את הקריירה שלהם למקום אחר. מקומות עבודה שמחפשים בודקי תוכנה מנוסים יצטרכו להתאמץ מאוד כדי למצוא מישהו שגם פנוי לעבודה, גם מנוסה וגם מסוגל לעשות עבודה טובה.

במקום בו אני עובד התמודדנו עם הקושי הזה על ידי גיוס עובדים מתחילים, תחת ההנחה שקל יותר ללמד אנשים איך להיות בודקי תוכנה טובים אם לא צריך לשבור להם הרגלים רעים. כמובן, ידענו שיש כמה חסרונות לגישה הזו - נצטרך להשקיע יותר בחניכה ואנחנו מהמרים על כל עובד בלי המון יכולת לנחש את סיכויי ההצלחה שלו מראש. הנחנו שנוכל להתמודד עם האתגרים האלה ולבנות צוות שיתאים למה שהחברה צריכה. אחד האתגרים שלא לקחנו בחשבון היה שימור עובדים. זה לא שהתעלמנו לחלוטין מהבעיה הזו, רק הנחנו שהקושי לשמר עובד יהיה דומה לזה שכל החברה חווה. נכון, לבדיקות תוכנה יש מוניטין לא טוב במיוחד, אבל אנחנו בונים צוות חזק שמאפשר לאנשים ללמוד המון ולהשפיע על הסביבה שלהם - הם יראו את זה ויישארו. לא? ובכן, מסתבר שלא מספיק. בודקי תוכנה מתחילים הגיעו למקצוע ממגוון סיבות, חלק גדול יחסית פשוט מנצל את רף הכניסה הנמוך יותר כדי לצבור ניסיון לקראת המשרה אליה הוא באמת מכוון. לא מעט מאלו שעזבו אותנו עשו את זה כי מנקודת המבט שלהם - אין לאן להתקדם. הם ידעו לומר שהם מתכנתים טובים יותר, אבל כל שאר הכישורים שהם צברו היו שקופים להם, וחשוב יותר - הם לא שיפרו את היכולת שלהם למצוא משרדה טובה יותר בהמשך הקריירה. משרות לפיתוח תוכנה יהיו ספציפיות - מומחה מסדי נתונים, מומחה מערכות משובצות, קרנל, או עוד עשרות תחומים. משרות לבודקי תוכנה יהיו דומות להחריד, לא משנה אם הן לבעלי ניסיון של שנתיים או של עשר והיכולת לומר "זה אני, זה הערך שאני מביא איתי ואלה המשרות שמעניינות אותי" דורשת חשיבה מעמיקה שלא זמינה בתחילת הקריירה, בטח לא במקום שעדיין בונה את תהליכי העבודה והקידום שלו.

אני רוצה לבנות מפה, משהו שיתאר מסלולי התקדמות אפשריים בתוך בדיקות תוכנה. משהו שיאפשר לאנשים להסתכל ולומר "אלה תחומי ההתמחות שלי". כלי כזה יוכל לתת לעובדים מתחילים תחושת התקדמות והתפתחות - במקום לומר "קודם בדקתי, היום אני בודק, מחר אבדוק", הם יוכלו לומר "קודם ידעתי את הבסיס, היום אני יודע קצת על בדיקות נגישות, קצת יותר על בדיקות ביצועים ואני טוב יותר בניתוח סיכונים מאשר הייתי לפני שנה". יותר מזה, הם יוכלו לומר "כשאהיה גדול אני רוצה להיות מומחה ניטור, ולכן אני צריך ללמוד ויזואליציה, עבודה עם נתוני עתק ואסטרטגיות אחסון מידע". עם קצת מזל, זה יכול להיות כלי שישמש חברות כשהן בונות מסלולי התקדמות לעובדים או אפילו בתהליך הגיוס שלהן.

כדי לתמוך במטרות האלה, בניתי מודל שמכיל שלוש רמות של הגדרות:

1. בכירות.
חלוקה כללית לדרגות - מי מקבל יותר כסף ויותר אחריות. אם אלה הדברים שמעניינים אותך (ולא, זה לא בהכרח עניין של שטחיות או סתם חמדנות - יש אנשים שנמצאים בתפקיד אחר: משאבי אנוש, כספים, ואפילו המנכ"ל - לא צריך להיות להם אכפת מה בדיוק אומר התפקיד שלך, רק כמה הוא בכיר), אפשר לעצור כאן. אבל, גם לכולנו נוח מדי פעם להכיר את החלוקה לרמות, גם במובן של שכר וגם כדי לדעת מה סוג האחריות שאנחנו צפויים לה.
במודל השתמשתי בשלוש רמות - ג'וניור, סיניור וprinciple (ולא, לא היה לי הגיוני לכתוב את זה בעברית) כשההפרדה היא לפי מידת השפעה מרמת הוא שעושה את העבודה שלו ועד לזה שמשפיע על כל הארגון. ככל הנראה, ארגונים קטנים מאוד יצטרכו רק שתי קטגוריות, ובארגוני ענק יהיה צורך ברמות נוספות.

2. תפקידים: אלה כותרות כלליות לסוגי עבודה שונים שאפשר לכלול תחת המטרייה הרחבה של "בדיקות תוכנה". בארגוני ענק יכול להיות שיהיו צוותים שונים שממלאים את הפונקציות האלה, אבל במקומות קטנים יותר (ובארגונים גדולים שבחרו לארגן את העבודה כך), סביר שאדם אחד ימלא מספר תפקידים. למשל - ארגון לא חייב להחזיק מומחה נגישות במשרה מלאה, אבל כדאי שיהיה מישהו שמבין בתחום ויכול לעזור במשימות קשורות כאשר הן צצות.

3. כישורים ויכולות: כדי לבצע תפקיד בהצלחה, נדרשים כישורים מסויימים. יהיה קשה מאוד לתפקד כמומחה ביצועים בלי להכיר כלי או שניים שמייצרים עומסים או בלי לדעת לנתח צריכת משאבים. עבור כל כישור ניתן לבנות מסלולי הכשרה - ביצוע משימות קשורות, קורסים, הרצאות וכדומה, ולהגדיר דרכים למדידת המיומנות של עובד בכישור מסויים.

העבודה שלי, באופן טבעי, רחוקה מלהיות מושלמת. אני משתמש בשפה שמובנת בעיקר לי, ואין ספק שאני מחמיץ תחומים שלמים. לכן, אני מבקש מכם להוסיף הערות - הייתי שמח אם המודל הזה יהיה משותף לקהילה כולה ויוכל לשמש גם את מקומות העבודה השונים בהגדרת התפקידים שהם מחפשים ומסלול ההתקדמות וגם את בודקי התוכנה עצמם כשהם מנסים להגדיר לעצמם את מסלול ההתפתחות שהם רוצים לנסות. המודל שלי נמצא כאן ויש הרשאות הערה לכל מי שניגש אליו. ספרו לי מה עובד בשבילכם היטב, אילו הנחות אני מניח ולא מתאימות לכם, מה חסר. שתפו את המודל הזה עם קולגות וחברים כדי שגם הם ייתנו משוב. זו גם ההזדמנות להודות לPerze Ababa שהשקיע לא מעט מזמנו ועזר לי לשים לב לכמה פערים ולראות גם חלק מנקודת המבט של ארגונים גדולים באמת.

What's the career path of a tester?

עברית

Link to the model

When programmers set their foot on their first job, there are quite a lot of options they can see and choose to specialize in: Embedded systems or highly distributed cloud architectures, specific programming language such as Java, Python or c, optimization and debugging, and that's even before we consider business domains. They will often find themselves with more experienced developers who will give them feedback on their work during design & code reviews, who will talk about concepts they still need to master such as secure code design, efficiency or design philosophies. They will have plenty of opportunities to go back to a piece of code they wrote and experience their mistakes biting in their respective behinds, just so that they can see how much progress they have made. All in all, that's an amazing learning environment even before we consider the plethora of easy to find courses and online resources that teach relevant skills in the appropriate level.

How is it for software testers? Well, one might say it's the complete opposite. It's not uncommon for testers to find themselves the sole tester in a team, or even in the whole start-up that hired them. Most of the people they meet, including some more experienced testers) won't know a lot about testing and in many places testers are so down the pecking order that it can seem that any career progress is pointing outside of testing - to programming, devOps, product management, or just to being a manager. What skills are required to function well as a function tester? One can find as many opinions as there are people, and most of them won't be actionable. Courses? By a simple search one will stumble upon a lot of ISTQB content, but this content is the opposite of helpful as it provides misconceptions and bad thought patterns (more on that here). There is some good content out there (I've heard good recommendations on BBST and RST, but can't attest yet to their content personally), but if you can find it - you have already burrowed deep enough into the rabbit hole that is the very specific testing community where I've been around for a while now. As for feedback on our work - what feedback? With the exception of an occasional blaming finger of "how did this bug escaped QA" (which we try to avoid), there's very little feedback on testing done right or wrong. What about the times where we were slacking but by chance there wasn't any critical blowup that happened? Or when we were too careful and slowed down the entire development process? Yes, there are outliers that have managed to build a way to get good, reliable feedback on their product from the real world, where it's deployed. But guess what? They usually don't really need testers.
In short, taking the view of an inexperienced software tester, it looks as if all growth paths are leading outside of testing. A small part of those testers will manage to find their way inside the testing world (and some of those will make progress to the level where formal titles do not limit them and matter a lot less) but most capable employees will find a way out of testing. As a result, places looking for experienced, high-skilled testers will have a hard time finding them, and those testers will have a difficult time finding a workplace that will appreciate their skills.

Where I work, we decided to tackle this difficulty initially by looking for inexperienced employees, assuming that it will be easier to teach people good testing if there aren't any bad habits that needs to be broken. We knew that this plan has some drawbacks: We'll have to invest more in training, and we are taking a gamble on their ability to develop the skills we need - which is more difficult than assessing the existing skills of a person. Those were challenges we assumed we could overcome. Naturally, we did not foresee all of the challenges we would face. One of which was employee attrition. To be fair, it's not that we completely ignored this risk, we just assumed that it would be at the same rate as in other teams in the company. Sure, software testing has bad reputation, but we're building a strong team and we can show employees that they can grow significantly and stay.
Well... no. Inexperienced software testers apply for their first (or second) positions for many reasons, and the most prominent I've encountered is that Software testing has a lower entry bar to "high-tech" jobs, and they assumed (with some degree of truth in it) that it would provide them with a good stepping stone towards the position they actually aim for. Talking with people who left us raised a common pattern: They decided that they can't build their career in testing, and since they did gain some relevant skills, they were ready to make a transition. They also didn't feel they were improving as testers - they could say they are better coders, but the rest of the skills they improved were invisible to them. More importantly, they couldn't see how those skills would help them find their next, hopefully better, job. This brings me back to the job market: Developer jobs are specific (see the list at the top for examples). Testing jobs are almost always "tester\QA\QE\SDET" - nothing that will help differentiate between testers with different experience. It means that the ability to articulate one's added value or interests requires deep thought that is not easy to do at the beginning of one's career, especially if they work in a place that is still building its career ladders and work procedures.

I want to build a map. Something that will describe possible progress paths inside the role of an individual contributor software tester. It might have exit paths to other roles or to management , but the focus is on software testers and the various kinds of tasks they might perform. Such a map will enable people to look and say "Those are my areas of expertise, this is what I want to do next". For beginners, it can help visualizing their progress, they could look and say "A year ago I had some basic fundamental skills, since I've gained knowledge of accessibility testing, delved a bit deeper into load generation and log filtering and I'm better at risk assessment than I was then". They could then go and say something like "I enjoyed trying to make sense out of heaps of data, so I might want to look into production monitoring and deeper into performance tests, for both I will need skills in data visualization, working with big data and data storage and retention strategies.
With some luck, companies could use it when they define their career ladders, or even in their recruiting processes and ads which will lead to more diversity in hiring "tester".

To support those goals, I built a three layered model:

Seniority.
A generic, non specialized, catch-it-all division. People who only care about prestige, compensation and responsibility can stop at this level (no, they are not shallow or indifferent, they might be from other professions, such as finance, HR or even your CEO), for the rest of us, it would still be helpful to know which level is any role - both in terms of compensation and in terms of the type of skills that would be relevant to it.
In the model I use 3 such categories - Junior, Senior and Principle that represent ever growing sphere of influence - from just doing one's own job, to having a company-wide impact. Smaller companies might only need two such categories, while larger corporations might need more.
Roles.
Those are titles to the types of work people with the title of "tester" are often doing and are not considered a distinct role (so, doing part time customer support does not count). Again, the size of the organization matters. Larger ones can have teams that focus on one such role, while smaller places, or corporations that chose to organize their work differently, will have people wearing multiple hats. For instance, not every place needs or can afford an accessibility expert, but most places can benefit from someone who has some accessibility expertise and help the rest of the team(s).
Skills and capabilities.
In order to be successful in each role, certain skills are needed. Since those are the things one can learn intentionally breaking down the roles to specific key competencies will be helpful to people who try to make their way through a specific career path. At a later phase, each such skill can be augmented with resources that can help acquiring it, and companies can decide on ways to measure them to decide whether someone has achieved the necessary level of a skill for the step they want to make.

My work, quite naturally, is far from being perfect or complete, I'm using language and terminology that makes sense to me, and I'm most certainly missing entire fields. This is the point where I'm asking for your help: I would be happy if this could become a model that is developed by the community and will be usable by both companies defining their processes and individuals crafting their career.
My model can be found here and anyone with access can comment. Tell me what works for you, what does not, which assumptions that I make are not suitable for you and what you can see missing there.
This is also the place to thank Perze Ababa that contributed from his time and helped me notice some key insights I was ignoring, as well as see the perspective of the larger corporate for a while.

Friday, October 14, 2022

Testing Magic 101

עברית

Every now and then, I get to hear about "X testing", where X might be API, mobile, embedded software or just anything that isn't a website. While I normally scoff at the relevance of such distinctions that usually can be narrowed to "understand your system, understand how to interact with it, the rest of the differences are nothing you haven't seen before", it is true that the good content will help you notice parameters you possibly didn't consider such as battery consumption for mobile, or heat and power fluctuation for embedded systems.

I got to listen to a recent episode of "The testing show" on AI testing which was, sadly, another one of the infomercial-like episode. It wasn't as terrible as the ones they bring in people from Qualitest, but definitely not one of their better ones. The topic was "AI\ML testing", and was mainly an iteration of the common pattern of "it's something completely new, and look at the challenges around here, also ML is really tricky".

This has prompted me to write this post and try to lay the basics for testing stuff related to ML, at least from my perspective.

The first thing you need to know is what you are testing - are you testing an ML engine? A product using (or even based on) an ML solution? the two can be quite different.

For the past decade, I've been testing a product that was based on some sort of machine learning - first a Bayesian risk-engine detecting credit-card fraud, and then an endpoint protection solution based on a deep neural net to detect (and then block) malware. The systems are quite different from each other, but they do have one aspect that was shared to them - The ML component was a complete black-box, not different than any 3rd party library that was included in it. True, that particular 3rd part was developed in-house and was the key differentiator between us and the competition, but even when you have a perfect answer to the relevant question (is this transaction fraudulent? is this file malicious?), there's still a lot of work to do in order to make that into a product - For the endpoint protection product it would be hooking to the filesystem to identify file-writes, identifying the file type, quarantining malicious ones and reporting the attack, all of which should be done smoothly while not taking up too much resources from the endpoint itself, not to mention the challenge of supporting various OS and versions deployed in the field. All of which have zero connection to the ML engine that powers everything. If you find yourself in a position similar to this (and for most products, this is exactly the case) - you are not testing ML, you are at most integrating with one, and can treat the actual ML component as a black box and even replace it with a simulator in most test scenarios.

There are cases, however, that one might find themselves actually testing the ML engine itself, in which case start by admittin that you don't have the necessary qualifications to do so (unless, of course, you do). Following that, we need to distinguish again between two kinds of algorithms - straightforward and opaque.

Straightforward algorithms are not necessarily simple, but a human can understand and predict their outcome given a specific input. For instance, in the first place I've been the ML was a Bayesian model with a few dozen parameters. The team testing the risk engine was using a lot of synthetic data - given specific weights for each "bucket" and a given input, verify the output is exactly X. In such cases, each step can be verified by regular functional tests they might require some math, but if a test fails we can see exactly where did the failure happen. In a Bayesian case, calculating a weighted score, normalizing it, recalculating "buckets" and assigning new weights are all separate, understandable steps that can be verified. If your algorithm is a straightforward one, "regular" testing is just what you need. You might need a lot of test data, but in order to verify the engine correctness, you just need to understand the rules by which it functions.

Opaque ML systems are a different creature. While it is possible to define the expected output of the algorithm given the state it's in (unless it also has a random effect as well), there's use to actually finding them since it would not help us understand why something was a specific answer was given. Notoriously, there are the deep neural networks that are nothing short of magic. We can explain the algorithm of transitioning between layers, the exact nature of the back-propagation function we use and the connections between "neurons", but even if we spot a mistake, there isn't much we can do besides feeding it through the back-propagation function and move on to the next data point.In fact, this is exactly what is being done on a massive scale while "training" the neural net. With opaque systems testing is basically the way they are created, so accepting them as fault-free is the best we can do.

That being said, ML algorithms are rarely fault free, and this brings us to the point we mentioned before - most of our product is about integrating ML component(s) to our system and we should focus on that The first thing is to see whether we can untangle it from the rest of our system. We could either mock the response we expect or use input that is known to provide certain result and see that our system works as expected given a specific result from the ML component.

Clever, just assume that correctness is someone else's problem, and we're all peachy. Right?
Well, not exactly. Even though in most cases there will be a team of data scientists (or is it data engineers now?) who are building and tuning the model, there are cases where we actually need to cover some gaps in figuring out whether it's good enough - Is our model actually as good as we think it is? Maybe we've purchased the ML component and we take the vendor's claims with a grain of salt.

A lot of the potential faults can be spotted when widening the scope from pure functionality to the wider world. Depending on what our ML solution does, there's a plethora of risks - from your chatbot turning Nazi to describing black people as apes to being gender-biased in hiring, that's without considering deliberate attacks that will run your self-driving car to the ditch. To avoid stupid mistakes that make us all look bad in retrospect, I like to go through a list of sanity questions - ones that have probably been addressed by the experts who've built this system, but just in case they got tunnel vision and forgot something - The list is quite short, and I'm probably missing a few key questions, but here's what I have in mind:

Does it even makes sense? Some claims (such as "Identifying criminals by facial features") are absurd even before we dive deeper into the data to find the problems that make the results superficially convincing).
The "Tay" question: Does the software continues to learn once in production? If it does - how trustworthy is the data? what kind of effort would be needed to subvert the learning ?
The Gorillas question: Where did we get the training data from? is it representative of what we're expecting to see in production?
Our world sucks question: Is there a real world bias that we might be encoding into our software? Teaching software to learn from human biased decisions will only serve to give this bias a stamp of algorithmic approval.
Pick on the poor question: Will this software create a misleading feedback loop? This idea came from Cathy O'Neil's "Weapons of Math Destruction" - Predictive policing algorithms meant that cops were sent to crime ridden neighborhoods, which is great. But now that there are more cops there, they will find more crimes - from drunken driving to petty crimes or jaywalking. So even if those areas are back to normal rate of crime, it will still get more attention from the police and will make the life of the residents there more difficult.
"Shh.. don't tell" question: Is the model using data that is unlawful to use? Is it using proxy measures to infer it? Imagine an alternative ML based credit score calculator. It helps those who don't have a traditional credit score to get better conditions for their credit. Can it factor in their sexual preference? And if they agree to disclose their social profiles for analysis, can we stop the algorithm from inferring their sexual preferences?

After asking those questions (that really should be asked before starting to build the solution, and every now and then afterwards), and understanding our model a bit better, we can come back to try and imagine risks to our system. In security testing (and more specifically, threat modeling) there's a method of identifying risks called "movie plotting" where we assemble a diverse team and ask them to plot attacks in a movie like "mission impossible". This idea could work well to identify risks in incorporating ML components to our business, with the only change is that the movie plot will be inspired by to "Terminator" or "The Matrix"

And yet, the problem still remains: Validating ML solutions is difficult and requires a different training than what most software testers have (or need). There are two tricks that I think can be useful.

Find an imperfect oracle: it could be that you have competitors that provide similar service, or that there's a human feedback on the expected outcome (you could even employ the Mechanical Turk). Select new (or recent) data points and compare your system results with the oracle ones - The oracle should be chosen such that every mismatch is not necessarily a problem, but that it's something that is worth investigating. Keep track on the percentage of differences. If it changes drastically, something is likely wrong on your side. Investigate a few differences to see if your oracle is good enough. In our case, we try a bunch of files that enough of our competitors claim to be malicious, and if we differ from that consensus, we assume it's a bug until proven otherwise.
Visualize. Sometimes, finding an oracle is not feasible. On the other hand, it could be that people can easily spot problems. Imagine that Google were placing a screen in several offices and projecting crawled images with what the ML thinks it actually is. It is possible that an employee would have seen it classifies people as apes way before real people were offended by it.

With that being said, I want to circle back to where I've started: While I hope you've gained an idea or two, testing is testing is testing. With any sort of system that you test, you should start by understanding just enough about it and how it interacts with your business needs, then figure out the risks that you care about and find the proper ways to address those risks.