אשרי אדם מפחד תמיד Happy is the man who always fears: October 2015

Saturday, October 31, 2015

הרקדן האוטומטי
The automated dancer

עד כה, מיעוט ההגיגים שיצא לי לכתוב מתעסקים במשהו ספציפי שקרה לי וגרם לי לחשוב, או להבין משהו. הפעם, התמריץ המרכזי שלי הוא שמישהו טועה באינטרנט. בפרט, עוד מישהו הגיע ושאל "איך אני מתקדם לתפקיד של בודק אוטומציה" (המונח המוזר נמצא במקור, גם אם הציטטה אינה מדוייקת). והשאלה הזו מרגיזה אותי בכמה אופנים בבת אחת.
הדבר הראשון שמרגיז אותי הוא שאני נתקל הרבה בשאלה הזו, אבל מעט מאוד בשאלות על תחומי התמחות אחרים - עדיין לא ראיתי מישהו ששואל איך הוא מתמחה בבדיקות חוויית-משתמש, או בבדיקות מסדי-נתונים, או בשום סוג בדיקות אחר (למעט בדיקות עומסים). כלומר, אף אחד לא שואל איך להשתפר כבודק פונקציונלי, או כבודק מערכות משובצות מחשב (Embedded, בלע"ז), או איך למצוא תחום התמקצעות שאינו אוטומציה או עומסים (אני משאיר את בדיקות האבטחה בחוץ, כי רוב בודקי האבטחה בהם נתקלתי לא רואים בעצמם בודקים, אלא "יועצי אבטחה"). הסיבה לזה ברורה - כי על תכנות או עמוסים המעסיקים משלמים יותר, על התחומים האחרים - לא. כך או כך - זה מרגיז.
הסיבה שנייה בגללה השאלה הזו מרגיזה אותי, היא שהתייחסות לאוטומציה כאל "התקדמות"היא אמנם נכונה מבחינת תנאים ושכר, אבל אין מסלול התפתחות הגיוני מבודק תוכנה שאינו מתכנת, לבודק תוכנה שכותב אוטומציה. כן, אפשר לרכוש את הכישורים האלה, אבל אפשר גם לרכוש כישורים בכלכלה וחשבונאות. כדי לרכוש כישורים רלוונטיים, הבודק יצטרך להשקיע מאמץ מחוץ לשעות העבודה, בניגוד להתפתחות טבעית שכוללת השתתפות במטלות קשורות שמקנות כישורים מועילים (למשל, כדי להתמחות בבדיקות מסדי נתונים, אפשר להשקיע מאמץ בבדיקת הפיצ'רים שמתעסקים במסדי נתונים. עם תכנות, רף הכניסה גבוה מכדי פשוט להיכנס פנימה ולהתחיל לכתוב קוד).
סיבה שלישית, קצת קטנונית, היא שבניגוד להתקדמות מבודק תוכנה זוטר לבודק תוכנה בכיר, אי אפשר לומר שבודק שכותב אוטומציה הוא בודק תוכנה טוב יותר, בניגוד לבודקים ששואלים "איך אני יכול להתקדם לפיתוח תוכנה" ומקבלים ממני את התשובה "זו לא התקדמות, זו החלפת מקצוע", כתיבת אוטומציה היא עדיין בתחום בודקי התוכנה, וכיוון שמשלמים, בדרך כלל, יותר על העבודה הזו, קשה לי גם לומר בלב שלם שלא מדובר בהתקדמות. אבל באופן מהותי, אין שום דבר שהופך בודק תוכנה שיודע לכתוב קוד לעדיף על פני בודק תוכנה שהתמחה בתחום אחר.

אבל, הסיבה המרכזית בגללה אמירות כאלה מרגיזות אותי היא שמי שמשתמש במינוח כזה מקבע ראייה לא נכונה של הקשר בין תכנות לבדיקות. אין, ולא צריך להיות דבר כמו "בודק אוטומציה". כתיבת קוד היא כישור שיכול לעזור לבודק תוכנה, ואפילו כישור חשוב (לפחות לדעתי), אבל זה לא פרמטר שמצדיק קטגוריה בפני עצמו. אף אחד לא חושב לקרוא לעצמו "בודק SQL", למרות שחלק ניכר מהבודקים נעזרים בשאילתות SQL. באופן דומה, הבנה כלשהי בתכנות, או "בשפת סקריפטים" כמו שאוהבים לנסח את זה כל מיני אנשים שמנסים לשלם משכורת נמוכה מזו של מתכנת, כל כך נפוצה היום, וכמוה גם השתתפות בכתיבת מבחנים אוטומטיים, עד שההבחנה בין ידני\כותב אוטומציה מיטשטשת, במיוחד אם מתחילים להשתמש בכלים פשוטים וחזקים מאוד (SoapUI הוא דוגמה אחת, seleniumIDE הוא דוגמה אחרת, קצת פחות טובה) שמספקים לכל אחד יכולות בסיסיות של אוטומציה.

בקיצור, בפעם הבאה בה אתם רואים מישהו משתמש במינוח "בודק אוטומציה", קחו פטיש פלסטיק וחבטו לו בראש.

(For the non-Hebrew speakers in the audience, The title refers to a song from the 70's that somehow was not forgotten in the mists of time)

So far, the few posts I had written were triggered by an actual experience I had and got me thinking about something, or helped me understand an interesting point. This time, however, my main argument is that someone is wrong on the internet. This time, I have encountered the question (roughly translated) : "How can I become an automation tester". This question annoys me in several manners altogether.

Probably the first thing that annoys me is that I encounter this question a lot. It's almost always "how can I advance to automation?", "How can I acquire automation skills?". It is almost never "How can I become a better functional tester?", better in embedded systems testing, in user experience testing etc. One exception for this is performance testing (another might be security testing, but I'm leaving this out, since most security testers I've encountered defined themselves as "security consultants"), which I see frequently enough. The reason for that, sadly, is obvious - employers are paying more for those type of testing. For automation, since they are used to pay programmers higher salaries, and for performance testing, since it actually seems like something that requires expertise (and testing isn't being perceived in such a way usually. I understand it, but this understanding annoys me even more.
Another reason that annoys me is that this question is usually "how can I move advance my career to automation?" and while this is an advancement in terms of pay and, sadly, also prestige, so does being a successful lawyer. The issue is that unlike other fields of expertise in testing, there isn't really a path that can lead someone from being a "manual tester" to being hired as tester that writes automation. For instance, I can catch up on security testing by involving myself in the security related features in my product, practicing a bit, reading a bit and being useful most of the time, I can't really learn how to program properly without investing a lot of effort in it, on my spare time. Not that it can't be done - it can, and does - but rather that the amount of effort is not that far from learning a new profession. The bar for starting to contribute is simply too high for one to contribute while learning.
A third reason, which is a bit petty, is that unlike advancing from junior to senior tester, there is nothing in "advancing" from tester to automation writer that means that one is a better tester. A performance tester is better in testing the narrow field of performance (narrow only in comparison to the entire field of testing that is, performance testing can be an astoundingly wide field), but a tester that can write automation is by no means better at testing anything. Programming is a skill that can assist greatly in testing, but so can SQL, or strong domain knowledge - possessing any of these skills doesn't make one a better tester.

But, finally, the main reason I find "automation tester" to be annoying is that using this term is setting a wrong connection between programming skills and testing. IMHO, there isn't, and shouldn't be, anything like "automation tester". Writing code is a skill that can help a software tester, and I will even say I consider it to be an important skill for a tester, but possessing this skill isn't something that justifies creating a new category of testers. I have yet to hear anyone refers to themselves as "SQL tester", even though some testers don't know SQL and don't use it in their work - this skill is too common to define a category. In a similar manner, some understanding in code (or, "in a scripting language", like some recruiters who wish to pay less than a programmers wage like to phrase it) is so common today as is the requirement to participate in writing some automated checks, that the line between a tester that writes automation to one that doesn't is getting really blurry. And, if that isn't enough, there are some very strong tools out there (such as SoapUI, or selenium IDE which is a less than perfect example) that provide just about everyone with some amount of automation capabilities.

So, next time you hear anyone referring to an "automation tester", take a squeaky hammer and smack them on the head.

Saturday, October 17, 2015

OWASP are cool

השבוע הגעתי לכנס APPSEC-IL, כנס שמאורגן על ידי סניף OWASP המקומי ומתעסק, כמובן, באבטחת תוכנה. לכנס הזה היו כמה יתרונות: קודם כל, הוא נשמע מעניין. ברגע בו פורסמו ההרצאות השונות, יכולתי למצוא כמעט בכל נקודת זמן משהו שנראה מעניין (מה שאינו מובן מאליו כשבכל רגע יש לכל היותר שתי הרצאות). שנית: הוא היה קרוב למקום מגורי - בערך ארבעים דקות נסיעה, כולל פקקים . שלישית: הוא היה בחינם.

שלושת המאפיינים האלה אפשרו לי גם להפיץ קבל עם ועדה את קיום הכנס ולומר לאנשים "בואו" בלי לדאוג שאני משבש למנהל כזה או אחר את תקציב הפיתוח האישי שיש לו וגם לקוות שאנשים מהצוות באמת יגיעו אליו (ספויילר - זה לא קרה).

התחושות שלי מהכנס עצמו הן בסך הכל - טובות למדי. האזנתי לכמה הרצאות שהיו מחוץ למה שאני מתעסק בו בדרך כלל (למשל, ארז מטולה דיבר על אבטחה בעולם IOT) ולהרצאות אחרות שעסקו בתחום קרוב מאוד לתחום שלי, אבל הציגו רעיונות שהיו חדשים לי (בלטה לטובה ההרצאה של נתנאל רובין "גדול מכדי להיכשל - לשבור את קוד הליבה של וורדפרס" שהציגה מסלול מרשים של הסלמת הרשאות). אם נוסיף לזה את הרצאת הפתיחה המצויינת של הכנס בה ג'רמיה גרוסמן (מקים WhiteHat security שהגיע לארץ מהוואי הרחוקה) שהציג תמונה מעניינת על מצב ההשקעה באבטחת תוכנה למול ההשקעה בביטוח נזקים, ושאל שאלה פשוטה לכאורה - איך זה יכול להיות שכל שוק אבטחת התוכנה מסיר מעצמו כל אחריות למתקפות על הלקוחות שלו? למה חברות שמוכרות אבטחת תוכנה לא מצהירות שהן יחזירו את הכסף שקיבלו אם מישהו יצליח לפרוץ ללקוח שלהם?

במבט ראשון, השאלה נשמעת פשוטה -אם אתה לא יכול להבטיח תוצאות, למה שמישהו יקנה את המוצר שלך? אבל למעשה, הצהרת אחריות שכזו נכנסת לעולם שלם של ניואנסים ומגבלות, קצת כמו הפוליסה המסובכת להחריד שמקבל כל מי שקונה ביטוח. הסיבה לזה היא שיש כל כך הרבה דרכים להיכנס ולגרום נזק, ואף חברה (לפחות, אף חברה למיטב ידיעתי) לא מתעסקת בכל הקשת הרחבה של איומים. התקפות, למרבה הצער, לא יטרחו להגביל את עצמן לוקטור יחיד. האם חברה שמתעסקת עם חולשות של אתרי אינטרנט יכולה להבטיח החזר במקרה בו המתקיף היה אחד העובדים? או במקרה של פריצה פיזית לחדר בו יושב השרת? האם חברה כזו יכולה להבטיח למוצר שמשתמש בAWS שאף אחד לא יצליח לפרוץ לתשתית הענן של אמזון? ואם מישהו השתלט על פס הייצור של רכיב חומרה בו המערכת משתמשת? בקיצור - למרות הטענה של מר גרוסמן על כך שהחברה שלו מציעה לא רק החזר כספי במקרה של פריצה אלא גם כיסוי של חצי המיליון הראשון מהנזק שייגרם, אני מוצא את עצמי קצת סקפטי לגבי יכולתן של חברות אחרות לעשות אותו דבר בצורה אפקטיבית (ולמעשה, אני תוהה מה הן האותיות הקטנות במקרה של החברה הזו).

חוץ מזה, כמקובל בכל מיני כנסים, חילקו לנו כל מיני שמונצעס. קיבלתי חולצה, כדור ספוג וגם - ספר! ואפילו ספר שיש סיכוי טוב שאקרא (למעשה, עותק של הספר שוכב אצלנו במשרדים, והמעט שקראתי ממנו היה די מסודר ויעיל)

אלו הדברים המצויינים שהיו. מטבע הדברים, היו גם כמה נקודות שאהבתי פחות.

נתחיל עם המקום הפיזי - מצד אחד, האולמות היו נפלאים. מספיק מקום לקצת יותר מ550 איש באולם הראשי, כיסאות נוחים ושקעי חשמל למי שהגיע מוכן (השארתי את ספק הכוח של המחשב בדירה). ומצד שני - המסדרון. מחוץ לאולם הראשי היה מסדרון בו שכנו הדוכנים השונים, ובו הסתובבו האנשים. או אולי, מילה מתאימה יותר תהיה "נדחקו". לא באמת היה אפשר לזוז שם.

ההערה השניה שלי היא לגבי לו"ז - הכנס היה צפוף מאוד. עם רבע שעה בין סבב לסבב, ועם שמונה סבבי הרצאות, לא נשאר הרבה מאוד זמן לפגוש אנשים ולהכיר תחומי עיסוק קרובים יותר או פחות. ואם כבר בענייני היכרות עסקינן, הרצאות המוניות הן לא בדיוק מקום אידיאלי להכיר בו אנשים. אם לוקחים בחשבון שיצירת קהילה הייתה אחת המטרות של הכנס (כך, לפחות, על פי דברי יו"ר OWASP-IL, אבי דוגלן), אני חושב שהיה כאן פספוס משמעותי. מי שהכיר קודם אנשים, עצר להחליף איתם מילה או שתיים בין ההרצאות, מי שלא - לא ממש הספיק לגשת לאחרים, גם אם רצה. אני חושב שכמה סדנאות, או כמעט כל פעילות שאפשר לנהל בקבוצות של עד עשרים איש, היו יכולות לתרום לנושא הזה לא מעט.

ואם בדברים חסרים עסקינן - חסרה לי נוכחות של בודקי תוכנה בכנס. היו מפתחים, היו מנהלי מוצר,אנשי מחלקת אבטחת תוכנה ושלל "יועצים", אבל אם לא סופרים את אלו שאפשר לקרוא להם penTesters (ואני לא סופר אותם כי חלק גדול מהם מתקרא "יועץ"), הרגשתי שהייתי בודק התוכנה היחיד בכנס. אני בטוח שזה לא נכון, אבל זו הייתה התחושה: בפתיחת הרצאות שאלו כמה מהקהל מפתחים, כמה מנהלי מוצר, כמה אנשי אבטחה. לבודקי תוכנה, מסתבר, אין נראות בכנס. וזה חבל לי, בטח כשהחיבור בין אבטחת תוכנה לבדיקות הוא כל כך טבעי (ומי שלא מספיקה לו המילה שלי לגבי הקשר, מוזמן לקרוא את הספר המצויין של אדם שוסטק Threat modeling: designing for security ולקרוא שם הערה דומה באחד הפרקים המאוחרים יותר).

ותודה רבה לOWASP על מפגש שהיה באמת מצויין.

-----------------------------------------------------

This week I attended APPSEC-IL, a conference that was organized by the Israeli branch of OWASP and deals, naturally, with software security. From my standpoint, this conference had three main advantages: First of all - it was interesting. once the talks were published, I was able to find something that seemed interesting for almost every time-slot (which is no mean feat, if you take into consideration that there were at most two talks per time-slot). Second - it was not far from where I live and work (around 40 minutes drive, traffic included). Third - the conference was free.

Those three properties made it really easy for me to invite my colleagues to participate (I could even say truthfully "come, there will be free food") and saved me the need to worry about convincing too many people to go and creating a mess to our director that has a limited "personal development" budget that probably can't send everyone to a conference. Also, those properties made it possible for me to hope that some of my team will attend (spoiler - they didn't).

My impressions from the conference are pretty good. I got to listen to some talks that exposed me to a world very different than my own (one of those was the talk Erez Metula gave on security in the world of IOT).Other talks that remained in a subject I was more familiar with still had a lot to teach me about new things to worry about (One I particularly enjoyed was Netanel Rubin's talk: "too big to fail - breaking wordpress core" that presented an impressive scenario of privilege escalation).

If we'll add to that the great keynote talk that was presented by Jeremiah Grossman (Founder and CTO of WhiteHat security) who cam all the way from distant Hawaii to give his talk. He presented some interesting figures about the money invested in software security and the amounts invested in insurance against security breaches. He asked a simple question: Why are software security solutions being sold "as is"? Why aren't they providing some guarantees or a "return policy" in case a customer was breached despite using those solutions properly?

This question sounds very simple at first glance - if you can't guarantee that your product delivers, why should anyone pay for it? just about any other product - software or other, has some sort of a return policy. However, after giving it a little bit of thought, I'm not sure this question is nearly as simple as it sounds. In fact, I don't see a way that such a commitment won't turn out to be a long list of exceptions, limitations and nuances, A little bit like those nasty, long incomprehensible terms and conditions you sign when buying Insurance. The reason for that is that there are so many ways to get in and cause some havoc, and no company (at least, to my limited knowledge) covers all the types of security hazards that exist in the big bad worlds outside. Attacks, most unfortunately, won't stick to one vector but would rather use whatever they can. Could a company specializing in web applications security commit that your product won't be hacked by one of the employees? Could it protect against someone breaking to the physical server room? Or from an attack that targets the manufacture line of some silicone chips used by your system? If you are using AWS, could they protect you from a successful attack on Amazon's infrastructure? And what happens if some of those attacks are compromising something out-of scope for that company in order to uncover a defect that is covered (e.g.: taking down a physical encryption machine in order to force the application to use less secure encryption). In short, despite Mr. Grossman's claim that his company is offering not only a refund, but also cover the first half-million dollars that the customer lost after a successful attack, I remain a bit skeptical about the adoption of this practice wildly and effectively (in fact, I am really curious about the fine letters in WhiteHat's contracts).

Besides those, as is customary in some conferences, there were some nice giveaways - I got a T-shirt, a sponge ball, and a book! There is even a fair chance I'll read this book (In fact, there is a copy lying in our office, and the chapter I read from it was very informative and useful).

Those were the things I enjoyed. As things happen, there were some minor things I enjoyed a bit less. One of them was the location. On one hand, the conference rooms were really great, with enough room to accommodate comfortably the 550+ participants in the main auditorium, with comfy chairs and electricity sockets for those who came prepared ( I left my power cable at home). On the other hand, the corridor, where all sponsors had their booths, and were people were during recess, was very crowded. Moving in the corridor involved stepping on a toe or two, pushing a bit and being pushed.

Another thing for me was the schedule - with 8 talks per track, there was very little time left to meet new people. As one of the purposes of this conference was to strengthen the community (At least, this was declared by OWASP-Israel's chairman, Avi Douglen), I think the conference could have done a bit more in this field. I think that some workshops, or any other activity that can be conducted with up to 20 participants could be more effective in getting people to know each other.

And one other thing I missed gravely was the lack of tester visibility in the conference. There were product owners, developers, "security people" and all kinds of consultants. But, it you don't count the PenTesters (And I don't count them since they usually prefer the title "security consultant"), I felt as if I was the only tester around. I'm sure that's not the case, but it still feels this way. In some of the talks the audience was asked "how many here are developers? product managers? security people?" no one asked on testers or on QA. This pains me a bit, especially since the connection between software security and testing is so natural (And, you don't have to take my word for it. you read the book "Threat modeling: designing for security" by Adam Shostack and find a similar saying in one of the latest chapters.

So thank you OWASP-israel for a great conference.

Tuesday, October 13, 2015

most of the Iceberg is underwater
רוב הקרחון נמצא מתחת למים

צורת העבודה שלנו, בלא מעט מובנים, היא סקראם בחצי כוח. אנחנו עובדים יחד - אבל יש לנו חלוקה ברורה מאוד של בודקים ומפתחים, אנחנו עובדים בספרינטים של שבועיים, אבל משחררים גרסה כל כמה חודשים. ולמרות כל מיני פינות שיש איפה להשתפר בהן, אני די מרוצה מצורת העבודה שלנו.

התוצאה המיידית של שחרור גרסה פעם בכמה חודשים היא שמצטברים לא מעט דברים שצריך לעשות כדי לעבור מגרסה לגרסה, ולא את כולם ניתן לעשות באופן אוטומטי (בעיקר, מגוון הגדרות סביבה כמו שמות של מכונות, סיסמאות ותהליכי החלפת מפתחות חתימה עם גורמים חיצוניים). כדי לוודא ששום דבר לא מתפספס, אנחנו מקדישים בצוות שבוע של בודק כדי לקחת את הגרסה הנוכחית ולהריץ עליה את תהליך השדרוג במכה, מא' ועד ת'. בפעמיים שלוש האחרונות לקחתי את המשימה הזו על עצמי, ועד כה בכל בדיקה גילינו צעדים ששכחנו, בעיות שהתעלמנו מהן ונובעות ממגבלות טכניות כאלה או אחרות או סתם נקודות שכדאי להדגיש בפני מי שמבצע את תהליך ההתקנה. היתרון המרכזי של ביצוע הבדיקות בפורמט הזה, מבחינתי, הוא שכך קל לי יותר לחבוש את כובע איש OPS

ביחד עם הבדיקות האלה, אנחנו גם בודקים תיקונים אחרונים לבעיות, אם יש כאלה. הפעם, בעקבות הדרכה שעשינו על אחד הפיצ'רים, ביקשו מאיתנו להקטין את קבצי הלוג. מצאנו כמה שורות שחוזרות על עצמן והעברנו אותן לרמת דיבאג. לכאורה - הכל אמור לעבוד כמו שצריך.

בפועל, תוך כדי שאני עובר על הקובץ אחרי התיקון, גיליתי שיש כמה שורות שחוזרות על עצמן שוב ושוב ושוב. ניגשתי עם השורות האלה למפתח, שהגיב (כמו שהוא מגיב כמעט תמיד) "לא יכול להיות". אחרי שנתתי לו להוציא קצת קיטור (לקח לי קצת זמן ללמוד את זה, אבל גיליתי שאם אני רק עומד בצד ונותן לו לבהות בבעיה, בסוף הוא מקבל את קיומה). הבנו שיש לנו בעיה קצת יותר גדולה מאשר גודל הלוג. בפרט, הבעיה עם השורות האלה היא שהן מייצגות אתחול של מגוון דברים שאמורים לעלות פעם אחת ודי, וזה שהן מופיעות כל שלוש שניות אומר שמתבצעים הרבה מאוד דברים שלא אמורים לקרות יחד איתן.

אחרי חקירה קצרה, גילינו את הסיבה והתקלה תוקנה. אבל אני למדתי שני דברים, שאצטרך להזכיר לעצמי מדי פעם.

קודם כל, בנוסף לדרישות הפונקציונליות, ישנן גם דרישות תפעוליות צדדיות, ובכל פעם אני נתקל בעוד דרישה כזו. הפעם זו הייתה דרישה שלא להציף את הלוגים, ושג'יגה ביום (לא כולל פעילות) זה קצת יותר מדי.

הדבר השני שאני צריך לזכור הוא שאני רואה בעיקר סימפטומים. אני יכול לנחש את הסיבה להם, אבל אם משהו קצת מוזר, זה בדרך אומר שהוא מחביא בעיה ברורה מתחת למכסה המנוע. אני עדיין לא בטוח איך, אבל זה משהו שכדאי לי להישאר עירני לגביו ולחפש גם להבא.

-----------------------------------------------------

Our work methodology is probably best described as "Scrum-but": We work side by side, but have distinct separation between testers and developers, We are targeting user stories by priority, but most of the time the developers will continue to the next user story when there are still testing tasks, and we work in two weeks sprints, but release something to production only once in a couple of months.

This latter "but" has one nice side effect - there are many tasks that need to be done during installation day. Sadly, some of them cannot be done automatically. It may be a new configuration parameter defining an environment property or a password, it can be creating a signing certificate, or contacting a 3rd party to provide us with its public key or just something that should be done only once and was not worth automating. In order to make sure that we do not miss anything on installation day, we create a step-by-step installation guide and then invest a week in testing it. In the last two or three times I took this task on myself, and each and every time I came across something we forgot, or even need some last minute fix. The main advantage I gain in performing the upgrade tests is that it makes it easier for me to wear my OPS hat, and thus notice stuff we missed the first time we tested it - be it procedural limitations (e.g.: in order to keep all machines in sync, OPS keep all files in a source-control server, so we cannot set a requirement to keep the same file with different content. at least, not without providing them enough time to come up with a solution). In this case, I noticed that even without actually doing any work, one of our logs swelled up and took a lot of disk space. Just to be certain - I asked around what is a reasonable log size, and could 1GB per day be acceptable for a while. They said they prefer their logs around 20 MB, so we out got our axe and went woodcutting.

What was done was simply to move some repeating lines to debug log level, which is normally turned off in production, thus shrinking the logs.
In theory, all should work fine. the change is not risky, and should solve our problem easily. Only it did something else - it turned my focus to the log files, and to the number of times I saw the same line being printed. Cleaning up the extra lines, it was easier for me to see that we are still printing a lot of data over an almost idle process. When I checked this up with the relevant developer he first responded (as he does almost always) "Impossible". I let him vent out a bit of steam (it took me a while to learn this, but in his case, the best way I can make him accept a problem is show it to him, and move aside while he stares at it) and then we understood that the problem was a bit more complex than we first assumed. The lines we were seeing were printed during the process initialization, where some data that should be loaded only once and then cached was being loaded repeatedly. Or, as the experts phrase that: "oops".
After a short investigation we fixed the problem, but I learned two things that I will have to remind myself from time to time.
The first is that in addition to the functional requirements I normally care for, there are always some operational requirements that I keep finding by chance (either when we get lucky and spot this requirement before the code goes live, or when it goes live and we get complaints and a request for fixing stuff).
The second, which should be reiterated every now and then since it is so easy to ignore, is that I see mainly symptoms. Even though I got really good in guessing root cause and impact of various bugs, all that I have are what I see on the surface and a (good?) guess. Usually, when a bug is odd, or dodgy (and sometimes even when it's simple and straightforward), what I see is just the tip of the iceberg, and further digging in is required to uncover the full impact of the actual bug.
I'm not sure yet how to do that, but this is something I want to remain aware of and look for also in the future.