December 20, 2021

Day 20 - To Deploy or Not to Deploy? That is the question.

By: Jessica DeVita (@ubergeekgirl)
Edited by: Jennifer Davis (@sigje)

Deployment Decision-Making during the holidays amid the COVID19 Pandemic

A sneak peek into my forthcoming MSc. thesis in Human Factors and Systems Safety, Lund University.

Web services that millions of us depend on for work and entertainment require vast compute resources (servers, nodes, networking) and interdependent software services, each configured in specialized ways. The experts who work on these distributed systems are under enormous pressure to deploy new features, and keep the services running, so deployment decisions are happening hundreds or thousands of times every day. While automated testing and deployment pipelines allow for frequent production changes, an engineer making a change wants confidence that the automated testing system is working. However, automating the testing pipeline makes the test-and-release process more opaque to the engineer, making it difficult to troubleshoot.

When an incident occurs, the decisions preceding the event may be brought under a microscope, often concluding that “human error” was the cause. As society increasingly relies on web services, it is imperative to understand the tradeoffs and considerations engineers face when they decide to deploy a change into production. The themes uncovered through this research underscore the complexity of engineering work in production environments and highlight the role of relationships with co-workers and management on deployment decision-making.

There’s No Place Like Production

Many deployments are uneventful and proceed without issues, but unforeseen permissions issues, network latency, sudden increases in demand, and security vulnerabilities may only manifest in production. When asked to describe a recent deployment decision, engineers reported intense feelings of uncertainty as they could not predict how their change would interact with changes elsewhere in the system. More automation isn’t always the solution, as one engineer explains:

“I can’t promise that when it goes out to the entire production fleet that the timing won’t be wrong. It’s a giant Rube Goldberg of a race condition. It feels like a technical answer to a human problem. I’ve seen people set up Jenkins jobs with locks that prevent other jobs from running until it’s complete. How often does it blow up in your face and fail to release the lock? If a change is significant enough to worry about, there should be a human shepherding it. Know each other’s names. Just talk to each other; it’s not that hard.”

Decision-making Under Pressure

“The effects of an action can be totally different, if performed too early or too late. But the right time is not clock time: it depends upon the precise state of the process evolution” (De Keyser, 1990).

Some engineers were under pressure to deploy fixes and features before the holidays, while other engineers were constrained by a "code freeze", when during certain times of the year, they “can’t make significant production changes that aren’t trivial or that fix something”. One engineer felt that they could continue to deploy to their test and staging environments but warned, “... a lot of things in a staging environment waiting to go out can compound the risk of the deployments.”

Responding to an incident or outage at any time of the year is challenging, but even more so because of “oddities that happen around holidays” and additional pressures from management, customers, and the engineers themselves. Pairing or working together was often done as a means to increase confidence in decision making. Pairing resulted in joint decisions, as engineers described actions and decisions with “we”. “So that was a late night. When I hit something like that, it involves a lot more point-by-point communications with my counterpart. For example,”I'm going to try this, do you agree this is a good thing? What are we going to type in?”

Engineers often grappled with "clock time" and reported that they made certain sacrifices to “buy more time” to make further decisions. An engineer expressed that a change “couldn’t be decided under pressure in the moment” so they implemented a temporary measure. Fully aware of the potential for their change to trigger new and different problems, engineers wondered what they could do “without making it worse”.

When triaging unexpected complications, engineers sometimes “went down rabbit holes”, exemplifying a cognitive fixation known as a “failure to revise” (Woods & Cook, 1999). Additionally, having pertinent knowledge does not guarantee that engineers can apply it in a given situation. For example, one engineer recounted their experience during an incident on Christmas Eve:

“...what happens to all of these volumes in the meantime? And so then we're just thinking of the possible problems, and then [my co-worker] suggested resizing it. And I said, ‘Oh, can you do that to a root volume?’ ‘Cause I hadn't done that before. I know you can do it to other volumes, but not the root.’”

Incidents were even more surprising in systems that rarely fail. For one engineer working on a safety critical system, responding to an incident was like a “third level of panic”.

Safety Practices

The ability to roll back a deployment was a critically important capability that for one engineer was only possible because they had “proper safety practices in place”. However, rollbacks were not guaranteed to work, as another engineer explained:

“It was a fairly catastrophic failure because the previous migration with a typo had partially applied and not rolled back properly when it failed. The update statement failed, but the migration tool didn’t record that it had attempted the migration, because it had failed. It did not roll back the addition, which I believed it would have done automatically”.

Sleep Matters

One engineer described how they felt that being woken up several times during the night was a direct cause of taking down production during their on-call shift:

“I didn't directly connect that what I had done to try to fix the page was what had caused the outage because of a specific symptom I was seeing… I think if I had more sleep it would have gotten fixed sooner”.

Despite needing “moral support”, engineers didn’t want to wake up their co-workers in different time zones: “You don't just have the stress of the company on your shoulders. You've got the stress of paying attention to what you're doing and the stress of having to do this late at night.” This was echoed in another engineer’s reluctance to page co-workers at night as they “thought they could try one more thing, but it’s hard to be self-aware in the middle of the night when things are broken, we’re stressed and tired”.

Engineers also talked about the impacts of a lack of sleep on their effectiveness at work as “not operating on all cylinders”, and no different than having 3 or 4 drinks: “It could happen in the middle of the night when you're already tired and a little delirious. It's a form of intoxication in my book.

Blame Culture

“What's the mean time to innocence? How quickly can you show that it's not a problem with your system?”

Some engineers described feeling that management was blameful after incidents and untruthful about priorities. For example, an engineer described the aftermath of a difficult database migration: “Upper management was not straightforward with us. We compromised our technical integrity and our standards for ourselves because we were told we had to”.

Another engineer described a blameful culture during post-incident review meetings:

“It is a very nerve-wracking and fraught experience to be asked to come to a meeting with the directors and explain what happened and why your product broke. And because this is an interwoven system, everybody's dependent on us and if something happens, then it’s like ‘you need to explain what happened because it hurt us.”

Engineers described their errors as "honest mistakes'' as they made sense of these events after the fact. Some felt a strong sense of personal failure, and that their actions were the cause of the incident, as this engineer describes:

“We are supposed to follow a blameless process, but a lot of the time people self-blame. You can't really shut it down that much because frankly they are very causal events. I'm not the only one who can't really let go of it. I know it was because of what I did.”

Not all engineers felt they could take “interpersonal risks” or admit a lack of knowledge without fear of “being seen as incompetent”. Synthesizing theories of psychological safety with this study’s findings, it seems clear that environments of psychological safety may increase engineers’ confidence in decision making (Edmondson, 2002).

What Would They Change?

Engineers were asked “If you could wave a magic wand, what would you change about your current environment that would help you feel more confident or safe in your day-to-day deployment decisions?

In addition to “faster CI and pre-deployments”, engineers overarchingly spoke about needing better testing. One participant wanted a better way to test front-end code end-to-end, "I return to this space every few years and am a bit surprised that this still is so hard to get right”. In another mention of improved testing, an engineer wanted “integration tests that exercise the subject component along with the graph of dependencies (other components, databases, etc.), using only public APIs. I.e., no "direct to database" fixtures, no mocking”.

Wrapping Up

Everything about engineers’ work was made more difficult in the face of a global pandemic. In the “before times” engineers could "swivel their chair” to get a "second set of eyes" on from co-workers before deploying. While some engineers in the study had sophisticated deployment automation, others spoke of manual workarounds with heroic scripts written ‘on the fly’ to repair the system when it failed. Engineers grappled with the complexities of automation, and the risk and uncertainty associated with decisions to deploy. Most engineers using tools to automate and manage configurations did not experience relief in their workload. They had to maintain skills in manual interventions when the automation did not work as expected or when they could not discern the machine’s state. Such experiences highlight the continued relevance of Lisanne Bainbridge’s (1983) research on the Ironies of Automation which found that “the more advanced a control system is, the more crucial the role of the operator”.

This study revealed that deployment decisions cannot be understood independently from the social systems, rituals, and organizational structures in which they occurred (Pettersen, McDonald, & Engen, 2010). So when a deployment decision results in an incident or outage, instead of blaming the engineer, consider the words of James Reason (1990) who said “...operators tend to be the inheritors of system defects…adding the final garnish to a lethal brew whose ingredients have already been long in the cooking”. Engineers may bring their previous experiences to deployment decisions, but the tools and conditions of their work environment, historical events, power structures, and hierarchy are what “enables and sets the stage for all human action.” (Dekker & Nyce, 2014, p. 47).

____

This is an excerpt from Jessica’s forthcoming thesis. If you’re interested in learning more about this deployment decision-making study or would like to explore future research opportunities, send Jessica a message on Twitter.

References

Bainbridge, L. (1983). IRONIES OF AUTOMATION. In G. Johannsen & J. E. Rijnsdorp (Eds.), Analysis, Design and Evaluation of Man–Machine Systems (pp. 129–135). Pergamon.

De Keyser, V. (1990). Temporal decision making in complex environments. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 327(1241), 569–576.

Dekker, S. W. A., & Nyce, J. M. (2014). There is safety in power, or power in safety. Safety Science, 67, 44–49.

Edmondson, A. C. (2002). Managing the risk of learning: Psychological safety in work teams. Citeseer.

Pettersen, K. A., McDonald, N., & Engen, O. A. (2010). Rethinking the role of social theory in socio-technical analysis: a critical realist approach to aircraft maintenance. Cognition, Technology & Work, 12(3), 181–191.

Reason, J. (1990). Human Error (pp. 173–216). Cambridge University Press.

Woods, D. D., & Cook, R. I. (1999). Perspectives on Human Error: Hindsight Bias and Local Rationality. In In F. Durso (Ed.) Handbook of Applied Cognitive Psychology. Retrieved 9 June 2021 from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.474.3161

No comments:

Post a Comment