An operations team, almost by definition, focuses on the steady-state “run” of whatever that team is responsible for … well, operating. This could be an electrical plant, an HR department, or an IT operations team; anything which needs to function day-in and day-out, reliably and without fuss. Those things the rest of us assume “just work”.
But just because we don’t notice doesn’t mean there isn’t anything happening. Much like an iceberg, where 90% of its mass sits below the waterline, nobody outside the team is aware of what the team actually does in order to make sure everything “just works”. This, of course, is by design. Humans tend to specialize and focus on their specializations, trusting other groups to keep up their ends of the social contract. If everyone had to be experts in everything, after all, then no-one would ever get anything done. This division of labor usually works great.
Except when it doesn’t. And the iceberg hits something. Then, everyone notices real quick.
Operations, as a rule, tends to be most visible in its absence.
I’d like to dive under the waterline and investigate the iceberg of IT operations. These are all the capabilities that a business expects will simply “Just Work”(tm). Each capability represents something a business needs or wants to have; something that provides or enables significant business value or prevents significant business loss.
There are dozens of ways to walk this maze. We will examine a tiny business which provides a single application and see all the places an operations team is required and why. Our intrepid operations team will initially just be Oscar.
The business itself
Before we discuss what the business does, we need to be able to do anything at all. This starts with an employee’s ability to do work. Since most work is done electronically these days, we immediately run into several expectations that need to be met.
- I control access to my work and all work with my name on it was done by me.
- Otherwise, no-one trusts each other.
- If my laptop dies, my work isn’t lost.
- Otherwise, the business has high risk.
- I have access to my work from anywhere.
- Otherwise, I might not be able to do my work.
Everyone in the business, from the Tammy the CEO to Bob in HR, has these expectations. It is not a stretch to say if these expectations are not met, the business folds.
These are difficult expectations to meet and supporting them requires significant coordination and planning. Email, office documents, and payroll processing necessitate several different client-server applications (like Microsoft Exchange) and each of those systems require the same level of support we’ll see in our production environment below.
Luckily for Oscar, there are dozens of companies in 2016 which provide these capabilities, like Google’s GSuite and Microsoft’s Office365, but even so, Oscar is still responsible for the laptops themselves, including security / virus protection, tech support (like passwords), and account management.
Email and word documents aside, let’s go build something.
The development environment
Once we have people that can function within the business, we need to construct our product. That means developers need a place to work. Initially, we just have one developer (Bill), so he installs whatever software he needs on his laptop and builds a prototype. Because the prototype is the most important thing, Bill throws ideas against the wall until something sticks. Tammy uses his laptop to demo the prototype to investors and we get a bite. With that new cash, we hire another developer - Dave.
Dave joins and spends a week working out exactly what Bill did so he can get the prototype to work on his laptop. Once that works, we run into another problem - Bill and Dave need to coordinate what they’re working on - they need centralized source control. This is similar to sharing office documents between Tammy the CEO and Bob in HR, but without having to physically hand papers to someone else. For that, we can install something internally or use an external provider (like GitHub), and since our business doesn’t make money on source control, Oscar outsources the problem. He still has to manage accounts, authorizations, and our usage of the service, but the service itself is not on his plate.
After another week, Tammy (who is also managing the product) wants to see what BIll and Dave have accomplished together. Bill’s laptop is no longer enough, so we need a place to integrate their work and make it available for the rest of the company. Oscar takes an extra laptop on the wifi and installs what Dave remembers. After a lot of poking, something works, but everyone has to remember a complicated number to get to it. Promoting changes is also complicated and very manual, but it works. With two developers banging away, enough new awesomeness is added that our company can hire a third developer - Kristen.
Kristen gets her laptop from Oscar, but installing and configuring the additional developer software takes almost two weeks. Every time it looks like she’s getting close, something else doesn’t work right. Finally, she’s able to contribute. Except, about half the time, the work she finishes doesn’t work on either Bill’s laptop or the QA machine. Oscar takes a week to track it down and figures out that everyone has a different version of the compiler installed. Specifically, Kristen’s is much newer, so she’s using features that aren’t available for anyone else. In the post-mortem, Oscar is given responsibility for installing and managing the additional software on the developer laptops. Quickly, a checklist is made and all the developer environments are synchronized, including the integration environment.
The production environment
Two months later, Tammy pronounces the product ready for users. At the planning meeting, she points out that a laptop on the wifi with a cryptic address isn’t going to work, so Oscar goes to work finding an appropriate solution.
Ten years ago, Oscar would have had to purchase servers, find a colocation space, arrange a networking contract, and “rack’n’stack” everything by hand. Not only would this take 2–4 weeks of work, but the lead time could easily be 6–12 weeks from start to finish. This doesn’t count the money locked into the process, something most new companies don’t have readily available.
Luckily for Oscar, there are currently dozens of companies which provide these capabilities, like Amazon’s AWS, Microsoft’s Azure, and Google’s Compute Engine. Through the UI, Oscar is able to set-up networking and servers. Using the checklist for the developer laptops, the production application is ready for traffic in less than a week.
Over the course of a few months, the application gains some traction and traffic ramps up. While most reactions are positive, some users complain about problems with the application, including slowness and flat-out failures. As the number of complaints rises, Tammy calls an all-hands meeting. After almost two hours of back and forth, everyone agrees that they should know about problems before users do, regardless of if it’s a failure, a slowdown, or anything else. Oscar goes off and writes up a set of monitoring checks. For now, all the alerts will come to him and Bill.
A week after the monitoring checks are written, the hard drive in the integration laptop fails. While the company works together to restore the testing environment, Bob in HR wonders out loud over lunch how this would play out in production. All eyes turn to Oscar who looks down at his quinoa salad and mumbles something about putting a plan together.
Using the disaster in the integration laptop as a starting point, Oscar puts backups into place for the database. As he works through different possible failures, Oscar realizes there are more places for the application to fail. Pulling Dave aside as a rubber duck, the two of them come up with a list of over a dozen different single points of failure - places where if one component fails, something bad happens. They bring this list back to the company and Tammy sets up some priorities, keeping some production stability items in each release.
The unexpected costs
Unlike most new companies, our intrepid band makes it past the first year. They’ve delivered a solid product that users are loving and have started paying for. At an all-hands meeting, Tammy announces they’re well on their way to profitability and will be hiring more staff, including more developers, a project manager and creating a QA team. Jenny, the new project manager, comes from a larger company. She quickly demands applications for ticketing, wiki, and automated builds. While Oscar builds nine new laptops and sets up new accounts in the four external services they already use, he also investigates adding the new services for Jenny.
Luckily for Oscar, there are dozens of companies in 2016 which provide such capabilities, like Atlassian’s suite and Microsoft’s TFS, both of which have hosted solutions. All three new services come up quickly, but Oscar is starting to feel overwhelmed. He starts joking about the operations team collecting projects like Pokemon.
With almost twenty employees (and their laptops) each using up to seven services, the integration laptop, and being on-call for the production environment (now up to 4 application servers and 3 databases), Oscar has been averaging over 60 hours/week for a while.
After a cathartic lunch with Jenny, Oscar starts to track his time in the new ticketing system. Within a month, he realizes most of this time (over 40 hours/week) is spent on unplanned work, like laptop support and production issues. Even the expected work is draining, like deployments of the increasingly complicated application into the integration laptop. (Which, frankly, has become another liability.) Without the time to improve how things are structured, Oscar cannot see how his life gets any better, let alone handle any future needs.
He has a meeting with Tammy, Bill, and Jenny and lays out what he’s learned. Tammy agrees the operations team needs both more people and time for structural improvements. The list of improvements is long, starting with replacing the integration server and automating the promotion process. Meanwhile, Oscar gets to work hiring two more people and learning about automation.
Around this time, Bill gets another job offer. After a two-week transition period and a great farewell lunch, Bill walks out the door. A week later, Dave calls Bill for help with a production issue. Bill logs in and solves the problem as a favor. Dave mentions this in a meeting the next morning and Jenny freaks out about unauthorized access. After the meeting, she goes to Tammy and they come to Oscar. “Drop everything and make sure Bill has no more access.” This first time, the offboarding takes hours, tracking down every service and documenting how to remove a user from each one.
Even after Bill’s accounts are removed, Jenny points out that security has never been a company focus. After a few months, she hires a third-party consultancy to do a security audit. The report comes back with findings, some good and some bad. Some really bad. After the initial remediation, Oscar’s team is tasked with reviewing the monthly security audits.
We can see how Oscar’s responsibilities grew over just two years. At first, it was just 6–8 laptops, office wifi, and a third-party office solution. Then it’s a cobbled-together server. Then development environments. Within a year, it’s 20 laptops, two application environments in the cloud, monitoring, alerting, and backups. After another 12 months, it’s a dozen third-party services, 40 laptops, 2 team members, offboarding processes, and monthly security audits.
Over beers one evening, Oscar makes the comment to a teammate that even he didn’t understand just how profoundly the company depended on the operations team; how much impact his work had on everyone else’s ability to do their jobs. The teammate grumbles in response “Even if many of them don’t know it, huh?” Oscar nods at that, and responds “Maybe, but we’re still a well-oiled machine.” They clink their glasses and laugh.
And when someone asks him at a party what he does for a living, he always starts with, “Well, it’s an interesting story…”