December 12, 2014

Day 12 - Ops and Development Teams: Finding a Harmony

Written by: Nell Shamrell (@nellshamrell)
Edited by: Ben Cotton (@funnelfiasco)

I used to think the term "DevOps" meant infrastructure automation tools like Chef, Puppet, and Ansible. But Devops is more than that, even more than technology itself. Devops is a culture, an attitude, a way of doing things where Dev and Ops work together rather than against each other. When you work in IT it's easy to get lost in technical skills. It's what we recruit for, it's what certifications test for. However, I've found that in any professional environment - no matter what the technology - soft skills like communication and taking responsibility are just as important as hard skills like sysadmin or coding.

I’ve worked in software development for 8 years now. In that time I have worked on teams with the classic Operations/Development divide - features and bug fixes were “thrown over the wall” to operations to deploy. I’ve also worked at a company where developers were responsible for their own hand crafted QA, Staging, and Production systems (often crafted by developers without expertise in SysAdmin) with little help from Operations. In working in both these extremes, I’ve found our projects were repeatedly delayed not by technical problems, but by failures in knowing responsibilities, actually taking responsibility, and communication.

Not knowing who is responsible for what kills IT projects. Are the devs just supposed to focus on the application code? Is the Ops team the one who should be woken up when a deploy goes wrong? When we don’t know the answers - or don’t agree on the answers - our project, company, and therefore ourselves suffer for it. Responsibilities will vary from team to team and project to project, but I’d like to at least give you a place to start. Here are 10 general guidelines to what the Ops team, Dev team, and both teams should be responsible for.

Ops Responsibilities

  1. Provide production-like environments for developers to test their code on. Why is this the Ops team’s responsibility, rather than the devs? Because Ops knows the production system better than anyone else and are tasked with protecting the stability of the system. In order to protect it, you must replicate it for developers to use in a safe way to test their features and bug fixes before they go live. Then everyone will sleep better at night.
  2. Provide multiple production-like environments. There is little more frustrating to a developer to have a feature ready to test on a production-like environment, but be blocked because someone else is already using the environment. One QA and one Staging server are not enough. Prevent it from become a bottleneck which delays features and bug fixes, which makes everyone unhappy. If resources are limited, at the very least provide an easily replicated virtual box (kept in sync with production!) that developers can run locally.
  3. Automate and Document procedures for building production-like environments. This will prevent developers from needing to ask you to build a system for them whenever they need to test something. Empower developers to do this themselves by providing automated infrastructure and documentation they can use. Then the developers will be able to get bug fixes and features out to the customers faster, not have to bother you in the process, and everybody wins.

Dev Responsibilities

  1. Test code in a production like environment before ever declaring a feature or bug fix to be done. “Works on my machine!” is never the definition of done. Gene Kim altered the agile manifesto definition of done to illustrate this "At the end of each sprint, we must have working and shippable code...demonstrated in an environment that resembles production."
  2. Take full responsibility for deployed code. If a deploy of developer code starts causing havoc in a system and preventing people from getting work done, it is the developer’s fault. Not Ops, not QA’s, the ultimate responsibility for what code does in production belongs to the developer who wrote the code. If something goes wrong in the middle of the night, the developer should be woken up first, then take the responsibility to wake up further team members if needed.
  3. Read documentation first and try the procedures it suggests before asking the Ops team for help. Respect the Ops team’s time by using any resources they’ve provided first, then ask for help if the problem remains.

Both Teams

  1. Respect how the other team does things - although we have similar goals, often dev and ops have different ways of getting work done. If an Ops person needs to request something of the dev team - or to contribute some code - check for any contributing documents and follow them. The same goes for a Dev person who needs something in the Ops team’s domain. Avoid “just this once” exceptions - those often multiply and turn into cruft which will bring down a system.
  2. Write down procedures and who is responsible for what. As stated earlier, not knowing (or not caring) who is responsible for what aspects of an application will kill that application. Knowing how things get done or are intended to be done is too vital to be tribal knowledge. Write it down, follow it, and refer to it whenever needed. If someone repeatedly refuses to consult the documentation, default to sending them a link to the documentation (or even better the section of the documentation) where they will find their answer. This may seem like an aggressive stance, but time is too scarce in the IT world to repeatedly solve the same problem over and over because someone refuses to look at available documentation.
  3. Never use the phrase “Why don’t you just-”. This comes across as extremely condescending. Teams must respect their teammates’ intelligence and realize that if it were “just” that simple, they probably would have already done it. “Have you considered...” is a good way of rephrasing.
  4. When you don’t know, help find the answer. When someone has a question and you don’t know the answer it's ok to say "I don't know." But your responsibility doesn’t end there. In a professional IT environment, it your duty to point the questioner to where they might find their answer - whether that's a person, a web resource, or just "Here, let’s try googling that together and let's see what we find." Help a questioner move forward, rather than stopping them in their tracks.

I am relieved to now work at company that embraces these principles of taking responsibility and communication (I learned many of these from working there!). Projects are completed faster, both teams are happier, far fewer people are woken in the middle of the night, and the business benefits tremendously. The key to making this work has been clearly establishing who is responsible for what and, when any confusion or blockers come up, communicating immediately. We may have different areas of expertise, but everyone is equally accountable to communicate and take responsibility for a project’s success. And it does work.

As IT professionals we shape how the world works now. It’s not just how people spend money - our work is now vital to how humans travel, how they communicate, how they access utilities like lighting and water, how laws are passed and implemented, and (as IT becomes more integrated into health care) how they physically survive and thrive. The stakes are much too high to let a lack of communication or failure to establish and take responsibility kill an IT project. The weight of the world rests on our shoulders now - we have that great power. It’s time to not only meet but embrace that responsibility!

2 comments :

Colin McD said...

I love the "Don't know" is not good enough. Having someone point people to the right resource is the best thing for your company and long term your team. Sadly allot of management is still focused on the SILO approach and this leads too turf wars instead of solutions.

On this point.
" If something goes wrong in the middle of the night, the developer should be woken up first, then take the responsibility to wake up further team members if needed."

Totally agreed that the developer needs to be 100% responsible for their code.

However I just don't see this working in practise. If the developers code is faulty either the customer, support or ops are the first to know:

I would suggest this:
Take full responsibility for deployed code. If a deploy of developer code starts causing havoc in a system and preventing people from getting work done, it is the developer’s fault. Not Ops, not QA’s, the ultimate responsibility for what code does in production belongs to the developer who wrote the code. If something goes wrong in the middle of the night, once the code fault is identified the issue is escalated to the development team. The development team have the primary responsibility of fixing their fault and then identifying a patch that will fix any errors that were created. If needed they can draw on Ops or support to assist with the fix.

Never whistle while you're pissing said...

Providing multiple production-like environments typically isn't an Ops decision, it's a Finance decision. The scale and cost of production replication usually gets redlined.

Harmony is definitely about communication which is why I'm trying to get Ops "moved to the left" in the SDLC. I just posted about this. Ops is more than just running environments. There's a wide array of knowledge that needs to be applied across the lifecycle of app delivery. Developers can't and shouldn't be expected to know everything.