sysadvent: agile

Showing posts with label agile. Show all posts

December 5, 2010

Day 5 - Why Aren't You Doing Code Reviews?

This article was written by Phil Hollenback and is @philiph on Twitter.

I have to admit: I'm on the fence about the devops movement. However, I do think there are a lot of good ideas to be found in the 'devops culture'. One idea I particularly like is that sysadmins should think more like developers. I might be partly attracted to that philosophy because I have a CS degree. However, I also think it's logical because developers have spent a lot of time figuring out development workflows (duh). With the increasing automation in system administration, it's natural that we as sysadmins should follow the same process. Whether we like it or not, we will all be writing more code in the years to come as we increase our use of automation.

I hear that developers are all het up about Agile programming these days. I don't know a lot about that whole methodology either, but I have found some good ideas from skimming the literature. One idea that really excites me as a sysadmin is the notion of code review. Someone once said something like, "with enough eyes, all bugs are shallow" and that is exactly what code review accomplishes. My group has embraced code review as an operational concept and it has greatly increased the quality of our work. Thus, I'm here to spread the word: everyone should be doing code review on any scripts, programs, or text config files destined for production.

THE TOOLS

In my workplace our code review tools are Bugzilla, Subversion, and Review Board. There are other code review products out there but we already had Review Board (http://www.reviewboard.org) set up so it was an obvious choice. This process is tool-agnostic: you could use any combination of version control, bug tracker, and code review if you wanted.

The first step in the process is to create a tracking bug. In my organization we say "if there isn't a bugzilla, it doesn't exist". Don't waste my time with an email thread. The bug will be used for all subversion commit messages and will be referenced in the review board posting. We use a subversion post-hook that requires all log messages to include a bugzilla reference to help enforce this.

Once you have the bug/ticket/issue, you check the existing files out of subversion and make your changes. Note that we have a lot of YAML-formatted text config files that we keep in subversion as well, so we use code review for changes to those as well.

Now you have a subversion diff you can submit to Review Board. You might want to run 'svn diff' to sanity check your diff before submitting. If you like what you see, submit your code review like this (from your subversion working directory):

$ post-review -p --bugs-closed 12345 --description "fix fencepost error" \
  --summary "system healthcheck scripts" --target-group sysadmins

The above will create your code review, publish it, and reference the bugzilla from the first step. It will also send an email out to the 'sysadmins' group which you've already configured in Review Board. That email contains a link to the review in Review Board. Here's what the main screen for a pending code review request looks like:

review main screen

The next step is up to the reviewers. Typically you will send a code review to a group of reviewers, and wait for any one of them to sign off on your review. Review Board allows reviewers to make general comments about all the diffs in a review, and/or comments on specific line numbers.

Viewing the diff for a change:

view the diff

Adding a comment to a specific line number:

adding a comment

Once a reviewer is satisfied with all their comments, they publish their review so you can see it. They can either select the 'ship it!' button or leave it unchecked, depending on whether or not they feel your changes are acceptable. If the reviewer doesn't check 'ship it!', the expectation is that you will fix the problems and submit another review request. Review Board supports review revisions via using the '-r' option to post-review, so you don't need to create multiple review requests.

ship it

Continue this iterative process for adjusting your code until your reviewer signs off by checking the 'ship it!' button on their review. Congratulations, you now have a code review change! Go check it in to subversion and prepare for deployment.

WHY DO ALL THIS WORK?

While the effort to generate code review is not terribly substantial, it does complicate your workflow. Why do this at all? One simple reason: it makes your code or config files better! This is due to several reasons:

If you work in an environment where code review is the norm, you unconsciously write better code because you know someone is going to look at it. Maybe it doesn't always work out that way, but I know I stop myself form doing sloppy things if I know someone is going to critique my work.
Code review serves as a final check to catch stupid mistakes. You are blind to dumb typos in your own code. Other reviewers tend to find them with much more regularity. This is largely because they don't know the flow of your code so they have to look through it all carefully.
Code review serves as a barrier for, lets say, "less experienced" peers. You all know what I'm saying here. Often people are afraid to admin they don't know how to fix some problem assigned to them. Solution? Take a best guess shot in the dark and pray it works. My classic example: I asked a sysadmin once to randomize his cronjob running on many servers as to not overload a particular service. He did the work and we put the new script into production on several thousand machines. Then we started seeing regular overloads on the service in question at the same time every hour. When we inspected the randomized code, we found the sysadmin had picked a random time delay value and then used THE SAME VALUE on every host. If you don't do code review you don't find problems like this.
A real benefit that should not be overlooked is that code review can be an incredible learning tool for both junior and senior sysadmins. Junior sysadmins need to learn about everything so your feedback helps them immediately. The effect with senior sysadmins is more subtle. We are creatures of habit and develop our coding techniques early in our career. Tools evolve and we often don't bother to follow up on advancements. If someone else reviews your code, that gives them an opportunity to suggest better approaches to problems. This can be a real eye-opener if you haven't read the manuals in a while.
Code and configuration style and quality can be enforced. If you have a style guide (you should!), you can use code review to both enforce style and educate about style. You can enforce code-style spacing, usage, etc, and also enforce larger concepts like requiring tests for each change, etc.

One thing to keep in mind is that code review is actually easier and less time-consuming for sysadmins than for developers. Developers write code all the time, and they write a lot of code. Sysadmins typically perform many duties besides writing scripts, and thus the amount of review work is correspondingly reduced. In our experience in a group of 6 people, code review create a minimal amount of overhead.

CONCLUSION

I'm here to tell you that code review works, and it works particularly well for system administrators. Formalized code review is a rising tide that lifts all boats - we all write better code and configurations when others look at it. Review Board in particular provides a fairly simple and lightweight way to implement code review. Code review is your first line of defense in many ways. When a script breaks in production, the first thing I say is, "was it code reviewed"?

Everyone knows that question is going to be asked so we plan accordingly and write better code. I'm not talking about large coding projects here either - even the simplest of scripts should be code reviewed. In fact, small scripts can benefit the most because those are the ones you are most likely to write quickly and carelessly. So please take my recommendations to heart - implement a code review culture for system administration. It will have a measurable effect on your team's performance and will definitely reduce production outages.

December 19, 2009

Day 19 - Kanban for Sysadmins

This article written by Stephen Nelson-Smith

Unless you've been living in a remote cave for the last year, you've probably noticed that the world is changing. With the maturing of automation technologies like Puppet, the popular uptake of Cloud Computing, and the rise of Software as a Service, the walls between developers and sysadmins are beginning to be broken down. Increasingly we're beginning to hear phrases like 'Infrastructure is code', and terms like 'Devops'. This is all exciting. It also has an interesting knock-on effect. Most development environments these days are at least strongly influenced by, if not run entirely according to 'Agile' principles. Scrum in particular has experienced tremendous success, and adoption by non-development teams has been seen in many cases. On the whole the headline objectives of the Agile movement are to be embraced, but the thorny question of how to apply them to operations work has yet to be answered satisfactorily.

I've been managing systems teams in an Agile environment for a number of years, and after thought and experimentation, I can recommend using an approach borrowed from Lean systems management, called Kanban.

Operations teams need to deliver business value

As a technical manager, my top priority is to ensure that my teams deliver business value. This is especially important for Web 2.0 companies - the infrastructure is the platform -- is the product -- is the revenue. Especially in tough economic times it's vital to make sure that as sysadmins we are adding value to the business.

In practice, this means improving throughput - we need to be fixing problems more quickly, delivering improvements in security, performance and reliability, and removing obstacles to enable us to ship product more quickly. It also means building trust with the business - improving the predictability and reliability of delivery times. And, of course, it means improving quality - the quality of the service we provide, the quality of the staff we train, and the quality of life that we all enjoy - remember - happy people make money.

The development side of the business has understood this for a long time. Aided by Agile principles (and implemented using such approaches as Extreme Programming or Scrum) developers organise their work into iterations, at the end of which they will deliver a minimum marketable feature, which will add value to the business.

The approach may be summarised as moving from the historic model of software development as a large team taking a long time to build a large system, towards small teams, spending a small amount of time, building the smallest thing that will add value to the business, but integrating frequently to see the big picture.

Systems teams starting to work alongside such development teams are often tempted to try the same approach.

The trouble is, for a systems team, committing to a two week plan, and setting aside time for planning and retrospective meetings, prioritisation and estimation sessions just doesn't fit. Sysadmin work is frequently interrupt-driven, demands on time are uneven, frequently specialised and require concentrated focus. Radical shifts in prioritisation are normal. It's not even possible to commit to much shorter sprints of a day, as sysadmin work also includes project and investigation activities that couldn't be delivered in such a short space of time.

Dan Ackerson recently carried out a survey in which he asked sysadmins their opinions and experience of using agile approaches in systems work. The general feeling was that it helped encourage organisation, focus and coordination, but that it didn't seem to handle the reactive nature of systems work, and the prescription of regular meetings interrupted the flow of work. My own experience of sysadmins trying to work in iterations is that they frequently fail their iterations, because the world changed (sometimes several times) and the iteration no longer captured the most important things. A strict, iteration-based approach just doesn't work well for operations - we're solving different problems. When we contrast a highly interdependent systems team with a development team who work together for a focussed time, answering to themselves, it's clear that the same tools won't necessarily be appropriate.

What is Kanban, and how might it help?

Let's keep this really really simple. You might read other explanations making it much more complicated than necessary. A Kanban system is simply a system with two specific characteristics. Firstly, it is a pull-based system. Work is only ever pulled into the system, on the basis of some kind of signal. It is never pushed; it is accepted, when the time is right, and when there is capacity to do the work. Secondly, work in progress (WIP) is limited. At any given time there is a limit to the amount of work flowing through the system - once that limit is reached, no more work is pulled into the system. Once some of that work is complete, space becomes available and more work is pulled into the system.

Kanban as a system is all about managing flow - getting a constant and predictable stream of work through, whilst improving efficiency and quality. This maps perfectly onto systems work - rather than viewing our work as a series of projects, with annoying interruptions, we view our work as a constant stream of work of varying kinds.

As sysadmins we are not generally delivering product, in the sense that a development team are. We're supporting those who do, addressing technical debt in the systems, and looking for opportunities to improve resilience, reliability and performance.

Supporting tools

Kanban is usually associated with some tools to make it easy to implement the basic philosophy. Again, keeping it simple, all we need is a stack of index cards and a board.

Stephen's (the author) Kanban board.

The word Kanban itself means 'Signal Card' - and is a token which represents a piece of work which needs to be done. This maps conveniently onto the agile 'story card'. The board is a planning tool, and and an information radiator. Typically it is organised into the various stages on the journey that a piece of work goes through. This could be as simple as to-do, in-progress, and done, or could feature more intermediate steps.

The WIP limit controls the amount of work (or cards) that can be on any particular part of the board. The board makes visible exactly who is working on what, and how much capacity the team has. It provides information to the team, and to managers and other people about the progress and priorities of the team..

Kanban teams abandon the concept of iterations altogether. As Andrew Shafer once said to me: "We will just work on the highest priority 'stuff', and kick-ass!"

How does Kanban help?

Kanban brings value to the business in three ways - it improves trust, it improves quality and it improves efficiency.

Trust is improved because very rapidly the team starts being able to deliver quickly on the highest priority work. There's no iteration overhead, it is absolutely transparent what the team is working on, and, because the responsibility for prioritising the work to be done lies outside the technical team, the business soon begins to feel that the team really is working for them.

Quality is improved because the WIP limit makes problems visible very quickly. Let's consider two examples - suppose we have a team of four sysadmins:

The team decides to set a WIP limit on work in progress of one. This means that the team as a whole will only ever work on one piece of work at a time. While that work is being done, everything else has to wait. The effects of this will be that all four sysadmins will need to work on the same issue simultaneously. This will result in very high quality work, and the tasks themselves should get done fairly quickly, but it will also be wasteful. Work will start queueing up ahead of the 'in progress' section of the board, and the flow of work will be too slow. Also it won't always be possible for all four people to work on the same thing, so for some of the time the other sysadmins will be doing nothing. This will be very obvious to anyone looking at the board. Fairly soon it will become apparent that the WIP limit of one is too low.

Suppose we now decide to increase the WIP limit to ten. The syadmins go their own ways, each starting work on one card each. The progress on each card will be slower, because there's only one person working on it, and the quality may not be as good, as individuals are more likely to make mistakes than pairs. The individual sysadmins also don't concentrate as well on their own, but work is still flowing through the system. However fairly soon, something will come up which makes progress difficult. At this stage a sysadmin will pick another card and work on that. Eventually two or three cards will be 'stuck' on the board, with no progress, while work flows around them owing to the large WIP limit. Eventually we might hit a big problem, system wide, that halts progress on all work, and perhaps even impacts other teams. It turns out that this problem was the reason why work stopped on the tasks earlier on. The problem gets fixed, but the impact on the team's productivity is significant, and the business has been impacted too. Has the WIP limit been lower, the team would have been forced to react sooner.

The board also makes it very clear to the team, and to anyone following the team, what kind of work patterns are building up. As an example, if the team's working cadence seems to be characterised by a large number of interrupts, especially for repeatable work, or to put out fires, that's a sign that the team is paying interest on technical debt. The team can then make a strong case for tackling that debt, and the WIP limit protects the team as they do so.

Efficiency is improved simply because this method of working has been shown to be the best way to get a lot of work through a system. Kanban has its origins in Toyota's lean processes, and has been explored and used in dozens of different kinds of work environment. Again, the effects of the WIP limit, and the visibility of their impact on the board makes it very easy to optimise the system, to reduce the cycle time - that is to reduce the time it takes to complete a piece of work once it enters the system.

Another benefit of Kanban boards is that it encourages self-management. At any time any team member can look at the board and see at once what is being worked on, what should be worked on next and, with a little experience, can see where the problems are. If there's one thing sysadmins hate, it's being micro-managed. As long as there is commitment to respect the board, a sysops team will self-organise very well around it. Happy teams produce better quality work, at a faster pace.

How do I get started?

If you think this sounds interesting, here are some suggestions for getting started.

Have a chat to the business - your manager and any internal stakeholders. Explain to them that you want to introduce some work practices that will improve quality and efficiency, but which will mean that you will be limiting the amount of work you do - i.e. you will have to start saying no. Try the puppy dog close: "Let's try this for a month - if you don't feel it's working out, we'll go back to the way we work now".
Get the team together, buy them pizza and beer, and try playing some Kanban games. There are a number of ways of doing this, but basically you need to come up with a scenario in which the team has to produce things, but the work is going to be limited and only accepted when there is capacity. Speak to me if you want some more detailed ideas - there are a few decent resources out there.
Get the team together for a white-board session. Try to get a sense of the kinds of phases your work goes through. How much emergency support work is there? How much general user support? How much project work? Draw up a first cut of a Kanban board, and imagine some scenarios. The key thing is to be creative. You can make work flow left to right, or top to bottom. You can use coloured cards or plain cards - it doesn't matter. The point of the board is to show what work is being done, by whom, and to make explicit what the WIP limits are.
Set up your Kanban board somewhere highly visible and easy to get to. You could use a whiteboard and magnets, a cork board and pins, or just stick cards to a wall with blue tack. You can draw lines with a ruler, or you can use insulating tape to give bold, straight dividers between sections. Make it big, and clear.
Agree your WIP limit amongst yourselves - it doesn't matter what it is - just pick a sensible number, and be prepared to tweak it based on experience.
Gather your current work backlog together and put each piece of work on a card. If you can, sit with the various stakeholders for whom the work is being done, so you can get a good idea of what the acceptance criteria are, and their relative importance. You'll end up with a huge stack of cards - I keep them in a card box, next to the board.
Get your manager, and any stakeholders together, and have a prioritisation session. Explain that there's a work in progress limit, but that work will get done quickly. Your team will work on whatever is agreed is the highest priority. Then stick the highest priority cards to the left of (or above) the board. I like to have a 'Next Please' section on the board, with a WIP limit. Cards can be added or removed by anyone from this board, and the team will pull from this section when capacity becomes available.
Write up a team charter - decide on the rules. You might agree not to work on other people's cards without asking first. You might agree times of the day you'll work. I suggest two very important rules - once a card goes onto the in progress section of the board, it never comes off again, until it's done. And nobody works on anything that isn't on the board. Write the charter up, and get the team to sign it.
Have a daily standup meeting at the start of the day. At this meeting, unlike a traditional scrum or XP standup, we don't need to ask who is working on what, or what they're going to work on next - that's already on the board. Instead, talk about how much more is needed to complete the work, and discuss any problems or impediments that have come up. This is a good time for the team to write up cards for work they feel needs to be done to make their systems more reliable, or to make their lives easier. I recommend trying to get agreement from the business to always ensure one such card is in the 'Next Please' section.
Set up a ticketing system. I've used RT and Eventum. The idea is to reduce the amount of interrupts, and to make it easy to track whatever work is being carried out. We have a rule of thumb that everything needs a ticket. Work that can be carried out within about ten minutes can just be done, at the discretion of the sysadmin. Anything that's going to be longer needs to go on the board. We have a dedicated 'Support' section on our board, with a WIP limit. If there are more support requests than slots on the board, it's up to the requestors to agree amongst themselves which has the greatest business value (or cost).
Have a regular retrospective. I find fortnightly is enough. Set aside an hour or so, buy the team lunch, and talk about how the previous fortnight has been. Try to identify areas for improvement. I recommend using 'SWOT' (strengths, weaknesses, opportunities, threats) as a template for discussion. Also try to get into the habit of asking 'Five Whys' - keep asking why until you really get to the root cause. Also try to ensure you fix things 'Three ways'. These habits are part of a practice called 'Kaizen' - continuous improvement. They feed into your Kanban process, and make everyone's life easier, and improve the quality of the systems you're supporting.

The use of Kanban in development and operations teams is an exciting new development, but one which people are finding fits very well with a devops kind of approach to systems and development work. If you want to find out more, I recommend the following resources:

the home of Kanban for software development; A central place where ideas, resources and experiences are shared.
mailing list for people deploying Kanban in a software environment - full of very bright and experienced people
the nascent devops movement
agile web operations - excellent blog covering all aspects of agile operations from a devops perspective
agile sysadmin - This author's own blog - focussed around the practical application of technology and agile processes to deliver business value

Subscribe to: Posts ( Atom )