December 18, 2014

Day 18 - Adding Context to Alerts with nagios-herald

Written by: Katherine Daniels (@beerops)
Edited by: Jennifer Davis (@sigje)

3am Pages Suck!

As sysadmins, we all know the pain that comes from getting paged at 3am because some computery thing somewhere has caught on fire. It’s dark, you were having a perfectly pleasant dream about saving the world from killer robots, or cake, or something, when all of a sudden your phone starts making a noise like a car alarm. It’s your good friend Nagios, disturbing your slumber once again with word of a problem and very little else.

We might hate it for being the bearer of bad news, but Nagios is a well-known and time-tested monitoring and alerting tool. It does its job well- it runs the checks we tell it to, when we tell it to, and it dutifully whines when those checks fail. The problem with its whining, however, is that by default there is very little context around it.

Adding Context to 3am

As an example, let’s take a look at everyone’s favorite thing to get woken up by, the disk space check. Disk Space Alert Without Context

We know that the disk space has just crossed the warning threshold. We know the amount and percentage of free space on this volume. We know what volume is having this issue, and what time the notification was sent. But this doesn’t tell us anything more. Was this volume gradually getting close to the threshold and just happened to go over it during the night? If so, we probably don’t care in the middle of the night - a nice slow increase means that it won’t explode during the night and can be fixed in the morning instead. On the other hand, was there a sudden drastic increase in disk usage? That’s another matter entirely, and something that someone probably should get out of bed for.

This kind of additional context provides really valuable information as to how actionable this alert is. And when we get disk space alerts, one of the first things we do is to check how quickly the disk has been filling up. But in the middle of the night, that’s asking an awful lot - getting out of bed to find a laptop, maybe arguing with a VPN, finding the right graphite or ganglia graph - who wants to do all that when what we really want to do is go back to sleep?

With nagios-herald, the computers can do all of that work for us.

Disk Space Alert With Context

Here we have a bunch of the most relevant context added into the alert for us. We start with a visual indicator of the problematic volume and how full it is, so eyes bleary from sleep can easily grok the severity of the situation. Next is a ganglia graph of the volume over the past day, to give an idea of how fast it has been filling up (and if there was a sudden jump, when it happened, which can often help in tracking down the source of a problem). The threshold is there as well, so we can tell if a critical alert is just barely over the threshold or OH HEY THIS IS REALLY SUPER SERIOUSLY CRITICAL GET UP AND PAY ATTENTION TO IT. Finally, we have alert frequency, to know easily if this is a box that frequently cries wolf or one that might require more attention.

Introducing Formatters

All this is done by way of formatters used by nagios-herald. nagios-herald is itself just a Nagios notification script, but these formatters can be used to do the heavy lifting of adding as much context to an alert as can be dreamt up (or at least automated). The Formatter::Base class defines a variety of methods that make up the core of nagios-herald’s formatting. More information on these methods can be found in their documentation, but to name a few: * add_text can be used to add any block of plain text to an alert - this could be used to add information such as which team to contact if this alert fires, whether or not the service is customer-impacting, or anything else that might assist the on-call person who receives the alert. * add_html can add any arbitrary HTML - this could be a link to a run-book with more detailed troubleshooting or resolution information, it could add an image (maybe a graph, or just a funny cat picture), or just turn the alert text different colors for added emphasis. * ack_info can be used to format information about who acknowledged the alert and when, which can be especially useful on larger or distributed teams where other people might be working on an issue (maybe that lets you know that somebody else is so on top of things that you can go back to sleep and wait until morning!)

All of the methods in the formatter base class can be overridden in any subclass that inherits from it, so the only limit is your imagination. For example, we have several checks that look at graphite graphs and alert (or not) based on their value. Those checks use the check_graphite_graph formatter, which overrides the additional_info base formatter method to add the relevant graph to the Nagios alert:

def additional_info
    section = __method__
    output = get_nagios_var("NAGIOS_#{@state_type}OUTPUT")
    add_text(section, "Additional Info:\n #{unescape_text(output)}\n\n") if output
    output_match = output.match(/Current value: (?<current_value>[^,]*), warn threshold: (?<warn_threshold>[^,]*), crit threshold: (?<crit_threshold>[^,]*)/)
    if output_match
      add_html(section, "Current value: <b><font color='red'>#{output_match['current_value']}</font></b>, warn threshold: <b>#{output_match['warn_threshold']}</b>, crit threshold: <b><font color='red'>#{output_match['crit_threshold']}</font></b><br><br>")
    else
      add_html(section, "<b>Additional Info</b>:<br> #{output}<br><br>") if output
    end

    service_check_command = get_nagios_var("NAGIOS_SERVICECHECKCOMMAND")
    url = service_check_command.split(/!/)[-1].gsub(/'/, '')
    graphite_graphs = get_graphite_graphs(url)
    from_match = url.match(/from=(?<from>[^&]*)/)
    if from_match
      add_html(section, "<b>View from '#{from_match['from']}' ago</b><br>")
    else
     add_html(section, "<b>View from the time of the Nagios check</b><br>")
    end
    add_attachment graphite_graphs[0]    # The original graph.
    add_html(section, %Q(<img src="#{graphite_graphs[0]}" alt="graphite_graph" /><br><br>))
    add_html(section, '<b>24-hour View</b><br>')
    add_attachment graphite_graphs[1]    # The 24-hour graph.
    add_html(section, %Q(<img src="#{graphite_graphs[1]}" alt="graphite_graph" /><br><br>))
  end

In this method, it calls other methods from the base formatter class such as add_html or add_attachment to get all the relevant information we wanted to add for these graphite-based checks.

Now What?

If you’re using Nagios and wish its alerts were a little more helpful, go ahead and install nagios-herald and give it a try! From there, you can start customizing your own alerts by writing your own formatters - and we love feedback and pull requests. You’ll have to wrangle some Ruby, but it’s totally worth it for how much more useful your alerts will be. Getting paged in the middle of the night still won’t be particularly fun, but with nagios-herald, at least you can know that the computers are pulling their weight as well. And really, if they’re going to be so demanding and interrupt our sleep, shouldn’t they at least do a little bit of work for us when they do?

December 17, 2014

Day 17 - DevOps for Horses: Moving an Enterprise Application to the Cloud

Written by: Eric Shamow (@eshamow)
Edited by: Michelle Carroll (@miiiiiche)

As an engineer, when you first start thinking about on-demand provisioning, CD, containers, or any of the myriad techniques and technologies floating across the headlines, there is a point when you realize with a cold sweat that this is going to be a bigger job than you thought. As you watch folks talking at various conferences about the way they are deploying and scaling applications, you realize that your applications won’t work if you deployed them this way.

Most of the glamorous or really interesting, thought-provoking discussions around deployment methodologies work because the corresponding applications were built to be deployed into those environments in a true virtuous cycle between development and operations teams. Sometimes the lines between those teams disappear entirely.

In some cases, this is because Operations is outsourced entirely — consider PaaS environments like Heroku or Google App Engine, where applications can be deployed with tremendous ease, due to a very restricted set of conditions defining how code is structured and what features are available. Similarly, on-premises PaaS infrastructures, such as Cloud Foundry or OpenShift, allow for organizations to create a more flexible and customized environment while leveraging the same kind of automation and tight controls around application delivery.

If you can leverage these tools, you should. I advise teams to try and build out an internal PaaS capability — whether they are using Cloud Foundry or bootstrapping their own, or even several to allow for multiple application patterns. The Twelve-Factor App pattern is a good checklist of conditions to start with for understanding what’s necessary to get to a Heroku-like level of automation. If your app meets all these conditions, congratulations — you are probably ready to go PaaS.

My App Isn’t Ready For PaaS

Unless you’re a startup or have a well-funded team effort to move, your application won’t work as it stands in a PaaS. You are perhaps ready for IaaS (or are evaluating IaaS) wondering, where do I start? If you can’t do much with the application design, how can you begin to get ready for a cloud move with the legacy infrastructure and code you have?

Getting Your Bearings

Start by collecting data. A few critical pieces of information I like to gather before drawing up a strategy:

  • What are the components of the application? Can you draw a graph of their dependencies?

  • If the components are separated from one another, can they tolerate the partition or does the app crash or freeze? Are any components a single point of failure?

  • How long does it take for the application to recover from a failure?

  • Can the application recover from a typical failure automatically? If not what manual intervention is involved?

  • How is the application deployed? If the server on which the application is running dies, what is the process/procedure for bringing it back to life?

  • Can you easily replicate the state of your app in any environment? Are your developers looking at code in an environment that looks as close as possible to production? Can your QA team adequately simulate the conditions of an outage when testing a new release?

  • How do you scale the application? Can you add additional worker systems and scale the system horizontally, or do you need to move the system to bigger and more powerful servers as the service grows?

  • What does the Development/QA cycle look like? Is Operations involved in deploying applications into QA? How long does it take for developers to get a new release into and through the testing cycle?

  • How does operations take delivery of code from development? What is the definition of a deliverable? Is it consistent, or does it change from version to version?

  • How do you know that your application was successfully installed?

I’m not going to tackle all of them, but will rather focus on some of the key themes we’re looking for in examining our apps and environment.

Coupling

One of the key underpinnings of modern application design is the understanding that failure is inevitable — it’s not a question of if a component of your application will fail, but when. The critical metric for an application is not necessarily how often it fails (although an app that fails regularly is clearly a problem) but how well its components tolerate the failure of other components. As your app scales out — and particularly if you are planning to move to public cloud — you can expect that data will no longer flow evenly between components. This is not just a problem of high latency, but variable latency — sudden network congestion can cause traffic between components to be bursty.

If one component of your application depends on another component to be functional, or your app requires synchronous and low-latency communication at all times between components, you have identified tight coupling. These tight couplings are death for applications in the cloud (and they’re the services that make upgrades and migration to new locations the most difficult as well). Tight couplings are amongst the most difficult problems to address — often they relate to application design and are tightly tied to the business logic and implementation of the application. A good overview of the problem and some potential remedies can be found in Martin Fowler’s 2001 article “Reducing Coupling” (warning: PDF).

For now , we need to identify these tight couplings and pay extra attention to them — monitor heavily around communications, add checks to ensure that data is flowing smoothly, and in general treat these parts of our architecture as the fragile breakpoints that they are. If you cannot work around or eliminate these couplings, you may be able to automate processes for detection and remediation. Ultimately, the couplings between your apps will determine your pattern for upgrades, migrations and scaling — so understanding how your components communicate and which depend on each other is essential to building a working and automated process.

Automation

If you can’t reinstall the app without human intervention, you have a problem. We can expect that a server will eventually fail and that application updates will happen on a regular cadence. Humans screw up things we do repetitively — repeat even a simple process often enough and you will eventually do it wrong. Computers are exceptionally good at repetitive tasks. If you have your sysadmins doing regular installs of your applications — or worse if your sysadmins have to call in developers and they must pair to slowly work through every install — you are not taking advantage of the computers. And you’re overtaxing humans who are much better at — and happier — doing other things.

Many organizations maintain either an installation wiki, a set of install scripts, or both. These sources of information frequently vary and operators need to hop from one to the other to assemble and install. With this type of ad-hoc assembly of a process, it’s likely that one administrator will not follow the process perfectly each time, but certain that different administrators will follow the process in different ways. Asking people to “fix the wiki” will not fix the discrepancy. The wiki will always lag the current state of your systems. Instead, treat your installation scripts like “executable documentation.” They should be the single source of truth for the process used to deploy the app.

While you will want your automation to use good, known frameworks, the reality is that a BASH script is a good start if you have nothing in place. Is BASH the way to go for your system automation? As a former employer put it, “SSH in a for loop isn’t enough” — and it’s not. But writing a script to deploy a system in a language you already know is a good way to identify if you can automate the deploy, as well as the decisions you need to make during the install. This information informs your later choice of automation framework, and enables you to identify which parts of your configuration change from install to install. As a bonus, you’ve taken a first pass at automating your process, which will speed up your deploys and help you select an automation framework that best fits your use case. For an exploration of this topic and an introduction to taking it a step further into early Configuration Management, check out my former colleague Mike Stahnke’s dead-on 2013 presentation “Getting Started With Puppet.”

Environment Parity and Configuration Management

We’ve all been on some side of the environment parity issue. Code makes it into production that didn’t take into account some critical element of the production environment — a firewall, different networking configuration, different system version, and so on. The invariable response from Operations is, “Developers don’t understand real operating environments.” The colloquial version of this is, “It works on my laptop!”

The more common truth is that Operations didn’t provide Development with an environment that looked anything like production, or even with the tools to know about or understand what the production environment looks like. As an Operations team, if you don’t offer Development a prod-like environment to deploy into and test with, you cede your right to complain about code they produce that doesn’t match prod.

Since it is often not possible to give developers an exact copy of production, it’s important for the Operations team to abstract away as many changes between environments as is possible. Dev, Prod, QA and all other teams should be running the same OS versions and patch sets, with the same dependencies and same system configuration across the board. The most sensible way to do this is with Configuration Management. Configure all of your environments using the same tools and — most critically — with the same configuration management scripts. The differences between your environments should be a set of variables that inform that code.

If you can’t reduce the differences between your environments to code informed by variables, you’ve identified some hard problems your developers and operations teams are going to have to bridge together. At the very least, if you can make your environments more similar, you can significantly reduce the number of factors that must be taken into account when an app fails in one environment when it succeeded in another.

Get Operations out of the Dev/QA Cycle

The notion of Operations being required to install applications into a QA/Testing environment always baffled me. I was in favor of Development not doing the install themselves, but I also understood that opening a ticket with Operations and waiting for an install is a time-intensive process, and that debugging/troubleshooting is a highly interactive one. These two needs are at odds. By slowing down the Dev/QA feedback loop, Operations not only causes Development to become less efficient, it also encourages developers to do larger chunks of work and submit them for testing less frequently.

The flip side of this is that allowing developers full root access on QA servers is potentially dangerous. Developers may inadvertently make changes that change the performance of the servers from production. Similarly, if developers are installing directly into QA, operations doesn’t get to look at the deployable until it reaches production. When they install the application for the first time, it’s in the most critical environment.

There’s a three-part fix for this:

  • Developers are responsible for deliverables in a consistent format. Whether that’s a package, a tarball, or a tagged git checkout, the deliverable must look the same from release to release

  • QA is managed via Configuration Management, and applications are installed into QA using the same automation tools/scripts used in production.

  • Operations’ SLA for QA is that it will flatten and re-provision the environment when needed. If a deployment screws up the server, Ops will provide a new, clean server.

Using these policies, the application is installed into QA and any subsequent environments with the same scripts. If we’ve learned anything from the Lean movement, it’s that accuracy can be improved by reducing batch sizes, increasing the speed of processing and baking QA into the process. With these changes, the deployment scripts and artifacts are tested dozens, hundreds or thousands of times before they are ever used in production. This can help find deployment problems and iron out scripts long before code ever reaches user-facing systems.

The benefits for both teams are clear: Development gets a fast turnaround time for QA, Operations gets a clean deliverable that can be deployed via its own scripts.

Functional Testing

While there will always be the need for manual testing of certain functionality, establishing an automated testing regimen can provide quick feedback about whether an app is functioning as intended.

While an overview of testing strategies is beyond the scope of this article (Chapter 4 of Jez Humble and David Farley’s book Continuous Delivery provides an excellent overview), I’d argue for prioritizing a combination of functional and integration tests. You want to confirm that the app does what is intended. Simple smoke tests to verify that a server is configured properly and that an application is installed and running is a good first pass at a testing regimen.

Once you get comfortable writing tests, you should begin doing more involved testing of application and server behavior and performance. Every time you make a change that alters the behavior of the application or underlying system, add a test. Down the line you may want to consider TDD or BDD, but start small — having imperfect tests is better than having no tests at all.

At the application level, your development team likely has a testing language or suite for unit and integration tests. There are a number of frameworks you can use for doing this at the server/Configuration Management level. I have used both serverspec and Beaker with success in the past.

The first time you run a proposed configuration management change through tests and discover that it would break your application is a revelation. Similarly, the first time you prevent a regression by adding a check for something that “always” breaks will be the last time somebody accidentally breaks it.

Wrapping Up

We’ve just scratched the surface of what can be done with an existing environment, but as you can hopefully see, there’s plenty you can do right now to get your environment ready for IaaS (and eventually, PaaS) without touching your application’s code.

Remember that this process should be iterative — unless you have the budget to build a greenfield environment tomorrow, you are going to be tackling this one piece at a time. Don’t feel ashamed because your environments aren’t automated enough or you don’t have comprehensive enough tests for your application. Rather, focus on making things better. If you don’t have enough automation, build more. If there aren’t enough good tests, write just one. Then re-examine your environment, see what most needs improvement, and iterate there.

There’s no way to completely move an app without touching the code, but there’s plenty of work to do before you get there in preparation of scalable, loosely coupled code. Don’t wait for the perfect application to start doing the right thing.

December 16, 2014

Day 16 - How to Interview Systems Administrators

Written by: Corey Quinn (@quinnypig)
Edited by: Justin Garrison (@rothgar)

There are many blog posts, articles, and even books[0] written on how to effectively interview software engineers. Hiring systems administrators[1] is a bit more prickly of a topic, for a few reasons.

  • You generally hire fewer of them than you do developers[2].
  • A systems administrator likely has root in production. Mistakes will show more readily, and in many environments “peer review” is an aspiration rather than the current state of things.
  • It’s extremely easy to let your systems administration team become “the department of no.” This can have an echo effect that pumps toxicity into your organization. It’s important to hire someone who isn’t going to add overwhelming negativity.

Every job interview since the beginning of time is built around asking candidates three questions. They’ll take different forms, and you’ll dress them up differently each time, but they can be distilled down as follows.

  1. Can you do the job?
  2. Will you like doing the job?
  3. Can we stand working with you?

Doing the Job

This is where the barrage of technical questions comes in. Be careful when selecting what technical areas you want to cover, and how you cover them. Going into stupendous depth on SAN management when you don’t have centralized storage at all is something of a waste of time.

Additionally, many shops equate trivia with mastery of a subject. “Which format specifier to date(1) will spit out the seconds since the Unix epoch began?” The correct answer is of course “man date” unless they, for some reason, have %s memorized– but what does a right answer really tell you past a single bit of data? Being able to successfully memorize trivia doesn’t really speak to someone’s ability to successfully perform in an operational role.

Instead, it probably makes more sense for you to ask open ended questions about things you care about. “So, we have a lot of web servers here. What’s your experience with managing them? What other technology have you worked with in conjunction with serving data over http/https?” This gleans a lot more data than asking trivia questions about configuring virtual hosts in Apache’s httpd. Be aware that some folks will try to talk around the question; politely returning to specific scenarios can help refocus them.

Liking the Job

Hiring people, training them, and the rest of the onboarding process are expensive. Having to replace someone who left due to poor fit, a skills mismatch, or other reasons two months into the job is awful. It’s important to suss out whether or not the candidate is likely to enjoy their work. That said, it’s sometimes difficult to ascertain whether or not the candidate is just telling you what you want to hear. To that end, ask the candidate for specific stories regarding their current and past work. “Tell me about a time you had to deal with a difficult situation.” Push for specific details– you don’t want to hear “the right answer,” you want to know what actually happened.

This questioning technique leads well into the third question…

Not Being a Jerk

If you think back across your career, you can probably think of a systems administrator you’ve met who could easily be named Surly McBastard. You really, really, really don’t want to hire that person. It’s very easy for the sysadmin group to gain the reputation as “the department of no” just due to their job function alone– remember, their goal is stability above all else. Your engineering group (presuming a separate and distinct team from the operations group) is trying to roll new features out. This gives way to a natural tension in most organizations. There’s no need to exacerbate this by hiring someone who’s difficult to work with.

A key indicator here is fanaticism. We all have our favorite pet technologies, but most of us are able to put personal preferences aside in favor of the prevailing consensus. A subset of technologists are unable to do this. “You use Redis? Why?! It’s a steaming pile of crap!” is a great example of what you might not want to hear. A better way for a candidate to frame this sentiment might be “Oh, you’re a Redis shop? That’s interesting– I’ve run into some challenges with it in the past. I’d be very curious to hear how you’ve overcome some challenges…”

Remember, the successful candidate is going to have to deal with other groups of people, and that’s a very challenging thing to interview for. It also helps to remember that interviewing is an inexact science, and everyone approaches it with a number of biases.

For this reason, I strongly recommend having multiple interviewers speak to each candidate, and then compare notes afterwards. It’s entirely possible that one person will pick up on a red flag that others will miss.

Ultimately, interviewing is a challenge on both sides of the table. The best way to improve is to practice– take notes on what works, what doesn’t, and adjust accordingly. Remember that every hire you make shifts your team; ideally you want that to be trending upwards with each successive hire.

[0] I’m partial to http://smile.amazon.com/Recruit-Hire-Great-Software-Engineers/dp/143024917X?sa-no-redirect=1 myself.
[1] For purposes of this article, “systems administrators” can be expanded to include operations engineers, devops unicorns, network engineers, database wizards, storage gurus, infrastructure perverts, NOC technicians, and other similar roles.
[2] For purposes of this article, “developers” can be expanded to include… you get the idea.

December 15, 2014

Day 15 - Cook your own packages: Getting more out of fpm

Written by: Mathias Lafeldt (@mlafeldt)
Edited by: Joseph Kern (@josephkern)

Introduction

When it comes to building packages, there is one particular tool that has grown in popularity over the last years: fpm. fpm’s honorable goal is to make it as simple as possible to create native packages for multiple platforms, all without having to learn the intricacies of each distribution’s packaging format (.deb, .rpm, etc.) and tooling.

With a single command, fpm can build packages from a variety of sources including Ruby gems, Python modules, tarballs, and plain directories. Here’s a quick example showing you how to use the tool to create a Debian package of the AWS SDK for Ruby:

$ fpm -s gem -t deb aws-sdk
Created package {:path=>"rubygem-aws-sdk_1.59.0_all.deb"}

It is this simplicity that makes fpm so popular. Developers are able to easily distribute their software via platform-native packages. Businesses can manage their infrastructure on their own terms, independent of upstream vendors and their policies. All of this has been possible before, but never with this little effort.

In practice, however, things are often more complicated than the one-liner shown above. While it is absolutely possible to provision production systems with packages created by fpm, it will take some work to get there. The tool can only help you so far.

In this post we’ll take a look at several best practices covering: dependency resolution, reproducible builds, and infrastructure as code. All examples will be specific to Debian and Ruby, but the same lessons apply to other platforms/languages as well.

Resolving dependencies

Let’s get back to the AWS SDK package from the introduction. With a single command, fpm converts the aws-sdk Ruby gem to a Debian package named rubygem-aws-sdk. This is what happens when we actually try to install the package on a Debian system:

$ sudo dpkg --install rubygem-aws-sdk_1.59.0_all.deb
...
dpkg: dependency problems prevent configuration of rubygem-aws-sdk:
 rubygem-aws-sdk depends on rubygem-aws-sdk-v1 (= 1.59.0); however:
  Package rubygem-aws-sdk-v1 is not installed.
...

As we can see, our package can’t be installed due to a missing dependency (rubygem-aws-sdk-v1). Let’s take a closer look at the generated .deb file:

$ dpkg --info rubygem-aws-sdk_1.59.0_all.deb
 ...
 Package: rubygem-aws-sdk
 Version: 1.59.0
 License: Apache 2.0
 Vendor: Amazon Web Services
 Architecture: all
 Maintainer: <vagrant@wheezy-buildbox>
 Installed-Size: 5
 Depends: rubygem-aws-sdk-v1 (= 1.59.0)
 Provides: rubygem-aws-sdk
 Section: Languages/Development/Ruby
 Priority: extra
 Homepage: http://aws.amazon.com/sdkforruby
 Description: Version 1 of the AWS SDK for Ruby. Available as both `aws-sdk` and `aws-sdk-v1`.
  Use `aws-sdk-v1` if you want to load v1 and v2 of the Ruby SDK in the same
  application.

fpm did a great job at populating metadata fields such as package name, version, license, and description. It also made sure that the Depends field contains all required dependencies that have to be installed for our package to work properly. Here, there’s only one direct dependency – the one we’re missing.

While fpm goes to great lengths to provide proper dependency information – and this is not limited to Ruby gems – it does not automatically build those dependencies. That’s our job. We need to find a set of compatible dependencies and then tell fpm to build them for us.

Let’s build the missing rubygem-aws-sdk-v1 package with the exact version required and then observe the next dependency in the chain:

$ fpm -s gem -t deb -v 1.59.0 aws-sdk-v1
Created package {:path=>"rubygem-aws-sdk-v1_1.59.0_all.deb"}

$ dpkg --info rubygem-aws-sdk-v1_1.59.0_all.deb | grep Depends
 Depends: rubygem-nokogiri (>= 1.4.4), rubygem-json (>= 1.4), rubygem-json (<< 2.0)

Two more packages to take care of: rubygem-nokogiri and rubygem-json. By now, it should be clear that resolving package dependencies like this is no fun. There must be a better way.

In the Ruby world, Bundler is the tool of choice for managing and resolving gem dependencies. So let’s ask Bundler for the dependencies we need. For this, we create a Gemfile with the following content:

# Gemfile
source "https://rubygems.org"
gem "aws-sdk", "= 1.59.0"
gem "nokogiri", "~> 1.5.0" # use older version of Nokogiri

We then instruct Bundler to resolve all dependencies and store the resulting .gem files into a local folder:

$ bundle package
...
Updating files in vendor/cache
  * json-1.8.1.gem
  * nokogiri-1.5.11.gem
  * aws-sdk-v1-1.59.0.gem
  * aws-sdk-1.59.0.gem

We specifically asked Bundler to create .gem files because fpm can convert them into Debian packages in a matter of seconds:

$ find vendor/cache -name '*.gem' | xargs -n1 fpm -s gem -t deb
Created package {:path=>"rubygem-aws-sdk-v1_1.59.0_all.deb"}
Created package {:path=>"rubygem-aws-sdk_1.59.0_all.deb"}
Created package {:path=>"rubygem-json_1.8.1_amd64.deb"}
Created package {:path=>"rubygem-nokogiri_1.5.11_amd64.deb"}

As a final test, let’s install those packages…

$ sudo dpkg -i *.deb
...
Setting up rubygem-json (1.8.1) ...
Setting up rubygem-nokogiri (1.5.11) ...
Setting up rubygem-aws-sdk-v1 (1.59.0) ...
Setting up rubygem-aws-sdk (1.59.0) ...

…and verify that the AWS SDK actually can be used by Ruby:

$ ruby -e "require 'aws-sdk'; puts AWS::VERSION"
1.59.0

Win!

The purpose of this little exercise was to demonstrate one effective approach to resolving package dependencies for fpm. By using Bundler – the best tool for the job – we get fine control over all dependencies, including transitive ones (like Nokogiri, see Gemfile). Other languages provide similar dependency tools. We should make use of language specific tools whenever we can.

Build infrastructure

After learning how to build all packages that make up a piece of software, let’s consider how to integrate fpm into our build infrastructure. These days, with the rise of the DevOps movement, many teams have started to manage their own infrastructure. Even though each team is likely to have unique requirements, it still makes sense to share a company-wide build infrastructure, as opposed to reinventing the wheel each time someone wants to automate packaging.

Packaging is often only a small step in a longer series of build steps. In many cases, we first have to build the software itself. While fpm supports multiple source formats, it doesn’t know how to build the source code or determine dependencies required by the package. Again, that’s our job.

Creating a consistent build and release process for different projects across multiple teams is hard. Fortunately, there’s another tool that does most of the work for us: fpm-cookery. fpm-cookery sits on top of fpm and provides the missing pieces to create a reusable build infrastructure. Inspired by projects like Homebrew, fpm-cookery builds packages based on simple recipes written in Ruby.

Let’s turn our attention back to the AWS SDK. Remember how we initially converted the gem to a Debian package? As a warm up, let’s do the same with fpm-cookery. First, we have to create a recipe.rb file:

# recipe.rb
class AwsSdkGem < FPM::Cookery::RubyGemRecipe
  name    "aws-sdk"
  version "1.59.0"
end

Next, we pass the recipe to fpm-cook, the command-line tool that comes with fpm-cookery, and let it build the package for us:

$ fpm-cook package recipe.rb
===> Starting package creation for aws-sdk-1.59.0 (debian, deb)
===>
===> Verifying build_depends and depends with Puppet
===> All build_depends and depends packages installed
===> [FPM] Trying to download {"gem":"aws-sdk","version":"1.59.0"}
...
===> Created package: /home/vagrant/pkg/rubygem-aws-sdk_1.59.0_all.deb

To complete the exercise, we also need to write a recipe for each remaining gem dependency. This is what the final recipes look like:

# recipe.rb
class AwsSdkGem < FPM::Cookery::RubyGemRecipe
  name       "aws-sdk"
  version    "1.59.0"
  maintainer "Mathias Lafeldt <mathias.lafeldt@gmail.com>"

  chain_package true
  chain_recipes ["aws-sdk-v1", "json", "nokogiri"]
end

# aws-sdk-v1.rb
class AwsSdkV1Gem < FPM::Cookery::RubyGemRecipe
  name       "aws-sdk-v1"
  version    "1.59.0"
  maintainer "Mathias Lafeldt <mathias.lafeldt@gmail.com>"
end

# json.rb
class JsonGem < FPM::Cookery::RubyGemRecipe
  name       "json"
  version    "1.8.1"
  maintainer "Mathias Lafeldt <mathias.lafeldt@gmail.com>"
end

# nokogiri.rb
class NokogiriGem < FPM::Cookery::RubyGemRecipe
  name       "nokogiri"
  version    "1.5.11"
  maintainer "Mathias Lafeldt <mathias.lafeldt@gmail.com>"

  build_depends ["libxml2-dev", "libxslt1-dev"]
  depends       ["libxml2", "libxslt1.1"]
end

Running fpm-cook again will produce Debian packages that can be added to an APT repository and are ready for use in production.

Three things worth highlighting:

  • fpm-cookery is able to build multiple dependent packages in a row (configured by chain_* attributes), allowing us to build everything with a single invocation of fpm-cook.
  • We can use the attributes build_depends and depends to specify a package’s build and runtime dependencies. When running fpm-cook as root, the tool will automatically install missing dependencies for us.
  • I deliberately set the maintainer attribute in all recipes. It’s important to take responsibility of the work that we do. We should make it as easy as possible for others to identify the person or team responsible for a package.

fpm-cookery provides many more attributes to configure all aspects of the build process. Among other things, it can download source code from GitHub before running custom build instructions (e.g. make install). The fpm-recipes repository is an excellent place to study some working examples. This final example, a recipe for chruby, is a foretaste of what fpm-cookery can actually do:

# recipe.rb
class Chruby < FPM::Cookery::Recipe
  description "Changes the current Ruby"

  name     "chruby"
  version  "0.3.8"
  homepage "https://github.com/postmodern/chruby"
  source   "https://github.com/postmodern/chruby/archive/v#{version}.tar.gz"
  sha256   "d980872cf2cd047bc9dba78c4b72684c046e246c0fca5ea6509cae7b1ada63be"

  maintainer "Jan Brauer <jan@jimdo.com>"

  section "development"

  config_files "/etc/profile.d/chruby.sh"

  def build
    # nothing to do here
  end

  def install
    make :install, "PREFIX" => prefix
    etc("profile.d").install workdir("chruby.sh")
  end
end

# chruby.sh
source /usr/share/chruby/chruby.sh

Wrapping up

fpm has changed the way we build packages. We can get even more out of fpm by using it in combination with other tools. Dedicated programs like Bundler can help us with resolving package dependencies, which is something fpm won’t do for us. fpm-cookery adds another missing piece: it allows us to describe our packages using simple recipes, which can be kept under version control, giving us the benefits of infrastructure as code: repeatability, automation, rollbacks, code reviews, etc.

Last but not least, it’s a good idea to pair fpm-cookery with Docker or Vagrant for fast, isolated package builds. This, however, is outside the scope of this article and left as an exercise for the reader.

Further reading

December 14, 2014

Day 14 - Using Chef Provisioning to Build Chef Server

Or, Yo Dawg, I heard you like Chef.

Written by: Joshua Timberman (@jtimberman)
Edited by: Paul Graydon (@twirrim)

This post is dedicated to Ezra Zygmuntowicz. Without Ezra, we wouldn’t have had Merb for the original Chef server, chef-solo, and maybe not even Chef itself. His contributions to the Ruby, Rails, and Chef communities are immense. Thanks, Ezra, RIP.

In this post, I will walk through a use case for Chef Provisioning used at Chef Software, Inc.: building a new Hosted Chef infrastructure with Chef Server 12 on Amazon EC2. This isn’t an in-depth how to guide, but I will illustrate the important components to discuss what is required to setup Chef Provisioning, with a real world example. Think of it as a whirlwind tour of Chef Provisioning and Chef Server 12.

Background

If you have used Chef for awhile, you may recall the wiki page “Bootstrap Chef RubyGems Installation” - the installation guide that uses cookbooks with chef-solo to install all the components required to run an open source Chef Server. This idea was a natural fit in the omnibus packages for Enterprise Chef (nee Private Chef) in the form of private-chef-ctl reconfigure: that command kicks off a chef-solo run that configures and starts all the Chef Server services.

It should be no surprise, that at CHEF we build Hosted Chef using Chef. Yes, it’s turtles and yo-dawg jokes all the way down. As the CHEF CTO Adam described when talking about one Chef Server codebase, we want to bring our internal deployment and development practices in line with what we’re shipping to customers, and we want to unify our approach so we can provide better support.

Chef Server 12

As announced recently, Chef Server 12 is generally available. For purposes of the example discussed below, we’ll provision three machines: one backend, one frontend (with Chef Manage and Chef Reporting), and one running Chef Analytics. While Chef Server 12 has the capability to install add-ons, we have a special cookbook with a resource to manage the installation of “Chef Server Ingredients.” This is so we can also install the chef-server-core package used by both the API frontend nodes and the backend nodes.

Chef Provisioning

Chef Provisioning is a new capability for Chef, where users can define “machines” as Chef resources in recipes, and then converge those recipes on a node. This means that new machines are created using a variety of possible providers (AWS, OpenStack, or Docker, to name a few), and they can have recipes applied from other cookbooks available on the Chef Server.

Chef Provisioning “runs” on a provisioner node. This is often a local workstation, but it could be a specially designated node in a data center or cloud provider. It is simply a recipe run by chef-client (or chef-solo). When using chef-client, any Chef Server will do, including Hosted Chef. Of course, the idea here is we don’t have a Chef Server yet. In my examples in this post, I’ll use my OS X laptop as the provisioner, and Chef Zero as the server.

Assemble the Pieces

The cookbook that does the work using Chef Provisioning is chef-server-cluster. Note that this cookbook is under active development, and the code it contains may differ from the code in this post. As such, I’ll post relevant portions to show the use of Chef Provisioning, and the supporting local setup required to make it go. Refer to the README.md in the cookbook for the most recent information on how to use it.

Amazon Web Services EC2

The first thing we need is an AWS account for the EC2 instances. Once we have that, we need an IAM user that has privileges to manage EC2, and an SSH keypair to log into the instances. It is outside the scope of this post to provide details on how to assemble those pieces. However once those are acquired, do the following:

Put the access key and secret access key configuration in ~/.aws/config. This is automatically used by chef-provisioning’s AWS provider. The SSH keys will be used in a data bag item (JSON) that is described later. You will then want to choose an AWS region to use. For sake of example, my keypair is named hc-metal-provisioner in the us-west-2 region.

Chef Provisioning needs to know about the SSH keys in three places:

  1. In the .chef/knife.rb, the private_keys and public_keys configuration settings.
  2. In the machine_options that is used to configure the (AWS) driver so it can connect to the machine instances.
  3. In a recipe.

This is described in more detail below.

Chef Repository

We use a Chef Repository to store all the pieces and parts for the Hosted Chef infrastructure. For example purposes I’ll use a brand new repository. I’ll use ChefDK’s chef generate command:

% chef generate repo sysadvent-chef-cluster

This repository will have a Policyfile.rb, a .chef/knife.rb config file, and a couple of data bags. The latest implementation specifics can be found in the chef-server-cluster cookbook’s README.md.

Chef Zero and Knife Config

As mentioned above, Chef Zero will be the Chef Server for this example, and it will run on a specific port (7799). I started it up in a separate terminal with:

% chef-zero -l debug -p 7799

The knife config file will serve two purposes. First, it will be used to load all the artifacts into Chef Zero. Second, it will provide essential configuration to use with chef-client. Let’s look at the required configuration.

This portion tells chef, knife, and chef-client to use the chef-zero instance started earlier.

chef_server_url 'http://localhost:7799'
node_name       'chef-provisioner'

In the next section, I’ll discuss the policyfile feature in more detail. These configuration settings tell chef-client to use policyfiles, and which deployment group the client should use.

use_policyfile   true
deployment_group 'sysadvent-demo-provisioner'

As mentioned above, these are the configuration options that tell Chef Provisioning where the keys are located. The key files must exist on the provisioning node somewhere.

First here’s the knife config:

private_keys     'hc-metal-provisioner' => '/tmp/ssh/id_rsa'
public_keys      'hc-metal-provisioner' => '/tmp/ssh/id_rsa.pub'

Then the recipe - this is from the current version of chef-server-cluster::setup-ssh-keys.

fog_key_pair node['chef-server-cluster']['aws']['machine_options']['bootstrap_options']['key_name'] do
  private_key_path '/tmp/ssh/id_rsa'
  public_key_path '/tmp/ssh/id_rsa.pub'
end

The attribute here is part of the driver options set using the with_machine_options method for Chef Provisioning in chef-server-cluster::setup-provisioner. For further reading about machine options, see Chef Provisioning configuration documentation. While the machine options will automatically use keys stored in ~/.chef/keys or ~/.ssh, we do this to avoid strange conflicts on local development systems used for test provisioning. An issue has been opened to revisit this.

Policyfile.rb

Beware, gentle reader! This is an experimental new feature that mayWwill change. However, I wanted to try it out, as it made sense for the workflow when I was assembling this post. Read more about Policyfiles in the ChefDK repository. In particular, read the “Motivation and FAQ” section. Also, Chef (client) 12 is required, which is included in the ChefDK package I have installed on my provisioning system.

The general idea behind Policyfiles is to assemble node’s run list as an artifact, including all the roles and recipes needed to fulfill its job in the infrastructure. Each policyfile.rb contains at least the following.

  • name: the name of the policy
  • run_list: the run list for nodes that use this policy
  • default_source: the source where cookbooks should be downloaded (e.g., Supermarket)
  • cookbook: define the cookbooks required to fulfill this policy

As an example, here is the Policyfile.rb I’m using, at the toplevel of the repository:

name            'sysadvent-demo'
run_list        'chef-server-cluster::cluster-provision'
default_source  :community
cookbook        'chef-server-ingredient', '>= 0.0.0',
                :github => 'opscode-cookbooks/chef-server-ingredient'
cookbook        'chef-server-cluster', '>= 0.0.0',
                :github => 'opscode-cookbooks/chef-server-cluster'

Once the Policyfile.rb is written, it needs to be compiled to a lock file (Policyfile.lock.json) with chef install. Installing the policy does the following.

  • Build the policy
  • “Install” the cookbooks to the cookbook store (~/.chefdk/cache/cookbooks)
  • Write the lockfile

This doesn’t put the cookbooks (or the policy) on the Chef Server. We’ll do that in the upload section with chef push.

Data Bags

At CHEF, we prefer to move configurable data and secrets to data bags. For secrets, we generally use Chef Vault, though for the purpose of this example we’re going to skip that here. The chef-server-cluster cookbook has a few data bag items that are required before we can run Chef Client.

Under data_bags, I have these directories/files.

  • secrets/hc-metal-provisioner-chef-aws-us-west-2.json: the name hc-metal-provisioner-chef-aws-us-west-2 is an attribute in the chef-server-cluster::setup-ssh-keys recipe to load the correct item; the private and public SSH keys for the AWS keypair are written out to /tmp/ssh on the provisioner node
  • secrets/private-chef-secrets-_default.json: the complete set of secrets for the Chef Server systems, written to /etc/opscode/private-chef-secrets.json
  • chef_server/topology.json: the topology and configuration of the Chef Server. Currently this doesn’t do much but will be expanded in future to inform /etc/opscode/chef-server.rb with more configuration options

See the chef-server-cluster cookbook README.md for the latest details about the data bag items required. Note At this time, chef-vault is not used for secrets, but that will change in the future.

Upload the Repository

Now that we’ve assembled all the required components to converge the provisioner node and start up the Chef Server cluster, let’s get everything loaded on the Chef Server.

Ensure the policyfile is compiled and installed, then push it as the provisioner deployment group. The group name is combined with the policy name in the config that we saw earlier in knife.rb. The chef push command uploads the cookbooks, and also creates a data bag item that stores the policyfile’s rendered JSON.

% chef install
% chef push provisioner

Next, upload the data bags.

% knife upload data_bags

We can now use knife to confirm that everything we need is on the Chef Server:

% knife data bag list
chef_server
policyfiles
secrets
% knife cookbook list
apt                      11131342171167261.63923027125258247.235168191861173
chef-server-cluster      2285060862094129.64629594500995644.198889591798187
chef-server-ingredient   37684361341419357.41541897591682737.246865540583454
chef-vault               11505292086701548.4466613666701158.13536425383812

What’s with those crazy versions? That is what the policyfile feature does. The human readable versions are no longer used, cookbook versions are locked using unique, automatically generated version strings, so based on the policy we know the precise cookbook dependency graph for any given policy. When Chef runs on the provisioner node, it will use the versions in its policy. When Chef runs on the machine instances, since they’re not using Policyfiles, it will use the latest version. In the future we’ll have policies for each of the nodes that are managed with Chef Provisioning.

Checkpoint

At this point, we have:

  • ChefDK installed on the local privisioning node (laptop) with Chef client version 12
  • AWS IAM user credentials in ~/.aws/config for managing EC2 instances
  • A running Chef Server using chef-zero on the local node
  • The chef-server-cluster cookbook and its dependencies
  • The data bag items required to use chef-server-cluster’s recipes, including the SSH keys Chef Provisioning will use to log into the EC2 instances
  • A knife.rb config file that will point chef-client at the chef-zero server, and tells it to use policyfiles

Chef Client

Finally, the moment (or several moments…) we have been waiting for! It’s time to run chef-client on the provisioning node.

% chef-client -c .chef/knife.rb

While that runs, let’s talk about what’s going on here.

Normally when chef-client runs, it reads configuration from /etc/chef/client.rb. As I mentioned, I’m using my laptop, which has its own run list and configuration, so I need to specify the knife.rb discussed earlier. This will use the chef-zero Chef Server running on port 7799, and the policyfile deployment group.

In the output, we’ll see Chef get its run list from the policy file, which looks like this:

resolving cookbooks for run list: ["chef-server-cluster::cluster-provision@0.0.7 (081e403)"]
Synchronizing Cookbooks:
  - chef-server-ingredient
  - chef-server-cluster
  - apt
  - chef-vault

The rest of the output should be familiar to Chef users, but let’s talk about some of the things Chef Provisioning is doing. First, the following resource is in the chef-server-cluster::cluster-provision recipe:

machine 'bootstrap-backend' do
  recipe 'chef-server-cluster::bootstrap'
  ohai_hints 'ec2' => '{}'
  action :converge
  converge true
end

The first system that we build in a Chef Server cluster is a backend node that “bootstraps” the data store that will be used by the other nodes. This includes the postgresql database, the RabbitMQ queues, etc. Here’s the output of Chef Provisioning creating this machine resource.

Recipe: chef-server-cluster::cluster-provision
  * machine[bootstrap-backend] action converge
    - creating machine bootstrap-backend on fog:AWS:862552916454:us-west-2
    -   key_name: "hc-metal-provisioner"
    -   image_id: "ami-b99ed989"
    -   flavor_id: "m3.medium"
    - machine bootstrap-backend created as i-14dec01b on fog:AWS:862552916454:us-west-2
    - Update tags for bootstrap-backend on fog:AWS:862552916454:us-west-2
    -   Add Name = "bootstrap-backend"
    -   Add BootstrapId = "http://localhost:7799/nodes/bootstrap-backend"
    -   Add BootstrapHost = "champagne.local"
    -   Add BootstrapUser = "jtimberman"
    - create node bootstrap-backend at http://localhost:7799
    -   add normal.tags = nil
    -   add normal.chef_provisioning = {"location"=>{"driver_url"=>"fog:AWS:XXXXXXXXXXXX:us-west-2", "driver_version"=>"0.11", "server_id"=>"i-14dec01b", "creator"=>"user/IAMUSERNAME, "allocated_at"=>1417385355, "key_name"=>"hc-metal-provisioner", "ssh_username"=>"ubuntu"}}
    -   update run_list from [] to ["recipe[chef-server-cluster::bootstrap]"]
    - waiting for bootstrap-backend (i-14dec01b on fog:AWS:XXXXXXXXXXXX:us-west-2) to be ready ...
    - bootstrap-backend is now ready
    - waiting for bootstrap-backend (i-14dec01b on fog:AWS:XXXXXXXXXXXX:us-west-2) to be connectable (transport up and running) ...
    - bootstrap-backend is now connectable
    - generate private key (2048 bits)
    - create directory /etc/chef on bootstrap-backend
    - write file /etc/chef/client.pem on bootstrap-backend
    - create client bootstrap-backend at clients
    -   add public_key = "-----BEGIN PUBLIC KEY-----\n..."
    - create directory /etc/chef/ohai/hints on bootstrap-backend
    - write file /etc/chef/ohai/hints/ec2.json on bootstrap-backend
    - write file /etc/chef/client.rb on bootstrap-backend
    - write file /tmp/chef-install.sh on bootstrap-backend
    - run 'bash -c ' bash /tmp/chef-install.sh'' on bootstrap-backend

From here, Chef Provisioning kicks off a chef-client run on the machine it just created. This install.sh script is the one that uses CHEF’s omnitruck service. It will install the current released version of Chef, which is 11.16.4 at the time of writing. Note that this is not version 12, so that’s another reason we can’t use Policyfiles on the machines. The chef-client run is started on the backend instance using the run list specified in the machine resource.

Starting Chef Client, version 11.16.4
 resolving cookbooks for run list: ["chef-server-cluster::bootstrap"]
 Synchronizing Cookbooks:
   - chef-server-cluster
   - chef-server-ingredient
   - chef-vault
   - apt

In the output, we see this recipe and resource:

Recipe: chef-server-cluster::default
  * chef_server_ingredient[chef-server-core] action reconfigure
    * execute[chef-server-core-reconfigure] action run
      - execute chef-server-ctl reconfigure

An “ingredient” is a Chef Server component, either the core package (above), or one of the Chef Server add-ons like Chef Manage or Chef Reporting. In normal installation instructions for each of the add-ons, their appropriate ctl reconfigure is run, which is all handled by the chef_server_ingredient resource. The reconfigure actually runs Chef Solo, so we’re running chef-solo in a chef-client run started inside a chef-client run.

The bootstrap-backend node generates some files that we need on other nodes. To make those available using Chef Provisioning, we use machine_file resources.

%w{ actions-source.json webui_priv.pem }.each do |analytics_file|
  machine_file "/etc/opscode-analytics/#{analytics_file}" do
    local_path "/tmp/stash/#{analytics_file}"
    machine 'bootstrap-backend'
    action :download
  end
end

machine_file '/etc/opscode/webui_pub.pem' do
  local_path '/tmp/stash/webui_pub.pem'
  machine 'bootstrap-backend'
  action :download
end

These are “stashed” on the local node - the provisioner. They’re used for Chef Manage webui, and the Chef Analytics node. When the recipe runs on the provisioner, we see this output:

  * machine_file[/etc/opscode-analytics/actions-source.json] action download
    - download file /etc/opscode-analytics/actions-source.json on bootstrap-backend to /tmp/stash/actions-source.json
  * machine_file[/etc/opscode-analytics/webui_priv.pem] action download
    - download file /etc/opscode-analytics/webui_priv.pem on bootstrap-backend to /tmp/stash/webui_priv.pem
  * machine_file[/etc/opscode/webui_pub.pem] action download
    - download file /etc/opscode/webui_pub.pem on bootstrap-backend to /tmp/stash/webui_pub.pem

They are uploaded to the frontend and analytics machines with the files resource attribute. Files are specified as a hash. The key is the target file to upload to the machine, and the value is the source file from the provisioning node.

machine 'frontend' do
  recipe 'chef-server-cluster::frontend'
  files(
        '/etc/opscode/webui_priv.pem' => '/tmp/stash/webui_priv.pem',
        '/etc/opscode/webui_pub.pem' => '/tmp/stash/webui_pub.pem'
       )
end

machine 'analytics' do
  recipe 'chef-server-cluster::analytics'
  files(
        '/etc/opscode-analytics/actions-source.json' => '/tmp/stash/actions-source.json',
        '/etc/opscode-analytics/webui_priv.pem' => '/tmp/stash/webui_priv.pem'
       )
end

Note These files are transferred using SSH, so they’re not passed around in the clear.

The provisioner will converge the frontend next, followed by the analytics node. We’ll skip the bulk of the output since we saw it earlier with the backend.

  * machine[frontend] action converge
  ... SNIP
    - upload file /tmp/stash/webui_priv.pem to /etc/opscode/webui_priv.pem on frontend
    - upload file /tmp/stash/webui_pub.pem to /etc/opscode/webui_pub.pem on frontend

Here is where the files are uploaded to the frontend, so the webui will work (it’s an API client itself, like knife, or chef-client).

When the frontend runs chef-client, not only does it install the chef-server-core and run chef-server-ctl reconfigure via the ingredient resource, it also gets the manage and reporting addons:

* chef_server_ingredient[opscode-manage] action install
  * package[opscode-manage] action install
    - install version 1.6.2-1 of package opscode-manage
* chef_server_ingredient[opscode-reporting] action install
   * package[opscode-reporting] action install
     - install version 1.2.1-1 of package opscode-reporting
Recipe: chef-server-cluster::frontend
  * chef_server_ingredient[opscode-manage] action reconfigure
    * execute[opscode-manage-reconfigure] action run
      - execute opscode-manage-ctl reconfigure
  * chef_server_ingredient[opscode-reporting] action reconfigure
    * execute[opscode-reporting-reconfigure] action run
      - execute opscode-reporting-ctl reconfigure

Similar to the frontend above, the analytics node will be created as an EC2 instance, and we’ll see the files uploaded:

    - upload file /tmp/stash/actions-source.json to /etc/opscode-analytics/actions-source.json on analytics
    - upload file /tmp/stash/webui_priv.pem to /etc/opscode-analytics/webui_priv.pem on analytics

Then, the analytics package is installed as an ingredient, and reconfigured:

* chef_server_ingredient[opscode-analytics] action install
* package[opscode-analytics] action install
  - install version 1.0.4-1 of package opscode-analytics
* chef_server_ingredient[opscode-analytics] action reconfigure
  * execute[opscode-analytics-reconfigure] action run
    - execute opscode-analytics-ctl reconfigure
...
Chef Client finished, 10/15 resources updated in 1108.3078 seconds

This will be the last thing in the chef-client run on the provisioner, so let’s take a look at what we have.

Results and Verification

We now have three nodes running as EC2 instances for the backend, frontend, and analytics systems in the Chef Server. We can view the node objects on our chef-zero server:

% knife node list
analytics
bootstrap-backend
chef-provisioner
frontend

We can use search:

% knife search node 'ec2:*' -r
3 items found

analytics:
  run_list: recipe[chef-server-cluster::analytics]

bootstrap-backend:
  run_list: recipe[chef-server-cluster::bootstrap]

frontend:
  run_list: recipe[chef-server-cluster::frontend]

% knife search node 'ec2:*' -a ipaddress
3 items found

analytics:
  ipaddress: 172.31.13.203

bootstrap-backend:
  ipaddress: 172.31.1.60

frontend:
  ipaddress: 172.31.1.120

If we navigate to the frontend IP, we can sign up using the Chef Server management console, then download a starter kit and use that to bootstrap new nodes against the freshly built Chef Server.

% unzip chef-starter.zip
Archive:  chef-starter.zip
...
  inflating: chef-repo/.chef/sysadvent-demo.pem
  inflating: chef-repo/.chef/sysadvent-demo-validator.pem
% cd chef-repo
% knife client list
sysadvent-demo-validator
% knife node create sysadvent-node1 -d
Created node[sysadvent-node1]

If we navigate to the analytics IP, we can sign in with the user we just created, and view the events from downloading the starter kit: the validator client key was regenerated, and the node was created.

Next Steps

For those following at home, this is now a fully functional Chef Server. It does have premium features (manage, reporting, analytics), but those are free up to 25 nodes. We can also destroy the cluster, using the cleanup recipe. That can be applied by disabling policyfile in .chef/knife.rb:

% grep policyfile .chef/knife.rb
# use_policyfile   true
% chef-client -c .chef/knife.rb -o chef-server-cluster::cluster-clean
Recipe: chef-server-cluster::cluster-clean
  * machine[analytics] action destroy
    - destroy machine analytics (i-5cdac453 at fog:AWS:XXXXXXXXXXXX:us-west-2)
    - delete node analytics at http://localhost:7799
    - delete client analytics at clients
  * machine[frontend] action destroy
    - destroy machine frontend (i-68dfc167 at fog:AWS:XXXXXXXXXXXX:us-west-2)
    - delete node frontend at http://localhost:7799
    - delete client frontend at clients
  * machine[bootstrap-backend] action destroy
    - destroy machine bootstrap-backend (i-14dec01b at fog:AWS:XXXXXXXXXXXXX:us-west-2)
    - delete node bootstrap-backend at http://localhost:7799
    - delete client bootstrap-backend at clients
  * directory[/tmp/ssh] action delete
    - delete existing directory /tmp/ssh
  * directory[/tmp/stash] action delete
    - delete existing directory /tmp/stash

As you can see, the Chef Provisioning capability is powerful, and gives us a lot of flexibility for running a Chef Server 12 cluster. Over time as we rebuild Hosted Chef with it, we’ll add more capability to the cookbook, including HA, scaled out frontends, and splitting up frontend services onto separate nodes.

December 13, 2014

Day 13 - Managing Repositories with Pulp

Written by: Justin Garrison (@rothgar)
Edited by: Corey Quinn (@quinnypig)

Your infrastructure and build systems shouldn’t rely on Red Hat, Ubuntu, or Docker’s repositories being available. Your work doesn’t stop when they have scheduled maintenance. You also have no control when they make updates available. Have you ever run an update and crossed your fingers unknown package versions wouldn’t break production? How do you manage content and repos between environments? Are you still running reposync or wget in a cron job? It’s time to take control of your system dependencies and use something that is scalable, flexible and repeatable.

Pulp

That’s where Pulp comes in. Pulp is a platform for managing repositories of content and pushing that content out to large numbers of consumers. Pulp can sync and manage more than just RPMs. Do you want to create your own Docker repository? How about syncing Puppet modules from the forge, or an easy place to host your installation media? Would you like to sync Debian* repositories or have a local mirror of pip*? Pulp can do all of that and is built to scale to your needs.

*Note that some importers are still a work in progress and not fully functional. Pull requests welcome.

How Pulp Works

The first step is to use an importer to get content into a Pulp repository. Importers are plugins which make them extremely flexible when dealing with content. With a little bit of work you can build an importer for any content source. Want local gems, maven or CPAN repositories? You can write your own importers and have it working with Pulp repos for just about anything. The content can be synced from external sources, uploaded from local files, or imported between repos. The importer validates content, applies metadata, and removes old content when it syncs. No more guessing if your packages synced or if your cron job failed.

After you have content in a repo, you then use a distributor to publish the content. Like importers, distributors are pluggable and content can be published to multiple locations. A single repo can publish to http(s), ISO, rsync, or any other exporter available. Publishing and syncing can also be scheduled for one or multiple times so you don’t have to worry about your content getting out of date.

Scaling Pulp

Pulp has different components that can be scaled according to your needs. The main components can be broken up into

  • httpd - frontend for API and http(s) published repos
  • pulp_workers - process for long running tasks like repo syncing and publishing
  • pulp_celerybeat - maintains workers and task cancellation
  • pulp_resource_manager - job assigner for tasks
  • mongodb - repo and content metadata value store
  • Qpid/RabbitMQ - message bus for job assigning
  • pulp-admin - cli tool for managing content and consumers
  • consumer - optional agent installed on node to subscribe to repos/content

Here’s how the components interact for a Pulp server.

Pulp components

Because of this modular layout you can scale out each component individually. Need more resources for large files hosted on http? You can scale httpd easily with a load balancer and a couple shared folders. If your syncs and publishes are taking a long time you can add more pulp_workers to get more concurrent tasks running inside Pulp.

If you have multiple datacenters you can mirror your Pulp repos to child nodes, which can replicate all or part of your parent server to each child.

Node topologies

Getting Started

The best architected software in the world is frivolous if you’re not going to use it. With Pulp, hopefully you can find a few use cases. Let’s say you just want better control over what repositories your servers pull content from. If you’re using Puppet you can quickly set up a server using the provided Puppet manifests and then you can mirror EPEL with just a few lines.

class pulp::repo::epel_6 {
     pulp_repo { 'epel-6-x86_64':    
       feed       => 'http://download.fedoraproject.org/pub/epel/6/x86_64',
       serve_http => true,
     }
 }

Want to set up dedicated repos for dev, test, and prod? Just create repos for each and schedule content syncing between environment repos. You’ll finally take control over what content gets pushed to each environment. Because Pulp is intelligent with its storage you can make sure you only ever store a needed package once.

Want to create an internal Docker registry? How about hosting it in Pulp deployed with Docker containers. You can deploy it with a single line in bash. Check out the infrastructure diagram below and learn how to do it in the quickstart documentation.

Container components

Getting content to consumers can be as easy as relying on system tools to pull the content like it normally does via http publishing, or you can install the consumer agent and get real time status about what is installed on each node, push content immediately when it is available, or rollback managed content if you find a broken package.

Conclusion

Not only can managing your own repositories greatly improve your control and visibility into your systems, but moving data closer to the nodes can speed up your deployments and simplify your infrastructure. If I haven’t convinced you yet that you should manage the content that goes onto your servers you must either be very trusting or have one doozy of a cron job!

December 12, 2014

Day 12 - Ops and Development Teams: Finding a Harmony

Written by: Nell Shamrell (@nellshamrell)
Edited by: Ben Cotton (@funnelfiasco)

I used to think the term "DevOps" meant infrastructure automation tools like Chef, Puppet, and Ansible. But Devops is more than that, even more than technology itself. Devops is a culture, an attitude, a way of doing things where Dev and Ops work together rather than against each other. When you work in IT it's easy to get lost in technical skills. It's what we recruit for, it's what certifications test for. However, I've found that in any professional environment - no matter what the technology - soft skills like communication and taking responsibility are just as important as hard skills like sysadmin or coding.

I’ve worked in software development for 8 years now. In that time I have worked on teams with the classic Operations/Development divide - features and bug fixes were “thrown over the wall” to operations to deploy. I’ve also worked at a company where developers were responsible for their own hand crafted QA, Staging, and Production systems (often crafted by developers without expertise in SysAdmin) with little help from Operations. In working in both these extremes, I’ve found our projects were repeatedly delayed not by technical problems, but by failures in knowing responsibilities, actually taking responsibility, and communication.

Not knowing who is responsible for what kills IT projects. Are the devs just supposed to focus on the application code? Is the Ops team the one who should be woken up when a deploy goes wrong? When we don’t know the answers - or don’t agree on the answers - our project, company, and therefore ourselves suffer for it. Responsibilities will vary from team to team and project to project, but I’d like to at least give you a place to start. Here are 10 general guidelines to what the Ops team, Dev team, and both teams should be responsible for.

Ops Responsibilities

  1. Provide production-like environments for developers to test their code on. Why is this the Ops team’s responsibility, rather than the devs? Because Ops knows the production system better than anyone else and are tasked with protecting the stability of the system. In order to protect it, you must replicate it for developers to use in a safe way to test their features and bug fixes before they go live. Then everyone will sleep better at night.
  2. Provide multiple production-like environments. There is little more frustrating to a developer to have a feature ready to test on a production-like environment, but be blocked because someone else is already using the environment. One QA and one Staging server are not enough. Prevent it from become a bottleneck which delays features and bug fixes, which makes everyone unhappy. If resources are limited, at the very least provide an easily replicated virtual box (kept in sync with production!) that developers can run locally.
  3. Automate and Document procedures for building production-like environments. This will prevent developers from needing to ask you to build a system for them whenever they need to test something. Empower developers to do this themselves by providing automated infrastructure and documentation they can use. Then the developers will be able to get bug fixes and features out to the customers faster, not have to bother you in the process, and everybody wins.

Dev Responsibilities

  1. Test code in a production like environment before ever declaring a feature or bug fix to be done. “Works on my machine!” is never the definition of done. Gene Kim altered the agile manifesto definition of done to illustrate this "At the end of each sprint, we must have working and shippable code...demonstrated in an environment that resembles production."
  2. Take full responsibility for deployed code. If a deploy of developer code starts causing havoc in a system and preventing people from getting work done, it is the developer’s fault. Not Ops, not QA’s, the ultimate responsibility for what code does in production belongs to the developer who wrote the code. If something goes wrong in the middle of the night, the developer should be woken up first, then take the responsibility to wake up further team members if needed.
  3. Read documentation first and try the procedures it suggests before asking the Ops team for help. Respect the Ops team’s time by using any resources they’ve provided first, then ask for help if the problem remains.

Both Teams

  1. Respect how the other team does things - although we have similar goals, often dev and ops have different ways of getting work done. If an Ops person needs to request something of the dev team - or to contribute some code - check for any contributing documents and follow them. The same goes for a Dev person who needs something in the Ops team’s domain. Avoid “just this once” exceptions - those often multiply and turn into cruft which will bring down a system.
  2. Write down procedures and who is responsible for what. As stated earlier, not knowing (or not caring) who is responsible for what aspects of an application will kill that application. Knowing how things get done or are intended to be done is too vital to be tribal knowledge. Write it down, follow it, and refer to it whenever needed. If someone repeatedly refuses to consult the documentation, default to sending them a link to the documentation (or even better the section of the documentation) where they will find their answer. This may seem like an aggressive stance, but time is too scarce in the IT world to repeatedly solve the same problem over and over because someone refuses to look at available documentation.
  3. Never use the phrase “Why don’t you just-”. This comes across as extremely condescending. Teams must respect their teammates’ intelligence and realize that if it were “just” that simple, they probably would have already done it. “Have you considered...” is a good way of rephrasing.
  4. When you don’t know, help find the answer. When someone has a question and you don’t know the answer it's ok to say "I don't know." But your responsibility doesn’t end there. In a professional IT environment, it your duty to point the questioner to where they might find their answer - whether that's a person, a web resource, or just "Here, let’s try googling that together and let's see what we find." Help a questioner move forward, rather than stopping them in their tracks.

I am relieved to now work at company that embraces these principles of taking responsibility and communication (I learned many of these from working there!). Projects are completed faster, both teams are happier, far fewer people are woken in the middle of the night, and the business benefits tremendously. The key to making this work has been clearly establishing who is responsible for what and, when any confusion or blockers come up, communicating immediately. We may have different areas of expertise, but everyone is equally accountable to communicate and take responsibility for a project’s success. And it does work.

As IT professionals we shape how the world works now. It’s not just how people spend money - our work is now vital to how humans travel, how they communicate, how they access utilities like lighting and water, how laws are passed and implemented, and (as IT becomes more integrated into health care) how they physically survive and thrive. The stakes are much too high to let a lack of communication or failure to establish and take responsibility kill an IT project. The weight of the world rests on our shoulders now - we have that great power. It’s time to not only meet but embrace that responsibility!