December 14, 2010

Day 14 - FreeBSD Jails

This article was written by Wesley Shields

The first day of SysAdvent talked about Linux Containers (LXC), and how they are an "operating system level virtualization", as opposed to "platform virtualization" choices like Xen or VMWare. Today, I'll focus on jails in FreeBSD and how they achieve a similar goal.

Background

If you think of a traditional OS it looks something like this:

Among other things, the kernel controls access to hardware, makes sure processes are not stomping all over each other's memory, does the necessary access control checks for actions, and also ensures that packets land at the appropriate sockets. However, even if everything is perfect, a sufficiently privileged process can cause a lot of havoc if it misbehaves.

Let's say you have a process running as root that gets compromised and is now running arbitrary code of the attacker's choosing. Typically, this comes in the form of a shell listening on a socket, or a connect-back shell; both of which are very bad. Through this arbitrary execution of code, the attacker can do whatever he/she is allowed to through the access controls in place. In most cases (where things like mandatory access controls are not in place) this is effectively game over for the system administrator.

Before I go any further, I should probably explain how jails are different from what most people are familiar with. To be consistent with the earlier article I'll call it "platform virtualization." That solution looks something like this:

There are different approaches but it is essentially the same goal. Insert a small "virtual machine monitor" (VMM) layer - often called a hypervisor - that brokers access to the real hardware and emulates whatever hardware the systems administrator wants to the OS running on top of it. Modern chips have support for helping do this in hardware (AMD calls it "AMD-V" and Intel calls it "VT-x"). Just about every chip shipping now has this built in.

"Platform virtualization" has many benefits to it. You can choose what hardware to provide to the guest operating system. The VMM is hopefully small enough that it can be properly secured and verified. Finally, you can decide which operating system to run as the guest. The fact that it is virtualized should be transparent - with the exception of needing driver support for whatever hardware is emulated, which every major OS has support for.

There are some drawbacks to this approach. It can be very resource intensive as the more virtual machines you spin up the more hardware and state has to be kept in memory. With modern hardware this is becoming less of a problem, but, for some environments, it may still hold true.

Enter Jails

Jails are best thought of as a means to contain and isolate processes from each other, even if those processes are privileged.

In this case, we are running multiple processes, but the kernel has been modified to limit the resources that each process can affect or view. This is the concept upon which jails are built: The name of the game is process isolation and containment, not virtualization.

The Details

I'm going to skip over the details of how jails are created and what that means from a data structure standpoint and skip straight to how to set up a jail and how to use it. My examples will be from a fairly recent development snapshot ("current," if you are familiar with FreeBSD terminology) that is not yet a finished release, so some of the things I will describe are not completely accurate to all versions of FreeBSD but the concepts are the important part.

Jails have been around in FreeBSD for a long time now. They were first introduced in FreeBSD 4.0 (10 years ago). Since that release, jails have been refined and extended to support many of the things people want from them. Recent releases of FreeBSD include the ability for IPv6, hierarchical jails, resource utilization limits, and even virtual network stacks (which is out of scope for this article).

Setup

For the purposes of this article, I'm going to use the term host to indicate the FreeBSD base operating system upon which the jails will run.

A jail only requires a handful of things in order to operate. The most important of which is a working userland. Usually, people run the same version of the userland inside a jail as the one that is running on the host, but it doesn't have to be this way. If you want to run an older userland in a jail, it will likely work because backwards compatibility in the kernel is usually preserved.

Actually getting a working userland is outside of the scope of this article. There are many ways to pick from: building your own, using your existing install, or installing the binaries straight from release media. The means of getting the binaries on disk is up to you. You also don't need a full world (FreeBSD's term for the base OS), if you know exactly what you are doing you can populate it with just what you need. There are also other tricks you can do involving null mounting in other paths. For the purposes of this article I've installed an exact copy of my host to /jails/test (minus any package installations).

Starting Jails

With a world installed to /jails/test, I don't need anything else installed in order to start a jail. Everything you need is provided by the base FreeBSD install. Starting a jail manually is done using the jail(8) command.

wxs@ack wxs % sudo jail /jails/test test 192.168.1.100 /bin/sh 
# id
uid=0(root) gid=0(wheel) groups=0(wheel),5(operator)
# 

The arguments to the jail command are pretty straight forward. It takes a path where the root of the jail should live, a hostname, an IP address and a command inside the jail to run. Once I'm inside the jail you can see that I am automatically the root user.

From outside of the jail, on the host, you can use the jls(8) command to list existing jails.

wxs@ack wxs % jls
   JID  IP Address      Hostname                      Path
     4  192.168.1.100   test                          /jails/test
wxs@ack wxs % 

The one catch is that while the jail says it has an IP address, the host OS knows nothing about that IP address. In order to have your jail respond to an IP address you must add it to a network interface by adding an alias:

wxs@ack wxs % sudo jail -r 4 # Kill existing jail, so it can get the new IP
wxs@ack wxs % sudo ifconfig bge0 alias 192.168.1.100 netmask 255.255.255.255
wxs@ack wxs % sudo jail /jails/test/ test 192.168.1.100 /bin/sh
#

And now inside our jail we can see that we have an IP address.

# ifconfig bge0 | grep inet
        inet 192.168.1.100 netmask 0xffffffff broadcast 192.168.1.100
# 

So, at this point, we just need a working devfs inside our jail and we should have a normal, contained, system.

wxs@ack wxs % sudo mount -t devfs devfs /jails/test/dev
wxs@ack wxs % mount | grep jails
data/jails on /jails (zfs, local, noatime)
devfs on /jails/7/dev (devfs, local, multilabel)
devfs on /jails/test/dev (devfs, local, multilabel)
wxs@ack wxs % 

You might want your jail to be accessible from the network, so let's run an SSH daemon! I'll permit root logins for this jail, just to make my life easier, but you can add users inside jails as normal. Also, be sure to set your root password before you expose any network services. From the root shell you have after running the jail(8) command, do this:

# sed -i '.bak' -e 's/^#PermitRootLogin no/PermitRootLogin yes/' /etc/ssh/sshd_config
# /etc/rc.d/sshd onestart
[... Lots of output about host key generation ...]
# sockstat -4l | grep 22
root     sshd       61052 3  tcp4   192.168.1.100:22      *:*
# 

And outside the jail:

wxs@ack wxs % ssh root@192.168.1.100
The authenticity of host '192.168.1.100 (192.168.1.100)' can't be
established.
RSA key fingerprint is f9:87:e1:41:3c:27:56:fd:5a:0e:c9:0b:c5:9a:d5:15.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added '192.168.1.100' (RSA) to the list of known hosts.
Password:
Last login: Mon Dec  6 01:43:39 2010 from 192.168.1.100
FreeBSD ?.?.?  (UNKNOWN)

Welcome to FreeBSD!

[... MOTD ...]
test#

A keen eye would spot the weirdness '?.?.?' in the MOTD above. Normally that is cleared up when the computer boots, but since we didn't really "boot" this jail, that step never happened. Let's explore what it takes to get a jail to start automatically upon boot.

Booting Automatically

One thing you must do when starting a jail is make sure the host services are set to listen on only IP addresses that belong to the host, and not the jail. Failure to do this will cause your host services to listen on IP addresses that should be for the jail. This can have unintended consequences such as exposing host services to places they shouldn't be. For now I'm assuming you know how to do that.

Like most things in FreeBSD they are controlled with settings in /etc/rc.conf. There are actually a whole bunch of settings available, but here's the simple set I'm using:

wxs@ack head % fgrep test /etc/rc.conf  
jail_list="test"
jail_test_rootdir="/jails/test"
jail_test_hostname="test"
jail_test_interface="bge0"
jail_test_ip="192.168.1.100"
jail_test_devfs_enable="YES"
wxs@ack head % 

Using this configuration, the 'test' jail will boot and start automatically.

Booting Manually

You can use the /etc/rc.d/jail script to boot a jail manually provided that the appropriate settings are set in /etc/rc.conf. Another option is to set the IP address alias manually and mount devfs manually then call /etc/rc yourself:

wxs@ack wxs % sudo jail -r 2 # Kill existing jail...
wxs@ack wxs % sudo jail /jails/test test 192.168.1.100 /bin/sh /etc/rc
/etc/rc: WARNING: $hostname is not set -- see rc.conf(5).
Creating and/or trimming log files.
Starting syslogd.
ELF ldconfig path: /lib /usr/lib /usr/lib/compat
32-bit compatibility ldconfig path: /usr/lib32
Clearing /tmp (X related).
Updating motd:.
Starting cron.

Sun Dec 12 16:16:37 UTC 2010
wxs@ack wxs % jls
   JID  IP Address      Hostname                      Path
     3  192.168.1.100   test                          /jails/test
wxs@ack wxs % 

Restrictions

So if jails are all about isolation and containment, what can and what can't you do inside a jail? The general rule of thumb is that if it affects the host or other jails it is restricted by default. There are knobs you can turn to allow these things, but in the interests of not breaking the security model, they are turned off by default. Exactly what is restricted and what knobs are available is highly dependent upon the version of FreeBSD you are running. As more and more things are being designed to work better with jails, the set of restricted operations is shrinking. For example, in earlier releases root inside a jail was not allowed to change any network stack configuration information. With the addition of virtualized network stacks in newer releases of FreeBSD this restriction is gone, provided the jail is using a virtualized stack.

For more information on this it is best to read the documentation available.

Trade offs

Jails are a great way of getting "operating system level virtualization" on FreeBSD, but like anything else, they come with a series of trade-offs which must be considered prior to implementation.

A kernel level compromise does break the isolation provided by jails. In a "platform virtualization" solution it would require a bug in the hypervisor for that to happen. Jails are not necessarily any more or less secure than a "platform virtualization" solution as it is going to come down to implementation details. Bugs do happen in both worlds.

Another trade-off is that a jail can not emulate arbitrary hardware like a VMM can. If you want to add a new virtual disk to your VM in a "platform virtualization" solution it is a simple operation - the physical disk doesn't really exist as it is just a file on the filesystem of the host. In a jail you can not emulate arbitrary hardware.

As a jail is really just isolating processes from each other, it is important to realize that root on the host has complete control over every jail. This is an important thing to keep in mind when setting up a jail environment. Root on the host should be trusted and controlled far more than root in any of the jails.

Lastly, a user on the host can get access to things inside one of the jails if the UIDs are the same. For example, on one of my hosts my UID is 1001, and inside one of the jails a different user (user1) has UID 1001. From the viewpoint of user1, only he has access to his files inside the jail. From my viewpoint, outside of the jail, the files are owned by me. The host is going to use it's copy of /etc/passwd to determine ownership of files, which means there can be overlapping information. This is another important consideration to keep in mind when setting up a jail environment.

Uses

There are many uses for jails. Lots of places use them to isolate web hosting environments from each other, to provide root inside a contained system for a customer, and to isolate developers from each other. I personally use them to isolate test environments from each other. As a developer who spends most of his time up in userland, this is a great solution to my need to be able to quickly setup a clean test environment. As you spend more time with jails, you begin to see different opportunities for application.

As more parts of FreeBSD become friendlier to jails you can start to build very interesting things. Virtual network stacks, zfs, multi-IP jails, hierarchical jails and many other things are fertile areas for exploration as a systems administrator. As is often the case, the best way to get familiar with jails is to dive in heads first!

Further Reading

December 13, 2010

Day 13 - Don't be a Human Keyboard

Written by Jordan Sissel (@jordansissel)

There are few things that quite irritate me more than when I accidentally take part in building a habit (or a culture) that treats people like human keyboards.

Humans are not meant to be buttons for pushing or functions for calling.

What does it mean to not be a human keyboard? Imagine for the moment that instead of typing on a keyboard or using a mouse, you had to speak at a person to perform any task?

"Go to my email."

"Show me that first message."

"Scroll down a bit."

"Scroll down a bit more."

"I want to reply to this email."

Imagine the terrible impact on productivity and agility by using such a computer interface! Further, your "check email" flow now requires two people (you + the human keyboard) - that means double the cost, double the typos/communication problems, etc! Yet, despite the silliness of this example, other, very similar situations occur quite often in business as interactions between employees making requests of each other.

In 2008, SysAdvent day 22 reported a subset of this problem - where people may tell you to implement a solution rather than telling you what problem they need solved. When Frank (fictional person) says "install Postgres on that server," he is treating you like a human keyboard. He isn't presenting you with a problem needing solved, he is telling you to implement a chosen solution (that may or may not work). You have a function being called and a result is expected. What result? You don't know the problem and lack context!

Let's ignore, for this situation, situations where it may be correct to accept input as "install this software." (or other examples)

The social problem with being treated like a keyboard is interesting. In the great acronym, PEBKAC, who is the user? I observe that you are the keyboard, here, and Frank (above) is the user. What happens if "install Postgres on that server" is obeyed correctly, but doesn't solve Frank's problem? Even though, in this instance, "Frank" is between the chair and keyboard (you), you may be blamed for his errors. That sucks, and you waste energy in a blame game rather than getting work done.

Another example of this is if a coworker asks you, rather than asking that well-publicized dashboard your team maintains, "is mysql down?" If you answer the question by checking the dashboard on their behalf and replying with the answer, you are only training the requester that you are the answer robot for mysql status. Instead, you should say "Check out the status here on the dashboard" and provide a URL. If you don't have a dashboard, maybe you have a FAQ page, or you dig and ask "What's the problem?"

Humans are creatures of habit. If you reinforce a behavior, it will persist, and even spread to other coworkers or users. You don't want to become the company-wide interface to answering the "is mysql down" question.

You don't want to become a human keyboard.

You have a options for avoiding turning yourself into a human keyboard, and solutions will vary by situation and audience. First, you can simply document common questions and answers. Second, you could (assuming skill/time/energy) automate answers that can't easily be documented - like creating a dashboard to answer service health status questions. Third, if you can't automate it, put requests like these through your ticketing system. Fourth, you can try to train the user to answer the question without invoking you.

Regarding the ticket system, if common requests require high effort, you can use features of your ticket system to track the number of these such requests and also time spent on them. This will help inform the business about the energy output of your group versus the energy requirements and can help steer hiring and other goals.

There are usually some red flags that tell me (even if only subconsciously) that a request is a "human keyboard" one: annoying requests, simple requests from technical coworkers, strange solution-based requests when you don't have context, etc. Personally, my signal is usually that a request is annoying.

Do you have users or coworkers who might treat you like a human keyboard? Go write that FAQ, make that dashboard, or add that ticket flow. You'll end up with less stress and more happy coworkers and users. The business will end up with higher quality support, faster turn-around times for simple requests, and better ideas about the kinds of things asked of you and your team.

Further reading:

  • Stashboard - Open Source status/dashboard
  • MediaWiki - a wiki for documenting things (like that FAQ you could write)

December 12, 2010

Day 12 - Scaling Operability with Truth

Written by Jordan Sissel (@jordansissel)

This article is about separating your configuration model from your configuration inputs. You might have the same service (like frontend, loadbalancer, etc) in multiple environments, but versions, accounts, members, etc, may change between them. The 'scale' part, here, is that having inputs separate from models helps to scale operability and sanity. I use Puppet features to show how this is done, but this idea is applicable to any config management tool.

Puppet and other configuration tools help you describe your infrastructure. Puppet's model is resources: files, users, packages, etc. Each resource has attributes like package version, file path, file contents, service restart command, package provider (gem, apt, yum), user uid, etc.

Many of my puppet manifests used to look a bit like this:

class nginx::server {
  package {
    "nginx": ensure => "0.8.52";
  }
  ...
}

To upgrade versions, I would just edit the manifest and change the version, then push out the change.

Static manifests don't scale very well. Puppet has support for environments so you can use different manifests (say, a different nginx::server class) in different situations. You could also do it all in one like this:

class nginx::server {
  package {
    "nginx":
      ensure => $environment {
        "production" => "0.8.52",
        "testing" => "0.8.53",
        "dev" => "latest",
      };
  }
}

The variable, $environment, comes from what puppet calls a 'fact.' Facts such as cpu speed, ip addresses, chassis make/model, virtualization source, etc, are provided to help you conditionally tune your infrastructure. Facts come from a puppet tool called Facter, which means facts are essentially an external input for puppet.

This model of external inputs is a good one. Using facts, you could conditionally install things like Dell OpenManage, or template service configs based on ram or cpu counts. Combining the inputs (facts) with your model (puppet resources) helps you steer puppet into doing the correct thing for each machine it is run on. Very useful!

Back to the nginx manifest above. The problem with the above puppet example is that it doesn't scale very well. Each time you add a new environment (a new testing environment, some one-off dev cluster, etc), you have to edit the manifest (your infrastructure model). For each 'environment' condition, you have to update it with the new environment-specific value, which could mean editing just about every file in your puppet manifests - and that's sucky and error-prone work.

The solution here is to follow the facts example and allow yourself a way to specify your own inputs - what I would call, "truth," Truth is very similar to facts. It is just more human-driven, and I only use a different, but similar, term for the sake of identifying source. I use "truth" to refer to human-originated inputs (such as "what package version to install"), and I use "facts" to refer to machine-originated information (such as "number of cpus on this system").

For the above puppet manifest, you could write truth in separate manifest that defines $nginx_version based on $environment:

# truth.pp
case $environment {
  "production": { $nginx_version = "0.8.52" }
  "testing": { $nginx_version = "0.8.53" }
  "dev": { $nginx_version =  "latest" }
}

# modules/nginx/manifests/server.pp
class nginx::server {
  package {
    "nginx": ensure => $nginx_version;
  }
}

This is better, but ultimately, setting variables with conditionals (case statements, etc) is more code than you need - in Puppet 2.6.1, a solution to this problem was included: extlookup(). (It existed prior, but it now officially ships with puppet)

This new function lets you separate your configuration inputs from your infrastructure model. An example of the above manifest using extlookup:

class nginx::server {
  package {
    "nginx": ensure => extlookup("package/nginx");
  }
}

You configure extlookup in puppet by setting some variables, here's an example:

$extlookup_datadir = "/etc/puppet/manifests/extdata"
$extlookup_precedence = ["nodes/%{fqdn}", "environments/%{environment}", "common"]

The values in %{ ... } are evaluated with each node's facts. This lets you put all production environment data in /etc/puppet/manifests/extdata/environments/production.csv.

With the above configuration, I can easily add per-node and per-environment package versions for nginx. If all else fails, I can provide a default package version in the 'common' setting above. You can also easily extend the configuration to include more sources without modifying your infrastructure model.

The data format extlookup uses currently is csv files. It's basically a bunch of key-value pairs you can query with a precedence order. For example, the above config would look in nodes/somehostname.csv first, which would allow you to override environment-specific configuration values per host. If nodes/somehostname.csv file does not exist or if the requested value is not present in that file, extlookup will fall to the next file in the list. You can also specify default values in extlookup per call, but that is outside the scope of this article.

Building on the examples above, here is what your extlookup files would look like:

# /etc/puppet/manifests/extdata/environments/production.csv
package/nginx,0.8.52

# /etc/puppet/manifests/extdata/environments/testing.csv
package/nginx,0.8.53

# /etc/puppet/manifests/extdata/environments/dev.csv
package/nginx,latest

This scales in meaningful ways. Common cases:

  • You want to clone production:

    % cp production.csv productionclone.csv
    # Or, accept 'testing' as good, and ship it to production:
    % cp testing.csv production.csv
    
  • A security vulnerability makes you want to know what nginx versions you are installing:

    % grep 'package/nginx' */*.csv
    
  • Tune settings per machine, per environment, etc.

  • All of your truth inputs are concentrated in one place. This has an additional benefit of allowing non-puppet users (in this example) to upgrade or tune values without needing puppet knowledge.
  • If you author puppet modules, you can release modules with extlookup support and document the values used. This allows users of your module to be happily ignorant of the internals of your module.

Point is, once you separate your model from your inputs (facts, truth, etc), you become more flexible and agile. I use extlookup for:

  • package versions. Of note, for developer systems, I set some package values to "absent" so developer activity won't be overwritten by puppet trying to deploy our internal apps.
  • application flags and tuning. For production, we want to do backups to Amazon S3. For random development systems, we don't. In general, I use this for helping generate production application configs.
  • syslog configuration (what remote syslog host to ship logs to, per deployment)
  • database information (database name, credentials, etc)
  • some dns automation (what certain internal CNAME records point to)
  • nagios contact configuration (example: notify pager in prod 24/7, only during daytime for staging, but never for development)

In closing, separating your model from your inputs increases the testability and maintainability of your infrastructure automation. Further, combining human input source (truth) with machine inputs (facts) can help you tune your infrastructure for any situation. Having a separate truth input allows you to abstract tunables in your infrastructure and opens up maintenance possibilities to non-experts (like allowing developers or QA folks managing their own clusters), which can help improve non-sysadmin throughput by allowing others to manage tunables you define. That helps you stay sane and uninterrupted, and it helps the business.

Further reading:

December 11, 2010

Day 11 - A Journey to NoSQL

Written by Michael Stahnke (@stahnma)

The N00b

When I was first learning about being a Unix Admin, I just wanted to know what systems my team supported, so that when I got called at 2 AM, I could either make some weak attempt at getting online and fixing a problem (I was new...very new), or promptly help that application analyst find the correct support team pager number. It was the week before I first went into our pager rotation that I realized something was very wrong. I had no idea what systems we actually supported. I wasn't the only one.

There had recently been some form of reorganization right before I hired in at this company. What was once four teams (IBM AIX, HP-UX, Sun Solaris and Red Hat Linux), was becoming three teams (Capacity Planning, Systems Implementation and Systems operations). However, there were still other server teams at other sites, plus Unix workstation support, and some IRIX somewhere out there. The fundamental problem, though, was, "Do I have the ability to help the person who has paged my team?"

A solution...sort of

I found this state to be extremely non-desired, so I started writing a Unix server tracking system. It started out as a basic web application utilizing a MySQL back-end. It worked great. The teams loved it. They knew what we supported and what we didn't. Then, the requests for enhancement came in. I needed to add MAC addresses, world wide port names, cluster licensing terms, customer information, out-of-band management URLs, etc. This quickly grew, but I was still happy with it. We designed several workflow automations through the tool as well. However, as the tool grew larger, and less maintainable, I was starting to get extremely frustrated with it.

While problems for this application were abundant, there were two issues that made it less of an operational platform than I desired. The first problem was that in order to do any type of CRUD actions, you have to have database drivers on the client. This was a big challenge. We had an extremely heterogenous environment, multiple firewalls, and some ancient operating systems that probably couldn't have had a MySQL driver loaded on them without sacrificing some type of domesticated animal and praying to a deity that was anything but righteous.

The other problem was flexibility of schema. Each time we added a new piece of data to track, it had to be analyzed, and then added into the schema. Normalization was great for one:many and many:many relationships, but then made the SQL queries much more complex with joins or sub-queries, especially for unix admins without much or any SQL background. In short, the relational portion of the RDBMS system was in the way.

Another solution...getting warmer

I left that shop before that problem was really solved, but since I had an opportunity again at my next assignment to solve a similar problem, I decided I would try some things in a different way. My first thoughts were around putting some form of web-services infrastructure in front of a basic RDBMS backed web application. I thought that speaking HTTP would be easier than MySQL, Oracle or even DBI for most clients. I toyed with it and did some mock-ups, but I still felt like the data model was complicated and required many calls and client-side parsing to really get the data into usable formats for automation, updates, or to generate Nagios configuration, etc. It was time for something completely different.

NoSQL. It was obvious. Of course, at this time (2006) I had never heard of the term NoSQL, but looking back on it, that was the epiphany I had. If relationships are difficult to model and manage, maybe some other model would work. Then it hit me: LDAP. The LDAP container is designed for easy replication, extremely granular security controls, and availability. On top of that, those features were all there out of the box. Schemas could be programatically deployed, and many of the data model questions were things like 'should this be single-valued or multi-valued'. Those questions were quite simple when compared to joining 17 tables to see a complete system configuration in the old RDBMS I had authored. As an added bonus, using LDAP didn't introduce a new source of truth for the environment since it was in use for account management.

LDAP also had a good solution for the driver problem. We were using LDAP for user authentication, so our systems already had LDAP client libraries loaded. Even the few that didn't, the client-side libraries were readily available, even on my less-than-favorite flavors of Unix.

We modified schema, populated data by hand, and then with some simple scripts. Life was good...at least for a while. After a couple years operating in this mode, the schema became a bit more problematic. Extending schema at will was not the greatest idea I've ever had. We also had a problem where some admins would make new objectClasses rather than extend one, or inherit from one. This led to conflicts in schema and some data integrity issues. None of it was absolutely horrible, but in the end it smelled like a chilli dog left in a desk drawer overnight.

The search continues

I had a lot of discussion about this problem with a group of my friends (and eventual business partners). We spent hours going back and forth on how to model host information and metadata and expose that information to our configuration management, monitoring, accounting, chargeback, and provisioning systems. It always came back to a discussion on discrete math: use Set Theory. The best, and possibly only sane way, to keep this data organized was to use set theory.

Luckily, we had a greenfield to play with as we forming a new company. We tried it out. We tried to not extend or customize schema for host information beyond loading in well-known IANA referenced schemas. The basic premise, obviously, is that everything can be grouped into sets. We created an OU=Sets at the top level our LDAP directory. Under OU=Sets, we created a DN of of 'set name' for example dn: cn=centos5,ou=Sets,dc=websages,dc=com is an entry in our directory. It is setup as a groupOfUniqueNames and contains the DN of each host that is in fact a CentOS 5 host. The nice thing about OU=Sets is you can just keep adding things into it, without extending schema.

It may seem a bit backward at first to have the attribute as the set name and then the host dn as the entry, but it seems to work. LDAP also allows groups within groups, so nesting works perfectly. As an example, if cn=ldap_servers,ou=sets exists, it may contain cn=ldap_write_servers,ou=sets and cn=ldap_replicas,ou=sets. Grouping in this manner allows one change to cascade through the directory.

Of course, with every good solution, there are more problems to be solved. In this case it's recursion. OpenLDAP and 389/RHDS/Fedora-DS/SunOne/iPlanet/et all don't seem to automatically recurse nested groups, though I have heard that some LDAP implementations do. Luckily, it's not that big of a problem.

Recursion

In this example, I'll be looking for all LDAP servers. Our directory information tree is setup such that we have three groups:

  • ldap_write_servers
  • ldap_replicas
  • ldap_servers

The ldap_servers entry is a groupOfUniqueNames whose uniqueMembers are the other two groups. To traverse this, we'll need some recursion.

Sample Code

In my code, I most often use ruby. When working with LDAP, I've used the classic ldap bindings heavily, but recently I've really taken a liking to activeldap. Activeldap borrows heavily from the Active Record design pattern and applies it to LDAP. It is not a perfect translation of active record, but it is quite nice for most operations on a directory server.

Activeldap requires some minimal setup to be useful. You can install it with gems or your favorite package manager.

require 'rubygems'
require 'active_ldap'

class Entry < ActiveLdap::Base
end

ActiveLdap::Base.setup_connection(
  :host => 'ldap.websages.com', :port => 636, :method => :ssl,
  :base => 'dc=websages,dc=com',
  :bind_dn => "uid=stahnma,ou=people,dc=websages,dc=com",
  :password => ENV['LDAP_PASSWORD'], :allow_anonymous => false)

This is a simple setup section for some code using activeldap. Require the library (and rubygems unless your environment will load them, or you installed activeldap in some other method). Then you run setup_connection. The Websages directory server requires SSL and does not allow anonymous bind, so a few more parameters are used than you might see on a clear-text, anonymous setup.

From there, it's really not very difficult to recurse through groups and find the entries.

# Returns the members of a ldap groupOfUniqueNames
def find_members(search, members = [])
  Entry.find(:all , search).each do |ent|
    # Ensure the search result is a group
    if ent.classes.include?('groupOfUniqueNames')
       # Check to see if each member is a group
       ent.uniqueMember.each do |dn|
         members << find_members(dn, members)
       end
    else
    # Add the results to the members array
     members <<  search
    end
  end
  # clean up the array before returning
  members.flatten.uniq
end

The above code will find all members of a groupOfUniqueNames including entries of groups within groups.

My calling function is just:

puts find_members('cn=ldap_servers')

Another excellent feature of activeldap is that if you simple puts an activeldap object, the LIDF text for the object is displayed on standard out.

Entry.find(:all , "cn=ldap_servers").each do |h|
  puts h
end

Produces a simple LDIF output:

version: 1
dn: cn=ldap_servers,ou=Sets,dc=websages,dc=com
cn: ldap_servers
description: Hosts acting as LDAP Servers
objectClass: groupOfUniqueNames
objectClass: top
uniqueMember: cn=ldap_replicas,ou=sets,dc=websages,dc=com
uniqueMember: cn=ldap_write_servers,ou=sets,dc=websages,dc=com

LDAP is a good answer

Now I can basically apply set theory for system management of meta data and configuration information. At Websages, we use our LDAP directory for nearly everything and integrate it into our fact generation for puppet, our backup schedules, our controlling IRC bots, and our broadcast SMSing while acting like idiots at the bar.

So next time you're faced with storing a bunch of host information or meta-data, you might turn back to a technology that is non-relational, scales horizontally, offers extensive ACL options, and is lightweight and fast. LDAP was NoSQL before the term was coined and often loses out on today's NoSQL discussions, but it's track record is proven.

When I see the term NoSQL, I am reminded of a classic Dilbert, "I assure you, it has a totally different name."

Dilbert

December 10, 2010

Day 10 - Basic Sniffing with tcpdump

This article was written by Evan Anderson

The How and Why of Sniffing

It starts with something like an innocuous-sounding error message during setup of a new piece of software: "Fatal Error: Can't connect to server." You peruse the configuration files again and can't see anything wrong. You check the documentation and note that it isn't exactly clear on the syntax (FQDN of the server or the DN of the server object the LDAP directory, etc). You know the story-- the documentation is unclear and you're not sure if you've got things configured right. You try it both ways and it still doesn't work. You scratch your head, check for working name resolution on your machine again, and try to think about what you might've missed.

This story plays out over and over with different OS's and applications but the frustration is the same. "Why isn't this thing working?" "What do you mean 'Cannot connect' !?!" Wouldn't it be nice to see what's actually being sent back and forth on the wire instead of poking around in the dark?

I'm regularly surprised at how long it takes sysadmins to reach for the sniffer. Many times I've found that sniffing traffic early in the process of troubleshooting, often while taking stock of the issue and documenting the symptoms, ends up revealing the root cause of issues. Sniffers aren't just for "network guys", and as a sysadmin you'll do well to become familiar with a sniffer for your operating system of choice.

The particular challenges related to bringing a sniffer to bear on a problem typically come (a) from getting the sniffer installed, (b) deciding how to capture the traffic you're looking for, and (c) filtering out extraneous traffic such that you can end up with a sample size small enough to make sense out of.

A piece of terminology worth mentioning as you get into using sniffers is the phrase promiscuous mode. Typically an Ethernet network interface card (NIC) and /or its driver will only forward broadcast frames or frames explicitly addressed to the NIC's physical address up to the operating system. If the NIC receives frames destined for other hosts (as would be the case in "old school" shared-media Ethernet) they are simply ignored. In promiscuous mode all frames received by the NIC are forwarded up to the operating system. Most operating systems restrict switching an interface from normal operation to promiscuous mode to only privileged users (root, Administrator, etc).

It's also bears mention that the architecture of the Windows networking stack doesn't permit sniffing of traffic to 127.0.0.1 in any reasonably easy manner. You can do this on most *nix operating systems, but the Windows networking stack architecture isn't conducive to this type of capture.

Finally, this article talks in broad terms but, generally, is oriented toward traffic capture on wired Ethernet networks. Capturing traffic on wireless Ethernet networks, in particular, comes with its own set of concerns, and is really a topic unto itself.

tcpdump and WinDump

On *nix operating systems tcpdump reigns supreme as the sniffer of choice. Most Linux distributions, including many tiny space-conscious embedded distributions, have a binary available. The pervasive availability is a testament to its utility. In the Windows world, a port of tcpdump called WinDump is available and mimics the functionality of tcpdump. The WinPcap driver is required to facilitate the low-level packet capture.

The learning-curve for tcpdump isn't particularly steep assuming you have a comfort-level with the command-line. The manual page for tcpdump is the canonical reference. Some common command-line arguments that I use include:

  • -X -- Display packets in hex and ASCII
  • -n -- Don't resolve addresses to hostnames
  • -s N -- Capture N bytes from each packet (defaults to 68). Set to 0 to capture entire packets.
  • -i x -- Capture from interface x, or any (on most *nix operating systems) to capture from all interfaces (useful to see if you're getting anything, but captures from any aren't performed in promiscuous mode so only traffic to / from the host will be captured)
  • -D -- Dumps a list of available interfaces (very useful in Windows since the interface names aren't easy things to type like eth0)

Filtering traffic

Capturing everything on the wire isn't tremendously useful since you're likely to be deluged with more information than you can handle. Turning off reverse DNS lookups with the -n option is a nice first step (because the reverse lookups themselves will end up generating more traffic), but filtering the traffic is essential to pinpointing exactly what you're looking for. The tcpdump filter syntax is very human-readable and reasonably intuitive so it's pretty easy to get started.

Suppose you're interested in seeing a hex dump of packets on your eth0 interface, without resolving addresses to hostnames, for the LDAP protocol (or, at least, the standard TCP port LDAP runs on). You can do that with the command: tcpdump -i eth0 -n -X tcp port 389. The filter portion of this command, tcp port 389 is fairly easy to understand. Other example filters include:

  • host 10.1.1.1 -- Captures only traffic to / from the host 10.1.1.1
  • src 10.1.1.1 -- Captures only traffic sourced by 10.1.1.1
  • dst tcp port 22 -- Captures only traffic destined for TCP port 22

Bear in mind that filters that mention "src" and "dst" only capture one direction of a conversation because they only match against the source or destination address or port number.

Using boolean operators allows you to get fancier with your capture. Suppose you're connected to a host with SSH (or RDP, in the Windows world-- just substitute 3389 into the example) and you want to capture all traffic except the traffic moving between the remote host and your machine (hypothetically at 10.1.1.2). You can use the boolean not operator to exclude that traffic with the very simplistic filter: not host 10.1.1.2

While that filter will work, it will also exclude all other traffic between your computer and the remote host. With some clever use of parenthesis and the boolean and operator you can construct a more complex filter that allows any non-SSH traffic between your computer and the remote host to be captured: not (tcp port 22 and host 10.1.1.2) (Beware that using parens on a *nix OS may require you to surround the filter expression with quotes since your shell is likely going to see the parenthesis as shell metacharacters!)

As a general filtering strategy, if I know what the traffic is that I'm trying to capture I write a filter that includes only the traffic I'm looking for. If I'm unsure about what I'm looking for I begin by capturing everything for a short period of time, reviewing the captured data, and creating a progressively longer filter excluding more types of traffic that aren't what I want. So, a filter that might start as something really simple like ip might become ip and not (tcp port 443 or icmp or tcp port 80 or tcp port 22 or udp port 53) as I find traffic that I don't want to see in my capture.

The best way to get familiar with tcpdump filters is to start using them. Capture known traffic and attempt to filter it. The manual page provides some much more detailed examples (performing bitwise comparisons of TCP header fields to catch packets with particular flags set, etc) to get you started.

Saving captures

tcpdump has the handy feature of allowing you to save captures to disk and replay them. The replay functionality is particularly handy if you want to capture traffic on one host then ship it to another host (or even another program, like Wireshark) for analysis.

The -w file argument specifies that packets should be written to the specified file rather than decoded and displayed. You can also use the optional -C argument to limit the size of capture files and the -W argument to specify the number of capture files to maintain. For long-term traffic analysis the latter two options can be used to create a "ring buffer" of capture files that will be re-used ad infinitum.

Once you've captured traffic you can "play it back" with tcpdump by using the -r file argument to read the captured traffic from file. Traffic read from a file "acts" like traffic being captured from an interface so all the command-line arguments to manipulate the output behaviour, and to filter the traffic stream, are applied in the same manner as traffic being captured from an interface.

Getting to the traffic

As we've already seen with using tcpdump's capability to write captures to disk remote executing of tcpdump is one method that can be used to capture traffic in situations where the traffic might not flow to or from your computer. If you need to monitor traffic between hosts where you can't run tcpdump you'll have to think of some way to bring the traffic to you. Some options to get to the traffic include:

  • Switch-based monitoring - Different Ethernet switch manufacturers call this mechanism by different names (SPAN, port-mirroring, monitor ports), but the functionality is the same. The Ethernet switch can "tee" the traffic in one or both directions from one or more ports into a dedicated "monitoring port" where your sniffer is connected. Most of the time the monitor port won't function as a normal network connection while monitoring is enabled so be wary that you may need a second interface in your sniffer computer, attached to a non-monitor network port, if you want communication with the network as-normal while you are monitoring. This is typically the least disruptive method for getting your sniffer in-line with the traffic to be captured, but requires a switch with monitoring capability and the cooperation of whoever manages the switch.

  • Insert a shared medium - It's a less common technique today with gigabit Ethernet being more pervasive, but an older technique for monitoring traffic included attaching a shared medium, like an Ethernet hub, between the host to be monitored and the LAN switch. Traffic moving through an Ethernet hub appears on all ports of the device, unlike an Ethernet switch. Attaching the LAN to one port of the hub, the host to be monitored to another port on the hub, and the sniffer to a third port permitted monitoring of the traffic between the host and the LAN. Since there aren't gigabit Ethernet hubs this method has become less commonplace as gigabit Ethernet has become more common.

  • Physically intercept the connection - Most modern Linux distributions and Windows versions allow you to create a "network bridge" between two (or more) network interfaces. In cases where the traffic flow between the host to be monitored and the LAN is not so great as to overwhelm the CPU in your sniffer computer you can opt to create a network bridge and physically insert the sniffer computer between the host to be monitored and the LAN. Your sniffer software can be configured to capture traffic on the virtual "bridge interface". This method works for low bandwidth captures, but high traffic hosts can overwhelm the CPU of the sniffer computer with traffic.

  • Redirect the traffic flow - You can redirect the traffic flow between the host and the LAN through various methods. ARP cache poisoning, using tools like Ettercap can "trick" hosts into forwarding traffic through your sniffer machine. The specific details of using this tool are beyond this post but many guides are out there. Be warned that ARP cache poisoning may be detected by an intrusion detection system (IDS) as an attack. If you're going to use ARP cache poisoning on someone else's LAN be sure you've cleared it with the network or system administrators, lest you send them into a panic when their IDS starts sounding alarms.

    Another method for traffic flow redirection is to use a layer 7 proxy, such as rinetd, to redirect traffic flows through your sniffer computer. With this method you would configure your sniffer computer to answer for the protocol to be captured and to forward those incoming connections to the host where the "real" server software is running. Client computers will need to be reconfigured to use your sniffer computer's IP address as their "server" unless you change the server computer's IP address and assume its address with your sniffer computer. This is the method that I use when I can't get access to a switch monitor port and don't want to run the risk of scaring somebody with ARP cache poisoning.

From here?

The next time you're faced with an opaque and unhelpful "Can't connect to server"-type message take a moment and fire up tcpdump to see what's happening on the wire before you start double-checking configuration files or stopping and restarting service programs. Mis-specified host names, malformed configuration parameters, and a whole host of other maladies that could be hiding behind that opaque error message can become visible immediately once you look at the conversation actually taking place on the wire. Mysterious "delays" and "timeouts" are great candidates for troubleshooting with a sniffer. And, finally, even if you're not troubleshooting a problem just using a sniffer to analyze protocols can provide you with valuable details that will help your understanding.

Obviously, this article is just a brief introduction. tcpdump has capabilities that I haven't mentioned, and there are a wealth of other tools out there that can do more. Get out there and sniff some traffic!

Further reading

December 9, 2010

Day 9 - Automated Deployments with LittleChef

Written by Grig Gheorghiu (@griggheo)

Prologue

Sysadmin #1 (to Sysadmin #2): So...I need you to tell me....what is Devops?

Syadmin #2: Devops? I don't know....I didn't expect a kind of Devops Inquisition!

[JARRING CHORD]

[The door flies open and Senior Devops admin Ximinez of Spain enters, flanked by two junior Devops admins]

Ximinez: NOBODY expects the Devops Inquisition! Our chief weapon is collaboration...collaboration and automated deployments... Our two weapons are collaboration and automated deployments...and ruthless procrastination.... Our three weapons are collaboration, automated deployments, and ruthless procrastination...and an almost fanatical devotion to vi.... Our four...no... Amongst our weapons.... Amongst our weaponry...are such elements as collaboration, automated deployments.... I'll come in again.

Automated deployment systems: push vs. pull

I don't need the Devops Inquisition to tell me that I need to use automated deployment/configuration management systems. They are indeed a critical part of a self-respecting sysadmin's weaponry. Without them, the activities of deploying and configuring servers become haphazard and fraught with errors.

I like to differentiate between two types of automated deployment tools. One type is what can be called a `push' system: a centralized server 'pushes' deployments by running commands via ssh on remote nodes. Examples of push systems are Fabric and Capistrano. The other type is what can be called 'pull' systems: remote nodes 'pull' commands and configurations from a centralized 'master' server. Examples of pull systems are Chef and Puppet.

To me, the main advantage of a push system is that you're under total control of what gets deployed where and at what time. Its main disadvantage is that it doesn't scale well when you have to control hundreds of servers, although you could use parallel ssh tools to help you there. In contrast, a pull system is inherently more scalable, because the nodes contact the 'master' asynchronously and independent of each other (see this blog post of mine for more details and comments on this topic.) However, with a pull system you can lose some of the control on the exact targets and timing of the deployments.

My personal philosophy on using push vs. pull systems is to use a pull system when a node boots up (as an EC2 instance for example, or bootstrapped with cobbler as another example) and have the node create required users, mount file systems, install monitoring tools, install and configure pre-requisite packages -- basically everything up to the level of the actual application. I prefer to then use a 'push' system to actually deploy the application and its configuration files. Why? Because I usually do it in small batches of servers at a time, and I would be nervous to know that a bad configuration change can propagate to all nodes if I were to deploy it via a 'pull' system. Note however that I manage tens and not hundreds of servers, so in this latter case I would definitely at least reconsider my options. Introducing LittleChef

One drawback of 'pull' systems is their complexity. You usually need to configure the 'master' server, then have nodes authenticate against it and ask it for tasks to run. If you've ever configured Chef Server for example, you know it's not exactly trivial. This is one reason why Chef provides a variant called Chef Solo which runs in isolation on a given node without requiring communication with a master server. The nice thing about Chef Solo is that it still uses all the other concepts of Chef (and Puppet for that matter): cookbooks, recipes, attributes, roles, etc.

The question becomes: how do you get Chef Solo installed on a node? Well, one way of doing it is by using LittleChef, a package created by Miquel Torres, who is also the author of Overmind. LittleChef is written in Python and uses Fabric as the underlying mechanism of getting Chef Solo installed on a remote node, then sending cookbooks and roles to that node for further processing via Chef Solo. It's a bit ironic that a Python-based package deals with configuring and running a Ruby-based tool such as Chef Solo, but it also shows flexibility in the design of Chef, since many of the configuration files inspected by Chef are in JSON format, easy to interpret in any modern language.

Essentially, LittleChef is a push system, so it brings simplicity to the table. The fact that deployments are done on the remote nodes via Chef Solo also brings into play the full repertory of a solid configuration management tool. It should be relatively easy to migrate the remote nodes from Chef Solo to a full-blown Chef Client / Chef Server setup down the road.

Another nice thing about LittleChef is that it allows you to easily debug your cookbooks and roles. In a typical Chef Client/Server setup, once you modify a cookbook or a role, you need to upload to the Chef Server via the knife utility, then wait for the Chef Client on the remote node to contact the server, then inspect the Chef log file for any errors. With LittleChef, the process is much simpler: you modify a cookbook and/or a role, then you push it to the remote node and you see any errors in real time. Read on for details on how exactly you do this.

Working with LittleChef

Here I am using Ubuntu Lucid 10.04 as my OS. A similar procedure can be followed for RedHat-based systems.

First install some pre-requisites:

# apt-get install build-essential python-dev python-pip python-setuptools git-core

Then install Fabric and LittleChef via pip (or you can use easy_install if you prefer):

# pip install fabric
(this also installs pycrypto and paramiko)

# pip install littlechef
(this will install the latest version of LittleChef at the time of this
writing, which is 0.4.0)

Initial LittleChef Configuration

Create a directory where your cookbooks, roles and node information will be located. I like to call this directory 'kitchen', as Miquel suggests in his documentation:

$ mkdir kitchen; cd kitchen

Now run the main LittleChef utility script, appropriately named 'cook', with new_deployment as an argument. It will create some sub-directories and a file called auth.cfg:

$ cook new_deployment
nodes/ directory created...
cookbooks/ directory created...
roles/ directory created...
auth.cfg created...

Edit auth.cfg and specify an user name and a password used for ssh-ing to the remote nodes (make sure this user has sudo access on the remote node.)

If you want to use some of the cookbooks already published by Opscode, you can replace the cookbooks directory with a git clone of the Opsode cookbook repository from GitHub:

$ rm -rf cookbooks
$ git clone https://github.com/opscode/cookbooks.git

Installing Chef Solo on the remote node

In the following examples, I'll use a server named app1 as my target remote node. All the 'cook' commands in these examples have 'kitchen' as the current working directory. If you try to run the 'cook' command outside of the 'kitchen', you'll get an error message:

Fatal error: You are executing 'cook' outside of a deployment directory To create a new deployment in the current directory type 'cook new_deployment'

The first step in actually doing deployments via LittleChef is to install Chef Solo on the remote node. This is very easy with the 'cook' command again, this time with the 'deploy_chef' argument:

$ cook node:app1 deploy_chef

The LittleChef documentation talks about other ways of installing Chef Solo on the remote node, for example via gems, or without asking for a confirmation. See the README for more details.

Deploying an Opscode cookbook on the remote node

As a first example, let's deploy memcached on the remote node. There already is a cookbook for memcached in the Opscode repository. If you ran the git clone command above, you're all set to go. Assuming you are OK with the default values for memcached (which are set in the file kitchen/cookbooks/memcached/attributes/default.rb), all you need to do is run:

$ cook node:app1 recipe:memcached

== Executing recipe memcached on node app1 ==
Uploading cookbooks... (memcached, runit)
Uploading roles...
Uploading node.json...

== Cooking... ==

[app1] out: [Tue, 07 Dec 2010 11:14:21 -0800] INFO: Setting the run_list to
["recipe[memcached]"] from JSON
[app1] out: [Tue, 07 Dec 2010 11:14:21 -0800] INFO: Starting Chef Run (Version
0.9.8)
[app1] out: [Tue, 07 Dec 2010 11:14:22 -0800] INFO: Upgrading
package[memcached] version from uninstalled to 1.4.2-1ubuntu3
[app1] out: [Tue, 07 Dec 2010 11:14:27 -0800] INFO: Upgrading
package[libmemcache-dev] version from uninstalled to 1.4.0.rc2-1
[app1] out: [Tue, 07 Dec 2010 11:14:30 -0800] INFO: Writing updated content for
template[/etc/memcached.conf] to /etc/memcached.conf
[app1] out: [Tue, 07 Dec 2010 11:14:30 -0800] INFO: Backing up
template[/etc/memcached.conf] to
/var/chef/backup/etc/memcached.conf.chef-20101207111430
[app1] out: [Tue, 07 Dec 2010 11:14:30 -0800] INFO:
template[/etc/memcached.conf] sending restart action to service[memcached]
(immediate)
[app1] out: [Tue, 07 Dec 2010 11:14:31 -0800] INFO: service[memcached]:
restarted successfully
[app1] out: [Tue, 07 Dec 2010 11:14:31 -0800] INFO: Chef Run complete in
9.569211 seconds
[app1] out: [Tue, 07 Dec 2010 11:14:31 -0800] INFO: Running report handlers
[app1] out: [Tue, 07 Dec 2010 11:14:31 -0800] INFO: Report handlers complete

SUCCESS: Node correctly configured

Done.
Disconnecting from app1... done.

The cook command above will also create a file called app1.json in the kitchen/nodes directory. Here is the file:

$ cat nodes/app1.json
{
   "run_list": [
       "recipe[memcached]"
   ]
}

Now assume you want to override some of the default memcached attributes. Let's say you want to use more 1 GB of RAM for memcached rather than the default 64 MB. You can just edit app1.json and add this stanza to override that attribute:

 "memcached": {
   "memory": "1024"
 }

(if you add it at the end of the JSON file, make sure you put a comma after the previous “run_list” stanza, otherwise the JSON syntax is invalid; however, even if you have an invalid JSON syntax, the cook command will let you know exactly what the syntax error is, which is a nice touch)

Now run the cook command, this time telling it to just configure the node. This will use whatever you specified in app1.json, so there is no need to specify the recipe name again on the command line:

$ cook node:app1 configure

== Configuring app1 ==
Uploading cookbooks... (memcached, runit)
Uploading roles...
Uploading node.json...

== Cooking... ==

[app1] out: [Tue, 07 Dec 2010 11:45:36 -0800] INFO: Setting the run_list to
["recipe[memcached]"] from JSON
[app1] out: [Tue, 07 Dec 2010 11:45:36 -0800] INFO: Starting Chef Run (Version
0.9.8)
[app1] out: [Tue, 07 Dec 2010 11:45:37 -0800] INFO: Writing updated content for
template[/etc/memcached.conf] to /etc/memcached.conf
[app1] out: [Tue, 07 Dec 2010 11:45:37 -0800] INFO: Backing up
template[/etc/memcached.conf] to
/var/chef/backup/etc/memcached.conf.chef-20101207114537
[app1] out: [Tue, 07 Dec 2010 11:45:37 -0800] INFO:
template[/etc/memcached.conf] sending restart action to service[memcached]
(immediate)
[app1] out: [Tue, 07 Dec 2010 11:45:38 -0800] INFO: service[memcached]:
restarted successfully
[app1] out: [Tue, 07 Dec 2010 11:45:38 -0800] INFO: Chef Run complete in
1.757141 seconds
[app1] out: [Tue, 07 Dec 2010 11:45:38 -0800] INFO: Running report handlers
[app1] out: [Tue, 07 Dec 2010 11:45:38 -0800] INFO: Report handlers complete

SUCCESS: Node correctly configured

Done.
Disconnecting from app1... done.

At this point, the /etc/memcached.conf file on node app1 will contain the new memory value 1024.

Adding your own cookbooks and recipes

I won't go into much detail on what exactly you need to do in order to create your own cookbook. It's a good idea to start from an existing one, and add your own recipes. Assuming your company is ACME Corporation, it's a good idea to create a cookbook called 'acme', so I will use that in my examples below.

First off, LittleChef expects the cookbook metadata file to be in JSON format. Here's an example:

$ cat cookbooks/acme/metadata.json 
{
   "suggestions": {
   },
   "maintainer_email": "admin@acme.com",
   "description": "Installs required packages for ACME applications",
   "recipes": {
     "acme": "Installs required packages for ACME applications"
   },
   "conflicting": {
   },
   "attributes": {
   },
   "providing": {
   },
   "dependencies": {
     "screen": [
     ],
     "python": [
     ],
     "git": [
     ],
     "build-essential": [
     ],
     "ntp": [
     ]
   },
   "replacing": {
   },
   "platforms": {
     "debian": [
     ],
     "centos": [
     ],
     "fedora": [
     ],
     "ubuntu": [
     ],
     "redhat": [
     ]
   },
   "version": "0.1.0",
   "groupings": {
   },
   "license": "Apache 2.0",
   "long_description": "",
   "name": "acme",
   "recommendations": {
   },
   "maintainer": "ACME"
 }

Note that the metadata file above specifies that the acme cookbook has dependencies on some other cookbooks: screen, python, git, build-essential and ntp. These are all part of the Opscode cookbook repository. LittleChef will figure out these dependencies and will transfer the main cookbook (acme) along with all the cookbooks it depends on to the remote node, so they can be processed by Chef Solo.

Here's an example of a recipe I use, located in cookbooks/acme/recipes/nginx.rb:

$ cat cookbooks/acme/recipes/nginx.rb
# install nginx from source
nginx = "nginx-#{node[:acme][:nginx_version]}"
nginx_pkg = "#{nginx}.tar.gz"
nginx_upstream = "nginx-upstream-fair.tar.gz"

downloads = [
   "#{nginx_pkg}",
   "#{nginx_upstream}",
]

downloads.each do |file|
   remote_file "/tmp/#{file}" do
       source "http://#{node[:acme][:web_server]}/download/nginx/#{file}"
   end
end

script "install_nginx_from_src" do
 interpreter "bash"
 user "root"
 cwd "/tmp"
 not_if "test -f /usr/local/nginx/conf/nginx.conf"
 code <<-EOH
   tar xvfz #{nginx_upstream} 
   cd /tmp
   tar xvfz #{nginx_pkg}; cd #{nginx}; ./configure --with-http_ssl_module
   --with-http_stub_status_module --add-module=/tmp/nginx-upstream-fair/; make;
   make install 
   EOH
end

This recipe downloads an nginx tarball and installs it via make install. Note that the recipe uses two attributes, referenced as node[:acme][:nginx_version] and node[:acme][:web_server]. These attributes have default values set in cookbooks/acme/attributes/default.rb:

default[:acme][:web_server] = 'mywebserver.acme.com'
default[:acme][:nginx_version] = '0.8.20'

I find it a good practice to use an attribute wherever I am tempted to use a hardcoded value for a variable in my recipes. This makes it easy to override the attribute at the node level or at the role level.

My 'acme' cookbook also has a recipe called default.rb, where I do things such as installing more pre-requisite Python packages, adding more users, etc.

Configuring roles

Instead of configuring nodes based on individual recipes (like we did when we configured node app1 with the memcached recipe), a better way is to define roles that nodes can belong to. Roles are in my view the most critical concept of any good deployment/configuration management system. In Chef, roles are the glue that ties nodes to cookbooks and recipes. A given node can belong to multiple roles. If you change any recipe that a given role includes, all nodes belonging to that role will automatically get the updated recipe.

Let's define a role called 'appserver'. This is done via a JSON file located in kitchen/roles:

$ cat roles/appserver.json 
{
   "name": "appserver",
   "json_class": "Chef::Role",
   "run_list": [
     "recipe[acme]",
     "recipe[acme::nginx]",
     "recipe[memcached]"
   ],
   "description": "Installs required packages and applications for an ACME app
   server",
   "chef_type": "role",
   "default_attributes": {
     "memcached": {
       "memory": "1024"
     }
   },
   "override_attributes": {
   }
 }

Several things to note in this role definition:

  • the run_list stanza specifies the recipes that need to be executed by nodes belonging to this role (the order of the recipes is also preserved, which is nice); in this example, the recipe defined in default.rb in the acme cookbook will be executed, followed by the recipe defined in nginx.rb in the acme cookbook, followed by the recipe defined in default.rb in the memcached cookbook
  • because the acme cookbook depends on other cookbooks listed in its metadata file (screen, python, git, build-essential and ntp), the default recipes in those cookbooks will be executed before the default acme cookbooks recipe gets executed
  • we use the default_attributes stanza to override the memcached memory from the default value of 64 MB (set as we mentioned before in cookbooks/memcached/attributes/default.rb) to a value of 1024 MB; I refer the reader to the "Attribute Type and Precedence" wiki page from the Opscode documentation for more details on how attributes can be set and overridden at various levels

Now we have to associate the node app1 with the role appserver. We can just modify the file nodes/app1.json like this:

$ cat nodes/app1.json
{
   "run_list": [
       "role[appserver]"
   ]
}

Finally, we can run the cook command and configure the node with its new role:

  $ cook node:app1 configure

  == Configuring app1 ==
  Uploading cookbooks... (acme, memcached, python, build-essential, screen, git,
  ntp, runit)
  Uploading roles...
  Uploading node.json...

  == Cooking... ==

  […... etc (more output here) ]

Note how littlechef figured out all the dependencies and uploaded the respective cookbooks to the remote node.

Querying for nodes, roles and recipes

A nice feature of LittleChef is that is allows you to query its inventory (defined by the information contained in the cookbooks, roles and nodes sub-directories of the 'kitchen' directory) for things such as 'nodes pertaining to a given role' or 'nodes that apply a given recipe', or 'list of all roles' or 'list of all recipes'. You can run 'cook -l' to see the available command-line options:

$ cook -l
LittleChef: Configuration Management using Chef without a Chef Server

Available commands:

   configure               Configure node using existing config file
   debug                   Sets logging level to debug
   deploy_chef             Install Chef-solo on a node
   list_nodes              List all nodes
   list_nodes_with_recipe  Show all nodes which have asigned a given recipe
   list_nodes_with_role    Show all nodes which have asigned a given recipe
   list_recipes            Show all available recipes
   list_roles              Show all roles
   new_deployment          Create LittleChef directory structure (Kitchen)
   node                    Select a node
   recipe                  Execute the given recipe,ignores existing config
   role                    Execute the given role, ignores existing config

Here is an example of querying for all nodes pertaining to the 'appserver' role:

$ cook list_nodes_with_role:appserver

app1 Role: appserver default_attributes: memcached: {'memory': '1024'} override_attributes: Node attributes:

Epilogue

Ximinez [with a cruel leer]: Now -- you will stay in the Comfy Aeron Chair until lunch time, with only a cup of coffee at eleven. [aside, to Biggles] Is that really all it is?

Biggles: Yes, lord.

Ximinez: I see. I suppose we make it worse by shouting a lot, do we? Confess, Sysadmin. Confess! Confess! Confess! Confess!

Biggles: I confess!

Ximinez: Not you!

So, Sysadmin, if you don't have an automated deployment/configuration management strategy yet, and if you want to avoid torture in the Comfy Aeron Chair, then roll up your sleeves and roll out a deployment system like LittleChef for your infrastructure TODAY!

Further reading

  1. Opscode Chef wiki
  2. Some blog posts of mine on "Chef Installation and Minimal Configuration", "Working with Chef Cookbooks and Roles", "Bootstrapping EC2 Instances with Chef", "Working with Chef Attributes"
  3. A four-part blog post series on “Building a Django App Server with Chef” by Eric Holscher (Part 1, Part 2, Part 3, Part 4)
  4. Collaboration, ruthless procrastination and an almost fanatical devotion to vi are exercises left to the reader

December 8, 2010

Day 8 - Everything is a DNS Problem

Written by Kris Buytaert (@KrisBuytaert)

Systems break. Whether you like it or not, one day, they will break. Either when they are up and running or when you are building new stuff, you will one day run into problems. Sometimes the error messages will guide you to the solution quickly, but sometimes they give you no pointers at all, and sometimes there are no error messages - just weird behavior.

When that happens, it's time to pull out your troubleshooting skills. And so, you read logfiles, you google, but you find nothing; you lie awake at night trying to figure out what parameter in which config file you forgot.

In the next couple of examples, I'll guide you to some issues I ran into over the past decade. The list is far from exhaustive but it might give you an idea.

Let's start with some trivial stuff.

Who hasn't heard the "When I log on to server X, it takes a while" complaint from a user? However, when you log on to the box, it goes lightning fast. At first, you think it was a temporary glitch, but those 5 users keep complaining. You go to their workstation and, indeed, from their desk things do go a lot slower. Turns out, they are on a newer part of the building that is on the newest subnets in your organization, and for those networks, there are no reverse mappings yet. As when the users log in, the server first tries to figure out who they are, and sometimes it takes more than an acceptable timeout before the lookup has been made.

Reverse DNS lookups causing performance problems are amongst the most common problems around. They happen with databases, regular logins, etc, and sometimes people doing performance comparisons of MySQL vs NoSQL tools fall into the trap , and they end up testing a Failing DNS lookup

The quick fix, adding the hosts to your /etc/hosts file is sometimes the only alternative as you don't always have control over the reverse dns mapping.

Luckily lots of daemons also let you disable the feature, such as the skip_name_resolve entry in your my.cnf or the UseDNS=no stanza in your sshd.conf .

Don't think it's just the regular MySQL and sshd services, there's even web applications that start performing slow because of dns problems, such as one wordpress user figured out. Some applications are slow when dns is misconfigured, but plenty of applications just don't want to launch when they can't figure out where they are running. e.g an old DRBD issue caused drbdadm to crash. The easy ones to detect are the ones that actually tell you they can't lookup localhost, or the node you are starting the application on, performance issues are usually also a good pointer, but plenty of times it just doesn't show.

I've seen dns causing problems across the board: Xen, GFS, DRBD, Oracle and many others, but apart from applications that have problems with misconfigured DNS setups, there's also people who try parsing the output of dig to find out the nameserver by grepping for "SERVER" as in the comment section of the dig output it notes what nameserver it used. Now imagine the output of dig containing any of the root nameservers such as A.ROOT-SERVERS.NET indeed .. the detection will fail

DNS problems can creep up on you in expected ways in every part of your infrastructure, so what can you do to prevent them?

The first and most important problem to solve to ensure that for every part of your network, you have a correct reverse mapping. RFC 1912 clearly points out that "Every Internet-reachable host should have a name." and "Make sure your PTR and A records match. For every IP address, there should be a matching PTR record in the in-addr.arpa domain. If a host is multi-homed, (more than one IP address) make sure that all IP addresses have a corresponding PTR record (not just the first one)."

So, if you have a 172.16 RFC 1918 subnet in your network you want to have a reverse zone that looks like:

more 172.16.0.db
$TTL 604800 $ORIGIN 0.16.172.in-addr.arpa. @ IN SOA ns1.yournetwork.org. root.yournetwork.org. ( 2010101501 ; Serial 3600 ; Refresh 3600 ; Retry 2419200 ; Expire 604800 ) ; Negative Cache TTL ; IN NS ns1.yournetwork.org.

1 IN PTR ns1.yournetwork.org. 2 IN PTR zion.yournetwork.org. 3 IN PTR matrix.yournetwork.org.

For public networks addresses, you sometimes have to talk to your upstream vendor for reverse mappings that match your domain. Sometimes they already have a reverse-map.customerof.theirdomain.com, but they usually are happy to make the updates or to delegate the administration whenever possible.

The second problem when relates to updating DNS zonefiles: people often forget to update the serial number of their zonefile. After hours and hours, the newly added host still isn't known on the network or Internet and the zonefile on the primary nameserver is showing it correctly. However, the nameserver and his slave haven't realized there is a new zonefile around yet. Failing to update the serial-number of the zonefile is the default problem that everybody falls in to once in a while. What if you are using a YYYYMMDDID timestamp and you by accident put in a YYYYDDMMID timestamp in place .. chances are you need to wait a whole year before you can continue to use your old scheme, or you can add 2147483647 to the now-incorrect value as documented here.

Before I let you guys go, I do have to point you to a tool you can't live without: http://intodns.com/. This is an online service that will check your public dns config, and point out different improvements you can make. Try it! It's worth your time.

By now, you must realize that everything is a funky DNS problem, and as @patrickdebois realized, DNS stands for Devops Need Sushi, but that's a different post :)