December 26, 2011

Day 25 - Learning from Other Industries

This was written by Jordan Sissel (semicomplete.com).

This post is late because, two days ago (December 24th), I went to the emergency room with appendicitis. Pretty sweet timing, eh?

After surgery, I was bored and took to observing nursing shift rotations - hand-offs of patients, documentation maintenance, etc. The entire process semeed to last about an hour for each shift change. I thought to myself, "Self, you have never been involved in any operations project, task, or fire hand-off that went as smoothly." Pretty sad realization.

Further, throughout my hospital stay (emergency room, surgery, and recovery), not once did I observe any two individuals debating passionately about how best to treat my illness. Data was available at every decision point. There were specializations in surgeons for the surgical tasks; nurses to help stabilize, comfort, and feed me (no small task); and administrative staff to handle the beaurocracy stuff.

I'm not expecting systems administration to follow exactly what the medical industry does, just pointing out that it's nice to see a well-orchestrated system in action.

And the best part? I never once heard anyone say, "hey! We should try this new cool thing I heard about."

Further, I never once heard anyone suggesting that we remove my appendix using node.js and mongodb.

Happy New Year, everyone :)

Further Reading

December 24, 2011

Day 24 - Implementing Configuration Management in Legacy Environments

This was written by Dean Wilson (www.unixdaemon.net).

Implementing configuration management is perfect for green field projects where you can have freedom to choose technical solutions free from most existing technical debt. It's a time where there are no old inconsistencies and you have a window to build, deploy and test all the manifests and recipes without risking user visible outages or interruptions.

Unfortunately, most of us don't have the luxury of starting from scratch in a pristine environment. We have to deal with the oddities, the battle scars, and the need to maintain a good quality of service while evolving and maintaining the platform.

In this article, I'll discuss some of the key points to consider as you begin the journey implementing a configuration management tool in an existing environment. I happen to use puppet and mcollective, but any config management tools will do - find one that works for you.

Discovery

You'll need to start with some discovery (manual, automated, whatever) to learn how your current systems are configured. How many differences are there across the sshd_config files? Where do we use that backup users key? What's the log retention setting of the apache servers? Is it identical between those we upgraded from apache 1 and the fresh apache 2?

When I first started to bring puppet in to my systems, I wrote custom scripts, ssh loops, and inventory-reporting cronjobs to find the information I needed. Now, there is an excellent solution to gather information you need in real time: MCollective.

With the appropriate mcollective agents you can compare, peer, and probe in to nearly every aspect of your infrastructure. The FileMD5er agent, for example, will show you groups of hosts that identical config files. Using this, you can partition the environment and the work, section by section. This will help you find smaller amounts of work to build into puppet.

One of my favourite current tricks is to query config file settings using Augeas in the AugeasQuery agent.

# Does your ssh permit root logins?
$ mco rpc augeasquery query query='/files/etc/ssh/sshd_config/PermitRootLogin' --with-agent augeasquery -v
...

test.example.com : OK
    {:matched=>["no"]}

test2.example.com: OK
    {:matched=>["yes"]}

No more crazy regexes to find config settings!

You can learn more about how the Blueprint tool can help you with this configuration discovery process in "Reverse-engineer Servers with Blueprint" (SysAdvent 2011, Day 12).

Short Cycles

One of the key factors to consider when starting large scale infrastructure refactoring projects is that you'll need information and you'll need it quickly.

You should always be the first to know about things you've broken. Your monitoring system, centralised logging, trend graphing, and config management systems reporting server (like Puppet Dashboard and Foreman) will become your watchful allies.

The speed at which you can gather information, write the manifest, deploy, and run your tests is more important than you'd think. Keeping this cycle short will encourage you to work in smaller sections, reducing the time between you actually testing your work and keeping the possible problem scope small. Tools like MCollective, mc-nrpe and nrpe runner will enable you to rapidly verify your changes.

Version Control

The short version is: version control everything.

There's no reason, whether you are in a team or working alone (especially when working alone!), to not keep your work in a version control system (such as subversion or git). "That change didn't work, so I've rolled it back". Really? By hand? Are you sure you didn't forget to remove one of the additions? Why would you shoulder the mental burden of knowing all the things you changed recently when there are so many excellent tools that will do it for you and allow easier, auditable rollbacks to old revisions?

Beyond the benefits of providing an ever-present safety net, using a VCS provides a nice way to tie your exact change into incident, audit, or change control systems - a perfect place to do smaller code reviews and a basic, but permanent, collection of institutional knowledge. Being able to search through the puppet logs / reports when debugging an issue to find exactly which resources changed and when, pull up the related changeset and the actual reason for it (and who you should ask for help) changes your process and reports from best guesses based on peoples memories to quick fixes backed by supporting evidence.

On a positive note, it's amazing how much you can learn from reading through commits from things like DBAs doing performance tuning. Seeing which settings were changed and why can save you and your team from forgetting something or making the same mistake or redoing parts of the work again in the future.

Use Packages

Many older environments have fallen in to the habit of pushing tarred applications around (and sometimes compiling software on the hosts as needed). This isn't the best way to deploy in general, and it will make your manifests ugly, complicated, sprawling messes. In most cases, it's not that hard to produce native packages, and tools like fpm make the process even easier.

By actually packaging software and running local repositories you can use native tools, such as yum and apt, in a more powerful way. Processes become shorter and have simpler moving parts, more meta information becomes available, and it's possible to trace the origin of files. You can then use other tools, such as puppet and MCollective to ease upgrades, additions, and reporting.

You can learn more about the world of software packages from "A Guide to Package Systems" (SysAdvent 2011, Day 4).

As an aside, there are differing opinions on whether packages should handle related tasks, such as deploying users. However, once you've reached the point where you can have this argument, you've surpassed most of your peers and will have an environment that gives you the freedom to do whichever you choose.

You can learn more about overlaps in functionality of packaging and config management systems in "Packaging vs Config Management" (SysAdvent 2010, Day 23).

Baked-in Goodness

As you bring more of your system under puppet control, you'll begin to experience the continuously improving system baseline. Every improvement you make is felt throughout the infrastructure, and regressions become infrequent, easy to spot, and quick to remedy. For example, you should never wonder if you've deployed nginx checks to a new server, it should be impossible to have a mysql server without your graphing applied, and every wiki you deploy should be added to the backup system without you intervening.

By having infrastructure manifests such as a monitoring module and writing some generic "add monitoring check" definitions, you can sprinkle reuse throughout other, more functionality-focused modules. In my test manifests, for example, nearly every class has nagios checks associated with it, and every service declares the ports it listens to. This might not sound like much, but it's impossible for me to deploy a server without monitoring, every time I add a new nagios check, all the hosts with that role receive it. Further, generating security policies, firewall rule sets, and audit documents is done automatically for me and requires no manual data gathering. Being able to supply people with links to live documentation is a wonderful way to remove awkward manual steps from reports, audits and inventories.

Involve Everyone

You don't have to agree with the DevOps movement and ideas to use configuration management, but some of its principles (such as communication, openness, shared ownership and responsibility) are worthwhile and have a low overhead once you start to config manage everything. Look for simple requests for information from other teams. Do they want to know which cronjobs run and when? What modules the apache servers have configured? Which rubygems you've deployed on the database servers? (from packages!)

By sharing configuration modules, you help others develop a balanced understanding of the platform. You enable them to regain time wasted raising support tickets through sharing commit logs, audit and error reports, etc, without feeling like your coworkers or customers are wrestling your time away.

At $WORK, a large percentage of our developers are comfortable reading puppet manifests, running MCollective commands, and using custom Foreman pages and cgi-scripts to investigate issues answering their own queries without the sysadmins being involved. This has dropped our workload, given them quicker answers and allowed them to ask targeted questions after doing their own research.

Such things increase efficiency, communication, and happiness.

It's not all about developers either. With a custom puppet define and a little reporting wrapper, we can expose the list of ports required by every machine to our network teams and auditors along with accurate timestamps showing when information was gathered. Using MCollective and custom database facts, our DBAs have dashboards showing deployed packages and services, current running configuration, and the ability to gather ad-hoc real time information from any of our systems: production, development or QA and even compare the differences between them.

Pick your fights

While you're growing accustomed and skilled with your new tools, you need to be aware of the strengths. As the person pushing this change, you have to prove your way is at least as good as (and hopefully better than) the current practise. Because of this, some tasks are bad places to prove your point and you need to be weary of them. Try to find some low hanging fruit

Large monolithic deployments or complex configuration scripts can be handled if you break them down to manageable components. Unfortunately, tight coupling and intertwined code makes these situations bad places to start. It's never a good opening move to start explaining that it took three hours to pick apart a shell script that seems to work fine. Vaguely-defined processes are the other common tar pit. Bringing attention to processes that people seem to do slightly differently is another great way to unite people against what you are trying to accomplish.

You want to avoid resistance caused by confusion. Find and tackle tasks that make sense to put into configuration management, first.

User accounts, for example, are a complex area to automate. While it starts with a simple "deploy the admin accounts, add them to wheel, and push the SSH keys," it can easily become a sprawl of teams needing different sets of access on a semi-rotating basis.

It's worth noting that your config management efforts will often be considered slower, especially when starting out. Writing a ten line throwaway shell script or hand-hacking something will be quicker in the short term than writing clean manifests that include monitoring, meta data and tests but the comparison is unfair. Remember, you're building flexible, reproducible, systems that support future change - not focusing on a single server that just needs to work now and only now.

Roles, not hosts - No more snowflakes

As you build your collection of configuration management techniques, you'll find special snowflake hosts: Machines performing several small but important tasks; ones hand-constructed under time constraints (and needing that one extra special config option) - boxes you'll "only ever have one of," right? Don't fall in the trap of assuming any machine is special or that you'll only ever have one of them. Building a second machine for disaster recovery, a staging version, or even an instance for upgrade tests means there will be times when you'll want more than one of them. If you find yourself adding special cases to your config management "If the hostname is 'foo.example.com'" then it's time to stop, take a step back, and consider alternatives. Any list, especially of resource names, that you hand maintain is a burden and will eventually be wrong.

A sign that things are starting to click is when you stop thinking in terms of hosts and start thinking in roles. Once you begin assigning roles to hosts you will find the natural level of granularity you should be writing your modules at. It's common to revisit previously written modules and extract aspects of them as you discover more potential reuse points.

You can learn more about configuration in terms of roles instead of machines in "Host vs Service" (SysAdvent 2008, Day 7).

Conclusion

With a little care and consideration it's possible to integrate configuration management and even the largest legacy infrastructures while enjoying the same benefits as the newer, less encumbered projects.

Take small steps, master your tools and good luck.

Further Reading

December 23, 2011

Day 23 - All The Metrics! Or How You Too Can Graph Everything.

This was written by Corry Haines.

As your company grows, you may find that your existing metric collection system(s) cannot keep up. Alternately, you may find that the interface used to read those metrics does not work with everything that you want to collect retrics for.

The problems that I have seen with existing solutions that I have used are:

  • Munin: Fails to scale in all respects. Collection is more complicated than it should be, and graphs are pre rendered for every defined time window. Needless to say, this does not scale well and cannot be used dynamically.
  • Collectd: While this system is excellent at collecting data, the project does not officially support any frontend. This has led to a proliferation of frontend projects that, if taken together, have all of the features you need, but no one frontend does everything.
  • XYMon: It's been some time since I used this, and I have not used it on a large set of systems. My guess is that it would suffer from some of Munin's issues.

Enter Graphite

Graphite is a collection of services that can replace or enhance your existing metric collection setup. Yes, it's written in python... but I like python.

The major components are:

  • Whisper: Replaces RRD with a vastly simpler storage only system. Unlike RRD, whisper cannot graph data directly. Also unlike RRD, you can actually read and understand the entire codebase in less than an hour (only 725 lines of well commented python).
  • Carbon: Takes data on a variety of interfaces (TCP, UDP, Pickle, AMQP) and stores the data into whisper.
  • Graphite-webapp: Graphs data from whisper or RRD files.

The best thing about the components being independent is that you can run graphite on your existing RRD data with no hassle. While there are advantages to using whisper, it is not required to get the power of graphite.

The only negatives that I currently hold against graphite are:

  • The documentation is still a bit lacking, though they are working to improve this. You can invoke the community (mailing lists, etc) as a workaround.
  • The learning curve can be a bit steep. While there is an interface to see all of the functions, you still need to learn how they are applied. This is offset by the ability to save named graphs for all users to see.
  • Feedback is a bit lacking. After a graph is requested it is difficult to tell if it is being rendered, or simply failed in the backend.
  • They use launchpad and thus bazaar, for their project management and source control. In a post-github world, this is starting to get a bit painful.

The Power of Filters and Functions

As wonderful as whisper and carbon are (and they really are worth using), the true power of graphite lies in its web interface. Unlike some other interfaces, graphite treats each metric as an independent data series. So long as you have an understanding of the system, you can apply functions (think sum, avg, stddev, etc.) to the metrics either by themselves, or more often, in aggregate.

In addition, you can use wildcards to select multiple machines quickly. While you could do a sum operation like this: sumSeries(host1.load,host2.load,etc) you could more easily type sumSeries(*.load).

Filter Example

As an example, if I wanted to find overloaded webservers I could construct a query like highestAverage(webservers.*.load.longterm, 3) producing:

highestAverage graph higher resolution graph here

Stacking example

Another example, graphing the amount of unused memory on the webservers (time for more memcached if so!) movingAverage(webservers.*.memory.free, 10) producing:

memory movingaverage graph higher resolution graph here

Note that I am also creating a moving average over 10 datapoints here. Also, the series are stacked to produce a sum while still showing the responsible server

Functions Are the Best!

And this is only a small selection of the functions available to you. Moreover, you can write your own! And easily too! Here is an example function in graphite:

# A function to scale up all datapoints by a given factor
def scale(requestContext, seriesList, factor):
  for series in seriesList:
    series.name = "scale(%s,%.1f)" % (series.name,float(factor))
    for i,value in enumerate(series):
      series[i] = safeMul(value,factor)
  return seriesList

Graphite in Production

We are currently collecting >93,000 metrics every 10 seconds. Most of the data is gathered on machines using collectd and then passed to a proxy written by sysadvent's editor. The proxy then ships all of the data, via TCP, to our central Carbon node.

All of the data is consumed by carbon and stored on a single machine with six 10k SAS drives in a RAID 10 array. Although this disk setup is not enough to write the data in real time, it batches up the data and writes sets at once. It only needs to use about 300 MB of RAM for caching.

In reality, this hardware is probably overkill for our current workload. While testing, I was running about 50,000 metrics on four 7.2k SATA drives in a RAID 10 and the machine was doing just fine. It was using several GB of RAM to cache the data, but it was still able to keep up.

In Closing

If you are considering the installation of a metric gathering system, I would absolutely recommend Graphite. If you are using Collectd or Munin already, you can try the graphite web interface without changing how you collect metrics. It only takes a few minutes to setup and might give you better insight into your systems.

Further Reading

December 22, 2011

Day 22 - Load Balancing Solutions on EC2

This was written by Grig Gheorghiu.

Before Amazon introduced the Elastic Load Balancing (ELB) service, the only way to do load balancing in EC2 was to use one of the software-based solutions such as HAProxy or Pound.

Having just one EC2 instance running a software-based load balancer would obviously be a single point of failure, so a popular technique was to do DNS Round-Robin and have the domain name corresponding to your Web site point to several IP addresses via separate A records. Each IP address would be an Elastic IP associated to an EC2 instance running the load balancer software. This was still not perfect, because if one of these instances would go down, users pointed to that instance via DNS Round-Robin would still get an error until another instance would be launched.

Another issue that comes up all the time in the context of load balancing is SSL termination. Ideally you would like the load balancer to act as an SSL end-point, in order to offload the SSL computations from your Web servers, and also for easier management of the SSL certificates. HAProxy does not support SSL termination, but Pound does (note: that you can still pass SSL traffic through HAProxy by using its TCP mode, you just cannot terminate SSL traffic there.)

In short, if Elastic Load Balancing weren’t available, you could still cobble together a load balancing solution in EC2. There is no reason to ‘roll your own’ anymore however now that you can use the ELB service. Note that HAProxy is still the king of load balancers when it comes to the different algorithms you can use (and to a myriad of other features), so if you want the best of both worlds, you can have an ELB upfront, pointing to one or more EC2 instances running HAProxy, which in turn delegate traffic to your Web server farm.

Elastic Load Balancing and the DNS Root Domain

One other issue that comes up all the time is that an ELB is only available as a CNAME (this is due to the fact that Amazon needs to scale the ELB service in the background depending on the traffic that hits it, so they cannot simply provide an IP address). A CNAME is fine if you want to load balance traffic to www.yourdomain.com, since that name can be mapped to a CNAME. However, the root or apex of your DNS zone, yourdomain.com, can only be mapped to an A record, so for yourdomain.com you could not use an ELB in theory. In practice, however, there are DNS providers that allow you to specify an alias for your root domain (I know Dynect does this, and Amazon’s own Route 53 DNS service).

Elastic Load Balancing and SSL

The AWS console makes it easy to associate an SSL certificate with an ELB instance, at ELB creation time. You do need to add an SSL line to the HTTP protocol table when you create the ELB. Note that even though you terminate the SSL traffic at the ELB, you have a choice of using either unencrypted HTTP traffic or encrypted SSL traffic between the ELB and the Web servers behind it. If you want to offload the SSL processing from your Web servers, you can choose HTTP between the ELB and the Web server instances.

If however you want to associate an existing ELB instance with a different SSL certificate (say for instance you initially associated it with a self-signed SSL cert, and now you want to use a real SSL cert), you can’t do that with the AWS console anymore. You need to use command-line tools. Here’s how.

Before you install the command-line tools, a caveat: you need Java 1.6. If you use Java 1.5 you will most likely get errors such as java.lang.NoClassDefFoundError when trying to run the tools.

  1. Install and configure the AWS Elastic Load Balancing command-line tools

    • download ElasticLoadBalancing.zip
    • unzip ElasticLoadBalancing.zip; this will create a directory named ElasticLoadBalancing-version (latest version at the time of this writing is 1.0.15.1)
    • set environment variable AWS_ELB_HOME=/path/to/ElasticLoadBalancing-1.0.15.1 (in .bashrc)
    • add $AWS_ELB_HOME/bin to your $PATH (in .bashrc)
  2. Install and configure the AWS Identity and Access Management (IAMCli) tools

    • download IAMCli.zip
    • unzip IAMCli.zip; this will create a directory named IAMCli-version (latest version at the time of this writing is 1.3.0)
    • set environment variable AWS_IAM_HOME=/path/to/IAMCli-1.3.0 (in .bashrc)
    • add $AWS_IAM_HOME/bin to your $PATH (in .bashrc)
  3. Create AWS credentials file

    • create file with following content AWSAccessKeyId=your_aws_access_key AWSSecretKey=your_aws_secret_key
    • if you named this file aws_credentials, set environment variable AWS_CREDENTIAL_FILE=/path/to/aws_credentials (in .bashrc)
  4. Get DNS name for ELB instance you want to modify

    We will use the ElasticLoadBalancing tool called elb-describe-lbs:

    # elb-describe-lbs
    LOAD_BALANCER  mysite-prod  mysite-prod-2639879155.us-east-1.elb.amazonaws.com  2011-05-24T22:38:31.690Z
    LOAD_BALANCER  mysite-stage   mysite-stage-714225413.us-east-1.elb.amazonaws.com    2011-09-16T18:01:16.180Z
    

    In our case, we will modify the ELB instance named mysite-stage.

  5. Upload SSL certificate to AWS

    I assume you have 3 files:

    • the SSL private key in a file called stage.mysite.com.key
    • the SSL certificate in a file called stage.mysite.com.crt
    • an intermediate certificate from the SSL vendor, in a file called stage.mysite.com.intermediate.crt

    We will use the IAMCli tool called iam-servercertupload:

    # iam-servercertupload -b stage.mysite.com.crt -c stage.mysite.com.intermediate.crt -k stage.mysite.com.key -s stage.mysite.com
    
  6. List the SSL certificates you have uploaded to AWS

    We will use the IAMCli tool called iam-servercertlistbypath:

    # iam-servercertlistbypath
    arn:aws:iam::YOUR_IAM_ID:server-certificate/stage.mysite.com
    arn:aws:iam::YOUR_IAM_ID:server-certificate/www.mysite.com
    
  7. Associate the ELB instance with the desired SSL certificate

    We will use the ElasticLoadBalancing tool called elb-set-lb-listener-ssl-cert:

    # elb-set-lb-listener-ssl-cert mysite-stage --lb-port 443 --cert-id arn:aws:iam::YOUR_IAM_ID:server-certificate/stage.mysite.com
    OK-Setting SSL Certificate
    

That's it! At this point, the SSL certificate for stage.mysite.com will be associated with the ELB instance handling HTTP and SSL traffic for stage.mysite.com. Not rocket science, but not trivial to put together all these bits of information either.

Further Reading

December 21, 2011

Day 21 - Automating Web Monitoring

This article is written by Brandon Burton, who can mostly be found posting lolcats and retweeting @solarce, though he occasionally posts interesting links to things sysadmin, devops, and unix.

As systems administrators, we all know that it's not in production until it's monitored, but this isn't always as simple a rule to live by as it may sound. Not all web applications, for example, are easily monitored through traditional monitoring solutions such as Nagios, Zenoss, or various commercial tools. These tools tend to take a "curl | grep" style monitoring, or they may support somewhat more complex POSTing of XML or JSON data and validation of the returned data. But often the most key parts of applications being deployed into production involve complex browser interactions and behaviors - AJAX, or some other session or transaction that traditional monitoring frameworks don't have an easy way to accommodate.

Enter Selenium. Selenium is a mature and robust framework for doing complex interactions with web applications. It originated as a tool at the consulting company ThoughtWorks as a way to do testing against web applications by driving a web browser. Since its release, it has seen the development of numerous tools, including browser plugins to make it easy to develop Selenium tests quickly and easily, language bindings to write tests in pretty much every major language, and tools to run many browsers across many operating systems, in parallel.

Additionally, services, such as BrowserMob and Sauce Labs, have grown around the Selenium ecosystem to help you do testing and monitoring in a scalable and offsite fashion. It is these services that we'll focus on utilizing in this blog post.

So what does all this mean? It means that we have a mature and robust toolset that we can utilize and perform testing and monitoring of the complex web applications that we are deploying into production.

Getting started

So how do we get started? My preferred method is to begin by developing tests locally. You can use the Selenium IDE, but for this example I'll show a Firefox extension called Sauce Builder which makes it a snap to build and run your first test locally.

To get started you'll need Firefox installed, then go to the Sauce Builder download page and walk through getting the extension installed.

Once you've got the Sauce Builder extension installed, it is time to build our first test.

I'm going to walk you through building a test to search for jelly beans on Amazon.

  1. Open Firefox
  2. Click on Tools -> Sauce Builder
  3. Enter ''amazon.com'' in the Start Record prompt and click Go
  4. Enter ''jelly beans'' for the search term
  5. Click Go
  6. Click on the first search result, for me this was '''Kirkland Signature Jelly Belly Jelly Beans 49 Flavors (4 Lbs)'''
  7. Go back to the Sauce Builder window and click Stop recording.
  8. Now that we've recorded a test, we should save it for safe keeping. Click File -> Save or Export -> Choose HTML as the format and name it, then click Save.

As you can see from the test we've recorded. The test is composed of a series of actions and each action will have one or more options associated with it.

Here is a short video of recording your first test

Digging into how to modify and adapt tests is beyond the scope of what I want to cover in this post, but the following links are some good places to go deeper:

Now that we've recorded our first test, it is time to run it.

  1. Click on Run and choose Run test locally.
  2. The test will begin running in the currently selected tab in Firefox.
  3. Obviously this is a pretty simple test and you could do a lot more with it, including go through adding it to a cart, checking out, and buying the order. But for the purposes of getting started, it's a good place to stop.

Here is a video of running your first test

The next thing we want to do, since our focus is on monitoring, is add some verification steps to each page load. This step is crucial in making our test doing the same kind of checking that your traditional curl URL | grep STRING style monitoring did, but now it's integrated into our browser-driven mode of execution.

  1. Go to the Sauce Builder window
  2. Mouse over the second step and choose New step below
  3. Select the new step
  4. Choose edit action
  5. Select the assertion option
  6. Choose page content
  7. Choose assertText
  8. Click Ok
  9. Choose locator and enter ''link=Your Amazon.com'
  10. Click Ok
  11. Choose equal to and enter the string ''Your Amazon.com''
  12. Click Ok
  13. Click on Run and Run test locally

The test should run successfully, if it does not, then you may want to click on locator and choose Find a different Target and use the tool to select the element you're asserting text with.

This is a critical step as the assertions are somewhat brittle and must be maintained as your application changes over time. For more details, see help on choosing good locators.

Here is a video of adding the assertion to your test and running it locally

Using Sauce Labs for Testing

Now that you've gotten your test running locally and you've added some assertions to make the test useful for monitoring, it is a good idea to run the test externally. As previously mentioned, the Sauce Labs folks run a service to run your tests in the Cloud, and they are nice enough to offer a free plan that gives you 200 "execution" minutes per month and the ability to run your tests under multiple browsers and operating systems with ease. Plus you'll get your jobs stored, logs, screenshots, and a video recorded of the whole test for later review and analysis. So now that you're thinking "where do I sign up?!"

To sign up for the free plan, do the following.

  1. Go to https://saucelabs.com/signup
  2. Enter a username
  3. Enter your email address
  4. Enter a password
  5. Click Sign Me Up

Now configure your Sauce Builder installation to use your free account

  1. Login to https://saucelabs.com/ and click on View My API key
  2. Copy your API key
  3. In your test, choose Run -> Run on Sauce OnDemand
  4. Leave the default Linux - Firefox 3.0
  5. Click Run
  6. When prompted if you have a Sauce Labs account, choose Yes
  7. Enter your username and API key
  8. Choose Save
  9. Your test will start running. Grab a snickers.
  10. You'll end up with a Job URL that looks something like https://saucelabs.com/jobs/6f4629f04dad85cd7803d8049ec00888 (which I've made public, since there is nothing private in it.)
  11. Review the details of the test, as you can see, you get the following for each test
  12. Platform
  13. Start and End Times
  14. Duration
  15. Status
  16. Break down of each Selenium command that's executed
  17. Screenshot of the final page of the test
  18. Video recording of the whole test run.

At this point you've successfully executed a test on Sauce Labs. I recommend you review the following to get a full idea of Sauce Labs features, which includes being able to use it programmatically from various languages, which is beyond the scope of what I'm covering this post.

Using BrowserMob for monitoring

So you've succeeded in getting your test run locally, you've run it externally in the "Cloud", and now you're thinking "wasn't I promised I could use this for monitoring?". Yes, you were, and that's where BrowserMob comes in.

While BrowserMob's primary product is focused on load testing, they've also built a great monitoring product and that's what we'll using to get our monitoring up and running.

BrowserMob is kind enough to offer a free plan, so let's start with getting signed up.

To sign up for the free plan, do the following.

  1. Go to https://browsermob.com/website-monitoring-load-testing-signup
  2. Enter all the required info.
  3. Click Sign Up
  4. Complete the email verification.
  5. You're done.

Now upload and verify your first test.

  1. Go to https://browsermob.com/account/overview
  2. Click on Scripts
  3. Click on Upload Selenium Browser Script
  4. Give it a Name
  5. Click Browser, locate your test file you saved from Sauce Builder
  6. Click on Upload
  7. It should automatically validate.
  8. If it passes validation, you should then see Revalidate, View Log, and Screenshot links
  9. Check out the log and screenshot to get an idea of what will be recorded for each monitoring test run.

Here is a short video showing uploading and verifying your first test

Let's configure an email address for notifications

  1. Click on Monitoring
  2. Clic on Notifications
  3. Click on create one
  4. Enter a name
  5. Confirm the contact name and email, it will default to what you registered with
  6. Click Create

Now let's set up a monitoring job.

  1. Click on Monitoring
  2. Click on Schedule
  3. Give the job a name
  4. Select a Frequency * With the free account, you can run a simple test every 12 hours, for higher frequency or more complicated tests, you'll need to purchase a paid account.
  5. If you want to do an alert, click create
  6. Select a location
  7. Select your notification preference
  8. Click Activate Now
  9. The job will be scheduled and will run at the next internal after the minute the job was created.
  10. Since you just signed up for a trial, you can get the test to run a bit sooner, but only a couple times, so we'll do that now, so we can see what it looks like.
  11. Click Edit next to the test
  12. Change Frequency to 10 minutes
  13. Click Save and Activate
  14. Set a timer for 12 minutes and wait, once it is done, we'll review what things look like. * Once you're done with this, you may want to revert to every 12 hours so that when you're trial expires you won't be over your credits, or just pause/delete the monitoring job.

Here is a video of creating the monitoring job

So now that that test has run, let's take a look at what it looks like.

  1. Click on Dashboard
  2. Mouse over the name of the job and click on the URL, it should look like something like this: https://browsermob.com/monitoring/view/{some_id_here}
  3. You should see a chart that defaults to 1 day and shows you each test, with a bar showing each data point, based on the overall time it took to run the test. * This gives you some quick insight to how performance (as measured by execution time) is doing over time.
  4. You can drill into each data point, and you'll get a waterfall style break down of each test run: how long each element of the page took to load, etc.

Below is a screenshot of a test that has run for a few days.

View Monitoring Job | BrowserMob


So a couple tips on how you can use custom stuff from BrowserMob's API to make your tests that much more effective.

Setting variables.

Since the BrowserMob scripts are written in JavaScript, doing variables is as simple as doing var zipcode = '90210'

Getting back data from a webpage.

I've only ever used this to get back the whole response from a page and use it as is, so you'd need to break out a bit of your own JS-fu if you want to use part of a response, but here's how I did it. The code below also shows using a previously declared variable in your request.

var response = c.get("http://api.example.com:8080/id?"+zipcode)
var testid = response.getBody()

At this point the testid variable contains the string returned in the response from the request to http://api.example.com:8080/id?90210

Extra Logging

BrowserMob's JS API has a nice function called browserMob.log() which lets you log arbitrary data and it will show up in the raw logs that BrowserMob keeps for each test run. An example of this is

browserMob.beginStep("Step 2");
selenium.waitForPageToLoad(60000);
selenium.type("id=twotabsearchtextbox", "jelly beans");
browserMob.log('searched for jelly beans')
browserMob.endStep();

For more info on these and more functions, check out the BrowserMob API Documentation

What Next?

At this point you've successfully built a test, run it locally, run it in the "cloud", and deployed it to monitor every 12 hours and are getting alerted by email, you're wondering what's next.

Well, amongst the things you could would would be

  1. Load Testing through BrowserMob
  2. Get called or pager by sending your email alerts into PagerDuty
  3. Interact with your own web services by using the ''getting data back'' example from above

I've made a github repository with my Amazon.com example.

As a challenge and a way to motivate people to contribute and give feedback, the 5 most interesting tests that people submit as pull requests on Github, I will send them a package of stickers, including SysAdvent, Github, Riak, and more!

I hope you've found this post to be informative and would love feedback via email or Twitter on how you do end up using any or all of the services in this post.

December 20, 2011

Day 20 - Thoughts On Load Testing

This was written by Adam Fletcher (@adamfblahblah

One task that often falls on the lonely sysadmin is load testing. In this article I'm going to talk about some philosophies and processes I used when doing load testing in my past roles.

I'm going to focus on testing the server side. There's a lot of articles on how to optimize the client side experience, and it is very important that you are aware of both the client side and server side tuning changes so that you can give the customer the best experience.

Use Science

Science

Science

(credit to XKCD. Buy the shirt!)

What I mean by that is that load testing is really experimentation. You're testing a hypothesis: Is setup A better than setup B? Develop your hypothesis, experiment, and measurements, and make conclusions based on data, not feelings. Don't forget that you need to control as many variables as possible. Don't test on your VMs or your staging server that the customer is also using.

Have Defined Targets

You can't determine if your architecture scales/is fast enough/can handle traffic during a disaster/won't crash when you launch/etc. without have something measurable that determines success. For example, instead of saying "make each page load in 400 milliseconds", it is better to say something like "Every page load must have also resources loaded within an average 400 milliseconds with a standard deviation of 25 milliseconds with 1000 clients performing actions every X seconds on Y series of pages." You will then know when you are done load testing because all the measurements you are taking show you have achieved success.

Scale First

If the Y axis is latency and the X axis is number of workers, then scaling is keeping Y constant while you increase X to infinity. This is much harder than keeping the number of workers constant and lowering latency. The first thing to look at during load testing is the shape of your latency curve as load increases. Keep that curve flat and you're most of the way there.

Understand Your Traffic

If you're already suffering a load problem in production, great! Track the pages being requested and the path of the requests through your system. Use a tool like Google Analytics to get a picture of the diversity of the pages hit and the flow your users take through the product. You'll want to be able to model those flows in your load generation software.

For example, an online store may have a few different paths users commonly take through the system: arrive at the home page organically, search for a product, add to a cart, and check out; arrive at a special landing page for a sale, add to cart, checkout; arrive at the home page organically and use the customer services features; etc. If you viewed this store's traffic at single point in time you could divide your simultaneous traffic up by percentage of pages hit: at a point in time, 30% of the requests are to /search, 20% are to /checkout, 40% are to /, and finally 10% are to /customer-service. This is the model you should use for load generation.

Your traffic also follows patterns that depend on the time of day. With the advent of (somewhat) elastic capacity allocation, you can model these usage patterns and adjust your capacity to fit the pattens of usage.

Furthermore, you need to be aware of client side changes such as allowing your users to use HTTP pipelining or inlining Javascript versus loading the Javascript in another HTTP request. Making the server scale requires you understanding the client.

Know Your Resource Limits

You don't have infinite computing power. You don't have infinite money. Most importantly, you don't have infinite time. Be smart about how you use that time.

It's expensive to have people do all the load testing science we're talking about in this article. With a little thought, you can probably guess where your bottleneck is - I'm going to guess it is something related to your data storage. Use your systems knowledge to make your first hypothesis "If I remove the obvious bottleneck, the system will be faster".

Also, to paraphrase Artur Bergman, don't be a backwards idiot - buy some SSDs. They are expensive per GB but they are dead cheap per IOP/S. They're also cheaper than the time you are spending doing the load testing. You'll want to use these SSDs in the machines that have the highest IO load (and you know which machines those are, because you're measuring IO load, right?).

Graphs Lie

There was an excellent talk at Velocity this year about the dangers of trusting your graphs given by John Rauser entitled Look at Your Data. He pointed out that you have to be careful of the trap of representing many points of data at the same X as the mean of those points of data as this representation hides the distribution of that data. This most commonly occurs when measuring latency and during load testing, when you have many requests at time X that, when averaged, come to Z milliseconds. Plot Z for many Xs and you miss the distribution of the latencies at X.

John's video explains it better, but if you look at this graph:

Request Latency Over Time

Request Latency Over Time

You'd think from this graph that everything is great - your latency went down!

But if we look at the distribution of our data at each sampling point:

Request Latency Over Time With Sample Distribution

Request Latency Over Time With Sample Distribution

We see that some of users are having a really bad experience on our site.

A good example of a tool that doesn't have this problem is Smokeping. Here's an example of Smokeping telling me that my home internet connection has some jitter in latency:

Comcast ICMP Ping Latency

Comcast ICMP Ping Latency

I've also put a gist up with the R code used to generate the graphs above here.

Measure Time and Resources Spent in Each Component

If you aren't instrumenting each piece of software in your stack you should start doing so. Instrument the entry point to your software and the exit point and graph this data over time. Combine this with even simple data from sar, iostat, other *stat tools, etc, and you can learn a lot about your code without ever firing up a profiler.

Learn And Use The Right Tools

Good tools will allow you to export the raw data in such a way that you can then do analysis on it. Tools that expose your system resource consumption metrics are critical, and it probably doesn't matter what you use as long as you are storing and graphing roughly what iostat, sar, vmstat, netstat and top give you. Learn what each metric really means - do you know why your software is context switching 4000 a second? Do you know if that is bad (hint: probably)? How would that manifest itself in top?

Learn to use the profiler that comes with your product's implementation language. Profilers are amazing things. If you can't use a language-specific profiler try a system-wide profiler such as oprofile or similar.

When you have all this data, use a real data analysis tool to look at it. Learn some R or NumPy/SciPy. Instead of using Excel or a clone for data analysis, consider learning a numerical computing language such as R. For example, in R or NumPy you can write a script that takes all of your raw resource consumption data (CPU, RAM, IOPS, etc) and runs correlation tests against the latency data. Try to do that in Excel! Oh, you can then use that script in your monitoring.

People often call load testing an art, but all that really means is that they're not doing science. Load testing can be challenging, but hopefully this article has given you some things to think about to make your load testing easier and more effective.

Further Reading

  • Learning R - a blog covering lots of cool visualizations and techniques in R.

December 19, 2011

Day 19 - Why Use Configuration Management?

This was written by Aleksey Tsalolikhin (http://verticalsysadmin.com/blog/). Illustrations by Joseph Kern

If you ask Wikipedia, "Configuration management (CM) is a field of management that focuses on establishing and maintaining consistency of a system."

Configuration management tools increase sysadmin efficiency and make sysadmin life better. As our systems grow larger and more complex, we need better tools to help us increase control and reliability of ever growing quanities and complexities in computing. Examples of such tools include Bcfg2, Cfengine, Chef, and Puppet - all of which are open source!

Configuring systems manually in interactive sessions is error-prone and extremely labor-intensive. Even with mostly-automated scripts, such as the typical "ssh and a for-loop" solution, pushing ad-hoc changes are still error prone. For example, if a system is down for maintenance while a change is being pushed out over ssh, it will miss that change, and "state drift" will occur between it and other systems in the same class.

You want a tool that helps keep actual and desired state the same.

System imaging is a common strategy for dealing with complexities of config management - make a copy of a system image, label it "gold master", and clone it to make new systems. While this approach helps to crank out identically configured systems, it has the weakness that updating the master image can be a pain and it does nothing to maintain the systems configured after the initial deploy. It is also not very auditable (what changed between golden image v1 and v2?).

Many sysadmins still configure systems with more traditional manual, ad-hoc, and hard-to-audit methods. In some cases, sysadmin teams build home-grown tools to solve these problems. An example of this is Ticketmaster, who released their own config management, "ssh and for loop" tool, and provisioning systems.

Why do we care to do this? Well, why do we administer systems? Correct configuration helps keep computer systems in use by human civilization.

CM tools free sysadmin's time for more challenging and creative system engineering and architecture work and for taking naps which power such work.

Minimize Manual Effort

Minimize manual effort by automatically configuring new systems. This works well because repeatable work is best left to computers; they don't get bored, and they don't forget steps.

"Go away or I will replace you with a very small shell script" - you've probably seen this shirt before, right? How about hearing someone recommend "automating yourself out of a job"? Building systems and fighting fires without any tools is a slow task that is difficult to repeat accurately, and with many sysadmin skills being software-related, it is in your interest to automate system turn up, maintenance, and repair. Automation helps reduce time spent in corrective actions, reduces mental energy consumed, reduces stress, and increases business value and agility. Winning!

In using a config management system, you are implicitly documenting the system's "desired state" - Why is the system configured this way? What are its dependencies? Who cares about the system? This documenting capability helps protect against knowledge loss by moving configuration knowledge out your brains and into a version control system. This helps defend against data lost through forgetfulness or staff changes, and it also facilitates alignment of efforts on a multi-sysadmin team.

In general, configuration management is in the realm of "Infrastructure as Code". Once your infrastructure is represented in code, you can think about apply release engineering and other tools - tag a new policy as "unstable", test it, then move the new policy into the "stable" branch where servers will apply it.

A Visualization

Sys Admin configures a server manually, ad hoc, and hands-on.

Sys Admin configures a server manually, ad hoc, and hands-on.

Sys Admin writes a configuration management tool program to configure a server. Then the CM tool (like a little sysadmin robot) configures the server.

Sys Admin writes a configuration management tool program to configure a server. Then the CM tool (like a little sysadmin robot) configures the server.

Sys Admin takes a nap, while the CM tool configures more servers, and keeps checking and re-configuring the servers (as needed) to keep them in compliance with the program.

Sys Admin takes a nap, while the CM tool configures more servers, and keeps checking and re-configuring the servers (as needed) to keep them in compliance with the program.

Getting Started

To encourage sysadmins to start using Configuration Management, the following is a rough manual of how to do some small tasks in a few different, open source configuration management tools demonstratiing what policies look like in common open-source server. Bourne shell examples are provided to help aid in understanding.

Using these examples

  • Bourne shell: Can be run on the command line or via cron
  • CFengine: Follow the quick start guide In a nutshell, put into a promise bundle inside a policy file (example.cf) and run from the command line with "cf-agent -f example.cf -b $bundlename"; or integrate into the default policy set in promises.cf in the CFEngine work directory, often found in /var/cfengine/inputs.
  • Chef: Follow the Chef Fast Start guide
  • Puppet: Follow the Getting Started guide. For quick testing of these examples, you can write them to a file 'foo.pp' and execute them with puppet apply foo.pp. Puppet also supports a client-server model that is more common for production deployments.

Set Permissions on a File

  • Bourne shell

    chmod 600 /tmp/testfile
    
  • CFengine

    files:
        "/tmp/testfile"
             perms   => m("600");
    
  • Chef

    file "/tmp/testfile" do
      mode "0600"          
    end                  
    
  • Puppet

    file { "/tmp/testfile":
       mode => 0600;
    }
    

Create with some content

  • Bourne shell

    echo 'Server will be down for maintenance 2 AM - 4 AM' > /etc/nologin
    
  • CFengine

    files:
       "/etc/nologin"
            create     => "true",
            edit_line  =>  insert_lines("Server will be down for maintenance 2 AM - 4 AM");
    
  • Chef

    file "/etc/nologin" do
      content 'Server will be down for maintenance 2 AM - 4 AM' 
    end 
    
  • Puppet

    file { "/etc/nologin":
      ensure => present,
      content => "Server will be down for maintenance 2 AM - 4 AM";
    }
    

Install a package

  • Bourne shell

    yum -y install httpd
    
  • CFengine

    packages:  
        "httpd"
            package_policy => "add",
            package_method => yum;
    
  • Chef

    package "httpd" 
    
  • Puppet

    package { "httpd":
      ensure => present;
    }
    

Make sure a service daemon is running

  • Bourne shell

    ps -ef | grep httpd >/dev/null 
    
    if [ $? -ne 0 ]  
      then /etc/init.d/httpd start 
    fi                            
    
  • CFengine

    processes:
       "httpd"
            restart_class => "restart_httpd";
    
    commands:
     restart_httpd::
       "/etc/init.d/httpd start";
    
  • Chef

    service "http" do 
      action :start   
    end             
    
  • Puppet

    service { "httpd":
      ensure => running;
    }
    

Final Thoughts

There's going to be a learning curve to any config management system, but I have found that the benefits in being able to audit, repeat, test, and share "desired state" in code far outweigh any time spent learning the config management tools.

Further Reading

December 18, 2011

Day 18 - Why Businesses Do Things

This was written by Joseph Kern (www.semafour.net).

Imagine your whole professional career as a sysadmin and you never understood the OSI model. Those seven simple layers that allow you to build an effective internal framework of network communications. Without this model how would you even begin to understand larger and more complex systems or the complex interactions between multiple systems?

You might get by for a time; hard work and dedication can take you a long way. But you would never be able to progress beyond a certain point. The problem space becomes to complex to brute force.

Now imagine that successfully managing and running a business is at least as complex as managing a network. Managing a 1000 computers is much easier than managing a business of 1000 people. I'd like to take you into the shallow end of business management and show you how the services that we sysadmins maintain are viewed from a business perspective.

Fortunately this framework is simple and avoids any hand waving. We just need this three word phrase, "differentiation and neutralization."

Differentiation builds services that create competitive advantage. Neutralization builds services that seek to maintain competitive equilibrium. This interplay is the heart of what drives business and directs the supporting activities of enterprise level IT. Working DNS is needed for almost all aspects of modern enterprise IT infrastructure, and it will serve as a technical example for this discussion.

Building services that neutralize competitive advantage usually involves buying solutions; these are often disguised as "industry best practices", and have many accompanying white papers offered as proof. More often than not this niche is filled by Microsoft or other large software vendors. You can buy your way out of a problem.

When building a DNS service it is most often thought of as a way to neutralize the advantage of other organizations. Seldom is thought given to how an organization might run DNS 30% "better" (not necessarily faster or any particular quality) than its competitors. In most cases, "better" would not matter at all.

"Better" does not create an advantage with a neutralizing service. Instead, it in fact creates a disadvantage. Time, attention, and resources are being funneled into a project that creates no value from a business standpoint. The business does not (and should not) care about innovating in services that they consider neutralizing. (See footnote #1)

Building services that create competitive differentiation is much different than neutralization as most of these services are built rather than bought. These tend to be very custom to the environment. The prime consideration for these services is adaptability. You must be able to extend the software providing the service as this allows you to out maneuver your competition. You are able to think your way out of a problem.

Turning Neutralizing into Differentiating

OpenDNS took a service that was neutralizing and rebuilt it from the ground up and adding many other services such as anti-phishing, content filtering (based on domain), and reporting. These services created a differentiation in their business model and offered something new to the market. OpenDNS created a reason to build "better" DNS services, as this is their core business model and their competitive advantage.

As it turns out, setting security and content filtering at the DNS level works equally well across all devices all the time and requires no client installation. Now other businesses must appear to neutralize the differential advantage by creating their own services to match. Norton, for example, has followed suit with their Norton Everywhere product offering DNS services that largely mirror OpenDNS. (See footnote #2)

OpenDNS must now continue to differentiate their services from their competitors. OpenDNS recently started offering DNSCrypt, which creates an encrypted channel for DNS queries between the client and the DNS server. Consider it to be SSL for DNS. No doubt, there will be other service providers that follow suit, creating their own DNSCrypt implementations. (See footnote #3)

Why do businesses seemingly chase the tail of their competiors? This is because if organization declines the opportunity to neutralize the advantage of their competition, they will be excluded from further innovation in this field and may be locked out of the market entirely. A technical term for this is a "feature". As the differentiation of services increases, the cost to enter the market (the table stakes) increases accordingly.

Why Should You Care?

Senior sysadmins and engineers need to not only understand how to build a service, but we must also understand why we are building it and what the business requires from this deployment. Understanding the complete picture, we will understand what technology is required, how it needs to be implemented, and how much effort we should put into a project.

Both the engineer and the business get something valuable from this understanding - keeping time and attention focused on important projects. The next time you are asked to deploy a new service ask yourself (and your management) one simple question:

"Is this a service that neutralizes or differentiates?"

Knowing this helps you set your own expectations. If you find yourself wanting to spend energy improving a service, knowing whether it is neutralizing or differentiating will help you make the case to your team and managers that you should be working on it. Knowing it is a neutralizing service might help you set expectations such that you don't spend time and energy thinking hard about how to improve a service that doesn't benefit the business if improved, and having that knowledge and expectations can help keep you from burning out optimizing things that effectively are unimportant.

Footnotes

  1. Why do you think sharepoint is so popular? It's not because it does everything well ...

  2. In the light of Windows 8 coming preloaded with Anti-Virus software, Norton is facing an almost complete lockout of their traditional market.

  3. The great thing about standards, there are so many to choose from.

Further Reading

December 17, 2011

Day 17 - Speaking the Same Language

This was written by Jordan Sissel (semicomplete.com).

Language is important.

You've probably had disagreements or confusions that, during or after, you realized were caused by miscommunication or misinterpretation. Something got lost in translation, right?

Language is important, so do you resolve these issues? Lawyers do it by dedicating great lengths of text to defining terms to eliminate confusion. What was the last legal document you read? Perhaps the constantly-changing iTunes EULA? Did you read it? Did you skip reading it because it was 70 pages long? Was it readable? Was it plain english?

What was the last legal document you saw that seemed approachable? Can you even read this document at the default font size?

Legal documents are long for many reasons, but the main reason I believe is to reduce loopholes or confusion by defining and using a common language and vocabulary. Vocabulary is important. Most of these documents have embedded definitions of just about every major term used. Look at your current employment contract - how much of it is simply defining words?

Centrally defining all words allows two parties who speak different languages to speak on a common ground. Look at the GPL3 and Apache 2 licenses compared to the MIT license. Both the GPL3 and Apache 2 licenses specify more requirements or allowances than the MIT one, but my point is that much of the raw text in both GPL3 and Apache 2 are definitions. Compare this to the MIT license which has practically no definitions embedded.

Verbal Communication

In your average day, it is unlikely you have the time or energy to spend defining terms in the middle of every conversation. Most of the time you might assume the other person (or people) know what you mean when you say it.

I propose that the likelihood of one person understanding your words is in inversely related to their distance from your context, job role, and other factors you could have (or have not) in common.

Explaining something job-related to a fellow sysadmin will require a different set of terminology, a different language, than doing the same to your manager, someone outside your engineering group, someone in marketing, etc. The knee-jerk reaction is often to assume the other person is stupid. They aren't understanding you! It sucks. It increases tension and distrust.

Like I said, you probably don't have time to work with each person (or group) to design a common terminology. You have stuff to do. What other options do you have?

My best recommendation here is to study everyone. Watch what they say, how they say it, and what you think they mean when they say something. Study their reactions when you say things. If they appear confused, ask and clarify. Don't treat them like they're an idiot. Speaking loudly and slowly doesn't help anyone; ask for what is confusing, and ask what needs clarification and definition.

Know your audience.

Avoid analogies. Analogies are you translating your words into other words you hope the audience will understand. Bad analogies are very easy to make, and you can accidentally increase confusion and distrust like the famous "the internet is not a truck" failure. If you have bidirectional communication with a person, instead of making an analogy, why not ask for clarification and definition?

Visual Communication

Pictures can help, too. Bought anything from Ikea recently? If not, check them out, or just look at this table assembly manual. All the Ikea furniture assembly manuals I've seen are pretty effective with only pictures. No words.

Study your audience.

If you study and interact with a given set of folks frequently enough, you should learn to more easily speak a language they understand. Marketing and PR folks probably don't care about disks being full, but they care about the external impact of that problem. Try to understand what a person knows and what they care about and frame your words accordingly.

Software Similarity and Language Problems

Language problems affect even groups with small distances between them. Look at logging tools: You'll see terms like facility, severity, log level, debug level, source, and more all referring to pretty much the same things. What about monitoring? Nagios calls it a 'service'; Zabbix calls it an 'item' and sometimes a 'check'; Xymon calls it a 'service'; Sensu calls it a 'plugin'.

All of monitoring term examples above mean pretty much the same thing, and before you disagree, consider that I say they mean the same thing because they look like the same thing from a distance. Knowing Nagios and learning a new monitoring tool requires learning new term definitions as well as new software, and overloaded terms (like 'service') can have different meanings in different projects. It trips up the brain!

Look at features provided by similar tools, and each feature is likely to have a different name for the same thing. Intentional or otherwise, this is a language-equivalent of vendor lock-in, and it sucks.

Puppet calls it a module and manifest; Chef calls it a cookbook and recipe. Ruby and Python use the word "yield" to mean different things, and this causes much confusion for people trying to learn both languages. You can reduce learning curves if you use a common language in your systems and projects.

When to Define Terms

It is worth defining terms if you need to have a long-lived common ground. You want defined common terms and features of a project that stay defined through out the project's life cycle.

As an example, there's a small group of folks who say, "Monitoring sucks," myself included. We got together to discuss ways to solve crappy monitoring problems, and one of the tasks was to define a common set of terms - we went the route lawyers go because having a common terminology ground would strengthen the #monitoringsucks movement. Agree or disagree with the definitions we came up with, the point was to lay a path for discussion that could avoid religious wars over confused terminology. The common terms were also chosen to help steer any new projects to use the same terms and reduce the learning curve as a result.

Parting Thoughts

Much of technical writing education focuses on knowing the audience. While 'technical writing' leans towards one-way communication (writer communicating with readers), the ideas are important in general communication.

Who are you talking to? What are their interests? What are the boundaries of their knowledge?

Further Reading

  • This page gives a reasonable overview of audience analysis.

December 16, 2011

Day 16 - Shipping Some Logs

This was written by Jordan Sissel (semicomplete.com).

Logging is messy. Ever have logs fill up your disk and crash services as a result? Me too, and that sucks.

You can solve the disk-filling problem by rotation logs and expiring old ones, but there's a better solution that solves more problems: ship your logs somewhere else. Shipping logs somewhere centralized helps you more quickly access those logs later when you need to debug or do analytics.

There are plenty of tools in this area to help you solve log transport problems. Common syslog servers like rsyslog and syslog-ng are useful if syslog is your preferred transport. Other tools like Apache Flume, Facebook's Scribe, and logstash provide an infrastructure for reliable and robust log transport. Many tools that help solve log transportation problems also solve other problems, for example, rsyslog can do more than simply moving a log event to another server.

Starting with Log Files

For all of these systems, one pleasant feature is that in most cases, you don't need to make any application-level changes to start shipping your logs elsewhere: If you already log to files, these tools can read those files and ship them out to your central log repository.

Files are a great common ground. Can you 'tail -F' to read your logs? Perfect.

Even rsyslog and syslog-ng, while generally a syslog server, can both follow files and stream out logs as they are written to disk. In rsyslog, you use the imfile module. In syslog-ng, you use the file() driver. In Flume, you use the tail() source. In logstash, you use the file input plugin.

Filtering Logs

Most of the tools mentioned here support some kind of filtering, whether it's dropping certain logs or modifying them in-flight.

Logstash, for example supports dropping events matched by a certain pattern, parsing events into a structured piece of data like JSON, normalizing timestamps, and figuring out what events are single-line and what events are multi-line (like java stack traces). Flume lets you do similar filter behaviors in decorators

In rsyslog, you can use filter conditions and templates to selectively drop and modify events before they are output. Similarly, in syslog-ng, filters let you drop events and templates let you reshape the output event.

Final Destination

Where are you putting logs?

You could put them on a large disk server for backups and archival, but logs have valuable data in them and are worth mining.

Recall Sysadvent Day 10 which covered how to analyze logs stored in S3 using Pig on Amazon EC2. "Logs stored in S3" - how do you get your logs into S3? Flume supports S3 out of the box allowing you to ship your logs up to Amazon for later processing. Check out this blog post for an example of doing exactly this.

If you're looking for awesome log analytics and debugging, there are a few tools out there to help you do that without strong learning curves. Some open source tools include Graylog2 and logstash are both popular and have active communities. Hadoop's Hive and Pig can help, but may have slightly steeper learning curves. If you're looking for a hosted log searching service, there's papertrail. Hosted options also vary in features and scope; for example, Airbrake (previously called 'hoptoad') focuses on helping you analyze logged errors.

And then?

Companies like Splunk have figured out that there is money to be made from your logs, and web advertising companies log everything because logs are money, so don't just treat your logs like they're a painful artifact that can only be managed with aggressive log rotation policies.

Centralize your logs somewhere and build some tools around them. You'll get faster at debugging problems and be able to better answer business and operations questions.

Further Reading

  • Log4j has a cool feature called MDC and NDC that lets you log more than just a text message.
  • logrotate

December 15, 2011

Day 15 - Automating WordPress with CFEngine

System administration is a relatively new profession. Without a standard curriculum, practitioners have different philosophies and practices. It is challenging for new sysadmins because every organization implements differently: the how and why of system setup, how and why of maintenance, and the how and why of disaster recovery and growth.

A software tool can respond faster than a human sysadmin to a deviation from configuration policy (something being broken). The corrective action can be automated, so chaos is kept to a minimum while not requiring human action.

Why WordPress? Installing WordPress involves coordinating multiple system components into a harmonious whole. It is a great demonstration of the power of automated configuration management. It involves copying and editing files, installing packages, and starting and restarting services.

Manually installing WordPress often takes tens of minutes. An automated install under CFEngine greatly shortens the time required and most importantly provides a repeatable and auditable experience.

Lastly, an introduction to CFEngine is out of scope for this post, but you can learn more here, here, and here.

Automating WordPress Installation

The two main parts of infrastructure involved in making WordPress work are a web server and a database. In this example, we'll use Apache httpd and MySQL as well as assume a Red Hat (or derivative) system.

The most up to date version of the cfengine implementation of this post can be found here: https://github.com/cfengine/contrib/raw/master/wordpress_installer.cf

You can run this policy with:

cf-agent -f /var/cfengine/inputs/wordpress_installer.cf

The rest of this post covers the manual steps you might do to install WordPress and also the equivalent implementation in CFengine.

Ordering Things

Below shows the control promise which controls the behavior of cf-agent including which files it should import (the standard Cfengine library) and in what sequences to examine and keep bundles (collections) of promises.

body common control 
{

        bundlesequence => {
                                "packages_installed",
                                "services_up",
                                "wordpress_tarball_is_present",
                                "wordpress_tarball_is_unrolled",
                                "configuration_of_mysql_db_for_wordpress",
                                "wpconfig_exists",
                                "wpconfig_is_properly_configured",
                                "allow_http_inbound",
                          };

        inputs =>        { "/var/cfengine/inputs/cfengine_stdlib.cf" };
}

Get the Right Packages

With that order given above, let's start by ensuring we have all the necessary packages. We will use the "yum" package_method since we are using a Red Hat derivative.

The packages_installed bundle depicted in below promises to restart the httpd if any packages are added to cover the case where httpd is up and running, but "php" and "php-mysql" are missing, and Cfengine installs them.

bundle agent packages_installed
{

vars: "desired_package" slist => {
          "httpd",
          "php",
          "php-mysql",
          "mysql-server",
         };

packages: "$(desired_package)"
    package_policy => "add",
    package_method => yum,
    classes => if_repaired("packages_added");

commands:
  packages_added::
  "/sbin/service httpd graceful"
    comment => "Restarting httpd so it can pick up new modules.";
}

Apache and MySQL

Now let's make sure httpd and mysqld are running with the services_up bundle shown below:

bundle agent services_up {
processes:
  "^mysqld" restart_class => "start_mysqld";
  "^httpd"  restart_class => "start_httpd";

commands:
  start_mysqld::
    "/sbin/service mysqld start";
  start_httpd::
    "/sbin/service httpd start";
}

The "restart_class" is used to scan the "ps" output for the named string, and if not found, the right hand side class will be set. We can then use that to launch a command to start the server.

Downloading WordPress

The next section shows the wordpress_tarball_is_present bundle where we make sure we have a copy of WordPress in an arbitrary location - let's say in /root. We'll need it later to install WordPress under the httpd document root.

We test using Cfengine built-in test function "fileexists()". If the file exists the "wordpress_tarball_is_present" class gets defined. (A class is Cfengine implicit if/then test. If it is defined, the test passes. If it is not defined it does not. In other words, defined = true, not defined = false.)

If the file does not exist, the "wordpress_tarball_is_present" class will not be defined and the commands promise will download it. If the file does exist, no action will be taken.

bundle agent wordpress_tarball_is_present
{
classes:
  "wordpress_tarball_is_present" expression =>
    fileexists("/root/wordpress-latest.tar.gz");

reports:
  wordpress_tarball_is_present::
    "WordPress tarball is on disk.";

commands:
  !wordpress_tarball_is_present::
    "/usr/bin/wget -q -O /root/wordpress-latest.tar.gz
http://wordpress.org/latest.tar.gz"
    comment => "Downloading WordPress.";
}

Unpacking WordPress

Next, we test if the WordPress directory exists under the document root (assumed to be "/var/www/html").

If it doesn't, we'll extract our WordPress tarball to the docroot using "tar".

Note that the "tar" extract promise depends on the earlier promise that the tar ball is on disk. Because Cfengine does three passes through the promises when it runs: on the first pass, the tar ball will be downloaded if necessary; on the second pass, Cfengine will extract it. This is an example of convergence to desired state, part of the basic philosophy of Cfengine.

Because Cfengine is convergent in its operation, the above cf-agent command can be run multiple times, and the system will always stay at or approach the desired state, never get further away from it. It can fight entropy and system state drift.

bundle agent wordpress_tarball_is_unrolled
{
classes:
  "wordpress_directory_is_present" expression =>
    fileexists("/var/www/html/wordpress/");
reports:
  wordpress_directory_is_present::
    "WordPress directory is present.";
commands:
  !wordpress_directory_is_present::
    "/bin/tar -C /var/www/html -xvzf /root/wordpress-latest.tar.gz"
      comment => "Unrolling wordpress tarball to /var/www/html/.";
}

Configuring MySQL

Next, we use the "mysql" command to create the database for the application data store as well as credentials to access it:

bundle agent configuration_of_mysql_db_for_wordpress
{
commands:
  "/usr/bin/mysql -u root -e \"
    CREATE DATABASE IF NOT EXISTS wordpress;
    GRANT ALL PRIVILEGES ON wordpress.*
    TO 'wordpress'@localhost
    IDENTIFIED BY 'lopsa10linux';
    FLUSH PRIVILEGES;\"
  ";
}

Please note the above command (like all these promise bundles) is convergent to desired state - it will either get us to the desired state if we are not there, or keep us there if we are there already.

The desired state is a "wordpress" database that can be accessed via a "wordpress" user with the password "lopsa10linux".

Adding the WordPress Config

Let's copy the sample config file WordPress ships with to wp-config.php if it doesn't exist.

First, we check if wp-config.php exists using the built-in "fileexists()" function. If wp-config.php exists, this will set a "wordpress_config_file_exists" class.

This class will be used to control what happens next: if the class is set, no changes will be made to the system; we'll just report wp-config.php is there. If the class is not defined, we'll report wp-config.php is not there, and then put it there by copying it from wp-config-sample.php

bundle agent wpconfig_exists
{
classes:
  "wordpress_config_file_exists"
  expression => fileexists("/var/www/html/wordpress/wp-config.php");
reports:
  wordpress_config_file_exists::
    "WordPress config file /var/www/html/wordpress/wp-config.php is present";
commands:
  !wordpress_config_file_exists::
  "/bin/cp -p /var/www/html/wordpress/wp-config-sample.php \
    /var/www/html/wordpress/wp-config.php"
    comment => "Creating wp-config.php from wp-config-sample.php";
}

Here is the wp-config-sample.php sample config:

// ** MySQL settings - You can get this info from your web host ** //
/** The name of the database for WordPress */
define('DB_NAME', 'database_name_here');

/** MySQL database username */
define('DB_USER', 'username_here');

/** MySQL database password */
define('DB_PASSWORD', 'password_here');

Taking the sample config above, we can use the "replace_patterns" in cfengine_stdlib.cf to replace database_name_here with our database name, and so on. Just like using a template, we replace placeholders with actual values.

bundle agent wpconfig_is_properly_configured
{
files:
  "/var/www/html/wordpress/wp-config.php"
    edit_line => replace_default_wordpress_config_with_ours;
}

bundle edit_line replace_default_wordpress_config_with_ours
{
replace_patterns:
  "database_name_here" replace_with => value("wordpress");

replace_patterns:
  "username_here" replace_with => value("wordpress");

replace_patterns:
  "password_here" replace_with => value("lopsa10linux");
}

Configure IPTables

As a finishing touch, let's make sure our host firewall allows inbound connections on port 80 TCP (Figure 9).

The is our most complicated promise bundle. There are three levels of abstraction: a "files" type promise that edits a file using "edit_line" type promise bundle uses "insert_lines" (from cfengine_stdlib.cf) which has an attribute "location" which is defined (in a separate promise attribute body) as before the iptables rule for accepting established TCP connections.

Incidentally, this promise bundle will also restart iptables if it edits the iptables config file.

Abstracting the details allows the sysadmin to see at a high level what's going on without being blinded by too many details at once, yet the details are accessible to examination if needed.

bundle agent allow_http_inbound
{
files:
  redhat::  # tested on RHEL only, file location may vary on other OSs
  "/etc/sysconfig/iptables"
    edit_line => insert_HTTP_allow_rule_before_the_accept_established_tcp_conns_rule,
    comment => "insert HTTP allow rule into /etc/sysconfig/iptables",
    classes => if_repaired("iptables_edited");
commands:
  iptables_edited::
  "/sbin/service iptables restart"
    comment => "Restarting iptables to load new config";
}

bundle edit_line insert_HTTP_allow_rule_before_the_accept_established_tcp_conns_rule
{
vars:
  "http_rule" string => "-A INPUT -p tcp -m tcp --dport 80 -j ACCEPT";
insert_lines: "$(http_rule)",
  location => before_the_accept_established_tcp_conns_rule;
}

body location before_the_accept_established_tcp_conns_rule
{
before_after => "before";
first_last => "first";
select_line_matching => "^-A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT.*";
}

To summarize, here is what the policy can do:

  • Install Web server and required httpd modules (php, php-mysql);
  • Install Web app in httpd docroot
  • Install and configure the database for the Web app
  • Configure the Web app to use the database
  • Configure the host firewall

There is a more sophisticated version of this automated WordPress installer in Aleksey's "CFEngine 3 Examples Collection" ( http://www.verticalsysadmin.com/cfengine/cfengine_examples.tar ), see 2030_More_Examples._EC2._Example102_wordpress_installation.cf

Further Reading

Aleksey has been a UNIX/Linux system administrator for 13 years, and will share his knowledge during the "Time Management for System Administrators" session at the So Cal Linux Expo on 20 Jan 2011 (http://www.socallinuxexpo.org/scale10x/events/scale-university) and "Automating System Administration using CFEngine 3", a 3 day hands-on course, in the Bay Area on 25-27 January 2012 and in Los Angeles on 20-22 February 2012.