December 26, 2011

Day 25 - Learning from Other Industries

This was written by Jordan Sissel (semicomplete.com).

This post is late because, two days ago (December 24th), I went to the emergency room with appendicitis. Pretty sweet timing, eh?

After surgery, I was bored and took to observing nursing shift rotations - hand-offs of patients, documentation maintenance, etc. The entire process semeed to last about an hour for each shift change. I thought to myself, "Self, you have never been involved in any operations project, task, or fire hand-off that went as smoothly." Pretty sad realization.

Further, throughout my hospital stay (emergency room, surgery, and recovery), not once did I observe any two individuals debating passionately about how best to treat my illness. Data was available at every decision point. There were specializations in surgeons for the surgical tasks; nurses to help stabilize, comfort, and feed me (no small task); and administrative staff to handle the beaurocracy stuff.

I'm not expecting systems administration to follow exactly what the medical industry does, just pointing out that it's nice to see a well-orchestrated system in action.

And the best part? I never once heard anyone say, "hey! We should try this new cool thing I heard about."

Further, I never once heard anyone suggesting that we remove my appendix using node.js and mongodb.

Happy New Year, everyone :)

Further Reading

December 24, 2011

Day 24 - Implementing Configuration Management in Legacy Environments

This was written by Dean Wilson (www.unixdaemon.net).

Implementing configuration management is perfect for green field projects where you can have freedom to choose technical solutions free from most existing technical debt. It's a time where there are no old inconsistencies and you have a window to build, deploy and test all the manifests and recipes without risking user visible outages or interruptions.

Unfortunately, most of us don't have the luxury of starting from scratch in a pristine environment. We have to deal with the oddities, the battle scars, and the need to maintain a good quality of service while evolving and maintaining the platform.

In this article, I'll discuss some of the key points to consider as you begin the journey implementing a configuration management tool in an existing environment. I happen to use puppet and mcollective, but any config management tools will do - find one that works for you.

Discovery

You'll need to start with some discovery (manual, automated, whatever) to learn how your current systems are configured. How many differences are there across the sshd_config files? Where do we use that backup users key? What's the log retention setting of the apache servers? Is it identical between those we upgraded from apache 1 and the fresh apache 2?

When I first started to bring puppet in to my systems, I wrote custom scripts, ssh loops, and inventory-reporting cronjobs to find the information I needed. Now, there is an excellent solution to gather information you need in real time: MCollective.

With the appropriate mcollective agents you can compare, peer, and probe in to nearly every aspect of your infrastructure. The FileMD5er agent, for example, will show you groups of hosts that identical config files. Using this, you can partition the environment and the work, section by section. This will help you find smaller amounts of work to build into puppet.

One of my favourite current tricks is to query config file settings using Augeas in the AugeasQuery agent.

# Does your ssh permit root logins?
$ mco rpc augeasquery query query='/files/etc/ssh/sshd_config/PermitRootLogin' --with-agent augeasquery -v
...

test.example.com : OK
    {:matched=>["no"]}

test2.example.com: OK
    {:matched=>["yes"]}

No more crazy regexes to find config settings!

You can learn more about how the Blueprint tool can help you with this configuration discovery process in "Reverse-engineer Servers with Blueprint" (SysAdvent 2011, Day 12).

Short Cycles

One of the key factors to consider when starting large scale infrastructure refactoring projects is that you'll need information and you'll need it quickly.

You should always be the first to know about things you've broken. Your monitoring system, centralised logging, trend graphing, and config management systems reporting server (like Puppet Dashboard and Foreman) will become your watchful allies.

The speed at which you can gather information, write the manifest, deploy, and run your tests is more important than you'd think. Keeping this cycle short will encourage you to work in smaller sections, reducing the time between you actually testing your work and keeping the possible problem scope small. Tools like MCollective, mc-nrpe and nrpe runner will enable you to rapidly verify your changes.

Version Control

The short version is: version control everything.

There's no reason, whether you are in a team or working alone (especially when working alone!), to not keep your work in a version control system (such as subversion or git). "That change didn't work, so I've rolled it back". Really? By hand? Are you sure you didn't forget to remove one of the additions? Why would you shoulder the mental burden of knowing all the things you changed recently when there are so many excellent tools that will do it for you and allow easier, auditable rollbacks to old revisions?

Beyond the benefits of providing an ever-present safety net, using a VCS provides a nice way to tie your exact change into incident, audit, or change control systems - a perfect place to do smaller code reviews and a basic, but permanent, collection of institutional knowledge. Being able to search through the puppet logs / reports when debugging an issue to find exactly which resources changed and when, pull up the related changeset and the actual reason for it (and who you should ask for help) changes your process and reports from best guesses based on peoples memories to quick fixes backed by supporting evidence.

On a positive note, it's amazing how much you can learn from reading through commits from things like DBAs doing performance tuning. Seeing which settings were changed and why can save you and your team from forgetting something or making the same mistake or redoing parts of the work again in the future.

Use Packages

Many older environments have fallen in to the habit of pushing tarred applications around (and sometimes compiling software on the hosts as needed). This isn't the best way to deploy in general, and it will make your manifests ugly, complicated, sprawling messes. In most cases, it's not that hard to produce native packages, and tools like fpm make the process even easier.

By actually packaging software and running local repositories you can use native tools, such as yum and apt, in a more powerful way. Processes become shorter and have simpler moving parts, more meta information becomes available, and it's possible to trace the origin of files. You can then use other tools, such as puppet and MCollective to ease upgrades, additions, and reporting.

You can learn more about the world of software packages from "A Guide to Package Systems" (SysAdvent 2011, Day 4).

As an aside, there are differing opinions on whether packages should handle related tasks, such as deploying users. However, once you've reached the point where you can have this argument, you've surpassed most of your peers and will have an environment that gives you the freedom to do whichever you choose.

You can learn more about overlaps in functionality of packaging and config management systems in "Packaging vs Config Management" (SysAdvent 2010, Day 23).

Baked-in Goodness

As you bring more of your system under puppet control, you'll begin to experience the continuously improving system baseline. Every improvement you make is felt throughout the infrastructure, and regressions become infrequent, easy to spot, and quick to remedy. For example, you should never wonder if you've deployed nginx checks to a new server, it should be impossible to have a mysql server without your graphing applied, and every wiki you deploy should be added to the backup system without you intervening.

By having infrastructure manifests such as a monitoring module and writing some generic "add monitoring check" definitions, you can sprinkle reuse throughout other, more functionality-focused modules. In my test manifests, for example, nearly every class has nagios checks associated with it, and every service declares the ports it listens to. This might not sound like much, but it's impossible for me to deploy a server without monitoring, every time I add a new nagios check, all the hosts with that role receive it. Further, generating security policies, firewall rule sets, and audit documents is done automatically for me and requires no manual data gathering. Being able to supply people with links to live documentation is a wonderful way to remove awkward manual steps from reports, audits and inventories.

Involve Everyone

You don't have to agree with the DevOps movement and ideas to use configuration management, but some of its principles (such as communication, openness, shared ownership and responsibility) are worthwhile and have a low overhead once you start to config manage everything. Look for simple requests for information from other teams. Do they want to know which cronjobs run and when? What modules the apache servers have configured? Which rubygems you've deployed on the database servers? (from packages!)

By sharing configuration modules, you help others develop a balanced understanding of the platform. You enable them to regain time wasted raising support tickets through sharing commit logs, audit and error reports, etc, without feeling like your coworkers or customers are wrestling your time away.

At $WORK, a large percentage of our developers are comfortable reading puppet manifests, running MCollective commands, and using custom Foreman pages and cgi-scripts to investigate issues answering their own queries without the sysadmins being involved. This has dropped our workload, given them quicker answers and allowed them to ask targeted questions after doing their own research.

Such things increase efficiency, communication, and happiness.

It's not all about developers either. With a custom puppet define and a little reporting wrapper, we can expose the list of ports required by every machine to our network teams and auditors along with accurate timestamps showing when information was gathered. Using MCollective and custom database facts, our DBAs have dashboards showing deployed packages and services, current running configuration, and the ability to gather ad-hoc real time information from any of our systems: production, development or QA and even compare the differences between them.

Pick your fights

While you're growing accustomed and skilled with your new tools, you need to be aware of the strengths. As the person pushing this change, you have to prove your way is at least as good as (and hopefully better than) the current practise. Because of this, some tasks are bad places to prove your point and you need to be weary of them. Try to find some low hanging fruit

Large monolithic deployments or complex configuration scripts can be handled if you break them down to manageable components. Unfortunately, tight coupling and intertwined code makes these situations bad places to start. It's never a good opening move to start explaining that it took three hours to pick apart a shell script that seems to work fine. Vaguely-defined processes are the other common tar pit. Bringing attention to processes that people seem to do slightly differently is another great way to unite people against what you are trying to accomplish.

You want to avoid resistance caused by confusion. Find and tackle tasks that make sense to put into configuration management, first.

User accounts, for example, are a complex area to automate. While it starts with a simple "deploy the admin accounts, add them to wheel, and push the SSH keys," it can easily become a sprawl of teams needing different sets of access on a semi-rotating basis.

It's worth noting that your config management efforts will often be considered slower, especially when starting out. Writing a ten line throwaway shell script or hand-hacking something will be quicker in the short term than writing clean manifests that include monitoring, meta data and tests but the comparison is unfair. Remember, you're building flexible, reproducible, systems that support future change - not focusing on a single server that just needs to work now and only now.

Roles, not hosts - No more snowflakes

As you build your collection of configuration management techniques, you'll find special snowflake hosts: Machines performing several small but important tasks; ones hand-constructed under time constraints (and needing that one extra special config option) - boxes you'll "only ever have one of," right? Don't fall in the trap of assuming any machine is special or that you'll only ever have one of them. Building a second machine for disaster recovery, a staging version, or even an instance for upgrade tests means there will be times when you'll want more than one of them. If you find yourself adding special cases to your config management "If the hostname is 'foo.example.com'" then it's time to stop, take a step back, and consider alternatives. Any list, especially of resource names, that you hand maintain is a burden and will eventually be wrong.

A sign that things are starting to click is when you stop thinking in terms of hosts and start thinking in roles. Once you begin assigning roles to hosts you will find the natural level of granularity you should be writing your modules at. It's common to revisit previously written modules and extract aspects of them as you discover more potential reuse points.

You can learn more about configuration in terms of roles instead of machines in "Host vs Service" (SysAdvent 2008, Day 7).

Conclusion

With a little care and consideration it's possible to integrate configuration management and even the largest legacy infrastructures while enjoying the same benefits as the newer, less encumbered projects.

Take small steps, master your tools and good luck.

Further Reading

December 23, 2011

Day 23 - All The Metrics! Or How You Too Can Graph Everything.

This was written by Corry Haines.

As your company grows, you may find that your existing metric collection system(s) cannot keep up. Alternately, you may find that the interface used to read those metrics does not work with everything that you want to collect retrics for.

The problems that I have seen with existing solutions that I have used are:

  • Munin: Fails to scale in all respects. Collection is more complicated than it should be, and graphs are pre rendered for every defined time window. Needless to say, this does not scale well and cannot be used dynamically.
  • Collectd: While this system is excellent at collecting data, the project does not officially support any frontend. This has led to a proliferation of frontend projects that, if taken together, have all of the features you need, but no one frontend does everything.
  • XYMon: It's been some time since I used this, and I have not used it on a large set of systems. My guess is that it would suffer from some of Munin's issues.

Enter Graphite

Graphite is a collection of services that can replace or enhance your existing metric collection setup. Yes, it's written in python... but I like python.

The major components are:

  • Whisper: Replaces RRD with a vastly simpler storage only system. Unlike RRD, whisper cannot graph data directly. Also unlike RRD, you can actually read and understand the entire codebase in less than an hour (only 725 lines of well commented python).
  • Carbon: Takes data on a variety of interfaces (TCP, UDP, Pickle, AMQP) and stores the data into whisper.
  • Graphite-webapp: Graphs data from whisper or RRD files.

The best thing about the components being independent is that you can run graphite on your existing RRD data with no hassle. While there are advantages to using whisper, it is not required to get the power of graphite.

The only negatives that I currently hold against graphite are:

  • The documentation is still a bit lacking, though they are working to improve this. You can invoke the community (mailing lists, etc) as a workaround.
  • The learning curve can be a bit steep. While there is an interface to see all of the functions, you still need to learn how they are applied. This is offset by the ability to save named graphs for all users to see.
  • Feedback is a bit lacking. After a graph is requested it is difficult to tell if it is being rendered, or simply failed in the backend.
  • They use launchpad and thus bazaar, for their project management and source control. In a post-github world, this is starting to get a bit painful.

The Power of Filters and Functions

As wonderful as whisper and carbon are (and they really are worth using), the true power of graphite lies in its web interface. Unlike some other interfaces, graphite treats each metric as an independent data series. So long as you have an understanding of the system, you can apply functions (think sum, avg, stddev, etc.) to the metrics either by themselves, or more often, in aggregate.

In addition, you can use wildcards to select multiple machines quickly. While you could do a sum operation like this: sumSeries(host1.load,host2.load,etc) you could more easily type sumSeries(*.load).

Filter Example

As an example, if I wanted to find overloaded webservers I could construct a query like highestAverage(webservers.*.load.longterm, 3) producing:

highestAverage graph higher resolution graph here

Stacking example

Another example, graphing the amount of unused memory on the webservers (time for more memcached if so!) movingAverage(webservers.*.memory.free, 10) producing:

memory movingaverage graph higher resolution graph here

Note that I am also creating a moving average over 10 datapoints here. Also, the series are stacked to produce a sum while still showing the responsible server

Functions Are the Best!

And this is only a small selection of the functions available to you. Moreover, you can write your own! And easily too! Here is an example function in graphite:

# A function to scale up all datapoints by a given factor
def scale(requestContext, seriesList, factor):
  for series in seriesList:
    series.name = "scale(%s,%.1f)" % (series.name,float(factor))
    for i,value in enumerate(series):
      series[i] = safeMul(value,factor)
  return seriesList

Graphite in Production

We are currently collecting >93,000 metrics every 10 seconds. Most of the data is gathered on machines using collectd and then passed to a proxy written by sysadvent's editor. The proxy then ships all of the data, via TCP, to our central Carbon node.

All of the data is consumed by carbon and stored on a single machine with six 10k SAS drives in a RAID 10 array. Although this disk setup is not enough to write the data in real time, it batches up the data and writes sets at once. It only needs to use about 300 MB of RAM for caching.

In reality, this hardware is probably overkill for our current workload. While testing, I was running about 50,000 metrics on four 7.2k SATA drives in a RAID 10 and the machine was doing just fine. It was using several GB of RAM to cache the data, but it was still able to keep up.

In Closing

If you are considering the installation of a metric gathering system, I would absolutely recommend Graphite. If you are using Collectd or Munin already, you can try the graphite web interface without changing how you collect metrics. It only takes a few minutes to setup and might give you better insight into your systems.

Further Reading

December 22, 2011

Day 22 - Load Balancing Solutions on EC2

This was written by Grig Gheorghiu.

Before Amazon introduced the Elastic Load Balancing (ELB) service, the only way to do load balancing in EC2 was to use one of the software-based solutions such as HAProxy or Pound.

Having just one EC2 instance running a software-based load balancer would obviously be a single point of failure, so a popular technique was to do DNS Round-Robin and have the domain name corresponding to your Web site point to several IP addresses via separate A records. Each IP address would be an Elastic IP associated to an EC2 instance running the load balancer software. This was still not perfect, because if one of these instances would go down, users pointed to that instance via DNS Round-Robin would still get an error until another instance would be launched.

Another issue that comes up all the time in the context of load balancing is SSL termination. Ideally you would like the load balancer to act as an SSL end-point, in order to offload the SSL computations from your Web servers, and also for easier management of the SSL certificates. HAProxy does not support SSL termination, but Pound does (note: that you can still pass SSL traffic through HAProxy by using its TCP mode, you just cannot terminate SSL traffic there.)

In short, if Elastic Load Balancing weren’t available, you could still cobble together a load balancing solution in EC2. There is no reason to ‘roll your own’ anymore however now that you can use the ELB service. Note that HAProxy is still the king of load balancers when it comes to the different algorithms you can use (and to a myriad of other features), so if you want the best of both worlds, you can have an ELB upfront, pointing to one or more EC2 instances running HAProxy, which in turn delegate traffic to your Web server farm.

Elastic Load Balancing and the DNS Root Domain

One other issue that comes up all the time is that an ELB is only available as a CNAME (this is due to the fact that Amazon needs to scale the ELB service in the background depending on the traffic that hits it, so they cannot simply provide an IP address). A CNAME is fine if you want to load balance traffic to www.yourdomain.com, since that name can be mapped to a CNAME. However, the root or apex of your DNS zone, yourdomain.com, can only be mapped to an A record, so for yourdomain.com you could not use an ELB in theory. In practice, however, there are DNS providers that allow you to specify an alias for your root domain (I know Dynect does this, and Amazon’s own Route 53 DNS service).

Elastic Load Balancing and SSL

The AWS console makes it easy to associate an SSL certificate with an ELB instance, at ELB creation time. You do need to add an SSL line to the HTTP protocol table when you create the ELB. Note that even though you terminate the SSL traffic at the ELB, you have a choice of using either unencrypted HTTP traffic or encrypted SSL traffic between the ELB and the Web servers behind it. If you want to offload the SSL processing from your Web servers, you can choose HTTP between the ELB and the Web server instances.

If however you want to associate an existing ELB instance with a different SSL certificate (say for instance you initially associated it with a self-signed SSL cert, and now you want to use a real SSL cert), you can’t do that with the AWS console anymore. You need to use command-line tools. Here’s how.

Before you install the command-line tools, a caveat: you need Java 1.6. If you use Java 1.5 you will most likely get errors such as java.lang.NoClassDefFoundError when trying to run the tools.

  1. Install and configure the AWS Elastic Load Balancing command-line tools

    • download ElasticLoadBalancing.zip
    • unzip ElasticLoadBalancing.zip; this will create a directory named ElasticLoadBalancing-version (latest version at the time of this writing is 1.0.15.1)
    • set environment variable AWS_ELB_HOME=/path/to/ElasticLoadBalancing-1.0.15.1 (in .bashrc)
    • add $AWS_ELB_HOME/bin to your $PATH (in .bashrc)
  2. Install and configure the AWS Identity and Access Management (IAMCli) tools

    • download IAMCli.zip
    • unzip IAMCli.zip; this will create a directory named IAMCli-version (latest version at the time of this writing is 1.3.0)
    • set environment variable AWS_IAM_HOME=/path/to/IAMCli-1.3.0 (in .bashrc)
    • add $AWS_IAM_HOME/bin to your $PATH (in .bashrc)
  3. Create AWS credentials file

    • create file with following content AWSAccessKeyId=your_aws_access_key AWSSecretKey=your_aws_secret_key
    • if you named this file aws_credentials, set environment variable AWS_CREDENTIAL_FILE=/path/to/aws_credentials (in .bashrc)
  4. Get DNS name for ELB instance you want to modify

    We will use the ElasticLoadBalancing tool called elb-describe-lbs:

    # elb-describe-lbs
    LOAD_BALANCER  mysite-prod  mysite-prod-2639879155.us-east-1.elb.amazonaws.com  2011-05-24T22:38:31.690Z
    LOAD_BALANCER  mysite-stage   mysite-stage-714225413.us-east-1.elb.amazonaws.com    2011-09-16T18:01:16.180Z
    

    In our case, we will modify the ELB instance named mysite-stage.

  5. Upload SSL certificate to AWS

    I assume you have 3 files:

    • the SSL private key in a file called stage.mysite.com.key
    • the SSL certificate in a file called stage.mysite.com.crt
    • an intermediate certificate from the SSL vendor, in a file called stage.mysite.com.intermediate.crt

    We will use the IAMCli tool called iam-servercertupload:

    # iam-servercertupload -b stage.mysite.com.crt -c stage.mysite.com.intermediate.crt -k stage.mysite.com.key -s stage.mysite.com
    
  6. List the SSL certificates you have uploaded to AWS

    We will use the IAMCli tool called iam-servercertlistbypath:

    # iam-servercertlistbypath
    arn:aws:iam::YOUR_IAM_ID:server-certificate/stage.mysite.com
    arn:aws:iam::YOUR_IAM_ID:server-certificate/www.mysite.com
    
  7. Associate the ELB instance with the desired SSL certificate

    We will use the ElasticLoadBalancing tool called elb-set-lb-listener-ssl-cert:

    # elb-set-lb-listener-ssl-cert mysite-stage --lb-port 443 --cert-id arn:aws:iam::YOUR_IAM_ID:server-certificate/stage.mysite.com
    OK-Setting SSL Certificate
    

That's it! At this point, the SSL certificate for stage.mysite.com will be associated with the ELB instance handling HTTP and SSL traffic for stage.mysite.com. Not rocket science, but not trivial to put together all these bits of information either.

Further Reading

December 21, 2011

Day 21 - Automating Web Monitoring

This article is written by Brandon Burton, who can mostly be found posting lolcats and retweeting @solarce, though he occasionally posts interesting links to things sysadmin, devops, and unix.

As systems administrators, we all know that it's not in production until it's monitored, but this isn't always as simple a rule to live by as it may sound. Not all web applications, for example, are easily monitored through traditional monitoring solutions such as Nagios, Zenoss, or various commercial tools. These tools tend to take a "curl | grep" style monitoring, or they may support somewhat more complex POSTing of XML or JSON data and validation of the returned data. But often the most key parts of applications being deployed into production involve complex browser interactions and behaviors - AJAX, or some other session or transaction that traditional monitoring frameworks don't have an easy way to accommodate.

Enter Selenium. Selenium is a mature and robust framework for doing complex interactions with web applications. It originated as a tool at the consulting company ThoughtWorks as a way to do testing against web applications by driving a web browser. Since its release, it has seen the development of numerous tools, including browser plugins to make it easy to develop Selenium tests quickly and easily, language bindings to write tests in pretty much every major language, and tools to run many browsers across many operating systems, in parallel.

Additionally, services, such as BrowserMob and Sauce Labs, have grown around the Selenium ecosystem to help you do testing and monitoring in a scalable and offsite fashion. It is these services that we'll focus on utilizing in this blog post.

So what does all this mean? It means that we have a mature and robust toolset that we can utilize and perform testing and monitoring of the complex web applications that we are deploying into production.

Getting started

So how do we get started? My preferred method is to begin by developing tests locally. You can use the Selenium IDE, but for this example I'll show a Firefox extension called Sauce Builder which makes it a snap to build and run your first test locally.

To get started you'll need Firefox installed, then go to the Sauce Builder download page and walk through getting the extension installed.

Once you've got the Sauce Builder extension installed, it is time to build our first test.

I'm going to walk you through building a test to search for jelly beans on Amazon.

  1. Open Firefox
  2. Click on Tools -> Sauce Builder
  3. Enter ''amazon.com'' in the Start Record prompt and click Go
  4. Enter ''jelly beans'' for the search term
  5. Click Go
  6. Click on the first search result, for me this was '''Kirkland Signature Jelly Belly Jelly Beans 49 Flavors (4 Lbs)'''
  7. Go back to the Sauce Builder window and click Stop recording.
  8. Now that we've recorded a test, we should save it for safe keeping. Click File -> Save or Export -> Choose HTML as the format and name it, then click Save.

As you can see from the test we've recorded. The test is composed of a series of actions and each action will have one or more options associated with it.

Here is a short video of recording your first test

Digging into how to modify and adapt tests is beyond the scope of what I want to cover in this post, but the following links are some good places to go deeper:

Now that we've recorded our first test, it is time to run it.

  1. Click on Run and choose Run test locally.
  2. The test will begin running in the currently selected tab in Firefox.
  3. Obviously this is a pretty simple test and you could do a lot more with it, including go through adding it to a cart, checking out, and buying the order. But for the purposes of getting started, it's a good place to stop.

Here is a video of running your first test

The next thing we want to do, since our focus is on monitoring, is add some verification steps to each page load. This step is crucial in making our test doing the same kind of checking that your traditional curl URL | grep STRING style monitoring did, but now it's integrated into our browser-driven mode of execution.

  1. Go to the Sauce Builder window
  2. Mouse over the second step and choose New step below
  3. Select the new step
  4. Choose edit action
  5. Select the assertion option
  6. Choose page content
  7. Choose assertText
  8. Click Ok
  9. Choose locator and enter ''link=Your Amazon.com'
  10. Click Ok
  11. Choose equal to and enter the string ''Your Amazon.com''
  12. Click Ok
  13. Click on Run and Run test locally

The test should run successfully, if it does not, then you may want to click on locator and choose Find a different Target and use the tool to select the element you're asserting text with.

This is a critical step as the assertions are somewhat brittle and must be maintained as your application changes over time. For more details, see help on choosing good locators.

Here is a video of adding the assertion to your test and running it locally

Using Sauce Labs for Testing

Now that you've gotten your test running locally and you've added some assertions to make the test useful for monitoring, it is a good idea to run the test externally. As previously mentioned, the Sauce Labs folks run a service to run your tests in the Cloud, and they are nice enough to offer a free plan that gives you 200 "execution" minutes per month and the ability to run your tests under multiple browsers and operating systems with ease. Plus you'll get your jobs stored, logs, screenshots, and a video recorded of the whole test for later review and analysis. So now that you're thinking "where do I sign up?!"

To sign up for the free plan, do the following.

  1. Go to https://saucelabs.com/signup
  2. Enter a username
  3. Enter your email address
  4. Enter a password
  5. Click Sign Me Up

Now configure your Sauce Builder installation to use your free account

  1. Login to https://saucelabs.com/ and click on View My API key
  2. Copy your API key
  3. In your test, choose Run -> Run on Sauce OnDemand
  4. Leave the default Linux - Firefox 3.0
  5. Click Run
  6. When prompted if you have a Sauce Labs account, choose Yes
  7. Enter your username and API key
  8. Choose Save
  9. Your test will start running. Grab a snickers.
  10. You'll end up with a Job URL that looks something like https://saucelabs.com/jobs/6f4629f04dad85cd7803d8049ec00888 (which I've made public, since there is nothing private in it.)
  11. Review the details of the test, as you can see, you get the following for each test
  12. Platform
  13. Start and End Times
  14. Duration
  15. Status
  16. Break down of each Selenium command that's executed
  17. Screenshot of the final page of the test
  18. Video recording of the whole test run.

At this point you've successfully executed a test on Sauce Labs. I recommend you review the following to get a full idea of Sauce Labs features, which includes being able to use it programmatically from various languages, which is beyond the scope of what I'm covering this post.

Using BrowserMob for monitoring

So you've succeeded in getting your test run locally, you've run it externally in the "Cloud", and now you're thinking "wasn't I promised I could use this for monitoring?". Yes, you were, and that's where BrowserMob comes in.

While BrowserMob's primary product is focused on load testing, they've also built a great monitoring product and that's what we'll using to get our monitoring up and running.

BrowserMob is kind enough to offer a free plan, so let's start with getting signed up.

To sign up for the free plan, do the following.

  1. Go to https://browsermob.com/website-monitoring-load-testing-signup
  2. Enter all the required info.
  3. Click Sign Up
  4. Complete the email verification.
  5. You're done.

Now upload and verify your first test.

  1. Go to https://browsermob.com/account/overview
  2. Click on Scripts
  3. Click on Upload Selenium Browser Script
  4. Give it a Name
  5. Click Browser, locate your test file you saved from Sauce Builder
  6. Click on Upload
  7. It should automatically validate.
  8. If it passes validation, you should then see Revalidate, View Log, and Screenshot links
  9. Check out the log and screenshot to get an idea of what will be recorded for each monitoring test run.

Here is a short video showing uploading and verifying your first test

Let's configure an email address for notifications

  1. Click on Monitoring
  2. Clic on Notifications
  3. Click on create one
  4. Enter a name
  5. Confirm the contact name and email, it will default to what you registered with
  6. Click Create

Now let's set up a monitoring job.

  1. Click on Monitoring
  2. Click on Schedule
  3. Give the job a name
  4. Select a Frequency * With the free account, you can run a simple test every 12 hours, for higher frequency or more complicated tests, you'll need to purchase a paid account.
  5. If you want to do an alert, click create
  6. Select a location
  7. Select your notification preference
  8. Click Activate Now
  9. The job will be scheduled and will run at the next internal after the minute the job was created.
  10. Since you just signed up for a trial, you can get the test to run a bit sooner, but only a couple times, so we'll do that now, so we can see what it looks like.
  11. Click Edit next to the test
  12. Change Frequency to 10 minutes
  13. Click Save and Activate
  14. Set a timer for 12 minutes and wait, once it is done, we'll review what things look like. * Once you're done with this, you may want to revert to every 12 hours so that when you're trial expires you won't be over your credits, or just pause/delete the monitoring job.

Here is a video of creating the monitoring job

So now that that test has run, let's take a look at what it looks like.

  1. Click on Dashboard
  2. Mouse over the name of the job and click on the URL, it should look like something like this: https://browsermob.com/monitoring/view/{some_id_here}
  3. You should see a chart that defaults to 1 day and shows you each test, with a bar showing each data point, based on the overall time it took to run the test. * This gives you some quick insight to how performance (as measured by execution time) is doing over time.
  4. You can drill into each data point, and you'll get a waterfall style break down of each test run: how long each element of the page took to load, etc.

Below is a screenshot of a test that has run for a few days.

View Monitoring Job | BrowserMob


So a couple tips on how you can use custom stuff from BrowserMob's API to make your tests that much more effective.

Setting variables.

Since the BrowserMob scripts are written in JavaScript, doing variables is as simple as doing var zipcode = '90210'

Getting back data from a webpage.

I've only ever used this to get back the whole response from a page and use it as is, so you'd need to break out a bit of your own JS-fu if you want to use part of a response, but here's how I did it. The code below also shows using a previously declared variable in your request.

var response = c.get("http://api.example.com:8080/id?"+zipcode)
var testid = response.getBody()

At this point the testid variable contains the string returned in the response from the request to http://api.example.com:8080/id?90210

Extra Logging

BrowserMob's JS API has a nice function called browserMob.log() which lets you log arbitrary data and it will show up in the raw logs that BrowserMob keeps for each test run. An example of this is

browserMob.beginStep("Step 2");
selenium.waitForPageToLoad(60000);
selenium.type("id=twotabsearchtextbox", "jelly beans");
browserMob.log('searched for jelly beans')
browserMob.endStep();

For more info on these and more functions, check out the BrowserMob API Documentation

What Next?

At this point you've successfully built a test, run it locally, run it in the "cloud", and deployed it to monitor every 12 hours and are getting alerted by email, you're wondering what's next.

Well, amongst the things you could would would be

  1. Load Testing through BrowserMob
  2. Get called or pager by sending your email alerts into PagerDuty
  3. Interact with your own web services by using the ''getting data back'' example from above

I've made a github repository with my Amazon.com example.

As a challenge and a way to motivate people to contribute and give feedback, the 5 most interesting tests that people submit as pull requests on Github, I will send them a package of stickers, including SysAdvent, Github, Riak, and more!

I hope you've found this post to be informative and would love feedback via email or Twitter on how you do end up using any or all of the services in this post.

December 20, 2011

Day 20 - Thoughts On Load Testing

This was written by Adam Fletcher (@adamfblahblah

One task that often falls on the lonely sysadmin is load testing. In this article I'm going to talk about some philosophies and processes I used when doing load testing in my past roles.

I'm going to focus on testing the server side. There's a lot of articles on how to optimize the client side experience, and it is very important that you are aware of both the client side and server side tuning changes so that you can give the customer the best experience.

Use Science

Science

Science

(credit to XKCD. Buy the shirt!)

What I mean by that is that load testing is really experimentation. You're testing a hypothesis: Is setup A better than setup B? Develop your hypothesis, experiment, and measurements, and make conclusions based on data, not feelings. Don't forget that you need to control as many variables as possible. Don't test on your VMs or your staging server that the customer is also using.

Have Defined Targets

You can't determine if your architecture scales/is fast enough/can handle traffic during a disaster/won't crash when you launch/etc. without have something measurable that determines success. For example, instead of saying "make each page load in 400 milliseconds", it is better to say something like "Every page load must have also resources loaded within an average 400 milliseconds with a standard deviation of 25 milliseconds with 1000 clients performing actions every X seconds on Y series of pages." You will then know when you are done load testing because all the measurements you are taking show you have achieved success.

Scale First

If the Y axis is latency and the X axis is number of workers, then scaling is keeping Y constant while you increase X to infinity. This is much harder than keeping the number of workers constant and lowering latency. The first thing to look at during load testing is the shape of your latency curve as load increases. Keep that curve flat and you're most of the way there.

Understand Your Traffic

If you're already suffering a load problem in production, great! Track the pages being requested and the path of the requests through your system. Use a tool like Google Analytics to get a picture of the diversity of the pages hit and the flow your users take through the product. You'll want to be able to model those flows in your load generation software.

For example, an online store may have a few different paths users commonly take through the system: arrive at the home page organically, search for a product, add to a cart, and check out; arrive at a special landing page for a sale, add to cart, checkout; arrive at the home page organically and use the customer services features; etc. If you viewed this store's traffic at single point in time you could divide your simultaneous traffic up by percentage of pages hit: at a point in time, 30% of the requests are to /search, 20% are to /checkout, 40% are to /, and finally 10% are to /customer-service. This is the model you should use for load generation.

Your traffic also follows patterns that depend on the time of day. With the advent of (somewhat) elastic capacity allocation, you can model these usage patterns and adjust your capacity to fit the pattens of usage.

Furthermore, you need to be aware of client side changes such as allowing your users to use HTTP pipelining or inlining Javascript versus loading the Javascript in another HTTP request. Making the server scale requires you understanding the client.

Know Your Resource Limits

You don't have infinite computing power. You don't have infinite money. Most importantly, you don't have infinite time. Be smart about how you use that time.

It's expensive to have people do all the load testing science we're talking about in this article. With a little thought, you can probably guess where your bottleneck is - I'm going to guess it is something related to your data storage. Use your systems knowledge to make your first hypothesis "If I remove the obvious bottleneck, the system will be faster".

Also, to paraphrase Artur Bergman, don't be a backwards idiot - buy some SSDs. They are expensive per GB but they are dead cheap per IOP/S. They're also cheaper than the time you are spending doing the load testing. You'll want to use these SSDs in the machines that have the highest IO load (and you know which machines those are, because you're measuring IO load, right?).

Graphs Lie

There was an excellent talk at Velocity this year about the dangers of trusting your graphs given by John Rauser entitled Look at Your Data. He pointed out that you have to be careful of the trap of representing many points of data at the same X as the mean of those points of data as this representation hides the distribution of that data. This most commonly occurs when measuring latency and during load testing, when you have many requests at time X that, when averaged, come to Z milliseconds. Plot Z for many Xs and you miss the distribution of the latencies at X.

John's video explains it better, but if you look at this graph:

Request Latency Over Time

Request Latency Over Time

You'd think from this graph that everything is great - your latency went down!

But if we look at the distribution of our data at each sampling point:

Request Latency Over Time With Sample Distribution

Request Latency Over Time With Sample Distribution

We see that some of users are having a really bad experience on our site.

A good example of a tool that doesn't have this problem is Smokeping. Here's an example of Smokeping telling me that my home internet connection has some jitter in latency:

Comcast ICMP Ping Latency

Comcast ICMP Ping Latency

I've also put a gist up with the R code used to generate the graphs above here.

Measure Time and Resources Spent in Each Component

If you aren't instrumenting each piece of software in your stack you should start doing so. Instrument the entry point to your software and the exit point and graph this data over time. Combine this with even simple data from sar, iostat, other *stat tools, etc, and you can learn a lot about your code without ever firing up a profiler.

Learn And Use The Right Tools

Good tools will allow you to export the raw data in such a way that you can then do analysis on it. Tools that expose your system resource consumption metrics are critical, and it probably doesn't matter what you use as long as you are storing and graphing roughly what iostat, sar, vmstat, netstat and top give you. Learn what each metric really means - do you know why your software is context switching 4000 a second? Do you know if that is bad (hint: probably)? How would that manifest itself in top?

Learn to use the profiler that comes with your product's implementation language. Profilers are amazing things. If you can't use a language-specific profiler try a system-wide profiler such as oprofile or similar.

When you have all this data, use a real data analysis tool to look at it. Learn some R or NumPy/SciPy. Instead of using Excel or a clone for data analysis, consider learning a numerical computing language such as R. For example, in R or NumPy you can write a script that takes all of your raw resource consumption data (CPU, RAM, IOPS, etc) and runs correlation tests against the latency data. Try to do that in Excel! Oh, you can then use that script in your monitoring.

People often call load testing an art, but all that really means is that they're not doing science. Load testing can be challenging, but hopefully this article has given you some things to think about to make your load testing easier and more effective.

Further Reading

  • Learning R - a blog covering lots of cool visualizations and techniques in R.

December 19, 2011

Day 19 - Why Use Configuration Management?

This was written by Aleksey Tsalolikhin (http://verticalsysadmin.com/blog/). Illustrations by Joseph Kern

If you ask Wikipedia, "Configuration management (CM) is a field of management that focuses on establishing and maintaining consistency of a system."

Configuration management tools increase sysadmin efficiency and make sysadmin life better. As our systems grow larger and more complex, we need better tools to help us increase control and reliability of ever growing quanities and complexities in computing. Examples of such tools include Bcfg2, Cfengine, Chef, and Puppet - all of which are open source!

Configuring systems manually in interactive sessions is error-prone and extremely labor-intensive. Even with mostly-automated scripts, such as the typical "ssh and a for-loop" solution, pushing ad-hoc changes are still error prone. For example, if a system is down for maintenance while a change is being pushed out over ssh, it will miss that change, and "state drift" will occur between it and other systems in the same class.

You want a tool that helps keep actual and desired state the same.

System imaging is a common strategy for dealing with complexities of config management - make a copy of a system image, label it "gold master", and clone it to make new systems. While this approach helps to crank out identically configured systems, it has the weakness that updating the master image can be a pain and it does nothing to maintain the systems configured after the initial deploy. It is also not very auditable (what changed between golden image v1 and v2?).

Many sysadmins still configure systems with more traditional manual, ad-hoc, and hard-to-audit methods. In some cases, sysadmin teams build home-grown tools to solve these problems. An example of this is Ticketmaster, who released their own config management, "ssh and for loop" tool, and provisioning systems.

Why do we care to do this? Well, why do we administer systems? Correct configuration helps keep computer systems in use by human civilization.

CM tools free sysadmin's time for more challenging and creative system engineering and architecture work and for taking naps which power such work.

Minimize Manual Effort

Minimize manual effort by automatically configuring new systems. This works well because repeatable work is best left to computers; they don't get bored, and they don't forget steps.

"Go away or I will replace you with a very small shell script" - you've probably seen this shirt before, right? How about hearing someone recommend "automating yourself out of a job"? Building systems and fighting fires without any tools is a slow task that is difficult to repeat accurately, and with many sysadmin skills being software-related, it is in your interest to automate system turn up, maintenance, and repair. Automation helps reduce time spent in corrective actions, reduces mental energy consumed, reduces stress, and increases business value and agility. Winning!

In using a config management system, you are implicitly documenting the system's "desired state" - Why is the system configured this way? What are its dependencies? Who cares about the system? This documenting capability helps protect against knowledge loss by moving configuration knowledge out your brains and into a version control system. This helps defend against data lost through forgetfulness or staff changes, and it also facilitates alignment of efforts on a multi-sysadmin team.

In general, configuration management is in the realm of "Infrastructure as Code". Once your infrastructure is represented in code, you can think about apply release engineering and other tools - tag a new policy as "unstable", test it, then move the new policy into the "stable" branch where servers will apply it.

A Visualization

Sys Admin configures a server manually, ad hoc, and hands-on.

Sys Admin configures a server manually, ad hoc, and hands-on.

Sys Admin writes a configuration management tool program to configure a server. Then the CM tool (like a little sysadmin robot) configures the server.

Sys Admin writes a configuration management tool program to configure a server. Then the CM tool (like a little sysadmin robot) configures the server.

Sys Admin takes a nap, while the CM tool configures more servers, and keeps checking and re-configuring the servers (as needed) to keep them in compliance with the program.

Sys Admin takes a nap, while the CM tool configures more servers, and keeps checking and re-configuring the servers (as needed) to keep them in compliance with the program.

Getting Started

To encourage sysadmins to start using Configuration Management, the following is a rough manual of how to do some small tasks in a few different, open source configuration management tools demonstratiing what policies look like in common open-source server. Bourne shell examples are provided to help aid in understanding.

Using these examples

  • Bourne shell: Can be run on the command line or via cron
  • CFengine: Follow the quick start guide In a nutshell, put into a promise bundle inside a policy file (example.cf) and run from the command line with "cf-agent -f example.cf -b $bundlename"; or integrate into the default policy set in promises.cf in the CFEngine work directory, often found in /var/cfengine/inputs.
  • Chef: Follow the Chef Fast Start guide
  • Puppet: Follow the Getting Started guide. For quick testing of these examples, you can write them to a file 'foo.pp' and execute them with puppet apply foo.pp. Puppet also supports a client-server model that is more common for production deployments.

Set Permissions on a File

  • Bourne shell

    chmod 600 /tmp/testfile
    
  • CFengine

    files:
        "/tmp/testfile"
             perms   => m("600");
    
  • Chef

    file "/tmp/testfile" do
      mode "0600"          
    end                  
    
  • Puppet

    file { "/tmp/testfile":
       mode => 0600;
    }
    

Create with some content

  • Bourne shell

    echo 'Server will be down for maintenance 2 AM - 4 AM' > /etc/nologin
    
  • CFengine

    files:
       "/etc/nologin"
            create     => "true",
            edit_line  =>  insert_lines("Server will be down for maintenance 2 AM - 4 AM");
    
  • Chef

    file "/etc/nologin" do
      content 'Server will be down for maintenance 2 AM - 4 AM' 
    end 
    
  • Puppet

    file { "/etc/nologin":
      ensure => present,
      content => "Server will be down for maintenance 2 AM - 4 AM";
    }
    

Install a package

  • Bourne shell

    yum -y install httpd
    
  • CFengine

    packages:  
        "httpd"
            package_policy => "add",
            package_method => yum;
    
  • Chef

    package "httpd" 
    
  • Puppet

    package { "httpd":
      ensure => present;
    }
    

Make sure a service daemon is running

  • Bourne shell

    ps -ef | grep httpd >/dev/null 
    
    if [ $? -ne 0 ]  
      then /etc/init.d/httpd start 
    fi                            
    
  • CFengine

    processes:
       "httpd"
            restart_class => "restart_httpd";
    
    commands:
     restart_httpd::
       "/etc/init.d/httpd start";
    
  • Chef

    service "http" do 
      action :start   
    end             
    
  • Puppet

    service { "httpd":
      ensure => running;
    }
    

Final Thoughts

There's going to be a learning curve to any config management system, but I have found that the benefits in being able to audit, repeat, test, and share "desired state" in code far outweigh any time spent learning the config management tools.

Further Reading