December 15, 2013

Day 15 - Distributed Storage with Ceph

Written by: Kyle Bader (@mmgaggle)
Edited by: Michelle Carroll

Introduction

Ceph provides a scalable, distributed storage system based on an intelligent object storage layer called RADOS. RADOS handles replication, recovery, and achieves a statistically even distribution of objects across the storage nodes participating in a Ceph cluster using the CRUSH algorithm. This article will help you get a small Ceph cluster up and running with Vagrant, a extremely powerful tool for development and testing environments.

Architecture

Ceph Monitors

Ceph monitors decide and maintain mappings of the cluster and achieve consistency using PAXOS. Each member of a Ceph cluster connects to one of the monitors in the quorum to receive maps and report state. Quorums of Ceph monitors need to be an odd number, to avoid ties during leader election -- typically 3 or 5. It's important to note that Ceph monitors are not in the data path. Ceph monitors also provide means of authentication for storage daemons and clients.

Object Storage Device

Object storage devices, or OSDs, are daemons that abstract underlying storage devices and filesystems to provide the RADOS object storage interface.

Object Storage Device states

Object storage device daemons can be in various states, they fall into two groupings:

in/out

OSD daemons that are "in" are mapped into the cluster and will participate in placement groups. In the event that an OSD is marked "out", either by exceeding the amount of time the configuration allows it to be "down" or by operator intervention, the placement groups with which they participated will be remapped to another OSD and data will be backfilled from surviving replicas.

up/down

OSD daemons will be marked up when they are running, able to successfully peer with OSD daemons with which they share a placement group, and are able to send heartbeats a Ceph monitor. If an OSD is marked down then it doesn't meet the one of the previously stated conditions.

CRUSH

CRUSH is a deterministic, pseudo-random placement algorithm used by RADOS to map data according to placement rules (defined in what is known as a CRUSH map).

Placement Groups

Placement groups are portions of a pool that get distributed to OSDs based on their status and your CRUSH map. The replica count of a given pool determines how many OSDs will participate in a placement group. Each primary in a placement group receives writes and replicates them to its peers, acknowledging only after all replicas are consistent. Reads, on the other hand can be serviced from any replica of a given object.

Placement Group States

The following are a few of the most relevant placement group states along with a description, for an exhaustive list see the official documentation.

peering,active

Placement groups are considered "peering" when the participating OSDs are still gossiping about cluster state. Once the OSDs in a placement group complete the peering process they are marked "active".

clean,degraded

Degraded placement groups lack sufficient replicas to satisfy the durability the system has been configured for. Clean placement groups have sufficient replicas of each object to satisfy the current pool configuration.

remapped

Remapped placement groups are no longer mapped to their ideal location as determined by the clusters state and CRUSH map.

backfilling

Placement groups that are backfilling are copying replicas from OSDs that participated in the placement group before remapping.

incomplete

Placement groups are incomplete when there are not enough member OSDs are marked up.

inconsistent

Replicas have been discovered that are inconsistent and require repair.

Clients

There are a variety of different clients that have been written to interact with a Ceph cluster, namely:

librados

librados is the native Ceph interface, there are a variety of language bindings available should you choose to interact directly with Ceph from your own applications.

radosgw

radosgw provides a RESTful interface to a Ceph cluster that is compatible with both Amazon S3 and OpenStack Swift.

qemu rbd

Modern versions of qemu support rbd, or RADOS block device backed volumes. These rbd volumes are striped across many objects in a Ceph cluster.

krbd

Similar to qemu rbd, krbd are RADOS block devices that are mounted by the Linux kernel and receive a major/minor device node.

cephfs

cephfs is the distributed filesystem. Features of cephfs include dynamic subtree partitioning of metadata, recursive disk usage accounting and snapshots.

Vagrant

To quickly get a Ceph server up and running to experiment with the software I've created a Vagrantfile and corresponding box files so that a small virtual cluster can be provisioned on a modest machine (8GB system or more memory is suggested). The vagrant file is written in such a way that the first machine provisioned is a Open Source Chef Server, which will run a script to load a set of Chef cookbooks and set up the environment for configuring a Ceph cluster. If you don't have Vagrant installed already then you can follow the official getting started guide. Next, you will need to clone the vagrant-ceph repository from github:

git clone https://github.com/mmgaggle/vagrant-ceph.git
cd vagrant-ceph
git checkout sysadvent
To bootstrap the Chef server and setup an initial monitor simply run:

export VAGRANT_CEPH_NUM_OSDS=2
vagrant up chefserver cephmon
Both nodes should boot and converge to our desired state using Chef. After the cephmon node converges you should want to wait a minute for Chef server to index the bootstrap key material and make it available for search (tiny VMs are slow). After you have waited a minute you can start up the OSDs nodes:
vagrant up cephstore1001 cephstore1002
The result should be two Ceph OSD nodes each running 1 Ceph OSD daemon. Now that the cluster is provisioned you can stop the Chef server to free up some resources on your machine:
vagrant halt chefserver

Ceph Basics

To examine cluster state you will need to have access to a CephX keyring with administrative permissions. The cephmon node we booted generated keyrings during convergence, you should run the following commands after establishing a ssh connection to cephmon:
vagrant ssh cephmon
The following are some commands to help you understand the current state of a Ceph cluster. Each of these commands should either be ran as the root user or via sudo.
agrant@cephmon:~$ sudo ceph health
HEALTH_OK
The ceph health command responds with a general health status for the cluster, the results will either be "HEALTH_OK" or a list of problematic placement groups and OSDs.
vagrant@cephmon:~$ sudo ceph -s
health HEALTH_OK
monmap e1: 1 mons at {cephmon=192.168.0.201:6789/0}, election epoch 2, quorum 0 cephmon
osdmap e9: 2 osds: 2 up, 2 in
pgmap v16: 192 pgs: 192 active+clean; 0 bytes data, 71120 KB used, 10628 MB / 10697 MB avail
mdsmap e1: 0/0/1 up
The -s flag is the abbreviation for status. Calling the ceph -s command will return the cluster health  (as ceph health reported earlier) along with lines that detail the status of the ceph monitors, osds, placement groups, and metadata servers.
vagrant@cephmon:~$ sudo ceph -w
cluster cf376172-55ef-410b-a1ad-b84d9445aaf1
health HEALTH_OK
monmap e1: 1 mons at {cephmon=192.168.0.201:6789/0}, election epoch 2, quorum 0 cephmon
osdmap e9: 2 osds: 2 up, 2 in
pgmap v16: 192 pgs: 192 active+clean; 0 bytes data, 71120 KB used, 10628 MB / 10697 MB avail
mdsmap e1: 0/0/1 up

2013-12-14 03:24:03.439146 mon.0 [INF] pgmap v16: 192 pgs: 192 active+clean; 0 bytes data, 71120 KB used, 10628 MB / 10697 MB avail
The -w flag is the abbreviation for watch, and calling the ceph -w command will return similar output to ceph -s -- with the exception that it will tail the cluster log and punctuate it with periodic status updates a la ceph -s. 

vagrant@cephmon:~$ sudo ceph osd tree
# id weight type name up/down reweight
-1 0.01999 root default
-2 0.009995  host cephstore1001
0 0.009995   osd.0 up 1
-3 0.009995  host cephstore1002
1 0.009995   osd.1 up 1
Shows a tree of your CRUSH map along with the weights and statuses of your OSDs.
vagrant@cephmon:~$ sudo ceph osd dump

epoch 9
fsid cf376172-55ef-410b-a1ad-b84d9445aaf1
created 2013-12-14 03:15:49.419751
modified 2013-12-14 03:21:57.738002
flags
pool 0 'data' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0
pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0

max_osd 2
osd.0 up in weight 1 up_from 4 up_thru 8 down_at 0 last_clean_interval [0,0) 192.168.0.205:6800/3618 192.168.0.205:6801/3618 192.168.0.205:6802/3618 192.168.0.205:6803/3618 exists,up a2515c14-f2e4-44b2-9cf9-1db603d7306a
osd.1 up in weight 1 up_from 8 up_thru 8 down_at 0 last_clean_interval [0,0) 192.168.0.206:6800/3620 192.168.0.206:6801/3620 192.168.0.206:6802/3620 192.168.0.206:6803/3620 exists,up ea9896f2-7137-4527-a326-9909dfdfd226
Dumps a list of all osds and a wealth of information about them.
vagrant@cephmon:~$ sudo ceph pg dump
dumped all in format plain
version 16
stamp 2013-12-14 03:24:03.435845
last_osdmap_epoch 9
last_pg_scan 1
full_ratio 0.95
nearfull_ratio 0.85
pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
2.3d 0 0 0 0 0 0 0 active+clean 2013-12-14 03:21:58.042543 0'0 9:16 [0,1] [0,1] 0'0 2013-12-14 03:20:08.744088 0'0 2013-12-14 03:20:08.744088
1.3e 0 0 0 0 0 0 0 active+clean 2013-12-14 03:21:58.045611 0'0 9:16 [0,1] [0,1] 0'0 2013-12-14 03:20:08.601796 0'0 2013-12-14 03:20:08.601796
...
Dumps a list of placement groups.
vagrant@cephmon:~$ sudo rados df
pool name category KB objects clones degraded unfound rd rd KB wr wr KB
data - 0 0 0 0 0 0 0 0 0
metadata - 0 0 0 0 0 0 0 0 0
rbd - 0 0 0 0 0 0 0 0 0
total used 71120 0
total avail 10883592
total space 10954712
Print a list of pools and their usage statistics.

Fill it Up

First, create a pool named "vagrant" so that we have somewhere to write data to:
vagrant@cephmon:~$ sudo rados mkpool vagrant
successfully created pool vagrant
Next, use the 'rados bench' tool to write some data to the cluster so it's a bit more interesting:
vagrant@cephmon:~$ sudo rados bench -p vagrant 200 write --noclean-up

Maintaining 16 concurrent writes of 4194304 bytes for up to 200 seconds or 0 objects
Object prefix: benchmark_data_cephmon_3715
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 18 2 7.98932 8 0.835571 0.82237
2 16 24 8 15.9871 24 1.84799 1.35787
...
201 15 1183 1168 23.2228 8 2.95785 2.72622
Total time run: 202.022774
Total writes made: 1183
Write size: 4194304
Bandwidth (MB/sec): 23.423

Stddev Bandwidth: 7.25419 Max bandwidth (MB/sec): 40
Min bandwidth (MB/sec): 0
Average Latency: 2.73042
Stddev Latency: 1.07016
Max latency: 5.34265
Min latency: 0.56332
You can use the commands you learned earlier to watch the cluster as it processes write requests from the benchmarking tool. Once you have written a bit of data to your virtual cluster you can run a read benchmark:
vagrant@cephmon:~$ sudo rados bench -p vagrant 200 seq
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 15 21 6 22.4134 24 0.728871 0.497643
2 16 33 17 32.8122 44 1.93878 1.00589

Starting and Stopping Ceph Daemons

At some point you will need to control the ceph daemons. This test cluster is built on Ubuntu Precise and uses Upstart for process monitoring.

With the commands you just learned you can inspect the cluster’s state, enabling you to experiment with stopping and starting Ceph monitors and OSDs.
vagrant@cephstore1001:~$ sudo stop ceph-osd id=0
ceph-osd stop/waiting
vagrant@cephstore1001:~$ sudo start ceph-osd id=0
ceph-osd (ceph/0) start/running, process 4422

Controlling Monitors

vagrant@cephmon:~$ sudo stop ceph-mon id=`hostname`
ceph-mon stop/waiting
vagrant@cephmon:~$ sudo start ceph-mon id=`hostname`
ceph-mon (ceph/cephmon) start/running, process 4106

Closing

I hope you find this a useful and interesting introduction to Ceph. The community is always interested in making it easier for people to experiment with the system. If you want to learn more about Ceph then you can dive into the excellent documentation at the Ceph homepage. I working on adding support to the Vagrant environment for large clusters launched on EC2 and later OpenStack compatible clouds. Pull requests and comments are welcome!

December 14, 2013

Day 14 - What is Packer?

Written By: Mike English (@gazoombo)
Edited By: Michelle Carroll

Packer is a new open source tool for building identical machine images. It's written in Go and maintained by HashiCorp, the creators of Vagrant.

Much like VeeWee (an earlier image-building tool by Patrick Debois from which Packer certainly drew some inspiration), Packer allows you to repeatedly create new VM images from build configurations defined in source code.

Packer takes advantage of an extensible plugin architecture to allow the same source templates to be used with multiple builders, provisioners, and post-processors to create artifacts. For example, a single JSON template and set of Chef cookbooks or Puppet modules could be used to create an AMI, a VirtualBox-based Vagrant basebox, and a VMWare image to be deployed to vSphere.

It allows for a great deal of code reuse in this regard, as well as portability to other environments. Packer also extends the ideas of infrastructure as code and automation down to the level of OS installation.

What can I use it for?

You can use Packer with as little as a template and a kickstart file (or equivalent) to build JeOS images for your organization. This is a great way to make explicit one's assumptions about what constitutes a "minimal OS install" as it pertains to your distribution of choice and options like SELinux or default language.

You can also use Packer along with higher level provisioning tools to build images with packages and configuration pre-installed. For example, you may have an application server built atop a JeOS image that isn't ready to accept an application deployment until it's been provisioned for the first time with a 45 minute Chef run. You could use Packer with the Chef-Solo provisioner to build pre-provisioned images so that the whole 45 minute Chef run doesn't need to occur for each individual node.

Depending on how your application manages state, you may even be able to build fully-provisioned images ready to run services on boot. For example, having pre-built, fully-provisioned images could be very useful when auto-scaling quickly.

Traditional image-based deploys

Aren't images a big step backwards?

There are some upsides to the traditional approach to images:

  • Once you get a gold master image, you can keep using it... until your needs change (hint: never change your requirements).
  • Re-using images is better than configuring everything by hand every time.

But, here are some of the downsides often associated with the traditional approach:

  • It takes 3 weeks to get an image after filling out the paperwork to request one! Like most IT problems, this is partly a tooling problem, but mostly an organizational one.
  • Images are often created by hand, meaning they are error-prone, undocumented, and not easily repeatable. This can lead to discrepancies between Production and Development environments, or worse—discrepancies and idiosyncrasies between discrete Production systems.
  • These problems lead to overprotective attitude of the "gold master" image, landing us back in a situation where the heavy-handed request processes and long turnaround time are unlikely to be challenged.

Packer is not your [old boss]'s approach to machine images!

Configuration Management

Yeah, so, configuration management tools saved us from all that, right? I thought we were freed from images by tools like Puppet or Chef!

Using tools like Puppet and Chef has drastically improved configuration management by making it repeatable and self-documenting, leading to parity between Production and Development environments and enabling rapid change.

Downsides?

  • It's not quite as easy to "ship" CM as it is to "ship" an image. (Take, for example, launching a new EC2 instance from an AMI vs. launching one with an AMI plus a bunch of CloudFormation scripts…)
  • It can take a really long time to get through the first provisioning run on a new system when you're configuring something especially complex.
  • OS-level configuration not always well-addressed across platforms. Assumptions about the underlying OS installation are not made explicit in code and documentation can fall through the cracks.

Images the Packer Way

Using Configuration Management tools with Packer, images can be repeatable, self-documenting, and portable across multiple platforms (thanks the many builder plugins available). Packer also ensures production/development parity more than configuration management alone. With Packer, you're more likely to include OS-installation-level config in the source code (kickstart / preseed / etc.), and it forces a bare minimum of documentation of assumptions about the underlying OS installation.

Immutable Infrastructure and the Question of State

When building images for more than a simple well-defined JeOS base, the inevitable question arises of how you deal with state. This is good. We should all spend a lot more time thinking about this.

Taking Inspiration from Functional Programming

Finding better ways to manage state has led many software developers to adopt a functional programming paradigm. Some of these concepts can be applied to the way we think about our infrastructure as well.

Over the summer, Chad Fowler put forth the interesting idea of Immutable Deployments:

Many of us in the software industry are starting to take notice of the benefits of immutability in software architecture. We’ve seen an increased interest over the past few years in functional programming techniques with rising popularity of languages such as Erlang, Scala, Haskell, and Clojure. Functional languages offer immutable data structures and single assignment variables. The claim (which many of us believe based on informal empirical evidence) is that immutability leads to programs that are easier to reason about and harder to screw up.

So why not take this approach (where possible) with infrastructure? If you absolutely know a system has been created via automation and never changed since the moment of creation, most of the problems I describe above disappear. Need to upgrade? No problem. Build a new, upgraded system and throw the old one away. New app revision? Same thing. Build a server (or image) with a new revision and throw away the old ones.

Not There Yet

Kris Buytaert recently remarked that "...immutable applications are really the exception rather than the rule.". That is, we should be careful about thinking that we can start deploying all of our applications as binary virtual appliances.

It's true, most useful applications need to persist a lot of state, and often do so in complex ways. That isn't to say that "Immutable Deployments" are impossible—it just means we have a lot of work to do.

Virtual Appliances

This past year, (but, unfortunately, before Packer was released), I worked on a project to build a Virtual Appliance. The deliverable was a process for creating an OVA containing several related applications suitable for deployment on the most commonly used hypervisors in the enterprise.

From early on in the project, we treated the question of state persistence as a primary concern. Our approach was to make sure we had a good import/export process. Newly launched appliance images were designed to be able to import from backups or go through a first-time configuration process. In this way, new artifacts can be deployed as images while maintaining continuity. Even though we had a good automated build process for our appliance images, we had to do the work to ensure we managed the application state appropriately.

Disaster Recovery and Phoenix Servers

Combining well-managed state with a good automated image build process also provides a great deal of value when it comes to disaster recovery. If you lose all of your production servers, but you have offsite backups that are easily imported to new nodes running your up-to-date image, you can get back online much more quickly.

In his definition of a Phoenix Server, Martin Fowler provides an apt thought experiment:

One day I had this fantasy of starting a certification service for operations. The certification assessment would consist of a colleague and I turning up at the corporate data center and setting about critical production servers with a baseball bat, a chainsaw, and a water pistol. The assessment would be based on how long it would take for the operations team to get all the applications up and running again.

This may be a daft fantasy, but there's a nugget of wisdom here. While you should forego the baseball bats, it is a good idea to virtually burn down your servers at regular intervals. A server should be like a phoenix, regularly rising from the ashes.

In order to prevent your new images from becoming the dreaded stale gold masters of old, consider taking advantage of the repeatability and automation these new tools provide.

In short

Packer is a tool for building identical machine images. It can help to provide repeatability, documentation, parity between production and development environments, portability across platforms, and faster deployments. But like any tool, it still requires a good and thoughtful workflow to be used most effectively.

For an example of how to use Packer, see the Getting Started section of the documentation.

December 13, 2013

Day 13 - Controlling a cluster of servers with Serf

Written By: Darron Froese (@darron)
Edited By: Shaun Mouton (@sdmouton)

First - Docker and Dependancy Management

This year's introduction of Docker has been huge for sysadmins everywhere. Whether or not you already understand what Docker can do for you - let me assure you - it has the potential to change how we think, work and build services.

At nonfiction, we host a large number of web applications for customers. Some of those web applications were developed for a specific purpose and because they're often not business critical, they don't get a lot of regular updates. They don't have the budget or the desire to continue working on them year after year - upgrading as techonology matures. As a result, we have a number of pretty old web applications that work pretty well but are not based on current technology.

From a sysadmin perspective, deploying these old applications can be pretty complicated - the dependancies can be pretty hairy and are downright fickle. Although we often us Heroku for many applications, we stil end up having:
  1. Ruby 1.8.x Server
  2. Ruby 1.9.x Server
  3. Specialized application Server for "that" project.
  4. Node.js 0.8.x Server
  5. Really old RHEL Server for that 11 year old PHP 4.x application. (Yes - really.)
That's not awesome at all.

More servers - especially servers that run a very low number of low-traffic apps - seems like a waste of resources, money and time. We are paying for too much capacity in order to get cleaner dependency management. For example, here's the cpu usage for an old Ruby 1.8.x application server:
CPU for the last month for an app server
What a waste.

Docker changes all of that.

With Docker you can run containers for any of your applications and run all of those applications (and more) on a single server. No crazy gem / Ruby version problems - everything in it's own self-contained container. No more "I don't have python 3.3 on that server." or "Sorry - can't compile that version of Node on that old box - gotta move it."

We like Docker so much, that we're building a new service with it providing the backend infrastructure. Our backend is named octohost and is available on Github.

Enter - Serf

The Serf website bills it thusly:
Serf is a decentralized solution for service discovery
and orchestration that is lightweight, highly available,
and fault tolerant.

In short, Serf is a system built to pass messages around and trigger events from server to server - some examples are listed on the website. Instead of building your own messaging system or inventing a new daemon, you can connect a number of servers together using Serf and use it to trigger "events".

We're going to use it to connect some Docker servers together:
  1. Compile server - this server compiles the software into a Docker container and pushes it to the Registry server.
  2. Registry server - this server receives and stores the container. (We are cheating and using the regular Docker INDEX for this.)
  3. Web server - this server pulls the container once it's ready to download and makes it available on the web.
Serf has the concept of Roles where you can tell a particular member of the cluster that it's a "{insert-role-here}" and only the events that apply to that role will be executed.

We're going to create some roles for our servers:
  1. build
  2. serve
  3. master
Let's launch these servers:
ec2-run-instances --key dfroese-naw -g sg-1a3b0e2a --user-data-file user-data-file/master ami-38204508 --region us-west-2
Once we have the IP for that server, we'll launch the others and get them to join the serf cluster:

ec2-run-instances --key dfroese-naw -g sg-1a3b0e2a --user-data-file user-data-file/build ami-38204508 --region us-west-2

ec2-run-instances --key dfroese-naw -g sg-1a3b0e2a --user-data-file user-data-file/serve ami-38204508 --region us-west-2
We're using Amazon's User Data system to:
  1. Set the system's Serf role.
  2. Download the Serf event handlers.
  3. Activate those handlers.
  4. Join the cluster by connecting to the first 'master' system.
  5. Any additional setup for that role as needed. Take a look at the user-data-files here.
Now that we've got the systems connected - let's send some test events.

serf event role-check

When that event is sent, each system executes /etc/serf/handlers/role-check.sh - this is some of the output:

2013/12/02 23:07:21 Requesting user event send: role-check. Coalesced: true. Payload: ""
2013/12/02 23:07:22 [INFO] agent: Received event: user-event: role-check
2013/12/02 23:07:22 [DEBUG] Event 'user' script output: ip-10-250-69-116 role is master
2013/12/02 23:07:22 [DEBUG] Event 'user' script output: ip-10-225-185-80 role is serve
2013/12/02 23:07:22 [DEBUG] Event 'user' script output: ip-10-227-14-222 role is build
You can also watch what's going on through the entire cluster:
serf monitor

2013/12/02 23:07:21 Requesting user event send: role-check. Coalesced: true. Payload: ""
2013/12/02 23:07:21 [DEBUG] serf-delegate: messageUserEventType: role-check
2013/12/02 23:07:21 [DEBUG] serf-delegate: messageUserEventType: role-check
2013/12/02 23:07:21 [DEBUG] serf-delegate: messageUserEventType: role-check
2013/12/02 23:07:21 [DEBUG] serf-delegate: messageUserEventType: role-check
2013/12/02 23:07:21 [DEBUG] serf-delegate: messageUserEventType: role-check
2013/12/02 23:07:22 [INFO] agent: Received event: user-event: role-check
2013/12/02 23:07:22 [DEBUG] Event 'user' script output: ip-10-250-69-116 role is master
2013/12/02 23:07:23 [INFO] serf: EventMemberFailed: ip-10-225-185-80 10.225.185.80
2013/12/02 23:07:24 [INFO] agent: Received event: member-failed
2013/12/02 23:07:27 [INFO] Responding to push/pull sync with: 10.250.65.99:33341
2013/12/02 23:07:27 [INFO] serf: EventMemberJoin: ip-10-225-185-80 10.225.185.80
2013/12/02 23:07:28 [INFO] agent: Received event: member-join
2013/12/02 23:07:51 [INFO] Initiating push/pull sync with: 10.225.185.80:7946

Now let's do something useful.

Let's tell the Docker server cluster to:
  1. Compile a git repo.
  2. Push it to the Docker INDEX
  3. Have another server pull that container.
  4. Then launch it so it's viewable from the web.
Here's how we kick it off:
serf event build https://github.com/darron/sysadvent-harp.git,sysadvent/harp-example
Which outputs:

Event 'build' dispatched! Coalescing enabled: true
2013/12/03 21:07:23 [INFO] Initiating push/pull sync with: 10.249.27.132:7946
2013/12/03 21:07:23 [DEBUG] serf-delegate: messageUserEventType: build
2013/12/03 21:07:23 [DEBUG] serf-delegate: messageUserEventType: build
2013/12/03 21:07:23 [DEBUG] serf-delegate: messageUserEventType: build
2013/12/03 21:07:23 [DEBUG] serf-delegate: messageUserEventType: build
2013/12/03 21:07:24 [INFO] agent: Received event: user-event: build
2013/12/03 21:08:51 Requesting user event send: pull. Coalesced: true. Payload: "sysadvent/harp-example"
2013/12/03 21:08:51 [DEBUG] Event 'user' script output: Build: https://github.com/darron/sysadvent-harp.git as sysadvent/harp-example in /tmp/tmp.WcrVsKOn1m
The build starts:

Cloning into '/tmp/tmp.WcrVsKOn1m'...
/usr/bin/docker build -t sysadvent/harp-example /tmp/tmp.WcrVsKOn1m
Uploading context 317440 bytes
Step 1 : FROM octohost/nodejs
---> 62108a2c615f
Step 2 : ADD . /srv/www
---> 5b01d85275dd
Step 3 : RUN cd /srv/www; npm install
---> Running in b136131fcb72
npm WARN package.json Harp@1.0.0 No repository field.
npm http GET https://registry.npmjs.org/harp
npm http 200 https://registry.npmjs.org/harp

### removed lots of npm output

harp@0.8.13 node_modules/harp
├── mime@1.2.9
├── async@0.2.9
├── mkdirp@0.3.4
├── commander@1.1.1 (keypress@0.1.0)
├── fs-extra@0.3.2 (jsonfile@0.0.1, ncp@0.2.7, rimraf@2.0.3)
├── jade@0.27.7 (commander@0.6.1, coffee-script@1.4.0)
├── less@1.3.1
├── connect@2.7.0 (fresh@0.1.0, cookie-signature@0.0.1, debug@0.7.4, pause@0.0.1, cookie@0.0.5, bytes@0.1.0, crc@0.2.0, formidable@1.0.11, qs@0.5.1, send@0.1.0)
└── terraform@0.4.12 (lru-cache@2.3.0, marked@0.2.8, ejs@0.8.4, coffee-script@1.6.3, jade@0.28.2, stylus@0.33.1, less@1.3.3)
---> af20e73caec4
Step 4 : EXPOSE 5000
---> Running in 2fce89dcbaa9
---> 688c52c7f926
Step 5 : CMD cd /srv/www; /usr/bin/node server.js
---> Running in 193e082814c4
---> 5a69fee1c103
Successfully built 5a69fee1c103
The build server pushes the built container to the registry and kicks off the pull:

Login Succeeded
The push refers to a repository [sysadvent/harp-example] (len: 1)
Sending image list
Pushing repository sysadvent/harp-example (1 tags)

### Lots of output removed.
Then the server in the 'serve' role pulls the container:

Event 'pull' dispatched! Coalescing enabled: true
2013/12/03 21:08:51 [DEBUG] serf-delegate: messageUserEventType: pull
2013/12/03 21:08:52 [DEBUG] serf-delegate: messageUserEventType: pull
2013/12/03 21:08:52 [INFO] agent: Received event: user-event: pull
2013/12/03 21:08:52 [DEBUG] Event 'user' script output: Pull: sysadvent/harp-example
Pulling repository sysadvent/harp-example
5a69fee1c103: Pulling image (latest) from sysadvent/harp-example5a69fee1c103: Pulling image (latest) from sysadvent/harp-example, endpoint: https://cdn-registry-1.docker.io/v1/5a69fee1c103: Pulling dependent layers

### More output removed.
So that it can run it:

2013/12/03 21:10:16 [DEBUG] serf-delegate: messageUserEventType: run
2013/12/03 21:10:16 [DEBUG] serf-delegate: messageUserEventType: run
2013/12/03 21:10:16 [DEBUG] serf-delegate: messageUserEventType: run
2013/12/03 21:10:17 [INFO] agent: Received event: user-event: run
2013/12/03 21:10:23 [INFO] Responding to push/pull sync with: 10.231.7.223:60001
2013/12/03 21:10:24 [WARN] Potential blocking operation. Last command took 46.701165ms
2013/12/03 21:10:29 [DEBUG] Event 'user' script output: Run: sysadvent/harp-example
At the end of this process, the site was available at:
http://harp-example.54.202.94.50.xip.io/

Which looked like this:


To sum up

Serf is a new tool that has been added to our toolboxes as sysadmins.

It's very powerful, simple to setup and can be extended in almost limitless ways.

Give Serf a try. My example Serf handlers are all available here. You can even use the same AMI that I used for this article - ami-38204508.

Let me know if you've got any questions!

December 12, 2013

Day 12 - Upgrading Mongo for Fun and Profit

Written By: Ryn Daniels (@rynchantress)
Edited By: Shaun Mouton (@sdmouton)

So you're running MongoDB and you want to upgrade it? Maybe you want some of those nifty hashed shard keys in 2.4? This post will discuss my experiences upgrading a sharded replica set in production.

Preparation

To begin, make sure you're familiar with the official MongoDB upgrade notes. Specifically, pay attention to the warning section stating basically that you shouldn't do anything interesting with your databases during the upgrade process. Messing with sharding, creating or dropping collections, and creating or dropping databases are to be avoided if you value your data integrity.

At this point, get some coffee, and make any other preparations needed for your environment. This might include silencing any MMS/Sensu/Nagios/etc. alerts for the duration of the upgrade. In our environment, we have a service called Downtime Abbey that we enable during things like Mongo upgrades - it allows us to redirect traffic to a downtime status page instead of displaying 500 errors caused by mongos clients reconnecting for a more pleasant and reliable end user experience.

Upgrading!

The upgrade process starts by replacing the existing mongo binaries with the shiny new ones. If you're using some config management tool like Puppet or Chef, this can be as simple as bumping the mongodb[version] attribute to your desired version (2.4.8 is recommended as of this writing) and triggering that to run across your infrastructure as needed. (And if you aren't using config management, you really should be!)

A Detour, Perhaps.

Since upgrading requires restarting all your mongod instances anyways, it can be a good excuse to do any tuning or make other adjustments that require such a restart. In our case, we realized that our disk readahead settings weren't optimal (MongoDB, Inc. recommends that EC2 users use a readahead value of 32) so we deployed a fix for that in Chef before continuing with the process, to cut down on the number of mongod restarts needed.

Also, if you are running the balancer, now is a good time to stop it.
mongos> sh.setBalancerState(false)
mongos> sh.getBalancerState()
false
mongos>
If the balancer is currently running a migration, it will wait for that chunk to finish before stopping. You may have to wait a few minutes for sh.getBalancerState() to return false before continuing. Also note that disabling (and later enabling) the balancer should only be done using a mongos connection as this will ensure that the changes are made using a two-phase commit, making sure that the changes are properly replicated to all config servers. Making this kind of change to the config database directly can lead to DifferConfig errors which can leave your cluster metadata read-only at best and invalid at worst!

Upgrading the Config Servers

Upgrading from Mongo 2.2 to 2.4 requires that you upgrade the config database servers, in order to update the metadata for the existing cluster. Before proceeding, make sure that all members of your cluster are running 2.2, not 2.0 or earlier. Watch out for 2.0 mongos processes as these will get in the way of upgrades as well. If you find any (hiding on a long-forgotten server in a corner of the data center somewhere), you'll need to wait at least 5 minutes after they have been stopped to continue. Once your cluster is entirely on 2.2, start a mongos 2.4 instance with the upgrade flag:
mongos --upgrade --configsvr=config1.example.com,config2.example.com,config3.example.com
If that mongos starts without errors, you can restart the mongod processes on your config servers one at a time, in the opposite order than how they were listed in the above command. The order above should be the same order that the rest of your mongos processes are running. It's important to make sure each config database process has restarted completely before moving onto the next one to make sure the configuration data stays in sync across the 3 servers. You may want to tail -f the mongos logs during this process to keep an eye out for any errors.

Restarting ALL the Things!

Restart ALL the Mongos!

Now it's time to restart all the mongod processes so they will start up using the fancy new 2.4 binaries, starting with the secondaries and arbiters.

Secondary Members and Arbiters

For each shard, if you aren't certain which members are primary or secondary, you can find out with the replica set status command:
shard2:SECONDARY> rs.status()
{
"set" : "shard2",
"date" : ISODate("2013-11-26T19:31:49Z"),
"myState" : 2,
"syncingTo" : "shard2-db3.example.com",
"members" : [
{
"_id" : 21,
"name" : "shard2-db3.example.com",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 28196,
"optime" : Timestamp(1385494307, 3),
"optimeDate" : ISODate("2013-11-26T19:31:47Z"),
"lastHeartbeat" : ISODate("2013-11-26T19:31:48Z"),
"lastHeartbeatRecv" : ISODate("2013-11-26T19:31:48Z"),
"pingMs" : 1
},
{
"_id" : 26,
"name" : "shard2-dr1.example.com",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 28196,
"optime" : Timestamp(1385494307, 3),
"optimeDate" : ISODate("2013-11-26T19:31:47Z"),
"lastHeartbeat" : ISODate("2013-11-26T19:31:48Z"),
"lastHeartbeatRecv" : ISODate("2013-11-26T19:31:48Z"),
"pingMs" : 0,
"syncingTo" : "shard2-db3.example.com"
},
{
"_id" : 27,
"name" : "shard2-db1.example.com",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 98979,
"optime" : Timestamp(1385494309, 1),
"optimeDate" : ISODate("2013-11-26T19:31:49Z"),
"self" : true
}
],
"ok" : 1
}
For each shard, the things to note are which replica set members are primary versus secondary, and if the replica set is currently "ok". If you are looking to avoid downtime, each shard and each member should be restarted one at a time; in our experience with a scheduled maintenance window we were able to restart all of our secondary members (we have a 3-member replica set across 4 shards, giving us 4 primary members and 8 secondaries) at once without issue. Arbiters, if you have them, can be restarted at this time in any order.

Primary Members

Now we come to the most fun part of the upgrade- restarting the primaries! The details of completing this step will depend somewhat on how much (if any) downtime you can tolerate. Because Mongo's secret webscale sauce allows it to be relatively fault-tolerant, you can simply restart the primary mongod processes and allow the normal failover process to happen. However, this method takes longer than running rs.stepDown() on the primary nodes (which is MongoDB, Inc's suggested method) so if you aren't operating in a scheduled maintenance window, you will want to use the stepdown method to minimize downtime.

It should be noted that in version 2.2 and earlier, old mongos connections are only disabled after they have been tried (and failed). This can cause undesirable behavior such as intermittent errors showing up for end users as various mongos connections fail and get disabled. It is recommended to restart all the mongos connections before restarting the primary members of the replica set in order to avoid these issues. If you want to totally avoid downtime, these restarts must be done one at a time. If you have more than a few mongos connections in your infrastructure, it's best to do this upgrade during a scheduled maintenance window to restart them all at once and avoid that process.

Once you have finished restarting the primary replica set members, verify your success! The Mongo shell command rs.status() will show you which members are the new primaries as well as verifying that each shard has all replica set members and arbiters present and healthy, and running mongo --version will verify if you have, in fact, managed to upgrade to the version you wanted.

Final Steps

Once you've verified that the upgrade has been successful, be sure to take care of any loose ends related to the maintenance, such as re-enabling alerts, ending any downtime windows, and re-enabling the balancer (if applicable). Assuming you were properly caffeinated, followed these steps, and paid attention to any errors Mongo may have helpfully provided, you should be the proud owner of a 2.4 sharded replica set. And if something went terribly terribly wrong, you did have backups, right?

December 11, 2013

Day 11 - The Lazy SysAdmin's Guide to Test Driven Chef Cookbooks

Written By: Paul Czarkowski (@pczarkowski)
Edited By: Shaun Mouton (@sdmouton)

This article assumes you have some knowledge of Chef and the Berkshelf way of managing cookbooks. If you do not, then I highly recommend you watch the Chefconf talk on 'The Berkshelf Way' before reading further.

What even is?

The Lazy Sysadmin is a person who is not lazy in the regular sense, but is lazy in the sense that they don't want to do the same thing twice, or in this specific case they don't want to be woken up at two in the morning by PagerDuty (I hate that guy!) for an issue caused by a bug in a chef cookbook.

A lot of config management code is tested briefly in Vagrant (if at all) by simply checking that it ran without an error and the service that you wrote the cookbook for is running. Maybe it spends a little while in a staging environment, but will often head to production with that minimum of testing.

I don't always test my code...

We can borrow a methodology from the developer community called Test Driven Development which helps reduce the feedback loop to just a few seconds after writing code. This does mean some extra up front work but in my opinion the payoff makes it worth that investment.

Test Driven Development (TDD) has been around for a long time and is heavily embraced in the Ruby developer community. The basic idea of it is to provide feedback to the developer as early as possibly as to whether their code is working via unit tests. In TDD the person writing the code often writes the (failing) unit test first, and then writes the code to make that test pass. There is also the concept of README Driven Development (RDD) which is where you write the documentation even before that.

This article explores these two concepts and some tooling that assists to bring this workflow of Document -> Test -> Code to chef cookbooks. This can feel very awkward at first, and chefspec especially can take a while to really grok (I'm not there yet, but it's getting easier). We'll explore a few ways to help make that transition.

The tools that I am using were decided for me by the stackforge chef cookbooks which already had a unit-testing framework defined when I started working with them. There are likely other tools out there that do similar things (e.g. bats vs minitest vs chefspec) so do not feel constrained by the tools I talk about, instead focus on the concepts.

It's important to mention here that this is a flexible framework and if you want to write the code, then the tests, then documentation that's fine as well, and in fact that's how I started off. As I got more comfortable with the ideas and tools I slowly moved to mostly following the workflow of Documentation -> Test -> Code, but there are still times where I write very loose documentation followed by some draft code then finally go back and document and write tests.

README Driven Development (RDD)

Every chef cookbook should have a README.md file in its root that acts as its documentation. This should be the first place you start, and by happy chance if you create a new cookbook with Berkshelf (use berks cookbook new_cookbook for this) it will create a very serviceable skeleton README.md file for you to use. I use a different format which you'll see later but that's just personal preference.

I start by writing the LICENSE file (I usually use the Apache 2.0 license). It's very important to let others know what license you're releasing the cookbook under as some companies have policies against using specific licenses (or code with no license specified) for legal protection.

Next I start filling out the README.md. For example I have a Rails application that uses sqlite3 for the development environment so my cookbook may start off having just a single recipe (not counting default) called application. There are some obvious dependencies and attributes to set, so I'll document those at this point as well.

Obviously you won't know beforehand about every tiny detail, so start by documenting loosely and then tighten it up as you go. For example my application uses docker, but I'll add that into the cookbook after I've got my base rails application installed, which means I can wait until I'm about to implement that before I write the documentation for it.

README.md


Requirements
============

Chef 0.11.0 or higher required (for Chef environment use).

Cookbooks
---------

The following cookbooks are dependencies:
* ruby
* git

Recipes
=======

ircaas::application
-------------------

* creates user `ircaas`
* includes recipes `git::default`, `ruby::default`
* Install IRCaaS Application code from `https://github.com/paulczar/ircaas`

Attributes
==========

ircaas['user'] - user to run application as
ircaas['git']['repo'] - repo containing IRCaaS code
ircaas['git']['branch'] - Branch to download
The final file to edit as part of RDD is metadata.rb which has optional methods to document recipes, attributes, etc as well. I try not to double up this information, so unless it is actually needed for the cookbook to run (such as dependencies) I leave them out. It's fine If you prefer to add it in both locations, or prefer to document in the metadata file, whichever you do just try to leave breadcrumbs for the reader to follow (for example, write 'see metatadata.rb under the appropriate sections.' in the README.md).

metadata.rb


name 'ircaas'
maintainer 'Paul Czarkowski'
maintainer_email 'username.taken@gmail.com'
license 'All rights reserved, Paul Czarkowski' # see './LICENSE' 
description 'Installs/Configures ircaas'
long_description IO.read(File.join(File.dirname(__FILE__), 'README.md'))
version '0.1.0'

%w{ ubuntu }.each do |os|
 supports os
end

%w{ ruby git }.each do |dep|
 depends dep
end

Test Driven Development (TDD)

Now that we've documented our first stage of the cookbook we want to write some tests for it, but first we need to set up the tooling. Most of the tools in the testing framework that we're installing are rubygems and can quickly suck you into dependency hell. To help deal with that and provide myself with some consistency I have created a Git repository called meez that contains my (very opinionated) skeleton cookbook and has a Gemfile and Gemfile.lock to help deal with the dependencies. I also have a base Berksfile, Vagrantfile, and Strainerfile, and I use Ruby 1.9.3 as my default ruby environment.

With this skeleton in place I can get up to speed very quickly by cloning the repo and running bundler which will install all the tools I need for testing which are described below.

git clone https://github.com/paulczar/meez.git chef-ircaas
cd chef-ircaas
bundle install

Strainer

A framework for testing chef cookbooks. It doesn't perform any tests itself, but instead calls a series of tools which are listed in a Strainerfile, as below:

# Strainerfile
tailor: bundle exec tailor
knife test: bundle exec knife cookbook test $COOKBOOK
foodcritic: bundle exec foodcritic -f any -t $SANDBOX/$COOKBOOK
chefspec: bundle exec rspec $SANDBOX/$COOKBOOK/spec

Tailor

Tailor reads Ruby files and measures them against some common ruby style guides. This is the tool that I'm least familiar with out of the set, but the framework I 'borrowed' from the stackforge cookbooks included it and I saw no reason to remove it.

knife cookbook test

Tests cookbook for syntax errors. this uses the built-in ruby syntax checking option for files in the cookbook ending in .rb, and the erb syntax check for files ending in .erb (templates). This comes free with knife.

foodcritic

Foodcritic is a linting tool for chef cookbooks. It parses your cookbook and comments on your styling as well as flagging known problems that would cause chef to break when converging. There is an excellent library of errors and Foodcritic will kindly provide an error code and often will even tell you how to fix it.
Example: FC002: Avoid string interpolation where not required

rubocop

I have just recently added this to my testing framework. It is a very verbose lint / style parser for Ruby. Prepare for it to yell at you a bunch when you start using it.

guard

I'm not using this yet, but it's a tool that watches files for changes and then runs commands against those files. This will allow for real time feedback of changes to files. Some potential uses for this that I plan to investigate are:
  • watch Berksfile and metadata.rb to automatically download any new cookbook dependencies.
  • watch Gemfile to automatically install any new Gem dependencies.
  • watch *.rb files to automatically check syntax/linting.

chefspec

ChefSpec is a unit testing framework for Chef cookbooks. It is an extension of RSpec and the version I use (ChefSpec 3.x) requires Ruby 1.9+ and Chef 11+.

ChefSpec runs your cookbook locally with Chef Solo but doesn't actually converge. This means it's very fast and it doesn't mess up your system by actually installing packages or pushing templates. Chefspec uses Fauxhai to mock Ohai data, and thus the unit tests don't need to be run on the same operating system as your dev or production servers.

Unit Tests

I write my unit tests in Chefspec (the other tools mostly take care of themselves and don't require much care and feeding). Chefspec is an extension of the rspec framework and tests written in the spec/ directory of your cookbook and are named by convention: <recipe-name>_spec.rb. In my Rails application example we would have the file spec/application_spec.rb. There is also a spec/spec_helper.rb which calls the chefspec modules and sets any common settings or Ohai (chefspec uses Fauxhai to fake it) data.

The basic workflow of Chefspec is that you describe what you're testing in a describe block which includes the call to run the fake chef run and sets any options or node attributes to be set before running any tests inside the block. Tests are written as it blocks and are simply pseudocode explaining what the function you're testing should do followed by the result that you expect to see from a successful run of that function. The testable resources are very well documented.

I know that my application recipe will need to include some recipes, create a user, and clone a git repository, so I will write the following tests:

spec/application_spec.rb


require_relative 'spec_helper'

describe 'ircaas::application' do 
 before do
   @chef_run = ::ChefSpec::Runner.new ::UBUNTU_OPTS do |node|
     node.set['ircaas'] = { # Sets custom node attributes
       user: 'ircaas', # so that we don't fail just because somebody went in
       path: '/opt/ircaas', # and changed a value in the default attributes file.
       git: { repo: 'ssh://git.path', branch: 'master' }
     }
   end
   @chef_run.converge 'ircaas::application' # Fake converges recipe.
 end

 it 'includes ruby::default recipe' do
   expect(@chef_run).to include_recipe 'ruby::default'
 end

 it 'includes git::default recipe' do
   expect(@chef_run).to include_recipe 'git::default'
 end

 it 'creates ircaas user' do
   expect(@chef_run).to create_user('ircaas')
 end

 it 'checkouts ircaas from repo' do
   expect(@chef_run).to checkout_git("/opt/ircaas").with(repository: 'ssh://git.path', branch: 'master')
 end

# a test without an expect will be marked as pending, useful
# if you don't know how to test a specific function, or you
# want to write the test later.
 it 'an example to help show pending tests' do
   pending 'I don't know how to test this function yet'
 end

end
After writing the tests I will go ahead and write the recipe to pass them.

recipe/application.rb


# Cookbook Name:: ircaas
# Recipe:: application

include_recipe 'ruby::default'
include_recipe 'git::default'

user node['ircaas']['user'] do
 username node['ircaas']['user']
 comment "ircaas User"
 shell "/bin/bash"
 home "/home/ircaas"
 system true
end

git node['ircaas']['path'] do
 repository node['ircaas']['git']['repo']
 branch node['ircaas']['git']['branch']
 destination node['ircaas']['path']
 action :checkout
end
Finally I run berkshelf to fetch any required cookbooks and then run strainer to run through the tests.

bundle exec berks install
bundle exec strainer test

Integration Test

Of course units tests are only part of the picture and I want to make sure that the whole cookbook functions correctly and will result in a node configured and working as defined in your cookbook. To perform an Integration Test I use Vagrant, for which I have a base Vagrantfile in my meez repo. I won't bore you with too much detail because chances are you're already familiar with Vagrant. Needless to say I run vagrant up and then after it completes provisioning I SSH into it and test that the node has converged and my rails application is running.

Summary

We've explored the basics of doing README/Test Driven Development for Chef Cookbooks. This has been a very shallow look at a very deep topic so I encourage you to look around at other tools and frameworks and try to find something that works for you. It took me quite some time to really grok the whole process and longer again to be able to proudly declare that I'm somewhat competent at it.

If at first chefspec is really hard just write pending tests for everything, or skip it altogether. The rest of the tests provided by Strainer will still help catch issues without having to wait for a node to converge.

One of the unexpected advantages I've found when following this framework is that there are two places I really have to think about what I want to acheive before I even write a single line of a recipe (in the README and writing the tests). This gives me time to really formulate in my head what it is I need to do and have already started breaking it down into small easy chunks that can be written between coffee breaks when it comes time to actually write code.

Further reading

December 10, 2013

Day 10 - AWS Spot Instances - HowTo and Tweaks

Written By: Jesse Davis (@jessemdavis)
Edited By: Shaun Mouton (@sdmouton)

At AppNeta, we provide software instrumentation, open source libraries and network hardware to instrument your network and software to provide information to you about your Full Stack™. On the software side, we do this by emitting tracing information back to our application, where we process, analyze and consolidate it to provide you with precise information about the health of your stack. We're a big fan of AWS, and have used spot pricing and AutoScaling groups to make our trace processing pool efficient and, more importantly, inexpensive to run.

Problems We've Encountered

However, auto-scaling groups are not a panacea to alleviating scaling concerns. Spot instances can be in heavy demand during certain periods, and bidding for them can be fierce, to say the least. One strategy that we have seen other users of spot instances employ is heavily over-bidding on spot instances. For example, we've seen spot prices for m3.2xlarge instances go as high as $7.00 for a four-day period. The normal on-demand price for these instance types is $1.00. Although this can ensure spot instance requests can be fulfilled, it's economically infeasible, to say the least.

Depending on how your autoscaling group is configured as well, being blocked from fulfilling spot instance requests can cause other problems. We recently modified our autoscaling groups to split across three availability zones (AZ). However, we noticed that in one AZ, the spot price had jumped well above our request spot price. Since autoscaling groups, by default, attempt to balance the instances evenly across the zones you have defined, this resulted in the autoscaling group terminating instances in one of the other AZs as well. This effectively decreased our processing capability to a third of its normal capacity. While this was technically enough capacity to handle our trace stream without problems, we would now have to manually manage our instance counts to handle possible incoming spikes.

Our solution

We knew that we wanted to keep a minimum number of instances up regardless of the availability of spot instances. However, we obviously want to favor using spot instances for the cost factor. So, our implementation follows one basic rule: if a spot instance terminates, bring up a on-demand instance to cover it. We do this with a combination of SNS, SQS and boto.

We define a SQS queue and SNS topic to provide the bridge to provide the bridge between our autoscaling group and the daemon that will monitor it.
ubuntu@host $ aws --region=us-east-1 sqs create-queue --queue-name spot-scale-demand
{
    "QueueUrl": "https://queue.amazonaws.com/1111111111/spot-scale-demand"
}
SNS_TOPIC='spot-autoscale-notifications'
SQS_QUEUE='spot-autoscale-notifications'

SNS_ARN=$(sns-create-topic $SNS_TOPIC)
SQS_ARN="arn:aws:sqs:us-east-1:1111111111:${SQS_QUEUE}"
sns-add-permission $SNS_ARN --label "ETLDemandBackupGroup" --action-name Publish,Subscribe,Receive --aws-account-id 11111111111 
sns-subscribe $SNS_ARN --protocol sqs --endpoint "$SQS_ARN"
We set up our autoscaling groups. We start off with having the demand group be empty, since we want to favor having our processing done on spot instances.
as-create-launch-config spot-lc --image-id ami-12345678 --instance-type m3.2xlarge \
    --group etl --key appneta --user-data-file=prod_data.txt \
    --spot-price 1.00
as-create-launch-config demand-lc --image-id ami-12345678 --instance-type m3.2xlarge \
    --group etl --key appneta --user-data-file=prod_data.txt
as-create-auto-scaling-group spot-etl-20131116 --launch-configuration spot-lc \
    --availability-zones us-east-1a,us-east-1b --default-cooldown 600 --min-size 1 --max-size 3
as-create-auto-scaling-group demand-etl-20131116 --launch-configuration demand-lc \
    --availability-zones us-east-1a,us-east-1b --default-cooldown 600 --min-size 0 --max-size 3
I'm leaving off scaling policies and metric alarms, since they'll be specific to your processing and workload.

We define notification configurations for each autoscaling group, and tell the autoscaling group to send a message to the SNS topic we defined whenever an instance is launched or terminated in the group. We also have another notification configuration for sending an email to our operations team when these scaling actions happen so that we have a secondary history.
for group in spot-etl-20131116 demand-etl-20131116; do
    for notification in spot-autoscale-notifications ops-alerts-email ; do
        as-put-notification-configuration $group -t "arn:aws:sns:us-east-1:111111111:${notification}" \
            -n autoscaling:EC2_INSTANCE_LAUNCH, autoscaling:EC2_INSTANCE_TERMINATE
    done
done
For the last part of our setup, let's be nice to our operations team and turn on Cloudwatch metrics on the autoscaling groups.
for group in spot-etl-20131116 demand-etl-20131116; do
    as-enable-metrics-collection $group -g 1Minute \
        -m GroupMinSize,GroupMaxSize,GroupPendingInstances,GroupInServiceInstances,GroupTotalInstances,GroupTerminatingInstances,GroupDesiredCapacity
done
Now we'll look into a overview of the daemon itself. We're primarily a Python shop, and rely heavily on the great boto library for most of our AWS interaction, just like Amazon. Our basic program flow looks like:
while (true):
  read message from SQS queue
  if it's a TERMINATE message and from a spot group:
    find the corresponding demand group
    adjust the demand group instance count by 1
I won't go into how to set up your environment with the correct keys to allow boto access to your AWS account. Even better, create an IAM role and assign it to the instance you run the daemon on.

Let's read the message off the queue:
MAX_RETRIES = 5

def update_groups_from_queue(queue=None):
    """ Connect to SQS and update config based on notifications """
    env = os.environ
    if ('AWS_ACCESS_KEY_ID' not in env or 'AWS_SECRET_ACCESS_KEY' not in env):
        raise Exception("Environment must have AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY set.")
    _process_queue(queue)

def _process_queue(queue):
    retries = 0
    delay = 5

    sqs = boto.connect_sqs()
    q = sqs.create_queue(queue)
    q.set_message_class(RawMessage)

    while True:
        try:
            rs = q.get_messages()
            for msg in rs:
                if _process_message(msg):
                    q.delete_message(msg)

            time.sleep(10)
            # if we make it here, we're processing correctly, we want to watch for consecutive
            # errors
            retries = 0
            delay = 5
        except SQSError as sqse:
            retries += 1
            log.info("Error in SQS, attempting to retry: retries: %d, exception: %r" % (retries, sqse))
            if retries >= MAX_RETRIES:
                log.error("Maximum SQS retries hit, giving up.")
                raise sqse
            else:
                time.sleep(delay)
                delay *= 1.2    # exponential backoff, similar to tracelon's @retry
There's nothing overly complicated here; it's just your basic processing loop. We pass in the queue name (in our case, spot-scale-demand) from the command line using argparse and feed the name to update_groups_from_queue. We process the message and sleep for 10 seconds. We only delete the message if we successfully processed it, so we can investigate broken messages after the fact. If we encounter an error reading from the queue, we sleep with an exponential backoff, and try again, up to 5 times. Past that, we assume that our SQS endpoint is having problems, and it'll be likely we're up and monitoring things :)

Now we process the message to determine if we need to scale up our demand instances.
def _process_message(msg):
    """ Process the SQS message. """
    try:
        m = json.loads(msg.get_body())
    except ValueError:
        # Unexpected message (non-json). Throw it away.
        log.error("Could not decode: %s" % (msg.get_body()))
        return True

    # There are 2 different message types in this queue: AWS generated (a result of autoscale activities)
    # and ones we send explicitly.

    msg_type = m['Type']
    if msg_type == 'Notification':
        # From AWS autoscale:
        payload = json.loads(m['Message'])
        spot_group = payload.get('AutoScalingGroupName', '')
        cause = payload.get('Cause', '')

        if not _is_spot_group(spot_group):
            log.info("Received AWS notification for non-spot group %s, ignoring." % spot_group)
            return True

        # usually we'll want to look for 'was taken out of service in response to a system health-check.'
        # if user initiated, will be 'user health-check', which can be triggered via
        # as-set-instance-health i-______ --status Unhealthy.
        # We could ignore this, but it makes testing much easier.
        if not re.search(r'was taken out of service in response to a (system|user) health-check.', cause):
            log.info("Received AWS notification for spot group %s, but not due to health check termination, ignoring." % spot_group)
            return True

        if event == 'autoscaling:EC2_INSTANCE_TERMINATE':
            _adjust_demand_group(spot_group, 1)
        else:
            log.info("Ignoring notification: %s", payload)
We grab the message body and load into a JSON object. The main attributes we care about are:
  • the notification type
  • the name of the autoscaling group
  • the cause of the autoscaling event
We ignore the request if the event came from a group that's not a spot group (more on this next), and also ignore the event unless it's a health check event. This allows up to manually adjust the capacity of the spot group without firing off launch commands to the demand autoscaling group. Last, don't do anything unless the event was from an instance termination.

Now we can implement the logic of when to spin up a demand instance. Now, please don't laugh, because we have some very simple logic right now :)
def _is_spot_group(group_name):
    """" Returns true if the group's name refers to a spot group. """
    # pretty simple, all our spot groups will have spot in the name
    return '-spot-' in group_name
Yes, that's how we mark our autoscaling groups as spot groups. Pretty simple, huh?

We want to make sure that we only spin up instances in the demand group that corresponds to our spot group.
def _find_demand_scaling_group(spot_group):
    """ Find the corresponding on-demand group for the given spot group.
        For now, just rely on them being named similarly.
        Returns the boto autoscale group object, or None if no group found. """
    autoscale = boto.connect_autoscale()
    as_groups = autoscale.get_all_groups()

    demand_group = re.sub(r'spot-', r'', spot_group)
    demand_group = re.sub(r'-\d+$', r'', demand_group)
    log.debug("Given spot group %s, find demand group similar to %s" % (spot_group, demand_group))

    result = [ group for group in as_groups
               if demand_group in group.name and not _is_spot_group(group.name) ]
    return sorted(result, key=lambda group: group.name).pop() if result else None
Basically, we name the groups the same, append 'spot-' on the spot autoscaling group, and append the date when we deploy the groups to the names of both. This could be anything you want, like serial numbers, build numbers, and version numbers (similar to how Netflix defines their groups in Asgard).

Finally, we adjust the capacity of the demand autoscaling group.
def _adjust_group(group, adjustment):
    """ Change the number of instances in the given boto group by the given amount. """
    try:
        current_capacity = group.desired_capacity
        desired_capacity = current_capacity + adjustment
        if desired_capacity < group.min_size or desired_capacity > group.max_size:
            log.info("Demand group count already at bound, adjust via AWS if necessary.")
            return
        group.desired_capacity = desired_capacity
        group.min_size = desired_capacity
        group.update()
        log.info("Adjusted instance count of ASG %s from %d to %d." %
                 (group.name, current_capacity, desired_capacity))
    except Exception as e:
        log.exception(e)

def _adjust_demand_group(spot_group, adjustment):
    """ Change the number of instances in on-demand ASG paired with the given spot group name
        by the given amount. """

    try:
        demand_group = _find_demand_scaling_group(spot_group)
        if demand_group:
            _adjust_group(demand_group, adjustment)
        else:
            log.error("No demand group found similar to %s." % spot_group)
    except Exception as e:
        log.exception(e)
Again, there's nothing very complicated here. We don't attempt to set the capacity of the demand group below its minimum or maximum.

And that's it! Throw all this in a python script and run it. We're big fans of supervisor here, so we set up a supervisor configuration for our script to ensure the daemon stays up.

Improvements


So far, this daemon is working out pretty well for us, but we still plan on improving it.

  • We're researching now moving to an autoscaling group per availability zone for our processing nodes to allow even more fine-grained control over how many instances are in each zone, instead of being at the mercy of having the same count in each AZ as it stands now.
  • The main logic of our algorithm is definitely very simple right now. The astute reader will notice that we never decrease the number of instances in our demand group. This is good for ensuring that our processing pool never loses a lot of capacity, but it also means our operations team eventually has to adjust the demand group back down to acceptable levels. My next steps will definitely involve checking the current spot instance requests and scaling the demand group back down once the requests have been fulfilled and the spot group instance count is stable.

December 9, 2013

Day 9 - Getting Pushy With Chef

Written By: Michael Ducy (@mfdii)
Edited By: Shaun Mouton (@sdmouton)

One of the long standing issues with Chef has always been that changes we wanted to make to nodes weren't necessarily instant. Eventually your nodes would come in sync with the recipes on your Chef server, but depending on how complex your environment was this might take a few runs to happen.

There were ways around this, mainly by using knife ssh to force the nodes to instantly update, in the order you want. While this method worked, it had its own problems (needing ssh keys and sudo rights for example). A few months ago Chef released an add-on to the enterprise version that allows users to initiate actions on a node without requiring SSH access; we call this feature Push Jobs. Right now, Push Jobs (formerly known as Pushy) are a feature of Enterprise Chef, but we are working towards open sourcing Push Jobs in early 2014 (think Q1).

Getting started


Getting ready for Push Jobs is fairly easy. There are 2 additional components that need to be installed, the Push Jobs server and the Push Jobs clients. The Push Jobs server sits along side your Erchef server, either on the same machine or a separate host. The Push Jobs clients can be installed using the push-jobs cookbook. There is a copious amount of documentation covering the installation, so I won’t cover that in detail. The two things you need to know is how to allow commands to be executed via Push Jobs, and how to start jobs.

First, commands to be executed are controlled by a “whitelist” attribute. The push-jobs cookbook sets the node['push_jobs']['whitelist'] attribute and writes a configuration file /etc/chef/push-jobs-client.rb. The node['push_jobs']['whitelist'] attribute is used in this config file to determine what commands can be ran on a node.

For example, if you want to add the ability to restart tomcat on nodes with the tomcat role, add this to the role:

"default_attributes": {
    "push_jobs": {
      "whitelist": {
        "chef-client": "chef-client",
        "apt-get-update": "apt-get update",
        "tomcat6_restart": "service tomcat6 restart"
      }

Second, you’ll need the ability to start jobs on nodes. This is accomplished by installing the knife-pushy plugin:

gem install knife-pushy-0.3.gem

This will give you the following new knife commands:

** JOB COMMANDS **
knife job list
knife job start <command> [<node> <node> ...]
knife job status <job id>

** NODE COMMANDS **
knife node status [<node> <node> ...]

Running knife node status will give you a list of nodes with state detail.

ricardoII:intro michael$ knife node status
1-lb-intro  available
1-tomcat-intro  available
1-tomcat2-intro available
ricardoII:intro michael$

In this case, all of my nodes are available. Let’s say I want to run chef-client on one of my tomcat nodes. It’s as simple as:

knife job start chef-client 1-tomcat-intro 

Maybe I don’t know all the jobs I can run on a node or group of nodes. I can search for those jobs by running:

ricardoII:intro michael$ knife search "name:*tomcat*" -a push_jobs.whitelist
2 items found

1-tomcat2-intro:
  push_jobs.whitelist:
    apt-get-update:  apt-get update
    chef-client:     chef-client
    tomcat6_restart: service tomcat6 restart

1-tomcat-intro:
  push_jobs.whitelist:
    apt-get-update:  apt-get update
    chef-client:     chef-client
    tomcat6_restart: service tomcat6 restart

I can see that I have the ability to run a tomcat restart on my tomcat nodes, as I set in my attributes earlier. But since I have multiple tomcat servers, listing them on the command line could be a pain. I can use search with the knife job start command to find all the nodes I want based on a search string:

ricardoII:intro michael$ knife job start tomcat6_restart --search "name:*tomcat*"
Started.  Job ID: 6e0b432e369904e76de6e95bac99c9e6
Running (1/2 in progress) ...
Complete.
command:     tomcat6_restart
created_at:  Fri, 06 Dec 2013 21:05:32 GMT
id:          6e0b432e369904e76de6e95bac99c9e6
nodes:
  succeeded:
    1-tomcat-intro
    1-tomcat2-intro
run_timeout: 3600
status:      complete
updated_at:  Fri, 06 Dec 2013 21:05:40 GMT

The nice thing about using push jobs is that I can use the same credentials I use to access Chef to fire off commands on whitelisted nodes with Push Jobs enabled. I don’t need to have SSH keys for the remote nodes as I do with knife ssh.

More Options


The other nice thing is that Push Jobs can be used inside recipes to orchestrate actions between machines. There is a really basic Lightweight Resource Provider that allows for you to fire push jobs from other nodes. You can find the LWRP on github. Why would you want to do this? Say for instance your webapp hosts have autoscaled. Your chef-client run interval on your HAProxy node is 15 minutes, but you don’t want to wait (at most) 15 minutes for the new webapp to be in the pool. Your webapp recipe can fire off a push job to have the HAProxy node run chef-client.

pushy "chef-client" do
  action :run 
  nodes [ "1-lb-tomcat"]
end

This cross-node orchestration doesn’t have to be a full-fledged chef run as seen above. You could use it to simply restart a whitelisted service if needed. Let's look at a tomcat recipe for Ubuntu that allows us to find our HAProxy server and start a chef-client run.

include_recipe "java"

# define our packages to install
tomcat_pkgs =  ["tomcat6","tomcat6-admin"]

tomcat_pkgs.each do |pkg|
  package pkg do
    action :install
  end
end

#setup the service to run
service "tomcat" do
  service_name "tomcat6"
  supports :restart => true, :reload => true, :status => true
  action [:enable, :start]
end

#intsall templates for the configs
template "/etc/default/tomcat6" do
  source "default_tomcat6.erb"
  owner "root"
  group "root"
  mode "0644"
  notifies :restart, resources(:service => "tomcat")
end

template "/etc/tomcat6/server.xml" do
  source "server.xml.erb"
  owner "root"
  group "root"
  mode "0644"
  notifies :restart, resources(:service => "tomcat")
end

#search for our LB based on role and current environment, just return the name
pool_members = partial_search("node", "role:tomcat_fe_lb AND chef_environment:#{node.chef_environment}", :keys => {
               'name' => ['name']
               }) || []

pool_members.map! do |member|
  member['hostname']
end

#run chef-client on the LB, don't wait
pushy "chef-client-delay" do
  action :run
  wait false
  nodes pool_members.uniq
end

The real magic of this recipe is in the last several lines. First we query to get our loadbalancer, then we use this query result to execute the job chef-client-delay on the nodes that are part of our query result. The nodes attribute can be one node, or an array of nodes that you want the job to execute on.

If you run a knife job list you should see your job running on the load balancer.

ricardoII:intro michael$ knife job list
command:     chef-client-delay
created_at:  Mon, 09 Dec 2013 06:09:05 GMT
id:          6e0b432e3699018834ee0e199309e8a7
run_timeout: 3600
status:      running
updated_at:  Mon, 09 Dec 2013 06:09:05 GMT

You can take the job id and query the job status by running knife job status <id> as so:

ricardoII:intro michael$ knife job status 6e0b432e3699018834ee0e199309e8a7
command:     chef-client-delay
created_at:  Mon, 09 Dec 2013 06:09:05 GMT
id:          6e0b432e3699018834ee0e199309e8a7
nodes:
  succeeded: 1-lb-intro
run_timeout: 3600
status:      complete
updated_at:  Mon, 09 Dec 2013 06:10:07 GMT

Or course, the the Chef API can also be used to directly query the results of a job, get a listing of jobs, or start jobs. The API is fully documented, and accessible under your normal Chef API endpoint.

Hopefully this gives you a good idea on how to use Push Jobs, why it’s different than knife ssh, and gets you excited for the upcoming open source release. At Chef our goal is to provide our community with the primitive resources that they can use to make their jobs more delightful. Push Jobs are the first release of primitives to better orchestrate things for Continuous Delivery, Continuous Integration, and more complex use cases.