December 2, 2010

Day 2 - Going Parallel

This article was written by Brandon Burton, aka @solarce.

As system administrators, we are often faced with tasks that need to run against a number of things, perhaps files, users, servers, etc. In most cases, we resort of a loop of some sort, often, just a for loop in the shell. The drawback to this seemingly obvious approach, is that we are constrained by the fact that this approach is serial and so the time it will take increases linearly with the number of things we are running the task against.

I am here to tell you there is a better way, it is the path of going parallel!

Tools for your shell scripts

The first place to start is with tools that can replace that for loop you usually use and add some parallelism to the task you are running.

The two most well known tools that are available are:

  1. xargs
  2. gnu parallel

xargs is a tool used to build and execute command lines from standard input, but one of its great features is that it can execute those command lines in parallel via its -P argument. A quick example of this is:

seq 10 20 | xargs -n 1 -P 5 sleep

This will send a sequence of numbers to xargs and divide it into chunks of one argument (-n 1) at a time and fork off 5 parallel processes (-P 5) to execute each. You can see it in action:

$ ps -eaf | grep sleep
baron     5830  5482  0 11:12 pts/2    00:00:00 xargs -n 1 -P 5 sleep
baron     5831  5830  0 11:12 pts/2    00:00:00 sleep 10
baron     5832  5830  0 11:12 pts/2    00:00:00 sleep 11
baron     5833  5830  0 11:12 pts/2    00:00:00 sleep 12
baron     5834  5830  0 11:12 pts/2    00:00:00 sleep 13
baron     5835  5830  0 11:12 pts/2    00:00:00 sleep 14

Some further reading on xargs is available at:

gnu parallel is a lesser known tool, but has been gaining popularity recently. It is written with the specific focus on executing processes in parallel. From the home page description: "GNU parallel is a shell tool for executing jobs in parallel locally or using remote machines. A job is typically a single command or a small script that has to be run for each of the lines in the input. The typical input is a list of files, a list of hosts, a list of users, a list of URLs, or a list of tables."

A quick example of using parallel is:

% cat offlineimap-cron5min.plist | parallel --max-procs=8 --group 'echo "Thing: {}"'
Thing:       <string>offlineimap-cron5min</string> 
Thing:     <key>Label</key> 
Thing:       <string>solarce</string> 
Thing:     <key>UserName</key> 
Thing:   <dict> 
Thing:     <key>ProgramArguments</key> 
Thing:       <string>admin</string> 
...

This plist file is xml, but the output of parallel is unordered above because each line of input is processed by one of the 8 workers and output occurs (--group) as each worker finishes an input (line) and not necessarily in the order of input.

Some further reading on parallel is available at:

Additionally, there is a great screencast on it.

Tools for multiple machines

The next step in our journey is to progress from just running parallel processes to running our tasks in parallel on multiple machines.

A common approach to this is to use something like the following:

for server in $(cat list_of_servers.txt); do
    ssh $server command argument
done

While this approach is fine for small tasks on a small number of machines, the drawback to it is that it is executed linearly, so the total time the job will take is as long as the task takes to finish multiplied by the number of machines you are executing it on, which means it could take a while, so you'd better get a Snickers.

Fortunately, people have recognized this problem and have developed a number of tools have been developed to solve this, by running your SSH commends in parallel.

These include:

I'll illustrate how these work with a few examples.

First, here is a basic example of pssh (on Ubuntu the package is 'pssh,' but the command is 'parallel-ssh'):

# cat hosts-file
p1
p2

# pssh -h hosts-file -l ben date
[1] 21:12:55 [SUCCESS] p2 22
[2] 21:12:55 [SUCCESS] p1 22

# pssh -h hosts-file -l ben -P date
p2: Thu Oct 16 21:14:02 EST 2008
p2: [1] 21:13:00 [SUCCESS] p2 22
p1: Thu Sep 25 15:44:36 EST 2008
p1: [2] 21:13:00 [SUCCESS] p1 22

Second, here is an example of using sshpt:

./sshpt -f ../testhosts.txt "echo foo" "echo bar"
Username: myuser
Password:
"devhost","SUCCESS","2009-02-20 16:20:10.997818","0: echo foo
1: echo bar","0: foo
1: bar"
"prodhost","SUCCESS","2009-02-20 16:20:11.990142","0: echo foo
1: echo bar","0: foo
1: bar"

As you can see, these tools simplify and parallelize your SSH commands, decreasing the execution time that your tasks take and improving your efficiency.

Some further reading on this includes:

Smarter tools for multiple machines

Once you've adopted the mindset your tasks can be done in parallel and you've started using one of the parallel ssh tools for executing ad-hoc commands in a parallel fashion, you may find yourself thinking that you'd like to be able to execute tasks in parallel, but in a more repeatable, extensible, and organized fashion.

If you were thinking this, you are in a luck. There is a class of tools commonly classified as Command and Control or Orchestration tools. These tools include:

These tools are built to be frameworks within which you can build repeatable systems automation. Mcollective and capistrano are written in Ruby, and Func and Fabric are written in Python. This gives you options for whichever language you prefer. Each has strengths and weaknesses. I'm a big fan of Mcollective in particular, because it has the strength of being built on Puppet and its primary author, R.I. Pienaar has a vision for it to become an extremely versatile tool for the kinds of needs that fall within the realm of Command and Control or Orchestration.

As it's always easiest to grasp a tool by seeing it in action, here are basic examples of using each tool:

mcollective

% mc-package install zsh

 * [ ============================================================> ] 3 / 3

web2.my.net                      version = zsh-4.2.6-3.el5
web3.my.net                      version = zsh-4.2.6-3.el5
web1.my.net                      version = zsh-4.2.6-3.el5

---- package agent summary ----
           Nodes: 3 / 3
        Versions: 3 * 4.2.6-3.el5
    Elapsed Time: 16.33 s

func

% func client15.example.com call hardware info
{'client15.example.com': {'bogomips': '7187.63',
                          'cpuModel': 'Intel(R) Pentium(R) 4 CPU 3.60GHz',
                          'cpuSpeed': '3590',
                          'cpuVendor': 'GenuineIntel',
                          'defaultRunlevel': '3',
...
                          'systemSwap': '8191',
                          'systemVendor': 'Dell Inc.'}}

fabric

% fab -H localhost,linuxbox host_type
[localhost] run: uname -s
[localhost] out: Darwin
[linuxbox] run: uname -s
[linuxbox] out: Linux

Done.
Disconnecting from localhost... done.
Disconnecting from linuxbox... done.

capistrano

# cap invoke COMMAND="yum -y install zsh"
  * executing `invoke'
  * executing "yum -y install zsh"
    servers: ["web1", "web2", "web3"]
    [web2] executing command
    [web1] executing command
    [web3] executing command
    [out :: web3] Nothing to do
    [out :: web2] Nothing to do
    [out :: web1] Complete!
    command finished

As you can see from these brief examples, each of these tools accomplishes similar things, each one has a unique ecosystem, plugins available, and strengths and weaknesses, a description of which, is beyond the scope of this post.

Taking your own script(s) multithreaded

The kernel of this article was an article I recently wrote for my employer's blog, Taking your script multithreaded, in which I detailed how I wrote a Python script to make an rsync job multithreaded and cut the execution time of a task from approximately 6 hours, down to 45 minutes.

I've created a git repo out of the script, so you can take my code and poke at it. If you end up using the script and make improvements, feel free to send me patches!

With the help of David Grieser, there is also a Ruby port of the script up on Github.

These are two good examples of how you can easily implement a multithreaded version of your own scripts to help parallelize your tasks.

Conclusion

There are clearly many steps you can take along the path to going parallel. I've tried to highlight how you can begin with using tools to execute commands in a more parallel fashion, progress to tools which help you execute ad-hoc and then repeatable tasks across many hosts, and finally, given some examples on how to make your own scripts more parallel.

6 comments :

Anonymous said...

I wonder why Perl is less often mentioned here. Even tough perl was originally born to be a system administration language.

I prefer Perl because it is more stable and is everywhere in any Unix-like os by default and no broken backward compatibility headaches between language version.

On Perl's CPAN, There are many module and app for parallelization.
http://search.cpan.org/search?query=parallel
http://search.cpan.org/search?q=parallel+ssh
http://search.cpan.org/search?query=batch+ssh

Pick what you want and enjoy :)

GDR! said...

pdsh (http://sourceforge.net/projects/pdsh/) is a great parallel shell:
pdsh -w hostname1,hostname2 cmd - execute command on one hosts
pdsh -g hosts cmd - execute command on a group of hosts

JB said...

Question for Brandon and all of the readers too : Where do you folks hear about these tools originally? I am out of the loop and don't like it.

Please share your sources so I can keep up better?

Anonymous said...

Great writeup! Fyi Yahoo just open-sourced a pretty awesome parallel job execution tool called Pogo here https://github.com/nrh/pogo. It's used heavily inside Yahoo and works great so I'm hoping it gets some wider adoption. Definitely a work in progress.

@philiph

Brandon Bowman said...

@Anonymous, tbh, Perl's community kind of stagnated for a while, all the momentum and "coolness" seemed to be with the Ruby and Python communities, so those of us who came into things in the last 5-8 years went to one of those communities, I know some hardcore Perl people and it seems like Perl is having a revival, see http://www.modernperlbooks.com/mt/index.html, but that's my take on it.

@GDR that is also a good one, there are many parallel SSH tools, too many for me to cover them all :)

@JB I know of these from a combination of following people on twitter, following blogs, listening to podcasts, and irc. There are two ways to find out about these kinds of things 1.) cast a wide net yourself 2.) find a few good people who cast a wide net and follow them ;) I'm working on another post that will highlight a lot of what I follow/listen to, so stay tuned for that.

@Phil thanks for the kind words, it's what keeps a writer going. I think every medium to large web company has written their own or taken an existing one and morphed into their own beast, it's great to see them be released, as the ecosystem grows and learns, with these kinds of things, it truly is a "rising tide lifts all boats" kind of effect.

Anonymous said...

Unfortunately I'm not able to find the ruby version of your multithread script, as well as the link to github links back to this article.

Also David's github account only contains the python version