This article was written by Brandon Burton, aka @solarce.
As system administrators, we are often faced with tasks that need to run against a number of things, perhaps files, users, servers, etc. In most cases, we resort of a loop of some sort, often, just a for loop in the shell. The drawback to this seemingly obvious approach, is that we are constrained by the fact that this approach is serial and so the time it will take increases linearly with the number of things we are running the task against.
I am here to tell you there is a better way, it is the path of going parallel!
Tools for your shell scripts
The first place to start is with tools that can replace that for
loop you
usually use and add some parallelism to the task you are running.
The two most well known tools that are available are:
xargs
is a tool used to build and execute command lines from standard
input, but one of its great features is that it can execute those command lines
in parallel via its -P
argument. A quick example of this is:
seq 10 20 | xargs -n 1 -P 5 sleep
This will send a sequence of numbers to xargs and divide it into chunks of one
argument (-n 1
) at a time and fork off 5 parallel processes (-P 5
) to
execute each. You can see it in action:
$ ps -eaf | grep sleep
baron 5830 5482 0 11:12 pts/2 00:00:00 xargs -n 1 -P 5 sleep
baron 5831 5830 0 11:12 pts/2 00:00:00 sleep 10
baron 5832 5830 0 11:12 pts/2 00:00:00 sleep 11
baron 5833 5830 0 11:12 pts/2 00:00:00 sleep 12
baron 5834 5830 0 11:12 pts/2 00:00:00 sleep 13
baron 5835 5830 0 11:12 pts/2 00:00:00 sleep 14
Some further reading on xargs
is available at:
- http://www.semicomplete.com/blog/articles/week-of-unix-tools/day-5-xargs.html
- http://www.xaprb.com/blog/2009/05/01/an-easy-way-to-run-many-tasks-in-parallel/
- http://www.linuxask.com/questions/run-tasks-in-parallel-with-xargs
- http://stackoverflow.com/questions/3321738/shell-scripting-using-xargs-to-execute-parallel-instances-of-a-shell-function
gnu parallel
is a lesser known tool, but has been gaining popularity
recently. It is written with the specific focus on executing processes in
parallel. From the home page description: "GNU parallel is a shell tool for
executing jobs in parallel locally or using remote machines. A job is typically
a single command or a small script that has to be run for each of the lines in
the input. The typical input is a list of files, a list of hosts, a list of
users, a list of URLs, or a list of tables."
A quick example of using parallel
is:
% cat offlineimap-cron5min.plist | parallel --max-procs=8 --group 'echo "Thing: {}"'
Thing: <string>offlineimap-cron5min</string>
Thing: <key>Label</key>
Thing: <string>solarce</string>
Thing: <key>UserName</key>
Thing: <dict>
Thing: <key>ProgramArguments</key>
Thing: <string>admin</string>
...
This plist
file is xml, but the output of parallel is unordered above because
each line of input is processed by one of the 8 workers and output occurs
(--group
) as each worker finishes an input (line) and not necessarily in the
order of input.
Some further reading on parallel
is available at:
- http://www.gnu.org/software/parallel/man.html
- http://psung.blogspot.com/2010/08/gnu-parallel.html
- http://unethicalblogger.com/posts/2010/11/gnuparallelchangedmy_life
Additionally, there is a great screencast on it.
Tools for multiple machines
The next step in our journey is to progress from just running parallel processes to running our tasks in parallel on multiple machines.
A common approach to this is to use something like the following:
for server in $(cat list_of_servers.txt); do
ssh $server command argument
done
While this approach is fine for small tasks on a small number of machines, the drawback to it is that it is executed linearly, so the total time the job will take is as long as the task takes to finish multiplied by the number of machines you are executing it on, which means it could take a while, so you'd better get a Snickers.
Fortunately, people have recognized this problem and have developed a number of tools have been developed to solve this, by running your SSH commends in parallel.
These include:
I'll illustrate how these work with a few examples.
First, here is a basic example of pssh (on Ubuntu the package is 'pssh,' but the command is 'parallel-ssh'):
# cat hosts-file
p1
p2
# pssh -h hosts-file -l ben date
[1] 21:12:55 [SUCCESS] p2 22
[2] 21:12:55 [SUCCESS] p1 22
# pssh -h hosts-file -l ben -P date
p2: Thu Oct 16 21:14:02 EST 2008
p2: [1] 21:13:00 [SUCCESS] p2 22
p1: Thu Sep 25 15:44:36 EST 2008
p1: [2] 21:13:00 [SUCCESS] p1 22
Second, here is an example of using sshpt:
./sshpt -f ../testhosts.txt "echo foo" "echo bar"
Username: myuser
Password:
"devhost","SUCCESS","2009-02-20 16:20:10.997818","0: echo foo
1: echo bar","0: foo
1: bar"
"prodhost","SUCCESS","2009-02-20 16:20:11.990142","0: echo foo
1: echo bar","0: foo
1: bar"
As you can see, these tools simplify and parallelize your SSH commands, decreasing the execution time that your tasks take and improving your efficiency.
Some further reading on this includes:
- What is a good modern parallel SSH tool?
- Linux - Running The Same Command on Many Machines at Once
- Effective adhoc commands in clusters
Smarter tools for multiple machines
Once you've adopted the mindset your tasks can be done in parallel and you've started using one of the parallel ssh tools for executing ad-hoc commands in a parallel fashion, you may find yourself thinking that you'd like to be able to execute tasks in parallel, but in a more repeatable, extensible, and organized fashion.
If you were thinking this, you are in a luck. There is a class of tools commonly classified as Command and Control or Orchestration tools. These tools include:
These tools are built to be frameworks within which you can build repeatable systems automation. Mcollective and capistrano are written in Ruby, and Func and Fabric are written in Python. This gives you options for whichever language you prefer. Each has strengths and weaknesses. I'm a big fan of Mcollective in particular, because it has the strength of being built on Puppet and its primary author, R.I. Pienaar has a vision for it to become an extremely versatile tool for the kinds of needs that fall within the realm of Command and Control or Orchestration.
As it's always easiest to grasp a tool by seeing it in action, here are basic examples of using each tool:
mcollective
% mc-package install zsh
* [ ============================================================> ] 3 / 3
web2.my.net version = zsh-4.2.6-3.el5
web3.my.net version = zsh-4.2.6-3.el5
web1.my.net version = zsh-4.2.6-3.el5
---- package agent summary ----
Nodes: 3 / 3
Versions: 3 * 4.2.6-3.el5
Elapsed Time: 16.33 s
func
% func client15.example.com call hardware info
{'client15.example.com': {'bogomips': '7187.63',
'cpuModel': 'Intel(R) Pentium(R) 4 CPU 3.60GHz',
'cpuSpeed': '3590',
'cpuVendor': 'GenuineIntel',
'defaultRunlevel': '3',
...
'systemSwap': '8191',
'systemVendor': 'Dell Inc.'}}
fabric
% fab -H localhost,linuxbox host_type
[localhost] run: uname -s
[localhost] out: Darwin
[linuxbox] run: uname -s
[linuxbox] out: Linux
Done.
Disconnecting from localhost... done.
Disconnecting from linuxbox... done.
capistrano
# cap invoke COMMAND="yum -y install zsh"
* executing `invoke'
* executing "yum -y install zsh"
servers: ["web1", "web2", "web3"]
[web2] executing command
[web1] executing command
[web3] executing command
[out :: web3] Nothing to do
[out :: web2] Nothing to do
[out :: web1] Complete!
command finished
As you can see from these brief examples, each of these tools accomplishes similar things, each one has a unique ecosystem, plugins available, and strengths and weaknesses, a description of which, is beyond the scope of this post.
Taking your own script(s) multithreaded
The kernel of this article was an article I recently wrote for my employer's blog, Taking your script multithreaded, in which I detailed how I wrote a Python script to make an rsync job multithreaded and cut the execution time of a task from approximately 6 hours, down to 45 minutes.
I've created a git repo out of the script, so you can take my code and poke at it. If you end up using the script and make improvements, feel free to send me patches!
With the help of David Grieser, there is also a Ruby port of the script up on Github.
These are two good examples of how you can easily implement a multithreaded version of your own scripts to help parallelize your tasks.
Conclusion
There are clearly many steps you can take along the path to going parallel. I've tried to highlight how you can begin with using tools to execute commands in a more parallel fashion, progress to tools which help you execute ad-hoc and then repeatable tasks across many hosts, and finally, given some examples on how to make your own scripts more parallel.
I wonder why Perl is less often mentioned here. Even tough perl was originally born to be a system administration language.
ReplyDeleteI prefer Perl because it is more stable and is everywhere in any Unix-like os by default and no broken backward compatibility headaches between language version.
On Perl's CPAN, There are many module and app for parallelization.
http://search.cpan.org/search?query=parallel
http://search.cpan.org/search?q=parallel+ssh
http://search.cpan.org/search?query=batch+ssh
Pick what you want and enjoy :)
pdsh (http://sourceforge.net/projects/pdsh/) is a great parallel shell:
ReplyDeletepdsh -w hostname1,hostname2 cmd - execute command on one hosts
pdsh -g hosts cmd - execute command on a group of hosts
Question for Brandon and all of the readers too : Where do you folks hear about these tools originally? I am out of the loop and don't like it.
ReplyDeletePlease share your sources so I can keep up better?
Great writeup! Fyi Yahoo just open-sourced a pretty awesome parallel job execution tool called Pogo here https://github.com/nrh/pogo. It's used heavily inside Yahoo and works great so I'm hoping it gets some wider adoption. Definitely a work in progress.
ReplyDelete@philiph
@Anonymous, tbh, Perl's community kind of stagnated for a while, all the momentum and "coolness" seemed to be with the Ruby and Python communities, so those of us who came into things in the last 5-8 years went to one of those communities, I know some hardcore Perl people and it seems like Perl is having a revival, see http://www.modernperlbooks.com/mt/index.html, but that's my take on it.
ReplyDelete@GDR that is also a good one, there are many parallel SSH tools, too many for me to cover them all :)
@JB I know of these from a combination of following people on twitter, following blogs, listening to podcasts, and irc. There are two ways to find out about these kinds of things 1.) cast a wide net yourself 2.) find a few good people who cast a wide net and follow them ;) I'm working on another post that will highlight a lot of what I follow/listen to, so stay tuned for that.
@Phil thanks for the kind words, it's what keeps a writer going. I think every medium to large web company has written their own or taken an existing one and morphed into their own beast, it's great to see them be released, as the ecosystem grows and learns, with these kinds of things, it truly is a "rising tide lifts all boats" kind of effect.
Unfortunately I'm not able to find the ruby version of your multithread script, as well as the link to github links back to this article.
ReplyDeleteAlso David's github account only contains the python version