December 15, 2013

Day 15 - Distributed Storage with Ceph

Written by: Kyle Bader (@mmgaggle)
Edited by: Michelle Carroll


Ceph provides a scalable, distributed storage system based on an intelligent object storage layer called RADOS. RADOS handles replication, recovery, and achieves a statistically even distribution of objects across the storage nodes participating in a Ceph cluster using the CRUSH algorithm. This article will help you get a small Ceph cluster up and running with Vagrant, a extremely powerful tool for development and testing environments.


Ceph Monitors

Ceph monitors decide and maintain mappings of the cluster and achieve consistency using PAXOS. Each member of a Ceph cluster connects to one of the monitors in the quorum to receive maps and report state. Quorums of Ceph monitors need to be an odd number, to avoid ties during leader election -- typically 3 or 5. It's important to note that Ceph monitors are not in the data path. Ceph monitors also provide means of authentication for storage daemons and clients.

Object Storage Device

Object storage devices, or OSDs, are daemons that abstract underlying storage devices and filesystems to provide the RADOS object storage interface.

Object Storage Device states

Object storage device daemons can be in various states, they fall into two groupings:


OSD daemons that are "in" are mapped into the cluster and will participate in placement groups. In the event that an OSD is marked "out", either by exceeding the amount of time the configuration allows it to be "down" or by operator intervention, the placement groups with which they participated will be remapped to another OSD and data will be backfilled from surviving replicas.


OSD daemons will be marked up when they are running, able to successfully peer with OSD daemons with which they share a placement group, and are able to send heartbeats a Ceph monitor. If an OSD is marked down then it doesn't meet the one of the previously stated conditions.


CRUSH is a deterministic, pseudo-random placement algorithm used by RADOS to map data according to placement rules (defined in what is known as a CRUSH map).

Placement Groups

Placement groups are portions of a pool that get distributed to OSDs based on their status and your CRUSH map. The replica count of a given pool determines how many OSDs will participate in a placement group. Each primary in a placement group receives writes and replicates them to its peers, acknowledging only after all replicas are consistent. Reads, on the other hand can be serviced from any replica of a given object.

Placement Group States

The following are a few of the most relevant placement group states along with a description, for an exhaustive list see the official documentation.


Placement groups are considered "peering" when the participating OSDs are still gossiping about cluster state. Once the OSDs in a placement group complete the peering process they are marked "active".


Degraded placement groups lack sufficient replicas to satisfy the durability the system has been configured for. Clean placement groups have sufficient replicas of each object to satisfy the current pool configuration.


Remapped placement groups are no longer mapped to their ideal location as determined by the clusters state and CRUSH map.


Placement groups that are backfilling are copying replicas from OSDs that participated in the placement group before remapping.


Placement groups are incomplete when there are not enough member OSDs are marked up.


Replicas have been discovered that are inconsistent and require repair.


There are a variety of different clients that have been written to interact with a Ceph cluster, namely:


librados is the native Ceph interface, there are a variety of language bindings available should you choose to interact directly with Ceph from your own applications.


radosgw provides a RESTful interface to a Ceph cluster that is compatible with both Amazon S3 and OpenStack Swift.

qemu rbd

Modern versions of qemu support rbd, or RADOS block device backed volumes. These rbd volumes are striped across many objects in a Ceph cluster.


Similar to qemu rbd, krbd are RADOS block devices that are mounted by the Linux kernel and receive a major/minor device node.


cephfs is the distributed filesystem. Features of cephfs include dynamic subtree partitioning of metadata, recursive disk usage accounting and snapshots.


To quickly get a Ceph server up and running to experiment with the software I've created a Vagrantfile and corresponding box files so that a small virtual cluster can be provisioned on a modest machine (8GB system or more memory is suggested). The vagrant file is written in such a way that the first machine provisioned is a Open Source Chef Server, which will run a script to load a set of Chef cookbooks and set up the environment for configuring a Ceph cluster. If you don't have Vagrant installed already then you can follow the official getting started guide. Next, you will need to clone the vagrant-ceph repository from github:

git clone
cd vagrant-ceph
git checkout sysadvent
To bootstrap the Chef server and setup an initial monitor simply run:

vagrant up chefserver cephmon
Both nodes should boot and converge to our desired state using Chef. After the cephmon node converges you should want to wait a minute for Chef server to index the bootstrap key material and make it available for search (tiny VMs are slow). After you have waited a minute you can start up the OSDs nodes:
vagrant up cephstore1001 cephstore1002
The result should be two Ceph OSD nodes each running 1 Ceph OSD daemon. Now that the cluster is provisioned you can stop the Chef server to free up some resources on your machine:
vagrant halt chefserver

Ceph Basics

To examine cluster state you will need to have access to a CephX keyring with administrative permissions. The cephmon node we booted generated keyrings during convergence, you should run the following commands after establishing a ssh connection to cephmon:
vagrant ssh cephmon
The following are some commands to help you understand the current state of a Ceph cluster. Each of these commands should either be ran as the root user or via sudo.
agrant@cephmon:~$ sudo ceph health
The ceph health command responds with a general health status for the cluster, the results will either be "HEALTH_OK" or a list of problematic placement groups and OSDs.
vagrant@cephmon:~$ sudo ceph -s
health HEALTH_OK
monmap e1: 1 mons at {cephmon=}, election epoch 2, quorum 0 cephmon
osdmap e9: 2 osds: 2 up, 2 in
pgmap v16: 192 pgs: 192 active+clean; 0 bytes data, 71120 KB used, 10628 MB / 10697 MB avail
mdsmap e1: 0/0/1 up
The -s flag is the abbreviation for status. Calling the ceph -s command will return the cluster health  (as ceph health reported earlier) along with lines that detail the status of the ceph monitors, osds, placement groups, and metadata servers.
vagrant@cephmon:~$ sudo ceph -w
cluster cf376172-55ef-410b-a1ad-b84d9445aaf1
health HEALTH_OK
monmap e1: 1 mons at {cephmon=}, election epoch 2, quorum 0 cephmon
osdmap e9: 2 osds: 2 up, 2 in
pgmap v16: 192 pgs: 192 active+clean; 0 bytes data, 71120 KB used, 10628 MB / 10697 MB avail
mdsmap e1: 0/0/1 up

2013-12-14 03:24:03.439146 mon.0 [INF] pgmap v16: 192 pgs: 192 active+clean; 0 bytes data, 71120 KB used, 10628 MB / 10697 MB avail
The -w flag is the abbreviation for watch, and calling the ceph -w command will return similar output to ceph -s -- with the exception that it will tail the cluster log and punctuate it with periodic status updates a la ceph -s. 

vagrant@cephmon:~$ sudo ceph osd tree
# id weight type name up/down reweight
-1 0.01999 root default
-2 0.009995  host cephstore1001
0 0.009995   osd.0 up 1
-3 0.009995  host cephstore1002
1 0.009995   osd.1 up 1
Shows a tree of your CRUSH map along with the weights and statuses of your OSDs.
vagrant@cephmon:~$ sudo ceph osd dump

epoch 9
fsid cf376172-55ef-410b-a1ad-b84d9445aaf1
created 2013-12-14 03:15:49.419751
modified 2013-12-14 03:21:57.738002
pool 0 'data' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0 crash_replay_interval 45
pool 1 'metadata' rep size 2 min_size 1 crush_ruleset 1 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0
pool 2 'rbd' rep size 2 min_size 1 crush_ruleset 2 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 owner 0

max_osd 2
osd.0 up in weight 1 up_from 4 up_thru 8 down_at 0 last_clean_interval [0,0) exists,up a2515c14-f2e4-44b2-9cf9-1db603d7306a
osd.1 up in weight 1 up_from 8 up_thru 8 down_at 0 last_clean_interval [0,0) exists,up ea9896f2-7137-4527-a326-9909dfdfd226
Dumps a list of all osds and a wealth of information about them.
vagrant@cephmon:~$ sudo ceph pg dump
dumped all in format plain
version 16
stamp 2013-12-14 03:24:03.435845
last_osdmap_epoch 9
last_pg_scan 1
full_ratio 0.95
nearfull_ratio 0.85
pg_stat objects mip degr unf bytes log disklog state state_stamp v reported up acting last_scrub scrub_stamp last_deep_scrub deep_scrub_stamp
2.3d 0 0 0 0 0 0 0 active+clean 2013-12-14 03:21:58.042543 0'0 9:16 [0,1] [0,1] 0'0 2013-12-14 03:20:08.744088 0'0 2013-12-14 03:20:08.744088
1.3e 0 0 0 0 0 0 0 active+clean 2013-12-14 03:21:58.045611 0'0 9:16 [0,1] [0,1] 0'0 2013-12-14 03:20:08.601796 0'0 2013-12-14 03:20:08.601796
Dumps a list of placement groups.
vagrant@cephmon:~$ sudo rados df
pool name category KB objects clones degraded unfound rd rd KB wr wr KB
data - 0 0 0 0 0 0 0 0 0
metadata - 0 0 0 0 0 0 0 0 0
rbd - 0 0 0 0 0 0 0 0 0
total used 71120 0
total avail 10883592
total space 10954712
Print a list of pools and their usage statistics.

Fill it Up

First, create a pool named "vagrant" so that we have somewhere to write data to:
vagrant@cephmon:~$ sudo rados mkpool vagrant
successfully created pool vagrant
Next, use the 'rados bench' tool to write some data to the cluster so it's a bit more interesting:
vagrant@cephmon:~$ sudo rados bench -p vagrant 200 write --noclean-up

Maintaining 16 concurrent writes of 4194304 bytes for up to 200 seconds or 0 objects
Object prefix: benchmark_data_cephmon_3715
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 16 18 2 7.98932 8 0.835571 0.82237
2 16 24 8 15.9871 24 1.84799 1.35787
201 15 1183 1168 23.2228 8 2.95785 2.72622
Total time run: 202.022774
Total writes made: 1183
Write size: 4194304
Bandwidth (MB/sec): 23.423

Stddev Bandwidth: 7.25419 Max bandwidth (MB/sec): 40
Min bandwidth (MB/sec): 0
Average Latency: 2.73042
Stddev Latency: 1.07016
Max latency: 5.34265
Min latency: 0.56332
You can use the commands you learned earlier to watch the cluster as it processes write requests from the benchmarking tool. Once you have written a bit of data to your virtual cluster you can run a read benchmark:
vagrant@cephmon:~$ sudo rados bench -p vagrant 200 seq
sec Cur ops started finished avg MB/s cur MB/s last lat avg lat
0 0 0 0 0 0 - 0
1 15 21 6 22.4134 24 0.728871 0.497643
2 16 33 17 32.8122 44 1.93878 1.00589

Starting and Stopping Ceph Daemons

At some point you will need to control the ceph daemons. This test cluster is built on Ubuntu Precise and uses Upstart for process monitoring.

With the commands you just learned you can inspect the cluster’s state, enabling you to experiment with stopping and starting Ceph monitors and OSDs.
vagrant@cephstore1001:~$ sudo stop ceph-osd id=0
ceph-osd stop/waiting
vagrant@cephstore1001:~$ sudo start ceph-osd id=0
ceph-osd (ceph/0) start/running, process 4422

Controlling Monitors

vagrant@cephmon:~$ sudo stop ceph-mon id=`hostname`
ceph-mon stop/waiting
vagrant@cephmon:~$ sudo start ceph-mon id=`hostname`
ceph-mon (ceph/cephmon) start/running, process 4106


I hope you find this a useful and interesting introduction to Ceph. The community is always interested in making it easier for people to experiment with the system. If you want to learn more about Ceph then you can dive into the excellent documentation at the Ceph homepage. I working on adding support to the Vagrant environment for large clusters launched on EC2 and later OpenStack compatible clouds. Pull requests and comments are welcome!

1 comment :

Andrea said...

Ceph is a dead end, long live GlusterFS.