December 13, 2014

Day 13 - Managing Repositories with Pulp

Written by: Justin Garrison (@rothgar)
Edited by: Corey Quinn (@quinnypig)

Your infrastructure and build systems shouldn’t rely on Red Hat, Ubuntu, or Docker’s repositories being available. Your work doesn’t stop when they have scheduled maintenance. You also have no control when they make updates available. Have you ever run an update and crossed your fingers unknown package versions wouldn’t break production? How do you manage content and repos between environments? Are you still running reposync or wget in a cron job? It’s time to take control of your system dependencies and use something that is scalable, flexible and repeatable.


That’s where Pulp comes in. Pulp is a platform for managing repositories of content and pushing that content out to large numbers of consumers. Pulp can sync and manage more than just RPMs. Do you want to create your own Docker repository? How about syncing Puppet modules from the forge, or an easy place to host your installation media? Would you like to sync Debian* repositories or have a local mirror of pip*? Pulp can do all of that and is built to scale to your needs.

*Note that some importers are still a work in progress and not fully functional. Pull requests welcome.

How Pulp Works

The first step is to use an importer to get content into a Pulp repository. Importers are plugins which make them extremely flexible when dealing with content. With a little bit of work you can build an importer for any content source. Want local gems, maven or CPAN repositories? You can write your own importers and have it working with Pulp repos for just about anything. The content can be synced from external sources, uploaded from local files, or imported between repos. The importer validates content, applies metadata, and removes old content when it syncs. No more guessing if your packages synced or if your cron job failed.

After you have content in a repo, you then use a distributor to publish the content. Like importers, distributors are pluggable and content can be published to multiple locations. A single repo can publish to http(s), ISO, rsync, or any other exporter available. Publishing and syncing can also be scheduled for one or multiple times so you don’t have to worry about your content getting out of date.

Scaling Pulp

Pulp has different components that can be scaled according to your needs. The main components can be broken up into

  • httpd - frontend for API and http(s) published repos
  • pulp_workers - process for long running tasks like repo syncing and publishing
  • pulp_celerybeat - maintains workers and task cancellation
  • pulp_resource_manager - job assigner for tasks
  • mongodb - repo and content metadata value store
  • Qpid/RabbitMQ - message bus for job assigning
  • pulp-admin - cli tool for managing content and consumers
  • consumer - optional agent installed on node to subscribe to repos/content

Here’s how the components interact for a Pulp server.

Pulp components

Because of this modular layout you can scale out each component individually. Need more resources for large files hosted on http? You can scale httpd easily with a load balancer and a couple shared folders. If your syncs and publishes are taking a long time you can add more pulp_workers to get more concurrent tasks running inside Pulp.

If you have multiple datacenters you can mirror your Pulp repos to child nodes, which can replicate all or part of your parent server to each child.

Node topologies

Getting Started

The best architected software in the world is frivolous if you’re not going to use it. With Pulp, hopefully you can find a few use cases. Let’s say you just want better control over what repositories your servers pull content from. If you’re using Puppet you can quickly set up a server using the provided Puppet manifests and then you can mirror EPEL with just a few lines.

class pulp::repo::epel_6 {
     pulp_repo { 'epel-6-x86_64':    
       feed       => '',
       serve_http => true,

Want to set up dedicated repos for dev, test, and prod? Just create repos for each and schedule content syncing between environment repos. You’ll finally take control over what content gets pushed to each environment. Because Pulp is intelligent with its storage you can make sure you only ever store a needed package once.

Want to create an internal Docker registry? How about hosting it in Pulp deployed with Docker containers. You can deploy it with a single line in bash. Check out the infrastructure diagram below and learn how to do it in the quickstart documentation.

Container components

Getting content to consumers can be as easy as relying on system tools to pull the content like it normally does via http publishing, or you can install the consumer agent and get real time status about what is installed on each node, push content immediately when it is available, or rollback managed content if you find a broken package.


Not only can managing your own repositories greatly improve your control and visibility into your systems, but moving data closer to the nodes can speed up your deployments and simplify your infrastructure. If I haven’t convinced you yet that you should manage the content that goes onto your servers you must either be very trusting or have one doozy of a cron job!

1 comment :

psellars said...

Twitter link for Corey Quinn needs updating, misspelt 'twitter' in existing link.