December 11, 2014

Day 11 - Turning off the Pacemaker: Load Balancing Across Layer 3

Written by: Jan Ivar Beddari (@beddari)
Edited by: Joseph Kern (@josephkern)


Traditional load balancing usually brings to mind a dedicated array of Layer 2 devices connected to a server farm, with all of the devices preferably coming from the same vendor. But the latest techniques in load balancing are being implemented as open source software and standards driven Layer 3 protocols. Building new load balancing stacks away from the traditional (often vendor controlled) Layer 2 technologies opens up the network edge and creates a flexible multi-vendor approach to systems design that many small organizations are embracing and leaves many larger organizations wondering why they should care.

Layer 2 is deprecated!

The data center network as a concept is changing. The traditional three layer design model - access, aggregation and core - is being challenged by simpler, more cost effective models where internet proven technology is reused inside the data center. Today, building or redesigning a data center network for modern workloads would likely include running Layer 3 and routing protocols all the way down to the top-of-rack (ToR) switches. Once that is done, there is no need for Layer 2 between more than two adjacent devices. As a result, each and every ToR interface would be a point-to-point IP subnet with its own minimal broadcast domain. Conceptually it could look something like this:

layer 3 only-network

Removing Layer 2 from the design process and accepting Layer 3 protocols and routing appears to be the future for networks and service design. This can be a hard transition if you work in a more traditional environment. Security, deployment practices, management and monitoring tools, and a lot of small details need to change when your design process removes Layer 2. One of those design details that need special consideration is load balancing.

Debating the HAProxy single point of failure

Every team or project that has deployed HAproxy has had a conversation about load balancing and resiliencey. These converstaions often start with the high ideal of eliminating single points of failure (SPoF) and end with an odd feeling that we might be traiding one SPoF for another. A note: I’m not a purist, I tend to casually mix the concept of load balancing with that of achieving basic network resilience. Apologies in advance about my lack of formality, practical experience suggests that dealing with these concepts separately does less to actually solve problems. How then do we deploy HAProxy for maximum value with the least ammount of effort in this new Layer 3 environment?

The simplest solution, and possibly my favorite one, would be to not bother with any failover for HAProxy at all. HAProxy is an incredible software engineering effort and given stable hardware it will just run, run and run. HAProxy will work, there will be no magic happening, if you reboot the node where it runs or have any kind of downtime - your services will be down. As excpected. That’s the point, you know what to expect and you will get exactly that. I think we sometimes underestimate the importance of making critical pieces of infrastructure as simple as possible. If you know why and at what cost metrics, just accepting that your HAProxy is and will be a SPoF can be your best bet.

Good design practice: Always question situations where a service must run without a transparent failover mechanism. Is this appropriate? Do I understand the risk? Have the people that depend on this service understood and accepted this risk?

But providing failover for a HAProxy service isn’t trivial. Is it even worth implementing? Maybe using Pacemaker or keepalived to cluster the HAProxy will work? Or might there be better alternatives that have been created while you are reading this post?

Let’s say that for the longest time you did run your HAProxy as a SPoF and it worked very well, it was never overloaded, whatever downtime experienced wasn’t related to the load balancer. But then someone decides that all parts and components in your service have to be designed with a realtime failover capability. With a background in development or operations, I think most people would default to start building a solution on proven software like Pacemaker or keepalived. Pacemaker is a widely used cluster resource management tool that covers a wide array of use cases around running and operating software clusters. keepalived design is simpler and with less features, relying on Virtual Router Redundancy Protocol (VRRP) for IP based failover. Given how services are evolving towards Layer 3 mechanisms, using any of these tools might not be the best decision. Pacemaker and keepalived in their default configurations rely on moving a virtual IP adress (VIP) inside a single subnet. They just will not work in a modern data center without providing legacy Layer 2 broadcast domains as exceptions to the Layer 3 design.

But the Layer 2 broadcast domain requirement of Pacemaker and keepalived are limitations that can be ignored or worked around. Ignoring it would involve doing things like placing VIP resources inside an “availability-stretched” overlay network, e.g inside a Openstack tenant network or a subnet inside a single Amazon availability zone. This is a horrible idea. Not only does this build a critical path on top of services not really designed for that, it would also not achieve anything beyond the capabilites of a single HAProxy instance. When it comes to workarounds, keepalived could allow VRRP unicast to be routed, thus “escaping” the single subnet limitation. Pacemaker uses a VIPArip resource that allows management of IP aliases across different subnets. I don’t think these designs would make enough sense (i.e. be simple enough) to design a solution around. Working around a single broadcast domain limitation by definition would involve modifying your existing Layer 3 routing, better value exists eslewhere.

Solving the HAProxy SPoF problem

Now, if you had a little more than just basic networking skills or are lucky enough to work with people that do - you might be aware of a solution that is both elegant and scalable. Using routing protocols it is possible to split the traffic to a VIP across upstream routers and have multiple HAProxy instances process the flows. The reason this can work is that most modern routers are able to do load balancing per flow so that each TCP session consistently gets the same route - this means they will also get the same HAProxy instance. This is not a new practice and has been done for years in organizations that operate large scale services. In the wider operations community though, there doesn’t seem to be much discucssion. Technically, it is not hard or complicated, but it requires skills and expereinces that are less common.

Knowing the basics, there are multiple ways of accomplishing this. CloudFlare uses Anycast to solve this problem, and a blog post by Allan Feid at Shutterstock explains how you could run ExaBGP to announce or whitdraw routes to a service. In short, if HAProxy is up serving connections, use ExaBGP to announce to the upstream router that the route is available. In case of failure, do the opposite, tell the router that this route is no longer available for traffic.

I’m going to describe a solution that is similar but expand it a bit more, I hope you begin to see your services and datacenter a little differently.

haproxy ecmp

In this scenario there are two routers, r1 and r2, both announcing a route to the service IP across the network. This announcement is done using routing protocols like BGP or OSPF. It does not matter which one is used, for our use-case, they are functionally very close. Depending on how the larger network around r1 and r2 is designed they might not be receiving equal amounts of the traffic towards the service IP. If so, it is possible to have the routers balance the workload across n0 before routing the traffic to the service.

These routers (r1 and r2) are connected to both load balancers across different link networks (n1_ through n4) and have two equal cost routes set up to the service IP. They know they can reach the service IP across both links and must make a routing decision about which one to use.

haproxy ecmp hashing

The routers then use a hashing algorithm on the packets in the flow to make that decision. A typical algorithm looks at Layer 3 and Layer 4 information as a tuple, e.g source IP, destination IP, source port and destination port, and then calculate a hash. If configured correctly, both routers will calculate the same hash and consequently install the same route, routing traffic to the same load balancer instance. Configuring hashing algorithms on the routers is what I’d consider the hardest part of getting a working solution. Not all routers are able to do it and trying to find documentation about it is hard.

Another approach is not using hardware routers at all and rely only on the Linux kernel and a software routing daemon like BIRD or Quagga. These would then be serving as dedicated routing servers in the setup, replacing the r1 and r2 hardware devices.

Regardless of using hardware or software routers, what makes this setup effective is that you do not interrupt any traffic when network changes take place. If r1 is administratively taken offline, routing information in the network will be updated so that the peering routers only use r2 as a destination for traffic towards the service IP. As for HAProxy it does not need to know that this is happening. Existing sessions (flows) won’t be interrupted and will be drained off r1.

haproxy ecmp failure

For minimizing unplanned downtime, optimizing configuration on r1 and r2 for fast convergence - quick recovery from an unknown state, is essential. Rather than adjusting routing protocol timers I’d recommend using other forms of convergence optimization, like Bidirectional Forwarding Detection (BFD). The BFD failure detection timers have much shorter time limits than the failure detection mechanisms in the routing protcols, so they provide faster detection. This means recovery can be fast, even sub-second, and data loss minimalized.

Automated health checks (mis)using Bidirectional Forwarding Detection

Now we need to define how to communicate with the routers from the HAProxy instances. They need to be able to communitcate with the routing layer to start or stop sending traffic in their direction. In practice that means to signal the routers to add or withdraw one of the routes to the service IP address. Again, there are multiple ways of acheiving this but simplicity is our goal. For this solution I’ll again focus on BFD. My contacts over at UNINETT in Trondheim have had success using OpenBFDD, an open source implementation of BFD, to initiate these routing updates. BFD is a network protocol used to detect faults between devices connected by a link. Standards-wise it is at the RFC stage according to Wikipedia. It is low-overhead and fast, so it’s perfect for our simple functional needs. While both Quagga and BIRD have support for BFD, OpenBFDD can be used as a standalone mechanism, removing the need for running a full routing daemon on your load balancer.

To set this up, you would run bfdd-beacon on your HAProxy nodes, and then send it commands from its control utility bfdd-control. Of course this is something you’d want to automate in your HAProxy health status checks. As an example, this is a simple Python daemon that will run in the background, check HAProxy status and interfaces every second, and signal the upstream routers about state changes:

#!/usr/bin/env python
import os.path
import requests
import time
import subprocess
import logging
import argparse
from daemonize import Daemonize

APP = "project_bfd_monitor"
DESCRIPTION = "Check that everything is ok, and signal link down via bfd otherwise"
ADMIN_DOWN_MARKER = "/etc/admin_down"
HAPROXY_CHECK_URL = "http://localhost:1936/haproxy_up"

def check_state(interface):
        response = requests.get(HAPROXY_CHECK_URL, timeout=1)
        return "down"
    ifstate_filename = "/sys/class/net/{}/operstate".format(interface)
    if not os.path.exists(ifstate_filename) or open(ifstate_filename).read().strip() != "up":
        return "down"
    if os.path.exists(ADMIN_DOWN_MARKER):
        return "admin"
    return "up"

def set_state(new_state):
    if new_state not in ("up", "down", "admin"):
        raise ValueError("Invalid new state: {}".format(new_state))["/usr/local/bin/bfdd-control", "session", "all", "state", new_state])

def main(logfile, interface):
                        format='%(asctime)s %(name)s %(levelname)s %(message)s')
    if logfile:
        handler = logging.handlers.RotatingFileHandler(logfile,
                                                       maxBytes=10*1024**3, backupCount=5)
    state = check_state(interface)
    set_state(state)"bfd-check starting, initial state: %s", state)
    while True:
        new_state = check_state(interface)
        if new_state != state:
  "state changed from %s to %s", state, new_state)
            state = new_state

def parse_args():
    parser = argparse.ArgumentParser(description=DESCRIPTION)
    parser.add_argument('-d', '--daemonize', default=False, action='store_true',
                        help="Run as daemon")
    parser.add_argument('--pidfile', type=str, default="/var/run/{}.pid".format(APP),
                        help="pidfile when run as daemon")
    parser.add_argument('--logfile', default='/var/log/{}.log'.format(APP),
                        help="logfile to use")
    parser.add_argument('--interface', help="Downstream interface to monitor status of")

    return parser.parse_args()

if __name__ == '__main__':
    args = parse_args()
    if args.daemonize:
        daemon_main = lambda: main(args.logfile, args.interface)
        daemon = Daemonize(app=APP, pid=args.pidfile, action=daemon_main)
        main(None, args.interface)

There are three main functions. check_state() requests the HAProxy stats uri and checks status of the monitored network interface through SysFS. main() runs a while True loop that calls check_state() every second. If state has changed, set_state() will be called and a bfdd-control subprocess ran to signal the new state through the protocol to the listening routers. One interesting thing to note about the admin down state - it is actually part of the BFD protocol standard as defined by the RFC. As a consequence, taking a load balancer out of service is as simple as marking it down, then waiting for its sessions to drain.

When designing the HAProxy health checks to run there is an important factor to remember. You don’t want complicated health checks, and there is no point in exposing any application or service-related checks through to the routing layer. As long as HAProxy is up serving connections, we want to continue receiving traffic and stay up.


Standards driven network design with core services implemented using open source software is currently gaining acceptance. Traditionally, developer and operations teams have had little knowledge of and ownership to this part of infrastructure. Shared knowledge of Layer 3 protocols and routing will be crucial for any organization building or operating IT services going forward. Load balancing is a problem space that needs multi-domain knowledge, and greatly benifits from integrated teams. Integrating knowlege of IP networks and routing with application service delivery allows the creation of flexible load balancing systems while staying vendor-neutral.

Interesting connections are most often made when people of diverse backgrounds meet and form a new relationship. If you know a network engineer, the next time you talk, ask about Bidirectional Forwarding Detection and convergence times. It could become an interesting conversation!

Thank you to Sigmund Augdal at UNINETT for sharing the Python code in this article. His current version is available on their Gitlab service.

No comments :