December 12, 2017

Day 12 - Monitoring Postgres Replication Lag

By: Kathryn Exline (@kathryn_ex)
Edited By: Baron Schwartz (@xaprb)

Have you created replicas of your PostgreSQL databases? I am going to assume you are a good database steward and answered that question with a resounding “YES INDEEDLY DO!” Ned Flanders style. If not, I recommend taking the time to be kind to your future self and do so as soon as possible. We won’t talk about how to do that here, but you can find details on how to configure replication in the PostgreSQL documentation, the PostgreSQL wiki, and around the internet.

With your trusty replicas in place, make sure you take the time to properly monitor your clusters. One of the most important metrics of replication health, albeit seductively easy to over value, is “replication lag”. Before I show you a few simple queries to collect this value on your PostgreSQL clusters, let us briefly talk about replication in PostgreSQL. If you are already familiar with the concept of replication and how it is implemented in PostgreSQL, feel free to skip ahead to the “Why Care About Lag?” section.

The Basics: Replication and WAL

Replication is a mechanism where data from one database (a “primary”) is copied to another secondary database (a “replica” or “standby”), keeping it in sync. Most databases have built-in mechanisms to support this feature. After you configure your primary PostgreSQL database for your service, you should create one or more replica databases in case you lose your primary database or decide you want to offload some operations from the primary. Generally, you initialize a replica with a snapshot of the primary and then it stays up to date by fetching and replaying the primary’s transactions.

PostgreSQL implements replication via the Write-Ahead-Log or the “WAL” (pronounced “wall”, like the big icy thing in Game of Thrones). The notion of the WAL is not unique to PostgreSQL, and is similar to that of journaling in file systems. It ensures transactions are logged durably before they are committed, so updates can be recovered and replayed in the case of a crash. Aside from crash recovery, PostgreSQL leverages the WAL for internal performance gains and built-in replication support.

The WAL is a collection of 16MB binary files located in the pg_xlog directory of your data directory. Each time the database gets a transaction that requires changing any data, it appends a record of the transaction to the most recently created WAL segment file and assigns the record with a Log Sequence Number (LSN) to note its position in the WAL. I explicitly say position and not time because as the term Log Sequence Number suggests, the WAL files and their individual records are based on a sequence-based timeline. Why? Because if you are processing a high volume of transactions, timestamps may not be unique or granular enough to validate that your transactions are executed in the correct serial order. Not to mention time is full of nasty tricksies. This will be important later when we look at the queries you can use to find your primary’s and replicas’ position in the WAL.

PostgreSQL uses the WAL to make replicas of the primary in one of two ways. The latest and greatest is via streaming replication, where each WAL log record is sent to the replica as quickly as possible to be replayed. By default, this is done asynchronously so the replica can process the record without delaying the commit on the primary; however, PostgreSQL also supports synchronous replication where a transaction on the primary must wait until the WAL record is committed on both the primary and and the replica before considering the transaction successful.

The second and older option is via log-shipping where it ships one full WAL segment file (16MB) at a time from the primary to a replica. This generally results in higher replication lag since the replica will not receive the WAL records until the file is completely filled. Streaming replication is best for most use cases, but I recommend reading the PostgreSQL documentation around log-shipping standby servers for in-depth explanations of these two options.

Why Care About Lag?

Replication lag is the replica’s distance behind the primary in the sequential timeline. The time it takes to copy data from the primary to a replica, and apply the changes, can vary based on a number of factors including network time, replication configuration, and activity on both the primary and replicas. Unsurprisingly, I have seen replication lag spike on several occasions due to network issues. In another case, I saw replication lag spike on a replica that was not able to find and recover a WAL file from an archiving node, and it quietly fell out of date. The potential causes are widespread and I have found that replication lag is often an indicator that something is subtly failing or behaving unexpectedly.

Ultimately, it is safe to assume that there will be some amount of lag on any replica. But why do you need to know the replication lag in your clusters?

Disaster Recovery

In most scenarios where a primary database is lost, users want to promote the most up to date replica to ensure minimal data loss. You and your tooling can measure the lag to select the optimal replacement for the primary.

Service Strategies and Optimizations

If you connect all of your clients to the primary, you will eventually overload your database. When this happens, a common technique is to direct some read-only queries to replicas; however, If you don’t build your service to be aware of, and tolerate, replication lag, then your users will experience inconsistent behavior from your service. Knowing the typical replication lag of your replicas will help you strategize which services can still function in spite of potential lag.

Debugging and Observation

Just as measuring latency in an HTTP request can indicate an underlying issue, unusually high replication lag can indicate an issue with your databases. Unfortunately replication lag in isolation rarely informs users of the specific underlying problem, but it is a broad indicator of several issues and is another data point in your observability toolbelt.

How To Monitor Lag: Get Your “See Lags”

Now that we have a handle on the importance of monitoring your replication lag, let’s dive into two ways to measure replication lag.

By WAL Location

The most accurate way to determine the lag is to compare the current WAL location on the primary with the last WAL location received by the standby. To find the LSN value of the current WAL location in Postgres versions older than 10.x, run the following on the primary:

=# select pg_current_xlog_location();

In Postgres 10.x, you’ll need to use a newer function:

=# select pg_current_wal_lsn();

You should get an LSN value which looks like the following:

(1 row)

To find the LSN value of the last WAL location received and synced to disk by the standby, run the following on the replica. Once again, there’s a pre-10 syntax and a newer version for Postgres 10.x:

=# -- in Postgres 9.x
=# select pg_last_xlog_receive_location();

=# -- in Postgres 10.x
=# select pg_last_wal_receive_lsn();

You should get a similar record as the previous function:

(1 row)

Note that these functions denote what WAL position the replica has received from the primary, but not what it has applied to bring the replica’s copy into sync with the primary. There could be a difference between these two values. To find out what the replica has replayed, use the following functions on the replica:

=# -- in Postgres 9.x
=# select pg_last_xlog_replay_location();

=# -- in Postgres 10.x
=# select pg_last_wal_replay_lsn();

You can determine whether the replica is at the same point in the WAL as the primary by comparing the values of what’s been committed on the primary and what’s been received or replayed on the replica. The disadvantage of using WAL position is that, despite being an accurate representation of lag, it is difficult for humans to understand what an LSN difference really means. I have seen clever scripts that convert LSN’s to the byte position in the WAL and take the difference of these values, but there is an easier option that leverages another built-in function to approximate time lag.

By Time Difference

I told you earlier that time was tricky and the WAL is based on a sequence, but timestamps are more readable to humans and ingestible by external tools than the WAL location values. PostgreSQL can extract the timestamp of a given WAL location, allowing you to compare the timestamp of the last played transaction in the WAL with the current time using the following query on your replica:

=# -- Same in both Postgres 9.x and 10.x 
=# select now() - pg_last_xact_replay_timestamp();

This value needs to be read with additional context and taken with a grain of salt. It is meant to be an approximation of lag and should be treated as such. I find this query most useful to inject into my time-series observability metrics or to toss in a terminal pane when running operations that might affect lag. If you are selecting a replica to replace a failed primary, you should use the LSN instead of the approximate timestamp of the LSN.

Other Helpful Queries and Tools

You can run the following to determine whether you are interacting with the primary or a replica. A replica will return ‘t’ and the primary will return ‘f’:

=# select pg_is_in_recovery();

You can also translate the LSN value returned by the functions mentioned above to the name of the WAL file name within your pg_xlog directory using:

=# -- In Postgres 9.x
=# select pg_xlogfile_name(pg_last_xlog_receive_location());

=# -- In Postgres 10.x
=# select pg_walfile_name(pg_last_wal_receive_lsn());

If you are curious about what a WAL file actually looks like, PostgreSQL introduced the pg_xlogdump tool in version 9.3 to convert the contents of the binary WAL file into human readable form. Note this tool was renamed to pg_waldump in version 10.0 and is intended for educational purposes only.

Beyond the Queries

If your databases run on a cloud platform, your provider may already provide these metrics for you. For example, AWS Cloudwatch provides the ReplicaLag metric and GCP provides the replication metric. Finally, whether you use external tooling to monitor your replication lag or write your own monitoring plugins, you need to consider how you actually use these metrics.

As we discussed earlier, replication lag is a helpful metric and provides additional data points when making decisions about your services, but think long and carefully before alerting or paging around replication lag. You probably don’t want to. Replication lag is susceptible to a variety of factors, some of which are not actionable or inherently wrong, and it varies enough that you could find yourself bogged down in tuning alerting thresholds or developing complex anomaly detection. If you do choose to page on this value, make sure you embed plenty of headroom in your thresholds, provide context around potential lag causes in your alerting tools, and give your on-call rotation a few extra high-fives.


No comments :