December 8, 2021

Day 8 - D&D for SREs

By: Jennifer Davis (@sigje)

In a past life, I was a full-time SRE and a part-time dragonborn paladin named Lorarath. While at work, I supported thousands of systems in collaboration with a team of geeks. Evenings, I tried to survive imaginary disasters and save the world from the sorceress Morgana. I love collaborative games because they plug into some of the real-world emotional responses and social processes critical for successful, meaningful engagement. They provide a place to practice dealing with critical scenarios in a safe place. When you know the stakes are purely imaginary, you're able to look at your efforts from a distance, to gain understanding and enjoy the process of learning and achieving goals together, even when failing. I want to share a couple of insights D&D has given me about my work and how this can help you.

Building your SRE Team … more than just a name.

SRE has many names: Operations, DevOps, Infrastructure engineering, System Admin. It's someone who deploys and runs a highly available, scalable, and secure service that meets business and partner requirements. But what does that mean? Generally, it means someone with a wide-ranging set of skills tackling different challenges at any point in time.

When you first start a campaign in dungeons and dragons, you choose a class to play. This class will then have specializations that you customize based on how you want to play. Next, you build out your character using a character sheet and create a backstory. This character sheet has several abilities and skills. You have several points to allocate to abilities and skills, which grants you additional chances to handle particular events successfully.

In gaming, you collaborate with your team to ensure that you have a well-rounded team often choosing roles to complement the team. You don't want a team of all "magic users" or hack and slashers. Often, we stop at identifying who we are with that single name, whether it's SRE or sysadmin. As an SRE, I depend on a diverse team with varied skills. I am not seeking people with the same expertise or abilities. I'm looking for people with complementary skills who can help accomplish the goals and visions of the team.

Developing your “character sheet”

There is no equivalent to a "character sheet" when it comes to your job. The closest might be equating a resume or LinkedIn profile to a character sheet. Still, these don't align to all of the possible experiences you gain:

  • Submitting git pull requests.
  • Participating in hackathons.
  • Attending training or conferences.
  • The myriad of other day-to-day challenges you face.

Additionally, if you don't practice skills in real life, they languish. For example, I haven't touched Solaris in over a decade, and I no longer document it as a skill.

If SRE did have a character sheet, I think three core abilities would be:

Communication, Collaboration, and Confidence. Let's take a closer look at these specializations and the value of spending energy on these areas.

Specialization: Communication

Communication is a fundamental building block to successful character building. As an SRE, I faced various scenarios that required expert communication.

  • The first specialty in communication is the number of messages. How often should I remind people about upcoming scheduled maintenance? How often should I reach out to my manager about working on the right thing? How often should my team get together to talk about team tasks?
  • The second specialty in communication is the quality of messages. Communication can be visual, written, or oral. Visuals can often convey much more nuanced meaning than repeating the same information in textual format and an underleveraged method.
  • The third specialty in communications is effectiveness. Effectiveness is the degree to which your words lead to the desired results. This specialty is the most advanced because effective communication requires an in-depth understanding of the audience and crafting your message as needed.

Specialization: Collaboration

The second core ability is collaboration. In any product or service, you are working on, work needs to be understood, planned, and executed. It doesn't matter who does the work; it just matters that it gets done.

The role I take today doesn't define who I am. If I say, "I'm an SRE at Company," that is just one characteristic of my story and not my identity. Every day as you go into work and tackle your challenge, recognize your special value and what you bring to the team. Rather than adopting and marrying your identity to a specific role, realize some days you take on a role that may be quite different from what you are used to, and that's part of your character development.

There is a distinction between the members of your team and the roles they play. In gaming, you become comfortable speaking on behalf of your character while having a separate, sometimes meta-conversation with your teammates. Social environments seem to tend towards homeostasis, and you (may) naturally ascribe a simplistic narrative to your co-workers' actions. Adopting this awareness that everyone is filling a role on the team that is not representative of everything about the individuals allows you to approach the work to do the impactful work that needs to get done.

In other words, never say, "well, they are just the ROLENAME and can't do that," or "that's not my job."

Specialization: Confidence

The third core ability for your SRE character sheet is confidence. Confidence is about the innate quality that drives you to take risks (or not).

In gaming, sometimes you take the wrong path, or you put your squishy players out front, and they get severely damaged. Mistakes happen. In the "real world," customers do something unexpected. There are bugs in the software, hardware fails, or someone from the team enters the wrong command on the wrong terminal in the production environment.

Collaborative games teach you to fail as a group and rise again while retaining the group cohesion necessary to succeed. Of course, if a teammate really caused you to be captured by a giant spider, you'd probably flip out. Still, across the game board, one has the emotional wiggle-room to behave in a manner that would be laudable in professional situations.

Playing teaches you about exploring challenges with imagination and a sense of play. You have to piece things together while continuing to take action, both keeping in mind the larger game goals and what's immediately on the board at the same time. In addition to this enormous world to explore, there are complex characters (non-playing characters or NPCs) to talk to, and information gathered within each encounter. Be on the lookout for the helpful non-production engineers (NPEs) in your environment, too; while they may not maintain production, they may have valuable information to support you.

Wrapping Up

So, this article inspired you to add some collaborative gaming to your team building, build out your team with complementary skills, or map out the work of the SRE or system administration to a character sheet. Great, beyond the "character sheet," you need the appropriate visualization. By analyzing the particular work items that an individual completed, there could be an incremented "skill" counter. Additional information like git commits, distribution of package management, and incident management APIs could be gathered and glued together to create a way to look at progress over time. That way, you could make sure to spend time on the skills that will improve you in the direction of your choosing.

If you want to try out D&D, check out your local game stores or related groups. Beginner games often provide preconfigured characters that allow you to practice the gameplay without understanding all of the nuances of playing the game.

December 7, 2021

Day 7 - Baking Multi-architecture Docker Images

By: Joe Block (@curiousbiped)
Edited by: Martin Smith (@martinb3)

My home lab cluster has a mix of CPU architectures - several Odroid HC2s that are arm7, another bunch of Raspberry Pi 4s and Odroid HC4s that are arm64 and finally a repurposed MacBook Air that is amd64. To further complicate things, they're not even all running the same linux distribution - some run Raspberry Pi OS, one's still on Raspbian, some are running debian (a mix of buster and bullseye), and the MacBook Air runs Ubuntu.

To reduce complication, the services in the cluster are all running in docker or containerd - it's a homelab, so I'm deliberately running multiple options to learn different tooling. This meant that I had to do three separate builds every time I updated one of my images, arm7 , arm64 and amd64, on three different machines, and my service startup scripts all had to determine what architecture they were running on and figure out what image tag to use.

Enter multi-architecture images

It used to be a hassle to create multi-architecture images. You'd have to create an image for each architecture, then upload them all separately from each build machine, then construct a manifest file that included references to all the different architecture images and then finally upload the manifest. This doesn't lead to easy rapid iteration.

Now, thanks to docker buildx, you can create multi-architecture images as easily as docker build creates them for single-architectures.

Let's take a look with an example on my system. First, I can see what architectures are supported with docker buildx ls. As of 2021-12-03, Docker Desktop for macOS supports the following:


        NAME/NODE       DRIVER/ENDPOINT             STATUS  PLATFORMS
        multiarch *     docker-container
          multiarch0    unix:///var/run/docker.sock running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
        desktop-linux   docker
          desktop-linux desktop-linux               running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/arm/v7, linux/arm/v6
        default         docker
          default       default                     running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/arm/v7, linux/arm/v6
        

My home lab only has three architectures, so in these examples I'm going to build for arm7, arm64 and amd64.

Create a builder

I need to create a builder that supports multi-architecture builds. This only needs to be done once as Docker Desktop will reuse it for all of my buildx builds.


    docker buildx create --name multibuild --use

Building a multi-architecture image

Now, when I build an image with docker buildx, all I have to do is specify a comma-separated list of desired platforms with --platform. Behind the scenes, Docker Desktop will fire up QEMU virtual machines for each architecture I specified, run the image builds in parallel, then create the manifest and upload everything.

As an example, I have a docker image, unixorn/unixorn-py3 that I use for my python projects that installs a minimal Python 3 onto debian 11-slim.

I build it with docker buildx build --platform linux/amd64,linux/arm/v7,linux/arm64 --push -t unixorn/debian-py3 resulting in the output below showing that it's building all three architectures.


        ❯ rake buildx
        Building unixorn/debian-py3
         docker buildx build --platform linux/amd64,linux/arm/v7,linux/arm64 --push -t unixorn/debian-py3 .
        [+] Building 210.4s (17/17) FINISHED
         => [internal] load build definition from Dockerfile                                                                                                            0.0s
         => => transferring dockerfile: 571B                                                                                                                            0.0s
         => [internal] load .dockerignore                                                                                                                               0.0s
         => => transferring context: 2B                                                                                                                                 0.0s
         => [linux/arm64 internal] load metadata for docker.io/library/debian:11-slim                                                                                   3.7s
         => [linux/arm/v7 internal] load metadata for docker.io/library/debian:11-slim                                                                                  3.6s
         => [linux/amd64 internal] load metadata for docker.io/library/debian:11-slim                                                                                   3.6s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [linux/arm64 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                             4.4s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:968621624b326084ed82349252b333e649eaab39f71866edb2b9a4f847283680 30.06MB / 30.06MB                                                                2.0s
         => => extracting sha256:968621624b326084ed82349252b333e649eaab39f71866edb2b9a4f847283680                                                                       2.4s
         => [linux/amd64 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                             4.0s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:e5ae68f740265288a4888db98d2999a638fdcb6d725f427678814538d253aa4d 31.37MB / 31.37MB                                                                1.8s
         => => extracting sha256:e5ae68f740265288a4888db98d2999a638fdcb6d725f427678814538d253aa4d                                                                       2.2s
         => [linux/arm/v7 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                            4.3s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:ba82a1312e1efdcd1cc6eb31cd40358dcec180da31779dac399cba31bf3dc206 26.57MB / 26.57MB                                                                2.3s
         => => extracting sha256:ba82a1312e1efdcd1cc6eb31cd40358dcec180da31779dac399cba31bf3dc206                                                                       2.0s
         => [linux/amd64 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install-r  22.3s
         => [linux/arm/v7 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install  176.9s
         => [linux/arm64 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install-  173.6s
         => exporting to image                                                                                                                                         25.4s
         => => exporting layers                                                                                                                                         6.7s
         => => exporting manifest sha256:ae5a5dcfe0028d32cba8d4e251cd7401c142023689a215c327de8bdbe8a4cba4                                                               0.0s
         => => exporting config sha256:48f97d6d8de3859a66625982c411f0aab062722a3611f18366ecff38ac4eafb9                                                                 0.0s
         => => exporting manifest sha256:fc7ad1e5f48da4fcb677d189dbc0abd3e155baf8f50eb09089968d1458fdcfb9                                                               0.0s
         => => exporting config sha256:60ced8a7d9dc49abbbcd02e7062268fdd2f14d9faedcb078b2980642ae959c3b                                                                 0.0s
         => => exporting manifest sha256:8f96f20d75502d5672f1be2d9646cbc5d5de3fcffd007289a688185714515189                                                               0.0s
         => => exporting config sha256:0c6e42f87110443450dbc539c97d99d3bfdd6dd78fb18cfdb0a1e3310f4c8615                                                                 0.0s
         => => exporting manifest list sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa                                                          0.0s
         => => pushing layers                                                                                                                                          17.2s
         => => pushing manifest for docker.io/unixorn/debian-py3:latest@sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa                         1.4s
         => [auth] unixorn/debian-py3:pull,push token for registry-1.docker.io                                                                                          0.0s
         => [auth] unixorn/debian-py3:pull,push token for registry-1.docker.io                                                                                          0.0s
         docker pull unixorn/debian-py3
        Using default tag: latest
        latest: Pulling from unixorn/debian-py3
        e5ae68f74026: Already exists
        86834dffc327: Pull complete
        Digest: sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa
        Status: Downloaded newer image for unixorn/debian-py3:latest
        docker.io/unixorn/debian-py3:latest
        1.60s user 1.05s system 1% cpu 3:36.49s total

One minor issue - docker buildx has a separate cache that it builds the images in, so when you build, the images won't be loaded in your local docker/containerd environment. If you want to have the image in your local docker environment, you need to run buildx with --load instead of --push.

In this example, instead of running docker run unixorn/debian-py3:amd64, docker run unixorn/debian-py3:arm7 or docker run unixorn/debian-py3:arm64 based on what machine I'm on, now I can use the same image reference on all the machines -


        ❯ docker run unixorn/debian-py3 python3 --version
        Python 3.9.2
        ❯
        

Takeaway

If you're running a mix of architectures in your lab environment, docker buildx will simplify things considerably.

No more maintaining multiple architecture tags, no more having to build on multiple machines, no more accidentally forgetting to update one of the tags so that things are mysteriously different on just some of our machines, no more weird issues because we forgot to update service start scripts and docker-compose.yml files.

Simpler is always better, and buildx will simplify the environment for you.

December 5, 2021

Day 6 - More to come tomorrow!

We don't have any special system content for you today. We will have more tomorrow!

December 4, 2021

Day 5 - Least Privilege using strace

By: Shaun Mouton (@sdmouton)
Edited by: Jennifer Davis (@sigje)

Security in software development has been a hot-button issue for years. Increasing awareness of the threat posed by supply chain breaches have only increased the pressure on teams to improve security in all aspects of the software delivery and operation. A key premise is least privilege: granting the minimum privileges necessary to accomplish a task, in order to prevent folks from accessing or altering things they shouldn't have rights to. Here's my thinking, we should help users to apply the principles of least privilege when designing tools. When we find that we have not designed security tooling which does not enable least privilege use, we can still address the problem using tracing tools which can be found in most Linux distribution package repositories. I would like to share my adventure of looking at an InSpec profile (using CINC Auditor) and a container I found on Docker Hub to demonstrate how to apply least privilege using strace for process access auditing.

At my prior job working at Chef, I fielded a request asking how to run an InSpec profile as a user other than root. InSpec allows you to write policies in code (called InSpec Profiles) to audit the state of a system. Most of the documentation and practice at the time had users inspecting the system as root or a root-equivalent user. At first glance, this makes a certain amount of sense: many tools in the "let's configure the entire system" and "let's audit the security of the entire system" spaces need access to whatever the user decides they want to check against. Users can write arbitrary profile code for InSpec (and the open source CINC Auditor), ship those profiles around, and scan their systems to determine whether or not they're in compliance.

I've experienced this pain of excessive privileges with utilities myself. I can't count the number of times we'd get a request to install some vendor tool nobody had ever heard of with root privileges. Nobody who asked could tell us what it'd be accessing, whether it would be able to make changes to the system, or how much network/cpu/disk it'd consume. The vendor and the security department or DBAs or whoever would file a request with the expectation that we should just trust their assertion that nothing would go wrong. So, being responsible system administrators, we'd say "no, absolutely not, tell us what it's going to be doing first" or "yes, we'll get that work scheduled" and then never schedule the work. This put us in the position of being gatekeepers rather than enablers of responsible behavior. While justified, it never sat right with me.

(Note: It is deeply strange that vendors often can't tell customers what their tools do when asked in good faith, as is the idea that there should be an assumption of trustworthiness in that lack of information.)

I've found some tools over the years which might be able to give a user output which can be used to help craft something like a set of required privileges to run an arbitrary program with non-root privileges. Not too long ago I discussed "securing the supply chain" on how to design an ingestion pipeline to enable folks to run containers in a secure environment where they could be somewhat assured that a container using code they didn't write wasn't going to try to access things that they weren't comfortable with. I thought about this old desire of limiting privileges when running an arbitrary command, and figured that I should do a little digging to see if something already existed. If not maybe I could work towards a solution.

Now, I don't consider myself an expert developer but I have been writing or debugging code in one form or another since the '90s. I hope you consider this demo code with the expectation that someone wanting to do this in a production environment will re-implement what I've done far more elegantly. I hope that seeing my thinking and the work will help folks to understand a bit more about what's going on behind the scenes when you run arbitrary code, and to help you design better methods of securing your environment using that knowledge.

What I'll be showing here is the use of strace to build a picture of what is going on when you run code and how to approach crafting a baseline of expected system behavior using the information you can gather. I'll show two examples:

  • executing a relatively simple InSpec profile using the open source distribution's CINC Auditor
  • running a randomly selected container off Docker Hub (jjasghar/container_cobol)

Hopefully, seeing this work will help you solve a problem in your environment or avoid some compliance pain.

Parsing strace Output for an CINC Auditor (Chef InSpec) profile

There are other write-ups of strace functionality which go into broader and deeper detail on what's possible using it, I'll point to Julia Evans' work to get you started if you want to know more.

Strace is the venerable Linux debugger, and a good tool to use when coming up against a "what's going on when this program runs" problem. However, its output can be decidedly unfriendly. Take a look in the strace-output directory in this repo for the files matching the pattern linux-baseline.* to see the output of the following command:


        root@trace1:~# strace --follow-forks --output-separately --trace=%file -o
    /root/linux-baseline cinc-auditor exec linux-baseline

You can parse the output, however, if all you want to know is what files might need to be accessed (for an explanation of the command go here) you can do something similar to the following (maybe don't randomly sort the output and only show 10 lines):


awk -F '"' '{print $2}' linux-baseline/linux-baseline.108579 | sort -uR | head
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/minitest-5.14.4/lib/nokogiri.so
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/train-winrm-0.2.12/lib/psych/visitors.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/i18n-1.8.10/lib/rubygems/resolver/index_set.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-cognitoidentityprovider-1.53.0/lib/inspec/resources/command.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/jwt-2.3.0/lib/rubygems/package/tar_writer.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-codecommit-1.46.0/lib/pp.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/extensions/x86_64-linux/2.7.0/ffi-1.15.4/http/2.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/extensions/x86_64-linux/2.7.0/bcrypt_pbkdf-1.1.0/rubygems/package.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-databasemigrationservice-1.53.0/lib/inspec/resources/be_directory.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-ram-1.26.0/lib/rubygems/resolver/current_set.rb

You can start to build a picture of what all the user would need to be able to access in order to run a profile based on that output, but in order to go further I'll use a much more simple check:


        cinc-auditor exec linux-vsp/
    

Full results of that command are located in the strace-output directory with files matching the pattern linux-vsp.*, but to summarize what cinc-auditor/inspec is doing:

  • linux-vsp.109613 - this file shows all the omnibussed ruby files the cinc-auditor command tries to access in order to run its parent process
  • linux-vsp.109614 - why auditor is trying to run cmd.exe on a Linux system I don't yet know, you'll get used to seeing $PATH traversal very quickly
  • linux-vsp.109615 - I see a Get-WmiObject Win32_OperatingSys in there so we're checking to see if this is Windows
  • linux-vsp.109616 - more looking on the $PATH for Get-WmiObject so more Windows checking
  • linux-vsp.109617 - I am guessing that checking the $PATH for the Select command is more of the same
  • linux-vsp.109618 - Looking for and not finding ConvertTo-Json, this is a PowerShell cmdlet, right?
  • linux-vsp.109619 - Now we're getting somewhere on Linux, this running uname -s (with $PATH traversal info in there, see how used to this you are by now?)
  • linux-vsp.109620 - Now running uname -m
  • linux-vsp.109621 - Now running test -f /etc/debian_version
  • linux-vsp.109622 - Doing something with /etc/lsb-release but I didn't use the -v or -s strsize flags with strace so the command is truncated.
  • linux-vsp.109623 - Now we're just doing cat /etc/lsb-release using locale settings
  • linux-vsp.109624 - Checking for the inetd package
  • linux-vsp.109625 - Checking for the auditd package, its config directory /etc/dpkg/dpkg.cfg.d, and the config files /etc/dpkg/dpkg.cfg, and /root/.dpkg.cfg

Moving from that to getting an idea of what all a non-root user would need to be able to access, you can do something like this in the strace-output directory (explainshell here):


    find . -name "linux-vsp.10*" -exec awk -F '"' '{print $2}' {} \; | sort -u >
    linux-vsp_files-accessed.txt

You can see the output of this command here, but you'll need to interpret some of the output from the perspective of the program being executed. For example, I see "Gemfile" in there without a preceding path. I expect that's Auditor looking in the ./linux-vsp directory where the profile being called exists, and the other entries without a preceding path are probably also relative to the command being executed.

Parsing strace output of a container execution

I said Docker earlier, but I've got podman installed on this machine so that's what the output will reflect. You can find the output of the following command in the strace-output directory in files matching the pattern container_cobol.*, and wow. Turns out running a full CentOS container produces a lot of output. When scanning through the files, you see what looks like podman doing podman things, and what looks like the COBOL Hello World application executing in the container. As I go through these files I will call out anything particularly interesting I see along the way:


        root@trace1:~# strace -ff --trace=%file -o /root/container_cobol podman run -it container_cobol
        Hello world!
        root@trace1:~# ls -1 container_cobol.* | wc -l
        146

I'm not going to go through 146 files individually as I did previously, but this is an interesting data point:


        root@trace1:strace-output# find . -name "container_cobol.1*" -exec awk -F '"' '{print $2}' {} \; | sort -u > container_cobol_files-accessed.txt

        root@trace1:strace-output# wc -l container_cobol_files-accessed.txt
        637 container_cobol_files-accessed.txt
        
        root@trace1:strace-output# wc -l linux-vsp_files-accessed.txt
        104754 linux-vsp_files-accessed.txt

So the full CentOS container running a little COBOL Hello World application needs access to six hundred thirty seven files, and CINC Auditor running a 22-line profile directly on the OS needs to access over one hundred four thousand files. That doesn't directly mean that one is more or less of a security risk than the other, particularly given that a Hello World application can't report on the compliance state of your machines, containers, or applications for example, but it is fun to think about. One of the neatest things about debugging using tools which expose the underlying operations of a container exec is that you can reason about what containerization is actually doing. In this case, since we're showing what files are accessed during the container exec, sorting the list, and removing duplicate entries it's a cursory view but still useful.

Let's say we're consuming a vendor application as a container. We can trace an execution (or sample a running instance of the container for a day, strace can attach to running processes), load the list of files into the pipeline we use to promote new versions of that vendor app to prod, and when we see a change in the files that the application is opening we can make a determination whether the behavior of the new version is appropriate for our production environment with all the PII and user financial data. Now, instead of trusting the vendor at their word that they've done their due diligence, we're actually observing the behavior of the application and using our own knowledge of our environment to say whether that application is suitable for use.

But wait! Strace isn't just for files!

I used strace's file syscall filter as an example because it fit the example use case, but strace can snoop on other syscalls too! Do you need to know what IP addresses your process knows about? This example is using a container exec again, but you could snoop on an existing pid if you want then run a similar search against the output (IPs have been modified in this output):


        strace -ff --trace=%network -o /root/yourcontainer-network -s 10241 podman run -it yourcontainer
        for file in $(ls -1 yourcontainer-network.*); do grep -oP 'inet_addr\("\K[^"]+' $file ; done
        127.0.0.1
        127.0.0.1
        693.18.119.36
        693.18.119.36
        693.18.131.255
        75.5117.0.5
        75.5117.0.5
        75.5117.255.255
        161.888.0.2
        161.888.0.2
        161.888.15.255
        832.71.40.1
        832.71.40.1
        832.71.255.255

Have I answered my original question?

With all that knowledge, can we address the original question: Can one use the list of files output by tracing a cinc-auditor run to provide a restricted set of permissions which will allow one to audit the system using CINC Auditor and the profile with a standard user?

Yes, with one caveat: My Very Simple Profile was too simple, and didn't require any additional privileges. I tried with a few other public profiles, but every one I tried ran successfully using a standard user created with useradd -m cincauditor. I looked through bug reports related to running profiles as a non-root user but couldn't replicate their issues - which is good, I suppose. It could be that the issue my customer was facing at the time was a bug in the program's behavior when run as a non-root user which has been fixed, or I just don't remember the use case they presented well enough to replicate it. So here's a manufactured case:



root@trace1:~# mkdir /tmp/foo
root@trace1:~# touch /tmp/foo/sixhundred
root@trace1:~# touch /tmp/foo/sevenhundred
root@trace1:~# chmod 700 /tmp/foo
root@trace1:~# chmod 600 /tmp/foo/sixhundred
root@trace1:~# chmod 700 /tmp/foo/sevenhundred
cincauditor@trace1:~$ cat << EOF > linux-vsp/controls/filetest.rb
> control "filetester" do
>   impact 1.0
>   title "Testing files"
>   desc "Ensure they're owned by root"
>   describe file('/tmp/foo/sixhundred') do
>     its('owner') { should eq 'root' }
>   end
>   describe file('/tmp/foo/sevenhundred') do
>     its('group') { should eq 'root'}
>   end
> end
> EOF
cincauditor@trace1:~$ cinc-auditor exec linux-vsp/

Profile: Very Simple Profile (linux-vsp)
Version: 0.1.0
Target:  local://

  ×  filetester: Testing files (2 failed)
     ×  File /tmp/foo/sixhundred owner is expected to eq "root"

     expected: "root"
          got: nil

     (compared using ==)

     ×  File /tmp/foo/sevenhundred group is expected to eq "root"

     expected: "root"
          got: nil

     (compared using ==)

  ✔  inetd: Do not install inetd
     ✔  System Package inetd is expected not to be installed
  ↺  auditd: Check auditd configuration (1 skipped)
     ✔  System Package auditd is expected to be installed
     ↺  Can't find file: /etc/audit/auditd.conf


Profile Summary: 1 successful control, 1 control failure, 1 control skipped
Test Summary: 2 successful, 2 failures, 1 skipped

cincauditor@trace1:~$ find . -name "linux-vsp.1*" -exec awk -F '"' '{print $2}' {} \; | sort -u > linux-vsp_files-accessed.txt

root@trace1:~# diff --suppress-common-lines -y linux-vsp_files-accessed.txt /home/cincauditor/linux-vsp_files-accessed.txt | grep -v /opt/cinc-auditor
							      >	/home
							      >	/home/cincauditor
							      >	/home/cincauditor/.dpkg.cfg
							      >	/home/cincauditor/.gem/ruby/2.7.0
							      >	/home/cincauditor/.gem/ruby/2.7.0/specifications
							      >	/home/cincauditor/.inspec
							      >	/home/cincauditor/.inspec/cache
							      >	/home/cincauditor/.inspec/config.json
							      >	/home/cincauditor/.inspec/gems/2.7.0/specifications
							      >	/home/cincauditor/.inspec/plugins
							      >	/home/cincauditor/.inspec/plugins.json
							      >	/home/cincauditor/linux-vsp
/root							      <
/root/.dpkg.cfg						      <
/root/.gem/ruby/2.7.0					      <
/root/.gem/ruby/2.7.0/specifications			      <
/root/.inspec						      <
/root/.inspec/cache					      <
/root/.inspec/config.json				      <
/root/.inspec/gems/2.7.0/specifications			      <
/root/.inspec/plugins					      <
/root/.inspec/plugins.json				      <
/root/linux-vsp						      <
							      >	/tmp/foo/sevenhundred
							      >	/tmp/foo/sixhundred
							      >	linux-vsp/controls/filetest.rb
root@trace1:~#

The end of that previous block's output shows compiling the list of files accessed when the cincauditor user runs the profile in the same way we did for the root user, then a diff of the two files. Looking at that output, it's fairly obvious that the profile is trying to access the newly created files which are in a directory we made inaccessible to the cincauditor user (with chmod 700 /tmp/foo), and when we give cinc-auditor access to that directory with chmod 750 /tmp/foo the profile is able to check those files. A manufactured replication of the use case, but it does show that it's possible to use the output to accomplish the task. Whether chmod is the right way to give an least-privilege user access to the files is a question best left up to the implementer, their organization, and their auditors - the purpose of this exercise is to demonstrate the potential value of the strace debugger.

It is important to note that file permissions aren't the only reason why a program wouldn't run. If you're not able to use the information strace gives you to get an application to run as a user with restricted privileges, at least you can get more information about what is happening under the hood and can communicate about why a program is not suitable for your environment. If a program needs to run anyway, you can profile the application's behavior (perhaps a tool built on eBPF would be more suitable than strace for ongoing monitoring in a production environment) and notify when its behavior changes.

Closing thoughts

Over the past few years I've had a lot of thoughts about how do get things done in modern environments, and I've come to the conclusion that it's okay to write shell scripts to get something like this done. Since in this case I'm wrapping arbitrary tasks so I can extract information about what happens when they're running, and I won't be able to predict where I'll need it I figured it was a good idea to use bash and awk as those will be available via package manager where I want to do this sort of thing.

You might not agree, and wish to see something like this implemented in something like Ruby, Python, or Rust (I have to admit that I thought about trying to do this using Rust so as to get better at it), and you're of course welcome to do so. Again, I chose shell since it's something many folks can easily run, look at, comprehend, modify, and re-implement in the way that suits them.

Lastly, thanks very much to Julia Evans. A note about the power of storytelling in one of her posts made me think "I should write a story about solving this problem so I can be sure I learned something from it", and I hope I've done a decent job of emulating her empathy towards folks learning these concepts for the first time. 

Day 4 - GWLB: Panacea for Cloud DMZ on AWS

By: Atif Siddiqui
Edited by: Jennifer Davis (@sigje)

Organizations aspire to apply the same security controls to ingress traffic in Cloud as they have on-premises, ideally taking advantage of Cloud value propositions to provide resiliency and scalability to traffic inspection appliances.

Within the AWS ecosystem, until last year, there wasn’t an elegant solution. Consequently, the most notable challenge it created, especially for regulated organizations, was designing the DMZ (demilitarized zone) pattern in AWS. It took two announcements to close this gap: VPC Ingress routing and Gateway Load Balancer (GWLB).

Two years ago, AWS announced VPC Ingress routing. This provided the capability where ingress traffic could be directed to an Elastic Network interface (ENI). Last year, Amazon followed it up with a complementary announcement of GWLB.

GWLB is AWS's fourth load balancer offering following Classic, Application and Network Load Balancer. Unlike the first three types, GWLB solves a niche problem and is, specifically, targeted towards partner appliances.

GWLB has a novel design with two distinct sides. The front end is connected to VPC endpoint service and corresponding VPC endpoints. This front end acts as a Layer 3 gateway. The backend is connected to third party appliances. This backend acts as a Layer 4 Load Balancer. An oversimplified diagram of the traffic flow is shown:

Ingress traffic → GWLB endpoints → GWLB endpoint service → GWLB → 3rd party appliance

So how do you provision a GWLB?

There are 4 key resources that need to be provisioned in order:

  • Target Group
  • GWLB using the above as the target group.
  • VPC endpoint service using above as the load balancer type.
  • VPC endpoints bound to the above endpoint service.
Target Group

As part of this announcement, AWS implemented the GENEVE protocol and added this option to the UX for Target Group. If you are unfamiliar with this protocol it will be explained after going through GWLB provisioning requirements.

To configure this as infrastructure code (IaC), you could use a terraform code snippet as follows:


    resource "aws_lb_target_group" "blog_gwlb_tgt_grp" {
        name      = "blog_gwlb_tgt_grp"
        port        = 6081
        protocol = "GENEVE"
        vpc_id   = aws_vpc.fw.id
      }
    

GWLB

As with Application Load Balancing, GWLB requires a target group to forward traffic; however, the target group must be created with the GENEVE protocol.

Health checks for TCP, HTTP and HTTPS are supported; however, it should be noted that health check packets are not GENEVE encapsulated.

An example of a terraform code snippet is as follows.


    resource "aws_lb" "blog_gwlb" {
        name                       = "blog_gwlb"
        load_balancer_type = "gateway"
        subnets                    = blog-gwlb-subnet.pvt.*.id
      
        tags = {
          Name                     = “blog-gwlb”,
          Environment              = "sandbox"
            }
      }
      

Endpoint Service

Prior to GWLB announcement, if an endpoint service was being created, the only option offered was Network Load Balancer (NLB). With GWLB’s availability, gateway is now the second option for load balancer type when creating an endpoint service. It should be noted that endpoint service whether it uses NLB or GWLB relies on the underlying PrivateLink technology.

An example of terraform code snippet is as follows.


    resource "aws_vpc_endpoint_service" "blog-vpce-srvc" {
        acceptance_required              = false
        gateway_load_balancer_arns       = [aws_lb.blog-gwlb.arn]
      
        tags = {
          Name                           = “blog-gwlb”,
          Environment                    = "sandbox"
            }
      }
      

VPC endpoint

The last key piece of the set is provisioning of VPC end points which will bind to end point service created in the prior step.


    resource “aws_vpc_endpoint “blog_gwlbe” {
        count          = length(var.az)
        service_name   = aws_vpc_endpoint_service.blog-vpce-srvc.service_name 
        subnet_ids     = [var.blog-gwlb-subnets[count.index]]
        vpc_id         = aws_vpc.fw.id
    
    tags = {
        Name           = “blog-gwlb”,
        Environment    = "sandbox"
          }
    }
    

GENEVE

This is an encapsulation protocol created by the Internet Engineering Task Force (IETF). GENEVE stands for Generic Network Virtualization Encapsulation and leverages UDP for the transport layer. This encapsulation is what achieves the transparent routing of packets to third party appliances from vendors such as Big-IP, Palo Alto Networks, Aviatrix etc.

Special route table

The glue that blends VPC Ingress routing and GWLB feature is through a special use of route table.

Ingress traffic → GWLB endpoints → GWLB endpoint service → GWLB → 3rd party appliance e.g marketplace subscription.

This table does not have any explicit subnet association. It, however, has Internet Gateway (IGW) specified on the Edge associations.

Within routes, quad 0 points to Network interfaces (ENIs) of the Gateway Load Balancer endpoints (GWLBe).

It is this routing rule that enforces ingress traffic to be routed to GWLBe which in turns sends to GWLB (endpoint service) that is then routed to appliances. 

Limitations

Target group using the GENEVE protocol does not support tags. 

Cloud DMZ: Centralized Inspection Architecture

Conclusion

The pairing of VPC ingress routing and GWLB allows enterprises to have a much sought after security posture where now both ingress and egress traffic can undergo firewall inspection. This set of capability is, especially, notable when the Cloud DMZ architecture is being created.

Afterthought: AWS Network Firewall

It is always fascinating to me how AWS keeps vendors on their toes. There seems to be an aura of ineluctability where vendors strive to stay a step ahead of AWS’s offering. While customers can use marketplace subscriptions (e.g. firewall) with GWLB, there is a competing service by Amazon named AWS Network Firewall. This is essentially Firewall as a Service where VPC ingress routing primitive will be used to point to AWS Network Firewall which uses GWLB behind the scenes. It is easy to predict that AWS will push for new products that will compete in this space that will use GWLB under the hood.

Over time, choices will rise whether it is with AWS products or more vendors certifying their products with GWLB. This abundance will serve to only benefit customers with more choices in their pursuit of secure network architecture.

References

December 3, 2021

Day 3 - Keeping Config Management Simple with Itamae

By: Paul Welch (@pwelch)
Edited by: Jennifer Davis (@sigje)

Our DevOps toolbox is filled with many tools with Configuration Management being an often neglected and overloaded workhorse. While many resources today are deployed with containers, you still use configuration management tools to manage the underlying servers. Whether you use an image-based approach and configure your systems with Packer or prefer configuring your systems manually after creation by something like Terraform, chances are you still want to continuously manage your hosts with infrastructure as code. To add to the list of potential tools to solve this, I’d like to introduce you to Itamae. Itamae is a simple tool that helps you manage your hosts with a straight-forward DSL while also giving you access to the Ruby ecosystem. Inspired by Chef, Itamae has a similar DSL but does not require a server, complex attributes, or data bags.

Managing Resources

Itamae is designed to be lightweight; it comes with an essential set of resource types to bring your hosts to the expected state. These resource types focus on the core parts of our host we want to manage like packages, templates, and services. The bundled `execute` resource can be used as an escape hatch to manage resources that might not have a builtin resource type. If you find yourself wanting to manage something often that does not have a built in resource, you can build your own resources if you are comfortable with Ruby.

All Itamae resource types have common attributes that include: actions, guards, and triggers for other resources.

Actions

Actions are the activities that you want to have occur with the resource. Each bundled resource has predefined actions that can be taken. A `service` resource, for example, can have both an `:enable` and `:start` action which tells Itamae to enable the service to start on system boot and also start the service if it is not currently running.


    # enable and start the fail2ban service
    service “fail2ban” do
      action [:enable, :start]
    end
    

Guards

Guards ensure a resource is idempotent by only invoking the interpreted code if the conditions pass. The common attributes that are available to use within your infracode are `only_if` and `not_if`.


    # create an empty file only if it does not exist
    execute "create an empty file" do
      command "touch /tmp/file.txt"
      not_if "test -e /tmp/file.txt"
    end
    

Triggers

Triggers allow you to define event driven notifications to other resources.

The `notifies` and `subscribes` attributes allow you to trigger other resources only if there is a change such as restarting a service when a new template is rendered. These are synonymous with Chef & Puppet’s `notifies` and `subscribes` or Ansible’s `handlers`.


    # define nginx service
    service 'nginx' do
      action [:enable, :start]
    end
    
    # render template and restart nginx if there are changes
    template "/etc/nginx/sites-available/main" do
      source "templates/etc/nginx/sites-available/main.erb"
      mode   "0644"
      action :create
      notifies :restart, "service[nginx]", :delayed
    end

Itamae code is normally organized in “cookbooks” much like Chef. You can include recipes to separate your code. Itamae also supports definitions to help DRY your code for resources.

Example

Now that we have an initial overview of the Itamae basics, let’s build a basic Nginx configuration for a host. This example will install Nginx from a PPA on Ubuntu and render a basic configuration that will return the requestor’s IP address. The cookbook resources will be organized as follows:


    ├── default.rb
    └── templates
        └── etc
              └── nginx
                └── sites-available
                   └── main.erb

We will keep it simple with a single `default.rb` recipe and single `main.erb` Nginx site configuration template. The recipe and site configuration template content can be found below.


    # default.rb
    # Add Nginx PPA
    execute "add-apt-repository-ppa-nginx-stable" do
      command "add-apt-repository ppa:nginx/stable --yes"
      not_if "test -e /usr/sbin/nginx"
    end
    
    # Update apt cache
    execute "update-apt-cache" do
      command "apt-get update"
    end
    
    # install nginx stable
    package "nginx" do
      action :install
    end
    
    # enable nginx service
    service 'nginx' do
      action [:enable, :start]
    end
    
    # configure nginx
    template "/etc/nginx/sites-available/main" do
      source "templates/etc/nginx/sites-available/main.erb"
      mode   "0644"
      action :create
      notifies :restart, "service[nginx]", :delayed
      variables()
    end
    
    # enable example site
    link '/etc/nginx/sites-enabled/main'  do
      to "/etc/nginx/sites-available/main"
      notifies :restart, "service[nginx]", :delayed
      not_if "test -e /etc/nginx/sites-enabled/main"
    end
    
    # disable default site
    execute "disable-nginx-default-site" do
      command "rm /etc/nginx/sites-enabled/default"
      notifies :restart, "service[nginx]", :delayed
      only_if "test -e /etc/nginx/sites-enabled/default"
    end

    # main.conf
server {
  listen 80 default_server;
  listen [::]:80 default_server;

  server_name _;

  location / {
    # Return the requestor's IP as plain text
    default_type text/html;
    return 200 $remote_addr;
  }
}

Deploying

*To deploy the above example, it is assumed that you have a temporary VPS instance available.

There are 3 different ways you can deploy your configurations with Itamae:

  • `itamae ssh` via the itamae gem.
  • `itamae local` also via the itamae gem.
  • `mitamae` locally on the host.

Mitamae is an alternative implementation of Itamae built with mruby. This post is focusing on Itamae in general but the Mitamae implementation is a notable option if you want to deploy your configuration using prebuilt binaries instead of using SSH or requiring Ruby.

With your configuration ready it’s just a single command to deploy over SSH. Itamae uses the SpecInfra library which is the same library that ServerSpec uses to test hosts. You can also access a host’s inventory in Itamae much like you can with Chef & Ohai. To deploy your configuration, run:


    itamae ssh --key=/path/to/ssh_key --host=<IP> --user=<USER> default.rb
    --log-level=DEBUG

Itamae will manage those packages and write out the template we specified, bringing the host to our desired state. Once the command is complete, you should be able to curl the host’s IP address and receive a response from Nginx.

Wrapping Up

Thank you for joining me in learning about this lightweight configuration management tool. Itamae gives you a set of bundled resource types to quickly configure your infrastructure in a repeatable and automated manner with three ways to deploy. Check out the Itamae Wiki for more information and best practices!

December 2, 2021

Day 2 - Reliability as a Product Feature

By: Martin Smith (@martinb3)
Edited by: Jennifer Davis (@sigje)

Abstract

SRE was born out of thinking about reliability as a product feature. However, all of the industry focus in the last few years on things like SLOs and Error Budgets and Production Engineering teams, and others, that constitute "doing SRE," sometimes means teams don’t take advantage of a product-centric approach to reliability these days. And they lose some of the advantages of doing so as a result. This post covers some project maturity levels, some suggestions for thinking about reliability as an SRE engaged in those kinds of projects, as well as what kinds of collaboration might be most successful in driving reliability-as-product-feature in each phase.

A brief history

Site Reliability Engineering, or SRE for short, was born in 2003 out of a need to improve service reliability at Google. Often described as, “an implementation of DevOps,” the practice of SRE aims to treat operations as a software problem that can be addressed through software engineering techniques.

And according to a survey by the DevOps Institute, SRE has truly taken off. This approach has been widely adopted, with 22% of organizations saying they have an SRE team in 2021. This shift can also be seen with the rise of conferences like USENIX’s SREcon which began in 2014, or the release of the popular, “Google SRE book,” a few years later in 2016.

Whether or not your organization has an SRE team that plans work using SLOs and Error Budgets, regularly reduces toil through automation, or has adopted one of the many SRE rules of thumb, the basic premise of what impact SRE can have sometimes gets lost -- that operations is a software problem. Or, shifting the focus back to the customer perspective, that reliability is a product feature that we build.

Having held DevOps Engineer and Site Reliability Engineer roles in the past, and having been a technical lead for SRE teams, I’ve had many opportunities to define the role, activities, and most importantly, the impact of an SRE team. In each case, I’ve found that focusing back on our customers’ experience of reliability has been the most useful framing when speaking to company leaders about SRE team’s, “why,” instead of reciting a long, confusing list of things SREs might do in a quarter. I’ve also found that it’s an easy litmus test for myself to ensure I’m working on the right things at the right time. If I can’t explain how my work affects customer reliability, keeping in mind that reliability for operators usually leads to reliability for customers, it might be a sign that I need to work on something else.

Shifting focus back to product reliability

Shifting the focus from operations and software engineering to talking about reliability as a product feature has some major benefits. First, it helps our organizations better understand what reliability might mean for them and their product(s) -- whether that’s resilience (tolerant of failure), scalability (can function with large volumes of work), observability (understanding internal state from outputs), or security (trust of the system). These are all product capabilities that often aren’t well understood, but fundamentally all matter to customers.

Reliability benefits from product management support (communication with stakeholders, building roadmaps, helping with prioritization and decisions, etc). For example, do you know who your internal stakeholders are for the scalability of your product? What’s on the roadmap for observability over the next 6 months? 2 years? And importantly, what metrics will you collect to be sure you’ve accomplished those goals and delivered on that roadmap? How does it align with other features’ roadmaps? As a friend and former colleague of mine says, “reliability is a product feature whether you devote engineering time to it or not.” If you don’t explicitly plan for that, your customers will implicitly make their own assumptions about your reliability.

Reliability may start to sound like any other product feature, with both internal and external stakeholders, and that’s by design. Making reliability an explicit part of your organizational planning also has many benefits. Thoughtworks’ Technology Radar (Volume 25) from October of this year recommends adoption of this kind of thinking -- that even internal teams should think of themselves as product teams. They also recommend using concepts from the popular Team Topologies book to figure out how to organize these internal teams. In reviewing examples of team structures from the book, many organizations have adopted Simon Wardley’s Pioneer-Settler-Town Planner (or “PST”) framework, too.

Let’s take a look at how one might apply these two ideas (reliability as a product feature, having a specific team profile) to improve the effectiveness of an SRE team.

  1. First, there’s no one-size-fits-all approach to improving reliability; different stages of a project will benefit from different kinds of SRE involvement. In this post, I’ll divide products/services into three levels of maturity: beginning, growing, and established.
  2. Then, I’ll describe what kinds of SRE work could be most effective at that maturity level, using the PST framework.

Here’s a graphic that explains the PST framework’s three kinds of roles/activities in more detail.

Team Profiles, from blog post Pioneers, Settlers and Town Planners by Simon Wardley

Beginning phase (with Pioneer SREs)

In new projects, there’s often uncertainty and unanswered questions. Small changes in direction could have large future benefits, but experimental work may be completely discarded, too. SREs can drive reliability at this stage by helping teams build prototypes, fail faster, and make agile decisions, all with reliability as a top of mind concern.

Have you ever had a project get close to production/release without thinking about reliability or operational burdens? “Pioneer SREs” can help. They should be part of the team that’s working to deliver a new product development, evaluate vendors, build out proofs of concept, or make major architectural changes. At this stage of a project, any work to “cover” reliability gaps should be identified or entire directions could be changed due to reliability concerns raised by the team.

Embedding in a team building the new product or feature is a great way for SREs to drive reliability early on in these kinds of projects. When teams only consult briefly on reliability or operational concerns, often the final output doesn’t adequately reflect customer or engineering expectations of reliability of the product or operability of the internals.

The success of Pioneer SREs can be measured by looking at how quickly new products or features show up on the roadmap, how quickly vendor implementations happen, or how quickly a project moves from, “exploration,” to, “concrete proposal.”

The largest risk in this phase is having your SRE team end up owners of the system’s reliability, since they helped design it. Hiding the overall reliability of your system from the other developers, behind an SRE team, will typically turn into a situation where the SRE team ends up being treated as an operational team for any product/service problems. Well-scoped embedding engagements can help avoid this problem by emphasizing that embedded SREs are a training resource for the rest of the team to learn, not coverage for the team once the embedding is over.

Growing phase (with Settler SREs)

In this phase, projects are often working to build production-quality infrastructure, launch to customers, or scale to the required audience. SREs can help actually build mature and scalable components from the initial prototypes. They could also level up the engineering organization on how to prepare for any new operational burdens by emphasizing best practices like automating away toil or choosing good SLOs.

Continuing to embed with teams is a great way for SREs to have a hand in the reliability of a nearly-launched product or feature, especially if SREs influence the team to build for observability, scalability, and security into the product. Consulting with teams on production readiness, especially for brand new teams or brand new services, is another way that SREs can ensure that everything reaching production will meet the original reliability requirements of the product, as well as operational best practices (e.g. automation instead of manual database migrations).

At this phase, SRE building and maintaining an idea of Production Readiness is especially important as a product or organization scales. This ensures a consistent approach to reliability across products or services, as well as creates a minimum bar for reliability that must be satisfied. SREs at this stage may even build automation into a pipeline to guarantee minimum scale or ensure resilience on specific failures.

The success of Settler SREs can be measured by looking at how many new services and features are safely being launched into production, as well as examining things like ease of observability (e.g. effective logging, metrics, or monitoring). Success in this phase is also about establishing patterns that make projects successful (e.g. proposal templates). Project retrospectives are a great way to find those patterns as well as improve SRE engagement with the project.

Established phase (with Town Planner SREs)

In this most mature phase, products or services are usually already generally available, and systemic issues like overall architecture or developer tooling are the most likely to impact reliability.

SREs can influence reliability here by identifying and working to resolve systemic reliability issues (e.g. repeated incidents, poor SLO choices, lack of on-call process, etc). Driving continuous improvement is a very common way that SREs influence reliability at this phase.

In addition, SREs can often identify ways to reduce operational burdens or eliminate large scale toil during this phase, whether through technical automation or architecture changes, or through helping teams build process, knowledge, skills, tools and techniques they need for large scale projects to be repeatedly successful and reliable.

This can be a phase where some SREs will feel there’s a stigma associated with doing less technical work, but the impact of this work cannot be overstated -- it’s where SRE can act as a true multiplier as more and more teams and products/services are launched. Examples include running an incident management program, SLA program, On-call Program, Disaster Recovery/Business Continuity planning, or even a Chaos Engineering program. A strategy to address this concern is to pair SREs with a technical program management function (TPM) so that SREs can focus most on the technical aspects of improvement while TPMs can help with the organizational changes needed to improve a process or execute a program.

Measuring the success of Town Planner SREs can be especially tricky. You might look for simple metric improvements like fewer incidents, reduced incident duration, reduced pages, improved SLO targets, or number of DR tests -- but isolating the SRE impact to these kinds of metrics can be difficult. Qualitative feedback from an SRE team’s internal customers is also frequently used to measure success at this stage. The most impactful SREs at this stage tend to cause paradigm shifts for the other development teams, and often even for their own SRE teammates.

Wrapping up

[PST is] how you take a highly effective company and push it [...] towards a continuously adaptive system. May 8th, 2020 @swardley

I hope that the grouping above is useful to readers for structuring work to drive reliability at various levels of product maturity. Reliability-as-a-product-feature isn’t a magic bullet to solve for an organization that doesn’t understand where it fits in the market or what kind of value it delivers, nor will it make a large difference with an unhealthy product management practice that might not know how to develop and drive delivery of a product and its features over time.

As mentioned earlier, there usually isn’t a, “one-size fits all,” approach to driving reliability. You may still need to establish some best practices for your organization such as “Limit toil to 50% of our work” or “Every product feature that goes live must have a reliability review.” Combined with these kinds of rules of thumb, the proposed divisions and strategies above should help focus your team(s) to make the biggest improvement to reliability for your products and services.

In researching this post, it was helpful to review how organizations “do SRE” at various organizations and companies. Continuous improvement was a clear shared trait among them. It’s also worth reviewing the huge amount of content out there about how SRE can effectively collaborate with other teams (e.g. embedding SREs); a poor relationship or failed collaboration with another team can jeopardize all of your efforts.

I invite and encourage you to write about and share your own experiences, both good and bad, focusing on reliability as a first class product feature at your organization. Special thanks to my own SRE team for the many discussions and ideation sessions on how we can best work to drive reliability. And special thanks to Jennifer Davis, Michael Lumsden, David Nolan, Jordan Rinke, and Kerim Satirli for feedback and editing on this post.