December 7, 2021

Day 7 - Baking Multi-architecture Docker Images

By: Joe Block (@curiousbiped)
Edited by: Martin Smith (@martinb3)

My home lab cluster has a mix of CPU architectures - several Odroid HC2s that are arm7, another bunch of Raspberry Pi 4s and Odroid HC4s that are arm64 and finally a repurposed MacBook Air that is amd64. To further complicate things, they're not even all running the same linux distribution - some run Raspberry Pi OS, one's still on Raspbian, some are running debian (a mix of buster and bullseye), and the MacBook Air runs Ubuntu.

To reduce complication, the services in the cluster are all running in docker or containerd - it's a homelab, so I'm deliberately running multiple options to learn different tooling. This meant that I had to do three separate builds every time I updated one of my images, arm7 , arm64 and amd64, on three different machines, and my service startup scripts all had to determine what architecture they were running on and figure out what image tag to use.

Enter multi-architecture images

It used to be a hassle to create multi-architecture images. You'd have to create an image for each architecture, then upload them all separately from each build machine, then construct a manifest file that included references to all the different architecture images and then finally upload the manifest. This doesn't lead to easy rapid iteration.

Now, thanks to docker buildx, you can create multi-architecture images as easily as docker build creates them for single-architectures.

Let's take a look with an example on my system. First, I can see what architectures are supported with docker buildx ls. As of 2021-12-03, Docker Desktop for macOS supports the following:


        NAME/NODE       DRIVER/ENDPOINT             STATUS  PLATFORMS
        multiarch *     docker-container
          multiarch0    unix:///var/run/docker.sock running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
        desktop-linux   docker
          desktop-linux desktop-linux               running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/arm/v7, linux/arm/v6
        default         docker
          default       default                     running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/arm/v7, linux/arm/v6
        

My home lab only has three architectures, so in these examples I'm going to build for arm7, arm64 and amd64.

Create a builder

I need to create a builder that supports multi-architecture builds. This only needs to be done once as Docker Desktop will reuse it for all of my buildx builds.


    docker buildx create --name multibuild --use

Building a multi-architecture image

Now, when I build an image with docker buildx, all I have to do is specify a comma-separated list of desired platforms with --platform. Behind the scenes, Docker Desktop will fire up QEMU virtual machines for each architecture I specified, run the image builds in parallel, then create the manifest and upload everything.

As an example, I have a docker image, unixorn/unixorn-py3 that I use for my python projects that installs a minimal Python 3 onto debian 11-slim.

I build it with docker buildx build --platform linux/amd64,linux/arm/v7,linux/arm64 --push -t unixorn/debian-py3 resulting in the output below showing that it's building all three architectures.


        ❯ rake buildx
        Building unixorn/debian-py3
         docker buildx build --platform linux/amd64,linux/arm/v7,linux/arm64 --push -t unixorn/debian-py3 .
        [+] Building 210.4s (17/17) FINISHED
         => [internal] load build definition from Dockerfile                                                                                                            0.0s
         => => transferring dockerfile: 571B                                                                                                                            0.0s
         => [internal] load .dockerignore                                                                                                                               0.0s
         => => transferring context: 2B                                                                                                                                 0.0s
         => [linux/arm64 internal] load metadata for docker.io/library/debian:11-slim                                                                                   3.7s
         => [linux/arm/v7 internal] load metadata for docker.io/library/debian:11-slim                                                                                  3.6s
         => [linux/amd64 internal] load metadata for docker.io/library/debian:11-slim                                                                                   3.6s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [linux/arm64 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                             4.4s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:968621624b326084ed82349252b333e649eaab39f71866edb2b9a4f847283680 30.06MB / 30.06MB                                                                2.0s
         => => extracting sha256:968621624b326084ed82349252b333e649eaab39f71866edb2b9a4f847283680                                                                       2.4s
         => [linux/amd64 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                             4.0s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:e5ae68f740265288a4888db98d2999a638fdcb6d725f427678814538d253aa4d 31.37MB / 31.37MB                                                                1.8s
         => => extracting sha256:e5ae68f740265288a4888db98d2999a638fdcb6d725f427678814538d253aa4d                                                                       2.2s
         => [linux/arm/v7 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                            4.3s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:ba82a1312e1efdcd1cc6eb31cd40358dcec180da31779dac399cba31bf3dc206 26.57MB / 26.57MB                                                                2.3s
         => => extracting sha256:ba82a1312e1efdcd1cc6eb31cd40358dcec180da31779dac399cba31bf3dc206                                                                       2.0s
         => [linux/amd64 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install-r  22.3s
         => [linux/arm/v7 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install  176.9s
         => [linux/arm64 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install-  173.6s
         => exporting to image                                                                                                                                         25.4s
         => => exporting layers                                                                                                                                         6.7s
         => => exporting manifest sha256:ae5a5dcfe0028d32cba8d4e251cd7401c142023689a215c327de8bdbe8a4cba4                                                               0.0s
         => => exporting config sha256:48f97d6d8de3859a66625982c411f0aab062722a3611f18366ecff38ac4eafb9                                                                 0.0s
         => => exporting manifest sha256:fc7ad1e5f48da4fcb677d189dbc0abd3e155baf8f50eb09089968d1458fdcfb9                                                               0.0s
         => => exporting config sha256:60ced8a7d9dc49abbbcd02e7062268fdd2f14d9faedcb078b2980642ae959c3b                                                                 0.0s
         => => exporting manifest sha256:8f96f20d75502d5672f1be2d9646cbc5d5de3fcffd007289a688185714515189                                                               0.0s
         => => exporting config sha256:0c6e42f87110443450dbc539c97d99d3bfdd6dd78fb18cfdb0a1e3310f4c8615                                                                 0.0s
         => => exporting manifest list sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa                                                          0.0s
         => => pushing layers                                                                                                                                          17.2s
         => => pushing manifest for docker.io/unixorn/debian-py3:latest@sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa                         1.4s
         => [auth] unixorn/debian-py3:pull,push token for registry-1.docker.io                                                                                          0.0s
         => [auth] unixorn/debian-py3:pull,push token for registry-1.docker.io                                                                                          0.0s
         docker pull unixorn/debian-py3
        Using default tag: latest
        latest: Pulling from unixorn/debian-py3
        e5ae68f74026: Already exists
        86834dffc327: Pull complete
        Digest: sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa
        Status: Downloaded newer image for unixorn/debian-py3:latest
        docker.io/unixorn/debian-py3:latest
        1.60s user 1.05s system 1% cpu 3:36.49s total

One minor issue - docker buildx has a separate cache that it builds the images in, so when you build, the images won't be loaded in your local docker/containerd environment. If you want to have the image in your local docker environment, you need to run buildx with --load instead of --push.

In this example, instead of running docker run unixorn/debian-py3:amd64, docker run unixorn/debian-py3:arm7 or docker run unixorn/debian-py3:arm64 based on what machine I'm on, now I can use the same image reference on all the machines -


        ❯ docker run unixorn/debian-py3 python3 --version
        Python 3.9.2
        ❯
        

Takeaway

If you're running a mix of architectures in your lab environment, docker buildx will simplify things considerably.

No more maintaining multiple architecture tags, no more having to build on multiple machines, no more accidentally forgetting to update one of the tags so that things are mysteriously different on just some of our machines, no more weird issues because we forgot to update service start scripts and docker-compose.yml files.

Simpler is always better, and buildx will simplify the environment for you.

December 5, 2021

Day 6 - More to come tomorrow!

We don't have any special system content for you today. We will have more tomorrow!

December 4, 2021

Day 5 - Least Privilege using strace

By: Shaun Mouton (@sdmouton)
Edited by: Jennifer Davis (@sigje)

Security in software development has been a hot-button issue for years. Increasing awareness of the threat posed by supply chain breaches have only increased the pressure on teams to improve security in all aspects of the software delivery and operation. A key premise is least privilege: granting the minimum privileges necessary to accomplish a task, in order to prevent folks from accessing or altering things they shouldn't have rights to. Here's my thinking, we should help users to apply the principles of least privilege when designing tools. When we find that we have not designed security tooling which does not enable least privilege use, we can still address the problem using tracing tools which can be found in most Linux distribution package repositories. I would like to share my adventure of looking at an InSpec profile (using CINC Auditor) and a container I found on Docker Hub to demonstrate how to apply least privilege using strace for process access auditing.

At my prior job working at Chef, I fielded a request asking how to run an InSpec profile as a user other than root. InSpec allows you to write policies in code (called InSpec Profiles) to audit the state of a system. Most of the documentation and practice at the time had users inspecting the system as root or a root-equivalent user. At first glance, this makes a certain amount of sense: many tools in the "let's configure the entire system" and "let's audit the security of the entire system" spaces need access to whatever the user decides they want to check against. Users can write arbitrary profile code for InSpec (and the open source CINC Auditor), ship those profiles around, and scan their systems to determine whether or not they're in compliance.

I've experienced this pain of excessive privileges with utilities myself. I can't count the number of times we'd get a request to install some vendor tool nobody had ever heard of with root privileges. Nobody who asked could tell us what it'd be accessing, whether it would be able to make changes to the system, or how much network/cpu/disk it'd consume. The vendor and the security department or DBAs or whoever would file a request with the expectation that we should just trust their assertion that nothing would go wrong. So, being responsible system administrators, we'd say "no, absolutely not, tell us what it's going to be doing first" or "yes, we'll get that work scheduled" and then never schedule the work. This put us in the position of being gatekeepers rather than enablers of responsible behavior. While justified, it never sat right with me.

(Note: It is deeply strange that vendors often can't tell customers what their tools do when asked in good faith, as is the idea that there should be an assumption of trustworthiness in that lack of information.)

I've found some tools over the years which might be able to give a user output which can be used to help craft something like a set of required privileges to run an arbitrary program with non-root privileges. Not too long ago I discussed "securing the supply chain" on how to design an ingestion pipeline to enable folks to run containers in a secure environment where they could be somewhat assured that a container using code they didn't write wasn't going to try to access things that they weren't comfortable with. I thought about this old desire of limiting privileges when running an arbitrary command, and figured that I should do a little digging to see if something already existed. If not maybe I could work towards a solution.

Now, I don't consider myself an expert developer but I have been writing or debugging code in one form or another since the '90s. I hope you consider this demo code with the expectation that someone wanting to do this in a production environment will re-implement what I've done far more elegantly. I hope that seeing my thinking and the work will help folks to understand a bit more about what's going on behind the scenes when you run arbitrary code, and to help you design better methods of securing your environment using that knowledge.

What I'll be showing here is the use of strace to build a picture of what is going on when you run code and how to approach crafting a baseline of expected system behavior using the information you can gather. I'll show two examples:

  • executing a relatively simple InSpec profile using the open source distribution's CINC Auditor
  • running a randomly selected container off Docker Hub (jjasghar/container_cobol)

Hopefully, seeing this work will help you solve a problem in your environment or avoid some compliance pain.

Parsing strace Output for an CINC Auditor (Chef InSpec) profile

There are other write-ups of strace functionality which go into broader and deeper detail on what's possible using it, I'll point to Julia Evans' work to get you started if you want to know more.

Strace is the venerable Linux debugger, and a good tool to use when coming up against a "what's going on when this program runs" problem. However, its output can be decidedly unfriendly. Take a look in the strace-output directory in this repo for the files matching the pattern linux-baseline.* to see the output of the following command:


        root@trace1:~# strace --follow-forks --output-separately --trace=%file -o
    /root/linux-baseline cinc-auditor exec linux-baseline

You can parse the output, however, if all you want to know is what files might need to be accessed (for an explanation of the command go here) you can do something similar to the following (maybe don't randomly sort the output and only show 10 lines):


awk -F '"' '{print $2}' linux-baseline/linux-baseline.108579 | sort -uR | head
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/minitest-5.14.4/lib/nokogiri.so
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/train-winrm-0.2.12/lib/psych/visitors.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/i18n-1.8.10/lib/rubygems/resolver/index_set.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-cognitoidentityprovider-1.53.0/lib/inspec/resources/command.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/jwt-2.3.0/lib/rubygems/package/tar_writer.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-codecommit-1.46.0/lib/pp.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/extensions/x86_64-linux/2.7.0/ffi-1.15.4/http/2.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/extensions/x86_64-linux/2.7.0/bcrypt_pbkdf-1.1.0/rubygems/package.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-databasemigrationservice-1.53.0/lib/inspec/resources/be_directory.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-ram-1.26.0/lib/rubygems/resolver/current_set.rb

You can start to build a picture of what all the user would need to be able to access in order to run a profile based on that output, but in order to go further I'll use a much more simple check:


        cinc-auditor exec linux-vsp/
    

Full results of that command are located in the strace-output directory with files matching the pattern linux-vsp.*, but to summarize what cinc-auditor/inspec is doing:

  • linux-vsp.109613 - this file shows all the omnibussed ruby files the cinc-auditor command tries to access in order to run its parent process
  • linux-vsp.109614 - why auditor is trying to run cmd.exe on a Linux system I don't yet know, you'll get used to seeing $PATH traversal very quickly
  • linux-vsp.109615 - I see a Get-WmiObject Win32_OperatingSys in there so we're checking to see if this is Windows
  • linux-vsp.109616 - more looking on the $PATH for Get-WmiObject so more Windows checking
  • linux-vsp.109617 - I am guessing that checking the $PATH for the Select command is more of the same
  • linux-vsp.109618 - Looking for and not finding ConvertTo-Json, this is a PowerShell cmdlet, right?
  • linux-vsp.109619 - Now we're getting somewhere on Linux, this running uname -s (with $PATH traversal info in there, see how used to this you are by now?)
  • linux-vsp.109620 - Now running uname -m
  • linux-vsp.109621 - Now running test -f /etc/debian_version
  • linux-vsp.109622 - Doing something with /etc/lsb-release but I didn't use the -v or -s strsize flags with strace so the command is truncated.
  • linux-vsp.109623 - Now we're just doing cat /etc/lsb-release using locale settings
  • linux-vsp.109624 - Checking for the inetd package
  • linux-vsp.109625 - Checking for the auditd package, its config directory /etc/dpkg/dpkg.cfg.d, and the config files /etc/dpkg/dpkg.cfg, and /root/.dpkg.cfg

Moving from that to getting an idea of what all a non-root user would need to be able to access, you can do something like this in the strace-output directory (explainshell here):


    find . -name "linux-vsp.10*" -exec awk -F '"' '{print $2}' {} \; | sort -u >
    linux-vsp_files-accessed.txt

You can see the output of this command here, but you'll need to interpret some of the output from the perspective of the program being executed. For example, I see "Gemfile" in there without a preceding path. I expect that's Auditor looking in the ./linux-vsp directory where the profile being called exists, and the other entries without a preceding path are probably also relative to the command being executed.

Parsing strace output of a container execution

I said Docker earlier, but I've got podman installed on this machine so that's what the output will reflect. You can find the output of the following command in the strace-output directory in files matching the pattern container_cobol.*, and wow. Turns out running a full CentOS container produces a lot of output. When scanning through the files, you see what looks like podman doing podman things, and what looks like the COBOL Hello World application executing in the container. As I go through these files I will call out anything particularly interesting I see along the way:


        root@trace1:~# strace -ff --trace=%file -o /root/container_cobol podman run -it container_cobol
        Hello world!
        root@trace1:~# ls -1 container_cobol.* | wc -l
        146

I'm not going to go through 146 files individually as I did previously, but this is an interesting data point:


        root@trace1:strace-output# find . -name "container_cobol.1*" -exec awk -F '"' '{print $2}' {} \; | sort -u > container_cobol_files-accessed.txt

        root@trace1:strace-output# wc -l container_cobol_files-accessed.txt
        637 container_cobol_files-accessed.txt
        
        root@trace1:strace-output# wc -l linux-vsp_files-accessed.txt
        104754 linux-vsp_files-accessed.txt

So the full CentOS container running a little COBOL Hello World application needs access to six hundred thirty seven files, and CINC Auditor running a 22-line profile directly on the OS needs to access over one hundred four thousand files. That doesn't directly mean that one is more or less of a security risk than the other, particularly given that a Hello World application can't report on the compliance state of your machines, containers, or applications for example, but it is fun to think about. One of the neatest things about debugging using tools which expose the underlying operations of a container exec is that you can reason about what containerization is actually doing. In this case, since we're showing what files are accessed during the container exec, sorting the list, and removing duplicate entries it's a cursory view but still useful.

Let's say we're consuming a vendor application as a container. We can trace an execution (or sample a running instance of the container for a day, strace can attach to running processes), load the list of files into the pipeline we use to promote new versions of that vendor app to prod, and when we see a change in the files that the application is opening we can make a determination whether the behavior of the new version is appropriate for our production environment with all the PII and user financial data. Now, instead of trusting the vendor at their word that they've done their due diligence, we're actually observing the behavior of the application and using our own knowledge of our environment to say whether that application is suitable for use.

But wait! Strace isn't just for files!

I used strace's file syscall filter as an example because it fit the example use case, but strace can snoop on other syscalls too! Do you need to know what IP addresses your process knows about? This example is using a container exec again, but you could snoop on an existing pid if you want then run a similar search against the output (IPs have been modified in this output):


        strace -ff --trace=%network -o /root/yourcontainer-network -s 10241 podman run -it yourcontainer
        for file in $(ls -1 yourcontainer-network.*); do grep -oP 'inet_addr\("\K[^"]+' $file ; done
        127.0.0.1
        127.0.0.1
        693.18.119.36
        693.18.119.36
        693.18.131.255
        75.5117.0.5
        75.5117.0.5
        75.5117.255.255
        161.888.0.2
        161.888.0.2
        161.888.15.255
        832.71.40.1
        832.71.40.1
        832.71.255.255

Have I answered my original question?

With all that knowledge, can we address the original question: Can one use the list of files output by tracing a cinc-auditor run to provide a restricted set of permissions which will allow one to audit the system using CINC Auditor and the profile with a standard user?

Yes, with one caveat: My Very Simple Profile was too simple, and didn't require any additional privileges. I tried with a few other public profiles, but every one I tried ran successfully using a standard user created with useradd -m cincauditor. I looked through bug reports related to running profiles as a non-root user but couldn't replicate their issues - which is good, I suppose. It could be that the issue my customer was facing at the time was a bug in the program's behavior when run as a non-root user which has been fixed, or I just don't remember the use case they presented well enough to replicate it. So here's a manufactured case:



root@trace1:~# mkdir /tmp/foo
root@trace1:~# touch /tmp/foo/sixhundred
root@trace1:~# touch /tmp/foo/sevenhundred
root@trace1:~# chmod 700 /tmp/foo
root@trace1:~# chmod 600 /tmp/foo/sixhundred
root@trace1:~# chmod 700 /tmp/foo/sevenhundred
cincauditor@trace1:~$ cat << EOF > linux-vsp/controls/filetest.rb
> control "filetester" do
>   impact 1.0
>   title "Testing files"
>   desc "Ensure they're owned by root"
>   describe file('/tmp/foo/sixhundred') do
>     its('owner') { should eq 'root' }
>   end
>   describe file('/tmp/foo/sevenhundred') do
>     its('group') { should eq 'root'}
>   end
> end
> EOF
cincauditor@trace1:~$ cinc-auditor exec linux-vsp/

Profile: Very Simple Profile (linux-vsp)
Version: 0.1.0
Target:  local://

  ×  filetester: Testing files (2 failed)
     ×  File /tmp/foo/sixhundred owner is expected to eq "root"

     expected: "root"
          got: nil

     (compared using ==)

     ×  File /tmp/foo/sevenhundred group is expected to eq "root"

     expected: "root"
          got: nil

     (compared using ==)

  ✔  inetd: Do not install inetd
     ✔  System Package inetd is expected not to be installed
  ↺  auditd: Check auditd configuration (1 skipped)
     ✔  System Package auditd is expected to be installed
     ↺  Can't find file: /etc/audit/auditd.conf


Profile Summary: 1 successful control, 1 control failure, 1 control skipped
Test Summary: 2 successful, 2 failures, 1 skipped

cincauditor@trace1:~$ find . -name "linux-vsp.1*" -exec awk -F '"' '{print $2}' {} \; | sort -u > linux-vsp_files-accessed.txt

root@trace1:~# diff --suppress-common-lines -y linux-vsp_files-accessed.txt /home/cincauditor/linux-vsp_files-accessed.txt | grep -v /opt/cinc-auditor
							      >	/home
							      >	/home/cincauditor
							      >	/home/cincauditor/.dpkg.cfg
							      >	/home/cincauditor/.gem/ruby/2.7.0
							      >	/home/cincauditor/.gem/ruby/2.7.0/specifications
							      >	/home/cincauditor/.inspec
							      >	/home/cincauditor/.inspec/cache
							      >	/home/cincauditor/.inspec/config.json
							      >	/home/cincauditor/.inspec/gems/2.7.0/specifications
							      >	/home/cincauditor/.inspec/plugins
							      >	/home/cincauditor/.inspec/plugins.json
							      >	/home/cincauditor/linux-vsp
/root							      <
/root/.dpkg.cfg						      <
/root/.gem/ruby/2.7.0					      <
/root/.gem/ruby/2.7.0/specifications			      <
/root/.inspec						      <
/root/.inspec/cache					      <
/root/.inspec/config.json				      <
/root/.inspec/gems/2.7.0/specifications			      <
/root/.inspec/plugins					      <
/root/.inspec/plugins.json				      <
/root/linux-vsp						      <
							      >	/tmp/foo/sevenhundred
							      >	/tmp/foo/sixhundred
							      >	linux-vsp/controls/filetest.rb
root@trace1:~#

The end of that previous block's output shows compiling the list of files accessed when the cincauditor user runs the profile in the same way we did for the root user, then a diff of the two files. Looking at that output, it's fairly obvious that the profile is trying to access the newly created files which are in a directory we made inaccessible to the cincauditor user (with chmod 700 /tmp/foo), and when we give cinc-auditor access to that directory with chmod 750 /tmp/foo the profile is able to check those files. A manufactured replication of the use case, but it does show that it's possible to use the output to accomplish the task. Whether chmod is the right way to give an least-privilege user access to the files is a question best left up to the implementer, their organization, and their auditors - the purpose of this exercise is to demonstrate the potential value of the strace debugger.

It is important to note that file permissions aren't the only reason why a program wouldn't run. If you're not able to use the information strace gives you to get an application to run as a user with restricted privileges, at least you can get more information about what is happening under the hood and can communicate about why a program is not suitable for your environment. If a program needs to run anyway, you can profile the application's behavior (perhaps a tool built on eBPF would be more suitable than strace for ongoing monitoring in a production environment) and notify when its behavior changes.

Closing thoughts

Over the past few years I've had a lot of thoughts about how do get things done in modern environments, and I've come to the conclusion that it's okay to write shell scripts to get something like this done. Since in this case I'm wrapping arbitrary tasks so I can extract information about what happens when they're running, and I won't be able to predict where I'll need it I figured it was a good idea to use bash and awk as those will be available via package manager where I want to do this sort of thing.

You might not agree, and wish to see something like this implemented in something like Ruby, Python, or Rust (I have to admit that I thought about trying to do this using Rust so as to get better at it), and you're of course welcome to do so. Again, I chose shell since it's something many folks can easily run, look at, comprehend, modify, and re-implement in the way that suits them.

Lastly, thanks very much to Julia Evans. A note about the power of storytelling in one of her posts made me think "I should write a story about solving this problem so I can be sure I learned something from it", and I hope I've done a decent job of emulating her empathy towards folks learning these concepts for the first time. 

Day 4 - GWLB: Panacea for Cloud DMZ on AWS

By: Atif Siddiqui
Edited by: Jennifer Davis (@sigje)

Organizations aspire to apply the same security controls to ingress traffic in Cloud as they have on-premises, ideally taking advantage of Cloud value propositions to provide resiliency and scalability to traffic inspection appliances.

Within the AWS ecosystem, until last year, there wasn’t an elegant solution. Consequently, the most notable challenge it created, especially for regulated organizations, was designing the DMZ (demilitarized zone) pattern in AWS. It took two announcements to close this gap: VPC Ingress routing and Gateway Load Balancer (GWLB).

Two years ago, AWS announced VPC Ingress routing. This provided the capability where ingress traffic could be directed to an Elastic Network interface (ENI). Last year, Amazon followed it up with a complementary announcement of GWLB.

GWLB is AWS's fourth load balancer offering following Classic, Application and Network Load Balancer. Unlike the first three types, GWLB solves a niche problem and is, specifically, targeted towards partner appliances.

GWLB has a novel design with two distinct sides. The front end is connected to VPC endpoint service and corresponding VPC endpoints. This front end acts as a Layer 3 gateway. The backend is connected to third party appliances. This backend acts as a Layer 4 Load Balancer. An oversimplified diagram of the traffic flow is shown:

Ingress traffic → GWLB endpoints → GWLB endpoint service → GWLB → 3rd party appliance

So how do you provision a GWLB?

There are 4 key resources that need to be provisioned in order:

  • Target Group
  • GWLB using the above as the target group.
  • VPC endpoint service using above as the load balancer type.
  • VPC endpoints bound to the above endpoint service.
Target Group

As part of this announcement, AWS implemented the GENEVE protocol and added this option to the UX for Target Group. If you are unfamiliar with this protocol it will be explained after going through GWLB provisioning requirements.

To configure this as infrastructure code (IaC), you could use a terraform code snippet as follows:


    resource "aws_lb_target_group" "blog_gwlb_tgt_grp" {
        name      = "blog_gwlb_tgt_grp"
        port        = 6081
        protocol = "GENEVE"
        vpc_id   = aws_vpc.fw.id
      }
    

GWLB

As with Application Load Balancing, GWLB requires a target group to forward traffic; however, the target group must be created with the GENEVE protocol.

Health checks for TCP, HTTP and HTTPS are supported; however, it should be noted that health check packets are not GENEVE encapsulated.

An example of a terraform code snippet is as follows.


    resource "aws_lb" "blog_gwlb" {
        name                       = "blog_gwlb"
        load_balancer_type = "gateway"
        subnets                    = blog-gwlb-subnet.pvt.*.id
      
        tags = {
          Name                     = “blog-gwlb”,
          Environment              = "sandbox"
            }
      }
      

Endpoint Service

Prior to GWLB announcement, if an endpoint service was being created, the only option offered was Network Load Balancer (NLB). With GWLB’s availability, gateway is now the second option for load balancer type when creating an endpoint service. It should be noted that endpoint service whether it uses NLB or GWLB relies on the underlying PrivateLink technology.

An example of terraform code snippet is as follows.


    resource "aws_vpc_endpoint_service" "blog-vpce-srvc" {
        acceptance_required              = false
        gateway_load_balancer_arns       = [aws_lb.blog-gwlb.arn]
      
        tags = {
          Name                           = “blog-gwlb”,
          Environment                    = "sandbox"
            }
      }
      

VPC endpoint

The last key piece of the set is provisioning of VPC end points which will bind to end point service created in the prior step.


    resource “aws_vpc_endpoint “blog_gwlbe” {
        count          = length(var.az)
        service_name   = aws_vpc_endpoint_service.blog-vpce-srvc.service_name 
        subnet_ids     = [var.blog-gwlb-subnets[count.index]]
        vpc_id         = aws_vpc.fw.id
    
    tags = {
        Name           = “blog-gwlb”,
        Environment    = "sandbox"
          }
    }
    

GENEVE

This is an encapsulation protocol created by the Internet Engineering Task Force (IETF). GENEVE stands for Generic Network Virtualization Encapsulation and leverages UDP for the transport layer. This encapsulation is what achieves the transparent routing of packets to third party appliances from vendors such as Big-IP, Palo Alto Networks, Aviatrix etc.

Special route table

The glue that blends VPC Ingress routing and GWLB feature is through a special use of route table.

Ingress traffic → GWLB endpoints → GWLB endpoint service → GWLB → 3rd party appliance e.g marketplace subscription.

This table does not have any explicit subnet association. It, however, has Internet Gateway (IGW) specified on the Edge associations.

Within routes, quad 0 points to Network interfaces (ENIs) of the Gateway Load Balancer endpoints (GWLBe).

It is this routing rule that enforces ingress traffic to be routed to GWLBe which in turns sends to GWLB (endpoint service) that is then routed to appliances. 

Limitations

Target group using the GENEVE protocol does not support tags. 

Cloud DMZ: Centralized Inspection Architecture

Conclusion

The pairing of VPC ingress routing and GWLB allows enterprises to have a much sought after security posture where now both ingress and egress traffic can undergo firewall inspection. This set of capability is, especially, notable when the Cloud DMZ architecture is being created.

Afterthought: AWS Network Firewall

It is always fascinating to me how AWS keeps vendors on their toes. There seems to be an aura of ineluctability where vendors strive to stay a step ahead of AWS’s offering. While customers can use marketplace subscriptions (e.g. firewall) with GWLB, there is a competing service by Amazon named AWS Network Firewall. This is essentially Firewall as a Service where VPC ingress routing primitive will be used to point to AWS Network Firewall which uses GWLB behind the scenes. It is easy to predict that AWS will push for new products that will compete in this space that will use GWLB under the hood.

Over time, choices will rise whether it is with AWS products or more vendors certifying their products with GWLB. This abundance will serve to only benefit customers with more choices in their pursuit of secure network architecture.

References

December 3, 2021

Day 3 - Keeping Config Management Simple with Itamae

By: Paul Welch (@pwelch)
Edited by: Jennifer Davis (@sigje)

Our DevOps toolbox is filled with many tools with Configuration Management being an often neglected and overloaded workhorse. While many resources today are deployed with containers, you still use configuration management tools to manage the underlying servers. Whether you use an image-based approach and configure your systems with Packer or prefer configuring your systems manually after creation by something like Terraform, chances are you still want to continuously manage your hosts with infrastructure as code. To add to the list of potential tools to solve this, I’d like to introduce you to Itamae. Itamae is a simple tool that helps you manage your hosts with a straight-forward DSL while also giving you access to the Ruby ecosystem. Inspired by Chef, Itamae has a similar DSL but does not require a server, complex attributes, or data bags.

Managing Resources

Itamae is designed to be lightweight; it comes with an essential set of resource types to bring your hosts to the expected state. These resource types focus on the core parts of our host we want to manage like packages, templates, and services. The bundled `execute` resource can be used as an escape hatch to manage resources that might not have a builtin resource type. If you find yourself wanting to manage something often that does not have a built in resource, you can build your own resources if you are comfortable with Ruby.

All Itamae resource types have common attributes that include: actions, guards, and triggers for other resources.

Actions

Actions are the activities that you want to have occur with the resource. Each bundled resource has predefined actions that can be taken. A `service` resource, for example, can have both an `:enable` and `:start` action which tells Itamae to enable the service to start on system boot and also start the service if it is not currently running.


    # enable and start the fail2ban service
    service “fail2ban” do
      action [:enable, :start]
    end
    

Guards

Guards ensure a resource is idempotent by only invoking the interpreted code if the conditions pass. The common attributes that are available to use within your infracode are `only_if` and `not_if`.


    # create an empty file only if it does not exist
    execute "create an empty file" do
      command "touch /tmp/file.txt"
      not_if "test -e /tmp/file.txt"
    end
    

Triggers

Triggers allow you to define event driven notifications to other resources.

The `notifies` and `subscribes` attributes allow you to trigger other resources only if there is a change such as restarting a service when a new template is rendered. These are synonymous with Chef & Puppet’s `notifies` and `subscribes` or Ansible’s `handlers`.


    # define nginx service
    service 'nginx' do
      action [:enable, :start]
    end
    
    # render template and restart nginx if there are changes
    template "/etc/nginx/sites-available/main" do
      source "templates/etc/nginx/sites-available/main.erb"
      mode   "0644"
      action :create
      notifies :restart, "service[nginx]", :delayed
    end

Itamae code is normally organized in “cookbooks” much like Chef. You can include recipes to separate your code. Itamae also supports definitions to help DRY your code for resources.

Example

Now that we have an initial overview of the Itamae basics, let’s build a basic Nginx configuration for a host. This example will install Nginx from a PPA on Ubuntu and render a basic configuration that will return the requestor’s IP address. The cookbook resources will be organized as follows:


    ├── default.rb
    └── templates
        └── etc
              └── nginx
                └── sites-available
                   └── main.erb

We will keep it simple with a single `default.rb` recipe and single `main.erb` Nginx site configuration template. The recipe and site configuration template content can be found below.


    # default.rb
    # Add Nginx PPA
    execute "add-apt-repository-ppa-nginx-stable" do
      command "add-apt-repository ppa:nginx/stable --yes"
      not_if "test -e /usr/sbin/nginx"
    end
    
    # Update apt cache
    execute "update-apt-cache" do
      command "apt-get update"
    end
    
    # install nginx stable
    package "nginx" do
      action :install
    end
    
    # enable nginx service
    service 'nginx' do
      action [:enable, :start]
    end
    
    # configure nginx
    template "/etc/nginx/sites-available/main" do
      source "templates/etc/nginx/sites-available/main.erb"
      mode   "0644"
      action :create
      notifies :restart, "service[nginx]", :delayed
      variables()
    end
    
    # enable example site
    link '/etc/nginx/sites-enabled/main'  do
      to "/etc/nginx/sites-available/main"
      notifies :restart, "service[nginx]", :delayed
      not_if "test -e /etc/nginx/sites-enabled/main"
    end
    
    # disable default site
    execute "disable-nginx-default-site" do
      command "rm /etc/nginx/sites-enabled/default"
      notifies :restart, "service[nginx]", :delayed
      only_if "test -e /etc/nginx/sites-enabled/default"
    end

    # main.conf
server {
  listen 80 default_server;
  listen [::]:80 default_server;

  server_name _;

  location / {
    # Return the requestor's IP as plain text
    default_type text/html;
    return 200 $remote_addr;
  }
}

Deploying

*To deploy the above example, it is assumed that you have a temporary VPS instance available.

There are 3 different ways you can deploy your configurations with Itamae:

  • `itamae ssh` via the itamae gem.
  • `itamae local` also via the itamae gem.
  • `mitamae` locally on the host.

Mitamae is an alternative implementation of Itamae built with mruby. This post is focusing on Itamae in general but the Mitamae implementation is a notable option if you want to deploy your configuration using prebuilt binaries instead of using SSH or requiring Ruby.

With your configuration ready it’s just a single command to deploy over SSH. Itamae uses the SpecInfra library which is the same library that ServerSpec uses to test hosts. You can also access a host’s inventory in Itamae much like you can with Chef & Ohai. To deploy your configuration, run:


    itamae ssh --key=/path/to/ssh_key --host=<IP> --user=<USER> default.rb
    --log-level=DEBUG

Itamae will manage those packages and write out the template we specified, bringing the host to our desired state. Once the command is complete, you should be able to curl the host’s IP address and receive a response from Nginx.

Wrapping Up

Thank you for joining me in learning about this lightweight configuration management tool. Itamae gives you a set of bundled resource types to quickly configure your infrastructure in a repeatable and automated manner with three ways to deploy. Check out the Itamae Wiki for more information and best practices!

December 2, 2021

Day 2 - Reliability as a Product Feature

By: Martin Smith (@martinb3)
Edited by: Jennifer Davis (@sigje)

Abstract

SRE was born out of thinking about reliability as a product feature. However, all of the industry focus in the last few years on things like SLOs and Error Budgets and Production Engineering teams, and others, that constitute "doing SRE," sometimes means teams don’t take advantage of a product-centric approach to reliability these days. And they lose some of the advantages of doing so as a result. This post covers some project maturity levels, some suggestions for thinking about reliability as an SRE engaged in those kinds of projects, as well as what kinds of collaboration might be most successful in driving reliability-as-product-feature in each phase.

A brief history

Site Reliability Engineering, or SRE for short, was born in 2003 out of a need to improve service reliability at Google. Often described as, “an implementation of DevOps,” the practice of SRE aims to treat operations as a software problem that can be addressed through software engineering techniques.

And according to a survey by the DevOps Institute, SRE has truly taken off. This approach has been widely adopted, with 22% of organizations saying they have an SRE team in 2021. This shift can also be seen with the rise of conferences like USENIX’s SREcon which began in 2014, or the release of the popular, “Google SRE book,” a few years later in 2016.

Whether or not your organization has an SRE team that plans work using SLOs and Error Budgets, regularly reduces toil through automation, or has adopted one of the many SRE rules of thumb, the basic premise of what impact SRE can have sometimes gets lost -- that operations is a software problem. Or, shifting the focus back to the customer perspective, that reliability is a product feature that we build.

Having held DevOps Engineer and Site Reliability Engineer roles in the past, and having been a technical lead for SRE teams, I’ve had many opportunities to define the role, activities, and most importantly, the impact of an SRE team. In each case, I’ve found that focusing back on our customers’ experience of reliability has been the most useful framing when speaking to company leaders about SRE team’s, “why,” instead of reciting a long, confusing list of things SREs might do in a quarter. I’ve also found that it’s an easy litmus test for myself to ensure I’m working on the right things at the right time. If I can’t explain how my work affects customer reliability, keeping in mind that reliability for operators usually leads to reliability for customers, it might be a sign that I need to work on something else.

Shifting focus back to product reliability

Shifting the focus from operations and software engineering to talking about reliability as a product feature has some major benefits. First, it helps our organizations better understand what reliability might mean for them and their product(s) -- whether that’s resilience (tolerant of failure), scalability (can function with large volumes of work), observability (understanding internal state from outputs), or security (trust of the system). These are all product capabilities that often aren’t well understood, but fundamentally all matter to customers.

Reliability benefits from product management support (communication with stakeholders, building roadmaps, helping with prioritization and decisions, etc). For example, do you know who your internal stakeholders are for the scalability of your product? What’s on the roadmap for observability over the next 6 months? 2 years? And importantly, what metrics will you collect to be sure you’ve accomplished those goals and delivered on that roadmap? How does it align with other features’ roadmaps? As a friend and former colleague of mine says, “reliability is a product feature whether you devote engineering time to it or not.” If you don’t explicitly plan for that, your customers will implicitly make their own assumptions about your reliability.

Reliability may start to sound like any other product feature, with both internal and external stakeholders, and that’s by design. Making reliability an explicit part of your organizational planning also has many benefits. Thoughtworks’ Technology Radar (Volume 25) from October of this year recommends adoption of this kind of thinking -- that even internal teams should think of themselves as product teams. They also recommend using concepts from the popular Team Topologies book to figure out how to organize these internal teams. In reviewing examples of team structures from the book, many organizations have adopted Simon Wardley’s Pioneer-Settler-Town Planner (or “PST”) framework, too.

Let’s take a look at how one might apply these two ideas (reliability as a product feature, having a specific team profile) to improve the effectiveness of an SRE team.

  1. First, there’s no one-size-fits-all approach to improving reliability; different stages of a project will benefit from different kinds of SRE involvement. In this post, I’ll divide products/services into three levels of maturity: beginning, growing, and established.
  2. Then, I’ll describe what kinds of SRE work could be most effective at that maturity level, using the PST framework.

Here’s a graphic that explains the PST framework’s three kinds of roles/activities in more detail.

Team Profiles, from blog post Pioneers, Settlers and Town Planners by Simon Wardley

Beginning phase (with Pioneer SREs)

In new projects, there’s often uncertainty and unanswered questions. Small changes in direction could have large future benefits, but experimental work may be completely discarded, too. SREs can drive reliability at this stage by helping teams build prototypes, fail faster, and make agile decisions, all with reliability as a top of mind concern.

Have you ever had a project get close to production/release without thinking about reliability or operational burdens? “Pioneer SREs” can help. They should be part of the team that’s working to deliver a new product development, evaluate vendors, build out proofs of concept, or make major architectural changes. At this stage of a project, any work to “cover” reliability gaps should be identified or entire directions could be changed due to reliability concerns raised by the team.

Embedding in a team building the new product or feature is a great way for SREs to drive reliability early on in these kinds of projects. When teams only consult briefly on reliability or operational concerns, often the final output doesn’t adequately reflect customer or engineering expectations of reliability of the product or operability of the internals.

The success of Pioneer SREs can be measured by looking at how quickly new products or features show up on the roadmap, how quickly vendor implementations happen, or how quickly a project moves from, “exploration,” to, “concrete proposal.”

The largest risk in this phase is having your SRE team end up owners of the system’s reliability, since they helped design it. Hiding the overall reliability of your system from the other developers, behind an SRE team, will typically turn into a situation where the SRE team ends up being treated as an operational team for any product/service problems. Well-scoped embedding engagements can help avoid this problem by emphasizing that embedded SREs are a training resource for the rest of the team to learn, not coverage for the team once the embedding is over.

Growing phase (with Settler SREs)

In this phase, projects are often working to build production-quality infrastructure, launch to customers, or scale to the required audience. SREs can help actually build mature and scalable components from the initial prototypes. They could also level up the engineering organization on how to prepare for any new operational burdens by emphasizing best practices like automating away toil or choosing good SLOs.

Continuing to embed with teams is a great way for SREs to have a hand in the reliability of a nearly-launched product or feature, especially if SREs influence the team to build for observability, scalability, and security into the product. Consulting with teams on production readiness, especially for brand new teams or brand new services, is another way that SREs can ensure that everything reaching production will meet the original reliability requirements of the product, as well as operational best practices (e.g. automation instead of manual database migrations).

At this phase, SRE building and maintaining an idea of Production Readiness is especially important as a product or organization scales. This ensures a consistent approach to reliability across products or services, as well as creates a minimum bar for reliability that must be satisfied. SREs at this stage may even build automation into a pipeline to guarantee minimum scale or ensure resilience on specific failures.

The success of Settler SREs can be measured by looking at how many new services and features are safely being launched into production, as well as examining things like ease of observability (e.g. effective logging, metrics, or monitoring). Success in this phase is also about establishing patterns that make projects successful (e.g. proposal templates). Project retrospectives are a great way to find those patterns as well as improve SRE engagement with the project.

Established phase (with Town Planner SREs)

In this most mature phase, products or services are usually already generally available, and systemic issues like overall architecture or developer tooling are the most likely to impact reliability.

SREs can influence reliability here by identifying and working to resolve systemic reliability issues (e.g. repeated incidents, poor SLO choices, lack of on-call process, etc). Driving continuous improvement is a very common way that SREs influence reliability at this phase.

In addition, SREs can often identify ways to reduce operational burdens or eliminate large scale toil during this phase, whether through technical automation or architecture changes, or through helping teams build process, knowledge, skills, tools and techniques they need for large scale projects to be repeatedly successful and reliable.

This can be a phase where some SREs will feel there’s a stigma associated with doing less technical work, but the impact of this work cannot be overstated -- it’s where SRE can act as a true multiplier as more and more teams and products/services are launched. Examples include running an incident management program, SLA program, On-call Program, Disaster Recovery/Business Continuity planning, or even a Chaos Engineering program. A strategy to address this concern is to pair SREs with a technical program management function (TPM) so that SREs can focus most on the technical aspects of improvement while TPMs can help with the organizational changes needed to improve a process or execute a program.

Measuring the success of Town Planner SREs can be especially tricky. You might look for simple metric improvements like fewer incidents, reduced incident duration, reduced pages, improved SLO targets, or number of DR tests -- but isolating the SRE impact to these kinds of metrics can be difficult. Qualitative feedback from an SRE team’s internal customers is also frequently used to measure success at this stage. The most impactful SREs at this stage tend to cause paradigm shifts for the other development teams, and often even for their own SRE teammates.

Wrapping up

[PST is] how you take a highly effective company and push it [...] towards a continuously adaptive system. May 8th, 2020 @swardley

I hope that the grouping above is useful to readers for structuring work to drive reliability at various levels of product maturity. Reliability-as-a-product-feature isn’t a magic bullet to solve for an organization that doesn’t understand where it fits in the market or what kind of value it delivers, nor will it make a large difference with an unhealthy product management practice that might not know how to develop and drive delivery of a product and its features over time.

As mentioned earlier, there usually isn’t a, “one-size fits all,” approach to driving reliability. You may still need to establish some best practices for your organization such as “Limit toil to 50% of our work” or “Every product feature that goes live must have a reliability review.” Combined with these kinds of rules of thumb, the proposed divisions and strategies above should help focus your team(s) to make the biggest improvement to reliability for your products and services.

In researching this post, it was helpful to review how organizations “do SRE” at various organizations and companies. Continuous improvement was a clear shared trait among them. It’s also worth reviewing the huge amount of content out there about how SRE can effectively collaborate with other teams (e.g. embedding SREs); a poor relationship or failed collaboration with another team can jeopardize all of your efforts.

I invite and encourage you to write about and share your own experiences, both good and bad, focusing on reliability as a first class product feature at your organization. Special thanks to my own SRE team for the many discussions and ideation sessions on how we can best work to drive reliability. And special thanks to Jennifer Davis, Michael Lumsden, David Nolan, Jordan Rinke, and Kerim Satirli for feedback and editing on this post.

December 1, 2021

Day 1 - The Myths and the Magic in My Search for Acquiring Software Engineering Skills

By: Annie Hedgpeth (@anniehedgie)
Edited by: Jennifer Davis (@sigje)

A happy SysAdvent to you, my dear elves. Whether you are an individual contributor (IC), manager, director, or something in between, my holiday wish is that my story spreads some holiday magic to your teams and roadmap.

“Then I traveled through the seven levels of the Candy Cane forest, past the sea of twirly-swirly gumdrops, and then I walked through the Lincoln Tunnel.” Buddy the Elf

I took an uncommon route into technology. With absolutely no experience of any kind in any sort of technological pursuit (save for video editing in college), I started my career in IT by learning configuration management and infrastructure as code first. Why? Because the opportunity presented itself, and I had a great in-house tutor. My husband, Michael, is the one who convinced me to pursue a career in technology and was the one who spent many late evenings teaching me how to “computer”. It was a bit of a trek through “the seven levels of the Candy Cane forest, through the sea of swirly twirly gumdrops” but with more tears and heartache.

I spent the first couple of years of my career just trying to learn enough of the different frameworks, like Chef, Terraform, PowerShell, Groovy, etc., to build stuff and configure it properly. Learning about how they should be built and configured came next with a focus on solution architecture and a bit on systems administration. Looking forward, after five years of work focused on configuration management, infrastructure as code, and CI/CD pipelines, I’m now to the point where I want to grow in software engineering, and this is where our story of myths and magic begins today.

“Some call it ‘the show’ or ‘the big dance’; it’s the profession that every elf aspires to…” – Papa Elf

Grab some hot cocoa and curl up with a blanket while I share with you what I see as the common myths believed about acquiring software engineering skills and what I believe to be the actual magic of making that a reality in my life. We will start with the myths, but please remember, dear elves, that these are myths and magic as they pertain to me personally. For you or others, they may not be, and that’s okay. My hope is that sharing my own experiences will give you empathy for others on their unique journeys and/or compassion for yourself as you learn and grow in your own way.

“The best way to spread Christmas cheer is singing loud for all to hear.” – Buddy the Elf

Myth #1 - Just read a book

I am a huge fan of books, and I consume a pretty good amount of books per year. I think that learning through books is important in a way that is difficult to replicate through other modalities. I have gone through Head First Go, a book that is geared toward people with little to no programming experience, and I found it to be incredibly helpful. I did every exercise in the book, learned a lot, and highly recommend it. That said, the exercises alone were not enough to prepare me immediately for real life coding. Doing the exercises was good and necessary, but it was only one piece of the puzzle required to complete the picture of what it takes for me to be able to contribute in a meaningful way to my company’s Go codebase.

Perhaps my lack of any formal training, whether university or code camp, prevented me from grasping the higher level understanding that would have enabled me to contribute confidently sooner, but whatever it was, I was still lacking after simply going through a book. I liken this to studying through a first year French textbook as your only means of learning the language. You will gather the concepts and vocabulary, but you will likely not be able to speak the language without other mediums of instruction.

Myth #2 - Just do some exercises

I am a huge fan of Exercism. I think they are helping a lot of people learn coding languages, and they do it in such a way that brings out a spirit of giving back in its users. There is much to love about that. I have completed many Exercism exercises, and I do find them helpful, but in the same way that the book was only helpful to a certain point, I haven’t found that it helps me with the big picture. I have found it to be like learning French with only Duolingo. Sure it’s a great app, and I use it all the time. But again, one cannot use it in isolation in order to be a proficient French speaker.

Myth #3 - Solution Architecture skills are built upon coding skills

Working at a cloud consulting firm for 4 years, I got a great education in architecting solutions for clients. I really enjoyed learning about the process, and it all made a lot of sense to me. After seeing several of them, I started to see the patterns and practices that are used to create a good solution. And then, as the person often implementing someone else’s solution, I learned quickly what made a bad solution, as well.

To be good at architecting solutions, one must think through all of the choices required to form that solution while you’re still in the planning phase, before any of the solution is actually implemented. You can’t really “mess around and find out”, which is why solutions architecture is such a valuable skill; if you plan well, you do the necessary work, no less and no more.

However, not all solutions are equal. Architecting a solution to a cloud migration feels like more of a tactile experience to me; I can see where things are moving. I think it helps that you can actually hold a CPU in your hands, and an architectural diagram has a very structural feel to it, similar to a blueprint of physical structures. For me, at least, this makes it more accessible and the concepts easier to grasp.

However, software architecture is more conceptual. You have to first understand all of the interfaces, levels of abstraction, and concepts before you can understand how to architect it. And if you don’t understand how to architect it, then you’re back at the Duolingo level of coding.

Myth #4 - The building blocks to starting a tech career are cloud, code editor, source control, and project management

Some people have suggested that huge barriers to moving into a software engineering role can be mastering the tooling - code editors and IDEs, source control, the cloud providers, and project management. This is possibly true of a certain type of person moving from a systems administration type of job into software development, but this was not true for me. But because Michael worried that these would be barriers for me, I learned them first. I created a website with GitHub Pages and used that as a way to learn source control and Visual Studio Code. I took some online classes on Agile Framework. I got a free Azure account and started playing with Terraform. These things were most definitely and obviously important, but again, they’re but one piece of the puzzle.

Myth #5 - It just takes a creativity / growth / problem-solving mindset

One of my husband’s main reasons for convincing me to pursue a career in tech was that I’m a pretty creative person who loves problem solving and that the desire to dig into a problem until it’s solved is one of the most necessary components for a career in tech. I completely agree that this is an important character trait in order to be successful as a technologist. I’m also decently creative and have a growth mindset, which are equally valuable for such a pursuit. You can probably see, by now, where I’m going with this, though.

These traits alone are great and will serve you well in just about any endeavor. Having these traits does not make a person automatically good at tech. It’s like when you’re house-hunting and find a house that needs a ton of cosmetic remodeling, but you say, “It has good bones,” meaning, you can easily make it the way you want it to look without having to overhaul anything structurally. Still, though, the cosmetic renovations are not insignificant. They are a lot of work.

The same is true with me. Yes, I have “good bones” - good traits that are great assets for a career in tech, like being creative, having a growth mindset, and being a good problem solver. But to let folks start a career in tech with the false hope that these traits will give them an unrealistic advantage is not helpful. Yes, those traits help me a lot, but, goodness me, it is still a lot of work learning and growing in tech, even with those traits.

Real barriers:

Truth #1 - People get pigeon-holed into certain work

I worked so hard to get the skills necessary to be valuable to my respective organizations, and while, yes, I found myself a bit pigeon-holed into “devops-y” roles, the other truth is that I didn’t feel as experienced as my peers because I didn’t have the formal training many of them had, so I felt behind in my learning. I wanted to catch up to the folks my age in this business, and that was nearly impossible, so the next best thing was to get really good at one thing, and just like that, I found myself pigeon-holed. This was honestly probably easier and less risky for the companies I was in as things were more predictable and steady when I was more focused on a smaller scope of expertise. And you might be thinking, ‘So what’s the problem with striving to become a subject matter expert at something. There’s immense value in that.’ And you’d be right. This is perfectly fine for some people. However, I personally like to have a range in my work. I find freedom in flexibility as my hope is that it gives me more options in my future, ultimately decreasing the risk to my career.

“There’s room for everyone on the Nice list.” Buddy the Elf

To overcome the barrier of being pigeon-holed into a particular line of work, a bit of magic is required - the magic that happens when goals are set and people help other people. Setting goals and tracking them is extremely important to me, but part of tracking those goals is being accountable to them by someone, whether it be a manager, a mentor, or a team lead. When my manager or team leads know my goals and I have milestones set for reaching those goals, then I am so much more likely to achieve them, and I’m giving them an opportunity to play an important role, which grows their leadership skills - a win-win.

Truth #2 - It’s an engineering problem for senior engineers to break down work to share work with juniors

My favorite type of senior engineer is one who can not only design a good solution but one who knows how to allow everyone on the team to contribute to the solution with their own strengths. Being able to communicate their vision for a solution to others and lead others effectively to carry out their vision is arguably the most valuable skill of a senior engineer. The whole team thrives when seniors lead in this way! Being able to do this is most definitely classified as a soft skill - one that is not easily measured by a test, and I have witnessed many ICs discount soft skills, thinking that only managers need worry themselves with growing such skills. I would argue, though, that this particular soft skill is also an engineering skill, one necessary to be an effective IC engineer.

“I mean, parents couldn’t do that all in one night.” Buddy the Elf

Conversely, how many times have you seen senior engineers go silent for two months and then emerge with an amazing something that solves a problem, but it resembles a coded version of a complicated Home Alone trap (like a Rube Goldberg machine)? This is actually not what we want from our senior engineers, dear elves. We want senior engineers who are able to thoughtfully and skillfully level up those in lower levels to them.

There is a common desire among engineers to remain as IC for as long as possible with no desire for the managerial track, and that is totally fine! However, being an IC does not mean that you work within a vacuum. No matter your level, every IC can have a positive influence on someone else on the team and can bring leadership and mentorship into their everyday roles. Seniors, however, have the responsibility to give others the opportunity to contribute to their vision. By considering the other people on their team and their strengths and goals, solutions can be designed so that everyone grows. Is it hard? Of course! But when it happens, it’s like magic.

I started my career in tech a few days before I turned 37, so with the amount of catch-up I have from being late to the game, I just need help sometimes. An hour of help from a human being, for me at least, is the absolute most supercharged way to learn. I am so grateful to have had people all throughout my time in tech who understand that investing in people by pairing on a problem is really an investment in the health and wellness of the team, product, and company. I would argue also that it makes them a better person, teacher, and leader.

I wholeheartedly believe that fostering this environment should be the number 1 priority of every engineering manager because it will solve a lot of other problems down the line naturally. We need not be islands unto ourselves but rather a rising tide that lifts all ships.

Truth #3 - A team needs dedicated time to grow

Getting time to grow at a consultancy was tough. It was usually designated to times when I was on the bench, but that time wasn’t consistent. There were times where I would go an entire year or more with no bench time, so I had to use my personal time. I will take this time to remind you, dear elves, that making your employees use their personal time for growth and development is not an inclusive practice. It makes it harder for folks with families, disabilities, or just plain healthy boundaries to have the time and space to learn.

“I planned out our whole day. First we make snow angels for two hours, and then we’ll go ice skating, and then we’ll eat a whole roll of Tollhouse Cookie Dough as fast as we can, and then to finish, we’ll snuggle.” Buddy the Elf

I’m so incredibly grateful for my current manager and team who have deemed half a day on Fridays to be dedicated learning times. When we all have learning time at the same time, then no one feels guilty for not working on sprint work because, as a team, we’ve decided that learning is important enough to spend time on it. I’ve gotten a lot out of this; I finished the aforementioned Head First Go book, and I’ve worked on Exercism exercises. I’ve also used it to learn how to do things that were blocking me in my sprint work. But to make the most out of this time, my next step is to use Friday learning times to actually use the things I’ve learned in real world work. This, however, may exceed the bounds of half a day on Fridays, and it may mean that I take a bug fix ticket and spend a whole week on it. The magic required is that the team and manager buy into this investment of time and energy. I personally know that I would get that buy-in on my current team, but I know I’m a lucky one. They know that the payoff of me growing my skills is worth the investment of time.

Truth #4 - Insecurity looms with the lack of formal education through a coding school OR engineering degree which makes it feel more difficult to acquire certain skills

This might be an unpopular opinion, and I just stated it as a truth, but I do believe that this is true for me. There are certain coding exercises that I have tried that make me feel like I will never truly understand certain concepts. I do believe that I will know enough to be valuable, but knowing when that matters and when it doesn’t is a mind trip. It’s difficult to manage my own expectations of my own growth, learning, and knowledge. The constant nagging thought in the back of my head is that if I would have had any sort of formal coding training, whether in university or code camp, that something would have clicked in my brain so that I understood certain concepts more quickly, and I honestly don’t know if this is a valid concern for me or not.

I do know that magic happens when people step in. When I have brilliant developer people in my life telling me what matters and what doesn’t matter and helping me to grasp fundamental concepts, my growth and confidence are accelerated greatly. I go from focusing on my blockers to focusing on my trajectory.

“Oh, it’s not a costume. I’m an elf. Well, technically, I’m a human, but I was raised by elves.” Buddy the Elf

Truth #5 - Career planning related to skills is a bit more complicated

When you’re a career-changer and are late to the tech game, planning for the future can be a bit complicated. My current difficulty is that I have the soft skills required to be a really great manager, but managing a technical team requires a great depth of knowledge that only comes with experience. So what do I do with all of this leadership potential? For now, I’m doing nothing. I’m growing my depth and breadth, hunkering down and growing, and that’s so frustrating!

But again, therein lies the potential for magic. If a manager and a team are intentional about growing people to their own strengths and goals, then we can carve a path that matches my goals and strengths with the business’s needs, but it requires a bit of creativity and flexibility. It takes mature leadership to know how to turn each team member’s potential into something that benefits everyone.

“I just like smiling. Smiling’s my favorite.” Buddy the Elf

TL;DR

Did you note a common thread? The myths I outlined are discouraging blockers that kept me from thinking that I could achieve my goals, and I have a hunch that I’m not alone in these feelings. But the magic lies in people caring about and investing in each other’s growth. That’s it! This is not just the kind, empathetic, and right thing to do, but it also will affect the business’s bottom line because when people are more committed to growth and feel encouraged to do so, they are creating quality products and they are staying put in the same place longer because they feel supported. As you go about your holiday and new year, I encourage you to bring a little bit of magic to your own teams by either being the support someone needs or by allowing someone to be a support for you.

“Bye Buddy, hope you find your dad!” – Mr. Narwhal