December 18, 2019

Day 18 - Generating Compliance as Code for Terraform with InSpec-Iggy

By: Matt Ray (@mattray)
Edited by: Ninad Pundalik (@ni_nad)

Compliance as Code

By now you’re probably familiar with the idea of “Infrastructure as Code” where we define how our servers and related infrastructure are configured in code that can be versioned, stored in source control, and tested in CI/CD environments. Tools such as Chef, Ansible, Puppet, CloudFormation, and Terraform have popularized this concept. But how do we know if our applications and infrastructure are secure and compliant? How can we manage this as code with all the same benefits (versioned, tested, easy to extend)?

InSpec is an open source “Compliance as Code” framework for defining and auditing your infrastructure’s compliance and security in a human and machine-readable language. You define your requirements in a high-level language built on top of Ruby and run them as automated tests against your systems. InSpec works with servers, containers, databases, applications, cloud APIs, and can easily be extended to support new auditing targets. InSpec can scan systems remotely or locally as an agent and may be used in conjunction with “Infrastructure as Code” tools to test the correctness of their deployments.

Terraform & InSpec-Iggy

The open source project Terraform allows us to define how our infrastructure is defined and provisioned across a wide variety of cloud platforms. InSpec-Iggy (InSpec Generate -> “IG” -> “Iggy”) is an InSpec plugin for generating compliance controls and profiles from Terraform .tfstate files and AWS CloudFormation templates. While both CloudFormation and Terraform are supported by Iggy, this post will focus on Terraform.

From Terraform’s source files, we know the intended infrastructure and the .tfstate file provides the last known state of what was deployed. By generating InSpec coverage from the .tfstate file, Iggy allows us to test that what we built with Terraform, and check whether resources have been modified or drifted from the initial deployment. Anytime we deploy a new infrastructure with Terraform, we can create new tests to audit for fidelity.

Iggy & AWS

The InSpec-Iggy README covers installing the plugin with InSpec and documents the various subcommands, their options, and development and testing. The Terraform AWS Provider Basic Two-Tier AWS Architecture produces 9 AWS resources with the command:

$ terraform apply

The terraform.tfstate file is the JSON representing the state of the infrastructure created from the above command. InSpec-Iggy parses the file with the following command:

$ inspec terraform generate --name AWS01 -t terraform.tfstate --platform aws --resourcepath ~/ws/inspec-aws/

 ───────────────────────── InSpec Iggy Code Generator ─────────────────────────

Creating new profile at /Users/mattray/ws/inspec-iggy/AWS01
 • Creating file README.md
 • Creating directory controls
 • Creating file controls/generated.rb
 • Creating file inspec.yml

Additional options such as --title and --version may be passed to populate the README.md and inspec.yml as necessary. Looking at the AWS01 profile that has been generated, we see the controls/generated.rb contains the following content:

control "aws_elb::terraform-example-elb" do
  title "InSpec-Iggy aws_elb::terraform-example-elb"
  desc  "
    aws_elb::terraform-example-elb from the source file /Users/mattray/ws/inspec-iggy/terraform.tfstate
    Generated by InSpec-Iggy v0.7.0
  "
  impact 1.0
  describe aws_elb({:load_balancer_name=>"terraform-example-elb"}) do
    it { should exist }
    its("availability_zones") { should cmp ["us-west-2c"] }
    its("dns_name") { should cmp "terraform-example-elb-2051343015.us-west-2.elb.amazonaws.com" }
    its("load_balancer_name") { should cmp "terraform-example-elb" }
  end
end
...

The generated profile leverages the inspec-aws resource pack and running inspec exec AWS01 -t aws://us-west-2 with the proper AWS credentials produces:

Profile: InSpec Profile (AWS01)
Version: 0.1.0
Target:  aws://us-west-2

  ✔  aws_elb::terraform-example-elb: InSpec-Iggy aws_elb::terraform-example-elb
     ✔  AWS ELB terraform-example-elb should exist
     ✔  AWS ELB terraform-example-elb availability_zones should cmp == ["us-west-2c"]
     ✔  AWS ELB terraform-example-elb dns_name should cmp == "terraform-example-elb-2051343015.us-west-2.elb.amazonaws.com"
     ✔  AWS ELB terraform-example-elb load_balancer_name should cmp == "terraform-example-elb"
  ✔  aws_ec2_instance::i-05c6d20469a0a0ee9: InSpec-Iggy aws_ec2_instance::i-05c6d20469a0a0ee9
     ✔  EC2 Instance i-05c6d20469a0a0ee9 should exist
     ✔  EC2 Instance i-05c6d20469a0a0ee9 availability_zone should cmp == "us-west-2c"
     ✔  EC2 Instance i-05c6d20469a0a0ee9 subnet_id should cmp == "subnet-00f265d40d3d0a227"
  ✔  aws_security_group::sg-0eac4a147658285b7: InSpec-Iggy aws_security_group::sg-0eac4a147658285b7
     ✔  EC2 Security Group ID: sg-0eac4a147658285b7 Name: terraform_example VPC ID: vpc-035d7e339ce59ad62  should exist
     ✔  EC2 Security Group ID: sg-0eac4a147658285b7 Name: terraform_example VPC ID: vpc-035d7e339ce59ad62  description should cmp == "Used in the terraform"
     ✔  EC2 Security Group ID: sg-0eac4a147658285b7 Name: terraform_example VPC ID: vpc-035d7e339ce59ad62  group_name should cmp == "terraform_example"
     ✔  EC2 Security Group ID: sg-0eac4a147658285b7 Name: terraform_example VPC ID: vpc-035d7e339ce59ad62  vpc_id should cmp == "vpc-035d7e339ce59ad62"
  ✔  aws_security_group::sg-01199abc1619d7613: InSpec-Iggy aws_security_group::sg-01199abc1619d7613
     ✔  EC2 Security Group ID: sg-01199abc1619d7613 Name: terraform_example_elb VPC ID: vpc-035d7e339ce59ad62  should exist
     ✔  EC2 Security Group ID: sg-01199abc1619d7613 Name: terraform_example_elb VPC ID: vpc-035d7e339ce59ad62  description should cmp == "Used in the terraform"
     ✔  EC2 Security Group ID: sg-01199abc1619d7613 Name: terraform_example_elb VPC ID: vpc-035d7e339ce59ad62  group_name should cmp == "terraform_example_elb"
     ✔  EC2 Security Group ID: sg-01199abc1619d7613 Name: terraform_example_elb VPC ID: vpc-035d7e339ce59ad62  vpc_id should cmp == "vpc-035d7e339ce59ad62"
  ✔  aws_subnet::subnet-00f265d40d3d0a227: InSpec-Iggy aws_subnet::subnet-00f265d40d3d0a227
     ✔  VPC Subnet subnet-00f265d40d3d0a227 should exist
     ✔  VPC Subnet subnet-00f265d40d3d0a227 availability_zone should cmp == "us-west-2c"
     ✔  VPC Subnet subnet-00f265d40d3d0a227 cidr_block should cmp == "10.0.1.0/24"
     ✔  VPC Subnet subnet-00f265d40d3d0a227 vpc_id should cmp == "vpc-035d7e339ce59ad62"
  ✔  aws_vpc::vpc-035d7e339ce59ad62: InSpec-Iggy aws_vpc::vpc-035d7e339ce59ad62
     ✔  VPC vpc-035d7e339ce59ad62 should exist
     ✔  VPC vpc-035d7e339ce59ad62 cidr_block should cmp == "10.0.0.0/16"
     ✔  VPC vpc-035d7e339ce59ad62 dhcp_options_id should cmp == "dopt-8d3211eb"
     ✔  VPC vpc-035d7e339ce59ad62 instance_tenancy should cmp == "default"


Profile: Amazon Web Services  Resource Pack (inspec-aws)
Version: 1.4.2
Target:  aws://us-west-2

     No tests executed.

Profile Summary: 6 successful controls, 0 control failures, 0 controls skipped
Test Summary: 23 successful, 0 failures, 0 skipped

This test can be run periodically for drift detection and to validate that our infrastructure is still configured as deployed by Terraform.

Negative Testing

What if we wanted to see all of the infrastructure that is not managed by Terraform? This “negative testing” can be used to find anything that may have been added to a particular VPC that was not created by Terraform, which may be a security issue or unwanted manual change. inspec terraform negative will generate a profile that pulls the available cloud resources from the VPC and reports failures for those that were not provided by the terraform.tfstate.

Visualized Reporting

InSpec allows you to send your audit reports directly to Automate for visualization and reporting.

Wrapping Up

Automatically creating InSpec tests for your Terraform-managed infrastructure provides auditing to ensure your infrastructure is always managed as code. InSpec-Iggy has been tested with AWS, Azure, and GCP, but other platforms could easily be supported. Iggy is still under development and as new cloud resources are added to the corresponding resource packs additional coverage will automatically be generated.

December 17, 2019

Day 17 - Automation in Everyday Life: It’s not always about velocity

By: Michael Stahnke (@stahnma)
Edited by: Joshua Zimmerman (@TheJewberwocky)

We automate technology for three main reasons:

  • Speed
  • Consistency
  • To get onto things that matter – solving new problems

In nearly all cases, people think automation is about speed. Speed is a benefit, but in most situations, speed is a very distant second place to the consistency you can arrive at when automation practices are in play. When you automate a bad process, you just get bad output, but faster. However, you will get the same type of bad output each time, meaning it is easier to make improvements and see results. Consistency matters.

I think in terms of automation. I think about repeatable process, speed of delivery, and how I can do more of the things I wish I was doing vs the things I have to do. Sometimes, I like to tinker with technology for automation. Sometimes, I have a plan.

Automation mindset

Recently, I had a friend point out that I live automation – beyond my career. This is embodied in my desire to cook meals. You see, I have a smoker and I like smoke meats and other foods. For those a little less familiar with this style of cooking, it is usually lower temperatures (~225ºF/ 107 ºC) for a rather long time (6–14 hours of cook time). The results are tasty entrees and meals.

The important parts for me were to get good results and to have minimal hands-on time with the cooking. This is so I can cook on a work day, or on a day I want to spend out and about with my family. This desire for minimal touch and consistency got me thinking about the three principles of automation I had been using as a mental framework for years.

Automated smoked meat delivery is not about speed. In fact, speed is the least important part of the equation. I am interested in consistency, quality, and how much manual intervention is required by me to get an awesome meal.

The process

At its core this food delivery process has the following steps.

  1. Decide on what you want to make
  2. Gather materials
  3. Prepare food for cooking
  4. Prepare smoker for cooking
  5. Cook Meal
  6. Continuously validate meal is cooking properly
  7. Remove meal
  8. Eat
  9. Clean smoker.

During the process, I am most concerned about step 6. If the smoking lasts 6 to 14 hours, how do I know the cook is going ok? How much time do I have to course correct? How much supervision will this require?

Equipment

Enter my Green Mountain Grills (GMG) Daniel Boone. This is a Pellet Grill smoker that burns small wood pellets as fuel. The nice thing about this class of grill is that you can have a large hopper of fuel, set a temperature and the grill moderates the fuel and temperature using an auger for the fuel and fans for the flames. This allows me to set a desired state (temperature) and the grill will be the engine enforcing that. It also has WiFi, so I can adjust the temperatures via my mobile device.

Inputs

Like any automation project, this one starts with requirements and learning about market trends. Normally, I start by asking my spouse what she is in the mood for, or if cooking for a larger group gauging interest in different dishes.

I break down the cook into smaller segments and inputs. This inputs are things like, desired cooking temperature, desired food completion temperature, and type of wood to use for the cook. These are implementation details, much like selecting a programming language or database to use in software delivery. I also make an estimate at this phase, for the duration of the cook. Much like software engineering, my estimates are often off by more than a little, but do improve over time.

After preparation, I begin the process of continuous validation of the food payload and adjust cooking parameters until the payload passes the validation suite (e.g. internal temperature and time frame constraints).

Begin the cooking

The food is on the grill. The grill is smoking. I am now continuously validating the state of the grill and the food. The grill comes with some instrumentation built in. It can display the current temperature of the grill and the temperature of the food where I have placed a food probe.

After a few cooks, I realized that something was not behaving as expected with this system. Even with my mobile device monitoring and sometimes adjusting the grill, the temperature of the food was not matching my expected duration of cook properly. I decided to opt for a way to monitor the system from a secondary source. Perhaps observing the system completely within the system was the problem.

Instrumentation

While my food wasn’t turning out horribly, it wasn’t quite right. Knowing that more ability to gain insight into the cook process would help, I bought an external temperature probe set where I placed on the grill and one in the food I’m cooking (so now I have two data points for food and two for ambient grill temp). This told me that the original grill temperture readings were often reporting much lower than they actually were, meaning I was cooking too fast. This also meant that I needed to adjust temperature a little more often than I had originally hoped for.

In the pipeline model, temperature being out of bounds is a failure. A failure means I need to take action to recover. Sometimes that can be simply to lower desired oven temperature, sometimes it means basting the food, opening the grill lid, or something else.

Continuous Improvement

Since most of these interventions required layer 1 connectivity to the grill (e.g. I had to touch the grill), I decided to optimize for the methods that explicitly did not require physical contact. This way, I could work my setup while I was working throughout the day. I had learned that the WiFi connections for the smoker used uPNP for discovery and communication. I whipped up a little CLI to send commands to the smoker and get output back. After a bit of fiddling around, I searched Github and found that somebody had written something more robust than I had. A web application existed and was easily deployable via a container. Cool. Now I could monitor the grill in my browser and didn’t need my phone out. This was helpful since I work from home and have a browser up nearly always.

Validation

Now that I can make easier adjustments to the meat delivery system, and had added monitoring and instrumentation, there were still a few other cases to handle. For example, if our delivery pipeline is a series of tests where each test is a sample of the current temp as compared to the desired temperature. The combination of all these tests into a single suite is the delivery pipeline, and you run it until they pass (or you fail completely, but that’s pretty rare).

While the meal is cooking you can get into all the traditional conditions of a validation pipeline. You have

  • Success – everything is operating as expected. (E.g. tests are running and passing)
  • Failure – Temp is not where you want it. Food is not cooking as expected. Perhaps wind is causing fuel consumption to be much higher than your capacity planning model showed.
  • Error – unable to get the right data. Perhaps out of fuel? Power outage? Battery is dead on temp probe? Dog has stolen the food?

The pipeline concludes when each test passes. Normally the grill temperature samples work fine and the food temperature is what we wait on. Once the food temperature is optimal the pipeline can be complete. It’s time to remove the payload.

Deployment

After a rest period, we carve, stir, mix, pull, or unwrap the food, depending on the type. After the smoker has had time to cool down (usually while we’re eating), it can be cleaned. This is much like workspace cleaning in any continuous integration platform. You don’t want your next set of tests (cook) to be tainted by previous artifacts…like to those bits of food that stick onto the grill tray.

Retry

My automation thought process started with the desire for a slow-cooked meal, and then went through basically all phases of a Software Delivery Lifecycle model with automation as the method to drive consistency in output. In short, automation provides tasty solutions to my problems.

In the end, I use my smoker a few times a month. I enjoy it, and I enjoy tinkering with the setup almost as much as the food itself. I also love that I can leave it alone for 8+ hours while playing with my son or doing other things and still be pretty confident that my evening meal will be awesome, and after all, what’s the point of automation if I can’t be confident in it?

If you want to see pictures of my meal creations (and all my other shenanigans online), you can follow me on twitter at @stahnma.

December 16, 2019

Day 16 - Evolution of CloudFormation

By: Atif Siddiqui
Edited by: Brad P. Adair (@bpadair)

Infrastructure as Code (IaC) is one of the salient DevOps practices. IaC is the management of infrastructure through code. This code is versioned which, resultantly, guarantees repeatability in the process of infrastructure provisioning.

In this article, I am going to talk about IaC through the prism of AWS. Focus will be on a few impressive features that AWS has released, in its IaC product, over the years.

Genesis

CloudFormation was released [1] in February 2011 as a service offering for Infrastructure as Code (IaC). This tool relies on declarative blueprint documents known as Templates. While initially, template could only be composed in JSON, several years later in 2016 [2], support for YAML was added.

Template Anatomy

Templates supports top level structure using the following sections.

1. AWSTemplateFormatVersion

2. Description

3. Metadata

4. Parameters

5. Mappings

6. Conditions

7. Transform

8. Resources

9. Outputs

Among this list, Resources is the only mandatory section. After all, the goal is to provision AWS resources. The collection of resources provisioned by a given template is called a CloudFormation Stack and area treated as a single unit.

Public Coverage Roadmap

One of the polarizing themes of CloudFormation has been the support lag with AWS aggressively rolling out products and features. Given a minimum viable product (MVP) mindset, it entails Products/Features not having Day 1 CloudFormation support, carrying the risk of customer dissatisfaction.

In light of all the heat perpetuated by the customers, AWS announced [3] Product roadmap for CloudFormation. This is meant to provide visibility and a public forum for customer feedback. It should be noted, this is not a new idea in AWS space. In fact, it is borrowed [4] from AWS Container Service team which launched it as an experimental idea last year [5].

While a roadmap is a step in the right direction, I have always felt that product teams should follow a well known timeline. For example, within three months of product or feature announcement, CloudFormation support must be made available.

Versatility

Evolution of CloudFormation, in my view, has been frustratingly slow. Only over the last couple of years have there been feature announcements that have added versatility to a tool with immense potential. In my humble opinion, the following are few of the noteworthy features that have been added.

Drift Detection

This feature was announced [6] November of last year. It provides the ability to detect if any of the underlying resources provisioned by CloudFormation have been modified outside this tool.

Drift Detection provides visibility across the entire stack (depicted above) as well as drift detail at the resource level.

In the example [7] shown below, IP on the inbound rule of the security group had been modified (outside of CloudFormation) from a value of 0.0.0.0/0 to 72.202.202.202/32.

It should be noted that Drift Detection supports a subset of resources [8]. At this time, there are a total of 16 resources being supported.

StackSet

One of the limitations Stack concept suffered from is the lack of scalability. A template would need to be run in every account and region where resources are required to be provisioned.

StackSet filled this gap by providing the capability to implement a stack across multiple accounts and regions in a repeatable fashion. Regions of interest could be selected allowing for a model where an Administrator account has ability to execute CloudFormation template across multiple target accounts.

Account numbers 12356789012 and 234567890123 are illustrations of Target account numbers.

Last month, Drift detection was announced [9] for StacksSets providing a portal view to manage it.

CloudFormation cli

Just a few weeks ago, CloudFormation cli was been released [10] as an open source project. As part of this announcement, CloudFormation registry was announced which is one of its salient value proposition.

With cli’s ability to build resource providers, it allows for extensibility through invocation of third party resources. The days of relying on user data (custom scripts) or Lambda functions to provision third party resources should be behind us.

Registry defines two types of resources: private and public. Native AWS resources (depicted above) are categorized as public while third party resources will be categorized as private. The list of currently supported third party resources is listed below.

a. Atlassian

b. Datadog

c. Densify

d. Dynatrace

e. Fortinet

f. New Relic

g. Spotinst

Import Existing Resources

Last month, another notable CloudFormation feature was announced [11]. Through this feature, resources created manually can be imported to manage them through CloudFormation. In a true MVP style, this feature is available in only select regions.

Following example depicts the example of security group being imported that had been manually created in console.

The wizard prompts for the resource name after parsing through the template. The manually created security group with id of sg-0e84f5537086128b5 was specified as the identifier value.

The template mandates that resources being imported use the Termination protection via the attribute ‘DeletionPolicy: Retain’.

Wrap up

Interestingly, while there were several CloudFormation releases announced in the weeks prior to re:Invent, conference itself was quiet on this product. Hopefully, AWS’s investment in IaC, for its customers, will continue to be prioritized as CloudFormation strives to achieve its potential.

References

[1] https://aws.amazon.com/about-aws/whats-new/2011/02/25/introducing-aws-cloudformation/

[2] https://aws.amazon.com/blogs/aws/aws-cloudformation-update-yaml-cross-stack-references-simplified-substitution/

[3] https://aws.amazon.com/blogs/aws/aws-cloudformation-update-public-coverage-roadmap-cdk-goodies/

[4] https://github.com/aws/containers-roadmap

[5] https://www.geekwire.com/2018/amazon-web-services-reveals-public-road-map-cloud-container-services/

[6] https://aws.amazon.com/blogs/aws/new-cloudformation-drift-detection/

[7] https://s3-us-west-2.amazonaws.com/cloudformation-templates-us-west-2/EC2_Instance_With_Ephemeral_Drives.template

[8] https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/using-cfn-stack-drift-resource-list.html?icmpid=docs_cfn_console

[9] https://aws.amazon.com/about-aws/whats-new/2019/11/cloudformation-announces-drift-detection-support-in-stackSets/

[10] https://aws.amazon.com/blogs/aws/cloudformation-update-cli-third-party-resource-support-registry/

[11] https://aws.amazon.com/blogs/aws/new-import-existing-resources-into-a-cloudformation-stack/

December 15, 2019

Day 15 - Moving a Monolithic Rails App to Kubernetes

By: Philip Brocoum (@stedwick)
Edited by: Ryan Hass (@_ryanhass_)

Here at Syncta we have a classic monolithic Ruby on Rails application with roughly 150 models. For the past six months, I have been working on reimagining our deployment on Kubernetes.

I began this project with two principles in mind:

1. Picking the best tool for each job.

2. Each tool should follow the Linux philosophy of doing one thing, and doing it well.

The first thing I asked myself was, "Can we containerize our Rails app?" The most difficult thing was the learning curve related to Docker, but the answer was, “Yes!” Our Gemfile is 200 lines long, and our Docker image weighs in around 1 GB, but I've seen bigger. Our Dockerfile starts with `FROM bitnami/ruby` and includes `nodejs libpq-dev file graphviz busybox vim` for compiling native gem extensions. I chose the Bitnami image because I tried Alpine first and found myself installing too many prerequisites. We push these images to Quay.io, which is a very user-friendly image repository. For example, they offer copy-paste credentials for installing the imagePullSecret into k8s.

Furthermore, I created a docker-compose.test.yml that runs all ~1000 of our tests completely self-contained; PostgreSQL, Redis, ElasticSearch, and ChromeDriver all run inside their respective containers for our test suite. The biggest change was that we used to simply install Chrome on our virtual machine, and use it locally for our tests. We now have to run our browser in a container using Browserless.io.

The next step was to get this Docker Composition running in a CI/CD pipeline, for which we used Bitbucket Pipelines. Running everything within Docker (with its overhead) on one CI instance used up the standard 4 GB in Bitbucket Pipelines. As a result we had to upgrade to the 2x memory build in order to run Chrome. Our test suite now automatically runs in about 15 minutes on a push to Bitbucket. Here's a screenshot of what our pipeline looked like previously in Semaphore, and now in Pipelines.

Now the real fun begins: we have a working, tested, Dockerized version of our monolithic Rails app, and the next step is to deploy it to Kubernetes. I like to keep things as simple as possible. With that in mind, we used Spotinst to spin up a k8s cluster on AWS EKS. Spotinst handles everything, including installation and cluster auto-scaling, and then we spun up a Rancher instance for an easy-to-use GUI and attached it to our k8s cluster using the provided Spotinst credentials. Rancher allows us to one-click-install things like Traefik & Grafana (see screenshot).

Finally, we launched a deployment inside the k8s cluster using the Docker image we created in step one. Wow! Right now, only our secondary staging server is running on k8s, from a git branch parallel to master and Rancher is currently a single node install. There's still plenty of QA to do. We are planning a Canary rollout next year.

There are some good lessons to learn from this whole journey. First, it's far easier to develop/test/deploy when everything is Dockerized and guaranteed to be identical. However, the downside is that there is a huge learning curve with Docker and k8s that our devs will need to get up to speed with. In order to help mitigate the learning curve I have started a Makefile with many common commands that our developers use. My New Year's wish for SysAdvent is for a smooth rollout to production in 2020. :-)

help:
  @echo Syncta Main Rails Makefile
  @echo ...

dev:
  @kubectl run -n "smr-development" \
    --image="${KI}" \
    ...

Specs:
  @export COMPOSE_PROJECT_NAME="smrtest"; \
    config/devops/specs.sh

If any of this sounds interesting to you, and you'd like to join our team, please get in touch with our recruiter, Jimmy! We are a remote-first development team (headquartered in Portland, OR). We are looking for a Senior/Lead Full-Stack Developer with good experience in Ruby on Rails immediately. In 2020, we'll be hiring another Developer and a Project Manager. Syncta is a small startup working on backflow and other water-related technologies, and we were recently acquired by Watts Water Technologies. I'm Philip Brocoum, Head of Development at Syncta, and Happy SysAdvent to all!

December 14, 2019

Day 14 - Building Datacenters in Hell

By: Bryan Horstmann-Allen (@bdha)
Edited By: Wayne Werner

A small tribe of lost souls finds themselves embattled, after long wandering
the plains of Hell. They nurture a unique Praxis, a small thing. It is not the
making of fire, but it is theirs, with a history that crosses spans of mortal
time. It is precious to them.

They find themselves without the resources to thrive. Hell is harsh, and the
so-called walled heavens of the hyperscalers, closed and proprietary in nature,
consume all within reach.

In this state, a group of Powers acquires the tribe.

The Powers That Were tell them: Your will is ours. Your efforts ours. The
Powers demand global storage clouds. The tribe is ill-equipped for an
undertaking so large. Ramp up time will be required. A single datacenter, they
say – let us build a Proof Of Concept, to identify our gaps, to build our
tooling, to measure twice, and cut once.

There will be no practice, no prototype. Everything will be Production, it will
be done Quickly. The tribe will be oncall 24/7 for empty servers. Failure will
not be countenanced. The Powers claim to build airplanes in the air, change
tires while in motion.

The Powers promises fall like heavy rain: It will be a joint undertaking, the
tribe is told. We will be as One Team.

Eventually these clouds would house roughly an exabyte of data, streaming from
the Aether at 200Gbps, across upwards of 10,000 devices in 20 datacenters.

For the hyperscalers in their false heavens that’s a Tuesday.

For the small tribe, however, new layers of Hell are created from whole clothe.

No Maps For These Territories

The first phase of the project is targeted for an area of the umbral plains
redolent with the monoliths of dozens of datacenter facilities.

Extreme haste is required. The Powers That Were never explain why. Rocks fall
from the skies when progress seems lacking, crushing arbitrary engineers. Speed
is required over all other considerations.

Mistakes will not be made, because they should not be.

The tribe lacks an actual project plan and much needed automation. A
spreadsheet with a dozen lines, like “order hardware,” “install racks” and
“cable up servers,” is provided. It is lacking an attention to detail that
would perhaps prove useful.

The Powers are not Gods, but old, massive, well-funded and long in reach.

The tribe designates internal groups to handle various aspects of the work.
DCOPS, SYSOPS, NETOPS. They daub their foreheads with mud and gather up their
primitive tools.

If the reader is unfamiliar with datacenters, they are actively hostile to
human life. Souls don’t belong there; they are only for machines. Giant
windowless boxes full of smaller screaming sharped-edged boxes. They are loud,
hot and cold. A constant dry wind blows in your face. The lights harsh,
seemingly unending, but if you do not move often enough the entire facility
will be cast into darkness around you.

The longer a soul spends there, the more they lose of themselves. Higher
cognitive functions shut down, memories lost to the patterns of the blinking
lichen.

Mistakes are made, and quickly compounded.

Each team airlifted to the site attempts to complete their work all at once,
dependencies between teams are unclear and regularly consume the fingers of the
unwary. The domes of the cloud balloon above them.

Acolytes of the Powers That Were appear, underfoot, refusing to say why. In
truth, they are there to build their own cloud, adjacent but separate from the
tribes work. The tribe is given no insight, no access, just responsibility. The
Acolytes fail in their work, and the blame is passed on to the tribe.

The Acolytes argue the tribe’s attempts are untenable, they demand that
velocity increase, demand new directions seemingly at random. Progress stalls
as the context switches tear holes in the thin membrane of the local reality.

Screaming horrors slip through and must be contained.

Racks are built in place by Integrator Daemons: servers built on the raised
floor, racked, cabled, nominally tested. Each system trails five network
cables, two power cables, hungry umbilicals ready for sustenance. The USB keys
installed in the rear slots became too hot to touch after a few hours of
screaming operation.

Dead components, shedding their dried scales, are RMA’d through secret,
unknowable means. It can take weeks or months for parts to be replaced.

Souls become lost behind the stacks of discarded cardboard. Some vanish for
good, seemingly burnt away at the edges in the hot exhaust aisle. Those
remaining run out of food, water, sanity.

Engineers gnaw on packing peanuts, simply to make the emptiness inside less.

Each datacenter has to be recabled three times.

The world outside becomes a lie. There is only the facility.

The Director of NETOPS walks alone into the racks. The bear-sized raptors that
infest the high domes of the datacenters can be heard screaming in the voice of
human infants. Their greasy feathers litter the tiles. The Director is never
seen again.

One of the problems you get when dealing with server vendors, integrators,
VARs, whatever, is that the inventory and testing data they give you is never
in a format you want. It is often full of errors, sometimes very subtle ones.
We had scripts to parse PDFs to pull out system inventory data and burn-in
reports. It makes me tired just to think about.

Long after the work has begin, Projections are finally shared by the Powers.

The Powers had demanded 25PB cloud for this quarter. 25PB is built. They
actually needed 60PB. This is clearly the tribes failing, lacking both an
Oracle and having never been given any data from the Powers despite constantly
beseeching Them. All pending builds have similar problems; budgets approved,
hardware purchased. It will not be enough.

Months after the build is “done,” the lost tribe still finds systems with
missing drives, zpools misconfigured; lopsided, keening pathetically. A cascade
of avoidable work follows.

Lessons are learned; none especially technical in nature.

It comes as no surprise that when you give people no time to prepare, no time
to do research or build processes and automation in advance – what you end
up with is a bunch of exhausted people doing subpar work.

Everyone knew what was needed, but no time or resources were allowed to
build any of it properly.

Eventually a demoralized group emerges from under the cloud they had built with
their own bleeding hands. Exhausted, thinner, fewer than when they started.

The cloud runs out of storage.

Like Boulders Up a Hill

The Powers That Were are relentless. The next clouds would be birthed back to
back. The final cloud and the first expansion of the existing regions would be
done in parallel.

It is learned that those amongst the Powers who actually own the data do not
even wish to move it. The Powers That Were bicker amongst themselves. Budgets
are shredded, fluid hissing down from the skies, burning the earth. Timelines
torn. The unknowable Ancients in Finance lay down ultimatums. The clouds must
be built. Costs must be cut. Now.

The Powers settle, once again direct their gaze to the lost souls.

The tribe gets on with it. They know that before they begin the next builds,
they need two things:

  • A plan
  • Robots

Without a detailed project plan and automation to execute it, they would be
stripped of their nerd hoodies and left to the mercies of Hell, alone.

By now, they have directed the Integrator Daemons to build the racks off-site,
in the Daemon’s own facilities, test them, and only then drag the hundreds of
racks into the maw of the cloud. The Integrators will have spares, the
ability to swap out whole chassis, the cost of errors is pushed back on them,
and not onto DCOPS, who are busy contending with the hawkbears.

Information only flows in one direction. They must do their best with suspect
data rarely and only begrudgingly shared.

The first project plan is built in LiquidPlanner. The spreadsheets are set on
fire.

The leads from each of the operating groups sit and gibber at each other for
unknowable hours. The light outside never changes.

At the end, they have a plan with major and minor gates, consisting of some
1500 discrete tasks, all dependencies defined and linked, specific engineers
assigned with worst/best time estimates.

When the Plan needs to change, the timelines shift, realign. The Plan is
flexible. The Plan will be light in the darkness. The Plan will save them.

They cannot be saved. They are already lost.

It can be surprisingly difficult to convince a Project Manager to give up
whatever control they think they possess over the project. When you have
dozens of people working on some incredibly complicated effort, you need to
let the team leads own their work.

The major gates are eldritch concepts:

  • Design
  • Procurement
  • Physical Plant

Portals are built from dead PDU strips, runes engraved with multitools.

The minor gates are interleaved.

  • Preflight
  • Network I
  • Compute I
  • Network II
  • Compute II

SYSOPS can spin up servers once basic Layer 2 networking is in, and NETOPS has
already moved on to configuring the cross-AZ Layer 3 network. Each datacenter
can be built independently until the final stages.

This interleaving means no particular team is blocked from performing useful
work. The constant movement will keep them warm in the unending dusk.

The automation strategy they settle on is two-fold:

  • Validate that what they got is what they bought
  • Script every part of building the cloud itself

The validation piece is termed “preflight.” The Validator and the
Director of SYSOPS hunt the hawkbears infesting the cloud domes. They push back
on the Powers, usually fruitlessly.

The first version of preflight is a hacked up FAI
deployment, running on a single VM on each of the datacenter management
servers, on a dedicated VLAN.

Once the racks hit the datacenter and the NETOPS shamans bring L2 up, the
servers are PXE booted into a custom rolled live image which runs through a
dozen or so shell scripts.

The scripts dump system inventory to a text file on an NFS mount (this being
Hell), and run basic stress testing. The SYSOPS team provides firmware upgrade
and BIOS configuration tools, which are pushed onto each box.

Scripts are written to validate the system inventory. When errors are found
(like cabling being in the wrong ports) JIRA tickets are cast like bones for
DCOPS to breakfix.

Once a full rack passed preflight, NETOPS is asked to flip the TOR switches
tagged VLAN to production. The servers in the rack are booted and added to a
spreadsheet for the SYSOPS team to provision.

None of this takes very long to implement, which is good because most of it has
to be done during the second cloud build.

There are no resources in the empty plains of Hell for a development lab.

An OPER leaves the tribe.

The Projectionist is cast away by the Powers.

The cloud run out of storage.

The Operationalization Orb

The Validator is catching the majority of initial errors they’d missed on
the first cloud build, but the process is still manual, still tedious. Building
a new cloud takes four to five months.

Manually managing automation incurs errors.

Being exhausted all the time incurs far more errors.

The Validator knew the next version needed to be less 2002 and more modern.
They hunted and felled a database, an API, UI. They wanted the Daemons and
DCOPS and NETOPS and SYSOPS to be able to look at a page that told them exactly
what was wrong with a server

  • disk 4 is missing
  • missing RAM in slot 2A
  • rabid sandtrout caught in thermal shield
  • NIC0 should be in port 1:2 but is in 1:3

and so on.

There will be no more JIRA tickets. No more spreadsheets. There is only the
truth they make themselves.

They use the sharpened bones of a hawkbear to carve a schema into the database.
Janky code is written in Perl’s Catalyst. An API and a (pure HTML tables,
natch) UI are birthed.

The builds progress. The Powers That Were can be heard screaming beyond the
veil, demanding more bits, more bandwidth, more flesh. They spew anger and
rage, and the work is made more difficult.

NETOPS run cables from the clouds of the Powers to the heavens. Dozens are
grafted into the domes. Bits stream in from the Aether upwards of 150Gbps. It
is not enough. It will never be enough.

Scaling problems dig their way up from the depths. Drive firmware under heavy
write loads ceases to service reads. This is deemed excellent behavior for the
databases running on them.

Systems reboot themselves, opaque boxes, lying to the tribe’s shamans. A month
passes. Hourly calls every day with the vendor stonewalling. The systems reboot
daily; the cloud chokes on the bits it must be fed. Finally the Powers ordain
they will stop purchasing globally from this vendor unless a solution is found.

A solution is quickly found.

Beastblades Magnetically Aligned

New strong souls are found in the darkness. Their minds are sharp, tools
unmarred. They rewrite preflight, twice. The second time in Mojolicious.

The tribe uses preflight as a source of truth: What is where, what does it do,
is it working, what is wrong with it.

A tool is created for the Command Line, it helps to bring more advocates to the
work. A new web UI is written.

The more hardware-oriented Validators automate PDU and switch configuration.
Improvements are made to testing the CPU, RAM, and disks. Systems are made to
power themselves off multiple times, to try and shock their components into
failing.

The SYSOPS tooling progresses apace. NETOPS gains a new Director and provides
production configs to be burned into the switches during preflight.

Turnkey cloud is near.

A deadline is nearly missed as a vendor fails to schedule air freight to
actually pick up dozens of completed racks. They sit on tarmac for a day.

The tribe is on their fourth Project Manager and second or third VP/Ops.
Everything blurs together. The PM lasts two weeks. The VP a few months.

The Validators devise artifacts that can be placed on top of a freshly squeezed
rack, spinning its servers and reporting back to the API from anywhere on Hell’s
plains. Progress of the rack builds can be viewed live. The servers can be
reached wherever they are, as they are being built, through arcane encrypted
tunnels.

Problems are fixed before the racks ever reach the datacenter. Confidence grows
in what is being delivered. The process is re-run once the racks are installed
in the datacenter. Problems incurred during shipment are quickly repaired. The
number of on-site RMAs plummets to nearly none. Wasted time is regained.

Building a cloud now takes two months. Expansion a few weeks.

The Powers That Were stop the flow of money. They say the cloud is not being
built quickly enough, does not consume quickly enough. They insist the work
continue as they stake the murdered budgets outside the gates of the tribe. The
Powers demand to know why deadlines are missed.

Eventually the money resumes. The tribe has long since learned that their own
actions have little to do with the arbitrary behavior of the Powers.

They are unknowable.

The Insurrection

A lone soul comes in from the darkness. They are admitted, though none of the
women or minority voices among the tribe will be heard. The Powers ignore their
objections, insist again that only Velocity matters. More hands will mean more
results.

This soul immediately argues that the clouds are broken, the tribe is broken.
It will be fine, they know better. They produce nothing but words, but they are
words the Powers want to hear. The tribe is failing purposely, to make the
Powers That Were into the Powers That Will Never Be, out of spite.

Soon the found soul pulls in others, also lost or disaffected from within the
tribe. They argue the Powers are not being appeased quickly enough, not being
fed enough. It is a trick, they wrap their own ambition in devotion.

They cast selfishness as piety. They claim they can build a cloud in a weekend.
They claim they have done it, but show no one. They whisper to anyone who will
listen the tribe is full of morons, the CEO and CTO are fools leading the tribe
to ruin.

These turncoats work in secret; finally discovered: copyright assigned to
themselves and not the tribe, not even the Powers they claim to serve. The
tribe believes these turncoats undone – they surely will be cast away, the
distractions finished.

The turncoats flee to where the Powers reside and a Validator is sent after
them. Irreparable damage is done to the tribe in short order. A duel is
required. The tribe’s Praxis against the turncoats vapor. It is all a sham. The
Powers flexing.

Deeply messed up stuff happens, just like really seriously amazingly dumb
things.

The Powers give the turncoats a place to work, hidden from the rest of the
tribe. The tribe is shocked. The last shreds of morale are found hiding under
a bean bag chair, mourned, set alight, the ashes buried.

Never hire anyone without interviewers of a mix of backgrounds talking to
them first. Listen to your people. Heartbreak can be avoided.

Periodically the priest of these turncoats is sent back into the tribe. He
tells them how great the True Work is going. The tribe is not allowed to see
this work; they are assured it is amazing; the existing clouds are already
obsolete; they should just give up already.

The priest is challenged during an incredibly uncomfortable all-hands. The
Powers intervene. The priest assures the tribe he will always protect them, and
then he leaves. This happens several times. The tribe is dubious. The work
continues.

The CTO is exiled to an island surrounded by clever, insane kappa. He is given
books and has no one to talk to.

The First Validator walks into the darkness, head bent, tools discarded in the
dust.

A year passes. The turncoats unsurprisingly fail miserably. The Powers exile
them. Some of them somehow manage to end up in even worse places, there being
an unending number of just terrible places among the industries of the umbral
plains.

The Powers are finally fed at 200Gbps. They are unsatiated.

The CEO follows so many others into the desert.

The CTO crafts a boat from discarded hopes and sails off.

The Final Form Shambles On

The clouds sing, bellies full of encrypted baubles from millions of the unseen.

Much of the original tribe is shattered.

The Powers announce they will take their data to another cloud, a better cloud,
a heavenly cloud, and that the tribe will dismantle what they’ve painfully
born.

This does not occur: the false heavens cannot feed them, cannot sustain them.

The work continues.

Post-script

This post brought to you by: three years of 10–16 hour days, a no-notice trip
from MEL to ICN to SFO to MEL in ten days, exponential burnout, foxhole
buddies, and taking an entire year off to recover.

Shout out to all the awesome people who did simply amazing work under the worst
conditions. You know who you are.

We all deserved better.

December 13, 2019

Day 13 - A Year of DevOps Days Around the Globe

By: JJ Asghar (@jjasghar)
Edited By: Jon Topper (@jtopper)

Introduction

Over the last year, I had the privilege of travelling the world, representing IBM,and speaking at DevOpsDays around the globe. I saw different cultures and demographics share in the teachings of digital transformation and DevOps cultural change. I saw many successes and a few failures, and I’d like to share them here. Additionally, after many years of attending DevOpsDays, I became an Organizer at my local event. I hope to make an impact with all I’ve learned and experienced in order to make my local event the best possible.

I’m the type of human that eats his vegetables first, so let me highlight some friction points before the grand successes. I’m going to do my best not to call out any specific DevOpsDays because I know organizers work inexhaustibly to provide the best experience possible to attendees. Of course the necessary disclaimer here, this is simply my perspective.

Don’t let a single voice dominate the conversation

“Cult of Personalities” is a term that has become more mainstream recently and I believe it runs rampant in our DevOpsDays culture. It’s good to see champions of this cultural movement but I started to see an almost “cult” reverence for specific humans.

I saw open spaces with 10-20 people and only that one personality would hold the conversation. I noticed that the ritual of the Open Spaces wasn’t enforced, and the saying “you have two ears and one mouth, listen 2x then speak” went ignored. In other words, we need to practice as we preach, be more inclusive, and encourage more participation from those that shy away from the spotlight. Easier said than done, I know. I’m the guy that had an ignite talk about surviving conferences as an introvert. Shameless self-promotion here: https://www.youtube.com/watch?v=JqwgmePMEw4

Delegate responsibilities

I also saw the personality problem in the organizer space too. There was one event where every decision seemed to go through one human. This individual became the be all end all for any decision; ranging from tactical issues with AV to signing the check for the catering. This individual was everywhere and didn’t seem to let any “sub-team” be empowered to make any decisions. You could see the exhaustion in this person’s eyes by the end of the day. Being a one-person team is not in the spirit of DevOpsDays. The moral of this observation is, where you can, empower the community and other organizers and lean on them: that’s why they volunteered.

Reiterate the format, and the advantages of Open Spaces

I mentioned the Open Space ritual earlier and I want to revisit it. I am a very strong proponent of the Open Spaces; years ago it’s what brought me into the DevOps world. One constant thing I saw was that if there wasn’t a clear explanation of what Open Spaces are, they would fall into some weird half Open Space, half presentation, half people sitting staring at each other. (Yes, I have 3 halfs there, it was weird and not always all three)

Daniel “phrawzty” Maher has an amazing slide deck https://github.com/phrawzty/open_spaces.deckset on Open Spaces. The rules and “law” of Open Spaces and some strong suggestions on how to make them successful. Many of us have seen this ritual multiple times, but just like your safety briefing on a plane, it’s important because you never know who hasn’t seen it before and it’s always good to have a refresher.

Something I saw at only one or two events was a volunteer moderator. (There were calls for note-takers too, but only once was this successful). The volunteer moderator was a great tool to make sure lesser voices were heard, but at the same time if they weren’t careful they could cause issues. Moderators are good, but they can cut both ways.

Some of the best run DevOpsDays were not the ones with the longest history, or the newest ones, but the ones that you could tell would iterate and pivot real-time. They trusted the teams or humans who were responsible for their sections, and empowered them to make decisions. They both preached and practiced the DevOps culture, and it paid off.

Provide a dedicated space for Speakers and Introverts

There were a couple of DevOpsDays that had intentionally provided on-call and quiet rooms. There was a saying I heard, “Using a laptop at a conference is like a virus, one person starts working and it gives permission for others to start working and conversations and interactions die out.” If you give people a room to work at, in a limited space, you can contain the spread and also make sure the people that do need to work on an emergency can.

Quiet rooms gave humans some space to decompress and recharge themselves. Having power squids, water and some tables and chairs also allowed humans to get their phones charged too. There was one event that doubled up the quiet room as a mother’s room; it worked great and promotes inclusivity to an underrepresented demographic. It’s a good sign for next year. There was a third type of room for humans needing to respond to on-call situations. In addition there was a small extra room with a live stream of the event. This had a few round tables and water, and was off in the corner, but it was amazing. I actually spent a lot of time in that room, engaging with cohorts being able to see the main stage and hack on some code. I had a game called Love Letter in my backpack and got quite a few rounds in and met some new gaming friends. Reference to my introverted side from above.

As a speaker at all of these events I spent a lot of time in the “green rooms”. I had an interesting observation about green rooms. First, they seem to be a US-centric thing: the EU the ones didn’t give us speakers space. As a speaker, I have a routine before I speak and I’d have to find a spot to do it. I strongly recommend giving your speakers a room, you never know if a speaker needs that space.

One thing I saw in the green room that I want to see at all the DevOpsDays was the “block of help.” I’ve seen the random volunteer sitting in the room approach, but that never seemed to work. Out of the few I saw they always seemed bored stuck there never knowing what to do, and that’s where this “block of help” comes into play. Every organizer and most volunteers have walkie talkies. At this event, in the green room, there was a walkie with big letters that said “HELP”. We were told if we needed anything, all we had to do was press the talk button and we had direct access to the team. Such a simple idea and it worked so well.

Acknowledge people have different ways of interacting

The final win I want to mention is board games. Like quiet rooms, board game rooms are becoming more and more mainstream. I went to a couple of DevOpsDays that had a boardgame portion to the evening event. Most were at bars, but some found a bar or event location that had a quiet place to have a beer and play a collection of games. This worked amazingly, it allowed for people not to feel pressured to drink, and also gave people who normally don’t frequent drinking establishments a place to enjoy socializing.

Conclusion

Thanks for taking the time to read this post about what I’ve learned visiting DevOpsDays around the world. I’m taking all of these lessons learned to my local DevOpsDays, and already had some amazing feedback on of these influenced ideas. Hopefully, this has highlighted some easy things our community and organizers can iterate on. For instance, start involving underrepresented attendees in open spaces, allowing for more diverse voices.

I really hope that organizers engage more with their organizing teams and allowing the group to take the brunt of the stress, instead of only a couple of people carrying most of the water. Finally, I really hope that organizers realize that speakers are not only attendees but do need some space to focus their energies, in order to put their best foot forward. Having this space, like the on-call room, giving people the space they need, can allow everyone to feel empowered.

I really see DevOpsDays only growing more and more. It’s a cultural movement that encourages people to bring their best, and collaborate.

December 12, 2019

Day 12 - Observability

By: Ramez Hanna (@informatiq)
Edited By: Kirstin Slevin (@andersonkirstin)

TL;DR Observability is about people and practices. You don’t need a dedicated team, you need people who care.

Bonus points This applies to many other things, not just observability.

Disclaimer

I do not take full credit for all that I am going to share.

This is the result of my learning from people, books and experience.

This is my view on the subject, hence you can disagree with me.

What is Observability?

According to wikipedia, it is

“ A measure of how well internal states of a system can be inferred from knowledge of its external outputs. ”

When I read that it was so clear and yet so mysterious.

Trying to make sense of that definition in my context I came up with this simplification.

The act of exposing state, and being able to answer 3 questions:
-> what is the status of my system?
-> what is not working?
-> why is it not working?

Let’s inspect that definition closely, starting with “The act of exposing state”; This is the intent, the conscious action.

It’s about instrumenting the code to expose state and data about itself that will help in understanding it.

The goal is not to expose what we know we want to monitor (known unknowns), rather, the goal is to expose more data and add as much context that will enable the discovery of new failure modes (unknown unknowns).

This will enable us to answer the three questions.

The goal of observability is to get as close as possible to knowing the cause of the issues that impact the performance of systems, hence enhancing the response time and the MTTR (Mean Time To Recovery).
To make it more concrete, let’s look at this example:

The Universe company is using Graphite and Grafana for their metrics, and ELK stack for their logs.
Team Earth instrumented their code to expose the necessary metrics. They thought carefully about what metrics are important to their service, how to collect these metrics, and they carefully crafted their logs to have enough context.
They also put in place probes that will query their service and report status as perceived by clients instead of relying only on metrics exposed by the service.

On the other hand, team Mars only had the metrics exposed by the framework they use.
Their logs were verbose, unstructured text and they relied on the basic health checks, which are basically a ping check to their homepage.
Both teams use the same tools in an effort to observe their systems but the result is not the same.
Team Earth during an incident will be able to see how their service’s performance is perceived by clients, and be able to follow the metrics/signals through the different components until they would identify a certain metric that is not within thresholds.
They would then look at logs where they would be able to see more details about the anomaly and work to fix it.

Team Mars can look at their metrics, but they won’t necessarily find a metric that is out of the norm, so they will go over to the logs and sift through all those blobs of text, scrambling to make sense out of them.

They end up finding a fix, but the effort and frustrations leaves them demotivated.

This shows that observability is about what people do with the tools.

Who is it for, who will be implementing it really?

Observability is best implemented by the engineers that wrote the code, since they know their systems the best.

I cannot implement observability for all the engineers, but I can enable them to observe their services, showing them how to best observe, monitor, and understand their systems.

My users are the heart of observability, without their involvement and their cooperation I will not succeed at my mission.

Observability is about people.

It comes down to engineers following best practices, understanding what needs to be observed, how it should be observed and how to use that knowledge to improve the reliability of their services.

Observability is about people and practices.

How to implement Observability?

Before implementing observability, I must ask “WHY?”
Why would I want to implement observability?
Well to make our company better at what it does, right? That’s why I was hired in the first place.
Observability should help my company be better at reacting to outages or any issue for that matter.
Engineering will be better because of observability, if correctly implemented.
Keeping that in mind helps set the stage for the work involved in the implementation.
So my mission is to enable the engineering teams through the following:

  • Talk/advocate/train engineers about the principles
  • Provide support when they start applying this knowledge
  • Selection of tools that are best suited for my company whether self-hosted or SaaS
    • Understand the tools strength and limitations and explain those to users

Observability in real life

At Criteo we have 600+ engineers and an Observability team of 5 engineers.
With that ratio, there is no way the Observability team can take the responsibility to implement everything.
The Observability team provides the necessary foundation to enable the teams to observe. This includes:

  • Develop and deploy tools to allow for exposing and visualizing state
  • Integrate the tools with the internal ecosystem
  • Provide support for using the tools
  • Write documentation
  • Drives the adoption of the best practices, by working closely with the different teams

The team deploys different tools and develops the glue to integrate them to have a coherent ecosystem. For example, this might look like:

  • BigGraphite as the long term storage for metrics
    • This is the main Metrics database, where we store metrics. It is also used as long term storage for Prometheus.
  • Prometheus for metrics collection, aggregation and alerting
  • Alertmanager to route alerts
  • Various other tools for tying it all together with sane defaults

Keeping our focus on user enablement, we always try to find ways to improve the experience of our users.
One successful Observability team initiative was to dedicate one member of the team during 3 days every sprint, to work alongside another engineering team, to observe how they interact with the observability tools, how they define their service level objectives, and understand their alerting needs. Through this process, the member of the Observability team was able to spot areas that needed improvement and show how to fix.
It was mutually beneficial, as the Observability team learned more about users and their needs, and the users improved their ability to observe their systems.

Final word on tools

Vendors will try to sell me observability, but these are tools. Some are good and some are bad, and some are average, but no one can sell me observability.
Observability is more about people and practices - no matter what tools you use, if you don’t know what you’re doing it won’t work.
People are creative and they will find ingenious ways of using the tools to fit their thinking instead of adapting their thinking to the tools.
So tools are crucial but they are not where the focus should be. Ultimately I should be careful to choose the tools that make it easier for my users to exercise the best practices and the principles of observability.