December 12, 2021

Day 12 - Terraform Refactoring

By: Bill O'Neill (@woneill)
Edited by: Kerim Satirli (@ksatirli)

Terraform is "Infrastructure as Code" and like all code, it is beneficial to review and refactor to:

  • improve code readability and reduce complexity
  • improve the maintainability of the source code
  • create a simpler, cleaner, and more expressive internal architecture or object model to improve extensibility

This article outlines the approaches that have helped my teams when refactoring Terraform code bases.

Convert modules to independent Git repositories

If your Terraform Git repository has grown organically, you will likely have a monorepo structure complete with embedded modules, similar to this:

$ tree terraform-monorepo/
.
├── README.md
├── main.tf
├── variables.tf
├── outputs.tf
├── ...
├── modules/
│   ├── moduleA/
│   │   ├── README.md
│   │   ├── variables.tf
│   │   ├── main.tf
│   │   ├── outputs.tf
│   ├── moduleB/
│   ├── .../

Encapsulating resources within modules is a great step, but the monorepo structure makes it difficult to iterate on individual module development, down the line.

Splitting the modules into independent Git repositories will:

  • Enable module development in an isolated manner
  • Support re-use of module logic in other Terraform code bases, across your organization
  • Enable publishing to public and private Terraform Registries

Here's a process that you can follow to make a module a stand-alone Git repository while preserving the historical log messages. The steps are examples of how to extract moduleA from the above file tree into its own git repository.

  1. Clone the Terraform Git repository to a new directory. I recommend naming the directory after the module you plan on converting.
    git clone <REMOTE_URL> moduleA
  2. Change into the new directory:
    cd moduleA
  3. Use git filter-branch to split out the module into a new repository..
    FILTER_BRANCH_SQUELCH_WARNING=1 git filter-branch --subdirectory-filter modules/moduleA -- --all

    Note that we're squelching the warning about filter-branch. See the filter-branch manual page for more details if you're interested

  4. Now your directory will only contain the contents of the module itself, while still having access to the full Git history.

    You can run git log to confirm this.
  5. Create a new Git repository and obtain the remote URL for it, then update the origin in the filtered repository:
    git remote set-url origin <NEW_REMOTE_URL>
    git push -u origin main
    
  6. Tag the repo as v1.0.0 before making any changes

       
    git tag v1.0.0
    git push --tags
    
  7. Now that the new repository is ready to be used, update the existing references to the module to use a source argument that points to the tag that you just created.

    The “Generic Git Repository” section in Terraform's Module Sources documentation has more details on the format.

    Replace lines such as

    source = "../modules/moduleA"


    with

    source = "git::<NEW_REMOTE_URL>?ref=v1.0.0"
    
  8. Alternatively, publishing your module to a Terraform registry is an option (but this is outside the scope of this article).
  9. Once all source arguments that previously pointed to the directory path have been replaced with references to Git repositories or Terraform registry references, delete the directory-based module in the original Terraform repository.

Update version constraints with tfupdate

Masayuki Morita's tfupdate utility can be used to recursively update version constraints of Terraform core, providers, and modules.

As you start refactoring modules and bumping their version tags, tfupdate becomes an invaluable tool to ensure all references have been updated.

Some examples of tfupdate usage, assuming the current directory is to be updated:

  • Updating the version of Terraform core:
    tfupdate terraform --version 1.0.11 --recursive .
  • Updating the version of the Google Terraform provider:
    tfupdate provider google --version 4.3.0 --recursive .
  • Updating the version references of Git-based module sources can be done with the module subcommand, for example:
    tfupdate module git::<REMOTE_URL> --version 1.0.1 --recursive .

Test state migrations with tfmigrate

Many Terraform users are hesitant to refactor their code base, since changes can require updates to the state configuration. Manually updating the state in a safe way involves duplicating the state, updating it locally, then copying it back in place.

In addition to tfupdate, Masayuki Morita has another excellent utility that can be used to apply Terraform state operations in a declarative way while validating the changes, before committing them: tfmigrate

You can do a dry run migration where you simulate state operations with a temporary local state file and check to see if terraform plan has no changes after the migration., This workflow is safe and non-disruptive, as it does not actually update the remote state.

If the dry run migration looks good, you can use tfmigrate to apply the state operations in a single transaction instead of multiple, individual changes.

Migrations are written in HCL and use the following format:

migration "state" "test" {
  dir = "."
  actions = [
    "mv google_storage_backup.stage-backups google_storage_backup.stage_backups",
    "mv google_storage_backup.prod-backups google_storage_backup.prod_backups",
  ]
}

Each action line is functionally identical to the command you’d run manually such as terraform state <action> …. A full list of possible actions is available on the tfmigrate website.

Quoting resources that have indexed keys can be tricky. The best approach appears to be using a single quote around the entire resource and then escaping the double quotes in the index. For example:

actions = [
    "mv docker_container.nginx 'docker_container.nginx[\"This is an example\"]'",
]

Testing the state migrations can be done via tfmigrate plan <filename>. The output will show you what terraform plan would look like if you had actually carried out the state changes.

Applying the migration to the actual state is done via terraform apply <filename>. Note that by default, it will only apply the changes if the result from tfmigrate plan was a clean output.

If you still want to apply changes to a “dirty” state, you can do so by adding a force = true line to the migration file.

If you are using Terraform 1.1 or newer, there is now a built-in moved statement that works similarly to these approaches. I haven’t tested it out yet but it looks like a useful feature! I can see it being especially useful for users who may not have direct access to state files such as Terraform Cloud and Enterprise users or Atlantis users.

See the announcement in the 1.1 release as well the HashiCorp Learn tutorial for more details.

Ensure standards compliance with TFLint

According to its website, TFLint is a Terraform linter with a handful of key features:

  • Finding possible errors (like illegal instance types) for major Cloud providers (AWS/Azure/GCP)
  • Warning about deprecated syntax and unused declarations
  • Enforcing best practices and naming conventions

TFLint has a plugin system for including cloud provider-specific linting rules as well as updated Terraform rules. Setting up the list of rules can be done on the command line but it is recommended to use a config file to manage the extensive list of rules to apply to your codebase.

Here is a configuration file that enables all of the possible terraform rules as well as includes AWS specific rules. Save it in the root of your Git repository as .tflint.hcl then initialize TFLint by running tflint –init. Now you can lint your codebase by running tflint

config {
  module              = false
  disabled_by_default = true
}

plugin "aws" {
  enabled = true
  version = "0.10.1"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

rule "terraform_comment_syntax" {
  enabled = true
}

rule "terraform_deprecated_index" {
  enabled = true
}

rule "terraform_deprecated_interpolation" {
  enabled = true
}

rule "terraform_documented_outputs" {
  enabled = true
}

rule "terraform_documented_variables" {
  enabled = true
}

rule "terraform_module_pinned_source" {
  enabled = true
}

rule "terraform_module_version" {
  enabled = true
  exact = false # default
}

rule "terraform_naming_convention" {
  enabled = true
}

rule "terraform_required_providers" {
  enabled = true
}

rule "terraform_required_version" {
  enabled = true
}

rule "terraform_standard_module_structure" {
  enabled = true
}

rule "terraform_typed_variables" {
  enabled = true
}

rule "terraform_unused_declarations" {
  enabled = true
}

rule "terraform_unused_required_providers" {
  enabled = true
}

rule "terraform_workspace_remote" {
  enabled = true
}

pre-commit

Setting up git hooks with the pre-commit framework allows you to automatically run TFLint, as well as many other Terraform code checks, prior to any commit.

Here is a sample .pre-commit-config.yaml that combines Anton Babenko's excellent collection of Terraform specific hooks with some out-of-the-box hooks for pre-commit. It ensures that your Terraform commits are:

  1. Following the canonical format and style per terraform fmt
  2. Syntactically valid and internally consistent per terraform validate
  3. Passing TFLint rules
  4. Ensuring that good practices are followed such as:
    • merge conflicts are resolved
    • private ssh keys aren't included
    • commits are done to a branch instead of directly to master or main
repos:
  - repo: git://github.com/antonbabenko/pre-commit-terraform
    rev: v1.59.0
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_tflint
        args:
          - '--args=--config=__GIT_WORKING_DIR__/.tflint.hcl'
  - repo: git://github.com/pre-commit/pre-commit-hooks
    rev: v4.0.1
    hooks:
      - id: check-added-large-files
      - id: check-merge-conflict
      - id: check-vcs-permalinks
      - id: check-yaml
      - id: detect-private-key
      - id: end-of-file-fixer
      - id: no-commit-to-branch
      - id: trailing-whitespace

You can take advantage of this configuration by:

  • Installing the pre-commit framework per the instructions on the website.
  • Creating the above configuration in the root directory of your Git repository as .pre-commit-config.yaml
  • Creating a .tflint.hcl in the base directory of the repository
  • Initialize the pre-commit hooks by running pre-commit install

Now whenever you create a commit, the hooks will run against any changed files and report back issues.

Since the pre-commit framework normally only runs against changed files, it’s a good idea to start off by validating all files in the repository by running pre-commit run –all-files

Conclusion

These approaches help make it easier and safer to refactor Terraform codebases, speeding up a team's "Infrastructure as Code" velocity.

This helped my team gain confidence in making changes to our legacy modules and enabled greater reusability. Standardizing on formatting and validation checks also sped up code reviews. We could focus on module logic instead of looking for typos or broken syntax

December 11, 2021

Day 11 - Moving from Engineering Manager to IC

By: Brian Scott (@brainscott)
Edited by: Don O'Neill (@sntxrr)

Within the past month, I've had a radical change into a new role within my existing employer, for the past decade I was an SRE Manager building teams and a Tech Executive. I hope to summarize my experience including how that made me feel, moving into an IC Role. The thoughts and ideas in this article are from my own opinion and past experiences.

For the past 6-8 years, I've been in an Engineering Manager/TechExec role, specifically in Systems Reliability Engineering. I was comfortable, happy, and engaged in this role, managing multiple SRE teams supporting a wide range of products & platforms in the Enterprise.

Before we dive in deeper, A little history on myself, I've been playing with technology since I was in 5th grade. My English teacher at the time taught me everything he knew about repairing computers, primarily 286's & 386's, DOS, and teaching me the BASIC programming language.

As I transitioned into 8th grade, entering High School, my computer teacher approached me to ask if I wanted to help with administering the School's Network of 12 Windows NT Servers running Active Directory, Exchange & File Services with over 4000 workstations & Printers. Apparently, my 5th-grade teacher passed a few tidbits to him of what I was doing in middle school in Computer Science.

Little did I know after accepting the position that my journey began, A few startups (MySpace, remember that?) and mid-large corporations later, I ended up in Engineering Management, primarily focused on building teams that support large scale applications both On-prem and in the cloud with a focus on delivering solutions with a DevOps culture & SRE mindset.

I've been used to building high-performing engineering teams, meeting new and amazing engineers while focusing on creating T-Shaped teams, this is not necessarily a new concept but one that worked for my teams and worked well. During this time, We have had an amazing leadership team that pushed us to go above and beyond while meeting new product teams across the company every day that needed our help in delivering great solutions. In certain organizations, high technical roles can be treated as semi-management.

We introduced several new technologies & concepts to the company as a whole, developing many Communities of Practice around Config Management, Containers, CI/CD, and even Web Development with Go, and so on. With the vast coverage of different areas that the company was working in, I found myself, slowly moving into a new space that we never had a role in the company, more on this, in just a bit.

Before moving into Management, I was a Staff SRE (Systems Reliability Engineering). You might be thinking, isn’t it Site Reliability Engineering?, yes but different companies tailor the meaning of SRE to meet the needs within their respective areas. In my case, we weren’t just managing Sites & Web Applications but Systems that handle a wide range of products in the Entertainment & Media space. Think Rendering, Control Systems, and safety systems.

As a Manager, I started seeking and making new connections across the enterprise, assisting teams in onboarding the latest technology, whether that be LiDar, Kubernetes, understanding GitOps & Docker, and new tools that were bursting with Innovation in the Open Source space. While being good at helping others and always saying “YES”, I quickly found myself spread quite thin between managing 5 different SRE Teams, each team roughly 3-5 team members, supporting over 3000 Applications and some of which were centralized services for the entire enterprise to consume. It was also getting a little hard for me to stay current with the technology, which I loved.

Leadership quickly saw my success in evangelizing new technology and helping our business units move fast in adopting new methods of engineering not only with new technology but ensuring our SRE’s had the proper tools and was aware of up and coming automation tools to help them reduce toil but also accelerate in how we delivered more value to our customers internally and externally.

My leader called me into a meeting to discuss my interest in moving into an SRE role, but instead of a pure Engineering role, wanted me to pursue leading the company’s effort in evangelising new technology. He went on to explain the value and deep vision in how this would allow me to expand my reach and support more teams in helping create an organization, around Developer Advocacy and mentoring our entire Global SRE Organization to the next level and inspire others in methods such as Empathy Engineering, Automation and best practices in multiple areas, the advancements in what’s next in driving technical leadership.

I was a bit taken back but excited, there was also a bit of nervousness of course, how that might have affected my teams in-relation to my relationships between each one of my engineers. In the next few weeks, my teams and leadership were very supportive and believed that I was needed in this new role to make a bigger impact on the Organization and company as a whole.

Never be discouraged if you find yourself moving into an IC role, new opportunities have a great way of nudging you in the right direction. People often think that moving up the ladder means success but as we all have seen incredible people in IC roles such as Kelsey Hightower at Google or Jessie Frazelle of Oxide Computer. Humans do their best work when positioned to do things they love doing and provided they can reach new heights.

December 10, 2021

Day 10 - Assembling Your Year In Review

By: Paige Bernier (@alpacatron3000)
Edited by: Jennifer Davis (@sigje) and Scott Murphy (@ovsage)

Intro

There are a few moments in my career that I have been struck by a story told with data. When I set out as a Site Reliability Engineer into the big wide world I wanted to capture that data storytelling magic and have adapted a presentation I call the “Year in Review”.

My first company had a tradition of taking a moment to pause and review the year by the numbers. The showstopper was the chart showing the amount of data ingested year over year since the founding.

In a single glance that chart conveyed a story that would take hours to tell!

It communicated the incredible efforts the employees took to scale the system to handle ingesting, processing, publishing and storing an ever increasing mountain of data. It illustrated how far the company had come and we were confronted head on with the realization that “what got you here, won’t get you there”.

The biggest impact I have seen comes after the presentation. Discussions from Year in Reviews have sparked sweeping oncall management changes as well as minor, but important, changes in the way developers engage with the SRE team.

Before diving into implementation details, let’s look at why this type of data storytelling is such a powerful tool by examining the core purpose of SRE

The Mission of SRE

The mission of an SRE team is to improve system reliability by facilitating change.

System reliability is the sum of hundreds of decisions humans make when developing, deploying, and maintaining software systems; it is not an intrinsic property1 of the systems (Patrick O’Connor, 1998). SRE job descriptions tout phrases like “evangelize a DevOps culture” and “influence without authority” acknowledging our roles as change agents.

And as often heard, “change is hard”. As change agents, we are often faced with conflicting priorities, multiple stakeholders internal and external, and fear of the new and unknown.

However, just as often we hear “change is the only constant”. Whether it’s hardware improvements, operating system upgrades, security vulnerability announcements, software dependencies, or the software that we manage as a service, we are constantly monitoring and implementing change.

Combine these two axioms, for extra difficulty:

Ask any engineer who has been forced into a major operating system upgrade when the version of software they’re running requires the previous OS.

As an SRE I often want to make changes across the entire engineering organization such as developing oncall onboarding, ensuring that we are monitoring the customer’s experience, clarifying the lines of responsibility between developers and operators and more!

These types of changes that affect everyone is difficult to effectively implement until two things are true:

  • Is there a shared understanding of the current state?
  • Is there agreement that the current state needs to change?

This does not mean there needs to be consensus on what changes need to be made!

Is there a shared understanding of the current state?

The answer to this can be a resounding “Yes!” after your Year in Review presentation. Here’s why:

Humans learn best from stories, feelings, senses, and opinions commonly known as qualitative data. Focusing only on these exclusively you risk coming to broad conclusions without nuance or context.

Businesses claim to operate on data, facts and figures, or quantitative data. Focusing purely on the numbers you risk having too many details leading to irrelevant rabbit holes.

In fact, the two seemingly disparate viewpoints aren’t at odds at all. You can even validate findings by using the other category of data.

Feel: “Our monitoring sucks, none of the last 5 pages I got were actionable”

Fact: The primary oncall was paged 5 times out of business hours last week

Finding: Team X is getting paged frequently for non-actionable reasons

Hosting a “Year in Review” means weaving a story using the quantitative data about what occurred in your systems with the qualitative “anec-data” from a human perspective to build a foundation to introduce change.

Is there agreement that the current state needs to change?

This is a more complex endeavor - identifying and implementing change is the hard work of collaborating across teams, roles and competing incentives, motives, and needs. Think of “Year in Review” as a springboard for driving discussion and debate to align on “do we agree something needs to change?”

What does this look like in practice?

At a previous company I heard from engineers and managers alike that the oncall rotations were in need of a shake up. This was an excellent starting place where everyone agreed that there was a problem but was having trouble implementing the necessary changes.

With a goal in mind to identify what exactly the oncall issues were my team tailored a “Year in Review” focused mainly on oncall metrics such as alert noise, hours oncall per engineer, pages received per engineer. Slides illustrated the deluge of alert storms no human could possibly investigate in a given shift and were largely unactionable noise. The impact of not addressing this problem was clear, we were likely missing important signals in the noise and oncalls weren’t able to effectively prioritize their time.

After reviewing the data as a group, my team facilitated a brainstorm to address the barriers to changing the rotations:

  • How to handle ownership when multiple teams contribute code?
  • What are the “hot potato” services no one feels comfortable owning?
  • What services are unofficially owned by a single engineer that needs documentation?
  • What is the goal of a low urgency or warning alert?

Based on the main discussion and others in standups and sidebars, my team proposed new team-service ownership and rotations. Several weeks and few rounds of revisions later we merged the PR with our new Terraformed oncall rotations!

DIY “Year in Review”

So, how do you create a “Year in Review” for an SRE team? To start, I typically have a few things in mind about what I think happened and what the data will show. It is fascinating to see where your perception of the system and reality diverge. You can kick off your process by asking a couple of questions:

  • What story are you expecting the data to tell?
  • What changes do you think need to be made in the next year to improve reliability?
  1. Book a meeting with all parties (including engineers, managers, sre, qa, ops, product managers). If there is an existing meeting like an All-Hands or Demo Hour sign up for a presentation slot
  2. Kick off a brainstorming session and have participants list out possible changes to include. Such as new features launched or infrastructure expansions to new regions, or even doubling the organization size.
  3. Ask teams (including managers)
    1. What data they would find interesting
    2. What data they could contribute from their domain
  4. List the company-specific tooling for data sources like:
    1. Version Control
    2. CI/CD
    3. Monitoring
    4. Incident Management
    5. Ticket tracking system
    6. Documentation store
    7. Support ticket system
  5. Enlist the help of others to gather the interesting metrics over the past year or year over year. Some suggestions are:
    1. Noisiest alerts
    2. Number of environments
    3. Oncall engineers
    4. Number of services
    5. Ratio of oncall engineer to number of services oncall for
    6. Age of dependencies/libraries
    7. # of hours oncall per person
    8. Number of features launched
    9. # of after hour pages
    10. Ratio of warning alerts to pages
    11. Number of production deploys rolled up by day
    12. Number of open incident AIs
    13. Ingress traffic or other indicator of system load
    14. Most viewed documentation pages
    15. Most search documentation terms
    16. Time to first PR
    17. ….and so much more!
  6. Slice and dice the data trying out top 10 lists, total sum, or segment by using whatever constructs your company has such as:
    1. Department
    2. Service
    3. Team
    4. Product Feature
  7. Group the data into themed areas “oncall” “production” “onboarding” etc. If you have convinced folks to co-present with you each person can be responsible for presenting a different theme
  8. Assemble into a slide deck with one chart per slide to maximize impact
  9. Hold the meeting and present your findings,
  10. Discuss! In the meeting, after the meeting before the next Year In Review how you interpreted the data compared to others
  11. Publish the data and your queries so everyone can explore and answer their own questions

Parting Thoughts

SREs are uniquely suited to facilitate a Year in Review bringing a system-wide perspective on the people, processes, and technology and mission to improve reliability. Keep in mind that much like effecting change, hosting a Year in Review is not a solo effort!

Going solo means you will only capture YOUR thoughts which will almost certainly be tempered by the unique vantage points from others. The more perspectives you invite, the fuller the story of your system will be.

Please share your favorite data storytelling moments or Year in Review stats with me on Twitter at @alpacatron3000

Citation

O’Connor, P. (1998) Standards in reliability and safety engineering [Article]. Elsevier Science Limited, 9 Dec. 2021.

https://www.sciencedirect.com/science/article/abs/pii/S095183209883010X

Notes


  1. Since the SRE field is still getting established outside of Google, I started to read perspectives from Reliability Engineering in other disciplines. A nugget from Patrick O’Connor’s “Standards in reliability and safety engineering” paper sparked a spicy but important revelation about reliability.

    “Those reliability standards which apply mathematical/ quantitative methods are also based on the inappropriate application of “scientific” thinking. An engineered system or a component has no intrinsic property of reliability, expressible for example as a failure rate. Truly scientifically based properties of systems and components include mass, power output, etc., and these can therefore be predicted and measured with credibility. However, whether a missile or a microcircuit fails depends upon the quality of the design, production, m~nten~ce and use applied to it. These are human contributions, not “scientific”. “ 

December 9, 2021

Day 9 - 3 things parenting taught me about system administration

By: Jennifer Davis (@sigje)

The last five years have been grounding for me as I became a beginner at parenting. In this article, I want to share three things I learned about being a better sysadmin from being a mom.

Prioritize your health

Of course, I've heard it so many times. But in the rush of trying to support the "system," sometimes, I lose track of the little things (getting enough sleep, eating meals, human engagement that isn't predicated on deliverables and action items). When it comes to parenting, I see the difference in how the necessities of the moment can gradually subsume the primary goals and real joy* (a secondary outcome of successful parenting that I tend to only enjoy in retrospect, after having assured myself that my internal parenting kanban board is as it should be–obsession, exhaustion, and then joy tends to be my experiential flow as a parent).

Prioritizing health - if I'm not ok, I'm not able to handle the "system" as well, regardless of its state.

Any parent of a child under five will tell you that 90 percent of the job is keeping the child alive. If they make it to the next day, smile and giggle the proper number of times per day, and if your friends, family, and parenting peers seem unaware that your parenting path bears a concerning resemblance to the plot of the movie Speed, then you're more or less gravy. You also learn that, while you can spend a great deal of time analyzing and conversing about your child and how they're faring, the main thing is to put them in the right places at the right time. Sunshine, exercise, the company of their peers, easily accessible bathrooms–these are the things that matter. If my son doesn't get direct sunlight within 90 minutes of walking, his mood takes a nosedive, and this isn't a mystery to me. Likewise, if he isn't let loose at the park to terrify small woodland creatures with his desire to befriend them, his attentional resources will be suboptimal when it's time for flashcards. Yet I (and I don't think I'm alone in this) will frequently wake, obtain caffeine, have a quick all-hands with my family, and proceed to sit in a small room staring at a screen for eight hours straight. As a result, my ability to practice self-care myself fails regularly.

Leverage the community

To prioritize my health, I have to ask for help. I've had the following experience again and again professionally, and as a parent, and at some point, I hope that it won't astound me, which it does every time: I believe that I'm having a singular experience (which, of course, we all are) and that I am an outlier because obviously no one else is concerned about the state of affairs or struggling. And then someone else gives voice to the precise issue that I've devoted considerable resources to NOT sharing. Of course, other people are also concerned about the children pretending that the scissors are boomerangs. One of my primary errors is thinking that there is some scorekeeping of tracking the social currency and categorizing discourse into the buckets of "I helped" and "I was helped." It's a binary that renders engagements as transactional when my actual community experience is almost always that I walk away feeling better regardless of who broached a topic.

You can't eliminate all Snowflakes

Within the community, we often talk about snowflakes as problems. Yet, as a parent, you discover that there are no handbooks for YOUR kid because every child is different in their own beautiful, hard, and surprising way. Likewise, while there is value in the community and sharing stories, every system will be different. You work with one system, you've learned about that system, and while there are useful things you'll learn from that system to apply to other systems, every system will be beautiful, different, and hard in its surprising ways.

Wrapping Up

Our industry is constantly evolving with the introduction of new technology, tools, and processes. It may feel overwhelming to try to understand everything. You have to accept some degree of the unknown. When I first became a parent, I realized that Operations had prepared me for the inevitable changes that occur every single day. No matter what tomorrow brings, the essential skills are learning to adapt to change and learning to learn fast.

Please make time for yourself, connecting with the community, and accepting what is different and unique about your systems and the environments they are running in.

December 8, 2021

Day 8 - D&D for SREs

By: Jennifer Davis (@sigje)

In a past life, I was a full-time SRE and a part-time dragonborn paladin named Lorarath. While at work, I supported thousands of systems in collaboration with a team of geeks. Evenings, I tried to survive imaginary disasters and save the world from the sorceress Morgana. I love collaborative games because they plug into some of the real-world emotional responses and social processes critical for successful, meaningful engagement. They provide a place to practice dealing with critical scenarios in a safe place. When you know the stakes are purely imaginary, you're able to look at your efforts from a distance, to gain understanding and enjoy the process of learning and achieving goals together, even when failing. I want to share a couple of insights D&D has given me about my work and how this can help you.

Building your SRE Team … more than just a name.

SRE has many names: Operations, DevOps, Infrastructure engineering, System Admin. It's someone who deploys and runs a highly available, scalable, and secure service that meets business and partner requirements. But what does that mean? Generally, it means someone with a wide-ranging set of skills tackling different challenges at any point in time.

When you first start a campaign in dungeons and dragons, you choose a class to play. This class will then have specializations that you customize based on how you want to play. Next, you build out your character using a character sheet and create a backstory. This character sheet has several abilities and skills. You have several points to allocate to abilities and skills, which grants you additional chances to handle particular events successfully.

In gaming, you collaborate with your team to ensure that you have a well-rounded team often choosing roles to complement the team. You don't want a team of all "magic users" or hack and slashers. Often, we stop at identifying who we are with that single name, whether it's SRE or sysadmin. As an SRE, I depend on a diverse team with varied skills. I am not seeking people with the same expertise or abilities. I'm looking for people with complementary skills who can help accomplish the goals and visions of the team.

Developing your “character sheet”

There is no equivalent to a "character sheet" when it comes to your job. The closest might be equating a resume or LinkedIn profile to a character sheet. Still, these don't align to all of the possible experiences you gain:

  • Submitting git pull requests.
  • Participating in hackathons.
  • Attending training or conferences.
  • The myriad of other day-to-day challenges you face.

Additionally, if you don't practice skills in real life, they languish. For example, I haven't touched Solaris in over a decade, and I no longer document it as a skill.

If SRE did have a character sheet, I think three core abilities would be:

Communication, Collaboration, and Confidence. Let's take a closer look at these specializations and the value of spending energy on these areas.

Specialization: Communication

Communication is a fundamental building block to successful character building. As an SRE, I faced various scenarios that required expert communication.

  • The first specialty in communication is the number of messages. How often should I remind people about upcoming scheduled maintenance? How often should I reach out to my manager about working on the right thing? How often should my team get together to talk about team tasks?
  • The second specialty in communication is the quality of messages. Communication can be visual, written, or oral. Visuals can often convey much more nuanced meaning than repeating the same information in textual format and an underleveraged method.
  • The third specialty in communications is effectiveness. Effectiveness is the degree to which your words lead to the desired results. This specialty is the most advanced because effective communication requires an in-depth understanding of the audience and crafting your message as needed.

Specialization: Collaboration

The second core ability is collaboration. In any product or service, you are working on, work needs to be understood, planned, and executed. It doesn't matter who does the work; it just matters that it gets done.

The role I take today doesn't define who I am. If I say, "I'm an SRE at Company," that is just one characteristic of my story and not my identity. Every day as you go into work and tackle your challenge, recognize your special value and what you bring to the team. Rather than adopting and marrying your identity to a specific role, realize some days you take on a role that may be quite different from what you are used to, and that's part of your character development.

There is a distinction between the members of your team and the roles they play. In gaming, you become comfortable speaking on behalf of your character while having a separate, sometimes meta-conversation with your teammates. Social environments seem to tend towards homeostasis, and you (may) naturally ascribe a simplistic narrative to your co-workers' actions. Adopting this awareness that everyone is filling a role on the team that is not representative of everything about the individuals allows you to approach the work to do the impactful work that needs to get done.

In other words, never say, "well, they are just the ROLENAME and can't do that," or "that's not my job."

Specialization: Confidence

The third core ability for your SRE character sheet is confidence. Confidence is about the innate quality that drives you to take risks (or not).

In gaming, sometimes you take the wrong path, or you put your squishy players out front, and they get severely damaged. Mistakes happen. In the "real world," customers do something unexpected. There are bugs in the software, hardware fails, or someone from the team enters the wrong command on the wrong terminal in the production environment.

Collaborative games teach you to fail as a group and rise again while retaining the group cohesion necessary to succeed. Of course, if a teammate really caused you to be captured by a giant spider, you'd probably flip out. Still, across the game board, one has the emotional wiggle-room to behave in a manner that would be laudable in professional situations.

Playing teaches you about exploring challenges with imagination and a sense of play. You have to piece things together while continuing to take action, both keeping in mind the larger game goals and what's immediately on the board at the same time. In addition to this enormous world to explore, there are complex characters (non-playing characters or NPCs) to talk to, and information gathered within each encounter. Be on the lookout for the helpful non-production engineers (NPEs) in your environment, too; while they may not maintain production, they may have valuable information to support you.

Wrapping Up

So, this article inspired you to add some collaborative gaming to your team building, build out your team with complementary skills, or map out the work of the SRE or system administration to a character sheet. Great, beyond the "character sheet," you need the appropriate visualization. By analyzing the particular work items that an individual completed, there could be an incremented "skill" counter. Additional information like git commits, distribution of package management, and incident management APIs could be gathered and glued together to create a way to look at progress over time. That way, you could make sure to spend time on the skills that will improve you in the direction of your choosing.

If you want to try out D&D, check out your local game stores or related groups. Beginner games often provide preconfigured characters that allow you to practice the gameplay without understanding all of the nuances of playing the game.

December 7, 2021

Day 7 - Baking Multi-architecture Docker Images

By: Joe Block (@curiousbiped)
Edited by: Martin Smith (@martinb3)

My home lab cluster has a mix of CPU architectures - several Odroid HC2s that are arm7, another bunch of Raspberry Pi 4s and Odroid HC4s that are arm64 and finally a repurposed MacBook Air that is amd64. To further complicate things, they're not even all running the same linux distribution - some run Raspberry Pi OS, one's still on Raspbian, some are running debian (a mix of buster and bullseye), and the MacBook Air runs Ubuntu.

To reduce complication, the services in the cluster are all running in docker or containerd - it's a homelab, so I'm deliberately running multiple options to learn different tooling. This meant that I had to do three separate builds every time I updated one of my images, arm7 , arm64 and amd64, on three different machines, and my service startup scripts all had to determine what architecture they were running on and figure out what image tag to use.

Enter multi-architecture images

It used to be a hassle to create multi-architecture images. You'd have to create an image for each architecture, then upload them all separately from each build machine, then construct a manifest file that included references to all the different architecture images and then finally upload the manifest. This doesn't lead to easy rapid iteration.

Now, thanks to docker buildx, you can create multi-architecture images as easily as docker build creates them for single-architectures.

Let's take a look with an example on my system. First, I can see what architectures are supported with docker buildx ls. As of 2021-12-03, Docker Desktop for macOS supports the following:


        NAME/NODE       DRIVER/ENDPOINT             STATUS  PLATFORMS
        multiarch *     docker-container
          multiarch0    unix:///var/run/docker.sock running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
        desktop-linux   docker
          desktop-linux desktop-linux               running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/arm/v7, linux/arm/v6
        default         docker
          default       default                     running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/arm/v7, linux/arm/v6
        

My home lab only has three architectures, so in these examples I'm going to build for arm7, arm64 and amd64.

Create a builder

I need to create a builder that supports multi-architecture builds. This only needs to be done once as Docker Desktop will reuse it for all of my buildx builds.


    docker buildx create --name multibuild --use

Building a multi-architecture image

Now, when I build an image with docker buildx, all I have to do is specify a comma-separated list of desired platforms with --platform. Behind the scenes, Docker Desktop will fire up QEMU virtual machines for each architecture I specified, run the image builds in parallel, then create the manifest and upload everything.

As an example, I have a docker image, unixorn/unixorn-py3 that I use for my python projects that installs a minimal Python 3 onto debian 11-slim.

I build it with docker buildx build --platform linux/amd64,linux/arm/v7,linux/arm64 --push -t unixorn/debian-py3 resulting in the output below showing that it's building all three architectures.


        ❯ rake buildx
        Building unixorn/debian-py3
         docker buildx build --platform linux/amd64,linux/arm/v7,linux/arm64 --push -t unixorn/debian-py3 .
        [+] Building 210.4s (17/17) FINISHED
         => [internal] load build definition from Dockerfile                                                                                                            0.0s
         => => transferring dockerfile: 571B                                                                                                                            0.0s
         => [internal] load .dockerignore                                                                                                                               0.0s
         => => transferring context: 2B                                                                                                                                 0.0s
         => [linux/arm64 internal] load metadata for docker.io/library/debian:11-slim                                                                                   3.7s
         => [linux/arm/v7 internal] load metadata for docker.io/library/debian:11-slim                                                                                  3.6s
         => [linux/amd64 internal] load metadata for docker.io/library/debian:11-slim                                                                                   3.6s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [linux/arm64 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                             4.4s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:968621624b326084ed82349252b333e649eaab39f71866edb2b9a4f847283680 30.06MB / 30.06MB                                                                2.0s
         => => extracting sha256:968621624b326084ed82349252b333e649eaab39f71866edb2b9a4f847283680                                                                       2.4s
         => [linux/amd64 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                             4.0s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:e5ae68f740265288a4888db98d2999a638fdcb6d725f427678814538d253aa4d 31.37MB / 31.37MB                                                                1.8s
         => => extracting sha256:e5ae68f740265288a4888db98d2999a638fdcb6d725f427678814538d253aa4d                                                                       2.2s
         => [linux/arm/v7 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                            4.3s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:ba82a1312e1efdcd1cc6eb31cd40358dcec180da31779dac399cba31bf3dc206 26.57MB / 26.57MB                                                                2.3s
         => => extracting sha256:ba82a1312e1efdcd1cc6eb31cd40358dcec180da31779dac399cba31bf3dc206                                                                       2.0s
         => [linux/amd64 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install-r  22.3s
         => [linux/arm/v7 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install  176.9s
         => [linux/arm64 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install-  173.6s
         => exporting to image                                                                                                                                         25.4s
         => => exporting layers                                                                                                                                         6.7s
         => => exporting manifest sha256:ae5a5dcfe0028d32cba8d4e251cd7401c142023689a215c327de8bdbe8a4cba4                                                               0.0s
         => => exporting config sha256:48f97d6d8de3859a66625982c411f0aab062722a3611f18366ecff38ac4eafb9                                                                 0.0s
         => => exporting manifest sha256:fc7ad1e5f48da4fcb677d189dbc0abd3e155baf8f50eb09089968d1458fdcfb9                                                               0.0s
         => => exporting config sha256:60ced8a7d9dc49abbbcd02e7062268fdd2f14d9faedcb078b2980642ae959c3b                                                                 0.0s
         => => exporting manifest sha256:8f96f20d75502d5672f1be2d9646cbc5d5de3fcffd007289a688185714515189                                                               0.0s
         => => exporting config sha256:0c6e42f87110443450dbc539c97d99d3bfdd6dd78fb18cfdb0a1e3310f4c8615                                                                 0.0s
         => => exporting manifest list sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa                                                          0.0s
         => => pushing layers                                                                                                                                          17.2s
         => => pushing manifest for docker.io/unixorn/debian-py3:latest@sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa                         1.4s
         => [auth] unixorn/debian-py3:pull,push token for registry-1.docker.io                                                                                          0.0s
         => [auth] unixorn/debian-py3:pull,push token for registry-1.docker.io                                                                                          0.0s
         docker pull unixorn/debian-py3
        Using default tag: latest
        latest: Pulling from unixorn/debian-py3
        e5ae68f74026: Already exists
        86834dffc327: Pull complete
        Digest: sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa
        Status: Downloaded newer image for unixorn/debian-py3:latest
        docker.io/unixorn/debian-py3:latest
        1.60s user 1.05s system 1% cpu 3:36.49s total

One minor issue - docker buildx has a separate cache that it builds the images in, so when you build, the images won't be loaded in your local docker/containerd environment. If you want to have the image in your local docker environment, you need to run buildx with --load instead of --push.

In this example, instead of running docker run unixorn/debian-py3:amd64, docker run unixorn/debian-py3:arm7 or docker run unixorn/debian-py3:arm64 based on what machine I'm on, now I can use the same image reference on all the machines -


        ❯ docker run unixorn/debian-py3 python3 --version
        Python 3.9.2
        ❯
        

Takeaway

If you're running a mix of architectures in your lab environment, docker buildx will simplify things considerably.

No more maintaining multiple architecture tags, no more having to build on multiple machines, no more accidentally forgetting to update one of the tags so that things are mysteriously different on just some of our machines, no more weird issues because we forgot to update service start scripts and docker-compose.yml files.

Simpler is always better, and buildx will simplify the environment for you.

December 5, 2021

Day 6 - More to come tomorrow!

We don't have any special system content for you today. We will have more tomorrow!