December 25, 2019

Day 25 - The “Just” Basics

By: C.A. Corriere (@cacorriere)
Edited by: Michelle Carroll (@miiiiiche)

This year we celebrated the ten year anniversary of devopsdays in Ghent, Belgium, where the conference originated in 2009. I was lucky enough to have my talk “Cookies, Mapping, & Complexity” selected for the event. The feedback I received was mixed, but it was aligned with a broader theme that emerged from the conference: given the impact technology has on our society in 2019, we can’t afford to ignore the complexity of our sociotechnical systems. The problem we’re now faced with is, how do we raise awareness around this complexity and make it more accessible to beginners?

If the answer to this question were obvious I could list a few examples here. If it were just complicated I could draw you a map or two. Sociotechnical problems, like this one, happen to be centered in a complex domain where models are often helpful. This question is one of multiple safe-to-fail experiments with negative hypotheses I am currently running, intended to serve as probes into a model of our communities I built. There’s a lot of specific jargon in this paragraph tied to complexity science, and the cynefin framework specifically.

I facilitated ninety minutes of open space workshops around mapping and complexity science in Ghent, but a workshop on complexity science alone can easily fill a week. Shorter workshops manifested at quite a few events I attended this year. Where I prefer sitting through a day of lecture, 30 minute segments with more specific content seem to be a better fit for most people.

I’ve also noticed that using common examples, like baking cookies or making a cup of tea, help folks connect the theory to an area of practice where they already have some experience. Even if you’ve never made tea or baked cookies, the barrier to entry is low enough that someone could try them for the sake of learning about complexity science and mapping.

I wouldn’t keep offering the workshops if people didn’t both show up and tell me they were useful, but I must admit I’ve covered the basics on these topics enough times that I worry I sound a bit like a broken record. I have been pulling a lot of this into a book, which I hope to be available in early 2020. For now, I am going to hope some folks can connect the dots between the language I’m using here and the picture of the framework provided. I’d encourage you to study this some on your own too, and you’re always welcome to ask questions on twitter. If I can’t answer them I probably know some clever person that can. What can we do to help make this type of content more accessible? Are you even convinced you need to learn it yet?

During the closing panel of Map Camp London, Cat Swetel referred to both cynefin and wardley mapping as “tools of epistemic justice”. I understand this to mean cynefin and wardley mapping are tools that can help us know how we know (or don’t know) something, and why our beliefs are (or aren’t) justified. Personally, I like being able to check my work and knowing when I’m wrong. It’s a humbling experience, but I do think it’s a pretty basic life lesson that’s easily justified.

What else counts as basic, introductory content in 2020? Is it installing an SDK and writing “Hello World!”? Do we start with a git repo and some yaml files? Maybe it’s a map of our application’s carbon footprint? Mapping and complexity science (among other tools) can help justify the answers to these questions, but I have no doubt those answers are context dependent. I would recommend learning to read a map before trying to draw one. This post on maturity mapping by Chris McDermott is based on cynefin and wardley mapping and serves as a solid example of the emergent justification I’m talking about. I’m looking forward to learning more about philosophy, epistemology, and tools that can help us change our minds and come to new understandings as the world shifts around us in the new year, but I really need to do a better job of pacing myself.

If a month of travel and research abroad weren’t enough for this year, then it’s a good thing I helped pull together three conferences at the Georgia Aquarium in Atlanta too. I have organized devopsdays Atlanta for a few years. When we saw an opportunity to host the first Map Camp outside of the U.K. and the first ServerlessDays Atlanta along with our conference we decided it was worth the effort. Watching the ripples from that event since April has warmed my heart, but 2019 has also brought my attention back to one of my first principles:

I cannot take care of anything if I am not taking care of myself.

This year has been very global for me. My goal is to make 2020 much more local and regional by comparison, and I’m not alone. More and more presenters are refusing to fly for tech conferences given the growing concerns around global warming, which ended up being the main theme for Map Camp London this year. I think it’s important for our international communities to gather on a regular basis, but the cost of doing so should have little to no impact on our local communities, our planet, or our individual health. It must be done sustainably.

I doubt I’m leaving the country next year, but I’m thankful to be part of the vibrant tech community we have in Atlanta. I’ll be speaking at devnexus this February, we’re organizing a minimally viable devopsdays Atlanta this April (the same week as REFACTR.TECH), and it seems like there are a few meetups to choose from here every week.

If you aren’t participating in your local tech community then maybe 2020 is the year to try attending more events. If there aren’t any events, maybe you’d like to try organizing one. Maybe 2020 is the right time to visit some other cities (like Atlanta : ) or even a different country. Maybe you’ve been doing plenty of that, and like me you’re ready to tap the brakes and invest a little more energy in your own backyard. Please join me in using the days we have left this year to rest, reflect, and justify how we can co-create intentional futures during our next decade together, and for the ones that will follow afterwards.

December 24, 2019

Day 24 - Expanding on Infrastructure as Code

By: Wyatt Walter (@wyattwalter)
Edited by: Joshua Smith (@jcsmith)

Introduction

As operators thinking about infrastructure as Code, we often think of infrastructure as just the stuff that runs inside our data centers or cloud providers. I recently worked on a project that expanded my view of what I consider “infrastructure” and what things were within grasp of managing similarly to the way I manage cloud resources. In this post I want to inspire you expand your view of what infrastructure might be for your organization and give an example using Terraform to help give a more concrete view of what that could look like.

First, the example I’ll use is a workflow for managing GitHub repositories at an organization. There are tons of other services Terraform can manage (“providers” in Terraform terms), but this example is a service that is free to recreate if you want to experiment. Then, we’ll dig into why you’d even want to go through the trouble of setting something like this up. Lastly, I’ll leave you with some inspiration on other services or ideas on where this can be applied.

The example and source code are very contrived, but available here (link: https://github.com/sysadventco-2019/sysadventco-terraform).

An example using GitHub

At SysAdventCo, developers use GitHub as a tool for source code management. The GitHub organization for the company is managed by a central IT team. While the IT team did allow a few individuals throughout the company permission to create repositories or teams, some actions were only accessible to administrators. So, even though teams could modify or create some settings, the IT team often was a bottleneck because many individuals needed access to see or modify settings that they could not do on their own.

So, the IT team imported the configuration for their organization into Terraform and allowed anyone in the organization to view it and submit pull requests to make changes. Their role has shifted from taking in tickets to modify settings (which often had multiple rounds of back-and-forth to ensure correctness) and manually making changes to simply being able to approve pull requests. In the pull requests, they can see exactly what is being asked for and receive validation through CI systems what the exact impact of that change would be.

A stripped down version of the configuration looks something like this:

# We define a couple of variables we can pass via environment variables.
variable "github_token" {
  type = string
}

variable "github_organization" {
  type = string
}

# Include the GitHub provider, set some basics
# for the example, set these with environment variables:
# TF_github_token=asdf TF_github_organization=sysadventco terraform plan
provider "github" {
  token        = var.github_token
  organization = var.github_organization
}

# This one is a bit meta: the definition for this repository
resource "github_repository" "sysadventco-terraform" {
  name               = "sysadventco-terraform"
  description        = "example Terraform source for managing the example-service repository"
  homepage_url       = "https://sysadvent.blogspot.com"
  gitignore_template = "Terraform"
}

SysAdventCo operates a number of services. The one we'll focus on is example-service. It's a Rails application, and has its own entry in the configuration:

resource "github_repository" "example-service" {
  name               = "example-service"
  description        = "the source code for example-service"
  homepage_url       = "https://sysadvent.blogspot.com/"
  gitignore_template = "Rails"
}

The team that builds and operates example-service wants to integrate a new tool into their testing processes that requires an additional webhook. In some organizations, a member of the team may have access to edit that directly. In others, maybe they have to find a GitHub administrator to ask them for help. In either case, only those who have access to change the settings can even see how the webhooks are configured. Luckily, things work a bit differently at SysAdventCo.

The developer working on example-service already has access to see what webhooks are configured for this repository. She is ready to start testing the new service, so she submits a small PR (link: https://github.com/sysadventco-2019/sysadventco-terraform/pull/2):

+
+resource "github_repository_webhook" "example-service-new-hook" {
+  repository = github_repository.example-service.name
+
+  configuration {
+    url          = "https://web.hook.com/"
+    content_type = "form"
+    insecure_ssl = false
+  }
+
+  active = false
+
+  events = ["issues"]
+}

The system then automatically creates a comment with exactly what actions Terraform would do if this were approved for a member of the IT team to review and collaborate with the developer requesting the change.

No one is stuck filling out a form or ticket trying to explain what is needed with words that get interpreted into manual actions. They have simply updated the configuration themselves which is automatically validated and a comment is added with the exact details of the repository which would change as a result of this request. Once the pull request is approved and merged, it is automatically applied.

This seems like a lot of work, why bother?

What an astute observation, dear reader! Yes, there is a good deal of setup involved once you get past this simple example. And yes, managing more automatically can often be more work. In addition, if your organization already exists but doesn’t use something like this method already, you probably have a good deal of configuration to import into the tool of your choice. I’d argue that there are a number of reasons that you would want to consider using a tool like this to manage tools that aren’t strictly servers, firewall rules, etc.

First, and what is the first thing I reached for, is that we can track changes the same way we do for other things in the delivery pipeline while also ensuring consistency. For me on my project, importing configuration of a PagerDuty account into management by Terraform allowed me to see inconsistencies in the manually configured service. While the tool added value, a huge part of the value was the simple act of doing the import and having a tool that enforced consistency. I caught a number of things that could’ve misrouted alerts if conditions were right before they became issues.

The next and most compelling reasons to me are in freeing up administrative time and giving teams the freedom to affect changes directly without creating a free-for-all situation. You can restrict administrative access to a very small number of people (or just a bot) without creating a huge bottleneck. It also allows anyone without elevated privileges to confirm settings without having to ask someone else. I’d also argue this creates an excellent model for the basis of a change control process for organizations that require or have them as well.

A further advantage is that, since none of these tools exist in isolation, using a method like this can give you an opportunity to reference configuration dynamically. This allows your team to spin up full environments to test configuration end-to-end.

But wait, there’s more!

Within the Terraform world, there’s an entire world of providers out there just waiting for you to explore! Imagine using the same tools you use to manage AWS or GCP resources that often are linked to other important things your team uses:

  • Manage your on-call rotations, escalation paths, routing decisions, and more with the PagerDuty provider
  • Manage the application list and alerts in NewRelic
  • Add external HTTP monitoring using tools like Statuscake or Pingdom

December 23, 2019

Day 23 - Becoming a Database Administrator

By: Jaryd Remillard (@KarateDBA)
Edited by: Benjamin Marsteau (@bmarsteau)

Database is a term that is thrown around in meetings amongst all industries. The term is almost always used with a sense of urgency and importance yet contains a vast mystery. It can be a topic that some may feel too confident, one with absolute no knowledge and one that refers to a single copy of a glorified excel spreadsheet sitting on their desktop. In my short time as a database administrator, I have found that it is typically the confident ones that venture into this mystery with the full understanding of the business value and risk that come as the database administrator. Like any area of science, technology, engineering, and math, acronyms are favored, so let it be known that the title of database administrator can be abbreviated as DBA.

Like any career path, one database administrator path will not necessarily align with the direction you have to take. It is not to discredit the value of the journey and course someone purposely took or perhaps accidentally stumbled into; there are specific points to remember which in itself could present an opportunity in your journey. Instead, be aware, just like theoretically there is an infinite number of ways of solving a problem with the code, there is an endless number of directions to reach your destination of becoming a DBA. All in all, I hope the reflection I have done of my journey I took to become a database administrator will set you up for success.

Start with the basics

When I was 12 years old, I befriended a stranger online through a collective group of people who played an online video game. We were idling in our TeamSpeak server when I asked them what they were up too, they replied saying they were coding a website for our group. The concept immediately struck me with curiosity like a static shock, the idea of how to construct a website was so far-fetched I just had to learn so I could quench the burning desire. I naively asked if it is a drag and drop type of process. They laughed and began to talk about and teach me HTML and showed me how to view the source of a website. The concept blew my mind; words typed in a specific manner can be translated into a structure that is displayed on my screen. It made me feel like anything was possible. I kept building websites with HTML, leveling up to using CSS, JavaScript, learning Linux, and eventually PHP. Soon after, I was building login systems, registration systems, user profiles, all in a LAMP stack that required knowledge of basic SQL, learning simple DML's, DDL's, DCL and TCL's, I wrote whatever worked. The experience and newfound knowledge I bashed together eventually turned into a charming but underwhelming social network that I named Express-It. Building the schemas in phpMyAdmin was accessible in the sense of, "I create a column, PHP writes to it, there we go.” However, as the social network grew to a whopping 100 people, which were primarily friends and family for moral support, it caused my website to slow to a crawl. What I did not understand was the more extensive technical specifics of ints, unsigned bigints, varchars, indexing, and primary keys and how various people at the same time querying similar things while the SQL scanned the entire database affected performance. I could not wrap my head around it, nor did I think there was anything but what the query was because it did the job locally. Frankly, it also didn't occur to me that my schemas and queries were DBA's nightmare. I shut down Express-It since my curiosity shifted from LAMP stacks to learning cybersecurity, doing basic IT jobs for friends and family, and I was sick of the free hosting tier I was using.

School of Hard Knocks

As I shifted my focus from building to fracturing, SQL came up in the form of learning its flaws. From the various types of SQL injection, brute force, and DoSing a database. My knowledge expanded to be more aware of possible vulnerabilities and the importance of a database, including losing data. This experience exemplified when the code I had from my Express-It website and hundreds of hours of various other projects I stored on a flash drive, as an interim between moving homes, was accidentally reformatted while transferring some photos by a family member. Losing all my work taught me how easy it can be for a large part of my life to disappear. I then realized the hard way that backups are a thing and I became hyper-aware. I learned to keep at least two to three copies of whatever had importance on separate data stores, I learned that flash drive and hard drives could die without warning or get overwritten accidentally, you can never have enough backups, and corruption is a thing. My motto became "backup backup backup, correctly.” I often chuckle when reflecting this time of my life because it reminds me of high school, where any time a big paper was due, someone always had the excuse that their file was gone, overwritten, or became corrupted the night before, conveniently, perhaps honestly so. I could not help but blurt out of my smart mouth, "Should have backed up.” Unfortunately, the lesson I learned came back to haunt me in a different form.

Venturing Further and Beyond!

I went on a hiatus from technology for a bit to focus on school and sports. Eventually, when my interest in technology came back, an internship opportunity landed on my lap, which still to this day I attribute to luck as the the news of an opportunity was shared to me by the CTO of the relevant company. Before starting, I was asked what I wanted to do during the internship, specifically, what direction of my career did I want to go. I thought it was software engineering; at the time, building and designing were shiny to my eyes. However, I was conflicted as I still enjoyed living in the terminal, something about the rawness of text on a blank background with specific commands can be utilized to effectively navigate the computer in ways a GUI could not. It still drew me in even when I was deep in an IDE, I had moments of barbarianism when I would code in vim. I knew programming was not what I wanted to do for a full eight hours a day, I was conflicted and I shared my concerns. They mentioned DevOps, and it was perfect. I would get the complete balance of being in the terminal and writing code, I then embarked on the start of my career. As an intern, a lot of my tasks were simple; data entry, set up a local environment, break the local environment, finish some tickets, attend standup, and the like. But one task stood out to me, the need for an internal tool to show the difference between a system at one point in time compared to another time, such as permissions and data in the file, essentially a beefy diff. Like most ideas, it was a task that seemed easy at first but was exponentially more complicated than initially anticipated. As I dug into the task, the first tool I chose was Python to program in as it seemed easy to learn and it was all the rage. As I learned more about Python's data types, I naively figured it was an excellent function to cache all files on a system in a dictionary, which unbeknownst to me beforehand, resulted in Python running out of memory. After consulting some of the engineers nearby on how to navigate this issue, it was recommended to use a database. So naturally, I chose Sqlite3. I moved to MySQL pretty soon after, sqlite3 was just not working out. I figured MySQL was perfect as it is a solid relational database, I would have the freedom to specify what kind of data to store and it made storing md5 checksums. I was eventually able to get the program to work to some degree of success but not without experiencing bottlenecks. Previously in my past, the amount of data I worked with was so little that there was little to no need for optimization. So when caching the majority file information on a system, well, that is when I started to see performance impacts on the database, particularly the length of time in executing the program and overall high usage of memory in both primary and secondary. I figured the best direction to tackle this problem would be to take it to a deeper level, learning of the internals of MySQL. Basics of how the client and server work together and elementary query optimization, but with limited guidance, there was only so much I could dig on my own. Eventually, time continued, and I moved to a local company as a system administrator. My new job exposed me to different types of databases; MongoDB and SQL Server, along with MySQL again. I spent a lot of my time at my new position on the front end and web servers like Apache, Jekyll, GruntJS workflows as well as Active Directory, I still got to see the back end as well. Naturally, it fascinated me more, in between tasks I learned how it was accessed by services and as an administrator, how to view the permissions of users, and how to query what you wanted. Questions about the front end were easy to answer but the back end had a lot of unanswered questions I could not find the answer too, various topics such as; internal functionalities, maximum capabilities, dynamically manage a user, etc. Databases remained a mystery I wanted to solve and I was ready to go Sherlock. I read the documentation and tinkered with the databases, and then I would go home to set one up for myself, just to see how I can break it. Unfortunately, my time became more consumed with school and the front end of a job at the time, although I knew I wanted to get back to the databases in the future. Soon, an opportunity opened to work as a student employee in the IT department at my university. This potential new position would save me an hour commute to school as well as an hour commute to work, although it was not nearly as technical as what I was currently doing and perhaps a step back; it was the right decision for me at the time. I had a feeling this job could turn into something more technical than what I was already doing.

Uh-oh

After completing three years of college and almost a year as a student employee, I was offered a full-time job in the IT department of the university I was attending. It was a decision that was difficult to choose and took some time to weigh the pros and cons; to take the risk and leap of faith into the field or continue education for another two years while piling debt to then flow into the field. Ultimately, I chose to take the job to pay off the student loans that I accrued, which was almost the size of my salary and nearly twice my weight in stress. Also, I knew this was an opportunity to delve deeper into different technologies and learn what it means to take ownership and responsibilities. The job was to be the system administrator for the STEM department, with that indeed came with a lot of responsibilities of managing various software, some cloud-based but many on premises-based. Much of the software I managed used a SQL Server to manage logins, logs, barcode numbers, etc. Little did I know that an SQL Server was actively in use until I got a call from a chemistry department head saying they are unable to log into their science rental equipment management software. I searched all over in our wiki and could not find a single trace of this existence, for a moment I thought this was a prank. I asked my coworkers if they had heard of this software, if it even exists; I got nothing in response. I dug further, even so far as reaching out to our previous system engineer that worked prior. It turns out this software had an SQL Server sitting in an undocumented virtual machine, lost in tribal knowledge. Unfortunately, there were no records of this software ever being provisioned. To look on the bright side, one of the science teacher's users in this database for some reason had super privileges, giving me the ability to login and work some magic, thinking this was the end of the immediate problem. But there was an itch in my brain, questions that stuck with me; what had happened? Why did it all of a sudden stop working? Why is it sitting on a VM undocumented? How can I prevent this from the future? Why is there no accountability and visibility with this database? It was going to be forgotten in this state, or I had to work on keeping this reliable in all aspects, especially documentation. I made a page on everything I learned about this software, including representatives from the company, run books for the database, and how the client and backend work. At that moment this is when my interest in reliability and uptime exponentially grew, especially around databases. Before I left, I made one big oops.

Be wary of drives

I was put in charge of the psychology department on top of the STEM department. I had a psychology professor come in as her laptop was due for a replacement and was hoping to speed up the process as it was filling up a special order of a 512 GB Dell XPS primarily, which consisted of personal photos and research documentation. The first process was to back up her laptop to a hard drive we had using some software that did it block by block. I had our student employees complete this process overnight. I woke up to some great news; it kept failing with odd errors that warranted no response via a Google search. After some consulting with my coworkers, Office 365 comes with a 1 TB storage via OneDrive. We thought this was perfect; she can store all her valuable documents into OneDrive as we set up her new laptop and download it back down. She preferred that I did it personally as I was in charge of the psychology department, as our policy was, I had to agree. I began the process of uploading her documents to her OneDrive and it took days. Being new to Office 365, I had no idea why it was taking this long, but I shrugged it off as it eventually reported successful. I began to download her files onto the new computer and started the RMA of her old one. Problems were immediate; permission issues, too long file names, disappearing files, you name it. After hours and hours of work, going through shadow copies of our servers, looking at past backups we had, recursively changing permissions of the files, it was exhaustive. I was able to obtain about 95% of what she had previously, but the 5% I lost was a good chunk of her research. It was a time of reflection where my motto rang in my head non-stop, I missed one more backup somewhere. From then on, I was no longer super aware. I was hyper-aware and vigilant in storing data. Everything made me skeptical or ask questions, and it was a mark in my career of growth through failure. I had a burning desire in me to learn more about storing data; how do it robustly, safely, ensure validity and integrity. I set out to fulfill my desire.

Where to begin?

Finding information on reliability and internals of the data storage is a difficult task when you do not have any reference or expert to guide you towards the correct path. The internet is filled with how-to's; doing write and read queries, but understanding documentation on internal works is tricky to begin, let alone comprehend. I have finally started to dip my toes in and quickly learned it's difficult to give a summary of the paradox that is an SQL database, the simplicity of the query structure itself is one that provides a façade and false sense of understanding. Rather, there is much behind the scenes you cannot see. Knowing how to query to get what I needed from a database gave me the confidence that I knew what to do and how it worked. It wasn't until I started a personal project that was causing the database to suffocate that it made me realize that perhaps there is more to databases than my cute knowledge previously thought.

Personal Projects

Being in the technology field will expose you to various situations that are hard to prepare for in personal studies as well as higher education. Outages that are unpredictable due to customer behavior or merely no reference to what the threshold is of a service, primarily because you never reach that point. Scaling up is a term that I heard of but never understand, so natural curiosity decided I needed to seek out what it really means. It is impossible to scale up unless you have a lot of data to utilize. Finding large data sets that contained false and made-up data was a tall task, so then I had an itch to create a data generator to assist in learning how to scale. Yes, there are a few data generator websites. However, they seem to cap out of a million rows at the time, which is not enough for me and the service to really push it to the limits. In creating this data generator, I made it so it can spit it out in an SQL format, making it easy to slap into MySQL right away. Fortunately, it is capable of generating 8 figure rows of data with columns for names, addresses, cars, age, and other data after some heavy work in Python. I ran my data generators several more times to add up to 300 million rows, I decided it was time to load up a MySQL server in a LAMP stack with this data to use in a simulation of what would be a essentially a country sized simulation. With no visibility of the VM, OS, and the database, my PHP queries to MySQL locally took ages or crashed the VM altogether. I knew it was the database because even querying via phpMyAdmin was not returning results quickly or timed out, and I couldn't figure out how to better interact with the database. Thinking it lacked in power, I kept upping the CPU's and RAM which only led to crashing the host. I stepped back to think more about scaling; how could I, in this case, scale up if upping power wasn't the solution? Then the concept of how a CPU is designed rang in my head, distributing the job into smaller chunks. A saying from the CTO of the company that I interned told me, "Any big problem is just a subset of a bunch of smaller problems. Iterate those small problems, and now you've solved the big problem."

I got it! Let me split the database into smaller sized databases, each containing a max of 10 million rows. If I needed something beyond the unique ID, I could query the next database instead of having MySQL scan the entire database. Distributing data through multiple instances of MySQL servers was a weak solution in this case, of course, as PHP now had to maintain 20 MySQL connections. Later I learned this moved the problem instead of solving it, and now I was stuck. I understood databases are complex at the time and are much more complicated than I initially thought and that fed my desire to learn more. I did not necessarily feel capable of being a database administrator, but I figured what is better than to in headfirst as a database administrator for a company.

I am a person that tries to not be afraid to delve into the unknown or face rejection. Imposter syndrome is real, but I know it is something you can grow past despite your thoughts no matter what your mind tells you otherwise. I scoured the internet for DBA jobs and found myself stumbling upon an entry-level DBA posting at the competitor of the company I interned at. It was perfect and I applied despite it feeling like a moonshoot as the position was based in a different state.

Don't be afraid

Unexpectedly, I got a callback. I flew through the phone interview, manager interview, and eventually hopped on a call for the technical interviews. I was as honest as I could be, I explained my attempts to scale, shared the little experience I had with MySQL, and why I wanted to be a DBA.

Simply put as to why I wanted to be a DBA, databases are facinating to me. We rely on databases for everything, but hardly anyone delves more in-depth than simple restarts or querying for what they need. I had a difficult time finding resources to help me learn deeper about SQL rather than how to write basic SQL queries, I was hungry; rather, I was famished to learn. I knew I lacked a lot of knowledge and was honest about it during my technical interviews but I backed it with what I was trying to do with MySQL. In particular, I shared my attempts of scaling by distributing the workload, having no idea what the correct term was other than using the description of distributing "it." I later learned it's called sharding. I jumped up and down after finding out the correct term as it had unlocked a vast amount of new resources via Google searches and technical conversations with people in the industry. During my technical inverview, I had a DBA on call, this was the perfect opportunity to ask what resource I should read so I jumped in as soon as I could. She recommended reading the Database Reliability Engineering book by Charity Majors and Laine Campbell. I Immediately bought it off Amazon, practically during the interview, and was extremely eager to crack it open the second it arrived. I started reading and taking impeccable notes, absorbing as much as I can.

This is the direction I needed, the direction I wanted to go, to push my mind, widen my thought process, making me aware that there is much more than writing code and setting up software such as; service level objectives/agreements, automating, the need for metrics and the alike. I just could not put the book down. It almost felt like I hadn't had a bite to eat in days essentially swallowing the book. Upon my second technical interview, I believe my famine showed. I talked about what I was learning and how I was applying it, and it raised the interviewer's eyebrow in a good way. I was flown to their headquarters for further interviewing.

Still much to learn

It is no secret this job was at SendGrid, and I am very fortunate to have found a job posting that was purposely looking to help the employee to grow. I attribute a lot of that to luck and the excellent mentality and awareness of the benefits of hiring and raising a junior employee at SendGrid. The distinctive culture included hunger, the hunger to learn, and I was viciously starving. I could not stop reading documentation, asking questions and writing everything down in a spiral notebook. I am fortunate to have a senior DBA on the team to guide me through processes of replication and basic troubleshooting of a MySQL server. Later I bought The High Performance MySQL: Optimizations, Backups and Replication book on Amazon, and soon after being hired, I started going through the book, diligently taking notes and asking questions along the way. The path to learning about SQL did not stop when I was hired; in fact, it just started.

Conclusion

Overall, my natural-born curiosity and love for challenges lead me to take an opportunity where no one else dared to venture. I broke my façade, thinking SQL databases are easy because I can query something by trying to force the database to kneel. Finding why was challenging, but that only led me viciously seek out a solution, not be afraid to apply for a DBA job. The key was realizing I always gravitated and asked myself the most questions when dealing with a database, I wanted to conquer databases. The two books mentioned are a great start to grow your knowledge beyond querying a database, but to delve deeper into what it is and how to use it. Another book to look at the Celko's Advanced SQL Programming by Joe Celko's, it does a good job of delving into how SQL works behind the scenes and make you realize that your queries can be optimized greatly. While there are many paths to take, the real take away is if you have the hunger to learn, you will succeed no matter what path you take.

December 22, 2019

Day 22 - Metadata Rich, Rule Based Object Store - Introduction to iRODS

By: John Constable (@kript)
Edited by: Jason Yee (@gitbisect)

iRODS? WhyRODS?

I’ve got lots of data where I work. Petabytes of the stuff. You’ve got lots too, maybe even more!

The problem, as you might know (or are just finding out!) is that unstructured data is hard to search, organise, and use to collaborate. As the volume of data grows, its characteristics will change over time as well, meaning that the organising principles you needed when you started is likely going to be different than the ones you will want a few years and petabytes later.

Also, once you’ve got all the data under some kind of management, you’ll want some assurance that the software will be around for a while and that you can expand or build on it yourself if you need or want to.

In this article, I’m going to show you iRODS, the Integrated Rule-Oriented Data System. It’s been around for years, currently on version 4.2.6, and there is active planning for future releases. Don’t take my word for it—check out their github!.

iRODS? I’ll get started!

Before we dive in, I need to define some terms and give you a bit of background.

Zones are the key bit of infrastructure. Each Zone can stand on its own and is an independent, uniquely named group of servers.

There are two types of servers: the Provider, which runs the Zone and connects to the database that holds the catalog, and the Consumer, which talks to Provider and can serve up additional storage resources. You need at least one Provider, but can start with no Consumers.Both server roles can offer storage resources to be used by the rest of the Zone to store files known as Objects.

Core competencies

The iRODS project (of which, I should add, I contribute bug reports and the odd bit of documentation to, but am not employed by and do not claim to represent) likes to talk about the software in four core competencies: data virtualisation, data discovery, workflow automation, and secure collaboration.

Data Virtualisation

Data in iRODS is stored as Collections that mimic UNIX directory structures, but can be spread over multiple filesystems, or even a mix of different filesystems and other back ends such as S3 buckets or Ceph datastores.

Objects can have multiple replicas stored on different locations, systems, or even storage types. Objects are usually accessed through the command line client, although APIs and other methods such as web and/or webDAV, CyberDuck, and more are available. Storage locations can be queried, but users do not need to understand their architecture in order to access Objects. The location of Objects and method for retrieving them is managed by iRODS. This provides a consistent experience for interacting with the system.

Data discovery

iRODS provides the usual metadata on Objects and Collections out of the box: filename, size, location of each replica, and so forth. You can add metadata manually or in more automated ways using workflows (Can’t wait? Skip forward a bit.) that can be triggered when adding the file or by other activities.

Once added, the metadata can be listed and searched.

The iRODS catalog itself can also be searched with a SQL-like syntax allowing identification of Objects or Collections with particular properties or locations, both in the catalog and on disk, or to provide information on the usage of the system. You can extend this with your own SQL queries if you want, but a lot is available before you need to.

Workflow Automation

iRODS has a Rule Engine that specifies actions to be taken when data is uploaded, downloaded, or accessed. These rules can be written in the existing rule language (a little baroque in my opinion) or you can write your own in Python.

As well as such tasks as changing ownership, setting checksums, or enforcing policy about where an object is stored, rules can extract data from Objects as they are uploaded and attach them as metadata for later searching.

Want the location of your photos automatically extracted and tagged? You can do that! Want to run an entire genomic pipeline from upload to analysis? You can do that too—although not out of the box, some assembly will be required. And for any BioInformaticians reading this, please tell your IT people first. We can help!

Secure Collaboration

Access Control is managed by ACLs on Objects and Collections, and governed by users and groups. In addition to the assorted password options, there are also Tickets which can be granted for time limited access.

There is also the ability to federate between iRODS Zones, so users of one Zone, can be given access to another Zone, each with its own policies, ACLs, auditing, and authentication.

You can encrypt all communication with SSL if you so desire.

iRODS? MyRODS!

Enough talk! Let’s get a server running!

I’ve chosen the simplest setup to start: a Provider server to run our Zone, connecting to a Postgres database on the same server.

Installing the database back end

On our Ubuntu Xenial system (Ubuntu Bionic Support, like winter, is coming), we first install Postgres (other databases supported are MySQL and Oracle), then setup the iRODS User.

This isn’t a tutorial on packaging or databases, so I’ll just point you at the manual.

Now we’re ready to install the application itself.

Installing iRODS as a Provider

RENCI, the developers of iRODS, provide a package repository. Let’s add that, together with its public key

$ wget -qO - https://packages.irods.org/irods-signing-key.asc | sudo apt-key add -
$ echo "deb [arch=amd64] https://packages.irods.org/apt/ $(lsb_release -sc) main" | sudo tee /etc/apt/sources.list.d/renci-irods.list
$ sudo apt-get update

Install the iRODS Provider server package along with the plugin for Postgres

$ sudo apt-get install irods-server irods-database-plugin-postgres

Next, set up iRODS and point it at our database by running the setup script. Follow the prompts and provide the information as requested, there is more information available in the manual.

$ python /var/lib/irods/scripts/setup_irods.py

The setup script will ask for a Zone key, negotiation key and a control key. These are strings (up to 32 characters in length) used for inter and intra-zone security. We touch on federation at the end of this post, but be sure to note your keys and read the federation documentation for more in depth coverage.

Lets become our irods user and run our first command!

$ sudo su - irods
irods@ubuntu-xenial:~$ ils
/tempZone/home/rods:

So far so good! We have an iRODS server, we’ve connected to it, and we can list the Collection we’re in (our home directory by default). We have not uploaded anything yet, so there’s not much to see. Let’s change that with some other commands!

iRODS? I show you this in action!

irods@ubuntu-xenial:~$ iput my-file.txt
irods@ubuntu-xenial:~$ ils my-file.txt
  /tempZone/home/rods/my-file.txt
irods@ubuntu-xenial:~$ ils -l my-file.txt
  rods           0 demoResc        59 2019-11-17.22:11 & my-file.txt

Ok, so we have uploaded a text file and using the long listing of the ils command, we see

  1. The user who created the file
  2. The replica id (for when you have more than one copy of the same file, usually created by rules or special resource types)
  3. The name of the resource the file was uploaded to
  4. The size of the file
  5. The timestamp of when the file was uploaded
  6. Whether the replica is good—the ‘&’ is the moniker for a good replica
  7. The name of the file

There’s more to it than just an object store though!

iRODS? I, checksum!

From cosmic rays to bit flips in memory to RAID controller firmware issues, small changes can occur in files. It’s helpful to have a way to detect this and know if it’s your download that went wrong or if the file on the disk is affected in some way.

Lets fight entropy by protecting against silent corruption with -K. K?

irods@ubuntu-xenial:~$ iput -K my-file.txt
irods@ubuntu-xenial:~$ ils -L my-file.txt
  rods           0 demoResc       158 2019-11-17.22:19 & my-file.txt
 sha2:Er3LKQ5YiO1+njYKQnBywhaxW/ajzavY9/qD++8znL0= generic /var/lib/irods/Vault/home/rods/my-file.txt

Now we have a new field: a checksum! The -K argument tells iput to checksum the file then upload and verify it . The upload is only complete when this returns successfully.

Downloading Files: iget

iget is the command to download files, however you can also use this to verify the file as you download it.

Let’s simulate silent corruption by changing the definition.txt file we uploaded on disk and then attempt to download it again

irods@ubuntu-xenial:/tmp/test$ ils -L definition.txt
  rods           0 demoResc       158 2019-11-17.22:19 & definition.txt
 sha2:Er3LKQ5YiO1+njYKQnBywhaxW/ajzavY9/qD++8znL0= generic /var/lib/irods/Vault/home/rods/definition.txt
irods@ubuntu-xenial:/tmp/test$ echo "not the definition you were looking for" > /var/lib/irods/Vault/home/rods/definition.txt
irods@ubuntu-xenial:/tmp/test$ iget definition.txt
irods@ubuntu-xenial:/tmp/test$ cat definition.txt
not the definition you were looking for
irods@ubuntu-xenial:/tmp/test$ rm definition.txt
irods@ubuntu-xenial:/tmp/test$ iget -K definition.txt
remote addresses: 127.0.1.1 ERROR: rcDataObjGet: checksum mismatch error for ./definition.txt, status = -314000 status = -314000 USER_CHKSUM_MISMATCH
remote addresses: 127.0.1.1 ERROR: getUtil: get error for ./definition.txt status = -314000 USER_CHKSUM_MISMATCH

iRODS? iRule!

iRODS Rules automate data management tasks. You can automate entire workflows, or call rules at many stages of the object lifecycle—each stage is called a Policy Enforcement Point (PEP).

Example checksum rule

In our iput example earlier, we only got a checksum on the file when we uploaded it with the -K flag. However, we might not want our end users to remember to do this, but we still want one, as having a checksum is usually beneficial all round.

We’re going to use a built in rule to make this happen on every upload.

First we need to configure the Provider server to load the rule. iRODS configurations that aren’t held in the database are held in JSON files in the /etc/irods directory. The one you will mostly be working with is server_config.json.

The PEP for post processing uploaded files is called acPostProcForPut and we’re going to use an already present function for making checksums.

By convention, the default rules are left in place and any changes are added in a new file that is included before the defaults. This allows undefined behaviour in one file to fall through to the next one until it hits the defaults.

First, we’ll create a rules file to add a rule to the PEP that is called upon completion of an upload (or put) operation.

In the /etc/irods/customrules.re file we’ll add:

acPostProcForPut {
 msiDataObjChksum($objPath, "", *checksumOut);
}

Then we need to tell iRODS to use those rules before the defaults. Each new iRODS connection causes a new agent to be started, which reads the config from the files. So the change is live as soon as it’s made.

In our /etc/irods/server_config.json file we have the rule_engine stanza

     "rule_engines": [
         {
             "instance_name": "irods_rule_engine_plugin-irods_rule_language-instance",
             "plugin_name": "irods_rule_engine_plugin-irods_rule_language",
             "plugin_specific_configuration": {
                     "re_data_variable_mapping_set": [
                         "core"
                     ],
                     "re_function_name_mapping_set": [
                         "core"
                     ],
                     "re_rulebase_set": [
                         "core"
                     ],
                     "regexes_for_supported_peps": [
                         "ac[^ ]*",
                         "msi[^ ]*",
                         "[^ ]*pep_[^ ]*_(pre|post|except)"
                     ]
             },
             "shared_memory_instance": "irods_rule_language_rule_engine"
         },

We want to change the re_rulebase_set to include our customrules.re file.

Note that the .re extension is left off, but is required for the server to find the file in the directory.

                 "re_rulebase_set": [{
                             "filename" : "customrules"
                     },
                     {
                             "filename" : "core"
                     }
                 ],

Now let’s test that checksum rule. Note that the irods superuser does not have rules applied to tasks, so we’ll use another user, john, to test.

john@ubuntu-xenial:~$ iput my-file.txt
john@ubuntu-xenial:~$ ils -L
/tempZone/home/john:
  john           0 demoResc        29 2019-11-23.21:41 & my-file.txt
 sha2:PPV9kd8elf4mA0OGbrK+I7qENRhTtws2okP2RV2mbMc= generic /var/lib/irods/Vault/home/john/my-file.txt

You notice that here, we have not used the -K flag, but iRODS generates the checksum anyway, because of the msiDataObjChksum service call we added to the post-upload rule acPostProcForPut.

More information about the Rule Engine and the Dynamic Policy Enforcement Points can be found in the manual.

iRODS? I can find my data!

Let’s see how can we apply metadata to the files to find them later.

First, some files. I’m going to upload some books from Project Gutenburg

Shaving Made Easy: What the Man Who Shaves Ought to Know by Anonymous

Shavings: A Novel by Joseph Crosby Lincoln

#upload the books
irods@ubuntu-xenial:~$ iput ShavingMadeEasy.mobi 
irods@ubuntu-xenial:~$ iput Shavings.mobi 

Now we have the files, let’s add some metadata about them

irods@ubuntu-xenial:~$ imeta add -d ShavingMadeEasy.mobi Author Anonymous
irods@ubuntu-xenial:~$ imeta add -d Shavings.mobi Author "Joseph Crosby Lincoln"

So what does it look like now we’ve set it

irods@ubuntu-xenial:~$ imeta ls -ld ShavingMadeEasy.mobi
AVUs defined for dataObj ShavingMadeEasy.mobi:
attribute: Author
value: Anonymous
units:
time set: 2019-11-27.21:29:53
----
attribute: Title
value: Shaving Made Easy
units:
time set: 2019-11-27.21:30:34

Now that we have some metadata, we can search on it. There is a query syntax which allows wildcards, string, and numeric searching. Be aware, the search is case sensitive.

Let’s find all the files that have a metadata field of ‘Author’ set, and which starts with ‘A’

irods@ubuntu-xenial:~$ imeta qu -d Author like A%
collection: /tempZone/home/rods
dataObj: ShavingMadeEasy.mobi

How about searching within the string, in this case for part of an Author’s name

irods@ubuntu-xenial:~$ imeta qu -d Author like %Crosby%
collection: /tempZone/home/rods
dataObj: Shavings.mobi

Finally, find all the files where the Author metadata has been set

irods@ubuntu-xenial:~$ imeta qu -d Author like %
collection: /tempZone/home/rods
dataObj: ShavingMadeEasy.mobi
----
collection: /tempZone/home/rods
dataObj: Shavings.mobi

iRODS? I put it where?

I could write an entire article on the iquest command! This powerful command allows you to find files across the entire Zone, no matter which resource they are in, with a SQL-like query language.

For example, how about a one line command to show you which users have uploaded files,how much, and distributed over which resources?

irods@ubuntu-xenial:~$ iquest "User %-9.9s uses %14.14s bytes in %8.8s files in '%s'" "SELECT USER_NAME, sum(DATA_SIZE),count(DATA_NAME),RESC_NAME"
'%s'" "SELECT USER_NAME,sum(DATA_SIZE),count(DATA_NAME),RESC_NAME"
User john   uses         261 bytes in     9 files in 'demoResc'
User rods   uses         217 bytes in     4 files in 'demoResc'

iRODS? I have more to say!

In addition to the above, here are some other things you might want to look into once you have your Zone up and running.

Federation

This is linking two or more Zones together, and allows users from one Zone to be granted access to Objects in another. One way to use this is a ‘hub and spoke’ design, where one Zone is used as a hub for authentication and users then connect on to other Zones—so authentication only needs to be handled in one place and differing policies, security models, and designs can be used on each sub-zone.

Capabilities

An iRODS Capability a pre-built set of rules and configurations designed around particular use cases. Some examples are;

  • Automated Ingest Framework - watching a file-system and automatically registering new files into iRODS, making them available to rules based workflow, or just visibility of further metadata tagging or retrieval.
  • Storage Tiering - rule based migration between different resource types
  • Indexing and Publishing - indexing Collections into external search systems such as Elasticsearch

iRODS? Your RODS!

While quick to set-up, iRODS provides powerful and flexible tools for automating your data management. Next time you’re shaving that yak cataloging your files or S3 buckets, I hope you’ll give it a try!

December 21, 2019

Day 21 - Being kind to 3am you

By: Katie McLaughlin (@glasnt)
Edited by: Cody Wilbourn (@codywilbourn)

I like sleep.

Sleep is just the best. Waking up when my body is done, my batteries are 100% charged, and I can get going with the best possible start to my day. I can get up, do a bit of exercise, get my coffee, and start my day. I can operate to the best of my ability. I can be productive and happy and get things done.

But when I'm tired. Oh, when I'm tired...

I don't operate well when I'm tired. Being unable to focus or see properly or think straight really inhibits my ability to be productive. My brain doesn't work right, and I just can't even.

So when I get paged, especially at night, I'm tired. I'm not operating at my best. I'm not going to be 100% there. I'm not going to be as quick thinking.

But if I'm paged, stuff be broken, yo; so I have to get in and fix it.

So what I need to do is to set myself up to be the best I can be when -- not if, when -- the pager goes off at 3am.

By "being paged", I mean that my phone has decided to make loud noises in the middle of the night to tell me that something is wrong, because I have previously setup a monitoring system for my servers that tell me if they aren't responding to ping and are immediately offline. Or if they are at high disk space and are at risk of becoming offline. Or if my inbound traffic is triggering autoscaling that isn't able to handle the load.

Alerts that immediately require human intervention.

Even at 3am.

Oh, you have an on call rotation that's not "me, myself, and I"? Or you "follow the sun"? Good for you. Keep doing that. Having the people who are already awake being paged? That's great. A lot of places don't have that luxury.

And it might not be an actual "3am pager". It could be the apocryphal 3am pager. Something is going to come by which means you won't be at your best -- when you're sick, tired, or just not with it. These tips and tricks can help you when you're not 100%, and you can use the time you are 100% to provide feedback into this system.

These are all based on my personal experiences of being the gal with the pager for years, across many environments, many roles, and many companies. From co-location, to web hosting, to machine learning pipelines, to platforms as a service. I have just a bit of experience at being awake at 3am in front of a laptop because an evil little app on my mobile has woken me up.

The Essentials.

Basically? Documentation. You should really write something down.

Documentation scales. Documentation is there when none of your coworkers are. Documentation is there when your senior database administrator is on leave. Documentation is there after your contractors engagements have ended.

I mean, it's useful to have documentation, sure. But where to have that documentation is also important to consider.

You could go whole hog and do a full knowledge base and documentation management system, but that all requires buy-in and resources. Sometimes, you can't get that.

And what you really want is a good night's sleep. You just want to throw some notes down somewhere.

Your tool you use to write things down could be anything: a Google Doc, OneNote, Emacs, Vim, VSCode... But consider where these docs live. Making sure that everyone on your on-call rotation can access your Google Drive or OneDrive. Or consider putting the docs closer to your working space: the wiki on your GitHub pages, or say under your username on your Confluence page. Or even a text file in a private project repo, in a pinch.

But, where you put these notes needs to have some important features.

Editable. It needs to be editable. Yes, sure, but editable means something important here. Wherever you're storing this stuff, it needs to be able to be edited easily. You need to be able to readily add new information, and remove out of date information. This might seem obvious, but it's such an important feature. If not, you could be stuck with, say, a "documentation repo" that needs to have content approved before it can be merged, and that is a huge blocker.

Searchable. Paper doesn't work here. You need to have something you can Control-F at 3am, wherever you are in the world. This is especially useful for those road warriors, the sysadmins who travel while on call. What I've also found super useful is creating a custom search engine. Having a keyword I can throw into a new tab of by browser to search my notes wiki is so helpful. I personally have custom search keywords for a number of services like Wikipedia or Twitter, as well as a keyword for my company wiki, and another for the code store. If I come across a problem that’s not documented in either the wiki or code, then it’s probably something third party and I have to search the public internet, or worse: knowledge is trapped inside someone's head. At 3am, information in someone one else's head is useless.

Discoverable. This is where having a wiki excels. A web-based system means that your coworkers can see it too. They can also use your custom search engines to find your notes, and perhaps collaborate and help improve them!

Access control. Consider that you probably don't want your internal Terraform docs in your public GitHub repo. Your sysadmins who have the credentials to be able to provision should probably be able to see those docs, especially the docs of where to find that magic ssh key that enables the ability to deploy from Ansible. This is going to be highly context specific, but it's probably a good idea to have this within your company's authentication barrier (or "firewall", if you still have one of those).

But the question is, what do you write down? What do you want to be able to discover at 3am?

Well, this is really going to depend on your environment.

Are you working in a Docker shop? Kubernetes? Lots of networking? What is super useful for 3am is sharpened tools. Commands with all those strange flags, more esoteric actions, or inspection scripts. Things that aren't aliased (though having these in your docs in case you forget what you have around is super useful). Having a copy-pasteable command that does something like: show me all the load balancers that have high memory usage, display the Docker containers with high CPU, show me the pod balance across the region.

Ensure these stored tools are easily usable. There's nothing worse than having those leading dollar signs or non-obvious environment variables in your stored commands that means you have to think about editing them before they can be used.

Make sure these commands are safe. Don't put any destructive commands in these caches unless they're clearly marked as such and have big red warning signs.

Especially avoid destructive chains of commands that start with basic search commands. For instance, show me all the docker containers... then delete them all.

You want to make sure that 3am you isn't blindly using a tool that's going to make things worse.


Stepping up.

So now that you have the initial basics, you need to think about their evolution. Stepping up this repository of useful hints, how can you make it work for you?

Again, your mileage may vary, but I can offer advice for what I've seen work. I'm a sysadmin, I'm not your sysadmin.

Integration. Integration is a big step. And so, so useful. Having your personal cache end up being moved into an "SRE Tips" page that appears on the home page for your on-call rotation information. Having it linked up in the channel topic of your firehouse chat channel. Making it readily available as well as useful.

Templates. Templates are great. When you have large repetitive tasks that also need custom care and attention to detail (be it new physical server deployment, or new client on-boarding), turn it into a template that you can copy each time. Even something as simple as making sure you link to the AWS EC2 search for the name of the server, and then any custom notes. Especially if one of these widgets has custom configurations outside of your provisioning automation that could be overwritten if you aren't careful (ask me how I know!).

Contextual Integration. Another big bonus is not just linking to the cache, but having it contextual. One fleet I maintained had a lot of different machines across different operating systems and virtualisation types. Physical machines, VMware, KVM, Xen; Linux, Windows, different versions of those in between. And depending on the service or the server that was having the issue there would be a link on the Nagios alert to the documentation for that particular service or server. This meant having a swap alert on a Linux box would immediately show the sysadmin on call a link to the basic debugging for that service. If there was a listing for the service specifically for that server, that would be shown instead. This was incredibly useful for those pesky machines that were notorious for having memory leaks or other bugbear issues.

Post mortems. When there is an issue, document some of the debugging steps that were used into the useful tips doc. This could be something as simple as saving a copy of a sanitized bash history somewhere, but is so very very useful when a senior SRE can show a junior which of those sharpened tools they used.

Which, in itself, brings me to the third major point.


The Feedback Loop.

These caches don't just appear overnight. They evolve over time as they are used -- and are useful -- for those on call. Having a cache of the flags on an esoteric CLI is one thing, but having a well oiled bag of tricks is another.

That goes double for recurring issues.

Now, this is different from one-off things, and I want to focus on this for a moment.

In an ideal world, no issue should happen more than once, because, hey, all problems are immediately fixed by the on-call engineer and will never happen again, right?

For anyone who has ever worked in operations for any period of time (or dev, for that matter), you know there's always a compromise between workarounds and root cause fixes. That server that keeps alerting due to critical disk space usage? Is it easier to occasionally clean up the old web server logs, or setup a scheduled task that archives logs older than a week?

The engineering time to create such a script that's appropriate for the environment is non-trivial if it doesn't already exist, especially when considerations like data retention or GDPR come involved that affect that implementation.

So, sometimes it's easier to, say, change monitoring to soft-alert at a 10% diskspace free level so an engineer can cleanup things during the day, as opposed to waking up the engineer at 3am with a critical 5% free alert that would result in the same action.

Tuning alerts and actions for recurring events is absolutely valid, even in cases where "Yes, we'll fix that Soon™️".

You can even start applying automation to these manual functions. Something simple like adding a for-loop to the start of a command to apply it to many servers. Or making that for-loop smarter by turning it into an Ansible playbook that can check for properties on the server before applying commands. Taking the commands in your bag of tricks and turning them into cron jobs, or somesuch.

Again, it's going to depend on your environment, both machine and people.

But in all of this, the biggest thing that I can suggest: turn this into a learning opportunity for the members of your team. This feeds back into the discoverability and feedback loop steps, but also makes sure that juniors or other team members "learn the weather".

If you have a junior that's starting to shadow in your on-call rotation, show them the iffy machines; give them a chance to debug things themselves, but work with them to solve issues in a timely manner. Make sure that any of those "We'll fix that soon" are noted, or even better: the alerts are modified, for your junior's sake. This is so, so important so that people know what to expect.

And when you finally get to fixing things, please make sure you communicate this. Having one sysadmin, or heaven forbid, a BOFH, being the only one that knows the temperament of your system doesn't scale. There's a certain joy when your entire on-call rotation are able to be pseudo-meteorologists and can just sorta *tell* what might be going on.

While it might be great that things finally get fixed, that those alerts go away, I've been here long enough to know it's not always that simple. Infrastructure changes almost always cause other issues down the line. Make sure you communicate these with your team, and in your documentation cache.

Bonus points: if you're going to be personally making big changes that might have the slightest chance of raising alerts, take the pager. Please. Especially if it's using your normal working hours and you're not already on call.

Remember that you should be reviewing this data and deprecating when required. Having a cache of information that is out of date isn't just annoying, it can be catastrophic. Your on-call engineer that finds the old fix-everything command that used to be the silver bullet that is now the WORST THING TO RUN... they should not be finding that in their search results. Deprecating content could be as simple as moving into a cache that's *not* searchable in your main search keyword, but still keeping it around in a secondary system.

Or, once considered in the light of day, deleting it entirely.

Having a lack of information at 3am is bad. Having actively harmful information is worse.


Empathy.

Because at the end of the day, empathy is critical.

Showing empathy for your fellow engineer who is going to be thankful for that full night sleep is paid back in kind.

Showing self-care by giving yourself the tools to help you get your job done so you can go back to counting sheep.

Making sure your junior or new on-call engineers don't freak out in the middle of the night because you left them a note about that upgrade, so those new errors they're seeing are totally okay (well, not okay, but not unexpected).

Thinking more about how less pages makes everyone sleep easier, and what can be done to achieve that.

Working in a team is hard, but as soon as you start expecting work out of hours, especially when on-call is involved, practicing explicit empathy makes things so much easier for everyone involved.

Get started now.

You're not your best when you're tired, but you'll do your future self a favour by starting your bag of tricks today. A sanitised bash history, an odd command here and there, just start somewhere. Evolve it, and it'll help you on those early morning calls so you can get back to sleep.

December 20, 2019

Day 20 - Importing and manipulating your Terraform configuration

By: Paul Puschmann (@ppuschmann)
Edited by: Scott Murphy (@ovsage

What is terraform?

Terraform is a tool for building, changing, and versioning infrastructure safely and efficiently. Terraform can manage existing and popular service providers as well as custom in-house solutions.

Configuration files describe to Terraform the components needed to run a single application or your entire datacenter. Terraform generates an execution plan describing what it will do to reach the desired state, and then executes it to build the described infrastructure. As the configuration changes, Terraform is able to determine what changed and create incremental execution plans which can be applied.

The infrastructure Terraform can manage includes low-level components such as compute instances, storage, and networking, as well as high-level components such as DNS entries, SaaS features, etc.

The key features of Terraform are:

  • Infrastructure as Code
  • Execution Plans
  • Resource Graph
  • Change Automation

Source: Introduction to Terraform

On december 5th 2019 Kief Morris already pointed out on SysAdvent that you should
“break up your Terraform setup before it breaks you”
and references a Hashicorp-Video with Nicki Watt about the proposed way a e-commerce system should evolve: Evolving Your Infrastructure with Terraform

I like the approach and the presentation and would like to share some of our experiences:

We already had an elaborate Terraform setup that was complex and contained some duplicate code and definitions having separate folders for our different staging environments.
This setup got converted to a terragrunt configuration which resulted technically in very “DRY” code.
But by this we also had some additional burdens to take:

We had to take care of the changes of another tool (terragrunt) and had to deal with very complex code:
Our code had modules nested into modules containing another layer of modules. The vast number of variable redirections created by this nesting of modules was slowing us down and prevented the further effective development of new features. For example the onboarding of new colleagues to this
very dry pile of code was consuming much time and other colleagues were reluctant to do deeper changes at all.

With the changes in the upcomming Terraform 0.12 we decided to take a step back and give the code a fresh start while keeping the infrastructure itself up and running. Our goals were:

  • to have understandable code (with the compromise of having duplicate code)
  • to be able to do changes to configuration
  • to actually own the code as a team and share the responsibility

After making up our minds we agreed on the following strategy:

  1. Create plain, unoptimized terraform-configration that ressembled the current state of our project
  2. Split the configuration into separated configuration domains
  3. Introduce terraform-modules to reduce the amount of duplicated code and definitions

The following examples will show you one way to get these tasks done. The third step, creation of terraform-modules, is left out to reduce the size of this article.

Importing

We discussed the following approaches for the configuration rewrite:

  • Doing completely manual imports with terraform import.
  • Use terraformer to generate config-files and create new statefiles.
  • Try to apply an empty configuration and parse the verbose diff to actually create new configuration files.

We chose the third approach (because we can), which is just a more automatic version of the first approach.

The automated way: terraformer

Terraformer is a CLI tool that generates .tf and .tfstate files based on existing infrastructure (reverse Terraform).

Terraformer may not support all the components you use, but will perhaps cover a great deal of them.

Example

Inside on an empty directory create a .tf-file with this input:

provider "google" {
}

Execute terraform init to download and initialize the required Terraform providers:

Execute terraformer with parameters to import your current live-configuration:

terraformer import google --regions=europe-west1 --projects=myexample-project-1 --resources=addresses,instances,disks,firewalls
2019/12/07 22:13:11 google importing project myexample-project-1 region europe-west1
2019/12/07 22:13:13 google importing... addresses
2019/12/07 22:13:14 Refreshing state... google_compute_address.tfer--ext-myexample-webserver
2019/12/07 22:13:16 google importing... instances
2019/12/07 22:13:17 Refreshing state... google_compute_instance.tfer--myexample-webserver
2019/12/07 22:13:19 google importing... disks
2019/12/07 22:13:21 Refreshing state... google_compute_disk.tfer--europe-west1-b--myexample-webserver-data
2019/12/07 22:13:21 Refreshing state... google_compute_disk.tfer--europe-west1-b--myexample-webserver
2019/12/07 22:13:22 google importing... firewalls
2019/12/07 22:13:23 Refreshing state... google_compute_firewall.tfer--default-allow-ssh
2019/12/07 22:13:23 Refreshing state... google_compute_firewall.tfer--fw-i-myexample-webserver-ssh
2019/12/07 22:13:23 Refreshing state... google_compute_firewall.tfer--fw-i-myexample-webserver-web
2019/12/07 22:13:23 Refreshing state... google_compute_firewall.tfer--default-allow-icmp
2019/12/07 22:13:23 Refreshing state... google_compute_firewall.tfer--default-allow-internal
2019/12/07 22:13:25 google Connecting....
2019/12/07 22:13:25 google save addresses
2019/12/07 22:13:25 google save tfstate for addresses
2019/12/07 22:13:25 google save instances
2019/12/07 22:13:25 google save tfstate for instances
2019/12/07 22:13:25 google save disks
2019/12/07 22:13:25 google save tfstate for disks
2019/12/07 22:13:25 google save firewalls
2019/12/07 22:13:25 google save tfstate for firewalls

The resulting tree in the filesystem looks like this:

.
├── generated
│   └── google
│       └── myexample-project-1
│           ├── addresses
│           │   └── europe-west1
│           │       ├── compute_address.tf
│           │       ├── outputs.tf
│           │       ├── provider.tf
│           │       └── terraform.tfstate
│           ├── disks
│           │   └── europe-west1
│           │       ├── compute_disk.tf
│           │       ├── outputs.tf
│           │       ├── provider.tf
│           │       └── terraform.tfstate
│           ├── firewalls
│           │   └── europe-west1
│           │       ├── compute_firewall.tf
│           │       ├── outputs.tf
│           │       ├── provider.tf
│           │       └── terraform.tfstate
│           └── instances
│               └── europe-west1
│                   ├── compute_instance.tf
│                   ├── outputs.tf
│                   ├── provider.tf
│                   └── terraform.tfstate
└── terraform.tf

Taking a look into the results:

Head of file generated/google/myexample-project-1/firewalls/europe-west1/compute_firewall.tf

resource "google_compute_firewall" "tfer--default-allow-icmp" {
  allow {
    protocol = "icmp"
  }

  description    = "Allow ICMP from anywhere"
  direction      = "INGRESS"
  disabled       = "false"
  enable_logging = "false"
  name           = "default-allow-icmp"
  network        = "https://www.googleapis.com/compute/v1/projects/myexample-project-1/global/networks/default"
  priority       = "65534"
  project        = "myexample-project-1"
  source_ranges  = ["0.0.0.0/0"]
}

Head of file generated/google/myexample-project-1/instances/europe-west1/compute_instance.tf

resource "google_compute_instance" "tfer--myexample-webserver" {
  attached_disk {
    device_name = "myexample-webserver-data"
    mode        = "READ_WRITE"
    source      = "https://www.googleapis.com/compute/v1/projects/myexample-project-1/zones/europe-west1-b/disks/myexample-webserver-data"
  }

  boot_disk {
    auto_delete = "true"
    device_name = "persistent-disk-0"

    initialize_params {
      image = "https://www.googleapis.com/compute/v1/projects/debian-cloud/global/images/debian-9-stretch-v20190326"
      size  = "20"
      type  = "pd-standard"
    }

    mode   = "READ_WRITE"
    source = "https://www.googleapis.com/compute/v1/projects/myexample-project-1/zones/europe-west1-b/disks/myexample-webserver"
  }

  can_ip_forward      = "false"
  deletion_protection = "false"
  enable_display      = "false"

  labels = {
    ansible-group = "webserver"
  }

  machine_type = "n1-standard-2"

  metadata = {
    enable-oslogin = "TRUE"
  }

  name = "myexample-webserver"

  network_interface {
    access_config {
      nat_ip       = "315.256.10.276"
      network_tier = "PREMIUM"
    }

    name               = "nic0"
    network            = "https://www.googleapis.com/compute/v1/projects/myexample-project-1/global/networks/default"
    network_ip         = "10.132.0.10"
    subnetwork         = "https://www.googleapis.com/compute/v1/projects/myexample-project-1/regions/europe-west1/subnetworks/default"
    subnetwork_project = "myexample-project-1"
  }

  project = "myexample-project-1"

  scheduling {
    automatic_restart   = "true"
    on_host_maintenance = "MIGRATE"
    preemptible         = "false"
  }

  service_account {
    email  = "998619246879-compute@developer.gserviceaccount.com"
    scopes = ["https://www.googleapis.com/auth/compute.readonly"]
  }

  tags = ["webserver", "test", "sysadvent"]
  zone = "europe-west1-b"
}

The results are impressive for the supported services.

After this you might spend some time rewriting the configuration and moving resources between statefiles.
At least now you have some valid configurations you can edit and do new deployments upon and also valid statefiles.

terraformer will also generate configurations for resources with lifecycle-attributes like deletion-prevention, a bonus compared to the manual imports.

The hard way: manual imports

A different approach could be to start with a nearly empty Terraform-file:

provider "google" {
}

Excuting terraform plan you’d now get a list of resources to get removed.
With some grep and sed magic you can recreate your resource definition.

But note: Resources that are using a lifecycle attribute to prevent the deletion of this item will not get mentioned in the diff created by terraform plan.

Example:

resource "google_compute_address" "ext_address_one" {
  name    = "ext-address-one"

  lifecycle {
    prevent_destroy = true
  }
}

Conclusion on imports

Imports can save time on recovery or when transforming configuration.

Independent of the method you use to import or recreate your configuration, limitations will apply.
The generated code or diff will not honour Terraform modules that were possibly used to create the resources,
but will create static resource-definitions.
Values will get hardcoded into the resource-definition, for example with external IP-addresses.

In short: You won’t get perfect Terraform configuration with an import, but you at least you’ll be some steps ahead.

Working with statefiles

Especially when using terraformer to import your configuration and generate code, you’ll find yourself with a set of configuration and statefiles,
one of each per resource type and region.
This generated code is functional, but far away from a structure you want to work with.

The terraform state command provides a versatile set of subcommands to manipulate Terraform statefiles.

With the help of terraform state mv you can rename
resources or move resources to different statefiles.
This command also allows you to move resources in and out of modules.

Other useful commands are terraform state pull and terraform state push to pull or push the statefile
from the configured storage backend.

Moving away from “terralith”

I’d like to show some methods on how to move between some of the different models Nicki Watt describes.

The name terralith is a synonym for a big Terraform-configuration that contains items of various infrastructure domains,
possibly a complete project.
Ideally you’d like to change your Terraform configuration without breaking your application or environments.

For demonstration purposes I created a setup of two MySQL-instances and six webservers, using modules.
Configuration of these instance-types is bundled in a terralith, meaning there’s one statefile for the complete
setup of the project: firewall-rules, GCE-instances, NAT-gateway, DNS-setup and other configuration.

Of course using modules is one way to move away from terralith, but I’d like to show you a different way to go first:
Moving on by splitting configuration into multiple, domain-separated blocks without increasing the technical complexity at the same time.

The first cut of configuration in the following example will be the separation of the “server” related parts from the general parts of the configuration.
The current setup will stay in a directory main, the new separeted setup will be located in the directory servers right next to main.

The current resources created in the statefile of our main project:

$ terraform state list
data.google_compute_zones.available
data.google_project.project
data.terraform_remote_state.mydemo
google_compute_address.bastionhost
google_compute_address.nat_gateway
google_compute_firewall.ext_to_bastionhost
google_compute_firewall.intern_to_ext
google_compute_firewall.bastion_to_intern
google_compute_router.nat_gateway
google_compute_router_nat.nat_gateway
module.bastionhosts.data.google_compute_zones.available
module.bastionhosts.google_compute_disk.instances[0]
module.bastionhosts.google_compute_instance.instances[0]
module.bastionhosts.google_dns_record_set.instances[0]
module.bastionhosts.google_dns_record_set.instances-private[0]
module.mysql.data.google_compute_zones.available
module.mysql.google_compute_disk.additional_v2["mysql-1-varlibmysql"]
module.mysql.google_compute_disk.additional_v2["mysql-2-varlibmysql"]
module.mysql.google_compute_disk.instances_v2["mysql-1"]
module.mysql.google_compute_disk.instances_v2["mysql-2"]
module.mysql.google_compute_instance.instances_v2["mysql-1"]
module.mysql.google_compute_instance.instances_v2["mysql-2"]
module.mysql.google_dns_record_set.instances-private_v2["mysql-1"]
module.mysql.google_dns_record_set.instances-private_v2["mysql-2"]
module.webservers.data.google_compute_zones.available
module.webservers.google_compute_disk.instances_v2["web-1"]
module.webservers.google_compute_disk.instances_v2["web-2"]
module.webservers.google_compute_disk.instances_v2["web-3"]
module.webservers.google_compute_disk.instances_v2["web-4"]
module.webservers.google_compute_disk.instances_v2["web-5"]
module.webservers.google_compute_disk.instances_v2["web-6"]
module.webservers.google_compute_instance.instances_v2["web-1"]
module.webservers.google_compute_instance.instances_v2["web-2"]
module.webservers.google_compute_instance.instances_v2["web-3"]
module.webservers.google_compute_instance.instances_v2["web-4"]
module.webservers.google_compute_instance.instances_v2["web-5"]
module.webservers.google_compute_instance.instances_v2["web-6"]
module.webservers.google_dns_record_set.instances-private_v2["web-1"]
module.webservers.google_dns_record_set.instances-private_v2["web-2"]
module.webservers.google_dns_record_set.instances-private_v2["web-3"]
module.webservers.google_dns_record_set.instances-private_v2["web-4"]
module.webservers.google_dns_record_set.instances-private_v2["web-5"]
module.webservers.google_dns_record_set.instances-private_v2["web-6"]

The state is saved in a Google storage bucket.

terraform {
  backend "gcs" {
    bucket = "mydemoproject"
    prefix = "dev/main"
  }
}

Example of the configured module webservers:

module "webservers" {
  source = "git::ssh://github.com/<sorry-this-only-an-example>.git"

  instance_map = {
    "web-1" : { zone = "europe-west1-b" },
    "web-2" : { zone = "europe-west1-c" },
    "web-3" : { zone = "europe-west1-d" },
    "web-4" : { zone = "europe-west1-b" },
    "web-5" : { zone = "europe-west1-c" },
    "web-6" : { zone = "europe-west1-d" },
  }

  region              = var.region
  machine_type        = "n1-standard-4"
  disk_size           = "15"
  disk_image          = var.image
  subnetwork          = var.subnetwork
  subnetwork_project  = var.network_project
  tags                = ["webserver-dev"]
  label_ansible_group = "webserver"

  dns_domain_intern       = var.internal_dnsdomain
  dns_managed_zone_intern = var.internal_managedzone

  project         = var.project
  network_project = var.network_project
}

We first create a new directory for the new MySQL-instance and webserver configuration, called servers, add the required
files for variables and providers, move over the module-configuration, and point the state to a different file:

terraform {
  backend "gcs" {
    bucket = "mydemoproject"
    prefix = "dev/servers"
  }
}

After the initialization of this Terraform configuration with terraform init you can execute terraform plan
to check what Terraform currently thinks is needed to change. The plan is to create two MySQL-instances
and six webservers, because the statefile of this configuration is still empty or doesn’t have the
actual information of the running instances.

Execute the following commands in the main directory to create a local copy of the remote statefile and
then move the states of the named modules from the local statefile to the new file ../servers/default.tfstate:

$ terraform state pull | tee default.tfstate
$ terraform state mv -state=default.tfstate -state-out=../servers/default.tfstate 'module.webservers' 'module.webservers'
Move "module.webservers" to "module.webservers"
Successfully moved 1 object(s).
$ terraform state mv -state=main.tfstate -state-out=../servers/default.tfstate 'module.mysql' 'module.mysql'
Move "module.mysql" to "module.mysql"
Successfully moved 1 object(s).

After successful creation of a new statefile in your new configuration directory do this:

Upload the statefile with terraform state push default.tfstate to your configured storage backend.
Then move the configuration for the resources you just moved to the new state to the servers directory as well.

Execute terraform plan to get this output:

No changes. Infrastructure is up-to-date.

This means that Terraform did not detect any differences between your
configuration and real physical resources that exist. As a result, no
actions need to be performed.

Move back to the directory main and push the changed statefile to remote: terraform state push default.tfstate

Delete your local statefiles for cleanup and you’re done.

Conclusion

Thank you for reading all this. I hope I helped you to understand the first steps away from a terralith to a more modular setup.
Unfortunately, a more detailed explanation of working with modules, as well as listing the pros and cons is beyond the scope of this article.

As you can see, working with Terraform can be more than just terraform init, terraform plan & terraform apply. Changes on the scope of the configuration do not automatically mean to destroy and recreate everything.

December 19, 2019

Day 19 - SRE Practice: Error Budgets

By: Nathen Harvey (@nathenharvey)
Edited by: Paul Welch (@pwelch

Site Reliability Engineering (SRE) is a set of principles, practices, and organizational constructs that seek to balance the reliability of a service with the need to continually deliver new features. It is highly likely that your organization utilizes many SRE principles and practices even without having an SRE team.

The Scenario

Let's look at an example of an online retailer. The customer's typical flow through the site is likely familiar to you. A customer browses the catalog, selects one or more items to add to the cart, views the shopping cart, enters payment information, and completes the transaction.

The operations team for this retailer meets regularly to review metrics, discuss incidents, plan work, and such. During a recent review, the team noticed a trend where the time it takes from submitting payment details to receiving a status update (e.g., successful payment or invalid card details) was gradually increasing. The team raised concerned that as this processing time continued to increase revenue would drop off. In short, they were concerned that customers would feel this slowdown and take their business elsewhere.

The team worked together with the development team to diagnose the reasons behind this degradation and made the necessary changes to improve the speed and consistency of the payment processing. This required the teams to take joint ownership of the issues and work together to resolve them.

Fixing the issues required some heroics from the development and operations teams, namely they worked day and night to get a fix in place. Meanwhile some new features that the product owners were pushing to launch took longer than initially anticipated. In the end, the product teams were unhappy about feature velocity and both the development and operation teams were showing some signs of burnout and had trouble understanding why the product owners were not prioritizing the work to hasten payment processing. In short, the issue was resolved but nobody was happy.

On further reflection and discussion of the scenario there were a few things that really stood out to everyone involved.

  • The outcome was good: payment processing was consistently fast and customers kept buying from the retailer.
  • The internal frustration was universal: product owners were frustrated with the pace of new development and development teams were frustrated with pressure to deliver new features while working to prevent an impending disaster.
  • Visibility was lacking
    • The product owners did not know latency was increasing
    • The work to resolve the latency issues was only visible to the people doing that work.
  • The product owners agreed that they would have prioritized this work if they had known about the issue (please join me in willingly suspending our belief in hindsight bias when considering this stance).

Error Budgets

Error Budgets provide teams a way to describe the reliability goals of a service, ways to spend that error budget, and the consequences of missing the reliability goals. SRE practices prescribe a data-driven approach. Error budgets are based on actual measures of the system's behavior.

Taking a bottom-up approach, we will define our Error Budget using Service Level Indicators (SLIs), Service Level Objectives (SLOs) and an Error Budget policy.

Service Level Indicators (SLIs)

We start with SLIs. An SLI is a metric that gives some indication about the level of service a customer of your service is experiencing. All systems have latent SLIs. The best way to discover the SLIs for your system is to consider the tasks a customer is trying to accomplish with the system. We might call these paths through the system Critical User Journeys (CUJs).

Working together with everyone who cares about our application, we may identify that purchasing items is a CUJ for our example application. We agree that the payment status page should load quickly. This is the first SLI we have identified. We know that customers will notice if the payment page is not loading quickly and this may lead to fewer sales. However, saying that the page should load quickly is not precise enough for our purposes, we have to answer a few more questions, such as:

  • How do we measure "quickly"?
  • When does the timer start and stop?

The best SLI measurements are taken as close to the customer as possible. For example, we may want to start the timer when the customer clicks the "buy" button and end the timer when the resulting page is fully rendered in the customer interface. However, this might require additional instrumentation in the application and consent from the customers to measure accurately. A sufficient proxy for this could be measured at the load balancer for our application. When a POST is made to a particular URL at the load balancer the timer will start, when the full response is sent back to the customer the timer will end. This will clearly miss things like requests that never make it to the load balancer, responses that never make it back to the customer, and any host of things that can go wrong between the customer and the load balancer. But there are significant benefits to using this data including the ease with which we can collect it. Starting simple and iterating for more precision is strongly recommended.

An SLI should always be expressed as a percentage so we may further refine this SLI as follows:

The proportion of HTTP POSTs to /api/pay
that send their entire response 
within X ms measured at the load balancer.

With our metric in hand, we must now agree on a goal, or Service Level Objective, that we are trying to meet.

Service Level Objectives (SLOs)

An SLO is the goal that we set for a given SLI. Looking at our SLI, notice that we did not define how fast (X ms), nor did we define how many POSTs should be considered.

Getting everyone to agree to a reasonable goal can be difficult. Ask a typical product owner "how reliable do you want this feature to be?" and the answer is often "110%, I want the service to be 110% reliable." As awesome as that sounds, we all know that even targeting 100% reliability is not a worthwhile endeavor. The investments required to go from 99% to 99.9% to 99.99% grow exponentially. Meanwhile, a typical customer's ability to notice these improvements will disappear. And that's the key: we should set a target that keeps a typical customer happy with our service.

So, let's agree on some goals for our SLI. Each SLO will be expressed as a percentage over a period of time. So we may set an SLO of 99% of POSTs in the last 28 days have a response time < 500ms.

Every request that takes less than 500ms is said to be within our SLO. Those that take longer than that are outside our SLO and consume our Error Budget.

Error Budget Policy

An error budget represents the acceptable level of unreliability. Our sample SLO gives us an error budget that allows up to 1% of all purchase requests to take longer than 500ms to process.

As a cross-functional team, before our error budget is exhausted, we must agree on a policy. This policy can help inform how to spend the error budget and what the consequences are of over spending the budget.

The current state of the error budget helps inform where engineering effort is focused. With the strictest of interpretations, without any remaining error budget, all engineering work focuses on things that make the system more reliable; no customer-facing features should be built or shipped. When budget is available we build and ship customer-facing features.

There are, of course, less drastic measures that you might agree to when there is no remaining error budget. For example, perhaps you will agree to prioritize some of the remediation items that were identified during your last audit, retrospective, or post-mortem. Maybe your team agrees to put engineering effort into better observability when there is no error budget remaining.

Having excess error budget available is not necessarily a good thing. Exceeding our targets has a number of potential downsides. First, we may inadvertently be resetting our customers' expectations, they will become dependent on this new level of service. Second, building up this excess is a signal from the system that we are not learning enough or shipping new features fast enough. We may use existing budget to focus energy on introducing risky features, paying down technical debt, or injecting latency to help validate our assumptions about what is "fast enough."

Our SLOs typically look back over a rolling time window. Our sample SLO, for example, looks at POSTs over the last 28 days. Doing this allows the recent past to have the most impact on the decisions we make.

You already do this

Applying the practice of Error Budgets, SLIs, and SLOs does not require adding anyone new to your team. In fact, you may already have these practices in place. Using these terms and language helps move the practice from implicit to explicit and allows teams to be more intentional about how to prioritize work.

Looking back at our example retailer, they had some SLIs, SLOs, and even an error budget in place. Let's look at what they were.

SLI

The time it takes to process payment.

SLO

The response times should look and feel similar to previous response times.

Error Budget Policy

When the error budget is consumed the team will work extra hours
to fix the system until it is once again meeting the service level
objectives.  This should take as long as necessary but should not 
interrupt the existing flow of new features to production.

There are a number of problems with these definitions though. They are not defined using actual metrics and data from the system and a human's intuition is required to assess if the objective is being met. The consequences of missing the objective or overspending the budget are not humane to anyone involved and are likely to introduce additional issues with both reliability and feature velocity. These also were not agreed to, discussed, or shared across the various teams responsible for the service.

Intentional, Iterative

Improving your team's practice with SLIs, SLOs, and error budgets requires an intentional, iterative approach. Gather the team of humans that care about the service and its customers, identify some CUJs, discover the SLIs, and have frank discussions about reliability goals and how the team will prioritize work when the budget is consumed and when there is budget available. Use past data as input to the discussions, agree on a future date to revisit the decisions, and build the measurements required to make data-driven decisions. Start simple using the best data you can gather now. Above all else, agree that whatever you put in place first will be wrong. With practice and experience you will identify and implement ways to improve.

Learn More

Google provides a number of freely available SRE Resources, including books and training materials, online at https://google.com/sre.