sysadvent

Day 23 - What is eBPF?

2021-12-23T00:00:00.001-05:00

By: Ania Kapuścińska (@lambdanis)
Edited by: Shaun Mouton (@sdmouton )

Like many engineers, for a long time I’ve thought of the Linux kernel as a black box. I've been using Linux daily for many years - but my usage was mostly limited to following the installation guide, interacting with the command line interface and writing bash scripts.

Some time ago I heard about eBPF (extended BPF). The first thing I heard was that it’s a programmable interface for the Linux kernel. Wait a second. Does that mean I can now inject my code into Linux without fully understanding all the internals and compiling the kernel? The answer turns out to be approximately yes!

An eBPF (or BPF - these acronyms are used practically interchangeably) program is written in a restricted version of C. Restricted, because a dedicated verifier checks that the program is safe to run in an BPF VM - it can’t crash, loop infinitely, or access arbitrary memory. If the program passes the check, it can be attached to some kind of event in the Linux kernel, and run every time this event happens.

A growing ecosystem makes it easier to create tools on top of BPF. One very popular framework is BCC (BPF Compiler Collection), containing a Python interface for writing BPF programs. Python is a very popular scripting language, for a good reason - simple syntax, dynamic typing and rich standard library make writing even complex scripts quick and fun. On top of that, bcc provides easy compilation, events attachment and output processing of BPF programs. That makes it the perfect tool to start experimenting with writing BPF code.

To run code examples from this article, you will need a Linux machine with a fairly recent kernel version (supporting eBPF). If you don’t have a Linux machine available, you can experiment in a Vagrant box. You will also need to install Python bcc package.

Very complicated hello

Let’s start in a very unoriginal way - with a “hello world” program. As I mentioned before, BPF programs are written in (restricted) C. A BPF program printing “Hello World!” can look like that:

hello.c

#define HELLO_LENGTH 13

BPF_PERF_OUTPUT(output);

struct message_t {
   char hello[HELLO_LENGTH];
};

static int strcp(char *src, char *dest) {
   for (int i = 0; src[i] != '\0'; i++) {
       dest[i] = src[i];
   }
   return 0;
};

int hello_world(struct pt_regs *ctx) {
   struct message_t message = {};
   strcp("Hello World!", message.hello);
   output.perf_submit(ctx, &message, sizeof(message));
   return 0;
}

The main piece here is the hello_world function - later we will attach it to a kernel event. We don’t have access to many common libraries, so we are implementing strcp (string copy) functionality ourselves. Extra functions are allowed in BPF code, but have to be defined as static. Loops are also allowed, but the verifier will check that they are guaranteed to complete.

The way we output data might look unusual. First, we define a perf ring buffer called “output” using the BPF_PERF_OUTPUT macro. Then we define a data structure that we will put in this buffer - message_t. Finally, we write to the “output” buffer using perf_submit function.

Now it’s time to write some Python:

hello.py

from bcc import BPF

b = BPF(src_file="hello.c")
b.attach_kprobe(
   event=b.get_syscall_fnname("clone"),
   fn_name="hello_world"
)

def print_message(_cpu, data, _size):
   message = b["output"].event(data)
   print(message.hello)

b["output"].open_perf_buffer(print_message)
while True:
   try:
       b.perf_buffer_poll()
   except KeyboardInterrupt:
       exit()

We import BPF from bcc as BPF is the core of the Python interface with eBPF in the bcc package. It loads our C program, compiles it, and gives us a Python object to operate on. The program has to be attached to a Linux kernel event - in this case it will be the clone system call, used to create a new process. The attach_kprobe method hooks the hello_world C function to the start of a clone system call.

The rest of Python code is reading and printing output. A great functionality provided by bcc is automatic translation of C structures (in this case “output” perf ring buffer) into Python objects. We access the buffer with a simple b[“output”], and use open_perf_buffer method to associate it with the print_message function. In this function we read incoming messages with the event method. The C structure we used to send them gets automatically converted into a Python object, so we can read “Hello World!” by accessing the hello attribute.

To see it in action, run the script with root privileges:


> sudo python hello.py

In a different terminal window run any commands, e.g. ls. “Hello World!” messages will start popping up.

Does it look awfully complicated for a “hello world” example? Yes, it does :) But it covers a lot, and most of the complexity comes from the fact that we are sending data to user space via a perf ring buffer.

In fact, similar functionality can be achieved with much simpler code. We can get rid of the complex printing logic by using the bpf_trace_printk function to write a message to the shared trace_pipe. Then, in Python script we can read from this pipe using trace_print method. It’s not recommended for real world tools, as trace_pipe is global and the output format is limited - but for experiments or debugging it’s perfectly fine.

Additionally, bcc allows us to write C code inline in the Python script. We can also use a shortcut for attaching C functions to kernel events - if we name the C function kprobe__<kernel function name>, it will get hooked to the desired kernel function automatically. In this case we want to hook into the sys_clone function.

So, hello world, the simplest version, can look like this:

from bcc import BPF

BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello World!\\n"); return 0; }').trace_print()

The output will be different, but what doesn’t change is that while the script is running, custom code will run whenever a clone system call is starting.

What even is an event?

Code compilation and attaching functions to events are greatly simplified by the bcc interface. But a lot of its power lies in the fact that we can glue many BPF programs together with Python. Nothing prevents us from defining multiple C functions in one Python script and attaching them to multiple different hook points.

Let’s talk about these “hook points”. What we used in the “hello world” example is a kprobe (kernel probe). It’s a way to dynamically run code at the beginning of Linux kernel functions. We can also define a kretprobe to run code when a kernel function returns. Similarly, for programs running in user space, there are uprobes and uretprobes.

Probes are extremely useful for dynamic tracing use cases. They can be attached almost anywhere, but that can cause stability problems - a function rename could break our program. Better stability can be achieved by using predefined static tracepoints wherever possible. Linux kernel provides many of those, and for user space tracing you can define them too (user statically defined tracepoints - USDTs).

Network events are very interesting hook points. BPF can be used to inspect, filter and route packets, opening a whole sea of possibilities for very performant networking and security tools. In this category, XDP (eXpress Data Path) is a BPF framework that allows running BPF programs not only in Linux kernel, but also on supported network devices.

We need to store data

So far I’ve mentioned functions attached to other functions many times. But interesting computer programs generally have something more than functions - a state that can be shared between function calls. That can be a database or a filesystem, and in the BPF world that’s BPF maps.

BPF maps are key/value pairs stored in Linux kernel. They can be accessed by both kernel and user space programs, allowing communication between them. Usually BPF maps are defined with C macros, and read or modified with BPF helpers. There are several different types of BPF maps, e.g.: hash tables, histograms, arrays, queues and stacks. In newer kernel versions, some types of maps let you protect concurrent access with spin locks.

In fact, we’ve seen a BPF map in action already. The perf ring buffer we’ve created with BPF_PERF_OUTPUT macro is nothing more than a BPF map of type BPF_MAP_TYPE_PERF_EVENT_ARRAY. We also saw that it can be accessed from Python bcc script, including automatic translation of items structure into Python objects.

A good, but still simple example of using a hash table BPF map for communication between different BPF programs can be found in “Linux Observability with BPF” book (or in the accompanying repo). It’s a script using uprobe and uretprobe to measure duration of a Go binary execution:

from bcc import BPF

bpf_source = """
BPF_HASH(cache, u64, u64);
int trace_start_time(struct pt_regs *ctx) {
 u64 pid = bpf_get_current_pid_tgid();
 u64 start_time_ns = bpf_ktime_get_ns();
 cache.update(&pid, &start_time_ns);
 return 0;
}
"""

bpf_source += """
int print_duration(struct pt_regs *ctx) {
 u64 pid = bpf_get_current_pid_tgid();
 u64 *start_time_ns = cache.lookup(&pid);
 if (start_time_ns == 0) {
   return 0;
 }
 u64 duration_ns = bpf_ktime_get_ns() - *start_time_ns;
 bpf_trace_printk("Function call duration: %d\\n", duration_ns);
 return 0;
}
"""

bpf = BPF(text = bpf_source)
bpf.attach_uprobe(name = "./hello-bpf", sym = "main.main", fn_name = "trace_start_time")
bpf.attach_uretprobe(name = "./hello-bpf", sym = "main.main", fn_name = "print_duration")
bpf.trace_print()

First, a hash table called “cache” is defined with the BPF_HASH macro. Then we have two C functions: trace_start_time writing the process start time to the map using cache.update(), and print_duration reading this value using cache.lookup(). The former is attached to a uprobe, and the latter to uretprobe for the same function - main.main in hello-bpf binary. That allows print_duration to, well, print duration of the Go program execution.

Sounds great! Now what?

To start using the bcc framework, visit its Github repo. There is a developer tutorial and a reference guide. Many tools have been built on the bcc framework - you can learn them from a tutorial or check their code. It’s a great inspiration and a great way to learn - code of a single tool is usually not extremely complicated.

Two goldmines of eBPF resources are ebpf.io and eBPF awesome list. Start browsing any of those, and you have all your winter evenings sorted :)

Have fun!

Day 22 - So, You're Incident Commander, Now What?

2021-12-22T00:00:00.029-05:00

By: Joshua Timberman (@jtimberman)

You’re the SRE on call and, while working on a project, your phone buzzes with an alert: “Elevated 500s from API.”

You’re a software developer, and your team lead posts in Slack: “Hey, the library we use for our payment processing endpoint has a remote exploit.”

You work on the customer success team and, during a routine sync with a high-profile customer, they install the new version of your client CLI. Then, when they run any command, it exits with a non-zero return code.

An incident is any situation that disrupts the ability of customers to use a system, service, or software product in a safe and secure manner. And in each of the incidents above, the person who noticed the incident first will most likely become the incident commander. So, now what?

What does it mean to be an incident commander?

Once an individual identifies an incident, one or more people will respond to it. Their goal is to resolve the incident and return systems, services, or other software back to a functional state. While an incident may have a few or many responders—only one person is the incident commander. This role is not about experience, seniority, or position on an org chart; it is to ensure that progress is being made to resolve the incident. The incident commander must think about the inputs from the incident, make decisions about what to do next, and communicate with others about what is happening. The incident commander also determines when the incident is resolved based on the information they have. After the incident is over, the incident commander is also responsible for conducting a post-incident analysis and review to summarize what happened, what the team learned, and what will be done to mitigate the risk of a similar incident happening in the future.

Having a single person—the incident commander—be responsible for handling the incident, delegating responsibility to others, determining when the incident is resolved, and conducting the post-incident review is one of the most effective incident management strategies.

How do you become an incident commander?

Organizations vary on how a team member can become an incident commander. Some call upon the first responder to an incident. Others require specific training and have an on-call rotation of just incident commanders. However you find yourself in the role of incident commander, you should be trusted and empowered by your organization to lead the effort to resolve the incident.

Now what?

Now that you’re incident commander, follow your organizations’ incident response procedure for the specifics about what to do. But for more general questions, we’ve got some guidance.

What are the best strategies for communication and coordination?

One of an incident commander’s primary tasks is to communicate with relevant teams and stakeholders about the status of the incident and to coordinate with other teams to ensure the right people are involved.

If your primary communication tool is Slack, use a separate channel for each incident. Prefix any time-sensitive notes with “timeline” or “TL” so they are easy to find later. If higher-bandwidth communication is required, use a video conference, and keep the channel updated with important information and interactions. When an incident affects external customers, be sure to update them as required by your support teams and agreements with customers.

In the case of a security incident, there may be additional communications requirements with your organization’s legal and/or marketing teams. Legal considerations to communicate may include:

Statutory or regulatory reporting
Contractual commitments and obligations to customers
Insurance claims

Marketing considerations to communicate may include:

Sensitive information from customer data exposure
“Zero Day” exploits
Special messaging requirements, e.g. for publicly traded companies

When should you hand off a long-running incident?

During an extended outage or other long-running incident, you will likely need a break. Whether you are feeling overwhelmed, or that you would contribute better by working on a solution for the incident itself, or that you need to eat, take care of your family, or sleep—all are good reasons to hand off the incident command to someone else.

Coordinate with your other responders in the appropriate channel, whether that’s in a Slack chat or in a Zoom meeting. If necessary, escalate by having someone paged out to get help. Once someone else can take over, communicate with them on the latest progress, the current steps being taken, and who else is involved with the incident. Remember, we’re all human and we need breaks.

How should you approach post-incident analysis and review?

One of an incident commander’s most important jobs is to conduct a post-incident analysis and review after the incident is resolved. This meeting must be blameless: That is, the goal of the meeting is to learn what happened, determine what contributing factors led to the incident, and take action to mitigate the risk of such an incident happening in the future. It’s also to establish a timeline of events, demonstrate an understanding of the problems, and set up the organization for future success in mitigating that problem.

The sooner the incident analysis and review meeting occurs after the incident is resolved, the better. You should ensure adequate rest time for yourself and other responders, but the review meeting should happen within 24 hours—and ideally not longer than two business days after the incident. The incident commander (or commanders) must attend, as they have the most context on what happened and what decisions were made. Any responders who performed significant remediation steps or investigation must also attend so they can share what they learned and what they did during the incident.

Because the systems that fail and cause incidents are complex, a good analysis and review process is complex. Let’s break it down:

Describe the incident

The incident commander will describe the incident. This description should detail the impact as well as its scope, i.e., whether the incident affected internal or external users, how long it took to discover, how long it took to recover, and what major steps were taken to resolve the incident.

“The platform was down” is not a good description.

“On its 5 minute check interval, our monitoring system alerted the on-call engineer that the API service was non-responsive, which meant external customers could not run their workflows for 15 minutes until we were able to restart the message queue” is a good description.

Contributing factors

Successful incident analysis should identify the contributing factors and places where improvements can be made to systems and software. Our world is complex, and technology stacks have multiple moving parts and places where failures occur. Not only can a contributing factor be something technical like “a configuration change was made to an application,” it can be nontechnical like “the organization didn’t budget for new hardware to improve performance.” In reviewing the incident for contributing factors, incident commanders and responders are looking for areas for improvement in order to identify potential corrective actions.

Corrective action items

Finally, incident analysis should determine corrective action items. These must be specific work items that are assigned to a person or a team accountable for their completion, and they must be the primary work priority for that person or team. These aren’t “nice to have,” these are “must do to ensure the safe and reliable operation of the site or service.” Such tasks aren’t necessarily the actions taken during the incident to stabilize or remediate a problem, which are often temporary workarounds to restore service. A corrective action can be as simple as adding new monitoring alerts or system metrics that weren’t implemented before. It can also be as complex as rebuilding a database cluster with a different high availability strategy or migrating to a different database service entirely.

Conclusion

If you’ve recently been the incident commander for your first incident—congratulations. You’ve worked to solve a hard problem that had a lot of moving parts. You took on the role and communicated with the relevant teams and stakeholders. Then, you got some much needed rest and conducted a successful post-incident analysis. Your team identified corrective actions, and your site or service is going to be more reliable for your customers.

Incident management is one of the most stressful aspects of operations work for DevOps and SRE professionals. The first time you become an incident commander, it may be confusing or upsetting. Don’t panic. You’re doing just fine, and you’ll keep getting better.

Day 20 - To Deploy or Not to Deploy? That is the question.

2021-12-20T00:00:00.031-05:00

By: Jessica DeVita (@ubergeekgirl)
Edited by: Jennifer Davis (@sigje)

Deployment Decision-Making during the holidays amid the COVID19 Pandemic

A sneak peek into my forthcoming MSc. thesis in Human Factors and Systems Safety, Lund University.

Web services that millions of us depend on for work and entertainment require vast compute resources (servers, nodes, networking) and interdependent software services, each configured in specialized ways. The experts who work on these distributed systems are under enormous pressure to deploy new features, and keep the services running, so deployment decisions are happening hundreds or thousands of times every day. While automated testing and deployment pipelines allow for frequent production changes, an engineer making a change wants confidence that the automated testing system is working. However, automating the testing pipeline makes the test-and-release process more opaque to the engineer, making it difficult to troubleshoot.

When an incident occurs, the decisions preceding the event may be brought under a microscope, often concluding that “human error” was the cause. As society increasingly relies on web services, it is imperative to understand the tradeoffs and considerations engineers face when they decide to deploy a change into production. The themes uncovered through this research underscore the complexity of engineering work in production environments and highlight the role of relationships with co-workers and management on deployment decision-making.

There’s No Place Like Production

Many deployments are uneventful and proceed without issues, but unforeseen permissions issues, network latency, sudden increases in demand, and security vulnerabilities may only manifest in production. When asked to describe a recent deployment decision, engineers reported intense feelings of uncertainty as they could not predict how their change would interact with changes elsewhere in the system. More automation isn’t always the solution, as one engineer explains:

“I can’t promise that when it goes out to the entire production fleet that the timing won’t be wrong. It’s a giant Rube Goldberg of a race condition. It feels like a technical answer to a human problem. I’ve seen people set up Jenkins jobs with locks that prevent other jobs from running until it’s complete. How often does it blow up in your face and fail to release the lock? If a change is significant enough to worry about, there should be a human shepherding it. Know each other’s names. Just talk to each other; it’s not that hard.”

Decision-making Under Pressure

“The effects of an action can be totally different, if performed too early or too late. But the right time is not clock time: it depends upon the precise state of the process evolution” (De Keyser, 1990).

Some engineers were under pressure to deploy fixes and features before the holidays, while other engineers were constrained by a "code freeze", when during certain times of the year, they “can’t make significant production changes that aren’t trivial or that fix something”. One engineer felt that they could continue to deploy to their test and staging environments but warned, “... a lot of things in a staging environment waiting to go out can compound the risk of the deployments.”

Responding to an incident or outage at any time of the year is challenging, but even more so because of “oddities that happen around holidays” and additional pressures from management, customers, and the engineers themselves. Pairing or working together was often done as a means to increase confidence in decision making. Pairing resulted in joint decisions, as engineers described actions and decisions with “we”. “So that was a late night. When I hit something like that, it involves a lot more point-by-point communications with my counterpart. For example,”I'm going to try this, do you agree this is a good thing? What are we going to type in?”

Engineers often grappled with "clock time" and reported that they made certain sacrifices to “buy more time” to make further decisions. An engineer expressed that a change “couldn’t be decided under pressure in the moment” so they implemented a temporary measure. Fully aware of the potential for their change to trigger new and different problems, engineers wondered what they could do “without making it worse”.

When triaging unexpected complications, engineers sometimes “went down rabbit holes”, exemplifying a cognitive fixation known as a “failure to revise” (Woods & Cook, 1999). Additionally, having pertinent knowledge does not guarantee that engineers can apply it in a given situation. For example, one engineer recounted their experience during an incident on Christmas Eve:

“...what happens to all of these volumes in the meantime? And so then we're just thinking of the possible problems, and then [my co-worker] suggested resizing it. And I said, ‘Oh, can you do that to a root volume?’ ‘Cause I hadn't done that before. I know you can do it to other volumes, but not the root.’”

Incidents were even more surprising in systems that rarely fail. For one engineer working on a safety critical system, responding to an incident was like a “third level of panic”.

Safety Practices

The ability to roll back a deployment was a critically important capability that for one engineer was only possible because they had “proper safety practices in place”. However, rollbacks were not guaranteed to work, as another engineer explained:

“It was a fairly catastrophic failure because the previous migration with a typo had partially applied and not rolled back properly when it failed. The update statement failed, but the migration tool didn’t record that it had attempted the migration, because it had failed. It did not roll back the addition, which I believed it would have done automatically”.

Sleep Matters

One engineer described how they felt that being woken up several times during the night was a direct cause of taking down production during their on-call shift:

“I didn't directly connect that what I had done to try to fix the page was what had caused the outage because of a specific symptom I was seeing… I think if I had more sleep it would have gotten fixed sooner”.

Despite needing “moral support”, engineers didn’t want to wake up their co-workers in different time zones: “You don't just have the stress of the company on your shoulders. You've got the stress of paying attention to what you're doing and the stress of having to do this late at night.” This was echoed in another engineer’s reluctance to page co-workers at night as they “thought they could try one more thing, but it’s hard to be self-aware in the middle of the night when things are broken, we’re stressed and tired”.

Engineers also talked about the impacts of a lack of sleep on their effectiveness at work as “not operating on all cylinders”, and no different than having 3 or 4 drinks: “It could happen in the middle of the night when you're already tired and a little delirious. It's a form of intoxication in my book.”

Blame Culture

“What's the mean time to innocence? How quickly can you show that it's not a problem with your system?”

Some engineers described feeling that management was blameful after incidents and untruthful about priorities. For example, an engineer described the aftermath of a difficult database migration: “Upper management was not straightforward with us. We compromised our technical integrity and our standards for ourselves because we were told we had to”.

Another engineer described a blameful culture during post-incident review meetings:

“It is a very nerve-wracking and fraught experience to be asked to come to a meeting with the directors and explain what happened and why your product broke. And because this is an interwoven system, everybody's dependent on us and if something happens, then it’s like ‘you need to explain what happened because it hurt us.”

Engineers described their errors as "honest mistakes'' as they made sense of these events after the fact. Some felt a strong sense of personal failure, and that their actions were the cause of the incident, as this engineer describes:

“We are supposed to follow a blameless process, but a lot of the time people self-blame. You can't really shut it down that much because frankly they are very causal events. I'm not the only one who can't really let go of it. I know it was because of what I did.”

Not all engineers felt they could take “interpersonal risks” or admit a lack of knowledge without fear of “being seen as incompetent”. Synthesizing theories of psychological safety with this study’s findings, it seems clear that environments of psychological safety may increase engineers’ confidence in decision making (Edmondson, 2002).

What Would They Change?

Engineers were asked “If you could wave a magic wand, what would you change about your current environment that would help you feel more confident or safe in your day-to-day deployment decisions?

In addition to “faster CI and pre-deployments”, engineers overarchingly spoke about needing better testing. One participant wanted a better way to test front-end code end-to-end, "I return to this space every few years and am a bit surprised that this still is so hard to get right”. In another mention of improved testing, an engineer wanted “integration tests that exercise the subject component along with the graph of dependencies (other components, databases, etc.), using only public APIs. I.e., no "direct to database" fixtures, no mocking”.

Wrapping Up

Everything about engineers’ work was made more difficult in the face of a global pandemic. In the “before times” engineers could "swivel their chair” to get a "second set of eyes" on from co-workers before deploying. While some engineers in the study had sophisticated deployment automation, others spoke of manual workarounds with heroic scripts written ‘on the fly’ to repair the system when it failed. Engineers grappled with the complexities of automation, and the risk and uncertainty associated with decisions to deploy. Most engineers using tools to automate and manage configurations did not experience relief in their workload. They had to maintain skills in manual interventions when the automation did not work as expected or when they could not discern the machine’s state. Such experiences highlight the continued relevance of Lisanne Bainbridge’s (1983) research on the Ironies of Automation which found that “the more advanced a control system is, the more crucial the role of the operator”.

This study revealed that deployment decisions cannot be understood independently from the social systems, rituals, and organizational structures in which they occurred (Pettersen, McDonald, & Engen, 2010). So when a deployment decision results in an incident or outage, instead of blaming the engineer, consider the words of James Reason (1990) who said “...operators tend to be the inheritors of system defects…adding the final garnish to a lethal brew whose ingredients have already been long in the cooking”. Engineers may bring their previous experiences to deployment decisions, but the tools and conditions of their work environment, historical events, power structures, and hierarchy are what “enables and sets the stage for all human action.” (Dekker & Nyce, 2014, p. 47).

____

This is an excerpt from Jessica’s forthcoming thesis. If you’re interested in learning more about this deployment decision-making study or would like to explore future research opportunities, send Jessica a message on Twitter.

References

Bainbridge, L. (1983). IRONIES OF AUTOMATION. In G. Johannsen & J. E. Rijnsdorp (Eds.), Analysis, Design and Evaluation of Man–Machine Systems (pp. 129–135). Pergamon.

De Keyser, V. (1990). Temporal decision making in complex environments. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 327(1241), 569–576.

Dekker, S. W. A., & Nyce, J. M. (2014). There is safety in power, or power in safety. Safety Science, 67, 44–49.

Edmondson, A. C. (2002). Managing the risk of learning: Psychological safety in work teams. Citeseer.

Pettersen, K. A., McDonald, N., & Engen, O. A. (2010). Rethinking the role of social theory in socio-technical analysis: a critical realist approach to aircraft maintenance. Cognition, Technology & Work, 12(3), 181–191.

Reason, J. (1990). Human Error (pp. 173–216). Cambridge University Press.

Woods, D. D., & Cook, R. I. (1999). Perspectives on Human Error: Hindsight Bias and Local Rationality. In In F. Durso (Ed.) Handbook of Applied Cognitive Psychology. Retrieved 9 June 2021 from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.474.3161

Day 19 - Into the World of Chaos Engineering

2021-12-19T00:00:00.005-05:00

By: Julie Gunderson (@Julie_Gund)
Edited by: Kerim Satirli (@ksatirli)

Intro

I recently left my role as a DevOps Advocate at PagerDuty to join the Gremlin team as a Sr. Reliability Advocate. The past few months have been an immersive experience into the world of Chaos Engineering and all things reliability. That said, my foray into the world of Chaos Engineering started long before joining the Gremlin team.

From my time as a lab researcher, to being a single parent, to dealing with cancer, I have learned that the journey of unpredictability is everywhere. I could never have imagined in college that I would end up doing what I do now. As I reflect on the path I have taken to where I am today, I realize one thing: chaos was always right there with me. My start in tech was as a recruiter and let me tell you: there is no straight line that leads from recruiting to advocacy. I experimented in my career, tried new things, failed more than a few times, learned from my experiences and made tweaks. Being a parent is very similar: most experiences you make along the way fall in one of two camps:mistakes or learning. With cancer, there was, and is a lot of experimenting and learning– even with the brightest of minds, every person’s system handles treatments differently. Luckily, I had, and still have others who mentor me both professionally and personally, people who help me improve along the way, and I learned that chaos is a part of life that can be harnessed for positive change.

Technical systems have a lot of similarities to our life experiences, we think we know how they are going to act, but suddenly a monkey wrench gets thrown into the mix and poof, all bets are off. So what do we do? We experiment, we try new things, we follow the data, we don’t let failure stop us in our tracks, and we learn how to build resiliency in.

We can’t mitigate every possible issue out in the wild, we should be proactive in identifying potential failure modes. We need to prepare folks to handle outages in a calm and efficient manner. We need to remember that there are users on the other end of those ones and zeros. We need to keep our eye on the reliability needle. Most of all, we need to have empathy for our co-workers, and remember that we are all in this together, and that we don’t need to be afraid of failure.

Talking about Chaos in the System

When a system or provider goes down (cough, cough us-east-1), people notice, and they share their frustrations, widely. Long Twitter rants are one thing, the media’s reaction is another: – outages make great headlines, and the old adage of “all press is good press” doesn’t really hold up anymore. Brand awareness is one thing, however, great SEO numbers based off of a headline in the New York Times that calls out a company for being down might not be the best way to go about it.

What is Chaos Engineering

So what is Chaos Engineering, and more importantly: why would you want to engineer Chaos? Chaos Engineering is one of those things that is just unfortunately named. After all, the practice has evolved a lot from the time when Jesse Robbins coined the term GameDays, to the codified processes we have in place today. The word “chaos” can still unfortunately lead to anxiety across the management team(s) of a company. But, fear not, the practice of Chaos Engineering helps us all create those highly reliable systems that the world depends on, it builds a culture of learning, and teaches us all to embrace failure and improve.

Chaos Engineering is the practice of proactively injecting failure into your systems to identify weaknesses. In a world where everyone relies on digital systems in some way, shape, or form, almost all of us have a focus on reliability. After all: the cost of downtime can be astronomical!

My studies started at the University of Idaho in microbiology. I worked as a researcher and studied the effects of carbon dioxide (CO2) and short-term storage success of Chinook salmon milt (spoiler alert– there is no advantage to using CO2). That’s where I learned that effective research requires the use of the scientific method:

Observe the current state
Create a hypothesis
Run experiments in a controlled, consistent environment
Analyze the data
Repeat the experiments and data analyzation
Share the results

In the research process, we focused on one thing at a time, we didn’t introduce all the variables at once, we built on our experiments as we gathered and analyzed the data. For example, we started off with the effects of CO2 and once we had our data we introduced clove oil into the study. Once we understood the effect on Chinook we moved to Sturgeon, and so on.

Similarly, you want to take a scientific approach when identifying weakness in your systems with Chaos Engineering, a key difference is on the system that is currently under study; your technical and social technical systems, vs. CO2 and Chinook salmon milt (also, there are no cool white coats.) With Chaos Engineering you aren’t running around unplugging everything at once, or introducing 100% memory consumption on all of your containers at the same time, you take little steps, starting with a small blast radius and increasing that blast radius so you can understand where the failure had impact.

How do we get there

Metrics

At PagerDuty, I focused on best practices around reducing Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR) of incidents, and then going beyond those metrics to learning and improvement. I often spoke about Chaos Engineering and how through intentionally injecting failure into our systems, we could not only build more reliable systems, we could build a culture of blamelessness, learning, and continuous improvement.

In my time at Gremlin, I have seen a lot of folks get blocked at the very beginning when it comes to metrics such as MTTD and MTTR. Some organizations may not have right monitoring tools in place, or are just at the beginning of the journey into metric collection. It’s okay if everything isn’t perfect here, the fact is you can just pick a place to start; one or two metrics to start collecting to give you a baseline to start measuring improvement from. As far as monitoring is concerned, you can use Chaos Engineering to validate what you do have, and make improvements from there.

People

On the people-side of our systems, being prepared to handle incidents takes practice. Waking up at 2am to a barbershop quartet alert singing “The Server is on Fire” is a blood pressure raising experience, however that stress can be reduced through practice.

For folks who are on-call, it’s important to give them some time to learn the ropes before tossing them into the proverbial fire. Give folks a chance to practice incident response through Chaos Engineering, run GameDays and FireDrills, where people can learn in a safe and controlled environment what the company process looks like in action. This is also a great way to validate your alerting mechanisms and response process. At PagerDuty we had a great joint workshop with Gremlin where people could practice incident response with Chaos Engineering to learn about the different roles and responsibilities and participate in mock incident calls. As a piano player, I had to build the muscle memory needed to memorize Beethovan’s Moonlight Sonata by practicing over, and over, and over for months. Similar to learning a musical instrument, practicing incident response builds the muscle memory needed to reduce the stress on those 2am calls. If I can stress (no pun intended) anything from my experiences in life, it is that repetition and practice are essential elements to handling surprises calmly.

Building a culture of accepting failure as a learning opportunity takes bravery and doesn’t happen overnight. Culture takes practice, empathy, and patience, so make sure to take the time to thank folks for finding bugs, for making mistakes, for accepting feedback, and for the willingness to learn.

Speak the language

As I mentioned before, sometimes we just have things that are unfortunately named. Many of us have the opportunities to attend conferences, read articles and blogs, earn certifications, etc... It’s important to remember that leadership often doesn't have the time to do those things. We as individual contributors, team leaders, engineers, whatever our title may be, need to be well equipped to speak effectively to our audience; Leaders need to understand the message we are trying to convey. I have found that using the phrase “just trust me” isn’t always an effective communication tool. I had to learn how to talk to decision makers and leadership in the terms they used, such as business objectives, business outcomes, Return on Investment (ROI). By communicating the business case I was trying to solve, they could connect the dots to the ROI of adopting and sponsoring new ways of working.

It’s a Wrap

To sum it up, chaos is part of our lives from the moment we are born, from learning to walk to learning to code, and all of the messiness in between. We don’t need to be afraid of experimentation, but we should be thoughtful with our test, and be open to learning. For me personally this next year, I plan on learning to play Bohemian Rhapsody, and professionally, I plan on experimenting with AWS and building a multi-regional application to test ways to be more resilient in the face of outages. Wish me luck, I think I’ll need it on both fronts.

Happy holidays, and may the chaos be with you.

Day 18 - Minimizing False Positive Monitoring Alerts with Checkmk

2021-12-18T00:00:00.043-05:00

By: Elias Voelker (@Elijah2807) and Faye Tandog (@fayetandog
Edited by: Jennifer Davis (@sigje)

Good IT monitoring stands and falls with its precision. Monitoring must inform you at the right time when something is wrong. But similar to statistics, you also have to deal with errors produced by your monitoring. In this post, I will talk about two types of errors - false positives and false negatives. And similar again to statistics, you can’t eliminate these errors completely in your monitoring. The best you can do is manage them and optimize for an acceptable level of errors.

In this article, I share ways in which you can fine-tune notifications from your monitoring system, to alleviate noisy alerts and ideally receive only those alerts that are really relevant.

Fine-tuning notifications is one of the most important and rewarding activities when it comes to configuring your monitoring system. The impact of a well-defined notification setup is felt immediately. First and foremost, your team will benefit from better focus due to less ‘noise’. This ultimately results in better service levels and higher service level objective (SLO) attainment across the board.

In this article, I talk about ‘alerts’ and ‘notifications’ using them interchangeably. An ‘alert’ or ‘notification’ is your monitoring system letting you know that something is supposedly wrong. Depending on your setup, this may be via email, text or a trouble ticket in PagerDuty.

When I talk about a ‘monitoring system’, I’m referring to both ‘traditional’ IT infrastructure and application monitoring tools such as Nagios, Zabbix, or Solarwinds Orion, as well as cloud-based monitoring solutions such as Prometheus, Datadog or Sensu.

Types of Alert Errors

Let’s start by examining two common alert errors: false positives and false negatives.

A false positive would be your monitoring tool alerting about an issue when in reality the monitored system is perfectly fine (or has recovered in the meantime). It could be a server or service being shown as DOWN because there was a short glitch in the network connection, or a specific service instance, for example Apache restarting to rotate its logs.

False negatives are when your monitoring system does not alert you, although something really is wrong. If you're running an on-prem infrastructure and your firewall is down, you want to know about it. If your monitoring system for some reason does not alert you about this, your network may be exposed to all kinds of threats, which can get into real trouble, really quickly.

However, the cost of erroneous alerting can differ vastly. Hence, when IT Ops teams try to determine the acceptable level of false positives versus an acceptable level of false negatives, they will often deem false positives more acceptable. Because a false negative could be a mission critical system down and not alerting. A false positive might just be one unnecessary notification that’s quickly deleted from your inbox.

This is why they will err on the side of caution and notify, which is totally understandable. The consequence, however, is that these teams get drowned in meaningless alerts, which increases the risk of overlooking a critical one.

Because notifications will only be of help when no — or only occasional — false alarms are produced.

In this article, I use Checkmk to show examples of minimizing false positive alerting. You can apply the same philosophy with other tools though they may vary in implementation and functionality.

1. Don’t alert.

My first tip to improve monitoring and reduce the noise of false notifications is to simply not send notifications. Seriously!

In Checkmk, notifications are actually optional. The monitoring system can still be used efficiently without them. Some large organizations have a sort of control panel in which an ops team is constantly monitoring the Checkmk interface. As they will be visually alerted, additional notifications are unnecessary.

These are typically users that can’t risk any downtime of their IT at all, like a stock exchange for example. They use the problem dashboards in Checkmk to immediately see the issue and its detail. As the lists are mostly empty, it is pretty clear when something red pops up on a big dashboard.

But in my opinion, this is rather the exception. Most people use some way of notifying their ops and sysadmin teams, be it through email, SMS or notification plugins for ITSM tools such as ServiceNow, PagerDuty or Splunk OnCall.

2. Give it time

So if you’ve decided you don’t want to go down the ‘no notifications’ route from my previous point, you need to make sure that your notifications are finely tuned to only notify people in case of real problems.

The first thing to tell your monitoring is: Give the system being monitored time.

Some systems produce sporadic and short-lived errors. Of course, what you really should do is investigate and eliminate the reason for these sporadic problems, but you may not have the capacity to chase after all of them.

You can reduce alarms from systems like that in two ways:

You can simply delay notifications to be sent only after a specified time AND if the system state hasn’t changed back to OK in the meantime.
You can alert on the number of failed checks. For Checkmk this is the ‘Maximum number of check attempts for service’ rule set. This will make the monitoring system check for a defined number of times before triggering a notification. By multiplying the number of check attempts with your defined check interval you can determine how much time you want to give the system. The default Checkmk interval is 1 minute, but you can configure this differently.

The two options are slightly different in how they treat the monitored system. By using the number of failed checks, you can be sure that the system has really been re-checked. If you alert only based on time and you (or someone else) changed the check interval to a longer timeframe you gain nothing. In Checkmk specifically there are some other factors as well, but that’s out of scope for this article. The essential effect is: By giving a system a bit of time to ‘recover’, you can avoid a bunch of unnecessary notifications.

This method also works great for ‘self-healing’ systems that should recover on their own; for example, you wouldn’t want to get a notification for a cloud provider killing an instance to upgrade it when your code will automatically deploy a new container instance to handle requests

Of course, this is not an option for systems that are mission-critical with zero downtime that require rapid remediation. For example, a hedge-fund that monitors the network link to a derivative marketplace can't trade if it goes down. Every second of downtime costs them dearly.

3. On average, you don’t have a problem

Notifications are often triggered by threshold values on utilization metrics (e.g. CPU utilization) which might only exceed the threshold for a short time. As a general rule, such brief peaks are not a problem and should not immediately cause the monitoring system to start notifying people.

For this reason, many check plug-ins have the configuration option that their metrics are averaged over a longer period (say, 15 minutes) before the thresholds for alerting are applied. By using this option, temporary peaks are ignored, and the metric will first be averaged over the defined time period and only afterwards will the threshold values be applied to this average value.

4. Like parents, like children

Imagine the following scenario: You are monitoring a remote data center. You have hundreds of servers in that data center working well and being monitored by your monitoring system. However, the connection to those servers goes through the DC’s core switch (forget redundancy for a moment). Now that core switch goes down, and all hell breaks loose. All of the sudden, hundreds of hosts are no longer being reached by your monitoring system and are being shown as DOWN. Hundreds of DOWN hosts mean a wave of hundreds of notifications…

But in reality, all those servers are (probably) doing just fine. Anyway we couldn’t tell, because we can’t connect to them because of the core switch acting up. So what do you do about it?

Configure your monitoring system so that it knows this interdependency. So the server checks are dependent on that core switch. You can do so in Checkmk by using ‘parent-child-relationships’. By declaring host A the ‘Child’ of another ‘Parent’ host B, you tell your Checkmk system that A is dependent on host B. Checkmk pauses notifications for the children if the parent is down.

5. Avoid alerts on systems that are supposed to be down

There are hundreds of reasons why a system should be down at times. Maybe some systems need to be rebooted regularly, maybe you are doing some maintenance or simply don’t need a system at certain times. What you don’t want is your monitoring system going into panic mode during these times, alerting who-knows-whom if a system is supposed to be down. To do that, you can use ‘Scheduled Downtimes’.

Scheduled downtimes work for entire hosts, but also for individual services. But why would you send certain services into scheduled downtimes? More or less for the same reason as hosts – when you know something will be going on that would trigger an unnecessary notification. You still might want your monitoring to keep an eye on the host as a whole, but you are expecting and accepting that some services might go haywire and breach thresholds for some time. An example could be a nightly cron job that syncs data to long term storage, causing the disk I/O check to spike. But, if everything goes back to normal once the sync is through, no need to lose sleep over it.

Moreover, you can extend scheduled downtimes to ‘Children’ of a ‘Parent’ host as well.

Wrapping Up

I hope this short overview has given you some ideas about really simple ways with which you can cut down on the number of meaningless notifications your team is getting from your monitoring system. There are other strategies to do this, but this should get you started.

Additional Resources

If you want to learn more about how to manage notifications in Checkmk, check out this docs article or post a question in the forum.

Day 17 - Death to Localhost: The Benefits of Developing In A Cloud Native Environment

2021-12-17T00:00:00.010-05:00

By: Tyler Auerbeck (@tylerauerbeck)
Edited by: Ben Cotton (@funnelfiasco)

Thank you everyone for joining us today. We gather here to say our goodbyes to our dear friend, Localhost. They’ve been there for us through the good times, the bad times, and the “we should really be sleeping right now…but let me just try one last thing” times. They’ve held our overly-complicated terminal configurations and—in all likelihood—most of our secrets. But alas, it is time to let our good friend ride into the sunset.

Saying Goodbye

But why?! We’ve all likely spent more time than we care to admit making these machines feel like home. They’re part of the family! Well, as it turns out, that can become part of the problem. We’ve all seen issues that are accompanied by the line “well it works on my machine” and a round of laughs. The problem with localhost is that it can be extremely difficult to ensure that a setup being utilized by one developer actually matches what is being run by another. This can happen for any number of reasons such as developer platform (Linux vs MacOS vs Windows), IDE (VScode vs Jetbrains), or even just the installation method of the tools you’re using. The different combinations of these problems only exacerbates the problem and likely leads to (at a minimum!) hundreds of hours of lost productivity. All in the name of working locally. But what if there was a better way?

My Machine is Your Machine

With everything becoming Cloud Native these days, why do we want to treat development any differently? The common trend recently is to push a number of our workloads into containers. Why? Because with containers we have the ability to bundle our runtimes, tooling, and any additional dependencies via a well-defined format. We can expect them to run almost anywhere, the same way, each and every time. What if we took that same approach and instead of a web application, we shipped our development environment?

Well, as it turns out, this is exactly what a few projects are starting to give us the ability to do. Now instead of shipping complex Makefiles, multiple install scripts, or having to ask our users to pipe our mystery scripts into bash, we can simply just launch our development environments out into the cloud of our choice. Currently, there are two main projects that offer us this functionality. If you’re not interested in hosting anything yourself, GitHub Codespaces is a hosted solution that integrates directly with your codebase and allows you to easily spin up a VScode instance to get to work. However, if you have more specific restrictions or just prefer to run your own infrastructure, another project offering this functionality is Eclipse Che. Whatever solution works best for your situation is fine. The more important part of both of these offerings is _how_ they make these environments available to you.

Development Environment Specs

Both of the above offerings allow you to specify the dev environment that you want to make available to your users/developers. It’s important to note that this is done on a per repository basis because there is never going to be a single dev environment that works to run them all. This is exactly the mess that we’re trying to get out of! We want to be able to define an environment that is purpose-built for the specific project that we are working on!

To do this, these platforms give us configuration files: dev-container.json (GitHub Workspaces) and devfile (Eclipse Che). Although the specs differ between the two formats, the underlying principles are the same. Within one well defined configuration file, I am able to specify the tooling that needs installed, an image that should be used or built to run all of my development inside of, ports that need exposed, storage that needs mounted, plugins to be used, etc. Everything that I would usually need to configure by hand when getting started with a project now _just happens_ whenever I launch my environments. So now not only are we solving the _snowflake_ environment problem, but we are also saving valuable time because the environment will be configured and ready as soon as we click launch. It’s just what we’ve always wanted: push button and get to work!

What Problems Are We Solving

This all sounds great right? But you might be shaking your first in the air and screaming “Just let me use my laptop!” While this is absolutely something that I can empathize with and may generally work on personal projects, there are real problems that are being solved with this approach. I’ve seen this more specifically in enterprise development shops where _your machine_ isn’t really *your* machine. Which brings us to our first problem

Permissions

Given the current security environment, most enterprise development shops aren’t too keen on giving you the permissions to install any of the tooling that you actually need. I have seen developers lose weeks waiting on a request to just install their runtime on their machines before they’re ever even able to begin contributing to their time. Multiply that by every tool and dependency that they might need and you can imagine how much valuable and productive time is lost in the name of security and process.

By moving to a cloud native development approach, your development environments can be treated just like any other application that you run and scanned/approved by your security teams. When a new developer comes on board, they can get right to work! No more waiting on approvals/installation because this has already gone through the necessary pipelines and is just ready whenever you are.

Develop In Production

Alright, so maybe we shouldn’t develop *in* production, but rather in an environment that is _like_ production. By developing an application in a location where it will ultimately be running, you get a better feel for configurations and even failure modes that you otherwise may not experience by developing solely on your local machine. Expecting certain ports to be available? Need specific hardware? By ensuring your configuration files mirror your environments you can determine these problems earlier on in your process versus finding them once they’ve launched into a staging or production environment. This ultimately helps you reduce downtime and speeds up your time to resolving these problems as you may find them before they’re ever even introduced.

Localhost: Still Slightly Alive

Realistically, this isn’t going to be a solution for everything or everyone. There are workloads and development tasks that require specialized environments or are potentially just not well suited to being done inside of a container environment. And that’s okay! There are still other approaches to finding a way off of your local machine and into the hearts of all of your developers without having to have them sink their time into troubleshooting differences between each of their machines. The heart of the problem still stands: developers want to get to work and provide value. Being able to provide on-demand environments that encapsulate all of the requirements of a project so that they can get involved immediately helps drive this productivity for both your teams and your communities, all without having to burn hours troubleshooting a personal machine.

So for now, let us lay our dear friend Localhost to rest. They may no longer be with us, but have no fear! Our localhost will always be with us up in the cloud(s)!

Day 16 - Setting up k3s in your home lab

2021-12-16T00:00:00.038-05:00

By: Joe Block (@curiousbiped)
Edited by: Jennifer Davis (@sigje)

Background

Compute, even at home with consumer-grade hardware, has gotten ridiculously cheap. You can get a quad-core ARM machine with 4GB like a Raspberry Pi 4 for under $150, including power supply and SD card for booting - and it'll idle at less than 5 watts of power draw and be completely silent because it is fanless.

What we're going to do

In this post, I'll show you how to set up a Kubernetes cluster on a cheap ARM board (or an x86 box if you prefer) using k3s and k3sup so you can learn Kubernetes without breaking an environment in use.

These instructions will also work on x86 machines, so you can repurpose that old hardware instead of buying a new Raspberry Pi.

Why k3s?

k3s was created by Rancher as a lightweight, easy to install, and secure Kubernetes option.

It's packaged as a single ~40MB binary that reduces the dependencies needed to get a cluster up and running. It even includes an embedded containerd, so you don't need to install that or docker. The ARM64 and ARM7 architectures are fully supported, so it's perfect for running on a Raspberry Pi in a home lab environment.

Why k3sup?

Alex Ellis wrote k3sup, a great tool for bringing up k3s clusters and we're going to use it in this post to simplify setting up a brand new cluster. With k3sup, we'll have a running kubernetes cluster in less than ten minutes.

Lets get started!

Pre-requisites.

A spare linux box. I'll be using a Raspberry Pi for my examples, but you can follow along on an x86 linux box or VM if you prefer.
k3sup - download the latest release from k3sup/releases into a directory in your $PATH.

Set up your cluster.

In the following example, I'm assuming you've created a user (you can use the pi user on rPi if you prefer) for configuring the cluster (I used borg below), you've added your ssh public key to ~pi/.ssh/authorized_keys and that the user has sudo privileges. I'm also assuming you've downloaded k3sup and put it into /usr/local/bin, and that /usr/local/bin is in your $PATH.

Create the leader node

The first step is to create the leader node with the k3sup utility:


k3sup install --host $HOSTNAME --user pi

Below is the output when I ran it against my scratch rPi. In the scrollback you'll see that I'm using my borg account instead of the pi user. After setting up the rPi, the first step I took was to disable the known pi account. I also specify the path to an SSH key that is in the borg account's authorized_keys, and configure the borg account to allow passwordless sudo.

Notice that I don't have to specify an architecture - k3sup automagically determines the architecture of the host and installs the correct binaries when it connects to the machine. All I have to do is tell it what host to connect to, what user to use, what ssh key, and whether I want to use the stable or latest k3s channels or a specific version.


❯ k3sup install --host cephalopod.example.com --user borg --ssh-key demo-key
--k3s-channel stable


k3sup install --host cephalopod.example.com --user borg --ssh-key demo-key --k3s-channel stable
Running: k3sup install
2021/12/13 16:30:49 cephalopod.example.com
Public IP: cephalopod.example.com
[INFO]  Finding release for channel stable
[INFO]  Using v1.21.7+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/sha256sum-arm64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/k3s-arm64
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s
Result: [INFO]  Finding release for channel stable
[INFO]  Using v1.21.7+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/sha256sum-arm64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/k3s-arm64
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
[INFO]  systemd: Starting k3s
 Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.

Saving file to: /Users/jpb/democluster/kubeconfig

# Test your cluster with:
export KUBECONFIG=/Users/jpb/democluster/kubeconfig
kubectl config set-context default
kubectl get node -o wide

Test it out

Per the directions output by k3sup, you can now test your brand new cluster by setting the environment variable KUBECONFIG, and then run kubectl to work with your new cluster.

My steps to verify my new cluster is up and running:

export KUBECONFIG=/Users/jpb/democluster/kubeconfig
kubectl config set-context default
kubectl get node -o wide

And I see nice healthy output where the status shows Ready -



NAME         STATUS   ROLES                  AGE     VERSION        INTERNAL-IP
 EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME

cephalopod   Ready    control-plane,master   2m53s   v1.21.7+k3s1   10.1.2.3
<none>        Ubuntu 18.04.3 LTS   4.9.196-63       containerd://1.4.12-k3s1

And I can also look at pods in the cluster



❯ kubectl get pods -A
Alias tip: kc get pods -A
NAMESPACE     NAME                                      READY   STATUS
RESTARTS   AGE
kube-system   coredns-7448499f4d-b2rdp                  1/1     Running     0
      9m29s
kube-system   local-path-provisioner-5ff76fc89d-d9rrc   1/1     Running     0
      9m29s
kube-system   metrics-server-86cbb8457f-cqk6q           1/1     Running     0
      9m29s
kube-system   helm-install-traefik-crd-jgk2x            0/1     Completed   0
      9m29s
kube-system   helm-install-traefik-l2j96                0/1     Completed   2
      9m29s
kube-system   svclb-traefik-7tzzs                       2/2     Running     0
      8m38s
kube-system   traefik-6b84f7cbc-92kkp                   1/1     Running     0
      8m38s

Clean Up

k3s is tidy and easy to uninstall, so you can stand up a cluster on a machine, do some experimentation, then dispose of the cluster and have a clean slate for your next experiment. This makes it great for continuous integration!


sudo /usr/local/bin/k3s-uninstall.sh to shut down the node and delete
/var/lib/rancher and the data stored there.

Next Steps

Learn kubernetes! Some interesting tutorials that I recommend -

The Kubernetes project has a set of tutorials to get you started at https://kubernetes.io/docs/tutorials/
VMWare sponsors a free set of online Kubernetes courses at https://kube.academy/courses.

Finally, now that you've set up a cluster the easy way, if you want to understand everything k3sup did behind the scenes to get your Kubernetes cluster up and running, Kubernetes the Hard Way by Kelsey Hightower is a must-read.

Day 15 - Introduction to the PagerDuty API

2021-12-15T00:00:00.036-05:00

By: Mandi Walls (@lnxchk)
Edited by: Joe Block (@curiousbiped)

Keeping track of all the data generated by a distributed ecosystem is a daunting task. When something goes wrong, or a service isn’t behaving properly, tracking down the culprit and getting the right folks enabled to fix it is also challenging. PagerDuty can help you with these challenges.

The PagerDuty platform integrates with over 600 other components to gather data, add context, and process automation. Under the hood of all of these integrations is the PagerDuty API, ready to help you programmatically interact with your PagerDuty account.

What’s Exposed Via the API

The PagerDuty API provides access to all the structural objects in your PagerDuty account - users, teams, services, escalation policies, etc - and also to the data objects including incidents, events, and change events.

For objects like users, teams, escalation policies, schedules, and services, you may find using the PagerDuty Terraform Provider will help you maintain the state of your account more efficiently without using the API directly.

The other object types in PagerDuty are more useful when we can send them anytime from anywhere, including via the API from our own code. Let’s take a look at three of them: incidents, events, and change events. If you’d like a copy of the code for these examples, you can find them on Github.

API Basics

To write new information into PagerDuty via the API, you'll need some authorization. You can use OAuth, or create an API key. There are account-level and user-level API keys available. You'll use the account-level keys for the rest of the examples here and keep things simple.

To create a key in your PagerDuty app, you'll need Admin, Global Admin, or Account Owner access to your account. More on that here.

In PagerDuty, navigate to Integrations and then chose API Access Keys. Create a new key, give it a description, and save it somewhere safe. The keys are strings that look like y_NbAkKc66ryYTWUXYEu.

Now you’re ready to generate some incidents! These examples use curl, but there are a number of client libraries for the API as well.

Incidents

Incidents are probably what you’re most familiar with in PagerDuty - they represent a problem or issue that needs to be addressed and resolved. Sometimes this includes alerting a human responder. Many of the integrations in the PagerDuty ecosystem generate incidents from other systems and services to send to PagerDuty.

In PagerDuty, incidents are assigned explicitly to services in your account, so an incoming incident will register with only that service. If your database has too many long-running queries, you want an incident to be assigned to the PagerDuty service representing that database so responders have all the correct context to fix the issue.

If you have a service that doesn’t have an integration out of the box, you can still get information from that service into PagerDuty via the API, and you don’t need anything special to do it. You can send an incident to the API via a curl request to the https://api.pagerduty.com/incidents endpoint.

There are three required headers for these requests, Accept, Content-Type and From, which needs to be an email address associated with your account, for attribution of the incident. Setting up the request will look something like:


curl -X POST --header 'Content-Type: application-json' \
--url https://api.pagerduty.com/incidents \
--header 'Accept: application/vnd.pagerduty+json;version=2' \
--header 'Authorization: Token token=y_NbAkKc66ryYTWUXYEu' \
--header 'From: system2@myemail.com' \

Now you need the information bits of the incident. These will be passed as --data in the curl request. There are just a few required pieces to set up the format and a number of optional pieces that help add context to the incident.

The most important piece you'll need is the service ID. Every object in the PagerDuty platform has a unique identifier. You can find the ID of a service in its URL in the UI. It will be something like https://myaccount.pagerduty.com/service-directory/SERVICEID.

Now you can create the rest of the message with JSON:


curl -X POST --header 'Content-Type: application/json' \
--url https://api.pagerduty.com/incidents \
--header 'Accept: application/vnd.pagerduty+json;version=2' \
--header 'Authorization: Token token=y_NbAkKc66ryYTWUXYEu' \
--header 'From: system2@myemail.com' \
--data '{
  "incident": {
    "type": "incident",
    "title": "Too many blocked requests",
    "service": {
      "id": "PWIXJZS",
      "summary": null,
      "type": "service_reference",
      "self": null,
      "html_url": null
    },
    "body": {
      "type": "incident_body",
      "details": "The service queue is full. Requests are no longer being fulfilled."
    }
  }
}'

When you run this curl command, it will generate a new incident on the service PWIXJZS with the title "To many blocked requests", along with some context in the "body" of the data to help our responders. You can add diagnostics or other information here to help your team fix whatever is wrong.

What if there is information being generated that might not need an immediate response? Instead of an incident, you can create an event.

Events

Events are non-alerting items sent to PagerDuty. They can be processed via Event Rules to help create context on incidents or provide information about the behavior of your services. They utilize the PagerDuty Common Event Format to make processing and collating more effective.

Events are registered to a particular routing_key via an integration on a particular service in your PagerDuty account. In your PagerDuty account, select a service you'd like to send events to, or create new one to practice with. On the page for that service, select the Integrations tab and Add an Integration. For this integration, select "Events API V2" and click Add. You'll have a new integration on your service page. Click the gear icon, and copy the Integration Key. For the full walkthrough of this setup, see the docs.

The next step is to set up the event. The request is a little different from the incident request - the url is different, the From: header is not required, and the authorization is completely handled in the routing_key instead of using an API token.

The content of the request is more structured, based on the Common Event Format, so that you can create event rules and take actions if necessary based on what the events contain.



curl --request POST \
  --url https://events.pagerduty.com/v2/enqueue \
  --header 'Content-Type: application/json' \
  --data '{
  "payload": {
    "summary": "DISK at 99% on machine prod-datapipe03.example.com",
    "timestamp": "2021-11-17T08:42:58.315+0000",
    "severity": "critical",
    "source": "prod-datapipe03.example.com",
    "component": "mysql",
    "group": "prod-datapipe",
    "class": "disk",
    "custom_details": {
      "free space": "1%",
      "ping time": "1500ms",
      "load avg": 0.75
    }
  },
  “event_action”: “trigger”,
  "routing_key": "e93facc04764012d7bfb002500d5d1a6"
}'

Change Events

A third type of contextual data you can send to the API is a Change Event. Change events are non-alerting, and help add context to a service. They are informational data about what's changing in your environment, and while they don't generate an incident, they can inform responders about other activities in the system that might have contributed to a running incident. Change events might come from build and deploy services, infrastructure as code, security updates, or other places that change is generated in your environment.

These events have a similar basic structure to the general events, and the setup with the routing_key is the same, as you can see in the below example. The custom_details can contain anything you want, like the build number, a link to the build report, or the list of objects that were changed during an Infrastructure as Code execution.

Change events have a time horizon. They expire after 90 days in the system, so you aren't looking at old context based on past changes.



curl --request POST \
  --url https://events.pagerduty.com/v2/change/enqueue \
  --header 'Content-Type: application/json' \
  --data '{
  "routing_key": "737ea619db564d41bd9824063e1f6b08",
  "payload": {
    "summary": "Build Success: Increase snapshot create timeout to 30 seconds",
    "timestamp": "2021-11-17T09:42:58.315+0000",
    "source": "prod-build-agent-i-0b148d1040d565540",
    "custom_details": {
      "build_state": "passed",
      "build_number": "220",
      "run_time": "1236s"
    }
  }
}'

Adding Notes

One final fun bit of functionality you can leverage in PagerDuty's API is with notes. Notes are short text entries added to the timeline of an incident. In some integrations, like PagerDuty and Slack, notes will be sent to any Slack channel that is configured to receive updates for an impacted service, making them helpful for responders to coordinate and record activity across different teams.

Notes are associated with a specific incident, so when you are creating a note, the url will include the incident ID. Incident IDs are similar to the other object IDs in PagerDuty in that you can find them from the URL of the incident in the UI. They are longer strings than other objects than the service ID in the examples above.

The content of a note can be anything that might be interesting to the timeline of the incident, like commands that have been run, notifications that have been sent, or additional data and links for responders and stakeholders.


curl --request POST \
  --url https://api.pagerduty.com/incidents/{id}/notes \
  --header 'Accept: application/vnd.pagerduty+json;version=2' \
  --header 'Authorization: Token token=y_NbAkKc66ryYTWUXYEu' \
  --header 'Content-Type: application/json' \
  --header 'From: responder2@myemail.com' \
  --data '{
  "note": {
    "content": "Firefighters are on the scene."
  }
}'

Responders utilizing the UI will see notes in a widget on the incident pag.

Next Steps

Using the API to create tooling where integrations don't yet exist, or for internally-developed services, can help your team stay on top of all the moving parts of your ecosystem when you have an incident. Learn more about the PagerDuty resources available at https://developer.pagerduty.com/. Join the PagerDuty Community to learn from other folks working in PagerDuty, ask questions, and get answers.

Day 14 - What's in a job description (and who does it keep away)?

2021-12-14T00:00:00.001-05:00

By: Daniel Medina
Edited by: James Turnbull (@kartar)

A colleague supporting our recruitment efforts asked hiring managers if their "job descriptions are still partying like it's 1999?" The point was to revisit old postings that had been copy-and-pasted down the years and create something that would increase engagement with candidates. But reading the title made me think about a job I applied for (and got) circa 1999. It was a systems administrator role and included language like

The associate must regularly lift and/or move 20-35 pounds and occasionally lift or pull 35-80 pounds.

No joke, those Sun Microsystems monitors were heavy. Checking a fact sheet confirms the "flat screen" (non-curved) 21-inch CRT from around that time was ~80 pounds.

Source: u/leaningtoweravenger on Reddit

Large network switches in the Cisco Catalyst 6500 family were easily twice that weight and were definitely a two-person job. Best practice for racking servers in the datacenter was to use a Genie Lift.

To this day, if I hear someone talking about a strong developer I might wonder "but how much can they deadlift?" Most job descriptions for roles outside physical datacenter management don't include this language anymore. This all got me thinking, what might be in job descriptions these days that could be turning off candidates?

"Education Level" might be one of those things we should re-think. Many postings require a "Bachelor's Degree". Granted, we don't describe what that degree is in and I've had colleagues with degrees in History, Library Sciences, Geology, Economics, and more (even Computer Science!)

Sometimes the phrase "or equivalent experience" is added to these requirements. It's unclear if this means something akin to a college experience, for example, thirteen weeks reading The Illiad in your teenage years. I've had colleagues who are Managing Directors and Distinguished Engineers with no college degrees; so why bother asking for this in our requirements? Maybe it's cloned from an existing description, or it's a required field in the system used to post the description and the option "None" isn't pre-filled. At best it's a proxy that means we're really looking for someone older than 21. At worst, we've dissuaded some candidates from considering us.

Sometimes the HR systems used for creating job descriptions can add unexpected data to your job descriptions. One job description posted in Montreal automatically included "Knowledge of French and English is required". This wasn't a Language Requirement that came from us! We were at a global firm using English as a common language and would be happy to hire anyone who met Canadian work requirements and had the skills we were looking for!

Other French-language oddities you may encounter are labels like "(H/F)" to indicate "Homme / Femme", that the job description is intended to be gender-neutral, despite pronouns and gendered language used throughout. This isn't as awkward as some of the "s/he will..." references used in English-language descriptions when the simpler "you", speaking directly to the candidate, seems so much more natural!

Speaking of strange language, some descriptions include language that doesn't make me think first of a technology role:

I'm hiring... a hacker that wants to work on the bleeding edge...

We spend a lot of time doing applied research...

You should be the type of person who likes to roll up their sleeves and get their hands dirty.

Source: Wikipedia: _Dexter (season 2)

Your signal that you have an existing, tight-knit group:

You'll be part of a small team of like-minded individuals.

might run counter to your efforts to advertise your goals of building a diverse and inclusive environment, one where the candidate-turned-new-joiner might not be able to provide their valuable external input if it will run counter to the current thinking.

We found that we were having trouble filling a "DevOps" role. Without suggesting that "DevOps isn't a job title", candidates wanted clarification on what that might mean in our environment. Reviewing some of the many open roles across different teams showed they varied widely, leaving candidates to try to figure out which of the DevOps Topologies they might be walking into (and was it a Pattern or Anti-Pattern?!)

These included:

Cloud SecDevOps (Cyber): This wins keyword bingo
Apply Now to The Wonderful World of DevOps: Points for creative use of the job title field
Devops Specialist - Private Cloud: "providing L3 support... including on-call"
DevOps Developer: "You are a developer who is not afraid of infrastructure. You identify with the 'Dev' in DevOps way more than the 'Ops'"
DevOps App Dev: A "release engineer" role that sounded more like DevOps in practice
DevOps Authentication Security L3 Engineer: Okay...

Much of this has been about job descriptions that can lose candidates. What should you include to gain credibility and interest? An honest declaration of the mission of the group they’re joining always helps. Don't shy away from describing a need to support existing legacy systems, even if the goal is to modernize and move to a new platform. Describe the lifecycle of the team; is it "newly formed", "fast-growing", or is this a chance to "join an established team" and learn from established experts?

What's the topology of the team, distributed (participation from a range of locations and timezones in an asynchronous arrangement), multi-site (people working from two or perhaps three sites passing of work between each other or operating in overlapping times), or fully co-located (in rough time or location)? This can affect travel, working hours, and collaboration styles.

Basic details of work-life balance should be included. These might include remote work arrangements (which will likely become a lasting legacy of the pandemic era), on-call staffing strategies, night and weekend work requirements, or travel requirements. We tend to advertise "flexible opportunities", which may have some constraints (we may want individuals to reside in a specific country but not care as much about sitting in an office).

Some of the most thoughtful job descriptions lay out a multi-month roadmap for the role and growth. "Within three months we expect you to join our on-call rotation in support of our production environment", "Within six months you will obtain certification in at least one of our hosting platforms", "Within nine months you will be doing my job and I will be riding off into the sunset", etc. Having such a timeline is important to set expectations for performance during any initial probation period that may be part of local labor law or new hire contract. This also sets a pace for someone to ramp up in your environment, ensuring enough time is set aside for required learning as opposed to "throwing them in the deep end".

I've made all the mistakes described here but can take some solace that I've created zero job postings seeking ninjas, rockstars, gurus, or wizards! Best of luck to all the hiring managers out there looking for their unicorns!

Source: Wikipedia: _Kiss (band)_

Day 13 - Ephemeral PR Environments: Enabling automated testing at a rapid pace

2021-12-13T00:00:00.000-05:00

By: Amar Sattaur
Edited by: Jennifer Davis (@sigje)

Recently, I've been thinking a lot about how to implement the concepts of least privilege while also speeding up the feedback cycle in the developer workflow. However, these two things are not very quickly intertwined. Therefore, there needs to be underlying tooling and visibility to show developers the data they need for a successful PR merge.

A developer doesn't care about what those underlying tools are; they just want access to a system where they can:

See the logs of the app that they're making a change for and the other relevant apps
See the metrics of their app so they can adequately gauge performance impact

One way to achieve this is with ephemeral environments based on PR's. The idea is that the developer opens up a PR and then automatically a new environment is spun up based on provided defaults with the conditions that the environment is:

deployed in the same way that dev/stage/prod are deployed, just with a few key elements different
labeled correctly so that the NOC/Ops teams know the purpose of these resources
Integrated with logging/metrics and useful tags so that the engineer can easily see metrics for this given PR build

That sounds like a daunting task but through the use of Kubernetes, Helm, a CI Platform (GitHub Actions in this tutorial) and ArgoCD, you can make this a reality. Let's look at an example application leveraging all of this technology.

Example app

You can find all the code readily available in this GitHub Repo.

Pre-requisites Used in this Example

Tool	Version
kubectl	v1.21
Kubernetes Cluster	v1.20.9
Helm	v3.6.3
ArgoCD	v2.0.5
kube-prometheus-stack	v0.50.0

The example app that you’re going to deploy today is a Prometheus exporter that exports a custom metric with an overridable label set:

The `version` of the deployed app
The `branch` of the PR
The PR ID

Pipeline

Now that I've defined the goal, let's go a little more in-depth on how you'll get there. First, let's take a look at the PR pipeline in .github/workflows/pull_requests.yml:


---
name: 'Build image and push PR image to ghcr'
on:
  pull_request:
    types: [assigned, opened, synchronize, reopened]
    branches:
      - main

jobs:
  build:
    name: Build
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v2
      - name: Build image
        uses: docker/build-push-action@v1
        with:
          registry: ghcr.io
          username: ${{ github.repository_owner }}
          password: ${{ secrets.GITHUB_TOKEN }}
          tags: PR-${{ github.event.pull_request.number }}

This pipeline runs on pull requests events to the main branch. So, when you open a PR, push a commit to an existing PR, reopen a closed PR, or assign it to a user, this pipeline will get triggered. It defines two workflows, the first of which is build. It's relatively straightforward: take the Dockerfile that lives in the root of your repo and build a container image out of it and tag it for use with GitHub Container Registry. The tag is the PR ID of the triggering pull request.

The second workflow is the one where we deploy to ArgoCD:


 deploy:
    needs: build
    container: ghcr.io/jodybro/argocd-cli:1.1.0
    runs-on: ubuntu-latest
    steps:
      - name: Log into argocd
        run: |
          argocd login ${{ secrets.ARGOCD_GRPC_SERVER }} --username ${{ secrets.ARGOCD_USER }} --password ${{ secrets.ARGOCD_PASSWORD }}
      - name: Deploy PR Build
        run: |
          argocd app create sysadvent2021-pr-${{ github.event.pull_request.number }} \
            --repo https://github.com/jodybro/sysadvent2021.git \
            --revision ${{ github.head_ref }} \
            --path . \
            --upsert \
            --dest-namespace argocd \
            --dest-server https://kubernetes.default.svc \
            --sync-policy automated \
            --values values.yaml \
            --helm-set version="PR-${{ github.event.pull_request.number }}" \
            --helm-set name="sysadvent2021-pr-${{ github.event.pull_request.number }}" \
            --helm-set env[0].value="PR-${{ github.event.pull_request.number }}" \
            --helm-set env[1].value="${{ github.head_ref }}" \
            --helm-set env[2].value="sysadvent2021-pr-${{ github.event.pull_request.number }}"

This workflow runs a custom image that I wrote that wraps the argocd cli tool in a container and allows for arbitrary commands to be executed against an authenticated ArgoCD instance.

It then creates a Kubernetes object of kind: Application which is a CRD that ArgoCD deploys into your cluster to define where you want to pull the application from and how to deploy it (helm/kustomize etc..).

Putting it all together

Now, let's see this pipeline in action. First, head to your repo and create a PR against the main branch with some changes; it doesn't matter what the changes are as all PR events will trigger the pipeline.

You can see that my PR has triggered a pipeline which can be viewed here. Furthermore, you can see that this pipeline was executed successfully, so if I go to my ArgoCD instance, I would see an application with this PR ID.

So, if you are following along, now you have two deployments of this example app, one should show labels for the main branch, and one should show labels for the PR branch.

Let's verify by port-forwarding to each and see what you get back.

Main branch

First, let's check out the main branch application:


kubectl port-forward service/sysadvent2021-main 8000:8000 
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

As you can see, the branch is set to main with the correct version.

And if you check out the state of our Application in ArgoCD:

Everything is healthy!

PR

Now let's check the PR deployment:


kubectl port-forward service/sysadvent2021-pr-1 8000:8000 
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000

This one's labels are showing the branch and the version from the PR.

This pod returns:

And in ArgoCD:

Final thoughts

It really is that easy to get PR environments running in your company!

Resources

* Source Code Repo

Day 12 - Terraform Refactoring

2021-12-12T00:00:00.074-05:00

By: Bill O'Neill (@woneill)
Edited by: Kerim Satirli (@ksatirli)

Terraform is "Infrastructure as Code" and like all code, it is beneficial to review and refactor to:

improve code readability and reduce complexity
improve the maintainability of the source code
create a simpler, cleaner, and more expressive internal architecture or object model to improve extensibility

This article outlines the approaches that have helped my teams when refactoring Terraform code bases.

Convert modules to independent Git repositories

If your Terraform Git repository has grown organically, you will likely have a monorepo structure complete with embedded modules, similar to this:

$ tree terraform-monorepo/
.
├── README.md
├── main.tf
├── variables.tf
├── outputs.tf
├── ...
├── modules/
│   ├── moduleA/
│   │   ├── README.md
│   │   ├── variables.tf
│   │   ├── main.tf
│   │   ├── outputs.tf
│   ├── moduleB/
│   ├── .../

Encapsulating resources within modules is a great step, but the monorepo structure makes it difficult to iterate on individual module development, down the line.

Splitting the modules into independent Git repositories will:

Enable module development in an isolated manner
Support re-use of module logic in other Terraform code bases, across your organization
Enable publishing to public and private Terraform Registries

Here's a process that you can follow to make a module a stand-alone Git repository while preserving the historical log messages. The steps are examples of how to extract moduleA from the above file tree into its own git repository.

Clone the Terraform Git repository to a new directory. I recommend naming the directory after the module you plan on converting.
git clone <REMOTE_URL> moduleA
Change into the new directory:
cd moduleA
Use git filter-branch to split out the module into a new repository..
FILTER_BRANCH_SQUELCH_WARNING=1 git filter-branch --subdirectory-filter modules/moduleA -- --all
Note that we're squelching the warning about filter-branch. See the filter-branch manual page for more details if you're interested
Now your directory will only contain the contents of the module itself, while still having access to the full Git history.

You can run git log to confirm this.
Create a new Git repository and obtain the remote URL for it, then update the origin in the filtered repository:
```
git remote set-url origin <NEW_REMOTE_URL>
git push -u origin main
```
Tag the repo as v1.0.0 before making any changes
```
   
git tag v1.0.0
git push --tags
```
Now that the new repository is ready to be used, update the existing references to the module to use a source argument that points to the tag that you just created.
The “Generic Git Repository” section in Terraform's Module Sources documentation has more details on the format.
Replace lines such as

source = "../modules/moduleA"

with
```
source = "git::<NEW_REMOTE_URL>?ref=v1.0.0"
```
Alternatively, publishing your module to a Terraform registry is an option (but this is outside the scope of this article).
Once all source arguments that previously pointed to the directory path have been replaced with references to Git repositories or Terraform registry references, delete the directory-based module in the original Terraform repository.

Update version constraints with `tfupdate`

Masayuki Morita's tfupdate utility can be used to recursively update version constraints of Terraform core, providers, and modules.

As you start refactoring modules and bumping their version tags, tfupdate becomes an invaluable tool to ensure all references have been updated.

Some examples of tfupdate usage, assuming the current directory is to be updated:

Updating the version of Terraform core:
tfupdate terraform --version 1.0.11 --recursive .
Updating the version of the Google Terraform provider:
tfupdate provider google --version 4.3.0 --recursive .
Updating the version references of Git-based module sources can be done with the module subcommand, for example:
tfupdate module git::<REMOTE_URL> --version 1.0.1 --recursive .

Test state migrations with `tfmigrate`

Many Terraform users are hesitant to refactor their code base, since changes can require updates to the state configuration. Manually updating the state in a safe way involves duplicating the state, updating it locally, then copying it back in place.

In addition to tfupdate, Masayuki Morita has another excellent utility that can be used to apply Terraform state operations in a declarative way while validating the changes, before committing them: tfmigrate

You can do a dry run migration where you simulate state operations with a temporary local state file and check to see if terraform plan has no changes after the migration., This workflow is safe and non-disruptive, as it does not actually update the remote state.

If the dry run migration looks good, you can use tfmigrate to apply the state operations in a single transaction instead of multiple, individual changes.

Migrations are written in HCL and use the following format:

migration "state" "test" {
  dir = "."
  actions = [
    "mv google_storage_backup.stage-backups google_storage_backup.stage_backups",
    "mv google_storage_backup.prod-backups google_storage_backup.prod_backups",
  ]
}

Each action line is functionally identical to the command you’d run manually such as terraform state <action> …. A full list of possible actions is available on the tfmigrate website.

Quoting resources that have indexed keys can be tricky. The best approach appears to be using a single quote around the entire resource and then escaping the double quotes in the index. For example:

actions = [
    "mv docker_container.nginx 'docker_container.nginx[\"This is an example\"]'",
]

Testing the state migrations can be done via tfmigrate plan <filename>. The output will show you what terraform planwould look like if you had actually carried out the state changes.

Applying the migration to the actual state is done via terraform apply <filename>. Note that by default, it will only apply the changes if the result from tfmigrate plan was a clean output.

If you still want to apply changes to a “dirty” state, you can do so by adding a force = true line to the migration file.

If you are using Terraform 1.1 or newer, there is now a built-in moved statement that works similarly to these approaches. I haven’t tested it out yet but it looks like a useful feature! I can see it being especially useful for users who may not have direct access to state files such as Terraform Cloud and Enterprise users or Atlantis users.

See the announcement in the 1.1 release as well the HashiCorp Learn tutorial for more details.

Ensure standards compliance with TFLint

According to its website, TFLint is a Terraform linter with a handful of key features:

Finding possible errors (like illegal instance types) for major Cloud providers (AWS/Azure/GCP)
Warning about deprecated syntax and unused declarations
Enforcing best practices and naming conventions

TFLint has a plugin system for including cloud provider-specific linting rules as well as updated Terraform rules. Setting up the list of rules can be done on the command line but it is recommended to use a config file to manage the extensive list of rules to apply to your codebase.

Here is a configuration file that enables all of the possible terraform rules as well as includes AWS specific rules. Save it in the root of your Git repository as .tflint.hcl then initialize TFLint by running tflint –init. Now you can lint your codebase by running tflint

config {
  module              = false
  disabled_by_default = true
}

plugin "aws" {
  enabled = true
  version = "0.10.1"
  source  = "github.com/terraform-linters/tflint-ruleset-aws"
}

rule "terraform_comment_syntax" {
  enabled = true
}

rule "terraform_deprecated_index" {
  enabled = true
}

rule "terraform_deprecated_interpolation" {
  enabled = true
}

rule "terraform_documented_outputs" {
  enabled = true
}

rule "terraform_documented_variables" {
  enabled = true
}

rule "terraform_module_pinned_source" {
  enabled = true
}

rule "terraform_module_version" {
  enabled = true
  exact = false # default
}

rule "terraform_naming_convention" {
  enabled = true
}

rule "terraform_required_providers" {
  enabled = true
}

rule "terraform_required_version" {
  enabled = true
}

rule "terraform_standard_module_structure" {
  enabled = true
}

rule "terraform_typed_variables" {
  enabled = true
}

rule "terraform_unused_declarations" {
  enabled = true
}

rule "terraform_unused_required_providers" {
  enabled = true
}

rule "terraform_workspace_remote" {
  enabled = true
}

pre-commit

Setting up git hooks with the pre-commit framework allows you to automatically run TFLint, as well as many other Terraform code checks, prior to any commit.

Here is a sample .pre-commit-config.yaml that combines Anton Babenko's excellent collection of Terraform specific hooks with some out-of-the-box hooks for pre-commit. It ensures that your Terraform commits are:

Following the canonical format and style per terraform fmt
Syntactically valid and internally consistent per terraform validate
Passing TFLint rules
Ensuring that good practices are followed such as:
- merge conflicts are resolved
- private ssh keys aren't included
- commits are done to a branch instead of directly to master or main

repos:
  - repo: git://github.com/antonbabenko/pre-commit-terraform
    rev: v1.59.0
    hooks:
      - id: terraform_fmt
      - id: terraform_validate
      - id: terraform_tflint
        args:
          - '--args=--config=__GIT_WORKING_DIR__/.tflint.hcl'
  - repo: git://github.com/pre-commit/pre-commit-hooks
    rev: v4.0.1
    hooks:
      - id: check-added-large-files
      - id: check-merge-conflict
      - id: check-vcs-permalinks
      - id: check-yaml
      - id: detect-private-key
      - id: end-of-file-fixer
      - id: no-commit-to-branch
      - id: trailing-whitespace

You can take advantage of this configuration by:

Installing the pre-commit framework per the instructions on the website.
Creating the above configuration in the root directory of your Git repository as .pre-commit-config.yaml
Creating a .tflint.hcl in the base directory of the repository
Initialize the pre-commit hooks by running pre-commit install

Now whenever you create a commit, the hooks will run against any changed files and report back issues.

Since the pre-commit framework normally only runs against changed files, it’s a good idea to start off by validating all files in the repository by running pre-commit run –all-files

Conclusion

These approaches help make it easier and safer to refactor Terraform codebases, speeding up a team's "Infrastructure as Code" velocity.

This helped my team gain confidence in making changes to our legacy modules and enabled greater reusability. Standardizing on formatting and validation checks also sped up code reviews. We could focus on module logic instead of looking for typos or broken syntax

Day 11 - Moving from Engineering Manager to IC

2021-12-11T00:00:00.001-05:00

By: Brian Scott (@brainscott)
Edited by: Don O'Neill (@sntxrr)

Within the past month, I've had a radical change into a new role within my existing employer, for the past decade I was an SRE Manager building teams and a Tech Executive. I hope to summarize my experience including how that made me feel, moving into an IC Role. The thoughts and ideas in this article are from my own opinion and past experiences.

For the past 6-8 years, I've been in an Engineering Manager/TechExec role, specifically in Systems Reliability Engineering. I was comfortable, happy, and engaged in this role, managing multiple SRE teams supporting a wide range of products & platforms in the Enterprise.

Before we dive in deeper, A little history on myself, I've been playing with technology since I was in 5th grade. My English teacher at the time taught me everything he knew about repairing computers, primarily 286's & 386's, DOS, and teaching me the BASIC programming language.

As I transitioned into 8th grade, entering High School, my computer teacher approached me to ask if I wanted to help with administering the School's Network of 12 Windows NT Servers running Active Directory, Exchange & File Services with over 4000 workstations & Printers. Apparently, my 5th-grade teacher passed a few tidbits to him of what I was doing in middle school in Computer Science.

Little did I know after accepting the position that my journey began, A few startups (MySpace, remember that?) and mid-large corporations later, I ended up in Engineering Management, primarily focused on building teams that support large scale applications both On-prem and in the cloud with a focus on delivering solutions with a DevOps culture & SRE mindset.

I've been used to building high-performing engineering teams, meeting new and amazing engineers while focusing on creating T-Shaped teams, this is not necessarily a new concept but one that worked for my teams and worked well. During this time, We have had an amazing leadership team that pushed us to go above and beyond while meeting new product teams across the company every day that needed our help in delivering great solutions. In certain organizations, high technical roles can be treated as semi-management.

We introduced several new technologies & concepts to the company as a whole, developing many Communities of Practice around Config Management, Containers, CI/CD, and even Web Development with Go, and so on. With the vast coverage of different areas that the company was working in, I found myself, slowly moving into a new space that we never had a role in the company, more on this, in just a bit.

Before moving into Management, I was a Staff SRE (Systems Reliability Engineering). You might be thinking, isn’t it Site Reliability Engineering?, yes but different companies tailor the meaning of SRE to meet the needs within their respective areas. In my case, we weren’t just managing Sites & Web Applications but Systems that handle a wide range of products in the Entertainment & Media space. Think Rendering, Control Systems, and safety systems.

As a Manager, I started seeking and making new connections across the enterprise, assisting teams in onboarding the latest technology, whether that be LiDar, Kubernetes, understanding GitOps & Docker, and new tools that were bursting with Innovation in the Open Source space. While being good at helping others and always saying “YES”, I quickly found myself spread quite thin between managing 5 different SRE Teams, each team roughly 3-5 team members, supporting over 3000 Applications and some of which were centralized services for the entire enterprise to consume. It was also getting a little hard for me to stay current with the technology, which I loved.

Leadership quickly saw my success in evangelizing new technology and helping our business units move fast in adopting new methods of engineering not only with new technology but ensuring our SRE’s had the proper tools and was aware of up and coming automation tools to help them reduce toil but also accelerate in how we delivered more value to our customers internally and externally.

My leader called me into a meeting to discuss my interest in moving into an SRE role, but instead of a pure Engineering role, wanted me to pursue leading the company’s effort in evangelising new technology. He went on to explain the value and deep vision in how this would allow me to expand my reach and support more teams in helping create an organization, around Developer Advocacy and mentoring our entire Global SRE Organization to the next level and inspire others in methods such as Empathy Engineering, Automation and best practices in multiple areas, the advancements in what’s next in driving technical leadership.

I was a bit taken back but excited, there was also a bit of nervousness of course, how that might have affected my teams in-relation to my relationships between each one of my engineers. In the next few weeks, my teams and leadership were very supportive and believed that I was needed in this new role to make a bigger impact on the Organization and company as a whole.

Never be discouraged if you find yourself moving into an IC role, new opportunities have a great way of nudging you in the right direction. People often think that moving up the ladder means success but as we all have seen incredible people in IC roles such as Kelsey Hightower at Google or Jessie Frazelle of Oxide Computer. Humans do their best work when positioned to do things they love doing and provided they can reach new heights.

Day 10 - Assembling Your Year In Review

2021-12-10T00:00:00.003-05:00

By: Paige Bernier (@alpacatron3000)
Edited by: Jennifer Davis (@sigje) and Scott Murphy (@ovsage)

Intro

There are a few moments in my career that I have been struck by a story told with data. When I set out as a Site Reliability Engineer into the big wide world I wanted to capture that data storytelling magic and have adapted a presentation I call the “Year in Review”.

My first company had a tradition of taking a moment to pause and review the year by the numbers. The showstopper was the chart showing the amount of data ingested year over year since the founding.

In a single glance that chart conveyed a story that would take hours to tell!

It communicated the incredible efforts the employees took to scale the system to handle ingesting, processing, publishing and storing an ever increasing mountain of data. It illustrated how far the company had come and we were confronted head on with the realization that “what got you here, won’t get you there”.

The biggest impact I have seen comes after the presentation. Discussions from Year in Reviews have sparked sweeping oncall management changes as well as minor, but important, changes in the way developers engage with the SRE team.

Before diving into implementation details, let’s look at why this type of data storytelling is such a powerful tool by examining the core purpose of SRE

The Mission of SRE

The mission of an SRE team is to improve system reliability by facilitating change.

System reliability is the sum of hundreds of decisions humans make when developing, deploying, and maintaining software systems; it is not an intrinsic property¹ of the systems (Patrick O’Connor, 1998). SRE job descriptions tout phrases like “evangelize a DevOps culture” and “influence without authority” acknowledging our roles as change agents.

And as often heard, “change is hard”. As change agents, we are often faced with conflicting priorities, multiple stakeholders internal and external, and fear of the new and unknown.

However, just as often we hear “change is the only constant”. Whether it’s hardware improvements, operating system upgrades, security vulnerability announcements, software dependencies, or the software that we manage as a service, we are constantly monitoring and implementing change.

Combine these two axioms, for extra difficulty:

Ask any engineer who has been forced into a major operating system upgrade when the version of software they’re running requires the previous OS.

As an SRE I often want to make changes across the entire engineering organization such as developing oncall onboarding, ensuring that we are monitoring the customer’s experience, clarifying the lines of responsibility between developers and operators and more!

These types of changes that affect everyone is difficult to effectively implement until two things are true:

Is there a shared understanding of the current state?
Is there agreement that the current state needs to change?

This does not mean there needs to be consensus on what changes need to be made!

Is there a shared understanding of the current state?

The answer to this can be a resounding “Yes!” after your Year in Review presentation. Here’s why:

Humans learn best from stories, feelings, senses, and opinions commonly known as qualitative data. Focusing only on these exclusively you risk coming to broad conclusions without nuance or context.

Businesses claim to operate on data, facts and figures, or quantitative data. Focusing purely on the numbers you risk having too many details leading to irrelevant rabbit holes.

In fact, the two seemingly disparate viewpoints aren’t at odds at all. You can even validate findings by using the other category of data.

Feel: “Our monitoring sucks, none of the last 5 pages I got were actionable”

Fact: The primary oncall was paged 5 times out of business hours last week

Finding: Team X is getting paged frequently for non-actionable reasons

Hosting a “Year in Review” means weaving a story using the quantitative data about what occurred in your systems with the qualitative “anec-data” from a human perspective to build a foundation to introduce change.

Is there agreement that the current state needs to change?

This is a more complex endeavor - identifying and implementing change is the hard work of collaborating across teams, roles and competing incentives, motives, and needs. Think of “Year in Review” as a springboard for driving discussion and debate to align on “do we agree something needs to change?”

What does this look like in practice?

At a previous company I heard from engineers and managers alike that the oncall rotations were in need of a shake up. This was an excellent starting place where everyone agreed that there was a problem but was having trouble implementing the necessary changes.

With a goal in mind to identify what exactly the oncall issues were my team tailored a “Year in Review” focused mainly on oncall metrics such as alert noise, hours oncall per engineer, pages received per engineer. Slides illustrated the deluge of alert storms no human could possibly investigate in a given shift and were largely unactionable noise. The impact of not addressing this problem was clear, we were likely missing important signals in the noise and oncalls weren’t able to effectively prioritize their time.

After reviewing the data as a group, my team facilitated a brainstorm to address the barriers to changing the rotations:

How to handle ownership when multiple teams contribute code?
What are the “hot potato” services no one feels comfortable owning?
What services are unofficially owned by a single engineer that needs documentation?
What is the goal of a low urgency or warning alert?

Based on the main discussion and others in standups and sidebars, my team proposed new team-service ownership and rotations. Several weeks and few rounds of revisions later we merged the PR with our new Terraformed oncall rotations!

DIY “Year in Review”

So, how do you create a “Year in Review” for an SRE team? To start, I typically have a few things in mind about what I think happened and what the data will show. It is fascinating to see where your perception of the system and reality diverge. You can kick off your process by asking a couple of questions:

What story are you expecting the data to tell?
What changes do you think need to be made in the next year to improve reliability?

Book a meeting with all parties (including engineers, managers, sre, qa, ops, product managers). If there is an existing meeting like an All-Hands or Demo Hour sign up for a presentation slot
Kick off a brainstorming session and have participants list out possible changes to include. Such as new features launched or infrastructure expansions to new regions, or even doubling the organization size.
Ask teams (including managers)
1. What data they would find interesting
2. What data they could contribute from their domain
List the company-specific tooling for data sources like:
1. Version Control
2. CI/CD
3. Monitoring
4. Incident Management
5. Ticket tracking system
6. Documentation store
7. Support ticket system
Enlist the help of others to gather the interesting metrics over the past year or year over year. Some suggestions are:
1. Noisiest alerts
2. Number of environments
3. Oncall engineers
4. Number of services
5. Ratio of oncall engineer to number of services oncall for
6. Age of dependencies/libraries
7. # of hours oncall per person
8. Number of features launched
9. # of after hour pages
10. Ratio of warning alerts to pages
11. Number of production deploys rolled up by day
12. Number of open incident AIs
13. Ingress traffic or other indicator of system load
14. Most viewed documentation pages
15. Most search documentation terms
16. Time to first PR
17. ….and so much more!
Slice and dice the data trying out top 10 lists, total sum, or segment by using whatever constructs your company has such as:
1. Department
2. Service
3. Team
4. Product Feature
Group the data into themed areas “oncall” “production” “onboarding” etc. If you have convinced folks to co-present with you each person can be responsible for presenting a different theme
Assemble into a slide deck with one chart per slide to maximize impact
Hold the meeting and present your findings,
Discuss! In the meeting, after the meeting before the next Year In Review how you interpreted the data compared to others
Publish the data and your queries so everyone can explore and answer their own questions

Parting Thoughts

SREs are uniquely suited to facilitate a Year in Review bringing a system-wide perspective on the people, processes, and technology and mission to improve reliability. Keep in mind that much like effecting change, hosting a Year in Review is not a solo effort!

Going solo means you will only capture YOUR thoughts which will almost certainly be tempered by the unique vantage points from others. The more perspectives you invite, the fuller the story of your system will be.

Please share your favorite data storytelling moments or Year in Review stats with me on Twitter at @alpacatron3000

Citation

O’Connor, P. (1998) Standards in reliability and safety engineering [Article]. Elsevier Science Limited, 9 Dec. 2021.

https://www.sciencedirect.com/science/article/abs/pii/S095183209883010X

Notes

Since the SRE field is still getting established outside of Google, I started to read perspectives from Reliability Engineering in other disciplines. A nugget from Patrick O’Connor’s “Standards in reliability and safety engineering” paper sparked a spicy but important revelation about reliability.
“Those reliability standards which apply mathematical/ quantitative methods are also based on the inappropriate application of “scientific” thinking. An engineered system or a component has no intrinsic property of reliability, expressible for example as a failure rate. Truly scientifically based properties of systems and components include mass, power output, etc., and these can therefore be predicted and measured with credibility. However, whether a missile or a microcircuit fails depends upon the quality of the design, production, m~nten~ce and use applied to it. These are human contributions, not “scientific”. “ ↩

Day 9 - 3 things parenting taught me about system administration

2021-12-09T00:16:00.007-05:00

By: Jennifer Davis (@sigje)

The last five years have been grounding for me as I became a beginner at parenting. In this article, I want to share three things I learned about being a better sysadmin from being a mom.

Prioritize your health

Of course, I've heard it so many times. But in the rush of trying to support the "system," sometimes, I lose track of the little things (getting enough sleep, eating meals, human engagement that isn't predicated on deliverables and action items). When it comes to parenting, I see the difference in how the necessities of the moment can gradually subsume the primary goals and real joy* (a secondary outcome of successful parenting that I tend to only enjoy in retrospect, after having assured myself that my internal parenting kanban board is as it should be–obsession, exhaustion, and then joy tends to be my experiential flow as a parent).

Prioritizing health - if I'm not ok, I'm not able to handle the "system" as well, regardless of its state.

Any parent of a child under five will tell you that 90 percent of the job is keeping the child alive. If they make it to the next day, smile and giggle the proper number of times per day, and if your friends, family, and parenting peers seem unaware that your parenting path bears a concerning resemblance to the plot of the movie Speed, then you're more or less gravy. You also learn that, while you can spend a great deal of time analyzing and conversing about your child and how they're faring, the main thing is to put them in the right places at the right time. Sunshine, exercise, the company of their peers, easily accessible bathrooms–these are the things that matter. If my son doesn't get direct sunlight within 90 minutes of walking, his mood takes a nosedive, and this isn't a mystery to me. Likewise, if he isn't let loose at the park to terrify small woodland creatures with his desire to befriend them, his attentional resources will be suboptimal when it's time for flashcards. Yet I (and I don't think I'm alone in this) will frequently wake, obtain caffeine, have a quick all-hands with my family, and proceed to sit in a small room staring at a screen for eight hours straight. As a result, my ability to practice self-care myself fails regularly.

Leverage the community

To prioritize my health, I have to ask for help. I've had the following experience again and again professionally, and as a parent, and at some point, I hope that it won't astound me, which it does every time: I believe that I'm having a singular experience (which, of course, we all are) and that I am an outlier because obviously no one else is concerned about the state of affairs or struggling. And then someone else gives voice to the precise issue that I've devoted considerable resources to NOT sharing. Of course, other people are also concerned about the children pretending that the scissors are boomerangs. One of my primary errors is thinking that there is some scorekeeping of tracking the social currency and categorizing discourse into the buckets of "I helped" and "I was helped." It's a binary that renders engagements as transactional when my actual community experience is almost always that I walk away feeling better regardless of who broached a topic.

You can't eliminate all Snowflakes

Within the community, we often talk about snowflakes as problems. Yet, as a parent, you discover that there are no handbooks for YOUR kid because every child is different in their own beautiful, hard, and surprising way. Likewise, while there is value in the community and sharing stories, every system will be different. You work with one system, you've learned about that system, and while there are useful things you'll learn from that system to apply to other systems, every system will be beautiful, different, and hard in its surprising ways.

Wrapping Up

Our industry is constantly evolving with the introduction of new technology, tools, and processes. It may feel overwhelming to try to understand everything. You have to accept some degree of the unknown. When I first became a parent, I realized that Operations had prepared me for the inevitable changes that occur every single day. No matter what tomorrow brings, the essential skills are learning to adapt to change and learning to learn fast.

Please make time for yourself, connecting with the community, and accepting what is different and unique about your systems and the environments they are running in.

Day 8 - D&D for SREs

2021-12-08T00:00:00.011-05:00

By: Jennifer Davis (@sigje)

In a past life, I was a full-time SRE and a part-time dragonborn paladin named Lorarath. While at work, I supported thousands of systems in collaboration with a team of geeks. Evenings, I tried to survive imaginary disasters and save the world from the sorceress Morgana. I love collaborative games because they plug into some of the real-world emotional responses and social processes critical for successful, meaningful engagement. They provide a place to practice dealing with critical scenarios in a safe place. When you know the stakes are purely imaginary, you're able to look at your efforts from a distance, to gain understanding and enjoy the process of learning and achieving goals together, even when failing. I want to share a couple of insights D&D has given me about my work and how this can help you.

Building your SRE Team … more than just a name.

SRE has many names: Operations, DevOps, Infrastructure engineering, System Admin. It's someone who deploys and runs a highly available, scalable, and secure service that meets business and partner requirements. But what does that mean? Generally, it means someone with a wide-ranging set of skills tackling different challenges at any point in time.

When you first start a campaign in dungeons and dragons, you choose a class to play. This class will then have specializations that you customize based on how you want to play. Next, you build out your character using a character sheet and create a backstory. This character sheet has several abilities and skills. You have several points to allocate to abilities and skills, which grants you additional chances to handle particular events successfully.

In gaming, you collaborate with your team to ensure that you have a well-rounded team often choosing roles to complement the team. You don't want a team of all "magic users" or hack and slashers. Often, we stop at identifying who we are with that single name, whether it's SRE or sysadmin. As an SRE, I depend on a diverse team with varied skills. I am not seeking people with the same expertise or abilities. I'm looking for people with complementary skills who can help accomplish the goals and visions of the team.

Developing your “character sheet”

There is no equivalent to a "character sheet" when it comes to your job. The closest might be equating a resume or LinkedIn profile to a character sheet. Still, these don't align to all of the possible experiences you gain:

Submitting git pull requests.
Participating in hackathons.
Attending training or conferences.
The myriad of other day-to-day challenges you face.

Additionally, if you don't practice skills in real life, they languish. For example, I haven't touched Solaris in over a decade, and I no longer document it as a skill.

If SRE did have a character sheet, I think three core abilities would be:

Communication, Collaboration, and Confidence. Let's take a closer look at these specializations and the value of spending energy on these areas.

Specialization: Communication

Communication is a fundamental building block to successful character building. As an SRE, I faced various scenarios that required expert communication.

The first specialty in communication is the number of messages. How often should I remind people about upcoming scheduled maintenance? How often should I reach out to my manager about working on the right thing? How often should my team get together to talk about team tasks?

The second specialty in communication is the quality of messages. Communication can be visual, written, or oral. Visuals can often convey much more nuanced meaning than repeating the same information in textual format and an underleveraged method.

The third specialty in communications is effectiveness. Effectiveness is the degree to which your words lead to the desired results. This specialty is the most advanced because effective communication requires an in-depth understanding of the audience and crafting your message as needed.

Specialization: Collaboration

The second core ability is collaboration. In any product or service, you are working on, work needs to be understood, planned, and executed. It doesn't matter who does the work; it just matters that it gets done.

The role I take today doesn't define who I am. If I say, "I'm an SRE at Company," that is just one characteristic of my story and not my identity. Every day as you go into work and tackle your challenge, recognize your special value and what you bring to the team. Rather than adopting and marrying your identity to a specific role, realize some days you take on a role that may be quite different from what you are used to, and that's part of your character development.

There is a distinction between the members of your team and the roles they play. In gaming, you become comfortable speaking on behalf of your character while having a separate, sometimes meta-conversation with your teammates. Social environments seem to tend towards homeostasis, and you (may) naturally ascribe a simplistic narrative to your co-workers' actions. Adopting this awareness that everyone is filling a role on the team that is not representative of everything about the individuals allows you to approach the work to do the impactful work that needs to get done.

In other words, never say, "well, they are just the ROLENAME and can't do that," or "that's not my job."

Specialization: Confidence

The third core ability for your SRE character sheet is confidence. Confidence is about the innate quality that drives you to take risks (or not).

In gaming, sometimes you take the wrong path, or you put your squishy players out front, and they get severely damaged. Mistakes happen. In the "real world," customers do something unexpected. There are bugs in the software, hardware fails, or someone from the team enters the wrong command on the wrong terminal in the production environment.

Collaborative games teach you to fail as a group and rise again while retaining the group cohesion necessary to succeed. Of course, if a teammate really caused you to be captured by a giant spider, you'd probably flip out. Still, across the game board, one has the emotional wiggle-room to behave in a manner that would be laudable in professional situations.

Playing teaches you about exploring challenges with imagination and a sense of play. You have to piece things together while continuing to take action, both keeping in mind the larger game goals and what's immediately on the board at the same time. In addition to this enormous world to explore, there are complex characters (non-playing characters or NPCs) to talk to, and information gathered within each encounter. Be on the lookout for the helpful non-production engineers (NPEs) in your environment, too; while they may not maintain production, they may have valuable information to support you.

Wrapping Up

So, this article inspired you to add some collaborative gaming to your team building, build out your team with complementary skills, or map out the work of the SRE or system administration to a character sheet. Great, beyond the "character sheet," you need the appropriate visualization. By analyzing the particular work items that an individual completed, there could be an incremented "skill" counter. Additional information like git commits, distribution of package management, and incident management APIs could be gathered and glued together to create a way to look at progress over time. That way, you could make sure to spend time on the skills that will improve you in the direction of your choosing.

If you want to try out D&D, check out your local game stores or related groups. Beginner games often provide preconfigured characters that allow you to practice the gameplay without understanding all of the nuances of playing the game.

Day 7 - Baking Multi-architecture Docker Images

2021-12-07T00:00:00.000-05:00

By: Joe Block (@curiousbiped)
Edited by: Martin Smith (@martinb3)

My home lab cluster has a mix of CPU architectures - several Odroid HC2s that are arm7, another bunch of Raspberry Pi 4s and Odroid HC4s that are arm64 and finally a repurposed MacBook Air that is amd64. To further complicate things, they're not even all running the same linux distribution - some run Raspberry Pi OS, one's still on Raspbian, some are running debian (a mix of buster and bullseye), and the MacBook Air runs Ubuntu.

To reduce complication, the services in the cluster are all running in docker or containerd - it's a homelab, so I'm deliberately running multiple options to learn different tooling. This meant that I had to do three separate builds every time I updated one of my images, arm7 , arm64 and amd64, on three different machines, and my service startup scripts all had to determine what architecture they were running on and figure out what image tag to use.

Enter multi-architecture images

It used to be a hassle to create multi-architecture images. You'd have to create an image for each architecture, then upload them all separately from each build machine, then construct a manifest file that included references to all the different architecture images and then finally upload the manifest. This doesn't lead to easy rapid iteration.

Now, thanks to docker buildx, you can create multi-architecture images as easily as docker build creates them for single-architectures.

Let's take a look with an example on my system. First, I can see what architectures are supported with docker buildx ls. As of 2021-12-03, Docker Desktop for macOS supports the following:


        NAME/NODE       DRIVER/ENDPOINT             STATUS  PLATFORMS
        multiarch *     docker-container
          multiarch0    unix:///var/run/docker.sock running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
        desktop-linux   docker
          desktop-linux desktop-linux               running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/arm/v7, linux/arm/v6
        default         docker
          default       default                     running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/arm/v7, linux/arm/v6

My home lab only has three architectures, so in these examples I'm going to build for arm7, arm64 and amd64.

Create a builder

I need to create a builder that supports multi-architecture builds. This only needs to be done once as Docker Desktop will reuse it for all of my buildx builds.


    docker buildx create --name multibuild --use

Building a multi-architecture image

Now, when I build an image with docker buildx, all I have to do is specify a comma-separated list of desired platforms with --platform. Behind the scenes, Docker Desktop will fire up QEMU virtual machines for each architecture I specified, run the image builds in parallel, then create the manifest and upload everything.

As an example, I have a docker image, unixorn/unixorn-py3 that I use for my python projects that installs a minimal Python 3 onto debian 11-slim.

I build it with docker buildx build --platform linux/amd64,linux/arm/v7,linux/arm64 --push -t unixorn/debian-py3 resulting in the output below showing that it's building all three architectures.


        ❯ rake buildx
        Building unixorn/debian-py3
         docker buildx build --platform linux/amd64,linux/arm/v7,linux/arm64 --push -t unixorn/debian-py3 .
        [+] Building 210.4s (17/17) FINISHED
         => [internal] load build definition from Dockerfile                                                                                                            0.0s
         => => transferring dockerfile: 571B                                                                                                                            0.0s
         => [internal] load .dockerignore                                                                                                                               0.0s
         => => transferring context: 2B                                                                                                                                 0.0s
         => [linux/arm64 internal] load metadata for docker.io/library/debian:11-slim                                                                                   3.7s
         => [linux/arm/v7 internal] load metadata for docker.io/library/debian:11-slim                                                                                  3.6s
         => [linux/amd64 internal] load metadata for docker.io/library/debian:11-slim                                                                                   3.6s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [auth] library/debian:pull token for registry-1.docker.io                                                                                                   0.0s
         => [linux/arm64 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                             4.4s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:968621624b326084ed82349252b333e649eaab39f71866edb2b9a4f847283680 30.06MB / 30.06MB                                                                2.0s
         => => extracting sha256:968621624b326084ed82349252b333e649eaab39f71866edb2b9a4f847283680                                                                       2.4s
         => [linux/amd64 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                             4.0s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:e5ae68f740265288a4888db98d2999a638fdcb6d725f427678814538d253aa4d 31.37MB / 31.37MB                                                                1.8s
         => => extracting sha256:e5ae68f740265288a4888db98d2999a638fdcb6d725f427678814538d253aa4d                                                                       2.2s
         => [linux/arm/v7 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                            4.3s
         => => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c                                         0.0s
         => => sha256:ba82a1312e1efdcd1cc6eb31cd40358dcec180da31779dac399cba31bf3dc206 26.57MB / 26.57MB                                                                2.3s
         => => extracting sha256:ba82a1312e1efdcd1cc6eb31cd40358dcec180da31779dac399cba31bf3dc206                                                                       2.0s
         => [linux/amd64 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install-r  22.3s
         => [linux/arm/v7 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install  176.9s
         => [linux/arm64 2/2] RUN apt-get update &&     apt-get install -y apt-utils ca-certificates --no-install-recommends &&     apt-get upgrade -y --no-install-  173.6s
         => exporting to image                                                                                                                                         25.4s
         => => exporting layers                                                                                                                                         6.7s
         => => exporting manifest sha256:ae5a5dcfe0028d32cba8d4e251cd7401c142023689a215c327de8bdbe8a4cba4                                                               0.0s
         => => exporting config sha256:48f97d6d8de3859a66625982c411f0aab062722a3611f18366ecff38ac4eafb9                                                                 0.0s
         => => exporting manifest sha256:fc7ad1e5f48da4fcb677d189dbc0abd3e155baf8f50eb09089968d1458fdcfb9                                                               0.0s
         => => exporting config sha256:60ced8a7d9dc49abbbcd02e7062268fdd2f14d9faedcb078b2980642ae959c3b                                                                 0.0s
         => => exporting manifest sha256:8f96f20d75502d5672f1be2d9646cbc5d5de3fcffd007289a688185714515189                                                               0.0s
         => => exporting config sha256:0c6e42f87110443450dbc539c97d99d3bfdd6dd78fb18cfdb0a1e3310f4c8615                                                                 0.0s
         => => exporting manifest list sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa                                                          0.0s
         => => pushing layers                                                                                                                                          17.2s
         => => pushing manifest for docker.io/unixorn/debian-py3:latest@sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa                         1.4s
         => [auth] unixorn/debian-py3:pull,push token for registry-1.docker.io                                                                                          0.0s
         => [auth] unixorn/debian-py3:pull,push token for registry-1.docker.io                                                                                          0.0s
         docker pull unixorn/debian-py3
        Using default tag: latest
        latest: Pulling from unixorn/debian-py3
        e5ae68f74026: Already exists
        86834dffc327: Pull complete
        Digest: sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa
        Status: Downloaded newer image for unixorn/debian-py3:latest
        docker.io/unixorn/debian-py3:latest
        1.60s user 1.05s system 1% cpu 3:36.49s total

One minor issue - docker buildx has a separate cache that it builds the images in, so when you build, the images won't be loaded in your local docker/containerd environment. If you want to have the image in your local docker environment, you need to run buildx with --load instead of --push.

In this example, instead of running docker run unixorn/debian-py3:amd64, docker run unixorn/debian-py3:arm7 or docker run unixorn/debian-py3:arm64 based on what machine I'm on, now I can use the same image reference on all the machines -


        ❯ docker run unixorn/debian-py3 python3 --version
        Python 3.9.2
        ❯

Takeaway

If you're running a mix of architectures in your lab environment, docker buildx will simplify things considerably.

No more maintaining multiple architecture tags, no more having to build on multiple machines, no more accidentally forgetting to update one of the tags so that things are mysteriously different on just some of our machines, no more weird issues because we forgot to update service start scripts and docker-compose.yml files.

Simpler is always better, and buildx will simplify the environment for you.

Day 6 - More to come tomorrow!

2021-12-05T22:04:00.001-05:00

We don't have any special system content for you today. We will have more tomorrow!

Day 5 - Least Privilege using strace

2021-12-04T03:00:00.003-05:00

By: Shaun Mouton (@sdmouton)
Edited by: Jennifer Davis (@sigje)

Security in software development has been a hot-button issue for years. Increasing awareness of the threat posed by supply chain breaches have only increased the pressure on teams to improve security in all aspects of the software delivery and operation. A key premise is least privilege: granting the minimum privileges necessary to accomplish a task, in order to prevent folks from accessing or altering things they shouldn't have rights to. Here's my thinking, we should help users to apply the principles of least privilege when designing tools. When we find that we have not designed security tooling which does not enable least privilege use, we can still address the problem using tracing tools which can be found in most Linux distribution package repositories. I would like to share my adventure of looking at an InSpec profile (using CINC Auditor) and a container I found on Docker Hub to demonstrate how to apply least privilege using strace for process access auditing.

At my prior job working at Chef, I fielded a request asking how to run an InSpec profile as a user other than root. InSpec allows you to write policies in code (called InSpec Profiles) to audit the state of a system. Most of the documentation and practice at the time had users inspecting the system as root or a root-equivalent user. At first glance, this makes a certain amount of sense: many tools in the "let's configure the entire system" and "let's audit the security of the entire system" spaces need access to whatever the user decides they want to check against. Users can write arbitrary profile code for InSpec (and the open source CINC Auditor), ship those profiles around, and scan their systems to determine whether or not they're in compliance.

I've experienced this pain of excessive privileges with utilities myself. I can't count the number of times we'd get a request to install some vendor tool nobody had ever heard of with root privileges. Nobody who asked could tell us what it'd be accessing, whether it would be able to make changes to the system, or how much network/cpu/disk it'd consume. The vendor and the security department or DBAs or whoever would file a request with the expectation that we should just trust their assertion that nothing would go wrong. So, being responsible system administrators, we'd say "no, absolutely not, tell us what it's going to be doing first" or "yes, we'll get that work scheduled" and then never schedule the work. This put us in the position of being gatekeepers rather than enablers of responsible behavior. While justified, it never sat right with me.

(Note: It is deeply strange that vendors often can't tell customers what their tools do when asked in good faith, as is the idea that there should be an assumption of trustworthiness in that lack of information.)

I've found some tools over the years which might be able to give a user output which can be used to help craft something like a set of required privileges to run an arbitrary program with non-root privileges. Not too long ago I discussed "securing the supply chain" on how to design an ingestion pipeline to enable folks to run containers in a secure environment where they could be somewhat assured that a container using code they didn't write wasn't going to try to access things that they weren't comfortable with. I thought about this old desire of limiting privileges when running an arbitrary command, and figured that I should do a little digging to see if something already existed. If not maybe I could work towards a solution.

Now, I don't consider myself an expert developer but I have been writing or debugging code in one form or another since the '90s. I hope you consider this demo code with the expectation that someone wanting to do this in a production environment will re-implement what I've done far more elegantly. I hope that seeing my thinking and the work will help folks to understand a bit more about what's going on behind the scenes when you run arbitrary code, and to help you design better methods of securing your environment using that knowledge.

What I'll be showing here is the use of strace to build a picture of what is going on when you run code and how to approach crafting a baseline of expected system behavior using the information you can gather. I'll show two examples:

executing a relatively simple InSpec profile using the open source distribution's CINC Auditor
running a randomly selected container off Docker Hub (jjasghar/container_cobol)

Hopefully, seeing this work will help you solve a problem in your environment or avoid some compliance pain.

Parsing strace Output for an CINC Auditor (Chef InSpec) profile

There are other write-ups of strace functionality which go into broader and deeper detail on what's possible using it, I'll point to Julia Evans' work to get you started if you want to know more.

Strace is the venerable Linux debugger, and a good tool to use when coming up against a "what's going on when this program runs" problem. However, its output can be decidedly unfriendly. Take a look in the strace-output directory in this repo for the files matching the pattern linux-baseline.* to see the output of the following command:


        root@trace1:~# strace --follow-forks --output-separately --trace=%file -o
    /root/linux-baseline cinc-auditor exec linux-baseline

You can parse the output, however, if all you want to know is what files might need to be accessed (for an explanation of the command go here) you can do something similar to the following (maybe don't randomly sort the output and only show 10 lines):


awk -F '"' '{print $2}' linux-baseline/linux-baseline.108579 | sort -uR | head
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/minitest-5.14.4/lib/nokogiri.so
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/train-winrm-0.2.12/lib/psych/visitors.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/i18n-1.8.10/lib/rubygems/resolver/index_set.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-cognitoidentityprovider-1.53.0/lib/inspec/resources/command.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/jwt-2.3.0/lib/rubygems/package/tar_writer.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-codecommit-1.46.0/lib/pp.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/extensions/x86_64-linux/2.7.0/ffi-1.15.4/http/2.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/extensions/x86_64-linux/2.7.0/bcrypt_pbkdf-1.1.0/rubygems/package.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-databasemigrationservice-1.53.0/lib/inspec/resources/be_directory.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-ram-1.26.0/lib/rubygems/resolver/current_set.rb

You can start to build a picture of what all the user would need to be able to access in order to run a profile based on that output, but in order to go further I'll use a much more simple check:


        cinc-auditor exec linux-vsp/

Full results of that command are located in the strace-output directory with files matching the pattern linux-vsp.*, but to summarize what cinc-auditor/inspec is doing:

linux-vsp.109613 - this file shows all the omnibussed ruby files the cinc-auditor command tries to access in order to run its parent process
linux-vsp.109614 - why auditor is trying to run cmd.exe on a Linux system I don't yet know, you'll get used to seeing $PATH traversal very quickly
linux-vsp.109615 - I see a Get-WmiObject Win32_OperatingSys in there so we're checking to see if this is Windows
linux-vsp.109616 - more looking on the $PATH for Get-WmiObject so more Windows checking
linux-vsp.109617 - I am guessing that checking the $PATH for the Select command is more of the same
linux-vsp.109618 - Looking for and not finding ConvertTo-Json, this is a PowerShell cmdlet, right?
linux-vsp.109619 - Now we're getting somewhere on Linux, this running uname -s (with $PATH traversal info in there, see how used to this you are by now?)
linux-vsp.109620 - Now running uname -m
linux-vsp.109621 - Now running test -f /etc/debian_version
linux-vsp.109622 - Doing something with /etc/lsb-release but I didn't use the -v or -s strsize flags with strace so the command is truncated.
linux-vsp.109623 - Now we're just doing cat /etc/lsb-release using locale settings
linux-vsp.109624 - Checking for the inetd package
linux-vsp.109625 - Checking for the auditd package, its config directory /etc/dpkg/dpkg.cfg.d, and the config files /etc/dpkg/dpkg.cfg, and /root/.dpkg.cfg

Moving from that to getting an idea of what all a non-root user would need to be able to access, you can do something like this in the strace-output directory (explainshell here):


    find . -name "linux-vsp.10*" -exec awk -F '"' '{print $2}' {} \; | sort -u >
    linux-vsp_files-accessed.txt

You can see the output of this command here, but you'll need to interpret some of the output from the perspective of the program being executed. For example, I see "Gemfile" in there without a preceding path. I expect that's Auditor looking in the ./linux-vsp directory where the profile being called exists, and the other entries without a preceding path are probably also relative to the command being executed.

Parsing strace output of a container execution

I said Docker earlier, but I've got podman installed on this machine so that's what the output will reflect. You can find the output of the following command in the strace-output directory in files matching the pattern container_cobol.*, and wow. Turns out running a full CentOS container produces a lot of output. When scanning through the files, you see what looks like podman doing podman things, and what looks like the COBOL Hello World application executing in the container. As I go through these files I will call out anything particularly interesting I see along the way:


        root@trace1:~# strace -ff --trace=%file -o /root/container_cobol podman run -it container_cobol
        Hello world!
        root@trace1:~# ls -1 container_cobol.* | wc -l
        146

I'm not going to go through 146 files individually as I did previously, but this is an interesting data point:


        root@trace1:strace-output# find . -name "container_cobol.1*" -exec awk -F '"' '{print $2}' {} \; | sort -u > container_cobol_files-accessed.txt

        root@trace1:strace-output# wc -l container_cobol_files-accessed.txt
        637 container_cobol_files-accessed.txt
        
        root@trace1:strace-output# wc -l linux-vsp_files-accessed.txt
        104754 linux-vsp_files-accessed.txt

So the full CentOS container running a little COBOL Hello World application needs access to six hundred thirty seven files, and CINC Auditor running a 22-line profile directly on the OS needs to access over one hundred four thousand files. That doesn't directly mean that one is more or less of a security risk than the other, particularly given that a Hello World application can't report on the compliance state of your machines, containers, or applications for example, but it is fun to think about. One of the neatest things about debugging using tools which expose the underlying operations of a container exec is that you can reason about what containerization is actually doing. In this case, since we're showing what files are accessed during the container exec, sorting the list, and removing duplicate entries it's a cursory view but still useful.

Let's say we're consuming a vendor application as a container. We can trace an execution (or sample a running instance of the container for a day, strace can attach to running processes), load the list of files into the pipeline we use to promote new versions of that vendor app to prod, and when we see a change in the files that the application is opening we can make a determination whether the behavior of the new version is appropriate for our production environment with all the PII and user financial data. Now, instead of trusting the vendor at their word that they've done their due diligence, we're actually observing the behavior of the application and using our own knowledge of our environment to say whether that application is suitable for use.

But wait! Strace isn't just for files!

I used strace's file syscall filter as an example because it fit the example use case, but strace can snoop on other syscalls too! Do you need to know what IP addresses your process knows about? This example is using a container exec again, but you could snoop on an existing pid if you want then run a similar search against the output (IPs have been modified in this output):


        strace -ff --trace=%network -o /root/yourcontainer-network -s 10241 podman run -it yourcontainer
        for file in $(ls -1 yourcontainer-network.*); do grep -oP 'inet_addr\("\K[^"]+' $file ; done
        127.0.0.1
        127.0.0.1
        693.18.119.36
        693.18.119.36
        693.18.131.255
        75.5117.0.5
        75.5117.0.5
        75.5117.255.255
        161.888.0.2
        161.888.0.2
        161.888.15.255
        832.71.40.1
        832.71.40.1
        832.71.255.255

Have I answered my original question?

With all that knowledge, can we address the original question: Can one use the list of files output by tracing a cinc-auditor run to provide a restricted set of permissions which will allow one to audit the system using CINC Auditor and the profile with a standard user?

Yes, with one caveat: My Very Simple Profile was too simple, and didn't require any additional privileges. I tried with a few other public profiles, but every one I tried ran successfully using a standard user created with useradd -m cincauditor. I looked through bug reports related to running profiles as a non-root user but couldn't replicate their issues - which is good, I suppose. It could be that the issue my customer was facing at the time was a bug in the program's behavior when run as a non-root user which has been fixed, or I just don't remember the use case they presented well enough to replicate it. So here's a manufactured case:



root@trace1:~# mkdir /tmp/foo
root@trace1:~# touch /tmp/foo/sixhundred
root@trace1:~# touch /tmp/foo/sevenhundred
root@trace1:~# chmod 700 /tmp/foo
root@trace1:~# chmod 600 /tmp/foo/sixhundred
root@trace1:~# chmod 700 /tmp/foo/sevenhundred
cincauditor@trace1:~$ cat << EOF > linux-vsp/controls/filetest.rb
> control "filetester" do
>   impact 1.0
>   title "Testing files"
>   desc "Ensure they're owned by root"
>   describe file('/tmp/foo/sixhundred') do
>     its('owner') { should eq 'root' }
>   end
>   describe file('/tmp/foo/sevenhundred') do
>     its('group') { should eq 'root'}
>   end
> end
> EOF
cincauditor@trace1:~$ cinc-auditor exec linux-vsp/

Profile: Very Simple Profile (linux-vsp)
Version: 0.1.0
Target:  local://

  ×  filetester: Testing files (2 failed)
     ×  File /tmp/foo/sixhundred owner is expected to eq "root"

     expected: "root"
          got: nil

     (compared using ==)

     ×  File /tmp/foo/sevenhundred group is expected to eq "root"

     expected: "root"
          got: nil

     (compared using ==)

  ✔  inetd: Do not install inetd
     ✔  System Package inetd is expected not to be installed
  ↺  auditd: Check auditd configuration (1 skipped)
     ✔  System Package auditd is expected to be installed
     ↺  Can't find file: /etc/audit/auditd.conf


Profile Summary: 1 successful control, 1 control failure, 1 control skipped
Test Summary: 2 successful, 2 failures, 1 skipped

cincauditor@trace1:~$ find . -name "linux-vsp.1*" -exec awk -F '"' '{print $2}' {} \; | sort -u > linux-vsp_files-accessed.txt

root@trace1:~# diff --suppress-common-lines -y linux-vsp_files-accessed.txt /home/cincauditor/linux-vsp_files-accessed.txt | grep -v /opt/cinc-auditor
							      >	/home
							      >	/home/cincauditor
							      >	/home/cincauditor/.dpkg.cfg
							      >	/home/cincauditor/.gem/ruby/2.7.0
							      >	/home/cincauditor/.gem/ruby/2.7.0/specifications
							      >	/home/cincauditor/.inspec
							      >	/home/cincauditor/.inspec/cache
							      >	/home/cincauditor/.inspec/config.json
							      >	/home/cincauditor/.inspec/gems/2.7.0/specifications
							      >	/home/cincauditor/.inspec/plugins
							      >	/home/cincauditor/.inspec/plugins.json
							      >	/home/cincauditor/linux-vsp
/root							      <
/root/.dpkg.cfg						      <
/root/.gem/ruby/2.7.0					      <
/root/.gem/ruby/2.7.0/specifications			      <
/root/.inspec						      <
/root/.inspec/cache					      <
/root/.inspec/config.json				      <
/root/.inspec/gems/2.7.0/specifications			      <
/root/.inspec/plugins					      <
/root/.inspec/plugins.json				      <
/root/linux-vsp						      <
							      >	/tmp/foo/sevenhundred
							      >	/tmp/foo/sixhundred
							      >	linux-vsp/controls/filetest.rb
root@trace1:~#

The end of that previous block's output shows compiling the list of files accessed when the cincauditor user runs the profile in the same way we did for the root user, then a diff of the two files. Looking at that output, it's fairly obvious that the profile is trying to access the newly created files which are in a directory we made inaccessible to the cincauditor user (with chmod 700 /tmp/foo), and when we give cinc-auditor access to that directory with chmod 750 /tmp/foo the profile is able to check those files. A manufactured replication of the use case, but it does show that it's possible to use the output to accomplish the task. Whether chmod is the right way to give an least-privilege user access to the files is a question best left up to the implementer, their organization, and their auditors - the purpose of this exercise is to demonstrate the potential value of the strace debugger.

It is important to note that file permissions aren't the only reason why a program wouldn't run. If you're not able to use the information strace gives you to get an application to run as a user with restricted privileges, at least you can get more information about what is happening under the hood and can communicate about why a program is not suitable for your environment. If a program needs to run anyway, you can profile the application's behavior (perhaps a tool built on eBPF would be more suitable than strace for ongoing monitoring in a production environment) and notify when its behavior changes.

Closing thoughts

Over the past few years I've had a lot of thoughts about how do get things done in modern environments, and I've come to the conclusion that it's okay to write shell scripts to get something like this done. Since in this case I'm wrapping arbitrary tasks so I can extract information about what happens when they're running, and I won't be able to predict where I'll need it I figured it was a good idea to use bash and awk as those will be available via package manager where I want to do this sort of thing.

You might not agree, and wish to see something like this implemented in something like Ruby, Python, or Rust (I have to admit that I thought about trying to do this using Rust so as to get better at it), and you're of course welcome to do so. Again, I chose shell since it's something many folks can easily run, look at, comprehend, modify, and re-implement in the way that suits them.

Lastly, thanks very much to Julia Evans. A note about the power of storytelling in one of her posts made me think "I should write a story about solving this problem so I can be sure I learned something from it", and I hope I've done a decent job of emulating her empathy towards folks learning these concepts for the first time.

Day 4 - GWLB: Panacea for Cloud DMZ on AWS

2021-12-04T00:00:00.087-05:00

By: Atif Siddiqui
Edited by: Jennifer Davis (@sigje)

Organizations aspire to apply the same security controls to ingress traffic in Cloud as they have on-premises, ideally taking advantage of Cloud value propositions to provide resiliency and scalability to traffic inspection appliances.

Within the AWS ecosystem, until last year, there wasn’t an elegant solution. Consequently, the most notable challenge it created, especially for regulated organizations, was designing the DMZ (demilitarized zone) pattern in AWS. It took two announcements to close this gap: VPC Ingress routing and Gateway Load Balancer (GWLB).

Two years ago, AWS announced VPC Ingress routing. This provided the capability where ingress traffic could be directed to an Elastic Network interface (ENI). Last year, Amazon followed it up with a complementary announcement of GWLB.

GWLB is AWS's fourth load balancer offering following Classic, Application and Network Load Balancer. Unlike the first three types, GWLB solves a niche problem and is, specifically, targeted towards partner appliances.

GWLB has a novel design with two distinct sides. The front end is connected to VPC endpoint service and corresponding VPC endpoints. This front end acts as a Layer 3 gateway. The backend is connected to third party appliances. This backend acts as a Layer 4 Load Balancer. An oversimplified diagram of the traffic flow is shown:

Ingress traffic → GWLB endpoints → GWLB endpoint service → GWLB → 3^rd party appliance

So how do you provision a GWLB?

There are 4 key resources that need to be provisioned in order:

Target Group
GWLB using the above as the target group.
VPC endpoint service using above as the load balancer type.
VPC endpoints bound to the above endpoint service.

Target Group

As part of this announcement, AWS implemented the GENEVE protocol and added this option to the UX for Target Group. If you are unfamiliar with this protocol it will be explained after going through GWLB provisioning requirements.

To configure this as infrastructure code (IaC), you could use a terraform code snippet as follows:


    resource "aws_lb_target_group" "blog_gwlb_tgt_grp" {
        name      = "blog_gwlb_tgt_grp"
        port        = 6081
        protocol = "GENEVE"
        vpc_id   = aws_vpc.fw.id
      }

GWLB

As with Application Load Balancing, GWLB requires a target group to forward traffic; however, the target group must be created with the GENEVE protocol.

Health checks for TCP, HTTP and HTTPS are supported; however, it should be noted that health check packets are not GENEVE encapsulated.

An example of a terraform code snippet is as follows.


    resource "aws_lb" "blog_gwlb" {
        name                       = "blog_gwlb"
        load_balancer_type = "gateway"
        subnets                    = blog-gwlb-subnet.pvt.*.id
      
        tags = {
          Name                     = “blog-gwlb”,
          Environment              = "sandbox"
            }
      }

Endpoint Service

Prior to GWLB announcement, if an endpoint service was being created, the only option offered was Network Load Balancer (NLB). With GWLB’s availability, gateway is now the second option for load balancer type when creating an endpoint service. It should be noted that endpoint service whether it uses NLB or GWLB relies on the underlying PrivateLink technology.

An example of terraform code snippet is as follows.


    resource "aws_vpc_endpoint_service" "blog-vpce-srvc" {
        acceptance_required              = false
        gateway_load_balancer_arns       = [aws_lb.blog-gwlb.arn]
      
        tags = {
          Name                           = “blog-gwlb”,
          Environment                    = "sandbox"
            }
      }

VPC endpoint

The last key piece of the set is provisioning of VPC end points which will bind to end point service created in the prior step.


    resource “aws_vpc_endpoint “blog_gwlbe” {
        count          = length(var.az)
        service_name   = aws_vpc_endpoint_service.blog-vpce-srvc.service_name 
        subnet_ids     = [var.blog-gwlb-subnets[count.index]]
        vpc_id         = aws_vpc.fw.id
    
    tags = {
        Name           = “blog-gwlb”,
        Environment    = "sandbox"
          }
    }

GENEVE

This is an encapsulation protocol created by the Internet Engineering Task Force (IETF). GENEVE stands for Generic Network Virtualization Encapsulation and leverages UDP for the transport layer. This encapsulation is what achieves the transparent routing of packets to third party appliances from vendors such as Big-IP, Palo Alto Networks, Aviatrix etc.

Special route table

The glue that blends VPC Ingress routing and GWLB feature is through a special use of route table.

Ingress traffic → GWLB endpoints → GWLB endpoint service → GWLB → 3^rd party appliance e.g marketplace subscription.

This table does not have any explicit subnet association. It, however, has Internet Gateway (IGW) specified on the Edge associations.

Within routes, quad 0 points to Network interfaces (ENIs) of the Gateway Load Balancer endpoints (GWLBe).

It is this routing rule that enforces ingress traffic to be routed to GWLBe which in turns sends to GWLB (endpoint service) that is then routed to appliances.

Limitations

Target group using the GENEVE protocol does not support tags.

Cloud DMZ: Centralized Inspection Architecture

Conclusion

The pairing of VPC ingress routing and GWLB allows enterprises to have a much sought after security posture where now both ingress and egress traffic can undergo firewall inspection. This set of capability is, especially, notable when the Cloud DMZ architecture is being created.

Afterthought: AWS Network Firewall

It is always fascinating to me how AWS keeps vendors on their toes. There seems to be an aura of ineluctability where vendors strive to stay a step ahead of AWS’s offering. While customers can use marketplace subscriptions (e.g. firewall) with GWLB, there is a competing service by Amazon named AWS Network Firewall. This is essentially Firewall as a Service where VPC ingress routing primitive will be used to point to AWS Network Firewall which uses GWLB behind the scenes. It is easy to predict that AWS will push for new products that will compete in this space that will use GWLB under the hood.

Over time, choices will rise whether it is with AWS products or more vendors certifying their products with GWLB. This abundance will serve to only benefit customers with more choices in their pursuit of secure network architecture.

References

Day 3 - Keeping Config Management Simple with Itamae

2021-12-03T00:00:00.023-05:00

By: Paul Welch (@pwelch)
Edited by: Jennifer Davis (@sigje)

Our DevOps toolbox is filled with many tools with Configuration Management being an often neglected and overloaded workhorse. While many resources today are deployed with containers, you still use configuration management tools to manage the underlying servers. Whether you use an image-based approach and configure your systems with Packer or prefer configuring your systems manually after creation by something like Terraform, chances are you still want to continuously manage your hosts with infrastructure as code. To add to the list of potential tools to solve this, I’d like to introduce you to Itamae. Itamae is a simple tool that helps you manage your hosts with a straight-forward DSL while also giving you access to the Ruby ecosystem. Inspired by Chef, Itamae has a similar DSL but does not require a server, complex attributes, or data bags.

Managing Resources

Itamae is designed to be lightweight; it comes with an essential set of resource types to bring your hosts to the expected state. These resource types focus on the core parts of our host we want to manage like packages, templates, and services. The bundled `execute` resource can be used as an escape hatch to manage resources that might not have a builtin resource type. If you find yourself wanting to manage something often that does not have a built in resource, you can build your own resources if you are comfortable with Ruby.

All Itamae resource types have common attributes that include: actions, guards, and triggers for other resources.

Actions

Actions are the activities that you want to have occur with the resource. Each bundled resource has predefined actions that can be taken. A `service` resource, for example, can have both an `:enable` and `:start` action which tells Itamae to enable the service to start on system boot and also start the service if it is not currently running.


    # enable and start the fail2ban service
    service “fail2ban” do
      action [:enable, :start]
    end

Guards

Guards ensure a resource is idempotent by only invoking the interpreted code if the conditions pass. The common attributes that are available to use within your infracode are `only_if` and `not_if`.


    # create an empty file only if it does not exist
    execute "create an empty file" do
      command "touch /tmp/file.txt"
      not_if "test -e /tmp/file.txt"
    end

Triggers

Triggers allow you to define event driven notifications to other resources.

The `notifies` and `subscribes` attributes allow you to trigger other resources only if there is a change such as restarting a service when a new template is rendered. These are synonymous with Chef & Puppet’s `notifies` and `subscribes` or Ansible’s `handlers`.


    # define nginx service
    service 'nginx' do
      action [:enable, :start]
    end
    
    # render template and restart nginx if there are changes
    template "/etc/nginx/sites-available/main" do
      source "templates/etc/nginx/sites-available/main.erb"
      mode   "0644"
      action :create
      notifies :restart, "service[nginx]", :delayed
    end

Itamae code is normally organized in “cookbooks” much like Chef. You can include recipes to separate your code. Itamae also supports definitions to help DRY your code for resources.

Example

Now that we have an initial overview of the Itamae basics, let’s build a basic Nginx configuration for a host. This example will install Nginx from a PPA on Ubuntu and render a basic configuration that will return the requestor’s IP address. The cookbook resources will be organized as follows:


    ├── default.rb
    └── templates
        └── etc
              └── nginx
                └── sites-available
                   └── main.erb

We will keep it simple with a single `default.rb` recipe and single `main.erb` Nginx site configuration template. The recipe and site configuration template content can be found below.


    # default.rb
    # Add Nginx PPA
    execute "add-apt-repository-ppa-nginx-stable" do
      command "add-apt-repository ppa:nginx/stable --yes"
      not_if "test -e /usr/sbin/nginx"
    end
    
    # Update apt cache
    execute "update-apt-cache" do
      command "apt-get update"
    end
    
    # install nginx stable
    package "nginx" do
      action :install
    end
    
    # enable nginx service
    service 'nginx' do
      action [:enable, :start]
    end
    
    # configure nginx
    template "/etc/nginx/sites-available/main" do
      source "templates/etc/nginx/sites-available/main.erb"
      mode   "0644"
      action :create
      notifies :restart, "service[nginx]", :delayed
      variables()
    end
    
    # enable example site
    link '/etc/nginx/sites-enabled/main'  do
      to "/etc/nginx/sites-available/main"
      notifies :restart, "service[nginx]", :delayed
      not_if "test -e /etc/nginx/sites-enabled/main"
    end
    
    # disable default site
    execute "disable-nginx-default-site" do
      command "rm /etc/nginx/sites-enabled/default"
      notifies :restart, "service[nginx]", :delayed
      only_if "test -e /etc/nginx/sites-enabled/default"
    end


    # main.conf
server {
  listen 80 default_server;
  listen [::]:80 default_server;

  server_name _;

  location / {
    # Return the requestor's IP as plain text
    default_type text/html;
    return 200 $remote_addr;
  }
}

Deploying

*To deploy the above example, it is assumed that you have a temporary VPS instance available.

There are 3 different ways you can deploy your configurations with Itamae:

`itamae ssh` via the itamae gem.
`itamae local` also via the itamae gem.
`mitamae` locally on the host.

Mitamae is an alternative implementation of Itamae built with mruby. This post is focusing on Itamae in general but the Mitamae implementation is a notable option if you want to deploy your configuration using prebuilt binaries instead of using SSH or requiring Ruby.

With your configuration ready it’s just a single command to deploy over SSH. Itamae uses the SpecInfra library which is the same library that ServerSpec uses to test hosts. You can also access a host’s inventory in Itamae much like you can with Chef & Ohai. To deploy your configuration, run:


    itamae ssh --key=/path/to/ssh_key --host=<IP> --user=<USER> default.rb
    --log-level=DEBUG

Itamae will manage those packages and write out the template we specified, bringing the host to our desired state. Once the command is complete, you should be able to curl the host’s IP address and receive a response from Nginx.

Wrapping Up

Thank you for joining me in learning about this lightweight configuration management tool. Itamae gives you a set of bundled resource types to quickly configure your infrastructure in a repeatable and automated manner with three ways to deploy. Check out the Itamae Wiki for more information and best practices!

Day 2 - Reliability as a Product Feature

2021-12-02T00:00:00.010-05:00

By: Martin Smith (@martinb3)
Edited by: Jennifer Davis (@sigje)

Abstract

SRE was born out of thinking about reliability as a product feature. However, all of the industry focus in the last few years on things like SLOs and Error Budgets and Production Engineering teams, and others, that constitute "doing SRE," sometimes means teams don’t take advantage of a product-centric approach to reliability these days. And they lose some of the advantages of doing so as a result. This post covers some project maturity levels, some suggestions for thinking about reliability as an SRE engaged in those kinds of projects, as well as what kinds of collaboration might be most successful in driving reliability-as-product-feature in each phase.

A brief history

Site Reliability Engineering, or SRE for short, was born in 2003 out of a need to improve service reliability at Google. Often described as, “an implementation of DevOps,” the practice of SRE aims to treat operations as a software problem that can be addressed through software engineering techniques.

And according to a survey by the DevOps Institute, SRE has truly taken off. This approach has been widely adopted, with 22% of organizations saying they have an SRE team in 2021. This shift can also be seen with the rise of conferences like USENIX’s SREcon which began in 2014, or the release of the popular, “Google SRE book,” a few years later in 2016.

Whether or not your organization has an SRE team that plans work using SLOs and Error Budgets, regularly reduces toil through automation, or has adopted one of the many SRE rules of thumb, the basic premise of what impact SRE can have sometimes gets lost -- that operations is a software problem. Or, shifting the focus back to the customer perspective, that reliability is a product feature that we build.

Having held DevOps Engineer and Site Reliability Engineer roles in the past, and having been a technical lead for SRE teams, I’ve had many opportunities to define the role, activities, and most importantly, the impact of an SRE team. In each case, I’ve found that focusing back on our customers’ experience of reliability has been the most useful framing when speaking to company leaders about SRE team’s, “why,” instead of reciting a long, confusing list of things SREs might do in a quarter. I’ve also found that it’s an easy litmus test for myself to ensure I’m working on the right things at the right time. If I can’t explain how my work affects customer reliability, keeping in mind that reliability for operators usually leads to reliability for customers, it might be a sign that I need to work on something else.

Shifting focus back to product reliability

Shifting the focus from operations and software engineering to talking about reliability as a product feature has some major benefits. First, it helps our organizations better understand what reliability might mean for them and their product(s) -- whether that’s resilience (tolerant of failure), scalability (can function with large volumes of work), observability (understanding internal state from outputs), or security (trust of the system). These are all product capabilities that often aren’t well understood, but fundamentally all matter to customers.

Reliability benefits from product management support (communication with stakeholders, building roadmaps, helping with prioritization and decisions, etc). For example, do you know who your internal stakeholders are for the scalability of your product? What’s on the roadmap for observability over the next 6 months? 2 years? And importantly, what metrics will you collect to be sure you’ve accomplished those goals and delivered on that roadmap? How does it align with other features’ roadmaps? As a friend and former colleague of mine says, “reliability is a product feature whether you devote engineering time to it or not.” If you don’t explicitly plan for that, your customers will implicitly make their own assumptions about your reliability.

Reliability may start to sound like any other product feature, with both internal and external stakeholders, and that’s by design. Making reliability an explicit part of your organizational planning also has many benefits. Thoughtworks’ Technology Radar (Volume 25) from October of this year recommends adoption of this kind of thinking -- that even internal teams should think of themselves as product teams. They also recommend using concepts from the popular Team Topologies book to figure out how to organize these internal teams. In reviewing examples of team structures from the book, many organizations have adopted Simon Wardley’s Pioneer-Settler-Town Planner (or “PST”) framework, too.

Let’s take a look at how one might apply these two ideas (reliability as a product feature, having a specific team profile) to improve the effectiveness of an SRE team.

First, there’s no one-size-fits-all approach to improving reliability; different stages of a project will benefit from different kinds of SRE involvement. In this post, I’ll divide products/services into three levels of maturity: beginning, growing, and established.
Then, I’ll describe what kinds of SRE work could be most effective at that maturity level, using the PST framework.

Here’s a graphic that explains the PST framework’s three kinds of roles/activities in more detail.

Team Profiles, from blog post Pioneers, Settlers and Town Planners by Simon Wardley

Beginning phase (with Pioneer SREs)

In new projects, there’s often uncertainty and unanswered questions. Small changes in direction could have large future benefits, but experimental work may be completely discarded, too. SREs can drive reliability at this stage by helping teams build prototypes, fail faster, and make agile decisions, all with reliability as a top of mind concern.

Have you ever had a project get close to production/release without thinking about reliability or operational burdens? “Pioneer SREs” can help. They should be part of the team that’s working to deliver a new product development, evaluate vendors, build out proofs of concept, or make major architectural changes. At this stage of a project, any work to “cover” reliability gaps should be identified or entire directions could be changed due to reliability concerns raised by the team.

Embedding in a team building the new product or feature is a great way for SREs to drive reliability early on in these kinds of projects. When teams only consult briefly on reliability or operational concerns, often the final output doesn’t adequately reflect customer or engineering expectations of reliability of the product or operability of the internals.

The success of Pioneer SREs can be measured by looking at how quickly new products or features show up on the roadmap, how quickly vendor implementations happen, or how quickly a project moves from, “exploration,” to, “concrete proposal.”

The largest risk in this phase is having your SRE team end up owners of the system’s reliability, since they helped design it. Hiding the overall reliability of your system from the other developers, behind an SRE team, will typically turn into a situation where the SRE team ends up being treated as an operational team for any product/service problems. Well-scoped embedding engagements can help avoid this problem by emphasizing that embedded SREs are a training resource for the rest of the team to learn, not coverage for the team once the embedding is over.

Growing phase (with Settler SREs)

In this phase, projects are often working to build production-quality infrastructure, launch to customers, or scale to the required audience. SREs can help actually build mature and scalable components from the initial prototypes. They could also level up the engineering organization on how to prepare for any new operational burdens by emphasizing best practices like automating away toil or choosing good SLOs.

Continuing to embed with teams is a great way for SREs to have a hand in the reliability of a nearly-launched product or feature, especially if SREs influence the team to build for observability, scalability, and security into the product. Consulting with teams on production readiness, especially for brand new teams or brand new services, is another way that SREs can ensure that everything reaching production will meet the original reliability requirements of the product, as well as operational best practices (e.g. automation instead of manual database migrations).

At this phase, SRE building and maintaining an idea of Production Readiness is especially important as a product or organization scales. This ensures a consistent approach to reliability across products or services, as well as creates a minimum bar for reliability that must be satisfied. SREs at this stage may even build automation into a pipeline to guarantee minimum scale or ensure resilience on specific failures.

The success of Settler SREs can be measured by looking at how many new services and features are safely being launched into production, as well as examining things like ease of observability (e.g. effective logging, metrics, or monitoring). Success in this phase is also about establishing patterns that make projects successful (e.g. proposal templates). Project retrospectives are a great way to find those patterns as well as improve SRE engagement with the project.

Established phase (with Town Planner SREs)

In this most mature phase, products or services are usually already generally available, and systemic issues like overall architecture or developer tooling are the most likely to impact reliability.

SREs can influence reliability here by identifying and working to resolve systemic reliability issues (e.g. repeated incidents, poor SLO choices, lack of on-call process, etc). Driving continuous improvement is a very common way that SREs influence reliability at this phase.

In addition, SREs can often identify ways to reduce operational burdens or eliminate large scale toil during this phase, whether through technical automation or architecture changes, or through helping teams build process, knowledge, skills, tools and techniques they need for large scale projects to be repeatedly successful and reliable.

This can be a phase where some SREs will feel there’s a stigma associated with doing less technical work, but the impact of this work cannot be overstated -- it’s where SRE can act as a true multiplier as more and more teams and products/services are launched. Examples include running an incident management program, SLA program, On-call Program, Disaster Recovery/Business Continuity planning, or even a Chaos Engineering program. A strategy to address this concern is to pair SREs with a technical program management function (TPM) so that SREs can focus most on the technical aspects of improvement while TPMs can help with the organizational changes needed to improve a process or execute a program.

Measuring the success of Town Planner SREs can be especially tricky. You might look for simple metric improvements like fewer incidents, reduced incident duration, reduced pages, improved SLO targets, or number of DR tests -- but isolating the SRE impact to these kinds of metrics can be difficult. Qualitative feedback from an SRE team’s internal customers is also frequently used to measure success at this stage. The most impactful SREs at this stage tend to cause paradigm shifts for the other development teams, and often even for their own SRE teammates.

Wrapping up

[PST is] how you take a highly effective company and push it [...] towards a continuously adaptive system. May 8th, 2020 @swardley

I hope that the grouping above is useful to readers for structuring work to drive reliability at various levels of product maturity. Reliability-as-a-product-feature isn’t a magic bullet to solve for an organization that doesn’t understand where it fits in the market or what kind of value it delivers, nor will it make a large difference with an unhealthy product management practice that might not know how to develop and drive delivery of a product and its features over time.

As mentioned earlier, there usually isn’t a, “one-size fits all,” approach to driving reliability. You may still need to establish some best practices for your organization such as “Limit toil to 50% of our work” or “Every product feature that goes live must have a reliability review.” Combined with these kinds of rules of thumb, the proposed divisions and strategies above should help focus your team(s) to make the biggest improvement to reliability for your products and services.

In researching this post, it was helpful to review how organizations “do SRE” at various organizations and companies. Continuous improvement was a clear shared trait among them. It’s also worth reviewing the huge amount of content out there about how SRE can effectively collaborate with other teams (e.g. embedding SREs); a poor relationship or failed collaboration with another team can jeopardize all of your efforts.

I invite and encourage you to write about and share your own experiences, both good and bad, focusing on reliability as a first class product feature at your organization. Special thanks to my own SRE team for the many discussions and ideation sessions on how we can best work to drive reliability. And special thanks to Jennifer Davis, Michael Lumsden, David Nolan, Jordan Rinke, and Kerim Satirli for feedback and editing on this post.

Day 1 - The Myths and the Magic in My Search for Acquiring Software Engineering Skills

2021-12-01T00:00:00.037-05:00

By: Annie Hedgpeth (@anniehedgie)
Edited by: Jennifer Davis (@sigje)

A happy SysAdvent to you, my dear elves. Whether you are an individual contributor (IC), manager, director, or something in between, my holiday wish is that my story spreads some holiday magic to your teams and roadmap.

“Then I traveled through the seven levels of the Candy Cane forest, past the sea of twirly-swirly gumdrops, and then I walked through the Lincoln Tunnel.” Buddy the Elf

I took an uncommon route into technology. With absolutely no experience of any kind in any sort of technological pursuit (save for video editing in college), I started my career in IT by learning configuration management and infrastructure as code first. Why? Because the opportunity presented itself, and I had a great in-house tutor. My husband, Michael, is the one who convinced me to pursue a career in technology and was the one who spent many late evenings teaching me how to “computer”. It was a bit of a trek through “the seven levels of the Candy Cane forest, through the sea of swirly twirly gumdrops” but with more tears and heartache.

I spent the first couple of years of my career just trying to learn enough of the different frameworks, like Chef, Terraform, PowerShell, Groovy, etc., to build stuff and configure it properly. Learning about how they should be built and configured came next with a focus on solution architecture and a bit on systems administration. Looking forward, after five years of work focused on configuration management, infrastructure as code, and CI/CD pipelines, I’m now to the point where I want to grow in software engineering, and this is where our story of myths and magic begins today.

“Some call it ‘the show’ or ‘the big dance’; it’s the profession that every elf aspires to…” – Papa Elf

Grab some hot cocoa and curl up with a blanket while I share with you what I see as the common myths believed about acquiring software engineering skills and what I believe to be the actual magic of making that a reality in my life. We will start with the myths, but please remember, dear elves, that these are myths and magic as they pertain to me personally. For you or others, they may not be, and that’s okay. My hope is that sharing my own experiences will give you empathy for others on their unique journeys and/or compassion for yourself as you learn and grow in your own way.

“The best way to spread Christmas cheer is singing loud for all to hear.” – Buddy the Elf

Myth #1 - Just read a book

I am a huge fan of books, and I consume a pretty good amount of books per year. I think that learning through books is important in a way that is difficult to replicate through other modalities. I have gone through Head First Go, a book that is geared toward people with little to no programming experience, and I found it to be incredibly helpful. I did every exercise in the book, learned a lot, and highly recommend it. That said, the exercises alone were not enough to prepare me immediately for real life coding. Doing the exercises was good and necessary, but it was only one piece of the puzzle required to complete the picture of what it takes for me to be able to contribute in a meaningful way to my company’s Go codebase.

Perhaps my lack of any formal training, whether university or code camp, prevented me from grasping the higher level understanding that would have enabled me to contribute confidently sooner, but whatever it was, I was still lacking after simply going through a book. I liken this to studying through a first year French textbook as your only means of learning the language. You will gather the concepts and vocabulary, but you will likely not be able to speak the language without other mediums of instruction.

Myth #2 - Just do some exercises

I am a huge fan of Exercism. I think they are helping a lot of people learn coding languages, and they do it in such a way that brings out a spirit of giving back in its users. There is much to love about that. I have completed many Exercism exercises, and I do find them helpful, but in the same way that the book was only helpful to a certain point, I haven’t found that it helps me with the big picture. I have found it to be like learning French with only Duolingo. Sure it’s a great app, and I use it all the time. But again, one cannot use it in isolation in order to be a proficient French speaker.

Myth #3 - Solution Architecture skills are built upon coding skills

Working at a cloud consulting firm for 4 years, I got a great education in architecting solutions for clients. I really enjoyed learning about the process, and it all made a lot of sense to me. After seeing several of them, I started to see the patterns and practices that are used to create a good solution. And then, as the person often implementing someone else’s solution, I learned quickly what made a bad solution, as well.

To be good at architecting solutions, one must think through all of the choices required to form that solution while you’re still in the planning phase, before any of the solution is actually implemented. You can’t really “mess around and find out”, which is why solutions architecture is such a valuable skill; if you plan well, you do the necessary work, no less and no more.

However, not all solutions are equal. Architecting a solution to a cloud migration feels like more of a tactile experience to me; I can see where things are moving. I think it helps that you can actually hold a CPU in your hands, and an architectural diagram has a very structural feel to it, similar to a blueprint of physical structures. For me, at least, this makes it more accessible and the concepts easier to grasp.

However, software architecture is more conceptual. You have to first understand all of the interfaces, levels of abstraction, and concepts before you can understand how to architect it. And if you don’t understand how to architect it, then you’re back at the Duolingo level of coding.

Myth #4 - The building blocks to starting a tech career are cloud, code editor, source control, and project management

Some people have suggested that huge barriers to moving into a software engineering role can be mastering the tooling - code editors and IDEs, source control, the cloud providers, and project management. This is possibly true of a certain type of person moving from a systems administration type of job into software development, but this was not true for me. But because Michael worried that these would be barriers for me, I learned them first. I created a website with GitHub Pages and used that as a way to learn source control and Visual Studio Code. I took some online classes on Agile Framework. I got a free Azure account and started playing with Terraform. These things were most definitely and obviously important, but again, they’re but one piece of the puzzle.

Myth #5 - It just takes a creativity / growth / problem-solving mindset

One of my husband’s main reasons for convincing me to pursue a career in tech was that I’m a pretty creative person who loves problem solving and that the desire to dig into a problem until it’s solved is one of the most necessary components for a career in tech. I completely agree that this is an important character trait in order to be successful as a technologist. I’m also decently creative and have a growth mindset, which are equally valuable for such a pursuit. You can probably see, by now, where I’m going with this, though.

These traits alone are great and will serve you well in just about any endeavor. Having these traits does not make a person automatically good at tech. It’s like when you’re house-hunting and find a house that needs a ton of cosmetic remodeling, but you say, “It has good bones,” meaning, you can easily make it the way you want it to look without having to overhaul anything structurally. Still, though, the cosmetic renovations are not insignificant. They are a lot of work.

The same is true with me. Yes, I have “good bones” - good traits that are great assets for a career in tech, like being creative, having a growth mindset, and being a good problem solver. But to let folks start a career in tech with the false hope that these traits will give them an unrealistic advantage is not helpful. Yes, those traits help me a lot, but, goodness me, it is still a lot of work learning and growing in tech, even with those traits.

Real barriers:

Truth #1 - People get pigeon-holed into certain work

I worked so hard to get the skills necessary to be valuable to my respective organizations, and while, yes, I found myself a bit pigeon-holed into “devops-y” roles, the other truth is that I didn’t feel as experienced as my peers because I didn’t have the formal training many of them had, so I felt behind in my learning. I wanted to catch up to the folks my age in this business, and that was nearly impossible, so the next best thing was to get really good at one thing, and just like that, I found myself pigeon-holed. This was honestly probably easier and less risky for the companies I was in as things were more predictable and steady when I was more focused on a smaller scope of expertise. And you might be thinking, ‘So what’s the problem with striving to become a subject matter expert at something. There’s immense value in that.’ And you’d be right. This is perfectly fine for some people. However, I personally like to have a range in my work. I find freedom in flexibility as my hope is that it gives me more options in my future, ultimately decreasing the risk to my career.

“There’s room for everyone on the Nice list.” Buddy the Elf

To overcome the barrier of being pigeon-holed into a particular line of work, a bit of magic is required - the magic that happens when goals are set and people help other people. Setting goals and tracking them is extremely important to me, but part of tracking those goals is being accountable to them by someone, whether it be a manager, a mentor, or a team lead. When my manager or team leads know my goals and I have milestones set for reaching those goals, then I am so much more likely to achieve them, and I’m giving them an opportunity to play an important role, which grows their leadership skills - a win-win.

Truth #2 - It’s an engineering problem for senior engineers to break down work to share work with juniors

My favorite type of senior engineer is one who can not only design a good solution but one who knows how to allow everyone on the team to contribute to the solution with their own strengths. Being able to communicate their vision for a solution to others and lead others effectively to carry out their vision is arguably the most valuable skill of a senior engineer. The whole team thrives when seniors lead in this way! Being able to do this is most definitely classified as a soft skill - one that is not easily measured by a test, and I have witnessed many ICs discount soft skills, thinking that only managers need worry themselves with growing such skills. I would argue, though, that this particular soft skill is also an engineering skill, one necessary to be an effective IC engineer.

“I mean, parents couldn’t do that all in one night.” Buddy the Elf

Conversely, how many times have you seen senior engineers go silent for two months and then emerge with an amazing something that solves a problem, but it resembles a coded version of a complicated Home Alone trap (like a Rube Goldberg machine)? This is actually not what we want from our senior engineers, dear elves. We want senior engineers who are able to thoughtfully and skillfully level up those in lower levels to them.

There is a common desire among engineers to remain as IC for as long as possible with no desire for the managerial track, and that is totally fine! However, being an IC does not mean that you work within a vacuum. No matter your level, every IC can have a positive influence on someone else on the team and can bring leadership and mentorship into their everyday roles. Seniors, however, have the responsibility to give others the opportunity to contribute to their vision. By considering the other people on their team and their strengths and goals, solutions can be designed so that everyone grows. Is it hard? Of course! But when it happens, it’s like magic.

I started my career in tech a few days before I turned 37, so with the amount of catch-up I have from being late to the game, I just need help sometimes. An hour of help from a human being, for me at least, is the absolute most supercharged way to learn. I am so grateful to have had people all throughout my time in tech who understand that investing in people by pairing on a problem is really an investment in the health and wellness of the team, product, and company. I would argue also that it makes them a better person, teacher, and leader.

I wholeheartedly believe that fostering this environment should be the number 1 priority of every engineering manager because it will solve a lot of other problems down the line naturally. We need not be islands unto ourselves but rather a rising tide that lifts all ships.

Truth #3 - A team needs dedicated time to grow

Getting time to grow at a consultancy was tough. It was usually designated to times when I was on the bench, but that time wasn’t consistent. There were times where I would go an entire year or more with no bench time, so I had to use my personal time. I will take this time to remind you, dear elves, that making your employees use their personal time for growth and development is not an inclusive practice. It makes it harder for folks with families, disabilities, or just plain healthy boundaries to have the time and space to learn.

“I planned out our whole day. First we make snow angels for two hours, and then we’ll go ice skating, and then we’ll eat a whole roll of Tollhouse Cookie Dough as fast as we can, and then to finish, we’ll snuggle.” Buddy the Elf

I’m so incredibly grateful for my current manager and team who have deemed half a day on Fridays to be dedicated learning times. When we all have learning time at the same time, then no one feels guilty for not working on sprint work because, as a team, we’ve decided that learning is important enough to spend time on it. I’ve gotten a lot out of this; I finished the aforementioned Head First Go book, and I’ve worked on Exercism exercises. I’ve also used it to learn how to do things that were blocking me in my sprint work. But to make the most out of this time, my next step is to use Friday learning times to actually use the things I’ve learned in real world work. This, however, may exceed the bounds of half a day on Fridays, and it may mean that I take a bug fix ticket and spend a whole week on it. The magic required is that the team and manager buy into this investment of time and energy. I personally know that I would get that buy-in on my current team, but I know I’m a lucky one. They know that the payoff of me growing my skills is worth the investment of time.

Truth #4 - Insecurity looms with the lack of formal education through a coding school OR engineering degree which makes it feel more difficult to acquire certain skills

This might be an unpopular opinion, and I just stated it as a truth, but I do believe that this is true for me. There are certain coding exercises that I have tried that make me feel like I will never truly understand certain concepts. I do believe that I will know enough to be valuable, but knowing when that matters and when it doesn’t is a mind trip. It’s difficult to manage my own expectations of my own growth, learning, and knowledge. The constant nagging thought in the back of my head is that if I would have had any sort of formal coding training, whether in university or code camp, that something would have clicked in my brain so that I understood certain concepts more quickly, and I honestly don’t know if this is a valid concern for me or not.

I do know that magic happens when people step in. When I have brilliant developer people in my life telling me what matters and what doesn’t matter and helping me to grasp fundamental concepts, my growth and confidence are accelerated greatly. I go from focusing on my blockers to focusing on my trajectory.

“Oh, it’s not a costume. I’m an elf. Well, technically, I’m a human, but I was raised by elves.” Buddy the Elf

Truth #5 - Career planning related to skills is a bit more complicated

When you’re a career-changer and are late to the tech game, planning for the future can be a bit complicated. My current difficulty is that I have the soft skills required to be a really great manager, but managing a technical team requires a great depth of knowledge that only comes with experience. So what do I do with all of this leadership potential? For now, I’m doing nothing. I’m growing my depth and breadth, hunkering down and growing, and that’s so frustrating!

But again, therein lies the potential for magic. If a manager and a team are intentional about growing people to their own strengths and goals, then we can carve a path that matches my goals and strengths with the business’s needs, but it requires a bit of creativity and flexibility. It takes mature leadership to know how to turn each team member’s potential into something that benefits everyone.

“I just like smiling. Smiling’s my favorite.” Buddy the Elf

TL;DR

Did you note a common thread? The myths I outlined are discouraging blockers that kept me from thinking that I could achieve my goals, and I have a hunch that I’m not alone in these feelings. But the magic lies in people caring about and investing in each other’s growth. That’s it! This is not just the kind, empathetic, and right thing to do, but it also will affect the business’s bottom line because when people are more committed to growth and feel encouraged to do so, they are creating quality products and they are staying put in the same place longer because they feel supported. As you go about your holiday and new year, I encourage you to bring a little bit of magic to your own teams by either being the support someone needs or by allowing someone to be a support for you.

“Bye Buddy, hope you find your dad!” – Mr. Narwhal

Day 25 - The “Just” Basics

2019-12-25T00:00:00.000-05:00

By: C.A. Corriere (@cacorriere)
Edited by: Michelle Carroll (@miiiiiche)

This year we celebrated the ten year anniversary of devopsdays in Ghent, Belgium, where the conference originated in 2009. I was lucky enough to have my talk “Cookies, Mapping, & Complexity” selected for the event. The feedback I received was mixed, but it was aligned with a broader theme that emerged from the conference: given the impact technology has on our society in 2019, we can’t afford to ignore the complexity of our sociotechnical systems. The problem we’re now faced with is, how do we raise awareness around this complexity and make it more accessible to beginners?

If the answer to this question were obvious I could list a few examples here. If it were just complicated I could draw you a map or two. Sociotechnical problems, like this one, happen to be centered in a complex domain where models are often helpful. This question is one of multiple safe-to-fail experiments with negative hypotheses I am currently running, intended to serve as probes into a model of our communities I built. There’s a lot of specific jargon in this paragraph tied to complexity science, and the cynefin framework specifically.

I facilitated ninety minutes of open space workshops around mapping and complexity science in Ghent, but a workshop on complexity science alone can easily fill a week. Shorter workshops manifested at quite a few events I attended this year. Where I prefer sitting through a day of lecture, 30 minute segments with more specific content seem to be a better fit for most people.

I’ve also noticed that using common examples, like baking cookies or making a cup of tea, help folks connect the theory to an area of practice where they already have some experience. Even if you’ve never made tea or baked cookies, the barrier to entry is low enough that someone could try them for the sake of learning about complexity science and mapping.

I wouldn’t keep offering the workshops if people didn’t both show up and tell me they were useful, but I must admit I’ve covered the basics on these topics enough times that I worry I sound a bit like a broken record. I have been pulling a lot of this into a book, which I hope to be available in early 2020. For now, I am going to hope some folks can connect the dots between the language I’m using here and the picture of the framework provided. I’d encourage you to study this some on your own too, and you’re always welcome to ask questions on twitter. If I can’t answer them I probably know some clever person that can. What can we do to help make this type of content more accessible? Are you even convinced you need to learn it yet?

During the closing panel of Map Camp London, Cat Swetel referred to both cynefin and wardley mapping as “tools of epistemic justice”. I understand this to mean cynefin and wardley mapping are tools that can help us know how we know (or don’t know) something, and why our beliefs are (or aren’t) justified. Personally, I like being able to check my work and knowing when I’m wrong. It’s a humbling experience, but I do think it’s a pretty basic life lesson that’s easily justified.

What else counts as basic, introductory content in 2020? Is it installing an SDK and writing “Hello World!”? Do we start with a git repo and some yaml files? Maybe it’s a map of our application’s carbon footprint? Mapping and complexity science (among other tools) can help justify the answers to these questions, but I have no doubt those answers are context dependent. I would recommend learning to read a map before trying to draw one. This post on maturity mapping by Chris McDermott is based on cynefin and wardley mapping and serves as a solid example of the emergent justification I’m talking about. I’m looking forward to learning more about philosophy, epistemology, and tools that can help us change our minds and come to new understandings as the world shifts around us in the new year, but I really need to do a better job of pacing myself.

If a month of travel and research abroad weren’t enough for this year, then it’s a good thing I helped pull together three conferences at the Georgia Aquarium in Atlanta too. I have organized devopsdays Atlanta for a few years. When we saw an opportunity to host the first Map Camp outside of the U.K. and the first ServerlessDays Atlanta along with our conference we decided it was worth the effort. Watching the ripples from that event since April has warmed my heart, but 2019 has also brought my attention back to one of my first principles:

I cannot take care of anything if I am not taking care of myself.

This year has been very global for me. My goal is to make 2020 much more local and regional by comparison, and I’m not alone. More and more presenters are refusing to fly for tech conferences given the growing concerns around global warming, which ended up being the main theme for Map Camp London this year. I think it’s important for our international communities to gather on a regular basis, but the cost of doing so should have little to no impact on our local communities, our planet, or our individual health. It must be done sustainably.

I doubt I’m leaving the country next year, but I’m thankful to be part of the vibrant tech community we have in Atlanta. I’ll be speaking at devnexus this February, we’re organizing a minimally viable devopsdays Atlanta this April (the same week as REFACTR.TECH), and it seems like there are a few meetups to choose from here every week.

If you aren’t participating in your local tech community then maybe 2020 is the year to try attending more events. If there aren’t any events, maybe you’d like to try organizing one. Maybe 2020 is the right time to visit some other cities (like Atlanta : ) or even a different country. Maybe you’ve been doing plenty of that, and like me you’re ready to tap the brakes and invest a little more energy in your own backyard. Please join me in using the days we have left this year to rest, reflect, and justify how we can co-create intentional futures during our next decade together, and for the ones that will follow afterwards.

Day 24 - Expanding on Infrastructure as Code

2019-12-24T00:00:00.000-05:00

By: Wyatt Walter (@wyattwalter)
Edited by: Joshua Smith (@jcsmith)

Introduction

As operators thinking about infrastructure as Code, we often think of infrastructure as just the stuff that runs inside our data centers or cloud providers. I recently worked on a project that expanded my view of what I consider “infrastructure” and what things were within grasp of managing similarly to the way I manage cloud resources. In this post I want to inspire you expand your view of what infrastructure might be for your organization and give an example using Terraform to help give a more concrete view of what that could look like.

First, the example I’ll use is a workflow for managing GitHub repositories at an organization. There are tons of other services Terraform can manage (“providers” in Terraform terms), but this example is a service that is free to recreate if you want to experiment. Then, we’ll dig into why you’d even want to go through the trouble of setting something like this up. Lastly, I’ll leave you with some inspiration on other services or ideas on where this can be applied.

The example and source code are very contrived, but available here (link: https://github.com/sysadventco-2019/sysadventco-terraform).

An example using GitHub

At SysAdventCo, developers use GitHub as a tool for source code management. The GitHub organization for the company is managed by a central IT team. While the IT team did allow a few individuals throughout the company permission to create repositories or teams, some actions were only accessible to administrators. So, even though teams could modify or create some settings, the IT team often was a bottleneck because many individuals needed access to see or modify settings that they could not do on their own.

So, the IT team imported the configuration for their organization into Terraform and allowed anyone in the organization to view it and submit pull requests to make changes. Their role has shifted from taking in tickets to modify settings (which often had multiple rounds of back-and-forth to ensure correctness) and manually making changes to simply being able to approve pull requests. In the pull requests, they can see exactly what is being asked for and receive validation through CI systems what the exact impact of that change would be.

A stripped down version of the configuration looks something like this:

# We define a couple of variables we can pass via environment variables.
variable "github_token" {
  type = string
}

variable "github_organization" {
  type = string
}

# Include the GitHub provider, set some basics
# for the example, set these with environment variables:
# TF_github_token=asdf TF_github_organization=sysadventco terraform plan
provider "github" {
  token        = var.github_token
  organization = var.github_organization
}

# This one is a bit meta: the definition for this repository
resource "github_repository" "sysadventco-terraform" {
  name               = "sysadventco-terraform"
  description        = "example Terraform source for managing the example-service repository"
  homepage_url       = "https://sysadvent.blogspot.com"
  gitignore_template = "Terraform"
}

SysAdventCo operates a number of services. The one we'll focus on is example-service. It's a Rails application, and has its own entry in the configuration:

resource "github_repository" "example-service" {
  name               = "example-service"
  description        = "the source code for example-service"
  homepage_url       = "https://sysadvent.blogspot.com/"
  gitignore_template = "Rails"
}

The team that builds and operates example-service wants to integrate a new tool into their testing processes that requires an additional webhook. In some organizations, a member of the team may have access to edit that directly. In others, maybe they have to find a GitHub administrator to ask them for help. In either case, only those who have access to change the settings can even see how the webhooks are configured. Luckily, things work a bit differently at SysAdventCo.

The developer working on example-service already has access to see what webhooks are configured for this repository. She is ready to start testing the new service, so she submits a small PR (link: https://github.com/sysadventco-2019/sysadventco-terraform/pull/2):

+
+resource "github_repository_webhook" "example-service-new-hook" {
+  repository = github_repository.example-service.name
+
+  configuration {
+    url          = "https://web.hook.com/"
+    content_type = "form"
+    insecure_ssl = false
+  }
+
+  active = false
+
+  events = ["issues"]
+}

The system then automatically creates a comment with exactly what actions Terraform would do if this were approved for a member of the IT team to review and collaborate with the developer requesting the change.

No one is stuck filling out a form or ticket trying to explain what is needed with words that get interpreted into manual actions. They have simply updated the configuration themselves which is automatically validated and a comment is added with the exact details of the repository which would change as a result of this request. Once the pull request is approved and merged, it is automatically applied.

This seems like a lot of work, why bother?

What an astute observation, dear reader! Yes, there is a good deal of setup involved once you get past this simple example. And yes, managing more automatically can often be more work. In addition, if your organization already exists but doesn’t use something like this method already, you probably have a good deal of configuration to import into the tool of your choice. I’d argue that there are a number of reasons that you would want to consider using a tool like this to manage tools that aren’t strictly servers, firewall rules, etc.

First, and what is the first thing I reached for, is that we can track changes the same way we do for other things in the delivery pipeline while also ensuring consistency. For me on my project, importing configuration of a PagerDuty account into management by Terraform allowed me to see inconsistencies in the manually configured service. While the tool added value, a huge part of the value was the simple act of doing the import and having a tool that enforced consistency. I caught a number of things that could’ve misrouted alerts if conditions were right before they became issues.

The next and most compelling reasons to me are in freeing up administrative time and giving teams the freedom to affect changes directly without creating a free-for-all situation. You can restrict administrative access to a very small number of people (or just a bot) without creating a huge bottleneck. It also allows anyone without elevated privileges to confirm settings without having to ask someone else. I’d also argue this creates an excellent model for the basis of a change control process for organizations that require or have them as well.

A further advantage is that, since none of these tools exist in isolation, using a method like this can give you an opportunity to reference configuration dynamically. This allows your team to spin up full environments to test configuration end-to-end.

But wait, there’s more!

Within the Terraform world, there’s an entire world of providers out there just waiting for you to explore! Imagine using the same tools you use to manage AWS or GCP resources that often are linked to other important things your team uses:

Manage your on-call rotations, escalation paths, routing decisions, and more with the PagerDuty provider
Manage the application list and alerts in NewRelic
Add external HTTP monitoring using tools like Statuscake or Pingdom

Day 23 - Becoming a Database Administrator

2019-12-23T00:00:00.000-05:00

By: Jaryd Remillard (@KarateDBA)
Edited by: Benjamin Marsteau (@bmarsteau)

Database is a term that is thrown around in meetings amongst all industries. The term is almost always used with a sense of urgency and importance yet contains a vast mystery. It can be a topic that some may feel too confident, one with absolute no knowledge and one that refers to a single copy of a glorified excel spreadsheet sitting on their desktop. In my short time as a database administrator, I have found that it is typically the confident ones that venture into this mystery with the full understanding of the business value and risk that come as the database administrator. Like any area of science, technology, engineering, and math, acronyms are favored, so let it be known that the title of database administrator can be abbreviated as DBA.

Like any career path, one database administrator path will not necessarily align with the direction you have to take. It is not to discredit the value of the journey and course someone purposely took or perhaps accidentally stumbled into; there are specific points to remember which in itself could present an opportunity in your journey. Instead, be aware, just like theoretically there is an infinite number of ways of solving a problem with the code, there is an endless number of directions to reach your destination of becoming a DBA. All in all, I hope the reflection I have done of my journey I took to become a database administrator will set you up for success.

Start with the basics

When I was 12 years old, I befriended a stranger online through a collective group of people who played an online video game. We were idling in our TeamSpeak server when I asked them what they were up too, they replied saying they were coding a website for our group. The concept immediately struck me with curiosity like a static shock, the idea of how to construct a website was so far-fetched I just had to learn so I could quench the burning desire. I naively asked if it is a drag and drop type of process. They laughed and began to talk about and teach me HTML and showed me how to view the source of a website. The concept blew my mind; words typed in a specific manner can be translated into a structure that is displayed on my screen. It made me feel like anything was possible. I kept building websites with HTML, leveling up to using CSS, JavaScript, learning Linux, and eventually PHP. Soon after, I was building login systems, registration systems, user profiles, all in a LAMP stack that required knowledge of basic SQL, learning simple DML's, DDL's, DCL and TCL's, I wrote whatever worked. The experience and newfound knowledge I bashed together eventually turned into a charming but underwhelming social network that I named Express-It. Building the schemas in phpMyAdmin was accessible in the sense of, "I create a column, PHP writes to it, there we go.” However, as the social network grew to a whopping 100 people, which were primarily friends and family for moral support, it caused my website to slow to a crawl. What I did not understand was the more extensive technical specifics of ints, unsigned bigints, varchars, indexing, and primary keys and how various people at the same time querying similar things while the SQL scanned the entire database affected performance. I could not wrap my head around it, nor did I think there was anything but what the query was because it did the job locally. Frankly, it also didn't occur to me that my schemas and queries were DBA's nightmare. I shut down Express-It since my curiosity shifted from LAMP stacks to learning cybersecurity, doing basic IT jobs for friends and family, and I was sick of the free hosting tier I was using.

School of Hard Knocks

As I shifted my focus from building to fracturing, SQL came up in the form of learning its flaws. From the various types of SQL injection, brute force, and DoSing a database. My knowledge expanded to be more aware of possible vulnerabilities and the importance of a database, including losing data. This experience exemplified when the code I had from my Express-It website and hundreds of hours of various other projects I stored on a flash drive, as an interim between moving homes, was accidentally reformatted while transferring some photos by a family member. Losing all my work taught me how easy it can be for a large part of my life to disappear. I then realized the hard way that backups are a thing and I became hyper-aware. I learned to keep at least two to three copies of whatever had importance on separate data stores, I learned that flash drive and hard drives could die without warning or get overwritten accidentally, you can never have enough backups, and corruption is a thing. My motto became "backup backup backup, correctly.” I often chuckle when reflecting this time of my life because it reminds me of high school, where any time a big paper was due, someone always had the excuse that their file was gone, overwritten, or became corrupted the night before, conveniently, perhaps honestly so. I could not help but blurt out of my smart mouth, "Should have backed up.” Unfortunately, the lesson I learned came back to haunt me in a different form.

Venturing Further and Beyond!

I went on a hiatus from technology for a bit to focus on school and sports. Eventually, when my interest in technology came back, an internship opportunity landed on my lap, which still to this day I attribute to luck as the the news of an opportunity was shared to me by the CTO of the relevant company. Before starting, I was asked what I wanted to do during the internship, specifically, what direction of my career did I want to go. I thought it was software engineering; at the time, building and designing were shiny to my eyes. However, I was conflicted as I still enjoyed living in the terminal, something about the rawness of text on a blank background with specific commands can be utilized to effectively navigate the computer in ways a GUI could not. It still drew me in even when I was deep in an IDE, I had moments of barbarianism when I would code in vim. I knew programming was not what I wanted to do for a full eight hours a day, I was conflicted and I shared my concerns. They mentioned DevOps, and it was perfect. I would get the complete balance of being in the terminal and writing code, I then embarked on the start of my career. As an intern, a lot of my tasks were simple; data entry, set up a local environment, break the local environment, finish some tickets, attend standup, and the like. But one task stood out to me, the need for an internal tool to show the difference between a system at one point in time compared to another time, such as permissions and data in the file, essentially a beefy diff. Like most ideas, it was a task that seemed easy at first but was exponentially more complicated than initially anticipated. As I dug into the task, the first tool I chose was Python to program in as it seemed easy to learn and it was all the rage. As I learned more about Python's data types, I naively figured it was an excellent function to cache all files on a system in a dictionary, which unbeknownst to me beforehand, resulted in Python running out of memory. After consulting some of the engineers nearby on how to navigate this issue, it was recommended to use a database. So naturally, I chose Sqlite3. I moved to MySQL pretty soon after, sqlite3 was just not working out. I figured MySQL was perfect as it is a solid relational database, I would have the freedom to specify what kind of data to store and it made storing md5 checksums. I was eventually able to get the program to work to some degree of success but not without experiencing bottlenecks. Previously in my past, the amount of data I worked with was so little that there was little to no need for optimization. So when caching the majority file information on a system, well, that is when I started to see performance impacts on the database, particularly the length of time in executing the program and overall high usage of memory in both primary and secondary. I figured the best direction to tackle this problem would be to take it to a deeper level, learning of the internals of MySQL. Basics of how the client and server work together and elementary query optimization, but with limited guidance, there was only so much I could dig on my own. Eventually, time continued, and I moved to a local company as a system administrator. My new job exposed me to different types of databases; MongoDB and SQL Server, along with MySQL again. I spent a lot of my time at my new position on the front end and web servers like Apache, Jekyll, GruntJS workflows as well as Active Directory, I still got to see the back end as well. Naturally, it fascinated me more, in between tasks I learned how it was accessed by services and as an administrator, how to view the permissions of users, and how to query what you wanted. Questions about the front end were easy to answer but the back end had a lot of unanswered questions I could not find the answer too, various topics such as; internal functionalities, maximum capabilities, dynamically manage a user, etc. Databases remained a mystery I wanted to solve and I was ready to go Sherlock. I read the documentation and tinkered with the databases, and then I would go home to set one up for myself, just to see how I can break it. Unfortunately, my time became more consumed with school and the front end of a job at the time, although I knew I wanted to get back to the databases in the future. Soon, an opportunity opened to work as a student employee in the IT department at my university. This potential new position would save me an hour commute to school as well as an hour commute to work, although it was not nearly as technical as what I was currently doing and perhaps a step back; it was the right decision for me at the time. I had a feeling this job could turn into something more technical than what I was already doing.

Uh-oh

After completing three years of college and almost a year as a student employee, I was offered a full-time job in the IT department of the university I was attending. It was a decision that was difficult to choose and took some time to weigh the pros and cons; to take the risk and leap of faith into the field or continue education for another two years while piling debt to then flow into the field. Ultimately, I chose to take the job to pay off the student loans that I accrued, which was almost the size of my salary and nearly twice my weight in stress. Also, I knew this was an opportunity to delve deeper into different technologies and learn what it means to take ownership and responsibilities. The job was to be the system administrator for the STEM department, with that indeed came with a lot of responsibilities of managing various software, some cloud-based but many on premises-based. Much of the software I managed used a SQL Server to manage logins, logs, barcode numbers, etc. Little did I know that an SQL Server was actively in use until I got a call from a chemistry department head saying they are unable to log into their science rental equipment management software. I searched all over in our wiki and could not find a single trace of this existence, for a moment I thought this was a prank. I asked my coworkers if they had heard of this software, if it even exists; I got nothing in response. I dug further, even so far as reaching out to our previous system engineer that worked prior. It turns out this software had an SQL Server sitting in an undocumented virtual machine, lost in tribal knowledge. Unfortunately, there were no records of this software ever being provisioned. To look on the bright side, one of the science teacher's users in this database for some reason had super privileges, giving me the ability to login and work some magic, thinking this was the end of the immediate problem. But there was an itch in my brain, questions that stuck with me; what had happened? Why did it all of a sudden stop working? Why is it sitting on a VM undocumented? How can I prevent this from the future? Why is there no accountability and visibility with this database? It was going to be forgotten in this state, or I had to work on keeping this reliable in all aspects, especially documentation. I made a page on everything I learned about this software, including representatives from the company, run books for the database, and how the client and backend work. At that moment this is when my interest in reliability and uptime exponentially grew, especially around databases. Before I left, I made one big oops.

Be wary of drives

I was put in charge of the psychology department on top of the STEM department. I had a psychology professor come in as her laptop was due for a replacement and was hoping to speed up the process as it was filling up a special order of a 512 GB Dell XPS primarily, which consisted of personal photos and research documentation. The first process was to back up her laptop to a hard drive we had using some software that did it block by block. I had our student employees complete this process overnight. I woke up to some great news; it kept failing with odd errors that warranted no response via a Google search. After some consulting with my coworkers, Office 365 comes with a 1 TB storage via OneDrive. We thought this was perfect; she can store all her valuable documents into OneDrive as we set up her new laptop and download it back down. She preferred that I did it personally as I was in charge of the psychology department, as our policy was, I had to agree. I began the process of uploading her documents to her OneDrive and it took days. Being new to Office 365, I had no idea why it was taking this long, but I shrugged it off as it eventually reported successful. I began to download her files onto the new computer and started the RMA of her old one. Problems were immediate; permission issues, too long file names, disappearing files, you name it. After hours and hours of work, going through shadow copies of our servers, looking at past backups we had, recursively changing permissions of the files, it was exhaustive. I was able to obtain about 95% of what she had previously, but the 5% I lost was a good chunk of her research. It was a time of reflection where my motto rang in my head non-stop, I missed one more backup somewhere. From then on, I was no longer super aware. I was hyper-aware and vigilant in storing data. Everything made me skeptical or ask questions, and it was a mark in my career of growth through failure. I had a burning desire in me to learn more about storing data; how do it robustly, safely, ensure validity and integrity. I set out to fulfill my desire.

Where to begin?

Finding information on reliability and internals of the data storage is a difficult task when you do not have any reference or expert to guide you towards the correct path. The internet is filled with how-to's; doing write and read queries, but understanding documentation on internal works is tricky to begin, let alone comprehend. I have finally started to dip my toes in and quickly learned it's difficult to give a summary of the paradox that is an SQL database, the simplicity of the query structure itself is one that provides a façade and false sense of understanding. Rather, there is much behind the scenes you cannot see. Knowing how to query to get what I needed from a database gave me the confidence that I knew what to do and how it worked. It wasn't until I started a personal project that was causing the database to suffocate that it made me realize that perhaps there is more to databases than my cute knowledge previously thought.

Personal Projects

Being in the technology field will expose you to various situations that are hard to prepare for in personal studies as well as higher education. Outages that are unpredictable due to customer behavior or merely no reference to what the threshold is of a service, primarily because you never reach that point. Scaling up is a term that I heard of but never understand, so natural curiosity decided I needed to seek out what it really means. It is impossible to scale up unless you have a lot of data to utilize. Finding large data sets that contained false and made-up data was a tall task, so then I had an itch to create a data generator to assist in learning how to scale. Yes, there are a few data generator websites. However, they seem to cap out of a million rows at the time, which is not enough for me and the service to really push it to the limits. In creating this data generator, I made it so it can spit it out in an SQL format, making it easy to slap into MySQL right away. Fortunately, it is capable of generating 8 figure rows of data with columns for names, addresses, cars, age, and other data after some heavy work in Python. I ran my data generators several more times to add up to 300 million rows, I decided it was time to load up a MySQL server in a LAMP stack with this data to use in a simulation of what would be a essentially a country sized simulation. With no visibility of the VM, OS, and the database, my PHP queries to MySQL locally took ages or crashed the VM altogether. I knew it was the database because even querying via phpMyAdmin was not returning results quickly or timed out, and I couldn't figure out how to better interact with the database. Thinking it lacked in power, I kept upping the CPU's and RAM which only led to crashing the host. I stepped back to think more about scaling; how could I, in this case, scale up if upping power wasn't the solution? Then the concept of how a CPU is designed rang in my head, distributing the job into smaller chunks. A saying from the CTO of the company that I interned told me, "Any big problem is just a subset of a bunch of smaller problems. Iterate those small problems, and now you've solved the big problem."

I got it! Let me split the database into smaller sized databases, each containing a max of 10 million rows. If I needed something beyond the unique ID, I could query the next database instead of having MySQL scan the entire database. Distributing data through multiple instances of MySQL servers was a weak solution in this case, of course, as PHP now had to maintain 20 MySQL connections. Later I learned this moved the problem instead of solving it, and now I was stuck. I understood databases are complex at the time and are much more complicated than I initially thought and that fed my desire to learn more. I did not necessarily feel capable of being a database administrator, but I figured what is better than to in headfirst as a database administrator for a company.

I am a person that tries to not be afraid to delve into the unknown or face rejection. Imposter syndrome is real, but I know it is something you can grow past despite your thoughts no matter what your mind tells you otherwise. I scoured the internet for DBA jobs and found myself stumbling upon an entry-level DBA posting at the competitor of the company I interned at. It was perfect and I applied despite it feeling like a moonshoot as the position was based in a different state.

Don't be afraid

Unexpectedly, I got a callback. I flew through the phone interview, manager interview, and eventually hopped on a call for the technical interviews. I was as honest as I could be, I explained my attempts to scale, shared the little experience I had with MySQL, and why I wanted to be a DBA.

Simply put as to why I wanted to be a DBA, databases are facinating to me. We rely on databases for everything, but hardly anyone delves more in-depth than simple restarts or querying for what they need. I had a difficult time finding resources to help me learn deeper about SQL rather than how to write basic SQL queries, I was hungry; rather, I was famished to learn. I knew I lacked a lot of knowledge and was honest about it during my technical interviews but I backed it with what I was trying to do with MySQL. In particular, I shared my attempts of scaling by distributing the workload, having no idea what the correct term was other than using the description of distributing "it." I later learned it's called sharding. I jumped up and down after finding out the correct term as it had unlocked a vast amount of new resources via Google searches and technical conversations with people in the industry. During my technical inverview, I had a DBA on call, this was the perfect opportunity to ask what resource I should read so I jumped in as soon as I could. She recommended reading the Database Reliability Engineering book by Charity Majors and Laine Campbell. I Immediately bought it off Amazon, practically during the interview, and was extremely eager to crack it open the second it arrived. I started reading and taking impeccable notes, absorbing as much as I can.

This is the direction I needed, the direction I wanted to go, to push my mind, widen my thought process, making me aware that there is much more than writing code and setting up software such as; service level objectives/agreements, automating, the need for metrics and the alike. I just could not put the book down. It almost felt like I hadn't had a bite to eat in days essentially swallowing the book. Upon my second technical interview, I believe my famine showed. I talked about what I was learning and how I was applying it, and it raised the interviewer's eyebrow in a good way. I was flown to their headquarters for further interviewing.

Still much to learn

It is no secret this job was at SendGrid, and I am very fortunate to have found a job posting that was purposely looking to help the employee to grow. I attribute a lot of that to luck and the excellent mentality and awareness of the benefits of hiring and raising a junior employee at SendGrid. The distinctive culture included hunger, the hunger to learn, and I was viciously starving. I could not stop reading documentation, asking questions and writing everything down in a spiral notebook. I am fortunate to have a senior DBA on the team to guide me through processes of replication and basic troubleshooting of a MySQL server. Later I bought The High Performance MySQL: Optimizations, Backups and Replication book on Amazon, and soon after being hired, I started going through the book, diligently taking notes and asking questions along the way. The path to learning about SQL did not stop when I was hired; in fact, it just started.

Conclusion

Overall, my natural-born curiosity and love for challenges lead me to take an opportunity where no one else dared to venture. I broke my façade, thinking SQL databases are easy because I can query something by trying to force the database to kneel. Finding why was challenging, but that only led me viciously seek out a solution, not be afraid to apply for a DBA job. The key was realizing I always gravitated and asked myself the most questions when dealing with a database, I wanted to conquer databases. The two books mentioned are a great start to grow your knowledge beyond querying a database, but to delve deeper into what it is and how to use it. Another book to look at the Celko's Advanced SQL Programming by Joe Celko's, it does a good job of delving into how SQL works behind the scenes and make you realize that your queries can be optimized greatly. While there are many paths to take, the real take away is if you have the hunger to learn, you will succeed no matter what path you take.

sysadvent

Day 23 - What is eBPF?

Very complicated hello

What even is an event?

We need to store data

Sounds great! Now what?

Day 22 - So, You're Incident Commander, Now What?

What does it mean to be an incident commander?

How do you become an incident commander?

Now what?

What are the best strategies for communication and coordination?

When should you hand off a long-running incident?

How should you approach post-incident analysis and review?

Describe the incident

Contributing factors

Corrective action items

Conclusion

Further reading

Day 20 - To Deploy or Not to Deploy? That is the question.

There’s No Place Like Production

Decision-making Under Pressure

Safety Practices

Sleep Matters

Blame Culture

What Would They Change?

Wrapping Up

Day 19 - Into the World of Chaos Engineering

Intro

Talking about Chaos in the System

What is Chaos Engineering

How do we get there

Metrics

People

Speak the language

It’s a Wrap

Day 18 - Minimizing False Positive Monitoring Alerts with Checkmk

Types of Alert Errors

1. Don’t alert.

2. Give it time

3. On average, you don’t have a problem

4. Like parents, like children

5. Avoid alerts on systems that are supposed to be down

Wrapping Up

Additional Resources

Day 17 - Death to Localhost: The Benefits of Developing In A Cloud Native Environment

Saying Goodbye

My Machine is Your Machine

Development Environment Specs

What Problems Are We Solving

Permissions

Develop In Production

Localhost: Still Slightly Alive

Day 16 - Setting up k3s in your home lab

Background

What we're going to do

Why k3s?

Why k3sup?

Lets get started!

Pre-requisites.

Set up your cluster.

Create the leader node

Test it out

Clean Up

Next Steps

Day 15 - Introduction to the PagerDuty API

What’s Exposed Via the API

API Basics

Incidents

Events

Change Events

Adding Notes

Next Steps

Day 14 - What's in a job description (and who does it keep away)?

Day 13 - Ephemeral PR Environments: Enabling automated testing at a rapid pace

Example app

Pre-requisites Used in this Example

Pipeline

Putting it all together

Main branch

PR

Update version constraints with `tfupdate`

Test state migrations with `tfmigrate`