December 23, 2021

Day 23 - What is eBPF?

By: Ania Kapuścińska (@lambdanis)
Edited by: Shaun Mouton (@sdmouton )

Like many engineers, for a long time I’ve thought of the Linux kernel as a black box. I've been using Linux daily for many years - but my usage was mostly limited to following the installation guide, interacting with the command line interface and writing bash scripts.

Some time ago I heard about eBPF (extended BPF). The first thing I heard was that it’s a programmable interface for the Linux kernel. Wait a second. Does that mean I can now inject my code into Linux without fully understanding all the internals and compiling the kernel? The answer turns out to be approximately yes!

An eBPF (or BPF - these acronyms are used practically interchangeably) program is written in a restricted version of C. Restricted, because a dedicated verifier checks that the program is safe to run in an BPF VM - it can’t crash, loop infinitely, or access arbitrary memory. If the program passes the check, it can be attached to some kind of event in the Linux kernel, and run every time this event happens.

A growing ecosystem makes it easier to create tools on top of BPF. One very popular framework is BCC (BPF Compiler Collection), containing a Python interface for writing BPF programs. Python is a very popular scripting language, for a good reason - simple syntax, dynamic typing and rich standard library make writing even complex scripts quick and fun. On top of that, bcc provides easy compilation, events attachment and output processing of BPF programs. That makes it the perfect tool to start experimenting with writing BPF code.

To run code examples from this article, you will need a Linux machine with a fairly recent kernel version (supporting eBPF). If you don’t have a Linux machine available, you can experiment in a Vagrant box. You will also need to install Python bcc package.

Very complicated hello

Let’s start in a very unoriginal way - with a “hello world” program. As I mentioned before, BPF programs are written in (restricted) C. A BPF program printing “Hello World!” can look like that:

hello.c

#define HELLO_LENGTH 13

BPF_PERF_OUTPUT(output);

struct message_t {
   char hello[HELLO_LENGTH];
};

static int strcp(char *src, char *dest) {
   for (int i = 0; src[i] != '\0'; i++) {
       dest[i] = src[i];
   }
   return 0;
};

int hello_world(struct pt_regs *ctx) {
   struct message_t message = {};
   strcp("Hello World!", message.hello);
   output.perf_submit(ctx, &message, sizeof(message));
   return 0;
}

The main piece here is the hello_world function - later we will attach it to a kernel event. We don’t have access to many common libraries, so we are implementing strcp (string copy) functionality ourselves. Extra functions are allowed in BPF code, but have to be defined as static. Loops are also allowed, but the verifier will check that they are guaranteed to complete.

The way we output data might look unusual. First, we define a perf ring buffer called “output” using the BPF_PERF_OUTPUT macro. Then we define a data structure that we will put in this buffer - message_t. Finally, we write to the “output” buffer using perf_submit function.

Now it’s time to write some Python:

hello.py

from bcc import BPF

b = BPF(src_file="hello.c")
b.attach_kprobe(
   event=b.get_syscall_fnname("clone"),
   fn_name="hello_world"
)

def print_message(_cpu, data, _size):
   message = b["output"].event(data)
   print(message.hello)

b["output"].open_perf_buffer(print_message)
while True:
   try:
       b.perf_buffer_poll()
   except KeyboardInterrupt:
       exit()

We import BPF from bcc as BPF is the core of the Python interface with eBPF in the bcc package. It loads our C program, compiles it, and gives us a Python object to operate on. The program has to be attached to a Linux kernel event - in this case it will be the clone system call, used to create a new process. The attach_kprobe method hooks the hello_world C function to the start of a clone system call.

The rest of Python code is reading and printing output. A great functionality provided by bcc is automatic translation of C structures (in this case “output” perf ring buffer) into Python objects. We access the buffer with a simple b[“output”], and use open_perf_buffer method to associate it with the print_message function. In this function we read incoming messages with the event method. The C structure we used to send them gets automatically converted into a Python object, so we can read “Hello World!” by accessing the hello attribute.

To see it in action, run the script with root privileges:


> sudo python hello.py

In a different terminal window run any commands, e.g. ls. “Hello World!” messages will start popping up.

Does it look awfully complicated for a “hello world” example? Yes, it does :) But it covers a lot, and most of the complexity comes from the fact that we are sending data to user space via a perf ring buffer.

In fact, similar functionality can be achieved with much simpler code. We can get rid of the complex printing logic by using the bpf_trace_printk function to write a message to the shared trace_pipe. Then, in Python script we can read from this pipe using trace_print method. It’s not recommended for real world tools, as trace_pipe is global and the output format is limited - but for experiments or debugging it’s perfectly fine.

Additionally, bcc allows us to write C code inline in the Python script. We can also use a shortcut for attaching C functions to kernel events - if we name the C function kprobe__<kernel function name>, it will get hooked to the desired kernel function automatically. In this case we want to hook into the sys_clone function.

So, hello world, the simplest version, can look like this:

from bcc import BPF

BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello World!\\n"); return 0; }').trace_print()

The output will be different, but what doesn’t change is that while the script is running, custom code will run whenever a clone system call is starting.

What even is an event?

Code compilation and attaching functions to events are greatly simplified by the bcc interface. But a lot of its power lies in the fact that we can glue many BPF programs together with Python. Nothing prevents us from defining multiple C functions in one Python script and attaching them to multiple different hook points.

Let’s talk about these “hook points”. What we used in the “hello world” example is a kprobe (kernel probe). It’s a way to dynamically run code at the beginning of Linux kernel functions. We can also define a kretprobe to run code when a kernel function returns. Similarly, for programs running in user space, there are uprobes and uretprobes.

Probes are extremely useful for dynamic tracing use cases. They can be attached almost anywhere, but that can cause stability problems - a function rename could break our program. Better stability can be achieved by using predefined static tracepoints wherever possible. Linux kernel provides many of those, and for user space tracing you can define them too (user statically defined tracepoints - USDTs).

Network events are very interesting hook points. BPF can be used to inspect, filter and route packets, opening a whole sea of possibilities for very performant networking and security tools. In this category, XDP (eXpress Data Path) is a BPF framework that allows running BPF programs not only in Linux kernel, but also on supported network devices.

We need to store data

So far I’ve mentioned functions attached to other functions many times. But interesting computer programs generally have something more than functions - a state that can be shared between function calls. That can be a database or a filesystem, and in the BPF world that’s BPF maps.

BPF maps are key/value pairs stored in Linux kernel. They can be accessed by both kernel and user space programs, allowing communication between them. Usually BPF maps are defined with C macros, and read or modified with BPF helpers. There are several different types of BPF maps, e.g.: hash tables, histograms, arrays, queues and stacks. In newer kernel versions, some types of maps let you protect concurrent access with spin locks.

In fact, we’ve seen a BPF map in action already. The perf ring buffer we’ve created with BPF_PERF_OUTPUT macro is nothing more than a BPF map of type BPF_MAP_TYPE_PERF_EVENT_ARRAY. We also saw that it can be accessed from Python bcc script, including automatic translation of items structure into Python objects.

A good, but still simple example of using a hash table BPF map for communication between different BPF programs can be found in “Linux Observability with BPF” book (or in the accompanying repo). It’s a script using uprobe and uretprobe to measure duration of a Go binary execution:

from bcc import BPF

bpf_source = """
BPF_HASH(cache, u64, u64);
int trace_start_time(struct pt_regs *ctx) {
 u64 pid = bpf_get_current_pid_tgid();
 u64 start_time_ns = bpf_ktime_get_ns();
 cache.update(&pid, &start_time_ns);
 return 0;
}
"""

bpf_source += """
int print_duration(struct pt_regs *ctx) {
 u64 pid = bpf_get_current_pid_tgid();
 u64 *start_time_ns = cache.lookup(&pid);
 if (start_time_ns == 0) {
   return 0;
 }
 u64 duration_ns = bpf_ktime_get_ns() - *start_time_ns;
 bpf_trace_printk("Function call duration: %d\\n", duration_ns);
 return 0;
}
"""

bpf = BPF(text = bpf_source)
bpf.attach_uprobe(name = "./hello-bpf", sym = "main.main", fn_name = "trace_start_time")
bpf.attach_uretprobe(name = "./hello-bpf", sym = "main.main", fn_name = "print_duration")
bpf.trace_print()

First, a hash table called “cache” is defined with the BPF_HASH macro. Then we have two C functions: trace_start_time writing the process start time to the map using cache.update(), and print_duration reading this value using cache.lookup(). The former is attached to a uprobe, and the latter to uretprobe for the same function - main.main in hello-bpf binary. That allows print_duration to, well, print duration of the Go program execution.

Sounds great! Now what?

To start using the bcc framework, visit its Github repo. There is a developer tutorial and a reference guide. Many tools have been built on the bcc framework - you can learn them from a tutorial or check their code. It’s a great inspiration and a great way to learn - code of a single tool is usually not extremely complicated.

Two goldmines of eBPF resources are ebpf.io and eBPF awesome list. Start browsing any of those, and you have all your winter evenings sorted :)

Have fun!

December 22, 2021

Day 22 - So, You're Incident Commander, Now What?

By: Joshua Timberman (@jtimberman)

You’re the SRE on call and, while working on a project, your phone buzzes with an alert: “Elevated 500s from API.”

You’re a software developer, and your team lead posts in Slack: “Hey, the library we use for our payment processing endpoint has a remote exploit.”

You work on the customer success team and, during a routine sync with a high-profile customer, they install the new version of your client CLI. Then, when they run any command, it exits with a non-zero return code.

An incident is any situation that disrupts the ability of customers to use a system, service, or software product in a safe and secure manner. And in each of the incidents above, the person who noticed the incident first will most likely become the incident commander. So, now what?

What does it mean to be an incident commander?

Once an individual identifies an incident, one or more people will respond to it. Their goal is to resolve the incident and return systems, services, or other software back to a functional state. While an incident may have a few or many responders—only one person is the incident commander. This role is not about experience, seniority, or position on an org chart; it is to ensure that progress is being made to resolve the incident. The incident commander must think about the inputs from the incident, make decisions about what to do next, and communicate with others about what is happening. The incident commander also determines when the incident is resolved based on the information they have. After the incident is over, the incident commander is also responsible for conducting a post-incident analysis and review to summarize what happened, what the team learned, and what will be done to mitigate the risk of a similar incident happening in the future.

Having a single person—the incident commander—be responsible for handling the incident, delegating responsibility to others, determining when the incident is resolved, and conducting the post-incident review is one of the most effective incident management strategies.

How do you become an incident commander?

Organizations vary on how a team member can become an incident commander. Some call upon the first responder to an incident. Others require specific training and have an on-call rotation of just incident commanders. However you find yourself in the role of incident commander, you should be trusted and empowered by your organization to lead the effort to resolve the incident.

Now what?

Now that you’re incident commander, follow your organizations’ incident response procedure for the specifics about what to do. But for more general questions, we’ve got some guidance.

What are the best strategies for communication and coordination?

One of an incident commander’s primary tasks is to communicate with relevant teams and stakeholders about the status of the incident and to coordinate with other teams to ensure the right people are involved.

If your primary communication tool is Slack, use a separate channel for each incident. Prefix any time-sensitive notes with “timeline” or “TL” so they are easy to find later. If higher-bandwidth communication is required, use a video conference, and keep the channel updated with important information and interactions. When an incident affects external customers, be sure to update them as required by your support teams and agreements with customers.

In the case of a security incident, there may be additional communications requirements with your organization’s legal and/or marketing teams. Legal considerations to communicate may include:

  • Statutory or regulatory reporting
  • Contractual commitments and obligations to customers
  • Insurance claims

Marketing considerations to communicate may include:

  • Sensitive information from customer data exposure
  • “Zero Day” exploits
  • Special messaging requirements, e.g. for publicly traded companies

When should you hand off a long-running incident?

During an extended outage or other long-running incident, you will likely need a break. Whether you are feeling overwhelmed, or that you would contribute better by working on a solution for the incident itself, or that you need to eat, take care of your family, or sleep—all are good reasons to hand off the incident command to someone else.

Coordinate with your other responders in the appropriate channel, whether that’s in a Slack chat or in a Zoom meeting. If necessary, escalate by having someone paged out to get help. Once someone else can take over, communicate with them on the latest progress, the current steps being taken, and who else is involved with the incident. Remember, we’re all human and we need breaks.

How should you approach post-incident analysis and review?

One of an incident commander’s most important jobs is to conduct a post-incident analysis and review after the incident is resolved. This meeting must be blameless: That is, the goal of the meeting is to learn what happened, determine what contributing factors led to the incident, and take action to mitigate the risk of such an incident happening in the future. It’s also to establish a timeline of events, demonstrate an understanding of the problems, and set up the organization for future success in mitigating that problem.

The sooner the incident analysis and review meeting occurs after the incident is resolved, the better. You should ensure adequate rest time for yourself and other responders, but the review meeting should happen within 24 hours—and ideally not longer than two business days after the incident. The incident commander (or commanders) must attend, as they have the most context on what happened and what decisions were made. Any responders who performed significant remediation steps or investigation must also attend so they can share what they learned and what they did during the incident.

Because the systems that fail and cause incidents are complex, a good analysis and review process is complex. Let’s break it down:

Describe the incident

The incident commander will describe the incident. This description should detail the impact as well as its scope, i.e., whether the incident affected internal or external users, how long it took to discover, how long it took to recover, and what major steps were taken to resolve the incident.

“The platform was down” is not a good description.

“On its 5 minute check interval, our monitoring system alerted the on-call engineer that the API service was non-responsive, which meant external customers could not run their workflows for 15 minutes until we were able to restart the message queue” is a good description.

Contributing factors

Successful incident analysis should identify the contributing factors and places where improvements can be made to systems and software. Our world is complex, and technology stacks have multiple moving parts and places where failures occur. Not only can a contributing factor be something technical like “a configuration change was made to an application,” it can be nontechnical like “the organization didn’t budget for new hardware to improve performance.” In reviewing the incident for contributing factors, incident commanders and responders are looking for areas for improvement in order to identify potential corrective actions.

Corrective action items

Finally, incident analysis should determine corrective action items. These must be specific work items that are assigned to a person or a team accountable for their completion, and they must be the primary work priority for that person or team. These aren’t “nice to have,” these are “must do to ensure the safe and reliable operation of the site or service.” Such tasks aren’t necessarily the actions taken during the incident to stabilize or remediate a problem, which are often temporary workarounds to restore service. A corrective action can be as simple as adding new monitoring alerts or system metrics that weren’t implemented before. It can also be as complex as rebuilding a database cluster with a different high availability strategy or migrating to a different database service entirely.

Conclusion

If you’ve recently been the incident commander for your first incident—congratulations. You’ve worked to solve a hard problem that had a lot of moving parts. You took on the role and communicated with the relevant teams and stakeholders. Then, you got some much needed rest and conducted a successful post-incident analysis. Your team identified corrective actions, and your site or service is going to be more reliable for your customers.

Incident management is one of the most stressful aspects of operations work for DevOps and SRE professionals. The first time you become an incident commander, it may be confusing or upsetting. Don’t panic. You’re doing just fine, and you’ll keep getting better.

Further reading

If you’re new to post incident analysis and review, check out Howie: The Post-Incident Guide from Jeli.

PagerDuty also has extensive documentation on incident response and incident command.

December 20, 2021

Day 20 - To Deploy or Not to Deploy? That is the question.

By: Jessica DeVita (@ubergeekgirl)
Edited by: Jennifer Davis (@sigje)

Deployment Decision-Making during the holidays amid the COVID19 Pandemic

A sneak peek into my forthcoming MSc. thesis in Human Factors and Systems Safety, Lund University.

Web services that millions of us depend on for work and entertainment require vast compute resources (servers, nodes, networking) and interdependent software services, each configured in specialized ways. The experts who work on these distributed systems are under enormous pressure to deploy new features, and keep the services running, so deployment decisions are happening hundreds or thousands of times every day. While automated testing and deployment pipelines allow for frequent production changes, an engineer making a change wants confidence that the automated testing system is working. However, automating the testing pipeline makes the test-and-release process more opaque to the engineer, making it difficult to troubleshoot.

When an incident occurs, the decisions preceding the event may be brought under a microscope, often concluding that “human error” was the cause. As society increasingly relies on web services, it is imperative to understand the tradeoffs and considerations engineers face when they decide to deploy a change into production. The themes uncovered through this research underscore the complexity of engineering work in production environments and highlight the role of relationships with co-workers and management on deployment decision-making.

There’s No Place Like Production

Many deployments are uneventful and proceed without issues, but unforeseen permissions issues, network latency, sudden increases in demand, and security vulnerabilities may only manifest in production. When asked to describe a recent deployment decision, engineers reported intense feelings of uncertainty as they could not predict how their change would interact with changes elsewhere in the system. More automation isn’t always the solution, as one engineer explains:

“I can’t promise that when it goes out to the entire production fleet that the timing won’t be wrong. It’s a giant Rube Goldberg of a race condition. It feels like a technical answer to a human problem. I’ve seen people set up Jenkins jobs with locks that prevent other jobs from running until it’s complete. How often does it blow up in your face and fail to release the lock? If a change is significant enough to worry about, there should be a human shepherding it. Know each other’s names. Just talk to each other; it’s not that hard.”

Decision-making Under Pressure

“The effects of an action can be totally different, if performed too early or too late. But the right time is not clock time: it depends upon the precise state of the process evolution” (De Keyser, 1990).

Some engineers were under pressure to deploy fixes and features before the holidays, while other engineers were constrained by a "code freeze", when during certain times of the year, they “can’t make significant production changes that aren’t trivial or that fix something”. One engineer felt that they could continue to deploy to their test and staging environments but warned, “... a lot of things in a staging environment waiting to go out can compound the risk of the deployments.”

Responding to an incident or outage at any time of the year is challenging, but even more so because of “oddities that happen around holidays” and additional pressures from management, customers, and the engineers themselves. Pairing or working together was often done as a means to increase confidence in decision making. Pairing resulted in joint decisions, as engineers described actions and decisions with “we”. “So that was a late night. When I hit something like that, it involves a lot more point-by-point communications with my counterpart. For example,”I'm going to try this, do you agree this is a good thing? What are we going to type in?”

Engineers often grappled with "clock time" and reported that they made certain sacrifices to “buy more time” to make further decisions. An engineer expressed that a change “couldn’t be decided under pressure in the moment” so they implemented a temporary measure. Fully aware of the potential for their change to trigger new and different problems, engineers wondered what they could do “without making it worse”.

When triaging unexpected complications, engineers sometimes “went down rabbit holes”, exemplifying a cognitive fixation known as a “failure to revise” (Woods & Cook, 1999). Additionally, having pertinent knowledge does not guarantee that engineers can apply it in a given situation. For example, one engineer recounted their experience during an incident on Christmas Eve:

“...what happens to all of these volumes in the meantime? And so then we're just thinking of the possible problems, and then [my co-worker] suggested resizing it. And I said, ‘Oh, can you do that to a root volume?’ ‘Cause I hadn't done that before. I know you can do it to other volumes, but not the root.’”

Incidents were even more surprising in systems that rarely fail. For one engineer working on a safety critical system, responding to an incident was like a “third level of panic”.

Safety Practices

The ability to roll back a deployment was a critically important capability that for one engineer was only possible because they had “proper safety practices in place”. However, rollbacks were not guaranteed to work, as another engineer explained:

“It was a fairly catastrophic failure because the previous migration with a typo had partially applied and not rolled back properly when it failed. The update statement failed, but the migration tool didn’t record that it had attempted the migration, because it had failed. It did not roll back the addition, which I believed it would have done automatically”.

Sleep Matters

One engineer described how they felt that being woken up several times during the night was a direct cause of taking down production during their on-call shift:

“I didn't directly connect that what I had done to try to fix the page was what had caused the outage because of a specific symptom I was seeing… I think if I had more sleep it would have gotten fixed sooner”.

Despite needing “moral support”, engineers didn’t want to wake up their co-workers in different time zones: “You don't just have the stress of the company on your shoulders. You've got the stress of paying attention to what you're doing and the stress of having to do this late at night.” This was echoed in another engineer’s reluctance to page co-workers at night as they “thought they could try one more thing, but it’s hard to be self-aware in the middle of the night when things are broken, we’re stressed and tired”.

Engineers also talked about the impacts of a lack of sleep on their effectiveness at work as “not operating on all cylinders”, and no different than having 3 or 4 drinks: “It could happen in the middle of the night when you're already tired and a little delirious. It's a form of intoxication in my book.

Blame Culture

“What's the mean time to innocence? How quickly can you show that it's not a problem with your system?”

Some engineers described feeling that management was blameful after incidents and untruthful about priorities. For example, an engineer described the aftermath of a difficult database migration: “Upper management was not straightforward with us. We compromised our technical integrity and our standards for ourselves because we were told we had to”.

Another engineer described a blameful culture during post-incident review meetings:

“It is a very nerve-wracking and fraught experience to be asked to come to a meeting with the directors and explain what happened and why your product broke. And because this is an interwoven system, everybody's dependent on us and if something happens, then it’s like ‘you need to explain what happened because it hurt us.”

Engineers described their errors as "honest mistakes'' as they made sense of these events after the fact. Some felt a strong sense of personal failure, and that their actions were the cause of the incident, as this engineer describes:

“We are supposed to follow a blameless process, but a lot of the time people self-blame. You can't really shut it down that much because frankly they are very causal events. I'm not the only one who can't really let go of it. I know it was because of what I did.”

Not all engineers felt they could take “interpersonal risks” or admit a lack of knowledge without fear of “being seen as incompetent”. Synthesizing theories of psychological safety with this study’s findings, it seems clear that environments of psychological safety may increase engineers’ confidence in decision making (Edmondson, 2002).

What Would They Change?

Engineers were asked “If you could wave a magic wand, what would you change about your current environment that would help you feel more confident or safe in your day-to-day deployment decisions?

In addition to “faster CI and pre-deployments”, engineers overarchingly spoke about needing better testing. One participant wanted a better way to test front-end code end-to-end, "I return to this space every few years and am a bit surprised that this still is so hard to get right”. In another mention of improved testing, an engineer wanted “integration tests that exercise the subject component along with the graph of dependencies (other components, databases, etc.), using only public APIs. I.e., no "direct to database" fixtures, no mocking”.

Wrapping Up

Everything about engineers’ work was made more difficult in the face of a global pandemic. In the “before times” engineers could "swivel their chair” to get a "second set of eyes" on from co-workers before deploying. While some engineers in the study had sophisticated deployment automation, others spoke of manual workarounds with heroic scripts written ‘on the fly’ to repair the system when it failed. Engineers grappled with the complexities of automation, and the risk and uncertainty associated with decisions to deploy. Most engineers using tools to automate and manage configurations did not experience relief in their workload. They had to maintain skills in manual interventions when the automation did not work as expected or when they could not discern the machine’s state. Such experiences highlight the continued relevance of Lisanne Bainbridge’s (1983) research on the Ironies of Automation which found that “the more advanced a control system is, the more crucial the role of the operator”.

This study revealed that deployment decisions cannot be understood independently from the social systems, rituals, and organizational structures in which they occurred (Pettersen, McDonald, & Engen, 2010). So when a deployment decision results in an incident or outage, instead of blaming the engineer, consider the words of James Reason (1990) who said “...operators tend to be the inheritors of system defects…adding the final garnish to a lethal brew whose ingredients have already been long in the cooking”. Engineers may bring their previous experiences to deployment decisions, but the tools and conditions of their work environment, historical events, power structures, and hierarchy are what “enables and sets the stage for all human action.” (Dekker & Nyce, 2014, p. 47).

____

This is an excerpt from Jessica’s forthcoming thesis. If you’re interested in learning more about this deployment decision-making study or would like to explore future research opportunities, send Jessica a message on Twitter.

References

Bainbridge, L. (1983). IRONIES OF AUTOMATION. In G. Johannsen & J. E. Rijnsdorp (Eds.), Analysis, Design and Evaluation of Man–Machine Systems (pp. 129–135). Pergamon.

De Keyser, V. (1990). Temporal decision making in complex environments. Philosophical Transactions of the Royal Society of London. Series B, Biological Sciences, 327(1241), 569–576.

Dekker, S. W. A., & Nyce, J. M. (2014). There is safety in power, or power in safety. Safety Science, 67, 44–49.

Edmondson, A. C. (2002). Managing the risk of learning: Psychological safety in work teams. Citeseer.

Pettersen, K. A., McDonald, N., & Engen, O. A. (2010). Rethinking the role of social theory in socio-technical analysis: a critical realist approach to aircraft maintenance. Cognition, Technology & Work, 12(3), 181–191.

Reason, J. (1990). Human Error (pp. 173–216). Cambridge University Press.

Woods, D. D., & Cook, R. I. (1999). Perspectives on Human Error: Hindsight Bias and Local Rationality. In In F. Durso (Ed.) Handbook of Applied Cognitive Psychology. Retrieved 9 June 2021 from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.474.3161

December 19, 2021

Day 19 - Into the World of Chaos Engineering

By: Julie Gunderson (@Julie_Gund)
Edited by: Kerim Satirli (@ksatirli)

Intro

I recently left my role as a DevOps Advocate at PagerDuty to join the Gremlin team as a Sr. Reliability Advocate. The past few months have been an immersive experience into the world of Chaos Engineering and all things reliability. That said, my foray into the world of Chaos Engineering started long before joining the Gremlin team.

From my time as a lab researcher, to being a single parent, to dealing with cancer, I have learned that the journey of unpredictability is everywhere. I could never have imagined in college that I would end up doing what I do now. As I reflect on the path I have taken to where I am today, I realize one thing: chaos was always right there with me. My start in tech was as a recruiter and let me tell you: there is no straight line that leads from recruiting to advocacy. I experimented in my career, tried new things, failed more than a few times, learned from my experiences and made tweaks. Being a parent is very similar: most experiences you make along the way fall in one of two camps:mistakes or learning. With cancer, there was, and is a lot of experimenting and learning– even with the brightest of minds, every person’s system handles treatments differently. Luckily, I had, and still have others who mentor me both professionally and personally, people who help me improve along the way, and I learned that chaos is a part of life that can be harnessed for positive change.

Technical systems have a lot of similarities to our life experiences, we think we know how they are going to act, but suddenly a monkey wrench gets thrown into the mix and poof, all bets are off. So what do we do? We experiment, we try new things, we follow the data, we don’t let failure stop us in our tracks, and we learn how to build resiliency in.

We can’t mitigate every possible issue out in the wild, we should be proactive in identifying potential failure modes. We need to prepare folks to handle outages in a calm and efficient manner. We need to remember that there are users on the other end of those ones and zeros. We need to keep our eye on the reliability needle. Most of all, we need to have empathy for our co-workers, and remember that we are all in this together, and that we don’t need to be afraid of failure.

Talking about Chaos in the System

When a system or provider goes down (cough, cough us-east-1), people notice, and they share their frustrations, widely. Long Twitter rants are one thing, the media’s reaction is another: – outages make great headlines, and the old adage of “all press is good press” doesn’t really hold up anymore. Brand awareness is one thing, however, great SEO numbers based off of a headline in the New York Times that calls out a company for being down might not be the best way to go about it.

What is Chaos Engineering

So what is Chaos Engineering, and more importantly: why would you want to engineer Chaos? Chaos Engineering is one of those things that is just unfortunately named. After all, the practice has evolved a lot from the time when Jesse Robbins coined the term GameDays, to the codified processes we have in place today. The word “chaos” can still unfortunately lead to anxiety across the management team(s) of a company. But, fear not, the practice of Chaos Engineering helps us all create those highly reliable systems that the world depends on, it builds a culture of learning, and teaches us all to embrace failure and improve.

Chaos Engineering is the practice of proactively injecting failure into your systems to identify weaknesses. In a world where everyone relies on digital systems in some way, shape, or form, almost all of us have a focus on reliability. After all: the cost of downtime can be astronomical!

My studies started at the University of Idaho in microbiology. I worked as a researcher and studied the effects of carbon dioxide (CO2) and short-term storage success of Chinook salmon milt (spoiler alert– there is no advantage to using CO2). That’s where I learned that effective research requires the use of the scientific method:

  1. Observe the current state
  2. Create a hypothesis
  3. Run experiments in a controlled, consistent environment
  4. Analyze the data
  5. Repeat the experiments and data analyzation
  6. Share the results

In the research process, we focused on one thing at a time, we didn’t introduce all the variables at once, we built on our experiments as we gathered and analyzed the data. For example, we started off with the effects of CO2 and once we had our data we introduced clove oil into the study. Once we understood the effect on Chinook we moved to Sturgeon, and so on.

Similarly, you want to take a scientific approach when identifying weakness in your systems with Chaos Engineering, a key difference is on the system that is currently under study; your technical and social technical systems, vs. CO2 and Chinook salmon milt (also, there are no cool white coats.) With Chaos Engineering you aren’t running around unplugging everything at once, or introducing 100% memory consumption on all of your containers at the same time, you take little steps, starting with a small blast radius and increasing that blast radius so you can understand where the failure had impact.

How do we get there

Metrics

At PagerDuty, I focused on best practices around reducing Mean Time to Detection (MTTD) and Mean Time to Resolution (MTTR) of incidents, and then going beyond those metrics to learning and improvement. I often spoke about Chaos Engineering and how through intentionally injecting failure into our systems, we could not only build more reliable systems, we could build a culture of blamelessness, learning, and continuous improvement.

In my time at Gremlin, I have seen a lot of folks get blocked at the very beginning when it comes to metrics such as MTTD and MTTR. Some organizations may not have right monitoring tools in place, or are just at the beginning of the journey into metric collection. It’s okay if everything isn’t perfect here, the fact is you can just pick a place to start; one or two metrics to start collecting to give you a baseline to start measuring improvement from. As far as monitoring is concerned, you can use Chaos Engineering to validate what you do have, and make improvements from there.

People

On the people-side of our systems, being prepared to handle incidents takes practice. Waking up at 2am to a barbershop quartet alert singing “The Server is on Fire” is a blood pressure raising experience, however that stress can be reduced through practice.

For folks who are on-call, it’s important to give them some time to learn the ropes before tossing them into the proverbial fire. Give folks a chance to practice incident response through Chaos Engineering, run GameDays and FireDrills, where people can learn in a safe and controlled environment what the company process looks like in action. This is also a great way to validate your alerting mechanisms and response process. At PagerDuty we had a great joint workshop with Gremlin where people could practice incident response with Chaos Engineering to learn about the different roles and responsibilities and participate in mock incident calls. As a piano player, I had to build the muscle memory needed to memorize Beethovan’s Moonlight Sonata by practicing over, and over, and over for months. Similar to learning a musical instrument, practicing incident response builds the muscle memory needed to reduce the stress on those 2am calls. If I can stress (no pun intended) anything from my experiences in life, it is that repetition and practice are essential elements to handling surprises calmly.

Building a culture of accepting failure as a learning opportunity takes bravery and doesn’t happen overnight. Culture takes practice, empathy, and patience, so make sure to take the time to thank folks for finding bugs, for making mistakes, for accepting feedback, and for the willingness to learn.

Speak the language

As I mentioned before, sometimes we just have things that are unfortunately named. Many of us have the opportunities to attend conferences, read articles and blogs, earn certifications, etc... It’s important to remember that leadership often doesn't have the time to do those things. We as individual contributors, team leaders, engineers, whatever our title may be, need to be well equipped to speak effectively to our audience; Leaders need to understand the message we are trying to convey. I have found that using the phrase “just trust me” isn’t always an effective communication tool. I had to learn how to talk to decision makers and leadership in the terms they used, such as business objectives, business outcomes, Return on Investment (ROI). By communicating the business case I was trying to solve, they could connect the dots to the ROI of adopting and sponsoring new ways of working.

It’s a Wrap

To sum it up, chaos is part of our lives from the moment we are born, from learning to walk to learning to code, and all of the messiness in between. We don’t need to be afraid of experimentation, but we should be thoughtful with our test, and be open to learning. For me personally this next year, I plan on learning to play Bohemian Rhapsody, and professionally, I plan on experimenting with AWS and building a multi-regional application to test ways to be more resilient in the face of outages. Wish me luck, I think I’ll need it on both fronts.

Happy holidays, and may the chaos be with you.

December 18, 2021

Day 18 - Minimizing False Positive Monitoring Alerts with Checkmk

By: Elias Voelker (@Elijah2807) and Faye Tandog (@fayetandog
Edited by: Jennifer Davis (@sigje)

Good IT monitoring stands and falls with its precision. Monitoring must inform you at the right time when something is wrong. But similar to statistics, you also have to deal with errors produced by your monitoring. In this post, I will talk about two types of errors - false positives and false negatives. And similar again to statistics, you can’t eliminate these errors completely in your monitoring. The best you can do is manage them and optimize for an acceptable level of errors.

In this article, I share ways in which you can fine-tune notifications from your monitoring system, to alleviate noisy alerts and ideally receive only those alerts that are really relevant.

Fine-tuning notifications is one of the most important and rewarding activities when it comes to configuring your monitoring system. The impact of a well-defined notification setup is felt immediately. First and foremost, your team will benefit from better focus due to less ‘noise’. This ultimately results in better service levels and higher service level objective (SLO) attainment across the board.

In this article, I talk about ‘alerts’ and ‘notifications’ using them interchangeably. An ‘alert’ or ‘notification’ is your monitoring system letting you know that something is supposedly wrong. Depending on your setup, this may be via email, text or a trouble ticket in PagerDuty.

When I talk about a ‘monitoring system’, I’m referring to both ‘traditional’ IT infrastructure and application monitoring tools such as Nagios, Zabbix, or Solarwinds Orion, as well as cloud-based monitoring solutions such as Prometheus, Datadog or Sensu.

Types of Alert Errors

Let’s start by examining two common alert errors: false positives and false negatives.

A false positive would be your monitoring tool alerting about an issue when in reality the monitored system is perfectly fine (or has recovered in the meantime). It could be a server or service being shown as DOWN because there was a short glitch in the network connection, or a specific service instance, for example Apache restarting to rotate its logs.

False negatives are when your monitoring system does not alert you, although something really is wrong. If you're running an on-prem infrastructure and your firewall is down, you want to know about it. If your monitoring system for some reason does not alert you about this, your network may be exposed to all kinds of threats, which can get into real trouble, really quickly.

However, the cost of erroneous alerting can differ vastly. Hence, when IT Ops teams try to determine the acceptable level of false positives versus an acceptable level of false negatives, they will often deem false positives more acceptable. Because a false negative could be a mission critical system down and not alerting. A false positive might just be one unnecessary notification that’s quickly deleted from your inbox.

This is why they will err on the side of caution and notify, which is totally understandable. The consequence, however, is that these teams get drowned in meaningless alerts, which increases the risk of overlooking a critical one.

Because notifications will only be of help when no — or only occasional — false alarms are produced.

In this article, I use Checkmk to show examples of minimizing false positive alerting. You can apply the same philosophy with other tools though they may vary in implementation and functionality.

1. Don’t alert.

My first tip to improve monitoring and reduce the noise of false notifications is to simply not send notifications. Seriously!

In Checkmk, notifications are actually optional. The monitoring system can still be used efficiently without them. Some large organizations have a sort of control panel in which an ops team is constantly monitoring the Checkmk interface. As they will be visually alerted, additional notifications are unnecessary.

These are typically users that can’t risk any downtime of their IT at all, like a stock exchange for example. They use the problem dashboards in Checkmk to immediately see the issue and its detail. As the lists are mostly empty, it is pretty clear when something red pops up on a big dashboard.

But in my opinion, this is rather the exception. Most people use some way of notifying their ops and sysadmin teams, be it through email, SMS or notification plugins for ITSM tools such as ServiceNow, PagerDuty or Splunk OnCall.

2. Give it time

So if you’ve decided you don’t want to go down the ‘no notifications’ route from my previous point, you need to make sure that your notifications are finely tuned to only notify people in case of real problems.

The first thing to tell your monitoring is: Give the system being monitored time.

Some systems produce sporadic and short-lived errors. Of course, what you really should do is investigate and eliminate the reason for these sporadic problems, but you may not have the capacity to chase after all of them.

You can reduce alarms from systems like that in two ways:

  • You can simply delay notifications to be sent only after a specified time AND if the system state hasn’t changed back to OK in the meantime.
  • You can alert on the number of failed checks. For Checkmk this is the ‘Maximum number of check attempts for service’ rule set. This will make the monitoring system check for a defined number of times before triggering a notification. By multiplying the number of check attempts with your defined check interval you can determine how much time you want to give the system. The default Checkmk interval is 1 minute, but you can configure this differently.

The two options are slightly different in how they treat the monitored system. By using the number of failed checks, you can be sure that the system has really been re-checked. If you alert only based on time and you (or someone else) changed the check interval to a longer timeframe you gain nothing. In Checkmk specifically there are some other factors as well, but that’s out of scope for this article. The essential effect is: By giving a system a bit of time to ‘recover’, you can avoid a bunch of unnecessary notifications.

This method also works great for ‘self-healing’ systems that should recover on their own; for example, you wouldn’t want to get a notification for a cloud provider killing an instance to upgrade it when your code will automatically deploy a new container instance to handle requests

Of course, this is not an option for systems that are mission-critical with zero downtime that require rapid remediation. For example, a hedge-fund that monitors the network link to a derivative marketplace can't trade if it goes down. Every second of downtime costs them dearly.

3. On average, you don’t have a problem

Notifications are often triggered by threshold values on utilization metrics (e.g. CPU utilization) which might only exceed the threshold for a short time. As a general rule, such brief peaks are not a problem and should not immediately cause the monitoring system to start notifying people.

For this reason, many check plug-ins have the configuration option that their metrics are averaged over a longer period (say, 15 minutes) before the thresholds for alerting are applied. By using this option, temporary peaks are ignored, and the metric will first be averaged over the defined time period and only afterwards will the threshold values be applied to this average value.

4. Like parents, like children

Imagine the following scenario: You are monitoring a remote data center. You have hundreds of servers in that data center working well and being monitored by your monitoring system. However, the connection to those servers goes through the DC’s core switch (forget redundancy for a moment). Now that core switch goes down, and all hell breaks loose. All of the sudden, hundreds of hosts are no longer being reached by your monitoring system and are being shown as DOWN. Hundreds of DOWN hosts mean a wave of hundreds of notifications…

But in reality, all those servers are (probably) doing just fine. Anyway we couldn’t tell, because we can’t connect to them because of the core switch acting up. So what do you do about it?

Configure your monitoring system so that it knows this interdependency. So the server checks are dependent on that core switch. You can do so in Checkmk by using ‘parent-child-relationships’. By declaring host A the ‘Child’ of another ‘Parent’ host B, you tell your Checkmk system that A is dependent on host B. Checkmk pauses notifications for the children if the parent is down.

5. Avoid alerts on systems that are supposed to be down

There are hundreds of reasons why a system should be down at times. Maybe some systems need to be rebooted regularly, maybe you are doing some maintenance or simply don’t need a system at certain times. What you don’t want is your monitoring system going into panic mode during these times, alerting who-knows-whom if a system is supposed to be down. To do that, you can use ‘Scheduled Downtimes’.

Scheduled downtimes work for entire hosts, but also for individual services. But why would you send certain services into scheduled downtimes? More or less for the same reason as hosts – when you know something will be going on that would trigger an unnecessary notification. You still might want your monitoring to keep an eye on the host as a whole, but you are expecting and accepting that some services might go haywire and breach thresholds for some time. An example could be a nightly cron job that syncs data to long term storage, causing the disk I/O check to spike. But, if everything goes back to normal once the sync is through, no need to lose sleep over it.

Moreover, you can extend scheduled downtimes to ‘Children’ of a ‘Parent’ host as well.

Wrapping Up

I hope this short overview has given you some ideas about really simple ways with which you can cut down on the number of meaningless notifications your team is getting from your monitoring system. There are other strategies to do this, but this should get you started.

Additional Resources

If you want to learn more about how to manage notifications in Checkmk, check out this docs article or post a question in the forum.

December 17, 2021

Day 17 - Death to Localhost: The Benefits of Developing In A Cloud Native Environment

By: Tyler Auerbeck (@tylerauerbeck)
Edited by: Ben Cotton (@funnelfiasco)

Thank you everyone for joining us today. We gather here to say our goodbyes to our dear friend, Localhost. They’ve been there for us through the good times, the bad times, and the “we should really be sleeping right now…but let me just try one last thing” times. They’ve held our overly-complicated terminal configurations and—in all likelihood—most of our secrets. But alas, it is time to let our good friend ride into the sunset.

Saying Goodbye

But why?! We’ve all likely spent more time than we care to admit making these machines feel like home. They’re part of the family! Well, as it turns out, that can become part of the problem. We’ve all seen issues that are accompanied by the line “well it works on my machine” and a round of laughs. The problem with localhost is that it can be extremely difficult to ensure that a setup being utilized by one developer actually matches what is being run by another. This can happen for any number of reasons such as developer platform (Linux vs MacOS vs Windows), IDE (VScode vs Jetbrains), or even just the installation method of the tools you’re using. The different combinations of these problems only exacerbates the problem and likely leads to (at a minimum!) hundreds of hours of lost productivity. All in the name of working locally. But what if there was a better way?

My Machine is Your Machine

With everything becoming Cloud Native these days, why do we want to treat development any differently? The common trend recently is to push a number of our workloads into containers. Why? Because with containers we have the ability to bundle our runtimes, tooling, and any additional dependencies via a well-defined format. We can expect them to run almost anywhere, the same way, each and every time. What if we took that same approach and instead of a web application, we shipped our development environment?

Well, as it turns out, this is exactly what a few projects are starting to give us the ability to do. Now instead of shipping complex Makefiles, multiple install scripts, or having to ask our users to pipe our mystery scripts into bash, we can simply just launch our development environments out into the cloud of our choice. Currently, there are two main projects that offer us this functionality. If you’re not interested in hosting anything yourself, GitHub Codespaces is a hosted solution that integrates directly with your codebase and allows you to easily spin up a VScode instance to get to work. However, if you have more specific restrictions or just prefer to run your own infrastructure, another project offering this functionality is Eclipse Che. Whatever solution works best for your situation is fine. The more important part of both of these offerings is _how_ they make these environments available to you.

Development Environment Specs

Both of the above offerings allow you to specify the dev environment that you want to make available to your users/developers. It’s important to note that this is done on a per repository basis because there is never going to be a single dev environment that works to run them all. This is exactly the mess that we’re trying to get out of! We want to be able to define an environment that is purpose-built for the specific project that we are working on!

To do this, these platforms give us configuration files: dev-container.json (GitHub Workspaces) and devfile (Eclipse Che). Although the specs differ between the two formats, the underlying principles are the same. Within one well defined configuration file, I am able to specify the tooling that needs installed, an image that should be used or built to run all of my development inside of, ports that need exposed, storage that needs mounted, plugins to be used, etc. Everything that I would usually need to configure by hand when getting started with a project now _just happens_ whenever I launch my environments. So now not only are we solving the _snowflake_ environment problem, but we are also saving valuable time because the environment will be configured and ready as soon as we click launch. It’s just what we’ve always wanted: push button and get to work!

What Problems Are We Solving

This all sounds great right? But you might be shaking your first in the air and screaming “Just let me use my laptop!” While this is absolutely something that I can empathize with and may generally work on personal projects, there are real problems that are being solved with this approach. I’ve seen this more specifically in enterprise development shops where _your machine_ isn’t really *your* machine. Which brings us to our first problem

Permissions

Given the current security environment, most enterprise development shops aren’t too keen on giving you the permissions to install any of the tooling that you actually need. I have seen developers lose weeks waiting on a request to just install their runtime on their machines before they’re ever even able to begin contributing to their time. Multiply that by every tool and dependency that they might need and you can imagine how much valuable and productive time is lost in the name of security and process.

By moving to a cloud native development approach, your development environments can be treated just like any other application that you run and scanned/approved by your security teams. When a new developer comes on board, they can get right to work! No more waiting on approvals/installation because this has already gone through the necessary pipelines and is just ready whenever you are.

Develop In Production

Alright, so maybe we shouldn’t develop *in* production, but rather in an environment that is _like_ production. By developing an application in a location where it will ultimately be running, you get a better feel for configurations and even failure modes that you otherwise may not experience by developing solely on your local machine. Expecting certain ports to be available? Need specific hardware? By ensuring your configuration files mirror your environments you can determine these problems earlier on in your process versus finding them once they’ve launched into a staging or production environment. This ultimately helps you reduce downtime and speeds up your time to resolving these problems as you may find them before they’re ever even introduced.

Localhost: Still Slightly Alive

Realistically, this isn’t going to be a solution for everything or everyone. There are workloads and development tasks that require specialized environments or are potentially just not well suited to being done inside of a container environment. And that’s okay! There are still other approaches to finding a way off of your local machine and into the hearts of all of your developers without having to have them sink their time into troubleshooting differences between each of their machines. The heart of the problem still stands: developers want to get to work and provide value. Being able to provide on-demand environments that encapsulate all of the requirements of a project so that they can get involved immediately helps drive this productivity for both your teams and your communities, all without having to burn hours troubleshooting a personal machine.

So for now, let us lay our dear friend Localhost to rest. They may no longer be with us, but have no fear! Our localhost will always be with us up in the cloud(s)!

December 16, 2021

Day 16 - Setting up k3s in your home lab

By: Joe Block (@curiousbiped)
Edited by: Jennifer Davis (@sigje)

Background

Compute, even at home with consumer-grade hardware, has gotten ridiculously cheap. You can get a quad-core ARM machine with 4GB like a Raspberry Pi 4 for under $150, including power supply and SD card for booting - and it'll idle at less than 5 watts of power draw and be completely silent because it is fanless.

What we're going to do

In this post, I'll show you how to set up a Kubernetes cluster on a cheap ARM board (or an x86 box if you prefer) using k3s and k3sup so you can learn Kubernetes without breaking an environment in use.

These instructions will also work on x86 machines, so you can repurpose that old hardware instead of buying a new Raspberry Pi.

Why k3s?

k3s was created by Rancher as a lightweight, easy to install, and secure Kubernetes option.

It's packaged as a single ~40MB binary that reduces the dependencies needed to get a cluster up and running. It even includes an embedded containerd, so you don't need to install that or docker. The ARM64 and ARM7 architectures are fully supported, so it's perfect for running on a Raspberry Pi in a home lab environment.

Why k3sup?

Alex Ellis wrote k3sup, a great tool for bringing up k3s clusters and we're going to use it in this post to simplify setting up a brand new cluster. With k3sup, we'll have a running kubernetes cluster in less than ten minutes.

Lets get started!

Pre-requisites.

  • A spare linux box. I'll be using a Raspberry Pi for my examples, but you can follow along on an x86 linux box or VM if you prefer.
  • k3sup - download the latest release from k3sup/releases into a directory in your $PATH.

Set up your cluster.

In the following example, I'm assuming you've created a user (you can use the pi user on rPi if you prefer) for configuring the cluster (I used borg below), you've added your ssh public key to ~pi/.ssh/authorized_keys and that the user has sudo privileges. I'm also assuming you've downloaded k3sup and put it into /usr/local/bin, and that /usr/local/bin is in your $PATH.

Create the leader node

The first step is to create the leader node with the k3sup utility:


k3sup install --host $HOSTNAME --user pi

Below is the output when I ran it against my scratch rPi. In the scrollback you'll see that I'm using my borg account instead of the pi user. After setting up the rPi, the first step I took was to disable the known pi account. I also specify the path to an SSH key that is in the borg account's authorized_keys, and configure the borg account to allow passwordless sudo.

Notice that I don't have to specify an architecture - k3sup automagically determines the architecture of the host and installs the correct binaries when it connects to the machine. All I have to do is tell it what host to connect to, what user to use, what ssh key, and whether I want to use the stable or latest k3s channels or a specific version.


❯ k3sup install --host cephalopod.example.com --user borg --ssh-key demo-key
--k3s-channel stable

k3sup install --host cephalopod.example.com --user borg --ssh-key demo-key --k3s-channel stable
Running: k3sup install
2021/12/13 16:30:49 cephalopod.example.com
Public IP: cephalopod.example.com
[INFO]  Finding release for channel stable
[INFO]  Using v1.21.7+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/sha256sum-arm64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/k3s-arm64
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s
Result: [INFO]  Finding release for channel stable
[INFO]  Using v1.21.7+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/sha256sum-arm64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/k3s-arm64
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
[INFO]  systemd: Starting k3s
 Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.

Saving file to: /Users/jpb/democluster/kubeconfig

# Test your cluster with:
export KUBECONFIG=/Users/jpb/democluster/kubeconfig
kubectl config set-context default
kubectl get node -o wide

Test it out

Per the directions output by k3sup, you can now test your brand new cluster by setting the environment variable KUBECONFIG, and then run kubectl to work with your new cluster.

My steps to verify my new cluster is up and running:

  1. export KUBECONFIG=/Users/jpb/democluster/kubeconfig
  2. kubectl config set-context default
  3. kubectl get node -o wide

And I see nice healthy output where the status shows Ready -



NAME         STATUS   ROLES                  AGE     VERSION        INTERNAL-IP
 EXTERNAL-IP   OS-IMAGE             KERNEL-VERSION   CONTAINER-RUNTIME

cephalopod   Ready    control-plane,master   2m53s   v1.21.7+k3s1   10.1.2.3
<none>        Ubuntu 18.04.3 LTS   4.9.196-63       containerd://1.4.12-k3s1

And I can also look at pods in the cluster



❯ kubectl get pods -A
Alias tip: kc get pods -A
NAMESPACE     NAME                                      READY   STATUS
RESTARTS   AGE
kube-system   coredns-7448499f4d-b2rdp                  1/1     Running     0
      9m29s
kube-system   local-path-provisioner-5ff76fc89d-d9rrc   1/1     Running     0
      9m29s
kube-system   metrics-server-86cbb8457f-cqk6q           1/1     Running     0
      9m29s
kube-system   helm-install-traefik-crd-jgk2x            0/1     Completed   0
      9m29s
kube-system   helm-install-traefik-l2j96                0/1     Completed   2
      9m29s
kube-system   svclb-traefik-7tzzs                       2/2     Running     0
      8m38s
kube-system   traefik-6b84f7cbc-92kkp                   1/1     Running     0
      8m38s

Clean Up

k3s is tidy and easy to uninstall, so you can stand up a cluster on a machine, do some experimentation, then dispose of the cluster and have a clean slate for your next experiment. This makes it great for continuous integration!


sudo /usr/local/bin/k3s-uninstall.sh to shut down the node and delete
/var/lib/rancher and the data stored there.

Next Steps

Learn kubernetes! Some interesting tutorials that I recommend -

Finally, now that you've set up a cluster the easy way, if you want to understand everything k3sup did behind the scenes to get your Kubernetes cluster up and running, Kubernetes the Hard Way by Kelsey Hightower is a must-read.

December 15, 2021

Day 15 - Introduction to the PagerDuty API

By: Mandi Walls (@lnxchk)
Edited by: Joe Block (@curiousbiped)

Keeping track of all the data generated by a distributed ecosystem is a daunting task. When something goes wrong, or a service isn’t behaving properly, tracking down the culprit and getting the right folks enabled to fix it is also challenging. PagerDuty can help you with these challenges.

The PagerDuty platform integrates with over 600 other components to gather data, add context, and process automation. Under the hood of all of these integrations is the PagerDuty API, ready to help you programmatically interact with your PagerDuty account.

What’s Exposed Via the API

The PagerDuty API provides access to all the structural objects in your PagerDuty account - users, teams, services, escalation policies, etc - and also to the data objects including incidents, events, and change events.

For objects like users, teams, escalation policies, schedules, and services, you may find using the PagerDuty Terraform Provider will help you maintain the state of your account more efficiently without using the API directly.

The other object types in PagerDuty are more useful when we can send them anytime from anywhere, including via the API from our own code. Let’s take a look at three of them: incidents, events, and change events. If you’d like a copy of the code for these examples, you can find them on Github.

API Basics

To write new information into PagerDuty via the API, you'll need some authorization. You can use OAuth, or create an API key. There are account-level and user-level API keys available. You'll use the account-level keys for the rest of the examples here and keep things simple.

To create a key in your PagerDuty app, you'll need Admin, Global Admin, or Account Owner access to your account. More on that here.

In PagerDuty, navigate to Integrations and then chose API Access Keys. Create a new key, give it a description, and save it somewhere safe. The keys are strings that look like y_NbAkKc66ryYTWUXYEu.

Now you’re ready to generate some incidents! These examples use curl, but there are a number of client libraries for the API as well.

Incidents

Incidents are probably what you’re most familiar with in PagerDuty - they represent a problem or issue that needs to be addressed and resolved. Sometimes this includes alerting a human responder. Many of the integrations in the PagerDuty ecosystem generate incidents from other systems and services to send to PagerDuty.

In PagerDuty, incidents are assigned explicitly to services in your account, so an incoming incident will register with only that service. If your database has too many long-running queries, you want an incident to be assigned to the PagerDuty service representing that database so responders have all the correct context to fix the issue.

If you have a service that doesn’t have an integration out of the box, you can still get information from that service into PagerDuty via the API, and you don’t need anything special to do it. You can send an incident to the API via a curl request to the https://api.pagerduty.com/incidents endpoint.

There are three required headers for these requests, Accept, Content-Type and From, which needs to be an email address associated with your account, for attribution of the incident. Setting up the request will look something like:


curl -X POST --header 'Content-Type: application-json' \
--url https://api.pagerduty.com/incidents \
--header 'Accept: application/vnd.pagerduty+json;version=2' \
--header 'Authorization: Token token=y_NbAkKc66ryYTWUXYEu' \
--header 'From: system2@myemail.com' \

Now you need the information bits of the incident. These will be passed as --data in the curl request. There are just a few required pieces to set up the format and a number of optional pieces that help add context to the incident.

The most important piece you'll need is the service ID. Every object in the PagerDuty platform has a unique identifier. You can find the ID of a service in its URL in the UI. It will be something like https://myaccount.pagerduty.com/service-directory/SERVICEID.

Now you can create the rest of the message with JSON:


curl -X POST --header 'Content-Type: application/json' \
--url https://api.pagerduty.com/incidents \
--header 'Accept: application/vnd.pagerduty+json;version=2' \
--header 'Authorization: Token token=y_NbAkKc66ryYTWUXYEu' \
--header 'From: system2@myemail.com' \
--data '{
  "incident": {
    "type": "incident",
    "title": "Too many blocked requests",
    "service": {
      "id": "PWIXJZS",
      "summary": null,
      "type": "service_reference",
      "self": null,
      "html_url": null
    },
    "body": {
      "type": "incident_body",
      "details": "The service queue is full. Requests are no longer being fulfilled."
    }
  }
}'

When you run this curl command, it will generate a new incident on the service PWIXJZS with the title "To many blocked requests", along with some context in the "body" of the data to help our responders. You can add diagnostics or other information here to help your team fix whatever is wrong.

What if there is information being generated that might not need an immediate response? Instead of an incident, you can create an event.

Events

Events are non-alerting items sent to PagerDuty. They can be processed via Event Rules to help create context on incidents or provide information about the behavior of your services. They utilize the PagerDuty Common Event Format to make processing and collating more effective.

Events are registered to a particular routing_key via an integration on a particular service in your PagerDuty account. In your PagerDuty account, select a service you'd like to send events to, or create new one to practice with. On the page for that service, select the Integrations tab and Add an Integration. For this integration, select "Events API V2" and click Add. You'll have a new integration on your service page. Click the gear icon, and copy the Integration Key. For the full walkthrough of this setup, see the docs.

The next step is to set up the event. The request is a little different from the incident request - the url is different, the From: header is not required, and the authorization is completely handled in the routing_key instead of using an API token.

The content of the request is more structured, based on the Common Event Format, so that you can create event rules and take actions if necessary based on what the events contain.



curl --request POST \
  --url https://events.pagerduty.com/v2/enqueue \
  --header 'Content-Type: application/json' \
  --data '{
  "payload": {
    "summary": "DISK at 99% on machine prod-datapipe03.example.com",
    "timestamp": "2021-11-17T08:42:58.315+0000",
    "severity": "critical",
    "source": "prod-datapipe03.example.com",
    "component": "mysql",
    "group": "prod-datapipe",
    "class": "disk",
    "custom_details": {
      "free space": "1%",
      "ping time": "1500ms",
      "load avg": 0.75
    }
  },
  “event_action”: “trigger”,
  "routing_key": "e93facc04764012d7bfb002500d5d1a6"
}'

Change Events

A third type of contextual data you can send to the API is a Change Event. Change events are non-alerting, and help add context to a service. They are informational data about what's changing in your environment, and while they don't generate an incident, they can inform responders about other activities in the system that might have contributed to a running incident. Change events might come from build and deploy services, infrastructure as code, security updates, or other places that change is generated in your environment.

These events have a similar basic structure to the general events, and the setup with the routing_key is the same, as you can see in the below example. The custom_details can contain anything you want, like the build number, a link to the build report, or the list of objects that were changed during an Infrastructure as Code execution.

Change events have a time horizon. They expire after 90 days in the system, so you aren't looking at old context based on past changes.



curl --request POST \
  --url https://events.pagerduty.com/v2/change/enqueue \
  --header 'Content-Type: application/json' \
  --data '{
  "routing_key": "737ea619db564d41bd9824063e1f6b08",
  "payload": {
    "summary": "Build Success: Increase snapshot create timeout to 30 seconds",
    "timestamp": "2021-11-17T09:42:58.315+0000",
    "source": "prod-build-agent-i-0b148d1040d565540",
    "custom_details": {
      "build_state": "passed",
      "build_number": "220",
      "run_time": "1236s"
    }
  }
}'

Adding Notes

One final fun bit of functionality you can leverage in PagerDuty's API is with notes. Notes are short text entries added to the timeline of an incident. In some integrations, like PagerDuty and Slack, notes will be sent to any Slack channel that is configured to receive updates for an impacted service, making them helpful for responders to coordinate and record activity across different teams.

Notes are associated with a specific incident, so when you are creating a note, the url will include the incident ID. Incident IDs are similar to the other object IDs in PagerDuty in that you can find them from the URL of the incident in the UI. They are longer strings than other objects than the service ID in the examples above.

The content of a note can be anything that might be interesting to the timeline of the incident, like commands that have been run, notifications that have been sent, or additional data and links for responders and stakeholders.


curl --request POST \
  --url https://api.pagerduty.com/incidents/{id}/notes \
  --header 'Accept: application/vnd.pagerduty+json;version=2' \
  --header 'Authorization: Token token=y_NbAkKc66ryYTWUXYEu' \
  --header 'Content-Type: application/json' \
  --header 'From: responder2@myemail.com' \
  --data '{
  "note": {
    "content": "Firefighters are on the scene."
  }
}'

Responders utilizing the UI will see notes in a widget on the incident pag.

Next Steps

Using the API to create tooling where integrations don't yet exist, or for internally-developed services, can help your team stay on top of all the moving parts of your ecosystem when you have an incident. Learn more about the PagerDuty resources available at https://developer.pagerduty.com/. Join the PagerDuty Community to learn from other folks working in PagerDuty, ask questions, and get answers.

December 14, 2021

Day 14 - What's in a job description (and who does it keep away)?

By: Daniel Medina
Edited by: James Turnbull (@kartar)

A colleague supporting our recruitment efforts asked hiring managers if their "job descriptions are still partying like it's 1999?" The point was to revisit old postings that had been copy-and-pasted down the years and create something that would increase engagement with candidates. But reading the title made me think about a job I applied for (and got) circa 1999. It was a systems administrator role and included language like

The associate must regularly lift and/or move 20-35 pounds and occasionally lift or pull 35-80 pounds.

No joke, those Sun Microsystems monitors were heavy. Checking a fact sheet confirms the "flat screen" (non-curved) 21-inch CRT from around that time was ~80 pounds.

Large network switches in the Cisco Catalyst 6500 family were easily twice that weight and were definitely a two-person job. Best practice for racking servers in the datacenter was to use a Genie Lift.

To this day, if I hear someone talking about a strong developer I might wonder "but how much can they deadlift?" Most job descriptions for roles outside physical datacenter management don't include this language anymore. This all got me thinking, what might be in job descriptions these days that could be turning off candidates?

"Education Level" might be one of those things we should re-think. Many postings require a "Bachelor's Degree". Granted, we don't describe what that degree is in and I've had colleagues with degrees in History, Library Sciences, Geology, Economics, and more (even Computer Science!)

Sometimes the phrase "or equivalent experience" is added to these requirements. It's unclear if this means something akin to a college experience, for example, thirteen weeks reading The Illiad in your teenage years. I've had colleagues who are Managing Directors and Distinguished Engineers with no college degrees; so why bother asking for this in our requirements? Maybe it's cloned from an existing description, or it's a required field in the system used to post the description and the option "None" isn't pre-filled. At best it's a proxy that means we're really looking for someone older than 21. At worst, we've dissuaded some candidates from considering us.

Sometimes the HR systems used for creating job descriptions can add unexpected data to your job descriptions. One job description posted in Montreal automatically included "Knowledge of French and English is required". This wasn't a Language Requirement that came from us! We were at a global firm using English as a common language and would be happy to hire anyone who met Canadian work requirements and had the skills we were looking for!

Other French-language oddities you may encounter are labels like "(H/F)" to indicate "Homme / Femme", that the job description is intended to be gender-neutral, despite pronouns and gendered language used throughout. This isn't as awkward as some of the "s/he will..." references used in English-language descriptions when the simpler "you", speaking directly to the candidate, seems so much more natural!

Speaking of strange language, some descriptions include language that doesn't make me think first of a technology role:

I'm hiring... a hacker that wants to work on the bleeding edge...
We spend a lot of time doing applied research...
You should be the type of person who likes to roll up their sleeves and get their hands dirty.
Source: Wikipedia: _Dexter (season 2)

Your signal that you have an existing, tight-knit group:

You'll be part of a small team of like-minded individuals.

might run counter to your efforts to advertise your goals of building a diverse and inclusive environment, one where the candidate-turned-new-joiner might not be able to provide their valuable external input if it will run counter to the current thinking.

We found that we were having trouble filling a "DevOps" role. Without suggesting that "DevOps isn't a job title", candidates wanted clarification on what that might mean in our environment. Reviewing some of the many open roles across different teams showed they varied widely, leaving candidates to try to figure out which of the DevOps Topologies they might be walking into (and was it a Pattern or Anti-Pattern?!)

These included:

  • Cloud SecDevOps (Cyber): This wins keyword bingo
  • Apply Now to The Wonderful World of DevOps: Points for creative use of the job title field
  • Devops Specialist - Private Cloud: "providing L3 support... including on-call"
  • DevOps Developer: "You are a developer who is not afraid of infrastructure. You identify with the 'Dev' in DevOps way more than the 'Ops'"
  • DevOps App Dev: A "release engineer" role that sounded more like DevOps in practice
  • DevOps Authentication Security L3 Engineer: Okay...

Much of this has been about job descriptions that can lose candidates. What should you include to gain credibility and interest? An honest declaration of the mission of the group they’re joining always helps. Don't shy away from describing a need to support existing legacy systems, even if the goal is to modernize and move to a new platform. Describe the lifecycle of the team; is it "newly formed", "fast-growing", or is this a chance to "join an established team" and learn from established experts?

What's the topology of the team, distributed (participation from a range of locations and timezones in an asynchronous arrangement), multi-site (people working from two or perhaps three sites passing of work between each other or operating in overlapping times), or fully co-located (in rough time or location)? This can affect travel, working hours, and collaboration styles.

Basic details of work-life balance should be included. These might include remote work arrangements (which will likely become a lasting legacy of the pandemic era), on-call staffing strategies, night and weekend work requirements, or travel requirements. We tend to advertise "flexible opportunities", which may have some constraints (we may want individuals to reside in a specific country but not care as much about sitting in an office).

Some of the most thoughtful job descriptions lay out a multi-month roadmap for the role and growth. "Within three months we expect you to join our on-call rotation in support of our production environment", "Within six months you will obtain certification in at least one of our hosting platforms", "Within nine months you will be doing my job and I will be riding off into the sunset", etc. Having such a timeline is important to set expectations for performance during any initial probation period that may be part of local labor law or new hire contract. This also sets a pace for someone to ramp up in your environment, ensuring enough time is set aside for required learning as opposed to "throwing them in the deep end".

I've made all the mistakes described here but can take some solace that I've created zero job postings seeking ninjas, rockstars, gurus, or wizards! Best of luck to all the hiring managers out there looking for their unicorns!

Source: Wikipedia: _Kiss (band)_

December 13, 2021

Day 13 - Ephemeral PR Environments: Enabling automated testing at a rapid pace

By: Amar Sattaur
Edited by: Jennifer Davis (@sigje)

Recently, I've been thinking a lot about how to implement the concepts of least privilege while also speeding up the feedback cycle in the developer workflow. However, these two things are not very quickly intertwined. Therefore, there needs to be underlying tooling and visibility to show developers the data they need for a successful PR merge.

A developer doesn't care about what those underlying tools are; they just want access to a system where they can:

  • See the logs of the app that they're making a change for and the other relevant apps
  • See the metrics of their app so they can adequately gauge performance impact

One way to achieve this is with ephemeral environments based on PR's. The idea is that the developer opens up a PR and then automatically a new environment is spun up based on provided defaults with the conditions that the environment is:

  • deployed in the same way that dev/stage/prod are deployed, just with a few key elements different
  • labeled correctly so that the NOC/Ops teams know the purpose of these resources
  • Integrated with logging/metrics and useful tags so that the engineer can easily see metrics for this given PR build

That sounds like a daunting task but through the use of Kubernetes, Helm, a CI Platform (GitHub Actions in this tutorial) and ArgoCD, you can make this a reality. Let's look at an example application leveraging all of this technology.

Example app

You can find all the code readily available in this GitHub Repo.

Pre-requisites Used in this Example

Tool Version
kubectl v1.21
Kubernetes Cluster v1.20.9
Helm v3.6.3
ArgoCD v2.0.5
kube-prometheus-stack v0.50.0

The example app that you’re going to deploy today is a Prometheus exporter that exports a custom metric with an overridable label set:

  • The `version` of the deployed app
  • The `branch` of the PR
  • The PR ID

Pipeline

Now that I've defined the goal, let's go a little more in-depth on how you'll get there. First, let's take a look at the PR pipeline in .github/workflows/pull_requests.yml:


---
name: 'Build image and push PR image to ghcr'
on:
  pull_request:
    types: [assigned, opened, synchronize, reopened]
    branches:
      - main

jobs:
  build:
    name: Build
    runs-on: ubuntu-latest
    steps:
      - name: Checkout
        uses: actions/checkout@v2
      - name: Build image
        uses: docker/build-push-action@v1
        with:
          registry: ghcr.io
          username: ${{ github.repository_owner }}
          password: ${{ secrets.GITHUB_TOKEN }}
          tags: PR-${{ github.event.pull_request.number }}
       

This pipeline runs on pull requests events to the main branch. So, when you open a PR, push a commit to an existing PR, reopen a closed PR, or assign it to a user, this pipeline will get triggered. It defines two workflows, the first of which is build. It's relatively straightforward: take the Dockerfile that lives in the root of your repo and build a container image out of it and tag it for use with GitHub Container Registry. The tag is the PR ID of the triggering pull request.

The second workflow is the one where we deploy to ArgoCD:


 deploy:
    needs: build
    container: ghcr.io/jodybro/argocd-cli:1.1.0
    runs-on: ubuntu-latest
    steps:
      - name: Log into argocd
        run: |
          argocd login ${{ secrets.ARGOCD_GRPC_SERVER }} --username ${{ secrets.ARGOCD_USER }} --password ${{ secrets.ARGOCD_PASSWORD }}
      - name: Deploy PR Build
        run: |
          argocd app create sysadvent2021-pr-${{ github.event.pull_request.number }} \
            --repo https://github.com/jodybro/sysadvent2021.git \
            --revision ${{ github.head_ref }} \
            --path . \
            --upsert \
            --dest-namespace argocd \
            --dest-server https://kubernetes.default.svc \
            --sync-policy automated \
            --values values.yaml \
            --helm-set version="PR-${{ github.event.pull_request.number }}" \
            --helm-set name="sysadvent2021-pr-${{ github.event.pull_request.number }}" \
            --helm-set env[0].value="PR-${{ github.event.pull_request.number }}" \
            --helm-set env[1].value="${{ github.head_ref }}" \
            --helm-set env[2].value="sysadvent2021-pr-${{ github.event.pull_request.number }}"
       

This workflow runs a custom image that I wrote that wraps the argocd cli tool in a container and allows for arbitrary commands to be executed against an authenticated ArgoCD instance.

It then creates a Kubernetes object of kind: Application which is a CRD that ArgoCD deploys into your cluster to define where you want to pull the application from and how to deploy it (helm/kustomize etc..).

Putting it all together

Now, let's see this pipeline in action. First, head to your repo and create a PR against the main branch with some changes; it doesn't matter what the changes are as all PR events will trigger the pipeline.

You can see that my PR has triggered a pipeline which can be viewed here. Furthermore, you can see that this pipeline was executed successfully, so if I go to my ArgoCD instance, I would see an application with this PR ID.

So, if you are following along, now you have two deployments of this example app, one should show labels for the main branch, and one should show labels for the PR branch.

Let's verify by port-forwarding to each and see what you get back.

Main branch

First, let's check out the main branch application:


kubectl port-forward service/sysadvent2021-main 8000:8000 
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
       

As you can see, the branch is set to main with the correct version.

And if you check out the state of our Application in ArgoCD:

Everything is healthy!

PR

Now let's check the PR deployment:


kubectl port-forward service/sysadvent2021-pr-1 8000:8000 
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
       

This one's labels are showing the branch and the version from the PR.

This pod returns:

And in ArgoCD:

Final thoughts

It really is that easy to get PR environments running in your company!

Resources

* Source Code Repo