tag:blogger.com,1999:blog-36153329690836509732024-03-18T21:47:11.458-04:00sysadventSystems Administration Advent CalendarJordan Sisselhttp://www.blogger.com/profile/13694925032675599790noreply@blogger.comBlogger298125tag:blogger.com,1999:blog-3615332969083650973.post-83524537469703783612021-12-23T00:00:00.001-05:002021-12-23T00:00:00.197-05:00Day 23 - What is eBPF?<p>
By: Ania Kapuścińska (<a href="https://twitter.com/lambdanis">@lambdanis</a>)<br />
Edited by: Shaun Mouton (<a href="https://twitter.com/sdmouton">@sdmouton</a> )
</p>
<p>
Like many engineers, for a long time I’ve thought of the Linux kernel as a black
box. I've been using Linux daily for many years - but my usage was mostly
limited to following the installation guide, interacting with the command line
interface and writing bash scripts.
</p>
<p>
Some time ago I heard about eBPF (extended BPF). The first thing I heard was
that it’s a programmable interface for the Linux kernel. Wait a second. Does
that mean I can now inject my code into Linux without fully understanding all
the internals and compiling the kernel? The answer turns out to be approximately
yes!
</p>
<p>
An eBPF (or BPF - these acronyms are used practically interchangeably) program
is written in a restricted version of C. Restricted, because a dedicated
verifier checks that the program is safe to run in an BPF VM - it can’t crash,
loop infinitely, or access arbitrary memory. If the program passes the check, it
can be attached to some kind of event in the Linux kernel, and run every time
this event happens.
</p>
<p>
A growing ecosystem makes it easier to create tools on top of BPF. One very
popular framework is <a href="https://github.com/iovisor/bcc">BCC</a> (BPF
Compiler Collection), containing a Python interface for writing BPF programs.
Python is a very popular scripting language, for a good reason - simple syntax,
dynamic typing and rich standard library make writing even complex scripts quick
and fun. On top of that, bcc provides easy compilation, events attachment and
output processing of BPF programs. That makes it the perfect tool to start
experimenting with writing BPF code.
</p>
<p>
To run code examples from this article, you will need a Linux machine with a
fairly recent kernel version (supporting eBPF). If you don’t have a Linux
machine available, you can experiment in a Vagrant box. You will also need to <a
href="https://github.com/iovisor/bcc/blob/master/INSTALL.md">install Python bcc
package</a>.
</p>
<h2>Very complicated hello</h2>
<p>
Let’s start in a very unoriginal way - with a “hello world” program. As I
mentioned before, BPF programs are written in (restricted) C. A BPF program
printing “Hello World!” can look like that:
</p>
<p>
hello.c
</p>
<pre
class="prettyprint">#define HELLO_LENGTH 13
BPF_PERF_OUTPUT(output);
struct message_t {
char hello[HELLO_LENGTH];
};
static int strcp(char *src, char *dest) {
for (int i = 0; src[i] != '\0'; i++) {
dest[i] = src[i];
}
return 0;
};
int hello_world(struct pt_regs *ctx) {
struct message_t message = {};
strcp("Hello World!", message.hello);
output.perf_submit(ctx, &message, sizeof(message));
return 0;
}
</pre>
<p>
The main piece here is the hello_world function - later we will attach it to a
kernel event. We don’t have access to many common libraries, so we are
implementing strcp (string copy) functionality ourselves. Extra functions are
allowed in BPF code, but have to be defined as static. Loops are also allowed,
but the verifier will check that they are guaranteed to complete.
</p>
<p>
The way we output data might look unusual. First, we define a perf ring buffer
called “output” using the BPF_PERF_OUTPUT macro. Then we define a data structure
that we will put in this buffer - message_t. Finally, we write to the “output”
buffer using perf_submit function.
</p>
<p>
Now it’s time to write some Python:
</p>
<p>
hello.py
</p>
<pre
class="prettyprint">from bcc import BPF
b = BPF(src_file="hello.c")
b.attach_kprobe(
event=b.get_syscall_fnname("clone"),
fn_name="hello_world"
)
def print_message(_cpu, data, _size):
message = b["output"].event(data)
print(message.hello)
b["output"].open_perf_buffer(print_message)
while True:
try:
b.perf_buffer_poll()
except KeyboardInterrupt:
exit()
</pre>
<p>
We import BPF from bcc as BPF is the core of the Python interface with eBPF in
the bcc package. It loads our C program, compiles it, and gives us a Python
object to operate on. The program has to be attached to a Linux kernel event -
in this case it will be the clone system call, used to create a new process. The
attach_kprobe method hooks the hello_world C function to the start of a clone
system call.
</p>
<p>
The rest of Python code is reading and printing output. A great functionality
provided by bcc is automatic translation of C structures (in this case “output”
perf ring buffer) into Python objects. We access the buffer with a simple
b[“output”], and use open_perf_buffer method to associate it with the
print_message function. In this function we read incoming messages with the
event method. The C structure we used to send them gets automatically converted
into a Python object, so we can read “Hello World!” by accessing the hello
attribute.
</p>
<p>
To see it in action, run the script with root privileges:
</p>
<div><pre><code class="language-none">
> sudo python hello.py
</code></pre></div>
<p>
In a different terminal window run any commands, e.g. ls. “Hello World!”
messages will start popping up.
</p>
<p>
Does it look awfully complicated for a “hello world” example? Yes, it does :)
But it covers a lot, and most of the complexity comes from the fact that we are
sending data to user space via a perf ring buffer.
</p>
<p>
In fact, similar functionality can be achieved with much simpler code. We can
get rid of the complex printing logic by using the bpf_trace_printk function to
write a message to the shared trace_pipe. Then, in Python script we can read
from this pipe using trace_print method. It’s not recommended for real world
tools, as trace_pipe is global and the output format is limited - but for
experiments or debugging it’s perfectly fine.
</p>
<p>
Additionally, bcc allows us to write C code inline in the Python script. We can
also use a shortcut for attaching C functions to kernel events - if we name the
C function kprobe__<kernel function name>, it will get hooked to the desired
kernel function automatically. In this case we want to hook into the sys_clone
function.
</p>
<p>
So, hello world, the simplest version, can look like this:
</p>
<pre
class="prettyprint">from bcc import BPF
BPF(text='int kprobe__sys_clone(void *ctx) { bpf_trace_printk("Hello World!\\n"); return 0; }').trace_print()
</pre>
<p>
The output will be different, but what doesn’t change is that while the script
is running, custom code will run whenever a clone system call is starting.
</p>
<h2>What even is an event?</h2>
<p>
Code compilation and attaching functions to events are greatly simplified by the
bcc interface. But a lot of its power lies in the fact that we can glue many BPF
programs together with Python. Nothing prevents us from defining multiple C
functions in one Python script and attaching them to multiple different hook
points.
</p>
<p>
Let’s talk about these “hook points”. What we used in the “hello world” example
is a kprobe (kernel probe). It’s a way to dynamically run code at the beginning
of Linux kernel functions. We can also define a kretprobe to run code when a
kernel function returns. Similarly, for programs running in user space, there
are uprobes and uretprobes.
</p>
<p>
Probes are extremely useful for dynamic tracing use cases. They can be attached
almost anywhere, but that can cause stability problems - a function rename could
break our program. Better stability can be achieved by using predefined static
tracepoints wherever possible. Linux kernel provides many of those, and for user
space tracing you can define them too (<a
href="https://lwn.net/Articles/753601/">user statically defined tracepoints</a>
- USDTs).
</p>
<p>
Network events are very interesting hook points. BPF can be used to inspect,
filter and route packets, opening a whole sea of possibilities for very
performant networking and security tools. In this category, XDP (eXpress Data
Path) is a BPF framework that allows running BPF programs not only in Linux
kernel, but also on supported network devices.
</p>
<h2>We need to store data</h2>
<p>
So far I’ve mentioned functions attached to other functions many times. But
interesting computer programs generally have something more than functions - a
state that can be shared between function calls. That can be a database or a
filesystem, and in the BPF world that’s BPF maps.
</p>
<p>
BPF maps are key/value pairs stored in Linux kernel. They can be accessed by
both kernel and user space programs, allowing communication between them.
Usually BPF maps are defined with C macros, and read or modified with <a
href="https://man7.org/linux/man-pages/man7/bpf-helpers.7.html">BPF helpers</a>.
There are several different types of BPF maps, e.g.: hash tables, histograms,
arrays, queues and stacks. In newer kernel versions, some types of maps let you
protect concurrent access with spin locks.
</p>
<p>
In fact, we’ve seen a BPF map in action already. The perf ring buffer we’ve
created with BPF_PERF_OUTPUT macro is nothing more than a BPF map of type
BPF_MAP_TYPE_PERF_EVENT_ARRAY. We also saw that it can be accessed from Python
bcc script, including automatic translation of items structure into Python
objects.
</p>
<p>
</p>
<p>
A good, but still simple example of using a hash table BPF map for communication
between different BPF programs can be found in <a
href="https://www.oreilly.com/library/view/linux-observability-with/9781492050193/">“Linux
Observability with BPF” book</a> (or in the <a
href="https://github.com/bpftools/linux-observability-with-bpf/blob/master/code/chapter-4/uretprobes/example.py">accompanying
repo</a>). It’s a script using uprobe and uretprobe to measure duration of a Go
binary execution:
</p>
<pre
class="prettyprint">from bcc import BPF
bpf_source = """
BPF_HASH(cache, u64, u64);
int trace_start_time(struct pt_regs *ctx) {
u64 pid = bpf_get_current_pid_tgid();
u64 start_time_ns = bpf_ktime_get_ns();
cache.update(&pid, &start_time_ns);
return 0;
}
"""
bpf_source += """
int print_duration(struct pt_regs *ctx) {
u64 pid = bpf_get_current_pid_tgid();
u64 *start_time_ns = cache.lookup(&pid);
if (start_time_ns == 0) {
return 0;
}
u64 duration_ns = bpf_ktime_get_ns() - *start_time_ns;
bpf_trace_printk("Function call duration: %d\\n", duration_ns);
return 0;
}
"""
bpf = BPF(text = bpf_source)
bpf.attach_uprobe(name = "./hello-bpf", sym = "main.main", fn_name = "trace_start_time")
bpf.attach_uretprobe(name = "./hello-bpf", sym = "main.main", fn_name = "print_duration")
bpf.trace_print()
</pre>
<p>
First, a hash table called “cache” is defined with the BPF_HASH macro. Then we
have two C functions: trace_start_time writing the process start time to the map
using cache.update(), and print_duration reading this value using
cache.lookup(). The former is attached to a uprobe, and the latter to uretprobe
for the same function - main.main in hello-bpf binary. That allows
print_duration to, well, print duration of the Go program execution.
</p>
<h2>Sounds great! Now what?</h2>
<p>
To start using the bcc framework, visit its Github repo. There is a <a
href="https://github.com/iovisor/bcc/blob/master/docs/tutorial_bcc_python_developer.md">developer
tutorial</a> and a <a
href="https://github.com/iovisor/bcc/blob/master/docs/reference_guide.md">reference
guide</a>. Many tools have been built on the bcc framework - you can learn them
from a <a
href="https://github.com/iovisor/bcc/blob/master/docs/tutorial.md">tutorial</a>
or check <a href="https://github.com/iovisor/bcc/tree/master/tools">their
code</a>. It’s a great inspiration and a great way to learn - code of a single
tool is usually not extremely complicated.
</p>
<p>
Two goldmines of eBPF resources are <a href="https://ebpf.io/">ebpf.io</a> and
<a href="https://project-awesome.org/zoidbergwill/awesome-ebpf">eBPF awesome
list</a>. Start browsing any of those, and you have all your winter evenings
sorted :)
</p>
<p>
Have fun!
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-90287067496438824142021-12-22T00:00:00.029-05:002021-12-22T00:00:00.185-05:00Day 22 - So, You're Incident Commander, Now What?<p>
By: Joshua Timberman (<a href="https://twitter.com/jtimberman">@jtimberman</a>)<br />
</p>
<p>
You’re the SRE on call and, while working on a project, your phone buzzes with
an alert: “Elevated 500s from API.”
</p>
<p>
You’re a software developer, and your team lead posts in Slack: “Hey, the
library we use for our payment processing endpoint has a remote exploit.”
</p>
<p>
You work on the customer success team and, during a routine sync with a
high-profile customer, they install the new version of your client CLI. Then,
when they run <em>any</em> command, it exits with a non-zero return code.
</p>
<p>
An incident is <em>any</em> situation that disrupts the ability of customers to
use a system, service, or software product in a safe and secure manner. And in
each of the incidents above, the person who noticed the incident first will most
likely become the incident commander. So, now what?
</p>
<h2>What does it mean to be an incident commander?</h2>
<p>
Once an individual identifies an incident, one or more people will respond to
it. Their goal is to resolve the incident and return systems, services, or other
software back to a functional state. While an incident may have a few or many
responders—only one person is the incident commander. This role is not about
experience, seniority, or position on an org chart; it is to ensure that
progress is being made to resolve the incident. The incident commander must
think about the inputs from the incident, make decisions about what to do next,
and communicate with others about what is happening. The incident commander also
determines when the incident is resolved based on the information they have.
After the incident is over, the incident commander is also responsible for
conducting a post-incident analysis and review to summarize what happened, what
the team learned, and what will be done to mitigate the risk of a similar
incident happening in the future.
</p>
<p>
Having a single person—the incident commander—be responsible for handling the
incident, delegating responsibility to others, determining when the incident is
resolved, and conducting the post-incident review is one of the most effective
incident management strategies.
</p>
<h2>How do you become an incident commander?</h2>
<p>
Organizations vary on how a team member can become an incident commander. Some
call upon the first responder to an incident. Others require specific training
and have an on-call rotation of just incident commanders. However you find
yourself in the role of incident commander, you should be trusted and empowered
by your organization to lead the effort to resolve the incident.
</p>
<h2>Now what?</h2>
<p>
Now that you’re incident commander, follow your organizations’ incident response
procedure for the specifics about what to do. But for more general questions,
we’ve got some guidance.
</p>
<h3>What are the best strategies for communication and coordination?</h3>
<p>
One of an incident commander’s primary tasks is to communicate with relevant
teams and stakeholders about the status of the incident and to coordinate with
other teams to ensure the right people are involved.
</p>
<p>
If your <a
href="https://allma.io/blog/effective-incident-management-using-slack">primary
communication tool is Slack</a>, use a separate channel for each incident.
Prefix any time-sensitive notes with “timeline” or “TL” so they are easy to find
later. If higher-bandwidth communication is required, use a video conference,
and keep the channel updated with important information and interactions. When
an incident affects external customers, be sure to update them as required by
your support teams and agreements with customers.
</p>
<p>
In the case of a security incident, there may be additional communications
requirements with your organization’s legal and/or marketing teams. Legal
considerations to communicate may include:
</p>
<ul>
<li>Statutory or regulatory reporting
<li>Contractual commitments and obligations to customers
<li>Insurance claims
</li>
</ul>
<p>
Marketing considerations to communicate may include:
</p>
<ul>
<li>Sensitive information from customer data exposure
<li>“Zero Day” exploits
<li>Special messaging requirements, e.g. for publicly traded companies
</li>
</ul>
<h3>When should you hand off a long-running incident?</h3>
<p>
During an extended outage or other long-running incident, you will likely need a
break. Whether you are feeling overwhelmed, or that you would contribute better
by working on a solution for the incident itself, or that you need to eat, take
care of your family, or sleep—all are good reasons to hand off the incident
command to someone else.
</p>
<p>
Coordinate with your other responders in the appropriate channel, whether that’s
in a Slack chat or in a Zoom meeting. If necessary, escalate by having someone
paged out to get help. Once someone else can take over, communicate with them
on the latest progress, the current steps being taken, and who else is involved
with the incident. Remember, we’re all human and we need breaks.
</p>
<h3>How should you approach post-incident analysis and review?</h3>
<p>
One of an incident commander’s most important jobs is to conduct a post-incident
analysis and review after the incident is resolved. This meeting must be
<em>blameless</em>: That is, the goal of the meeting is to learn what happened,
determine what contributing factors led to the incident, and take action to
mitigate the risk of such an incident happening in the future. It’s also to
establish a timeline of events, demonstrate an understanding of the problems,
and set up the organization for future success in mitigating that problem.
</p>
<p>
The sooner the incident analysis and review meeting occurs after the incident is
resolved, the better. You should ensure adequate rest time for yourself and
other responders, but the review meeting should happen within 24 hours—and
ideally not longer than two business days after the incident. The incident
commander (or commanders) must attend, as they have the most context on what
happened and what decisions were made. Any responders who performed significant
remediation steps or investigation must also attend so they can share what they
learned and what they did during the incident.
</p>
<p>
Because the systems that fail and cause incidents are complex, a good analysis
and review process is complex. Let’s break it down:
</p>
<h4>Describe the incident</h4>
<p>
The incident commander will describe the incident. This description should
detail the impact as well as its scope, i.e., whether the incident affected
internal or external users, how long it took to discover, how long it took to
recover, and what major steps were taken to resolve the incident.
</p>
<p>
“The platform was down” is not a good description.
</p>
<p>
“On its 5 minute check interval, our monitoring system alerted the on-call
engineer that the API service was non-responsive, which meant external customers
could not run their workflows for 15 minutes until we were able to restart the
message queue” is a good description.
</p>
<h4>Contributing factors</h4>
<p>
Successful incident analysis should identify the contributing factors and places
where improvements can be made to systems and software. Our world is complex,
and technology stacks have multiple moving parts and places where failures
occur. Not only can a contributing factor be something technical like “a
configuration change was made to an application,” it can be nontechnical like
“the organization didn’t budget for new hardware to improve performance.” In
reviewing the incident for contributing factors, incident commanders and
responders are looking for areas for improvement in order to identify potential
corrective actions.
</p>
<h4>Corrective action items</h4>
<p>
Finally, incident analysis should determine corrective action items. These must
be specific work items that are assigned to a person or a team accountable for
their completion, and they must be the primary work priority for that person or
team. These aren’t “nice to have,” these are “must do to ensure the safe and
reliable operation of the site or service.” Such tasks aren’t necessarily the
actions taken during the incident to stabilize or remediate a problem, which are
often temporary workarounds to restore service. A corrective action can be as
simple as adding new monitoring alerts or system metrics that weren’t
implemented before. It can also be as complex as rebuilding a database cluster
with a different high availability strategy or migrating to a different database
service entirely.
</p>
<h2>Conclusion</h2>
<p>
If you’ve recently been the incident commander for your first
incident—congratulations. You’ve worked to solve a hard problem that had a lot
of moving parts. You took on the role and communicated with the relevant teams
and stakeholders. Then, you got some much needed rest and conducted a successful
post-incident analysis. Your team identified corrective actions, and your site
or service is going to be more reliable for your customers.
</p>
<p>
Incident management is one of the most stressful aspects of operations work for
DevOps and SRE professionals. The first time you become an incident commander,
it may be confusing or upsetting. Don’t panic. You’re doing just fine, and
you’ll keep getting better.
</p>
<h2>Further reading</h2>
<p>
If you’re new to post incident analysis and review, check out <a
href="https://www.jeli.io/howie-the-post-incident-guide/">Howie: The
Post-Incident Guide</a> from Jeli.
</p>
<p>
PagerDuty also has extensive documentation on <a
href="https://response.pagerduty.com/">incident response</a> and <a
href="https://response.pagerduty.com/training/incident_commander/">incident
command</a>.
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-34780190441322228072021-12-20T00:00:00.031-05:002021-12-20T00:00:00.197-05:00Day 20 - To Deploy or Not to Deploy? That is the question.<p>
By: Jessica DeVita (<a href="https://twitter.com/ubergeekgirl">@ubergeekgirl</a>)<br />
Edited by: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>)
</p>
<p>
Deployment Decision-Making during the holidays amid the COVID19 Pandemic
</p>
<p>
<em>A sneak peek into my forthcoming MSc. thesis in Human Factors and Systems
Safety, Lund University</em>.
</p>
<p>
Web services that millions of us depend on for work and entertainment require
vast compute resources (servers, nodes, networking) and interdependent software
services, each configured in specialized ways. The experts who work on these
distributed systems are under <em>enormous</em> pressure to deploy new features,
<em>and</em> keep the services running, so deployment decisions are happening
hundreds or thousands of times every day. While automated testing and deployment
pipelines allow for frequent production changes, an engineer making a change
wants confidence that the automated testing system is working. However,
automating the testing pipeline makes the test-and-release process more opaque
to the engineer, making it difficult to troubleshoot.
</p>
<p>
</p>
<p>
When an incident occurs, the decisions preceding the event may be brought under
a microscope, often concluding that “human error” was the cause. As society
increasingly relies on web services, it is imperative to understand the
tradeoffs and considerations engineers face when they decide to deploy a change
into production. The themes uncovered through this research underscore the
complexity of engineering work in production environments and highlight the role
of relationships with co-workers and management on deployment decision-making.
</p>
<h2>There’s No Place Like Production</h2>
<p>
Many deployments are uneventful and proceed without issues, but unforeseen
permissions issues, network latency, sudden increases in demand, and security
vulnerabilities may only manifest in production.<em> </em>When asked to describe
a recent deployment decision, engineers reported intense feelings of uncertainty
as they could not predict how their change would interact with changes elsewhere
in the system. More automation isn’t always the solution, as one engineer
explains:
</p>
<p>
<em>“I can’t promise that when it goes out to the entire production fleet
that the timing won’t be wrong. It’s a giant Rube Goldberg of a race condition.
It feels like a technical answer to a human problem. I’ve seen people set up
Jenkins jobs with locks that prevent other jobs from running until it’s
complete. How often does it blow up in your face and fail to release the lock?
If a change is significant enough to worry about, there should be a human
shepherding it. Know each other’s names. Just talk to each other; it’s not that
hard.”</em>
</p>
<h2>Decision-making Under Pressure</h2>
<p>
<em>“The effects of an action can be totally different, if performed too
early or too late. But the right time is not clock time: it depends upon the
precise state of the process evolution”</em> (De Keyser, 1990).
</p>
<p>
Some engineers were under pressure to deploy fixes and features before the
holidays, while other engineers were constrained by a "<em>code freeze</em>",
when during certain times of the year, they “<em>can’t make significant
production changes that aren’t trivial or that fix something</em>”. One engineer
felt that they could continue to deploy to their test and staging environments
but warned, “... <em>a lot of things in a staging environment waiting to go out
can compound the risk of the deployments.”</em>
</p>
<p>
Responding to an incident or outage at any time of the year is challenging, but
even more so because of <em>“oddities that happen around holidays”</em> and
additional pressures from management, customers, and the engineers themselves.
Pairing or working together was often done as a means to increase confidence in
decision making. Pairing resulted in joint decisions, as engineers described
actions and decisions with “we”. <em>“So that was a late night. When I hit
something like that, it involves a lot more point-by-point communications with
my counterpart. For example,”I'm going to try this, do you agree this is a good
thing? What are we going to type in?”</em>
</p>
<p>
Engineers often grappled with "<em>clock time</em>" and reported that they made
certain sacrifices to “<em>buy more time</em>” to make further decisions. An
engineer expressed that a change <em>“couldn’t be decided under pressure in the
moment”</em> so they implemented a temporary measure.<em> </em>Fully aware of
the potential for their change to trigger new and different problems, engineers
wondered what they could do <em>“without making it worse”</em>.
</p>
<p>
When triaging unexpected complications, engineers sometimes <em>“went down
rabbit holes”</em>, exemplifying a cognitive fixation known as a “failure to
revise” (Woods & Cook, 1999). Additionally, having pertinent knowledge does not
guarantee that engineers can apply it in a given situation. For example, one
engineer recounted their experience during an incident on Christmas Eve:
</p>
<p>
<em>“...what happens to all of these volumes in the meantime? And so then
we're just thinking of the possible problems, and then [my co-worker] suggested
resizing it. And I said, ‘Oh, can you do that to a root volume?’ ‘Cause I hadn't
done that before. I know you can do it to other volumes, but not the
root.’”</em>
</p>
<p>
Incidents were even <em>more</em> surprising in systems that rarely fail. For
one engineer working on a safety critical system, responding to an incident was
like a <em>“third level of panic”</em>.
</p>
<h3>Safety Practices</h3>
<p>
The ability to roll back a deployment was a critically important capability that
for one engineer was only possible because they had “<em>proper safety practices
in place”.</em> However, rollbacks were not guaranteed to work, as another
engineer explained:
</p>
<p>
<em>“It was a fairly catastrophic failure because the previous migration
with a typo had partially applied and not rolled back properly when it failed.
The update statement failed, but the migration tool didn’t record that it had
attempted the migration, because it had failed. It did not roll back the
addition, which I believed it would have done automatically”.</em>
</p>
<h3>Sleep Matters</h3>
<p>
One engineer described how they felt that being woken up several times during
the night was a direct cause of taking down production during their on-call
shift:
</p>
<p>
<em>“I didn't directly connect that what I had done to try to fix the page
was what had caused the outage because of a specific symptom I was seeing… I
think if I had more sleep it would have gotten fixed sooner”. </em>
</p>
<p>
Despite needing <em>“moral support”</em>, engineers didn’t want to wake up their
co-workers in different time zones: <em>“You don't just have the stress of the
company on your shoulders. You've got the stress of paying attention to what
you're doing and the stress of having to do this late at night.”</em> This was
echoed in another engineer’s reluctance to page co-workers at night as they
“<em>thought they could try one more thing, but it’s hard to be self-aware in
the middle of the night when things are broken, we’re stressed and tired”.</em>
</p>
<p>
Engineers also talked about the impacts of a lack of sleep on their
effectiveness at work as <em>“not operating on all cylinders”</em>, and no
different than having 3 or 4 drinks: “<em>It could happen in the middle of the
night when you're already tired and a little delirious. It's a form of
intoxication in my book.</em>”
</p>
<h3>Blame Culture</h3>
<p>
<em>“What's the mean time to innocence? How quickly can you show that it's not a
problem with your system?”</em>
</p>
<p>
Some engineers described feeling that management was blameful after incidents
and untruthful about priorities. For example, an engineer described the
aftermath of a difficult database migration: <em>“Upper management was not
straightforward with us. We compromised our technical integrity and our
standards for ourselves because we were told we had to”.</em>
</p>
<p>
Another engineer described a blameful culture during post-incident review
meetings:
</p>
<p>
<em>“It is a very nerve-wracking and fraught experience to be asked to come
to a meeting with the directors and explain what happened and why your product
broke. And because this is an interwoven system, everybody's dependent on us and
if something happens, then it’s like ‘you need to explain what happened because
it hurt us.”</em>
</p>
<p>
Engineers described their errors as <em>"honest mistakes'' </em>as they made
sense of these events after the fact. Some felt a strong sense of personal
failure, and that their actions were the cause of the incident, as this engineer
describes:
</p>
<p>
<em>“We are supposed to follow a blameless process, but a lot of the time
people self-blame. You can't really shut it down that much because frankly they
are very causal events. I'm not the only one who can't really let go of it. I
know it was because of what I did.”</em>
</p>
<p>
Not all engineers felt they could take <em>“interpersonal risks”</em> or admit a
lack of knowledge without fear of <em>“being seen as incompetent”</em>.
Synthesizing theories of psychological safety with this study’s findings, it
seems clear that environments of psychological safety may increase engineers’
confidence in decision making (Edmondson, 2002).
</p>
<h2>What Would They Change?</h2>
<p>
Engineers were asked “If you could wave a magic wand, what would you change
about your current environment that would help you feel more confident or safe
in your day-to-day deployment decisions?
</p>
<p>
In addition to <em>“faster CI and pre-deployments”, </em>engineers overarchingly
spoke about needing better testing. One participant wanted a better way to test
front-end code end-to-end, <em>"I return to this space every few years and am a
bit surprised that this still is so hard to get right”. </em>In another mention
of improved testing, an engineer wanted “<em>integration tests that exercise the
subject component along with the graph of dependencies (other components,
databases, etc.), using only public APIs. I.e., no "direct to database"
fixtures, no mocking”.</em>
</p>
<h2>Wrapping Up</h2>
<p>
Everything about engineers’ work was made more difficult in the face of a global
pandemic. In the “before times” engineers could <em>"swivel their chair</em>” to
get a <em>"second set of eyes" </em>on from co-workers before deploying. While
some engineers in the study had sophisticated deployment automation, others
spoke of manual workarounds with heroic scripts written ‘on the fly’ to repair
the system when it failed. Engineers grappled with the complexities of
automation, and the risk and uncertainty associated with decisions to deploy.
Most engineers using tools to automate and manage configurations did
<em>not</em> experience relief in their workload. They had to maintain skills in
manual interventions when the automation did not work as expected or when they
could not discern the machine’s state. Such experiences highlight the continued
relevance of Lisanne Bainbridge’s (1983) research on the Ironies of Automation
which found that “the more advanced a control system is, the more crucial the
role of the operator”.
</p>
<p>
This study revealed that deployment decisions <em>cannot</em> be understood
independently from the social systems, rituals, and organizational structures in
which they occurred (Pettersen, McDonald, & Engen, 2010). So when a deployment
decision results in an incident or outage, instead of blaming the engineer,
consider the words of James Reason (1990) who said “<em>...operators tend to be
the inheritors of system defects…adding the final garnish to a lethal brew whose
ingredients have already been long in the cooking”.</em> Engineers may bring
their previous experiences to deployment decisions, but the tools and conditions
of their work environment, historical events, power structures, and hierarchy
are what <em>“enables and sets the stage for all human action.”</em> (Dekker &
Nyce, 2014, p. 47).
</p>
<p>
____
</p>
<p>
This is an excerpt from Jessica’s forthcoming thesis. If you’re interested in
learning more about this deployment decision-making study or would like to
explore future research opportunities, send Jessica a message on <a
href="twitter.com/ubergeekgirl">Twitter</a>.
</p>
<p>
<strong>References</strong>
</p>
<p>
Bainbridge, L. (1983). IRONIES OF AUTOMATION. In G. Johannsen & J. E.
Rijnsdorp (Eds.), <em>Analysis, Design and Evaluation of Man–Machine
Systems</em> (pp. 129–135). Pergamon.
</p>
<p>
De Keyser, V. (1990). Temporal decision making in complex environments.
<em>Philosophical Transactions of the Royal Society of London. Series B,
Biological Sciences</em>, 327(1241), 569–576.
</p>
<p>
Dekker, S. W. A., & Nyce, J. M. (2014). There is safety in power, or power
in safety. <em>Safety Science</em>, 67, 44–49.
</p>
<p>
Edmondson, A. C. (2002). <em>Managing the risk of learning: Psychological
safety in work teams</em>. Citeseer.
</p>
<p>
Pettersen, K. A., McDonald, N., & Engen, O. A. (2010). Rethinking the role
of social theory in socio-technical analysis: a critical realist approach to
aircraft maintenance. <em>Cognition, Technology & Work</em>, 12(3), 181–191.
</p>
<p>
Reason, J. (1990). <em>Human Error</em> (pp. 173–216). Cambridge University
Press.
</p>
<p>
Woods, D. D., & Cook, R. I. (1999). Perspectives on Human Error: Hindsight
Bias and Local Rationality. In <em>In F. Durso (Ed.) Handbook of Applied
Cognitive Psychology</em>. Retrieved 9 June 2021 from
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.474.3161
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-44325813741427694092021-12-19T00:00:00.005-05:002021-12-19T00:00:00.218-05:00Day 19 - Into the World of Chaos Engineering<p>
By: Julie Gunderson (<a href="https://twitter.com/Julie_Gund">@Julie_Gund</a>)<br />
Edited by: Kerim Satirli (<a href="https://twitter.com/ksatirli">@ksatirli</a>)
</p>
<h2>Intro</h2>
<p>
I recently left my role as a DevOps Advocate at PagerDuty to join the Gremlin
team as a Sr. Reliability Advocate. The past few months have been an immersive
experience into the world of Chaos Engineering and all things reliability. That
said, my foray into the world of Chaos Engineering started long before joining
the Gremlin team.
</p>
<p>
From my time as a lab researcher, to being a single parent, to dealing with
cancer, I have learned that the journey of unpredictability is everywhere. I
could never have imagined in college that I would end up doing what I do now. As
I reflect on the path I have taken to where I am today, I realize one thing:
chaos was always right there with me. My start in tech was as a recruiter and
let me tell you: there is no straight line that leads from recruiting to
advocacy. I experimented in my career, tried new things, failed more than a few
times, learned from my experiences and made tweaks. Being a parent is very
similar: most experiences you make along the way fall in one of two
camps:mistakes or learning. With cancer, there was, and is a lot of
experimenting and learning– even with the brightest of minds, every person’s
system handles treatments differently. Luckily, I had, and still have others who
mentor me both professionally and personally, people who help me improve along
the way, and I learned that chaos is a part of life that can be harnessed for
positive change.
</p>
<p>
Technical systems have a lot of similarities to our life experiences, we think
we know how they are going to act, but suddenly a monkey wrench gets thrown into
the mix and poof, all bets are off. So what do we do? We experiment, we try new
things, we follow the data, we don’t let failure stop us in our tracks, and we
learn how to build resiliency in.
</p>
<p>
We can’t mitigate every possible issue out in the wild, we should be proactive
in identifying potential failure modes. We need to prepare folks to handle
outages in a calm and efficient manner. We need to remember that there are users
on the other end of those ones and zeros. We need to keep our eye on the
reliability needle. Most of all, we need to have empathy for our co-workers, and
remember that we are all in this together, and that we don’t need to be afraid
of failure.
</p>
<h2>Talking about Chaos in the System</h2>
<p>
When a system or provider goes down (cough, cough us-east-1), people notice, and
they share their frustrations, widely. Long Twitter rants are one thing, the
media’s reaction is another: – outages make great headlines, and the old adage
of “all press is good press” doesn’t really hold up anymore. Brand awareness is
one thing, however, great SEO numbers based off of a headline in the New York
Times that calls out a company for being down might not be the best way to go
about it.
</p>
<h2>What is Chaos Engineering</h2>
<p>
So what is Chaos Engineering, and more importantly: why would you want to
engineer Chaos? Chaos Engineering is one of those things that is just
unfortunately named. After all, the practice has evolved a lot from the time
when Jesse Robbins coined the term GameDays, to the codified processes we have
in place today. The word “chaos” can still unfortunately lead to anxiety across
the management team(s) of a company. But, fear not, the practice of Chaos
Engineering helps us all create those highly reliable systems that the world
depends on, it builds a culture of learning, and teaches us all to embrace
failure and improve.
</p>
<p>
Chaos Engineering is the practice of proactively injecting failure into your
systems to identify weaknesses. In a world where everyone relies on digital
systems in some way, shape, or form, almost all of us have a focus on
reliability. After all: the cost of downtime can be astronomical!
</p>
<p>
My studies started at the University of Idaho in microbiology. I worked as a
researcher and studied the effects of carbon dioxide (CO2) and short-term
storage success of Chinook salmon milt (spoiler alert– there is no advantage to
using CO2). That’s where I learned that <strong>effective</strong> research
requires the use of the scientific method:<br>
</p>
<ol>
<li>Observe the current state
<li>Create a hypothesis
<li>Run experiments in a controlled, consistent environment
<li>Analyze the data
<li>Repeat the experiments and data analyzation
<li>Share the results
</li>
</ol>
<p>
In the research process, we focused on one thing at a time, we didn’t introduce
all the variables at once, we built on our experiments as we gathered and
analyzed the data. For example, we started off with the effects of CO2 and once
we had our data we introduced clove oil into the study. Once we understood the
effect on Chinook we moved to Sturgeon, and so on.
</p>
<p>
Similarly, you want to take a scientific approach when identifying weakness in
your systems with Chaos Engineering, a key difference is on the system that is
currently under study; your technical and social technical systems, vs. CO2 and
Chinook salmon milt (also, there are no cool white coats.) With Chaos
Engineering you aren’t running around unplugging everything at once, or
introducing 100% memory consumption on all of your containers at the same time,
you take little steps, starting with a small blast radius and increasing that
blast radius so you can understand where the failure had impact.
</p>
<img height="134" src="https://lh3.googleusercontent.com/rvY18DoLrWsIahDlhnI4x-WGDNq7P0s5serUySR7OslmizjdRlkAvdcY2wQcCFu8UFs-AZJI5wEBqTCPlWTdBerob_3Cw7A9i1ymyPuvYNsO3CA3r0oWYWTLbTHyMRlKJws3eXkh" style="margin-left: 0px; margin-top: 0px;" width="203" />
<h2>How do we get there</h2>
<h3>Metrics</h3>
<p>
At PagerDuty, I focused on best practices around reducing <em>Mean Time to
Detection</em> (MTTD) and <em>Mean Time to Resolution</em> (MTTR) of incidents,
and then going beyond those metrics to learning and improvement. I often spoke
about Chaos Engineering and how through intentionally injecting failure into our
systems, we could not only build more reliable systems, we could build a culture
of blamelessness, learning, and continuous improvement.
</p>
<p>
In my time at Gremlin, I have seen a lot of folks get blocked at the very
beginning when it comes to metrics such as MTTD and MTTR. Some organizations may
not have right monitoring tools in place, or are just at the beginning of the
journey into metric collection. It’s okay if everything isn’t perfect here, the
fact is you can just pick a place to start; one or two metrics to start
collecting to give you a baseline to start measuring improvement from. As far as
monitoring is concerned, you can use Chaos Engineering to validate what you do
have, and make improvements from there.
</p>
<h3>People</h3>
<p>
On the people-side of our systems, being prepared to handle incidents takes
practice. Waking up at 2am to a barbershop quartet alert singing “The Server is
on Fire” is a blood pressure raising experience, however that stress can be
reduced through practice.
</p>
<p>
For folks who are on-call, it’s important to give them some time to learn the
ropes before tossing them into the proverbial fire. Give folks a chance to
practice incident response through Chaos Engineering, run GameDays and
FireDrills, where people can learn in a safe and controlled environment what the
company process looks like in action. This is also a great way to validate your
alerting mechanisms and response process. At PagerDuty we had a great joint
workshop with Gremlin where people could practice incident response with Chaos
Engineering to learn about the different roles and responsibilities and
participate in mock incident calls. As a piano player, I had to build the muscle
memory needed to memorize Beethovan’s Moonlight Sonata by practicing over, and
over, and over for months. Similar to learning a musical instrument, practicing
incident response builds the muscle memory needed to reduce the stress on those
2am calls. If I can stress (no pun intended) anything from my experiences in
life, it is that repetition and practice are essential elements to handling
surprises calmly.
</p>
<p>
Building a culture of accepting failure as a learning opportunity takes bravery
and doesn’t happen overnight. Culture takes practice, empathy, and patience, so
make sure to take the time to thank folks for finding bugs, for making mistakes,
for accepting feedback, and for the willingness to learn.
</p>
<h4>Speak the language</h4>
<p>
As I mentioned before, sometimes we just have things that are unfortunately
named. Many of us have the opportunities to attend conferences, read articles
and blogs, earn certifications, etc... It’s important to remember that
leadership often doesn't have the time to do those things. We as individual
contributors, team leaders, engineers, whatever our title may be, need to be
well equipped to speak effectively to our audience; Leaders need to understand
the message we are trying to convey. I have found that using the phrase “just
trust me” isn’t always an effective communication tool. I had to learn how to
talk to decision makers and leadership in the terms they used, such as business
objectives, business outcomes, Return on Investment (ROI). By communicating the
business case I was trying to solve, they could connect the dots to the ROI of
adopting and sponsoring new ways of working.
</p>
<h3>It’s a Wrap</h3>
<p>
To sum it up, chaos is part of our lives from the moment we are born, from
learning to walk to learning to code, and all of the messiness in between. We
don’t need to be afraid of experimentation, but we should be thoughtful with our
test, and be open to learning. For me personally this next year, I plan on
learning to play Bohemian Rhapsody, and professionally, I plan on experimenting
with AWS and building a multi-regional application to test ways to be more
resilient in the face of outages. Wish me luck, I think I’ll need it on both
fronts.
</p>
<p>
Happy holidays, and may the chaos be with you.
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-614895900489754772021-12-18T00:00:00.043-05:002021-12-18T00:00:00.189-05:00Day 18 - Minimizing False Positive Monitoring Alerts with Checkmk<p>
By: Elias Voelker (<a href="https://twitter.com/Elijah2807">@Elijah2807</a>) and Faye Tandog (<a href="https://twitter.com/fayetandog">@fayetandog</a><br />
Edited by: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>)
</p>
<p>
Good IT monitoring stands and falls with its precision. Monitoring must inform you at
the right time when something is wrong. But similar to statistics, you also have
to deal with errors produced by your monitoring. In this post, I will talk about
two types of errors - false positives and false negatives. And similar again to
statistics, you can’t eliminate these errors completely in your monitoring. The
best you can do is manage them and optimize for an acceptable level of errors.
</p>
<p>
In this article, I share ways in which you can fine-tune
notifications from your monitoring system, to alleviate noisy
alerts and ideally receive only those alerts that are really relevant.
</p>
<p>
Fine-tuning notifications is one of the most important and
rewarding activities when it comes to configuring your monitoring system. The
impact of a well-defined notification setup is felt immediately. First and
foremost, your team will benefit from better focus due to less ‘noise’.
This ultimately results in better service levels and higher service level objective (SLO) attainment
across the board.
</p>
<p>
In this article, I talk about ‘alerts’ and ‘notifications’ using them interchangeably.
An ‘alert’ or ‘notification’ is your monitoring system letting you know that
something is supposedly wrong. Depending on your setup, this may be via email, text or a
trouble ticket in PagerDuty.
</p>
<p>
When I talk about a ‘monitoring system’, I’m referring to both ‘traditional’ IT
infrastructure and application monitoring tools such as Nagios, Zabbix, or
Solarwinds Orion, as well as cloud-based monitoring solutions such as
Prometheus, Datadog or Sensu.
</p>
<h2>Types of Alert Errors</h2>
<p>
Let’s start by examining two common alert errors: false positives and false
negatives.
</p>
<p>
A false positive would be your monitoring tool alerting about an issue when in
reality the monitored system is perfectly fine (or has recovered in the
meantime). It could be a server or service being shown as DOWN because there was a short
glitch in the network connection, or a specific service instance, for example Apache restarting to rotate its logs.
</p>
<p>
False negatives are when your monitoring system does not alert you, although
something really is wrong. If you're running an on-prem infrastructure and your
firewall is down, you want to know about it. If your monitoring system for some
reason does not alert you about this, your network may be exposed to all kinds
of threats, which can get into real trouble, really quickly.
</p>
<p>
However, the cost of erroneous alerting can differ vastly. Hence, when IT Ops
teams try to determine the acceptable level of false positives versus an
acceptable level of false negatives, they will often deem false positives more
acceptable. Because a false negative could be a mission critical system down and
not alerting. A false positive might just be one unnecessary notification that’s
quickly deleted from your inbox.
</p>
<p>
This is why they will err on the side of caution and notify, which is totally
understandable. The consequence, however, is that these teams get drowned in
meaningless alerts, which increases the risk of overlooking a critical one.
</p>
<p>
Because notifications will only be of help when no — or only occasional — false
alarms are produced.
</p>
<p>
In this article, I use Checkmk to show examples of minimizing false positive
alerting. You can apply the same philosophy with other tools though they may vary in implementation and functionality.
</p>
<h2><strong>1. Don’t alert. </strong></h2>
<p>
My first tip to improve monitoring and reduce the noise of false notifications
is to simply not send notifications. Seriously!
</p>
<p>
In Checkmk, notifications are actually optional. The monitoring system can still
be used efficiently without them. Some large organizations have a sort of
control panel in which an ops team is constantly monitoring the Checkmk
interface. As they will be visually alerted, additional notifications are
unnecessary.
</p>
<p>
These are typically users that can’t risk any downtime of their IT at all, like
a stock exchange for example. They use the problem dashboards in Checkmk to
immediately see the issue and its detail. As the lists are mostly empty, it is
pretty clear when something red pops up on a big dashboard.
</p>
<p>
But in my opinion, this is rather the exception. Most people use some way of
notifying their ops and sysadmin teams, be it through email, SMS or notification
plugins for ITSM tools such as ServiceNow, PagerDuty or Splunk OnCall.
</p>
<h2><strong>2. Give it time</strong></h2>
<p>
So if you’ve decided you don’t want to go down the ‘no notifications’ route from
my previous point, you need to make sure that your notifications are finely
tuned to only notify people in case of real problems.
</p>
<p>
The first thing to tell your monitoring is: Give the system being monitored
time.
</p>
<p>
Some systems produce sporadic and short-lived errors. Of course, what you really
should do is investigate and eliminate the reason for these sporadic problems,
but you may not have the capacity to chase after all of them.
</p>
<p>
You can reduce alarms from systems like that in two ways:
</p>
<ul>
<li>You can simply delay notifications to be sent only after a specified time
AND if the system state hasn’t changed back to OK in the meantime.
<li>You can alert on the number of failed checks. For Checkmk this is the ‘<a
href="https://docs.checkmk.com/latest/en/intro_finetune.html#sporadic_errors">Maximum
number of check attempts for service</a>’ rule set. This will make the
monitoring system check for a defined number of times before triggering a
notification. By multiplying the number of check attempts with your defined
check interval you can determine how much time you want to give the system. The
default Checkmk interval is 1 minute, but you can configure this differently.
</li>
</ul>
<p>
The two options are slightly different in how they treat the monitored system.
By using the number of failed checks, you can be sure that the system has really
been re-checked. If you alert only based on time and you (or someone else)
changed the check interval to a longer timeframe you gain nothing. In Checkmk
specifically there are some other factors as well, but that’s out of scope for
this article. The essential effect is: By giving a system a bit of time to
‘recover’, you can avoid a bunch of unnecessary notifications.
</p>
<p>
This method also works great for ‘self-healing’ systems that should recover on
their own; for example, you wouldn’t want to get a notification for a cloud
provider killing an instance to upgrade it when your code will automatically
deploy a new container instance to handle requests
</p>
<p>
Of course, this is not an option for systems that are mission-critical with zero
downtime that require rapid remediation. For example, a hedge-fund that monitors
the network link to a derivative marketplace can't trade if it goes down. Every
second of downtime costs them dearly.
</p>
<h2><strong>3. On average, you don’t have a problem</strong></h2>
<p>
Notifications are often triggered by threshold values on utilization metrics
(e.g. CPU utilization) which might only exceed the threshold for a short time.
As a general rule, such brief peaks are not a problem and should not immediately
cause the monitoring system to start notifying people.
</p>
<p>
For this reason, many check plug-ins have the configuration option that their
metrics are averaged over a longer period (say, 15 minutes) before the
thresholds for alerting are applied. By using this option, temporary peaks are
ignored, and the metric will first be averaged over the defined time period and
only afterwards will the threshold values be applied to this<a
href="https://docs.checkmk.com/latest/en/intro_finetune.html#average_value">
average value</a>.
</p>
<h2><strong>4. Like parents, like children</strong></h2>
<p>
Imagine the following scenario: You are monitoring a remote data center. You
have hundreds of servers in that data center working well and being monitored by
your monitoring system. However, the connection to those servers goes through
the DC’s core switch (forget redundancy for a moment). Now that core switch goes
down, and all hell breaks loose. All of the sudden, hundreds of hosts are no
longer being reached by your monitoring system and are being shown as DOWN.
Hundreds of DOWN hosts mean a wave of hundreds of notifications…
</p>
<p>
But in reality, all those servers are (probably) doing just fine. Anyway we
couldn’t tell, because we can’t connect to them because of the core switch
acting up. So what do you do about it?
</p>
<p>
Configure your monitoring system so that it knows this interdependency. So the
server checks are dependent on that core switch. You can do so in Checkmk by
using ‘parent-child-relationships’. By declaring host A the ‘Child’ of another
‘Parent’ host B, you tell your Checkmk system that A is dependent on host B.
Checkmk pauses notifications for the children if the parent is down.
</p>
<h2><strong>5. Avoid alerts on systems that are supposed to be
down</strong></h2>
<p>
There are hundreds of reasons why a system should be down at times. Maybe some
systems need to be rebooted regularly, maybe you are doing some maintenance or
simply don’t need a system at certain times. What you don’t want is your
monitoring system going into panic mode during these times, alerting
who-knows-whom if a system is supposed to be down. To do that, you can use
‘Scheduled Downtimes’.
</p>
<p>
Scheduled downtimes work for entire hosts, but also for individual services. But
why would you send certain services into scheduled downtimes? More or less for
the same reason as hosts – when you know something will be going on that would
trigger an unnecessary notification. You still might want your monitoring to
keep an eye on the host as a whole, but you are expecting and accepting that
some services might go haywire and breach thresholds for some time. An example
could be a nightly cron job that syncs data to long term storage, causing the
disk I/O check to spike. But, if everything goes back to normal once the sync is
through, no need to lose sleep over it.
</p>
<p>
Moreover, you can extend scheduled downtimes to ‘Children’ of a ‘Parent’ host as
well.
</p>
<h2>Wrapping Up</h2>
<p>
I hope this short overview has given you some ideas about really simple ways
with which you can cut down on the number of meaningless notifications your team
is getting from your monitoring system. There are other strategies to do this, but
this should get you started.
</p>
<h2>Additional Resources</h2>
<p>
If you want to learn more about how to manage notifications in Checkmk,
check out this<a href="https://docs.checkmk.com/latest/en/notifications.html">
docs article</a> or<a href="https://forum.checkmk.com"> post a question in the
forum</a>.
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-390587273620638462021-12-17T00:00:00.010-05:002021-12-17T00:00:00.176-05:00Day 17 - Death to Localhost: The Benefits of Developing In A Cloud Native Environment<p>
By: Tyler Auerbeck (<a href="https://twitter.com/tylerauerbeck">@tylerauerbeck</a>) <br />
Edited by: Ben Cotton (<a href="https://twitter.com/funnelfiasco">@funnelfiasco</a>)
</p>
<p>
Thank you everyone for joining us today. We gather here to say our goodbyes to
our dear friend, Localhost. They’ve been there for us through the good times,
the bad times, and the “we should really be sleeping right now…but let me just
try one last thing” times. They’ve held our overly-complicated terminal
configurations and—in all likelihood—most of our secrets. But alas, it is time
to let our good friend ride into the sunset.
</p>
<h2>Saying Goodbye</h2>
<p>
But why?! We’ve all likely spent more time than we care to admit making these
machines feel like home. They’re part of the family! Well, as it turns out, that
can become part of the problem. We’ve all seen issues that are accompanied by
the line “well it works on my machine” and a round of laughs. The problem with
localhost is that it can be extremely difficult to ensure that a setup being
utilized by one developer actually matches what is being run by another. This
can happen for any number of reasons such as developer platform (Linux vs MacOS
vs Windows), IDE (VScode vs Jetbrains), or even just the installation method of
the tools you’re using. The different combinations of these problems only
exacerbates the problem and likely leads to (at a minimum!) hundreds of hours of
lost productivity. All in the name of working locally. But what if there was a
better way?
</p>
<h2>My Machine is Your Machine</h2>
<p>
With everything becoming Cloud Native these days, why do we want to treat
development any differently? The common trend recently is to push a number of
our workloads into containers. Why? Because with containers we have the ability
to bundle our runtimes, tooling, and any additional dependencies via a
well-defined format. We can expect them to run almost anywhere, the same way,
each and every time. What if we took that same approach and instead of a web
application, we shipped our development environment?
</p>
<p>
Well, as it turns out, this is exactly what a few projects are starting to give
us the ability to do. Now instead of shipping complex Makefiles, multiple
install scripts, or having to ask our users to pipe our mystery scripts into
bash, we can simply just launch our development environments out into the cloud
of our choice. Currently, there are two main projects that offer us this
functionality. If you’re not interested in hosting anything yourself, <a
href="https://github.com/features/codespaces">GitHub Codespaces</a> is a hosted
solution that integrates directly with your codebase and allows you to easily
spin up a VScode instance to get to work. However, if you have more specific
restrictions or just prefer to run your own infrastructure, another project
offering this functionality is <a href="https://www.eclipse.org/che/">Eclipse
Che</a>. Whatever solution works best for your situation is fine. The more
important part of both of these offerings is _how_ they make these environments
available to you.
</p>
<h3>Development Environment Specs</h3>
<p>
Both of the above offerings allow you to specify the dev environment that you
want to make available to your users/developers. It’s important to note that
this is done on a per repository basis because there is never going to be a
single dev environment that works to run them all. This is exactly the mess that
we’re trying to get out of! We want to be able to define an environment that is
purpose-built for the specific project that we are working on!
</p>
<p>
To do this, these platforms give us configuration files: dev-container.json
(GitHub Workspaces) and devfile (Eclipse Che). Although the specs differ between
the two formats, the underlying principles are the same. Within one well defined
configuration file, I am able to specify the tooling that needs installed, an
image that should be used or built to run all of my development inside of, ports
that need exposed, storage that needs mounted, plugins to be used, etc.
Everything that I would usually need to configure by hand when getting started
with a project now _just happens_ whenever I launch my environments. So now not
only are we solving the _snowflake_ environment problem, but we are also saving
valuable time because the environment will be configured and ready as soon as we
click launch. It’s just what we’ve always wanted: push button and get to work!
</p>
<h2>What Problems Are We Solving</h2>
<p>
This all sounds great right? But you might be shaking your first in the air and
screaming “Just let me use my laptop!” While this is absolutely something that I
can empathize with and may generally work on personal projects, there are real
problems that are being solved with this approach. I’ve seen this more
specifically in enterprise development shops where _your machine_ isn’t really
*your* machine. Which brings us to our first problem
</p>
<h3>Permissions</h3>
<p>
Given the current security environment, most enterprise development shops aren’t
too keen on giving you the permissions to install any of the tooling that you
actually need. I have seen developers lose weeks waiting on a request to just
install their runtime on their machines before they’re ever even able to begin
contributing to their time. Multiply that by every tool and dependency that they
might need and you can imagine how much valuable and productive time is lost in
the name of security and process.
</p>
<p>
By moving to a cloud native development approach, your development environments
can be treated just like any other application that you run and scanned/approved
by your security teams. When a new developer comes on board, they can get right
to work! No more waiting on approvals/installation because this has already gone
through the necessary pipelines and is just ready whenever you are.
</p>
<h3>Develop In Production</h3>
<p>
Alright, so maybe we shouldn’t develop *in* production, but rather in an
environment that is _like_ production. By developing an application in a
location where it will ultimately be running, you get a better feel for
configurations and even failure modes that you otherwise may not experience by
developing solely on your local machine. Expecting certain ports to be
available? Need specific hardware? By ensuring your configuration files mirror
your environments you can determine these problems earlier on in your process
versus finding them once they’ve launched into a staging or production
environment. This ultimately helps you reduce downtime and speeds up your time
to resolving these problems as you may find them before they’re ever even
introduced.
</p>
<h2>Localhost: Still Slightly Alive</h2>
<p>
Realistically, this isn’t going to be a solution for everything or everyone.
There are workloads and development tasks that require specialized environments
or are potentially just not well suited to being done inside of a container
environment. And that’s okay! There are still other approaches to finding a way
off of your local machine and into the hearts of all of your developers without
having to have them sink their time into troubleshooting differences between
each of their machines. The heart of the problem still stands: developers want
to get to work and provide value. Being able to provide on-demand environments
that encapsulate all of the requirements of a project so that they can get
involved immediately helps drive this productivity for both your teams and your
communities, all without having to burn hours troubleshooting a personal
machine.
</p>
<p>
So for now, let us lay our dear friend Localhost to rest. They may no longer be
with us, but have no fear! Our localhost will always be with us up in the
cloud(s)!
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-88628481196970618202021-12-16T00:00:00.038-05:002021-12-16T00:00:00.196-05:00Day 16 - Setting up k3s in your home lab<p>
By: Joe Block (<a href="https://twitter.com/curiousbiped">@curiousbiped</a>)<br />
Edited by: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>)
</p>
<h2><strong>Background</strong></h2>
<p>
Compute, even at home with consumer-grade hardware, has gotten ridiculously
cheap. You can get a quad-core ARM machine with 4GB like a Raspberry Pi 4 for
under $150, including power supply and SD card for booting - and it'll idle at
less than 5 watts of power draw and be completely silent because it is fanless.
</p>
<h2><strong>What we're going to do</strong></h2>
<p>
In this post, I'll show you how to set up a Kubernetes cluster on a cheap ARM
board (or an x86 box if you prefer) using<a
href="https://rancher.com/products/k3s"> k3s</a> and k3sup so you can learn
Kubernetes without breaking an environment in use.
</p>
<p>
These instructions will also work on x86 machines, so you can repurpose that old
hardware instead of buying a new Raspberry Pi.
</p>
<h3><strong>Why k3s?</strong></h3>
<p>
k3s was created by<a href="https://rancher.com/docs/k3s/latest/en/"> Rancher</a>
as a lightweight, easy to install, and secure Kubernetes option.
</p>
<p>
It's packaged as a single ~40MB binary that reduces the dependencies needed to
get a cluster up and running. It even includes an embedded containerd, so you
don't need to install that or docker. The ARM64 and ARM7 architectures are fully
supported, so it's perfect for running on a Raspberry Pi in a home lab
environment.
</p>
<h3><strong>Why k3sup?</strong></h3>
<p>
Alex Ellis wrote<a href="https://github.com/alexellis/k3sup"> k3sup</a>, a great
tool for bringing up k3s clusters and we're going to use it in this post to
simplify setting up a brand new cluster. With k3sup, we'll have a running
kubernetes cluster in less than ten minutes.
</p>
<h2><strong>Lets get started!</strong></h2>
<h3><strong>Pre-requisites.</strong></h3>
<ul>
<li>A spare linux box. I'll be using a Raspberry Pi for my examples, but you can
follow along on an x86 linux box or VM if you prefer.
<li><a href="https://github.com/alexellis/k3sup">k3sup</a> - download the latest
release from<a href="https://github.com/alexellis/k3sup/releases">
k3sup/releases</a> into a directory in your $PATH.
</li>
</ul>
<h2><strong>Set up your cluster.</strong></h2>
<p>
In the following example, I'm assuming you've created a user (you can use the pi
user on rPi if you prefer) for configuring the cluster (I used borg below),
you've added your ssh public key to ~pi/.ssh/authorized_keys and that the user
has sudo privileges. I'm also assuming you've downloaded k3sup and put it into
/usr/local/bin, and that /usr/local/bin is in your $PATH.
</p>
<h3><strong>Create the leader node</strong></h3>
<p>
The first step is to create the leader node with the k3sup utility:
</p>
<div><pre><code class="language-bash">
k3sup install --host $HOSTNAME --user pi
</code></pre></div>
<p>
Below is the output when I ran it against my scratch rPi. In the scrollback
you'll see that I'm using my borg account instead of the pi user. After setting
up the rPi, the first step I took was to disable the known pi account. I also
specify the path to an SSH key that is in the borg account's authorized_keys,
and configure the borg account to allow passwordless sudo.
</p>
<p>
Notice that I don't have to specify an architecture - k3sup automagically
determines the architecture of the host and installs the correct binaries when
it connects to the machine. All I have to do is tell it what host to connect to,
what user to use, what ssh key, and whether I want to use the stable or latest
k3s channels or a specific version.
</p>
<div><pre><code class="language-bash">
❯ k3sup install --host cephalopod.example.com --user borg --ssh-key demo-key
--k3s-channel stable
</code></pre></div>
<div><pre><code class="language-bash">
k3sup install --host cephalopod.example.com --user borg --ssh-key demo-key --k3s-channel stable
Running: k3sup install
2021/12/13 16:30:49 cephalopod.example.com
Public IP: cephalopod.example.com
[INFO] Finding release for channel stable
[INFO] Using v1.21.7+k3s1 as release
[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/sha256sum-arm64.txt
[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/k3s-arm64
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Skipping installation of SELinux RPM
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s.service
[INFO] systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO] systemd: Starting k3s
Result: [INFO] Finding release for channel stable
[INFO] Using v1.21.7+k3s1 as release
[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/sha256sum-arm64.txt
[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.21.7+k3s1/k3s-arm64
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Skipping installation of SELinux RPM
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s.service
[INFO] systemd: Enabling k3s unit
[INFO] systemd: Starting k3s
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
Saving file to: /Users/jpb/democluster/kubeconfig
# Test your cluster with:
export KUBECONFIG=/Users/jpb/democluster/kubeconfig
kubectl config set-context default
kubectl get node -o wide
</code></pre></div>
<h3><strong>Test it out</strong></h3>
<p>
Per the directions output by k3sup, you can now test your brand new cluster by
setting the environment variable KUBECONFIG, and then run kubectl to work with
your new cluster.
</p>
<p>
My steps to verify my new cluster is up and running:
</p>
<ol>
<li>export KUBECONFIG=/Users/jpb/democluster/kubeconfig
<li>kubectl config set-context default
<li>kubectl get node -o wide
</li>
</ol>
<p>
And I see nice healthy output where the status shows Ready -
</p>
<div><pre><code class="language-bash">
NAME STATUS ROLES AGE VERSION INTERNAL-IP
EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
cephalopod Ready control-plane,master 2m53s v1.21.7+k3s1 10.1.2.3
<none> Ubuntu 18.04.3 LTS 4.9.196-63 containerd://1.4.12-k3s1
</code></pre></div>
<p>
And I can also look at pods in the cluster
</p>
<div><pre><code class="language-bash">
❯ kubectl get pods -A
Alias tip: kc get pods -A
NAMESPACE NAME READY STATUS
RESTARTS AGE
kube-system coredns-7448499f4d-b2rdp 1/1 Running 0
9m29s
kube-system local-path-provisioner-5ff76fc89d-d9rrc 1/1 Running 0
9m29s
kube-system metrics-server-86cbb8457f-cqk6q 1/1 Running 0
9m29s
kube-system helm-install-traefik-crd-jgk2x 0/1 Completed 0
9m29s
kube-system helm-install-traefik-l2j96 0/1 Completed 2
9m29s
kube-system svclb-traefik-7tzzs 2/2 Running 0
8m38s
kube-system traefik-6b84f7cbc-92kkp 1/1 Running 0
8m38s
</code></pre></div>
<h3><strong>Clean Up</strong></h3>
<p>
k3s is tidy and easy to uninstall, so you can stand up a cluster on a
machine, do some experimentation, then dispose of the cluster and have a
clean slate for your next experiment. This makes it great for continuous
integration!
</p>
<div><pre><code class="language-bash">
sudo /usr/local/bin/k3s-uninstall.sh to shut down the node and delete
/var/lib/rancher and the data stored there.
</code></pre></div>
<h3><strong>Next Steps</strong></h3>
<p>
Learn kubernetes! Some interesting tutorials that I recommend -
</p>
<ul>
<li>The Kubernetes project has a set of tutorials to get you started at<a
href="https://kubernetes.io/docs/tutorials/">
https://kubernetes.io/docs/tutorials/</a>
<li>VMWare sponsors a free set of online Kubernetes courses at<a
href="https://kube.academy/courses"> https://kube.academy/courses</a>.
</li>
</ul>
<p>
Finally, now that you've set up a cluster the easy way, if you want to
understand everything k3sup did behind the scenes to get your Kubernetes cluster
up and running,<a
href="https://github.com/kelseyhightower/kubernetes-the-hard-way"> Kubernetes
the Hard Way</a> by Kelsey Hightower is a must-read.
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-77427394017700538122021-12-15T00:00:00.036-05:002021-12-15T00:00:00.180-05:00Day 15 - Introduction to the PagerDuty API<p>
By: Mandi Walls (<a href="https://twitter.com/lnxchk">@lnxchk</a>) <br />
Edited by: Joe Block (<a href="https://twitter.com/curiousbiped">@curiousbiped</a>)
</p>
<p>
Keeping track of all the data generated by a distributed ecosystem is a daunting
task. When something goes wrong, or a service isn’t behaving properly, tracking
down the culprit and getting the right folks enabled to fix it is also
challenging. PagerDuty can help you with these challenges.
</p>
<p>
The PagerDuty platform integrates with over 600 other components to gather data,
add context, and process automation. Under the hood of all of these integrations
is the PagerDuty API, ready to help you programmatically interact with your
PagerDuty account.
</p>
<h3><strong>What’s Exposed Via the API</strong></h3>
<p>
The<a href="https://developer.pagerduty.com/docs/ZG9jOjQ2NDA2-introduction">
PagerDuty API</a> provides access to all the structural objects in your
PagerDuty account - users, teams, services, escalation policies, etc - and also
to the data objects including incidents, events, and change events.
</p>
<p>
For objects like users, teams, escalation policies, schedules, and services, you
may find using the<a
href="https://registry.terraform.io/providers/PagerDuty/pagerduty/latest/docs">
PagerDuty Terraform Provider</a> will help you maintain the state of your
account more efficiently without using the API directly.
</p>
<p>
The other object types in PagerDuty are more useful when we can send them
anytime from anywhere, including via the API from our own code. Let’s take a
look at three of them: incidents, events, and change events. If you’d like a
copy of the code for these examples, you can find them on<a
href="https://github.com/lnxchk/pdgarage-samples/tree/main/sysadvent-2021">
Github</a>.
</p>
<h3><strong>API Basics</strong></h3>
<p>
To write new information into PagerDuty via the API, you'll need some
authorization. You can use<a
href="https://developer.pagerduty.com/docs/ZG9jOjExMDI5NTcz-o-auth-2-0-functionality">
OAuth</a>, or create an<a
href="https://developer.pagerduty.com/docs/ZG9jOjExMDI5NTUx-authentication#api-token-authentication">
API key</a>. There are<a
href="https://support.pagerduty.com/docs/generating-api-keys#section-generating-a-general-access-rest-api-key">
account-level</a> and<a
href="https://support.pagerduty.com/docs/generating-api-keys#generating-a-personal-rest-api-key">
user-level</a> API keys available. You'll use the account-level keys for the
rest of the examples here and keep things simple.
</p>
<p>
To create a key in your PagerDuty app, you'll need Admin, Global Admin, or
Account Owner access to your account. More on that<a
href="https://support.pagerduty.com/docs/user-roles"> here</a>.
</p>
<p>
In PagerDuty, navigate to <em>Integrations</em> and then chose <em>API Access
Keys</em>. Create a new key, give it a description, and save it somewhere safe.
The keys are strings that look like y_NbAkKc66ryYTWUXYEu.
</p>
<p>
Now you’re ready to generate some incidents! These examples use curl, but there
are a number of<a
href="https://developer.pagerduty.com/docs/ZG9jOjExMDI5NTg2-api-client-libraries">
client libraries</a> for the API as well.
</p>
<h3><strong>Incidents</strong></h3>
<p>
Incidents are probably what you’re most familiar with in PagerDuty - they
represent a problem or issue that needs to be addressed and resolved. Sometimes
this includes alerting a human responder. Many of the integrations in the
PagerDuty ecosystem generate incidents from other systems and services to send
to PagerDuty.
</p>
<p>
In PagerDuty, incidents are assigned explicitly to services in your account, so
an incoming incident will register with only that service. If your database has
too many long-running queries, you want an incident to be assigned to the
PagerDuty service representing that database so responders have all the correct
context to fix the issue.
</p>
<p>
If you have a service that doesn’t have an integration out of the box, you can
still get information from that service into PagerDuty via the API, and you
don’t need anything special to do it. You can send an incident to the API via a
curl request to the https://api.pagerduty.com/incidents endpoint.
</p>
<p>
There are three required headers for these requests, Accept, Content-Type and
From, which needs to be an email address associated with your account, for
attribution of the incident. Setting up the request will look something like:
</p>
<div><pre><code class="language-bash">
curl -X POST --header 'Content-Type: application-json' \
--url https://api.pagerduty.com/incidents \
--header 'Accept: application/vnd.pagerduty+json;version=2' \
--header 'Authorization: Token token=y_NbAkKc66ryYTWUXYEu' \
--header 'From: system2@myemail.com' \
</code></pre></div>
<p>
Now you need the information bits of the incident. These will be passed as
--data in the curl request. There are just a few required pieces to set up the
format and a number of optional pieces that help add context to the incident.
</p>
<p>
The most important piece you'll need is the service ID. Every object in the
PagerDuty platform has a unique identifier. You can find the ID of a service in
its URL in the UI. It will be something like
https://myaccount.pagerduty.com/service-directory/SERVICEID.
</p>
<p>
Now you can create the rest of the message with JSON:
</p>
<div><pre><code class="language-bash">
curl -X POST --header 'Content-Type: application/json' \
--url https://api.pagerduty.com/incidents \
--header 'Accept: application/vnd.pagerduty+json;version=2' \
--header 'Authorization: Token token=y_NbAkKc66ryYTWUXYEu' \
--header 'From: system2@myemail.com' \
--data '{
"incident": {
"type": "incident",
"title": "Too many blocked requests",
"service": {
"id": "PWIXJZS",
"summary": null,
"type": "service_reference",
"self": null,
"html_url": null
},
"body": {
"type": "incident_body",
"details": "The service queue is full. Requests are no longer being fulfilled."
}
}
}'
</code></pre></div>
<p>
When you run this curl command, it will generate a new incident on the service
PWIXJZS with the title "To many blocked requests", along with some context in
the "body" of the data to help our responders. You can add diagnostics or other
information here to help your team fix whatever is wrong.
</p>
<p>
What if there is information being generated that might not need an immediate
response? Instead of an incident, you can create an event.
</p>
<h3><strong>Events</strong></h3>
<p>
Events are non-alerting items sent to PagerDuty. They can be processed via<a
href="https://support.pagerduty.com/docs/rulesets"> Event Rules</a> to help
create context on incidents or provide information about the behavior of your
services. They utilize the<a href="https://support.pagerduty.com/docs/pd-cef">
PagerDuty Common Event Format</a> to make processing and collating more
effective.
</p>
<p>
Events are registered to a particular routing_key via an integration on a
particular service in your PagerDuty account. In your PagerDuty account, select
a service you'd like to send events to, or create new one to practice with. On
the page for that service, select the <em>Integrations</em> tab and <em>Add an
Integration</em>. For this integration, select "Events API V2" and click
<em>Add</em>. You'll have a new integration on your service page. Click the gear
icon, and copy the <em>Integration Key</em>. For the full walkthrough of this
setup, see the<a
href="https://support.pagerduty.com/docs/services-and-integrations#add-integrations-to-an-existing-service">
docs</a>.
</p>
<p>
The next step is to set up the event. The request is a little different from the
incident request - the url is different, the From: header is not required, and
the authorization is completely handled in the routing_key instead of using an
API token.
</p>
<p>
The content of the request is more structured, based on the Common Event Format,
so that you can create event rules and take actions if necessary based on what
the events contain.
</p>
<div><pre><code class="language-bash">
curl --request POST \
--url https://events.pagerduty.com/v2/enqueue \
--header 'Content-Type: application/json' \
--data '{
"payload": {
"summary": "DISK at 99% on machine prod-datapipe03.example.com",
"timestamp": "2021-11-17T08:42:58.315+0000",
"severity": "critical",
"source": "prod-datapipe03.example.com",
"component": "mysql",
"group": "prod-datapipe",
"class": "disk",
"custom_details": {
"free space": "1%",
"ping time": "1500ms",
"load avg": 0.75
}
},
“event_action”: “trigger”,
"routing_key": "e93facc04764012d7bfb002500d5d1a6"
}'
</code></pre></div>
<h3><strong>Change Events</strong></h3>
<p>
A third type of contextual data you can send to the API is a<a
href="https://support.pagerduty.com/docs/change-events"> Change Event</a>.
Change events are non-alerting, and help add context to a service. They are
informational data about what's changing in your environment, and while they
don't generate an incident, they can inform responders about other activities in
the system that might have contributed to a running incident. Change events
might come from build and deploy services, infrastructure as code, security
updates, or other places that change is generated in your environment.
</p>
<p>
These events have a similar basic structure to the general events, and the setup
with the routing_key is the same, as you can see in the below example. The
custom_details can contain anything you want, like the build number, a link to
the build report, or the list of objects that were changed during an
Infrastructure as Code execution.
</p>
<p>
Change events have a time horizon. They expire after 90 days in the system, so
you aren't looking at old context based on past changes.
</p>
<div><pre><code class="language-bash">
curl --request POST \
--url https://events.pagerduty.com/v2/change/enqueue \
--header 'Content-Type: application/json' \
--data '{
"routing_key": "737ea619db564d41bd9824063e1f6b08",
"payload": {
"summary": "Build Success: Increase snapshot create timeout to 30 seconds",
"timestamp": "2021-11-17T09:42:58.315+0000",
"source": "prod-build-agent-i-0b148d1040d565540",
"custom_details": {
"build_state": "passed",
"build_number": "220",
"run_time": "1236s"
}
}
}'
</code></pre></div>
<h3><strong>Adding Notes</strong></h3>
<p>
One final fun bit of functionality you can leverage in PagerDuty's API is with
<em>notes</em>. Notes are short text entries added to the timeline of an
incident. In some integrations, like<a
href="https://www.pagerduty.com/integrations/slack/"> PagerDuty and Slack</a>,
notes will be sent to any Slack channel that is configured to receive updates
for an impacted service, making them helpful for responders to coordinate and
record activity across different teams.
</p>
<p>
Notes are associated with a specific incident, so when you are creating a note,
the url will include the incident ID. Incident IDs are similar to the other
object IDs in PagerDuty in that you can find them from the URL of the incident
in the UI. They are longer strings than other objects than the service ID in the
examples above.
</p>
<p>
The content of a note can be anything that might be interesting to the timeline
of the incident, like commands that have been run, notifications that have been
sent, or additional data and links for responders and stakeholders.
</p>
<div><pre><code class="language-bash">
curl --request POST \
--url https://api.pagerduty.com/incidents/{id}/notes \
--header 'Accept: application/vnd.pagerduty+json;version=2' \
--header 'Authorization: Token token=y_NbAkKc66ryYTWUXYEu' \
--header 'Content-Type: application/json' \
--header 'From: responder2@myemail.com' \
--data '{
"note": {
"content": "Firefighters are on the scene."
}
}'
</code></pre></div>
<p>
Responders utilizing the UI will see notes in a widget on the incident pag.
</p>
<h3><strong>Next Steps</strong></h3>
<p>
Using the API to create tooling where integrations don't yet exist, or for
internally-developed services, can help your team stay on top of all the moving
parts of your ecosystem when you have an incident. Learn more about the PagerDuty resources available
at <a href="https://developer.pagerduty.com/">
https://developer.pagerduty.com/</a>. Join the<a
href="https://community.pagerduty.com"> PagerDuty Community</a> to learn from
other folks working in PagerDuty, ask questions, and get answers.
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com1tag:blogger.com,1999:blog-3615332969083650973.post-16812278314954267872021-12-14T00:00:00.001-05:002021-12-14T00:00:00.188-05:00Day 14 - What's in a job description (and who does it keep away)?
<p>
By: Daniel Medina <br />
Edited by: James Turnbull (<a href="https://twitter.com/kartar">@kartar</a>)
</p>
<p>
A colleague supporting our recruitment efforts asked hiring managers if their
"job descriptions are still partying like it's 1999?" The point was to revisit
old postings that had been copy-and-pasted down the years and create something
that would increase engagement with candidates. But reading the title made me
think about a job I applied for (and got) circa 1999. It was a systems
administrator role and included language like
</p>
<blockquote>The associate must regularly lift and/or move 20-35 pounds and occasionally
lift or pull 35-80 pounds.</blockquote>
<p>
No joke, those Sun Microsystems monitors were <i>heavy</i>. Checking a <a href="http://shrubbery.net/~heas/sun-feh-2_1/Devices/Monitor/documents/Monitor_JTF.pdf">fact sheet</a> confirms the "flat screen" (non-curved) 21-inch CRT from around that time was ~80 pounds.
</p>
<div class="separator" style="clear: both;"><a href="https://camo.githubusercontent.com/c29c9168b560838f8deeaaa8481e18fff2593581ecc3b4bd86e5ba6e7c28ef78/68747470733a2f2f692e726564642e69742f78366939343665317a716334312e6a7067" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="450" data-original-width="800" src="https://camo.githubusercontent.com/c29c9168b560838f8deeaaa8481e18fff2593581ecc3b4bd86e5ba6e7c28ef78/68747470733a2f2f692e726564642e69742f78366939343665317a716334312e6a7067"/></a>
<p>
Source: <a href="https://www.reddit.com/r/retrobattlestations/comments/etc3gv/not_x86_week_my_sun_microsystems_collection_ultra/">u/leaningtoweravenger on
Reddit</a>
</p>
</div>
<p>
Large network switches in the Cisco Catalyst 6500 family were easily twice that
weight and were definitely a two-person job. Best practice for racking servers
in the datacenter was to use a <a href="https://www.genielift.com/en/material-handling/material-lifts/gl-8">Genie
Lift</a>.
</p>
<p>
To this day, if I hear someone talking about a <i>strong developer</i> I might wonder
"but how much can they deadlift?" Most job descriptions for roles outside
physical datacenter management don't include this language anymore. This all
got me thinking, <b>what might be in job descriptions these days that could be
turning off candidates?</b>
</p>
<p>
"Education Level" might be one of those things we should re-think. Many
postings require a "Bachelor's Degree". Granted, we don't describe what that
degree is in and I've had colleagues with degrees in History, Library Sciences,
Geology, Economics, and more (even Computer Science!)
</p>
<p>
Sometimes the phrase "or equivalent experience" is added to these requirements.
It's unclear if this means something akin to a college experience, for example,
thirteen weeks reading <i>The Illiad</i> in your teenage years. I've had colleagues
who are Managing Directors and Distinguished Engineers with no college degrees;
so why bother asking for this in our requirements? Maybe it's cloned from an
existing description, or it's a required field in the system used to post the
description and the option "None" isn't pre-filled. At best it's a proxy that
means we're really looking for someone older than 21. At worst, we've dissuaded
some candidates from considering us.
</p>
<p>
Sometimes the HR systems used for creating job descriptions can add unexpected
data to your job descriptions. One job description posted in Montreal
automatically included "Knowledge of French and English is required". This
wasn't a Language Requirement that came from us! We were at a global firm using
English as a common language and would be happy to hire anyone who met Canadian
work requirements and had the skills we were looking for!
</p>
<p>
Other French-language oddities you may encounter are labels like "(H/F)" to
indicate "Homme / Femme", that the job description is intended to be
gender-neutral, despite pronouns and gendered language used throughout. This
isn't as awkward as some of the "s/he will..." references used in
English-language descriptions when the simpler "you", speaking directly to the
candidate, seems so much more natural!
</p>
<p>
Speaking of strange language, some descriptions include language that doesn't
make me think first of a technology role:
</p>
<blockquote>I'm hiring... a hacker that wants to work on the bleeding edge...</blockquote>
<blockquote>We spend a lot of time doing applied research...</blockquote>
<blockquote>You should be the type of person who likes to roll up their sleeves and get
their hands dirty.</blockquote>
<div class="separator" style="clear: both;"><a href="https://upload.wikimedia.org/wikipedia/en/6/6f/Dexter_season_2_DVD.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" height="320" data-original-height="361" data-original-width="275" src="https://upload.wikimedia.org/wikipedia/en/6/6f/Dexter_season_2_DVD.png"/></a>
Source: Wikipedia:<a href="https://en.wikipedia.org/wiki/Dexter_%28season_2%29"> _Dexter (season
2)</a></div>
<p>
Your signal that you have an existing, tight-knit group:
</p>
<blockquote>You'll be part of a small team of like-minded individuals.
</blockquote><p>
might run counter to your efforts to advertise your goals of building a diverse
and inclusive environment, one where the candidate-turned-new-joiner might not
be able to provide their valuable external input if it will run counter to the
current thinking.
</p>
<p>
We found that we were having trouble filling a "DevOps" role. Without
suggesting that "DevOps isn't a job title", candidates wanted clarification on
what that might mean in our environment. Reviewing some of the many open roles
across different teams showed they varied widely, leaving candidates to try to
figure out which of the <a href="https://web.devopstopologies.com/">DevOps Topologies</a>
they might be walking into (and was it a Pattern or Anti-Pattern?!)
</p>
<p>
These included:
</p>
<ul>
<li> <i>Cloud SecDevOps (Cyber)</i>: This wins keyword bingo</li>
<li> <i>Apply Now to The Wonderful World of DevOps</i>: Points for creative use of the
job title field</li>
<li><i> Devops Specialist - Private Cloud</i>: "providing L3 support... including
on-call"</li>
<li> <i>DevOps Developer</i>: "You are a developer who is not afraid of infrastructure.
You identify with the 'Dev' in DevOps way more than the 'Ops'"</li>
<li> <i>DevOps App Dev</i>: A "release engineer" role that sounded more like DevOps in
practice</li>
<li><i> DevOps Authentication Security L3 Engineer</i>: Okay...</li>
</ul>
<p>
Much of this has been about job descriptions that can lose candidates. What
should you include to gain credibility and interest? An honest declaration of
the mission of the group they’re joining always helps. Don't shy away from
describing a need to support existing legacy systems, even if the goal is to
modernize and move to a new platform. Describe the lifecycle of the team; is it
"newly formed", "fast-growing", or is this a chance to "join an established
team" and learn from established experts?
</p>
<p>
What's the topology of the team, distributed (participation from a range of
locations and timezones in an asynchronous arrangement), multi-site (people
working from two or perhaps three sites passing of work between each other or
operating in overlapping times), or fully co-located (in rough time or
location)? This can affect travel, working hours, and collaboration styles.
</p>
<p>
Basic details of work-life balance should be included. These might include
remote work arrangements (which will likely become a lasting legacy of the
pandemic era), on-call staffing strategies, night and weekend work requirements,
or travel requirements. We tend to advertise "flexible opportunities", which
may have some constraints (we may want individuals to reside in a specific
country but not care as much about sitting in an office).
</p>
<p>
Some of the most thoughtful job descriptions lay out a multi-month roadmap for
the role and growth. "Within three months we expect you to join our on-call
rotation in support of our production environment", "Within six months you will
obtain certification in at least one of our hosting platforms", "Within nine
months you will be doing my job and I will be riding off into the sunset", etc.
Having such a timeline is important to set expectations for performance during
any initial probation period that may be part of local labor law or new hire
contract. This also sets a pace for someone to ramp up in your environment,
ensuring enough time is set aside for required learning as opposed to "throwing
them in the deep end".
</p>
<p>
I've made all the mistakes described here but can take some solace that I've
created zero job postings seeking ninjas, rockstars, gurus, or wizards! Best of
luck to all the hiring managers out there looking for their unicorns!
</p>
<div class="separator" style="clear: both;"><a href="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Kiss_Cracow_2019.jpg/640px-Kiss_Cracow_2019.jpg" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="360" data-original-width="640" src="https://upload.wikimedia.org/wikipedia/commons/thumb/d/d6/Kiss_Cracow_2019.jpg/640px-Kiss_Cracow_2019.jpg"/></a></div>
<p>
Source: <a href="https://commons.wikimedia.org/wiki/File:Kiss_Cracow_2019.jpg">Wikipedia: _Kiss
(band)_</a>
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-21729511673925282682021-12-13T00:00:00.000-05:002021-12-13T00:24:55.896-05:00Day 13 - Ephemeral PR Environments: Enabling automated testing at a rapid pace
<p>
By: Amar Sattaur <br />
Edited by: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>)
</p>
<p>
Recently, I've been thinking a lot about how to implement the concepts of least
privilege while also speeding up the feedback cycle in the developer workflow.
However, these two things are not very quickly intertwined. Therefore, there
needs to be underlying tooling and visibility to show developers the data they
need for a successful PR merge.
</p>
<p>
A developer doesn't care about what those underlying tools are; they just want
access to a system where they can:
</p>
<ul>
<li>See the logs of the app that they're making a change for and the other
relevant apps
</li><li>See the metrics of their app so they can adequately gauge performance impact
</li>
</ul>
<p>
One way to achieve this is with ephemeral environments based on PR's. The idea
is that the developer opens up a PR and then automatically a new environment is
spun up based on provided defaults with the conditions that the environment is:
</p>
<ul>
<li>deployed in the same way that dev/stage/prod are deployed, just with a few
key elements different
</li><li>labeled correctly so that the NOC/Ops teams know the purpose of these
resources
</li><li>Integrated with logging/metrics and useful tags so that the engineer can
easily see metrics for this given PR build
</li>
</ul>
<p>
That sounds like a daunting task but through the use of Kubernetes, Helm, a CI
Platform (GitHub Actions in this tutorial) and ArgoCD, you can make this a
reality. Let's look at an example application leveraging all of this technology.
</p>
<h2><b>Example app</b></h2>
<p>
You can find all the code readily available in this <a href="https://github.com/jodybro/sysadvent2021">GitHub Repo</a>.
</p>
<h1><b>Pre-requisites Used in this Example</b></h1>
<table>
<tbody><tr>
<td><b>Tool</b>
</td>
<td><b>Version</b>
</td>
</tr>
<tr>
<td>kubectl
</td>
<td>v1.21
</td>
</tr>
<tr>
<td>Kubernetes Cluster
</td>
<td>v1.20.9
</td>
</tr>
<tr>
<td>Helm
</td>
<td>v3.6.3
</td>
</tr>
<tr>
<td>ArgoCD
</td>
<td>v2.0.5
</td>
</tr>
<tr>
<td>kube-prometheus-stack
</td>
<td>v0.50.0
</td>
</tr>
</tbody></table>
<p>
The example app that you’re going to deploy today is a Prometheus exporter that
exports a custom metric with an overridable label set:
</p>
<ul>
<li>The `version` of the deployed app
</li><li>The `branch` of the PR
</li><li>The PR ID
</li>
</ul>
<h3><b>Pipeline</b></h3>
<p>
Now that I've defined the goal, let's go
a little more in-depth on how you'll get there. First, let's take a look at the PR
pipeline in .github/workflows/pull_requests.yml:
</p>
<div><pre><code class="language-yaml">
---
name: 'Build image and push PR image to ghcr'
on:
pull_request:
types: [assigned, opened, synchronize, reopened]
branches:
- main
jobs:
build:
name: Build
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Build image
uses: docker/build-push-action@v1
with:
registry: ghcr.io
username: ${{ github.repository_owner }}
password: ${{ secrets.GITHUB_TOKEN }}
tags: PR-${{ github.event.pull_request.number }}
</code></pre></div>
<p>
This pipeline runs on pull requests events to the main branch. So, when you open a
PR, push a commit to an existing PR, reopen a closed PR, or assign it to a user,
this pipeline will get triggered. It defines two workflows, the first of which
is build. It's relatively straightforward: take the Dockerfile that lives in the
root of your repo and build a container image out of it and tag it for use with
GitHub Container Registry. The tag is the PR ID of the triggering pull
request.
</p>
<p>
The second workflow is the one where we deploy to ArgoCD:
</p>
<div><pre><code class="language-yaml">
deploy:
needs: build
container: ghcr.io/jodybro/argocd-cli:1.1.0
runs-on: ubuntu-latest
steps:
- name: Log into argocd
run: |
argocd login ${{ secrets.ARGOCD_GRPC_SERVER }} --username ${{ secrets.ARGOCD_USER }} --password ${{ secrets.ARGOCD_PASSWORD }}
- name: Deploy PR Build
run: |
argocd app create sysadvent2021-pr-${{ github.event.pull_request.number }} \
--repo https://github.com/jodybro/sysadvent2021.git \
--revision ${{ github.head_ref }} \
--path . \
--upsert \
--dest-namespace argocd \
--dest-server https://kubernetes.default.svc \
--sync-policy automated \
--values values.yaml \
--helm-set version="PR-${{ github.event.pull_request.number }}" \
--helm-set name="sysadvent2021-pr-${{ github.event.pull_request.number }}" \
--helm-set env[0].value="PR-${{ github.event.pull_request.number }}" \
--helm-set env[1].value="${{ github.head_ref }}" \
--helm-set env[2].value="sysadvent2021-pr-${{ github.event.pull_request.number }}"
</code></pre></div>
<p>
This workflow runs a custom image that I<a href="https://github.com/jodybro/argocd-cli"> wrote</a> that wraps the argocd
cli tool in a container and allows for arbitrary commands to be executed
against an authenticated ArgoCD instance.
</p>
<p>
It then creates a Kubernetes object of kind: Application which is a CRD that
ArgoCD deploys into your cluster to define where you want to pull the
application from and how to deploy it (helm/kustomize etc..).
</p>
<h2><b>Putting it all together</b></h2>
<p>
Now, let's see this pipeline in action. First, head to your repo and
create a PR against the main branch with some changes; it doesn't matter what
the changes are as all PR events will trigger the pipeline.
</p>
<p>
You can see that my PR has triggered a pipeline which can be viewed <a href="https://github.com/jodybro/sysadvent2021/actions/runs/"> here</a>.
Furthermore, you can see that this pipeline was executed successfully, so if I
go to my ArgoCD instance, I would see an application with this PR ID.
</p>
<p>
So, if you are following along, now you have two deployments of this example app, one should show labels for
the main branch, and one should show labels for the PR branch.
</p>
<p>
Let's verify by port-forwarding to each and see what you get back.
</p>
<h3><b>Main branch</b></h3>
<p>
First, let's check out the main branch application:
</p>
<div><pre><code class="language-bash">
kubectl port-forward service/sysadvent2021-main 8000:8000
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
</code></pre></div>
<div class="separator" style="clear: both;"><a href="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/main-service.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="63" data-original-width="704" src="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/main-service.png"/></a></div>
<p>
As you can see, the branch is set to main with the correct version.
</p>
<p>
And if you check out the state of our Application in ArgoCD:</p>
<div class="separator" style="clear: both;"><a href="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/ArgoCD-main.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="38" data-original-width="800" src="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/ArgoCD-main.png"/></a></div>
<div class="separator" style="clear: both;"><a href="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/ArgoCD-main-state.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="214" data-original-width="800" src="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/ArgoCD-main-state.png"/></a></div>
<p>
Everything is healthy!
</p>
<h3><b>PR</b></h3>
<p>
Now let's check the PR deployment:
</p>
<div><pre><code class="language-bash">
kubectl port-forward service/sysadvent2021-pr-1 8000:8000
Forwarding from 127.0.0.1:8000 -> 8000
Forwarding from [::1]:8000 -> 8000
</code></pre></div>
<p>
This one's labels are showing the branch and the version from the PR.
</p>
<p>
This pod returns:
</p>
<div class="separator" style="clear: both;"><a href="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/pr-1-service.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="60" data-original-width="745" src="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/pr-1-service.png"/></a></div>
<p>
And in ArgoCD:
</p>
<div class="separator" style="clear: both;"><a href="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/ArgoCD-PR-1.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="35" data-original-width="800" src="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/ArgoCD-PR-1.png"/></a></div>
<div class="separator" style="clear: both;"><a href="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/ArgoCD-PR-1-state.png" style="display: block; padding: 1em 0; text-align: center; "><img alt="" border="0" width="320" data-original-height="210" data-original-width="800" src="https://raw.githubusercontent.com/jodybro/sysadvent2021/main/images/ArgoCD-PR-1-state.png"/></a></div>
<h2><b>Final thoughts</b></h2>
<p>
It really is that easy to get PR environments running in your company!
</p>
<h1>Resources</h1>
* <a href="https://github.com/jodybro/sysadvent2021">Source Code Repo</a>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com1tag:blogger.com,1999:blog-3615332969083650973.post-53297490890693771682021-12-12T00:00:00.074-05:002021-12-12T00:00:00.179-05:00Day 12 - Terraform Refactoring<p>
By: Bill O'Neill (<a href="https://twitter.com/woneill">@woneill</a>)<br />
Edited by: Kerim Satirli (<a href="https://twitter.com/ksatirli">@ksatirli</a>)
</p>
<p>
Terraform is "Infrastructure as Code" and like all code, it is beneficial to
review and refactor to:
</p>
<ul>
<li>improve code readability and reduce complexity
</li><li>improve the maintainability of the source code
</li><li>create a simpler, cleaner, and more expressive internal architecture or
object model to improve extensibility
</li>
</ul>
<p>
This article outlines the approaches that have helped my teams when refactoring
Terraform code bases.
</p>
<h2>
<b>Convert modules to independent Git repositories</b></h2>
<p>
If your Terraform Git repository has grown organically, you will likely have a
monorepo structure complete with embedded modules, similar to this:
</p>
<pre class="prettyprint">$ tree terraform-monorepo/
.
├── README.md
├── main.tf
├── variables.tf
├── outputs.tf
├── ...
├── modules/
│ ├── moduleA/
│ │ ├── README.md
│ │ ├── variables.tf
│ │ ├── main.tf
│ │ ├── outputs.tf
│ ├── moduleB/
│ ├── .../
</pre>
<p>
Encapsulating resources within modules is a great step, but the monorepo
structure makes it difficult to iterate on individual module development, down
the line.
</p>
<p>
Splitting the modules into independent Git repositories will:
</p>
<ul>
<li>Enable module development in an isolated manner
</li><li>Support re-use of module logic in other Terraform code bases, across your
organization
</li><li>Enable publishing to public and private Terraform Registries
</li>
</ul>
<p>
Here's a process that you can follow to make a module a stand-alone Git
repository while preserving the historical log messages. The steps are examples
of how to extract <code>moduleA</code> from the above file tree into its own git
repository.
</p>
<ol>
<li>Clone the Terraform Git repository to a new directory. I recommend naming
the directory after the module you plan on converting.<br />
<code>git clone <REMOTE_URL> moduleA</code>
</li><li>Change into the new directory:<br /><code>cd moduleA</code>
</li>
<li>Use <code>git filter-branch</code> to split out the module into a new
repository..<br /><code>FILTER_BRANCH_SQUELCH_WARNING=1 git filter-branch
--subdirectory-filter modules/moduleA -- --all </code>
<p>
Note that we're squelching the warning about <code>filter-branch</code>. See
the filter-branch manual page for more details if you're interested
</p>
</li><li>Now your directory will only contain the contents of the module itself,
while still having access to the full Git history. <br /><br />You can
run <code>git log</code> to confirm this.
</li>
<li>Create a new Git repository and obtain the remote URL for it, then update the
<code>origin</code> in the filtered repository:<br />
<pre class="prettyprint">git remote set-url origin <NEW_REMOTE_URL>
git push -u origin main
</pre>
</li>
<li>
<p>
Tag the repo as <code>v1.0.0</code> <i>before</i> making any
changes<br />
</p>
<pre class="prettyprint">
git tag v1.0.0
git push --tags
</pre>
</li>
<li>
<p>
Now that the new repository is ready to be used, update the existing references
to the module to use a <code>source</code> argument that points to the tag that
you just created.</p><p>The “<a href="https://www.terraform.io/docs/language/modules/sources.html#generic-git-repository">Generic
Git Repository</a>” section in Terraform's <a href="https://www.terraform.io/docs/language/modules/sources.html">Module
Sources</a> documentation has more details on the
format.</p><p>Replace lines such
as<br /><br /><code>source = "../modules/moduleA"</code>
</p>
<p>
<code><br /></code>with<br />
</p>
<pre class="prettyprint">source = "git::<NEW_REMOTE_URL>?ref=v1.0.0"
</pre>
</li><li>Alternatively, publishing your module to a <a href="https://www.terraform.io/docs/language/modules/develop/publish.html">Terraform
registry</a> is an option (but this is outside the scope of this article).
</li><li>Once all <code>source</code> arguments that previously pointed to the
directory path have been replaced with references to Git repositories or
Terraform registry references, delete the directory-based module in the original
Terraform repository.
</li>
</ol>
<h2>
Update version constraints with <code>tfupdate</code></h2>
<p>
Masayuki Morita's <code><a href="https://github.com/minamijoyo/tfupdate">tfupdate</a></code> utility can be
used to recursively update version constraints of Terraform core, providers, and
modules. <br /><br />As you start refactoring modules and bumping
their version tags, <code>tfupdate</code> becomes an invaluable tool to ensure
all references have been updated.
</p><p>
Some examples of <code>tfupdate</code> usage, assuming the current directory is
to be updated:
</p>
<ul>
<li>Updating the version of Terraform core:<br /><code>tfupdate terraform
--version 1.0.11 --recursive .</code>
</li><li>Updating the version of the Google Terraform
provider:<br /><code>tfupdate provider google --version 4.3.0 --recursive
.</code>
</li><li>Updating the version references of Git-based module sources can be done with
the module subcommand, for example:<br /><code>tfupdate module
git::<REMOTE_URL> --version 1.0.1 --recursive .</code>
</li>
</ul>
<h2>
Test state migrations with <code>tfmigrate</code></h2>
<p>
Many Terraform users are hesitant to refactor their code base, since changes can
require updates to the state configuration. Manually updating the state in a
safe way involves duplicating the state, updating it locally, then copying it
back in place.
</p>
<p>
In addition to <code>tfupdate</code>, Masayuki Morita has another excellent
utility that can be used to apply Terraform state operations in a declarative
way while validating the changes, before committing them: <code><a href="https://github.com/minamijoyo/tfmigrate">tfmigrate</a></code>
</p><p>
You can do a dry run migration where you simulate state operations with a
temporary local state file and check to see if <code>terraform plan</code> has
no changes after the migration., This workflow is safe and non-disruptive, as it
does not <i>actually</i> update the remote state.
</p>
<p>
If the dry run migration looks good, you can use <code>tfmigrate</code> to apply
the state operations in a single transaction instead of multiple, individual
changes.
</p>
<p>
Migrations are written in HCL and use the following format:
</p>
<pre class="prettyprint">migration "state" "test" {
dir = "."
actions = [
"mv google_storage_backup.stage-backups google_storage_backup.stage_backups",
"mv google_storage_backup.prod-backups google_storage_backup.prod_backups",
]
}
</pre>
<p>
Each action line is functionally identical to the command you’d run
manually such as <code>terraform state <action> …</code>. A full list of
possible actions is available on the <a href="https://github.com/minamijoyo/tfmigrate#migration-block-state.">tfmigrate
website</a>.
</p>
<p>
Quoting resources that have indexed keys <a href="https://github.com/minamijoyo/tfmigrate/issues/5#issuecomment-722378933">can
be tricky</a>. The best approach appears to be using a single quote around the
entire resource and then escaping the double quotes in the index. For example:
</p>
<pre class="prettyprint">actions = [
"mv docker_container.nginx 'docker_container.nginx[\"This is an example\"]'",
]
</pre>
<p>
Testing the state migrations can be done via <code>tfmigrate plan
<filename></code>. The output will show you what <code>terraform plan
</code>would look like if you had actually carried out the state changes.
</p>
<p>
Applying the migration to the actual state is done via <code>terraform apply
<filename></code>. Note that by default, it will only apply the changes if the
result from <code>tfmigrate plan</code> was a clean output.
<br /><br />If you still want to apply changes to a “dirty” state, you
can do so by adding a <code>force = true</code> line to the migration file.
</p>
<blockquote><p>
<span style="color: #38761d;">If you are using Terraform 1.1 or newer, there is now a built-in
<code>moved</code> statement that works similarly to these approaches. I haven’t
tested it out yet but it looks like a useful feature! I can see it being
especially useful for users who may not have direct access to state files such
as Terraform Cloud and Enterprise users or Atlantis users.
</span></p><span style="color: #38761d;">
</span><p><span style="color: #38761d;">
See the <a href="https://www.hashicorp.com/blog/terraform-1-1-improves-refactoring-and-the-cloud-cli-experience">announcement</a>
in the 1.1 release as well the <a href="https://learn.hashicorp.com/tutorials/terraform/move-config">HashiCorp
Learn tutorial</a> for more details.
</span></p></blockquote>
<h2>
<b>Ensure standards compliance with TFLint</b></h2>
<p>
According to its website, <a href="https://github.com/terraform-linters/tflint">TFLint</a> is a Terraform
linter with a handful of key features:
</p>
<ul>
<li>Finding possible errors (like illegal instance types) for major Cloud
providers (AWS/Azure/GCP)
</li><li>Warning about deprecated syntax and unused declarations
</li><li>Enforcing best practices and naming conventions
</li>
</ul>
<p>
TFLint has a plugin system for including cloud provider-specific linting rules
as well as updated Terraform rules. Setting up the list of rules can be done on
the command line but it is recommended to use a config file to manage the
extensive list of rules to apply to your codebase.
</p>
<p>
Here is a configuration file that enables all of the possible terraform rules as
well as includes AWS specific rules. Save it in the root of your Git repository
as .tflint.hcl then initialize TFLint by running <code>tflint –init</code>. Now
you can lint your codebase by running <code>tflint</code>
</p>
<pre class="prettyprint">config {
module = false
disabled_by_default = true
}
plugin "aws" {
enabled = true
version = "0.10.1"
source = "github.com/terraform-linters/tflint-ruleset-aws"
}
rule "terraform_comment_syntax" {
enabled = true
}
rule "terraform_deprecated_index" {
enabled = true
}
rule "terraform_deprecated_interpolation" {
enabled = true
}
rule "terraform_documented_outputs" {
enabled = true
}
rule "terraform_documented_variables" {
enabled = true
}
rule "terraform_module_pinned_source" {
enabled = true
}
rule "terraform_module_version" {
enabled = true
exact = false # default
}
rule "terraform_naming_convention" {
enabled = true
}
rule "terraform_required_providers" {
enabled = true
}
rule "terraform_required_version" {
enabled = true
}
rule "terraform_standard_module_structure" {
enabled = true
}
rule "terraform_typed_variables" {
enabled = true
}
rule "terraform_unused_declarations" {
enabled = true
}
rule "terraform_unused_required_providers" {
enabled = true
}
rule "terraform_workspace_remote" {
enabled = true
}
</pre>
<h2>
<b>pre-commit</b></h2>
<p>
Setting up git hooks with the <a href="http://pre-commit.com/">pre-commit</a>
framework allows you to automatically run TFLint, as well as many other
Terraform code checks, prior to any commit.
</p>
<p>
Here is a sample <code>.pre-commit-config.yaml</code> that combines Anton
Babenko's excellent collection of Terraform specific hooks with some
out-of-the-box hooks for pre-commit. It ensures that your Terraform commits are:
</p>
<ol>
<li>Following the canonical format and style per <code>terraform fmt</code>
</li><li>Syntactically valid and internally consistent per <code>terraform
validate</code>
</li><li>Passing TFLint rules
</li><li>Ensuring that good practices are followed such as:
<ul>
<li>merge conflicts are resolved
</li><li>private ssh keys aren't included
</li><li>commits are done to a branch instead of directly to <code>master</code> or
<code>main</code>
</li>
</ul>
</li>
</ol>
<pre class="prettyprint">repos:
- repo: git://github.com/antonbabenko/pre-commit-terraform
rev: v1.59.0
hooks:
- id: terraform_fmt
- id: terraform_validate
- id: terraform_tflint
args:
- '--args=--config=__GIT_WORKING_DIR__/.tflint.hcl'
- repo: git://github.com/pre-commit/pre-commit-hooks
rev: v4.0.1
hooks:
- id: check-added-large-files
- id: check-merge-conflict
- id: check-vcs-permalinks
- id: check-yaml
- id: detect-private-key
- id: end-of-file-fixer
- id: no-commit-to-branch
- id: trailing-whitespace
</pre>
<p>
You can take advantage of this configuration by:
</p>
<ul>
<li>Installing the pre-commit framework <a href="https://pre-commit.com/#install">per the instructions</a> on the website.
</li><li>Creating the above configuration in the root directory of your Git
repository as .pre-commit-config.yaml
</li><li>Creating a .tflint.hcl in the base directory of the repository
</li><li>Initialize the pre-commit hooks by running <code>pre-commit install</code>
</li>
</ul>
<p>
Now whenever you create a commit, the hooks will run against any changed files
and report back issues.
</p>
<p>Since the pre-commit framework normally only runs against changed files,
it’s a good idea to start off by validating all files in the repository by
running <code>pre-commit run –all-files</code></p>
<h2><b>Conclusion</b></h2>
<p>
These approaches help make it easier and safer to refactor Terraform codebases,
speeding up a team's "Infrastructure as Code" velocity.
</p>
<p>
This helped my team gain confidence in making changes to our legacy modules and
enabled greater reusability. Standardizing on formatting and validation checks
also sped up code reviews. We could focus on module logic instead of looking for
typos or broken syntax
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-71216733960480770232021-12-11T00:00:00.001-05:002021-12-11T00:00:00.180-05:00Day 11 - Moving from Engineering Manager to IC<p>
By: Brian Scott (<a href="https://twitter.com/brainscott">@brainscott</a>)<br />
Edited by: Don O'Neill (<a href="https://twitter.com/sntxrr">@sntxrr</a>)
</p>
<p>
Within the past month, I've had a radical change into a new role within my
existing employer, for the past decade I was an SRE Manager building teams and a
Tech Executive. I hope to summarize my experience including how that made me
feel, moving into an IC Role. The thoughts and ideas in this article are from my
own opinion and past experiences.
</p>
<p>
For the past 6-8 years, I've been in an Engineering Manager/TechExec role,
specifically in Systems Reliability Engineering. I was comfortable, happy, and
engaged in this role, managing multiple SRE teams supporting a wide range of
products & platforms in the Enterprise.
</p>
<p>
Before we dive in deeper, A little history on myself, I've been playing with
technology since I was in 5th grade. My English teacher at the time taught me
everything he knew about repairing computers, primarily 286's & 386's, DOS, and
teaching me the BASIC programming language.
</p>
<p>
As I transitioned into 8th grade, entering High School, my computer teacher
approached me to ask if I wanted to help with administering the School's Network
of 12 Windows NT Servers running Active Directory, Exchange & File Services with
over 4000 workstations & Printers. Apparently, my 5th-grade teacher passed a few
tidbits to him of what I was doing in middle school in Computer Science.
</p>
<p>
Little did I know after accepting the position that my journey began, A few
startups (MySpace, remember that?) and mid-large corporations later, I ended up
in Engineering Management, primarily focused on building teams that support
large scale applications both On-prem and in the cloud with a focus on
delivering solutions with a DevOps culture & SRE mindset.
</p>
<p>
I've been used to building high-performing engineering teams, meeting new and
amazing engineers while focusing on creating <a
href="https://www.forbes.com/sites/lisabodell/2020/08/28/futurethink-forecasts-t-shaped-teams-are-the-future-of-work/">T-Shaped
teams</a>, this is not necessarily a new concept but one that worked for my
teams and worked well. During this time, We have had an amazing leadership team
that pushed us to go above and beyond while meeting new product teams across the
company every day that needed our help in delivering great solutions. In certain
organizations, high technical roles can be treated as semi-management.
</p>
<p>
We introduced several new technologies & concepts to the company as a whole,
developing many Communities of Practice around Config Management, Containers,
CI/CD, and even Web Development with Go, and so on. With the vast coverage of
different areas that the company was working in, I found myself, slowly moving
into a new space that we never had a role in the company, more on this, in just
a bit.
</p>
<p>
Before moving into Management, I was a Staff SRE (Systems Reliability
Engineering). You might be thinking, isn’t it Site Reliability Engineering?, yes
but different companies tailor the meaning of SRE to meet the needs within their
respective areas. In my case, we weren’t just managing Sites & Web Applications
but Systems that handle a wide range of products in the Entertainment & Media
space. Think Rendering, Control Systems, and safety systems.
</p>
<p>
As a Manager, I started seeking and making new connections across the
enterprise, assisting teams in onboarding the latest technology, whether that be
LiDar, Kubernetes, understanding GitOps & Docker, and new tools that were
bursting with Innovation in the Open Source space. While being good at helping
others and always saying “YES”, I quickly found myself spread quite thin between
managing 5 different SRE Teams, each team roughly 3-5 team members, supporting
over 3000 Applications and some of which were centralized services for the
entire enterprise to consume. It was also getting a little hard for me to stay
current with the technology, which I loved.
</p>
<p>
Leadership quickly saw my success in evangelizing new technology and helping our
business units move fast in adopting new methods of engineering not only with
new technology but ensuring our SRE’s had the proper tools and was aware of up
and coming automation tools to help them reduce toil but also accelerate in how
we delivered more value to our customers internally and externally.
</p>
<p>
My leader called me into a meeting to discuss my interest in moving into an SRE
role, but instead of a pure Engineering role, wanted me to pursue leading the
company’s effort in evangelising new technology. He went on to explain the value
and deep vision in how this would allow me to expand my reach and support more
teams in helping create an organization, around Developer Advocacy and mentoring
our entire Global SRE Organization to the next level and inspire others in
methods such as Empathy Engineering, Automation and best practices in multiple
areas, the advancements in what’s next in driving technical leadership.
</p>
<p>
I was a bit taken back but excited, there was also a bit of nervousness of
course, how that might have affected my teams in-relation to my relationships
between each one of my engineers. In the next few weeks, my teams and leadership
were very supportive and believed that I was needed in this new role to make a
bigger impact on the Organization and company as a whole.
</p>
<p>
Never be discouraged if you find yourself moving into an IC role, new
opportunities have a great way of nudging you in the right direction. People
often think that moving up the ladder means success but as we all have seen
incredible people in IC roles such as Kelsey Hightower at Google or Jessie
Frazelle of Oxide Computer. Humans do their best work when positioned to do
things they love doing and provided they can reach new heights.
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-38841065792030941082021-12-10T00:00:00.003-05:002021-12-10T00:00:00.182-05:00Day 10 - Assembling Your Year In Review<p>
By: Paige Bernier (<a href="https://twitter.com/alpacatron3000">@alpacatron3000</a>)<br />
Edited by: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>) and Scott Murphy (<a href="https://twitter.com/ovsage">@ovsage</a>)
</p>
<h2>Intro</h2>
<p>
There are a few moments in my career that I have been struck by a story told
with data. When I set out as a Site Reliability Engineer into the big wide world
I wanted to capture that data storytelling magic and have adapted a presentation
I call the “Year in Review”.
</p>
<p>
My first company had a tradition of taking a moment to pause and review the year
by the numbers. The showstopper was the chart showing the amount of data
ingested year over year since the founding.
</p>
<p>
In a single glance that chart conveyed a story that would take hours to tell!
</p>
<p>
It communicated the incredible efforts the employees took to scale the system to
handle ingesting, processing, publishing and storing an ever increasing mountain
of data. It illustrated how far the company had come and we were confronted head
on with the realization that “what got you here, won’t get you there”.
</p>
<p>
The biggest impact I have seen comes <em>after </em>the presentation.
Discussions from Year in Reviews have<em> </em>sparked sweeping oncall
management changes as well as minor, but important, changes in the way
developers engage with the SRE team.
</p>
<p>
Before diving into implementation details, let’s look at <em>why</em> this type
of data storytelling is such a powerful tool by examining the core purpose of
SRE
</p>
<h2>The Mission of SRE</h2>
<p>
The mission of an SRE team is to improve system reliability by facilitating
change.
</p>
<p>
System reliability is the sum of hundreds of decisions humans make when
developing, deploying, and maintaining software systems; it is <em>not</em> an
intrinsic property<sup id="fnref1"><a href="#fn1" rel="footnote">1</a></sup> of
the systems (Patrick O’Connor, 1998). SRE job descriptions tout phrases like
“evangelize a DevOps culture” and “influence without authority” acknowledging
our roles as change agents.
</p>
<p>
And as often heard, “change is hard”. As change agents, we are often faced with
conflicting priorities, multiple stakeholders internal and external, and fear of
the new and unknown.
</p>
<p>
However, just as often we hear “change is the only constant”. Whether it’s
hardware improvements, operating system upgrades, security vulnerability
announcements, software dependencies, or the software that we manage as a
service, we are constantly monitoring and implementing change.
</p>
<p>
Combine these two axioms, for extra difficulty:
</p>
<p>
Ask any engineer who has been forced into a major operating system upgrade when
the version of software they’re running requires the previous OS.
</p>
<p>
As an SRE I often want to make changes across the entire engineering
organization such as developing oncall onboarding, ensuring that we are
monitoring the customer’s experience, clarifying the lines of responsibility
between developers and operators and more!
</p>
<p>
These types of changes that affect everyone is difficult to effectively
implement until two things are true:
</p>
<ul>
<li>Is there a shared understanding of the current state?
<li>Is there agreement that the current state needs to change?
</li>
</ul>
<p>
This does not mean there needs to be consensus on what changes need to be made!
</p>
<h3>
Is there a shared understanding of the current state?
</h3>
<p>
The answer to this can be a resounding “Yes!” after your Year in Review
presentation. Here’s why:
</p>
<p>
Humans learn best from stories, feelings, senses, and opinions commonly known as
<em>qualitative data</em>. Focusing only on these exclusively you risk coming to
broad conclusions without nuance or context.
</p>
<p>
Businesses claim to operate on data, facts and figures, or <em>quantitative
data</em>. Focusing purely on the numbers you risk having too many details
leading to irrelevant rabbit holes.
</p>
<p>
In fact, the two seemingly disparate viewpoints aren’t at odds at all. You can
even validate findings by using the other category of data.
</p>
<p>
<strong>Feel: </strong>“Our monitoring sucks, none of the last 5 pages I got
were actionable”
</p>
<p>
<strong>Fact: </strong>The primary oncall was paged 5 times out of business
hours last week
</p>
<p>
<strong>Finding: </strong>Team X is getting paged frequently for non-actionable
reasons
</p>
<p>
Hosting a “Year in Review” means weaving a story using the quantitative data
about what occurred in your systems with the qualitative “anec-data” from a
human perspective to build a foundation to introduce change.
</p>
<h3>
Is there agreement that the current state needs to change?
</h3>
<p>
This is a more complex endeavor - identifying and implementing change is the
hard work of collaborating across teams, roles and competing incentives,
motives, and needs. Think of “Year in Review” as a springboard for driving
discussion and debate to align on “do we agree something needs to change?”
</p>
<p>
What does this look like in practice?
</p>
<p>
At a previous company I heard from engineers and managers alike that the oncall
rotations were in need of a shake up. This was an excellent starting place where
everyone agreed that there was a problem but was having trouble implementing the
necessary changes.
</p>
<p>
With a goal in mind to identify what exactly the oncall issues were my team
tailored a “Year in Review” focused mainly on oncall metrics such as alert
noise, hours oncall per engineer, pages received per engineer. Slides
illustrated the deluge of alert storms no human could possibly investigate in a
given shift and were largely unactionable noise. The impact of <span
style="text-decoration:underline;">not</span> addressing this problem was clear,
we were likely missing important signals in the noise and oncalls weren’t able
to effectively prioritize their time.
</p>
<p>
After reviewing the data as a group, my team facilitated a brainstorm to address
the barriers to changing the rotations:
</p>
<ul>
<li>How to handle ownership when multiple teams contribute code?
<li>What are the “hot potato” services no one feels comfortable owning?
<li>What services are unofficially owned by a single engineer that needs
documentation?
<li>What is the goal of a low urgency or warning alert?
</li>
</ul>
<p>
Based on the main discussion and others in standups and sidebars, my team
proposed new team-service ownership and rotations. Several weeks and few rounds
of revisions later we merged the PR with our new Terraformed oncall rotations!
</p>
<p>
</p>
<h2>
DIY “Year in Review”
</h2>
<p>
So, how do you create a “Year in Review” for an SRE team? To start, I typically
have a few things in mind about what I think happened and what the data will
show. It is fascinating to see where your perception of the system and reality
diverge. You can kick off your process by asking a couple of questions:
</p>
<ul>
<li>What story are you expecting the data to tell?
<li> What changes do you think need to be made in the next year to improve
reliability?
</li>
</ul>
<ol>
<li>Book a meeting with all parties (including engineers, managers, sre, qa,
ops, product managers). If there is an existing meeting like an All-Hands or
Demo Hour sign up for a presentation slot
<li>Kick off a brainstorming session and have participants list out possible
changes to include. Such as new features launched or infrastructure expansions
to new regions, or even doubling the organization size.
<li>Ask teams (including managers)
<ol>
<li>What data they would find interesting
<li>What data they could contribute from their domain
</li>
</ol>
<li>List the company-specific tooling for data sources like:
<ol>
<li>Version Control
<li>CI/CD
<li>Monitoring
<li>Incident Management
<li>Ticket tracking system
<li>Documentation store
<li>Support ticket system
</li>
</ol>
<li>Enlist the help of others to gather the interesting metrics over the past
year or year over year. Some suggestions are:
<ol>
<li>Noisiest alerts
<li>Number of environments
<li>Oncall engineers
<li>Number of services
<li>Ratio of oncall engineer to number of services oncall for
<li>Age of dependencies/libraries
<li># of hours oncall per person
<li>Number of features launched
<li># of after hour pages
<li>Ratio of warning alerts to pages
<li>Number of production deploys rolled up by day
<li>Number of open incident AIs
<li>Ingress traffic or other indicator of system load
<li>Most viewed documentation pages
<li>Most search documentation terms
<li>Time to first PR
<li>….and so much more!
</li>
</ol>
<li>Slice and dice the data trying out top 10 lists, total sum, or segment by
using whatever constructs your company has such as:
<ol>
<li>Department
<li>Service
<li>Team
<li>Product Feature
</li>
</ol>
<li>Group the data into themed areas “oncall” “production” “onboarding” etc. If
you have convinced folks to co-present with you each person can be responsible
for presenting a different theme
<li>Assemble into a slide deck with one chart per slide to maximize impact
<li>Hold the meeting and present your findings,
<li>Discuss! In the meeting, after the meeting before the next Year In Review
how you interpreted the data compared to others
<li>Publish the data and your queries so everyone can explore and answer their
own questions
</li>
</ol>
<h2>
Parting Thoughts
</h2>
<p>
SREs are uniquely suited to facilitate a Year in Review bringing a system-wide
perspective on the people, processes, and technology and mission to improve
reliability. Keep in mind that much like effecting change, hosting a Year in
Review is not a solo effort!
</p>
<p>
Going solo means you will only capture YOUR thoughts which will almost certainly
be tempered by the unique vantage points from others. The more perspectives you
invite, the fuller the story of your system will be.
</p>
<p>
Please share your favorite data storytelling moments or Year in Review stats
with me on Twitter at @alpacatron3000
</p>
<h2>Citation</h2>
<p>
O’Connor, P. (1998) <em>Standards in reliability and safety engineering
</em>[Article]. Elsevier Science Limited, 9 Dec. 2021.
</p>
<p>
<a
href="https://www.sciencedirect.com/science/article/abs/pii/S095183209883010X">https://www.sciencedirect.com/science/article/abs/pii/S095183209883010X</a>
</p><!-- Footnotes themselves at the bottom. -->
<h2>Notes</h2>
<div class="footnotes">
<hr>
<ol><li id="fn1">
<p>
Since the SRE field is still getting established outside of Google, I started to
read perspectives from Reliability Engineering in other disciplines. A nugget from Patrick
O’Connor’s “Standards in reliability and safety engineering” paper sparked a spicy but important
revelation about reliability.
<p>
“Those reliability standards which apply mathematical/ quantitative methods are
also based on the inappropriate application of “scientific” thinking. An engineered system or a
component has no intrinsic property of reliability, expressible for example as a failure rate. Truly
scientifically based properties of systems and components include mass, power output, etc., and
these can therefore be predicted and measured with credibility. However, whether a missile or a
microcircuit fails depends upon the quality of the design, production, m~nten~ce and use applied to
it. These are human contributions, not “scientific”. “ <a href="#fnref1"
rev="footnote">↩</a>
</ol></div>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-70927812865113701502021-12-09T00:16:00.007-05:002021-12-09T00:18:39.047-05:00Day 9 - 3 things parenting taught me about system administration<p>
By: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>)<br />
</p>
<p>
The last five years have been grounding for me as I became a beginner at
parenting. In this article, I want to share three things I learned about being a
better sysadmin from being a mom.
</p>
<h2>Prioritize your health</h2>
<p>
Of course, I've heard it so many times. But in the rush of trying to support the
"system," sometimes, I lose track of the little things (getting enough sleep,
eating meals, human engagement that isn't predicated on deliverables and action
items). When it comes to parenting, I see the difference in how the necessities
of the moment can gradually subsume the primary goals and real joy* (a secondary
outcome of successful parenting that I tend to only enjoy in retrospect, after
having assured myself that my internal parenting kanban board is as it should
be–obsession, exhaustion, and then joy tends to be my experiential flow as a
parent).
</p>
<p>
Prioritizing health - if I'm not ok, I'm not able to handle the "system" as
well, regardless of its state.
</p>
<p>
Any parent of a child under five will tell you that 90 percent of the job is
keeping the child alive. If they make it to the next day, smile and giggle the
proper number of times per day, and if your friends, family, and parenting peers
seem unaware that your parenting path bears a concerning resemblance to the plot
of the movie Speed, then you're more or less gravy. You also learn that, while
you can spend a great deal of time analyzing and conversing about your child and
how they're faring, the main thing is to put them in the right places at the
right time. Sunshine, exercise, the company of their peers, easily accessible
bathrooms–these are the things that matter. If my son doesn't get direct
sunlight within 90 minutes of walking, his mood takes a nosedive, and this isn't
a mystery to me. Likewise, if he isn't let loose at the park to terrify small
woodland creatures with his desire to befriend them, his attentional resources
will be suboptimal when it's time for flashcards. Yet I (and I don't think I'm
alone in this) will frequently wake, obtain caffeine, have a quick all-hands
with my family, and proceed to sit in a small room staring at a screen for eight
hours straight. As a result, my ability to practice self-care myself fails
regularly.
</p>
<h2>Leverage the community </h2>
<p>
To prioritize my health, I have to ask for help. I've had the following
experience again and again professionally, and as a parent, and at some point, I
hope that it won't astound me, which it does every time: I believe that I'm
having a singular experience (which, of course, we all are) and that I am an
outlier because obviously no one else is concerned about the state of affairs or
struggling. And then someone else gives voice to the precise issue that I've
devoted considerable resources to NOT sharing. Of course, other people are also
concerned about the children pretending that the scissors are boomerangs. One of
my primary errors is thinking that there is some scorekeeping of tracking the
social currency and categorizing discourse into the buckets of "I helped" and "I
was helped." It's a binary that renders engagements as transactional when my
actual community experience is almost always that I walk away feeling better
regardless of who broached a topic.
</p>
<h2>You can't eliminate all Snowflakes </h2>
<p>
Within the community, we often talk about snowflakes as problems. Yet, as a
parent, you discover that there are no handbooks for YOUR kid because every
child is different in their own beautiful, hard, and surprising way. Likewise,
while there is value in the community and sharing stories, every system will be
different. You work with one system, you've learned about that system, and while
there are useful things you'll learn from that system to apply to other systems,
every system will be beautiful, different, and hard in its surprising ways.
</p>
<h2>Wrapping Up</h2>
<p>
Our industry is constantly evolving with the introduction of new technology,
tools, and processes. It may feel overwhelming to try to understand everything.
You have to accept some degree of the unknown. When I first became a parent, I
realized that Operations had prepared me for the inevitable changes that occur
every single day. No matter what tomorrow brings, the essential skills are
learning to adapt to change and learning to learn fast.
<br><br>Please make time for yourself, connecting with the
community, and accepting what is different and unique about your systems and the
environments they are running in.
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-88758232386916578422021-12-08T00:00:00.011-05:002021-12-08T00:08:50.606-05:00Day 8 - D&D for SREs
<p>
By: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>)<br />
</p>
<p>
In a past life, I was a full-time SRE and a part-time dragonborn paladin named
Lorarath. While at work, I supported thousands of systems in collaboration with
a team of geeks. Evenings, I tried to survive imaginary disasters and save the
world from the sorceress Morgana. I love collaborative games because they plug
into some of the real-world emotional responses and social processes critical
for successful, meaningful engagement. They provide a place to practice dealing
with critical scenarios in a safe place. When you know the stakes are purely
imaginary, you're able to look at your efforts from a distance, to gain
understanding and enjoy the process of learning and achieving goals together,
even when failing. I want to share a couple of insights D&D has given me about
my work and how this can help you.
</p>
<h2>Building your SRE Team … more than just a name.</h2>
<p>
SRE has many names: Operations, DevOps, Infrastructure engineering, System
Admin. It's someone who deploys and runs a highly available, scalable, and
secure service that meets business and partner requirements. But what does that
mean? Generally, it means someone with a wide-ranging set of skills tackling
different challenges at any point in time.
</p>
<p>
When you first start a campaign in dungeons and dragons, you choose a class to
play. This class will then have specializations that you customize based on how
you want to play. Next, you build out your character using a character sheet and
create a backstory. This character sheet has several abilities and skills. You
have several points to allocate to abilities and skills, which grants you
additional chances to handle particular events successfully.
</p>
<p>
In gaming, you collaborate with your team to ensure that you have a well-rounded
team often choosing roles to complement the team. You don't want a team of all
"magic users" or hack and slashers. Often, we stop at identifying who we are
with that single name, whether it's SRE or sysadmin. As an SRE, I depend on a
diverse team with varied skills. I am not seeking people with the same expertise
or abilities. I'm looking for people with complementary skills who can help
accomplish the goals and visions of the team.
</p>
<h2>Developing your “character sheet”</h2>
<p>
There is no equivalent to a "character sheet" when it comes to your job. The
closest might be equating a resume or LinkedIn profile to a character sheet.
Still, these don't align to all of the possible experiences you gain:
</p>
<ul>
<li>Submitting git pull requests.
<li>Participating in hackathons.
<li>Attending training or conferences.
<li>The myriad of other day-to-day challenges you face.
</li>
</ul>
<p>
Additionally, if you don't practice skills in real life, they languish. For
example, I haven't touched Solaris in over a decade, and I no longer document it
as a skill.
</p>
<p>
If SRE did have a character sheet, I think three core abilities would be:
</p>
<p>
Communication, Collaboration, and Confidence. Let's take a closer look at these
specializations and the value of spending energy on these areas.
</p>
<h3>Specialization: Communication</h3>
<p>
<strong>Communication</strong> is a fundamental building block to successful
character building. As an SRE, I faced various scenarios that required expert
communication.
</p>
<ul>
<li>The first specialty in communication is the <strong>number of
messages</strong>. How often should I remind people about upcoming scheduled
maintenance? How often should I reach out to my manager about working on the
right thing? How often should my team get together to talk about team tasks?
</li>
</ul>
<ul>
<li>The second specialty in communication is the <strong>quality of
messages</strong>. Communication can be visual, written, or oral. Visuals can
often convey much more nuanced meaning than repeating the same information in
textual format and an underleveraged method.
</li>
</ul>
<ul>
<li>The third specialty in communications is <strong>effectiveness.
</strong>Effectiveness is the degree to which your words lead to the desired
results. This specialty is the most advanced because effective communication
requires an in-depth understanding of the audience and crafting your message as
needed.
</li>
</ul>
<h3>Specialization: Collaboration</h3>
<p>
The second core ability is collaboration. In any product or service, you are
working on, work needs to be understood, planned, and executed. It doesn't
matter who does the work; it just matters that it gets done.
</p>
<p>
The role I take today doesn't define who I am. If I say, "I'm an SRE at
Company," that is just one characteristic of my story and not my identity. Every
day as you go into work and tackle your challenge, recognize <strong><em>your
special value</em></strong> and what <strong><em>you</em></strong> bring to the
team. Rather than adopting and marrying your identity to a specific role,
realize some days you take on a role that may be quite different from what you
are used to, and that's part of your character development.
</p>
<p>
There is a distinction between the members of your team and the roles they play.
In gaming, you become comfortable speaking on behalf of your character while
having a separate, sometimes meta-conversation with your teammates. Social
environments seem to tend towards homeostasis, and you (may) naturally ascribe a
simplistic narrative to your co-workers' actions. Adopting this awareness that
everyone is filling a role on the team that is not representative of everything
about the individuals allows you to approach the work to do the impactful work
that needs to get done.
</p>
<p>
In other words, never say, "well, they are just the ROLENAME and can't do that,"
or "that's not my job."
</p>
<h3>Specialization: Confidence</h3>
<p>
The third core ability for your SRE character sheet is
<strong>confidence</strong>. Confidence is about the innate quality that drives
you to take risks (or not).
</p>
<p>
In gaming, sometimes you take the wrong path, or you put your squishy players
out front, and they get severely damaged. Mistakes happen. In the "real world,"
customers do something unexpected. There are bugs in the software, hardware
fails, or someone from the team enters the wrong command on the wrong terminal
in the production environment.
</p>
<p>
Collaborative games teach you to fail as a group and rise again while retaining
the group cohesion necessary to succeed. Of course, if a teammate really caused
you to be captured by a giant spider, you'd probably flip out. Still, across the
game board, one has the emotional wiggle-room to behave in a manner that would
be laudable in professional situations.
</p>
<p>
Playing teaches you about exploring challenges with imagination and a sense of
play. You have to piece things together while continuing to take action, both
keeping in mind the larger game goals and what's immediately on the board at the
same time. In addition to this enormous world to explore, there are complex
characters (non-playing characters or NPCs) to talk to, and information gathered
within each encounter. Be on the lookout for the helpful non-production
engineers (NPEs) in your environment, too; while they may not maintain
production, they may have valuable information to support you.
</p>
<h2>Wrapping Up</h2>
<p>
So, this article inspired you to add some collaborative gaming to your team
building, build out your team with complementary skills, or map out the work of
the SRE or system administration to a character sheet. Great, beyond the
"character sheet," you need the appropriate visualization. By analyzing the
particular work items that an individual completed, there could be an
incremented "skill" counter. Additional information like git commits,
distribution of package management, and incident management APIs could be
gathered and glued together to create a way to look at progress over time. That
way, you could make sure to spend time on the skills that will improve you in
the direction of your choosing.
</p>
<p>
If you want to try out D&D, check out your local game stores or related groups.
Beginner games often provide preconfigured characters that allow you to practice
the gameplay without understanding all of the nuances of playing the game.
</p>
sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-87219054690099112672021-12-07T00:00:00.000-05:002021-12-07T00:34:09.197-05:00Day 7 - Baking Multi-architecture Docker Images<p>
By: Joe Block (<a href="https://twitter.com/curiousbiped">@curiousbiped</a>)<br />
Edited by: Martin Smith (<a href="https://twitter.com/martinb3">@martinb3</a>)
</p>
<p>
My home lab cluster has a mix of CPU architectures - several Odroid<a
href="https://www.hardkernel.com/shop/odroid-hc2-home-cloud-two/"> HC2s</a> that
are arm7, another bunch of Raspberry Pi 4s and Odroid<a
href="https://www.hardkernel.com/shop/odroid-hc4/"> HC4s</a> that are arm64 and
finally a repurposed MacBook Air that is amd64. To further complicate things,
they're not even all running the same linux distribution - some run Raspberry Pi
OS, one's still on Raspbian, some are running debian (a mix of buster and
bullseye), and the MacBook Air runs Ubuntu.
</p>
<p>
To reduce complication, the services in the cluster are all running in docker or
containerd - it's a homelab, so I'm deliberately running multiple options to
learn different tooling. This meant that I had to do three separate builds every
time I updated one of my images, arm7 , arm64 and amd64, on three different
machines, and my service startup scripts all had to determine what architecture
they were running on and figure out what image tag to use.
</p>
<h2><strong>Enter multi-architecture images</strong></h2>
<p>
It used to be a hassle to create multi-architecture images. You'd have to create
an image for each architecture, then upload them all separately from each build
machine, then construct a manifest file that included references to all the
different architecture images and then finally upload the manifest. This doesn't
lead to easy rapid iteration.
</p>
<p>
Now, thanks to<a href="https://docs.docker.com/buildx/working-with-buildx/">
docker buildx</a>, you can create multi-architecture images as easily as docker
build creates them for single-architectures.
</p>
<p>
Let's take a look with an example on my system. First, I can see what
architectures are supported with docker buildx ls. As of 2021-12-03, Docker
Desktop for macOS supports the following:
</p>
<div><pre><code class="language-none">
NAME/NODE DRIVER/ENDPOINT STATUS PLATFORMS
multiarch * docker-container
multiarch0 unix:///var/run/docker.sock running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/mips64le, linux/mips64, linux/arm/v7, linux/arm/v6
desktop-linux docker
desktop-linux desktop-linux running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/arm/v7, linux/arm/v6
default docker
default default running linux/amd64, linux/arm64, linux/riscv64, linux/ppc64le, linux/s390x, linux/386, linux/arm/v7, linux/arm/v6
</code></pre></div>
<p>
My home lab only has three architectures, so in these examples I'm going to
build for arm7, arm64 and amd64.
</p>
<h3><strong>Create a builder</strong></h3>
<p>
I need to create a builder that supports multi-architecture builds. This only
needs to be done once as Docker Desktop will reuse it for all of my buildx
builds.
</p>
<div><pre><code class="language-none">
docker buildx create --name multibuild --use
</code></pre></div>
<h3><strong>Building a multi-architecture image</strong></h3>
<p>
Now, when I build an image with docker buildx, all I have to do is specify a
comma-separated list of desired platforms with --platform. Behind the scenes,
Docker Desktop will fire up QEMU virtual machines for each architecture I
specified, run the image builds in parallel, then create the manifest and upload
everything.
</p>
<p>
As an example, I have a docker image,<a
href="https://github.com/unixorn/unixorn-py3"> unixorn/unixorn-py3</a> that I
use for my python projects that installs a minimal Python 3 onto debian 11-slim.
</p>
<p>
I build it with docker buildx build --platform
linux/amd64,linux/arm/v7,linux/arm64 --push -t unixorn/debian-py3 resulting in
the output below showing that it's building all three architectures.
</p>
<div><pre><code class="language-none">
❯ rake buildx
Building unixorn/debian-py3
docker buildx build --platform linux/amd64,linux/arm/v7,linux/arm64 --push -t unixorn/debian-py3 .
[+] Building 210.4s (17/17) FINISHED
=> [internal] load build definition from Dockerfile 0.0s
=> => transferring dockerfile: 571B 0.0s
=> [internal] load .dockerignore 0.0s
=> => transferring context: 2B 0.0s
=> [linux/arm64 internal] load metadata for docker.io/library/debian:11-slim 3.7s
=> [linux/arm/v7 internal] load metadata for docker.io/library/debian:11-slim 3.6s
=> [linux/amd64 internal] load metadata for docker.io/library/debian:11-slim 3.6s
=> [auth] library/debian:pull token for registry-1.docker.io 0.0s
=> [auth] library/debian:pull token for registry-1.docker.io 0.0s
=> [auth] library/debian:pull token for registry-1.docker.io 0.0s
=> [linux/arm64 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c 4.4s
=> => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c 0.0s
=> => sha256:968621624b326084ed82349252b333e649eaab39f71866edb2b9a4f847283680 30.06MB / 30.06MB 2.0s
=> => extracting sha256:968621624b326084ed82349252b333e649eaab39f71866edb2b9a4f847283680 2.4s
=> [linux/amd64 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c 4.0s
=> => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c 0.0s
=> => sha256:e5ae68f740265288a4888db98d2999a638fdcb6d725f427678814538d253aa4d 31.37MB / 31.37MB 1.8s
=> => extracting sha256:e5ae68f740265288a4888db98d2999a638fdcb6d725f427678814538d253aa4d 2.2s
=> [linux/arm/v7 1/2] FROM docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c 4.3s
=> => resolve docker.io/library/debian:11-slim@sha256:656b29915fc8a8c6d870a1247aad7796ce296729b0ae95e168d1cfe30dd2fb3c 0.0s
=> => sha256:ba82a1312e1efdcd1cc6eb31cd40358dcec180da31779dac399cba31bf3dc206 26.57MB / 26.57MB 2.3s
=> => extracting sha256:ba82a1312e1efdcd1cc6eb31cd40358dcec180da31779dac399cba31bf3dc206 2.0s
=> [linux/amd64 2/2] RUN apt-get update && apt-get install -y apt-utils ca-certificates --no-install-recommends && apt-get upgrade -y --no-install-r 22.3s
=> [linux/arm/v7 2/2] RUN apt-get update && apt-get install -y apt-utils ca-certificates --no-install-recommends && apt-get upgrade -y --no-install 176.9s
=> [linux/arm64 2/2] RUN apt-get update && apt-get install -y apt-utils ca-certificates --no-install-recommends && apt-get upgrade -y --no-install- 173.6s
=> exporting to image 25.4s
=> => exporting layers 6.7s
=> => exporting manifest sha256:ae5a5dcfe0028d32cba8d4e251cd7401c142023689a215c327de8bdbe8a4cba4 0.0s
=> => exporting config sha256:48f97d6d8de3859a66625982c411f0aab062722a3611f18366ecff38ac4eafb9 0.0s
=> => exporting manifest sha256:fc7ad1e5f48da4fcb677d189dbc0abd3e155baf8f50eb09089968d1458fdcfb9 0.0s
=> => exporting config sha256:60ced8a7d9dc49abbbcd02e7062268fdd2f14d9faedcb078b2980642ae959c3b 0.0s
=> => exporting manifest sha256:8f96f20d75502d5672f1be2d9646cbc5d5de3fcffd007289a688185714515189 0.0s
=> => exporting config sha256:0c6e42f87110443450dbc539c97d99d3bfdd6dd78fb18cfdb0a1e3310f4c8615 0.0s
=> => exporting manifest list sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa 0.0s
=> => pushing layers 17.2s
=> => pushing manifest for docker.io/unixorn/debian-py3:latest@sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa 1.4s
=> [auth] unixorn/debian-py3:pull,push token for registry-1.docker.io 0.0s
=> [auth] unixorn/debian-py3:pull,push token for registry-1.docker.io 0.0s
docker pull unixorn/debian-py3
Using default tag: latest
latest: Pulling from unixorn/debian-py3
e5ae68f74026: Already exists
86834dffc327: Pull complete
Digest: sha256:9133393fcebf2a2bdc85a6b7df34fafad55befa58232971b1b963d2ba0209efa
Status: Downloaded newer image for unixorn/debian-py3:latest
docker.io/unixorn/debian-py3:latest
1.60s user 1.05s system 1% cpu 3:36.49s total
</code></pre></div>
<p>
One minor issue - docker buildx has a separate cache that it builds the images
in, so when you build, the images won't be loaded in your local
docker/containerd environment. If you want to have the image in your local
docker environment, you need to run buildx with --load instead of --push.
</p>
<p>
In this example, instead of running docker run unixorn/debian-py3:amd64, docker
run unixorn/debian-py3:arm7 or docker run unixorn/debian-py3:arm64 based on what
machine I'm on, now I can use the same image reference on all the machines -
</p>
<div><pre><code class="language-none">
❯ docker run unixorn/debian-py3 python3 --version
Python 3.9.2
❯
</code></pre></div>
<h2><strong>Takeaway</strong></h2>
<p>
If you're running a mix of architectures in your lab environment, docker buildx
will simplify things considerably.
</p>
<p>
No more maintaining multiple architecture tags, no more having to build on
multiple machines, no more accidentally forgetting to update one of the tags so
that things are mysteriously different on just some of our machines, no more
weird issues because we forgot to update service start scripts and
docker-compose.yml files.
</p>
<p>
Simpler is always better, and buildx will simplify the environment for you.
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-62319905521744634502021-12-05T22:04:00.001-05:002021-12-05T22:04:34.057-05:00Day 6 - More to come tomorrow!We don't have any special system content for you today. We will have more tomorrow! sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-53645165494375002552021-12-04T03:00:00.003-05:002021-12-05T11:19:48.722-05:00Day 5 - Least Privilege using strace<p>
By: Shaun Mouton (<a href="https://twitter.com/sdmouton">@sdmouton</a>)<br />
Edited by: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>)
</p>
<p>
Security in software development has been a hot-button issue for years.
Increasing awareness of the threat posed by supply chain breaches have only
increased the pressure on teams to improve security in all aspects of the
software delivery and operation. A key premise is least privilege: granting the
minimum privileges necessary to accomplish a task, in order to prevent folks
from accessing or altering things they shouldn't have rights to. Here's my
thinking, we should help users to apply the principles of least privilege when designing tools. When we find that we have not designed security tooling which does not enable least privilege use, we can still address the problem using tracing tools which can be found in most Linux distribution package repositories. I would like to share my adventure of
looking at an InSpec profile (using CINC Auditor) and a container I found on
Docker Hub to demonstrate how to apply least privilege using <a href="https://sysadvent.blogspot.com/2008/12/sysadmin-advent-day-1.html">strace</a> for process access
auditing.
</p>
<p>
At my prior job working at Chef, I fielded a request asking how to run an InSpec
profile as a user other than root. InSpec allows you to write policies in code
(called InSpec Profiles) to audit the state of a system. Most of the
documentation and practice at the time had users inspecting the system as root or a
root-equivalent user. At first glance, this makes a certain amount of sense:
many tools in the "let's configure the entire system" and
"let's audit the security of the entire system" spaces need access to whatever
the user decides they want to check against. Users can write arbitrary profile
code for InSpec (and the open source CINC Auditor), ship those profiles
around, and scan their systems to determine whether or not they're in
compliance.
</p>
<p>
I've experienced this pain of excessive privileges with utilities myself. I
can't count the number of times we'd get a request to install some vendor tool
nobody had ever heard of with root privileges. Nobody who asked could tell us
what it'd be accessing, whether it would be able to make changes to the system,
or how much network/cpu/disk it'd consume. The vendor and the security
department or DBAs or whoever would file a request with the expectation that we
should just trust their assertion that nothing would go wrong. So, being
responsible system administrators, we'd say "no, absolutely not, tell us what
it's going to be doing first" or "yes, we'll get that work scheduled" and then
never schedule the work. This put us in the position of being gatekeepers rather than enablers of responsible behavior. While justified, it never sat right with me.
</p>
<p>
(Note: It is deeply strange that vendors often can't tell customers what their
tools do when asked in good faith, as is the idea that there should be an
assumption of trustworthiness in that lack of information.)
</p>
<p>
I've found some tools over the years which might be able to give a user output
which can be used to help craft something like a set of required privileges to
run an arbitrary program with non-root privileges. Not too long ago I discussed
"securing the supply chain" on how to design an ingestion pipeline to enable
folks to run containers in a secure environment where they could be somewhat
assured that a container using code they didn't write wasn't going to try to
access things that they weren't comfortable with. I thought about this old
desire of limiting privileges when running an arbitrary command, and figured
that I should do a little digging to see if something already existed. If not
maybe I could work towards a solution.
</p>
<p>
Now, I don't consider myself an expert developer but I have been writing or
debugging code in one form or another since the '90s. I hope you consider this
demo code with the expectation that someone wanting to do this in a production
environment will re-implement what I've done far more elegantly. I hope that
seeing my thinking and the work will help folks to understand a bit more about
what's going on behind the scenes when you run arbitrary code, and to help you
design better methods of securing your environment using that knowledge.
</p>
<p>
What I'll be showing here is the use of strace to build a picture of what is
going on when you run code and how to approach crafting a baseline of expected
system behavior using the information you can gather. I'll show two examples:
</p>
<ul>
<li>executing a relatively simple InSpec profile using the open source
distribution's CINC Auditor
</li><li>running a randomly selected container off Docker Hub
(jjasghar/container_cobol)
</li>
</ul>
<p>
Hopefully, seeing this work will help you solve a problem in your environment or
avoid some compliance pain.
</p>
<h2><b>Parsing strace Output for an CINC Auditor (Chef InSpec)
profile</b></h2>
<p>
There are other write-ups of strace functionality which go into broader and
deeper detail on what's possible using it,<a href="https://jvns.ca/categories/strace/"> I'll point to Julia Evans' work</a>
to get you started if you want to know more.
</p>
<p>
Strace is the venerable Linux debugger, and a good tool to use when coming up
against a "<a href="https://sysadvent.blogspot.com/2010/12/day-15-down-ls-rabbit-hole.html">what's going on when this program runs</a>" problem. However, its output
can be decidedly unfriendly. Take a look in the<a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output">
strace-output directory in this repo</a> for the files matching the pattern
linux-baseline.* to see the output of the following command:
</p>
<div><pre><code class="language-none">
root@trace1:~# strace --follow-forks --output-separately --trace=%file -o
/root/linux-baseline cinc-auditor exec linux-baseline
</code></pre></div>
<p>
You can parse the output, however, if all you want to know is what files might
need to be accessed (<a href="https://explainshell.com/explain?cmd=awk+-F+%27%22%27+%27%7Bprint+%242%7D%27+linux-baseline%2Flinux-baseline.108579+%7C+sort+-uR+%7C+head">for
an explanation of the command go here</a>) you can do something similar to the
following (maybe don't randomly sort the output and only show 10 lines):
</p>
<div><pre><code class="language-none">
awk -F '"' '{print $2}' linux-baseline/linux-baseline.108579 | sort -uR | head
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/minitest-5.14.4/lib/nokogiri.so
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/train-winrm-0.2.12/lib/psych/visitors.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/i18n-1.8.10/lib/rubygems/resolver/index_set.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-cognitoidentityprovider-1.53.0/lib/inspec/resources/command.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/jwt-2.3.0/lib/rubygems/package/tar_writer.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-codecommit-1.46.0/lib/pp.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/extensions/x86_64-linux/2.7.0/ffi-1.15.4/http/2.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/extensions/x86_64-linux/2.7.0/bcrypt_pbkdf-1.1.0/rubygems/package.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-databasemigrationservice-1.53.0/lib/inspec/resources/be_directory.rb
/opt/cinc-auditor/embedded/lib/ruby/gems/2.7.0/gems/aws-sdk-ram-1.26.0/lib/rubygems/resolver/current_set.rb
</code></pre></div>
<p>
You can start to build a picture of what all the user would need to be able to
access in order to run a profile based on that output, but in order to go
further I'll use a<a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/linux-vsp">
much more simple check</a>:
</p>
<div><pre><code class="language-none">
cinc-auditor exec linux-vsp/
</code></pre></div>
<p>
Full results of that command are located in the<a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output">
strace-output directory</a> with files matching the pattern linux-vsp.*, but to
summarize what cinc-auditor/inspec is doing:
</p>
<ul>
<li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109613">linux-vsp.109613</a>
- this file shows all the omnibussed ruby files the cinc-auditor command tries
to access in order to run its parent process
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109614">linux-vsp.109614</a>
- why auditor is trying to run cmd.exe on a Linux system I don't yet know,
you'll get used to seeing $PATH traversal very quickly
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109615">linux-vsp.109615</a>
- I see a Get-WmiObject Win32_OperatingSys in there so we're checking to see if
this is Windows
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109616">linux-vsp.109616</a>
- more looking on the $PATH for Get-WmiObject so more Windows checking
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109617">linux-vsp.109617</a>
- I am guessing that checking the $PATH for the Select command is more of the
same
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109618">linux-vsp.109618</a>
- Looking for and not finding ConvertTo-Json, this is a PowerShell cmdlet,
right?
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109619">linux-vsp.109619</a>
- Now we're getting somewhere on Linux, this running uname -s (with $PATH
traversal info in there, see how used to this you are by now?)
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109620">linux-vsp.109620</a>
- Now running uname -m
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109621">linux-vsp.109621</a>
- Now running test -f /etc/debian_version
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109622">linux-vsp.109622</a>
- Doing something with /etc/lsb-release but I didn't use the -v or -s strsize
flags with strace so the command is truncated.
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109623">linux-vsp.109623</a>
- Now we're just doing cat /etc/lsb-release using locale settings
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109624">linux-vsp.109624</a>
- Checking for the inetd package
</li><li><a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp.109625">linux-vsp.109625</a>
- Checking for the auditd package, its config directory /etc/dpkg/dpkg.cfg.d,
and the config files /etc/dpkg/dpkg.cfg, and /root/.dpkg.cfg
</li>
</ul>
<p>
Moving from that to getting an idea of what all a non-root user would need to be
able to access, you can do something like this in the strace-output directory
(<a href="https://explainshell.com/explain?cmd=find+.+-name+%22linux-vsp.10*%22+-exec+awk+-F+%27%22%27+%27%7Bprint+%242%7D%27+%7B%7D+%5C%3B+%7C+sort+-u+%3E+linux-vsp_files-accessed.txt">explainshell
here</a>):
</p>
<div><pre><code class="language-none">
find . -name "linux-vsp.10*" -exec awk -F '"' '{print $2}' {} \; | sort -u >
linux-vsp_files-accessed.txt
</code></pre></div>
<p>
You can see the<a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output/linux-vsp_files-accessed.txt">
output of this command here</a>, but you'll need to interpret some of the output
from the perspective of the program being executed. For example, I see "Gemfile"
in there without a preceding path. I expect that's Auditor looking in the
./linux-vsp directory where the profile being called exists, and the other
entries without a preceding path are probably also relative to the command being
executed.
</p>
<h2><b>Parsing strace output of a container execution</b></h2>
<p>
I said Docker earlier, but I've got podman installed on this machine so that's
what the output will reflect. You can find the output of the following command
in the strace-output<a href="https://github.com/moutons/tracing-sysadvent-2021/blob/main/strace-output">
directory</a> in files matching the pattern container_cobol.*, and wow. Turns
out running a full CentOS container produces a lot of output. When scanning
through the files, you see what looks like podman doing podman things, and what
looks like the COBOL Hello World application executing in the container. As I go
through these files I will call out anything particularly interesting I see
along the way:
</p>
<div><pre><code class="language-none">
root@trace1:~# strace -ff --trace=%file -o /root/container_cobol podman run -it container_cobol
Hello world!
root@trace1:~# ls -1 container_cobol.* | wc -l
146
</code></pre></div>
<p>
I'm not going to go through 146 files individually as I did previously, but this
is an interesting data point:
</p>
<div><pre><code class="language-none">
root@trace1:strace-output# find . -name "container_cobol.1*" -exec awk -F '"' '{print $2}' {} \; | sort -u > container_cobol_files-accessed.txt
root@trace1:strace-output# wc -l container_cobol_files-accessed.txt
637 container_cobol_files-accessed.txt
root@trace1:strace-output# wc -l linux-vsp_files-accessed.txt
104754 linux-vsp_files-accessed.txt
</code></pre></div>
<p>
So the full CentOS container running a little COBOL Hello World application
needs access to six hundred thirty seven files, and CINC Auditor running
a 22-line profile directly on the OS needs to access over one hundred four
thousand files. That doesn't directly mean that one is more or less of a
security risk than the other, particularly given that a Hello World application
can't report on the compliance state of your machines, containers, or
applications for example, but it is fun to think about. One of the neatest
things about debugging using tools which expose the underlying operations of a
container exec is that you can reason about what containerization is actually
doing. In this case, since we're showing what files are accessed during the
container exec, sorting the list, and removing duplicate entries it's a cursory
view but still useful.
</p>
<p>
Let's say we're consuming a vendor application as a container. We can trace an
execution (or sample a running instance of the container for a day, strace can
attach to running processes), load the list of files into the pipeline we use to
promote new versions of that vendor app to prod, and when we see a change in the
files that the application is opening we can make a determination whether the
behavior of the new version is appropriate for our production environment with
all the PII and user financial data. Now, instead of trusting the vendor at
their word that they've done their due diligence, we're actually observing the
behavior of the application and using our own knowledge of our environment to
say whether that application is suitable for use.
</p>
<h2><b>But wait! Strace isn't just for files!</b></h2>
<p>
I used strace's file syscall filter as an example because it fit the example use
case, but strace can snoop on other syscalls too! Do you need to know what IP
addresses your process knows about? This example is using a container exec
again, but you could snoop on an existing pid if you want then run a similar
search against the output (IPs have been modified in this output):
</p>
<div><pre><code class="language-none">
strace -ff --trace=%network -o /root/yourcontainer-network -s 10241 podman run -it yourcontainer
for file in $(ls -1 yourcontainer-network.*); do grep -oP 'inet_addr\("\K[^"]+' $file ; done
127.0.0.1
127.0.0.1
693.18.119.36
693.18.119.36
693.18.131.255
75.5117.0.5
75.5117.0.5
75.5117.255.255
161.888.0.2
161.888.0.2
161.888.15.255
832.71.40.1
832.71.40.1
832.71.255.255
</code></pre></div>
<h2><b>Have I answered my original question?</b></h2>
<p>
With all that knowledge, can we address the original question: Can one use the
list of files output by tracing a cinc-auditor run to provide a restricted set of permissions which
will allow one to audit the system using CINC Auditor and the profile with a
standard user?
</p>
<p>
Yes, with one caveat: My Very Simple Profile was too simple, and didn't require
any additional privileges. I tried with a few other public profiles, but every
one I tried ran successfully using a standard user created with useradd -m
cincauditor. I looked through bug reports related to running profiles as a
non-root user but couldn't replicate their issues - which is good, I suppose. It
could be that the issue my customer was facing at the time was a bug in the
program's behavior when run as a non-root user which has been fixed, or I just
don't remember the use case they presented well enough to replicate it. So
here's a manufactured case:
</p>
<div><pre><code class="language-none">
root@trace1:~# mkdir /tmp/foo
root@trace1:~# touch /tmp/foo/sixhundred
root@trace1:~# touch /tmp/foo/sevenhundred
root@trace1:~# chmod 700 /tmp/foo
root@trace1:~# chmod 600 /tmp/foo/sixhundred
root@trace1:~# chmod 700 /tmp/foo/sevenhundred
cincauditor@trace1:~$ cat << EOF > linux-vsp/controls/filetest.rb
> control "filetester" do
> impact 1.0
> title "Testing files"
> desc "Ensure they're owned by root"
> describe file('/tmp/foo/sixhundred') do
> its('owner') { should eq 'root' }
> end
> describe file('/tmp/foo/sevenhundred') do
> its('group') { should eq 'root'}
> end
> end
> EOF
cincauditor@trace1:~$ cinc-auditor exec linux-vsp/
Profile: Very Simple Profile (linux-vsp)
Version: 0.1.0
Target: local://
× filetester: Testing files (2 failed)
× File /tmp/foo/sixhundred owner is expected to eq "root"
expected: "root"
got: nil
(compared using ==)
× File /tmp/foo/sevenhundred group is expected to eq "root"
expected: "root"
got: nil
(compared using ==)
✔ inetd: Do not install inetd
✔ System Package inetd is expected not to be installed
↺ auditd: Check auditd configuration (1 skipped)
✔ System Package auditd is expected to be installed
↺ Can't find file: /etc/audit/auditd.conf
Profile Summary: 1 successful control, 1 control failure, 1 control skipped
Test Summary: 2 successful, 2 failures, 1 skipped
cincauditor@trace1:~$ find . -name "linux-vsp.1*" -exec awk -F '"' '{print $2}' {} \; | sort -u > linux-vsp_files-accessed.txt
root@trace1:~# diff --suppress-common-lines -y linux-vsp_files-accessed.txt /home/cincauditor/linux-vsp_files-accessed.txt | grep -v /opt/cinc-auditor
> /home
> /home/cincauditor
> /home/cincauditor/.dpkg.cfg
> /home/cincauditor/.gem/ruby/2.7.0
> /home/cincauditor/.gem/ruby/2.7.0/specifications
> /home/cincauditor/.inspec
> /home/cincauditor/.inspec/cache
> /home/cincauditor/.inspec/config.json
> /home/cincauditor/.inspec/gems/2.7.0/specifications
> /home/cincauditor/.inspec/plugins
> /home/cincauditor/.inspec/plugins.json
> /home/cincauditor/linux-vsp
/root <
/root/.dpkg.cfg <
/root/.gem/ruby/2.7.0 <
/root/.gem/ruby/2.7.0/specifications <
/root/.inspec <
/root/.inspec/cache <
/root/.inspec/config.json <
/root/.inspec/gems/2.7.0/specifications <
/root/.inspec/plugins <
/root/.inspec/plugins.json <
/root/linux-vsp <
> /tmp/foo/sevenhundred
> /tmp/foo/sixhundred
> linux-vsp/controls/filetest.rb
root@trace1:~#
</code></pre></div>
<p>
The end of that previous block's output shows compiling the list of files
accessed when the cincauditor user runs the profile in the same way we did for
the root user, then a diff of the two files. Looking at that output, it's fairly
obvious that the profile is trying to access the newly created files which are
in a directory we made inaccessible to the cincauditor user (with chmod 700
/tmp/foo), and when we give cinc-auditor access to that directory with chmod 750
/tmp/foo the profile is able to check those files. A manufactured replication of
the use case, but it does show that it's possible to use the output to
accomplish the task. Whether chmod is the right way to give an least-privilege
user access to the files is a question best left up to the implementer, their
organization, and their auditors - the purpose of this exercise is to
demonstrate the potential value of the strace debugger.</p><p>It is important to note that file permissions aren't the only reason why a program wouldn't run. If you're not able to use the information strace gives you to get an application to run as a user with restricted privileges, at least you can get more information about what is happening under the hood and can communicate about why a program is not suitable for your environment. If a program needs to run anyway, you can profile the application's behavior (perhaps a tool built on eBPF would be more suitable than strace for ongoing monitoring in a production environment) and notify when its behavior changes.<br /></p>
<h2><b>Closing thoughts</b></h2>
<p>
Over the past few years I've had a lot of thoughts about how do get things done in modern environments,
and I've come to the conclusion that it's okay to write shell scripts to get
something like this done. Since in this case I'm wrapping arbitrary tasks so I can
extract information about what happens when they're running, and I won't
be able to predict where I'll need it I figured it was a good idea to use
bash and awk as those will be available via package manager where I want to do
this sort of thing.
</p>
<p>
You might not agree, and wish to see something like this implemented in something like Ruby,
Python, or Rust (I have to admit that I thought about trying to do this using
Rust so as to get better at it), and you're of course welcome to do so. Again, I
chose shell since it's something many folks can easily run, look at, comprehend,
modify, and re-implement in the way that suits them.
</p>
<p>
Lastly, thanks very much to Julia Evans. A note about the power of storytelling
in one of her posts made me think "I should write a story about solving this
problem so I can be sure I learned something from it", and I hope I've done a decent job of emulating her empathy towards folks learning these concepts for the first time.
</p>Shaun Moutonhttp://www.blogger.com/profile/16550788803145425493noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-66081166472775468052021-12-04T00:00:00.087-05:002021-12-04T23:44:46.189-05:00Day 4 - GWLB: Panacea for Cloud DMZ on AWS<p>
By: Atif Siddiqui <br />
Edited by: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>)
</p>
<p>
Organizations aspire to apply the same security controls to ingress traffic in
Cloud as they have on-premises, ideally taking advantage of Cloud value
propositions to provide resiliency and scalability to traffic inspection
appliances.
</p>
<p>
Within the AWS ecosystem, until last year, there wasn’t an elegant solution.
Consequently, the most notable challenge it created, especially for regulated
organizations, was designing the DMZ (demilitarized zone) pattern in AWS. It
took two announcements to close this gap: VPC Ingress routing and Gateway Load
Balancer (GWLB).
</p>
<p>
Two years ago, AWS announced VPC Ingress routing. This provided the capability
where ingress traffic could be directed to an Elastic Network interface (ENI).
Last year, Amazon followed it up with a complementary announcement of GWLB.
</p>
<p>
GWLB is AWS's fourth load balancer offering following Classic, Application and
Network Load Balancer. Unlike the first three types, GWLB solves a niche problem
and is, specifically, targeted towards partner appliances.
</p>
<p>
GWLB has a novel design with two distinct sides. The front end is connected to
VPC endpoint service and corresponding VPC endpoints. This front end acts as a
Layer 3 gateway. The backend is connected to third party appliances. This
backend acts as a Layer 4 Load Balancer. An oversimplified diagram of the
traffic flow is shown:
</p>
<p>
<i>Ingress traffic → GWLB endpoints → GWLB endpoint service → GWLB →
3<sup>rd</sup> party appliance</i>
</p>
<h3>So how do you provision a GWLB? </h3>
<p>
There are 4 key resources that need to be provisioned in order:
</p>
<ul>
<li>Target Group</li>
<li>GWLB using the above as the target group.</li>
<li>VPC endpoint service using above as the load balancer type.</li>
<li>VPC endpoints bound to the above endpoint service.</li>
</ul>
<span style="text-decoration: underline;">Target Group</span>
<p>
As part of this announcement, AWS implemented the GENEVE protocol and added this
option to the UX for Target Group. If you are unfamiliar with this protocol it
will be explained after going through GWLB provisioning requirements.
</p>
<p>
To configure this as infrastructure code (IaC), you could use a terraform code
snippet as follows:
</p><div><pre><code class="language-none">
resource "aws_lb_target_group" "blog_gwlb_tgt_grp" {
name = "blog_gwlb_tgt_grp"
port = 6081
protocol = "GENEVE"
vpc_id = aws_vpc.fw.id
}
</code></pre></div>
<img src="https://lh3.googleusercontent.com/4dgtOgm2LlU5yq3kn0wG3TIfr_DwJFAfE3sjLlT4o0eXvDNYVfuh28Iae7Y1zHRWWi8is2zny15DqnDARgZV5KplwrC1JBxR7gLTBUuD6r_1KSyNDhBlaZF6Z6hygeaVRgjp0Mco" style="margin-left: 0px; margin-top: 0px;" width="492" />
<h2>GWLB</h2>
<p>
As with Application Load Balancing, GWLB requires a target group to forward
traffic; however, the target group must be created with the GENEVE protocol.
</p>
<p>
Health checks for TCP, HTTP and HTTPS are supported; however, it should be noted
that health check packets are not GENEVE encapsulated.
</p>
<p>
An example of a terraform code snippet is as follows.
</p>
<div><pre><code class="language-none">
resource "aws_lb" "blog_gwlb" {
name = "blog_gwlb"
load_balancer_type = "gateway"
subnets = blog-gwlb-subnet.pvt.*.id
tags = {
Name = “blog-gwlb”,
Environment = "sandbox"
}
}
</code></pre></div>
<img src="https://lh5.googleusercontent.com/GMSIhrOtsTcGcRYq9JNnSLq57w7tipZhTdVzy3HPr9x0wo55wWG8qBBVA7-knV822FxPDVma9-P1M2fsJt6B3UgYJ9ds-gDbXd-zdDUxH3k4FXouKRpMYpqlAfxqNH0Uivlq0fA0" style="margin-left: 0px; margin-top: 0px;" width="492" />
<img src="https://lh5.googleusercontent.com/qPRzqZR5rPJR_GejuCR6JQYBcAKmBXARi5cAGVnYi6mqV7eeHfzwymoXJS0UBIwcntmYAmteuef_eVcgRJU-e9bFuou8uU34wLaSXKbIOgJ7mMBTJZmWWtWNDr-tt47jvb7lKDmt" style="margin-left: 0px; margin-top: 0px;" width="492" />
<h2>Endpoint Service</h2>
<p>
Prior to GWLB announcement, if an endpoint service was being created, the only
option offered was Network Load Balancer (NLB). With GWLB’s availability,
gateway is now the second option for load balancer type when creating an
endpoint service. It should be noted that endpoint service whether it uses NLB
or GWLB relies on the underlying PrivateLink technology.
</p>
<p>
An example of terraform code snippet is as follows.
</p>
<div><pre><code class="language-none">
resource "aws_vpc_endpoint_service" "blog-vpce-srvc" {
acceptance_required = false
gateway_load_balancer_arns = [aws_lb.blog-gwlb.arn]
tags = {
Name = “blog-gwlb”,
Environment = "sandbox"
}
}
</code></pre></div>
<img src="https://lh3.googleusercontent.com/YhhmSmS5ul91FSI_0eO-_ewB_8ixJ4ZInPj0nL-lXMhl_Q-qj-urLbjYvrToo0ybsiVz5meLJCN33rMlT36fADe7k4t9rAG1wxBQPB4aAsNwcRdF8cSwFDT5cFiXMM_2kOc2Jo3y" style="margin-left: 0px; margin-top: 0px;" width="492" />
<h2>VPC endpoint</h2>
<p>
The last key piece of the set is provisioning of VPC end points which will bind
to end point service created in the prior step.
</p>
<div><pre><code class="language-none">
resource “aws_vpc_endpoint “blog_gwlbe” {
count = length(var.az)
service_name = aws_vpc_endpoint_service.blog-vpce-srvc.service_name
subnet_ids = [var.blog-gwlb-subnets[count.index]]
vpc_id = aws_vpc.fw.id
tags = {
Name = “blog-gwlb”,
Environment = "sandbox"
}
}
</code></pre></div>
<img src="https://lh3.googleusercontent.com/VqdwLFjULJiuIoqS495-ZI4TzLzBtzf0lzeX_BxxiDJp4WZugZ-5Mvh_PC_tzsgRk4-0N3QXO1ZX4-hojX7j5F4qn0odsLsAJXPHECdgjmo2EiGDY0wZPSMMsnrIii_xNPHuZO66" style="margin-left: 0px; margin-top: 0px;" width="492" />
<img src="https://lh4.googleusercontent.com/FifVP4eog3KHCt-0HkGjk4yteHoCmWCY0aI9sXbQnei8vGPfTGqGMS7YE33AHgdkGGsGDk3JwFoXTvuFf6wliMei492waKqr8QuaFl1SfmMe7a1Z6yIxUt5WyI0PnZ_URObkiBTw" style="margin-left: 0px; margin-top: 0px;" width="492" />
<h2>GENEVE</h2>
<p>
This is an encapsulation protocol created by the Internet Engineering Task Force
(IETF). GENEVE stands for Generic Network Virtualization Encapsulation and
leverages UDP for the transport layer. This encapsulation is what achieves the
transparent routing of packets to third party appliances from vendors such as
Big-IP, Palo Alto Networks, Aviatrix etc.
</p>
<p>
</p>
<p>
<span style="text-decoration: underline;">Special route table</span>
</p>
<p>
The glue that blends VPC Ingress routing and GWLB feature is through a special
use of route table.
</p>
<p>
<i>Ingress traffic → GWLB endpoints → GWLB endpoint service → GWLB →
3<sup>rd</sup> party appliance e.g marketplace subscription.</i>
</p>
<p>
This table does not have any explicit subnet association. It, however, has
Internet Gateway (IGW) specified on the Edge associations.
</p>
<p>
Within routes, quad 0 points to Network interfaces (ENIs) of the Gateway Load
Balancer endpoints (GWLBe).
</p>
<p>
It is this routing rule that enforces ingress traffic to be routed to GWLBe
which in turns sends to GWLB (endpoint service) that is then routed to
appliances. </p>
<img src="https://lh6.googleusercontent.com/8P-AqpGjmZTuTDy2vdgqbc8tOw8NJbUSGqVQ7GqJMPUI8UdfJvGKxcLAaJO4833_tJut5_qgWLe3KIGFmtNfmZG17de7a26U6m05C0ToCVDzD2xbKK4fWDDpQZdlbwKcCVzY-6U0" style="margin-left: 0px; margin-top: 0px;" width="492" />
<img src="https://lh3.googleusercontent.com/-_Q-d8sEwHUPtyAPs8RWYIjBTfqQM1BU20rR_DyJWsIshS9Hnz5oQastOjY6Gr40RVhHHx1MV7GzCcJ4MFHQnH2ZoV2AFrlLn1KdblYqtPtSBqg-lNowTVXL3Tn-W65JThZNche3" style="margin-left: 0px; margin-top: 0px;" width="492" />
<h2>Limitations</h2>
<p>
Target group using the GENEVE protocol does not support tags. </p>
<img src="https://lh4.googleusercontent.com/4Ivocf32Pme56tvvHo5O6xgx_OO17fH_T3Oy9koYDREXcGDyW8pDlFnh4d4EZAL0gxHPKcLX9zxfrxon13uvfpE3iY7fIZH0kirQdO3DwSZ9RXGYRbv9YZC9PbszxtddbiZ5l_VY" style="margin-left: 0px; margin-top: 0px;" width="492" />
Cloud DMZ: Centralized Inspection Architecture
<img src="https://lh4.googleusercontent.com/YkZX_lTaQW2NZp0_1DLPvGD9eRNSH34vS4wXlM1Qlb9IRE9fCOF21c7TaLY2vptWcoqOFOA82MPJZKqp1A2kN5CZyO0_F0K-mWnT_EA2SYxqe1WeVUgYQFVMiQzw1g0mZ35jWB0B" style="margin-left: 0px; margin-top: 0px;" width="492" />
<h2>Conclusion</h2>
<p>
The pairing of VPC ingress routing and GWLB allows enterprises to have a much
sought after security posture where now both ingress and egress traffic can
undergo firewall inspection. This set of capability is, especially, notable when
the Cloud DMZ architecture is being created.
</p>
<h2>Afterthought: AWS Network
Firewall</h2>
<p>
It is always fascinating to me how AWS keeps vendors on their toes. There seems
to be an aura of ineluctability where vendors strive to stay a step ahead of
AWS’s offering. While customers can use marketplace subscriptions (e.g.
firewall) with GWLB, there is a competing service by Amazon named AWS Network
Firewall. This is essentially Firewall as a Service where VPC ingress routing
primitive will be used to point to AWS Network Firewall which uses GWLB behind
the scenes. It is easy to predict that AWS will push for new products that will
compete in this space that will use GWLB under the hood.
</p>
<p>
Over time, choices will rise whether it is with AWS products or more vendors
certifying their products with GWLB. This abundance will serve to only benefit
customers with more choices in their pursuit of secure network architecture.
</p>
<p>
</p>
<h2>References</h2>
<ul><li><a href="https://aws.amazon.com/blogs/aws/new-vpc-ingress-routing-simplifying-integration-of-third-party-appliances/">VPC
Ingress routing announcement</a></li>
<li><a href="https://aws.amazon.com/blogs/aws/introducing-aws-gateway-load-balancer-easy-deployment-scalability-and-high-availability-for-partner-appliances/">GWLB
announcement</a></li>
<li><a href="https://datatracker.ietf.org/doc/html/rfc8926">GENEVE RFC</a></li>
<li><a href="https://aws.amazon.com/network-firewall/">AWS Network Firewall</a></li>
<li><a href="https://www.redhat.com/en/blog/what-geneve">What is GENEVE</a></li>
</ul>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-22706026869505692782021-12-03T00:00:00.023-05:002021-12-04T23:44:37.075-05:00Day 3 - Keeping Config Management Simple with Itamae<p>
By: Paul Welch (<a href="https://twitter.com/pwelch">@pwelch</a>)<br />
Edited by: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>)
</p>
<p>
Our DevOps toolbox is filled with many tools with Configuration Management being
an often neglected and overloaded workhorse. While many resources today are
deployed with containers, you still use configuration management tools to manage
the underlying servers. Whether you use an image-based approach and configure
your systems with <a href="https://www.packer.io">Packer</a> or prefer
configuring your systems manually after creation by something like <a
href="https://www.terraform.io/">Terraform</a>, chances are you still want to
continuously manage your hosts with infrastructure as code. To add to the list
of potential tools to solve this, I’d like to introduce you to <a
href="https://itamae.kitchen/">Itamae</a>. Itamae is a simple tool that helps
you manage your hosts with a straight-forward DSL while also giving you access
to the Ruby ecosystem. Inspired by <a
href="https://github.com/chef/chef">Chef</a>, Itamae has a similar DSL but does
not require a server, complex attributes, or data bags.
</p>
<h2>Managing Resources</h2>
<p>
Itamae is designed to be lightweight; it comes with an essential set of <a
href="https://github.com/itamae-kitchen/itamae/wiki/Resources#resource-type">resource
types</a> to bring your hosts to the expected state. These resource types focus
on the core parts of our host we want to manage like packages, templates, and
services. The bundled `execute` resource can be used as an escape hatch to
manage resources that might not have a builtin resource type. If you find
yourself wanting to manage something often that does not have a built in
resource, you can <a
href="https://github.com/itamae-kitchen/itamae/wiki/Resource-Plugins">build your
own resources</a> if you are comfortable with Ruby.
</p>
<p>
All Itamae resource types have <a
href="https://github.com/itamae-kitchen/itamae/wiki/Resources#resource-type">common
attributes</a> that include: actions, guards, and triggers for other resources.
</p>
<h3>Actions</h3>
<p>
Actions are the activities that you want to have occur with the resource. Each
bundled resource has predefined actions that can be taken. A `service`
resource, for example, can have both an `:enable` and `:start` action which
tells Itamae to enable the service to start on system boot and also start the
service if it is not currently running.
</p>
<div><pre><code class="language-none">
# enable and start the fail2ban service
service “fail2ban” do
action [:enable, :start]
end
</code></pre></div>
<h3>Guards</h3>
<p>
Guards ensure a resource is idempotent by only invoking the interpreted code if
the conditions pass. The common attributes that are available to use within your
infracode are `only_if` and `not_if`.
</p>
<p>
</p>
<div><pre><code class="language-none">
# create an empty file only if it does not exist
execute "create an empty file" do
command "touch /tmp/file.txt"
not_if "test -e /tmp/file.txt"
end
</code></pre></div>
<h3>Triggers</h3>
<p>
Triggers allow you to define event driven notifications to other resources.
</p>
<p>
The `notifies` and `subscribes` attributes allow you to trigger other resources
only if there is a change such as restarting a service when a new template is
rendered. These are synonymous with Chef & Puppet’s `notifies` and `subscribes`
or Ansible’s `handlers`.
</p>
<div><pre><code class="language-none">
# define nginx service
service 'nginx' do
action [:enable, :start]
end
# render template and restart nginx if there are changes
template "/etc/nginx/sites-available/main" do
source "templates/etc/nginx/sites-available/main.erb"
mode "0644"
action :create
notifies :restart, "service[nginx]", :delayed
end
</code></pre></div>
<p>
Itamae code is normally organized in “cookbooks” much like Chef. You can <a
href="https://github.com/itamae-kitchen/itamae/wiki/Including-Recipes">include
recipes</a> to separate your code. Itamae also supports <a
href="https://github.com/itamae-kitchen/itamae/wiki/Definitions">definitions</a>
to help <a href="https://en.wikipedia.org/wiki/Don't_repeat_yourself">DRY</a>
your code for resources.
</p>
<h2>Example</h2>
<p>
Now that we have an initial overview of the Itamae basics, let’s build a basic
Nginx configuration for a host. This example will install Nginx from a PPA on
Ubuntu and render a basic configuration that will return the requestor’s IP
address. The cookbook resources will be organized as follows:
</p>
<div><pre><code class="language-none">
├── default.rb
└── templates
└── etc
└── nginx
└── sites-available
└── main.erb
</code></pre></div>
<p>
We will keep it simple with a single `default.rb` recipe and single `main.erb`
Nginx site configuration template. The recipe and site configuration template
content can be found below.
</p>
<div><pre><code class="language-none">
# default.rb
# Add Nginx PPA
execute "add-apt-repository-ppa-nginx-stable" do
command "add-apt-repository ppa:nginx/stable --yes"
not_if "test -e /usr/sbin/nginx"
end
# Update apt cache
execute "update-apt-cache" do
command "apt-get update"
end
# install nginx stable
package "nginx" do
action :install
end
# enable nginx service
service 'nginx' do
action [:enable, :start]
end
# configure nginx
template "/etc/nginx/sites-available/main" do
source "templates/etc/nginx/sites-available/main.erb"
mode "0644"
action :create
notifies :restart, "service[nginx]", :delayed
variables()
end
# enable example site
link '/etc/nginx/sites-enabled/main' do
to "/etc/nginx/sites-available/main"
notifies :restart, "service[nginx]", :delayed
not_if "test -e /etc/nginx/sites-enabled/main"
end
# disable default site
execute "disable-nginx-default-site" do
command "rm /etc/nginx/sites-enabled/default"
notifies :restart, "service[nginx]", :delayed
only_if "test -e /etc/nginx/sites-enabled/default"
end
</code></pre></div>
<div><pre><code class="language-none">
# main.conf
server {
listen 80 default_server;
listen [::]:80 default_server;
server_name _;
location / {
# Return the requestor's IP as plain text
default_type text/html;
return 200 $remote_addr;
}
}
</code></pre></div>
<h2>Deploying</h2>
<p>
<em>*To deploy the above example, it is assumed that you have a temporary VPS
instance available.</em>
</p>
<p>
There are 3 different ways you can deploy your configurations with Itamae:
</p>
<ul>
<li>`itamae ssh` via the itamae gem.
<li>`itamae local` also via the itamae gem.
<li>`mitamae` locally on the host.
</li>
</ul>
<p>
<a href="https://github.com/itamae-kitchen/mitamae">Mitamae</a> is an
alternative implementation of Itamae built with <a
href="https://mruby.org/">mruby</a>. This post is focusing on Itamae in general
but the Mitamae implementation is a notable option if you want to deploy your
configuration using <a
href="https://github.com/itamae-kitchen/mitamae/releases">prebuilt binaries</a>
instead of using SSH or requiring Ruby.
</p>
<p>
With your configuration ready it’s just a single command to deploy over SSH.
Itamae uses the SpecInfra library which is the same library that <a
href="https://serverspec.org">ServerSpec</a> uses to test hosts. You can also
access a <a href="https://serverspec.org/host_inventory.html">host’s
inventory</a> in Itamae much like you can with Chef & Ohai. To deploy your
configuration, run:
</p>
<div><pre><code class="language-none">
itamae ssh --key=/path/to/ssh_key --host=<IP> --user=<USER> default.rb
--log-level=DEBUG
</code></pre></div>
<p>
Itamae will manage those packages and write out the template we specified,
bringing the host to our desired state. Once the command is complete, you should
be able to curl the host’s IP address and receive a response from Nginx.
</p>
<h2>Wrapping Up</h2>
<p>
Thank you for joining me in learning about this lightweight configuration
management tool. Itamae gives you a set of bundled resource types to quickly
configure your infrastructure in a repeatable and automated manner with three
ways to deploy. Check out the <a
href="https://github.com/itamae-kitchen/itamae/wiki">Itamae Wiki</a> for more
information and best practices!
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-34530133637658346942021-12-02T00:00:00.010-05:002021-12-04T23:44:27.631-05:00Day 2 - Reliability as a Product Feature<p>
By: Martin Smith (<a href="https://twitter.com/martinb3">@martinb3</a>)<br />
Edited by: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>)
</p>
<h2>Abstract</h2>
<p>
SRE was born out of thinking about reliability as a product feature. However,
all of the industry focus in the last few years on things like SLOs and Error
Budgets and Production Engineering teams, and others, that constitute "doing
SRE," sometimes means teams don’t take advantage of a product-centric approach
to reliability these days. And they lose some of the advantages of doing so as a
result. This post covers some project maturity levels, some suggestions for
thinking about reliability as an SRE engaged in those kinds of projects, as well
as what kinds of collaboration might be most successful in driving
reliability-as-product-feature in each phase.
</p>
<h2>A brief history</h2>
<p>
Site Reliability Engineering, or SRE for short, was <a
href="https://www.oreilly.com/content/site-reliability-engineering-sre-a-simple-overview/">born
in 2003</a> out of a need to improve service reliability at <a
href="https://sre.google/">Google</a>. Often described as, “an implementation of
<a href="https://aws.amazon.com/devops/what-is-devops/">DevOps</a>,” the
practice of SRE aims to <strong>treat operations as a software problem</strong>
that can be addressed through software engineering techniques.
</p>
<p>
And according to <a
href="https://insights.devopsinstitute.com/hubfs/Automation%20Downloads/Upskilling%202021-Enterprise%20DevOps%20Skills%20Report.pdf">a
survey by the DevOps Institute</a>, SRE has truly taken off. This approach has
been widely adopted, with <strong>22% of organizations saying they have an SRE
team in 2021</strong>. This shift can also be seen with the rise of conferences
like USENIX’s SREcon which began in 2014, or the release of the popular, “Google
SRE book,” a few years later in 2016.
</p>
<p>
Whether or not your organization has an SRE team that plans work using SLOs and
Error Budgets, regularly reduces toil through automation, or has adopted one of
the <a
href="https://www.oreilly.com/content/site-reliability-engineering-sre-a-simple-overview/">many
SRE rules of thumb</a>, the basic premise of <strong>what impact SRE can have
sometimes gets lost</strong> -- that operations is a software problem. Or,
shifting the focus back to the customer perspective, that <strong>reliability is
a product feature</strong> that we build.
</p>
<p>
Having held <em>DevOps Engineer</em> and <em>Site Reliability Engineer</em>
roles in the past, and having been a technical lead for SRE teams, I’ve had many
opportunities to define the role, activities, and most importantly, the
<strong>impact of an SRE team</strong>. In each case, I’ve found that
<strong>focusing back on our customers’ experience of reliability</strong> has
been the most useful framing when speaking to company leaders about SRE team’s,
“<a href="https://simonsinek.com/product/find-your-why/">why</a>,” instead of
reciting a long, confusing list of things SREs might do in a quarter. I’ve also
found that it’s an easy litmus test for myself to ensure I’m working on the
right things at the right time. If I can’t explain how my work affects customer
reliability, keeping in mind that <strong>reliability for operators usually
leads to reliability for customers, </strong>it might be a sign that I need to
work on something else.
</p>
<p>
</p>
<h2>Shifting focus back to product reliability</h2>
<p>
Shifting the focus from operations and software engineering to talking about
<strong>reliability as a product feature</strong> has some major benefits.
First, it helps our organizations better understand what reliability might mean
for them and their product(s) -- whether that’s <strong>resilience</strong>
(tolerant of failure), <strong>scalability</strong> (can function with large
volumes of work), <strong>observability</strong> (understanding internal state
from outputs), or <strong>security</strong> (trust of the system). These are all
product capabilities that often aren’t well understood, but fundamentally all
matter to customers.
</p>
<p>
Reliability <strong>benefits from product management support</strong>
(communication with stakeholders, building roadmaps, helping with prioritization
and decisions, etc). For example, do you know who your internal stakeholders are
for the scalability of your product? What’s on the roadmap for observability
over the next 6 months? 2 years? And importantly, what metrics will you collect
to be sure you’ve accomplished those goals and delivered on that roadmap? How
does it align with other features’ roadmaps? As a friend and former colleague of
mine says, “reliability is a product feature whether you devote engineering time
to it or not.” If you don’t explicitly plan for that, your customers will
implicitly make their own assumptions about your reliability.
</p>
<p>
Reliability may start to sound like any other product feature, with both
internal and external stakeholders, and that’s by design. Making reliability an
explicit part of your organizational planning also has many benefits.
Thoughtworks’ <a
href="https://www.thoughtworks.com/content/dam/thoughtworks/documents/radar/2021/10/tr_technology_radar_vol_25_en.pdf">Technology
Radar (Volume 25)</a> from October of this year recommends adoption of this kind
of thinking -- that even <strong>internal teams should think of themselves as
product teams</strong>. They also recommend using concepts from the popular <a
href="https://teamtopologies.com/">Team Topologies</a> book to figure out how to
organize these internal teams. In reviewing examples of team structures from the
book, many organizations have adopted Simon Wardley’s <a
href="https://blog.gardeviance.org/2015/03/on-pioneers-settlers-town-planners-and.html">Pioneer-Settler-Town
Planner</a> (or “PST”) framework, too.
</p>
<p>
Let’s take a look at how one might apply these two ideas (reliability as a
product feature, having a specific team profile) to improve the effectiveness of
an SRE team.
</p>
<ol>
<li>First, <strong>there’s no one-size-fits-all approach to improving
reliability</strong>; different stages of a project will benefit from different
kinds of SRE involvement. In this post, I’ll divide products/services into three
levels of maturity: <strong>beginning, growing, and established</strong>.
<li>Then, I’ll describe <strong>what kinds of SRE work could be most effective
at that maturity level</strong>, using the PST framework.
</li>
</ol>
<p>
Here’s a graphic that explains the <a
href="https://leadingedgeforum.com/insights/a-lesson-from-the-past-on-pioneering-organizational-structures/">PST
framework</a>’s three kinds of roles/activities in more detail.
</p>
<img width="492" src="https://lh4.googleusercontent.com/425sdbwBKobjYHRoNHZFVNag1SaP2kxrHT7vP82yphtuNUnJbge9PTz85RZjLMeL-Ut-xjbIvD24K1vdzIQm4SuZND44ZtOJSNSwmXPCQp_GQMsuZvfZpfvWZFKGq1GQFa3Xt8ri" style="margin-left: 0px; margin-top: 0px;" />
<p>
Team Profiles, from blog post <a
href="https://blog.gardeviance.org/2012/06/pioneers-settlers-and-town-planners.html">Pioneers,
Settlers and Town Planners</a> by Simon Wardley
</p>
<h2>Beginning phase (with Pioneer SREs)</h2>
<p>
In new projects, there’s <strong>often uncertainty and unanswered
questions</strong>. Small changes in direction could have large future benefits,
but experimental work may be completely discarded, too. SREs can drive
reliability at this stage by helping teams build prototypes, fail faster, and
make agile decisions, <strong>all with reliability as a top of mind
concern</strong>.
</p>
<p>
Have you ever had a project get close to production/release <strong>without
thinking about reliability or operational burdens</strong>? “Pioneer SREs” can
help. They should be part of the team that’s working to deliver a new product
development, evaluate vendors, build out proofs of concept, or make major
architectural changes. At this stage of a project, any work to “cover”
reliability gaps should be identified or entire directions could be changed due
to reliability concerns raised by the team.
</p>
<p>
<strong>Embedding</strong> in a team building the new product or feature is a
great way for SREs to drive reliability early on in these kinds of projects.
When teams only consult briefly on reliability or operational concerns, often
the final output doesn’t adequately reflect customer or engineering expectations
of reliability of the product or operability of the internals.
</p>
<p>
The <strong>success of Pioneer SREs</strong> can be measured by looking at how
quickly new products or features show up on the roadmap, how quickly vendor
implementations happen, or how quickly a project moves from, “exploration,” to,
“concrete proposal.”
</p>
<p>
The<strong> largest risk</strong> in this phase is having your SRE team end up
owners of the system’s reliability, since they helped design it. Hiding the
overall reliability of your system from the other developers, behind an SRE
team, will typically turn into a situation where the <strong>SRE team ends up
being treated as an operational team</strong> for any product/service problems.
Well-scoped embedding engagements can help avoid this problem by emphasizing
that embedded SREs are a <strong>training resource </strong>for the rest of the
team to learn, <strong>not coverage</strong> for the team once the embedding is
over.
</p>
<h2>Growing phase (with Settler SREs)</h2>
<p>
In this phase, projects are often working to build <strong>production-quality
infrastructure</strong>, <strong>launch</strong> to customers, or
<strong>scale</strong> to the required audience. SREs can help actually build
mature and scalable components from the initial prototypes. They could also
level up the engineering organization on how to prepare for any <strong>new
operational burdens</strong> by emphasizing best practices like automating away
toil or choosing good SLOs.
</p>
<p>
<strong>Continuing to embed</strong> with teams is a great way for SREs to have
a hand in the reliability of a nearly-launched product or feature, especially if
SREs influence the team to build for observability, scalability, and security
into the product. <strong>Consulting </strong>with teams on production
readiness, especially for brand new teams or brand new services, is another way
that SREs can ensure that everything reaching production will meet the original
reliability requirements of the product, as well as operational best practices
(e.g. automation instead of manual database migrations).
</p>
<p>
At this phase, SRE building and maintaining an idea of <strong>Production
Readiness</strong> is especially important as a product or organization scales.
This ensures a consistent approach to reliability across products or services,
as well as creates a minimum bar for reliability that must be satisfied. SREs at
this stage may even build automation into a pipeline to guarantee minimum scale
or ensure resilience on specific failures.
</p>
<p>
The <strong>success of Settler SREs</strong> can be measured by looking at how
many new services and features are safely being launched into production, as
well as examining things like ease of observability (e.g. effective logging,
metrics, or monitoring). Success in this phase is also about
<strong>establishing patterns</strong> that make projects successful (e.g.
proposal templates). <strong>Project retrospectives</strong> are a great way to
find those patterns as well as improve SRE engagement with the project.
</p>
<h3>Established phase (with Town Planner SREs)</h3>
<p>
In this most mature phase, products or services are usually already generally
available, and systemic issues like overall architecture or developer tooling
are the most likely to impact reliability.
</p>
<p>
SREs can influence reliability here by identifying and working to resolve
<strong>systemic reliability issues</strong> (e.g. repeated incidents, poor SLO
choices, lack of on-call process, etc). <strong>Driving</strong>
<strong>continuous improvement </strong>is a very common way that SREs influence
reliability at this phase.
</p>
<p>
In addition, SREs can often identify ways to <strong>reduce operational burdens
or eliminate large scale toil</strong> during this phase, whether through
technical automation or architecture changes, or through helping teams build
process, knowledge, skills, tools and techniques they need for large scale
projects to be repeatedly successful and reliable.
</p>
<p>
This can be a phase where <strong>some SREs will feel there’s a stigma
associated with doing less technical work</strong>, but the impact of this work
cannot be overstated -- it’s where SRE can act as a <strong>true
multiplier</strong> as more and more teams and products/services are launched.
<strong>Examples include</strong> running an incident management program, SLA
program, On-call Program, Disaster Recovery/Business Continuity planning, or
even a Chaos Engineering program. A strategy to address this concern is to pair
SREs with a <a
href="https://blog.tryexponent.com/what-is-the-role-of-a-technical-program-manager/">technical
program management function</a> (TPM) so that SREs can focus most on the
technical aspects of improvement while TPMs can help with the organizational
changes needed to improve a process or execute a program.
</p>
<p>
Measuring the <strong>success of Town Planner SREs</strong> can be especially
tricky. You might look for simple metric improvements like fewer incidents,
reduced incident duration, reduced pages, improved SLO targets, or number of DR
tests -- but isolating the SRE impact to these kinds of metrics can be
difficult. <strong>Qualitative feedback from an SRE team’s internal
customers</strong> is also frequently used to measure success at this stage. The
most impactful SREs at this stage tend to <strong>cause paradigm shifts for the
other development teams</strong>, and often even for their own SRE teammates.
</p>
<h2>Wrapping up</h2>
<p>
<em>[PST is] how you take a highly effective company and push it [...]
towards a continuously adaptive system. </em><a
href="https://twitter.com/swardley/status/1258690134725349376">May 8th, 2020</a>
<a href="https://twitter.com/swardley/">@swardley</a>
</p>
<p>
I hope that the grouping above is useful to readers for structuring work to
drive reliability at various levels of product maturity.
Reliability-as-a-product-feature isn’t a magic bullet to solve for an
organization that doesn’t understand where it fits in the market or what kind of
value it delivers, nor will it make a large difference with an unhealthy product
management practice that might not know how to develop and drive delivery of a
product and its features over time.
</p>
<p>
As mentioned earlier, there usually isn’t a, “one-size fits all,” approach to
driving reliability. You may still need to <strong>establish some best practices
for your organization</strong> such as “Limit toil to 50% of our work” or “Every
product feature that goes live must have a reliability review.” Combined with
these kinds of rules of thumb, the proposed divisions and strategies above
should help focus your team(s) to make the biggest improvement to reliability
for your products and services.
</p>
<p>
In researching this post, it was helpful to review <a
href="https://github.com/upgundecha/howtheysre">how organizations “do SRE”</a>
at various organizations and companies. <a
href="https://www.cnpatterns.org/organization-culture/sre-team">Continuous
improvement</a> was a clear shared trait among them. It’s also worth reviewing
the huge amount of content out there about how SRE can effectively collaborate
with other teams (e.g. <a
href="https://sre.google/sre-book/operational-overload/">embedding SREs</a>); a
poor relationship or failed collaboration with another team can jeopardize all
of your efforts.
</p>
<p>
I invite and encourage you to write about and share your own experiences, both
good and bad, focusing on reliability as a first class product feature at your
organization. Special thanks to my own SRE team for the many discussions and
ideation sessions on how we can best work to drive reliability. And special
thanks to Jennifer Davis, Michael Lumsden, David Nolan, Jordan Rinke, and Kerim
Satirli for feedback and editing on this post.
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-21352210232594859602021-12-01T00:00:00.037-05:002021-12-04T23:44:17.160-05:00Day 1 - The Myths and the Magic in My Search for Acquiring Software Engineering Skills<p>
By: Annie Hedgpeth (<a href="https://twitter.com/anniehedgie">@anniehedgie</a>)<br />
Edited by: Jennifer Davis (<a href="https://twitter.com/sigje">@sigje</a>)
</p>
<p>
A happy SysAdvent to you, my dear elves. Whether you are an individual
contributor (IC), manager, director, or something in between, my holiday wish is
that my story spreads some holiday magic to your teams and roadmap.
</p><p style="margin-left: 40px; text-align: left;">
<span style="color: #38761D;"><i>“Then I traveled through the seven levels of the Candy Cane forest, past the
sea of twirly-swirly gumdrops, and then I walked through the Lincoln Tunnel.” <a href="https://g.co/kgs/iaSao7">Buddy the Elf</a></i></span>
</p>
<p>
I took an uncommon route into technology. With absolutely no experience of any
kind in any sort of technological pursuit (save for video editing in college), I
<a href="https://youtu.be/bNxc6Y8ZHsI">started my career</a> in IT by learning
configuration management and infrastructure as code first. Why? Because the
opportunity <a href="http://www.anniehedgie.com/leaning-in">presented
itself</a>, and I had a great in-house <a href="https://hedge-ops.com/">tutor</a>. My husband, <a href="https://twitter.com/michaelhedgpeth">Michael</a>, is the one who convinced
me to pursue a career in technology and was the one who spent many late evenings
teaching me how to “computer”. It was a bit of a trek through “<a href="https://g.co/kgs/iaSao7">the seven levels of the Candy Cane forest,
through the sea of swirly twirly gumdrops</a>” but with more tears and
heartache.
</p>
<p>
I spent the first couple of years of my career just trying to learn enough of
the different frameworks, like Chef, Terraform, PowerShell, Groovy, etc., to
build stuff and configure it properly. Learning about <i>how</i> they should
be built and configured came next with a focus on solution architecture and a
bit on systems administration. Looking forward, after five years of work focused
on configuration management, infrastructure as code, and CI/CD pipelines, I’m
now to the point where I want to grow in software engineering, and this is where
our story of myths and magic begins today.
</p><p style="margin-left: 40px; text-align: left;">
<span style="color: #38761D;"><i>“Some call it ‘the show’ or ‘the big dance’; it’s the profession that every
elf aspires to…” – <a href="https://g.co/kgs/iaSao7">Papa Elf</a></i></span>
</p>
<p>
Grab some hot cocoa and curl up with a blanket while I share with you what I see
as the common myths believed about acquiring software engineering skills and
what I believe to be the actual magic of making that a reality in <i>my</i>
life. We will start with the myths, but please remember, dear elves, that these
are myths and magic as they pertain to me personally. For you or others, they
may not be, and that’s okay. My hope is that sharing my own experiences will
give you empathy for others on their unique journeys and/or compassion for
yourself as you learn and grow in your own way.
</p><p style="margin-left: 40px; text-align: left;">
<span style="color: #38761d;"><i>“The best way to spread Christmas cheer is singing loud for all to hear.” –
<a href="https://www.imdb.com/title/tt0319343/?ref_=fn_al_tt_1">Buddy the
Elf</a></i>
</p>
<h3>Myth #1 - Just read a book</h3>
<p>
I am a huge fan of books, and I consume a pretty good amount of books per year.
I think that learning through books is important in a way that is difficult to
replicate through other modalities. I have gone through <a href="https://headfirstgo.com/">Head First Go</a>, a book that is geared toward
people with little to no programming experience, and I found it to be incredibly
helpful. I did every exercise in the book, learned a lot, and highly recommend
it. That said, the exercises alone were not enough to prepare me immediately for
real life coding. Doing the exercises was good and necessary, but it was only
one piece of the puzzle required to complete the picture of what it takes for me
to be able to contribute in a meaningful way to my company’s Go codebase.
</p>
<p>
Perhaps my lack of any formal training, whether university or code camp,
prevented me from grasping the higher level understanding that would have
enabled me to contribute confidently sooner, but whatever it was, I was still
lacking after simply going through a book. I liken this to studying through a
first year French textbook as your only means of learning the language. You will
gather the concepts and vocabulary, but you will likely not be able to speak the
language without other mediums of instruction.
</p><h3>Myth #2 - Just do some exercises</h3>
<p>
I am a huge fan of <a href="https://exercism.org/">Exercism</a>. I think they
are helping a lot of people learn coding languages, and they do it in such a way
that brings out a spirit of giving back in its users. There is much to love
about that. I have completed many Exercism exercises, and I do find them
helpful, but in the same way that the book was only helpful to a certain point,
I haven’t found that it helps me with the big picture. I have found it to be
like learning French with only Duolingo. Sure it’s a great app, and I use it all
the time. But again, one cannot use it in isolation in order to be a proficient
French speaker.
</p><h3>Myth #3 - Solution Architecture skills are built upon coding skills</h3>
<p>
Working at a cloud consulting firm for 4 years, I got a great education in
architecting solutions for clients. I really enjoyed learning about the process,
and it all made a lot of sense to me. After seeing several of them, I started to
see the patterns and practices that are used to create a good solution. And
then, as the person often implementing someone else’s solution, I learned
quickly what made a bad solution, as well.
</p>
<p>
To be good at architecting solutions, one must think through all of the choices
required to form that solution while you’re still in the planning phase, before
any of the solution is actually implemented. You can’t really “mess around and
find out”, which is why solutions architecture is such a valuable skill; if you
plan well, you do the necessary work, no less and no more.
</p>
<p>
However, not all solutions are equal. Architecting a solution to a cloud
migration feels like more of a tactile experience to me; I can see where things
are moving. I think it helps that you can actually hold a CPU in your hands, and
an architectural diagram has a very structural feel to it, similar to a
blueprint of physical structures. For me, at least, this makes it more
accessible and the concepts easier to grasp.
</p>
<p>
However, software architecture is more conceptual. You have to first understand
all of the interfaces, levels of abstraction, and concepts before you can
understand how to architect it. And if you don’t understand how to architect it,
then you’re back at the Duolingo level of coding.
</p><h3>Myth #4 - The building blocks to starting a tech career are cloud, code
editor, source control, and project management</h3>
<p>
Some people have suggested that huge barriers to moving into a software
engineering role can be mastering the tooling - code editors and IDEs, source
control, the cloud providers, and project management. This is possibly true of a
certain type of person moving from a systems administration type of job into
software development, but this was not true for me. But because Michael worried
that these would be barriers for me, I learned them first. I created a website
with GitHub Pages and used that as a way to learn source control and Visual
Studio Code. I took some online classes on Agile Framework. I got a free Azure
account and started playing with Terraform. These things were most definitely
and obviously important, but again, they’re but one piece of the puzzle.
</p><h3>Myth #5 - It just takes a creativity / growth / problem-solving mindset</h3>
<p>
One of my husband’s main reasons for convincing me to pursue a career in tech
was that I’m a pretty creative person who loves problem solving and that the
desire to dig into a problem until it’s solved is one of the most necessary
components for a career in tech. I completely agree that this is an important
character trait in order to be successful as a technologist. I’m also decently
creative and have a growth mindset, which are equally valuable for such a
pursuit. You can probably see, by now, where I’m going with this, though.
</p>
<p>
These traits alone are great and will serve you well in just about any endeavor.
Having these traits does not make a person automatically good at tech. It’s like
when you’re house-hunting and find a house that needs a ton of cosmetic
remodeling, but you say, “It has good bones,” meaning, you can easily make it
the way you want it to look without having to overhaul anything structurally.
Still, though, the cosmetic renovations are not insignificant. They are a lot of
work.
</p>
<p>
The same is true with me. Yes, I have “good bones” - good traits that are great
assets for a career in tech, like being creative, having a growth mindset, and
being a good problem solver. But to let folks start a career in tech with the
false hope that these traits will give them an unrealistic advantage is not
helpful. Yes, those traits help me a lot, but, goodness me, it is still a
<i>lot </i>of work learning and growing in tech, even with those traits.
</p><h2>Real barriers:</h2>
<h3>Truth #1 - People get pigeon-holed into certain work</h3>
<p>
I worked so hard to get the skills necessary to be valuable to my respective
organizations, and while, yes, I found myself a bit pigeon-holed into “devops-y”
roles, the other truth is that I didn’t <i>feel </i>as experienced as my peers
because I didn’t have the formal training many of them had, so I felt behind in
my learning. I wanted to catch up to the folks my age in this business, and that
was nearly impossible, so the next best thing was to get really good at one
thing, and just like that, I found myself pigeon-holed. This was honestly
probably easier and less risky for the companies I was in as things were more
predictable and steady when I was more focused on a smaller scope of expertise.
And you might be thinking, ‘So what’s the problem with striving to become a
subject matter expert at something. There’s immense value in that.’ And you’d be
right. This is perfectly fine for some people. However, I personally like to
have a range in my work. I find freedom in flexibility as my hope is that it
gives me more options in my future, ultimately decreasing the risk to my career.
</p><p style="margin-left: 40px; text-align: left;">
<span style="color: #38761d;"><i>“There’s room for everyone on the Nice list.” <a href="https://www.imdb.com/title/tt0319343/?ref_=fn_al_tt_1">Buddy the
Elf</a></i></span>
</p>
<p>
To overcome the barrier of being pigeon-holed into a particular line of work, a
bit of magic is required - the magic that happens when goals are set and people
help other people. Setting goals and tracking them is extremely important to me,
but part of tracking those goals is being accountable to them by someone,
whether it be a manager, a mentor, or a team lead. When my manager or team leads
know my goals and I have milestones set for reaching those goals, then I am so
much more likely to achieve them, and I’m giving them an opportunity to play an
important role, which grows their leadership skills - a win-win.
</p><h3>Truth #2 - It’s an engineering problem for senior engineers to break down
work to share work with juniors</h3>
<p>
My favorite type of senior engineer is one who can not only design a good
solution but one who knows how to allow everyone on the team to contribute to
the solution with their own strengths. Being able to communicate their vision
for a solution to others and lead others effectively to carry out their vision
is arguably the most valuable skill of a senior engineer. The whole team thrives
when seniors lead in this way! Being able to do this is most definitely
classified as a soft skill - one that is not easily measured by a test, and I
have witnessed many ICs discount soft skills, thinking that only managers need
worry themselves with growing such skills. I would argue, though, that this
particular soft skill is also an <i>engineering skill, </i>one necessary to be
an effective IC engineer.
</p><p style="margin-left: 40px; text-align: left;">
<span style="color: #38761D;"><i>“I mean, parents couldn’t do that all in one night.” <a href="https://www.imdb.com/title/tt0319343/?ref_=fn_al_tt_1">Buddy the
Elf</a></i></span>
</p>
<p>
Conversely, how many times have you seen senior engineers go silent for two
months and then emerge with an amazing <i>something</i> that solves a problem,
but it resembles a coded version of a complicated Home Alone trap (like a Rube
Goldberg machine)? This is actually not what we want from our senior engineers,
dear elves. We want senior engineers who are able to thoughtfully and skillfully
level up those in lower levels to them.
</p>
<p>
There is a common desire among engineers to remain as IC for as long as possible
with no desire for the managerial track, and that is totally fine! However,
being an IC does not mean that you work within a vacuum. No matter your level,
every IC can have a positive influence on someone else on the team and can bring
leadership and mentorship into their everyday roles. Seniors, however, have the
<i>responsibility </i>to give others the opportunity to contribute to their
vision. By considering the other people on their team and their strengths and
goals, solutions can be designed so that everyone grows. Is it hard? Of course!
But when it happens, it’s like magic.
</p>
<p>
I started my career in tech a few days before I turned 37, so with the amount of
catch-up I have from being late to the game, I just need help sometimes. An hour
of help from a human being, for me at least, is the absolute most supercharged
way to learn. I am so grateful to have had people all throughout my time in tech
who understand that investing in people by pairing on a problem is really an
investment in the health and wellness of the team, product, and company. I would
argue also that it makes them a better person, teacher, and leader.
</p>
<p>
I wholeheartedly believe that fostering this environment should be the number 1
priority of every engineering manager because it will solve a lot of other
problems down the line naturally. We need not be islands unto ourselves but
rather a rising tide that lifts all ships.
</p><h3>Truth #3 - A team needs dedicated time to grow</h3>
<p>
Getting time to grow at a consultancy was tough. It was usually designated to
times when I was on the bench, but that time wasn’t consistent. There were times
where I would go an entire year or more with no bench time, so I had to use my
personal time. I will take this time to remind you, dear elves, that making your
employees use their personal time for growth and development is not an inclusive
practice. It makes it harder for folks with families, disabilities, or just
plain healthy boundaries to have the time and space to learn.
</p><p style="margin-left: 40px; text-align: left;">
<span style="color: #38761D;"><i>“I planned out our whole day. First we make snow angels for two hours, and
then we’ll go ice skating, and then we’ll eat a whole roll of Tollhouse Cookie
Dough as fast as we can, and then to finish, we’ll snuggle.” <a href="https://www.imdb.com/title/tt0319343/?ref_=fn_al_tt_1">Buddy the
Elf</a></i></span>
</p>
<p>
I’m so incredibly grateful for my current manager and team who have deemed half
a day on Fridays to be dedicated learning times. When we all have learning time
at the same time, then no one feels guilty for not working on sprint work
because, as a team, we’ve decided that learning is important enough to spend
time on it. I’ve gotten a lot out of this; I finished the aforementioned <a href="https://headfirstgo.com/">Head First Go</a> book, and I’ve worked on <a href="https://exercism.org/">Exercism</a> exercises. I’ve also used it to learn
how to do things that were blocking me in my sprint work. But to make the most
out of this time, my next step is to use Friday learning times to actually use
the things I’ve learned in real world work. This, however, may exceed the bounds
of half a day on Fridays, and it may mean that I take a bug fix ticket and spend
a whole week on it. The magic required is that the team and manager buy into
this investment of time and energy. I personally know that I would get that
buy-in on my current team, but I know I’m a lucky one. They know that the payoff
of me growing my skills is worth the investment of time.
</p><h3>Truth #4 - Insecurity looms with the lack of formal education through a
coding school OR engineering degree which makes it feel more difficult to
acquire certain skills</h3>
<p>
This might be an unpopular opinion, and I just stated it as a truth, but I do
believe that this is true for me. There are certain coding exercises that I have
tried that make me feel like I will never truly understand certain concepts. I
do believe that I will know enough to be valuable, but knowing when that matters
and when it doesn’t is a mind trip. It’s difficult to manage my own expectations
of my own growth, learning, and knowledge. The constant nagging thought in the
back of my head is that if I would have had any sort of formal coding training,
whether in university or code camp, that something would have clicked in my
brain so that I understood certain concepts more quickly, and I honestly don’t
know if this is a valid concern for me or not.
</p>
<p>
I do know that magic happens when people step in. When I have brilliant
developer people in my life telling me what matters and what doesn’t matter and
helping me to grasp fundamental concepts, my growth and confidence are
accelerated greatly. I go from focusing on my blockers to focusing on my
trajectory.
</p><p style="margin-left: 40px; text-align: left;">
<span style="color: #38761D;"><i>“Oh, it’s not a costume. I’m an elf. Well, technically, I’m a human, but I
was raised by elves.” <a href="https://www.imdb.com/title/tt0319343/?ref_=fn_al_tt_1">Buddy the
Elf</a></i></span>
</p>
<h3>Truth #5 - Career planning related to skills is a bit more complicated</h3>
</span>When you’re a career-changer and are late to the tech game, planning for the
future can be a bit complicated. My current difficulty is that I have the soft
skills required to be a really great manager, but managing a technical team
requires a great depth of knowledge that only comes with experience. So what do
I do with all of this leadership potential? For now, I’m doing nothing. I’m
growing my depth and breadth, hunkering down and growing, and that’s so
frustrating!</span>
<p>
But again, therein lies the potential for magic. If a manager and a team are
intentional about growing people to their own strengths and goals, then we can
carve a path that matches my goals and strengths with the business’s needs, but
it requires a bit of creativity and flexibility. It takes mature leadership to
know how to turn each team member’s potential into something that benefits
everyone.
</p><div style="margin-left: 40px; text-align: left;"><span style="color: #38761D;"><i>“I just like smiling. Smiling’s my favorite.” <a href="https://www.imdb.com/title/tt0319343/?ref_=fn_al_tt_1">Buddy the
Elf</a></i></span></div>
<h3>TL;DR</h3>
<p>
Did you note a common thread? The myths I outlined are discouraging blockers
that kept me from thinking that I could achieve my goals, and I have a hunch
that I’m not alone in these feelings. But the magic lies in people caring about
and investing in each other’s growth. That’s it! This is not just the kind,
empathetic, and right thing to do, but it also will affect the business’s bottom
line because when people are more committed to growth and feel encouraged to do
so, they are creating quality products and they are staying put in the same
place longer because they feel supported. As you go about your holiday and new
year, I encourage you to bring a little bit of magic to your own teams by either
being the support someone needs or by allowing someone to be a support for you.
</p>
<p style="margin-left: 40px; text-align: left;">
<span style="color: #38761D;"><i>“Bye Buddy, hope you find your dad!” – <a href="https://www.imdb.com/title/tt0319343/?ref_=fn_al_tt_1">Mr.
Narwhal</a></i></span>
</p>sigjehttp://www.blogger.com/profile/18050320060096957519noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-82986001179632510342019-12-25T00:00:00.000-05:002019-12-25T00:39:18.811-05:00Day 25 - The “Just” Basics<p>
By: C.A. Corriere (<a href="https://twitter.com/cacorriere">@cacorriere</a>)<br/>
Edited by: Michelle Carroll (<a href="https://twitter.com/miiiiiche">@miiiiiche</a>)
</p>
<p>
This year we celebrated <a href="https://devopsdays.org/events/2019-ghent/welcome/">the ten year anniversary of devopsdays</a> in Ghent, Belgium, where the conference originated in 2009. I was lucky enough to have my talk <a href="https://youtu.be/ctNzkslnrVE?t=4147">“Cookies, Mapping, & Complexity”</a> selected for the event. The feedback I received was mixed, but it was aligned with <a href="https://youtu.be/TwRxN7TohO8?t=27089">a broader theme</a> that emerged from the conference: given the impact technology has on our society in 2019, we can’t afford to ignore the complexity of our sociotechnical systems. The problem we’re now faced with is, how do we raise awareness around this complexity and make it more accessible to beginners?
</p>
<p>
If the answer to this question were obvious I could list a few examples here. If it were just complicated I could draw you a map or two. Sociotechnical problems, like this one, happen to be centered in a complex domain where models are often helpful. This question is one of multiple safe-to-fail experiments with negative hypotheses I am currently running, intended to serve as probes into a model of our communities I built. There’s a lot of specific jargon in this paragraph tied to complexity science, and the <a href="https://en.wikipedia.org/wiki/Cynefin_framework">cynefin framework</a> specifically.
</p>
<p>
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi80vbjcQwzN_kWYlYBcoyXDzoZPkUH5FkSbBjI4l7DXbo4rtIXnf8kIAUk6MNLowy7ZZANc7M-DSV8edBY-I1PAoEJ0hCz_8Dn2p2t-sbEcGMofEc6ihJbMLdx1DCuTY4YqLzIKiw1-MQ/s1600/Screen+Shot+2019-12-24+at+9.33.43+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEi80vbjcQwzN_kWYlYBcoyXDzoZPkUH5FkSbBjI4l7DXbo4rtIXnf8kIAUk6MNLowy7ZZANc7M-DSV8edBY-I1PAoEJ0hCz_8Dn2p2t-sbEcGMofEc6ihJbMLdx1DCuTY4YqLzIKiw1-MQ/s1600/Screen+Shot+2019-12-24+at+9.33.43+PM.png" width="95%" /></a></div>
</p>
<p>
I facilitated ninety minutes of open space workshops around mapping and complexity science in Ghent, but a workshop on complexity science alone can easily fill a week. Shorter workshops manifested at quite a few events I attended this year. Where I prefer sitting through a day of lecture, 30 minute segments with more specific content seem to be a better fit for most people.
</p>
<p>
I’ve also noticed that using common examples, like baking cookies or making a cup of tea, help folks connect the theory to an area of practice where they already have some experience. Even if you’ve never made tea or baked cookies, the barrier to entry is low enough that someone could try them for the sake of learning about complexity science and mapping.
</p>
<p>
I wouldn’t keep offering the workshops if people didn’t both show up and tell me they were useful, but I must admit I’ve covered the basics on these topics enough times that I worry I sound a bit like a broken record. I have been pulling a lot of this into a book, which I hope to be available in early 2020. For now, I am going to hope some folks can connect the dots between the language I’m using here and the picture of the framework provided. I’d encourage you to study this some on your own too, and you’re always welcome to ask questions on twitter. If I can’t answer them I probably know some clever person that can. What can we do to help make this type of content more accessible? Are you even convinced you need to learn it yet?
</p>
<p>
</p>
<p>
During the closing panel of Map Camp London, <a href="https://twitter.com/CatSwetel">Cat Swetel</a> referred to both cynefin and <a href="https://en.wikipedia.org/wiki/Wardley_map">wardley mapping</a> as “tools of <a href="https://en.m.wikipedia.org/wiki/Theory_of_justification">epistemic justice</a>”. I understand this to mean cynefin and wardley mapping are tools that can help us know how we know (or don’t know) something, and why our beliefs are (or aren’t) justified. Personally, I like being able to check my work and knowing when I’m wrong. It’s a humbling experience, but I do think it’s a pretty basic life lesson that’s easily justified.
</p>
<p>
What else counts as basic, introductory content in 2020? Is it installing an SDK and writing “Hello World!”? Do we start with a git repo and some yaml files? Maybe it’s a map of our application’s carbon footprint? Mapping and complexity science (among other tools) can help justify the answers to these questions, but I have no doubt those answers are context dependent. I would recommend learning to read a map before trying to draw one. <a href="https://medium.com/@chrisvmcd/mapping-maturity-create-context-specific-maturity-models-with-wardley-maps-informed-by-cynefin-37ffcd1d315">This post on maturity mapping</a> by Chris McDermott is based on cynefin and wardley mapping and serves as a solid example of the emergent justification I’m talking about. I’m looking forward to learning more about philosophy, epistemology, and tools that can help us change our minds and come to new understandings as the world shifts around us in the new year, but I really need to do a better job of pacing myself.
</p>
<p>
If a month of travel and research abroad weren’t enough for this year, then it’s a good thing I helped pull together <a href="https://www.youtube.com/watch?v=bsbWfBdwpIk&list=PL5pdUnQbCX6tZUv75Hl82h_U2TPB0iTaw">three conferences at the Georgia Aquarium</a> in Atlanta too. I have organized <a href="https://devopsdays.org/events/2020-atlanta/welcome/">devopsdays Atlanta</a> for a few years. When we saw an opportunity to host the first <a href="https://www.map-camp.com/">Map Camp</a> outside of the U.K. and the first <a href="https://atlanta.serverlessdays.io/">ServerlessDays Atlanta</a> along with our conference we decided it was worth the effort. Watching the ripples from that event since April has warmed my heart, but 2019 has also brought my attention back to one of my <a href="https://en.wikipedia.org/wiki/First_principle">first principles</a>:
</p>
<p>
<strong>I cannot take care of anything if I am not taking care of myself.</strong>
</p>
<p>
This year has been very global for me. My goal is to make 2020 much more local and regional by comparison, and I’m not alone. More and more presenters are refusing to fly for tech conferences given the growing concerns around global warming, which ended up being the main theme for Map Camp London this year. I think it’s important for our international communities to gather on a regular basis, but the cost of doing so should have little to no impact on our local communities, our planet, or our individual health. It must be done sustainably.
</p>
<p>
I doubt I’m leaving the country next year, but I’m thankful to be part of the vibrant tech community we have in Atlanta. I’ll be speaking at <a href="https://devnexus.com/">devnexus</a> this February, we’re organizing a minimally viable <a href="https://devopsdays.org/events/2020-atlanta/welcome/">devopsdays Atlanta</a> this April (the same week as <a href="https://www.refactr.tech/">REFACTR.TECH</a>), and it seems like there are a few meetups to choose from here every week.
</p>
<p>
If you aren’t participating in your local tech community then maybe 2020 is the year to try attending more events. If there aren’t any events, maybe you’d like to try organizing one. Maybe 2020 is the right time to visit some other cities (like Atlanta : ) or even a different country. Maybe you’ve been doing plenty of that, and like me you’re ready to tap the brakes and invest a little more energy in your own backyard. Please join me in using the days we have left this year to rest, reflect, and justify how we can co-create intentional futures during our next decade together, and for the ones that will follow afterwards.
</p>cwebberhttp://www.blogger.com/profile/16516063226544348092noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-58883086197052909892019-12-24T00:00:00.000-05:002019-12-24T00:00:04.013-05:00Day 24 - Expanding on Infrastructure as Code<p>By: Wyatt Walter (<a href="https://twitter.com/wyattwalter">@wyattwalter</a>)<br/>
Edited by: Joshua Smith (<a href="https://twitter.com/jcsmith">@jcsmith</a>)</p>
<h2>Introduction</h2>
<p>
As operators thinking about infrastructure as Code, we often think of infrastructure as just the stuff that runs inside our data centers or cloud providers. I recently worked on a project that expanded my view of what I consider “infrastructure” and what things were within grasp of managing similarly to the way I manage cloud resources. In this post I want to inspire you expand your view of what infrastructure might be for your organization and give an example using Terraform to help give a more concrete view of what that could look like.
</p>
<p>
First, the example I’ll use is a workflow for managing GitHub repositories at an organization. There are tons of other services Terraform can manage (“providers” in Terraform terms), but this example is a service that is free to recreate if you want to experiment. Then, we’ll dig into why you’d even want to go through the trouble of setting something like this up. Lastly, I’ll leave you with some inspiration on other services or ideas on where this can be applied.
</p>
<p>
The example and source code are very contrived, but available here (link: <a href="https://github.com/sysadventco-2019/sysadventco-terraform">https://github.com/sysadventco-2019/sysadventco-terraform</a>).
</p>
<h2>An example using GitHub</h2>
<p>
At SysAdventCo, developers use GitHub as a tool for source code management. The GitHub organization for the company is managed by a central IT team. While the IT team did allow a few individuals throughout the company permission to create repositories or teams, some actions were only accessible to administrators. So, even though teams could modify or create some settings, the IT team often was a bottleneck because many individuals needed access to see or modify settings that they could not do on their own.
</p>
<p>
So, the IT team imported the configuration for their organization into Terraform and allowed anyone in the organization to view it and submit pull requests to make changes. Their role has shifted from taking in tickets to modify settings (which often had multiple rounds of back-and-forth to ensure correctness) and manually making changes to simply being able to approve pull requests. In the pull requests, they can see exactly what is being asked for and receive validation through CI systems what the exact impact of that change would be.
</p>
<p>
A stripped down version of the configuration looks something like this:
</p>
<pre class="prettyprint"># We define a couple of variables we can pass via environment variables.
variable "github_token" {
type = string
}
variable "github_organization" {
type = string
}
# Include the GitHub provider, set some basics
# for the example, set these with environment variables:
# TF_github_token=asdf TF_github_organization=sysadventco terraform plan
provider "github" {
token = var.github_token
organization = var.github_organization
}
# This one is a bit meta: the definition for this repository
resource "github_repository" "sysadventco-terraform" {
name = "sysadventco-terraform"
description = "example Terraform source for managing the example-service repository"
homepage_url = "https://sysadvent.blogspot.com"
gitignore_template = "Terraform"
}
SysAdventCo operates a number of services. The one we'll focus on is example-service. It's a Rails application, and has its own entry in the configuration:
resource "github_repository" "example-service" {
name = "example-service"
description = "the source code for example-service"
homepage_url = "https://sysadvent.blogspot.com/"
gitignore_template = "Rails"
}
</pre>
<p>
The team that builds and operates example-service wants to integrate a new tool into their testing processes that requires an additional webhook. In some organizations, a member of the team may have access to edit that directly. In others, maybe they have to find a GitHub administrator to ask them for help. In either case, only those who have access to change the settings can even see how the webhooks are configured. Luckily, things work a bit differently at SysAdventCo.
</p>
<p>
The developer working on example-service already has access to see what webhooks are configured for this repository. She is ready to start testing the new service, so she submits a small PR (link: <a href="https://github.com/sysadventco-2019/sysadventco-terraform/pull/2">https://github.com/sysadventco-2019/sysadventco-terraform/pull/2</a>):
</p>
<pre class="prettyprint">+
+resource "github_repository_webhook" "example-service-new-hook" {
+ repository = github_repository.example-service.name
+
+ configuration {
+ url = "https://web.hook.com/"
+ content_type = "form"
+ insecure_ssl = false
+ }
+
+ active = false
+
+ events = ["issues"]
+}
</pre>
<p>
The system then automatically creates a comment with exactly what actions Terraform would do if this were approved for a member of the IT team to review and collaborate with the developer requesting the change.
<div class="separator" style="clear: both; text-align: center;"><a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2lvfvkk3_bN5_3wHaDfJdqNKdV11lHGR34EDcoRYKy924uii0-OqDcnPiClSEPcmBzTpxqCBDu-05eeFNpqx0nkMdJEKlyb_xSTzI3ZMiXIc_FQ8A2sFvvdNUymuENDPyHQVxAAA3S5I/s1600/Screen+Shot+2019-12-15+at+7.49.32+PM.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEg2lvfvkk3_bN5_3wHaDfJdqNKdV11lHGR34EDcoRYKy924uii0-OqDcnPiClSEPcmBzTpxqCBDu-05eeFNpqx0nkMdJEKlyb_xSTzI3ZMiXIc_FQ8A2sFvvdNUymuENDPyHQVxAAA3S5I/s1600/Screen+Shot+2019-12-15+at+7.49.32+PM.png" width="95%" /></a></div>
</p>
<p>
No one is stuck filling out a form or ticket trying to explain what is needed with words that get interpreted into manual actions. They have simply updated the configuration themselves which is automatically validated and a comment is added with the exact details of the repository which would change as a result of this request. Once the pull request is approved and merged, it is automatically applied.
</p>
<h2>This seems like a lot of work, why bother?</h2>
<p>
What an astute observation, dear reader! Yes, there is a good deal of setup involved once you get past this simple example. And yes, managing more automatically can often be more work. In addition, if your organization already exists but doesn’t use something like this method already, you probably have a good deal of configuration to import into the tool of your choice. I’d argue that there are a number of reasons that you would want to consider using a tool like this to manage tools that aren’t strictly servers, firewall rules, etc.
</p>
<p>
First, and what is the first thing I reached for, is that we can track changes the same way we do for other things in the delivery pipeline while also ensuring consistency. For me on my project, importing configuration of a PagerDuty account into management by Terraform allowed me to see inconsistencies in the manually configured service. While the tool added value, a huge part of the value was the simple act of doing the import and having a tool that enforced consistency. I caught a number of things that could’ve misrouted alerts if conditions were right before they became issues.
</p>
<p>
The next and most compelling reasons to me are in freeing up administrative time and giving teams the freedom to affect changes directly without creating a free-for-all situation. You can restrict administrative access to a very small number of people (or just a bot) without creating a huge bottleneck. It also allows anyone without elevated privileges to confirm settings without having to ask someone else. I’d also argue this creates an excellent model for the basis of a change control process for organizations that require or have them as well.
</p>
<p>
A further advantage is that, since none of these tools exist in isolation, using a method like this can give you an opportunity to reference configuration dynamically. This allows your team to spin up full environments to test configuration end-to-end.
</p>
<h2>But wait, there’s more!</h2>
<p>
Within the Terraform world, there’s an entire world of providers out there just waiting for you to explore! Imagine using the same tools you use to manage AWS or GCP resources that often are linked to other important things your team uses:
</p>
<ul>
<li>Manage your on-call rotations, escalation paths, routing decisions, and more with the PagerDuty provider
<li>Manage the application list and alerts in NewRelic
<li>Add external HTTP monitoring using tools like Statuscake or Pingdom
</li>
</ul>cwebberhttp://www.blogger.com/profile/16516063226544348092noreply@blogger.com0tag:blogger.com,1999:blog-3615332969083650973.post-69022098277122125542019-12-23T00:00:00.000-05:002019-12-23T00:00:05.226-05:00Day 23 - Becoming a Database Administrator<p>By: Jaryd Remillard (<a href="https://twitter.com/KarateDBA">@KarateDBA</a>)<br/>
Edited by: Benjamin Marsteau (<a href="https://twitter.com/bmarsteau">@bmarsteau</a>)</p>
<p>Database is a term that is thrown around in meetings amongst all industries. The term is almost always used with a sense of urgency and importance yet contains a vast mystery. It can be a topic that some may feel too confident, one with absolute no knowledge and one that refers to a single copy of a glorified excel spreadsheet sitting on their desktop. In my short time as a database administrator, I have found that it is typically the confident ones that venture into this mystery with the full understanding of the business value and risk that come as the database administrator. Like any area of science, technology, engineering, and math, acronyms are favored, so let it be known that the title of database administrator can be abbreviated as DBA.
</p><p>
Like any career path, one database administrator path will not necessarily align with the direction you have to take. It is not to discredit the value of the journey and course someone purposely took or perhaps accidentally stumbled into; there are specific points to remember which in itself could present an opportunity in your journey. Instead, be aware, just like theoretically there is an infinite number of ways of solving a problem with the code, there is an endless number of directions to reach your destination of becoming a DBA. All in all, I hope the reflection I have done of my journey I took to become a database administrator will set you up for success.
</p>
<h3>Start with the basics</h3>
<p>
When I was 12 years old, I befriended a stranger online through a collective group of people who played an online video game. We were idling in our TeamSpeak server when I asked them what they were up too, they replied saying they were coding a website for our group. The concept immediately struck me with curiosity like a static shock, the idea of how to construct a website was so far-fetched I just had to learn so I could quench the burning desire. I naively asked if it is a drag and drop type of process. They laughed and began to talk about and teach me HTML and showed me how to view the source of a website. The concept blew my mind; words typed in a specific manner can be translated into a structure that is displayed on my screen. It made me feel like anything was possible. I kept building websites with HTML, leveling up to using CSS, JavaScript, learning Linux, and eventually PHP. Soon after, I was building login systems, registration systems, user profiles, all in a LAMP stack that required knowledge of basic SQL, learning simple DML's, DDL's, DCL and TCL's, I wrote whatever worked. The experience and newfound knowledge I bashed together eventually turned into a charming but underwhelming social network that I named Express-It. Building the schemas in phpMyAdmin was accessible in the sense of, "I create a column, PHP writes to it, there we go.” However, as the social network grew to a whopping 100 people, which were primarily friends and family for moral support, it caused my website to slow to a crawl. What I did not understand was the more extensive technical specifics of ints, unsigned bigints, varchars, indexing, and primary keys and how various people at the same time querying similar things while the SQL scanned the entire database affected performance. I could not wrap my head around it, nor did I think there was anything but what the query was because it did the job locally. Frankly, it also didn't occur to me that my schemas and queries were DBA's nightmare. I shut down Express-It since my curiosity shifted from LAMP stacks to learning cybersecurity, doing basic IT jobs for friends and family, and I was sick of the free hosting tier I was using.
</p>
<h3>School of Hard Knocks</h3>
<p>
As I shifted my focus from building to fracturing, SQL came up in the form of learning its flaws. From the various types of SQL injection, brute force, and DoSing a database. My knowledge expanded to be more aware of possible vulnerabilities and the importance of a database, including losing data. This experience exemplified when the code I had from my Express-It website and hundreds of hours of various other projects I stored on a flash drive, as an interim between moving homes, was accidentally reformatted while transferring some photos by a family member. Losing all my work taught me how easy it can be for a large part of my life to disappear. I then realized the hard way that backups are a thing and I became hyper-aware. I learned to keep at least two to three copies of whatever had importance on separate data stores, I learned that flash drive and hard drives could die without warning or get overwritten accidentally, you can never have enough backups, and corruption is a thing. My motto became "backup backup backup, correctly.” I often chuckle when reflecting this time of my life because it reminds me of high school, where any time a big paper was due, someone always had the excuse that their file was gone, overwritten, or became corrupted the night before, conveniently, perhaps honestly so. I could not help but blurt out of my smart mouth, "Should have backed up.” Unfortunately, the lesson I learned came back to haunt me in a different form.
</p>
<h3>Venturing Further and Beyond!</h3>
<p>
I went on a hiatus from technology for a bit to focus on school and sports. Eventually, when my interest in technology came back, an internship opportunity landed on my lap, which still to this day I attribute to luck as the the news of an opportunity was shared to me by the CTO of the relevant company. Before starting, I was asked what I wanted to do during the internship, specifically, what direction of my career did I want to go. I thought it was software engineering; at the time, building and designing were shiny to my eyes. However, I was conflicted as I still enjoyed living in the terminal, something about the rawness of text on a blank background with specific commands can be utilized to effectively navigate the computer in ways a GUI could not. It still drew me in even when I was deep in an IDE, I had moments of barbarianism when I would code in vim. I knew programming was not what I wanted to do for a full eight hours a day, I was conflicted and I shared my concerns. They mentioned DevOps, and it was perfect. I would get the complete balance of being in the terminal and writing code, I then embarked on the start of my career. As an intern, a lot of my tasks were simple; data entry, set up a local environment, break the local environment, finish some tickets, attend standup, and the like. But one task stood out to me, the need for an internal tool to show the difference between a system at one point in time compared to another time, such as permissions and data in the file, essentially a beefy diff. Like most ideas, it was a task that seemed easy at first but was exponentially more complicated than initially anticipated. As I dug into the task, the first tool I chose was Python to program in as it seemed easy to learn and it was all the rage. As I learned more about Python's data types, I naively figured it was an excellent function to cache all files on a system in a dictionary, which unbeknownst to me beforehand, resulted in Python running out of memory. After consulting some of the engineers nearby on how to navigate this issue, it was recommended to use a database. So naturally, I chose Sqlite3. I moved to MySQL pretty soon after, sqlite3 was just not working out. I figured MySQL was perfect as it is a solid relational database, I would have the freedom to specify what kind of data to store and it made storing md5 checksums. I was eventually able to get the program to work to some degree of success but not without experiencing bottlenecks. Previously in my past, the amount of data I worked with was so little that there was little to no need for optimization. So when caching the majority file information on a system, well, that is when I started to see performance impacts on the database, particularly the length of time in executing the program and overall high usage of memory in both primary and secondary. I figured the best direction to tackle this problem would be to take it to a deeper level, learning of the internals of MySQL. Basics of how the client and server work together and elementary query optimization, but with limited guidance, there was only so much I could dig on my own. Eventually, time continued, and I moved to a local company as a system administrator. My new job exposed me to different types of databases; MongoDB and SQL Server, along with MySQL again. I spent a lot of my time at my new position on the front end and web servers like Apache, Jekyll, GruntJS workflows as well as Active Directory, I still got to see the back end as well. Naturally, it fascinated me more, in between tasks I learned how it was accessed by services and as an administrator, how to view the permissions of users, and how to query what you wanted. Questions about the front end were easy to answer but the back end had a lot of unanswered questions I could not find the answer too, various topics such as; internal functionalities, maximum capabilities, dynamically manage a user, etc. Databases remained a mystery I wanted to solve and I was ready to go Sherlock. I read the documentation and tinkered with the databases, and then I would go home to set one up for myself, just to see how I can break it. Unfortunately, my time became more consumed with school and the front end of a job at the time, although I knew I wanted to get back to the databases in the future. Soon, an opportunity opened to work as a student employee in the IT department at my university. This potential new position would save me an hour commute to school as well as an hour commute to work, although it was not nearly as technical as what I was currently doing and perhaps a step back; it was the right decision for me at the time. I had a feeling this job could turn into something more technical than what I was already doing.
</p>
<h3>Uh-oh</h3>
<p>
After completing three years of college and almost a year as a student employee, I was offered a full-time job in the IT department of the university I was attending. It was a decision that was difficult to choose and took some time to weigh the pros and cons; to take the risk and leap of faith into the field or continue education for another two years while piling debt to then flow into the field. Ultimately, I chose to take the job to pay off the student loans that I accrued, which was almost the size of my salary and nearly twice my weight in stress. Also, I knew this was an opportunity to delve deeper into different technologies and learn what it means to take ownership and responsibilities. The job was to be the system administrator for the STEM department, with that indeed came with a lot of responsibilities of managing various software, some cloud-based but many on premises-based. Much of the software I managed used a SQL Server to manage logins, logs, barcode numbers, etc. Little did I know that an SQL Server was actively in use until I got a call from a chemistry department head saying they are unable to log into their science rental equipment management software. I searched all over in our wiki and could not find a single trace of this existence, for a moment I thought this was a prank. I asked my coworkers if they had heard of this software, if it even exists; I got nothing in response. I dug further, even so far as reaching out to our previous system engineer that worked prior. It turns out this software had an SQL Server sitting in an undocumented virtual machine, lost in tribal knowledge. Unfortunately, there were no records of this software ever being provisioned. To look on the bright side, one of the science teacher's users in this database for some reason had super privileges, giving me the ability to login and work some magic, thinking this was the end of the immediate problem. But there was an itch in my brain, questions that stuck with me; what had happened? Why did it all of a sudden stop working? Why is it sitting on a VM undocumented? How can I prevent this from the future? Why is there no accountability and visibility with this database? It was going to be forgotten in this state, or I had to work on keeping this reliable in all aspects, especially documentation. I made a page on everything I learned about this software, including representatives from the company, run books for the database, and how the client and backend work. At that moment this is when my interest in reliability and uptime exponentially grew, especially around databases. Before I left, I made one big oops.
</p>
<h3>Be wary of drives</h3>
<p>
I was put in charge of the psychology department on top of the STEM department. I had a psychology professor come in as her laptop was due for a replacement and was hoping to speed up the process as it was filling up a special order of a 512 GB Dell XPS primarily, which consisted of personal photos and research documentation. The first process was to back up her laptop to a hard drive we had using some software that did it block by block. I had our student employees complete this process overnight. I woke up to some great news; it kept failing with odd errors that warranted no response via a Google search. After some consulting with my coworkers, Office 365 comes with a 1 TB storage via OneDrive. We thought this was perfect; she can store all her valuable documents into OneDrive as we set up her new laptop and download it back down. She preferred that I did it personally as I was in charge of the psychology department, as our policy was, I had to agree. I began the process of uploading her documents to her OneDrive and it took days. Being new to Office 365, I had no idea why it was taking this long, but I shrugged it off as it eventually reported successful. I began to download her files onto the new computer and started the RMA of her old one. Problems were immediate; permission issues, too long file names, disappearing files, you name it. After hours and hours of work, going through shadow copies of our servers, looking at past backups we had, recursively changing permissions of the files, it was exhaustive. I was able to obtain about 95% of what she had previously, but the 5% I lost was a good chunk of her research. It was a time of reflection where my motto rang in my head non-stop, I missed one more backup somewhere. From then on, I was no longer super aware. I was hyper-aware and vigilant in storing data. Everything made me skeptical or ask questions, and it was a mark in my career of growth through failure. I had a burning desire in me to learn more about storing data; how do it robustly, safely, ensure validity and integrity. I set out to fulfill my desire.
</p>
<h3>Where to begin?</h3>
<p>
Finding information on reliability and internals of the data storage is a difficult task when you do not have any reference or expert to guide you towards the correct path. The internet is filled with how-to's; doing write and read queries, but understanding documentation on internal works is tricky to begin, let alone comprehend. I have finally started to dip my toes in and quickly learned it's difficult to give a summary of the paradox that is an SQL database, the simplicity of the query structure itself is one that provides a façade and false sense of understanding. Rather, there is much behind the scenes you cannot see. Knowing how to query to get what I needed from a database gave me the confidence that I knew what to do and how it worked. It wasn't until I started a personal project that was causing the database to suffocate that it made me realize that perhaps there is more to databases than my cute knowledge previously thought.
</p>
<h3>Personal Projects</h3>
<p>
Being in the technology field will expose you to various situations that are hard to prepare for in personal studies as well as higher education. Outages that are unpredictable due to customer behavior or merely no reference to what the threshold is of a service, primarily because you never reach that point. Scaling up is a term that I heard of but never understand, so natural curiosity decided I needed to seek out what it really means. It is impossible to scale up unless you have a lot of data to utilize. Finding large data sets that contained false and made-up data was a tall task, so then I had an itch to create a data generator to assist in learning how to scale. Yes, there are a few data generator websites. However, they seem to cap out of a million rows at the time, which is not enough for me and the service to really push it to the limits. In creating this data generator, I made it so it can spit it out in an SQL format, making it easy to slap into MySQL right away. Fortunately, it is capable of generating 8 figure rows of data with columns for names, addresses, cars, age, and other data after some heavy work in Python. I ran my data generators several more times to add up to 300 million rows, I decided it was time to load up a MySQL server in a LAMP stack with this data to use in a simulation of what would be a essentially a country sized simulation. With no visibility of the VM, OS, and the database, my PHP queries to MySQL locally took ages or crashed the VM altogether. I knew it was the database because even querying via phpMyAdmin was not returning results quickly or timed out, and I couldn't figure out how to better interact with the database. Thinking it lacked in power, I kept upping the CPU's and RAM which only led to crashing the host. I stepped back to think more about scaling; how could I, in this case, scale up if upping power wasn't the solution? Then the concept of how a CPU is designed rang in my head, distributing the job into smaller chunks. A saying from the CTO of the company that I interned told me, "Any big problem is just a subset of a bunch of smaller problems. Iterate those small problems, and now you've solved the big problem."
</p><p>
I got it! Let me split the database into smaller sized databases, each containing a max of 10 million rows. If I needed something beyond the unique ID, I could query the next database instead of having MySQL scan the entire database. Distributing data through multiple instances of MySQL servers was a weak solution in this case, of course, as PHP now had to maintain 20 MySQL connections. Later I learned this moved the problem instead of solving it, and now I was stuck. I understood databases are complex at the time and are much more complicated than I initially thought and that fed my desire to learn more. I did not necessarily feel capable of being a database administrator, but I figured what is better than to in headfirst as a database administrator for a company.
</p><p>
I am a person that tries to not be afraid to delve into the unknown or face rejection. Imposter syndrome is real, but I know it is something you can grow past despite your thoughts no matter what your mind tells you otherwise. I scoured the internet for DBA jobs and found myself stumbling upon an entry-level DBA posting at the competitor of the company I interned at. It was perfect and I applied despite it feeling like a moonshoot as the position was based in a different state.
</p>
<h3>Don't be afraid</h3>
<p>
Unexpectedly, I got a callback. I flew through the phone interview, manager interview, and eventually hopped on a call for the technical interviews. I was as honest as I could be, I explained my attempts to scale, shared the little experience I had with MySQL, and why I wanted to be a DBA.
</p><p>
Simply put as to why I wanted to be a DBA, databases are facinating to me. We rely on databases for everything, but hardly anyone delves more in-depth than simple restarts or querying for what they need. I had a difficult time finding resources to help me learn deeper about SQL rather than how to write basic SQL queries, I was hungry; rather, I was famished to learn. I knew I lacked a lot of knowledge and was honest about it during my technical interviews but I backed it with what I was trying to do with MySQL. In particular, I shared my attempts of scaling by distributing the workload, having no idea what the correct term was other than using the description of distributing "it." I later learned it's called sharding. I jumped up and down after finding out the correct term as it had unlocked a vast amount of new resources via Google searches and technical conversations with people in the industry. During my technical inverview, I had a DBA on call, this was the perfect opportunity to ask what resource I should read so I jumped in as soon as I could. She recommended reading the <a href="https://www.amazon.com/Database-Reliability-Engineering-Designing-Operating/dp/1491925949">Database Reliability Engineering book by Charity Majors and Laine Campbell</a>. I Immediately bought it off Amazon, practically during the interview, and was extremely eager to crack it open the second it arrived. I started reading and taking impeccable notes, absorbing as much as I can.
</p><p>
This is the direction I needed, the direction I wanted to go, to push my mind, widen my thought process, making me aware that there is much more than writing code and setting up software such as; service level objectives/agreements, automating, the need for metrics and the alike. I just could not put the book down. It almost felt like I hadn't had a bite to eat in days essentially swallowing the book. Upon my second technical interview, I believe my famine showed. I talked about what I was learning and how I was applying it, and it raised the interviewer's eyebrow in a good way. I was flown to their headquarters for further interviewing.
</p>
<h3>Still much to learn</h3>
<p>
It is no secret this job was at SendGrid, and I am very fortunate to have found a job posting that was purposely looking to help the employee to grow. I attribute a lot of that to luck and the excellent mentality and awareness of the benefits of hiring and raising a junior employee at SendGrid. The distinctive culture included hunger, the hunger to learn, and I was viciously starving. I could not stop reading documentation, asking questions and writing everything down in a spiral notebook. I am fortunate to have a senior DBA on the team to guide me through processes of replication and basic troubleshooting of a MySQL server. Later I bought <a href="https://www.amazon.com/High-Performance-MySQL-Optimization-Replication/dp/1449314287">The High Performance MySQL: Optimizations, Backups and Replication</a> book on Amazon, and soon after being hired, I started going through the book, diligently taking notes and asking questions along the way. The path to learning about SQL did not stop when I was hired; in fact, it just started.
</p>
<h3>Conclusion</h3>
<p>
Overall, my natural-born curiosity and love for challenges lead me to take an opportunity where no one else dared to venture. I broke my façade, thinking SQL databases are easy because I can query something by trying to force the database to kneel. Finding why was challenging, but that only led me viciously seek out a solution, not be afraid to apply for a DBA job. The key was realizing I always gravitated and asked myself the most questions when dealing with a database, I wanted to conquer databases. The two books mentioned are a great start to grow your knowledge beyond querying a database, but to delve deeper into what it is and how to use it. Another book to look at the <a href="https://www.amazon.com/Joe-Celkos-SQL-Smarties-Programming/dp/0128007613/ref=sr_1_1?keywords=celkos+sql&qid=1576870200&s=books&sr=1-1">Celko's Advanced SQL Programming</a> by Joe Celko's, it does a good job of delving into how SQL works behind the scenes and make you realize that your queries can be optimized greatly. While there are many paths to take, the real take away is if you have the hunger to learn, you will succeed no matter what path you take.</p>cwebberhttp://www.blogger.com/profile/16516063226544348092noreply@blogger.com0