December 15, 2019

Day 15 - Moving a Monolithic Rails App to Kubernetes

By: Philip Brocoum (@stedwick)
Edited by: Ryan Hass (@_ryanhass_)

Here at Syncta we have a classic monolithic Ruby on Rails application with roughly 150 models. For the past six months, I have been working on reimagining our deployment on Kubernetes.

I began this project with two principles in mind:

1. Picking the best tool for each job.

2. Each tool should follow the Linux philosophy of doing one thing, and doing it well.

The first thing I asked myself was, "Can we containerize our Rails app?" The most difficult thing was the learning curve related to Docker, but the answer was, “Yes!” Our Gemfile is 200 lines long, and our Docker image weighs in around 1 GB, but I've seen bigger. Our Dockerfile starts with `FROM bitnami/ruby` and includes `nodejs libpq-dev file graphviz busybox vim` for compiling native gem extensions. I chose the Bitnami image because I tried Alpine first and found myself installing too many prerequisites. We push these images to, which is a very user-friendly image repository. For example, they offer copy-paste credentials for installing the imagePullSecret into k8s.

Furthermore, I created a docker-compose.test.yml that runs all ~1000 of our tests completely self-contained; PostgreSQL, Redis, ElasticSearch, and ChromeDriver all run inside their respective containers for our test suite. The biggest change was that we used to simply install Chrome on our virtual machine, and use it locally for our tests. We now have to run our browser in a container using

The next step was to get this Docker Composition running in a CI/CD pipeline, for which we used Bitbucket Pipelines. Running everything within Docker (with its overhead) on one CI instance used up the standard 4 GB in Bitbucket Pipelines. As a result we had to upgrade to the 2x memory build in order to run Chrome. Our test suite now automatically runs in about 15 minutes on a push to Bitbucket. Here's a screenshot of what our pipeline looked like previously in Semaphore, and now in Pipelines.

Now the real fun begins: we have a working, tested, Dockerized version of our monolithic Rails app, and the next step is to deploy it to Kubernetes. I like to keep things as simple as possible. With that in mind, we used Spotinst to spin up a k8s cluster on AWS EKS. Spotinst handles everything, including installation and cluster auto-scaling, and then we spun up a Rancher instance for an easy-to-use GUI and attached it to our k8s cluster using the provided Spotinst credentials. Rancher allows us to one-click-install things like Traefik & Grafana (see screenshot).

Finally, we launched a deployment inside the k8s cluster using the Docker image we created in step one. Wow! Right now, only our secondary staging server is running on k8s, from a git branch parallel to master and Rancher is currently a single node install. There's still plenty of QA to do. We are planning a Canary rollout next year.

There are some good lessons to learn from this whole journey. First, it's far easier to develop/test/deploy when everything is Dockerized and guaranteed to be identical. However, the downside is that there is a huge learning curve with Docker and k8s that our devs will need to get up to speed with. In order to help mitigate the learning curve I have started a Makefile with many common commands that our developers use. My New Year's wish for SysAdvent is for a smooth rollout to production in 2020. :-)

  @echo Syncta Main Rails Makefile
  @echo ...

  @kubectl run -n "smr-development" \
    --image="${KI}" \

  @export COMPOSE_PROJECT_NAME="smrtest"; \

If any of this sounds interesting to you, and you'd like to join our team, please get in touch with our recruiter, Jimmy! We are a remote-first development team (headquartered in Portland, OR). We are looking for a Senior/Lead Full-Stack Developer with good experience in Ruby on Rails immediately. In 2020, we'll be hiring another Developer and a Project Manager. Syncta is a small startup working on backflow and other water-related technologies, and we were recently acquired by Watts Water Technologies. I'm Philip Brocoum, Head of Development at Syncta, and Happy SysAdvent to all!

December 14, 2019

Day 14 - Building Datacenters in Hell

By: Bryan Horstmann-Allen (@bdha)
Edited By: Wayne Werner

A small tribe of lost souls finds themselves embattled, after long wandering
the plains of Hell. They nurture a unique Praxis, a small thing. It is not the
making of fire, but it is theirs, with a history that crosses spans of mortal
time. It is precious to them.

They find themselves without the resources to thrive. Hell is harsh, and the
so-called walled heavens of the hyperscalers, closed and proprietary in nature,
consume all within reach.

In this state, a group of Powers acquires the tribe.

The Powers That Were tell them: Your will is ours. Your efforts ours. The
Powers demand global storage clouds. The tribe is ill-equipped for an
undertaking so large. Ramp up time will be required. A single datacenter, they
say – let us build a Proof Of Concept, to identify our gaps, to build our
tooling, to measure twice, and cut once.

There will be no practice, no prototype. Everything will be Production, it will
be done Quickly. The tribe will be oncall 24/7 for empty servers. Failure will
not be countenanced. The Powers claim to build airplanes in the air, change
tires while in motion.

The Powers promises fall like heavy rain: It will be a joint undertaking, the
tribe is told. We will be as One Team.

Eventually these clouds would house roughly an exabyte of data, streaming from
the Aether at 200Gbps, across upwards of 10,000 devices in 20 datacenters.

For the hyperscalers in their false heavens that’s a Tuesday.

For the small tribe, however, new layers of Hell are created from whole clothe.

No Maps For These Territories

The first phase of the project is targeted for an area of the umbral plains
redolent with the monoliths of dozens of datacenter facilities.

Extreme haste is required. The Powers That Were never explain why. Rocks fall
from the skies when progress seems lacking, crushing arbitrary engineers. Speed
is required over all other considerations.

Mistakes will not be made, because they should not be.

The tribe lacks an actual project plan and much needed automation. A
spreadsheet with a dozen lines, like “order hardware,” “install racks” and
“cable up servers,” is provided. It is lacking an attention to detail that
would perhaps prove useful.

The Powers are not Gods, but old, massive, well-funded and long in reach.

The tribe designates internal groups to handle various aspects of the work.
DCOPS, SYSOPS, NETOPS. They daub their foreheads with mud and gather up their
primitive tools.

If the reader is unfamiliar with datacenters, they are actively hostile to
human life. Souls don’t belong there; they are only for machines. Giant
windowless boxes full of smaller screaming sharped-edged boxes. They are loud,
hot and cold. A constant dry wind blows in your face. The lights harsh,
seemingly unending, but if you do not move often enough the entire facility
will be cast into darkness around you.

The longer a soul spends there, the more they lose of themselves. Higher
cognitive functions shut down, memories lost to the patterns of the blinking

Mistakes are made, and quickly compounded.

Each team airlifted to the site attempts to complete their work all at once,
dependencies between teams are unclear and regularly consume the fingers of the
unwary. The domes of the cloud balloon above them.

Acolytes of the Powers That Were appear, underfoot, refusing to say why. In
truth, they are there to build their own cloud, adjacent but separate from the
tribes work. The tribe is given no insight, no access, just responsibility. The
Acolytes fail in their work, and the blame is passed on to the tribe.

The Acolytes argue the tribe’s attempts are untenable, they demand that
velocity increase, demand new directions seemingly at random. Progress stalls
as the context switches tear holes in the thin membrane of the local reality.

Screaming horrors slip through and must be contained.

Racks are built in place by Integrator Daemons: servers built on the raised
floor, racked, cabled, nominally tested. Each system trails five network
cables, two power cables, hungry umbilicals ready for sustenance. The USB keys
installed in the rear slots became too hot to touch after a few hours of
screaming operation.

Dead components, shedding their dried scales, are RMA’d through secret,
unknowable means. It can take weeks or months for parts to be replaced.

Souls become lost behind the stacks of discarded cardboard. Some vanish for
good, seemingly burnt away at the edges in the hot exhaust aisle. Those
remaining run out of food, water, sanity.

Engineers gnaw on packing peanuts, simply to make the emptiness inside less.

Each datacenter has to be recabled three times.

The world outside becomes a lie. There is only the facility.

The Director of NETOPS walks alone into the racks. The bear-sized raptors that
infest the high domes of the datacenters can be heard screaming in the voice of
human infants. Their greasy feathers litter the tiles. The Director is never
seen again.

One of the problems you get when dealing with server vendors, integrators,
VARs, whatever, is that the inventory and testing data they give you is never
in a format you want. It is often full of errors, sometimes very subtle ones.
We had scripts to parse PDFs to pull out system inventory data and burn-in
reports. It makes me tired just to think about.

Long after the work has begin, Projections are finally shared by the Powers.

The Powers had demanded 25PB cloud for this quarter. 25PB is built. They
actually needed 60PB. This is clearly the tribes failing, lacking both an
Oracle and having never been given any data from the Powers despite constantly
beseeching Them. All pending builds have similar problems; budgets approved,
hardware purchased. It will not be enough.

Months after the build is “done,” the lost tribe still finds systems with
missing drives, zpools misconfigured; lopsided, keening pathetically. A cascade
of avoidable work follows.

Lessons are learned; none especially technical in nature.

It comes as no surprise that when you give people no time to prepare, no time
to do research or build processes and automation in advance – what you end
up with is a bunch of exhausted people doing subpar work.

Everyone knew what was needed, but no time or resources were allowed to
build any of it properly.

Eventually a demoralized group emerges from under the cloud they had built with
their own bleeding hands. Exhausted, thinner, fewer than when they started.

The cloud runs out of storage.

Like Boulders Up a Hill

The Powers That Were are relentless. The next clouds would be birthed back to
back. The final cloud and the first expansion of the existing regions would be
done in parallel.

It is learned that those amongst the Powers who actually own the data do not
even wish to move it. The Powers That Were bicker amongst themselves. Budgets
are shredded, fluid hissing down from the skies, burning the earth. Timelines
torn. The unknowable Ancients in Finance lay down ultimatums. The clouds must
be built. Costs must be cut. Now.

The Powers settle, once again direct their gaze to the lost souls.

The tribe gets on with it. They know that before they begin the next builds,
they need two things:

  • A plan
  • Robots

Without a detailed project plan and automation to execute it, they would be
stripped of their nerd hoodies and left to the mercies of Hell, alone.

By now, they have directed the Integrator Daemons to build the racks off-site,
in the Daemon’s own facilities, test them, and only then drag the hundreds of
racks into the maw of the cloud. The Integrators will have spares, the
ability to swap out whole chassis, the cost of errors is pushed back on them,
and not onto DCOPS, who are busy contending with the hawkbears.

Information only flows in one direction. They must do their best with suspect
data rarely and only begrudgingly shared.

The first project plan is built in LiquidPlanner. The spreadsheets are set on

The leads from each of the operating groups sit and gibber at each other for
unknowable hours. The light outside never changes.

At the end, they have a plan with major and minor gates, consisting of some
1500 discrete tasks, all dependencies defined and linked, specific engineers
assigned with worst/best time estimates.

When the Plan needs to change, the timelines shift, realign. The Plan is
flexible. The Plan will be light in the darkness. The Plan will save them.

They cannot be saved. They are already lost.

It can be surprisingly difficult to convince a Project Manager to give up
whatever control they think they possess over the project. When you have
dozens of people working on some incredibly complicated effort, you need to
let the team leads own their work.

The major gates are eldritch concepts:

  • Design
  • Procurement
  • Physical Plant

Portals are built from dead PDU strips, runes engraved with multitools.

The minor gates are interleaved.

  • Preflight
  • Network I
  • Compute I
  • Network II
  • Compute II

SYSOPS can spin up servers once basic Layer 2 networking is in, and NETOPS has
already moved on to configuring the cross-AZ Layer 3 network. Each datacenter
can be built independently until the final stages.

This interleaving means no particular team is blocked from performing useful
work. The constant movement will keep them warm in the unending dusk.

The automation strategy they settle on is two-fold:

  • Validate that what they got is what they bought
  • Script every part of building the cloud itself

The validation piece is termed “preflight.” The Validator and the
Director of SYSOPS hunt the hawkbears infesting the cloud domes. They push back
on the Powers, usually fruitlessly.

The first version of preflight is a hacked up FAI
deployment, running on a single VM on each of the datacenter management
servers, on a dedicated VLAN.

Once the racks hit the datacenter and the NETOPS shamans bring L2 up, the
servers are PXE booted into a custom rolled live image which runs through a
dozen or so shell scripts.

The scripts dump system inventory to a text file on an NFS mount (this being
Hell), and run basic stress testing. The SYSOPS team provides firmware upgrade
and BIOS configuration tools, which are pushed onto each box.

Scripts are written to validate the system inventory. When errors are found
(like cabling being in the wrong ports) JIRA tickets are cast like bones for
DCOPS to breakfix.

Once a full rack passed preflight, NETOPS is asked to flip the TOR switches
tagged VLAN to production. The servers in the rack are booted and added to a
spreadsheet for the SYSOPS team to provision.

None of this takes very long to implement, which is good because most of it has
to be done during the second cloud build.

There are no resources in the empty plains of Hell for a development lab.

An OPER leaves the tribe.

The Projectionist is cast away by the Powers.

The cloud run out of storage.

The Operationalization Orb

The Validator is catching the majority of initial errors they’d missed on
the first cloud build, but the process is still manual, still tedious. Building
a new cloud takes four to five months.

Manually managing automation incurs errors.

Being exhausted all the time incurs far more errors.

The Validator knew the next version needed to be less 2002 and more modern.
They hunted and felled a database, an API, UI. They wanted the Daemons and
DCOPS and NETOPS and SYSOPS to be able to look at a page that told them exactly
what was wrong with a server

  • disk 4 is missing
  • missing RAM in slot 2A
  • rabid sandtrout caught in thermal shield
  • NIC0 should be in port 1:2 but is in 1:3

and so on.

There will be no more JIRA tickets. No more spreadsheets. There is only the
truth they make themselves.

They use the sharpened bones of a hawkbear to carve a schema into the database.
Janky code is written in Perl’s Catalyst. An API and a (pure HTML tables,
natch) UI are birthed.

The builds progress. The Powers That Were can be heard screaming beyond the
veil, demanding more bits, more bandwidth, more flesh. They spew anger and
rage, and the work is made more difficult.

NETOPS run cables from the clouds of the Powers to the heavens. Dozens are
grafted into the domes. Bits stream in from the Aether upwards of 150Gbps. It
is not enough. It will never be enough.

Scaling problems dig their way up from the depths. Drive firmware under heavy
write loads ceases to service reads. This is deemed excellent behavior for the
databases running on them.

Systems reboot themselves, opaque boxes, lying to the tribe’s shamans. A month
passes. Hourly calls every day with the vendor stonewalling. The systems reboot
daily; the cloud chokes on the bits it must be fed. Finally the Powers ordain
they will stop purchasing globally from this vendor unless a solution is found.

A solution is quickly found.

Beastblades Magnetically Aligned

New strong souls are found in the darkness. Their minds are sharp, tools
unmarred. They rewrite preflight, twice. The second time in Mojolicious.

The tribe uses preflight as a source of truth: What is where, what does it do,
is it working, what is wrong with it.

A tool is created for the Command Line, it helps to bring more advocates to the
work. A new web UI is written.

The more hardware-oriented Validators automate PDU and switch configuration.
Improvements are made to testing the CPU, RAM, and disks. Systems are made to
power themselves off multiple times, to try and shock their components into

The SYSOPS tooling progresses apace. NETOPS gains a new Director and provides
production configs to be burned into the switches during preflight.

Turnkey cloud is near.

A deadline is nearly missed as a vendor fails to schedule air freight to
actually pick up dozens of completed racks. They sit on tarmac for a day.

The tribe is on their fourth Project Manager and second or third VP/Ops.
Everything blurs together. The PM lasts two weeks. The VP a few months.

The Validators devise artifacts that can be placed on top of a freshly squeezed
rack, spinning its servers and reporting back to the API from anywhere on Hell’s
plains. Progress of the rack builds can be viewed live. The servers can be
reached wherever they are, as they are being built, through arcane encrypted

Problems are fixed before the racks ever reach the datacenter. Confidence grows
in what is being delivered. The process is re-run once the racks are installed
in the datacenter. Problems incurred during shipment are quickly repaired. The
number of on-site RMAs plummets to nearly none. Wasted time is regained.

Building a cloud now takes two months. Expansion a few weeks.

The Powers That Were stop the flow of money. They say the cloud is not being
built quickly enough, does not consume quickly enough. They insist the work
continue as they stake the murdered budgets outside the gates of the tribe. The
Powers demand to know why deadlines are missed.

Eventually the money resumes. The tribe has long since learned that their own
actions have little to do with the arbitrary behavior of the Powers.

They are unknowable.

The Insurrection

A lone soul comes in from the darkness. They are admitted, though none of the
women or minority voices among the tribe will be heard. The Powers ignore their
objections, insist again that only Velocity matters. More hands will mean more

This soul immediately argues that the clouds are broken, the tribe is broken.
It will be fine, they know better. They produce nothing but words, but they are
words the Powers want to hear. The tribe is failing purposely, to make the
Powers That Were into the Powers That Will Never Be, out of spite.

Soon the found soul pulls in others, also lost or disaffected from within the
tribe. They argue the Powers are not being appeased quickly enough, not being
fed enough. It is a trick, they wrap their own ambition in devotion.

They cast selfishness as piety. They claim they can build a cloud in a weekend.
They claim they have done it, but show no one. They whisper to anyone who will
listen the tribe is full of morons, the CEO and CTO are fools leading the tribe
to ruin.

These turncoats work in secret; finally discovered: copyright assigned to
themselves and not the tribe, not even the Powers they claim to serve. The
tribe believes these turncoats undone – they surely will be cast away, the
distractions finished.

The turncoats flee to where the Powers reside and a Validator is sent after
them. Irreparable damage is done to the tribe in short order. A duel is
required. The tribe’s Praxis against the turncoats vapor. It is all a sham. The
Powers flexing.

Deeply messed up stuff happens, just like really seriously amazingly dumb

The Powers give the turncoats a place to work, hidden from the rest of the
tribe. The tribe is shocked. The last shreds of morale are found hiding under
a bean bag chair, mourned, set alight, the ashes buried.

Never hire anyone without interviewers of a mix of backgrounds talking to
them first. Listen to your people. Heartbreak can be avoided.

Periodically the priest of these turncoats is sent back into the tribe. He
tells them how great the True Work is going. The tribe is not allowed to see
this work; they are assured it is amazing; the existing clouds are already
obsolete; they should just give up already.

The priest is challenged during an incredibly uncomfortable all-hands. The
Powers intervene. The priest assures the tribe he will always protect them, and
then he leaves. This happens several times. The tribe is dubious. The work

The CTO is exiled to an island surrounded by clever, insane kappa. He is given
books and has no one to talk to.

The First Validator walks into the darkness, head bent, tools discarded in the

A year passes. The turncoats unsurprisingly fail miserably. The Powers exile
them. Some of them somehow manage to end up in even worse places, there being
an unending number of just terrible places among the industries of the umbral

The Powers are finally fed at 200Gbps. They are unsatiated.

The CEO follows so many others into the desert.

The CTO crafts a boat from discarded hopes and sails off.

The Final Form Shambles On

The clouds sing, bellies full of encrypted baubles from millions of the unseen.

Much of the original tribe is shattered.

The Powers announce they will take their data to another cloud, a better cloud,
a heavenly cloud, and that the tribe will dismantle what they’ve painfully

This does not occur: the false heavens cannot feed them, cannot sustain them.

The work continues.


This post brought to you by: three years of 10–16 hour days, a no-notice trip
from MEL to ICN to SFO to MEL in ten days, exponential burnout, foxhole
buddies, and taking an entire year off to recover.

Shout out to all the awesome people who did simply amazing work under the worst
conditions. You know who you are.

We all deserved better.

December 13, 2019

Day 13 - A Year of DevOps Days Around the Globe

By: JJ Asghar (@jjasghar)
Edited By: Jon Topper (@jtopper)


Over the last year, I had the privilege of travelling the world, representing IBM,and speaking at DevOpsDays around the globe. I saw different cultures and demographics share in the teachings of digital transformation and DevOps cultural change. I saw many successes and a few failures, and I’d like to share them here. Additionally, after many years of attending DevOpsDays, I became an Organizer at my local event. I hope to make an impact with all I’ve learned and experienced in order to make my local event the best possible.

I’m the type of human that eats his vegetables first, so let me highlight some friction points before the grand successes. I’m going to do my best not to call out any specific DevOpsDays because I know organizers work inexhaustibly to provide the best experience possible to attendees. Of course the necessary disclaimer here, this is simply my perspective.

Don’t let a single voice dominate the conversation

“Cult of Personalities” is a term that has become more mainstream recently and I believe it runs rampant in our DevOpsDays culture. It’s good to see champions of this cultural movement but I started to see an almost “cult” reverence for specific humans.

I saw open spaces with 10-20 people and only that one personality would hold the conversation. I noticed that the ritual of the Open Spaces wasn’t enforced, and the saying “you have two ears and one mouth, listen 2x then speak” went ignored. In other words, we need to practice as we preach, be more inclusive, and encourage more participation from those that shy away from the spotlight. Easier said than done, I know. I’m the guy that had an ignite talk about surviving conferences as an introvert. Shameless self-promotion here:

Delegate responsibilities

I also saw the personality problem in the organizer space too. There was one event where every decision seemed to go through one human. This individual became the be all end all for any decision; ranging from tactical issues with AV to signing the check for the catering. This individual was everywhere and didn’t seem to let any “sub-team” be empowered to make any decisions. You could see the exhaustion in this person’s eyes by the end of the day. Being a one-person team is not in the spirit of DevOpsDays. The moral of this observation is, where you can, empower the community and other organizers and lean on them: that’s why they volunteered.

Reiterate the format, and the advantages of Open Spaces

I mentioned the Open Space ritual earlier and I want to revisit it. I am a very strong proponent of the Open Spaces; years ago it’s what brought me into the DevOps world. One constant thing I saw was that if there wasn’t a clear explanation of what Open Spaces are, they would fall into some weird half Open Space, half presentation, half people sitting staring at each other. (Yes, I have 3 halfs there, it was weird and not always all three)

Daniel “phrawzty” Maher has an amazing slide deck on Open Spaces. The rules and “law” of Open Spaces and some strong suggestions on how to make them successful. Many of us have seen this ritual multiple times, but just like your safety briefing on a plane, it’s important because you never know who hasn’t seen it before and it’s always good to have a refresher.

Something I saw at only one or two events was a volunteer moderator. (There were calls for note-takers too, but only once was this successful). The volunteer moderator was a great tool to make sure lesser voices were heard, but at the same time if they weren’t careful they could cause issues. Moderators are good, but they can cut both ways.

Some of the best run DevOpsDays were not the ones with the longest history, or the newest ones, but the ones that you could tell would iterate and pivot real-time. They trusted the teams or humans who were responsible for their sections, and empowered them to make decisions. They both preached and practiced the DevOps culture, and it paid off.

Provide a dedicated space for Speakers and Introverts

There were a couple of DevOpsDays that had intentionally provided on-call and quiet rooms. There was a saying I heard, “Using a laptop at a conference is like a virus, one person starts working and it gives permission for others to start working and conversations and interactions die out.” If you give people a room to work at, in a limited space, you can contain the spread and also make sure the people that do need to work on an emergency can.

Quiet rooms gave humans some space to decompress and recharge themselves. Having power squids, water and some tables and chairs also allowed humans to get their phones charged too. There was one event that doubled up the quiet room as a mother’s room; it worked great and promotes inclusivity to an underrepresented demographic. It’s a good sign for next year. There was a third type of room for humans needing to respond to on-call situations. In addition there was a small extra room with a live stream of the event. This had a few round tables and water, and was off in the corner, but it was amazing. I actually spent a lot of time in that room, engaging with cohorts being able to see the main stage and hack on some code. I had a game called Love Letter in my backpack and got quite a few rounds in and met some new gaming friends. Reference to my introverted side from above.

As a speaker at all of these events I spent a lot of time in the “green rooms”. I had an interesting observation about green rooms. First, they seem to be a US-centric thing: the EU the ones didn’t give us speakers space. As a speaker, I have a routine before I speak and I’d have to find a spot to do it. I strongly recommend giving your speakers a room, you never know if a speaker needs that space.

One thing I saw in the green room that I want to see at all the DevOpsDays was the “block of help.” I’ve seen the random volunteer sitting in the room approach, but that never seemed to work. Out of the few I saw they always seemed bored stuck there never knowing what to do, and that’s where this “block of help” comes into play. Every organizer and most volunteers have walkie talkies. At this event, in the green room, there was a walkie with big letters that said “HELP”. We were told if we needed anything, all we had to do was press the talk button and we had direct access to the team. Such a simple idea and it worked so well.

Acknowledge people have different ways of interacting

The final win I want to mention is board games. Like quiet rooms, board game rooms are becoming more and more mainstream. I went to a couple of DevOpsDays that had a boardgame portion to the evening event. Most were at bars, but some found a bar or event location that had a quiet place to have a beer and play a collection of games. This worked amazingly, it allowed for people not to feel pressured to drink, and also gave people who normally don’t frequent drinking establishments a place to enjoy socializing.


Thanks for taking the time to read this post about what I’ve learned visiting DevOpsDays around the world. I’m taking all of these lessons learned to my local DevOpsDays, and already had some amazing feedback on of these influenced ideas. Hopefully, this has highlighted some easy things our community and organizers can iterate on. For instance, start involving underrepresented attendees in open spaces, allowing for more diverse voices.

I really hope that organizers engage more with their organizing teams and allowing the group to take the brunt of the stress, instead of only a couple of people carrying most of the water. Finally, I really hope that organizers realize that speakers are not only attendees but do need some space to focus their energies, in order to put their best foot forward. Having this space, like the on-call room, giving people the space they need, can allow everyone to feel empowered.

I really see DevOpsDays only growing more and more. It’s a cultural movement that encourages people to bring their best, and collaborate.

December 12, 2019

Day 12 - Observability

By: Ramez Hanna (@informatiq)
Edited By: Kirstin Slevin (@andersonkirstin)

TL;DR Observability is about people and practices. You don’t need a dedicated team, you need people who care.

Bonus points This applies to many other things, not just observability.


I do not take full credit for all that I am going to share.

This is the result of my learning from people, books and experience.

This is my view on the subject, hence you can disagree with me.

What is Observability?

According to wikipedia, it is

“ A measure of how well internal states of a system can be inferred from knowledge of its external outputs. ”

When I read that it was so clear and yet so mysterious.

Trying to make sense of that definition in my context I came up with this simplification.

The act of exposing state, and being able to answer 3 questions:
-> what is the status of my system?
-> what is not working?
-> why is it not working?

Let’s inspect that definition closely, starting with “The act of exposing state”; This is the intent, the conscious action.

It’s about instrumenting the code to expose state and data about itself that will help in understanding it.

The goal is not to expose what we know we want to monitor (known unknowns), rather, the goal is to expose more data and add as much context that will enable the discovery of new failure modes (unknown unknowns).

This will enable us to answer the three questions.

The goal of observability is to get as close as possible to knowing the cause of the issues that impact the performance of systems, hence enhancing the response time and the MTTR (Mean Time To Recovery).
To make it more concrete, let’s look at this example:

The Universe company is using Graphite and Grafana for their metrics, and ELK stack for their logs.
Team Earth instrumented their code to expose the necessary metrics. They thought carefully about what metrics are important to their service, how to collect these metrics, and they carefully crafted their logs to have enough context.
They also put in place probes that will query their service and report status as perceived by clients instead of relying only on metrics exposed by the service.

On the other hand, team Mars only had the metrics exposed by the framework they use.
Their logs were verbose, unstructured text and they relied on the basic health checks, which are basically a ping check to their homepage.
Both teams use the same tools in an effort to observe their systems but the result is not the same.
Team Earth during an incident will be able to see how their service’s performance is perceived by clients, and be able to follow the metrics/signals through the different components until they would identify a certain metric that is not within thresholds.
They would then look at logs where they would be able to see more details about the anomaly and work to fix it.

Team Mars can look at their metrics, but they won’t necessarily find a metric that is out of the norm, so they will go over to the logs and sift through all those blobs of text, scrambling to make sense out of them.

They end up finding a fix, but the effort and frustrations leaves them demotivated.

This shows that observability is about what people do with the tools.

Who is it for, who will be implementing it really?

Observability is best implemented by the engineers that wrote the code, since they know their systems the best.

I cannot implement observability for all the engineers, but I can enable them to observe their services, showing them how to best observe, monitor, and understand their systems.

My users are the heart of observability, without their involvement and their cooperation I will not succeed at my mission.

Observability is about people.

It comes down to engineers following best practices, understanding what needs to be observed, how it should be observed and how to use that knowledge to improve the reliability of their services.

Observability is about people and practices.

How to implement Observability?

Before implementing observability, I must ask “WHY?”
Why would I want to implement observability?
Well to make our company better at what it does, right? That’s why I was hired in the first place.
Observability should help my company be better at reacting to outages or any issue for that matter.
Engineering will be better because of observability, if correctly implemented.
Keeping that in mind helps set the stage for the work involved in the implementation.
So my mission is to enable the engineering teams through the following:

  • Talk/advocate/train engineers about the principles
  • Provide support when they start applying this knowledge
  • Selection of tools that are best suited for my company whether self-hosted or SaaS
    • Understand the tools strength and limitations and explain those to users

Observability in real life

At Criteo we have 600+ engineers and an Observability team of 5 engineers.
With that ratio, there is no way the Observability team can take the responsibility to implement everything.
The Observability team provides the necessary foundation to enable the teams to observe. This includes:

  • Develop and deploy tools to allow for exposing and visualizing state
  • Integrate the tools with the internal ecosystem
  • Provide support for using the tools
  • Write documentation
  • Drives the adoption of the best practices, by working closely with the different teams

The team deploys different tools and develops the glue to integrate them to have a coherent ecosystem. For example, this might look like:

  • BigGraphite as the long term storage for metrics
    • This is the main Metrics database, where we store metrics. It is also used as long term storage for Prometheus.
  • Prometheus for metrics collection, aggregation and alerting
  • Alertmanager to route alerts
  • Various other tools for tying it all together with sane defaults

Keeping our focus on user enablement, we always try to find ways to improve the experience of our users.
One successful Observability team initiative was to dedicate one member of the team during 3 days every sprint, to work alongside another engineering team, to observe how they interact with the observability tools, how they define their service level objectives, and understand their alerting needs. Through this process, the member of the Observability team was able to spot areas that needed improvement and show how to fix.
It was mutually beneficial, as the Observability team learned more about users and their needs, and the users improved their ability to observe their systems.

Final word on tools

Vendors will try to sell me observability, but these are tools. Some are good and some are bad, and some are average, but no one can sell me observability.
Observability is more about people and practices - no matter what tools you use, if you don’t know what you’re doing it won’t work.
People are creative and they will find ingenious ways of using the tools to fit their thinking instead of adapting their thinking to the tools.
So tools are crucial but they are not where the focus should be. Ultimately I should be careful to choose the tools that make it easier for my users to exercise the best practices and the principles of observability.

December 11, 2019

Day 11 - When to Take a Chance on Someone

By: Annie Hedgpeth (@anniehedgie)
Edited By: Tyler Auerbeck (@tylerauerbeck)

The second week of Advent traditionally represents HOPE. This post is dedicated to those whose hope is likely in your hands - hope for a better life, hope to better support their families, hope to get out from under the crippling weight of debt, hope of having greater confidence to do things they know they can do if only given the chance.

When I first began my career move into tech, I knew it would be hard and that I’d need a few pieces of good luck along the route. I’ve had the luck I needed and have made it to the other side, so to speak. However, what has surprised me the most since the move is how infrequently what I did actually works. This is totally just my observation, but I’ve seen people successfully make the move maybe 1/4 to a 1/3 of the time. The biggest hurdle these brave souls need to overcome is in large part out of their hands; it’s convincing people to take a chance on them. This is way harder than you’d expect, especially when you have little to nothing in common with your interviewer.

This blog post isn’t for those searching for their first big break into tech, though. It’s for those who hold the power in their hands. Most of you, I would wager, want to be able to give someone their big break. The problem is that you don’t trust yourself enough to decide who would be a good candidate on whom to take that risk. Your heart is in the right place, but you don’t quite feel qualified enough to make a decision that puts your reputation and the company’s money on the line for someone who is a bit of a mystery to you. And this is why we largely hire people that look like us. We are obviously more familiar with people that are just like us. Therefore, if someone comes in who looks, sounds, acts, and has the same background as us, then we know what we’re getting. This, of course, is where we fall prey to having implicit bias and later hate ourselves for it.

Then how do you break the cycle? How do you bring a bit of humanity into the resume vetting process and into the interview room? How do you learn to understand humans more wholly? Regardless of what is or isn’t on a person’s resume, we need to see their humanity. Resumes are great at disclosing that information, so we need to figure out how they confront challenges so that we can begin to see them in the context of their whole lives.

The following are not interview questions. They’re simply questions to train you in to open your eyes to ways in which people’s different backgrounds could very well be an asset to your team. Think about these things as you meet people and begin to see how a person’s background can help and not hurt. Listen for clues of strength, character, hard work, and problem-solving in the stories that people tell you in your day to day life. You will likely relate to the humanness of these stories, but you are likely not used to equating those things to busines outcomes.

What do someone’s gap years say about them?

Most people think of gap years as something reserved for women taking off to raise kids, but let me tell you about a different scenario.

About ten years ago a friend of the family, Chris, was an assistant pastor at a medium-sized church in Texas (probably considered a large church anywhere else). He was wanting to get back into the corporate world as church leadership was taking its toll on him. He had been pastoring for about 7 years and was not able to get any interviews due to this “gap” from the corporate world on his resume.

My husband Michael, a software engineering director at NCR, however, knew this guy, knew what pastoring a church really meant (especially assistant pastoring, as you tend to get all the administrative duties) and what skills were needed for such a job. Many people who read his resume may have thought that Chris was on a sort of sabbatical or break and was taking it easy. His life was anything but easy. He was keeping a church afloat amidst a myriad of challenges. Not only was he being a successful leader, but he was also able to write several novels during that time (one being a very touching memoir about his father after his death). Chris’s years as a pastor enriched his life, grew him as a leader and mediator, and gave him skills he would have a hard time growing in the corporate world. He took a huge risk leaving the corporate world for church leadership. He took a harder path in search for of a bigger picture outcome that he was after. That says a lot about his character. He’s willing to do the more difficult thing for the more meaningful outcome.

While some may have read that these 7 years were a sort of sabbatical, my husband Michael knew differently. His company at the time was in need of strong product managers and that Chris was looking for a break. He took a chance on him first as a technical writer, and soon after Chris flew through the ranks into product management on an important new product within their portfolio. He is leading a successful career today in Georgia in product management thanks in large part to the skills he built as an assistant pastor.

What does one’s education say about them?

Personally, I have gone back and forth with whether or not to lament the fact that I got a film degree. If I had to do it all over again, would I? I don’t know, but let me tell you what it says about me in hopes that it sparks your curiosity about what other people’s non-technical degrees say about them.

First of all, neither of my parents went to college. My dad has an 8th grade education, and my mom graduated high school with my sister in her womb. They did not have the ability to coach me into thinking ahead or thinking about what would make the most sense. My mom’s only advice was to follow my heart and my passion (I have since read Cal Newport’s book So Good They Can’t Ignore You and agree that this is misguided advice). Growing up as a hardworking creative type who loved writing and watched a lot of TV, I wrote scripts in my head all day long. Naturally, I wanted to be a TV or filmmaker. I knew that it was risky, but I just thought that I’d figure it out as I went. Do I regret it? Meh, I don’t know. Would I have done it differently if I had to do it all over again? Maybe, maybe not. I love all the knowledge I have stored up because I have an Art/Film degree with a minor in Theatre. I wouldn’t want to give all that up.

Today I can walk through a museum or go to the theatre and appreciate it and connect with it in a way that I don’t know would be possible without those years of study. Creativity is not a mathematical equation but a human pursuit. And because of that human pursuit, I’m a really well-rounded person who connects with people and sees deeper meaning and connections in things, largely in part because I hold artists like Caravaggio, Rembrandt, Jackson Pollock, Tom Stoppard, Baz Luhrmann, Ant Farm, and so many more in my heart and mind all the time! When I travel to see my client in Chicago, we have jubilant well-attended happy hours in honor of my travels because I genuinely care about people and humanity, and that draws people into connection with me that rarely exists in tech. I’m not just saying that to blow smoke, but it’s truly just what people have observed. I like that I’m able to bring something special like that to the table. It’s fun for all involved. And don’t you want someone that thinks outside of the tech box and brings somehting unique?

What does one’s reaction to change say about them?

When I was a kid and my dad drank away all of our rent and we had to skip town in the middle of the night, I experienced big changes frequently. We moved 13 times between the time of my birth and when I turned 11. Change was a part of my normal life, and there were several formative events that shaped me because of the things I determined in those moments. For example, one time in the mid-nineties we were driving down the highway in my dad’s 1972 Buick Skylark. It was primer gray with one gold fender on the driver’s side. He saw his friend pull up beside us, so naturally he decided to race his buddy down the highway. My sisters and I, teenagers at the time, were in the backseat with no seatbelts because it was an old car and they didn’t work. Our clover-leaf exit was approaching, so my dad hit the brakes and we went spinning onto the elevated clover-leaf. I honestly thought we were about to die spinning off of the bridge. When we finally came to a stop and I attempted to scold my dad for his irresponsible and unreasonable behavior, he laughed it off and told me I was over-exaggerating. In that moment, I determined to put my future kids’ well-being and sense of safety above my own - always. I dealt daily with challenges that were largely beyond my control. I hated not having control, so I developed a strong enough sense of myself by understanding what I could change. I resolved to have the better life that I desired and deserved. The challenges that I resented so much really helped me develop resilience, strength, and determination.

In a life where someone else’s rash decisions affected my day to day, I was led to constantly overlook my present circumstances in search of better ones. Over and over again, I was determined to make better choices for my life. I may not have known what the best choice was, but I would experiment and iterate until I found it.

When I grew up and had kids of my own, staying home with my kids was one of those experiments that lasted 10 years (speaking of GAP years!). I kept challenging myself to be different and be better. I taught myself new skills, like carpentry and how to fix small electrical problems around the house. I learned to bake bread from wheat I ground myself, make yogurt, buy produce from a local organic co-op, and was constantly iterating on parenting techniques. I learned all I could about healthy living for pregnancy, nursing, and raising kids. Constant improvement and growth was my path.

And today, in everything I do, technology included, my reaction to change is to learn more about my own potential and pursue it with abandon. I don’t seek to take the easy route but the one that will get me to the end that I want. Life is hard, but I can do hard things! I’ve proven it to myself over and over again. If a person came into my interview room with an attitude toward life like this, then I know that they will be successful at anything they do. Anything.

What are ways in which people deal with fear and struggle in their lives?

A byproduct of my upbringing was an overwhelming sense of shyness. I feared people to an unhealthy degree. Some of this fear was warranted as I was around some pretty questionable people growing up. Other times, it was simply my lifestyle that invited criticism. I remember one time in the sixth grade in the passing period a popular boy in my class passed by and said, “Hey, nice jeans.” I felt pretty good because they were my favorite Lee jeans that my Granny had bought for me the year prior for my birthday for $35! Then to my horror after a long pause he threw in, “High waters,” with a sly chuckle. It turns out I had held onto those jeans for a bit too long, without money to replace them, which led to being called out in front of everyone. The constant embarrassment over my situation at the time led me to being overly shy, a trait that actually wasn’t natural to me, but I wouldn’t realize that until much later.

During this time, however, I was developing this inner strength that knew that I had things to offer the world that needed to be shared. I couldn’t keep them inside any longer; the pressure was growing and growing to share not only my talents but my feelings. My two biggest outlets for creative expression when I was a teenager were singing and poetry. Instead of being paralyzed in my fear and insecurity, I felt the fear, acknowledged it, and moved forward with my expression anyway. It was the only way to be true to myself.

I moved past fear and insecurity and sang solos for my school choir and for my church, and I developed a strong singing voice that grew in passion and character as I shared it with the world. The shaky little voice of my youth turned into a strong soloist able to sing in front of thousands of people over the next twenty years. I was even able to use that voice and my growing extrovertedness as I grew my community by singing without fear or insecurity with my Chef friends at ChefConf over the years. Every show of confidence counts when you’re networking your heart out!

Similarly, a wonderful outlet for self-expression in my tween and teen years was poetry. I had some lovely surrogate grandparents in my life, one of whom was a professor of literature at Incarnate Word University in San Antonio, who took me to poetry readings at the university campus when I was developing my voice. I became inspired to write all the time. It was great therapy for me. I developed quite a strong voice for such a young girl, and was encouraged by the professor to keep at it because there was raw talent there waiting to be sculpted. I let the pain and hurt of the years of growing up with an alcoholic father spill through the lines of poetry until they turned into healing and resolution. Even as a young girl, something inside me felt that other people needed to hear these words, that maybe it would help them in some way. The poetry turned into journalling, and the journalling turned into blogging - raw, honest, and full of conviction - whether I was writing about lifestyle, parenting, budgeting, or later technology. I’m always honest and vulnerable with my readers because that’s my expression of humanity. I can’t help it.

And why should you care about that as an interviewer? You should care because it says that I’m honest and I care about helping other people. Yes, I’m a consultant, but it is not in my DNA to bullshit you. I just can’t. This is something that just won’t be conveyed in a resume alone.

That same writing voice today works through complex technical problems and shares them with compassion for the reader, empathizing with their struggle because sometimes tech is hard. I’m thoughtful to be sure that I explained it in such a way that my readers both understand and be able to use the information. And it’s beneficial to me as I cement my own learning along the way, just as my writing before benefitted me, too. Blogging about my learning and discoveries is the only way to be true to myself in the face of fear - fear of failure, fear of being an imposter, fear of looking dumb, fear of losing it all. I feel the fear and move forward anyway.

This is exactly why my life experiences make me well suited for devops. I envision myself to be a better person and do what it takes to be that. Shortly after I started my journey into technology, I was on a podcast with no business being there because I had a sense of urgency and knew I was better than I thought I was, so I acknowledged the fear and stepped out anyway. It started when I was sitting in the back of my dad’s spun-out car knowing I could do better with my life.


No one is single-faceted. We are complex beings with so many opportunities for growth every day. How do you find those people who take those growth opportunities and run with them? You can teach and learn the technology, but it’s much harder to teach and learn how to grow and adapt to change and be a good human.

When you’re facing that internal struggle of knowing when to take a chance on someone, seeing them as a whole person and assessing how they deal with life is key. If it’s someone who adapts to change and grows and loves people, then your risk goes way down. If it’s someone who is thinks and has a wide array of interests that they pursue with excellence, then the chances of them pursuing their job with excellence are probably pretty high, too. If it’s someone who looks like they have their shit together because they won’t be caught unaware, then that’s a good sign that they’re responsible and can probably be trusted to deliver.

My challenge to you is to grow this skill of detecting the humanity in people. Start by taking small risks until you trust yourself as you grow this skill. Your team needs it. Your company needs it. And these interviewees need it. They need you to be a better, more responsible gatekeeper into good jobs. You have the power to change their lives. You have a societal mandate to be better at this. You have power; use it wisely.

Thanks for reading! I’m honored to be a part of SysAdvent. You can find me on Twitter @anniehedgie, and find much more about my journey on my blog. Happy Holidays!

December 10, 2019

Day 10 - It’s OK if you’re not running Kubernetes

By: Mattias Geniar (@mattiasgeniar)

I love technology. We’re in an industry that is fast-paced, ever improving and loves to be cutting-edge and bold. It’s this very drive that gives us exciting new tech like HTTP/3, Kubernetes, Golang & so many other interesting projects.

But I also love stability, predictability and reliability. And that’s why I’m here to say that it’s OK if you’re not running the very latest flavor-du-jour insert-new-project-here.

The media tell us only half the truth

If you would only read the media headlines or news outlets, you would believe everyone is running their applications on top of an auto-scaling, load balanced, geo-distributed Kubernetes cluster backed by only a handful of developers that have set the whole thing up overnight. It was an instant success!

Well no. That’s not how that works.

The reality is, most Linux or open source applications today still run on a traditional Debian, Ubuntu or CentOS server. As a VM or a as physical server.

I’ve managed thousands of servers over my lifetime and have watched technology come and go. Today, Kubernetes is very hot. A few years ago it was Openstack. Go back some more and you’ll find KVM & Xen, paravirtualization & plenty more.

I’m not saying these technologies will vanish - far from it. There’s merit in each project or tool, they all solve particular problems. If your organisation can benefit from something that can be fixed that way, great!

There’s still much to improve on the old & boring side of technology

My background is mostly in PHP. We started out using CGI & FastCGI to run our PHP applications and have sinced moved from mod_php to php-fpm. For many sysadmins, that’s where it ended.

But there’s so much room for improvements here. The same applies to Python, Node or Ruby. We can further optimize our old and boring setups (you know, the ones being used by 90% of the web) and make it even safer, more performant and more robust.

Were you able to check every config and parameter? What does that obscure setting do, exactly? What happens if you start sending malicious traffic to your box? Can you improve the performance of OS scheduler? Are you monitoring everything you should be?

That Linux server that runs your applications isn’t finished. It requires maintenance, monitoring, upgrades, patches, interventions, back-ups, security fixes, troubleshooting, …

Please don’t let the media think you should be running Kubernetes just because it’s hot, you have servers running that you know best that still have room for improvements. They can be faster. They can be safer.

Get satisfaction in knowing that you’re making a difference for the business & its developers because your servers are running as best they can.

What you do matters, even if it looks like the industry has all gone and left for The Next Big Thing (tm).

But don’t sit still

Don’t take this as an excuse to stop looking for new projects or tools. Have you taken the time yet to look at Kubernetes? Do you think your business would benefit from such a system? Can everyone understand how it works? Its pitfalls?

Ask yourself the hard questions first. There’s a reason organisations adopt new technology. It’s because it solves a problem. You might have the same problems!

Every day new projects & tools come out. I know because I write a weekly newsletter about it. Make sure you stay up-to-date. Follow the news. If something looks interesting, try it out!

But don’t be afraid to stick to the old and boring server setups if that’s what your business requires.

December 9, 2019

Day 9 - In Defense Of The Modern Day JVM (Java Virtual Machine)

By: Gene Kim (@realgenekim)
Edited By: Joshua Smith (@jcsmith)

In this post, I'm going to tell you something very surprising that happened to me earlier this year, which has led me to make an impassioned (and perhaps surprising) defense of something that I feel has been unfairly maligned, and share with you some things you may find surprising.

In September, I had the privilege of attending the Sensu Summit here in my hometown of Portland, Oregon. On the evening before the first day of talks, I ran into so many friends I've met over the last 10+ years. Not surprisingly, It was great to catch up with everyone, and hearing about all about their their great new adventures.

When people asked what I’ve been up to, I remember enthusiastically telling everyone how much fun I've been having programming in Clojure, which I later wrote extensively about in a blog post called "Love Letter to Clojure."

I told people how initially difficult I found functional programming and Clojure to be, given that it was a LISP ("doesn't even look like code!") and didn't allow mutation of state ("you can't change variables!"), but the sense of incredible accomplishment I felt being able to quickly, easily, and safely solve problems, in a manner completely unlike the first 35 years of my programming career ("even after 3 years, my code base hasn't collapsed in on itself like a house of cards!").

I remember laughing, jubilantly telling everyone this, and then looking around, and then freezing in surprise... Something had changed... People weren't smiling at me anymore. In fact, people were looking at me as if I had said something incredibly impolite, uncouth, or maybe even immoral.

"Uh, what did I say?" I remember asking everyone in the group. No one said anything, instead just looking back at me with a forced smile. "No, really, what did I say?" I insisted. Still just polite smiles.

In my head, I furiously went through everything I had said, trying to figure out what I might have said that was offensive. I had mentioned Clojure, functional programming, LISP, that I loved that Clojure it ran on the JVM, the programs I had written, what I learned, and...

"Wait, is it because I said I loved that Clojure runs on the JVM?" I asked. Several people around me finally laughed, apparently with complete disbelief that a fellow enlightened DevOpser could say such a thing. When I asked what was so surprising, it’s like the floodgates had opened.

They tell me about all their horrific war stories of their lives in Ops, being thrown incomprehensible and completely opaque Java JAR files, which then invariably detonated in production, resulting in endless firefighting, at night, on weekends, during birthday parties...

"Holy cow," I remember saying, shaking my head in disbelief. "I totally forgot about all of that..."

Basically, when I said the word “JVM,” they heard, “Here's my JAR file. Good luck, chumps. Kbye.”

Why I Love And Appreciate The JVM! And It’s Not Just Me!

Until that moment, if you asked me what I thought about the JVM, I would have told you something like this:

"The JVM is amazing! Clojure runs on the JVM, and takes advantage of the billions of dollars of R&D spent over twenty years that has made it one of the most battle-tested and performant compute platforms around.

“And it can use any of the Java components in the Maven ecosystem — Maven -> Java as NPM -> NodeJS, Gems -> Ruby, Pip -> Python, etc... And there’s so much innovation happening right now, thanks to Red Hat's Quarkus, Oracle's GraalVM, Amazon AWS, Azul, and so much more.

The JVM has enabled me to be so productive! There’s never been a better time to be using the JVM than now!”

But after that astonishing evening at Sensu Summit, for weeks, I kept thinking, “Am I having so much fun programming in Clojure, being a Dev, that I’ve completely forgotten what it’s like to do Ops? Is it possible that Dev and Ops really do have two very different views of the JVM?”

As an experiment, I put out the following tweet (and this one, too) :

I recently observed something interesting/unexpected. I’m performing an experiment & will report out results.

Please reply to this tweet w/following info:

1. Which do you identify as? Dev or Ops

2. Then type any words / emotions that come to mind when I say ‘JVM’ or ‘Java Virtual Machine’.

Begin. Thx! 🙏❤️

Amazingly, I got over 300 replies, which included some of these gems that evoked bad memories from the past:

  • Very-annoying memory-hog
  • "Write once, run anywhere" and "It's running slow... Let's just give it more memory."
  • Pain, anguish, suffering, and screams of "Whhhhyyyyyyy?!?!?"
  • Possibly fast, definitely difficult-to-troubleshoot opaque process that is not working and no one knows why! But it's the most important thing in our stack!
  • Oh no, this thing runs in what now? It's horrendously slow, and will crash at inopportune times
  • Bane of my early Ops existence.
  • Pain
  • Oh FFS, another 3am callout!!
  • Out of memory again

And yet, I got a couple comments like these:

  • Amazingly cool under appreciated tech under the hood. See Clojure, JRuby.
  • My life 10 years ago. Can be really really fast and stable if you really really understand how to drive it.

My sample skewed decisively “Ops.” Last week, I asked some friends in the Java Dev community to repeat the tweet, and again, we quickly got over 200 responses. Here is a sample from the replies we got — my thanks to Stu Halloway from Clojure fame, Dr. Mik Kersten from Tasktop, and Josh Long from Pivotal.

Look how differently they talk about the JVM!

  • Fast, reliable, ubiquitous, large ecosystem, easy packaging
  • Brilliant piece of engineering
  • Love the JVM, bored with the core language, great library ecosystem, solid, reliable, familiar, Clojure runs great on it.
  • Robust, ubiquitous, ponderous, vintage, solipsistic
  • Solid, battle-tested with great ecosystem
  • Impressive, stable, rich, complex, ubiquitous, pervasive, Graal, native-awkward, powerful, elegant, clever, surprisingly long lived, under threat (licensing), tool supported, marketed.
  • It's not just for Java; Reliable; It's grown with me over the years; Runs everywhere including my toaster; Under-valued
  • Safe, known, capable, low risk
  • Backend, concurrency, stability, performance, maturity, excellent design
  • Runs everywhere. Java va Kotlin vs Scala. Spring Boot.
  • useful, but not trendy with the webdevs
  • Polyglot, JIT, fast, happy

I plan on creating a word cloud of all the amazing replies, and grouping them by sentiment — which will be written in Clojure and run on the JVM, of course. But due to deadlines, I’m not sure I can get it done in time for it to be included here. Stay tuned!

My Top Six Things You Should Know About The Modern JVM

To the Syadvent community, I wanted to share some things that excite me most about the JVM, which may surprise you. My hope is that you’ll see the JVM as a vibrant and viable way to run kickass applications in production. And that you’ll see some incredibly valuable characteristics of it that benefits all of us, and that we can move dramatically more of the JVM responsibilities to Dev (e.g., “you build it, you configure it, you run it”).

  1. The JVM runs more than just Java: some of the well-known languages include Groovy, Kotlin, and Clojure. There are also implementations of other languages, such as JRuby (I had fun reading the slides from this 2019 presentation from the JRuby core team) and Jython, which were created to take advantage of the amazing run-time performance and multi-threaded capabilities of the JVM. You can find a more extensive list here.
  2. The JVM runs some of the most compute- and data-intensive business processes on the planet, including at the technology giants (aka, the FAANGs, or Facebook, Amazon, Apple, Netflix, Google, and really, Microsoft should be in there, too — although I suspect it’s unlikely you’ll see too much of the JVM at Microsoft).

    You can see some of the fun stats and names in posts like this one.

    And most of the most famous data platforms all run on the JVM, either entirely or significant components: Hadoop, Spark, Storm, Kafka, etc…
  3. The JVM runs some of the biggest web properties on the planet, like eBay, Google, Amazon, Alibaba, Twitter, Netflix, and more. The JVM has been used at scale for decades, and this lived-experience has born a rich, mature ecosystem of options for application developers. Frameworks like Spring (and increasingly Spring Boot) power Netflix, eBay, all of Alibaba’s various online properties, and more. Developers do not need to choose between simplicity and power: they can have both.
  4. There are many JVM options out there, beyond the Oracle version and OpenJDK. And there are now really great utilities that make it easy to install, upgrade and even swap JVMs on your laptop,, thanks to utilities like SDK for MacOS — it supports many JVMs, including Correto (Amazon), GraalVM (Oracle), Zulu (Azul)...

    Personally, I’ve found it remarkably easy using SDK to switch my Clojure programs between different JVMs — I used this fabulous tutorial here. I used it to explore the GraalVM, which I’m very excited about, which I’ll describe next.

    And new JVMs now have a bunch of new memory garbage collectors, including the Shenandoah GC, written by Red Hat, which finally brings continuous compaction to the JVM, eliminating the need for “stop the world” GC pauses. Here’s a great talk about A New Age of JVM Garbage Collectors by Alexander Yakushev that I saw at the Clojure/conj conference two weeks ago!
  5. I think the GraalVM project is so exciting! For me, GraalVM defies easy explanation — it’s an ambitious, entirely new polyglot JVM that allows running Java and other JVM languages, as well as being able to host languages such as JavaScript, Ruby, R, Python and LLVM-based languages. (!!!)

    GraalVM also enables native-image compilation, which essentially compiles your code into native executables, which allow nearly instant, sub-millisecond startup times, which addresses one of the primary complaints of conventional JVMs. These binaries are usually small (e.g., 33 MB executables instead of 300 MB uberjar files).

    GraalVM is the brainchild of Dr. Thomas Wuerthinger, Senior Director of Research at Oracle. You can listen to an interview of him on Software Engineering Daily. I suspect you’ll be blown away by his ambitious vision, and the massive productivity advantages they’re gaining by writing a JVM in a language that’s not C++.

    I loved this video of how Twitter is now using GraalVM to run all their Scala workloads by Chris Thalinger, resulting in a 20% improvement in performance and compute density. And here’s a video of Jan Stepien presenting on how to create native images for Clojure programs. Here’s another great article from AstRecipes, showing how native images resulted in a 300x improvement in startup times.

    GraalVM seems to be energizing a flurry of innovation outside of Oracle — Red Hat has created Quakus platform, intended to optimize the JVM for Kubernetes and similar environments. I remember reading the this blog post, getting excited about instantaneous startup times and significantly reduced memory footprints. You can find an interview of Guillaume Smet and Emmanuel Bernard on Software Engineering Daily — I found it especially fascinating that they are using Go as the benchmark to beat.
  6. The JVM and Maven packaging ecosystem is a calming breath of fresh air: The Maven packaging ecosystem is one of the longest-lived and most successful package repositories. In terms of number of packages/components and versions, only NPM for NodeJS is larger.

    In a presentation that I did with Dr. Stephen Magill at GitHub Universe, we presented on the findings from the State of the Software Supply Chain research we did with Sonatype, studying the update behaviors within the software supply chain in the Maven ecosystem.

    One of the things this research reinforced for me is that packaging churn is growing to be untenable, especially in NPM in NodeJS. This funny tweet thread sums it nicely: “When npm was first released in 2010, the release cycle for typical nodeJS package was 4 months, and npm restore took 15-30 seconds on an average project. By early 2018, the average release cycle for a JS package was 11 days, and the average npm restore step took 3-4 minutes...."

    Of course, the catastrophic “nodularity” is a joke — but there are so many stories of projects where “I didn’t touch the project for 4 months, and now it no longer builds if you update any of the dependencies, and I probably need to update nom, too.”

    In other words, if every dependency you rely on is updating every week, and they are introducing breaking changes, you are in a world of hurt. In a world where the best way to keep your software supply chain is to integrate updating dependencies into your daily work, this becomes impossible.

    The Maven ecosystem are full of components that just work, have been working in production. In fact, there have been almost no breaking changes to the Clojure core libraries in 12 years!

    If you choose carefully, the components in the JVM ecosystem are stable, reliable, and updates can be made quickly, reliably, easily, without everything blowing up in your face.

    And by the way, one of most amazing talks I've seen is from @BrianGoetz, the Java language architect, on his stewardship of Java ecosystem from Clojure/conj 2016. What's unmistakable & so admirable is his sense of responsibility to not break the billions of lines of code that 9MM developers have written over the last two decades.

    The promise they make to them: “we won’t break your code”
  7. Devs should configure and run their own applications and JVMs: The days of Devs throwing JAR files over the wall to Ops, who must figure out how to run it in production. Instead, the more modern pattern is having the Devs configure their own JVMs however they want, which then get deployed into a platform that Ops creates and maintains — but it’s the Devs who will be woken up if and when things blow up...


I hope I’ve told you something about the JVM that you may not have known, and made the case that the modern day JVM should be a great thing, both for Devs and for Ops!

PS: You can read more about Clojure and functional programming here, and more about how critical it is for Ops creates platforms that enables developers to be productive in my description of the Five Ideals here, featured in “The Unicorn Project: A Novel about Developers, Digital Disruption, and Thriving in the Age of Data” (this book about DevOps is now a Wall Street Journal bestseller!!)

December 8, 2019

Day 8 - Going Nomad in a Kubernetes World

By: Paul Welch (@pwelch)
Edited By: Nathen Harvey (@nathenharvey)

What is Nomad

Nomad by HashiCorp is a flexible orchestration tool that allows for management and scheduling of different types of compute workloads. Nomad is able to orchestrate legacy applications, containers, and even Machine Learning tasks. Kubernetes is a well-known orchestration platform but this post provides an introduction to Nomad, a tool that can provide some of the same orchestration capabilities.

Most distributed systems benefit from having schedulers and a facility for service discovery. Schedulers programmatically manage compute resources across a large number of nodes. Service discovery tools are used to distribute information about services in a cluster.

How is Nomad Different

Let’s look at some of the differences between Nomad and Kubernetes. Both Nomad and Kubernetes are able to manage thousands of nodes across multiple availability zones or regions, but this is where they begin to differ. Kubernetes is specifically designed to manage Docker containers. It is designed with more than a half-dozen services interconnected to provide full functionality. Administrating a Kubernetes management cluster can be a full time job if you are not able to leverage one of the many managed services most major cloud providers offer today e.g., Amazon EKS, Azure Kubernetes Service (AKS), or Google Kubernetes Engine (GKE).

In contrast, Nomad is a more general purpose scheduler supporting virtualized applications, containers, standalone binaries, and even tasks requiring GPU resource management. Nomad is a single binary for both clients and servers that provides a lightweight scheduler and resource manager. Nomad aims to follow the Unix design philosophy of having a smaller scope focusing on cluster management and scheduling, while leveraging other tools, such as Consul for service discovery and Vault for secrets management.

Getting Started

This article is merely a quick introduction to getting started with Nomad using the local development environment with Docker installed. The steps described have been tested with Nomad version 0.10.1. Other great resources for learning more include Learn Nomad and the Nomad Documentation.

To get started, grab the latest version from the Nomad download page for your platform. Note that there is only one binary to install. The binary will run in server or client mode based on the configuration file given. Once you have it installed you can run nomad --version to verify a successful install.

With a successful install confirmed, let’s dive into setting up a local running instance.

Nomad has many options for task drivers available but this demo will be using Docker. Make sure you have Docker installed and running locally.

Nomad Server

Nomad consists of several agents running in server mode, typically 3–5 server instances at a minimum, and any number of agents running in client mode on hosts that will accept jobs from the Nomad server. For our purposes, a single server and client will be enough.

First create a server.hcl file with the following basic configuration:

# server.hcl
# Increase log verbosity
log_level = "DEBUG"

# Setup data dir
data_dir = "/tmp/server1"

# Give the agent a unique name. Defaults to hostname
name = "server1"

# Enable the server
server {
  enabled = true

  # Self-elect, should be 3 or 5 for production
  bootstrap_expect = 1

In a new terminal, run, nomad agent -config server.hcl. This will start a development server that includes a Web UI available at http://localhost:4646. Here you will be able to see details on Nomad Servers and Clients in this cluster, as well as current and past jobs. Now that we have a server to manage our jobs and resources, let’s add a client.

Nomad Client

Nomad clusters will have agents deployed in client mode on any host that has resources that need to be managed for jobs. In a new terminal window, let’s create the following configuration file and run nomad agent -config client1.hcl.

# client1.hcl
# Increase log verbosity
log_level = "DEBUG"

# Setup data dir
data_dir = "/tmp/client1"

# Give the agent a unique name. Defaults to hostname
name = "client1"

# Enable the client
client {
  enabled = true

  # For demo assume we are talking to server1. For production,
  # this should be like "nomad.service.consul:4647" and a system
  # like Consul used for service discovery.
  servers = [""]

# Modify our port to avoid a collision with server1
ports {
  http = 5656

Now when you revisit the Web UI for the Nomad Server you should see a new client listed. You can click into the client details to see information such as available resources or drivers that Nomad has detected (e.g. Docker or QEMU).

Nomad Job Configuration

Now that we have an operational Nomad cluster, let’s create a job to be orchestrated. Generate a new Nomad job with nomad job init. This will create a new file called example.nomad in your current directory.

The Nomad job specifications have many options but for this article we are only going to focus on some of the primary stanzas. For a more in-depth breakdown, check out the Nomad Job Specifications documentation.


The job stanza is the top most configuration value for a job specification file. There can only be one job stanza in a file and it must be unique per Nomad region. Parameters, such as resource constraints, can be set at the job, group, or task level based on your needs. A Nomad Job is similar to a Kubernetes Pod.


The job type refers to the type of scheduler Nomad should use when creating the job. The options are service, batch, and system. The two more frequently used options will probably be service and batch. Service scheduler type is used for a long running tasks that should never go down such as an application or cache service such as Redis. A batch task is similar to service task but is less sensitive to performance and is expected to finish within a few minutes to a few days. The system scheduler type is useful for deploying tasks that should be present on every node.


A job can have many groups and each group can have many tasks. The group stanza is used to define the tasks that should be co-located on the same Nomad client. Any task defined in a group will be placed on the same client node. It’s out of scope for this tutorial, but for failure tolerance configurations, see the spread stanza documentation.


A task is the unit of work using Docker containers, binary applications, or any of the other Nomad supported task types. This is where you specify what you want to run and how you want it to run, with parameters such as command arguments, services using service discovery, or resource requirements, to name a few.


Not to be confused with the Nomad job type mentioned above, the service configuration is for Nomad to register a service with Consul for service discovery. This allows you to reference the resource in other Nomad configurations by the service name.

Each job will have 1 type. A job will have N groups comprised of N tasks. Each task is a service.

 | type (1)
  \_ group (N)
        \_ task (N)
          | service (1)

Nomad Job Execution

After reviewing some of the basics of a Nomad job specification, it’s time to deploy a job to our local Nomad cluster.

The example nomad job created with nomad job init defaults to a service job with a task to run a Redis instance using the Docker driver. Let’s deploy that now.

In a separate terminal, with your server and client nodes running in the others, submit the job by running nomad job run example.nomad. You have now setup a basic Nomad installation and deployed your first job!

You can view the status and details of the job with nomad job status example or via the Web UI mentioned previously. Since we are using the Docker driver, we can also see the running container Nomad is managing by running docker ps. Each job is given an allocation number. From the nomad job status example command you can retrieve the allocation number and see the details by running nomad alloc status ALLOC_ID. As with other Nomad commands, you can see what other options are available by running nomad alloc to see the other subcommands you can run such as exec, fs, stop, logs and a few others to help manage jobs.

Scheduler Example

Now that we have a running job, let’s see the scheduler maintain the running service. Get the “Allocation ID” from the job status by running nomad job status example. Then run the allocation status command nomad alloc status ALLOC_ID. If you want, you can use the details in the “Address” field from the nomad alloc status command to connect to the redis container. For example, if the value is db: you can connect to it by running redis-cli -h -p 29906.

Using the alloc status command to refresh the status after we cause the docker container to fail, run docker ps and get the redis container running id. Now stop the contain with docker stop CONTAINER_ID. After a few seconds you can run the nomad alloc status ALLOC_ID again to see the updated status and event details for the job. If you run docker ps as well you will see that the Nomad scheduler has started a new Redis container! If this had been a production cluster and the node running our job had failed, Nomad would have rescheduled the task to a healthy node. This is one of the advantages of using an orchestration tool like Nomad.


Nomad is an exciting project because it focuses on managing resources across a fleet of systems; regardless of the type of resource. This is particularly useful in diverse environments where there is a need to manage multiple types of resources (e.g binary applications, LXC containers, QEMU virtual machines, etc.). No one tool is perfect for every environment, but hopefully this article has helped you determine if Nomad might be the right solution for you.