December 12, 2019

Day 12 - Observability

By: Ramez Hanna (@informatiq)
Edited By: Kirstin Slevin (@andersonkirstin)

TL;DR Observability is about people and practices. You don’t need a dedicated team, you need people who care.

Bonus points This applies to many other things, not just observability.

Disclaimer

I do not take full credit for all that I am going to share.

This is the result of my learning from people, books and experience.

This is my view on the subject, hence you can disagree with me.

What is Observability?

According to wikipedia, it is

“ A measure of how well internal states of a system can be inferred from knowledge of its external outputs. ”

When I read that it was so clear and yet so mysterious.

Trying to make sense of that definition in my context I came up with this simplification.

The act of exposing state, and being able to answer 3 questions:
-> what is the status of my system?
-> what is not working?
-> why is it not working?

Let’s inspect that definition closely, starting with “The act of exposing state”; This is the intent, the conscious action.

It’s about instrumenting the code to expose state and data about itself that will help in understanding it.

The goal is not to expose what we know we want to monitor (known unknowns), rather, the goal is to expose more data and add as much context that will enable the discovery of new failure modes (unknown unknowns).

This will enable us to answer the three questions.

The goal of observability is to get as close as possible to knowing the cause of the issues that impact the performance of systems, hence enhancing the response time and the MTTR (Mean Time To Recovery).
To make it more concrete, let’s look at this example:

The Universe company is using Graphite and Grafana for their metrics, and ELK stack for their logs.
Team Earth instrumented their code to expose the necessary metrics. They thought carefully about what metrics are important to their service, how to collect these metrics, and they carefully crafted their logs to have enough context.
They also put in place probes that will query their service and report status as perceived by clients instead of relying only on metrics exposed by the service.

On the other hand, team Mars only had the metrics exposed by the framework they use.
Their logs were verbose, unstructured text and they relied on the basic health checks, which are basically a ping check to their homepage.
Both teams use the same tools in an effort to observe their systems but the result is not the same.
Team Earth during an incident will be able to see how their service’s performance is perceived by clients, and be able to follow the metrics/signals through the different components until they would identify a certain metric that is not within thresholds.
They would then look at logs where they would be able to see more details about the anomaly and work to fix it.

Team Mars can look at their metrics, but they won’t necessarily find a metric that is out of the norm, so they will go over to the logs and sift through all those blobs of text, scrambling to make sense out of them.

They end up finding a fix, but the effort and frustrations leaves them demotivated.

This shows that observability is about what people do with the tools.

Who is it for, who will be implementing it really?

Observability is best implemented by the engineers that wrote the code, since they know their systems the best.

I cannot implement observability for all the engineers, but I can enable them to observe their services, showing them how to best observe, monitor, and understand their systems.

My users are the heart of observability, without their involvement and their cooperation I will not succeed at my mission.

Observability is about people.

It comes down to engineers following best practices, understanding what needs to be observed, how it should be observed and how to use that knowledge to improve the reliability of their services.

Observability is about people and practices.

How to implement Observability?

Before implementing observability, I must ask “WHY?”
Why would I want to implement observability?
Well to make our company better at what it does, right? That’s why I was hired in the first place.
Observability should help my company be better at reacting to outages or any issue for that matter.
Engineering will be better because of observability, if correctly implemented.
Keeping that in mind helps set the stage for the work involved in the implementation.
So my mission is to enable the engineering teams through the following:

  • Talk/advocate/train engineers about the principles
  • Provide support when they start applying this knowledge
  • Selection of tools that are best suited for my company whether self-hosted or SaaS
    • Understand the tools strength and limitations and explain those to users

Observability in real life

At Criteo we have 600+ engineers and an Observability team of 5 engineers.
With that ratio, there is no way the Observability team can take the responsibility to implement everything.
The Observability team provides the necessary foundation to enable the teams to observe. This includes:

  • Develop and deploy tools to allow for exposing and visualizing state
  • Integrate the tools with the internal ecosystem
  • Provide support for using the tools
  • Write documentation
  • Drives the adoption of the best practices, by working closely with the different teams

The team deploys different tools and develops the glue to integrate them to have a coherent ecosystem. For example, this might look like:

  • BigGraphite as the long term storage for metrics
    • This is the main Metrics database, where we store metrics. It is also used as long term storage for Prometheus.
  • Prometheus for metrics collection, aggregation and alerting
  • Alertmanager to route alerts
  • Various other tools for tying it all together with sane defaults

Keeping our focus on user enablement, we always try to find ways to improve the experience of our users.
One successful Observability team initiative was to dedicate one member of the team during 3 days every sprint, to work alongside another engineering team, to observe how they interact with the observability tools, how they define their service level objectives, and understand their alerting needs. Through this process, the member of the Observability team was able to spot areas that needed improvement and show how to fix.
It was mutually beneficial, as the Observability team learned more about users and their needs, and the users improved their ability to observe their systems.

Final word on tools

Vendors will try to sell me observability, but these are tools. Some are good and some are bad, and some are average, but no one can sell me observability.
Observability is more about people and practices - no matter what tools you use, if you don’t know what you’re doing it won’t work.
People are creative and they will find ingenious ways of using the tools to fit their thinking instead of adapting their thinking to the tools.
So tools are crucial but they are not where the focus should be. Ultimately I should be careful to choose the tools that make it easier for my users to exercise the best practices and the principles of observability.

No comments :