December 5, 2014

Day 5 - How To Talk About Monitors, Tests, and Diagnostics

Written by: Yvonne Lam (@yvonnezlam)
Edited by: Jennifer Davis (@sigje)

Over time, I’ve accumulated a short list of statements about test-like things, a personal category that includes monitors and diagnostics as well as traditional tests, that when I hear them make my heart sink into my shoes:

  • “That isn’t a unit/integration/functional test!”
  • “Why do we need a check for that?”
  • “We need to run our functional tests against the production system all the time and alert when they fail.”

These remarks are strong signals that we lack shared context about what we are trying to accomplish with the tests/monitors/diagnostics in question. Ensuing conversation is likely to be difficult unless we can build some.

I think of the context for a test-like thing in terms of five questions:

  1. Where is it going to run?
  2. When is it going to run? (e.g. What events will cause it to run?)
  3. How often will it run?
  4. Who is going to consume the result? (Answers could be “another service”, “an application that generates time-series data”, “a person”, or a chain of any of the above.)
  5. What is that entity or collection of entities going to do with the result?

These questions compose a heuristic, so I don’t feel that all of them need to be answered precisely every time.

For example:

  • A traditional test runs against a non-production system when the system is built or deployed. It will generally run because of an event such as code change or deployment. It will likely run relatively infrequently, e.g. hours rather than minutes or seconds. A person may look at the result of the test with the goal of fixing a bug before it goes into production, or an automated system may look at the aggregate test results in order to decide on the flow of code through a continuous integration or deployment system.
  • A monitor runs against a production or production-like system on a schedule. Most monitors run frequently, as in on minute or sub-minute intervals, although some may trigger based on other events. The result may be consumed directly by an alerting system such as nagios, or by another system to produce time-series or other data. In the first case, a result may cause an alert to fire, resulting in an alarm going off and potentially a person being paged. In the second case, measured changes in the state of the collected data may cause an alert to fire and a person to be notified, or the collected data may be used for other kinds of analysis or presented in other forms. The eventual consumer of the data will be someone who wants to know the state of the running production system.
  • A diagnostic runs against a production or production-like system on demand. It will be run when there is reason to believe that something is wrong. The result will be used to fix the running system, through reconfiguration or code change/deploy. In sufficiently advanced systems, cheap diagnostics may be run all the time, in order to trigger automated healing of common system errors or conditions.

Note that intent matters: checking to see if, say, nginx is available and serving data may be a test, a monitor, or a diagnostic. footnote1

An immediate benefit of this approach is that it allows discussion of test-like things without first wading through a sea of vocabulary. You don’t have to argue about what the difference between an integration test and a functional test is, or struggle for words to explain why converting unit tests to production monitors is probably not going to be the best use of the time you have to spend on monitoring. For the sake of effectiveness, it’s good to converge on shared terms, but if people already have different vocabulary, then that convergence will be eventual, and you may need test-like things before then.

The main reason why I like this approach is that it allows me to be very specific about what the test-like thing under discussion is supposed to do. Being able to say,

  • “This test-like thing writes data to a database. Do we want it to run every time someone makes a code change, given that we need to set up the database before the test and clean it up afterward?” (probably not), or
  • “If we think something is broken, we need to check that data is getting from A to B. Is that something we want to check once every ten seconds when the system is otherwise healthy?” (it depends)
  • “Our functional tests for account setup create new accounts every time. Do we want to create a new account in production every minute?” (probably not if we can alert on failed attempts and/or unusually low counts of new accounts per unit time) footnote2

makes for a more useful conversation, in my opinion.

Specificity is especially useful when people ask for a type of test or monitor that is not consonant with the way the application or service component under examination has been written. For example, a common response to user-facing bugs that depend on system state and hence are not easy to reproduce is, “Write a functional test and run it as a monitoring check.” This is an excellent idea, since it allows one to track how often the problem happens, get closer to understanding what system state is relevant, etc. It also may not be practical. A test that replicates a user scenario is powerful in part because it exercises the entity being tested at a high level of abstraction; attempts to capture that test in code often results in test code that is fragile, requires more access to service or application internals than is readily available, or involves expensive setup and teardown to run at the frequency required. The more specific you can be about what such a monitoring check actually needs to do and what that requires, the better equipped you are to suggest alternatives, or, in the most extreme case, talk about what would need to change in the application in order for it to be monitored as desired.

Gratuitous piece of advice #1: It’s worth considering whether some of those health checks that run all the time could be replaced either with monitors that feed a system for collecting data for trend analysis or with diagnostics that a person could use to probe a running system when they suspect something is wrong. Gratuitous piece of advice #2: It depends on your service, but if it is data-intensive and/or relies heavily on caching to improve performance, using your functional tests as monitors may not get you the result that you want. Services whose primary purpose is data collection and storage are most likely to get use out of functional tests as monitors. For a service that does real-time or near-real-time data mining, I have found that data-dependent functional tests either are or become problematic as monitors.

No comments :