December 24, 2015

Day 24 - It's not Production without an audit trail

Written by: Brian Henerey (@bhenerey)
Edited by: AJ Bourg (@ajbourg)

Don't be scared

I suspect when most tech people hear the word audit they want to run away in horror. It tends to bring to mind bureaucracy, paperwork creation, and box ticking. There's no technical work involved, so it tends to feel like a distraction from the 'real work' we're already struggling to keep up with.

My mind shifted on this a bit ago when I worked very closely with an Auditor over several months helping put together Controls, Policies and Procedures at an organization to prepare for a SOC2 audit. If you're not familiar with a SOC2, in essence it is a process where you define how you're going to protect your organization's Security, Availability, and Confidentiality(1) in a way that produces evidence for an outside auditor to inspect. The end result is a report you can share with customers, partners, or even board members, with the auditor's opinion on how well you're performing against what you said.

But even without seeking a report, aren't these all good things? As engineers working with complex systems, we constantly think about Security and Availability. We work hard implementing availability primitives such as Redundancy, Load Balancing, Clustering, Replication and Monitoring. We constantly strive to improve our security with DDOS protection, Web Application Firewalls, Intrusion Detection Systems, Pen tests, etc. People love this type of work because there's a never-ending set of problems to solve, and who doesn't love solving problems?

So why does Governance frighten us so? I think it's because we still treat it like a waterfall project, with all the audit work saved until the end. But what if we applied some Agile or Lean thinking to it?

Perhaps if we rub some Devops on it, it won't be so loathsome any more.

Metrics and Laurie's Law

We've been through this before. Does anyone save up monitoring to the end of a project any longer? No, of course not. We're building new infrastructure and shipping code all the time, and as we do, everything has monitoring in place as it goes out the door.

In 2011, Laurie Denness coined the phrase "If it moves, graph it". What this means to me is that any work done by me or my team is not "Done" until we have metrics flowing. Generally we'll have a dashboard as well, grouping as appropriate. However, I've worked with several different teams at a handful of companies, and people generally do not go far enough without some prompting. They might have os-level metrics, or even some application metrics, but they don't instrument all the things. Here are some examples that I see commonly neglected:

  • Cron jobs / background tasks - How often do they fail? How long do they take to run? Is it consistent? What influences the variance?

  • Deployments - How long did it take? How long did each individual step take? How often are deploys rolled back?

  • Operational "Meta-metrics" - How often do things change? How long do incidents last? How many users are affected? How quickly do we identify issues? How quickly from identification can we solve issues?

  • Data backups / ETL processes - Are we monitoring that they are running? How long do they take? How long do restores take? How often do these processes fail?

Now let's take these lessons we've learned about monitoring and apply them to audits.

Designing for Auditability

There's a saying that goes something like 'Systems well designed to be operated are easy to operate'. I think designing a system to be easily audited will have the same effect. So if you've already embraced 'measure all things!', then 'audit all the things!' should come easily to you. You can do this by having these standards:

  • Every tool or script you run should create a log event.
  • The log event should include as much meta-data as possible, but start with who, what and when.
  • These log events should be centralized into something like Logstash.
  • Adopt JSON as your logging format.
  • Incrementally improve things over time. This is not a big project to take on.

While I wrote this article, James Turnbull published a fantastic piece on Structured Logging.

Start small

The lowest hanging fruit comes from just centralizing your logs and using a tool like Logstash. Your configuration management and/or deployment changes are probably already being logged in /var/log/syslog.

The next step is to be a bit more purposeful and instrument your most heavily used tools.

Currently, at the beginning of every Ansible run we run this:

  - name: ansible start debug message #For audit trail
    shell: 'sudo echo Ansible playbook started on {{ inventory_hostname }} '

and also run this at the end:

  - name: ansible finish debug message #For audit trail
    shell: 'sudo echo Ansible playbook finished on {{ inventory_hostname }} '

Running that command with sudo privileges ensures it will show up /var/log/auth.log.

Improve as you go

In the first few years after Statsd first came out, I evangelized often to get Dev teams to start instrumenting their code. Commonly, people would think of this as an extra task to be done outside of meeting the acceptance criteria of whatever story a Product Manager has fed to them. As such, this work tended to be put off till later, perhaps when we hoped we'd be less busy (hah!). Don't fall into this habit! Rather, add purposeful, quality logging to every bit of your work.

Back then, I asked a pretty senior engineer from an outside startup to give a demo of how he leveraged Statsd and Graphite at his company, and it was very well received. I asked him what additional amount of effort it added to any coding he did, and his answer was less than 1%.

The lesson here is not to think of this as a big project to go and do across your infrastructure and tooling. Just begin now, improve whatever parts of your infrastructure code-base you're working in, and your incremental improvements with add up over time.


If you're working in AWS, you'd be silly not to leverage CloudTrail. Launched in November 2013, AWS CloudTrail "records API calls made on your account and delivers log files to your Amazon S3 bucket."

One of the most powerful uses for this has been tracking all Security Group changes.

Pulling your CloudTrail logs into Elasticsearch/Logstash/Kibana adds even more power. Here's a graph plus event stream of a security rule being updated that opens up a port to Unless this rule is in front of a public-internet facing service, it is the equivalent of chmod 0777 on a file/directory when you're trying to solve a permissions problems.

It can occasionally be useful to open things to the world when debugging, but too often this change is left behind in a sloppy way and poses a security risk.

Auditing in real-time!

Audit processes are not usually a part of technical workers' day-to-day activities. Keeping the compliance folks happy doesn't feel central to the work we're normally getting paid to do. However, if we think of the audit work as a key component of protecting our security or availability, perhaps we should be approaching it differently. For example, if the audit process is designed to keep unwanted security holes out of our infrastructure, shouldn't we be checking this all the time, not just in an annual audit? Can we get immediate feedback on the changes we make? Yes, we can.

Alerting on Elasticsearch data is an incredibly powerful way of getting immediate feedback on deviations from your policies. has a paid product for this called Watcher. I've not used it, preferring to use a Sensu plugin instead.

  "checks": {
    "es-query-count-cloudtrail": {
      "command": "/etc/sensu/plugins/check-es-query-count.rb -h my.elasticsearch  -d 'logstash-%Y.%m.%d' --minutes-previous 30 -p 9200 -c 1 -w 100  --types "cloudtrail"  -s http  -q 'Authorize*' -f eventName --invert",
      "subscribers": ["sensu-server"],
      "handlers": ["default"],
      "interval": 60,
      "occurrences": 2

With this I can query over any time frame, within a subset of event 'types', look for matches in any event field, and define warning and critical alert criteria for the results.

Now you can find out immediately when things are happening like non-approved accounts making changes, new IAM resources being created, activity in AWS regions you don't use, etc.

Closing Time

It can be exceptionally hard to move fast and 'embrace devops' and actually follow what you've documented in your organizations controls, policies, and procedures. If an audit is overly time consuming, even more time is lost from 'real' work, and there's even more temptation to cut corners and skip steps. I'd argue that the only way to avoid this is to bake auditing into every tool, all along the way as you go. Like I said before, it doesn't need to be a huge monumental effort, just start now and build on it as you go.

Good luck and happy auditing!


(1) I only mentioned 3, but there are 5 "Trust Service Principles", as definied by the AICPA

1 comment :

Unknown said...

I would offer one suggestion to add to your ansible wrappers. Underneath each, add:

- always

This comes with a slight cost that you will run it even if you are doing a dry run where nothing should've changed, but ensures you get the entry even if someone limits the run to a specific tag.