Showing posts with label teraform. Show all posts
Showing posts with label teraform. Show all posts

December 22, 2017

Day 22 - Building a secure bastion host, or, 50 ways to kill your server

By: Anna Kennedy (@anna_ken_)
Edited By: Gillian Gunson (@shebang_the_cat)

Bastion (noun) 1. A projecting part of a fortification 2. A special purpose computer on a network specifically designed and configured to withstand attacks

If you deploy servers to a private network, then you also need a way to connect to them. The two most common ways methods are to use a VPN, or to ssh through a bastion host (also known as a jump box). Shielding services this way massively reduces your attack surface, but you need to make sure that the server exposed to the internet is as secure as you can make it.

At Telenor Digital we have about 20 federated AWS accounts, and we wanted to avoid having to set up a complex system of VPNs. Additionally, we wanted to be able to connect to any account from anywhere, and not just from designated IP ranges. Deploying a bastion host to each account would allow us to connect easily to instances via ssh forwarding. Our preferred forwarding solution is sshuttle, a "transparent proxy server / poor man's VPN".

This is where it got... interesting. At the time of writing, Amazon AWS did not have a designated bastion host instance type available. Nor in fact do any of the other main cloud providers, nor did there appear to be any other trustworthy bastions available from other sources. There didn’t even seem to be any information about how other people were solving this problem.

I knew that we wanted a secure bastion such that:

  1. Only authorised users can ssh into the bastion
  2. The bastion is useless for anything BUT ssh'ing through

How hard could making such a bastion possibly be?

Constraints and processes

Our technology stack uses Ubuntu exclusively, and we wanted the bastion to be compatible with the various services we already deploy, such as Consul, Ansible, and Filebeat. Beyond that, I personally have a lot more experience with Ubuntu than I do with any other OS.

For these reasons we decided to base the bastion on a minimal Ubuntu install, strip out as many packages as possible, add some extra security, and make a golden image bastion AMI. Had it not been for these constraints, there might be better OSs to start with, such as Alpine Linux.

Additionally, we run everything in AWS so one or two points of the following are AWS-specific, but based on a lot of conversations it seems that the bastion problem is one that affects a much wider range of architectures.

We use Packer to build our AMIs, Ansible to set them up and Serverspec to test them, so building AMIs is a pretty fast process, typically taking about five minutes. After that we deploy everything using Terraform, so it's a quick turnaround from code commit to running instance.

Starting point: Ubuntu minimal-server

My first port of call: what packages are pre-installed in an Ubuntu minimal-server? Inspection via $apt list --installed or $ dpkg-query -W showed over 2000 packages, and of those I was surprised how many I'd never heard of. And of the ones I had heard of, I was further surprised how many seemed, well, superfluous.

I spent some time and made a few spreadsheets trying to figure out what all the mystery packages were before I got bored and had the bright idea of leveraging Ubuntu's package rating system: all packages are labelled as one of: required, important, standard, optional, or extra.

$ dpkg-query -Wf '${Package;-40}${Priority}\n'
apt                             important
adduser                         required
at                              standard
a11y-profile-manager-indicator  optional
adium-theme-ubuntu              extra

Remove optional and extra packages

Those optional and extra packages sounded very nonessential. I was pretty sure I could rip those out with a nice one-liner and be done.

dpkg-query -Wf '${Package;-40}${Priority}\n' | awk '$2 ~ /optional|extra/ { print $1 }' | xargs -I % sudo apt-get -y purge %

Turns out this was not my best ever idea. All sorts of surprising packages were marked optional or extra and were thus unceremoniously removed, including:

  • cloud-init
  • grub
  • linux-base
  • openssh-server
  • resolvconf
  • ubuntu-server (meta-package)

It doesn't take a genius to realise that removing grub, open-ssh or resolvconf is colossally ill-advised, but even after I tried not removing these but uninstalling the rest I had no luck. On every build I got an unstable and/or unusable image, often not booting at all. Interestingly it broke in a different way each time, possibly something to do with how fast it was uprooting various dependencies before it got to an unbootable state. After quite a lot of experimenting with package-removal lists and getting apparently nondeterministic results, it was time for a new strategy.

Remove a selected list of packages

I revised my plan somewhat in the realisation that maybe blindly removing lots of packages wasn't the best of ideas. Maybe I could look through the package list and remove the ones that seemed the most 'useful' and remove them. Some obvious candidates for removal were the various scripting languages, plus tools like curl and net-tools. I was pretty sure these were just peripherals to a minimal server.

Package nameOk to remove?Dependency
curlnoConsul
edyes
ftpyes
gawkyes
nanoyes
net-toolsnosshuttle
perlnossh
python 2.7noAnsible
python 3noAWS instance checks
rsyncyes
screenyes
tarnoAnsible
tmuxyes
vimyes
wgetyes

It turns out I was incorrect. Due to the various restrictions placed upon the system because we use Consul, sshuttle, Ansible and AWS, about half of my hitlist was unremovable.

To compensate for the limitations in my "remove all the things" strategy, I decided to explore limiting user powers.

Restrict user capabilities

Really, I didn't want users to be able to do anything - they should only be allowed to ssh tunnel or sshuttle through the bastion. Therefore locking down the specific commands a user could issue ought to limit potential damage. To restrict user capabilities, I found four possible methods:

  • Change all user shells to /bin/nologin
  • Use rbash instead of bash
  • Restrict allowed commands in authorized_keys
  • Remove sudo from all users

All seemed like good ideas - but on testing I discovered that the first three options only work for pure ssh tunnelling, and don’t work in conjunction with sshuttle.

Remove sudo

I disabled sudo by removing all users from the sudo group, which worked perfectly apart from introducing a new dimension to bastion troubleshooting - without sudo it’s not possible to read the logs nor perform any meaningful investigation on the instance.

We offset most of the pain by having the bastion export its logs to our ELG (Elasticsearch, Logstash, Graylog) logging stack, and export metrics to Prometheus. Between these, most issues were easily identified without needing direct sudo access directly. For the couple of bigger build issues that I had in the later stages of development, I built two versions of the bastion at a time, with and without sudo. A little clunky, but only temporarily.

With the bastion locked down as much as possible, I then added in a few more restrictions to finalise the hardening.

Install fail2ban

An oldie but a goodie, fail2ban is fantastic at restricting logon attempts by locking out anyone who fails to login three times in a row for a determined time period.

Use 2FA and port knocking

Some clever folks in my team ended up making a version of sshuttle that invokes AWS two-factor authentication for our users, and implements a port knocking capability which only opens the ssh port in response to a certain request. The details of this are outside the scope of this article, but we hope to make this open-source in the near future.

Finally! A bastion!

After a lot of experiments and some heavy testing, the bastion was declared production-ready and deployed to all accounts. The final image is t2.nano sized, but we’ve not seen any problems with performance so far as the ssh forwarding is so lightweight.

It's now been in use for a least half a year, and it's been surprisingly stable. We have a Jenkins job that builds and tests the bastion AMI on any change to the source AMI or on code merge, and we redeploy our bastions every few weeks.

I still lie awake in bed sometimes and try to work out how to build a bastion from the ground up, but to all intents and purposes I think the one we have is just fine.

References

Related content

December 14, 2016

Day 14 - Terraform Deployment Strategy

Written by: Jon Brouse (@jonbrouse)
Edited by: Kerim Satirli (@ksatirli)

Introduction

HashiCorp’s infrastructure management tool, Terraform, is no doubt very flexible and powerful. The question is, how do we write Terraform code and construct our infrastructure in a reproducible fashion that makes sense? How can we keep code DRY, segment state, and reduce the risk of making changes to our service/stack/infrastructure?

This post describes a design pattern to help answer the previous questions. This article is divided into two sections, with the first section describing and defining the design pattern with a Deployment Example. The second part uses a multi-repository GitHub organization to create a Real World Example of the design pattern.

This post assumes you understand or are familiar with AWS and basic Terraform concepts such as CLI Commands, Providers, AWS Provider, Remote State, Remote State Data Sources, and Modules.

Modules

Essentially, any directory with one or more .tf files can be used as or considered a Terraform module. I am going to be creating a couple of module types and giving them names for reference. The first module type constructs all of the resources a service will need to be operational such as EC2 instances, S3 bucket, etc. The remaining module types will instantiate the aforementioned module type.

The first module type is a service module. A service module can be thought of as a reuseable library that deploys a single service’s infrastructure. Service modules are the brains and contain the logic to create Terraform resources. They are the “how” we build our infrastructure.

The other module types are environment modules. We will run our Terraform commands within this module type. The environment modules all live within a single repository, as compared to service modules, which live in individual repositories or alongside the code of the service they build infrastructure for. This is “where” our infrastructure will be built.

Deployment Example

I am going to start by describing how we would deploy a service and then deconstruct the concepts as we move through the deployment.

Running Terraform

As mentioned earlier, the environments repository is where we actually run the Terraform command to instantiate a service module. I’ve written a Bash wrapper to manage the service’s remote state configuration and ensure we always have to latest modules.

So instead of running terraform apply we will run ./remote.sh apply

The apply argument will set and get our remote state configuration, run a terraform get -update and then run terraform apply

Environment Module Example

Directory Structure

The environment module’s respository contains a strict directory hierarchy:

production-account (aws-account)
|__ us-east-1 (aws-region)
    |__ production-us-east-1-vpc (vpc)
        |__ production (environment)
            |__ ssh-bastion (service) 
               |__ remote.sh <~~~~~ YOU ARE HERE
Dynamic State Management

The directory structure is one of the cornerstones of this design as it enables us to dynamically generate the S3 key for a service’s state file. Within the remote.sh script (shown below) we parse the directory structure and then set/get our remote state.

A symlink of /templates/service-remote.sh to remote.sh is created in each service folder.

#!/bin/bash -e
# Must be run in the service's directory.

help_message() {
  echo -e "Usage: $0 [apply|destroy|plan|refresh|show]\n"
  echo -e "The following arguments are supported:"
  echo -e "\tapply   \t Refresh the Terraform remote state, perform a \"terraform get -update\", and issue a \"terraform apply\""
  echo -e "\tdestroy \t Refresh the Terraform remote state and destroy the Terraform stack"
  echo -e "\tplan    \t Refresh the Terraform remote state, perform a \"terraform get -update\", and issues a \"terraform plan\""
  echo -e "\trefresh \t Refresh the Terraform remote state"
  echo -e "\tshow    \t Refresh and show the Terraform remote state"
  exit 1
}

apply() {
  plan
  echo -e "\n\n***** Running \"terraform apply\" *****"
  terraform apply
}

destroy() {
  plan
  echo -e "\n\n***** Running \"terraform destroy\" *****"
  terraform destroy
}

plan() {
  refresh
  terraform get -update
  echo -e  "\n\n***** Running \"terraform plan\" *****"
  terraform plan
}

refresh() {

  account=$(pwd | awk -F "/" '{print $(NF-4)}')
  region=$(pwd | awk -F "/" '{print $(NF-3)}')
  vpc=$(pwd | awk -F "/" '{print $(NF-2)}')
  environment=$(pwd | awk -F "/" '{print $(NF-1)}')
  service=$(pwd | awk -F "/" '{print $NF}')

  echo -e "\n\n***** Refreshing State *****"

  terraform remote config -backend=s3 \
                          -backend-config="bucket=${account}" \
                          -backend-config="key=${region}/${vpc}/${environment}/${service}/terraform.tfstate" \
                          -backend-config="region=us-east-1"
}

show() {
  refresh
  echo -e "\n\n***** Running \"terraform show\"  *****"
  terraform show
}

## Begin script ##
if [ "$#" -ne 1 ] || [ "$1" = "-h" ] || [ "$1" = "--help" ]; then
  help_message
fi

ACTION="$1"

case $ACTION in
  apply|destroy|plan|refresh|show)
    $ACTION
    ;;
  ****)
    echo "That is not a vaild choice."
    help_message
    ;;
esac

Service Module Instantiation

In addition to remote.sh, the environment’s service directory contains a main.tf file.

module "environment" {
  source = "../"
}

module "bastion" {
  source = "git@github.com:TerraformDesignPattern/bastionhost.git"

  aws_account      = "${module.environment.aws_account}"
  aws_region       = "${module.environment.aws_region}"
  environment_name = "${module.environment.environment_name}"
  hostname         = "${var.hostname}"
  image_id         = "${var.image_id}"
  vpc_name         = "${module.environment.vpc_name}"
}

We call two modules in main.tf. The first is an environment module, which we’ll talk about in a moment, and the second is an environment service module.

Environment Service Module

Environment service module?

Everytime I’ve introduced this term, I’ve seen this…


Alt

An environment service module, or ESM for short, is just a way to specify, in conversation, that we are talking about the code that actually instantiates a service module.

If you look at the ESM declaration in main.tf above, you’ll see it is using the output from the enviroment module to define variables that will be passed into the service module. If we take a step back to review our directory structure we see the service we are deploying sits within the production environment’s directory:

sysadvent-production/us-east-1/production-us-east-1-vpc/production/ssh-bastion

Within the production environment’s directory is an outputs.tf file.

output "aws_account" {
  value = "production"
}

output "aws_region" {
  value = "us-east-1"
}

output "environment_name" {
  value = "production"
}

output "vpc_name" {
  value = "production-us-east-1-vpc"
}

We are able to create an entire service, regardless of resources, with a very generic ESM and just four values from our environment module. We are using our organization’s defined and somewhat colloquial terms to create our infrastructure. We don’t need to remember ARNs, ID’s or other allusive information. We don’t need to remember naming conventions either as the service module will take care of this for us.

Service Module Example

So far we’ve established a repeatable ways to run our Terraform command and guarantee that our state is managed properly and consistently. We’ve also instantiated a service module from within an environment service module. We are now going to dive into the components of a service module.

A service module will be reused throughout your infrastrcuture so it must be generic and parameterized. The module will create all Terraform provider resources required by the service.

In my experience, I’ve found splitting each resource type into its own file improves readability. Below is the list of Terraform files from our example bastion host service module repository:

bastionhost
|-- data.tf
|-- ec2.tf
|-- LICENSE
|-- outputs.tf
|-- providers.tf
|-- route53.tf
|-- security_groups.tf
`-- variables.tf

The contents of most of these files will look pretty generic to the average Terraform user. The power of this pattern lies within the data.tf as it allows the simplistic instantiation.

// Account Remote State
data "terraform_remote_state" "account" {
  backend = "s3"

  config {
    bucket = "${var.aws_account}"
    key    = "terraform.tfstate"
    region = "us-east-1"
  }
}

// VPC Remote State
data "terraform_remote_state" "vpc" {
  backend = "s3"

  config {
    bucket = "${var.aws_account}"
    key    = "${var.aws_region}/${var.vpc_name}/terraform.tfstate"
    region = "us-east-1"
  }
}

Sooooo. What populates the state file for the VPC data resources? Enter the VPC Service Module

There is no cattle.
There are no layers.
There is no spoon.

Everything is just a compartmentalized service. The module that creates your VPC is a separate “service” that lives in its own repository.

We create our account resources (DNS zones, SSL Certs, Users) within an account service module, our vpc resources from a VPC module within a vpc service module and our services (Application Services, RDS, Web Services) within an Environment Service Module.

We use a Bash wrapper to publish the state of resources in a consistent fashion.

Lastly, we abstract the complexity of infrastructure configuration management by querying Terraform state files based on a strict S3 key structure.

Real World Example

Follow along with the example by pulling the TerraformDesignPattern/environments repository. The configuration of each module within the environment repository will consist of roughly the same steps:

  1. Create the required files. Usually main.tf and variables.tf or simply an outputs.tf file.
  2. Populate variables.tf/outputs.tf with your desired values.
  3. Create a symlink to a specific remote.sh (account-remote.sh, service-remote.sh or vpc-remote.sh) from within the appropriate directory.
  4. For example, to create the remote.sh wrapper for your account service module, issue the following from within your environments/$ACCOUNT directory: ln -s ../templates/account-remote.sh remote.sh
  5. Run ./remote.sh apply

Prerequisites

  • Domain Name: I went over to Namecheap.com and grabbed the sysadvent.host domain name for $.88.
  • State File S3 Bucket: Create the S3 bucket to store your state files in. This should be the name of the account folder within your environments repository. For this example I created the sysadvent-production S3 bucket.
  • SSH Public Key: As of the writing of this post, the aws_key_pair Terraform resource does not currently support creating a public key, only importing one.
  • SSL ARN: AWS Certificate Manager offers free SSL/TLS certificates.

Getting Started

This Real World Example assumes you are provisioning a new AWS account and domain. For those working in a brown field, the following section provides a quick example of how to build a scaffolding that can be used to deploy the design pattern.

State Scaffolding or Fake It ’Til You Make It (With Terraform)

Within the environments/sysadvent-production account directory is an s3.tf file that creates a dummy_object in S3:

resource "aws_s3_bucket_object" "object" {
  bucket = "${var.aws_account}"
  key    = "dummy_object"
  source = "outputs.tf"
  etag   = "${md5(file("outputs.tf"))}"
}

A new object will be uploaded when outputs.tf is changed. This change updates the remote state file and thus any outputs that have been added to outputs.tf will be added to the remote state file as well. To use a resource (IAM Role, VPC ID, or Zone ID) that was not created with Terraform, simply add the desired data to the account, vpc, or ESM’s outputs.tf file. Since not all resources can be imported via data resources, this enables us to migrate in small, iterable phases.

In the example below, the AWS Account ID will be added to the account’s state file via this mechanism. The outputs.tf file defines the account ID via the aws_account variable:

output "aws_account" {
  value = "${var.aws_account}"
}

Stage One: Create The Account Service Module (ASM)

Working Directory: environments/sysadvent-production

As per the name, this is were account wide resources are created such as DNS Zones or Cloudtrail Logs. The sysadvent-production ASM will create the following:

  • Cloudwatch Log Stream for the account’s Cloudtrail
  • Import a public key
  • Route53 Zone
  • The “scaffolding” S3 dummy_object from outputs.tf to publish:
  • AWS Account ID
  • Domain Name
  • SSL ARN

Populate Variables

Populate the variables in the account’s variables.tf file:

aws_account - name of the account level folder
aws_account_id - your AWS provided account ID
domain_name - your choosen domain name
key_pair_name - what you want to name the key you are going to import
ssl_arn - ARN of the SSL certificate you created, free, with Amazon's Certificate Manager
public_key - the actual public key you want to import as your key pair

Execute!

Once you have created a state file S3 bucket and populated the variables.tf file with your desired values run ./remote.sh apply.

Stage Two: Create The VPC Service Module (VSM)

Working Directory: environments/sysadvent-production/us-east-1/production-us-east-1-vpc

The TerraformDesignPattern/vpc module creates the following resources:

  • VPC
  • Enable VPC Flow Logs
  • Cloudwatch Log Stream for the VPC’s Flow Logs
  • Flow Log IAM Policies
  • An Internet Gateway
  • Three private subnets
  • Three public subnets with nat gateways and elastic IP addresses
  • Routes, route tables, and associations

Populate Variables

Populate the following variables in the VSM’s variables.tf file:

availability_zones
aws_region
private_subnets
public_subnets
vpc_cidr
vpc_name

Execute!

Once you have populated the variables.tf file with your desired values, create the resources by running ./remote.sh apply.

Stage Three: Create The Environment Module

Working Directory: environments/sysadvent-production/us-east-1/production-us-east-1-vpc/production

The environment module stores the minimal amount of information required to pass to an environment service module. As mentioned previously, this module consists of a single outputs.tf file which requires you to configure the following:

aws_account
aws_region
environment_name
vpc_name

Stage Four: Create An Environment Service Module (ESM)

Working Directory: environments/sysadvent-production/us-east-1/production-us-east-1-vpc/production/elk

Congratulations, you’ve made it to the point where this wall of text really pays off.

Create the main.tf file. The ELK’s ESM calls an environment module, an ami_image_id module , and the ELK service module. The environment module supplies environment specific data such as the AWS account, region, environment name and VPC name. This module data is, in turn, passed to the ami_image_id module. The ami_image_id module will return AMI ID’s based on the enviroment’s region.

The ELK ESM will create the following resources:

  • Three instance ELK stack within the previously created private subnet.
  • The Elasticsearch instances will be clustered via the EC2 discovery plugin.
  • A public facing ELB to access Kibana.
  • A Route53 DNS entry pointing to the ELB.

Execute!

Once you created the main.tf file, create the resources by running ./remote.sh apply.

Appendix

Gotchas

During the development of this pattern I stumbled across a couple gotchas. I wanted to share these with you but didn’t think they were necessarily pertinent to an introductory article.

Global Service Resources

Example: IAM Roles

We want to keep the creation of IAM roles in a compartmentalized module but IAM roles are global thus you can only create them once. Using count and a lookup based on our region, we can tell Terraform to only create the IAM role in a single region.

Example iam.tf file:

...
count = "${lookup(var.create_iam_role, var.aws_region, 0)}"
...

Example variables.tf file:

...
variable "create_iam_role" {
  default = {
    "us-east-1" = 1
  }
}
...

Referencing an ASM or VSM From an ESM

The ASM and VSM create account and VPC resources, respectively. If you were to reference an ASM or VSM a la an environment module within an ESM, you’ll essentially be attempting to recreate the resources originally created by the ASM and VSM.