December 19, 2017

Day 19 - Infrastructure Testing: Sanity Checking VPC Routes

By: Dan Stark (@danstarkdevops)
Edited By: James Wen

Testing Infrastructure is hard; testing VPCs is even harder

Infrastructure testing is hard. My entire career I’ve tried to bring traditional development testing practices into operations. Linters. Rspec. Mock objects. These tools provide semantic and syntax checking, as well as unit and integration level coverage for infrastructure as code. Ideally, we would also test the system after the code is deployed. End-to-end infrastructure testing has always been a stretch goal – too time-consuming to implement from scratch. This is especially true of network level testing. I am not aware of any existing tools that provide self-contained, end-to-end tests to ensure VPCs, subnets, and route tables are properly configured. As a result, production network deployments can be incredibly anxiety-inducing. Recently, my coworkers and I set up an entire VPC (virtual private cloud) using infrastructure as code, but felt we needed a tool that could perform a VPC specification validation to catch bugs or typos before deployment. The goal was to sanity check our VPC routes using a real resource in every subnet.

Why is this necessary

A typical VPC architecture may contain multiple VPCs and include peering, internal/external subnets, NAT instances/gateways, and internet gateways. In the “Reliability Pillar” of their “Well-Architected Framework” whitepaper, AWS recommends designing your system based on your availability need. At Element 84, we desired 99.9% reliability for EC2, which required 3 external and internal subnets having CIDR blocks of /20 or smaller. In addition, we needed 9 of these redundant VPCs to provide required network segregation. It took significant effort to carve out VPCs with dependent rules and resources for three availability zones.

Here is a hypothetical example of multiple VPCs (Dev, Stage, Prod) over two regions:

image1

Extending this example with additional VPCs for bastion hosts, reporting, and Demilitarized Zones for contractors (DMZs), both NonProd and Prod:

image 2

It’s too easy for a human to make a mistake, even with infrastructure as code.

Managing VPC infrastructure

Here’s one example of how we want a VPC to behave:

We want a Utility VPC peered to a Staging VPC. The Utility VPC contains bastion EC2 instances living in external subnets and the Staging VPC contains internally subnetted application EC2 instances. We want to test and ensure the connectivity between these resources. Also, we want to verify that every bastion EC2 can communicate with all the potential application EC2 instances, across all subnets. Additionally, we want to test connectivity to the external internet for the application EC2 instances, in this case via a NAT gateway.

These behaviors are well defined and should be tested. We decided to write a tool to help manage testing our VPCs and ensuring these kinds of behaviors. It contains:

  1. a maintainable top level DSL written in YAML to declare the VPC specification that sits above the VPC configuration code; and
  2. a mechanism to be able to test the network level connectivity between VPCs, subnets, IGW/NAT and report any problems.

Introducing: VpcSpecValidator

This project is “VpcSpecValidator,” a Python 3 library built on top of the boto3.

There are a few requirements in how you deploy your VPCs to use this library:

  1. You must have deployed your VPCs with CloudFormation and have Outputs for internal subnet with the strings “Private” or “Public” and “Subnet”, e.g. “DevPrivateSubnetAZ1A”, “DevPrivateSubnetAZ1B”, “DevPrivateSubnetAZ1C.”
  2. All VPCs should be tagged with ‘Name’ tags in your region(s).
  3. You must ensure your that a security group attached to these instances allows SSH access between your VPC peers. This is not recommended for production so you may want to remove these rules after testing.
  4. You must have permissions to create/destroy EC2 instances for complete setup and teardown. The destroy method has multiple guards to prevent you from accidentally deleting EC2 instances not created by this project.

You supply a YAML configuration file to outline your VPCs’ structure. Using our example above, this would look like:

project_name: mycompany
region: us-east-1
availability_zones:
  - us-east-1a
  - us-east-1b
  - us-east-1c

# Environment specification
dev:
  peers:
    - nonprod-util
    - nonprod-reporting
  us-east-1a:
    public: 172.16.0.0/23
    private: 172.16.6.0/23
  us-east-1b:
    public: 172.16.2.0/23
    private: 172.16.8.0/23
  us-east-1c:
    public: 172.16.4.0/23
    private: 172.16.10.0/23

stage:
  peers:
    - nonprod-util
    - nonprod-reporting
  us-east-1a:
    public: 172.17.0.0/23
    private: 172.17.6.0/23
  us-east-1b:
    public: 172.17.2.0/23
    private: 172.17.8.0/23
  us-east-1c:
    public: 172.17.4.0/23
    private: 172.17.10.0/23

nonprod-util:
  peers:
    - dev
    - stage
  us-east-1a:
    public: 172.19.0.0/23
    private: 172.19.6.0/23
  us-east-1b:
    public: 172.19.2.0/23
    private: 172.19.8.0/23
  us-east-1c:
    public: 172.19.4.0/23
    private: 172.19.10.0/23

nonprod-reporting:
  peers:
    - dev
    - stage
  us-east-1a:
    public: 172.20.0.0/23
    private: 172.20.6.0/23
  us-east-1b:
    public: 172.20.2.0/23
    private: 172.20.8.0/23
  us-east-1c:
    public: 172.20.4.0/23
    private: 172.20.10.0/23

prod:
  peers:
    - prod-util
  us-east-1a:
    public: 172.18.0.0/23
    private: 172.18.6.0/23
  us-east-1b:
    public: 172.18.2.0/23
    private: 172.18.8.0/23
  us-east-1c:
    public: 172.18.4.0/23
    private: 172.18.10.0/23

prod-util:
  peers:
    - prod
  us-east-1a:
    public: 172.19.208.0/23
    private: 172.19.214.0/23
  us-east-1b:
    public: 172.19.210.0/23
    private: 172.19.216.0/23
  us-east-1c:
    public: 172.19.212.0/23
    private: 172.19.218.0/23

prod-reporting:
  peers:
    - prod
  us-east-1a:
    public: 172.20.208.0/23
    private: 172.20.214.0/23
  us-east-1b:
    public: 172.20.210.0/23
    private: 172.20.216.0/23
  us-east-1c:
    public: 172.20.212.0/23
    private: 172.20.218.0/23

The code will:

  1. parse the YAML for a user-specified VPC,
  2. get the public or private subnets associated with each Availability Zone’s CIDR range in that VPC,
  3. launch an instance in those subnets,
  4. identify the peering VPC(s),
  5. create a list of the test instances in the peer’s subnets (public or private, depending on what was specified in step 2),
  6. attempt an TCP socket connection using the private IP and port 22 for each instance in this list.

Step 5 posed an interesting deployment challenge. We decided UserData was a good option to bootstrap and clone the repo on an EC2 instance, but did not know how to pass it peered VPC’s private IP addresses as SSH targets.

Given the entire specification is in one file and the CIDR ranges are available, we can cheat and look at the Outputs of the peer(s)’ CloudFormation stack and see if any instances created in Step 3 match.

def get_ip_of_peer_instances_and_write_to_settings_file(self):

    '''
    This is run on the source EC2 instance as part of UserData bootstrapping
    1) Look at the peer(s)' VPC CloudFormation Stack's Outputs for a list of subnets, public or private as defined in the constructor.
    2) Find instances in those subnets created by this library
    3) Get the Private IP address of target instances and write it to a local configuration file
    '''
        
    # Query for peer CloudFormation, get instances
    target_subnet_list = []
    target_ip_list = []
    with open(self.config_file_path, 'r') as ymlfile:
        cfg = yaml.load(ymlfile)
    
    for peer in self.peers_list:
        peer_stack_name = "{}-vpc-{}-{}".format(self.project_name, peer, cfg['region'])
    
        # Look at each peer's CloudFormation Stack Outputs and get a list of subnets (public or private)
        client = boto3.client('cloudformation')
        response = client.describe_stacks(StackName=peer_stack_name)
        response_outputs = response['Stacks'][0]['Outputs']
    
        for i in range(0,len(response_outputs)):
            if self.subnet_type == 'public':
                if 'Subnet' in response_outputs[i]['OutputKey'] and 'Public' in \
                        response_outputs[i]['OutputKey']:
                    subnet_id = response_outputs[i]['OutputValue']
                    target_subnet_list.append(subnet_id)
    
            else:
                if 'Subnet' in response_outputs[i]['OutputKey'] and 'Private' in \
                        response_outputs[i]['OutputKey']:
                    subnet_id = response_outputs[i]['OutputValue']
                    target_subnet_list.append(subnet_id)
    
    
        # Search the instances in the targeted subnets for a Name tag of VpcSpecValidator
        client = boto3.client('ec2')
        describe_response = client.describe_instances(
            Filters=[{
                'Name': 'tag:Name',
                'Values': ['VpcSpecValidator-test-runner-{}-*'.format(peer)]
            }]
        )
    
        # Get Private IP addresses of these instances and write them to target_ip_list.settings
    
        for i in range(0,len(describe_response['Reservations'])):
            target_ip_list.append(describe_response['Reservations'][i]['Instances'][0]['PrivateIpAddress'])
    
        # Write the list to a configuration file used at runtime for EC2 instance
        with open('config/env_settings/target_ip_list.settings', 'w') as settings_file:
            settings_file.write(str(target_ip_list))

There is also a friendly method to ensure that the YAML specification matches what is actually deployed via CloudFormation templates.

spec = VpcSpecValidator('dev', subnet_type='private')

spec.does_spec_match_cloudformation()

Next Steps

At Element 84, we believe our work benefits our world, and open source is one way to personify this value. We’re in the process of open sourcing this library at the moment. Please check back soon and we’ll update this blog post with a link. We will also post a link to the repo on our company Twitter.

Future features we would like to add:

  • Make the VpcSpecValidator integration/usage requirements less strict.
  • Add methods to test internet connectivity.
  • Dynamic EC2 keypair generation/destruction. The key pairs should be unique and throwaway after the test.
  • Compatibility with Terraform.
  • CI as a first-class citizen by aggregating results in JUnit-compatible format. Although I think it would be overkill to run these tests with every application code commit, it may make sense for infrastructure commits or running on a schedule.

Wrap Up - Testing VPCs is difficult but important

Many businesses use one big, poorly defined (default) VPC. There are a few problems with this:

Development resources can impact production

At a fundamental level, you want to have as many barriers between development and production environments as possible. This isn’t necessarily to stop developers from caring about production. As an operator, I want to prevent my developers from being in a position where they might unintentionally impact production. In addition to security group restrictions, avoid these potential mishaps impossible from a network perspective. To steal an idea from Google Cloud Platform, we want to establish "layers of security.” This tool helps to enforce these paradigms by validating VPC behavior prior to deployment.

Well-defined and well-tested architecture is necessary for scaling

This exercise forced our team to think about our architecture and its future. What are the dependencies as they sit today? How would we scale to multi-region? What about third party access? We would want them in a DMZ yet still able to get the information they need. How big do we expect these VPCs to scale?

It’s critical to catch these issues before anything is deployed

The best time to find typos, configuration, and logic errors is before the networking is in use. Once deployed, these are hard errors to troubleshoot because of the built-in redundancy. The goal is to prevent autoscaling events yielding a “how was this ever working” alarm at 3AM because one subnet route table is misconfigured. That’s why we feel a tool like this has a place in the community. Feel free to add comments and voice a +1 in support.

Happy SysAdvent!

No comments:

Post a Comment