A Practical Guide to deploying Netflix's BLESS Certificate Authority

Foreword:

Special thanks are due to Ryan Lane and Chris Steipp of lyft, both for authoring their open source bless client and for specific help via Slack when I ran into things I didn't understand. Thank you also to Russell Lewis of Netflix for writing BLESS in the first place!

You can find the project code here: https://github.com/crielly/bless
The BLESS project is located here: https://github.com/Netflix/bless.git
Lyft's BLESS client project is located here: https://github.com/lyft/python-blessclient

You'll need all three.

What is BLESS?

BLESS is a Lambda function that acts as an SSH Certificate Authority. A user can request a short lived certificate for their public key, which they present to a server for authentication - provided that the server is configured to trust the CA's public key.

When I learned of the Netflix project known as BLESS, I was intrigued by two huge wins that it enables:

  • We no longer need to manage deployment of user public keys to instances
  • We can enforce multi-factor authentication on SSH using IAM's existing MFA (using Lyft's blessclient and KMS auth)

The Challenge

As with many open source projects, the available documentation is very minimal. The BLESS readme really just outlines how to compile the Lambda function's dependencies and enable your instances' sshd config to trust the BLESS CA. There's a ton of work between git clone and successfully authenticating via BLESS that isn't explained anywhere. It assumes a fairly detailed knowledge of AWS Lambda, and IAM in particular.

This article hopes to fill that gap by providing the necessary implementation details in the form of code examples.

Note on Terragrunt

I've opted to use Terragrunt to handle backend definitions for this project, but you are not obligated to use it. If you'd rather not, just remove the terraform.tfvars files and define your own backends per the Terraform documentation.

To Do List:

  • Compile Dependencies with make lambda-deps
  • Generate RSA Keypair with password
  • Update Ansible role "bless-ca" defaults/main.yml with public key(s)
  • Generate Base64 encrypted copy of that password for the bless_deploy.cfg
  • make publish to generate a .zip for upload to lambda
  • cp that zip to the terraform/BLESS directory
  • edit terraform/terraform.tfvars with appropriate values for your S3 bucket backend and dynamo lock table
  • terragrunt apply

Note: the crielly/bless repo I linked includes other subdirectories in the terraform directory - these aren't strictly necessary to get Lambda up and running. Cloudtrail provides audit logging for your AWS API calls. iam-global contains an account password policy and a couple of handy group definition. One way or another you'll need a group with permission to invoke the Lambda.

Project Structure

tree -aF  
.
├── ansible/
│   ├── base-server-bless.json
│   ├── base-server.yml
│   └── roles/
│       ├── bless-ca/
│       │   ├── defaults/
│       │   │   └── main.yml
│       │   ├── handlers/
│       │   │   └── main.yml
│       │   ├── tasks/
│       │   │   └── main.yml
│       │   └── templates/
│       │       └── cas.pub.j2
│       └── iamsync/
│           ├── defaults/
│           │   └── main.yml
│           ├── files/
│           │   └── iamsync.py
│           ├── tasks/
│           │   └── main.yml
│           └── templates/
│               ├── iamsync-logrotate.j2
│               └── sudoers.j2
├── README.md
└── terraform/
    ├── base/
    │   ├── base-vars-data.tf
    │   └── terraform.tfvars
    ├── BLESS/
    │   ├── base-vars-data.tf -> ../base/base-vars-data.tf
    │   ├── bless_lambda.zip
    │   ├── iam-invoke.tf
    │   ├── iam-kms.tf
    │   ├── iam-lambda.tf
    │   ├── kms-key.tf
    │   ├── lambda-fn.tf
    │   ├── .terraform/
    │   │   └── terraform.tfstate
    │   └── terraform.tfvars
    ├── cloudtrail/
    │   ├── base-vars-data.tf -> ../base/base-vars-data.tf
    │   ├── cloudtrail.tf
    │   ├── s3-bucket.tf
    │   ├── .terraform/
    │   │   └── terraform.tfstate
    │   └── terraform.tfvars -> ../base/terraform.tfvars
    ├── iam-global/
    │   ├── base-vars-data.tf -> ../base/base-vars-data.tf
    │   ├── groups.tf
    │   ├── password-policy.tf
    │   ├── .terraform/
    │   │   └── terraform.tfstate
    │   └── terraform.tfvars -> ../base/terraform.tfvars
    ├── terraform.tfvars

Build Lambda Dependencies

Docker actually makes this very easy. Once you clone the Netflix/bless repo, you'll find that the included Makefile has a make lambda-deps recipe. Use it, you'll end up with a directory full of the appropriate compiled dependencies called aws_lambda_libs.

Generate a CA Keypair

ssh-keygen -t rsa -b 4096 -f bless-ca- -C "SSH CA Key" - name is arbitrary and comment is optional. Provide a password for the key - while you won't need the key to actively use the Lambda function, you will need to provide it in base64 encoded form to the Lambda function in order for it to sign certificates for you. Make a second directory, called lambda_configs, and put the private key inside it. Remember to chmod 0644 keyfile - the Lambda will fail to read it if not. Also, now that you have a public key available, open up the ansible/roles/bless-ca/defaults/main.yml and place it in the provided list. The "region" field in that list isn't relevant to deployment, it's just there to help you keep track of your keys.

Encrypt Private Key Password as Base64

This needs to be done via a KMS key, which you can create via Terraform like this:

resource "aws_kms_key" "BLESS" {  
  description             = "BLESS KMS for ${var.REGION}"
  deletion_window_in_days = 10
  tags {
    Name      = "BLESS-${var.REGION}"
    Terraform = "True"
  }
}

resource "aws_kms_alias" "BLESS" {  
  name          = "alias/BLESS-${var.REGION}"
  target_key_id = "${aws_kms_key.BLESS.key_id}"
}

The actual encryption can be performed via a second lambda function, or via python on a local dev machine, as long as you have the needed IAM permissions on your user to call the kms:Encrypt function:

import boto3  
import base64  
import os

# If running as a Lambda, include the event and context parameters. If running locally, remove those parameters from the function call and simply execute it (or pass in KeyID and Plaintext as parameters if you prefer.

# Lambda Function
def lambda_handler(event, context):  
    region = os.environ['AWS_REGION']
    client = boto3.client('kms', region_name=region)
    response = client.encrypt(
    KeyId='alias/your_kms_key',
    Plaintext='Do not forget to delete the real plain text when done'
    )

    ciphertext = response['CiphertextBlob']
    return base64.b64encode(ciphertext)

# Run Locally - keyid is the text of your KMS alias - using region us-west-2, mine would be alias/BLESS-us-west-2
# plaintext is the password you used to generate the RSA key you're uploading as part of the Lambda

def get_base64_key(keyid, plaintext):  
    region = os.environ['AWS_REGION']
    client = boto3.client('kms', region_name=region)
    response = client.encrypt(
    KeyId=keyid,
    Plaintext=plaintext
    )

    ciphertext = response['CiphertextBlob']
    return base64.b64encode(ciphertext)

Bless Config File

Now that we have the encoded key, we have everything we need to fill in the bless_deploy.cfg you'll be putting in the lambda_configs directory:

# This section and its options are optional
[Bless Options]

# Number of seconds +/- the issued time for the certificate to be valid
certificate_validity_after_seconds = 120  
certificate_validity_before_seconds = 120

# Minimum number of bits in the system entropy pool before requiring an additional seeding step
entropy_minimum_bits = 2048

# Number of bytes of random to fetch from KMS to seed /dev/urandom
random_seed_bytes = 256

# Set the logging level
logging_level = DEBUG

# Comma separated list of the SSH Certificate extensions to include. Not specifying this uses the ssh-keygen defaults:
# certificate_extensions = permit-X11-forwarding,permit-agent-forwarding,permit-port-forwarding,permit-pty,permit-user-rc
# Username validation options are described in bless_request.py:USERNAME_VALIDATION_OPTIONS
# Configure how bastion_user names are validated.
# username_validation = useradd
# Configure how remote_usernames names are validated.
# remote_usernames_validation = principal

# These values are all required to be modified for deployment
[Bless CA]
us-west-2_password = AQICAHjOafinwoienf389/1I8mgIfTWd9rs1gRoksJq1i3xNIO4wH1Pf0HwchHNqjxAfTXqVBiAAMNrfiwejo38kiG9w0BBwagbTBrAgEAMGYGCSqGSIb3DQEHATAeBglghkgBZQMEAS4wEQQMA1+FfJa+Tobi29EQgDnJmWvPQELkH8QmooEGDyFAaeYldswogYEdnluc9kI7nldnvc0YtV+Fq3248noNJABjNtFHOZEwbUE=

# Specify the file name of your SSH CA's Private Key in PEM format.
ca_private_key_file = bless-ca-

[KMS Auth]
# Enable kmsauth, to ensure the certificate's username matches the AWS user
use_kmsauth = True

# One or multiple KMS keys, setup for kmsauth (see github.com/lyft/python-kmsauth)
kmsauth_key_id = 7a013fb0-69b2-2807-123c-59f56e834b10

# If using kmsauth, you need to set the kmsauth service name. Users need to set the 'to'
# context to this same service name when they create a kmsauth token.
# This is done in the blessclient.cfg when using the Lyft blessclient
kmsauth_serviceid = bless  

With this in place, we can do make publish to generate a zipped lambda for upload. You'll need to put it in the BLESS directory under terraform.

IAM Policies

We need to allow the Bless Lambda function to use the decrypt function on our KMS key, since we're using it to encrypt our private key's password:

resource "aws_iam_policy" "BLESS-kms-decrypt" {  
  name        = "BLESS-kms-decrypt-${var.REGION}"
  description = "BLESS-kms-decrypt-${var.REGION}"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowKMSDecryption",
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt",
        "kms:DescribeKey"
      ],
      "Resource": [
        "${aws_kms_key.BLESS.arn}"
      ]
    }
  ]
}
EOF  
}

resource "aws_iam_role_policy_attachment" "BLESS-kms-decrypt" {  
  role       = "${aws_iam_role.BLESS-lambda.name}"
  policy_arn = "${aws_iam_policy.BLESS-kms-decrypt.arn}"
}

Lyft's BLESS client utilized a system they dub KMSauth to ensure that a user calling a function proved their identity to AWS at a certain time. As per their readme, we'll allow this via a policy that enforces 2FA and ensure that the cert is only valid for an SSH user matching the name of the IAM user who is attempting to invoke the lambda:

resource "aws_iam_role" "BLESS-invoke" {  
  name        = "BLESS-invoke-${var.REGION}"
  description = "BLESS-invoke-${var.REGION}"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "AWS": "arn:aws:iam::${data.aws_caller_identity.current.account_id}:root"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF  
}

resource "aws_iam_policy" "BLESS-invoke" {  
  name        = "BLESS-invoke-${var.REGION}"
  description = "BLESS-invoke-${var.REGION}"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Action": [
          "lambda:InvokeFunction"
      ],
      "Resource": [
          "${aws_lambda_function.BLESS.arn}"
      ]
    },
    {
      "Sid": "",
      "Effect": "Allow",
      "Action": [
          "iam:GetUser"
      ],
      "Resource": [
          "arn:aws:iam::${data.aws_caller_identity.current.account_id}:user/$${aws:username}"
      ]
    },
    {
      "Sid": "AllowKMSEncryptIfMFAPresent",
      "Action": "kms:Encrypt",
      "Effect": "Allow",
      "Resource": [
        "${aws_kms_key.BLESS.arn}"
      ],
      "Condition": {
        "StringEquals": {
          "kms:EncryptionContext:to": [
            "bless"
          ],
          "kms:EncryptionContext:user_type": "user",
          "kms:EncryptionContext:from": "$${aws:username}"
        },
        "Bool": {
          "aws:MultiFactorAuthPresent": "true"
        }
      }
    }
  ]
}
EOF  
}

resource "aws_iam_role_policy_attachment" "BLESS-invoke" {  
  role       = "${aws_iam_role.BLESS-invoke.name}"
  policy_arn = "${aws_iam_policy.BLESS-invoke.arn}"
}

As you can see, this is a role. The only thing that actually needs to be attached to a group of users to allow execution of the lambda is a policy allowing the role to be assumed. This must allow you to assume the role defined in your blessclient.cfg - the role itself actually provides execution permissions on the Lambda and encryption privileges on the KMS key we defined.

resource "aws_iam_policy" "BLESS-assume-invoke-role" {  
  name        = "BLESS-STS-${var.REGION}"
  description = "BLESS-STS-${var.REGION}"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "AllowAssumeInvokeRole",
      "Effect": "Allow",
      "Action": [
          "sts:AssumeRole"
      ],
      "Resource": [
          "${aws_iam_role.BLESS-invoke.arn}"
      ]
    },
        {
        "Sid": "AllowIndividualUserToListTheirOwnMFA",
        "Effect": "Allow",
        "Action": [
            "iam:ListVirtualMFADevices",
            "iam:ListMFADevices"
        ],
        "Resource": [
            "arn:aws:iam::${data.aws_caller_identity.current.account_id}:mfa/*",
            "arn:aws:iam::${data.aws_caller_identity.current.account_id}:user/$${aws:username}"
        ]
    }
  ]
}
EOF  
}

Lastly, the Lambda itself needs a handful of permissions to do its job!:

resource "aws_iam_role" "BLESS-lambda" {  
  name        = "BLESS-lambda-${var.REGION}"
  description = "BLESS-lambda-${var.REGION}"

  assume_role_policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": "sts:AssumeRole",
      "Principal": {
        "Service": "lambda.amazonaws.com",
        "AWS": "arn:aws:sts::${data.aws_caller_identity.current.account_id}:assumed-role/${aws_iam_role.BLESS-invoke.name}/mfaassume"
      },
      "Effect": "Allow",
      "Sid": ""
    }
  ]
}
EOF  
}

resource "aws_iam_policy" "BLESS-lambda" {  
  name        = "BLESS-lambda-${var.REGION}"
  description = "BLESS-lambda-${var.REGION}"

  policy = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "kms:GenerateRandom",
        "logs:CreateLogGroup",
        "logs:CreateLogStream",
        "logs:PutLogEvents"
      ],
      "Effect": "Allow",
      "Resource": "*"
    },
    {
      "Sid": "AllowKMSDecryption",
      "Effect": "Allow",
      "Action": [
        "kms:Decrypt",
        "kms:DescribeKey"
      ],
      "Resource": [
        "${aws_kms_key.BLESS.arn}"
      ]
    }
  ]
}
EOF  
}

resource "aws_iam_role_policy_attachment" "BLESS-lambda" {  
  role       = "${aws_iam_role.BLESS-lambda.name}"
  policy_arn = "${aws_iam_policy.BLESS-lambda.arn}"
}

And, perhaps anti-climatically, a definition for the lambda itself - as you can see, it'll look for a zipped copy of your lambda:

resource "aws_lambda_function" "BLESS" {  
  filename         = "bless_lambda.zip"
  function_name    = "BLESS"
  role             = "${aws_iam_role.BLESS-lambda.arn}"
  handler          = "bless_lambda.lambda_handler"
  source_code_hash = "${base64sha256(file("bless_lambda.zip"))}"
  runtime          = "python2.7"
  kms_key_arn      = "${aws_kms_key.BLESS.arn}"
  timeout          = 30
}

BLESS Client

At this point, you should have a functional Lambda and the ability to invoke it. In order to successfully get a certificate and establish an SSH connection with it, the following conditions must exist:

  • You must have IAM credentials with the BLESS-assume-invoke-role policy attached, either directly or to a group
  • Your IAM username must match a valid Linux user on the box you wish to connect to
  • Your blessclient.cfg (part of Lyft's blessclient) must be correctly configured (more to follow)
  • Any box you wish to connect to must be configured to trust your Lambda as a cert authority

The blessclient.cfg is quite straightforward. Note that the kms_service_name must match the kmsauth_serviceid in the bless_deploy.cfg inside the Lambda zip, which must match the kms:EncryptionContext:to inside the BLESS-invoke IAM policy. Other values should be fairly self explanatory - though for the uninitiated with Lambda, the $LATEST is probably a sane default, but can be used to call a specific version of a Lambda function. This is useful when you're testing a new version or configuration - give one a specific name and call it by its alias.

[MAIN]
# region_aliases: These are regions that can be passed on the commandline of blessclient,
# using the --region option, to specify the AWS region. You must have at least one region
# defined. If the client can't connect to aws services in the region, it will try the next
# region.
region_aliases: WEST

# kms_service_name: Name that will be set in the "To" context for the kmsauth token. Your
# Lambda must have permissions to decrypt with each of the kms keys when the "To" context
# is set to this string. Setting policy appropriately can prevent a staging/dev kmsauth
# token from being used to authenticate to the production Lambda.
kms_service_name: bless

# bastion_ips: These IPs and/or netmasks will be added as valid source IPs to every
# certificate issued. If your users proxy / agent-forward through a bastion host, then
# the internal IP of each should be listed here.
bastion_ips: 12.345.678.90/32

[CLIENT]
# domain_regex: A (python) regex that is tested by the blessclient to determine if we need
# to run bless and get a certificate, or if we can skip it. This prevents blessclient from
# making your users wait to get a certificate when they connect to github, etc.
domain_regex: (.*\.somedomain\.com||\A10\.100(?:\.[0-9]{1,3}){2}\Z)$

# cache_dir / cache_file: file and directory (in the user's home directory) where we cache
# information about the user. Blessclient will cache AWS tokens here, so the directory should
# have permissions to only let the user read the cache.
cache_dir: .bless/session  
cache_file: bless_cache.json

# mfa_cache_dir / mfa_cache_file: If you organization has another tool that generates and
# caches AWS tokens for your users, you can list it here. Blessclient will attempt to use
# any cached credentials to identify the user, to reduce the number of times the user must
# input their MFA code. TODO: make the client gracefully not use this by default.
mfa_cache_dir: .aws/session  
mfa_cache_file: token_cache.json

# ip_urls: comma-separated list of urls that can provide a user's public IP address. This
# IP will be added as an authorized IP to the user's certificate, preventing a stolen
# SSH certificate from being used by another IP.
ip_urls: http://api.ipify.org, http://canihazip.com

# update_script: This script will be called after 7 days of use, so you can push updates
# to your users. Your update script should use some mechanism to verify the integrity of
# the code. Script is relative to the path where blessclient was downloaded.
update_script: update_blessclient.sh

# user_session_length: The length of time that we request AWS issues the session tokens for
# when the user inputs their MFA code. This defaults to 64800 seconds (18 hours). The value
# must be in the range 900-129600, or the sts call will fail.

# usebless_role_session_length: Then length of time that we request AWS issues the session
# tokens for when the user assumes the role necessary to call the BLESS Lambda. The default
# is 3600 seconds (1 hour). The value must be in the range 900-3600.

[LAMBDA]
# user_role: IAM Role that the user will assume, in order to run the BLESS Lambda. This
# role should be in the same AWS account as your Lambda.
user_role: BLESS-invoke-us-west-2

# account_id: AWS account id where the BLESS Lambda is setup. For production, you probably
# want the Lambda running in a separate AWS account, to better protect the CA private key.
account_id: 12345678910

# functionname: The name of the BLESS Lambda function
functionname: BLESS

# functionversion: The version alias we use when invoking the Lambda. If you make a change
# to the Lambda function's api, then you can bump this version, and new versions of the client
# code will access the new Lambda. You can also have a set of users call a "canary" version of
# the Lambda, to test new changes. See the AWS Lambda docs for information about aliases.
functionversion: $LATEST

# certlifetime: Let the client know how long the Lambda will set the certificate's validity.
# This DOES NOT control the time limit, but lets blessclient know how long to use a certificate
# before refreshing. TODO: read this directly from the certificate.
certlifetime: 120

# ipcachelifetime: How long to cache the user's current public IP address, before querying
# the ip_urls to see if the user's IP has changed since we last issued a certificate. If your
# users work from one place, you can set this long (to reduce the time to issue a cert), but
# if they move around a lot (e.g., ssh-ing from a moving vehicle while tethered) then decrease
# this. Users can set BLESSIPCACHELIFETIME in their environment to temporarily change this.
ipcachelifetime: 120

# timeout_connect / timeout_read: Set connection timeouts (in seconds) for the boto3 connection
# to the AWS Lambda. If the connection fails, the client will try in the next AWS region.
timeout_connect: 5  
timeout_read: 10

# REGION sections (REGION_<ALIAS>, for each region_aliases defined). Must have the AWS
# region specified, as well as the kmsauth key in that region.

[REGION_WEST]
awsregion: us-west-2  
kmsauthkey: 9a177fd9-89b9-1471-822c-72f27e914b10  

User Management

The IAM credentials are simple enough - you need a group with the above policy attached that allows you to assume the invoke role, and your user must be a member of that group. The way KMSauth works is that you cannot make an encryption request to KMS unless you have a valid MFA token - and the cert that comes back will only be valid for a user that matches your IAM user credentials - therefore, the users deployed to your boxes must correspond to the IAM users you want to be allowed to invoke the Lambda function. There are quite a few ways to accomplish this, from deploying users via config management to LDAP, NSSCache, etc. We've opted for something much simpler - we make IAM the source of truth for users on our linux boxes.

All that's required are some basic IAM readonly permissions. Succinctly put, we exploit the "path" attribute available during creation of user groups in IAM. We create groups using, for example, the path "iamsync":

resource "aws_iam_group" "devs" {  
  name = "devs"
  path = "/iamsync/"
}

resource "aws_iam_group_policy_attachment" "BLESS-invoke-devs" {  
  group      = "${aws_iam_group.devs.name}"
  policy_arn = "${data.terraform_remote_state.BLESS.bless-assume-arn}"
}

output "devs-arn" {  
  value = "${aws_iam_group.devs.arn}"
}

As you can see, I've also attached the policy we created earlier which allows assuming the BLESS-invoke role. Now, using some python glory, we can dynamically fetch, create, and update our users directly from IAM. My code depends on the groups already being present on the box and having their permissions configured via Ansible (baked into the base AMI in fact) - but there is nothing preventing you from doing that via python or any other language, of course. The whole chain looks as follows:

First, a packer template - this fetches the latest of Ubuntu Trusty from Canonical's public AMI collection and uses it as a base, then runs our playbook against it:

{
  "builders": [
    {
      "type": "amazon-ebs",
      "region": "us-east-1",
      "ami_regions": ["us-east-1"],
      "source_ami_filter": {
          "filters": {
            "name": "*ssd/ubuntu-trusty-14.04-amd64-server*",
            "architecture": "x86_64",
            "virtualization-type": "hvm",
            "root-device-type": "ebs"
          },
          "owners": ["099720109477"],
          "most_recent": true
      },
      "ami_virtualization_type": "hvm",
      "force_deregister": true,
      "instance_type": "m3.medium",
      "ssh_username": "ubuntu",
      "ami_name": "base-server-bless_{{timestamp}}"
    }
  ],
  "provisioners": [
    {
      "type": "ansible",
      "playbook_file": "base-server.yml",
      "groups": ["all"],
      "user": "ubuntu"
    }
  ]
}

This could of course have as many playbooks and roles as desired, but in this case, the only goals are to sync users and configure the server to trust our CA. To that end, we lay down two cron jobs - one to run @reboot, the other periodically. I won't C&P all of the Ansible code in here as it's available in the repo I linked at the top - for the purposes of syncing users, these are the relevant bits - not shown are things like logrotate templates and creation of user groups with specific privileges (passwordless sudo for Ops, ability to ls to your heart's content if you're a dev) which are very well documented elsewhere.

- name: copy iamsync.py to disk
  copy:
    src: iamsync.py
    dest: /usr/local/lib/iamsync.py
    owner: root
    group: root
    mode: 0500

- name: Implement logrotate for iamsync.log
  template:
    src: iamsync-logrotate.j2
    dest: /etc/logrotate.d/iamsync
    owner: root
    group: root
    mode: 0644

- name: ensure iamsync.py runs every x minutes
  cron:
    name: "Run iamsync.py every 10 minutes"
    minute: "*/10"
    user: root
    job: "/usr/bin/python /usr/local/lib/iamsync.py {{ iam_pathprefix }}"

- name: ensure iamsync.py runs on boot
  cron:
    name: "Run iamsync.py at boot"
    special_time: reboot
    user: root
    job: "/usr/bin/python /usr/local/lib/iamsync.py {{ iam_pathprefix }}"

- name: set SHELL in root crontab
  cronvar:
    user: root
    name: SHELL
    value: /bin/bash
    state: present

- name: set PATH in root crontab
  cronvar:
    user: root
    name: PATH
    value: /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin
    state: present

The last two tasks are necessary because of my use of the subprocess.call() methods. To my great surprise, or perhaps a failure in my google-fu, I didn't find a pure python way to create or delete users - so I do it via subprocess calls to the shell. These will fail silently if you don't have a PATH and SHELL defined - and since I'm running this as Root due to the nature of what's being done, you need to set them (at least, I did - I only tested this on Ubuntu 14, so YMMV).

Depending on the size of your fleet and the desired response time from "I've removed a user from IAM" to "they're gone from all our servers", you may want to lengthen or shorten the period between cron runs. The moment that you remove a user from IAM, they lose the ability to get a new certificate as they can no longer invoke the Lambda - but that does nothing to kill existing SSH connections and the user still needs to clean up. Similarly, when you add a new user to IAM, they only need to wait until the next run before they're able to SSH into boxes - no deploy needed.

"""
iamsync.py  
Given an IAM path prefix, this script will fetch a list of groups in that path  
and then fetch the members of those groups and ensure that they match the members  
in the matching local unix groups.  
Usage:  
    iamsync.py <pathprefix> --logpath=<l>
"""

import boto3, pwd, grp, logging, logging.handlers, subprocess  
from docopt import docopt

_LOGGER = logging.getLogger('pythonLogger')

def get_local_group_membership(groupname):  
    try:
        return grp.getgrnam(groupname)[3]
    except KeyError:
        print('Group {} does not exist.'.format(groupname))
    except Exception as e:
        _LOGGER.error(e)

def get_iam_groups(pathprefix):  
    # Construct a list of IAM groups to sync by listing all groups present
    # at the designated IAM path prefix
    try:
        iamgroups = []
        iam = boto3.client('iam')
        groups = iam.list_groups(PathPrefix=pathprefix)
        for g in groups['Groups']:
            iamgroups.append(g['GroupName'])
        return iamgroups
    except Exception as e:
        _LOGGER.error(e)

def remove_defunct_users(groups, pathprefix):  
    allusers = pwd.getpwall()
    defunct = []
    try:
        for u in allusers:
            delete = False
            # set delete to True if the user's comment is iamsync
            if pwd.getpwnam(u.pw_name).pw_gecos == "iamsync":
                delete = True

            # Check each group's membership for the user
            # If any group has that user as a member, set delete to False
            for g in groups:
                if u.pw_name in grp.getgrnam(g).gr_mem:
                    delete = False

            # If delete is True, add to list of defunct users
            if delete:
                defunct.append(u.pw_name)
        if defunct:
            _LOGGER.info("Defunct Users: {}".format(defunct))

        for u in defunct:
            exitcode = subprocess.call(
                ["userdel", "--remove", u]
            )
            _LOGGER.info(
                "Removing user {} and home directory exited with code {}".format(
                    u, exitcode
                )
            )
    except Exception as e:
        _LOGGER.error(e)



def create_user(username, group):  
    try:
        homedir = "/home/{}".format(username)
        exitcode = subprocess.call(
            [
                "useradd", "-s", "/bin/bash",
                "-c", "iamsync", "-md", homedir,
                "-g", group, username
            ]
        )
        _LOGGER.info(
            "Creating user {} with primary group {} exited with code {}".format(
            username, group, exitcode
        ))
    except Exception as e:
        _LOGGER.error(e)

def check_if_user_exists(username):  
    try:
        check = pwd.getpwnam(username)
        return True
    except KeyError:
        return False

def remove_user_from_group(username, groupname):  
    try:
        subprocess.call(
            [
                "gpasswd", "-d", username, groupname
            ]
        )
    except Exception as e:
        _LOGGER.error(e)

def add_user_to_group(username, groupname):  
    try:
        exitcode = subprocess.call(
            [
                "usermod", "-aG", groupname, username
            ]
        )
        _LOGGER.info("Adding user {} to group {} exited with code {}".format(
            username, groupname, exitcode
        ))
    except Exception as e:
        _LOGGER.error(e)

def get_iam_group_membership(groupname):  
    try:
        users = []
        iam = boto3.resource('iam')
        group = iam.Group(groupname)
        for u in group.users.all():
            users.append(u.name)
        return users
    except Exception as e:
        _LOGGER.error(e)

if __name__ == '__main__':

    # Parse Docopts
    args = docopt(__doc__)
    pathprefix = args['<pathprefix>']

    if args['--logpath']:
        logpath = args['logpath']
    else:
        logpath = "/var/log/iamsync.log"

    _LOGGER.setLevel(logging.INFO)
    py_hdlr = logging.handlers.WatchedFileHandler(
        logpath, mode='a'
    )
    py_formatter = logging.Formatter(
        '%(asctime)s : %(levelname)s - %(message)s'
    )
    py_hdlr.setFormatter(py_formatter)
    _LOGGER.addHandler(py_hdlr)

    _LOGGER.info("Executing with pathprefix {}".format(pathprefix))

    iamgroups = get_iam_groups(pathprefix)

    for group in iamgroups:

        # Determine memberships in IAM and locally
        iamusers = get_iam_group_membership(group)

        localusers = get_local_group_membership(group)

        # List of users we want to purge from the group
        subtract = []

        # If user exists locally but is not in IAM group
        # Add to subtract list
        for u in localusers:
            if u not in iamusers:
                subtract.append(u)

        # Perform subtraction
        if subtract:
            for u in subtract:
                remove_user_from_group(u, group)

        # If user exists in IAM but not locally, create and append to group        
        for u in iamusers:
            if u not in localusers:
                if not check_if_user_exists(u):
                    create_user(u, group)
                    add_user_to_group(u, group)

    remove_defunct_users(iamgroups, pathprefix)

Those conversant in Python will recognise super basic exception handling when they see it - I know, I know, I'll get to it! This code does the following:

  • Gets a list of IAM groups that exist at the provided "IAM Prefix", aka path.
  • For each group at that path, retrieves a list of members of that group
  • Get membership of the local groups matching those names
  • If a user exists in a group on the local box, but not in the IAM group, remove them from the local group
  • If the user exist in IAM but not locally, add them
  • If a user's comment (which requires sudo to modify) matches the supplied prefixpath, and they're not in any of our IAM groups, clean them up as they are deemed to be defunct

We've found it to be simple, but effective.

Trust the CA

Lastly, we need to configure our boxes to trust the CA. When we setup the Lambda and generated an RSA key, a pubkey was supplied. We need to lay that down into a file which we then tell sshd to look inside for Trusted CA keys:

- name: Tell sshd about our file full of trusted CA pubkeys
  lineinfile:
    path: /etc/ssh/sshd_config
    line: 'TrustedUserCAKeys /etc/ssh/cas.pub'
  notify: reload ssh

- name: lay down file full of trusted CA pubkeys
  template:
    src: cas.pub.j2
    dest: /etc/ssh/cas.pub
    owner: root
    mode: 0600
  notify: reload ssh

Here we've told sshd to look in /etc/ssh/cas.pub for trusted keys, then laid down a Jinja template containing 1 or more keys that we pull from a list such as this:

# Replace this with your CA's pubkey
bless_ca_pubkeys:  
  - region: us-west-2
    pubkey: ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQD+Ker7rge73/EAgY1EngMqV0xHt1m4eg8ySFd9KGU42DgAFw3AuUrNF5O3PT0pg3c0LCS869uURBYjG4q/4cFAanCwQrXcUZL9EALfZfjAQGsoRWYJcLjhFwt0PbyllsmZcbQxXNSBru7SarncfSnbqaFdyo33RlIeWbGxZGkhyGHO+MG21ishN5F99RTPCqG96myuothMrQeQv+qIkxpRzRjD7NjxONs5AofekA9FxVHsfVQ+i4EP9O6hD89aE0tAvuOjK6+zndbqiMj/CbPUUmlezuxtrqqC7Le9Zo6DJ3HZSA2KV9MtvmoDxDtXgLRRPiJ31kJvzQq2hSUUmqALpabToQvEGVjnP/svKewLIqAbbdDbD+931EXiPg3gZPyNr2PxGRm6QTfqI9Q/3hfhghbJCAv4PFBzSHzGbnFjGtBiUQKj7UG/QWxewpA0zj/K92qFWbfVkSJ43FPT5Ro7T8TuiYKuFQeIcIz0FF9+4g1HEufpQE0o4Kh8PS6pDSGgxiqoynXpHI8nnJ0P3mD6H6v7+af+cLWaQLFOYvz8NLbTsljIkXvoNkLtlVNNvpWUOtFRIOI2+b/KUhEOoO5//qRV2QZB1Uy0UlmSs7Agl4VQ7zkLALxaHu0AtwZwAIXqXio+8+Mk6QId/8JvxAYjVtJHw3rWHk10SPmRUNktdQ== SSH CA Key

The "region" attribute is just for keeping track of things, it is not used in deploy. Simply create a list entry with region: and pubkey: for every BLESS function you want to be trusted. You can spread them across regions for redundancy or just in case you get a compromised CA key and you need to quickly offline one without disruption. This role gets run by the above packer template as well, so you end up with an AMI that you can use to spin up fresh boxes. These will be pre-configured to trust your CAs and will sync users automatically on boot (and every so often as determined by Cron) from IAM.

None of this precludes the user of traditionally deployed RSA keys - your AWS provisioner keys still work, as will any user for whom you deploy an authorized_keys file. It merely allows a more streamlined method of managing users and SSH auth, as well as providing the previously discussed benefits.

The last thing to do is make a connection using all this. Lyft's BLESS client readme offers methods on how to integrate running the BLESS lambda with client workflows that is more advanced than what I've chosen to do for testing, so I'll not elaborate further on that. In short, all you should need to do:

ssh-keygen -f ~/.ssh/blessid -b 4096 -t rsa -C 'Temporary key for BLESS certificate'  
ssh-keygen -y -f ~/.ssh/blessid > ~/.ssh/blessid.pub  
touch ~/.ssh/blessid-cert.pub  
ln -s ~/.ssh/blessid-cert.pub ~/.ssh/blessid-cert  

You could of course use an existing keypair, either by symlinking it to ~/.ssh/blessid and ~/.ssh/blessid.pub or by setting the BLESS_IDENTITYFILE env var. Either way, you can now

/path/to/blessclient/blessclient.run --region WEST
BLESS_COMPLETE=1 ssh $@ -i ~/.ssh/blessid  

Troubleshooting Advice

It took me hours of reading cloudwatch logs and debug output to get the configuration right - IAM permissions in particular cost me quite a bit of time as I realised how much I didn't know. Make sure you read the cloudwatch output if you're not successfully receiving a cert. Also, export BLESSDEBUG=1 is your friend.

That's it, happy authing.