Ansible Tower, Autoscaling and recycled IPs: a short treatise on cleaning up after yourself.

My organization's EC2 inventory is highly dynamic in nature. Our requirements vary considerably depending on the time of day, as well as, to a lesser degree, day of the week. As a result, we have a mixture of scaling policies based on schedules and based on performance or load metrics.

We depend heavily on Ansible Tower for provisioning our autoscaled resources. When we increment the desired value on one of our ASGs, the launch config's user data tells the instance to run a python script which adds it to the appropriate Tower inventory and runs a job against it which provisions the instance. Most of the work is already done - baked into the AMI - but since we use instances which leverage ephemeral storage, there are often provisioning tasks which must be done after an instance is launched.

When we scale an instance down, in the past, we did not immediately take any steps to remove it from any inventories or groups. Every few minutes, we had Jenkins run a job that checked every instance in the Tower inventories to see if it was in the running state via Boto. It wasn't terribly efficient, but it had been doing its job for years and we didn't see sufficient reason to spend dev time writing new code to improve the process.

That all changed last week. We had a deploy job fail on one of these autoscaled hosts. Upon investigation, we found that a task failed which succeeded on all the other instances that had just scaled up alongside the offending box. The task that failed was laying down a jinja template - an nginx block file, not that it's relevant. The task failed because the instance was missing the /etc/nginx/sites-available directory. Poking around the server, we noticed that some other things were odd - it had a copy of another repository that didn't belong there, config files for some applications that weren't used on this type of server. It didn't take us long to figure out why! We looked at our Tower inventories and found the host to be a member of two different groups - nothing inherently wrong with that, but in this case, a thoroughly unintended situation.

What had happened was this:

  • During a scheduled scale-down of one of our autoscale groups, another was scaling up based on a cloudwatch metric.

  • As an instance scaled down, it remained in the Tower inventory of group A. The public IP was released from the instance.

  • As the second ASG was scaling up, an instance was given the IP of one of the instances from the other ASG that just scaled down.

  • Our python provisioner added this new instance to group B

  • When the Jenkins cleanup job ran against our Tower inventory a few minutes later, it did not remove it from group A because it was in the running state.

  • The Tower job kicked off by our provisioner script ran a job against the new instance as a member of group B

  • Later on, a code change triggered a deploy against group A. Despite the fact that the instance was no longer a member of group A's ASG, Tower didn't know this and it tried to run the deploy against the instance because it was a member of Group A in tower. The deploy failed because the AMI upon which it was based didn't have the needed configuration.

Obviously this exposed a big problem in our automation - code was being deployed to a server not equipped to run it.

Our solution:

A bit quick and dirty, but effective. We added a shell script to /etc/rc6.d called K99_unprovisiontower. This actually kicks off a python script, because the requests library is so nice to work with. Upon receiving a shutdown, this script:

  • Grabs the instance's public IP address via the internal metadata API present on our Ubuntu AMIs at 169.254.169.254

  • Gets the instance ID and inventory ID of the host matching that public IP from Ansible Tower via its API

  • Via a POST request, disassociates that instance from all Tower inventories (and thereby removes it from all groups)

  • If any of these steps should fail, we create a PagerDuty incident (just a slack alert, not a phone call, mind you) via their API.

This way, we ensure that by the time the IP address is released back into the pool of available IPs that a new instance might draw from, it has already been removed from our Tower database.

Closing Thoughts

This issue had never manifested itself until this past week, meaning we've never had an IP recycled inside a 15 minute window over many years of autoscaling dozens of instances at a time. In a week, it has now happened three times. Why now? Who knows. Without more insight into how Amazon assigns IP addresses to instance and the size of the available pools over time, we'll probably never know for sure if something changed or we just beat the odds. It could be that the pool of available IPs for the instance type we were using (same for both ASGs) has recently been reduced - or maybe, we were just super lucky for a prolonged period. Regardless, it is better to do the needful and just clean up after yourself.