Outage Recovery With Ansible

It has been a while since the last post here, so I planned to come in and write about a cool ESLint configuration tool I found, but instead was greeted with a pleasant 502 Bad Gateway message from Nginx.

Initial Investigations

The AWS EC2 instance was still running fine, but the Node.js process serving up the blog was no longer running. I could tell this by ssh'ing into the box and typing pm2 list, and then seeing that no apps were listed. I hadn't touched the box in the last three months, so I really wasn't sure what state it was in. Before spending a lot more time at this point to figure out what happened, I decided to fix it and then investigate more later.

Ansible

Luckily, I had used Ansible to provision the EC2 instance. If you're not familiar, Ansible is a system orchestration, configuration management, and app deployment tool. It is similar to Chef, but unlike Chef, Ansible is agentless, so you don't need to install anything special on the target machine for it to work. It uses declarative configuration files (Ansible calls them "playbooks") which are designed to be idempotent, meaning they can be run once on a bare box to set it up the way you want it, and then again over and over to make sure it stays the way you want it. You basically just describe the state of the system that you want, and Ansible is smart enough to know what needs to be done to make the system match that desired state.

Fast-acting Relief

Because Ansible playbooks are declarative, all I needed to do was run the same command I used to set up the machine to begin with, and the blog was back up-and-running. The whole process from discovering the outage to bringing the site back online took about 5 minutes.

Lessons Learned

I still don't know exactly what happened to cause the site to go down, but I have identified a few things I should improve:

  • Monitoring. It's not acceptable that I stumbled across the outage on my own after being down for a few days. PM2 and Keymetrics should have alerted me to the problem, so I need to figure out why they didn't.

  • Maintenance. After SSH'ing into the EC2 instance, I realized there are a lot of outdated packages, including some security updates. I should create and follow a plan to periodically update these packages.

  • Memory. Specifically, I shouldn't need to rely on my brain to remember the necessary commands. I hadn't touched Ansible in a few months, and I couldn't exactly remember the structure of the command that I needed to execute. I should keep some better notes and instructions for myself.

Summary

It's best not to rely on PM2 and assume that all is running fine on your web apps. Make sure your monitoring is set up and working, and use something like Ansible to configure and provision your infrastructure. When your site goes down, you'll be able to get it back up and running quickly, even if you don't know exactly what went wrong in the first place. Be on the lookout for a post discussing the nitty-gritty of how to use Ansible.