Cattle, not pets - pt1

cattlenotpets.jpg

In the old way of doing things, we treated servers like pets - we got attached to them, we spent money to keep them healthy and spent time to make sure they receive enough attention and don’t start causing trouble. There have been countless moments in which a singular ‘slowflake’ server was the cause for a service outage, causing operations teams to wake up in the middle of the night, drowsily reach for their laptops and try to make sense of it all before attempting to stabilize the situation before the outage’s aftershocks are felt throughout the system.

The new way is treating servers like cattle - you don’t get attached to them, you make sure they’re all numbered and sit in a line doing what they were made to do. If any of them start misbehaving the culprit is taken out of the line, as to not cause more trouble, and then laid to rest, leaving its place free for a new one to replace it. This reduces the operational cost of maintaining server infrastructure, especially when coupled with concepts such as automated cluster healing, provisioning and monitoring.

Modern cloud infrastructure provider offerings make this approach more viable than on-premise datacenters, given the potential to tap into available resources at any time - with additional costs involved, and within certain limits.

How do I start?

There are a number of considerations to take into account before starting the journey towards taking this approach.

Application architecture is one of the main factors which get a say in it. The ideal candidate is a stateless application which can also buffer its write operations through a remotely-accessed queueing or stream-processing system, such as RabbitMQ or Apache Kafka. This ensures data is kept locally, on the machine the application is running on, for a short time and that it is quickly evicted to the buffering system of choice.

Provisioning

Another factor is being able to ensure that the server nodes, or “cattle”, have a quick start-up process, with few moving parts, thus reducing any unforeseen complications which may arise. This is paramount to having a degraded cluster’s capacity back to full speed in as little time as possible. Common solutions involve using prebuilt VM images, containing the application, libraries needed to run it, as well as services in charge of node monitoring and discoverability. Hashicorp’s Packer has proven itself invaluable for this use-case, allowing us to run reproducible image builds on a schedule, across cloud providers, while - as a side effect - also keeping up with security updates, as part of the pipeline.

Continues here