In the old way of doing things, we treated servers like pets - we got attached to them, we spent money to keep them healthy and spent time to make sure they receive enough attention and don’t start causing trouble. There have been countless moments in which a singular ‘slowflake’ server was the cause for a service outage, causing operations teams to wake up in the middle of the night, drowsily reach for their laptops and try to make sense of it all before attempting to stabilize the situation before the outage’s aftershocks are felt throughout the system.
The new way is treating servers like cattle - you don’t get attached to them, you make sure they’re all numbered and sit in a line doing what they were made to do. If any of them start misbehaving the culprit is taken out of the line, as to not cause more trouble, and then laid to rest, leaving its place free for a new one to replace it. This reduces the operational cost of maintaining server infrastructure, especially when coupled with concepts such as automated cluster healing, provisioning and monitoring.
Modern cloud infrastructure provider offerings make this approach more viable than on-premise datacenters, given the potential to tap into available resources at any time - with additional costs involved, and within certain limits.
How do I start?
There are a number of considerations to take into account before starting the journey towards taking this approach.
Application architecture is one of the main factors which get a say in it. The ideal candidate is a stateless application which can also buffer its write operations through a remotely-accessed queueing or stream-processing system, such as RabbitMQ or Apache Kafka. This ensures data is kept locally, on the machine the application is running on, for a short time and that it is quickly evicted to the buffering system of choice.
Another factor is being able to ensure that the server nodes, or “cattle”, have a quick start-up process, with few moving parts, thus reducing any unforeseen complications which may arise. This is paramount to having a degraded cluster’s capacity back to full speed in as little time as possible. Common solutions involve using prebuilt VM images, containing the application, libraries needed to run it, as well as services in charge of node monitoring and discoverability. Hashicorp’s Packer has proven itself invaluable for this use-case, allowing us to run reproducible image builds on a schedule, across cloud providers, while - as a side effect - also keeping up with security updates, as part of the pipeline.
Monitoring and healthchecks
Being able to detect cluster performance degradation is another important consideration before taking this approach. “Monolithic” healthchecks are easy to set up - they’re configured to monitor an endpoint, port, or the status of a service - and usually work when something goes wrong - the server node is flagged as unhealthy, it is detached from the regular workflow of the cluster, and sometimes an alert is triggered somewhere. This works, but in certain situations the regular behaviour of a monolithic healthcheck can prove itself to be imprecise or too slow and it ends up being impactful, in a bad way.
This is where monitoring and custom metrics come into play. There are large numbers of monitoring solutions available these days, both open source as well as commercial, any of which can monitor things such as CPU, memory and disk usage. Observability is much more than aggregating standard metrics. Observability is maintaining a transparent view towards every facet of the platform, including application and system-level behaviour.
Custom metrics are the first step towards creating a system in which ill-performing servers are pointed out and removed from their clusters. They allow establishing a baseline between the cluster’s nodes, a “normal” behaviour shared between the “cattle” - any outlier clearly not taking its place in line and it ends up removed from the cluster once it crosses a certain threshold.
We track a large number of custom metrics, some of them being related to the type of application - response latency for APIs, while some are business-related and can point out other issues in our platform - not serving a certain type of video container, while its other versions are successfully delivered. Another strength of custom metrics is assessing a server node’s performance and behaviour past its expiration date. It allows operations personnel to analyse the cause of a node’s misbehaviour during regular work hours, after they’ve enjoyed a full night of sleep.
Prometheus is usually a good bet for this. Being a Cloud Native Computing Foundation (CNCF) project - it is open source, flexible, allows creating custom metrics for our applications, and has managed to build a bustling community around it. Some of Prometheus’ other strengths are its built-in alert manager, integrating with services such as Pagerduty and Slack, and its flexible query language, which boasts features such as linear prediction.
Automated cluster healing - or “auto-healing” - was mentioned earlier in the article and is a key component to maintaining a system’s performance at the desired capacity. Auto-healing processes are in charge of bringing up new servers whenever the actual capacity of the cluster is lower than what we desire, keeping everything up, and the ops people happy at night.
Ideally, we’d have a replacement instance already lined up whenever a server node fails its healthchecks and its deemed unhealthy, but that isn’t always the case. Thus, the cluster’s capacity decreases and its performance is degraded, and it starts “running hot” until auto-healing does its job and everything is working properly again.
“Running hot” means that the average resource usage of the cluster is higher than normal, due to having less server nodes than usual. Sometimes, during high loads, running hot can be risky, as having additional nodes fail can start overloading the remaining healthy nodes and bring the whole cluster down.
Luckily, there are a few ways to mitigate this risk. A cluster’s redundancy factor can be increased by provisioning a larger number of low resource machines, instead of having a small number of high resource ones. This way, if two server nodes decide to overstep their boundaries, you’d rather have a cluster of fifteen low-spec servers working in unison, rather than a cluster of five huge servers chugging along. Percentage-wise, the impact of the outage is greatly reduced and a cluster’s load is evenly distributed across many nodes, instead of just a few.
The other way is even simpler to implement, but it adds literal cost - overprovisioning. Overprovisioning means adding more server nodes that would be necessary for the current cluster’s load, in an event something goes wrong during its lifetime. More server nodes = increased costs, though - so how do we handle this?
Fortunately for us, cloud providers provide a mechanism to get the desired compute capacity while having lower overall costs. AWS calls it “spot instances”, Google calls it “preemptible VMs”.
Sounds good - but there’s a catch - machines can indeed be provisioned at 20-50% the cost of regular, on-demand machines, but they’re not here to stay. These instances have short lifespans, ranging from hours to weeks, depending on many factors - including the cloud provider’s physical datacenter’s load. The cloud provider is just selling unused compute capacity at a lower rate, so that it generates income for its usage. Whenever there is need for that capacity, some of these discounted machines get terminated, giving around two minutes to gracefully terminate whatever workload the machine may have, before it forcefully begins a shutdown.
Fortunately, both AWS and Google allow, through instance metadata information, to determine if the machine is going to be terminated, allowing us to hook up custom workflows whenever this happens. We’re trying to stay as provider-agnostic as we can so our applications barely know they’re in the cloud, and can’t know by themselves that the machine they’re running on is going to be terminated. As a solution to this we started provisioning our nodes with a small service which polls the instance termination metadata endpoint every few seconds, issuing SIGHUP signals to notify applications to clean up after themselves before shutdown - again, this approach is greatly benefited by designing applications to be stateless.
Any real-life examples about this?
One of the clusters which has benefited the most from this approach was our video encoding cluster. GPU instances are usually expensive - spot instances are a great use for this, cutting costs by a significant margin. GPU instances, though, are also in very high-demand and we see them being terminated quite often. Our video encoder service runs as any other such service would: it reserves a batch of video encoding jobs in our queue table, fetches those videos, reencodes them and then uploads the resulting files to a remote location, marking the job as completed in the aforementioned tables. To benefit from this workflow, it also has a small but useful built-in feature - whenever it receives a SIGHUP signal, it cancels all encoding jobs running on that server, unflags its batch of videos so it can be picked up by another video encoder process, and then exits.
Given that nodes are terminated based on physical datacenter load, it may come to times in which spot instances are few and far in between. Normally, that would end up leaving clusters starved for resources, but depending on how the auto-healing process is implemented, it can switch cluster configuration from using spot instances to on-demand instances, reverting the change whenever spot instances are available again. This way we can ensure both uptime, as well as reduced overall costs.
Spot instances have a large number of use-cases, from autoscaling Jenkins slaves based on job queue size and time of day, using them as Packer image build servers, running batch jobs with either predictable or unpredictable loads, to even running stateful, but distributed applications. Our Elasticsearch clusters run on spot instances in Kubernetes, leveraging Kubernetes’ Stateful Set and Persistent Volume Claim (PVC) features - whenever a node gets terminated, the disk attached to it lives on to be reattached to the replacement node. This way we don’t lose any data and the cluster doesn’t need to rebalance and replicate its shards between the remaining nodes.
We also use them for all our APIs, paired with Spinnaker and Terraform. Spinnaker is an all-around continuous delivery platform that we use to release new versions of our APIs into the wild, while also increasing confidence through observability. It allows us to run release candidate versions of our applications in parallel to production clusters, running canary analysis processes to identify any type of negative impact before promoting it to production-worthy. After release candidate promotion, a rolling deploy is initiated, bringing online new application servers in batches, before tearing down the old version.
Terraform allows us to version, provision and preview changes to our infrastructure and also serves as an auditing tool for tracking down unwanted manual operations. It is called through Spinnaker to prepare the ground for new cluster versions.
How do you keep track of spots, though?
Now, we’ve said that spot instances come and go - it must be quite hard keeping an eye and collecting metrics from those servers, right? Not quite, cloud providers to the rescue - again. Prometheus fortunately supports all major cloud provider APIs and can detect and group machines based on instance tags. Prometheus also supports a number of service discovery tools, such as Consul, another tool made by Hashicorp which we use internally.
We use spot instances as often as we can. It reduces costs, increases resilience and poses interesting challenges for all technically-minded personnel, developers and operations engineers alike.
The “cattle, not pets” approach takes time to properly set up, as it involves spending time on planning, establishing a more complex than usual application and system architecture and requires its key concepts – auto-healing and quick node start-up – to have bulletproof implementations.
Adopting “pets” is easy and provides quick gratification, but the responsibility and cost of taking care of them adds up in the mid-to-long term to the point in which nobody would dare lay a finger on them. With “cattle”, though, you have the peace of mind that they can be put down at any time, without disrupting both day to day and night-to-night operations.