Cattle, not pets - pt2 - the meat of it all


Monitoring and healthchecks

Being able to detect cluster performance degradation is another important consideration before taking this approach. “Monolithic” healthchecks are easy to set up - they’re configured to monitor an endpoint, port, or the status of a service - and usually work when something goes wrong - the server node is flagged as unhealthy, it is detached from the regular workflow of the cluster, and sometimes an alert is triggered somewhere. This works, but in certain situations the regular behaviour of a monolithic healthcheck can prove itself to be imprecise or too slow and it ends up being impactful - in a bad way.

This is where monitoring and custom metrics come into play. There are large numbers of monitoring solutions available these days, both open source as well as commercial, any of which can monitor things such as CPU, memory and disk usage. Observability is much more than aggregating standard metrics. Observability is maintaining a transparent view towards every facet of the platform, including application and system-level behaviour.

Custom metrics are the first step towards creating a system in which ill-performing servers are pointed out and removed from their clusters. They allow establishing a baseline between the cluster’s nodes, a “normal” behaviour shared between the “cattle” - any outlier clearly not taking its place in line and it ends up removed from the cluster once it crosses a certain threshold.

We track a large number of custom metrics, some of them being related to the type of application - response latency for APIs, while some are business-related and can point out other issues in our platform - not serving a certain type of video container, while its other versions are successfully delivered. Another strength of custom metrics is assessing a server node’s performance and behaviour past its expiration date. It allows operations personnel to analyse the cause of a node’s misbehaviour during regular work hours, after they’ve enjoyed a full night of sleep.

Prometheus is usually a good bet for this. Being a Cloud Native Computing Foundation (CNCF) project - it is open source, flexible, allows creating custom metrics for our applications, and has managed to build a bustling community around it. Some of Prometheus’ other strengths are its built-in alert manager, integrating with services such as Pagerduty and Slack, and its flexible query language, which boasts features such as linear prediction.


Automated cluster healing - or “auto-healing” - was mentioned earlier in the article and is a key component to maintaining a system’s performance at the desired capacity. Auto-healing processes are in charge of bringing up new servers whenever the actual capacity of the cluster is lower than what we desire, keeping everything up, and the ops people happy at night.

Ideally, we’d have a replacement instance already lined up whenever a server node fails its healthchecks and its deemed unhealthy, but that isn’t always the case. Thus, the cluster’s capacity decreases and its performance is degraded, and it starts “running hot” until auto-healing does its job and everything is working properly again.

“Running hot” means that the average resource usage of the cluster is higher than normal, due to having less server nodes than usual. Sometimes, during high loads, running hot can be risky, as having additional nodes fail can start overloading the remaining healthy nodes and bring the whole cluster down.

Luckily, there are a few ways to mitigate this risk. A cluster’s redundancy factor can be increased by provisioning a larger number of low resource machines, instead of having a small number of high resource ones. This way, if two server nodes decide to overstep their boundaries, you’d rather have a cluster of fifteen low-spec servers working in unison, rather than a cluster of five huge servers chugging along. Percentage-wise, the impact of the outage is greatly reduced and a cluster’s load is evenly distributed across many nodes, instead of just a few.

The other way is even simpler to implement, but it adds literal cost - overprovisioning. Overprovisioning means adding more server nodes that would be necessary for the current cluster’s load, in an event something goes wrong during its lifetime. More server nodes = increased costs, though - so how do we handle this?

Cost reduction

Fortunately for us, cloud providers provide a mechanism to get the desired compute capacity while having lower overall costs. AWS calls it “spot instances”, Google calls it “preemptible VMs”.

Sounds good - but there’s a catch - machines can indeed be provisioned at 20-50% the cost of regular, on-demand machines, but they’re not here to stay. These instances have short lifespans, ranging from hours to weeks, depending on many factors - including the cloud provider’s physical datacenter’s load. The cloud provider is just selling unused compute capacity at a lower rate, so that it generates income for its usage. Whenever there is need for that capacity, some of these discounted machines get terminated, giving around two minutes to gracefully terminate whatever workload the machine may have, before it forcefully begins a shutdown.

Fortunately, both AWS and Google allow, through instance metadata information, to determine if the machine is going to be terminated, allowing us to hook up custom workflows whenever this happens. We’re trying to stay as provider-agnostic as we can so our applications barely know they’re in the cloud, and can’t know by themselves that the machine they’re running on is going to be terminated. As a solution to this we started provisioning our nodes with a small service which polls the instance termination metadata endpoint every few seconds, issuing SIGHUP signals to notify applications to clean up after themselves before shutdown - again, this approach is greatly benefited by designing applications to be stateless.