Cattle, not pets - pt 3

Continued from here

Any real-life examples about this?

One of the clusters which has benefited the most from this approach was our video encoding cluster. GPU instances are usually expensive - spot instances are a great use for this, cutting costs by a significant margin. GPU instances, though, are also in very high-demand and we see them being terminated quite often.

Read More

Cattle, not pets - pt2 - the meat of it all

Continued from here

Monitoring and healthchecks

Being able to detect cluster performance degradation is another important consideration before taking this approach. “Monolithic” healthchecks are easy to set up - they’re configured to monitor an endpoint, port, or the status of a service - and usually work when something goes wrong - the server node is flagged as unhealthy, it is detached from the regular workflow of the cluster, and sometimes an alert is triggered somewhere. This works, but in certain situations the regular behaviour of a monolithic healthcheck can prove itself to be imprecise or too slow and it ends up being impactful - in a bad way.

Read More

Cattle, not pets - pt1

In the old way of doing things, we treated servers like pets - we got attached to them, we spent money to keep them healthy and spent time to make sure they receive enough attention and don’t start causing trouble. There have been countless moments in which a singular ‘slowflake’ server was the cause for a service outage, causing operations teams to wake up in the middle of the night, drowsily reach for their laptops and try to make sense of it all before attempting to stabilize the situation before the outage’s aftershocks are felt throughout the system.

Read More