Cattle, not pets - part 3: the conclusion

Continued from here Any real-life examples about this? One of the clusters which has benefited the most from this approach was our video encoding cluster. GPU instances are usually expensive - spot instances are a great use for this, cutting costs by a significant margin. GPU instances, though, are also in very high-demand and we see them being terminated quite often. Our video encoder service runs as any other such service would: it reserves a batch of video encoding jobs in our queue table, fetches those videos, reencodes them and then uploads the resulting files to a remote location, marking the job as completed in the aforementioned tables.
Read more

Cattle, not pets - part 2: the meat of it all

Continued from here Monitoring and healthchecks Being able to detect cluster performance degradation is another important consideration before taking this approach. “Monolithic” healthchecks are easy to set up - they’re configured to monitor an endpoint, port, or the status of a service - and usually work when something goes wrong - the server node is flagged as unhealthy, it is detached from the regular workflow of the cluster, and sometimes an alert is triggered somewhere.
Read more

Cattle, not pets - part 1: the start

Originally published on In the old way of doing things, we treated servers like pets - we got attached to them, we spent money to keep them healthy and spent time to make sure they receive enough attention and don’t start causing trouble. There have been countless moments in which a singular ‘slowflake’ server was the cause for a service outage, causing operations teams to wake up in the middle of the night, drowsily reach for their laptops and try to make sense of it all before attempting to stabilize the situation before the outage’s aftershocks are felt throughout the system.
Read more