Any real-life examples about this?
One of the clusters which has benefited the most from this approach was our video encoding cluster. GPU instances are usually expensive - spot instances are a great use for this, cutting costs by a significant margin. GPU instances, though, are also in very high-demand and we see them being terminated quite often. Our video encoder service runs as any other such service would: it reserves a batch of video encoding jobs in our queue table, fetches those videos, reencodes them and then uploads the resulting files to a remote location, marking the job as completed in the aforementioned tables. To benefit from this workflow, it also has a small but useful built-in feature - whenever it receives a SIGHUP signal, it cancels all encoding jobs running on that server, unflags its batch of videos so it can be picked up by another video encoder process, and then exits.
Given that nodes are terminated based on physical datacenter load, it may come to times in which spot instances are few and far in between. Normally, that would end up leaving clusters starved for resources, but depending on how the auto-healing process is implemented, it can switch cluster configuration from using spot instances to on-demand instances, reverting the change whenever spot instances are available again. This way we can ensure both uptime, as well as reduced overall costs.
Spot instances have a large number of use-cases, from autoscaling Jenkins slaves based on job queue size and time of day, using them as Packer image build servers, running batch jobs with either predictable or unpredictable loads, to even running stateful, but distributed applications. Our Elasticsearch clusters run on spot instances in Kubernetes, leveraging Kubernetes’ Stateful Set and Persistent Volume Claim (PVC) features - whenever a node gets terminated, the disk attached to it lives on to be reattached to the replacement node. This way we don’t lose any data and the cluster doesn’t need to rebalance and replicate its shards between the remaining nodes.
We also use them for all our APIs, paired with Spinnaker and Terraform. Spinnaker is an all-around continuous delivery platform that we use to release new versions of our APIs into the wild, while also increasing confidence through observability. It allows us to run release candidate versions of our applications in parallel to production clusters, running canary analysis processes to identify any type of negative impact before promoting it to production-worthy. After release candidate promotion, a rolling deploy is initiated, bringing online new application servers in batches, before tearing down the old version.
Terraform allows us to version, provision and preview changes to our infrastructure and also serves as an auditing tool for tracking down unwanted manual operations. It is called through Spinnaker to prepare the ground for new cluster versions.
How do you keep track of spots, though?
Now, we’ve said that spot instances come and go - it must be quite hard keeping an eye and collecting metrics from those servers, right? Not quite, cloud providers to the rescue - again. Prometheus fortunately supports all major cloud provider APIs and can detect and group machines based on instance tags. Prometheus also supports a number of service discovery tools, such as Consul, another tool made by Hashicorp which we use internally.
We use spot instances as often as we can. It reduces costs, increases resilience and poses interesting challenges for all technically-minded personnel, developers and operations engineers alike.
The “cattle, not pets” approach takes time to properly set up, as it involves spending time on planning, establishing a more complex than usual application and system architecture and requires its key concepts – auto-healing and quick node start-up – to have bulletproof implementations.
Adopting “pets” is easy and provides quick gratification, but the responsibility and cost of taking care of them adds up in the mid-to-long term to the point in which nobody would dare lay a finger on them. With “cattle”, though, you have the peace of mind that they can be put down at any time, without disrupting both day to day and night-to-night operations.