Outshift | Overspending in the cloud

Cloud cost management series: Overspending in the cloud Managing spot instance clusters on Kubernetes with Hollowtrees Monitor AWS spot instance terminations Diversifying AWS auto-scaling groups Draining Kubernetes nodes Cluster recommender Cloud instance type and price information as a service

One of the primary advantages always discussed in the context of deciding whether to move a deployment to the cloud is cost. There are no upfront costs in moving to the cloud because you don't have to buy any hardware, and you only pay for what you really use because you can scale your infrastructure according to your workloads. Simple right? Well, it's not. According to this blog post enterprises will waste $10 billion in the cloud next year. Wait, but why?

Complicated pricing model

Pricing for cloud providers tends to be very complicated. Even though AWS claims that they are continuously simplifying their cloud pricing model it's still very difficult to clearly understand when you're planning your infrastructure. Just try running this AWS CLI command against the Pricing API that describes EC2 products with different pricing options (also, if you do try it, check pagination).

aws pricing get-products --region us-east-1 --service-code AmazonEC2

EC2 prices vary by region and instance type (and there are more than 100 different instance types available). As soon as you make a decision about which type to use, you'll realize that you're faced with at least a dozen payment options. Do you want to go completely on-demand, or do you want to pay upfront? How many instances do you want to pay for upfront? If you do decide to pay up front, do you want the added flexibility of paying for convertible instances? Will you pay for one year or three? Will you pay for everything upfront or just part of it? There's even a marketplace where you can sell your reserved instances. And we haven't even touched on spot pricing.

Not taking advantage of new cloud services

Cloud providers are developing new services and new options for existing services at breakneck speed. Beside their standard VMs, they now offer serverless compute (like AWS Lambda or Azure Functions), a wide variety of persistent storage solutions (since re:Invent there's even serverless Aurora), different analytics and big data solutions, and managed Kubernetes. You can now run containers without managing their infrastructure at all. The AWS CLI boasts 109 different subcommands, which is roughly how many services they offer.

$ aws

Display all 109 possibilities? (y or n)

It all sounds great, but who knows if - from a cost/benefit perspective - it's better in the long run to deploy a plain VM-based infrastructure and manage it yourself, or let a cloud provider oversee different aspects via managed services? And a lot of development teams aren't even aware that they have these options, they just want to move to the cloud as quickly and painlessly as possible.

Paying for unused resources

Its not uncommon for cloud customers to pay for resources they're not even using. This can be the result of simple ineffeciencies, like dev VMs that are running at night when nobody's using them, unused backup snapshots, unattached volumes, or custom AMIs that have fallen out of favor. It may seem obvious, at first, that the solution to this is to write scripts, but one of the main reasons for overspending in the cloud is that developers leave resources up and running. And there are more complicated causes of waste. It's not easy to determine which instance type is best suited to a particular workload. Customers frequently use much larger instances than they need, just to be on the safe side or because they don't know that there are specific instance families better tailored to their needs. Last but not least, there's the scaling problem. In theory, it's obvious that we should only use what we really need, but you don't want your service to break when there's a spike in traffic. If you'd like to be especially efficient, you'll want your infrastructure to scale automatically, based on demand. But, if you've had this problem before, you'll know that's easier said than done.

Solving these problems

The spot instance problem was the first thing we wanted to solve. We wanted to make sure that we could safely run a cluster of VMs in an Auto Scaling Group without worrying about spot instance termination. There are some well known companies at work on similar projects, but these are closed-source alternatives and as a 100% open source company we didn't want to use them. Additionally, it's necessary to give these companies permission to access a large part of your AWS account for their managed services to work, and we didn't want to do that either. We are using Auto Scaling Groups in some of our other projects like Pipeline, and we wanted to continue using them routinely, instead of using, for example, Spotinst's Elastigroup or Amazon's Spot Fleet in each of our clusters. In order to use those groups we need to rewrite these projects and introduce a hard dependency, if only because we want to save some money on AWS. One project we explored and experimented with was Autospotting. It's a very promising, small project that meets almost all of our needs in regards to spot instances, and also gave us a few ideas. But we wanted to do things a little differently and, later, we realized that the use of spot instances wasn't the only problem we had that we wanted to handle in a fundementally similar way. It became clear that a more general approach was called for. So we're not the first company that's become interested in solving these problems, but it's difficult to find an overarching solution to cost optimization in the cloud. And it's even harder to find something that's open source and fits neatly into a cloud native environment. We're throwing our hat into the ring with a new project, Hollowtrees, that's been designed to tackle these difficult challenges. Our project will be open source, cloud native friendly, and pluggable, so anyone in the community can easily extend it to suit their needs. Hollowtrees is based on a ruleset controlled alert-react model wherein alert plugins notify the Hollowtrees engine if something seems wrong on the cloud provider or on a monitoring system (like Prometheus). React plugins automatically intervene by taking direct action on the cloud provider or in Kubernetes. This is the result of plugins with a complete understanding of Kubernetes building blocks that are simultaneously aware of Pipeline's spotguides. We'll go into the details of this architecture and its plugins in upcoming blogposts over the next few weeks.