3 ways data teams can avoid a tragedy of the cloud commons

News March 12, 2023 techietr

In 1833, British economist William Forster Lloyd coined the term “Tragedy of the Commons” to describe a situation in which individual users who have...

In 1833, British economist William Forster Lloyd coined the term “Tragedy of the Commons” to describe a situation in which individual users who have open access to a collective resource, unimpeded by formal rules that govern access and use, will act according to their own self-interest and contrary to the common good.

In Lloyd’s famous hypothetical, a group of individual herders share a public pasture for grazing their cattle. As each herder seeks to optimize his or her own economic gain by giving more of his or her cows access to graze, the commons eventually becomes depleted to the detriment of all.

In other words, when an infinite and seemingly “free” resource is offered up to be used with little consideration of cost or consequence, it becomes unsustainable.

There’s a similar phenomenon happening in today’s cloud-first data operations (dataops) environment. The “commons” in this case is the public cloud, a shared resource that appears to be free to the data teams using it since they have little visibility into what their cloud usage actually costs.

Crisis in the cloud

Industry analysts estimate that at least 30% of cloud spend is “wasted” each year — some $17.6 billion. For modern data pipelines in the cloud, the percentage of waste is significantly higher, estimated at closer to 50%.

It’s not hard to understand how we got here. Public cloud services like AWS and GCP have made it easy to spin resources up and down at will, as they’re needed. Having unfettered access to a “limitless” pool of computing resources has truly transformed how businesses create new products and services and bring them to market.

For modern data teams, this “democratization of IT” facilitated by the public cloud has been a game-changer. For one thing, it’s enabled them to be far more agile as they don’t need to negotiate and justify a business case with the IT department to buy or repurpose a server in the corporate data center. And as an operational expenditure, the pay-by-the-drip model of the cloud makes budget planning seem more flexible.

However, the ease with which we can spin up a cloud instance doesn’t come without a few unintentional consequences — forgotten workloads, over-provisioned or underutilized resources — with results including spiraling and unpredictable costs. Near-infinite cloud resources make it easy to simply throw additional compute resources at inefficient queries.

The practice of FinOps has emerged in part as a response to this democratization of IT. The unifying principle of FinOps is that by bringing finance, engineering and business teams together to make better decisions around cost and performance, they will act in a more responsible manner — provided they have access to the right data to inform their decision-making.

According to the 2022 State of FinOps report, the biggest challenge facing organizations trying to establish a FinOps culture is “getting engineers to take action on cost optimization.” The authors go on to say that with so many data projects on their backlog and nearly unlimited cloud resources at their disposal, it’s understandable that data engineers naturally prioritize new data pipeline creation and timely data delivery over resource optimization.

While this is sound advice, this type of generalized guidance glosses over just how difficult a task this can be, and begs the question: How can data engineers be accountable if they can’t capture accurate and easy-to-understand metrics about actual usage requirements? Moreover, how do you encourage this type of accountability without sacrificing cloud agility?

Empowering data teams via feedback loops

One powerful mechanism to change behavior is providing people with information about their actions in real time so they can alter their behavior accordingly. This is the fundamental premise of a positive feedback loop.

For instance, think about the black box that is residential electricity consumption. Few of us have real-time access to utility pricing or a sense of how much it really costs us to run a household appliance. But connect a smart meter to an outlet and suddenly you can just look at an app on your phone and understand at a much more granular level precisely how much energy each device that’s plugged in is using and therefore what it’s costing you.

It’s also important to consider the role that behavior theory and incentives play in shaping how we make decisions. In the context of cloud consumption, the incentives at work for a data engineer are quite different from those of the finance director. The data engineer is primarily motivated by and held accountable to metrics related to performance and reliability. They want to know: Are my applications running reliably, on time, every time?

In the engineer’s calculus, they’ve become conditioned to overestimate the resources an application might require rather than having to “guesstimate” their perceived capacity requirements. It’s not that they are intentionally over-provisioning resources; rather, they simply don’t know exactly how many or what size resources are actually needed, so they guess, erring on the side of too much rather than too little.

In order for engineers to take action on cost optimization, they need to be given the granular-level usage details that enable them to make informed and defensible choices — and do so without worrying that they will fall short on their service-level obligations.

Getting at this information, however, is anything but easy. The data pipelines that feed modern data apps are enormously complex and the sheer size and scale of the data workloads only amplifies the challenge of identifying cost-saving opportunities.

A flight path to cloud usage observability

This is the problem that full stack observability, informed by AI algorithms and machine learning models, was designed to address. There are a several ways in which the deep visibility that observability enables can help data teams more fully understand their usage costs and nudge their behavior to become more cost-conscious.

Start at the job level: While most cloud cost control measures take a top-down approach that gives a bird’s-eye aggregated view of spending, they don’t really help users identify exactly where the cost-saving opportunities lie. Controlling cloud costs starts at the job level, as there are typically thousands of jobs running on more expensive instances than necessary. Without deep visibility into the actual resource requirements of each job over time, data teams are just guessing as to what they think they will need.

Enable showback to align IT value with cost: To help connect the dots between what data teams are consuming and what they are spending, a growing number of organizations are using observability to generate showback and/or chargeback reports — itemized bills of materials that show precisely who is consuming what resource and what it costs. With this type of intelligence, cost allocations can be put into a context that makes sense to all — whether that’s breaking down costs by department, team, project or application all the way down to the individual job or user level.

Provide users with prescriptive recommendations: It’s not enough to simply throw a bunch of charts and metrics at engineers and expect them to puzzle everything out to make the right choices. Instead they need to be served up actionable and prescriptive recommendations that tell them in plain English precisely what steps they should take. This level of self-service will empower engineers to make more cost-effective decisions on their own so they can take individual responsibility and be held accountable for their cloud usage.

One of the enduring lessons from the Tragedy of the Commons analogy is that when everyone is responsible, no one is responsible. It’s not enough to tell stakeholders to be accountable; you need to provide them with the tools, insights and incentives that are needed to change their behavior.

Clinton Ford is DataOps champion at Unravel Data.

DataDecisionMakers

Welcome to the VentureBeat community!

DataDecisionMakers is where experts, including the technical people doing data work, can share data-related insights and innovation.

If you want to read about cutting-edge ideas and up-to-date information, best practices, and the future of data and data tech, join us at DataDecisionMakers.

You might even consider contributing an article of your own!