Yes you built that... but at what cost?

A thought Experiment

You tighten the last screw on the bridge. Feeling triumphant you raise your hands and cheer. Your crew cheers. The bridge is five years too late but it is done. The people you started with are no longer with you. But they were traitors and disloyal people. “Tough times don’t last, tough people do” you utter to yourself, rationalizing the loss of your entire team.

Was the bridge a success or a failure? Did you immediately jump to an answer? Do you have enough information to answer that question? Let’s assume you picked one of the two options.

Maybe you think that the project was a failure because the project was five years too late. What if the city really needed that bridge? And now that the bridge exists, it has cut average commute times by thirty minutes. Is it not a success?

Maybe you think the project was a success because even though it took five extra years, the bridge was finally built. But what if in reality the bridge took so long to build that it bled the city’s treasures dry, and people moved out of the city and there was barely anyone left to use the bridge. Is it still a success?

Maybe the city realised that the bridge was going to be overdue and their budget wouldn’t be enough. So they decided to stimulate economic activity in the city, which led to an increase in immigration. Now, despite the bridge being 5 years too late, there are way more people who would benefit from the bridge. Now, is it a success or a failure?

Reframing the Question: What’s the Best Possible Outcome?

Let’s imagine another scenario. Assume that the city finds the best damn engineers on the planet, the most loyal workers and builds a bridge within the estimated time and budget. Is this still the best possible outcome? This time I didn’t ask if the project was a success or failure. I’m more interested in if it was the best possible outcome. The answer is it depends. What if there was a single ferry that would transport vehicles and it took slightly more time for a vehicle to cross the river with the ferry compared to the bridge. But the real problem was that in peak hours, the cars would have to wait for two or three ferry trips before they could get on and cross the river. Instead of building a bridge, the city could just buy more ferries and that would bring down the wait time for cars.

The real problem that the city was trying to solve was a high average travel time across the river. The real problem wasn’t how to build a bridge in time or within budget. The real problem wasn’t what to do if the bridge goes over the estimated time. In that context, building a bridge was not the best possible outcome. The best possible outcome was to add more ferries.

Let’s say that the estimated cost of adding more ferries is the same as the estimated cost of building a bridge, and adding more ferries would reduce commute time by only 15 minutes (a smaller reduction) compared to a bridge which would reduce it by 30 minutes (a bigger reduction). Is adding more ferries still a better outcome? Yes, but why? Because there is more risk involved in building a bridge.

Even if both projects had the same estimated costs, the actual costs can greatly differ based on the risk profiles. The other benefit of the ferry-solution is that it can be implemented faster than a bridge. Another problem with bridges is that they are not scalable, a bridge can only sustain X amount of cars at any given time, and if you want to support more than X cars then you need to build another bridge. Whereas you could add more ferries and easily support more traffic. So even though ferries don’t reduce commute time as much as bridges do, there are other considerations which make ferries a much better choice than bridges. Of course, I’m trying to illustrate a point and real-life scenarios are even more nuanced.

The point I’m trying to make is there are tradeoffs to each solution. Tradeoffs are what makes real-world problem solving so darn interesting. In the above example we tradeded off a cheaper, less riskier solution that was faster to implement, was scalable if needed but did not speed up commute times as much the more expensive, riskier solution that would take longer to implement. But the ferry solution would be bad idea if you were in a place with lots of storms and unpredictable weather, in that case it might makes sense to take on the more riskier solution of building a bridge. So there is no black-and-white answer to this problem.

The Three Pillars of Tradeoff Analysis: Costs, Benefits, and Risks

When deciding between tradeoffs there are three factors to consider. Costs, benefits and risks. If you consider any of these aspects in isolation then you won’t make good tradeoffs.

Assessing benefits

Let’s start talking about benefits. These are the first things that we tend to think of when considering a solution. Benefits might look like “Feature X will be 10 times faster”, “We won’t have to manage service Y and can focus on developing Z”, or “We won’t have to maintain A anymore”. Benefits are also subject to optimism bias. We tend to overestimate how beneficial something will be. Keeping the bias in mind is important when comparing benefits. In my opinion, benefits are the not the most important consideration when comparing two solutions. Costs and risks are more important.

Costs are more than time and money

Costs are the most complicated (amongst the three) to estimate. There are different types of cost. The first type is the amount of cash you will spend on a project. This is where most people tend to stop with their estimation.

The second type is opportunity cost. If you choose to utilize your resource on one project, you won’t be able to utilize those resources on another project (assuming you have limited resources). In simple terms, if you choose to bake a cake, you can’t bake a pizza in the oven at the same time. Opportunity costs are less intuitive.

The third type of cost is what future possibilities are being eliminated by the choice you are making. If you choose to build a skyscraper in one place, you can’t build a garden or school in that exact same place.

The fourth type of cost is the kind of problems you will have to solve if you choose a specific solution. This also intersects with risk because depending on the problems you have to solve, and the resources you have, the same problems can have different risk profiles.

For example, if you decide to go with bare metal servers for your infrastructure, you will to deal with problems of infrastructure management. But if you have a big team of kickass cloud and networking engineers, these problems are not really big problems for you and hence the it is less costly and less risky. Whereas if you lacked that specific expertise in your team, the cost would be higher and riskier.

Another example is that a frontend project will be more expensive if a backend engineer does it compared to a frontend engineer doing it. Not because the backend engineer gets paid more but rather because they lack the expertise in the field. What can be a week’s worth of work for the frontend engineer might well be two week’s worth of work for the backend engineer.

When picking two solutions, you will often have to solve two different sets of problems, and some sets of problems are easier to solve than others (for you), and if possible you should choose to solve the easier problem.

The fifth type of cost is the sunken cost. It is a cost, not in the sense that you must consider it when making a decision but rather you must be aware of the sunk cost fallacy and avoid it while making a decision. Let’s say you built feature X, and while feature X is used by some customers it isn’t that popular. But feature X makes it much costlier to develop feature Y. But you don’t want to remove feature X because you spent a year building feature X. So you don’t want to remove feature X when developing feature Y, which makes developing feature Y significantly more expensive. This example also shows the third type of cost (cost of limiting future possibilities).

Cost is the most important factor to consider when deciding between solutions because the cheapest feature is the one you don’t build.

Risk

Risk is the third factor to consider when choosing solutions. risk can derail your entire project and bankrupt you. Risk is highly domain specific. The risks in IT projects aren’t the same as the risks in construction projects. However there are certain factors that universally influence risk. The most obvious one is time. Longer projects are riskier than shorter projects. Projects with more moving parts are more risky than ones with fewer moving parts. After a certain amount of people, adding more people to the project increases the risk. Using unproven, cutting-edge technologies tend to be more risky than using old, reliable, boring technologies.

Each project or solution comes with some level of risk. Risk is the second most important consideration when deciding between solutions. Because at the end of the project risks will solidify into costs.

Don’t try to quantise everything

Humans have a tendency to try and assign numerical values to things, mostly because numbers are easy to compare. If you have two projects and one costs ten thousand dollars and the other costs a hundred thousand, it is easy to say that the first project is cheaper than the other. Estimating time and money is important for forecasting spend and planning the roadmap for the company, but they aren’t the only factors that should be considered when deciding between projects.

Not all costs, risks and benefits are easily quantisable. It is much harder to qualitatively reason about these three factors across projects, but it is important to do. If you ignore these aspects and only focus on the time and money a project is going to cost, then you are falling prey to the measurability bias which is the tendency to assign importance to only quantisable factors.

In the real world, estimating costs, risk and benefits is part science and part intuition because it largely depends on domain expertise and experience. These estimations aren’t a one-time thing. You need to constantly evaluate these things over the lifespan of the project, because costs and benefits are dynamic. There are many stories of companies that started off with fully managed cloud services, grew big and had to migrate to something else because fully managed services became too expensive, but this doesn’t mean that betting on fully managed services in the beginning was a bad idea.

Real-World Tradeoffs in Action

How level.fyi scaled to a million users

This is one of my favorite engineering stories. They used google sheets as a backend. Why? It doesn’t cost much to store data, they don’t have to manage or even code the backend, and the service is fault tolerant, highly available, and reliable because it is a google service. However, as they grew really big, they had to move to postgres. This is an excellent example of how to make a great tradeoff and also how tradeoffs change over time.

How an HFT firm built the fastest garbage collector

Another story that really stuck with me was how an HFT (High-frequency trading) firm handled the problems of garbage collection. Low latency systems are critical to HFT firms. Most modern programming languages come with a garbage collector. But the problem with garbage collection is that your program pauses during garbage collection which affects the latency of your program. The bigger the pause the higher the latency. Which is obviously problematic for these firms.

One obvious solution is to use a language with manual memory management like C, but that would mean slower development times, and dealing with potential memory-leaks and other bugs. Another solution would be to create a better garbage collector that would be much faster than the default one. This would mean investing in compiler engineers, maintaining a garbage collector and so on. At some point trying to optimise the garbage collector would reach the point of diminishing returns.

But if you took a step back and asked what is the real problem? Why do we need garbage collection? We need garbage collection to free up RAM. So, if we had infinite RAM we wouldn’t need a garbage collector. But we don’t have infinite RAM. The key insight the HFT firm had was that their low-latency programs only needed to operate during market hours (so 9-5 on weekdays). Which meant that they could shut down their computers at the end of the day and that could be the garbage collection. So, they decided to carefully measure how much memory their program would use over the entire day and give the machine enough RAM for the entire day, turn off the garbage collector and let the program run. The fastest garbage collector is the one you don’t run.

This company effectively picked the problem of having developers carefully measure memory usage of their programs, instead of figuring out how to write a faster garbage collector. This tradeoff was domain specific, because you can’t use this strategy when your program has to run 24x7.

Conclusion

In the end, making good tradeoffs is about picking the right problems to solve, picking the easier problems to solve and if you can, avoiding problems instead of solving them. I like to think of it as a tree of problems. At the root is the high-level challenge, like a business trying to be profitable. One level down, you might wonder which path to pursue - do you start a bakery or a database company? Each path comes with its own set of problems to solve. As you move further down the tree, you encounter increasingly specific issues. The bakery has to deal with inventory management and standing out from the competition. The database company might wrestle with fund raising, hiring top-talent, and B2B sales. Finally, if you zoom all the way in, you will find engineers debating algorithms and bakers testing recipes.

This mental model always reminds me that the problem I am currently solving is in service of a bigger, higher-level problem. If the current problem I am solving is intractable, I could go a level higher and see if I could solve that in a different way.

Making good tradeoffs is hard, because a lot of the decisions or thoughts that go into making these tradeoffs are naturally hard for humans. Thinking of opportunity costs is not at all intuitive. Plus it is hard to qualitatively compare risk and costs. But it is something you can get better at with conscious effort. Because making good tradeoffs is hard, not a lot of people can do it, but the people that do, are rewarded and highly valued.

P.S I’d like to thank Aditya Athalye for reading and reviewing this.

A thought Experiment#

Reframing the Question: What’s the Best Possible Outcome?#

The Three Pillars of Tradeoff Analysis: Costs, Benefits, and Risks#

Assessing benefits#

Costs are more than time and money#

Risk#

Don’t try to quantise everything#

Real-World Tradeoffs in Action#

How level.fyi scaled to a million users#

How an HFT firm built the fastest garbage collector#

Conclusion#