The first thought I had, reading about the Skype meltdown last week is: here’s another opportunity to score points for unit testing. Nothing like a a single bug causing so much havoc (or silence, in this case) to get developers attention to the catastrophes waiting to happen.
But I decided to go deeper. There are few lessons in what happened to Skype, and indeed, a few more services we’ve become so dependent on lately.
The power of one bug. The Skype bug managed to disrupt businesses, not just consumers. I rely heavily on Skype for business calls. And like many others, was left stranded. Would you build your business around such a fragile infrastructure? (the answer is of course yes, we won’t really change our ways. We’re too addicted. But we’ll think twice, and maybe spread the risk so we won’t get hurt too much next time).
The cost of one bug. Think about the damage to Skype and its likes – this small bug will have investors thinking twice before putting money in infrastructure companies. Who want their name tarnished like that? I always talk about how a bug’s cost to the company is huge once it’s out of development. Case in point. The reputation drop alone is immeasurable.
At some point people see the light. Could this bug have been prevented? Maybe. With proper practices, like code review, unit tests, acceptance tests – it may have been found before. I don’t know if Skype does them, and I recommend they do. They did see the light according to their CIO – they need more testing.
Overcome by complexity. Could the consequences of the bug been foreseen in advanced? I expect that if they did, Skype would pour in all their resources in their power to prevent the bug from seeing the light of day. The world of software has become so complicated, there’s no way to test every scenario, and even perform a calculated risk estimation. In unit testing you write tests for certain scenarios. It’s simpler to anticipate and comprehend these scenarios in the unit level, and therefore make sure that code is protected. It’s still a tool to beat the odds, rather then anticipating and comprehending much more complex scenarios.
Built to last? Software is evolving so quickly, we need disasters like these to take lessons in how to build systems. And these lessons are very costly. Architects, as expert as you are, this is a good opportunity to take a good look at your company’s architecture. Will this happen to you?
Communication is everything. At first there was silence. Not just from Skype’s customers. From Skype management. For 24 hours, there was no status reports, no idea what’s going on, and when is this thing going to be over. At times of trouble, more communication is needed. More messages would have at least make an impression that Skype is actually doing something to handle the crisis.
The path to improvement. Now they are talking, and explaining the cause, and how things deteriorated. It’s not just Skype that needs to learn and improve. Other companies can learn from this as well. I applaud Skype for dealing with this openly. They show us no just that they intend to improve – they show others the way too.
It will take time until most companies will understand the impact of bugs,and the much lower cost of preventing them. We can learn how to communicate and understand where visibility helps. We can learn from these disasters and improve.
Until next time.