Credit: Illustration by Toby Dexter
Christopher Meiklejohn was only 19 years old the night he had to drive through a blizzard in a Honda Civic to reboot a server.
It was Jan 19 2002, and he was working for a data centre in Boston that hosted an online shop for the New England Patriots.
His job was to visit the server to fix it in person on the day the Patriots played a crucial home game against the Oakland Raiders.
He managed to do so, just in time, but recalls how the worsening snow on the way back left him stranded with no way to get home.
Today, he is grateful such an arduous task has been taken out of his hands. “This is a lot of what we pay cloud providers for, right?” says Meiklejohn, now a PhD candidate at Carnegie Mellon University studying how to make the internet more resilient. “We pay a premium because they do all of this stuff for us.”
Last Tuesday, millions of web users and thousands of websites got a sharp reminder of the flaws in this arrangement.
The outage that briefly cut off access to Amazon, Reddit, Boots, The Guardian, the Financial Times, the US White House and all Gov.uk websites, was not due to freak weather but to a single US cloud computing company called Fastly – one of a handful that have quietly consolidated a surprising share of the internet’s hidden plumbing.
Swathes of the internet including Amazon, Reddit and many news outlets went offline on Tuesday following a glitch affecting a relatively obscure cloud computing company.
Credit: Leon Neal/Getty Images
According to the internet mapping firm Intricately, Fastly is merely the fifth biggest “content delivery network” (CDN), a specific type of cloud service that provides digital liquidity to smaller services who cannot easily handle bandwidth spikes on their own.
The biggest three – Cloudflare, Amazon Web Services (AWS) and Akamai – command an estimated 89pc of this market, and all three have suffered widespread faults since 2010.
These companies are just one corner of a sprawling worldwide apparatus as opaque and impenetrable to most users as the banking system before the 2008 crash. And as Fastly has shown, large parts of the internet may now just be one domino away from collapse.
“We tend to forget that the fact that the internet works on a day-to-day basis is close to a miracle,” says Corinne Cath-Speth, an anthropologist at the Oxford Internet Institute who studies how internet infrastructure exerts political power.
“We think of it as this ephemeral thing – we literally talk about ‘clouds’. But clouds are servers, clouds are things you can kick, clouds are big, humming machines that stand in rooms that need to be cooled…. it takes so many individuals, it takes all these autonomous networks.
“Things go wrong all the time, but we as consumers are so used to a frictionless experience. And because it is so important as day-to-day infrastructure, it kind of freaks us out.”
Fastly said a new service configuration sent a wave of disruption across its so-called "POPs" – the servers that store cached copies of web pages to speed up access – leaving users unable to access certain sites.
Credit: AP Photo/Marcio Jose Sanchez
Originally the internet was built to connect different computers and networks together, with many websites simply hosted on their owners’ personal machines.
Those that were featured on popular blogs such as Slashdot and Digg often crashed under the weight of new traffic.
Meiklejohn recalls having to ship and plug in a heavy rack of servers as temporary email capacity for the 2005 Boston Marathon. That decade, though, websites began to fill up with images and videos, creating soaring bandwidth requirements.
CDNs solve these problems with a worldwide fleet of servers that can deliver content quickly to users from within their own countries, flexibly absorbing sudden shocks and pooling the costs of keeping spare capacity.
Cath-Speth says they tend towards market concentration because it depends on overwhelming bandwidth, economies of scale, expensive physical data centres and serious programming talent.
Nothing stays up forever, and so in this trade risk is quantified by “nines” – a contractual guarantee to keep each client’s services running for perhaps four nines (99.99pc) or five nines (99.999pc) of the year. “Which means that you have to anticipate failure, right?” says Meiklejohn. “The company’s literally telling you that it can be down this long and there’s nothing you can do.”
Some companies adjust for that: Netflix has standing preparations for the loss of an entire AWS “region”, of which Europe has six.
Yet Meikleohn’s supervisor Heather Miller says many of those who build services in the cloud are not aware of the risk, since it is so distributed throughout a system.
“Instead of there being maybe 50 hidden relationships between various companies, we’re talking about millions. We just can’t see the scale of it,” she says. “It’s really difficult to pin the blame on who, because the graph is so big.”
Fastly’s hour-long outage last week was caused by a single customer updating their settings. That introduced a bug in Fastly’s code that took down 85pc of the company’s network.
We identified a service configuration that triggered disruptions across our POPs globally and have disabled that configuration. Our global network is coming back online. Continued status is available at https://t.co/RIQWX0LWwl
— Fastly (@fastly) June 8, 2021
ParcelHero, the e-commerce delivery company, estimates retailers across the UK, Europe and US will have lost around £1bn because of the outage.
“Amazon alone currently turns over $950,000 a minute,” says David Jinks, ParcelHero’s Head of Consumer Research. “It was one of the quickest sites to get back online but some organisations were down for around an hour.”
It’s a high price to pay, but the consequences of an online systems failure could prove much worse, according to Cath-Speth.
“It’s annoying if you can’t watch the final episode of The Crown,” she says. “That’s unfortunate. But what if today you were trying to get a Covid-19 vaccination appointment, and the website was down? And so you didn’t get your vaccination, and tomorrow you catch Covid, and the day after you’re dead?”
It’s not just CDNs that are vulnerable. Data centers, undersea cables, telecoms networks, and third-party software providers are just a few areas that may be at risk of outages from cyber attacks or natural disasters.
“The internet wasn’t designed for the amount of data being delivered at the speeds we want it to be delivered at,” says Gav Winter, the CEO of security company Rapidspike.
“More than 70pc of shoppers admit to leaving websites because something took more than the seconds to load, however, you leave yourself open to catastrophic failures like this.”
“The core infrastructure needs to change, be more resilient and decentralised,” he adds.
For all the risks inherent in our online systems, however, the internet seems to work for most of that time. That may be why governments and policymakers have so far not made it a priority to build resilience into the system.
Cath-Speth, however, believes that it should be a legal requirement for governments to have multiple CDN contracts: “Clearly there were many government websites, or many government entities, that either didn’t take this into consideration or thought it to be too expensive.”
The world could soon find out the true cost of not preparing, however. Winter believes a much larger outage could happen in the near future:
“The potential of this is very real…and the next one could be much worse.”