Outage shows how Amazon’s complex cloud makes backup plans difficult
Major companies using Amazon.com’s data services got a painful lesson this week about how the complexity and market dominance of the company’s cloud unit make it difficult to back up their data with other providers, analysts and experts said.
Amazon said that an “an impairment of several network devices” in its Amazon Web Services (AWS) Virginia data center region caused the prolonged outage on Tuesday. The outage temporarily interrupted streaming platforms Netflix Inc (NFLX.O) and Disney+ (DIS.N), trading app Robinhood Markets Inc (HOOD.O) and even Amazon’s own e-commerce site, which makes heavy use of AWS.
An Amazon spokesperson said that the issues had been resolved.
The huge trail of damage from a network problem at a single region that AWS calls “US-EAST-1” underscored how difficult it is for companies to spread their cloud computing around.
With 24.1% of the overall market, according to research firm IDC, Amazon is the world’s biggest cloud computing firm. Rivals like Microsoft Corp (MSFT.O), Alphabet’s (GOOGL.O) Google Inc and Oracle Corp (ORCL.N) are trying to lure AWS customers to use parts of their clouds, often as a backup.
But crafting a complex online service that can be easily shifted from one provider to another in case of emergency is far from simple, said Naveen Chhabra, a senior analyst with research firm Forrester. Rather than being a singular “cloud,” AWS is actually composed of hundreds of different services, from basic building blocks like computing power and storage to advanced services like high-speed databases and artificial intelligence training.
Any given website, Chhabra said, might use several dozen of those individual services, each of which must work for the site to function. It is difficult to make a backup on another cloud provider because some services are proprietary to AWS and some work very differently at another provider.
“It’s like saying, ‘Can I put an SUV body on a sedan chassis?’ Maybe, if everything is all the same and lines up. But there is no guarantee,” Chhabra said.
Another issue that makes it hard for businesses to diversify is that AWS makes it relatively cheap to send data into its cloud, but then charges higher prices for “egress fees” to get data out of its cloud to take to a rival.
“That amplifies issues like this (outage) when they happen,” said Matthew Prince, chief executive of internet security firm Cloudflare Inc (NET.N) “A more resilient cloud is one where egress fees are eliminated and customers can be multi-cloud. I think that would actually increase the faith customers have in the cloud.”
DEPENDENCIES IN ONE REGION
AWS itself has critical “dependencies” within its own services where they are linked together in ways that can cause one to fail when another fails, said Angelique Medina, head of product market at Cisco Systems Inc’s (CSCO.O) ThousandEyes. That is because AWS’s complex services are often built on top of its own more basic services. One problem that crops up with a basic function like networking can cascade through services that depend on it.
Early on in the incident on Tuesday, AWS said the outage was “affecting some of our monitoring and incident response tooling, which is delaying our ability to provide updates.”
Medina said AWS also seems to be have critical services clustered in its US-EAST-1 region, where another outage last year also had a widely felt impact.
“That’s where a lot of their critical dependencies have been located historically,” Medina said. “Over time, they’ve diversified a bit.”
Chhabra, the Forrester analyst, said Amazon has done a lot of “heavy lifting” to make its own services resilient. But what Amazon does not do for its customers is build applications in a way that can withstand an outage by tapping multiple locations or providers.
Doing so can often involve extra work that might not always be worth it when cloud outages remain relatively rare.
“It’s this tradeoff you always have between something that is decentralized, something that’s secure and something that’s useable,” said Charly Fei, product lead for Inter Blockchain Communication lead at The Interchain Foundation, which is focused on technologies for decentralizing computing. “It’s not something where you’ll ever get a perfect solution that gets all three.”