Reliability at the Scale of the Web

Web-scale systems should be economical to operate at any load level, and have in-built resilience against localized problems.
Infrastructure as a service cloud offerings, like Amazon AWS, have made it easy and economical to provision servers and storage. But outages like the recent AWS us-east-1 failure quickly expose the systems that weren't designed in a way to survive a localized infrastructure breakdown.
We're a heavy AWS customer and have lots of services running in us-east-1. I'm pleased to say that the entire fallout from the outage was a recoverable write failure that affected one customer's web site for some minutes. It could have been worse; it could have been better. But generally, I was happy that things did what they were designed to do and simply worked around the trouble.
As one example, we use MongoDB replica sets, with replicas in different AWS regions and AZs, and off AWS entirely. We "lost" one replica entirely due to the us-east-1 outage; it happened to be the master, but a new master was elected in a different us-east-1 AZ. The replicas in other regions could have picked up the slack if us-east went down altogether.
We have multiple web serving nodes in different regions and AZs, too; the AWS load balancer service did its job and didn't send traffic to dead nodes in the affected AZ. If it hadn't, we could have overridden it in DNS and sent traffic somewhere else, even off Amazon altogether.
The one customer who was affected? Embarrassing, and our fault. As he worked on a system in the crippled AZ, his system was spewing so many I/O warnings to the log, it filled up a volume that shouldn't have been possible to fill, eventually causing a write to fail. Implementation mistake; now corrected.
Not every system we run is that resilient, and some can still be taken out for long periods by a major data center outage like this one, but it's nice to see that systems intentionally designed for modern web-scale reliability can actually deliver on the promise, even while lots of people are running around yelling that the sky is falling.
@rfc2616: If your solution fell over when one data center burped, you probably adore "virtualization" but gave lip service to "cloud" #justsayin
Recent insights