A guide to building reliable systems #shorts #sre #cloud #incidentresponse

Channel	Publish Date	Thumbnail & View Count	Download Video
	Publish Date not found	0 Views	View on YouTube Download Video

Let's talk about Amazon S3's 11 nines claim.

When it comes to these things, Amazon is certainly one of the best and most impressive organizations I know.

But 11 nines means that a stored photo will stay there for 100 billion years. But the Sun will envelop the Earth in about 5-7 billion years, and I don't think US-East-2 will survive that event!

S3 claims to achieve 11 nines by creating 33 copies of data and storing 11 copies each in three data centers, allowing it to tolerate 11 single disk failures.

However, this approach does not take into account associated failures, such as a data center failure due to fire or natural disaster, which would result in the loss of multiple copies of data.

When we designed Aurora, we stored two copies of the data in the three data centers (although most other systems kept a copy in each center).

The reason for this was because I was expecting what was probably the biggest outage associated with it, namely a data center failure.

If that happens, every single one of my databases will experience two outages during that period.

Now some subsets will have more errors somewhere else while I need time to fix the first two errors.

So if two out of three fail, I'm left with only one copy. And I can't trust that copy to be up to date or not. That means my database is corrupt.

But if I do four out of six and get to three out of six, I can still read it and do the repair.

Therefore, when designing your systems, you need to consider the following:
– the largest probable correlated event,
– to link it to independent events that may already be underway,
– multiply that by the number of such things happening in your environment, and then
– divide it by the duration over which this will happen.

For example, if it takes me 10 seconds to fix a segment in Aurora, I'm basically looking for a 10-second period for the independent errors compared to the correlated error.

You want to reduce this number as much as possible in an economically sensible way.

In the end, we got four out of six. For you, it might be a different number.

To find out, you need to consider the following factors:
– Your correlated events
– their downstream effects
– the time required for the repair
– the width of the system to which it is applied

Let me know if you found this helpful.

Website: https://shoreline.io

#Runbookautomation #SRE #Incidentresponse #devops #cloud