“Well that’s on the SAN, so it’s good,” better never be a part of your recovery plan. If it is? You just lost. As I type this I’m on the way home from a city from an old client that just lost their SAN – their SAN that every VM was on, their SAN that their company relied on. The call from the customer was basically” Can you help us do some reinstalls of SQL and put a plan in place to avoid this pain in the future? Our data center is still there, but for all intents and purposes we’ve lost it with this SAN crash.”
Your SAN? It Can Fail…
Now this customer knew they had some risks in their setup – they had a decent mitigation strategy in place for SQL Server – in fact I’d say their SQL environment is coming back up nicely. They failed over to a secondary server where they effectively have a copy of production that stays in sync with their main production. That worked seamlessly for most environments. They aren’t using the technology that I love and prefer there, but it still worked alright for them, they only had one dev database come up corrupt. They knew they had a risk here too, but they were willing to pay for that risk and use backups if needed.
The rest of their environment? It didn’t do quite as well. They are still on their backup VPN. They are still in a degraded capacity performance wise in production, they are having to rebuild VMs from scratch, they’ve switched to the SAN they recently moved off of to get better performance.
I won’t get into too many identifying details. Basically their SAN had a significant failure of a number of disks at the same time. During some routine building work, they had an orderly shut down of their environment. On coming back online they noticed that several disks in their storage setup (think iSCSI “SAN” – one main storage array doing most of the important work, not a lot of disks, no redundancy outside of the one storage array other than backups and some warm standby for some items in DR) failed. Actually they first noticed that they lost their SAN. The research with the vendor support revealed that more than the tolerable number of active drives ALL FAILED at the same moment.. Turns out the drives were all manufactured on the same date, in the same batch. The drives that hadn’t failed from that batch were swapped out – but the damage was done. The storage vendor’s response was a bit of “wow” and a bit of “wonder if the power came back up incorrectly or if there was dirty power” and then some more “wow”.
Principle #1 From This Situation – Your SAN? It Can Fail. We all technically know this and knock on wood when we talk about its redundancies and invincibility. But.. They can fail. Enough drives can fail at once, something can bork the entire device all at once
It’s The Restores, Stupid
My client made a profound statement when I visited this morning after thanking me for dropping some other items and being a good consultant and making this a top priority. He said “I’m realizing that restores are a lot more important than backups”. I agree, I made this point about 5 years ago when I started blogging in my “People focus too much on their backups” post.
He wasn’t saying that in a “I wish I did better” way. They actually have been far better in their SQL area than some other people I’ve been called in to help. They have a great backup and recovery process and most servers are covered well by it. But his point is great – restore speed, restore reliability, restore testing, restore scripting, restorability.. These things matter so much more than a lot of the questions people ask when they just think of backups as “something we have to do”
Principle #2 - Your Backups? You will use them someday.. It will be a nasty time, you’ll smell, you’re family won’t see you, you’ll be fighting dragons – but there may be a day where you need to restore – and not just one database at one time. Are you ready??
SQL Is Not An Island
You are responsible for SQL. That’s great. Does that mean you are good if SQL is good? Not so for this company. What do you do if your VPN is dead? What do you do if your VMHosts are down and you have no snapshots of any VMs? What do you do if all of the monitoring, inventory, runbook, SharePoint document libraries, etc are all done because their app servers and proprietary databases are all down? Trick question – the answer? Not Much.
Listen, DBA, you are The DBA. You are responsible for the data. You can’t say “Where the data ends, my job ends” that doesn’t work. You are the advocate of your data – and you should be the one doing all you can to keep it available, usable and active. That sometimes means asking pesky questions like, “Hey.. What is your restore policy?” or “My lovely friend, Mrs. SAN Administrator who I sometimes yell at, can I help you do anything to plan for recovery? What do you do if the SAN goes down? What kind of Performance SLA do we have while building a boatload of VMs and trying to switch production from DR to “normal”?
You see – you are reading this post. You are a DBA who cares. You have to ask these questions and get the dialogs started. This doesn’t mean the buck stops with you – but if you aren’t asking these questions who will? Hopefully everyone else, but in my experience? That just isn’t a reliable hope to have.
Principle #3 – Your Assumptions About Everything Else Being Safe? It could be wrong. And wrong here isn’t a little “oh silly” it’s BAD on the good—bad scale. In fact it is Bad turned up to 11. Start those painful dialogs now..
I have more principles, but I’m at 900 words. Next two post will have:
- Failure != DR Test - Instead it is actually a failure.. Don’t make your first landing on the hudson on the actual hudson.. Do it in a simulator…
- Document and Learn - During your test – throw monkey wrenches in. Make things confusing – and document what you learned.
- You gotta keep ‘em separated - The new big arrays of disks with one big happy RAID setting and all the spindles working together are great for a lot of reasons, but sometimes when you put all your eggs in one basket, the basket can get a bit messier. I won’t say you have to do this and there are a ton of “it depends” points – but this is something to consider. And not just for recovery, but performance and workload sharing.
- Complacency Kills - You read these posts and though “but… It won’t happen to me” right? Not a good approach.
I’m glad this client lost no serious data. I’m glad they are mostly functioning find in DR with no huge impact to the user. That makes this a good experience to share. No users were killed in the telling of the story. But there are things that could have been much better. There are pains that didn’t have to be there.
For the love of your data and environments… Get proactive. Spend some quality time with your DR plans, be a negative paranoid worry-wart (basically just be yourself if you are a production DBA) and find and fix the holes. Especially if you are knocking on wood right now about how your data center is fine.
Part 2 – What Else Can We Learn From a SAN Failure? – is now live with some additional lessons over on my blog at Linchpin. Stay tuned to the feed here for Part 3 coming out on 11/14.
Part 3 – How Do You Prevent a SAN Failure? (Well from ruining your week anyway) – Is live on this blog.
I don’t care if you call us at Linchpin People or you call one of the other many great consultancies out there – or if you have the time to do it yourself. But get this taken care of. It is a lot less expensive to deal with finding faults now than it is to patch things together later.