How Do You Prevent a SAN Failure?

… Well, prevent a SAN failure from ruining your week anyway. Because you can never fully prevent a failure 100% – even in a SAN, even when your SAN vendor says it’ll never happen (hint: anytime a vendor tells you never – be afraid.. Be very afraid). So that’s what this whole post series is about – minimizing damage.

SANs have plenty of moving parts and components, all of which can fail (and this can be bad)- we learned that last week. Earlier this week at my Linchpin People Blog post we learned that it’s kind of a good idea to do a DR drill before a real Disaster strikes and we talked about failing back when the disaster was over and an always helpful reminder about documentation (which I feel you have to learn to love).

Today we’ll finish the series and talk about the silent killer in your environment that everyone is okay with, eggs and baskets, and a final reminder/action item.

Silent, But Deadly…

I enjoy talking about lessons for IT professionals from other industries. One of the concepts I blog and speak about often is airline disasters. Well once when I was preparing to give my first talk, I bumped into a commercial pilot of private jets. I was talking to him about the Payne Stewart crash and some of the potential factors (including at least one of the pilots likely not wearing the required O2 mask above a certain altitude). When I mentioned that he quickly shot back, “But you’ve never worn one of those have you? They are incredibly uncomfortable, annoying and besides, the likelihood of something happening is so incredibly low.” (Note: it is debated on what the exact cause was here and if it was maintenance issues, etc. this is a potential cause, sadly no one knows for sure, but from this pilot’s answer this is not an unheard of thing in aircrews, apparently.)

There it is.. The silent killer in your environment right now is the same thing that has caused battles to be lost, disasters to be had, patients to be killed and countless lives destroyed… Complacency. It’s a serious problem in IT environments across the globe. You may not be talking about a life saving device in the event of sudden and unexpected depressurization. No one is likely to die if one of the “big” IT disasters strikes your environment. But similar phrases are echoed in board and conference rooms across all of ITdom. Phrases like “Oh.. well it’s on the SAN with all of those redundancies built in.. So the chance of a failure is so incredibly low..” or “It’s really hard to do a proper DR test, we need to stop operations, take resources away from the death marches we’ve assigned them to and risk breaking stuff.. Besides our primary data center is used by so many people and several major carriers run through there. The likelihood of something bad happening is so incredibly low.” 

I could go on, but I won’t. You already know what I mean because you heard yourself and your colleagues in just those couple quick examples, right?

It’s like this: Murphy (at least the classic idea of the Murphy with his own law) would have made a great DBA. The second you say, “______ will never happen”, you just introduced this silent killer into your environment!! No this isn’t superstition, jinxing, cursing or anything like that.. It’s just that you’ve allowed for the possibility to fester by ignoring it and not preparing for it. In fact your overconfidence not only helped smooth the road for _______ to happen, but you’ve made for a pretty awkward moment when it does.

To me this is one of the biggest takeaways from this entire period of restoring and helping a client get everything up and running: Get some foresight, put on your Carnac hat and predict failures..  This is actually incredibly easy. Incredibly easy. All you have to do is get cynical, look for faults and imagine what could possibly go wrong. (You can insert some joke about your spouse here if you like.. But yes.. be like that.. Find the problems, and go out looking hard for problems on purpose.)

Hindsight is cheap - any conscious human with neurons still firing and just a thimble full of reason left can second guess something after the fact. It’s what sells the Sports pages for 4 days after a football game. It’s why your father-in-law could make a better coach than any coach in any sport, as long as you let him review the tape of the game he is about to coach before he coaches it..  But - Hindsight is also terribly expensive. It’s that moment of looking back and saying, “Durn IT!!!!! I  should have only done this one thing differently!!!!!” (It’s also cruel…) Sometimes hindsight is spending a lot more money recovering from one of those “bad” situations on the good/bad scale than you would have spent if you had foresight.

Basically – Plan for the worst. I know I blog about that a lot (If you fail to plan you have planned to fail comes to mind first), but it’s true. This whole series highlights that for me. I hope I’m not alone.

Diversify

If your environment diagrams look like a whole bunch of production systems or databases sitting on top of a carefully sculpted set of expensive single points of failure – you might be doing something wrong. Now we’ve established this elsewhere already but I’m going to repeat it at least twice more  - A SAN doesn’t magically mean you are fully redundant and can stop.. Re-read the section above, or the first post..

So here is where you and your entire IT organization need to come together and talk and plan the best shape of your environment – and you will end up  taking known risks and making decisions based on risk/reward – but it will feel a LOT better having that conversation and making informed decisions. You may not get to the level of fault tolerance that you want or is ideal for your paranoia – but at least you gave the risks and rewards and cost/benefit analysis up front, right? The point is, if you are the DBA and you see a lot of room for failures (even if something “bad” has to happen first) – you need to call those out and give suggested solutions, understand the business needs, discuss the pros and cons and help architect the right solution for you and your workload. Sometimes it can look different based on the application. Perhaps you have a pool of apps that are so incredibly critical that you are to protect them at all costs (Like the system that pays your consultants), and maybe you have one that no one will miss if it goes away for a week (your time sheet system, perhaps). They don’t each have to have the same level of protection.

But – you need to stop putting all your eggs in one basket. A single disk array for all VMs and all your SQL Server environments living on that set of VMs? Sounds good, simple, less overhead of administration, maybe the performance is okay – but if you aren’t protecting that single array – what happens when it tanks? Can you rebuild everything? And in time?   Maybe adding a second array and using a synchronous SAN replication offers the protect you need. Maybe splitting some of your classes of applications works. Maybe doing some site to site SAN replication, maybe looking at SRM in VMWare, or all of the DR and HA options in SQL Server.

The point is – Next, Next, Next, Finish doesn’t really cut it for your SQL Server installation and configuration. It doesn’t really cut it for all of the other pieces and parts of your environment. Chat, be realistic and find and destroy single points of failure - If you (or your users) care about availability. And try and minimize the risk of having to rely on your backups and your secondary data center. Yes, have them, but make them expensive insurance purchases.

Finally…

We got there.. The end of the series and this long post. Last bit of advice -

The next time you hear, “But.. It’s on the SAN – that is fully redundant and you are fine, stop worrying..” - try and contain your laughter….

(Or have the same reaction I have.. Laugh at the statement you heard -inside-, chuckle a bit to yourself about the concept of “the SAN” said as though it is talking about one box with disks in it and not a network of a lot of moving parts some invincible, magical, single “thing”.. Then you need to get serious’d and poke holes in the statement – not to win a battle, but to prove out the concept and the approach..)

Tags: ,

Trackbacks/Pingbacks

  1. Yes Virginia, SANs Can Fail | Straight Path Solutions, a SQL Server ConsultancyStraight Path Solutions, a SQL Server Consultancy - November 15, 2013

    […] 3 – How Do You Prevent a SAN Failure? (Well from ruining your week anyway) – Is live on this […]

Leave a Reply