Tuesday 1 December 2015

Why database corruption shouldn't be that scary.

A rather clever SQL person by the name of Paul Randal once said. "There are two types of people in this world. Those that have experienced database corruption, and those that will."

He's absolutely right. It's more than an occupational hazard, it happens all of the time but despite it's commonality the mere mention of it can still strike fear into many would be database administrators.

Now be assured, I am not trivialising corruption here, not one bit! I've been there. It can turn a good day into a very bad one (the night too), but there are good reasons why you should not live in fear of it.

Firstly database corruption comes in different shapes and sizes and you need to know what you're up against. Yes its likely that it's irrecoverable meaning a full restore is the only way things are going back to normal. But it could a problem with something like metadata such as a view rowcount meaning the object just needs recreating. The time compared to doing that to a full restore could be huge, so lesson one, don't panic and know your enemy! 

The Panic Button - DON'T press it!


There is though a much more important reason as to why you shouldn't fear database corruption and that is because you're are prepared. As another wise man once said that we don't plan to fail, we fail to plan.

If you're having to restore a database into production then you should already be armed with a few bits of critical information. What services are affected, who needs informing and also, how long is this going to take. See my SQL inheritance tax article here on why you really need this info!

You also need to know that your backups are up to date and consistent. I have lost count of the amount of times that I have read on a forum that a database has become corrupt and the only backup is from four weeks ago, if one even exists that is. Seriously, all the time. 

This is where fear really comes from. It's the not knowing. You have to know the backups are there and ready. You have to know how long a restore is going to take and you know why you should know? Because you should have already done it. Recently, and documented it. You've consistency checked it too because you don't just rely on verifying your backups!

So here's a few guidelines:

Backup your databases, regularly.
It's a given I know. No, it should be a given but like I mentioned earlier it is often not the case!

Know your backup jobs.
Do you know if your backup jobs are running successfully? Your backup jobs should definitely notify you in the event of failure but also consider how are you going to know if the job actually ran. If SQL Agent is down, you won't be getting any notifications that way so think about success notifications and customised reporting on your backup tables.

Know your RPO and RTO. 
Standard backup routines are fine but different databases have different recovery objectives and you need to ensure they can be met in the event of disaster. You also need to know what the databases are doing, what services will be affected and who to notify when things go wrong.

Are your databases consistent?
Consistency checking is another given but how often should you do it? Personally, I like to check as often as possible (resource allowing), a minimum of 1 + number of full backups. System databases, every day...why not? And yes, that includes TempDB. Remember corruption often indicates a problem at the IO subsystem, you need to know about it sooner rather than later, speaking of which...

Are you alert?
Consistency checking is all well and good but its scheduled and you need almost real time notifications that things are not quite as they should be. Add SQL Agent alerts for errors 823, 824 and 825 among others.

Your backup strategy is only as good as your restore strategy.
Many times I have encountered SQL instances with sound backup plans and absolutely no restore checks. Verify only is not enough, neither is checksum for that matter...seriously, I'm not kidding! The only way to guarantee that your backup can be restored is to restore it and consistency check it. For critical databases plan a restore check on a regular schedule. 

In the event of corruption, know what you're dealing with.
The golden rule is simple...don't panic! Ensure you know what is being affected here and how. There are several different types of corruption with different methods on how to resolve the problem and it might not necessarily mean a stack of downtime (although it still might). If you are going to have to restore then make sure of one thing; that this is not the first time you have done this!! You want to be going from a plan that is documented, proven to work and rehearsed.

So remember, corruption is not a mythical beast it is very real, it will happen, but rather than be scared of it, be prepared for it.


3 comments:

  1. Totally agree with your post, apart from your comment on backing up tempdb not sure I really agree. Whats your reasoning behind that?

    ReplyDelete
    Replies
    1. REad too much into it. you can't backup tempdb, but you can check it.

      Delete
  2. You mean consistency checking TempDB? I often get asked this, and you absolutely should.

    Ultimately, you need to know if there is a consistency issue. If it's I/O related for example then chances are the problem will be more widespread and a bounce of the instance won't be enough, you'll need to delete the physical files.

    ReplyDelete