Ah a topic dear to my heart. Both my business partner and I care fanatically about data integrity and hence backups. I guess that makes us appropriate people to manage data for people 😛
We address backup from multiple dimensions:
- Risk Events
- Recovery Times
We use software RAID1 under Linux on commodity PC hardware. We also use different branded hard drives as this decreases the probability of simultaneous drive failure. We use hard-drive based backup (with monitoring to detect imminent drive failures) rather than tapes because of increased reliability. All our servers are under maintenance agreements and monitored (and yes, this is affordable for small businesses due to our process innovation). All of this means the likelihood of something happening to destroy your data in the first place is drastically reduced.
The technology used in your backup matters. Are you using tapes? They can have up to 70% failure rates when you attempt to recover from them! Are you backing up your whole system or just files? Do you know how long it will take to recreate your server from scratch with all the undocumented hacks your IT person has made? Are your backups actually working – are you doing test restores? How are you handling off-site backup? Does your backup software deal with open files (like your Outlook or Exchange databases)? Is all your data stored in one spot?
This makes backup much much easier. Are you using RAID5? On a hardware RAID device? You’ll be sold one of these if you ring up Dell and say you need backup, but do you realise that this scenario makes you MORE likely to lose data and incur considerable expense in attempting to recover it (as compared with RAID1, particularly software-baseder under Linux). Your Operating System matters too – Windows and its applications are a lot harder to back up and generally require more expensive hardware to run on too (which often means recovery time is slowed whilst waiting for replacement parts, or that you pay much more for fast warranty replacements).
People delete data. Deliberately or accidentally, it doesn’t matter. Design your backup system around this fact and you’ll be much happier. For example, some automated backup solutions simply mirror what’s on the fileserver. This includes mirroring any delete operations! Many companies only find out this was a limit to their system after the fact. Too late.
We automate solutions where possible, because leaving backup to busy overworked staff members is a sure way to guarantee it won’t get done, or won’t get done properly. Ask your IT person next time when the last time *they* backed up was. Chances are if they’re using a manual system it was months ago, and they supposedly care!
4. Risk Events
Consider the risk events you are exposed to. The most common are:
- accidental file deletion
- hard drive failure
- other system component failure
- virus/hacker attack
- disgruntled employees
- failure of your backups
The rarer, but still deadly events can include:
- dedicated hacker/industrial espionage
- … and all the other weird and wonderful events you can dream up!
You need a plan to address each and every event you can think of. You have a choice of:
accept – you accept the risk of it happening and lose data
mitigate – you put processes in place to reduce the likelihood or occurrence and/or reduce the
cost of recovery when it does happen (ideally do both)
offload – offload the risk to someone else, e.g. an insurance company. N.B. This is tricky with data
5. Recovery Times
This is often overlooked. Assume you have a perfect backup, how long does it take to recover? Windows servers in particular can take days to recover. Even with an image (snapshot) of the whole server and separate file-based backup, you’ll run into trouble if you don’t have near-identical hardware lying around to restore the image to. This is particularly a problem when your server is more than 12 months old and you can no longer buy replacement parts for it.
What about off-site backup? How long would it take you to download 100GB of data from your off-site backup or to drive and get it? The list of acceptable recovery times needs to be correlated with the list of risk events. For example, if your building burns down you might accept everything taking a week to get back up.
But if your hard drive fails, you want your recovery to be in a matter of hours. Ideally this type of failure would have been prevented with RAID1 and drive monitoring.
Often overlooked, this is considering what’s worth backing up, how, how often and how expensive and time-consuming the retrieval process is. It’s not possible to have 100% secure data (although sending it to the moon $10,000/hard disk at a time might be worth it to some!), so considering the economics of what data you have and its value to the organisation is important.
As Mal said above, the disaster prevention and recovery process (of which backup is a part) needs to be communicated and understood by business owners. All too often we see half-baked backup solutions that we know will fail under so many situations, but the business owner is blissfully unaware. In my opinion, this is worse than no backup at all as it provides a false sense of security.