Backingup.. Beep Beep
Zetsu
Posts: 186
Lets say someone had like 5 to 20 + TB of data they wanted to come up with a backup solution for, what would you suggest ?
Retrospect and a bunch of 2TB externals doesnt work well for this ;x
Retrospect and a bunch of 2TB externals doesnt work well for this ;x
Comments
The amount of data you're talking about could be well-matched to either standard DVD or Blu-Ray DVD data formats. How to organize a series of backups on a schedule that's efficient, yet effective is another issue. With that amount of data, you want some kind of complete backup every now and then with frequent incremental backups.
That would be at least 100 to 400 50GB Blu-Ray discs!!
I'd start by asking myself what I might exclude.
Then I'd most likely use rsync over a ssh connection to do the majority of the work.
Both rsync and ssh are available for free for Windows, Linux, and Apple.
I presume the data is all owned by you.
http://sourceforge.net/apps/trac/sourceforge/wiki/Rsync%20over%20SSH
At my office we use a HP MSL8096 tape robot with 3 LTO4 drives to backup approximately that amount of data.
(An 8096 can hold up to 96 tapes, one of which really should be a cleaning cartridge. LTO4 tapes are 800GB uncompressed. It's possible to add another 8096 body on top of the original and expand the capacity to 192 tapes.)
LTO4 as that's what was available when we bought the thing.
(Today we would have bought LTO5 or waited on LTO6 - which is 2.5TB uncompressed - to prove itself stable)
This robot is connected to the servers using optic fibre(Fibre-based SAN, ports set at 8Gbit/S).
The data is stored on a disk array, also connected to the servers through the same SAN.
There are other and possibly cheaper tape autoloaders and robots out there.
Note that LTO drives require a 'minimum continual data feed' to perform optimally.
(One of our older servers couldn't send data fast enough and the drive it used tended to stop, backtrace and rewind a lot, slowing it down further just because the data buffers ran out)
Storing data on Blu-ray or DVD...
Please NO!
A DVD can be unreadable if left exposed to daylight for only a month or less.
And some of the cheaper ones deteriote enough to be unreadable after 2 years even if stored correctly.
The only thing worse is using DAT (DDS) tapes, really.
Durn drives tends to stretch the tape slightly, and in a way that's specific to each drive. A tape written on one drive may be unreadable on another.
Bah, 5 to 20 TB is kids stuff now a days:)
Out of curiosity, what is this 20TB of data?
One might ask questions like:
What is it's value? That will determine how much you want to pay for a solution.
How fast is it growing? That might dictate options as you may want a lot of future scalability.
How fast does it age? It might help if you can dump data past a certain age.
How sensitive is this data? If it's a lot of personal info or financial stuff making sure it is secure and taking care of privacy issues will come into the equation.
Now, one approach is to farm the problem out to people who specialize in this sort of stuff. That, dare I say it, is put in the "cloud"
For example Amazon Simple Storage Service (S3) will keep that stuff for you,http://aws.amazon.com/s3/ I'm sure you can get similar services elsewhere.
That will cost you. But it may be better to pay than to cobble something together yourself and always have to take care of it forever after.
No, a pile of DVD's is not a solution. For that amount of data you will need a truck load of them. Better keep duplicates as all these optical disks have a habit of rotting. It will be slow an labour intensive, does anyone know of a DVD robot?
I guess that Zetsu might be doing this as homework for a class on system administration or someone might have figured that they could ask a student to do system administration and save some bucks.
The situation is that the size creates a huge amount of time involved in continual backups. So Rsync is very appropriate as it can do partial backups.
The Tower of Hanoi puzzle has been applied mathematically to determine what is the best way to divide time between full backups and partial backups. What I saw in Wikipedia seemed to indicate an 8 day cycle (full backup once in 8 days) with various partial backups is optimal.
But the real savings is to be selective about what is backed up. You have cut your time in backing up by half if you can exclude half the files from the process. Saving everything is the most costly approach.
With Rsnyc over ssh, you don't really require tape storage... you can send this off site to cloud storage. That may avoid investment is a rather expensive tape backup system.
But whatever you do, the computers are likely to require running 24/7 so that backups can be done when people are not in the office. For that, you can load Bash and Cron, which are Unix/Linux software into a Windows or Apple OS. Again it is free and excellent, so why buy stuff that is expensive and not as clear to operate?
Full clone images can be created over ssh with Unix/Linux dd command and to an Apple or Windows OS, but so far I have only done this with a manual booting. It may be possible to automate the process.
The biggest hangups are the speed of your computers, the speed of the 50TB storage, and the bandwidth capablity of the network you are using. Slowness is tied both to the size of the project and fact that good backup software is actually going to verify quality. There are steps that you can skip, but never really can afford to.
We have half a dozen servers locally, and nearly two dozen small servers on remote sites that are backed up on my Robot.
(About 1TB from the smaller sites)
We use 'deduplication' SW that copies data/changes from the servers and over to one specific 'backup server', which then does the actual backup.
(It used to be that the servers took turns to use the robot, but unless you were very careful, you could end up with a server overwriting another server's backup)
The backup-server has its own dedicated disk shelf(about 6TB. We're swapping it soon for one with over 20TB.) for temporary storing data.
The nice thing about robots like these is that you can dedicate a sets of tapes for different backups and not have to swap daily.
About a dozen tapes are used for the monthly. Then the SW picks a 'free' tape for differentials every night. It can even use free space on a partially full tape from the night before.
With the new solution, not only will we be ready for more data, but we will also be able to do a restore from cached data more easily instead of always having to get it from tape.
One thing to consider when picking a backup solution is 'why do we backup this'...
F.example, the user's homeshare is 'kind of important' to the user.
On those areas we also use the 'Shadow copy' function in the newer versions of Windows server.
(Set to take 'snapshots' of the selected disks at specific times. If a file is accidentally deleted or becomes corrupt, we can just 'roll it back'.
In fact the process is so painless that some of our more 'accident prone' users are able to do it themselves, saving us Helldeskers from a few calls every month.
A common tape backup routine is:
Full backup every friday night.
Differential backups all other nights. DO NOT use Incremental.
Last (or first) full backup of the month is set aside for a full year.
Last full backup of the year is set aside for a decade, or whatever is thought necessary.
(Many organisations and companies needs backups of financial data for a decade, for auditing purposes)
Why Differential, not Incremental?
A Differential backup job backs up all the changed files since the last FULL backup.
The Incremental job only takes a backup of the files changed since the previous backup, even if that was also an incremental.
WHEN(Not if) you have disaster recovery scenario:
On differential; Restore last full, then the last Differential.
On Incremental: Restore last full, then EVERY incremental.
If there's something wrong with one of the incremental sets, you will miss on all the changes on files in that set if those files weren't also changed on later dates.
On a differential, you take the previous set and restore with 'only overwrite older files', and end up losing only what has been changed one day.
Trust me, the difference can be big...
First thing to do with a new backup system is to do a FULL BACKUP, then a RESTORE.
Ideally, the restore should be run on the same drive as the data belongs to.
(Rename a large folder, then restore it and the contents from tape)
A backup that doesn't work is no backup at all...
Personal setting.
Im just looking for something that's more manageable then, retrospect.
retrospect works great for somethings but I don't like how you have to dig out catalog files etc.. etc..
guess what disk its on and whatnot...
Yes all the data is owned by me and acquired legally.
well if my home server fails and I have to spend the next year re downloading everything I would probably go crazy, so my sanity is valuable to me ;p
Good, There was a Thai student in the US that learned he could have his family buy PDF textbooks in Thailand much cheaper than in the US and resell them in the USA. So he did this to pay for his education. But the publishers found out and dragged him into US court for huge sums of money.
In other words, it seems like the US is very free about doing business, but big companies can sue without proving merit. That comes later when it goes to trial. So you can get into a huge costly legal defense for years before you get to trail.
With so much easy to download, be wary of who you share your data with -- especially if they are willing to pay for a copy of it.
Here in Taiwan, people do a lot of things that can't or shouldn't be done in the US because the lawyers can't reach them. Personally, I don't have a multimedia collection.. less that 300 Mbytes of data.
This sounds like you need more of an archive solution rather than a back-up. Just back-up copies of data that is essentially non-changing and not continually needing to be accessed and modified.
My solution was a small server from HP and multiple Western Digital drives. An HP Microserver is fairly cheap can hold up to 4 drives internally. 2 TB - 4 TB drives are now available for around $100 per TB. Don't buy more capacity that you actually need to start with, prices will drop and capacities will go up over time.
Using Rsyc in a Bash file with Chron making sure everything is cloned at regular intervals will provide you with separate image of your working directories that can be accessed via normal search routines. If you use a Linux OS, it can even search Windows directories and usually do it much faster than Windows.
To get the duplicat archive to another computer, you use SSH. I believe SSH will also allow you to remotely search the second computer.
The beauty of all this is no cash... rock solid system administration for free.