Cost effective, long term archival of video and image data? ~50 TB

My lab is in the process of setting up a small server that holds data (mostly video and image data, plus a few documents) for the project our group is working on at a moment in time. Historically, after a research project ends, the data haphazardly ends up being archived in one hard drive, or a big pile of DVDs (or CDs in the olden days), and/or some of the video ended up in Sony DV cassettes or even VHS tapes (this lab has been active since the early '90s), OR a mixture of all the above...

Question: What is the best way for (1) consolidating them ALL into the same format AND storage medium, and (2) what's the best medium for long term archiving of such data for very occasional access (say, 30+ years?)? Unfortunately we don't have enterprise level budget (we are just a ~10 people lab), so can't do things that costs hundreds of thousands of dollars.

Thanks!

P.S. Considering our old video and images are of smaller resolution, but recent ones are huge, I think we are talking about 30~40 TB for the really old data, another 10~20 TB for recent data, then yearly additions of about 5 TB.


Unfortunately, there is no best way for you. 30 year archival of digital media is a very hard problem and takes routine investment. About the only formats guaranteed to be readable in 30 years are ASCII and UTF8, which are not video formats. Storage formats change, the 8 track reel-to-reel tapes we were using 30 years ago are nigh impossible to read these days even though the data is still on the tape (there is an interesting story about NASA rebuilding a 40 year old tape drive to get at some newly recovered/discovered Apollo data tapes). Your best bet is to commit to periodic, I'd say every 5 years, assessments of your archival environment with sufficient budget to bring old formats into newer formats.

You probably know better than I do, but the video landscape is changing rapidly. Realtime online editing is now possible, where it was only doable on seriously good kit even 10 years ago. Who knows how things will look 30 years hence.

  • Set your archival window for 5 years.
    • In the immediate term a largish storage array should suffice (
      • big and slow 50TB disk can be had for under $70K, possibly well under.
      • An LTO5 tape drive and 50 tapes (well over 50TB worth) can be had for less than $15K.
  • What format you store your video in is up to you.
  • Start finding and converting all of your older stuff into this new storage.
  • At the end of 5 years, do another full assessment of your archival environment.
    • What formats are you using?
    • What are newer formats?
    • What codecs seem to be dead ends, and what media do you have stored encoded that way?
    • Decide how you're going to migrate to newer storage methods (data formats, disk/tape/something-else), and spend appropriately.
  • Repeat 6 times.

That should get you to 30 years.


I totally agree with sysadmin1138's post in every way bar one caveat - I don't think you're going to have the budget to really achieve what you want.

There are 5 main functions you need to create;

  • a standardised content and catalogue policy - I know you want to store everything in one format but you really should consider two - PDF for images and H.264 for video - both are long-term-support formats with multi-platform code that will almost certainly be supported by one party or another for 25-50 years in their current form simply due to the existing usage around the world.
  • a catalogue or CMS to index and publish the content.
  • a 'content ingest' system - this will take all of your media, package, encode, store and update the catalogue for each new piece of content. You will need a manual or automated content quality check put in place too.
  • a primary content store - this will have two main storage blocks; one small one to hold origin content while it's being transcoded/checked and a much larger block to hold the content 'near'. This is one of the only valid uses for RAID 6 I've come across but try to use enterprise quality disks that have a 24x365 'duty cycle' here.
  • long-term backup system - this is where the real money will be spent, you'll need to select a vendor that offers genuinely long-term backup capability. If I were doing this right now I'd still go with tape over disk purely for data longevity reasons, perhaps by IBM as they have a lot of experience in this area. You also need to consider that you need to do regular tape restorations and data verifications too, meaning you'll need a third storage block at least as large as the largest tape you have - and the systems to verify too of course. On top of that you'll need to ensure that the backup software you use will be around for a long time too, something like TAR on *nix is likely to be around for a while but it may not functionally give you what you want so ensure this isn't overlooked by your tape vendor.

So what you want to do can be done, I've done it myself a number of times over the past two decades or so - but none were cheap I'm afraid.

Good luck.


The others have given good advice about how to back your media up. I would suggest you spend some quality time looking at the library of congress guidelines:

http://www.digitalpreservation.gov/formats/index.shtml

You might also consider building a cheap whitebox ZFS array. You could probably do something to fit your needs for under $10k. As the drives die, replace them with larger ones, and so your storage capacity grows as you generate data. That would probably keep you going for quite a while, and you can replace it with a higher capacity device when it gets old. The advantage is that your data is online (and so it can be accessed as necessary), and is relatively well protected against bitrot, a serious problem when you have this much data.

A decent build option was put together here:

http://www.zfsbuild.com/


As difficult as it is for technologists, I would recommend immediately stopping thoughts about disks and technology. Break out your business problem into things that you have to make decisions about.

Example:

  • How are you going to deal with converting analog/miscellaneous digital tape formats into digital media that can be stored on some sort of digital storage?
  • How are you going to manage the content and associated metadata? Storage is easy -- you could put everything on LTO tape and store it in an old salt mine, but you would not have access to the data.
  • Are you re-inventing the wheel? If you're at a university, are there already solutions for content management available centrally? Or if you need to buy/build your own content managment, is there centralized infrastructure that you can buy a piece of? (Tape, Object storage, SAN)
  • What are the real business requirements? What do you really want to keep and why? Oftentimes when you really dig into the heart of the matter, the real long-term retention requirements actually apply to only a small subset of data.

Be aware that if you store data in a lossy format, and then convert to another lossy format, and then another, your video quality will degrade with each transition.

The following is talking about audio, but the same generally applies:

You can convert any audio format to Ogg Vorbis. However, converting from one lossy format, like MP3, to another lossy format, like Vorbis, is generally a bad idea. Both MP3 and Vorbis encoders achieve high compression ratios by throwing away parts of the audio waveform that you probably won't hear. However, the MP3 and Vorbis codecs are very different, so they each will throw away different parts of the audio, although there certainly is some overlap. Converting a MP3 to Vorbis involves decoding the MP3 file back to an uncompressed format, like WAV, and recompressing it using the Ogg Vorbis encoder. The decoded MP3 will be missing the parts of the original audio that the MP3 encoder chose to discard. The Ogg Vorbis encoder will then discard other audio components when it compresses the data. At best, the result will be an Ogg file that sounds the same as your original MP3, but it is most likely that the resulting file will sound worse than your original MP3. In no case will you get a file that sounds better than the original MP3.

Since many music players can play both MP3 and Ogg files, there is no reason that you should have to switch all of your files to one format or the other. If you like Ogg Vorbis, then we would encourage you to use it when you encode from original, lossless audio sources (like CDs). When encoding from originals, you will find that you can make Ogg files that are smaller or of better quality (or both) than your MP3s.

(If you must absolutely must convert from MP3 to Ogg, there are several conversion scripts available on Freshmeat.)

http://www.vorbis.com/faq/#transcode

So it's probably best to pick a lossless format, because once you pick one lossy format, you're stuck with it.