Capacity Optimization / Deduplication Options for Primary Storage

I'm exploring options for making more efficient use of our primary storage.

Our current NAS is an HP ProLiant DL380 G5 with an HP Storageworks MSA20, and one other disk shelf which I'm not sure what it is.

The vast majority of our files are PDF files (hundreds of millions of them), with a high degree of similarity.

In an expert opinion from George Crump (referenced from Data Domain's Dedupe Central), in the section on granularity, he says: "To be effective data de-duplication needs to be done at a sub file level using variable length segments."

This is hard to find, yet exactly what I need. Most dedupe options seems to be block based, which works really well for minimizing how much space backups take up, since only the changed blocks get stored, but the block-based techniques do not find identical segments located at different offsets within the blocks of our PDFs.

I came across Ocarina Networks the other day, which looks like exactly what we need.

Storage Switzerland's Lab Report Overview - The Deduplication of Primary Storage compares Ocarina Networks and NetApp as being "two of the leaders in primary storage deduplication".

Ideally we'd like to continue using our current NAS, but much more efficiently.

The other solution I've come upon is Storwize, who seem to perform inline compression of single files, integrating with deduping solutions.

What other solutions and informational resources are there?


Solution 1:

I have found that most black-box solutions for de-duplication are not as effective or as efficient as the ones built directly into the storage.

For example, a black-box de-dupe appliance will require all of your data pass through it in both directions before hitting whatever generic storage you are using, processing it all for de-dupe, whereas storage arrays such as NetApp, Data Domain, and many others, allow you to control de-dupe on a per volume basis, and all processing is done on the controller itself.

If you are set on using existing non-intelligent storage but employing a solution in front of it, I would recommend data domain, but honestly I would encourage you to upgrade to a different storage system which can de-dupe internally.

I would look into the NetApp V-Series of storage controllers. These allow you to attach an intelligent disk controller to existing disk shelf hardware you already have.

Solution 2:

The technology you're looking for is called deduplication, and there's a ton of vendors offering dedupe.

If you're using a SAN, call your SAN vendor and they'll fall all over themselves trying to sell you their dedupe options.

Here's a good resource on how to get started with dedupe:

http://www.datadomain.com/dedupe/

Solution 3:

I know the MSA range well and I think you'll struggle to dedupe with what you have, for a start deduping is a reasonably slow and IO-intensive job that's best done by the actual SAN/NAS controllers. It's slightly different in a backup scenario as the backup media server can dedupe as it goes but with live data it's important to maintain data integrity and overall performance and I'm not sure there's anything available as an 'after market add-on' that'll really give you what you need.

Solution 4:

Its worth noting that the Ocarina system trawls an original file system and sees if a file matches a policy. If it does the Ocarina box expands the file out and applies their proprietary compression algorithms. It then writes this new file to a new different file system, optionally deleting the original file.

Apparently the reading side can be set up with a fuse file system such that reads to the original file system can be intercepted by fuse to use the "optimised" version so that sounds much more transparent then the original sales person described.