Storing duplicate files efficiently on linux

I host a lot of websites and our system makes it easy to duplicate items in these sites which is handy, but leads to lots of duplicated (and potentially quite large) files. I was wondering if these is any mechanism in linux (specifically Ubuntu) where the filesystem will only store the file once but link to it from all its locations.

I'd need this to be transparent, and also handle the case that if a user changes one of the files, it doesn't alter the contents of the main file but creates a new file for just this particular instance of the file.

The point of the exercise is to reduce wasted space used by duplicated files.


I'd need this to be transparent

ZFS-on-Linux × feature called "on-line deduplication".

UPD.: I've re-read your question once again now it looks like Aufs can be of help for you. It's very popular solution for hosting environments. And actually I can mention Btrfs by myself now as well — the pattern is you have some template sub-volume which you snapshot every time you need another instance. It's COW, so only changed file blocks would need more space. But keep in mind, Btrfs is, ergh… well, not too stable anyways. I'd use it in production only if data on it are absolutely okay to be gone.


There is a linux user space/fuse filesystem that will do this dedup.

http://sourceforge.net/p/lessfs/wiki/Home/

Linux Journal has a good article on it in it's August 2011 issue. There are also various filesystem specific options with btrfs and zfs.