Need an efficient data container. Moving from storage to memory as quickly as possible
Problem: I need to copy a large block of data from a remote location into system memory as quickly as possible.
Scenario: I have a data processing system. The system is built via shell scripts on-the-fly using multiple components that are pulled in from remote locations.
One of those components is a large block of data stored as groups of files.
The requirement I have is to retrieve that large block of data from a remote location and install it into system memory as quickly as possible. This is a requirement so that the system which relies on this data can start using it for processing as soon after boot time as possible.
Question: "What would be the most efficient container for my data?"
Solutions already tried/considered:
- ISO file: requires tools for creation and reading that are not typically native
- TAR file: extracting can take a lot of time
- Remote filesystem mounted as local: slow because contents need to be copied into memory
- LVM snapshot: gear more toward backups, not built for speed on restore
Notes:
- Data loss is not a primary concern.
- The remote file transfer procedure is not a primary concern as I already have an adequate tool.
- The system is currently using Ubuntu Linux.
Solution 1:
"The remote file transfer procedure is not a primary concern as I already have an adequate tool."
If you already have the file transferred, I suggest using mmap(2).
Solution 2:
You should consider an image file with a file system that contains your data (put a loop device over the file with losetup
and mount the loop device). The fastest way would probably be a compresed read-only file system like squashfs.
This would even allow some tricks if not all the data is needed simultaneously. Instead of mounting the loop device you could put a DM device on top of it, mount a network file system (or network block device) with the image file, put a second loop device on top of the network version of the file and combine both loop devices with the DM device.
Let's assume you have to copy 500 MiB of data. You start copying it. As soon as the first 100 MiB have been transferred you create the loop devices and the DM device. The DM device points to the loop device of the local file for the first 100 MiB and to the other one for the rest. After e.g. each transferred 10 MiB block you suspend the DM device and reload it with the border shifted by another 10 MiB.
The risk is: If accesses go to the network version then that data is transferred twice. So if that happens often then the data transfer will take longer (the whole process may finish earlier though, depending on its access characteristics).
Edit 1:
See this answer of me to another question for an explanation how to use DM devices this way (without suspend/reload/resume though).