Why does copying the same amount of data take longer if spread across many separate files?
I noticed that copying 24Mb worth of data from one folder to another took about 30 seconds because (I'm assuming this is the reason) it was over 1,000 separate files. Copying 24Mb shouldn't take so long. Why does the number of files make a different?
I'm running Windows 7 on a MacBook (4GB ram, Intel(R) Core(TM)2 Duo CPU P7450 @ 2.13GHz, 32-bit Operating System)
EDIT: NTFS is the file system used on the drive
Solution 1:
The HDD does not have an exact transfer rate, it depends on proper maintenance, ie that is not fragmented, nor has bad sectors, etc...
If HDD is SATA 2 and it's the same partition, it's only the speed of data transfer.
If there are two partitions in the same HDD, it is not required this data transfer between the bus and motherboard, ie it loads into the buffer. (then depends of hdd buffer size.)
But for every file copied, the system must keep its index in the HDD's MFT (Master File Table), which makes the copying process slower if you copy many files. And if you have any anti-virus, it will scan each file that is copied. And if you have enabled the microsoft search file indexing (or any other file indexing service), the result will be worse.
I think there must be many other reasons why the copy of many files is slower, but these should be the main.
Solution 2:
Why does the number of files make a different?
Apparently you are focusing solely on the "copy the data" aspect of "copy a file". A file is more than just the data; it is an entity in a filesystem. A file has a name and attributes and permissions. All of this additional information about the file has to be duplicated along with the data when the "file is copied". There is a significant amount of disk I/O to perform this filesystem overhead.
The procedure for copying one (1) file in a generic filesystem would be something like:
- Find the source file in the filesystem. (a)
- Read from disk the directory entry for the source file.
- Verify read permissions.
- Find the destination file in the filesystem. (b)
- Verify write permissions in the destination directory.
- Expand the directory if necessary to accommodate the new file. (c)
- Update the directory on disk. (c1)
- Find free blocks, allocate them and update the table again. (d)
- Read file data and copy to destination file (i.e. copy the "file").
- Update the directory entry for the new file with (size and time). (e)
- Update the access time of the source directory entry. (f)
(a) At the very least this means searching the current directory. Or the path might start at the root of the filesystem, and several levels of directories have to be traversed.
(b) At the very least this means searching the current directory. Or the path might start at the root of the filesystem, and several levels of directories have to be traversed. If the destination file already exists, then determine how the copy should proceed or abort. If the destination file does not exist, then a new directory entry must be created, and maybe this involves expanding the directory (i.e. file block (aka cluster) allocation overhead).
(c) If the directory has to be expanded, allocate a new block by finding a free block, modify the allocation table with the new allocation, and then write the block(s) out to disk. Since most filesystems maintain multiple copies of the allocation table, then that means multiple writes to disk.
(c1) Once the destination directory is located, read the directory block from disk, modify it with the new directory entry for the copied file, and then write the block out to disk.
(d) In order to copy the file, allocate blocks by finding free blocks, modify the allocation table with the new allocations, and then write the block(s) out to disk. Since most filesystems maintain multiple copies of the allocation table, then that means multiple writes to disk. In order to maintain data integrity, the filesystem may not try to coalesce (delay and merge) disk write operations for directories and allocation tables, but rather perform the write operations immediately as the new files are created and block allocated.
(e) Once the data copy is complele, update the new directory entry for the copied file with the proper file length and timestamps, and then write the directory block out to disk.
(f) Update the source directory entry with a new "access" timestamp and then write the directory block out to disk.
So instead of just one file, your question is asking if doing all this stuff for one thousand files might add to the time it takes to just copy the data portion of the files? If you copy just one file of 24MB, then you will have something to compare with your copy time of one thousand files.
When backing-up a filesystem, copying the individual files to another filesystem on a disk or partition is rarely employed because it's a rather slow process as you have discovered. A faster method is to create & write a single archive file that holds the source directory entries and file contents in a special file format; backup programs and the *nix command 'tar' can output such an archive file. (Note that 'tar' just handles archive files and does not use compression like archival+compression utilities.) The fastest method of backup is to write to a block device (rather than a filesystem on a device), so that the source filesystem is ignored (treated as more data) and a block-by-block image copy of the source device can be performed.