Does btrfs balance also defragment files?

When I run btrfs filesystem balance, does this implicitly defragment files? I could imagine that balance simply reallocates each file extent separately, preserving the existing fragmentation.

There is an FAQ entry, 'What does "balance" do?', which is unclear on this point:

btrfs filesystem balance is an operation which simply takes all of the data and metadata on the filesystem, and re-writes it in a different place on the disks, passing it through the allocator algorithm on the way. It was originally designed for multi-device filesystems, to spread data more evenly across the devices (i.e. to "balance" their usage). This is particularly useful when adding new devices to a nearly-full filesystem.

Due to the way that balance works, it also has some useful side-effects:

  • If there is a lot of allocated but unused data or metadata chunks, a balance may reclaim some of that allocated space. This is the main reason for running a balance on a single-device filesystem.
  • On a filesystem with damaged replication (e.g. a RAID-1 FS with a dead and removed disk), it will force the FS to rebuild the missing copy of the data on one of the currently active devices, restoring the RAID-1 capability of the filesystem.

TL;DR

Btrfs' defrag feature is specific to fixing fragmentation in folder metadata and file contents, whilst the balance feature was created to "balance" (hence the name) the amount of data shared between drives whenever a drive is added or removed. While they do have some theoretical overlap in what they do, they are not directly related, thus the documentation does not link the two features.

Verbose answer below. Note of course that my long answer is in the hope that it will help others who don't have the full context of the problems faced.


Chunk Allocation

An important concept with btrfs is chunk allocation. When you write data to btrfs, it writes that data into a "current" chunk, typically 1GB in size1. If the "current" chunk becomes full, it allocates a new chunk. If an existing chunk is emptied, its storage space is made available for re-allocation when a new chunk is needed.

If the filesystem is using more than one drive with the "dup", "single", or "raid1" storage profiles, the chunk allocator always prefers putting the next new chunk on the drive(s) with the most free space available. This ensures, generally, that drives are used equally.


How Balance Does its Thing

The balance feature works by taking existing data chunks and re-writing them into the "current" chunk. When an existing chunk is emptied in this way, it is automatically made available to the allocator. If the existing chunk being emptied was not full to start with (perhaps old data in the chunk was deleted), the net result is the freeing up of diskspace since the newer chunk is "more tightly packed" with relevant data.

This is the part which could, in theory, be used as part of a de-fragmentation strategy, which I feel is the reason that many people assume it already does. However, of course, the balance feature was built with a specific purpose in mind, thus why it does not look at the file content. It only checks whether or not the data it is taking out of the existing chunks is relevant2 before copying that data to the new chunk.

Where does the Balance part come in?

When you add a new drive to the filesystem, the allocator will at first tend toward writing all new data to the new drive, mainly because it has more free space available than the existing drives. By re-writing all chunks, all initially-balanced chunks are written only to the new drive. Once it has equalised (become balanced), the rest of the data will be equally re-allocated between the drives.

Typical Balance Scenario:

I have 2x 500GB drives with 240GB used on each; I add another 500GB drive. I would typically have:

  • drive a: 240GB used
  • drive b: 240GB used
  • drive c: 0GB used

I start a balance of all the data. About one quarter through the balance, I'm likely to see a situation similar to the following:

  • drive a: 180GB used
  • drive b: 180GB used
  • drive c: 120GB used

At about the one-third mark, it appears to be balanced:

  • drive a: 160GB used
  • drive b: 160GB used
  • drive c: 160GB used

You can of course stop the balance operation at this point, though there are reasons (good and bad) why you might want to let it finish3.


How Fragmentation Happens in btrfs

Btrfs is a CoW (Copy on Write) filesystem, which means that data is never over-written4. If you have an existing 100MB file and over-write a 1MB portion of the file, that 1MB portion is not written over the existing data on the drive. Instead it is written elsewhere in the "current" chunk. Btrfs keeps track of where these "fragments" of new data are stored. This is most useful for maintaining snapshots of the data as it means the old data is preserved by default. Because SSDs, in a very similar way, also never overwrite data, this CoW mechanism lends itself well to allowing SSDs to maintain their lifespan and performance.

Where Defrag Comes In

Regardless of the advantages, some files are over-written very often (typically database files), so end up having hundreds of these fragments. With SSDs, there is little performance penalty in the short term. But with spindle drives, the performance penalty is severe.

One solution of course is to use btrfs' defrag feature. The defrag operation re-writes the file content in the current chunk in the logical order of its current state, thereby reducing the fragments into one large 100MB set of data instead of numerous separate pieces.

An alternative solution would be to use the "nocow" feature specifically for files such as this. The nocow feature causes the file to be overwritten in place. Beware that there are caveats to nocow5 6.


Summary Again

  • The balance looks at chunks and stripes - and is not actually aware of file content except for whether or not data in those chunks is still relevant.

  • The defrag operation looks at folder data and individual file content and re-writes the data in as contiguous a fashion as possible. The down side is with snapshots where defrag causes duplication and extra drive usage.


Notes:

  1. Though chunks are typically 1GB in size, they can be bigger or smaller. When using raid types, chunks are typically striped across multiple drives in 1GB multiples. For example, 5 drives with raid0 typically results in a 5GB stripe consisting of 1GB chunks being written to each drive.

  2. Btrfs uses "references" to file content. When part of a file is overwritten, the live filesystem "references" the location where that data was written. A snapshot however might still "reference" the old location. If there is no snapshot - or the old snapshot is deleted, this results in no "ref"erences left that refer to the original overwritten content. This content is then considered irrelevant and will not be copied with the other relevant data in the balance operation.

  3. At this point, assuming storage is using the simple "single" profile7, the first 160GB balanced would all be moved to the new drive - but also at this point, it still has about 320GB left to balance. The rest would be balanced equally across the drives. With spindles, ideally you would want to balance only 160 chunks before having btrfs re-balance all 3 drives for a better "spread" of the data. With SSDs, attempting to maintain an even "spread" of data gets very complicated, likely pointless, and far more likely very bad for the SSD lifespan.

  4. The exception is the "nocow" feature.

  5. If there are snapshots, defragmenting the "live" file causes the snapshots and "live" file to refer to divergent data locations on the disk, causing the data to be duplicated and thus take up extra diskspace. When a general-purpose de-duplication feature becomes available, this will not be as much of a problem.

  6. Using nocow means btrfs does not maintain checksums for the file-content.

  7. With most raid types (raid1 is the exception), "spread" across the drives is moot as the stripes are typically written across all drives anyway.


Maybe looking at the source code of the command might help

Prefer btrfs balance start

'btrfs filesystem balance' command is deprecated, please use 'btrfs balance start' command instead.

And then on the command string

"btrfs [filesystem] balance start [options] <path>",
"Balance chunks across the devices",
"Balance and/or convert (change allocation profile of) chunks that",
"passed all filters in a comma-separated list of filters for a",
"particular chunk type.  If filter list is not given balance all",
"chunks of that type.  In case none of the -d, -m or -s options is",
"given balance all chunks in a filesystem."

I might give it a second look but I can't see any references to defrag on the structs nor the ioctl() calls. So there's no explicit defrag.

All it does is copy from one place to other and using the default allocator in the process. Taken from here

Depending on the purpose allocation and on allocation mode, algorithm either directly searches for a continuous extent of freespace in each suitable allocation group (a group in btrfs corresponds to a chunk described above

So depending on the allocation mode, free space on the device, and so on you can say that btrfs will allocate in such a way that defragmenting won't be necessary. Which you might consider a form of implicit defragmentation.

HTH


Balance works at the chunk level; chunks are how Btrfs implements raid redundancy. It doesn't do anything at the Btree level and doesn't defragment.