Create Linux swap on external USB 3 hard drive

Solution 1:

There are a number of things to consider and maybe test bearing in mind your setup and use case.

Suggested SWAP partition locations

If the USB HDD is not a good idea where should I put some swap?

Short answer, yes you can create a swap partition on the USB3 HDD, but the 2x750GB HDD is possibly the safest place to put the swap.

However, you could also spread and prioritise your swap partitions across all disks with varied priority to try max performance and swap capacity. If you like over-optimising like me, I'd recommend trying something like the following (which requires tinkering with fstab, etc):

  • Allocate a little swap partition space on the 2x SSD array, e.g. 4GB, with high priority (limited SSD space and paranoia over the lifespan of SSD are reasons other people don't do this).
  • Allocate more swap partition space on the 2x HDD array, e.g. 8GB, with medium priority.
  • Allocate even more swap space in a swap file on the USB3 HDD, e.g. 16GB with low priority.

That way, if system RAM gets crushed with lots of processes begging for RAM and being swapping out, the load is distributed across all disk devices. Note also, the swap priority was based on performance of the underlying disc systems.

Next I'll try go over some detailed reasoning.

Storage speed is probably much more important

You've probably read the recommendation to place the swap on a less busy or dedicated drive, but it only applies in an apples vs apples type comparison, and isn't an accurate rule for a more complex system mixing different storage mediums such as SSDs vs HDDs and interfaces SATA vs USB3. For your case, the guiding principle should be to balance the I/O load types and allocate the SWAP where you expect the storage interface types and drives that have the best spare/free random I/O throughput. That could be the SSDs, but with a caveat...

USB3 HDD for SWAP

You mentioned in a comment that the USB3 option didn't perform too well and indeed, the reasons could be:

  • Your USB3 drive is probably a single disk system, whereas your 2x SSD and 2x HDD with RAID should have better performance, given:
    • RAID 0 nearly doubles both read and write performance.
    • RAID 1 nearly doubles just read performance and can degrade write performance by a marginal amount.
    • So, assuming similar individual drive performance, USB3 HDD would would only be better if, on average, the 2x HDD SATA array was busy 50% of the time and the USB3 1x HDD was 0% busy.
  • And even more so, if you compare swapping on one HDD to 2x SSDs, there's no contest/chance it would perform as well. The SATA SSDs would have to be like 95%+ busy before a single dormant USB HDD might begin to compare...
  • USB3 will have more latency than SATA. And low latency is a key factor in memory access performance and responsiveness.

Internal HDDs array for swap

As above, the 2x HDDs for swap should be better than just 1 HDD hanging off USB3, and as will be explained, should be be safe to use as SWAP.

  • The 2x HDDs are best suited to large data sets which would tend to have sequential access patterns, e.g. media files (music/video/images).
  • I'm not sure about Intel RAID setup, but with Linux RAID (mdadm) I know you have options e.g.:
    • you could share the same discs, but make a RAID 0 for swap and RAID 1 for VM images/data
    • you could avoid raid overhead and directly configure a 1st swap partition on the start of each individual drive while configuring mdadm to create the array out a the 2nd partition on each drive
  • HDD magnetic media is supposed to have better write longevity compared to SSDs (if they don't suffer other types of premature failure...)
  • If a system SWAPs a lot, it implies a lot writes.

SSDs for swap

2x SSD 120GB would be great for swap performance, but SSD lifespan is a factor to look out for.

  • SSDs are more like RAM compared rotating discs and have much better random I/O support.
  • If lots of VMs and processes are running and your RAM is heavily utilised, the page-fault (read) access patterns to the swap partition/file are going to end up random.
    • Memory page allocation units are small, i.e. 4KB
    • I assume the Linux kernel is smart about 'swapout' (freeing some ram by taking pages out and putting them on disc) and does it in batches to optimize for a more sequential writes to disk.
    • For 'swapin' (when a process needs data from RAM that's not there but in fact in the swap / page fault), this could be quite random and that's where an SSD can excel.
    • Windows 7 Engineering MDSN blog recommends SSDs given reads outnumber writes by about 40 to 1 (hopefully Linux is in principle similar), alleviating the concern about too much writing to SSD
  • Even if your SSDs are used to store your main OS and some VM images, there's probably plenty of headroom for SWAP file operations too. I have 2x 128GB Crucial M4s in RAID0 and they get awesome sequential IO (almost 1000MB/s) plus fairly good random read/write performance too (I measured close to 5000 IOPs and 50MB/s on a nasty mix of random read with mixed sizes mostly in the 4K and 16K blocks, but up to 256K).
  • Enterprise class SSD, i.e. based on more robust SLC tech, can handle more erase-write cycles and should be okay for swap.
  • Consumer based SSDs, i.e. based on cheaper higher density MLC, might suffer a worse than expected lifespan if swap usage gets very heavy very often (I'm assuming you have consumer based SSDs given the budget comments you made). However, at least in normal desktop workload scenarios, I sounds like swap on an SSD isn't an issue.
  • When SSDs get fully utilised, the write performance degrades and the write wear problem and lifespan issue becomes even worse.
  • You can potentially mitigate the erase-write limits and write performance issues of the SSD array by under provisioning to leave more headroom for SSD garbage collection to free up contiguous write blocks for better write performance and longevity.
    • Assuming you previously used the SSDs to full capacity an ATA secure erase operation might help refresh them so the wear leveling algorithms see the full SSD as unallocated.
    • Simply partition only 80 to 90% of capacity and leave the end of the SSD space free.
  • RAID type? If you have more faith in the reliability of SSDs and can afford the time to restore from a backup, I recommend RAID0. Note RAID 1 on 2 SSDs will technically have double the impact on write lifespan compared to RAID0 (as it doubles every write). So maybe steer clear of RAID1...

Other tweaks

There are also several other tweaks and options you should consider given the concerns of supporting multiple VMs, etc.

Linux loves more RAM for caching I/O and Virtualisation Hates Disk I/O

Potential pitfalls:

  • Don't over allocate all your RAM to guest operating systems so that you can save some for caching I/O
  • Find the sweet spot for 'swappiness'. Swapping should leave some room in RAM to cache disk I/O, but too much swapping will cause processes to be swapped out too soon and hurt general multitasking.

Modern CPUs have good hardware support for virtualizing the CPU and Memory resources, but when it comes to sharing disk storage, virtualization workloads often bottleneck. Linux (and windows) can improve I/O performance by using the RAM to cache I/O operations while the SSD or HDD disk devices are still busy 'catching up'. Therefore, you're extra RAM might not just be useful for running multiple OSes, but also for caching virtual machine I/O.

Virtual guest pagefile location

It would be a great solution if I could also use the same location for the Windows vbox clients' to move swap from C: to there!

I'm not sure about this, but my hunch is:

  • rather allocate enough / more RAM per VM and let linux swap out and in pages form the virtual box process on the host as and when needed look at using the VirtualBox Memory ballooning control
    • after double checking, sounds like VirtualBox locks and hogs the RAM and the host OS can't page it in and out
    • so you'll still need some swap for virtual guests
  • Having enough RAM for each guest and using memory ballooning should be faster / better compared to each individual VM guest doing it's own swapping via virtual I/O which has a performance penalty
  • also explore the option of installing the virtio drivers for Windows (VirtualBox supports this now and RedHat has these drivers)

Compress swapped storage

If your virtual host has a fair number of spare CPU cores, then something like zswap could work well:

  • Could have good performance boost if using the 2x HDD for swap space.
  • Might not help performance that much with swapping to 2x SSD, but compression would imply less write cycles.
  • And implies more virtual memory capacity from less storage

Anyhow, this may not be worth the effort as it would require a newer kernel and Debian is notorious for sticking with older tried and tested kernels, so not an easy option unless you backport a kernel or look at a different distro: E.g. Ubuntu 14.04 or CentOS 7 should offer more recent kernels.

Benchmarking Experience

On my own workstation (Windows 7), I used fio (http://www.bluestop.org/fio/) to mimic the random read and random write I/O trends mentioned in the MSDN blog. Anyone else wanting to test what various storage options can offer under swap/page file workloads could try something similar.

In looking at telemetry data from thousands of traces and focusing on pagefile reads and writes, we find that

  • Pagefile.sys reads outnumber pagefile.sys writes by about 40 to 1,
  • Pagefile.sys read sizes are typically quite small, with 67% less than or equal to 4 KB, and 88% less than 16 KB. Pagefile.sys writes are relatively large, with 62% greater than or equal to 128 KB and 45% being exactly 1 MB in size.

Benchmark Setup

This is the fio job file I used:

[global]
description="test random read and write to estimate suitability for page file use"
filename=fakeswap
numjobs=1
iodepth=1
direct=1
sync=1
filesize=2048m

[pageout]
rw=randwrite
bssplit=64k/38:256K/15:1024K/45:4096k/2

[pagein]
rw=randread
bssplit=4K/67:16K/21:64K/10:256K/2

Since the MSDN blog post text only briefly mentioned a few stats, I made some educated guesses about block sizes and proportions of IOs for those sizes. I used the bssplit option to weight the different block sizes. My guesses were hopefully not too bad given the final ratio of random read vs write IOs I got was 38.5 : 1 which is quite close to the 40 : 1 mentioned by the blog post.

I ran the benchmarks on an AMD SB850 based storage chipset and compared them to the performance of a RAM drive.

  • DDR3 Dual Channel @ 1600MHz with 2G RAMDisk (using DataRAM RAMDisk product)
  • SSDx2 RAID 0 (Crucial M4 128GB), NTFS
  • HDDx4 RAID 10 (Seagate 7200.14 3TB), NTFS
  • ADATA UV150 USB3 Flash Drive 32GB, FAT32

Note, I executed random read and random write benchmarks independently (not mixed, but a real system may see mixed patterns - I was interested in comparing read/pagein versus write/pageout, hence I seperated it). E.g. the commands I used were:

fio --section=pageout --output raid10_hdd4_pageout_2G.txt page2g.fio
fio --section=pagein --output raid10_hdd4_pagein_2G.txt page2g.fio

Benchmark Results

After running the benchmarks, they confirmed my own suspicion that a USB3 flash drive (note, not a hard disk on USB3) could perform fairly well with small random I/O. It turns out however that it isn't that good at the larger random write blocks with very erratic latency times.

The following graph shows time taken to page out and page back in 2G of swap space with the representative/estimate random I/O patterns for paging

benchmark result - time taken to page out and page back in 2G of swap space with a representative random I/O pattern

I also looked at average throughput and compared it to that of RAM - it gives an idea of how bad things get when to system has to use swap ;-)

Table comparing storage options for swap space and page files

Further observations

  • Random Read I/O matters more than Random Write because of the smaller block sizes and larger number of IOs. Proportionally, pagein is more painful than pageout...
  • SSDx2 RAID 0 was about 10x slower than RAM
  • HDDx4 RAID10 looks to be terrible at pagein - about 300x slower than RAM and 30x slower than SSD.
  • However, HDDx4 RAID10 looks like it'll do relatively better at pageout - about 40x slower than RAM and only about 4x slower than SSD
  • The USB3 flash drive was much better at small random reads compared to the HDD RAID (~9x faster), so much so, that it made up for how poor it was at random write (~7x slower). Even when plugged into a USB 2 port, overall, it beats the HDD RAID.

WARNING - not recommending putting swap / page file on USB flash drive

  • A USB flash drive's NAND and controller could lack robust wear leveling and garbage collection implementations (e.g. can't benefit from SSD ATA TRIM command) making it more likely that, if used for swap space/page file, it'll suffer a short lifespan and performance degradation over time. My tests were on a fresh/new flash drive. Maybe after 6 months of swapping to and from it, it won't keep up the performance and have a premature death.

Last few notes

  • My SSDs and HDDs have fairly large caches. 256MB and 64GB respectively on each device, so this presumably gives them a boost whereas the USB flash drive probably lacks this.
  • I'm not sure how well the observation M$ made about windows page file use applies to a Linux swap partition or file, but I'd bet it's not far off...

References

More reading (sorry, would've posted more links, but I've just signed up and superuser doesn't trust me yet)

  • superuser.com question about placing swap on SSD
    • MSDN Blog - support and QA for Solid State Drives
    • SSD endurance myths and legends