When should I use /dev/shm/ and when should I use /tmp/?
When should I use /dev/shm/
and when should I use /tmp/
? Can I always rely on them both being there on Unices?
/dev/shm
is a temporary file storage filesystem, i.e., tmpfs, that uses RAM for the backing store.
It can function as a shared memory implementation that facilitates IPC.
From Wikipedia:
Recent 2.6 Linux kernel builds have started to offer /dev/shm as shared memory in the form of a ramdisk, more specifically as a world-writable directory that is stored in memory with a defined limit in /etc/default/tmpfs. /dev/shm support is completely optional within the kernel config file. It is included by default in both Fedora and Ubuntu distributions, where it is most extensively used by the Pulseaudio application. (Emphasis added.)
/tmp
is the location for temporary files as defined in the Filesystem Hierarchy Standard, which is followed by almost all Unix and Linux distributions.
Since RAM is significantly faster than disk storage, you can use /dev/shm
instead of /tmp
for the performance boost, if your process is I/O intensive and extensively uses temporary files.
To answer your questions: No, you cannot always rely on /dev/shm
being present, certainly not on machines strapped for memory. You should use /tmp
unless you have a very good reason for using /dev/shm
.
Remember that /tmp
can be part of the /
filesystem instead of a separate mount, and hence can grow as required. The size of /dev/shm
is limited by excess RAM on the system, and hence you're more likely to run out of space on this filesystem.
In descending order of tmpfs
likelyhood:
┌───────────┬──────────────┬────────────────┐
│ /dev/shm │ always tmpfs │ Linux specific │
├───────────┼──────────────┼────────────────┤
│ /tmp │ can be tmpfs │ FHS 1.0 │
├───────────┼──────────────┼────────────────┤
│ /var/tmp │ never tmpfs │ FHS 1.0 │
└───────────┴──────────────┴────────────────┘
Since you are asking about a Linux specific tmpfs mountpoint versus a portably defined directory that may be tmpfs (depending on your sysadmin and what's default for your distro), your question has two aspects, which other answers have emphasized differently:
- Appropriate use of various tmp directories
- Appropriate use of tmpfs
Appropriate use of various tmp directories
Based on the ancient Filesystem Hierarchy Standard and what Systemd says about the matter.
- When in doubt, use
/tmp
. - Use
/var/tmp
for data that should persist across reboots. - Use
/var/tmp
for large data that may not easily fit in RAM (assuming that/var/tmp
has more available space – usually a fair assumption). - Use
/dev/shm
only as a side-effect of callingshm_open()
. The intended audience is bounded buffers that are endlessly overwritten. So this is for long lived files whose content is volatile and not terribly large. - Definitely don't use
/dev/shm
for executables (of any kind), as it's commonly mountednoexec
. - If still in doubt, provide a way for the user to override. For the least amount of surprise, do like
mktemp
and honor theTMPDIR
environment variable.
Where tmpfs excels
tmpfs
performance is deceptive. You will find workloads that are faster on tmpfs, and this is not because RAM is faster than disk: All filesystems are cached in RAM – the page cache! Rather, it is a sign that the workload is doing something that defeats the page cache. And of the worse things a process can do in this regard is syncing to disk way more often than necessary.
fsync
is a no-op on tmpfs. This syscall tells the OS to flush its page cache for a file, all the way down to flushing the write cache of the relevant storage device, all while blocking the program that issued it from making any progress at all – a very crude write barrier. It is a necessary tool in the box only because storage protocols aren't made with transactions in mind. And the caching is there in the first place to make it possible for programs to perform millions of small writes to a file without noticing how slow it actually is to write to a storage device – all actual writing happens asynchronously, or until fsync
is called, which is the only place where write performance is directly felt by the program.
So if you find yourself using tmpfs (or eatmydata) just to defeat fsync, then you (or some other developer in the chain) are doing something wrong. It means that the transactions toward the storage device are unnecessarily fine grained for your purpose – you are clearly willing to skip some savepoints for performance, as you have now gone to the extreme of sabotaging them all – seldom the best compromise. Also, it is here in transaction performance land where some of the greatest benefits of having an SSD are – any SSD worth its salt is going to perform out-of-this-world compared to what a spinning disk can possibly take (7200 rpm = 120 Hz, if notihing else is accessing it). Flash memory cards also vary widely on this metric (it is a tradeoff with sequential performance, and the SD card class rating only considers the latter). So beware, developers with blazing fast SSDs, not to force your users into this use case!
Wanna hear a ridiculous story? My first fsync
lesson: I had a job that involved routinely "upgrading" a bunch of Sqlite databases (kept as testcases) to an ever-changing current format. The "upgrade" framework would run a bunch of scripts, making at least one transaction each, to upgrade one database. Of course, I upgraded my databases in parallel (8 in parallel, since I was blessed with a mighty 8 core CPU). But as I found out, there was no parallelization speedup whatsoever (rather a slight hit) because the process was entirely IO bound. Hilariously, wrapping the upgrade framework in a script that copied each database to /dev/shm
, upgraded it there, and copied it back to disk was like 100 times faster (still with 8 in parallel). As a bonus, the PC was usable too, while upgrading databases.
Where tmpfs is appropriate
The appropriate use of tmpfs is to avoid unnecessary writing of volatile data. Effectively disabling writeback, like setting /proc/sys/vm/dirty_writeback_centisecs
to infinity on a regular filesystem.
This has very little to do with performance, and failing this is a much smaller concern than abusing fsync: The writeback timeout determines how lazily the disk content is updated after the pagecache content, and the default of 5 seconds is a long time for a computer – an application can overwrite a file as frequently as it wants, in pagecache, but the content on disk is only updated about once every 5 seconds. Unless the application forces it through with fsync, that is. Think about how many times an application can output a small file in this time, and you see why fsyncing every single one would be a much bigger problem.
What tmpfs can not help you with
- Read performance. If your data is hot (which it better be if you consider keeping it in tmpfs), you will hit the pagecache anyway. The difference is when not hitting the pagecache; if this is the case, go to "Where tmpfs sux", below.
- Short lived files. These can live their entire lives in the pagecache (as dirty pages) before ever being written out. Unless you force it with
fsync
of course.
Where tmpfs sux
Keeping cold data. You might be tempted to think that serving files out of swap is just as efficient as a normal filesystem, but there are a couple of reasons why it isn't:
- The simplest reason: There is nothing that contemporary storage devices (be it harddisk or flash based) loves more than reading fairly sequential files neatly organized by a proper filesystem. Swapping in 4KiB blocks is unlikely to improve on that.
- The hidden cost: Swapping out. Tmpfs pages are dirty — they need to be written somewhere (to swap) to be evicted from pagecache, as opposed to file backed clean pages that can be dropped instantly. This is an extra write penalty on everything else that competes for memory – affects something else at a different time than the use of those tmpfs pages.
Okay, here's the reality.
Both tmpfs and a normal filesystem are a memory cache over disk.
The tmpfs uses memory and swapspace as it's backing store a filesystem uses a specific area of disk, neither is limited in the size the filesystem can be, it is quite possible to have a 200GB tmpfs on a machine with less than a GB of ram if you have enough swapspace.
The difference is in when data is written to the disk. For a tmpfs the data is written ONLY when memory gets too full or the data unlikely to be used soon. OTOH most normal Linux filesystems are designed to always have a more or less consistent set of data on the disk so if the user pulls the plug they don't lose everything.
Personally, I'm used to having operating systems that don't crash and UPS systems (eg: laptop batteries) so I think the ext2/3 filesystems are too paranoid with their 5-10 second checkpoint interval. The ext4 filesystem is better with a 10 minute checkpoint, except it treats user data as second class and doesn't protect it. (ext3 is the same but you don't notice it because of the 5 second checkpoint)
This frequent checkpointing means that unnecessary data is being continually written to disk, even for /tmp.
So the result is you need to create swap space as big as you need your /tmp to be (even if you have to create a swapfile) and use that space to mount a tmpfs of the required size onto /tmp.
NEVER use /dev/shm.
Unless, you're using it for very small (probably mmap'd) IPC files and you are sure that it exists (it's not a standard) and the machine has more than enough memory + swap available.
Use /tmp/ for temporary files. Use /dev/shm/ when you want shared memory (ie, interprocess communication through files).
You can rely on /tmp/ being there, but /dev/shm/ is a relatively recent Linux only thing.
Another time when you should use /dev/shm (for Linux 2.6 and above) is when you need a guaranteed tmpfs file system because you don't know if you can write to disk.
A monitoring system I'm familiar with needs to write out temporary files while building its report for submission to a central server. It's far more likely in practice that something will prevent writes to a file system (either out of disk space or an underlying RAID failure has pushed the system into a hardware read-only mode) but you'll still be able to limp along to alert about it than if something spirals all available memory such that tmpfs will be unusable (and the box won't be dead). In cases like this, a monitoring system will prefer writing out to RAM so as to potentially be able to send an alert about a full disk or dead/dying hardware.