What does opening a file actually do?
Solution 1:
In almost every high-level language, the function that opens a file is a wrapper around the corresponding kernel system call. It may do other fancy stuff as well, but in contemporary operating systems, opening a file must always go through the kernel.
This is why the arguments of the fopen
library function, or Python's open
closely resemble the arguments of the open(2)
system call.
In addition to opening the file, these functions usually set up a buffer that will be consequently used with the read/write operations. The purpose of this buffer is to ensure that whenever you want to read N bytes, the corresponding library call will return N bytes, regardless of whether the calls to the underlying system calls return less.
I am not actually interested in implementing my own function; just in understanding what the hell is going on...'beyond the language' if you like.
In Unix-like operating systems, a successful call to open
returns a "file descriptor" which is merely an integer in the context of the user process. This descriptor is consequently passed to any call that interacts with the opened file, and after calling close
on it, the descriptor becomes invalid.
It is important to note that the call to open
acts like a validation point at which various checks are made. If not all of the conditions are met, the call fails by returning -1
instead of the descriptor, and the kind of error is indicated in errno
. The essential checks are:
- Whether the file exists;
- Whether the calling process is privileged to open this file in the specified mode. This is determined by matching the file permissions, owner ID and group ID to the respective ID's of the calling process.
In the context of the kernel, there has to be some kind of mapping between the process' file descriptors and the physically opened files. The internal data structure that is mapped to the descriptor may contain yet another buffer that deals with block-based devices, or an internal pointer that points to the current read/write position.
Solution 2:
I'd suggest you take a look at this guide through a simplified version of the open()
system call. It uses the following code snippet, which is representative of what happens behind the scenes when you open a file.
0 int sys_open(const char *filename, int flags, int mode) {
1 char *tmp = getname(filename);
2 int fd = get_unused_fd();
3 struct file *f = filp_open(tmp, flags, mode);
4 fd_install(fd, f);
5 putname(tmp);
6 return fd;
7 }
Briefly, here's what that code does, line by line:
- Allocate a block of kernel-controlled memory and copy the filename into it from user-controlled memory.
- Pick an unused file descriptor, which you can think of as an integer index into a growable list of currently open files. Each process has its own such list, though it's maintained by the kernel; your code can't access it directly. An entry in the list contains whatever information the underlying filesystem will use to pull bytes off the disk, such as inode number, process permissions, open flags, and so on.
-
The
filp_open
function has the implementationstruct file *filp_open(const char *filename, int flags, int mode) { struct nameidata nd; open_namei(filename, flags, mode, &nd); return dentry_open(nd.dentry, nd.mnt, flags); }
which does two things:
- Use the filesystem to look up the inode (or more generally, whatever sort of internal identifier the filesystem uses) corresponding to the filename or path that was passed in.
- Create a
struct file
with the essential information about the inode and return it. This struct becomes the entry in that list of open files that I mentioned earlier.
Store ("install") the returned struct into the process's list of open files.
- Free the allocated block of kernel-controlled memory.
- Return the file descriptor, which can then be passed to file operation functions like
read()
,write()
, andclose()
. Each of these will hand off control to the kernel, which can use the file descriptor to look up the corresponding file pointer in the process's list, and use the information in that file pointer to actually perform the reading, writing, or closing.
If you're feeling ambitious, you can compare this simplified example to the implementation of the open()
system call in the Linux kernel, a function called do_sys_open()
. You shouldn't have any trouble finding the similarities.
Of course, this is only the "top layer" of what happens when you call open()
- or more precisely, it's the highest-level piece of kernel code that gets invoked in the process of opening a file. A high-level programming language might add additional layers on top of this. There's a lot that goes on at lower levels. (Thanks to Ruslan and pjc50 for explaining.) Roughly, from top to bottom:
-
open_namei()
anddentry_open()
invoke filesystem code, which is also part of the kernel, to access metadata and content for files and directories. The filesystem reads raw bytes from the disk and interprets those byte patterns as a tree of files and directories. - The filesystem uses the block device layer, again part of the kernel, to obtain those raw bytes from the drive. (Fun fact: Linux lets you access raw data from the block device layer using
/dev/sda
and the like.) - The block device layer invokes a storage device driver, which is also kernel code, to translate from a medium-level instruction like "read sector X" to individual input/output instructions in machine code. There are several types of storage device drivers, including IDE, (S)ATA, SCSI, Firewire, and so on, corresponding to the different communication standards that a drive could use. (Note that the naming is a mess.)
- The I/O instructions use the built-in capabilities of the processor chip and the motherboard controller to send and receive electrical signals on the wire going to the physical drive. This is hardware, not software.
- On the other end of the wire, the disk's firmware (embedded control code) interprets the electrical signals to spin the platters and move the heads (HDD), or read a flash ROM cell (SSD), or whatever is necessary to access data on that type of storage device.
This may also be somewhat incorrect due to caching. :-P Seriously though, there are many details that I've left out - a person (not me) could write multiple books describing how this whole process works. But that should give you an idea.
Solution 3:
Any file system or operating system you want to talk about is fine by me. Nice!
On a ZX Spectrum, initializing a LOAD
command will put the system into a tight loop, reading the Audio In line.
Start-of-data is indicated by a constant tone, and after that a sequence of long/short pulses follow, where a short pulse is for a binary 0
and a longer one for a binary 1
(https://en.wikipedia.org/wiki/ZX_Spectrum_software). The tight load loop gathers bits until it fills a byte (8 bits), stores this into memory, increases the memory pointer, then loops back to scan for more bits.
Typically, the first thing a loader would read is a short, fixed format header, indicating at least the number of bytes to expect, and possibly additional information such as file name, file type and loading address. After reading this short header, the program could decide whether to continue loading the main bulk of the data, or exit the loading routine and display an appropriate message for the user.
An End-of-file state could be recognized by receiving as many bytes as expected (either a fixed number of bytes, hardwired in the software, or a variable number such as indicated in a header). An error was thrown if the loading loop did not receive a pulse in the expected frequency range for a certain amount of time.
A little background on this answer
The procedure described loads data from a regular audio tape - hence the need to scan Audio In (it connected with a standard plug to tape recorders). A LOAD
command is technically the same as open
a file - but it's physically tied to actually loading the file. This is because the tape recorder is not controlled by the computer, and you cannot (successfully) open a file but not load it.
The "tight loop" is mentioned because (1) the CPU, a Z80-A (if memory serves), was really slow: 3.5 MHz, and (2) the Spectrum had no internal clock! That means that it had to accurately keep count of the T-states (instruction times) for every. single. instruction. inside that loop, just to maintain the accurate beep timing.
Fortunately, that low CPU speed had the distinct advantage that you could calculate the number of cycles on a piece of paper, and thus the real world time that they would take.