Week 3: Filesystems, part 2
Types of files
Linux distinguishes between several types of files:
- Regular file: A sequence of bytes. (Linux doesn't distinguish between binary and text files.)
- Directory: A listing of other files (which of course may themselves be directories).
- Symbolic link: A file that "points" to another file. Most syscalls follow symlinks by default, so if I create a symlink
dir1/a
that points todir2/b
, then callingopen("dir1/a")
will have the same effect as callingopen("dir2/b")
.- If the original file is removed, then the symlink will be left dangling.
- It's possible to create symlink loops. Trying to open or manipulate a symlink in a loop will return the
ELOOP
error.
- Hard link: Another kind of filesystem link, with three major differences from symlinks:
- The hard link and the original file are completely identical, and in fact there is no "original" – the two are indistinguishable.
- Hard links can't be left dangling; the file contents will be kept alive until all links to it are removed.
- Creating a hard link to a directory is a bad idea as it can create filesystem loops.
In fact, hard links aren't really a different type of file – all regular files are effectively hard links. But normally we don't call them hard links unless we've created a second link to an existing file.
The other file types are block devices, character devices, sockets, and FIFOs. The former two are for representing hardware devices. The latter two will be covered later in the course.
Under the hood: inodes
Within the kernel, files are represented by a data structure called an inode. The inode holds metadata about the file, and the location of the file's content on disk. (See man 7 inode
for details.)
A directory is just a mapping from path names to inodes. Two path names can point to the same inode (this is what a hard link is). Only when the last path pointing to an inode is removed, will the underlying file contents be marked for deletion.
Other things can hold on to references to inodes, too, such as running programs. When you call open
, the kernel will keep the file alive until you call close
, even if another program deletes it.
So, the filesystem effectively has a reference-counting garbage collector. To extend the analogy further: symbolic links are like weak references and hard links are like strong references.
Try this:
touch test.txt
# make a hard link
ln test.txt test2.txt
ls -i test*.txt
ls -i
will print the same inode number for the two files.
The classic Unix permissions model
Linux has inherited the classic file permissions model from Unix. There are three permission bits, each of which has a different meaning for regular files versus directories:
- Read
- Files: can read from file
- Directories: can list directory contents
- Write
- Files: can write to file
- Directories: can create, delete, and move entries
- Execute
- Files: can execute as program
- Directories: can "traverse", i.e. access paths within the directory
Execute for directories
The execute permission is really a misnomer for directories: it has nothing to do with execution, they just appropriated an otherwise meaningless bit.
It seems similar to read, and indeed usually the read and execute bits for directories have the same value, but all four combinations of the two bits are possible:
r-x
– can both list the directory and access paths within in (this is the normal case)r--
– can list the directory, but can't access any of the paths--x
– can access paths within the directory, but can't get a list of all the paths---
– no permission on the directory at all
Owners and groups
Every file has an owner and a group. The three permissions (read, write, execute) can be separately set for the owner, the group, and everyone else. We can write a file's permission as a nine-character string, such as rwxr-xr-x
, where the first three bits are the owner's permissions, the next bits the group's, and the last bits everyone else's.
Example: Suppose I'm a professor teaching a CS course, and I have a file of homework solutions. I should have read and write access to the file. My TAs, who belong to the ta
group, should have read-only access. No one else should have access. Therefore, my file permissions should look like:
$ ls -l solutions.txt
-rw-r----- ian ta ... solutions.txt
(The initial dash is how ls
indicates the file is not a directory.)
Octal notation
Instead of a nine-character string, we can represent permissions more concisely as a three-digit octal number, for example 755
. To break it down:
755
in octal is111 101 101
in binary.- So this is equivalent to
rwx r-x r-x
. - Another way to remember is that
r = 4
,w = 2
, andx = 1
, so 7 = 4 + 2 + 1 =rwx
.
Some common permissions are:
755
=rwxr-xr-x
= only owner can write, but anyone can read or execute (directories and executable files)644
=rw-r--r--
= only owner can write, but anyone can read (non-executable files)600
=rw-------
= only owner can read and write, no one else can access400
=r--------
= only owner can read, no one can write (read-only files)
Bonus: chmod
abbreviations
The chmod
shell command understands some abbreviations:
# give the owner ('user') exec permission
$ chmod u+x myfile.txt
# remove the owner's exec permission
$ chmod u-x myfile.txt
Unfortunately they are easy to get confused:
u
stands for 'user'g
stands for 'group', not globalo
stands for 'other', not ownera
stands for 'all', meaning user and group and other, not just other
Syscalls: File metadata and permissions
struct stat {
mode_t st_mode;
uid_t st_uid;
gid_t st_gid;
off_t st_size;
/* other fields */
};
int stat(const char* pathname, struct stat* statbuf);
int fstat(int fd, struct stat* statbuf);
stat
returns metadata about a file. Notably:- the mode, which includes what type of file it is (regular file, directory, etc.) and its permissions
- the owner and group
- the size of the file in bytes
fstat
is likestat
except it takes a file descriptor instead of a path name.
int chmod(const char* pathname, mode_t mode);
int fchmod(int fd, mode_t mode);
chmod
andfchmod
let you change a file's permissions.- You must be the owner of the file (or root) to do this.
mode_t umask(mode_t mask);
The open
syscall allows us to set arbitrary file permissions on files we create. Sometimes, we might want to set a policy for more restrictive access. For instance, we might not want files to be world-readable or group-writable by default.
The way to do this is with the process's umask
, a mask that applies to the mode
parameter to open
. Any bit that is set in the umask
will be turned off in the mode
. For example, if umask
is 002
(like modes, the umask
is expressed as an octal number), then files will not be world-writable by default.
Any process can change its own umask
, so it's not a watertight mechanism for enforcing default file permissions.
Syscalls: Flushing file data
int fsync(int fd);
int fdatasync(int fd);
The kernel buffers writes to disk, so when write
returns, there's no guarantee that what you've written has actually made it to disk. Normally, this is OK, as the kernel will write it out eventually, but some programs (e.g., databases) need to be more careful about ensuring the persistence of data. The fsync
system call forces the kernel to flush all pending writes, and does not return until the data has made it to disk.
fdatasync
is like fsync
but only syncs file contents, not file metadata.
Syscalls: Directories
int mkdir(const char* pathname, mode_t mode);
mkdir
creates a directory with the permissions specified inmode
.- Unlike
open
, it does not return a file descriptor. - The new directory's owner will be set to the effective user ID of the process (same as for creating regular files with
open
).
struct linux_dirent {
char d_name[];
char d_type;
/* other fields */
};
// raw syscall
ssize_t getdents64(int fd, void* dirp, size_t count);
// libc wrapper
DIR* opendir(const char* pathname);
struct dirent* readdir(DIR* dirp);
getdents64
is the raw syscall to get the entries of a directory.- Quoting
man 2 getdents
: "These are not the interfaces you are interested in." - You are supposed to use
opendir
andreaddir
from libc instead, and languages other than C may not provide agetdents64
interface at all (Python only hasos.scandir
, for instance). - Still,
getdents64
isn't hard to understand. You pass in a file descriptor, an array to hold the entries, and a count (of bytes, not of array entries), and it fills the array and returns the number of bytes read.- This is because
struct linux_dirent
values are not fixed in size, so if it returned the number of entries read you wouldn't know where the end of your array is. - At any rate, 0 is returned at the end of the directory.
- This is because
- The man page has a lot of gory details about struct layout, but these only apply to kernels older than Linux 2.4, which was released in 2001.
opendir
andreaddir
are less awkward thangetdents64
, but have the disadvantage of only returning one entry at a time.struct dirent
is similar tostruct linux_dirent
; read the man page if you're interested.
Syscalls: Moving and deleting files
int rename(const char* oldpath, const char* newpath);
rename
is used both to rename and move files. (From the kernel's standpoint, these are the same thing.)- If
newpath
already exists, it will be replaced atomically, meaning that there is no interval wherenewpath
temporarily ceases to exist, and if the rename fails thennewpath
will be untouched.- The move itself is not atomic, i.e. there will possibly be a time when both
oldpath
andnewpath
point to the file being renamed.
- The move itself is not atomic, i.e. there will possibly be a time when both
- Caveat:
oldpath
andnewpath
must be on the same filesystem. Otherwise, the kernel would not be able to do the rename atomically.
int unlink(const char* pathname);
unlink
removes a file.- Why is it called
unlink
instead ofremove
? Because we're just removing the link between the path and the inode. It doesn't necessarily delete the file contents, unless that path was the last reference to the file.
int rmdir(const char* pathname);
rmdir
deletes an empty directory. Like thermdir
command, it will fail if the directory has any files in it.
Syscalls: File locking
const int LOCK_SH = /* */;
const int LOCK_EX = /* */;
const int LOCK_UN = /* */;
const int LOCK_NB = /* */;
int flock(int fd, int op);
flock
places or removes a lock on a file.- A lock can be shared (
LOCK_SH
) or exclusive (LOCK_EX
). A file can have either N shared locks or 1 exclusive lock at a time.- Typically, shared locks are for readers and exclusive locks for writers.
- Pass
LOCK_UN
as the operation to release the lock. - By default,
flock
will block until the lock is available. You can combineLOCK_NB
with the lock options to make it non-blocking, in which case it will return immediately, withEWOULDBLOCK
if the lock was already held. - It is an advisory lock, meaning that the kernel won't stop other processes from accessing the file unless they also try to acquire the lock.
- It's good for groups of cooperating processes that all agree to use
flock
. - It's no good for protecting against uncooperative processes – use file permissions to do that.
- It's good for groups of cooperating processes that all agree to use
Final project milestone
Modify your program to create the database file with permissions locked down to the file's owner. Use file locking to ensure that get
and set
are synchronized. Test your solution by having the set
command go to sleep for 5 seconds while holding the lock; in another terminal, ensure that get
and set
block until the first set
command wakes up and releases the lock.
Homework exercises
- (★) Do syscalls follow symlinks? What about hard links?
Solution
Yes, although to be pedantic it's misleading to speak of "following" hard links in the same way as symbolic links, since there's not really a link to be followed but a direct reference to the file's metadata and content. - (★) True or false: If I acquire an exclusive lock on a file, no other process will be able to write to it.
Solution
False. Locks are advisory. Other processes can ignore the lock and write to the file anyway. - (★) How is the owner of a new file or directory set?
Solution
The owner of the new file is set to the effective user ID of the process that created it. We'll learn in week 4 what exactly the "effective user ID" is. - (★★) One syscall we didn't cover in class this week is
sendfile
, which lets you efficiently copy bytes from one file to another. Read the man page, then usesendfile
to write a simple version ofcp -r
. If your language doesn't exposegetdents64
directly, you can use a higher-level API. Warning: Depending on your language, listing the directory may include the special..
entry. Make sure to filter this out! Otherwise you might try to recursively copy the whole tree. - (★★) What happens if you rename a file while another process is writing to it? Make a prediction, then write a pair of programs to demonstrate what happens.
Solution
Two reasonable predictions are that (a) it will continue writing to the new path, or (b) it will start writing to the old path. It turns out that (a) is correct – but only if the writing process keeps the file open. If it closes and reopens each time, it will keep writing to the old path. - (★★) Is
getdents64
atomic? Write a pair of programs that demonstrates its behavior.Solution
By "atomic", I mean whether a single call togetdents64
presents a consistent view of the directory. For example, if I callrename("foo", "bar")
,getdents64
should return eitherfoo
orbar
, but not both. (With a little bit of thought, you can see why it's not reasonable for multiple calls togetdents64
to be atomic – what if you waited 5 seconds in between?) I wrote a program that creates files00001
through09999
and rapidly renames, e.g.,00001
back and forth to10001
. Another thread callsgetdents64
in a loop; if it ever sees both fileN
andN + 10000
, then it was not atomic. I also ran the same test withreaddir
instead ofgetdents64
. My program detected a non-atomic read forreaddir
but notgetdents64
. Under the hood, it appears thatreaddir
is callinggetdents64
with a fixed-size buffer of 32,768 bytes, so if your directory exceeds that size, soreaddir
won't be atomic on directories larger than that. (It's a little unfair that we're comparing one call togetdents64
against multiple calls toreaddir
, but that's the only way to callreaddir
.) ChatGPT says that the Linux VFS implementation takes a sharedi_rwsem
lock when reading and an exclusive lock when writing. The man page doesn't guarantee thatgetdents64
is atomic, though, and some filesystems like NFS we can expect that it isn't, so you shouldn't rely on this behavior. - (★★) What permissions are required to
rename
a file? Without reading the man page, make a guess, then try to find the minimal set of permissions you need.Solution
The basic permissions are+x
on each directory in the source and destination paths (this is a general requirement for accessing any path on Linux), and+w
on the ultimate source and destination directories. For a more comprehensive answer, see my blog post. - (★★) Read Dan Luu's article "Files are hard" and write a program that uses his technique to ensure file consistency.
- (★★) Linux file permissions are a little more complicated than what was presented here. Research the concepts of the set-uid, set-gid, and sticky bits.
Solution
Normally, when an executable is run on Linux, it will run with the user ID and group ID of the user who started it. But if the set-uid or set-gid bit is set in the executable file'sst_mode
, then it will instead run as the file's owner or group, respectively. Usually this is when an executable needs root permissions but should be available to non-root users. Because they allow privilege escalation, set-uid executables are a potential security hole and must be written very carefully. See section 8.11 of Advanced Programming in the Unix Environment for an example.
The sticky bit is another bit inst_mode
that changes the interpretation of directory permissions: if set, then files in the directory can only be renamed or removed by the file's owner, the directory's owner, or the superuser (instead of anyone with+w
permissions). Most commonly, the sticky bit is set on/tmp
so that everyone can create files but not interfere with others' files. Tony Finch's blog post goes into more detail. - (★★★) Is it possible to atomically overwrite a file? First, write a pair of programs (a reader and a writer) that shows that simply overwriting with
write
isn't atomic. Then, find a way to do it atomically.Solution
We demonstrated in class with simultaneous.c that writes are not atomic. To overwrite atomically, create a temporary file, write to it, and then userename
to atomically replace the destination file. It's important to create the temporary file on the same filesystem as the destination file (e.g., in the same directory), becauserename
does not work across filesystems. You can use theO_TMPFILE
flag toopen
to create an unnamed temporary file. - (★★★) Linux file permissions are a lot more complicated than what was presented here. Research file ACLs and SELinux contexts. What syscalls do they use?
Solution
File ACLs let you set granular per-user and per-group permissions. See "Notes on Linux file ACLs" for details. SELinux is a set of security modules for the Linux kernel that encompasses much more than just filesystem permissions. Files can be tagged with a security context that affects how the file can be used – for instance, the system daemons may not be allowed to access files with security contextuser_home_t
, regardless of the permission bits. See wiki/selinux for some useful troubleshooting commands.
Both file ACLs and SELinux security contexts use POSIX extended attributes, which is a way to store key-value attributes on files. The syscalls aregetxattr
,setxattr
, etc. – readman 7 xattr
for more.
Further reading
- Some syscalls we didn't have a chance to cover:
man 2 chown
– change a file's owner or groupman 2 link
– create a hard linkman 2 symlink
– create a symlinkman 2 readlink
– read what a symlink points toman 2 lstat
– same asstat
except does not follow symlinks
- "Behind The Scenes of Bun Install"
- "PostgreSQL's fsync() surprise"
- "fsync() on a different thread: apparently a useless trick"