cs644 > fall2025 > week3

CS644 week 3: Filesystems, part 2

Slides

Types of files

Linux distinguishes between several types of files:

Regular file: A sequence of bytes. (Linux doesn't distinguish between binary and text files.)
Directory: A listing of other files (which of course may themselves be directories).
Symbolic link: A file that "points" to another file. Most syscalls follow symlinks by default, so if I create a symlink dir1/a that points to dir2/b, then calling open("dir1/a") will have the same effect as calling open("dir2/b").
- If the original file is removed, then the symlink will be left dangling.
- It's possible to create symlink loops. Trying to open or manipulate a symlink in a loop will return the ELOOP error.
Hard link: Another kind of filesystem link, with three major differences from symlinks:
- The hard link and the original file are completely identical, and in fact there is no "original" – the two are indistinguishable.
- Hard links can't be left dangling; the file contents will be kept alive until all links to it are removed.
- Creating a hard link to a directory is a bad idea as it can create filesystem loops.

In fact, hard links aren't really a different type of file – all regular files are effectively hard links. But normally we don't call them hard links unless we've created a second link to an existing file.

The other file types are block devices, character devices, sockets, and FIFOs. The former two are for representing hardware devices. The latter two will be covered later in the course.

Under the hood: inodes

Within the kernel, files are represented by a data structure called an inode. The inode holds metadata about the file, and the location of the file's content on disk. (See man 7 inode for details.)

A directory is just a mapping from path names to inodes. Two path names can point to the same inode (this is what a hard link is). Only when the last path pointing to an inode is removed, will the underlying file contents be marked for deletion.

Other things can hold on to references to inodes, too, such as running programs. When you call open, the kernel will keep the file alive until you call close, even if another program deletes it.

So, the filesystem effectively has a reference-counting garbage collector. To extend the analogy further: symbolic links are like weak references and hard links are like strong references.

Try this:

touch test.txt
# make a hard link
ln test.txt test2.txt
ls -i test*.txt

ls -i will print the same inode number for the two files.

The classic Unix permissions model

Linux has inherited the classic file permissions model from Unix. There are three permission bits, each of which has a different meaning for regular files versus directories:

Read
- Files: can read from file
- Directories: can list directory contents
Write
- Files: can write to file
- Directories: can create, delete, and move entries
Execute
- Files: can execute as program
- Directories: can "traverse", i.e. access paths within the directory

Execute for directories

The execute permission is really a misnomer for directories: it has nothing to do with execution, they just appropriated an otherwise meaningless bit.

It seems similar to read, and indeed usually the read and execute bits for directories have the same value, but all four combinations of the two bits are possible:

r-x – can both list the directory and access paths within in (this is the normal case)
r-- – can list the directory, but can't access any of the paths
--x – can access paths within the directory, but can't get a list of all the paths
--- – no permission on the directory at all

Owners and groups

Every file has an owner and a group. The three permissions (read, write, execute) can be separately set for the owner, the group, and everyone else. We can write a file's permission as a nine-character string, such as rwxr-xr-x, where the first three bits are the owner's permissions, the next bits the group's, and the last bits everyone else's.

Example: Suppose I'm a professor teaching a CS course, and I have a file of homework solutions. I should have read and write access to the file. My TAs, who belong to the ta group, should have read-only access. No one else should have access. Therefore, my file permissions should look like:

$ ls -l solutions.txt
-rw-r----- ian ta ... solutions.txt

(The initial dash is how ls indicates the file is not a directory.)

Octal notation

Instead of a nine-character string, we can represent permissions more concisely as a three-digit octal number, for example 755. To break it down:

755 in octal is 111 101 101 in binary.
So this is equivalent to rwx r-x r-x.
Another way to remember is that r = 4, w = 2, and x = 1, so 7 = 4 + 2 + 1 = rwx.

Some common permissions are:

755 = rwxr-xr-x = only owner can write, but anyone can read or execute (directories and executable files)
644 = rw-r--r-- = only owner can write, but anyone can read (non-executable files)
600 = rw------- = only owner can read and write, no one else can access
400 = r-------- = only owner can read, no one can write (read-only files)

Bonus: `chmod` abbreviations

The chmod shell command understands some abbreviations:

# give the owner ('user') exec permission
$ chmod u+x myfile.txt
# remove the owner's exec permission
$ chmod u-x myfile.txt

Unfortunately they are easy to get confused:

u stands for 'user'
g stands for 'group', not global
o stands for 'other', not owner
a stands for 'all', meaning user and group and other, not just other

Syscalls: File metadata and permissions

struct stat {
  mode_t st_mode;
  uid_t  st_uid;
  gid_t  st_gid;
  off_t  st_size;
  /* other fields */
};

int stat(const char* pathname, struct stat* statbuf);
int fstat(int fd, struct stat* statbuf);

stat returns metadata about a file. Notably:
- the mode, which includes what type of file it is (regular file, directory, etc.) and its permissions
- the owner and group
- the size of the file in bytes
fstat is like stat except it takes a file descriptor instead of a path name.

int chmod(const char* pathname, mode_t mode);
int fchmod(int fd, mode_t mode);

chmod and fchmod let you change a file's permissions.
You must be the owner of the file (or root) to do this.

mode_t umask(mode_t mask);

The open syscall allows us to set arbitrary file permissions on files we create. Sometimes, we might want to set a policy for more restrictive access. For instance, we might not want files to be world-readable or group-writable by default.

The way to do this is with the process's umask, a mask that applies to the mode parameter to open. Any bit that is set in the umask will be turned off in the mode. For example, if umask is 002 (like modes, the umask is expressed as an octal number), then files will not be world-writable by default.

Any process can change its own umask, so it's not a watertight mechanism for enforcing default file permissions.

Syscalls: Flushing file data

int fsync(int fd);
int fdatasync(int fd);

The kernel buffers writes to disk, so when write returns, there's no guarantee that what you've written has actually made it to disk. Normally, this is OK, as the kernel will write it out eventually, but some programs (e.g., databases) need to be more careful about ensuring the persistence of data. The fsync system call forces the kernel to flush all pending writes, and does not return until the data has made it to disk.

fdatasync is like fsync but only syncs file contents, not file metadata.

Syscalls: Directories

int mkdir(const char* pathname, mode_t mode);

mkdir creates a directory with the permissions specified in mode.
Unlike open, it does not return a file descriptor.
The new directory's owner will be set to the effective user ID of the process (same as for creating regular files with open).

struct linux_dirent {
    char d_name[];
    char d_type;
    /* other fields */
};

// raw syscall
ssize_t getdents64(int fd, void* dirp, size_t count);

// libc wrapper
DIR* opendir(const char* pathname);
struct dirent* readdir(DIR* dirp);

getdents64 is the raw syscall to get the entries of a directory.
Quoting man 2 getdents: "These are not the interfaces you are interested in."
You are supposed to use opendir and readdir from libc instead, and languages other than C may not provide a getdents64 interface at all (Python only has os.scandir, for instance).
Still, getdents64 isn't hard to understand. You pass in a file descriptor, an array to hold the entries, and a count (of bytes, not of array entries), and it fills the array and returns the number of bytes read.
- This is because struct linux_dirent values are not fixed in size, so if it returned the number of entries read you wouldn't know where the end of your array is.
- At any rate, 0 is returned at the end of the directory.
The man page has a lot of gory details about struct layout, but these only apply to kernels older than Linux 2.4, which was released in 2001.
opendir and readdir are less awkward than getdents64, but have the disadvantage of only returning one entry at a time.
- struct dirent is similar to struct linux_dirent; read the man page if you're interested.

Syscalls: Moving and deleting files

int rename(const char* oldpath, const char* newpath);

rename is used both to rename and move files. (From the kernel's standpoint, these are the same thing.)
If newpath already exists, it will be replaced atomically, meaning that there is no interval where newpath temporarily ceases to exist, and if the rename fails then newpath will be untouched.
- The move itself is not atomic, i.e. there will possibly be a time when both oldpath and newpath point to the file being renamed.
Caveat: oldpath and newpath must be on the same filesystem. Otherwise, the kernel would not be able to do the rename atomically.

int unlink(const char* pathname);

unlink removes a file.
Why is it called unlink instead of remove? Because we're just removing the link between the path and the inode. It doesn't necessarily delete the file contents, unless that path was the last reference to the file.

int rmdir(const char* pathname);

rmdir deletes an empty directory. Like the rmdir command, it will fail if the directory has any files in it.

Syscalls: File locking

const int LOCK_SH = /* */;
const int LOCK_EX = /* */;
const int LOCK_UN = /* */;
const int LOCK_NB = /* */;

int flock(int fd, int op);

flock places or removes a lock on a file.
A lock can be shared (LOCK_SH) or exclusive (LOCK_EX). A file can have either N shared locks or 1 exclusive lock at a time.
- Typically, shared locks are for readers and exclusive locks for writers.
Pass LOCK_UN as the operation to release the lock.
By default, flock will block until the lock is available. You can combine LOCK_NB with the lock options to make it non-blocking, in which case it will return immediately, with EWOULDBLOCK if the lock was already held.
It is an advisory lock, meaning that the kernel won't stop other processes from accessing the file unless they also try to acquire the lock.
- It's good for groups of cooperating processes that all agree to use flock.
- It's no good for protecting against uncooperative processes – use file permissions to do that.

Final project milestone

Modify your program to create the database file with permissions locked down to the file's owner. Use file locking to ensure that get and set are synchronized. Test your solution by having the set command go to sleep for 5 seconds while holding the lock; in another terminal, ensure that get and set block until the first set command wakes up and releases the lock.

Homework exercises

(★) Do syscalls follow symlinks? What about hard links?
Solution
Yes, although to be pedantic it's misleading to speak of "following" hard links in the same way as symbolic links, since there's not really a link to be followed but a direct reference to the file's metadata and content.
(★) True or false: If I acquire an exclusive lock on a file, no other process will be able to write to it.
Solution
False. Locks are advisory. Other processes can ignore the lock and write to the file anyway.
(★) How is the owner of a new file or directory set?
Solution
The owner of the new file is set to the effective user ID of the process that created it. We'll learn in week 4 what exactly the "effective user ID" is.
(★★) One syscall we didn't cover in class this week is sendfile, which lets you efficiently copy bytes from one file to another. Read the man page, then use sendfile to write a simple version of cp -r. If your language doesn't expose getdents64 directly, you can use a higher-level API. Warning: Depending on your language, listing the directory may include the special .. entry. Make sure to filter this out! Otherwise you might try to recursively copy the whole tree.
Solution
https://github.com/iafisher/cs644/tree/master/week3/solutions/cp.c
(★★) What happens if you rename a file while another process is writing to it? Make a prediction, then write a pair of programs to demonstrate what happens.
Solution
Two reasonable predictions are that (a) it will continue writing to the new path, or (b) it will start writing to the old path. It turns out that (a) is correct – but only if the writing process keeps the file open. If it closes and reopens each time, it will keep writing to the old path.
(★★) Is getdents64 atomic? Write a pair of programs that demonstrates its behavior.
Solution
By "atomic", I mean whether a single call to getdents64 presents a consistent view of the directory. For example, if I call rename("foo", "bar"), getdents64 should return either foo or bar, but not both. (With a little bit of thought, you can see why it's not reasonable for multiple calls to getdents64 to be atomic – what if you waited 5 seconds in between?) I wrote a program that creates files 00001 through 09999 and rapidly renames, e.g., 00001 back and forth to 10001. Another thread calls getdents64 in a loop; if it ever sees both file N and N + 10000, then it was not atomic. I also ran the same test with readdir instead of getdents64. My program detected a non-atomic read for readdir but not getdents64. Under the hood, it appears that readdir is calling getdents64 with a fixed-size buffer of 32,768 bytes, so if your directory exceeds that size, so readdir won't be atomic on directories larger than that. (It's a little unfair that we're comparing one call to getdents64 against multiple calls to readdir, but that's the only way to call readdir.) ChatGPT says that the Linux VFS implementation takes a shared i_rwsem lock when reading and an exclusive lock when writing. The man page doesn't guarantee that getdents64 is atomic, though, and some filesystems like NFS we can expect that it isn't, so you shouldn't rely on this behavior.
(★★) What permissions are required to rename a file? Without reading the man page, make a guess, then try to find the minimal set of permissions you need.
Solution
The basic permissions are +x on each directory in the source and destination paths (this is a general requirement for accessing any path on Linux), and +w on the ultimate source and destination directories. For a more comprehensive answer, see my blog post.
(★★) Read Dan Luu's article "Files are hard" and write a program that uses his technique to ensure file consistency.
Solution
https://github.com/iafisher/cs644/tree/master/week3/solutions/files-are-hard.c
(★★) Linux file permissions are a little more complicated than what was presented here. Research the concepts of the set-uid, set-gid, and sticky bits.
Solution
Normally, when an executable is run on Linux, it will run with the user ID and group ID of the user who started it. But if the set-uid or set-gid bit is set in the executable file's st_mode, then it will instead run as the file's owner or group, respectively. Usually this is when an executable needs root permissions but should be available to non-root users. Because they allow privilege escalation, set-uid executables are a potential security hole and must be written very carefully. See section 8.11 of Advanced Programming in the Unix Environment for an example.

The sticky bit is another bit in st_mode that changes the interpretation of directory permissions: if set, then files in the directory can only be renamed or removed by the file's owner, the directory's owner, or the superuser (instead of anyone with +w permissions). Most commonly, the sticky bit is set on /tmp so that everyone can create files but not interfere with others' files. Tony Finch's blog post goes into more detail.
(★★★) Is it possible to atomically overwrite a file? First, write a pair of programs (a reader and a writer) that shows that simply overwriting with write isn't atomic. Then, find a way to do it atomically.
Solution
We demonstrated in class with simultaneous.c that writes are not atomic. To overwrite atomically, create a temporary file, write to it, and then use rename to atomically replace the destination file. It's important to create the temporary file on the same filesystem as the destination file (e.g., in the same directory), because rename does not work across filesystems. You can use the O_TMPFILE flag to open to create an unnamed temporary file.
(★★★) Linux file permissions are a lot more complicated than what was presented here. Research file ACLs and SELinux contexts. What syscalls do they use?
Solution
File ACLs let you set granular per-user and per-group permissions. See "Notes on Linux file ACLs" for details. SELinux is a set of security modules for the Linux kernel that encompasses much more than just filesystem permissions. Files can be tagged with a security context that affects how the file can be used – for instance, the system daemons may not be allowed to access files with security context user_home_t, regardless of the permission bits. See wiki/selinux for some useful troubleshooting commands.

Both file ACLs and SELinux security contexts use POSIX extended attributes, which is a way to store key-value attributes on files. The syscalls are getxattr, setxattr, etc. – read man 7 xattr for more.

Table of contents

CS644 week 3: Filesystems, part 2

Types of files

Under the hood: inodes

The classic Unix permissions model

Execute for directories

Owners and groups

Octal notation

Bonus: `chmod` abbreviations

Syscalls: File metadata and permissions

Syscalls: Flushing file data

Syscalls: Directories

Syscalls: Moving and deleting files

Syscalls: File locking

Final project milestone

Homework exercises

Further reading

Table of contents

CS644 week 3: Filesystems, part 2

Types of files

Under the hood: inodes

The classic Unix permissions model

Execute for directories

Owners and groups

Octal notation

Bonus: chmod abbreviations

Syscalls: File metadata and permissions

Syscalls: Flushing file data

Syscalls: Directories

Syscalls: Moving and deleting files

Syscalls: File locking

Final project milestone

Homework exercises

Further reading

Bonus: `chmod` abbreviations