home blog portfolio Ian Fisher

Week 3: Filesystems, part 2

Types of files

Linux distinguishes between several types of files:

In fact, hard links aren't really a different type of file – all regular files are effectively hard links. But normally we don't call them hard links unless we've created a second link to an existing file.

The other file types are block devices, character devices, sockets, and FIFOs. The former two are for representing hardware devices. The latter two will be covered later in the course.

Under the hood: inodes

Within the kernel, files are represented by a data structure called an inode. The inode holds metadata about the file, and the location of the file's content on disk. (See man 7 inode for details.)

A directory is just a mapping from path names to inodes. Two path names can point to the same inode (this is what a hard link is). Only when the last path pointing to an inode is removed, will the underlying file contents be marked for deletion.

Other things can hold on to references to inodes, too, such as running programs. When you call open, the kernel will keep the file alive until you call close, even if another program deletes it.

So, the filesystem effectively has a reference-counting garbage collector. To extend the analogy further: symbolic links are like weak references and hard links are like strong references.

Try this:

touch test.txt
# make a hard link
ln test.txt test2.txt
ls -i test*.txt

ls -i will print the same inode number for the two files.

The classic Unix permissions model

Linux has inherited the classic file permissions model from Unix. There are three permission bits, each of which has a different meaning for regular files versus directories:

Execute for directories

The execute permission is really a misnomer for directories: it has nothing to do with execution, they just appropriated an otherwise meaningless bit.

It seems similar to read, and indeed usually the read and execute bits for directories have the same value, but all four combinations of the two bits are possible:

Owners and groups

Every file has an owner and a group. The three permissions (read, write, execute) can be separately set for the owner, the group, and everyone else. We can write a file's permission as a nine-character string, such as rwxr-xr-x, where the first three bits are the owner's permissions, the next bits the group's, and the last bits everyone else's.

Example: Suppose I'm a professor teaching a CS course, and I have a file of homework solutions. I should have read and write access to the file. My TAs, who belong to the ta group, should have read-only access. No one else should have access. Therefore, my file permissions should look like:

$ ls -l solutions.txt
-rw-r----- ian ta ... solutions.txt

(The initial dash is how ls indicates the file is not a directory.)

Octal notation

Instead of a nine-character string, we can represent permissions more concisely as a three-digit octal number, for example 755. To break it down:

Some common permissions are:

Bonus: chmod abbreviations

The chmod shell command understands some abbreviations:

# give the owner ('user') exec permission
$ chmod u+x myfile.txt
# remove the owner's exec permission
$ chmod u-x myfile.txt

Unfortunately they are easy to get confused:

Syscalls: File metadata and permissions

struct stat {
  mode_t st_mode;
  uid_t  st_uid;
  gid_t  st_gid;
  off_t  st_size;
  /* other fields */
};

int stat(const char* pathname, struct stat* statbuf);
int fstat(int fd, struct stat* statbuf);
int chmod(const char* pathname, mode_t mode);
int fchmod(int fd, mode_t mode);
mode_t umask(mode_t mask);

The open syscall allows us to set arbitrary file permissions on files we create. Sometimes, we might want to set a policy for more restrictive access. For instance, we might not want files to be world-readable or group-writable by default.

The way to do this is with the process's umask, a mask that applies to the mode parameter to open. Any bit that is set in the umask will be turned off in the mode. For example, if umask is 002 (like modes, the umask is expressed as an octal number), then files will not be world-writable by default.

Any process can change its own umask, so it's not a watertight mechanism for enforcing default file permissions.

Syscalls: Flushing file data

int fsync(int fd);
int fdatasync(int fd);

The kernel buffers writes to disk, so when write returns, there's no guarantee that what you've written has actually made it to disk. Normally, this is OK, as the kernel will write it out eventually, but some programs (e.g., databases) need to be more careful about ensuring the persistence of data. The fsync system call forces the kernel to flush all pending writes, and does not return until the data has made it to disk.

fdatasync is like fsync but only syncs file contents, not file metadata.

Syscalls: Directories

int mkdir(const char* pathname, mode_t mode);
struct linux_dirent {
    char d_name[];
    char d_type;
    /* other fields */
};

// raw syscall
ssize_t getdents64(int fd, void* dirp, size_t count);

// libc wrapper
DIR* opendir(const char* pathname);
struct dirent* readdir(DIR* dirp);

Syscalls: Moving and deleting files

int rename(const char* oldpath, const char* newpath);
int unlink(const char* pathname);
int rmdir(const char* pathname);

Syscalls: File locking

const int LOCK_SH = /* */;
const int LOCK_EX = /* */;
const int LOCK_UN = /* */;
const int LOCK_NB = /* */;

int flock(int fd, int op);

Final project milestone

Modify your program to create the database file with permissions locked down to the file's owner. Use file locking to ensure that get and set are synchronized. Test your solution by having the set command go to sleep for 5 seconds while holding the lock; in another terminal, ensure that get and set block until the first set command wakes up and releases the lock.

Homework exercises

  1. (★) Do syscalls follow symlinks? What about hard links?
    Solution Yes, although to be pedantic it's misleading to speak of "following" hard links in the same way as symbolic links, since there's not really a link to be followed but a direct reference to the file's metadata and content.
  2. (★) True or false: If I acquire an exclusive lock on a file, no other process will be able to write to it.
    Solution False. Locks are advisory. Other processes can ignore the lock and write to the file anyway.
  3. (★) How is the owner of a new file or directory set?
    Solution The owner of the new file is set to the effective user ID of the process that created it. We'll learn in week 4 what exactly the "effective user ID" is.
  4. (★★) One syscall we didn't cover in class this week is sendfile, which lets you efficiently copy bytes from one file to another. Read the man page, then use sendfile to write a simple version of cp -r. If your language doesn't expose getdents64 directly, you can use a higher-level API. Warning: Depending on your language, listing the directory may include the special .. entry. Make sure to filter this out! Otherwise you might try to recursively copy the whole tree.
    Solution https://github.com/iafisher/cs644/tree/master/week3/solutions/cp.c
  5. (★★) What happens if you rename a file while another process is writing to it? Make a prediction, then write a pair of programs to demonstrate what happens.
    Solution Two reasonable predictions are that (a) it will continue writing to the new path, or (b) it will start writing to the old path. It turns out that (a) is correct – but only if the writing process keeps the file open. If it closes and reopens each time, it will keep writing to the old path.
  6. (★★) Is getdents64 atomic? Write a pair of programs that demonstrates its behavior.
    Solution By "atomic", I mean whether a single call to getdents64 presents a consistent view of the directory. For example, if I call rename("foo", "bar"), getdents64 should return either foo or bar, but not both. (With a little bit of thought, you can see why it's not reasonable for multiple calls to getdents64 to be atomic – what if you waited 5 seconds in between?) I wrote a program that creates files 00001 through 09999 and rapidly renames, e.g., 00001 back and forth to 10001. Another thread calls getdents64 in a loop; if it ever sees both file N and N + 10000, then it was not atomic. I also ran the same test with readdir instead of getdents64. My program detected a non-atomic read for readdir but not getdents64. Under the hood, it appears that readdir is calling getdents64 with a fixed-size buffer of 32,768 bytes, so if your directory exceeds that size, so readdir won't be atomic on directories larger than that. (It's a little unfair that we're comparing one call to getdents64 against multiple calls to readdir, but that's the only way to call readdir.) ChatGPT says that the Linux VFS implementation takes a shared i_rwsem lock when reading and an exclusive lock when writing. The man page doesn't guarantee that getdents64 is atomic, though, and some filesystems like NFS we can expect that it isn't, so you shouldn't rely on this behavior.
  7. (★★) What permissions are required to rename a file? Without reading the man page, make a guess, then try to find the minimal set of permissions you need.
    Solution The basic permissions are +x on each directory in the source and destination paths (this is a general requirement for accessing any path on Linux), and +w on the ultimate source and destination directories. For a more comprehensive answer, see my blog post.
  8. (★★) Read Dan Luu's article "Files are hard" and write a program that uses his technique to ensure file consistency.
    Solution https://github.com/iafisher/cs644/tree/master/week3/solutions/files-are-hard.c
  9. (★★) Linux file permissions are a little more complicated than what was presented here. Research the concepts of the set-uid, set-gid, and sticky bits.
    Solution Normally, when an executable is run on Linux, it will run with the user ID and group ID of the user who started it. But if the set-uid or set-gid bit is set in the executable file's st_mode, then it will instead run as the file's owner or group, respectively. Usually this is when an executable needs root permissions but should be available to non-root users. Because they allow privilege escalation, set-uid executables are a potential security hole and must be written very carefully. See section 8.11 of Advanced Programming in the Unix Environment for an example.

    The sticky bit is another bit in st_mode that changes the interpretation of directory permissions: if set, then files in the directory can only be renamed or removed by the file's owner, the directory's owner, or the superuser (instead of anyone with +w permissions). Most commonly, the sticky bit is set on /tmp so that everyone can create files but not interfere with others' files. Tony Finch's blog post goes into more detail.
  10. (★★★) Is it possible to atomically overwrite a file? First, write a pair of programs (a reader and a writer) that shows that simply overwriting with write isn't atomic. Then, find a way to do it atomically.
    Solution We demonstrated in class with simultaneous.c that writes are not atomic. To overwrite atomically, create a temporary file, write to it, and then use rename to atomically replace the destination file. It's important to create the temporary file on the same filesystem as the destination file (e.g., in the same directory), because rename does not work across filesystems. You can use the O_TMPFILE flag to open to create an unnamed temporary file.
  11. (★★★) Linux file permissions are a lot more complicated than what was presented here. Research file ACLs and SELinux contexts. What syscalls do they use?
    Solution File ACLs let you set granular per-user and per-group permissions. See "Notes on Linux file ACLs" for details. SELinux is a set of security modules for the Linux kernel that encompasses much more than just filesystem permissions. Files can be tagged with a security context that affects how the file can be used – for instance, the system daemons may not be allowed to access files with security context user_home_t, regardless of the permission bits. See wiki/selinux for some useful troubleshooting commands.

    Both file ACLs and SELinux security contexts use POSIX extended attributes, which is a way to store key-value attributes on files. The syscalls are getxattr, setxattr, etc. – read man 7 xattr for more.

Further reading