Week 2: Filesystems, part 1
Unix and Linux
Before we begin in earnest, a quick historical note.
Unix was a 1970s-era multi-user time-sharing operating system from Bell Labs that was hugely influential on OS design. Much of what we study in this course dates from the early days of Unix.
Unix spawned many direct and indirect descendants, among them Linux, an operating system that began in the early 1990s as a hobby project by Linus Torvalds and is now the dominant open-source general-purpose operating system, used on everything from supercomputers to cloud servers to mobile phones to microcontrollers.
When people say "Unix" today, they most often mean to evoke the common characteristics of the whole family of Unix-like operating systems, which besides Linux notably includes macOS, the BSDs, and proprietary systems like IBM's AIX and HP's HP-UX.
What is a syscall?
Syscalls (short for 'system calls') are how your program communicates with the operating system. Making a syscall is like calling a function, but instead of jumping to another point in your program, they switch out of your program entirely and into the operating system.
In code, syscalls look just like functions:
int fd = open("file.txt", O_RDONLY);
In fact, this is a function – making a syscall requires executing some processor-specific assembly instructions, so all languages, including C, wrap the raw syscall in a regular function. But that distinction is rarely important, so we'll call functions like open
'syscalls' – knowing that there is one small layer between the open
function and the actual syscall.
Most syscalls can fail – perhaps it tried to access a file that didn't exist, or that it didn't have permission to access, or the programmer passed invalid arguments. In C, syscalls signal an error by returning -1. You can then check a per-thread global variable called errno
to find the exact error; they have mnemonics like EACCES
, EINVAL
, etc. Other languages might have different error-reporting mechanisms. In Python, for instance, an OSError
exception is raised.
This course examines Linux syscalls in great detail – most weeks of the course will involve looking through the set of syscalls that deal with a certain concept, like the filesystem or process control. We do this in order to learn, not because making syscalls directly is a good idea for most programs. Most programs should use platform-independent APIs, like open
in Python or fopen
in C, whenever they are available.
The filesystem, with fresh eyes
The filesystem is a global, hierarchical, mutable store of unstructured data.
- global – There is one filesystem that all programs have access to.
- hierarchical – The filesystem is a hierarchical tree, where the interior nodes are directories and the leaf nodes are files.
- mutable – The contents of files can be changed, and the files themselves can be renamed, deleted, and moved. New files and directories can be created.
- unstructured – Regular files on Linux are just sequences of bytes. The filesystem itself does not know what they are meant to represent. It doesn't even distinguish between text files and binary files. File extensions like
.pdf
and.txt
are mnemonics for human use. The filesystem doesn't care.
This is a simplified picture that we will flesh out in the coming weeks. Notable omissions include file include file permissions, symlinks, and namespaces. Nonetheless, we can already infer a couple of consequences from this description:
- Because the filesystem is global, it can be used to share data between programs – including programs that were not specifically designed to work with each other.
- Because the filesystem is global and mutable, we have to consider the possibility of concurrency bugs and race conditions – what happens if two programs write to the same file at once?
As we shall see, the filesystem is a powerful tool, but careful programs cannot be cavalier in how they use it.
Pathnames
A file is identified by its path. /home/ian/hello.txt
is a file whose name is hello.txt
. It is inside a directory called ian
, which is inside a directory called home
, which is inside the root directory, called /
on Unix systems. Because it is rooted at /
, this is an absolute path. We can also refer to files by relative paths – if we are already in /home/ian
, then we can identify the same file as hello.txt
.
Two special paths: .
is the current directory, and ..
is the parent directory. So, ./hello.txt
is the same as hello.txt
, and ../goodbye.txt
means 'the file named goodbye.txt
in the parent of the current directory'.
Linux filesystem APIs
What operations can a program perform on the filesystem? Two of the most basic are reading and writing data from files. Before a program can read or write, it must notify the OS of its interest in the file by opening it, and when it is done, it is best to close the file.
Opening a file
int open(const char* pathname, int flags, mode_t mode);
pathname
is the path to the file you want to open (e.g.,/usr/share/cs644/bigfile.txt
).flags
control how the file should be open.O_RDONLY
to open for reading onlyO_WRONLY
to open for writing onlyO_RDWR
to open for reading and writingO_CREAT
to create if it does not existO_APPEND
to append writes to the end of the fileO_TRUNC
to truncate the file's length to 0 if it already exists- Use bitwise 'or' to pass multiple flags, e.g.
O_WRONLY | O_APPEND
.
mode
is used for setting permissions of newly-created files. It's optional unlessO_CREAT
is passed inflags
. We'll talk more about it next week.open
returns a file descriptor, an integer that identifies the open file to the OS. The file descriptor itself holds no information (they just count up from 0); all the bookkeeping is done by the OS.
Reading from a file
ssize_t read(int fd, char* buf, size_t count);
fd
is the file descriptor to read from, as returned byopen
.buf
is the pointer to the array to read into.count
is the maximum number of bytes to read. Make sure thatbuf
is at least this long!- The return value is the number of bytes read, or -1 on error. If you are at the end of file, 0 is returned.
Writing to a file
ssize_t write(int fd, const char* buf, size_t count);
fd
is the file descriptor to write to, as returned byopen
.buf
is the pointer to the array to write from.count
is the maximum number of bytes to write. Make sure thatbuf
is at least this long!- The return value is the number of bytes written, or -1 on error. Usually it will equal
count
, but not always, for instance if your disk runs out of space.
Seeking in a file
off_t lseek(int fd, off_t offset, int whence);
- The kernel keeps track of "where you are" in the file, e.g., after you read 100 bytes, the next read will start 100 bytes into the file.
lseek
lets you explicitly control the position of the cursor.- You can probably guess what
fd
is by now. offset
andwhence
together determine the behavior.- If
whence
isSEEK_SET
, thenoffset
is a fixed offset to jump to. - If
whence
isSEEK_CUR
, thenoffset
is relative to the current position. - If
whence
isSEEK_END
, thenoffset
is relative to the end of the file.
- If
- To jump to start of file:
lseek(fd, 0, SEEK_SET)
- To jump to end of file:
lseek(fd, 0, SEEK_END)
- The return value is either the new position, or -1 on error.
Closing a file
int close(int fd);
- File descriptors are not an infinite resource: the kernel sets a maximum number of open files per process. So it's a good idea to clean them up when you're done.
- Note this important caveat from the man page: "Typically, filesystems do not flush buffers when a file is closed."
fd
is the file descriptor to be closed.- There is no information to communicate back, so
close
just returns 0 on success and -1 on error.
Case study: wc
wc
is a standard Unix program that counts the number of bytes, words, and lines in a file.
open
andclose
are called inwc_file
.read
is called inwc_lines
.- Bonus: What does the
fdadvise
syscall do?
In-class exercises
- Let's take a look at the APIs that your programming languages of choice expose for making system calls on Linux.
- Use
man 2 read
to view the manual page for theread
syscall. - Write a program that reads a file in fixed-size chunks and prints the number of bytes in the file. (Next week we'll learn a more efficient way to do this.)
- Write a program that appends a line of text to a file, creating it if it does not already exist. Do it once with
O_APPEND
and once withlseek
. - Let's use
strace
to see what system calls some common Linux utilities use.
Final project milestone
This week, we'll begin work on a final project that ties together all the topics we cover in the course. What we'll build is a key-value server – a program that stores and retrieves data organized by key, like an on-disk hash map. By the end of the course, we'll have a multithreaded program with an IPC interface that can set, retrieve, and delete keys concurrently without data races.
Let's use filesystem APIs to implement the core operations of the key-value server. Your program should have two commands: get
and set
. The set
command takes a key and a value and writes it to disk, and the get
command takes a key and prints the value, if it exists. Store all data in a single file (hard-coded path is fine). Use whatever data format you want – this is not a class on data structures or algorithms, so we won't focus on making the key-value server efficient.
Homework exercises
- (★) What's the difference between a syscall and a function call?
Solution
Function calls jump between different points in your program; syscalls switch control to the operating system. - (★) How do you distinguish between an I/O error and reaching the end of the file with
read
?Solution
read
returns 0 at end of file, and a negative number on an I/O error. - (★) What flags do I pass to
open
to open a file for writing at the end?Solution
O_WRONLY
(orO_RDWR
) andO_APPEND
- (★★)
EACCES
,EEXIST
, andENOENT
are three common errors thatopen
can return. Read the description of these errors inman 2 open
, and write a program that demonstrates each of them. - (★★) Modify your program from in-class exercise 3 to count the number of whitespace characters in the file. Try it out on
/usr/share/cs644/bigfile.txt
. Experiment with different chunk sizes. How does it affect the performance of your program? (Tip: Runtime ./myprogram
to measure the running time of your program.)Solution
There are 1,650,564 whitespace characters in the file. Here's a program to measure it. Unsurprisingly, increasing the buffer size makes the program faster. My program took 7,500 ms with a buffer of 1, but only 70–80 ms with a buffer of 1,000. Past around 10,000 bytes, making the buffer bigger did not reliably make it faster, probably because performance became dominated by actual I/O rather than syscall overhead. - (★★) Why does
read
return the number of bytes read? Why doesn't it just setbuf
to a null-terminated string, like other C functions?Solution
Because files in Linux can hold arbitrary bytes, including the null byte. Ifread
madebuf
null-terminated, the caller could not distinguish the null terminator from a null byte read from the file. - (★★) If you call
write
, uselseek
to rewind, and callread
again, are you guaranteed to see the data you just wrote? Find the place in the man pages that describes Linux's behavior. Write a program to demonstrate it.Solution
man 2 write
says: "POSIX requires that aread(2)
that can be proved to occur after awrite()
has returned will return the new data. Note that not all filesystems are POSIX conforming." Demonstrating program: https://github.com/iafisher/cs644/tree/master/week2/solutions/read-after-write.c - (★★★) Modify your program from exercise 3 to read a file line-by-line.
- (★★★) What happens when one program is reading from a file while another program is writing? Formulate a hypothesis, then write a pair of programs to test it.
Solution
- Some plausible hypotheses:
- If a program tries to
read
while another program is in the middle of awrite
, or vice-versa, the syscall will return with an error. - The OS will allow simultaneous access to a file, but writes will be atomic, so a
read
will never observe the partial effect of a write. - The OS will allow simultaneous access to a file, and writes will not be atomic, so a
read
could observe a partial write.
- If a program tries to
- This program shows that it's the third possibility: there's no synchronization between reads and writes of different programs. Even a write as small as 10 bytes is not atomic. In week 3, we'll learn how we can explicitly synchronize access.
- Some plausible hypotheses: