Week 2: Filesystems, part 1
What is a syscall?
A system call or syscall is how your program communicates with the operating system. Syscalls look like function calls, but instead of jumping to another point in your program, they switch out of your program entirely and into the operating system.
Usually programming languages wrap system calls with higher-level APIs, for portability (system calls are OS-specific) and convenience. However, in this course we will be making syscalls directly1 because we want to understand exactly what we are asking the OS to do.
Syscall error handling
If the syscall fails (because of invalid arguments, because of inadequate permissions, etc.), a negative integer is returned that indicates the specific problem. Error codes have descriptive names like EACCES, EINVAL, and EBUSY, but the exact meaning depends on the syscall.
Some languages handle error results differently. C sets a per-thread global variable called errno. Python raises an OSError.
Linux filesystem APIs
Today, we're going to learn the basic APIs for reading and writing files in Linux.
Opening a file
int open(const char* pathname, int flags, mode_t mode);
pathnameis the path to the file you want to open (e.g.,/usr/share/cs644/bigfile.txt).flagscontrol how the file should be open.O_RDONLYto open for reading onlyO_WRONLYto open for writing onlyO_RDWRto open for reading and writingO_CREATto create if it does not existO_APPENDto append writes to the end of the fileO_TRUNCto truncate the file's length to 0 if it already exists
modeis used for setting permissions of newly-created files. It's optional unlessO_CREATis passed inflags. We'll talk more about it next week.openreturns a file descriptor, an integer that identifies the open file to the OS. The file descriptor itself holds no information (they just count up from 0); all the bookkeeping is done by the OS.
Reading from a file
ssize_t read(int fd, char* buf, size_t count);
fdis the file descriptor to read from, as returned byopen.bufis the pointer to the array to read into.countis the maximum number of bytes to read. Make sure thatbufis at least this long!- The return value is the number of bytes read, or -1 on error. If you are at the end of file, 0 is returned.
Writing to a file
ssize_t write(int fd, const char* buf, size_t count);
fdis the file descriptor to write to, as returned byopen.bufis the pointer to the array to write from.countis the maximum number of bytes to write. Make sure thatbufis at least this long!- The return value is the number of bytes written, or -1 on error. Usually it will equal
count, but not always, for instance if your disk runs out of space.
Seeking in a file
off_t lseek(int fd, off_t offset, int whence);
- The kernel keeps track of "where you are" in the file, e.g., after you read 100 bytes, the next read will start 100 bytes into the file.
lseeklets you explicitly control the position.- You can probably guess what
fdis by now. offsetandwhencetogether determine the behavior.- If
whenceisSEEK_SET, thenoffsetis a fixed offset to jump to. - If
whenceisSEEK_CUR, thenoffsetis relative to the current position. - If
whenceisSEEK_END, thenoffsetis relative to the end of the file. - To jump to start of file:
lseek(fd, 0, SEEK_SET) - To jump to end of file:
lseek(fd, 0, SEEK_END) - The return value is either the new position, or -1 on error.
Closing a file
int close(int fd);
- File descriptors are not an infinite resource: the kernel sets a maximum number of open files per process. So it's a good idea to clean them up when you're done.
- Note this important caveat from the man page: "Typically, filesystems do not flush buffers when a file is closed."
fdis the file descriptor to be closed.- There's no information to communicate back, so
closejust returns 0 on success and -1 on error.
In-class exercises
- Let's take a look at the APIs that your programming languages of choice expose for making system calls on Linux.
- Use
man 2 readto view the manual page for thereadsyscall. - Write a program that reads a file in fixed-size chunks and prints the number of bytes in the file. (Next week we'll learn a more efficient way to do this.)
- Write a program that appends a line of text to a file, creating it if it does not already exist. Do it once with
O_APPENDand once withlseek. - Let's use
straceto see what system calls some common Linux utilities use.
Homework exercises
- (★) What's the difference between a syscall and a function call?
Solution
Function calls jump between different points in your program; syscalls switch control to the operating system. - (★) How do you distinguish between an I/O error and reaching the end of the file with
read?Solution
readreturns 0 at end of file, and a negative number on an I/O error. - (★) What flags do I pass to
opento open a file for writing at the end?Solution
O_WRONLY(orO_RDWR) andO_APPEND - (★★) Final project (database): The very first version of your database simply stores key-value pairs to disk. Your program should have two commands:
getandset. Thesetcommand takes a key and a value and writes it to disk, and thegetcommand takes a key and prints the value, if it exists. You should store all data in a single file (it's okay to hard-code the path – users shouldn't look at the file directly). Use whatever data format you want. It's okay to make assumptions about the data if it simplifies your program (e.g., doesn't contain the|character so you can use that as a delimiter). - (★★) Final project (web server): Web servers commonly log some details about incoming requests to a file. We're not ready to handle network requests, so this week we'll just do the logging. Your program should have two commands:
runandcount. Theruncommand will append a line to a log file and exit. Thecountcommand should read the log file and print a count of the number of lines. You can format the log lines however you like, though generally they begin with a timestamp and include a descriptive message. - (★★)
EACCES,EEXIST, andENOENTare three common errors thatopencan return. Read the description of these errors inman 2 open, and write a program that demonstrates each of them. - (★★) Modify your program from in-class exercise 3 to count the number of whitespace characters in the file. Try it out on
/usr/share/cs644/bigfile.txt. Experiment with different chunk sizes. How does it affect the performance of your program? (Tip: Runtime ./myprogramto measure the running time of your program.)Solution
There are 1,650,564 whitespace characters in the file. Here's a program to measure it. Unsurprisingly, increasing the buffer size makes the program faster. My program took 7,500 ms with a buffer of 1, but only 70–80 ms with a buffer of 1,000. Past around 10,000 bytes, making the buffer bigger did not reliably make it faster, probably because performance became dominated by actual I/O rather than syscall overhead. - (★★) Modify your program from exercise 3 to read a file line-by-line.
- (★★) Why does
readreturn the number of bytes read? Why doesn't it just setbufto a null-terminated string, like other C functions?Solution
Because files in Linux can hold arbitrary bytes, including the null byte. Ifreadmadebufnull-terminated, the caller could not distinguish the null terminator from a null byte read from the file. - (★★) If you call
write, uselseekto rewind, and callreadagain, are you guaranteed to see the data you just wrote? Find the place in the man pages that describes Linux's behavior. Write a program to demonstrate it.Solution
man 2 writesays: "POSIX requires that aread(2)that can be proved to occur after awrite()has returned will return the new data. Note that not all filesystems are POSIX conforming." Demonstrating program: https://github.com/iafisher/cs644/tree/master/archive/2025-001-spring/week2/solutions/read-after-write.c - (★★★) Find the location in the Linux kernel source code where a process's table of file descriptors is declared.
Solution
The field isstruct files_struct *filesinstruct task_struct(include/sched/linux.h).struct files_structis defined here, and the actual file representation,struct file, is defined here. - (★★★) What happens when one program is reading from a file while another program is writing? Formulate a hypothesis, then write a pair of programs to test it.
Solution
- Some plausible hypotheses:
- If a program tries to
readwhile another program is in the middle of awrite, or vice-versa, the syscall will return with an error. - The OS will allow simultaneous access to a file, but writes will be atomic, so a
readwill never observe the partial effect of a write. - The OS will allow simultaneous access to a file, and writes will not be atomic, so a
readcould observe a partial write.
- If a program tries to
- This program shows that it's the third possibility: there's no synchronization between reads and writes of different programs. Even a write as small as 100 bytes is not atomic. In week 3, we'll learn how we can explicitly synchronize access.
- Some plausible hypotheses:
-
OK, there's still going to be a wrapper function in between your Python/Rust/Go/whatever program and the actual syscall (this is true even for C). But we're going to be using the wrapper function with the same interface as the real syscall, instead of a higher-level API with a different interface. ↩