home blog portfolio Ian Fisher

Week 4: Process control

How a process is run

A program is a file on disk containing executable code. The file, which is typically in the ELF format on Linux, has machine code encoded in binary, as well as static data such as string literals.

The CPU can't run code directly from disk, though, so before a program is run the OS has to load it into memory. A program that is currently executing on a CPU (or waiting to execute) is called a process. A computer often has more processes running than physical CPU cores; the scheduler, a part of the kernel, decides what processes to run and when. A process is allowed to run for a fixed time slice, and then control returns to the OS (or maybe earlier, if the process made a syscall) and the scheduler decides what to do next.

Every process has a parent (the process who spawned it), so the set of running processes on a system is organized as a tree. The process at the root of the tree is a special system process, traditionally called init but nowadays most often systemd. It is not the kernel, but rather the first process that the kernel spawns in userspace.

PIDs and UIDs

Processes are identified by an integer called a PID. A process's PID is unique while it is running, but may eventually be reused. Two syscalls, getpid and getppid, let you discover your own and your parent's PIDs. Unusually for syscalls, they cannot return an error.

pid_t getpid(void);
pid_t getppid(void);

A process executes as a particular user (and group). In fact, it has both a real UID and an effective UID (and likewise for GIDs):

uid_t getuid(void);
gid_t getgid(void);

uid_t geteuid(void);
gid_t getegid(void);

The real UID is the "actual" user, while the effective user is the user for purposes of access control. Usually they are the same, but not always: if you run sudo sleep 30 & and then ps -eo pid,euser,ruser, you will see that the process's real UID is your UID, but its effective UID is root.

The way this works is that the file /usr/bin/sudo has a special bit set called the set-uid bit. When a program with the set-uid bit is executed, its effective UID is set to the owner of the file rather than the user who launched it. There's an analogous bit for GIDs called the set-gid bit.

The set-uid and set-gid bits are a mechanism for controlled privilege escalation. A set-uid binary lets users assume elevated privileges in a controlled fashion, and for this reason set-uid programs must be written very carefully to avoid granting unintentional privileges. It makes sense that sudo is a set-uid binary as privilege escalation is the essence of what it does. Historically, ping was the classic (and counterintuitive) example of a set-uid binary: despite its innocuous function, it needed privileged access to low-level network interfaces. On our Linux box, though, ping is just a regular binary.

Launching processes

Spawning a child process is accomplished by a pair of syscalls: fork and execve.

pid_t fork(void);

fork is the syscall that creates a new child process.

It is a very ancient Unix syscall with a clever design: when it returns, it returns into both processes. In the child process it returns 0, and in the parent process it returns the child's PID. (No real process has PID 0; the init process is assigned PID 1 and all other processes have a higher PID.)

You can imagine that the original process has split and cloned itself: the child process is running the same program, with a complete copy of the variables, call stack, etc. of the original process.

Normally you'll see:

pid_t pid = fork();
if (pid < 0) {
    // error
} else if (pid == 0) {
    // child
} else {
    // parent
}

It's a little hard to wrap your head around.

A child process copies a lot of state from the parent. Notably:

Regardless of if the original process was multi-threaded, the forked process will have only one thread.

Most of the time, you fork because you want to run a different program (e.g., invoke a shell command) – if you actually want to run two copies of the same program, you're probably better off using multithreading, which we'll cover later in the course.

To execute a different program, call execve in the child process:

int execve(const char* pathname, char* argv[], char* envp[]);

execve replaces the current program with the one at pathname, passing it the arguments in argv and the environment variables in envp. It does not create a new process – that's what fork did.

A few sharp edges:

So a full invocation looks like:

char* pathname = "/usr/bin/echo";
char* argv[] = {pathname, "hello", "world", NULL};
execve(pathname, argv, environ);

where environ is a C standard library variable that holds the current process's environment.

If execve succeeds, it will never return – the old program is "switched out" with the new one.

Why such a confusing design?

fork and exec is a very confusing way to launch processes. Why not just launch_process("prog", "arg1", "arg2")?

In fact, there is an API like this, called posix_spawn. But looking at the function signature gives a clue to why fork and exec persist:

int posix_spawn(pid_t *restrict pid, const char *restrict path,
                const posix_spawn_file_actions_t *restrict file_actions,
                const posix_spawnattr_t *restrict attrp,
                char *const argv[restrict],
                char *const envp[restrict]);

That's a lot of arguments, and initializing the file_actions and attrp arguments takes even more lines of code.

The bottom line is, there is potentially a lot of things you may want to do in a child process before calling exec: opening or closing file descriptors, changing the working directory, setting the signal mask, etc. With fork, you can run whatever code you want before calling exec, and the APIs stay simple. But if you want a single API like posix_spawn, then you have to support all the different ways someone might want to customize their child process.

Waiting for processes

Usually, you'll want a way for the parent to communicate with its child. We'll cover a number of ways to do so next week; today, we'll just look at waitpid, which is how a parent waits for its child to finish.

pid_t waitpid(pid_t pid, int* wstatus, int options);

pid can be -1 to wait for any child, or a positive number to wait for a specific child. There are a few more possibilities documented in man 2 waitpid. The main useful value for the options parameter is WNOHANG, if you just want to check if a process has exited but don't want to block on waiting for it. Some information will be copied into wstatus, such as the child's exit status.

When a child exits, the kernel sends a SIGCHLD signal to the parent process – we'll talk more about signals later in the course.

If a parent exits before its child, the child process is reassigned to the process with PID 1 as its parent (a system service called init). After a child exits but before its parent calls waitpid, the process is what's called a zombie process. The OS still has to maintain some metadata about the process. So it's important to "reap" child processes to free up these resources.

The last part of a process's lifecycle is exiting:

void _exit(int status);

Calling _exit terminates the process immediately, and registers a status code that can later be retrieved by the parent process when it calls waitpid. Traditionally, an exit code of 0 indicates success and any non-zero value indicates failure.

You more likely want to use the C library function exit() (no leading underscore), which calls exit handlers, flushes output buffers, etc.

In-class exercises

  1. fork and execve can be confusing. Let's write an example program to understand how they work.
    Solution https://github.com/iafisher/cs644/blob/master/week4/forkexec.c

Final project milestone

Implement a size command that prints the size of the database file. Rather than the stat syscall, let's use the standard Unix wc program, which can print the bytes, words, and lines in the file. Use fork and exec to spawn wc as a subprocess. Make sure to wait for wc to finish in the parent process.

Homework exercises

  1. (★) What UID is used to check file permissions?
    Solution The effective UID. Usually it is the same as the real UID – unless the program is a set-uid file.
  2. (★) How does the caller of fork know if they are the child or parent process?
    Solution fork returns 0 in the child process, and the child's PID (which is always greater than 0) in the parent process.
  3. (★) What is a zombie process?
    Solution A zombie process is a process that has exited but whose parent has not yet called waitpid. While a zombie, the OS must keep some resources allocated to the dead process.
  4. (★★) How can a parent process view its child's resource usage (e.g., CPU time) after it exits? Find the relevant syscall, and use it to write your own version of the time command.
    Solution The syscall you want is wait4. Example program: https://github.com/iafisher/cs644/blob/master/week4/solutions/mytime.c
  5. (★★) fork is the classic Unix system call, but Linux also offers something called clone. Read man 2 clone. What are the differences from fork?
    Solution clone gives many more operations controlling how the child is created, such as whether it shares the same address space with the parent. Notably, clone is the low-level syscall used to implement threads.
  6. (★★) The C standard library offers several wrappers around execve. Read man 3 exec and implement the wrappers in your language of choice.
    Solution https://github.com/iafisher/cs644/blob/master/week4/solutions/execs.c
  7. (★★) Many programming languages have a high-level way to run a child process, such as Python's subprocess.run. Write a simple program to demonstrate it, then use strace to determine what syscalls it makes. (The -f flag lets you see syscalls in child processes as well.)
    Solution Running strace on the Python program import subprocess; subprocess.run(["echo", "hello", "world"]) reveals that it does vfork and then execve. In fact, we can see that it tries execve with various paths before it finds echo at /usr/bin/echo.
  8. (★★★) Think through what happens in terms of process relationships and UIDs when you make an SSH connection to a remote server. How does a process running on your laptop (ssh) "transform" itself into a process running on the remote server (bash)? How does it end up with the correct UID?
    Solution To begin with, there is an sshd process running on the server as root. When you start ssh on your laptop, it makes a network connection to the server. sshd authenticates, then forks a child process on the server and uses its root privileges to set its effective UID to whatever user was authenticated. This child process then forks its own child to run the user's login shell. It forwards all the output from the child process over the network to the ssh process on your laptop, which in turn sends over any input it receives. So, while it seems that the bash process on the server is running in your terminal, this is an illusion: what is running is the ssh process on your laptop.
  9. (★★★) It's important to follow execve with a call to _exit. What could go wrong if you don't?
    Solution I was writing a little jobserver program in Python. It acquired a lockfile, then looped forever, spawning jobs with fork and execve according to its schedule. Python's os.execve raises an exception if it fails, which seems like the right thing to do... except that it means the child process will run cleanup code from the parent program. One of the cleanup functions was to remove the lockfile. So the bug manifested as the lockfile mysteriously disappearing when the jobserver was running. The fix was to catch any exceptions and call os._exit to exit immediately without doing any clean-up.
  10. (★★★) Read this post on fork. Do you agree with the author?