Week 4: Process control
How a process is run
A program is a file on disk containing executable code. The file, which is typically in the ELF format on Linux, has machine code encoded in binary, as well as static data such as string literals.
The CPU can't run code directly from disk, though, so before a program is run the OS has to load it into memory. A program that is currently executing on a CPU (or waiting to execute) is called a process. A computer often has more processes running than physical CPU cores; the scheduler, a part of the kernel, decides what processes to run and when. A process is allowed to run for a fixed time slice, and then control returns to the OS (or maybe earlier, if the process made a syscall) and the scheduler decides what to do next.
Every process has a parent (the process who spawned it), so the set of running processes on a system is organized as a tree. The process at the root of the tree is a special system process, traditionally called init
but nowadays most often systemd
. It is not the kernel, but rather the first process that the kernel spawns in userspace.
PIDs and UIDs
Processes are identified by an integer called a PID. A process's PID is unique while it is running, but may eventually be reused. Two syscalls, getpid
and getppid
, let you discover your own and your parent's PIDs. Unusually for syscalls, they cannot return an error.
pid_t getpid(void);
pid_t getppid(void);
A process executes as a particular user (and group). In fact, it has both a real UID and an effective UID (and likewise for GIDs):
uid_t getuid(void);
gid_t getgid(void);
uid_t geteuid(void);
gid_t getegid(void);
The real UID is the "actual" user, while the effective user is the user for purposes of access control. Usually they are the same, but not always: if you run sudo sleep 30 &
and then ps -eo pid,euser,ruser
, you will see that the process's real UID is your UID, but its effective UID is root.
The way this works is that the file /usr/bin/sudo
has a special bit set called the set-uid bit. When a program with the set-uid bit is executed, its effective UID is set to the owner of the file rather than the user who launched it. There's an analogous bit for GIDs called the set-gid bit.
The set-uid and set-gid bits are a mechanism for controlled privilege escalation. A set-uid binary lets users assume elevated privileges in a controlled fashion, and for this reason set-uid programs must be written very carefully to avoid granting unintentional privileges. It makes sense that sudo
is a set-uid binary as privilege escalation is the essence of what it does. Historically, ping
was the classic (and counterintuitive) example of a set-uid binary: despite its innocuous function, it needed privileged access to low-level network interfaces. On our Linux box, though, ping
is just a regular binary.
Launching processes
Spawning a child process is accomplished by a pair of syscalls: fork
and execve
.
pid_t fork(void);
fork
is the syscall that creates a new child process.
It is a very ancient Unix syscall with a clever design: when it returns, it returns into both processes. In the child process it returns 0, and in the parent process it returns the child's PID. (No real process has PID 0; the init
process is assigned PID 1 and all other processes have a higher PID.)
You can imagine that the original process has split and cloned itself: the child process is running the same program, with a complete copy of the variables, call stack, etc. of the original process.
Normally you'll see:
pid_t pid = fork();
if (pid < 0) {
// error
} else if (pid == 0) {
// child
} else {
// parent
}
It's a little hard to wrap your head around.
A child process copies a lot of state from the parent. Notably:
- All open file descriptors
- Environment variables
- Signal handlers (covered in a later week)
Regardless of if the original process was multi-threaded, the forked process will have only one thread.
Most of the time, you fork because you want to run a different program (e.g., invoke a shell command) – if you actually want to run two copies of the same program, you're probably better off using multithreading, which we'll cover later in the course.
To execute a different program, call execve
in the child process:
int execve(const char* pathname, char* argv[], char* envp[]);
execve
replaces the current program with the one at pathname
, passing it the arguments in argv
and the environment variables in envp
. It does not create a new process – that's what fork
did.
A few sharp edges:
pathname
must be the absolute path;execve
won't do path look-up for you.pathname
should be duplicated as the first element ofargv
.argv
andenvp
must both have a null pointer as their last argument, to mark the end of the list.
So a full invocation looks like:
char* pathname = "/usr/bin/echo";
char* argv[] = {pathname, "hello", "world", NULL};
execve(pathname, argv, environ);
where environ
is a C standard library variable that holds the current process's environment.
If execve
succeeds, it will never return – the old program is "switched out" with the new one.
Why such a confusing design?
fork
and exec
is a very confusing way to launch processes. Why not just launch_process("prog", "arg1", "arg2")
?
In fact, there is an API like this, called posix_spawn
. But looking at the function signature gives a clue to why fork
and exec
persist:
int posix_spawn(pid_t *restrict pid, const char *restrict path,
const posix_spawn_file_actions_t *restrict file_actions,
const posix_spawnattr_t *restrict attrp,
char *const argv[restrict],
char *const envp[restrict]);
That's a lot of arguments, and initializing the file_actions
and attrp
arguments takes even more lines of code.
The bottom line is, there is potentially a lot of things you may want to do in a child process before calling exec
: opening or closing file descriptors, changing the working directory, setting the signal mask, etc. With fork
, you can run whatever code you want before calling exec
, and the APIs stay simple. But if you want a single API like posix_spawn
, then you have to support all the different ways someone might want to customize their child process.
Waiting for processes
Usually, you'll want a way for the parent to communicate with its child. We'll cover a number of ways to do so next week; today, we'll just look at waitpid
, which is how a parent waits for its child to finish.
pid_t waitpid(pid_t pid, int* wstatus, int options);
pid
can be -1 to wait for any child, or a positive number to wait for a specific child. There are a few more possibilities documented in man 2 waitpid
. The main useful value for the options
parameter is WNOHANG
, if you just want to check if a process has exited but don't want to block on waiting for it. Some information will be copied into wstatus
, such as the child's exit status.
When a child exits, the kernel sends a SIGCHLD
signal to the parent process – we'll talk more about signals later in the course.
If a parent exits before its child, the child process is reassigned to the process with PID 1 as its parent (a system service called init
). After a child exits but before its parent calls waitpid
, the process is what's called a zombie process. The OS still has to maintain some metadata about the process. So it's important to "reap" child processes to free up these resources.
The last part of a process's lifecycle is exiting:
void _exit(int status);
Calling _exit
terminates the process immediately, and registers a status code that can later be retrieved by the parent process when it calls waitpid
. Traditionally, an exit code of 0 indicates success and any non-zero value indicates failure.
You more likely want to use the C library function exit()
(no leading underscore), which calls exit handlers, flushes output buffers, etc.
In-class exercises
fork
andexecve
can be confusing. Let's write an example program to understand how they work.
Final project milestone
Implement a size
command that prints the size of the database file. Rather than the stat
syscall, let's use the standard Unix wc
program, which can print the bytes, words, and lines in the file. Use fork
and exec
to spawn wc
as a subprocess. Make sure to wait for wc
to finish in the parent process.
Homework exercises
- (★) What UID is used to check file permissions?
Solution
The effective UID. Usually it is the same as the real UID – unless the program is a set-uid file. - (★) How does the caller of
fork
know if they are the child or parent process?Solution
fork
returns 0 in the child process, and the child's PID (which is always greater than 0) in the parent process. - (★) What is a zombie process?
Solution
A zombie process is a process that has exited but whose parent has not yet calledwaitpid
. While a zombie, the OS must keep some resources allocated to the dead process. - (★★) How can a parent process view its child's resource usage (e.g., CPU time) after it exits? Find the relevant syscall, and use it to write your own version of the
time
command.Solution
The syscall you want iswait4
. Example program: https://github.com/iafisher/cs644/blob/master/week4/solutions/mytime.c - (★★)
fork
is the classic Unix system call, but Linux also offers something calledclone
. Readman 2 clone
. What are the differences fromfork
?Solution
clone
gives many more operations controlling how the child is created, such as whether it shares the same address space with the parent. Notably,clone
is the low-level syscall used to implement threads. - (★★) The C standard library offers several wrappers around
execve
. Readman 3 exec
and implement the wrappers in your language of choice. - (★★) Many programming languages have a high-level way to run a child process, such as Python's
subprocess.run
. Write a simple program to demonstrate it, then usestrace
to determine what syscalls it makes. (The-f
flag lets you see syscalls in child processes as well.)Solution
Runningstrace
on the Python programimport subprocess; subprocess.run(["echo", "hello", "world"])
reveals that it doesvfork
and thenexecve
. In fact, we can see that it triesexecve
with various paths before it findsecho
at/usr/bin/echo
. - (★★★) Think through what happens in terms of process relationships and UIDs when you make an SSH connection to a remote server. How does a process running on your laptop (
ssh
) "transform" itself into a process running on the remote server (bash
)? How does it end up with the correct UID?Solution
To begin with, there is ansshd
process running on the server as root. When you startssh
on your laptop, it makes a network connection to the server.sshd
authenticates, then forks a child process on the server and uses its root privileges to set its effective UID to whatever user was authenticated. This child process then forks its own child to run the user's login shell. It forwards all the output from the child process over the network to thessh
process on your laptop, which in turn sends over any input it receives. So, while it seems that thebash
process on the server is running in your terminal, this is an illusion: what is running is thessh
process on your laptop. - (★★★) It's important to follow
execve
with a call to_exit
. What could go wrong if you don't?Solution
I was writing a little jobserver program in Python. It acquired a lockfile, then looped forever, spawning jobs withfork
andexecve
according to its schedule. Python'sos.execve
raises an exception if it fails, which seems like the right thing to do... except that it means the child process will run cleanup code from the parent program. One of the cleanup functions was to remove the lockfile. So the bug manifested as the lockfile mysteriously disappearing when the jobserver was running. The fix was to catch any exceptions and callos._exit
to exit immediately without doing any clean-up. - (★★★) Read this post on
fork
. Do you agree with the author?