Woes of writing your own jobserver

I wanted a system that could run regularly scheduled jobs on my laptop and home server, with a friendlier interface than cron or systemd. So I decided to write my own jobserver. This is perhaps where I erred first.

My jobserver is a daemon process that runs under launchd (macOS) or systemd (Linux). There is a client program to list scheduled jobs, add and remove jobs, etc. I sometimes call the daemon process the "server" process, though in truth the client and server don't directly talk to each other: as an implementation shortcut, the client just reads the daemon's state file on disk directly. The daemon and client are both written in Python.

λ kg jobs list
job      next run            last run         status  time
backup   tomorrow at 4:35am  today at 4:47am  0       59.1s

Jobs are configured in JSON files:

{
  "jobs": [
    {
      "name": "backup",
      "cmd": ["backup", "b2", "save"],
      "start_at": "4:35am",
      "machines": ["laptop"],
      "extra_path": ["/opt/homebrew/bin"]
    }
  ]
}

The daemon arranges for the job's standard output, standard error, and logging to all go to a single file, which can be quickly viewed with kg logs. If a job fails, the daemon sends me an email.

To spawn child processes, I use the traditional fork and exec method. In retrospect, it would have been wiser to use subprocess from the standard library, but I was teaching a systems programming class at the time and I wanted some hands-on experience.

`fork` is hard

On start-up, the daemon takes a lock on a pid.lock file and writes its PID to it, so that only one daemon can run at a time and so that the client can find the daemon's PID.

I noticed that occasionally the lockfile would disappear while the daemon was still running. I read the code carefully. It did not seem possible. The only place where the lockfile was removed was when the program exited. Then I realized:

newpid = os.fork()
if newpid == 0:
    # set-up code omitted
    os.execvpe(job.cmd[0], job.cmd, env)

os.execvpe can never return normally. But it can raise an exception, such as when the executable doesn't exist. And if that happens, the child process is still running the original jobserver program. So the exception bubbles up to the top level, calls the clean-up handler, and removes the PID lockfile.

Solution: Catch any exception raised after fork, and call os._exit to exit immediately.

Auto-restart is scary

Next, I saw that the 'last exit status' field of jobs was always missing. This puzzled me for a while, until I realized that due to a minor programming error, the jobserver crashed every time it ran a job, but since launchd was configured to auto-restart the daemon, I never noticed.

Locking is hard

Sometimes I called kg jobs schedule but the new job was never added. Race condition: (a) the daemon takes the lock, reads the state file, and releases the lock; (b) the client takes the lock, writes to the state file, and releases the lock; and (c) the server finishes what it was doing in (a) and writes back the state, clobbering the update in (b).

Solution: the daemon must hold the lock the whole time from read to write.

Signals are hard

Sometimes the daemon would get stuck and stop spawning any child jobs. The daemon listens for the SIGCHLD signal to detect when a child job has exited. In the signal handler, it updates the state file to record the child's exit status, taking a file lock to do so. What if SIGCHLD is received while the daemon is already holding the lock? Deadlock!

I knew that it is a bad idea to do non-trivial work in a signal handler, but I wrongly thought that because the Python signal handler is not the same as the C signal handler (details), the same considerations did not apply. Not so: the Python signal handler can be invoked at an arbitrary point in your program and you have to be just as paranoid as usual.

Locking is really hard

To address the bug above, I changed the signal handler to merely write the signal number to a queue that is read in the daemon's main loop. Python's queue.Queue class is thread-safe, so without thinking too hard about it, it seemed like a good choice.

To my consternation, I continued to see occasional deadlocks in the daemon. By now, maybe you can guess why.

If we are unlucky enough to receive the signal when the main loop is calling queue.get, then the signal handler's call to queue.put will try to take an internal lock that the main loop already holds. Deadlock!

The fact that Queue is thread-safe doesn't help us since the two calls are on the same thread.

Solution: The "self-pipe trick" – write the signal number to an OS pipe in the signal handler, and read from it in the main loop.

Bug free?

After 9 months of bashing bugs, I hesitate to say that the jobserver is rock-solid, but it at least has no more obvious bugs. Except that sometimes it reports that the child process took negative time to run. And I really should have a proper client–server RPC mechanism… ∎