Yet another strange job scheduler bug

I have a job scheduler to run commands on a fixed schedule (once every 15 minutes, every Wednesday at 8 pm, etc.). This weekend, I noticed that some jobs were taking much longer than expected – commands that should exit in less than a second were taking 15 or 30 minutes.

How does the job scheduler handle children exiting?

SIGCHLD is delivered to job scheduler.
Signal handler is called; timestamp is taken and enqueued in pipe.
Job scheduler wakes up and checks the pipe.
Job scheduler calls wait4 to get child PID, retrieves start time, and computes wall-clock duration.

Because the timestamp is recorded in step 2, inside the signal handler, there ought not to be any delay between the child exiting and the parent recording the exit time – certainly not 30 minutes.

But the logs from the child processes strongly suggest they are exiting "on time", and a BPF trace confirms that the kernel is delivering SIGCHLD right before the job scheduler's logs show it is received. So, what could it be?

(If you'd like to investigate it yourself, you can read the source code here.)

Suppose job A and job B both run every 15 minutes and take about the same time. What happens if they exit nearly simultaneously – so close in time, in fact, that the job scheduler doesn't have a chance to respond to the first SIGCHLD signal before the second one is raised?

On Linux, signals do not enqueue, they coalesce, so only one SIGCHLD is delivered. The job scheduler wakes up, sees the signal, calls wait4, and gets the result for, let's say, job A.

The job scheduler continues to wakes up again and again, but they are no new signals to handle, so it thinks job B is still running.

Finally, after 15 minutes, job A is run again. (Job B won't be run because the job scheduler thinks it is currently running, and it won't run it again until the current run has finished.) SIGCHLD is delivered, the job scheduler calls wait4, and gets the result for job B from 15 minutes ago. Job B is then run again, wait4 returns the result for job A, and the cycle repeats.

The fix: Call wait4 (with WNOHANG so it doesn't block) in a loop until it returns 0. This will reap all exited children even if a SIGCHLD was coalesced.