Slaying Zombies in Production

Slaying Zombies in Production

November 21, 2025
πŸ§Ÿβ€β™‚οΈ
This is a throwback post, about saving production from zombies apocalypse.

Story

Long time ago, I was enjoying the vacation with my family, near to midnight I got a call from a co-worker/nikita8.

S
She
7:08 AM
M
Me
7:13 AM
* ...jumps on call and checks
M
Me
7:23 AM
S
She
7:28 AM
M
Me
7:33 AM
restart completes and servers start to pick jobs

A moment later…background jobs are picked up fine, we did our test and it was all ok.

We tried quick checking the logs and metrics but they were looking fine though. That was weird πŸ€” as both of us being clueless, what just happened.


Few minutes later, its the same thing, all of servers stopped responding. I forgot if my ssh session was ok on one of them, but we did the ritual turning off πŸ”΄ and on 🟒 again.


As the servers were up and running, this time I closely inspected the process tree for the main process. And told her to manually run the job.

M
Me
7:08 AM
S
She
7:13 AM
* ...she invoked fn to enqueue jobs from the console and I kept the watch.
and noticed something interesting,
a process with `Z` state πŸ§Ÿβ€β™‚οΈ
M
Me
7:43 AM
S
She
7:48 AM
M
Me
7:58 AM
* she invoked it again

It was lil concerning, a child process kept hanging there with its parent.

M
Me
7:08 AM
* I told her to do the thing again
and guess what, it added one more process with similar state `Z` on the process tree.

Btw, I was using htop with filters. I tried killing the process, but couldn’t. Also switched to root user and tried to kill it again.

B🀯ye, i couldn’t kill it, whaaaat 😯 ? how come 😲 but why ??πŸ€”

After few head scratch and quick googling, about the Z state of process, I found out it’s a ZOMBIE process πŸ§Ÿβ€β™‚οΈ

and you can’t killπŸ”ͺ what is already dead️ ☠️

I checked the resource usage by the process, and found none. Woo, that’s even more interesting.

So what’s the problem ? If those dead process is not eating any RAM/CPU/IO. Why are they bad then ? And how is it related to our case ?

πŸ§Ÿβ€β™‚οΈ
Well, zombies even though they are dead process and are using no machine resources, they don’t release the PID, and there is limit on how many PIDs a machine can have.
Its already midnight && and engg team were on vacation ⁉️
  1. Can I just restart the parent process with a cronjob?

    →️ I can’t periodically kill the process while it is actively doing its work. Also we were unsure; what happens even if we do that ? was there retry? would the process exit abruptly, what happens to the task it was doing ?

    future me: Well, you would restart process during the deployment anyway, what happens during the deployment then ? I wonder why we choose to not restart the process. πŸ€” I guess, that version of me would never feel ok to kill an actively running process in production, just recklessly.

    Usually all the programs/processes could be designed to receive a some signal(SIGTERM) to gracefully exit the task(save its data or whatever state it was working with) before it is killed.

  1. What can we do to patch it, then?

    β†’ Something quick and dirty ducktaping. This service/app was built and released in short period, and we had skill gap about its idiosyncrasies.

Rescue πŸ₯·

My tiny-tiny brain came up with a random idea, what if we could attach to the process and tell it, that it has the zombies down below in process tree and somehow clean them.

β€πŸ™ˆοΈ
Looking back, I was junior ops guy with production access, hehe πŸ˜…

I skimmed via multiple stackoverflow pages for the answers(yes, it was AI doesn’t exists yet times), and I was lucky. Some one suggested like:

You can attach to parent process via a debugger and send a wait signal to a release the zombies.

Then quick actions, I fired up my editor and create a sh file name /tmp/zombie_slayer.sh

I don’t remember much but it vaguely looked like:

#!/bin/bash

# πŸ’‘ illustration purpose only, this code doen't work

#find zombie process
ZOMBIE_PIDS=( `ps aux | grep 'zombie process'|...` )

#find parent process
PARENT_PID=$(ps -o ppid=${ZOMBIE_PIDS[0]})

# create a gdb file: /tmp/signal_zombies with all zombie pids
    attach $PARENT_ID

    # for all pids in ZOMBIE_PIDS;
    call waitpid($ZOMBILE_PID1, 0, 0)
    call waitpid($ZOMBILE_PID2, 0, 0)
    call waitpid($ZOMBILE_PID3, 0, 0)
    ...
    call waitpid($ZOMBILE_PIDn, 0, 0)

    detatch
    quit
...

# attach via gdb, and slay them all
sudo gdb  -x /tmp/signal_zombies  -batch ...
complete code generated with geminiπŸ™ˆ
#!/bin/bash

# Define the name of the parent process
PARENT_NAME="hutch"

# Find the PID of the parent process
PPID=$(pgrep -o "$PARENT_NAME")

# Check if parent process exists
if [ -z "$PPID" ]; then
  echo "Error: Parent process '$PARENT_NAME' not found." >&2
  exit 1
fi

echo "Found parent process '$PARENT_NAME' with PID: $PPID"

# Use `ps` to find all zombie child PIDs and store them in a temporary file
ZOMBIE_PIDS_FILE=$(mktemp)
ps -o ppid= -o pid= -o state= | awk -v ppid="$PPID" '$1 == ppid && $3 == "Z"' | awk '{print $2}' > "$ZOMBIE_PIDS_FILE"

# Check if any zombie children were found and written to the file
if [ ! -s "$ZOMBIE_PIDS_FILE" ]; then
  echo "No zombie children found for PID $PPID. Exiting."
  rm "$ZOMBIE_PIDS_FILE"
  exit 0
fi

echo "Found zombie children. PIDs listed in $ZOMBIE_PIDS_FILE"

# Create the GDB script file
GDB_SCRIPT_FILE=$(mktemp)
echo "attach $PPID" > "$GDB_SCRIPT_FILE"
echo "set confirm off" >> "$GDB_SCRIPT_FILE"

# Read PIDs from the temporary file and generate a waitpid call for each one
while read -r Z_PID; do
  echo "  call waitpid($Z_PID, 0, 0)" >> "$GDB_SCRIPT_FILE"
done < "$ZOMBIE_PIDS_FILE"

echo "detach" >> "$GDB_SCRIPT_FILE"
echo "quit" >> "$GDB_SCRIPT_FILE"

# Use sudo to run GDB in batch mode with the generated script
echo "Attaching GDB to PID $PPID to reap specific zombie processes using script: $GDB_SCRIPT_FILE"
if sudo gdb -x "$GDB_SCRIPT_FILE" -batch; then
  echo "Successfully reaped specific zombie children."
else
  echo "Failed to run GDB commands. Check permissions or GDB installation." >&2
fi

# Clean up the temporary files
rm "$ZOMBIE_PIDS_FILE" "$GDB_SCRIPT_FILE"

# Verify that the zombies are gone
if ps -o ppid= -o state= | awk -v ppid="$PPID" '$1 == ppid && $2 == "Z"' | grep -q "Z"; then
  echo "Verification failed: Some zombies remain. Rerun the script or investigate further." >&2
else
  echo "Verification successful: No more zombies found."
fi

and script was tested, and it kinda worked, momentarily freezing the the main process, but it was acceptable. It was a bandage to the problem not the fix.

We tested it some rounds and later i hooked it up in a crontab of the root user in each of the box. phew that was quick and dirty πŸ˜…

It was already past midnight, we exited our ways for the time.


Recalling hard:

Next day i guess, she had found the root cause for the zombie apocalypse, in our logic…

We got the fix deployed and the zombie_slayer.sh was disabled from the cron. for the good.

Summary

Issue:

  • Production servers were silently dying after certain duration of operation ☠️
  • Instrumentation tools didn’t help much(no major spikes to blame)
  • Zombie process were piling up and eating up entire PIDs of the system
    • i wonder what the limit on number of pid was back then, 2^15=32,768?
  • Hence no new process could be forked, making the server useless
  • We couldn’t ssh without restarting them, lol, no pid left for new connectionπŸ™…πŸΌβ€β™‚οΈ

Finding:

  • I vaguely remember the root cause (id, it was 8-9yrs ago)
  • We were calling “imagemagick” binary(convert ??) from ruby
  • I asked llms/gemini to help me write what we might have done wrong:
    ...
    # Fork a child process
    pid = fork do
      # This code runs in the child process.
      # Use ImageMagick to resize and compress the image.
      # The `exec` call replaces the child process with the command,
      # so no further Ruby code in this block will run.
      exec("convert", "input.png", "-quality", "85", "-resize", "250x250", "output.jpg")
    end
    
    # This code runs in the parent process.
    puts "Parent process forked child with PID: #{pid}"
    puts "The child process is now a zombie."

Fix:

  • devs fixed changes on codebase…
  • Again, my memory is not serving me well, but we might have done something like:
    ...
    # Detach the child process to prevent it from becoming a zombie.
    puts "Parent detaching child process to allow for automatic cleanup..."
    Process.detach(pid) # which is a dedicated thread which internally calls wait
    or used some good library that took care of these for us.

Result:

  • No more zombies, servers happy, me happy.

Learnings:

  • about zombies and their weird lifecycle

  • a child process become “zombie” if its parent fails to do cleanup routine

  • when a process exits:

    • the operating system (kernel) reclaims all of its allocated resources, such as memory and open files.
    • the kernel deliberately keeps a minimal entry in the system’s process table.
    • this “zombie” entry contains the process ID and its exit status
    • and its purpose is help the parent know about the status of its exited child
  • later when parent process calls wait(those_pid), that process table entry is cleared

    • the process is aka “reaping”
    • usually reaping happens instantly, we never notice zombies piling up
    • only on those times when parents fails to call the wait()
      • i.e, if the parent process is poorly written or has a bug that prevents it from doing this, the child’s process table entry persists as a zombie.
  • sometime the parent process might die abruptly leaving its orphan/zombie process

    • good thing the init process inherits them
    • int process invokes wait() periodically for housekeeping
    • and it release those zombies entries from process table.
πŸ§Ÿβ€β™‚οΈ

A zombie process is a terminated child process that remains in the system’s process table until its parent retrieves its exit status. The kernel keeps this minimal entry so the parent can perform the cleanup ritual, known as “reaping,” by calling wait(). As a good dev, its your responsibility to wait on those dead child process.

and If the parent process dies first, the init process (PID 1) inherits the zombie. The init process then reaps the zombie, removing its entry from the process table

While writing this post, I also refreshed my understanding on zombies and their fate, here are some good reference i found:

That is all, if you made it this far, thank you πŸ™‡πŸ»β€β™‚οΈ.

And please let me know if you have any late night production rescue or zombie related stories that I can learn from. Have a good one.πŸ™ŒπŸΌβœ¨