When RTFM Isn’t Enough

One of my first jobs after graduation was as a junior sysadmin for a set of UNIX systems. I was one of four sysadmins, and the most junior at the time, which meant I got a lot of the grunt work. However, it also meant I was part of the normal rotation for normal IT jobs, like responding to user calls.

During my first week on the job, I got a call from a user stating one of their apps was failing to print. My boss and the other IT folks were busy, so I decided to investigate and resolve this ticket on my own.

The system in question was our big HP-UX system, which ran the massive Oracle database backbone for the entire organization. Since this was the early 1990’s, there was no internet to consult, so I picked up the manuals and started reading about print jobs and what could cause them to fail. One cause was low disk space on the server, and I dug in there.

Checking available disk space on a UNIX or Linux system is easily done using the df command, and I saw that the root mount (identified as / in the listing) only had a few kilobytes of unused space. Surely, this was the problem! I took a look at the manual to find the remedy, which turned out to be a simple fix: delete all temporary files.

When any system runs, whether it’s Linux, Windows, or Mac, it needs to store data temporarily. On a UNIX/Linux system, the place where temporary data stored is the /tmp directory. Processes create file there to store data temporarily, and sometimes they don’t clean up after themselves properly. Looking at the contents of the /tmp folder, I could see there were lots of temporary files taking up precious disk space. Deleting the files there would return some disk space to the system. Any file which are not deleted are still in use, so this is usually a safe process.

On UNIX/Linux systems, the command to delete files is called rm, short for “remove”. It takes a set of options and the list of files to remove. Since I wanted to remove all the temporary files, the command I issued was simple: rm *, which deleted all the files in the current directory. There was only one problem.

My current directory wasn’t /tmp. It was /.

I had just deleted all the files from the root of our UNIX system. This included the UNIX kernel, which was needed to actually boot the system, along with the backup kernels and other system files.

After recovering from the adrenaline rush of fear, I made possibly the single best decision of my young career – I immediately located my boss and told them what I had done and why. They, of course, recognized the severity of the problem, but they also knew how to remedy it (which didn’t include firing me on the spot!)

And that is how I found myself staying late on the Friday night of my first week identifying, locating, and restoring all the root files one at a time from the most recent backup tapes. That’s also when I experienced my second adrenaline rush, when we successfully rebooted the system to make sure everything still booted.

As for the user’s original issue, it shouldn’t be surprising to learn my diagnosis was incorrect. I learned that our users never logged into the big HP-UX system to do any work. They all logged into one of three smaller AIX systems, where they had rigidly enforced disk space quotas. The solution to the user’s printing issue was to clear temporary files from their AIX disk space, not the HP-UX system.

Which is why reading the manual wasn’t enough. I needed to understand how the system was architected and built to understand how best to resolve the issue which was reported. I learned a lot that year as a sysadmin, thanks to a boss who treated my mistake as a teaching moment.