Linux/sysadmin questions — permissions, processes, systemd, boot, performance, and the diagnostics that actually matter on a broken box — graded easy → hard with full answers. Click to expand. Pair with the Linux cheatsheet.
Easy — fundamentals
Explain Linux file permissions and what 755 / 644 mean. easy
Each file has permissions for user (owner), group, and others, each with read(4)/write(2)/execute(1). The octal sums them: 755 = owner rwx (7), group r-x (5), others r-x (5) — typical for executables/dirs. 644 = owner rw-, group r--, others r-- — typical for regular files. On directories, execute means "can enter/traverse." There's also the ownership (chown user:group), special bits (setuid/setgid/sticky), and ACLs for finer control.
What's the difference between a hard link and a symbolic link? easy
A hard link is another directory entry pointing to the same inode (the actual file data) — same file, multiple names; the data persists until the last hard link is removed. Hard links can't span filesystems or link directories. A symbolic (soft) link is a tiny file containing a path to another file — it can cross filesystems and link directories, but breaks if the target is moved/deleted (dangling). ln target hardlink vs ln -s target symlink.
How do you check disk and memory usage? easy
Disk: df -h (filesystem usage), du -sh * (size per dir), df -i (inode usage — can fill up even with free space). Memory: free -h (total/used/free/buff-cache/available — look at available, not free, since cache is reclaimable), top/htop (per-process). cat /proc/meminfo for detail. A common gotcha: "low free memory" is usually just the page cache, which is healthy and reclaimable.
What is a process vs a thread, and what are zombie/orphan processes? easy
A process has its own address space; threads share a process's memory and run concurrently. A zombie is a finished child whose exit status hasn't been reaped by its parent (wait()) — it holds only a PID entry; many zombies = a buggy parent not reaping. An orphan is a child whose parent died; it gets re-parented to init/systemd (PID 1), which reaps it. You can't kill a zombie (it's already dead) — you fix or kill the parent.
What does the boot process look like on a modern Linux system? easy
Firmware (BIOS/UEFI) → bootloader (GRUB) loads the kernel + initramfs → kernel initializes hardware, mounts the root filesystem → starts PID 1 (systemd) → systemd brings up units/targets (mounts, network, services) according to dependencies until it reaches the default target (e.g. multi-user.target / graphical.target). journalctl -b shows this boot's logs; systemd-analyze blame shows what was slow.
What do Linux file permissions mean (rwx)? easy
Read/write/execute for owner, group, others; shown as e.g. rwxr-xr-x and set numerically (755). Execute on a directory means 'enter/traverse'.
Hard link vs symlink? easy
A hard link is another name for the same inode (same filesystem, survives original deletion); a symlink points to a path (can cross filesystems, breaks if target moves).
What is a process vs a thread? easy
A process has its own address space; threads share the process's memory and run concurrently within it.
What is a zombie process? easy
A finished process whose exit status hasn't been reaped by its parent; it holds a PID slot until the parent calls wait (or is reaped by init).
What is load average? easy
Average number of processes running or waiting (incl. uninterruptible I/O) over 1/5/15 min; compare to core count to judge saturation.
What is a file descriptor? easy
An integer handle to an open file/socket/pipe in a process; stdin/stdout/stderr are 0/1/2.
What is the page cache? easy
The kernel caches file data in free RAM to speed reads/writes; 'used' memory that's reclaimable, which is why 'free' looks low — that's healthy.
What is a signal (SIGTERM/SIGKILL)? easy
An async notification to a process; SIGTERM (15) asks it to stop gracefully (catchable), SIGKILL (9) force-kills (uncatchable).
What is cron? easy
A scheduler running commands on a time schedule via crontab entries (minute hour day month weekday).
What is a systemd unit? easy
A declarative service definition systemd manages (start/stop/restart/deps/logging); systemctl controls it, journalctl reads its logs.
Medium — applied
How do you manage a service with systemd and read its logs? medium
systemctl status/start/stop/restart/enable/disable <svc> (enable = start at boot; status shows state + recent logs + the main PID). Logs: journalctl -u <svc> (-f follow, -b this boot, --since "10 min ago", -p err by priority). Edit safely with systemctl edit <svc> (drop-in override) then daemon-reload. A unit defines ExecStart, Restart=, dependencies (After=/Requires=), and resource limits. If a service won't start: status for the error, journalctl -u for the stack, check the unit file and permissions.
A process is using 100% CPU. How do you find and handle it? medium
top/htop to find the PID and confirm it's user vs system CPU. top -H -p <pid> or pidstat -t to find the hot thread. To see what it's doing: strace -p <pid> (syscalls — is it spinning on a syscall?), perf top -p <pid> (hot functions), or a flame graph. Check if it's legit load (scale/optimize) or a bug/runaway (restart, fix). To contain without killing: renice/cpulimit, or cgroup limits. Note CPU % >100 means multiple cores. Distinguish saturation (genuinely needs CPU) from a tight loop (bug).
What is a load average, and how do you interpret 4.0 on a 4-core box? medium
Load average is the exponentially-weighted count of processes runnable or in uninterruptible sleep (usually disk I/O wait) over 1/5/15 minutes. On a 4-core box, 4.0 ≈ fully utilized (roughly one runnable task per core); >4 means tasks are queuing/waiting. But Linux includes D-state (uninterruptible I/O) in the load, so a high load with low CPU often means a disk/NFS bottleneck, not CPU. Always pair it with vmstat (run-queue r vs blocked b), mpstat, and iostat to tell CPU saturation from I/O wait.
Disk shows full but du doesn't account for the space. What's going on? medium
Classic: a file was deleted while still open by a running process. The directory entry is gone (so du can't see it) but the inode/blocks aren't freed until the process closes the fd — common with log files truncated/rotated incorrectly while the app keeps writing. Find it with lsof | grep deleted (or lsof +L1); the fix is to restart/signal the process (or truncate via its /proc/<pid>/fd/<n>). Other causes: inode exhaustion (df -i — lots of tiny files), space used in a mount hidden under a mountpoint, or sparse/reserved blocks. Always check df -i too.
How do you find what's listening on a port and which process owns it? medium
ss -ltnp (listening TCP, numeric, with process) is the modern tool (netstat -ltnp the old one). To check a specific port: ss -ltnp 'sport = :8080' or lsof -i :8080. To test connectivity to it: nc -vz host 8080 or curl. Watch for a service bound to 127.0.0.1 (only local) vs 0.0.0.0 (all interfaces) — a very common "works locally, refused remotely" bug. fuser can also identify and kill the owner.
Walk through the Linux boot process. medium
Firmware/UEFI → bootloader (GRUB) → kernel + initramfs → init/systemd → mounts filesystems → starts target/services.
How does the OOM killer decide what to kill? medium
When memory+swap is exhausted, it kills the process with the highest oom_score (roughly memory use × adjustments) to reclaim RAM.
What is swap and swappiness? medium
Disk space used as overflow for RAM; swappiness (0–100) tunes how aggressively the kernel swaps anonymous pages vs reclaiming page cache.
Explain the USE method for performance. medium
For each resource check Utilization, Saturation, Errors — a systematic way to find the bottleneck instead of guessing.
What do top/vmstat/iostat tell you? medium
top: per-process CPU/mem; vmstat: run queue, swap, context switches; iostat: disk %util and await (I/O bound?).
What is a cgroup? medium
Kernel mechanism limiting/accounting CPU, memory, I/O for groups of processes — the basis of container resource limits.
What is a namespace? medium
Kernel isolation of resources (PID, net, mount, user, etc.) so processes see their own view — the basis of containers.
How does Linux handle file system caching and dirty pages? medium
Writes go to page cache (dirty pages) and are flushed to disk asynchronously by writeback; sync/fsync force it. Tunable via vm.dirty_*.
What is an inode? medium
A filesystem structure holding a file's metadata (permissions, owner, size, block pointers) but not its name; running out of inodes fills a disk despite free space.
How do you inspect open files and ports? medium
lsof for open files/sockets per process, ss -tulpn (or netstat) for listening ports and owning processes.
Hard — senior & troubleshooting
The server feels slow but CPU looks fine. Walk through the USE method to find the bottleneck. hard
USE = for every resource, check Utilization, Saturation, Errors. Sweep the resources: CPU — mpstat -P ALL (per-core util), vmstat run-queue (saturation), check %steal (noisy neighbour); Memory — free/vmstat (is it swapping? si/so > 0 kills performance), OOM in dmesg; Disk — iostat -x (%util, await latency, queue depth), iotop for the culprit; Network — ss -s, retransmits, NIC errors (ip -s link). High load + low CPU usually points at disk I/O wait (D-state) or memory pressure/swap. The method is to systematically rule resources in/out rather than guess — then drill into the saturated one with the specific tool (strace/perf/biolatency).
What is the OOM killer and how do you investigate/prevent it? hard
When the kernel can't satisfy a memory allocation and can't reclaim enough, the OOM killer picks a process to kill (scored by memory use × oom_score_adj) to save the system. Investigate: dmesg/journalctl -k shows "Out of memory: Killed process …" with the victim and an RSS table — confirm which process and whether it was a leak or just over-provisioned. In containers/cgroups, hitting the memory limit triggers a cgroup OOM (exit code 137) even if the host has RAM. Prevent: right-size limits/requests, fix leaks, add headroom, tune oom_score_adj to protect critical procs, disable/limit overcommit if appropriate, and add swap cautiously. The signal is sudden process death + 137 + a dmesg OOM line.
You're locked out / a box won't boot. How do you recover it? hard
Depends on the failure. Won't boot: boot to an earlier kernel or edit the GRUB entry to add systemd.unit=rescue.target (or init=/bin/bash) for single-user/emergency shell, then fix the cause — read journalctl -b -1 -p err from the previous boot, check /etc/fstab (a bad mount blocks boot — add nofail), failed services (systemctl --failed), or a full /. Filesystem errors: fsck from rescue/live media. Forgot root / lost SSH: boot to single-user or a live USB, chroot into the install, reset the password / fix sshd config / re-add the key. Cloud VM: use the serial console, or detach the disk and mount it on a rescue instance. Principle: get a shell (rescue/chroot/serial), read the logs to find the failing unit/mount, fix it, reboot.
How does the Linux page cache work, and why is "free memory" misleading? hard
Linux uses otherwise-idle RAM as a page cache for file data, so repeated reads hit memory instead of disk — this is why free shows lots of "used" memory as buff/cache. That memory is reclaimable on demand, so the number to watch is available, not free. Misreading "low free" as a problem leads people to drop caches (echo 3 > drop_caches) and hurt performance. Real memory pressure shows as: low available, active reclaim, and especially swapping (si/so in vmstat) or OOM kills. Dirty pages (written, not yet flushed) are controlled by vm.dirty_* tunables; a sudden flush can cause I/O stalls. So: cache is good; judge memory health by available memory, swap activity, and reclaim — not by "free."
Walk me through how you'd harden a fresh Linux server. hard
Layered baseline: access — SSH key-only (disable password + root login), non-root sudo user, fail2ban, change/limit exposure; patching — enable automatic security updates, keep the kernel current. Network — host firewall (ufw/nftables) default-deny inbound except needed ports, close unused services (ss -ltnp audit). Least privilege — minimal packages, proper file permissions, no setuid where avoidable, service accounts not root. Hardening frameworks — SELinux/AppArmor enforcing, CIS benchmark, auditd for logging. Monitoring — centralized logs, integrity monitoring (AIDE), time sync. Secrets — no plaintext creds, rotate keys. Then validate with a scanner (Lynis/OpenSCAP). The theme: reduce attack surface, enforce least privilege, patch, log, and verify.
How do you profile a CPU-bound process? hard
perf top/perf record + flame graphs to find hot functions, check thread contention, and use strace/ltrace for syscall/library time.
What is a uninterruptible (D) state and why can't you kill it? hard
The process is blocked in a kernel syscall (usually I/O — stuck NFS/failing disk) and won't accept signals until the I/O completes; fix the underlying device/mount.
How does copy-on-write fork work? hard
fork() shares parent pages read-only; on write either side gets a private copy — makes fork cheap and is why memory 'doubles' only when written.
How do you debug high system (sys) CPU time? hard
High kernel time points to syscalls/interrupts — profile syscalls (perf/strace), check context switches, interrupt load, lock contention, or page-fault storms.
Explain TCP tuning knobs you'd touch on a busy server. hard
Backlog (somaxconn), ephemeral port range, tw_reuse, window scaling/buffers (rmem/wmem), and congestion algorithm (BBR) — for high connection rates/throughput.
How do you securely harden a Linux server? hard
Minimal packages, SSH keys + no root login, firewall (default-deny), patching, SELinux/AppArmor enforcing, auditd, fail2ban, least-privilege users, and CIS-benchmark settings.
What happens on a page fault? hard
The MMU traps to the kernel: minor fault maps an existing page (cache/COW), major fault reads from disk/swap (slow). Excessive major faults = thrashing.
How do you diagnose disk I/O latency? hard
iostat -x (await, %util, queue), iotop for the offending process, check filesystem/RAID health, and distinguish throughput saturation from per-op latency.
What is the difference between RSS, VSZ, and actual memory use? hard
VSZ is total virtual address space (incl. unmapped/shared), RSS is resident physical pages (incl. shared); neither is exact unique use — use PSS for proportional sharing.
How does systemd manage service dependencies and ordering? hard
Units declare Wants/Requires (dependency) and Before/After (ordering); targets group units; systemd builds a dependency graph and parallelizes startup.
Scenario-based
Disk is full but `du` shows little usage. What's happening? hard
A process is holding a deleted-but-still-open file — the directory entry is gone (so du doesn't count it) but the inode/space isn't freed until the file descriptor closes. Classic with a log file rotated/deleted while the app keeps writing. Find it with lsof +L1 (or lsof | grep deleted), then restart/signal the process (or truncate via /proc/PID/fd) to release the space. Also check whether df vs du gap is a different mount/filesystem.
A server is slow and load average is high. How do you investigate? medium
Run the USE method (Utilization, Saturation, Errors) across resources. top/htop (CPU, run queue, which process), vmstat (run queue r, swap, context switches), iostat -x (disk %util, await — is it I/O-bound?), free (memory/swap pressure). High load with low CPU = I/O wait or uninterruptible processes. Identify the saturated resource, then the offending process, then why. Load average alone is ambiguous — decompose it.
A process won't die even with `kill`. Why, and what do you do? medium
kill (SIGTERM) can be caught/ignored; try kill -9 (SIGKILL). If even -9 won't kill it, it's in uninterruptible sleep (D state) — blocked in a kernel call, usually I/O wait (stuck NFS mount, failing disk); it can't be killed until the I/O returns or you fix the underlying device. A zombie (Z state) is already dead — it's waiting for its parent to reap it; kill or fix the parent (it consumes a PID slot, not resources). Diagnose state via ps -o stat.
The OOM killer killed your app. How do you diagnose and prevent it? hard
Confirm via dmesg/journal ("Out of memory: Killed process"). The kernel kills the process with the worst oom_score when memory + swap is exhausted. Diagnose: was it a genuine leak/spike in your app, a noisy neighbor, no limits (cgroup/container) so one process ate everything, or overcommit? Prevent: set memory limits/cgroups, fix the leak, right-size the box, add swap as a buffer (not a fix), and tune oom_score_adj for critical procs. In containers, set requests/limits.
You can't write to a file despite having correct permissions. What else could it be? medium
Beyond rwx: filesystem full (df) or out of inodes (df -i), read-only mount (remounted ro after errors — check mount/dmesg), the immutable attribute (lsattr shows i, set via chattr +i), SELinux/AppArmor denial (check ausearch/dmesg, context mismatch), or you're hitting a different file (symlink/bind mount) than you think. Also a held lock or quota limit. Check these in order.
A server won't boot. How do you recover it? hard
Boot into rescue/single-user mode (or attach the disk to a recovery instance). Common causes and checks: a bad /etc/fstab entry hangs the mount (comment it out), broken initramfs/kernel after an update (boot the previous kernel from GRUB), filesystem corruption (run fsck on the unmounted FS), full /boot, or a misconfigured service blocking boot (check journalctl -xb, mask the offender). Read the console/boot logs to see where it stops, fix that one thing, reboot.
Disk full but `du` shows little. What's happening? hard
A deleted-but-open file held by a process — space isn't freed until the fd closes; find with lsof +L1, restart/signal the process.
Server slow, load average high. Investigate. medium
USE method: top/vmstat/iostat/free — high load + low CPU = I/O wait or uninterruptible procs; find the saturated resource then the culprit.
A process won't die even with kill -9. Why? medium
It's in uninterruptible D state (blocked in kernel I/O — stuck mount/disk); can't be killed until I/O returns or you fix the device.
OOM killer killed your app. Diagnose/prevent. hard
dmesg confirms; check for leak/spike, missing limits (cgroup), or noisy neighbor; set memory limits, fix the leak, right-size, add swap as buffer.
Can't write to a file despite correct perms. Else? medium
Disk/inodes full, read-only remount, immutable attr (lsattr/chattr +i), SELinux denial, or a different file via symlink — check in order.
Server won't boot. Recover it. hard
Rescue/single-user mode: bad fstab (comment out), broken initramfs/kernel (boot previous from GRUB), FS corruption (fsck), full /boot — read console logs for where it stops.
A service flaps restarting every few seconds. Debug. medium
systemctl status + journalctl -u: crash on startup (bad config/missing dep), too-aggressive watchdog/timeout, or OOM; fix root cause, adjust Restart settings.
CPU shows high iowait. What does it mean and next steps? medium
CPUs idle waiting on disk/network I/O; find the I/O-heavy process (iotop), check disk latency (iostat -x), and address the slow device or reduce I/O.
Out of file descriptors errors under load. Fix? medium
Raise ulimit/LimitNOFILE and somaxconn, find fd leaks (lsof per PID), and ensure connections/files are closed — default limits are often too low for servers.
Need to find what's eating memory on a box. Approach? hard
free/top (RSS), smem (PSS for shared), check page cache vs anon, slab (slabtop) for kernel memory, and per-cgroup usage in containers.
Linux/sysadmin and SRE loops lean on live troubleshooting ("box is slow," "disk full
but du is clean," "OOM kills," "service won't start," "can't SSH") and expect fluency with the
diagnostic toolkit (top/htop, vmstat, iostat, ss,
lsof, journalctl, strace) and systemd.
Fundamentals (permissions, links, processes, boot, page cache) are screening. The differentiator is a
structured method — USE method, layer isolation — rather than randomly running commands. Many shops
give a broken VM and watch how you debug.