Breadcrumbs: CS 3235 - System Security
January 27, 2026
Daniel Genkin
Running Untrusted Code
The Central Security Challenge
The fundamental question: How do we run potentially malicious programs without them destroying our system?
Why we need to run untrusted code:
- You download a PDF from an unknown email attachment
- You visit websites that run JavaScript in your browser
- You install apps from app stores (how do you know they’re safe?)
- You process user-uploaded files on your server
- You run code from open-source repositories
The dilemma:
We WANT to: But we FEAR:
- Open PDFs → Malicious code execution
- Browse websites → Data theft
- Install apps → System compromise
- Run plugins → Privilege escalation
This lecture’s goal: Learn techniques to run untrusted code SAFELY through isolation and sandboxing.
Process Isolation
Evolution of Security Thinking
Traditional Thinking: Identify the Bad Guys
Old question: “How to run bad/untrustworthy programs safely?”
This assumed we could identify dangerous programs:
- Programs from untrusted sites
- Apps that process adversarial data (PDF viewers, browsers)
- Honeypots (intentionally vulnerable systems to trap attackers)
Problem: You have to KNOW something is bad before protecting against it. What about:
- Zero-day vulnerabilities?
- Compromised trusted programs?
- Supply chain attacks?
New Thinking: Trust Nothing
Modern question: “Be skeptical of ALL programs, isolate to achieve least privilege”
Key principles:
- Assume compromise - ANY program could be exploited
- Default deny - Give minimal permissions by default
- Limit damage - When (not if) compromise happens, contain it
Real example - Chrome Browser:
Old approach:
"Chrome is from Google, it's trustworthy"
→ Give Chrome full system access
→ If exploited, attacker gets everything
New approach:
"Assume Chrome will be exploited"
→ Each tab runs in isolated process
→ Limited syscalls (can't open files, create sockets)
→ If exploited, damage is contained
General Goal: Confinement
Confinement = Ensure misbehaving process cannot harm rest of system
Medical quarantine analogy:
- Sick patient (compromised program) is isolated
- Can’t spread disease (attacks) to others
- But can still function (perform legitimate operations)
- Medical staff (system services) interact through barriers
Spectrum of Isolation Levels
Can be implemented at many levels:
- System call interposition (AppArmor, SELinux)
- Monitor and filter individual system calls
- Example: Block process from accessing
/etc/passwd
- Containers (Docker, Kubernetes)
- Isolated userspace instances sharing one kernel
- Example: Each microservice in own container
- Virtual machines (VMware, VirtualBox, Xen)
- Isolate entire operating systems on one physical machine
- Example: AWS running different customers on same hardware
- Physical isolation (Air gaps)
- Separate physical hardware with no connection
- Example: Nuclear power plant control systems
The tradeoff: Stronger isolation → More overhead/less convenience
Approach - Confinement (Hardware / Air Gap)
Physical Isolation - The Strongest Defense
What it is: Run applications on completely separate physical hardware with absolutely no connection
Visual:
Network 1 <<< AIR GAP >>> Network 2
┌──────────┐ (No wires, ┌──────────┐
│Computer A│ No WiFi, │Computer B│
│ App 1 │ No Bluetooth, │ App 2 │
│ │ Physical separation) │ │
└──────────┘ └──────────┘
Security properties:
- ✅ Strongest possible isolation - Cannot attack what you cannot reach
- ✅ No shared resources - Completely independent systems
- ✅ Simple to understand - Physical separation is obvious
Drawbacks:
- ❌ Very expensive - Need multiple complete computer systems
- ❌ Difficult to manage - No remote access, must physically visit
- ❌ Inconvenient data sharing - “Sneakernet” (walk USB drive between systems)
- ❌ Doesn’t scale - Can’t have thousands of air-gapped machines
When to use:
- Nuclear power plants - Control systems separated from internet
- Military networks - SIPRNET (SECRET) physically separate from NIPRNET (UNCLASSIFIED)
- Financial systems - Trading systems isolated from corporate network
- Critical infrastructure - When security is worth ANY cost
Real example - Stuxnet: Even with air gap, Iranian nuclear centrifuges were compromised via USB drive. Shows air gaps aren’t perfect but raise the bar significantly.
Approach - Confinement (Virtual Machines)
Virtual Machine Isolation
What it is: Isolate entire operating systems on a single physical machine
Architecture:
┌────────────────────────────────────────┐
│ OS1 | OS2 | OS3 │ ← Guest Operating Systems
│ app1 | app2 | app3 │
├────────────────────────────────────────┤
│ Virtual Machine Monitor (Hypervisor) │ ← Isolation layer
├────────────────────────────────────────┤
│ Hardware │
└────────────────────────────────────────┘
What each VM sees:
- Its own “CPU” (time-sliced from real CPU)
- Its own “RAM” (isolated portion of real RAM)
- Its own “disk” (file on host or dedicated partition)
- Its own “network card” (virtual NIC)
The illusion:
- From inside VM: looks exactly like a real computer
- Root user in VM doesn’t know it’s virtualized
- VM thinks it has exclusive access to hardware
Security model:
- ✅ Strong isolation (separate OS instances)
- ✅ Can run different OSes (Windows + Linux on same hardware)
- ✅ Compromise of one VM shouldn’t affect others
We’ll dive deeper into VMs later…
Approach - Confinement (System Call Interposition / Containers)
Process-Level Isolation
What it is: Isolate individual processes within a single operating system
Architecture:
┌─────────────────────────────────┐
│ Operating System │
│ │
│ ┌──────────┐ ┌─────────┐ │
│ │Process 1 │ │Process 2│ │ ← Isolated processes
│ │(isolated)│ │(normal) │ │
│ └──────────┘ └─────────┘ │
└─────────────────────────────────┘
How it works:
- Monitor and filter system calls from isolated process
- Block dangerous operations (file access, network, etc.)
- Allow safe operations (computation, limited I/O)
Compared to VMs:
- ✅ Much lighter weight (no separate OS)
- ✅ Faster startup (milliseconds vs minutes)
- ✅ Less memory (MBs vs GBs)
- ❌ Weaker isolation (shared kernel)
Modern implementation: Containers (Docker, Kubernetes)
We’ll explore this in detail after covering system calls…
System Calls - Going from User to OS Code
Understanding System Calls (Foundation for SCI)
Before we can understand System Call Interposition, we need to understand what system calls are.
What is a system call?
- A system call is how user programs ask the operating system to do privileged operations.
Why programs can’t do these directly:
Programs run in user mode with restricted privileges:
- Cannot directly access hardware (disk, network, devices)
- Cannot access other processes’ memory
- Cannot modify kernel memory
- Cannot access protected system files
The OS kernel runs in kernel mode with full privileges:
- Can access all hardware
- Can access all memory
- Can perform privileged operations
Programs must ASK the kernel through system calls
Common System Calls
File operations:
int fd = open("/home/user/file.txt", O_RDONLY); // Open file
ssize_t n = read(fd, buffer, 100); // Read data
write(fd, data, 50); // Write data
close(fd); // Close fileProcess operations:
pid_t pid = fork(); // Create new process
exec("/bin/ls", args); // Execute program
exit(0); // Terminate processNetwork operations:
int sock = socket(AF_INET, SOCK_STREAM, 0); // Create socket
connect(sock, &addr, sizeof(addr)); // Connect to server
send(sock, data, len, 0); // Send dataMemory operations:
void* ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, ...); // Map memory
munmap(ptr, 4096); // Unmap memoryHow System Calls Work
User Program Kernel
─────────────────────────────────────────
1. Program calls open()
↓
2. Library wraps it
syscall(SYS_open, ...)
↓
3. Trap to kernel mode ──────→ 4. Kernel receives syscall
(CPU switches modes) ↓
5. Kernel checks permissions
↓
6. Performs operation
↓
4. Return to user mode ←────── 8. Returns result
↓
5. Program continues
Key insight: Every privileged operation goes through syscalls!
This is why we can control programs by filtering syscalls - if we block dangerous syscalls, we limit what program can do.
Achieving Isolation (Reference Monitor)
The Abstract Security Model
Reference Monitor = Abstract design pattern for security
Think of it as a security guard at a checkpoint:
┌─────────────────┐
│ Isolated Process│
│ (Untrusted) │
└────────┬────────┘
│ "Can I open /etc/passwd?"
|
┌───────────────-|───-───────────────┐
| ┌─────────────|────────────────┐ |
| | ┌─────────↓────────────┐ | |
| | | Reference Monitor | | |
| | └──────────────────────┘ | |
| | Trusted Computing Base (TCB) | |
| └──────────────────────────────┘ |
| Operating System Kernel |
└────────────────────────────────────┘
Three Required Properties
For a reference monitor to provide security:
1. Complete Mediation - “Every request must go through the guard”
Must always be invoked: Every application request must be mediated
What this means:
- NO backdoor paths
- NO bypass mechanisms
- EVERY access checked
Bad example:
Prison with front gate security but unlocked back door
→ Security guard is useless
→ Complete mediation violated
Good example:
OS kernel checks EVERY file access
Program cannot directly access disk
Must go through kernel
→ Complete mediation achieved
2. Tamperproof - “Guard can’t be bribed or killed”
Cannot be killed:
- Reference monitor must run at higher privilege than monitored process
- If monitor killed, monitored process must also die
- Monitor must be in protected memory
Cannot be modified:
- Isolated program cannot change monitor’s code
- Isolated program cannot change monitor’s policy
- Even with exploit, cannot disable security
How enforced:
- Monitor runs in kernel mode (protected)
- Program runs in user mode (restricted)
- Hardware (CPU protection rings, MMU) enforces separation
3. Small and Verifiable - “Simple guards make fewer mistakes”
Must be small enough to analyze and validate
Why size matters:
Complex code = More bugs
More bugs = More vulnerabilities
More vulnerabilities = Weaker security
Example:
Bad: 100,000 lines of monitoring code
→ Too complex to audit
→ Likely contains exploitable bugs
Good: 1,000 lines of monitoring code
→ Security experts can review completely
→ Can prove correctness
→ Fewer attack vectors
Part of Trusted Computing Base
By “trusted” we mean: “If it isn’t trustworthy, we’re vulnerable”
The reference monitor is part of the TCB (Trusted Computing Base) - all the code we MUST trust for security.
We’ll explore TCB more in the next slide…
Need for Trusting an Operating System (TCB)
Why Must We Trust the OS?
Question: Why do we need to trust the operating system?
Answer: Because the OS enforces ALL security!
What the OS controls:
- File permissions (who can read/write files)
- Process isolation (programs can’t access each other’s memory)
- Network access (who can create sockets)
- Hardware access (who can access devices)
If OS is compromised → All security fails
Trusted Computing Base (TCB)
Definition: The TCB is all the code and hardware that, if compromised, breaks your security.
For our isolation systems, TCB includes:
- Operating system kernel
- Hypervisor (for VMs)
- Reference monitor implementation
- Device drivers (unfortunately - they’re buggy!)
- Hardware (CPU, MMU, memory protection)
TCB Requirements
To be trustworthy, TCB must have:
1. Tamper-proof
- Cannot be modified by untrusted code
- Protected memory, privileged execution
2. Complete mediation
- Checks every access
- No bypass mechanisms
3. Correct
- Actually implements security policy properly
- No bugs that allow security bypass
Reference Monitors = Special TCBs
Official definition: TCBs that meet all three requirements are called Reference Monitors
The relationship:
- Reference Monitor = The ideal requirements (what we want)
- TCB = The actual implementation (what we built)
- We WANT our TCB to be a reference monitor
- Reality: TCB often has bugs and falls short of the ideal
Security goal: Make TCB as small as possible while meeting reference monitor requirements
Approach - UNIX chroot System Call
Now we see a SIMPLE (but flawed) isolation technique before learning the correct approaches.
What is chroot?
chroot = “change root directory”
Changes what a process sees as the root directory (/)
Normal filesystem:
/ ← Real root
├── home/user/documents/
├── etc/passwd ← Sensitive!
└── tmp/
After chroot /tmp/jail:
Process sees:
/ ← Now /tmp/jail
├── bin/
├── lib/
└── app/
Process CANNOT see:
/home/user/documents/ ← Outside jail
/etc/passwd ← Outside jail
How to Use chroot
Requirements:
- Must be root to use chroot
- Must create minimal filesystem in jail first
Setup:
# Create jail structure
sudo mkdir -p /tmp/jail/{bin,lib,etc}
# Copy necessary binaries
sudo cp /bin/bash /tmp/jail/bin/
sudo cp /bin/ls /tmp/jail/bin/
# Copy required libraries
sudo cp /lib/x86_64-linux-gnu/libc.so.6 /tmp/jail/lib/
# Enter jail
sudo chroot /tmp/jail /bin/bash
# Drop to unprivileged user
su guest
# Run untrusted program
./appSecurity Property
Process cannot access files outside jail because it cannot name them
Example:
# Inside jail, process tries:
cat /etc/passwd
# What happens:
# "/" is interpreted as "/tmp/jail"
# Acutally opens: /tmp/jail/etc/passwd
# Real /etc/passwd is never touchedEarly Uses
Common uses:
- FTP servers - Prevent users accessing system files
- Test environments - Install software without affecting system
- Build environments - Clean compilation
But is it secure?
Spoiler: NO! We’ll see why in upcoming slides…
Escaping from Jails
chroot is NOT Secure!
While chroot provides filesystem isolation, it has MANY escape mechanisms.
Early escapes using relatives paths
Example
fopen("../../etc/password", "r') =>
fopen("/tmp/guest/../../etc.password", "r")
chroot should only be executable by root
Why?
- Otherwise jailed app can:
- create dummy file
/aaa/etc/password - run
chroot "/aaa" - run
su rootto become root
- create dummy file
This was a bug in Ultrix 4.0
Many Ways to Evade chroot Isolation
Escape Technique 1: Device Files
Attack:
# Inside chroot jail:
mknod /dev/sda b 8 0 # Create block device for hard drive
dd if=/dev/sda of=dump # Read raw disk!Why it works:
- Device major/minor numbers are GLOBAL (shared namespace)
- chroot doesn’t isolate device namespace
- Raw disk access bypasses file permissions entirely
Result: Can read ANY data on disk, even outside jail
Escape Technique 2: Send Signals
Attack:
# Inside jail, kill process outside jail:
kill -9 1234Why it works:
- Process IDs (PIDs) are GLOBAL
- chroot doesn’t isolate PID namespace
- Can affect processes outside jail
Escape Technique 3: Reboot System
Attack:
# If somehow have capability:
rebootWhy it works:
- Reboot affects ENTIRE system
- chroot doesn’t restrict system-wide operations
Escape Technique 4: Network Ports
Attack:
# Bind to privileged port:
nc -l 80Why it works:
- Network namespace is SHARED
- Port numbers are global
- Can block legitimate services
Summary: What chroot Does NOT Isolate
✅ chroot isolates: Filesystem paths
❌ chroot does NOT isolate:
- Process IDs
- Network
- Devices
- System calls
- Users/Groups
- IPC mechanisms
Conclusion: chroot alone is INSECURE!
Modern solution: Containers (covered later) use chroot PLUS many other isolation mechanisms
FreeBSD Jail
Improvement Over chroot
FreeBSD jail extends chroot with additional isolation:
What jail adds:
- Hostname isolation (each jail has own hostname)
- IP address isolation (can only bind to sockets with specific IP address & authorized ports)
- Process isolation (can only communicate with processes inside jail)
- Root restrictions (root in jail ≠ root on host; cannot load kernel modules)
To run:
# Create jail
jail jail-path hostname IP-addr cmdBut still has problems… (next slide)
Problems with chroot and jail
Remaining Issues
- All or nothing access to parts of file system
- Inappropriate for apps like a web browser
- Needs read access to files outside of jail (e.g. sending attachments in Gmail)
Approach - System Call Interposition
The Core Idea
Observation: To damage host system, app MUST make system calls
What malware wants to do:
- Delete/overwrite files → needs
unlink(),open(),write() - Launch network attacks → needs
socket(),bind(),connect() - Install persistence → needs disk writes
- Steal data → needs file/network access
Key insight: If we control syscalls, we control what program can do!
System Call Interposition Strategy
Idea: Monitor app’s system calls and block unauthorized ones
Program wants: Monitor checks: Action:
──────────────────────────────────────────────────────────
open("/etc/passwd") → Policy: DENY → Block/Kill
open("/tmp/myfile") → Policy: ALLOW → Let through
socket(...) → Policy: NO network → Block
write(stdout, "hi") → Policy: ALLOW → Let through
Implementation Options
1. Completely kernel space (SELinux)
- Filtering built into OS kernel
- Very secure, can’t bypass
- Complex to configure
2. Completely user space (program shepherding)
- Separate monitor process
- More portable
- Can have race conditions
3. Hybrid (Systrace)
- Monitor in user space
- Enforcement in kernel
- Balance of flexibility and security
Let’s see how to implement this…
Implementing System Call Interposition (ptrace)
Extra resource bc huhh: MITRE ATT&CK T1055.008 Process Injection: Ptrace System Calls
Naive Approach: ptrace
ptrace = Unix system call for process tracing (designed for debugging)
How it works:
┌────────────────────────────────────┐
│ User Space │
│ │
│ ┌──────────┐ ┌───────────┐ │
│ │ Monitor │ │ Program │ │
│ │ Process │ │(monitored)│ │
│ └────┬─────┘ └────↑──────┘ │
│ │ | │
│ │ ptrace() │ syscall │
└───────┼─────────────────┼──────────┘
│ │
┌───────↓─────────────────┼─────────┐
│ └-----------------┘ │
│ OS Kernel │
│ Monitor wakes when syscall made │
└───────────────────────────────────┘
Process:
- Monitor attaches to program with
ptrace() - Program makes syscall
- Kernel stops program and wakes monitor
- Monitor checks if syscall allowed
- Monitor either allows or kills program
Seems reasonable… but has SEVERE problems!
System Call Interposition Policies
Before discussing ptrace problems, let’s see what policies look like:
Example Policy File
path allow /tmp/*
path deny /etc/passwd
path deny /etc/shadow
network deny all
How it works:
open("/tmp/file") → Matches "allow /tmp/*" → ALLOW
open("/etc/passwd") → Matches "deny /etc/passwd" → DENY
socket(...) → Matches "deny all network" → DENY
The Policy Specification Problem
Challenge 1: Auto-generating policies
Idea: Run program on “good” inputs, learn what it does, generate policy
Problems:
- What are “good” inputs?
- Might miss rare legitimate behaviors
- Attacker could poison training data
Challenge 2: Asking the user
Show popup: “Program wants to open /etc/passwd. Allow?”
Problems:
- Users don’t understand implications
- Alert fatigue → click “Allow” without reading
- Not practical for servers
Why This Is Hard
For complex apps (browsers, office suites):
- Access patterns unpredictable
- Need many capabilities
- Hard to write policy that’s both:
- Secure (blocks attacks)
- Permissive (doesn’t break features)
This difficulty in choosing policy is why syscall interposition isn’t widely used for complex apps
Complications
Issues with ptrace Monitoring
Fork handling:
- If monitored app forks, monitor must also fork
- Must track parent-child relationships
- Complex coordination
Crash handling:
- If monitor crashes, app must be killed immediately
- Can’t allow unmonitored execution
- Requires kernel support
State synchronization:
Monitor must track ALL OS state that affects syscall interpretation:
State to track:
- Current working directory (CWD)
- User ID (UID, EUID, GID)
- Open file descriptors
- Environment variables
Example why CWD matters:
Program does: Monitor must remember:
──────────────────────────────────────────────
chdir("/tmp") CWD = /tmp
open("passwd") → Interprets as /tmp/passwd ✓
chdir("/etc") CWD = /etc
open("passwd") → Interprets as /etc/passwd ✓
If monitor loses track → misinterprets paths!
Slide 18: Problems with ptrace
Why ptrace Fails for Security
Problem 1: All or Nothing
- Can only trace ALL syscalls or NONE
- Cannot selectively trace just dangerous ones
- Inefficient (trace every
read(),write(),close())
Problem 2: Cannot Abort Cleanly
- Can only KILL entire program
- Cannot deny individual syscalls gracefully
- Too harsh response
Problem 3: TOCTOU Race Condition ⚠️ CRITICAL SECURITY BUG
TOCTOU = Time-Of-Check, Time-Of-Use
The vulnerability:
Timeline of Attack:
1. Attacker creates: me → mydata.txt (symlink)
2. Program calls: open("me")
3. Monitor wakes: Checks "me" → points to mydata.txt
Decision: ALLOW ✓
4. [ATTACKER STRIKES] Changes: me → /etc/passwd
5. Kernel executes: open("me") → Opens /etc/passwd!
❌ SECURITY VIOLATED
What went wrong: Check (step 3) and use (step 5) were NOT atomic. Attacker changed target in between.
Why this is critical:
- Can bypass ANY security check
- Classic vulnerability in many systems
- Fundamental problem with user-space monitoring
Conclusion: ptrace was designed for debugging, NOT security!
SCI in Linux: seccomp-bpf
SCI = System Call Intereference Seccomp-BPF = Secure Computing with Berkeley Packet Filter
This solves ALL the ptrace problems!
How It’s Different
1. Filter runs IN THE KERNEL (not user space)
- ✅ No TOCTOU races (check and use are atomic)
- ✅ No state synchronization issues
- ✅ Perfect state tracking
2. Filter written in BPF language (compiled, efficient)
- ✅ Efficient (runs in kernel)
- ✅ Can selectively filter (only check specific syscalls)
- ✅ No overhead for allowed syscalls
3. Process sets its OWN filter (before doing dangerous things)
- ✅ No separate monitor process
- ✅ No fork complications
- ✅ Filter cannot be removed once set
- ✅ If program call
execve, all filters are preserved
Visual Architecture
┌──────────────────────────────────┐
│ User Space │
│ │
│ Chrome Renderer Process: │
│ 1. Sets filter │
│ 2. Processes web page │
│ 3. Exploit tries: open("/etc/passwd")│
│ ↓ │
└────────────────┼─────────────────┘
↓
┌────────────────┼─────────────────┐
│ OS Kernel │ │
│ ↓ │
│ Seccomp-BPF Filter │
│ - Check: open() syscall │
│ - Not in whitelist! │
│ - Action: KILL process │
└──────────────────────────────────┘
Used in production:
- Chrome/Chromium browser tabs
- Docker containers
- Android apps
- systemd services
BPF Filters (Policy Programs)
What is BPF?
How BPF Filters Work
BPF program:
- Small program that runs for each syscall
- Returns: ALLOW, DENY, or KILL
SECCOMP_RET_ALLOW= allow syscallSECCOMP_RET_ERRNO= return specified error to callerSECCOMP_RET_KILL= kill process
- Written in restricted language (safe, cannot loop forever)
Structure:
For each syscall:
1. Load syscall number
2. Check against whitelist/blacklist
3. Return action (ALLOW/DENY/KILL)
Installing a BPF Filter
Two Required Calls (Must Be In This Order)
1. prctl(PR_SET_NO_NEW_PRIVS, 1)
- Purpose: Prevents privilege escalation
- Effect: Disables set-UID/set-GID bits on subsequent
execve()calls - Why needed: Without this, attacker could exec a set-UID root binary and bypass the filter
- Must come FIRST - kernel won’t allow filter installation without this
2. prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_policy)
- Purpose: Installs the syscall filter
- Effect: Every syscall from now on is checked against the BPF policy
- Irreversible: Once set, filter CANNOT be removed (even by root)
- Point of no return
The Example
c
int main(int argc, char **argv) {
prctl(PR_SET_NO_NEW_PRIVS, 1); // Block privilege escalation
prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_policy); // Install filter
fopen("file.txt", "w"); // Calls open() syscall
printf("... will not be printed.\n"); // Never executes - process killed
}
```
**What happens:**
- `fopen()` internally calls `open()` syscall for writing
- BPF filter checks the policy
- Policy blocks `open()` for write
- **Process is immediately killed**
- `printf()` never runs
## Key Properties
- **Voluntary self-restriction** - Process restricts itself
- **Kernel-enforced** - No TOCTOU races (unlike ptrace)
- **Cannot be removed** - Permanent once installed
- **Inherited by children** - `fork()` passes filter to child processes
## Why This Order Matters
```
Correct:
1. NO_NEW_PRIVS → 2. Install filter → 3. Restricted execution ✓
Wrong:
Install filter first → Kernel rejects (error) ✗
Without NO_NEW_PRIVS:
Attacker could execve("/usr/bin/passwd") → runs as root → bypasses filter ✗Docker - Isolating Containers Using seccomp-bpf
Modern Container Isolation
Container: OS-level isolation creating multiple isolated userspace instances
Docker uses multiple isolation mechanisms together:
What Engineers Should Know About Container Isolation
1. Namespaces - What You Can See
PID Namespace: Container sees PIDs 1,2,3 (own processes)
Network Namespace: Container has own IP, ports
Mount Namespace: Container has own filesystem view
User Namespace: Root in container ≠ root on host
IPC Namespace: Isolated shared memory
2. Cgroups - What Resources You Can Use
Memory cgroup: Limit: 512MB RAM
CPU cgroup: Limit: 50% of one core
Disk I/O: Limit: 10MB/s write
3. Seccomp-BPF - What Syscalls You Can Make
Default Docker filter blocks ~40 dangerous syscalls:
❌ ptrace - Can't debug processes
❌ reboot - Can't reboot host
❌ mount - Can't mount filesystems
❌ setns - Can't join other namespaces
Docker Architecture
┌───────────────────────────────────┐
│ App 1 │ App 2 │ App 3 │ ← Containers
└───────────────────────────────────┘
┌───────────────────────────────────┐
| Docker Engine |
└───────────────────────────────────┘
┌───────────────────────────────────┐
│ Host OS (Single Kernel) │
└───────────────────────────────────┘
┌───────────────────────────────────┐
│ Hardware │
└───────────────────────────────────┘
Key features:
- ✅ Lightweight (share kernel)
- ✅ Fast startup (seconds)
- ✅ Good isolation (multiple mechanisms)
- ❌ All containers must use same OS type
- ❌ Shared kernel = shared vulnerabilities
Docker Syscall Filtering
Custom Seccomp Filters in Docker
Command:
docker run --security-opt="seccomp=filter.json" nginxExample filter.json
{
"defaultAction": "SCMP_ACT_ERRNO", // Deny by default
"syscalls": [
{
"names": ["accept", "read", "write"],
"action": "SCMP_ACT_ALLOW" // allow (whitelist)
},
{
"names": ["socket"],
"action": "SCMP_ACT_ALLOW",
"args": [
{
"index": 0, // First argument
"value": 2, // AF_INET (IPv4)
"op": "SCMP_CMP_EQ" // Must equal
}
]
},
{
"names": ["ptrace", "reboot"],
"action": "SCMP_ACT_KILL" // Kill immediately
}
]
}What this does:
- ✅ Allows:
accept(),read(),write() - ✅ Allows:
socket()only for IPv4 - ❌ Kills process if:
ptrace()orreboot() - ❌ Returns error for: anything else
More Docker Confinement Flags
Additional Security Layers
1. Run as unprivileged user:
docker run --user nginx nginx- Process runs as
nginxuser, not root - Limited damage if escaped
2. Limit Linux capabilities:
docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx- Removes all Linux capabilities
- Adds back to allow binding to privileged ports
- Root in container has minimal powers
3. Prevent process from becoming privileged:
docker run --security-opt=no-new-privileges:true nginx- Blocks
setuidelevation - Even
setuidbinaries won’t gain privileges
4. Resource limits:
docker run \
--restart=on-failure:5 \
--ulimit nofile=100 \
--ulimit nproc=50 \
nginx- Max 5 restarts (prevents crash loops)
- Max 100 file descriptors
- Max 50 processes
Defense in depth: Multiple layers protect against different attacks!
Approach - Virtual Machines
Now we return to VMs for a deeper dive…
What is a Virtual Machine?
VM = emulate an entire computer (OS and all) running inside your real computer
Each VM thinks it has:
- Its own CPU
- Its own RAM
- Its own disk
- Its own network card
- Its own hardware
How? The hypervisor (VMM - virtual machine monitor) creates this illusion
Two Types of Hypervisors
Type 1 (Bare Metal):
┌─────────┬─────────┬─────────┐
│ VM 1 │ VM 2 │ VM 3 │
├─────────┴─────────┴─────────┤
│ Hypervisor │ ← Runs directly on hardware
├─────────────────────────────┤
│ Physical Hardware │
└─────────────────────────────┘
Examples: VMware ESXi, Xen, Hyper-V
Used: Cloud providers (AWS, Azure)
Type 2 (Hosted):
┌─────────┬─────────┐
│ VM 1 │ VM 2 │
├─────────┴─────────┤
│ Hypervisor │ ← Runs as application
├───────────────────┤
│ Host OS │
├───────────────────┤
│ Physical Hardware │
└───────────────────┘
Examples: VirtualBox, VMware Workstation
Used: Desktop users, developers
Security:
- Type 1: More secure (no host OS to attack)
- Type 2: More convenient (use with normal OS)
Why So Popular?
VMs in the 1960’s:
- Before when there were few computers and lots of users
- VMs allow many users to share a single computer
VMs in the 1970’s - 2000: non-existent
VMs since 2000:
- Too many computers & too few users
- e.g. printer server, mail server, web server, file server, database
- VMs heavily used in private & public clouds
Virtual Machine Security
Security Model
Assumption: ✅ We accept: Malware CAN infect guest OS and apps
- Attacker can get root in VM
- Can install keyloggers, rootkits in VM
- Can read all files in that VM
❌ We require: Malware CANNOT escape from VM
- Cannot infect host OS
- Cannot infect other VMs
- Cannot access hypervisor
Requirements for Security
Hypervisor must:
- Protect itself (cannot be accessed by guest)
- Isolate VMs (cannot see each other)
- Be bug-free (vulnerabilities = escapes)
The Driver Problem
Tension:
- Hypervisor kept simple (50K lines) ← Good!
- Device drivers complex (100K+ lines) ← Bad!
- Drivers run in privileged mode ← Dangerous!
Result:
- Bug in driver → compromise hypervisor
- Hypervisor compromised → ALL VMs compromised
- This weakens the “simple hypervisor” benefit
Problem - Covert Channels
Intentional Secret Communication
Covert Channel: Unintended communication channel between isolated components used to leak data
Scenario:
┌─────────────────────────────────────────┐
│ ┌───────────┐ ┌───────────┐ │
│ │ VM 1 │ │ VM 2 │ │
│ │ │ │ │ │
│ │secret doc │ │ │ │
│ │ ↓ │ │ │ │
│ │ malware ─────→ ←── listener │→Data│
│ │ │ covert │ │leak │
│ │ │channel │ │ │
│ └───────────┘ └───────────┘ │
│ Hypervisor/VMM │
└─────────────────────────────────────────┘
What’s happening:
- VM 1 has malware with stolen secrets
- VM 2 has attacker’s listener
- VMs supposed to be isolated (can’t send packets)
- But they COOPERATE using shared hardware
- Data leaks across security boundary
Key: This is INTENTIONAL cooperation between malware in BOTH VMs
Covert Channel Example
CPU Timing Attack
How it works:
Setup:
- Both VMs run on same physical CPU
- They compete for CPU time
- Can measure each other’s CPU usage by timing
Sending data (VM 1 malware):
# To send bit 1:
at 1:00:00 AM:
for i in range(huge_number):
x = x * x # CPU intensive!
# To send bit 0:
at 1:00:00 AM:
sleep(1) # Do nothingReceiving data (VM 2 listener):
at 1:00:00 AM:
start = time()
for i in range(huge_number):
x = x * x
duration = time() - start
if duration > threshold:
return 1 # Slow (VM 1 competing for CPU)
else:
return 0 # Fast (VM 1 idle)Other Covert Channels
- File lock status - lock/unlock at specific times
- Cache contents - fill/empty cache to signal
- Interrupts - trigger interrupts at specific rates
- Memory bus - compete for memory bandwidth
- Disk I/O patterns - read/write patterns
- Many more…
The sad truth: “Impossible to eliminate all”
Why? Isolated components share physical hardware. Any shared resource = potential covert channel.
Bigger Problem - Side Channels
Unintentional Information Leakage
Side Channel: Unintentional leakage between isolated components
Key difference from covert channel:
- Covert: Malware in BOTH VMs cooperating
- Side: Only listener is malicious, victim is innocent
Scenario:
┌───────────┐ ┌───────────┐
│ VM 1 │ │ VM 2 │
│ │ │ │
│secret doc │ │ │
│ ↓ │ │ │
│Word proc │ │ │
│(innocent) │ │ │
│ ↓ │ │ │
│ accidental --→ ←-- listener │
│ leakage │ │(observes) │
┌──────────────────────────────────────────┐
│ Hypervisor/VMM │
└──────────────────────────────────────────┘
What’s happening:
- VM 1 innocently processes secret document
- VM 2 has attacker’s listener observing
- VM 1 doesn’t know it’s being attacked
- Listener infers information from hardware usage patterns
Why it’s “Bigger Problem”:
- Covert channel: need to compromise BOTH VMs (harder)
- Side channel: only compromise ONE VM (easier)
- Victim doesn’t need malware
VM Isolation in Practice - Cloud
Cloud Computing Architecture
Type 1 Hypervisor (No Host OS):
┌───────────┬───────────┬───────────┬──────────┐
│User Apps │User Apps │User Apps │User Apps │
├───────────┼───────────┼───────────┼──────────┤
│Guest OS 1 │Guest OS 2 │Guest OS 3 │Guest OS n│
│(Kernel) │(Kernel) │(Kernel) │(Kernel) │
└───────────┴───────────┴───────────┴──────────┘
┌──────────────────────────────────────────────┐
│ Hypervisor/VMM (Xen, KVM, ESXi) │
│ BIOS/SMM │
└──────────────────────────────────────────────┘
┌──────────────────────────────────────────────┐
│ Physical Hardware │
└──────────────────────────────────────────────┘
Key points:
- Type 1 = No host OS (hypervisor runs directly on hardware)
- Used by AWS, Azure, Google Cloud
- More secure (smaller TCB)
- More efficient (no extra OS layer)
Security in cloud:
- VMs from different customers share hardware
- Hypervisor MUST isolate them
- But “some info leaks” (side channels)
VM Isolation in Practice - End User
Qubes OS - Security-Focused Desktop
Philosophy: Everything runs in a separate VM
Example configurations shown in slides:
Configuration 1:
┌──────────────┬──────────────┬──────────────┐
│Disposable VM │ Work VM │ Personal VM │
│(Debian) │ (Windows) │ (Debian) │
│ │ │ │
│Sketchy PDFs │Work email │Personal email│
│Random links │Office docs │Social media │
│Auto-deleted │Slack/Teams │Shopping │
└──────────────┴──────────────┴──────────────┘
Xen Hypervisor
Hardware (Intel Xeon)
Configuration 2:
┌──────────────┬──────────────┬──────────────┐
│ Whonix VM │ Work VM │ Vault VM │
│(Debian) │ (Windows) │ (Debian) │
│ │ │ │
│All traffic │Normal work │Password mgr │
│goes through │activities │Crypto keys │
│TOR │ │NO INTERNET │
└──────────────┴──────────────┴──────────────┘
Xen Hypervisor
Hardware
Security benefits:
- Compromise of one VM doesn’t affect others
- Different VMs for different trust levels
- Disposable VMs for risky activities
Every Window Frame Identifies VM Source
Visual Security in Qubes
Color-coded window borders:
- Each VM has a distinct color border on its windows
- Red border = Untrusted/Disposable VM
- Yellow border = Work VM
- Green border = Personal VM
- Black border = Vault VM (most secure)
Why this matters:
- Instantly see which VM a window belongs to
- Prevents spoofing (malicious VM can’t fake border color)
- GUI VM ensures frames drawn correctly
- Helps user make security decisions
Example:
User sees:
- Login window with RED border → Fake! Don't enter real password
- Banking site with GREEN border → Expected VM ✓
GUI isolation:
- Separate GUI VM manages window system
- Prevents VMs from overlaying fake windows
- Enforces visual security policy
Confinement - Summary
Many Sandboxing Techniques
From strongest to weakest:
- Physical air gap
- Separate hardware, no connection
- Strongest isolation, highest cost
- Virtual air gap (hypervisor)
- Separate OS instances via VMM
- Strong isolation, moderate cost
- System call interposition (SCI)
- Filter syscalls (seccomp-BPF)
- Fine-grained, low overhead
- Software Fault Isolation (SFI)
- Insert checks in compiled code
- Application-level isolation
- Application-specific
- JavaScript sandbox in browser
- Language-level isolation
Complete Isolation Often Inappropriate
Programs need to communicate!
- Apps must access shared resources
- Need regulated interfaces for communication
- Reference monitors mediate this communication
Examples:
- Browser tabs → share bookmarks via browser process
- Containers → communicate via virtual network
- VMs → use shared folders with access control
Two Hardest Aspects
1. Specifying policy
- What can apps do and not do?
- Too restrictive → breaks functionality
- Too permissive → doesn’t stop attacks
- Auto-generation doesn’t work well
- User decisions unreliable
2. Preventing covert channels
- Fundamentally impossible while sharing hardware
- Can only make them slow/noisy
- Accept some risk for non-critical apps
- Physically separate extremely sensitive work
Final Summary: Key Takeaways
Choosing the Right Isolation Level
Decision factors:
- Threat model
- Performance requirements
- Trust in software
- Resource constraints
Core Principles
- Defense in Depth - Layer multiple techniques
- Least Privilege - Minimal permissions by default
- Complete Mediation - Check every access
- Simplicity - Keep TCB small
- Accept Limitations - Perfect isolation impossible
What We Learned
✅ Why isolation is needed ✅ Reference monitors and TCB ✅ VMs vs Containers vs Syscall filtering ✅ chroot is broken, seccomp-BPF works ✅ Docker security mechanisms ✅ Covert and side channels exist
THE END of lecture 5 finally!!