CS 3235 - 5. Isolation & Sandboxing

January 27, 2026
Daniel Genkin

Running Untrusted Code

The Central Security Challenge

The fundamental question: How do we run potentially malicious programs without them destroying our system?

Why we need to run untrusted code:

You download a PDF from an unknown email attachment
You visit websites that run JavaScript in your browser
You install apps from app stores (how do you know they’re safe?)
You process user-uploaded files on your server
You run code from open-source repositories

The dilemma:

We WANT to:                      But we FEAR:
- Open PDFs                   →  Malicious code execution
- Browse websites             →  Data theft
- Install apps                →  System compromise
- Run plugins                 →  Privilege escalation

This lecture’s goal: Learn techniques to run untrusted code SAFELY through isolation and sandboxing.

Process Isolation

Evolution of Security Thinking

Traditional Thinking: Identify the Bad Guys

Old question: “How to run bad/untrustworthy programs safely?”

This assumed we could identify dangerous programs:

Programs from untrusted sites
Apps that process adversarial data (PDF viewers, browsers)
Honeypots (intentionally vulnerable systems to trap attackers)

Problem: You have to KNOW something is bad before protecting against it. What about:

Zero-day vulnerabilities?
Compromised trusted programs?
Supply chain attacks?

New Thinking: Trust Nothing

Modern question: “Be skeptical of ALL programs, isolate to achieve least privilege”

Key principles:

Assume compromise - ANY program could be exploited
Default deny - Give minimal permissions by default
Limit damage - When (not if) compromise happens, contain it

Real example - Chrome Browser:

Old approach:
  "Chrome is from Google, it's trustworthy"
  → Give Chrome full system access
  → If exploited, attacker gets everything

New approach:
  "Assume Chrome will be exploited"
  → Each tab runs in isolated process
  → Limited syscalls (can't open files, create sockets)
  → If exploited, damage is contained

General Goal: Confinement

Confinement = Ensure misbehaving process cannot harm rest of system

Medical quarantine analogy:

Sick patient (compromised program) is isolated
Can’t spread disease (attacks) to others
But can still function (perform legitimate operations)
Medical staff (system services) interact through barriers

Spectrum of Isolation Levels

Can be implemented at many levels:

System call interposition (AppArmor, SELinux)
- Monitor and filter individual system calls
- Example: Block process from accessing /etc/passwd
Containers (Docker, Kubernetes)
- Isolated userspace instances sharing one kernel
- Example: Each microservice in own container
Virtual machines (VMware, VirtualBox, Xen)
- Isolate entire operating systems on one physical machine
- Example: AWS running different customers on same hardware
Physical isolation (Air gaps)
- Separate physical hardware with no connection
- Example: Nuclear power plant control systems

The tradeoff: Stronger isolation → More overhead/less convenience

Approach - Confinement (Hardware / Air Gap)

Physical Isolation - The Strongest Defense

What it is: Run applications on completely separate physical hardware with absolutely no connection

Visual:

Network 1          <<<  AIR GAP  >>>          Network 2
┌──────────┐      (No wires,              ┌──────────┐
│Computer A│       No WiFi,               │Computer B│
│  App 1   │       No Bluetooth,          │  App 2   │
│          │       Physical separation)   │          │
└──────────┘                              └──────────┘

Security properties:

✅ Strongest possible isolation - Cannot attack what you cannot reach
✅ No shared resources - Completely independent systems
✅ Simple to understand - Physical separation is obvious

Drawbacks:

❌ Very expensive - Need multiple complete computer systems
❌ Difficult to manage - No remote access, must physically visit
❌ Inconvenient data sharing - “Sneakernet” (walk USB drive between systems)
❌ Doesn’t scale - Can’t have thousands of air-gapped machines

When to use:

Nuclear power plants - Control systems separated from internet
Military networks - SIPRNET (SECRET) physically separate from NIPRNET (UNCLASSIFIED)
Financial systems - Trading systems isolated from corporate network
Critical infrastructure - When security is worth ANY cost

Real example - Stuxnet: Even with air gap, Iranian nuclear centrifuges were compromised via USB drive. Shows air gaps aren’t perfect but raise the bar significantly.

Approach - Confinement (Virtual Machines)

Virtual Machine Isolation

What it is: Isolate entire operating systems on a single physical machine

Architecture:

┌────────────────────────────────────────┐
│     OS1     |     OS2    |      OS3    │ ← Guest Operating Systems
│   app1      |   app2     |    app3     │
├────────────────────────────────────────┤
│  Virtual Machine Monitor (Hypervisor)  │ ← Isolation layer
├────────────────────────────────────────┤
│           Hardware                     │
└────────────────────────────────────────┘

What each VM sees:

Its own “CPU” (time-sliced from real CPU)
Its own “RAM” (isolated portion of real RAM)
Its own “disk” (file on host or dedicated partition)
Its own “network card” (virtual NIC)

The illusion:

From inside VM: looks exactly like a real computer
Root user in VM doesn’t know it’s virtualized
VM thinks it has exclusive access to hardware

Security model:

✅ Strong isolation (separate OS instances)
✅ Can run different OSes (Windows + Linux on same hardware)
✅ Compromise of one VM shouldn’t affect others

We’ll dive deeper into VMs later…

Approach - Confinement (System Call Interposition / Containers)

Process-Level Isolation

What it is: Isolate individual processes within a single operating system

Architecture:

┌─────────────────────────────────┐
│      Operating System           │
│                                 │
│   ┌──────────┐   ┌─────────┐    │
│   │Process 1 │   │Process 2│    │ ← Isolated processes
│   │(isolated)│   │(normal) │    │
│   └──────────┘   └─────────┘    │
└─────────────────────────────────┘

How it works:

Monitor and filter system calls from isolated process
Block dangerous operations (file access, network, etc.)
Allow safe operations (computation, limited I/O)

Compared to VMs:

✅ Much lighter weight (no separate OS)
✅ Faster startup (milliseconds vs minutes)
✅ Less memory (MBs vs GBs)
❌ Weaker isolation (shared kernel)

Modern implementation: Containers (Docker, Kubernetes)

We’ll explore this in detail after covering system calls…

System Calls - Going from User to OS Code

Understanding System Calls (Foundation for SCI)

Before we can understand System Call Interposition, we need to understand what system calls are.

What is a system call?

A system call is how user programs ask the operating system to do privileged operations.

Why programs can’t do these directly:

Programs run in user mode with restricted privileges:

Cannot directly access hardware (disk, network, devices)
Cannot access other processes’ memory
Cannot modify kernel memory
Cannot access protected system files

The OS kernel runs in kernel mode with full privileges:

Can access all hardware
Can access all memory
Can perform privileged operations

Programs must ASK the kernel through system calls

Common System Calls

File operations:

int fd = open("/home/user/file.txt", O_RDONLY);  // Open file
ssize_t n = read(fd, buffer, 100);               // Read data
write(fd, data, 50);                             // Write data
close(fd);                                       // Close file

Process operations:

pid_t pid = fork();           // Create new process
exec("/bin/ls", args);        // Execute program
exit(0);                      // Terminate process

Network operations:

int sock = socket(AF_INET, SOCK_STREAM, 0);       // Create socket
connect(sock, &addr, sizeof(addr));               // Connect to server
send(sock, data, len, 0);                         // Send data

Memory operations:

void* ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, ...);  // Map memory
munmap(ptr, 4096);                                         // Unmap memory

How System Calls Work

User Program                    Kernel
─────────────────────────────────────────
                                
1. Program calls open()
    ↓
2. Library wraps it
   syscall(SYS_open, ...)
    ↓
3. Trap to kernel mode ──────→ 4. Kernel receives syscall
   (CPU switches modes)            ↓
                                5. Kernel checks permissions
                                   ↓
                                6. Performs operation
                                   ↓
4. Return to user mode ←────── 8. Returns result
   ↓
5. Program continues

Key insight: Every privileged operation goes through syscalls!

This is why we can control programs by filtering syscalls - if we block dangerous syscalls, we limit what program can do.

Achieving Isolation (Reference Monitor)

The Abstract Security Model

Reference Monitor = Abstract design pattern for security

Think of it as a security guard at a checkpoint:

		┌─────────────────┐
		│ Isolated Process│
		│ (Untrusted)     │
		└────────┬────────┘
		         │ "Can I open /etc/passwd?"
		         |
┌───────────────-|───-───────────────┐
|  ┌─────────────|────────────────┐  |
|  |   ┌─────────↓────────────┐   |  |
|  |   | Reference Monitor    |   |  |
|  |   └──────────────────────┘   |  |
|  | Trusted Computing Base (TCB) |  |
|  └──────────────────────────────┘  |
|      Operating System Kernel       |
└────────────────────────────────────┘

Three Required Properties

For a reference monitor to provide security:

1. Complete Mediation - “Every request must go through the guard”

Must always be invoked: Every application request must be mediated

What this means:

NO backdoor paths
NO bypass mechanisms
EVERY access checked

Bad example:

Prison with front gate security but unlocked back door
→ Security guard is useless
→ Complete mediation violated

Good example:

OS kernel checks EVERY file access
Program cannot directly access disk
Must go through kernel
→ Complete mediation achieved

2. Tamperproof - “Guard can’t be bribed or killed”

Cannot be killed:

Reference monitor must run at higher privilege than monitored process
If monitor killed, monitored process must also die
Monitor must be in protected memory

Cannot be modified:

Isolated program cannot change monitor’s code
Isolated program cannot change monitor’s policy
Even with exploit, cannot disable security

How enforced:

Monitor runs in kernel mode (protected)
Program runs in user mode (restricted)
Hardware (CPU protection rings, MMU) enforces separation

3. Small and Verifiable - “Simple guards make fewer mistakes”

Must be small enough to analyze and validate

Why size matters:

Complex code = More bugs
More bugs = More vulnerabilities
More vulnerabilities = Weaker security

Example:

Bad: 100,000 lines of monitoring code
  → Too complex to audit
  → Likely contains exploitable bugs

Good: 1,000 lines of monitoring code
  → Security experts can review completely
  → Can prove correctness
  → Fewer attack vectors

Part of Trusted Computing Base

By “trusted” we mean: “If it isn’t trustworthy, we’re vulnerable”

The reference monitor is part of the TCB (Trusted Computing Base) - all the code we MUST trust for security.

We’ll explore TCB more in the next slide…

Need for Trusting an Operating System (TCB)

Why Must We Trust the OS?

Question: Why do we need to trust the operating system?

Answer: Because the OS enforces ALL security!

What the OS controls:

File permissions (who can read/write files)
Process isolation (programs can’t access each other’s memory)
Network access (who can create sockets)
Hardware access (who can access devices)

If OS is compromised → All security fails

Trusted Computing Base (TCB)

Definition: The TCB is all the code and hardware that, if compromised, breaks your security.

For our isolation systems, TCB includes:

Operating system kernel
Hypervisor (for VMs)
Reference monitor implementation
Device drivers (unfortunately - they’re buggy!)
Hardware (CPU, MMU, memory protection)

TCB Requirements

To be trustworthy, TCB must have:

1. Tamper-proof

Cannot be modified by untrusted code
Protected memory, privileged execution

2. Complete mediation

Checks every access
No bypass mechanisms

3. Correct

Actually implements security policy properly
No bugs that allow security bypass

Reference Monitors = Special TCBs

Official definition: TCBs that meet all three requirements are called Reference Monitors

The relationship:

Reference Monitor = The ideal requirements (what we want)
TCB = The actual implementation (what we built)
We WANT our TCB to be a reference monitor
Reality: TCB often has bugs and falls short of the ideal

Security goal: Make TCB as small as possible while meeting reference monitor requirements

Approach - UNIX chroot System Call

Now we see a SIMPLE (but flawed) isolation technique before learning the correct approaches.

What is chroot?

chroot = “change root directory”

Changes what a process sees as the root directory (/)

Normal filesystem:

/                       ← Real root
├── home/user/documents/
├── etc/passwd          ← Sensitive!
└── tmp/

After chroot /tmp/jail:

Process sees:
/                       ← Now /tmp/jail
├── bin/
├── lib/
└── app/

Process CANNOT see:
/home/user/documents/   ← Outside jail
/etc/passwd             ← Outside jail

How to Use chroot

Requirements:

Must be root to use chroot
Must create minimal filesystem in jail first

Setup:

# Create jail structure
sudo mkdir -p /tmp/jail/{bin,lib,etc}
 
# Copy necessary binaries
sudo cp /bin/bash /tmp/jail/bin/
sudo cp /bin/ls /tmp/jail/bin/
 
# Copy required libraries
sudo cp /lib/x86_64-linux-gnu/libc.so.6 /tmp/jail/lib/
 
# Enter jail
sudo chroot /tmp/jail /bin/bash
 
# Drop to unprivileged user
su guest
 
# Run untrusted program
./app

Security Property

Process cannot access files outside jail because it cannot name them

Example:

# Inside jail, process tries:
cat /etc/passwd
 
# What happens:
# "/" is interpreted as "/tmp/jail"
# Acutally opens: /tmp/jail/etc/passwd
# Real /etc/passwd is never touched

Early Uses

Common uses:

FTP servers - Prevent users accessing system files
Test environments - Install software without affecting system
Build environments - Clean compilation

But is it secure?

Spoiler: NO! We’ll see why in upcoming slides…

Escaping from Jails

chroot is NOT Secure!

While chroot provides filesystem isolation, it has MANY escape mechanisms.

Early escapes using relatives paths

Example

fopen("../../etc/password", "r') =>
   fopen("/tmp/guest/../../etc.password", "r")

chroot should only be executable by root

Why?

Otherwise jailed app can:
- create dummy file /aaa/etc/password
- run chroot "/aaa"
- run su root to become root

This was a bug in Ultrix 4.0

Many Ways to Evade chroot Isolation

Escape Technique 1: Device Files

Attack:

# Inside chroot jail:
mknod /dev/sda b 8 0      # Create block device for hard drive
dd if=/dev/sda of=dump    # Read raw disk!

Why it works:

Device major/minor numbers are GLOBAL (shared namespace)
chroot doesn’t isolate device namespace
Raw disk access bypasses file permissions entirely

Result: Can read ANY data on disk, even outside jail

Escape Technique 2: Send Signals

Attack:

# Inside jail, kill process outside jail:
kill -9 1234

Why it works:

Process IDs (PIDs) are GLOBAL
chroot doesn’t isolate PID namespace
Can affect processes outside jail

Escape Technique 3: Reboot System

Attack:

# If somehow have capability:
reboot

Why it works:

Reboot affects ENTIRE system
chroot doesn’t restrict system-wide operations

Escape Technique 4: Network Ports

Attack:

# Bind to privileged port:
nc -l 80

Why it works:

Network namespace is SHARED
Port numbers are global
Can block legitimate services

Summary: What chroot Does NOT Isolate

✅ chroot isolates: Filesystem paths

❌ chroot does NOT isolate:

Process IDs
Network
Devices
System calls
Users/Groups
IPC mechanisms

Conclusion: chroot alone is INSECURE!

Modern solution: Containers (covered later) use chroot PLUS many other isolation mechanisms

FreeBSD Jail

Improvement Over chroot

FreeBSD jail extends chroot with additional isolation:

What jail adds:

Hostname isolation (each jail has own hostname)
IP address isolation (can only bind to sockets with specific IP address & authorized ports)
Process isolation (can only communicate with processes inside jail)
Root restrictions (root in jail ≠ root on host; cannot load kernel modules)

To run:

# Create jail
jail jail-path hostname IP-addr cmd

But still has problems… (next slide)

Problems with chroot and jail

Remaining Issues

All or nothing access to parts of file system
Inappropriate for apps like a web browser
- Needs read access to files outside of jail (e.g. sending attachments in Gmail)

Approach - System Call Interposition

The Core Idea

Observation: To damage host system, app MUST make system calls

What malware wants to do:

Delete/overwrite files → needs unlink(), open(), write()
Launch network attacks → needs socket(), bind(), connect()
Install persistence → needs disk writes
Steal data → needs file/network access

Key insight: If we control syscalls, we control what program can do!

System Call Interposition Strategy

Idea: Monitor app’s system calls and block unauthorized ones

Program wants:              Monitor checks:         Action:
──────────────────────────────────────────────────────────
open("/etc/passwd")    →    Policy: DENY        →  Block/Kill
open("/tmp/myfile")    →    Policy: ALLOW       →  Let through
socket(...)            →    Policy: NO network  →  Block
write(stdout, "hi")    →    Policy: ALLOW       →  Let through

Implementation Options

1. Completely kernel space (SELinux)

Filtering built into OS kernel
Very secure, can’t bypass
Complex to configure

2. Completely user space (program shepherding)

Separate monitor process
More portable
Can have race conditions

3. Hybrid (Systrace)

Monitor in user space
Enforcement in kernel
Balance of flexibility and security

Let’s see how to implement this…

Implementing System Call Interposition (ptrace)

Extra resource bc huhh: MITRE ATT&CK T1055.008 Process Injection: Ptrace System Calls

Naive Approach: ptrace

ptrace = Unix system call for process tracing (designed for debugging)

How it works:

┌────────────────────────────────────┐
│         User Space                 │
│                                    │
│  ┌──────────┐      ┌───────────┐   │
│  │ Monitor  │      │  Program  │   │
│  │ Process  │      │(monitored)│   │
│  └────┬─────┘      └────↑──────┘   │
│       │                 |          │
│       │ ptrace()        │ syscall  │
└───────┼─────────────────┼──────────┘
        │                 │
┌───────↓─────────────────┼─────────┐
│       └-----------------┘         │
│         OS Kernel                 │
│  Monitor wakes when syscall made  │
└───────────────────────────────────┘

Process:

Monitor attaches to program with ptrace()
Program makes syscall
Kernel stops program and wakes monitor
Monitor checks if syscall allowed
Monitor either allows or kills program

Seems reasonable… but has SEVERE problems!

System Call Interposition Policies

Before discussing ptrace problems, let’s see what policies look like:

Example Policy File

path allow /tmp/*
path deny /etc/passwd
path deny /etc/shadow
network deny all

How it works:

open("/tmp/file")    → Matches "allow /tmp/*"     → ALLOW
open("/etc/passwd")  → Matches "deny /etc/passwd" → DENY
socket(...)          → Matches "deny all network" → DENY

The Policy Specification Problem

Challenge 1: Auto-generating policies

Idea: Run program on “good” inputs, learn what it does, generate policy

Problems:

What are “good” inputs?
Might miss rare legitimate behaviors
Attacker could poison training data

Challenge 2: Asking the user

Show popup: “Program wants to open /etc/passwd. Allow?”

Problems:

Users don’t understand implications
Alert fatigue → click “Allow” without reading
Not practical for servers

Why This Is Hard

For complex apps (browsers, office suites):

Access patterns unpredictable
Need many capabilities
Hard to write policy that’s both:
- Secure (blocks attacks)
- Permissive (doesn’t break features)

This difficulty in choosing policy is why syscall interposition isn’t widely used for complex apps

Complications

Issues with ptrace Monitoring

Fork handling:

If monitored app forks, monitor must also fork
Must track parent-child relationships
Complex coordination

Crash handling:

If monitor crashes, app must be killed immediately
Can’t allow unmonitored execution
Requires kernel support

State synchronization:

Monitor must track ALL OS state that affects syscall interpretation:

State to track:
- Current working directory (CWD)
- User ID (UID, EUID, GID)
- Open file descriptors
- Environment variables

Example why CWD matters:

Program does:           Monitor must remember:
──────────────────────────────────────────────
chdir("/tmp")          CWD = /tmp
open("passwd")         → Interprets as /tmp/passwd ✓

chdir("/etc")          CWD = /etc
open("passwd")         → Interprets as /etc/passwd ✓

If monitor loses track → misinterprets paths!

Slide 18: Problems with ptrace

Why ptrace Fails for Security

Problem 1: All or Nothing

Can only trace ALL syscalls or NONE
Cannot selectively trace just dangerous ones
Inefficient (trace every read(), write(), close())

Problem 2: Cannot Abort Cleanly

Can only KILL entire program
Cannot deny individual syscalls gracefully
Too harsh response

Problem 3: TOCTOU Race Condition ⚠️ CRITICAL SECURITY BUG

TOCTOU = Time-Of-Check, Time-Of-Use

The vulnerability:

Timeline of Attack:

1. Attacker creates:  me → mydata.txt (symlink)

2. Program calls:     open("me")

3. Monitor wakes:     Checks "me" → points to mydata.txt
                     Decision: ALLOW ✓

4. [ATTACKER STRIKES] Changes:  me → /etc/passwd

5. Kernel executes:   open("me") → Opens /etc/passwd!
                     ❌ SECURITY VIOLATED

What went wrong: Check (step 3) and use (step 5) were NOT atomic. Attacker changed target in between.

Why this is critical:

Can bypass ANY security check
Classic vulnerability in many systems
Fundamental problem with user-space monitoring

Conclusion: ptrace was designed for debugging, NOT security!

SCI in Linux: seccomp-bpf

SCI = System Call Intereference Seccomp-BPF = Secure Computing with Berkeley Packet Filter

This solves ALL the ptrace problems!

How It’s Different

1. Filter runs IN THE KERNEL (not user space)

✅ No TOCTOU races (check and use are atomic)
✅ No state synchronization issues
✅ Perfect state tracking

2. Filter written in BPF language (compiled, efficient)

✅ Efficient (runs in kernel)
✅ Can selectively filter (only check specific syscalls)
✅ No overhead for allowed syscalls

3. Process sets its OWN filter (before doing dangerous things)

✅ No separate monitor process
✅ No fork complications
✅ Filter cannot be removed once set
✅ If program call execve, all filters are preserved

Visual Architecture

┌──────────────────────────────────┐
│        User Space                │
│                                  │
│  Chrome Renderer Process:        │
│  1. Sets filter                  │
│  2. Processes web page           │
│  3. Exploit tries: open("/etc/passwd")│
│                ↓                 │
└────────────────┼─────────────────┘
                 ↓
┌────────────────┼─────────────────┐
│      OS Kernel │                 │
│                ↓                 │
│     Seccomp-BPF Filter           │
│     - Check: open() syscall      │
│     - Not in whitelist!          │
│     - Action: KILL process       │
└──────────────────────────────────┘

Used in production:

Chrome/Chromium browser tabs
Docker containers
Android apps
systemd services

BPF Filters (Policy Programs)

What is BPF?

How BPF Filters Work

BPF program:

Small program that runs for each syscall
Returns: ALLOW, DENY, or KILL
- SECCOMP_RET_ALLOW = allow syscall
- SECCOMP_RET_ERRNO = return specified error to caller
- SECCOMP_RET_KILL = kill process
Written in restricted language (safe, cannot loop forever)

Structure:

For each syscall:
    1. Load syscall number
    2. Check against whitelist/blacklist
    3. Return action (ALLOW/DENY/KILL)

Installing a BPF Filter

Two Required Calls (Must Be In This Order)

1. prctl(PR_SET_NO_NEW_PRIVS, 1)

Purpose: Prevents privilege escalation
Effect: Disables set-UID/set-GID bits on subsequent execve() calls
Why needed: Without this, attacker could exec a set-UID root binary and bypass the filter
Must come FIRST - kernel won’t allow filter installation without this

2. prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_policy)

Purpose: Installs the syscall filter
Effect: Every syscall from now on is checked against the BPF policy
Irreversible: Once set, filter CANNOT be removed (even by root)
Point of no return

The Example

int main(int argc, char **argv) {
    prctl(PR_SET_NO_NEW_PRIVS, 1);              // Block privilege escalation
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_policy);  // Install filter
    
    fopen("file.txt", "w");                      // Calls open() syscall
    printf("... will not be printed.\n");        // Never executes - process killed
}
```
 
**What happens:**
- `fopen()` internally calls `open()` syscall for writing
- BPF filter checks the policy
- Policy blocks `open()` for write
- **Process is immediately killed**
- `printf()` never runs
 
## Key Properties
 
- **Voluntary self-restriction** - Process restricts itself
- **Kernel-enforced** - No TOCTOU races (unlike ptrace)
- **Cannot be removed** - Permanent once installed
- **Inherited by children** - `fork()` passes filter to child processes
 
## Why This Order Matters
```
Correct:
1. NO_NEW_PRIVS → 2. Install filter → 3. Restricted execution ✓
 
Wrong:
Install filter first → Kernel rejects (error) ✗
 
Without NO_NEW_PRIVS:
Attacker could execve("/usr/bin/passwd") → runs as root → bypasses filter ✗

Docker - Isolating Containers Using seccomp-bpf

Modern Container Isolation

Container: OS-level isolation creating multiple isolated userspace instances

Docker uses multiple isolation mechanisms together:

What Engineers Should Know About Container Isolation

1. Namespaces - What You Can See

PID Namespace:       Container sees PIDs 1,2,3 (own processes)
Network Namespace:   Container has own IP, ports
Mount Namespace:     Container has own filesystem view
User Namespace:      Root in container ≠ root on host
IPC Namespace:       Isolated shared memory

2. Cgroups - What Resources You Can Use

Memory cgroup:  Limit: 512MB RAM
CPU cgroup:     Limit: 50% of one core
Disk I/O:       Limit: 10MB/s write

3. Seccomp-BPF - What Syscalls You Can Make

Default Docker filter blocks ~40 dangerous syscalls:
❌ ptrace   - Can't debug processes
❌ reboot   - Can't reboot host
❌ mount    - Can't mount filesystems
❌ setns    - Can't join other namespaces

Docker Architecture

┌───────────────────────────────────┐
│    App 1  │  App 2  │  App 3      │ ← Containers
└───────────────────────────────────┘
┌───────────────────────────────────┐
|           Docker Engine           |
└───────────────────────────────────┘       
┌───────────────────────────────────┐
│      Host OS (Single Kernel)      │
└───────────────────────────────────┘
┌───────────────────────────────────┐
│             Hardware              │
└───────────────────────────────────┘

Key features:

✅ Lightweight (share kernel)
✅ Fast startup (seconds)
✅ Good isolation (multiple mechanisms)
❌ All containers must use same OS type
❌ Shared kernel = shared vulnerabilities

Docker Syscall Filtering

Custom Seccomp Filters in Docker

Command:

docker run --security-opt="seccomp=filter.json" nginx

Example filter.json

{
  "defaultAction": "SCMP_ACT_ERRNO",  // Deny by default
  
  "syscalls": [
    {
      "names": ["accept", "read", "write"],
      "action": "SCMP_ACT_ALLOW"        // allow (whitelist)
    },
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,        // First argument
          "value": 2,        // AF_INET (IPv4)
          "op": "SCMP_CMP_EQ" // Must equal
        }
      ]
    },
    {
      "names": ["ptrace", "reboot"],
      "action": "SCMP_ACT_KILL"          // Kill immediately
    }
  ]
}

What this does:

✅ Allows: accept(), read(), write()
✅ Allows: socket() only for IPv4
❌ Kills process if: ptrace() or reboot()
❌ Returns error for: anything else

More Docker Confinement Flags

Additional Security Layers

1. Run as unprivileged user:

docker run --user nginx nginx

Process runs as nginx user, not root
Limited damage if escaped

2. Limit Linux capabilities:

docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx

Removes all Linux capabilities
Adds back to allow binding to privileged ports
Root in container has minimal powers

3. Prevent process from becoming privileged:

docker run --security-opt=no-new-privileges:true nginx

Blocks setuid elevation
Even setuid binaries won’t gain privileges

4. Resource limits:

docker run \
  --restart=on-failure:5 \
  --ulimit nofile=100 \
  --ulimit nproc=50 \
  nginx

Max 5 restarts (prevents crash loops)
Max 100 file descriptors
Max 50 processes

Defense in depth: Multiple layers protect against different attacks!

Approach - Virtual Machines

Now we return to VMs for a deeper dive…

What is a Virtual Machine?

VM = emulate an entire computer (OS and all) running inside your real computer

Each VM thinks it has:

Its own CPU
Its own RAM
Its own disk
Its own network card
Its own hardware

How? The hypervisor (VMM - virtual machine monitor) creates this illusion

Two Types of Hypervisors

Type 1 (Bare Metal):

┌─────────┬─────────┬─────────┐
│  VM 1   │  VM 2   │  VM 3   │
├─────────┴─────────┴─────────┤
│         Hypervisor          │ ← Runs directly on hardware
├─────────────────────────────┤
│     Physical Hardware       │
└─────────────────────────────┘

Examples: VMware ESXi, Xen, Hyper-V
Used: Cloud providers (AWS, Azure)

Type 2 (Hosted):

┌─────────┬─────────┐
│  VM 1   │  VM 2   │
├─────────┴─────────┤
│    Hypervisor     │ ← Runs as application
├───────────────────┤
│     Host OS       │
├───────────────────┤
│ Physical Hardware │
└───────────────────┘

Examples: VirtualBox, VMware Workstation
Used: Desktop users, developers

Security:

Type 1: More secure (no host OS to attack)
Type 2: More convenient (use with normal OS)

Why So Popular?

VMs in the 1960’s:

Before when there were few computers and lots of users
VMs allow many users to share a single computer

VMs in the 1970’s - 2000: non-existent

VMs since 2000:

Too many computers & too few users
- e.g. printer server, mail server, web server, file server, database
VMs heavily used in private & public clouds

Virtual Machine Security

Security Model

Assumption: ✅ We accept: Malware CAN infect guest OS and apps

Attacker can get root in VM
Can install keyloggers, rootkits in VM
Can read all files in that VM

❌ We require: Malware CANNOT escape from VM

Cannot infect host OS
Cannot infect other VMs
Cannot access hypervisor

Requirements for Security

Hypervisor must:

Protect itself (cannot be accessed by guest)
Isolate VMs (cannot see each other)
Be bug-free (vulnerabilities = escapes)

The Driver Problem

Tension:

Hypervisor kept simple (50K lines) ← Good!
Device drivers complex (100K+ lines) ← Bad!
Drivers run in privileged mode ← Dangerous!

Result:

Bug in driver → compromise hypervisor
Hypervisor compromised → ALL VMs compromised
This weakens the “simple hypervisor” benefit

Problem - Covert Channels

Intentional Secret Communication

Covert Channel: Unintended communication channel between isolated components used to leak data

Scenario:

┌─────────────────────────────────────────┐
│  ┌───────────┐        ┌───────────┐     │
│  │  VM 1     │        │  VM 2     │     │
│  │           │        │           │     │
│  │secret doc │        │           │     │
│  │    ↓      │        │           │     │
│  │ malware ─────→   ←── listener  │→Data│
│  │           │ covert │           │leak │
│  │           │channel │           │     │
│  └───────────┘        └───────────┘     │
│         Hypervisor/VMM                  │
└─────────────────────────────────────────┘

What’s happening:

VM 1 has malware with stolen secrets
VM 2 has attacker’s listener
VMs supposed to be isolated (can’t send packets)
But they COOPERATE using shared hardware
Data leaks across security boundary

Key: This is INTENTIONAL cooperation between malware in BOTH VMs

Covert Channel Example

CPU Timing Attack

How it works:

Setup:

Both VMs run on same physical CPU
They compete for CPU time
Can measure each other’s CPU usage by timing

Sending data (VM 1 malware):

# To send bit 1:
at 1:00:00 AM:
    for i in range(huge_number):
        x = x * x  # CPU intensive!
        
# To send bit 0:
at 1:00:00 AM:
    sleep(1)  # Do nothing

Receiving data (VM 2 listener):

at 1:00:00 AM:
    start = time()
    for i in range(huge_number):
        x = x * x
    duration = time() - start
    
    if duration > threshold:
        return 1  # Slow (VM 1 competing for CPU)
    else:
        return 0  # Fast (VM 1 idle)

Other Covert Channels

File lock status - lock/unlock at specific times
Cache contents - fill/empty cache to signal
Interrupts - trigger interrupts at specific rates
Memory bus - compete for memory bandwidth
Disk I/O patterns - read/write patterns
Many more…

The sad truth: “Impossible to eliminate all”

Why? Isolated components share physical hardware. Any shared resource = potential covert channel.

Bigger Problem - Side Channels

Unintentional Information Leakage

Side Channel: Unintentional leakage between isolated components

Key difference from covert channel:

Covert: Malware in BOTH VMs cooperating
Side: Only listener is malicious, victim is innocent

Scenario:


   ┌───────────┐        ┌───────────┐
   │  VM 1     │        │  VM 2     │
   │           │        │           │
   │secret doc │        │           │
   │    ↓      │        │           │
   │Word proc  │        │           │
   │(innocent) │        │           │
   │    ↓      │        │           │
   │ accidental --→    ←-- listener │
   │  leakage  │        │(observes) │
┌──────────────────────────────────────────┐
│         Hypervisor/VMM                   │
└──────────────────────────────────────────┘

What’s happening:

VM 1 innocently processes secret document
VM 2 has attacker’s listener observing
VM 1 doesn’t know it’s being attacked
Listener infers information from hardware usage patterns

Why it’s “Bigger Problem”:

Covert channel: need to compromise BOTH VMs (harder)
Side channel: only compromise ONE VM (easier)
Victim doesn’t need malware

VM Isolation in Practice - Cloud

Cloud Computing Architecture

Type 1 Hypervisor (No Host OS):

┌───────────┬───────────┬───────────┬──────────┐
│User Apps  │User Apps  │User Apps  │User Apps │
├───────────┼───────────┼───────────┼──────────┤
│Guest OS 1 │Guest OS 2 │Guest OS 3 │Guest OS n│
│(Kernel)   │(Kernel)   │(Kernel)   │(Kernel)  │
└───────────┴───────────┴───────────┴──────────┘
┌──────────────────────────────────────────────┐
│        Hypervisor/VMM (Xen, KVM, ESXi)       │
│              BIOS/SMM                        │
└──────────────────────────────────────────────┘
┌──────────────────────────────────────────────┐
│         Physical Hardware                    │
└──────────────────────────────────────────────┘

Key points:

Type 1 = No host OS (hypervisor runs directly on hardware)
Used by AWS, Azure, Google Cloud
More secure (smaller TCB)
More efficient (no extra OS layer)

Security in cloud:

VMs from different customers share hardware
Hypervisor MUST isolate them
But “some info leaks” (side channels)

VM Isolation in Practice - End User

Qubes OS - Security-Focused Desktop

Philosophy: Everything runs in a separate VM

Example configurations shown in slides:

Configuration 1:

┌──────────────┬──────────────┬──────────────┐
│Disposable VM │  Work VM     │ Personal VM  │
│(Debian)      │ (Windows)    │ (Debian)     │
│              │              │              │
│Sketchy PDFs  │Work email    │Personal email│
│Random links  │Office docs   │Social media  │
│Auto-deleted  │Slack/Teams   │Shopping      │
└──────────────┴──────────────┴──────────────┘
           Xen Hypervisor
           Hardware (Intel Xeon)

Configuration 2:

┌──────────────┬──────────────┬──────────────┐
│ Whonix VM    │  Work VM     │  Vault VM    │
│(Debian)      │ (Windows)    │ (Debian)     │
│              │              │              │
│All traffic   │Normal work   │Password mgr  │
│goes through  │activities    │Crypto keys   │
│TOR           │              │NO INTERNET   │
└──────────────┴──────────────┴──────────────┘
           Xen Hypervisor
           Hardware

Security benefits:

Compromise of one VM doesn’t affect others
Different VMs for different trust levels
Disposable VMs for risky activities

Every Window Frame Identifies VM Source

Visual Security in Qubes

Color-coded window borders:

Each VM has a distinct color border on its windows
Red border = Untrusted/Disposable VM
Yellow border = Work VM
Green border = Personal VM
Black border = Vault VM (most secure)

Why this matters:

Instantly see which VM a window belongs to
Prevents spoofing (malicious VM can’t fake border color)
GUI VM ensures frames drawn correctly
Helps user make security decisions

Example:

User sees:
- Login window with RED border → Fake! Don't enter real password
- Banking site with GREEN border → Expected VM ✓

GUI isolation:

Separate GUI VM manages window system
Prevents VMs from overlaying fake windows
Enforces visual security policy

Confinement - Summary

Many Sandboxing Techniques

From strongest to weakest:

Physical air gap
- Separate hardware, no connection
- Strongest isolation, highest cost
Virtual air gap (hypervisor)
- Separate OS instances via VMM
- Strong isolation, moderate cost
System call interposition (SCI)
- Filter syscalls (seccomp-BPF)
- Fine-grained, low overhead
Software Fault Isolation (SFI)
- Insert checks in compiled code
- Application-level isolation
Application-specific
- JavaScript sandbox in browser
- Language-level isolation

Complete Isolation Often Inappropriate

Programs need to communicate!

Apps must access shared resources
Need regulated interfaces for communication
Reference monitors mediate this communication

Examples:

Browser tabs → share bookmarks via browser process
Containers → communicate via virtual network
VMs → use shared folders with access control

Two Hardest Aspects

1. Specifying policy

What can apps do and not do?
Too restrictive → breaks functionality
Too permissive → doesn’t stop attacks
Auto-generation doesn’t work well
User decisions unreliable

2. Preventing covert channels

Fundamentally impossible while sharing hardware
Can only make them slow/noisy
Accept some risk for non-critical apps
Physically separate extremely sensitive work

Final Summary: Key Takeaways

Choosing the Right Isolation Level

Decision factors:

Threat model
Performance requirements
Trust in software
Resource constraints

Core Principles

Defense in Depth - Layer multiple techniques
Least Privilege - Minimal permissions by default
Complete Mediation - Check every access
Simplicity - Keep TCB small
Accept Limitations - Perfect isolation impossible

What We Learned

✅ Why isolation is needed ✅ Reference monitors and TCB ✅ VMs vs Containers vs Syscall filtering ✅ chroot is broken, seccomp-BPF works ✅ Docker security mechanisms ✅ Covert and side channels exist

THE END of lecture 5 finally!!

Notes

Explorer

CS 3235 - 5. Isolation & Sandboxing

Running Untrusted Code

The Central Security Challenge

Process Isolation

Evolution of Security Thinking

General Goal: Confinement

Spectrum of Isolation Levels

Approach - Confinement (Hardware / Air Gap)

Physical Isolation - The Strongest Defense

Approach - Confinement (Virtual Machines)

Virtual Machine Isolation

Approach - Confinement (System Call Interposition / Containers)

Process-Level Isolation

System Calls - Going from User to OS Code

Understanding System Calls (Foundation for SCI)

Common System Calls

How System Calls Work

Achieving Isolation (Reference Monitor)

The Abstract Security Model

Three Required Properties

1. Complete Mediation - “Every request must go through the guard”

2. Tamperproof - “Guard can’t be bribed or killed”

3. Small and Verifiable - “Simple guards make fewer mistakes”

Part of Trusted Computing Base

Need for Trusting an Operating System (TCB)

Why Must We Trust the OS?

Trusted Computing Base (TCB)

TCB Requirements

Reference Monitors = Special TCBs

Approach - UNIX chroot System Call

What is chroot?

How to Use chroot

Security Property

Early Uses

Escaping from Jails

chroot is NOT Secure!

Many Ways to Evade chroot Isolation

Escape Technique 1: Device Files

Escape Technique 2: Send Signals

Escape Technique 3: Reboot System

Escape Technique 4: Network Ports

Summary: What chroot Does NOT Isolate

FreeBSD Jail

Improvement Over chroot

Problems with chroot and jail

Approach - System Call Interposition

The Core Idea

System Call Interposition Strategy

Implementation Options

Implementing System Call Interposition (ptrace)

Naive Approach: ptrace

System Call Interposition Policies

Example Policy File

The Policy Specification Problem

Why This Is Hard

Complications

Issues with ptrace Monitoring

Slide 18: Problems with ptrace

Why ptrace Fails for Security

SCI in Linux: seccomp-bpf

How It’s Different

Visual Architecture

BPF Filters (Policy Programs)

What is BPF?

How BPF Filters Work

Installing a BPF Filter

Two Required Calls (Must Be In This Order)

The Example

Docker - Isolating Containers Using seccomp-bpf

Modern Container Isolation

1. Namespaces - What You Can See

2. Cgroups - What Resources You Can Use

3. Seccomp-BPF - What Syscalls You Can Make

Docker Architecture

Docker Syscall Filtering

Custom Seccomp Filters in Docker

Example filter.json

More Docker Confinement Flags