Breadcrumbs: CS 3235 - System Security

January 27, 2026
Daniel Genkin


Running Untrusted Code

The Central Security Challenge

The fundamental question: How do we run potentially malicious programs without them destroying our system?

Why we need to run untrusted code:

  • You download a PDF from an unknown email attachment
  • You visit websites that run JavaScript in your browser
  • You install apps from app stores (how do you know they’re safe?)
  • You process user-uploaded files on your server
  • You run code from open-source repositories

The dilemma:

We WANT to:                      But we FEAR:
- Open PDFs                   →  Malicious code execution
- Browse websites             →  Data theft
- Install apps                →  System compromise
- Run plugins                 →  Privilege escalation

This lecture’s goal: Learn techniques to run untrusted code SAFELY through isolation and sandboxing.


Process Isolation

Evolution of Security Thinking

Traditional Thinking: Identify the Bad Guys

Old question: “How to run bad/untrustworthy programs safely?”

This assumed we could identify dangerous programs:

  • Programs from untrusted sites
  • Apps that process adversarial data (PDF viewers, browsers)
  • Honeypots (intentionally vulnerable systems to trap attackers)

Problem: You have to KNOW something is bad before protecting against it. What about:

  • Zero-day vulnerabilities?
  • Compromised trusted programs?
  • Supply chain attacks?

New Thinking: Trust Nothing

Modern question: “Be skeptical of ALL programs, isolate to achieve least privilege”

Key principles:

  1. Assume compromise - ANY program could be exploited
  2. Default deny - Give minimal permissions by default
  3. Limit damage - When (not if) compromise happens, contain it

Real example - Chrome Browser:

Old approach:
  "Chrome is from Google, it's trustworthy"
  → Give Chrome full system access
  → If exploited, attacker gets everything

New approach:
  "Assume Chrome will be exploited"
  → Each tab runs in isolated process
  → Limited syscalls (can't open files, create sockets)
  → If exploited, damage is contained

General Goal: Confinement

Confinement = Ensure misbehaving process cannot harm rest of system

Medical quarantine analogy:

  • Sick patient (compromised program) is isolated
  • Can’t spread disease (attacks) to others
  • But can still function (perform legitimate operations)
  • Medical staff (system services) interact through barriers

Spectrum of Isolation Levels

Can be implemented at many levels:

  1. System call interposition (AppArmor, SELinux)
    • Monitor and filter individual system calls
    • Example: Block process from accessing /etc/passwd
  2. Containers (Docker, Kubernetes)
    • Isolated userspace instances sharing one kernel
    • Example: Each microservice in own container
  3. Virtual machines (VMware, VirtualBox, Xen)
    • Isolate entire operating systems on one physical machine
    • Example: AWS running different customers on same hardware
  4. Physical isolation (Air gaps)
    • Separate physical hardware with no connection
    • Example: Nuclear power plant control systems

The tradeoff: Stronger isolation → More overhead/less convenience


Approach - Confinement (Hardware / Air Gap)

Physical Isolation - The Strongest Defense

What it is: Run applications on completely separate physical hardware with absolutely no connection

Visual:

Network 1          <<<  AIR GAP  >>>          Network 2
┌──────────┐      (No wires,              ┌──────────┐
│Computer A│       No WiFi,               │Computer B│
│  App 1   │       No Bluetooth,          │  App 2   │
│          │       Physical separation)   │          │
└──────────┘                              └──────────┘

Security properties:

  • Strongest possible isolation - Cannot attack what you cannot reach
  • No shared resources - Completely independent systems
  • Simple to understand - Physical separation is obvious

Drawbacks:

  • Very expensive - Need multiple complete computer systems
  • Difficult to manage - No remote access, must physically visit
  • Inconvenient data sharing - “Sneakernet” (walk USB drive between systems)
  • Doesn’t scale - Can’t have thousands of air-gapped machines

When to use:

  • Nuclear power plants - Control systems separated from internet
  • Military networks - SIPRNET (SECRET) physically separate from NIPRNET (UNCLASSIFIED)
  • Financial systems - Trading systems isolated from corporate network
  • Critical infrastructure - When security is worth ANY cost

Real example - Stuxnet: Even with air gap, Iranian nuclear centrifuges were compromised via USB drive. Shows air gaps aren’t perfect but raise the bar significantly.


Approach - Confinement (Virtual Machines)

Virtual Machine Isolation

What it is: Isolate entire operating systems on a single physical machine

Architecture:

┌────────────────────────────────────────┐
│     OS1     |     OS2    |      OS3    │ ← Guest Operating Systems
│   app1      |   app2     |    app3     │
├────────────────────────────────────────┤
│  Virtual Machine Monitor (Hypervisor)  │ ← Isolation layer
├────────────────────────────────────────┤
│           Hardware                     │
└────────────────────────────────────────┘

What each VM sees:

  • Its own “CPU” (time-sliced from real CPU)
  • Its own “RAM” (isolated portion of real RAM)
  • Its own “disk” (file on host or dedicated partition)
  • Its own “network card” (virtual NIC)

The illusion:

  • From inside VM: looks exactly like a real computer
  • Root user in VM doesn’t know it’s virtualized
  • VM thinks it has exclusive access to hardware

Security model:

  • ✅ Strong isolation (separate OS instances)
  • ✅ Can run different OSes (Windows + Linux on same hardware)
  • ✅ Compromise of one VM shouldn’t affect others

We’ll dive deeper into VMs later…


Approach - Confinement (System Call Interposition / Containers)

Process-Level Isolation

What it is: Isolate individual processes within a single operating system

Architecture:

┌─────────────────────────────────┐
│      Operating System           │
│                                 │
│   ┌──────────┐   ┌─────────┐    │
│   │Process 1 │   │Process 2│    │ ← Isolated processes
│   │(isolated)│   │(normal) │    │
│   └──────────┘   └─────────┘    │
└─────────────────────────────────┘

How it works:

  • Monitor and filter system calls from isolated process
  • Block dangerous operations (file access, network, etc.)
  • Allow safe operations (computation, limited I/O)

Compared to VMs:

  • ✅ Much lighter weight (no separate OS)
  • ✅ Faster startup (milliseconds vs minutes)
  • ✅ Less memory (MBs vs GBs)
  • ❌ Weaker isolation (shared kernel)

Modern implementation: Containers (Docker, Kubernetes)

We’ll explore this in detail after covering system calls…


System Calls - Going from User to OS Code

Understanding System Calls (Foundation for SCI)

Before we can understand System Call Interposition, we need to understand what system calls are.

What is a system call?

  • A system call is how user programs ask the operating system to do privileged operations.

Why programs can’t do these directly:

Programs run in user mode with restricted privileges:

  • Cannot directly access hardware (disk, network, devices)
  • Cannot access other processes’ memory
  • Cannot modify kernel memory
  • Cannot access protected system files

The OS kernel runs in kernel mode with full privileges:

  • Can access all hardware
  • Can access all memory
  • Can perform privileged operations

Programs must ASK the kernel through system calls

Common System Calls

File operations:

int fd = open("/home/user/file.txt", O_RDONLY);  // Open file
ssize_t n = read(fd, buffer, 100);               // Read data
write(fd, data, 50);                             // Write data
close(fd);                                       // Close file

Process operations:

pid_t pid = fork();           // Create new process
exec("/bin/ls", args);        // Execute program
exit(0);                      // Terminate process

Network operations:

int sock = socket(AF_INET, SOCK_STREAM, 0);       // Create socket
connect(sock, &addr, sizeof(addr));               // Connect to server
send(sock, data, len, 0);                         // Send data

Memory operations:

void* ptr = mmap(NULL, 4096, PROT_READ|PROT_WRITE, ...);  // Map memory
munmap(ptr, 4096);                                         // Unmap memory

How System Calls Work

User Program                    Kernel
─────────────────────────────────────────
                                
1. Program calls open()
    ↓
2. Library wraps it
   syscall(SYS_open, ...)
    ↓
3. Trap to kernel mode ──────→ 4. Kernel receives syscall
   (CPU switches modes)            ↓
                                5. Kernel checks permissions
                                   ↓
                                6. Performs operation
                                   ↓
4. Return to user mode ←────── 8. Returns result
   ↓
5. Program continues

Key insight: Every privileged operation goes through syscalls!

This is why we can control programs by filtering syscalls - if we block dangerous syscalls, we limit what program can do.


Achieving Isolation (Reference Monitor)

The Abstract Security Model

Reference Monitor = Abstract design pattern for security

Think of it as a security guard at a checkpoint:

		┌─────────────────┐
		│ Isolated Process│
		│ (Untrusted)     │
		└────────┬────────┘
		         │ "Can I open /etc/passwd?"
		         |
┌───────────────-|───-───────────────┐
|  ┌─────────────|────────────────┐  |
|  |   ┌─────────↓────────────┐   |  |
|  |   | Reference Monitor    |   |  |
|  |   └──────────────────────┘   |  |
|  | Trusted Computing Base (TCB) |  |
|  └──────────────────────────────┘  |
|      Operating System Kernel       |
└────────────────────────────────────┘

Three Required Properties

For a reference monitor to provide security:

1. Complete Mediation - “Every request must go through the guard”

Must always be invoked: Every application request must be mediated

What this means:

  • NO backdoor paths
  • NO bypass mechanisms
  • EVERY access checked

Bad example:

Prison with front gate security but unlocked back door
→ Security guard is useless
→ Complete mediation violated

Good example:

OS kernel checks EVERY file access
Program cannot directly access disk
Must go through kernel
→ Complete mediation achieved

2. Tamperproof - “Guard can’t be bribed or killed”

Cannot be killed:

  • Reference monitor must run at higher privilege than monitored process
  • If monitor killed, monitored process must also die
  • Monitor must be in protected memory

Cannot be modified:

  • Isolated program cannot change monitor’s code
  • Isolated program cannot change monitor’s policy
  • Even with exploit, cannot disable security

How enforced:

  • Monitor runs in kernel mode (protected)
  • Program runs in user mode (restricted)
  • Hardware (CPU protection rings, MMU) enforces separation

3. Small and Verifiable - “Simple guards make fewer mistakes”

Must be small enough to analyze and validate

Why size matters:

Complex code = More bugs
More bugs = More vulnerabilities
More vulnerabilities = Weaker security

Example:

Bad: 100,000 lines of monitoring code
  → Too complex to audit
  → Likely contains exploitable bugs

Good: 1,000 lines of monitoring code
  → Security experts can review completely
  → Can prove correctness
  → Fewer attack vectors

Part of Trusted Computing Base

By “trusted” we mean: “If it isn’t trustworthy, we’re vulnerable”

The reference monitor is part of the TCB (Trusted Computing Base) - all the code we MUST trust for security.

We’ll explore TCB more in the next slide…


Need for Trusting an Operating System (TCB)

Why Must We Trust the OS?

Question: Why do we need to trust the operating system?

Answer: Because the OS enforces ALL security!

What the OS controls:

  • File permissions (who can read/write files)
  • Process isolation (programs can’t access each other’s memory)
  • Network access (who can create sockets)
  • Hardware access (who can access devices)

If OS is compromised → All security fails

Trusted Computing Base (TCB)

Definition: The TCB is all the code and hardware that, if compromised, breaks your security.

For our isolation systems, TCB includes:

  • Operating system kernel
  • Hypervisor (for VMs)
  • Reference monitor implementation
  • Device drivers (unfortunately - they’re buggy!)
  • Hardware (CPU, MMU, memory protection)

TCB Requirements

To be trustworthy, TCB must have:

1. Tamper-proof

  • Cannot be modified by untrusted code
  • Protected memory, privileged execution

2. Complete mediation

  • Checks every access
  • No bypass mechanisms

3. Correct

  • Actually implements security policy properly
  • No bugs that allow security bypass

Reference Monitors = Special TCBs

Official definition: TCBs that meet all three requirements are called Reference Monitors

The relationship:

  • Reference Monitor = The ideal requirements (what we want)
  • TCB = The actual implementation (what we built)
  • We WANT our TCB to be a reference monitor
  • Reality: TCB often has bugs and falls short of the ideal

Security goal: Make TCB as small as possible while meeting reference monitor requirements


Approach - UNIX chroot System Call

Now we see a SIMPLE (but flawed) isolation technique before learning the correct approaches.

What is chroot?

chroot = “change root directory”

Changes what a process sees as the root directory (/)

Normal filesystem:

/                       ← Real root
├── home/user/documents/
├── etc/passwd          ← Sensitive!
└── tmp/

After chroot /tmp/jail:

Process sees:
/                       ← Now /tmp/jail
├── bin/
├── lib/
└── app/

Process CANNOT see:
/home/user/documents/   ← Outside jail
/etc/passwd             ← Outside jail

How to Use chroot

Requirements:

  • Must be root to use chroot
  • Must create minimal filesystem in jail first

Setup:

# Create jail structure
sudo mkdir -p /tmp/jail/{bin,lib,etc}
 
# Copy necessary binaries
sudo cp /bin/bash /tmp/jail/bin/
sudo cp /bin/ls /tmp/jail/bin/
 
# Copy required libraries
sudo cp /lib/x86_64-linux-gnu/libc.so.6 /tmp/jail/lib/
 
# Enter jail
sudo chroot /tmp/jail /bin/bash
 
# Drop to unprivileged user
su guest
 
# Run untrusted program
./app

Security Property

Process cannot access files outside jail because it cannot name them

Example:

# Inside jail, process tries:
cat /etc/passwd
 
# What happens:
# "/" is interpreted as "/tmp/jail"
# Acutally opens: /tmp/jail/etc/passwd
# Real /etc/passwd is never touched

Early Uses

Common uses:

  1. FTP servers - Prevent users accessing system files
  2. Test environments - Install software without affecting system
  3. Build environments - Clean compilation

But is it secure?

Spoiler: NO! We’ll see why in upcoming slides…


Escaping from Jails

chroot is NOT Secure!

While chroot provides filesystem isolation, it has MANY escape mechanisms.

Early escapes using relatives paths

Example

fopen("../../etc/password", "r') =>
   fopen("/tmp/guest/../../etc.password", "r")

chroot should only be executable by root

Why?

  • Otherwise jailed app can:
    • create dummy file /aaa/etc/password
    • run chroot "/aaa"
    • run su root to become root

This was a bug in Ultrix 4.0


Many Ways to Evade chroot Isolation

Escape Technique 1: Device Files

Attack:

# Inside chroot jail:
mknod /dev/sda b 8 0      # Create block device for hard drive
dd if=/dev/sda of=dump    # Read raw disk!

Why it works:

  • Device major/minor numbers are GLOBAL (shared namespace)
  • chroot doesn’t isolate device namespace
  • Raw disk access bypasses file permissions entirely

Result: Can read ANY data on disk, even outside jail

Escape Technique 2: Send Signals

Attack:

# Inside jail, kill process outside jail:
kill -9 1234

Why it works:

  • Process IDs (PIDs) are GLOBAL
  • chroot doesn’t isolate PID namespace
  • Can affect processes outside jail

Escape Technique 3: Reboot System

Attack:

# If somehow have capability:
reboot

Why it works:

  • Reboot affects ENTIRE system
  • chroot doesn’t restrict system-wide operations

Escape Technique 4: Network Ports

Attack:

# Bind to privileged port:
nc -l 80

Why it works:

  • Network namespace is SHARED
  • Port numbers are global
  • Can block legitimate services

Summary: What chroot Does NOT Isolate

chroot isolates: Filesystem paths

chroot does NOT isolate:

  • Process IDs
  • Network
  • Devices
  • System calls
  • Users/Groups
  • IPC mechanisms

Conclusion: chroot alone is INSECURE!

Modern solution: Containers (covered later) use chroot PLUS many other isolation mechanisms


FreeBSD Jail

Improvement Over chroot

FreeBSD jail extends chroot with additional isolation:

What jail adds:

  • Hostname isolation (each jail has own hostname)
  • IP address isolation (can only bind to sockets with specific IP address & authorized ports)
  • Process isolation (can only communicate with processes inside jail)
  • Root restrictions (root in jail ≠ root on host; cannot load kernel modules)

To run:

# Create jail
jail jail-path hostname IP-addr cmd

But still has problems… (next slide)


Problems with chroot and jail

Remaining Issues

  • All or nothing access to parts of file system
  • Inappropriate for apps like a web browser
    • Needs read access to files outside of jail (e.g. sending attachments in Gmail)

Approach - System Call Interposition

The Core Idea

Observation: To damage host system, app MUST make system calls

What malware wants to do:

  • Delete/overwrite files → needs unlink(), open(), write()
  • Launch network attacks → needs socket(), bind(), connect()
  • Install persistence → needs disk writes
  • Steal data → needs file/network access

Key insight: If we control syscalls, we control what program can do!

System Call Interposition Strategy

Idea: Monitor app’s system calls and block unauthorized ones

Program wants:              Monitor checks:         Action:
──────────────────────────────────────────────────────────
open("/etc/passwd")    →    Policy: DENY        →  Block/Kill
open("/tmp/myfile")    →    Policy: ALLOW       →  Let through
socket(...)            →    Policy: NO network  →  Block
write(stdout, "hi")    →    Policy: ALLOW       →  Let through

Implementation Options

1. Completely kernel space (SELinux)

  • Filtering built into OS kernel
  • Very secure, can’t bypass
  • Complex to configure

2. Completely user space (program shepherding)

  • Separate monitor process
  • More portable
  • Can have race conditions

3. Hybrid (Systrace)

  • Monitor in user space
  • Enforcement in kernel
  • Balance of flexibility and security

Let’s see how to implement this…


Implementing System Call Interposition (ptrace)

Extra resource bc huhh: MITRE ATT&CK T1055.008 Process Injection: Ptrace System Calls

Naive Approach: ptrace

ptrace = Unix system call for process tracing (designed for debugging)

How it works:

┌────────────────────────────────────┐
│         User Space                 │
│                                    │
│  ┌──────────┐      ┌───────────┐   │
│  │ Monitor  │      │  Program  │   │
│  │ Process  │      │(monitored)│   │
│  └────┬─────┘      └────↑──────┘   │
│       │                 |          │
│       │ ptrace()        │ syscall  │
└───────┼─────────────────┼──────────┘
        │                 │
┌───────↓─────────────────┼─────────┐
│       └-----------------┘         │
│         OS Kernel                 │
│  Monitor wakes when syscall made  │
└───────────────────────────────────┘

Process:

  1. Monitor attaches to program with ptrace()
  2. Program makes syscall
  3. Kernel stops program and wakes monitor
  4. Monitor checks if syscall allowed
  5. Monitor either allows or kills program

Seems reasonable… but has SEVERE problems!


System Call Interposition Policies

Before discussing ptrace problems, let’s see what policies look like:

Example Policy File

path allow /tmp/*
path deny /etc/passwd
path deny /etc/shadow
network deny all

How it works:

open("/tmp/file")    → Matches "allow /tmp/*"     → ALLOW
open("/etc/passwd")  → Matches "deny /etc/passwd" → DENY
socket(...)          → Matches "deny all network" → DENY

The Policy Specification Problem

Challenge 1: Auto-generating policies

Idea: Run program on “good” inputs, learn what it does, generate policy

Problems:

  • What are “good” inputs?
  • Might miss rare legitimate behaviors
  • Attacker could poison training data

Challenge 2: Asking the user

Show popup: “Program wants to open /etc/passwd. Allow?”

Problems:

  • Users don’t understand implications
  • Alert fatigue → click “Allow” without reading
  • Not practical for servers

Why This Is Hard

For complex apps (browsers, office suites):

  • Access patterns unpredictable
  • Need many capabilities
  • Hard to write policy that’s both:
    • Secure (blocks attacks)
    • Permissive (doesn’t break features)

This difficulty in choosing policy is why syscall interposition isn’t widely used for complex apps


Complications

Issues with ptrace Monitoring

Fork handling:

  • If monitored app forks, monitor must also fork
  • Must track parent-child relationships
  • Complex coordination

Crash handling:

  • If monitor crashes, app must be killed immediately
  • Can’t allow unmonitored execution
  • Requires kernel support

State synchronization:

Monitor must track ALL OS state that affects syscall interpretation:

State to track:
- Current working directory (CWD)
- User ID (UID, EUID, GID)
- Open file descriptors
- Environment variables

Example why CWD matters:

Program does:           Monitor must remember:
──────────────────────────────────────────────
chdir("/tmp")          CWD = /tmp
open("passwd")         → Interprets as /tmp/passwd ✓

chdir("/etc")          CWD = /etc
open("passwd")         → Interprets as /etc/passwd ✓

If monitor loses track → misinterprets paths!

Slide 18: Problems with ptrace

Why ptrace Fails for Security

Problem 1: All or Nothing

  • Can only trace ALL syscalls or NONE
  • Cannot selectively trace just dangerous ones
  • Inefficient (trace every read(), write(), close())

Problem 2: Cannot Abort Cleanly

  • Can only KILL entire program
  • Cannot deny individual syscalls gracefully
  • Too harsh response

Problem 3: TOCTOU Race Condition ⚠️ CRITICAL SECURITY BUG

TOCTOU = Time-Of-Check, Time-Of-Use

The vulnerability:

Timeline of Attack:

1. Attacker creates:  me → mydata.txt (symlink)

2. Program calls:     open("me")

3. Monitor wakes:     Checks "me" → points to mydata.txt
                     Decision: ALLOW ✓

4. [ATTACKER STRIKES] Changes:  me → /etc/passwd

5. Kernel executes:   open("me") → Opens /etc/passwd!
                     ❌ SECURITY VIOLATED

What went wrong: Check (step 3) and use (step 5) were NOT atomic. Attacker changed target in between.

Why this is critical:

  • Can bypass ANY security check
  • Classic vulnerability in many systems
  • Fundamental problem with user-space monitoring

Conclusion: ptrace was designed for debugging, NOT security!


SCI in Linux: seccomp-bpf

SCI = System Call Intereference Seccomp-BPF = Secure Computing with Berkeley Packet Filter

This solves ALL the ptrace problems!

How It’s Different

1. Filter runs IN THE KERNEL (not user space)

  • ✅ No TOCTOU races (check and use are atomic)
  • ✅ No state synchronization issues
  • ✅ Perfect state tracking

2. Filter written in BPF language (compiled, efficient)

  • ✅ Efficient (runs in kernel)
  • ✅ Can selectively filter (only check specific syscalls)
  • ✅ No overhead for allowed syscalls

3. Process sets its OWN filter (before doing dangerous things)

  • ✅ No separate monitor process
  • ✅ No fork complications
  • ✅ Filter cannot be removed once set
  • ✅ If program call execve, all filters are preserved

Visual Architecture

┌──────────────────────────────────┐
│        User Space                │
│                                  │
│  Chrome Renderer Process:        │
│  1. Sets filter                  │
│  2. Processes web page           │
│  3. Exploit tries: open("/etc/passwd")│
│                ↓                 │
└────────────────┼─────────────────┘
                 ↓
┌────────────────┼─────────────────┐
│      OS Kernel │                 │
│                ↓                 │
│     Seccomp-BPF Filter           │
│     - Check: open() syscall      │
│     - Not in whitelist!          │
│     - Action: KILL process       │
└──────────────────────────────────┘

Used in production:

  • Chrome/Chromium browser tabs
  • Docker containers
  • Android apps
  • systemd services

BPF Filters (Policy Programs)

What is BPF?

How BPF Filters Work

BPF program:

  • Small program that runs for each syscall
  • Returns: ALLOW, DENY, or KILL
    • SECCOMP_RET_ALLOW = allow syscall
    • SECCOMP_RET_ERRNO = return specified error to caller
    • SECCOMP_RET_KILL = kill process
  • Written in restricted language (safe, cannot loop forever)

Structure:

For each syscall:
    1. Load syscall number
    2. Check against whitelist/blacklist
    3. Return action (ALLOW/DENY/KILL)

Installing a BPF Filter

Two Required Calls (Must Be In This Order)

1. prctl(PR_SET_NO_NEW_PRIVS, 1)

  • Purpose: Prevents privilege escalation
  • Effect: Disables set-UID/set-GID bits on subsequent execve() calls
  • Why needed: Without this, attacker could exec a set-UID root binary and bypass the filter
  • Must come FIRST - kernel won’t allow filter installation without this

2. prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_policy)

  • Purpose: Installs the syscall filter
  • Effect: Every syscall from now on is checked against the BPF policy
  • Irreversible: Once set, filter CANNOT be removed (even by root)
  • Point of no return

The Example

c

int main(int argc, char **argv) {
    prctl(PR_SET_NO_NEW_PRIVS, 1);              // Block privilege escalation
    prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &bpf_policy);  // Install filter
    
    fopen("file.txt", "w");                      // Calls open() syscall
    printf("... will not be printed.\n");        // Never executes - process killed
}
```
 
**What happens:**
- `fopen()` internally calls `open()` syscall for writing
- BPF filter checks the policy
- Policy blocks `open()` for write
- **Process is immediately killed**
- `printf()` never runs
 
## Key Properties
 
- **Voluntary self-restriction** - Process restricts itself
- **Kernel-enforced** - No TOCTOU races (unlike ptrace)
- **Cannot be removed** - Permanent once installed
- **Inherited by children** - `fork()` passes filter to child processes
 
## Why This Order Matters
```
Correct:
1. NO_NEW_PRIVS → 2. Install filter → 3. Restricted execution ✓
 
Wrong:
Install filter first → Kernel rejects (error) ✗
 
Without NO_NEW_PRIVS:
Attacker could execve("/usr/bin/passwd") → runs as root → bypasses filter ✗

Docker - Isolating Containers Using seccomp-bpf

Modern Container Isolation

Container: OS-level isolation creating multiple isolated userspace instances

Docker uses multiple isolation mechanisms together:

What Engineers Should Know About Container Isolation

1. Namespaces - What You Can See

PID Namespace:       Container sees PIDs 1,2,3 (own processes)
Network Namespace:   Container has own IP, ports
Mount Namespace:     Container has own filesystem view
User Namespace:      Root in container ≠ root on host
IPC Namespace:       Isolated shared memory

2. Cgroups - What Resources You Can Use

Memory cgroup:  Limit: 512MB RAM
CPU cgroup:     Limit: 50% of one core
Disk I/O:       Limit: 10MB/s write

3. Seccomp-BPF - What Syscalls You Can Make

Default Docker filter blocks ~40 dangerous syscalls:
❌ ptrace   - Can't debug processes
❌ reboot   - Can't reboot host
❌ mount    - Can't mount filesystems
❌ setns    - Can't join other namespaces

Docker Architecture

┌───────────────────────────────────┐
│    App 1  │  App 2  │  App 3      │ ← Containers
└───────────────────────────────────┘
┌───────────────────────────────────┐
|           Docker Engine           |
└───────────────────────────────────┘       
┌───────────────────────────────────┐
│      Host OS (Single Kernel)      │
└───────────────────────────────────┘
┌───────────────────────────────────┐
│             Hardware              │
└───────────────────────────────────┘

Key features:

  • ✅ Lightweight (share kernel)
  • ✅ Fast startup (seconds)
  • ✅ Good isolation (multiple mechanisms)
  • ❌ All containers must use same OS type
  • ❌ Shared kernel = shared vulnerabilities

Docker Syscall Filtering

Custom Seccomp Filters in Docker

Command:

docker run --security-opt="seccomp=filter.json" nginx

Example filter.json

{
  "defaultAction": "SCMP_ACT_ERRNO",  // Deny by default
  
  "syscalls": [
    {
      "names": ["accept", "read", "write"],
      "action": "SCMP_ACT_ALLOW"        // allow (whitelist)
    },
    {
      "names": ["socket"],
      "action": "SCMP_ACT_ALLOW",
      "args": [
        {
          "index": 0,        // First argument
          "value": 2,        // AF_INET (IPv4)
          "op": "SCMP_CMP_EQ" // Must equal
        }
      ]
    },
    {
      "names": ["ptrace", "reboot"],
      "action": "SCMP_ACT_KILL"          // Kill immediately
    }
  ]
}

What this does:

  • ✅ Allows: accept(), read(), write()
  • ✅ Allows: socket() only for IPv4
  • ❌ Kills process if: ptrace() or reboot()
  • ❌ Returns error for: anything else

More Docker Confinement Flags

Additional Security Layers

1. Run as unprivileged user:

docker run --user nginx nginx
  • Process runs as nginx user, not root
  • Limited damage if escaped

2. Limit Linux capabilities:

docker run --cap-drop ALL --cap-add NET_BIND_SERVICE nginx
  • Removes all Linux capabilities
  • Adds back to allow binding to privileged ports
  • Root in container has minimal powers

3. Prevent process from becoming privileged:

docker run --security-opt=no-new-privileges:true nginx
  • Blocks setuid elevation
  • Even setuid binaries won’t gain privileges

4. Resource limits:

docker run \
  --restart=on-failure:5 \
  --ulimit nofile=100 \
  --ulimit nproc=50 \
  nginx
  • Max 5 restarts (prevents crash loops)
  • Max 100 file descriptors
  • Max 50 processes

Defense in depth: Multiple layers protect against different attacks!


Approach - Virtual Machines

Now we return to VMs for a deeper dive…

What is a Virtual Machine?

VM = emulate an entire computer (OS and all) running inside your real computer

Each VM thinks it has:

  • Its own CPU
  • Its own RAM
  • Its own disk
  • Its own network card
  • Its own hardware

How? The hypervisor (VMM - virtual machine monitor) creates this illusion

Two Types of Hypervisors

Type 1 (Bare Metal):

┌─────────┬─────────┬─────────┐
│  VM 1   │  VM 2   │  VM 3   │
├─────────┴─────────┴─────────┤
│         Hypervisor          │ ← Runs directly on hardware
├─────────────────────────────┤
│     Physical Hardware       │
└─────────────────────────────┘

Examples: VMware ESXi, Xen, Hyper-V
Used: Cloud providers (AWS, Azure)

Type 2 (Hosted):

┌─────────┬─────────┐
│  VM 1   │  VM 2   │
├─────────┴─────────┤
│    Hypervisor     │ ← Runs as application
├───────────────────┤
│     Host OS       │
├───────────────────┤
│ Physical Hardware │
└───────────────────┘

Examples: VirtualBox, VMware Workstation
Used: Desktop users, developers

Security:

  • Type 1: More secure (no host OS to attack)
  • Type 2: More convenient (use with normal OS)

VMs in the 1960’s:

  • Before when there were few computers and lots of users
  • VMs allow many users to share a single computer

VMs in the 1970’s - 2000: non-existent

VMs since 2000:

  • Too many computers & too few users
    • e.g. printer server, mail server, web server, file server, database
  • VMs heavily used in private & public clouds

Virtual Machine Security

Security Model

Assumption:We accept: Malware CAN infect guest OS and apps

  • Attacker can get root in VM
  • Can install keyloggers, rootkits in VM
  • Can read all files in that VM

We require: Malware CANNOT escape from VM

  • Cannot infect host OS
  • Cannot infect other VMs
  • Cannot access hypervisor

Requirements for Security

Hypervisor must:

  • Protect itself (cannot be accessed by guest)
  • Isolate VMs (cannot see each other)
  • Be bug-free (vulnerabilities = escapes)

The Driver Problem

Tension:

  • Hypervisor kept simple (50K lines) ← Good!
  • Device drivers complex (100K+ lines) ← Bad!
  • Drivers run in privileged mode ← Dangerous!

Result:

  • Bug in driver → compromise hypervisor
  • Hypervisor compromised → ALL VMs compromised
  • This weakens the “simple hypervisor” benefit

Problem - Covert Channels

Intentional Secret Communication

Covert Channel: Unintended communication channel between isolated components used to leak data

Scenario:

┌─────────────────────────────────────────┐
│  ┌───────────┐        ┌───────────┐     │
│  │  VM 1     │        │  VM 2     │     │
│  │           │        │           │     │
│  │secret doc │        │           │     │
│  │    ↓      │        │           │     │
│  │ malware ─────→   ←── listener  │→Data│
│  │           │ covert │           │leak │
│  │           │channel │           │     │
│  └───────────┘        └───────────┘     │
│         Hypervisor/VMM                  │
└─────────────────────────────────────────┘

What’s happening:

  1. VM 1 has malware with stolen secrets
  2. VM 2 has attacker’s listener
  3. VMs supposed to be isolated (can’t send packets)
  4. But they COOPERATE using shared hardware
  5. Data leaks across security boundary

Key: This is INTENTIONAL cooperation between malware in BOTH VMs


Covert Channel Example

CPU Timing Attack

How it works:

Setup:

  • Both VMs run on same physical CPU
  • They compete for CPU time
  • Can measure each other’s CPU usage by timing

Sending data (VM 1 malware):

# To send bit 1:
at 1:00:00 AM:
    for i in range(huge_number):
        x = x * x  # CPU intensive!
        
# To send bit 0:
at 1:00:00 AM:
    sleep(1)  # Do nothing

Receiving data (VM 2 listener):

at 1:00:00 AM:
    start = time()
    for i in range(huge_number):
        x = x * x
    duration = time() - start
    
    if duration > threshold:
        return 1  # Slow (VM 1 competing for CPU)
    else:
        return 0  # Fast (VM 1 idle)

Other Covert Channels

  • File lock status - lock/unlock at specific times
  • Cache contents - fill/empty cache to signal
  • Interrupts - trigger interrupts at specific rates
  • Memory bus - compete for memory bandwidth
  • Disk I/O patterns - read/write patterns
  • Many more…

The sad truth: “Impossible to eliminate all”

Why? Isolated components share physical hardware. Any shared resource = potential covert channel.


Bigger Problem - Side Channels

Unintentional Information Leakage

Side Channel: Unintentional leakage between isolated components

Key difference from covert channel:

  • Covert: Malware in BOTH VMs cooperating
  • Side: Only listener is malicious, victim is innocent

Scenario:


   ┌───────────┐        ┌───────────┐
   │  VM 1     │        │  VM 2     │
   │           │        │           │
   │secret doc │        │           │
   │    ↓      │        │           │
   │Word proc  │        │           │
   │(innocent) │        │           │
   │    ↓      │        │           │
   │ accidental --→    ←-- listener │
   │  leakage  │        │(observes) │
┌──────────────────────────────────────────┐
│         Hypervisor/VMM                   │
└──────────────────────────────────────────┘

What’s happening:

  1. VM 1 innocently processes secret document
  2. VM 2 has attacker’s listener observing
  3. VM 1 doesn’t know it’s being attacked
  4. Listener infers information from hardware usage patterns

Why it’s “Bigger Problem”:

  • Covert channel: need to compromise BOTH VMs (harder)
  • Side channel: only compromise ONE VM (easier)
  • Victim doesn’t need malware

VM Isolation in Practice - Cloud

Cloud Computing Architecture

Type 1 Hypervisor (No Host OS):

┌───────────┬───────────┬───────────┬──────────┐
│User Apps  │User Apps  │User Apps  │User Apps │
├───────────┼───────────┼───────────┼──────────┤
│Guest OS 1 │Guest OS 2 │Guest OS 3 │Guest OS n│
│(Kernel)   │(Kernel)   │(Kernel)   │(Kernel)  │
└───────────┴───────────┴───────────┴──────────┘
┌──────────────────────────────────────────────┐
│        Hypervisor/VMM (Xen, KVM, ESXi)       │
│              BIOS/SMM                        │
└──────────────────────────────────────────────┘
┌──────────────────────────────────────────────┐
│         Physical Hardware                    │
└──────────────────────────────────────────────┘

Key points:

  • Type 1 = No host OS (hypervisor runs directly on hardware)
  • Used by AWS, Azure, Google Cloud
  • More secure (smaller TCB)
  • More efficient (no extra OS layer)

Security in cloud:

  • VMs from different customers share hardware
  • Hypervisor MUST isolate them
  • But “some info leaks” (side channels)

VM Isolation in Practice - End User

Qubes OS - Security-Focused Desktop

Philosophy: Everything runs in a separate VM

Example configurations shown in slides:

Configuration 1:

┌──────────────┬──────────────┬──────────────┐
│Disposable VM │  Work VM     │ Personal VM  │
│(Debian)      │ (Windows)    │ (Debian)     │
│              │              │              │
│Sketchy PDFs  │Work email    │Personal email│
│Random links  │Office docs   │Social media  │
│Auto-deleted  │Slack/Teams   │Shopping      │
└──────────────┴──────────────┴──────────────┘
           Xen Hypervisor
           Hardware (Intel Xeon)

Configuration 2:

┌──────────────┬──────────────┬──────────────┐
│ Whonix VM    │  Work VM     │  Vault VM    │
│(Debian)      │ (Windows)    │ (Debian)     │
│              │              │              │
│All traffic   │Normal work   │Password mgr  │
│goes through  │activities    │Crypto keys   │
│TOR           │              │NO INTERNET   │
└──────────────┴──────────────┴──────────────┘
           Xen Hypervisor
           Hardware

Security benefits:

  • Compromise of one VM doesn’t affect others
  • Different VMs for different trust levels
  • Disposable VMs for risky activities

Every Window Frame Identifies VM Source

Visual Security in Qubes

Color-coded window borders:

  • Each VM has a distinct color border on its windows
  • Red border = Untrusted/Disposable VM
  • Yellow border = Work VM
  • Green border = Personal VM
  • Black border = Vault VM (most secure)

Why this matters:

  • Instantly see which VM a window belongs to
  • Prevents spoofing (malicious VM can’t fake border color)
  • GUI VM ensures frames drawn correctly
  • Helps user make security decisions

Example:

User sees:
- Login window with RED border → Fake! Don't enter real password
- Banking site with GREEN border → Expected VM ✓

GUI isolation:

  • Separate GUI VM manages window system
  • Prevents VMs from overlaying fake windows
  • Enforces visual security policy

Confinement - Summary

Many Sandboxing Techniques

From strongest to weakest:

  1. Physical air gap
    • Separate hardware, no connection
    • Strongest isolation, highest cost
  2. Virtual air gap (hypervisor)
    • Separate OS instances via VMM
    • Strong isolation, moderate cost
  3. System call interposition (SCI)
    • Filter syscalls (seccomp-BPF)
    • Fine-grained, low overhead
  4. Software Fault Isolation (SFI)
    • Insert checks in compiled code
    • Application-level isolation
  5. Application-specific
    • JavaScript sandbox in browser
    • Language-level isolation

Complete Isolation Often Inappropriate

Programs need to communicate!

  • Apps must access shared resources
  • Need regulated interfaces for communication
  • Reference monitors mediate this communication

Examples:

  • Browser tabs → share bookmarks via browser process
  • Containers → communicate via virtual network
  • VMs → use shared folders with access control

Two Hardest Aspects

1. Specifying policy

  • What can apps do and not do?
  • Too restrictive → breaks functionality
  • Too permissive → doesn’t stop attacks
  • Auto-generation doesn’t work well
  • User decisions unreliable

2. Preventing covert channels

  • Fundamentally impossible while sharing hardware
  • Can only make them slow/noisy
  • Accept some risk for non-critical apps
  • Physically separate extremely sensitive work

Final Summary: Key Takeaways

Choosing the Right Isolation Level

Decision factors:

  • Threat model
  • Performance requirements
  • Trust in software
  • Resource constraints

Core Principles

  1. Defense in Depth - Layer multiple techniques
  2. Least Privilege - Minimal permissions by default
  3. Complete Mediation - Check every access
  4. Simplicity - Keep TCB small
  5. Accept Limitations - Perfect isolation impossible

What We Learned

✅ Why isolation is needed ✅ Reference monitors and TCB ✅ VMs vs Containers vs Syscall filtering ✅ chroot is broken, seccomp-BPF works ✅ Docker security mechanisms ✅ Covert and side channels exist


THE END of lecture 5 finally!!