ReproVM

Overview

A lightweight yet powerful task execution engine designed for reproducibility and efficiency

ReproVM is a lightweight task execution virtual machine written entirely in C99. It makes complex workflows—builds, data pipelines, test orchestrations, and analyses—reproducible, incremental, and efficient through content-addressed caching and explicit dependency tracking.

Think of it as a miniaturized, transparent cousin of modern build systems like Bazel or data pipeline orchestrators, but without hidden state or heavyweight dependencies. It operates on a declarative manifest of tasks, each describing what to run, what files it consumes and produces, and what it depends on.

100%

Pure C99

Runtime Dependencies

SHA-256

Content Addressing

∞

Task Parallelism

Key Features

Why ReproVM stands out from traditional build systems

🔄

Reproducibility

Every piece of work is content-addressed. If inputs and commands are identical, outputs are reused deterministically across machines and environments.

⚡

Incrementality

Only tasks whose inputs, commands, or upstream results have changed are re-executed. Massive time savings on repeated runs.

🔍

Transparency

ASCII diagrams show exactly what ran, what was skipped, and why. No hidden state or magic—complete visibility into execution flow.

🚀

Parallel Execution

Execute independent tasks concurrently with pthreads. Respects dependencies automatically while maximizing throughput.

📦

Content-Addressed Storage

All inputs and outputs stored by SHA-256 hash. Natural deduplication, versioning, and provenance tracking built-in.

🔧

Zero Dependencies

Pure C99 implementation with only standard libraries. No external runtime, no package managers—just compile and run.

🌐

Polyglot Support

Execute tasks in any language—C, Python, Go, Rust, Ruby, Node.js. The VM doesn't care what's in your command.

🐳

Docker Ready

Full Docker and docker-compose support for containerized, reproducible execution environments across teams.

Architecture

Understanding ReproVM's internal design

Core Components

📄 Manifest Parser

🔗 DAG Builder

🔐 SHA-256 Hasher

💾 Content-Addressed Storage

⚙️ Task Executor

🧵 Parallel Scheduler

📊 Graph Visualizer

💽 Cache Manager

Execution Flow

flowchart TB Start([User invokes CLI]) --> Parse[Parse Manifest] Parse --> Build[Build Dependency DAG] Build --> Filter[Filter Target Tasks] Filter --> Topo[Topological Sort] Topo --> Ready{Ready Tasks?} Ready -->|Yes| Schedule[Schedule Parallel Execution] Ready -->|No| Done{All Complete?} Schedule --> Task[Task Instance] subgraph "Task Execution" Task --> Hash[Compute Task Hash
cmd + inputs + deps] Hash --> Cache{Cache Hit?} Cache -->|Yes + Not Forced| Restore[Restore from CAS
⚡ SKIP] Cache -->|No or Forced| Execute[Execute Command
🔧 RUN] Execute --> Success{Exit 0?} Success -->|Yes| Store[Store Outputs to CAS
💾 CACHE] Success -->|No| Fail[Mark Failed
❌ ERROR] Store --> UpdateSuccess[Update: SUCCESS ✓] Restore --> UpdateSkip[Update: SKIPPED *] Fail --> UpdateFail[Update: FAILED X] end UpdateSuccess --> Ready UpdateSkip --> Ready UpdateFail --> Ready Done -->|Yes| Visualize[Render Final Graph] Done -->|No| Ready Visualize --> End([Exit]) style Start fill:#2563eb style End fill:#10b981 style Fail fill:#ef4444 style Restore fill:#7c3aed style Store fill:#f59e0b

Content-Addressed Storage Layout

project/
├── .reprovm/
│   ├── cas/
│   │   └── objects/
│   │       ├── a3/
│   │       │   └── f5d9c0e1b2c3d4...  (blob: file content)
│   │       ├── 5f/
│   │       │   └── 6e7d8c9b0a1f2e...  (blob: output data)
│   │       └── 9a/
│   │           └── 8b7c6d5e4f3a2b...  (blob: binary)
│   └── cache/
│       ├── .meta  (metadata: task identity)
│       ├── .meta
│       └── .meta
├── manifest.txt
└── source files...

Task Hash Computation

graph LR A[Command String] --> H[SHA-256] B[Sorted Input Hashes] --> H C[Dependency Result Hashes] --> H H --> D[Task Hash
64-char hex] D --> E{Cache Lookup} E -->|Hit| F[Restore Outputs] E -->|Miss| G[Execute & Store] style D fill:#2563eb style F fill:#10b981 style G fill:#f59e0b

Technology Stack

Built on battle-tested, industry-standard technologies

🔧 Core Implementation

C99

GCC

Make

POSIX

🔐 Security & Hashing

SHA-256

CRC32

CAS

⚡ Parallelism

pthreads

Mutex/CV

DAG Scheduling

🐳 Deployment

Docker

Docker Compose

CI/CD

🌐 Polyglot Extensions

Rust

Python

Ruby

Node.js

NASM

🔍 Code Quality

clang-format

clang-tidy

uncrustify

AStyle

Quick Start

Get up and running in under 5 minutes

1. Clone the Repository

git clone https://github.com/hoangsonww/ReproVM-Virtual-Machine
cd ReproVM-Virtual-Machine

2. Build ReproVM

make
# Builds both serial (reprovm) and parallel (reprovm_parallel) binaries

Output:

gcc -std=c99 -O2 -Wall -Wextra -g -c main.c -o main.o
gcc -std=c99 -O2 -Wall -Wextra -g -c task.c -o task.o
gcc -std=c99 -O2 -Wall -Wextra -g -c cas.c -o cas.o
gcc -std=c99 -O2 -Wall -Wextra -g -c util.c -o util.o
gcc -std=c99 -O2 -Wall -Wextra -g -o reprovm main.o task.o cas.o util.o
gcc -std=c99 -O2 -Wall -Wextra -g -o reprovm_parallel ...

3. Create a Manifest

task build {
  cmd = gcc -o hello hello.c
  inputs = hello.c
  outputs = hello
  deps =
}

task test {
  cmd = ./hello > result.txt
  inputs = hello
  outputs = result.txt
  deps = build
}

task checksum {
  cmd = sha256sum result.txt > result.sha
  inputs = result.txt
  outputs = result.sha
  deps = test
}

4. Run Your Pipeline

# Serial execution
./reprovm manifest.txt

# Parallel execution (auto-detects CPU count)
./reprovm_parallel manifest.txt

# Parallel with specific worker count
./reprovm_parallel -j 4 manifest.txt

5. Marvel at Cache Hits ⚡

Run again without changes:

$ ./reprovm manifest.txt
Will execute 3 tasks in order:
  build
  test
  checksum
[*] build (cache hit)
[*] test (cache hit)
[*] checksum (cache hit)
All tasks completed. ✓

🎉 Success! All tasks were skipped because nothing changed. This is the power of content-addressed caching—instant, deterministic rebuilds.

Documentation

Everything you need to know about ReproVM

📖 Manifest Specification

Tasks are declared in a simple DSL:

task <name> {
  cmd = <shell command>
  inputs = <file1>, <file2>, ...
  outputs = <out1>, <out2>, ...
  deps = <task1>, <task2>, ...
}

cmd: Shell command to execute
inputs: Files consumed by the task
outputs: Files produced by the task
deps: Task dependencies (execution order)

🎯 CLI Usage

# Run all tasks
./reprovm manifest.txt

# Run specific targets
./reprovm manifest.txt target1 target2

# Parallel execution
./reprovm_parallel -j N manifest.txt

# Force rebuild (ignore cache)
rm -rf .reprovm
./reprovm manifest.txt

🔍 Task Statuses

Symbol	Status	Meaning
[ ]	Pending	Not yet executed
[~]	Running	Currently executing
[*]	Skipped	Cache hit, restored from CAS
[✔]	Success	Executed successfully
[X]	Failed	Command returned non-zero

🐳 Docker Usage

# Build Docker image
docker build -t reprovm:latest .

# Run with volume mount
docker run --rm -v "$(pwd)":/workspace \
  -w /workspace reprovm:latest \
  ./reprovm_parallel manifest.txt

# Use docker-compose
docker compose up reprovm

# Interactive shell
docker run --rm -it -v "$(pwd)":/workspace \
  -w /workspace reprovm:latest bash

⚙️ Cache Management

# View cache metadata
cat .reprovm/cache/<task_hash>.meta

# Clean cache
rm -rf .reprovm

# Selective invalidation
rm .reprovm/cache/<specific_task>.meta

# CI cache restoration
./ci_restore_cache.sh

🔐 Security Considerations

⚠️ Warning:

Tasks execute arbitrary shell commands—only run trusted manifests
Shared .reprovm directories require trust boundaries
CAS blobs can be replaced on disk (no cryptographic signing by default)
Consider implementing integrity verification for production use

Examples

Real-world manifests demonstrating ReproVM's capabilities

🏗️ C Build Pipeline

task compile {
  cmd = gcc -c -o app.o app.c
  inputs = app.c
  outputs = app.o
  deps =
}

task link {
  cmd = gcc -o app app.o
  inputs = app.o
  outputs = app
  deps = compile
}

task test {
  cmd = ./app --test > test.log
  inputs = app
  outputs = test.log
  deps = link
}

📊 Data Pipeline

task fetch_data {
  cmd = curl -o data.csv https://api.example.com/data
  inputs =
  outputs = data.csv
  deps =
}

task clean_data {
  cmd = python3 clean.py data.csv > clean.csv
  inputs = data.csv
  outputs = clean.csv
  deps = fetch_data
}

task analyze {
  cmd = python3 analyze.py clean.csv > report.json
  inputs = clean.csv
  outputs = report.json
  deps = clean_data
}

task visualize {
  cmd = python3 plot.py report.json chart.png
  inputs = report.json
  outputs = chart.png
  deps = analyze
}

🤖 ML Training Pipeline

task preprocess {
  cmd = python3 preprocess.py raw.csv processed.csv
  inputs = raw.csv
  outputs = processed.csv
  deps =
}

task train_model {
  cmd = python3 train.py processed.csv model.pkl
  inputs = processed.csv
  outputs = model.pkl
  deps = preprocess
}

task evaluate {
  cmd = python3 eval.py model.pkl metrics.json
  inputs = model.pkl, processed.csv
  outputs = metrics.json
  deps = train_model
}

task package {
  cmd = tar czf model.tar.gz model.pkl metrics.json
  inputs = model.pkl, metrics.json
  outputs = model.tar.gz
  deps = evaluate
}

🌐 Web Asset Pipeline

task compile_ts {
  cmd = tsc src/index.ts --outDir dist/
  inputs = src/index.ts
  outputs = dist/index.js
  deps =
}

task bundle_js {
  cmd = webpack dist/index.js -o dist/bundle.js
  inputs = dist/index.js
  outputs = dist/bundle.js
  deps = compile_ts
}

task minify_css {
  cmd = cleancss -o dist/style.min.css src/style.css
  inputs = src/style.css
  outputs = dist/style.min.css
  deps =
}

task package_assets {
  cmd = tar czf assets.tar.gz dist/
  inputs = dist/bundle.js, dist/style.min.css
  outputs = assets.tar.gz
  deps = bundle_js, minify_css
}

📈 Performance Comparison

graph LR A[Cold Run
No Cache] -->|100%| B[3.2s] C[Warm Run
Full Cache] -->|~0%| D[0.08s] E[Partial Change
1 task] -->|33%| F[1.1s] style B fill:#ef4444 style D fill:#10b981 style F fill:#f59e0b

💡 Pro Tip: In CI/CD pipelines, persist the .reprovm directory between runs to achieve 40x-90x speedups on unchanged steps.

Advanced Features

Power-user capabilities and extension points

🧵 Parallel Execution

The reprovm_parallel binary uses pthreads to execute independent tasks concurrently:

# Auto-detect CPU count
./reprovm_parallel manifest.txt

# Specify worker threads
./reprovm_parallel -j 8 manifest.txt

# Set via environment
export REPROVM_JOBS=16
./reprovm_parallel manifest.txt

How it works:

Tasks with satisfied dependencies run immediately
Worker pool managed via mutex/condition variables
Failures propagate but don't block independent tasks
Graph updates are serialized for clean output

🌐 Remote CAS (Extension)

Extend ReproVM to push/pull blobs from remote storage:

// cas.h extension point
int cas_push_remote(const char *hash, const char *url);
int cas_pull_remote(const char *hash, const char *url);

// Example: S3 backend
cas_push_remote(hash, "s3://bucket/cas/objects/");
cas_pull_remote(hash, "s3://bucket/cas/objects/");

🔄 CI/CD Integration

ReproVM is designed for CI/CD environments:

# .github/workflows/ci.yml
name: CI
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Restore Cache
        uses: actions/cache@v3
        with:
          path: .reprovm
          key: reprovm-${{ hashFiles('manifest.txt') }}
      - name: Build
        run: |
          make
          ./reprovm_parallel -j 4 manifest.txt
      - name: Save Cache
        uses: actions/cache@v3
        with:
          path: .reprovm
          key: reprovm-${{ hashFiles('manifest.txt') }}

📊 Provenance Tracking

Every cache metadata file contains full provenance:

task_hash: a3f5d9c0e1b2c3d4e5f6a7b8...
result_hash: 1f2e3d4c5b6a7980a9b8c7d6...
output hello 5e2f1a3b4c6d7e8f9a0b1c2d...
output result.txt 9c4d7a1e2f3b5c6d7e8f...

Use this to:

Trace output back to exact inputs and commands
Audit reproducibility across environments
Build compliance artifacts for regulated industries

🔌 Language Server Integration

For complex codebases, integrate with language servers:

# Start TypeScript language server
tsserver --stdio

# Use with ReproVM to analyze dependencies
task analyze_deps {
  cmd = tsserver --check src/**/*.ts > deps.json
  inputs = src/index.ts, src/utils.ts
  outputs = deps.json
  deps =
}

🧪 Testing ReproVM

Comprehensive test suite included:

# Run all tests
cd tests && ./test_all.sh

# Individual test suites
./test_cas.sh      # CAS functionality
./test_parallel.sh # Parallel execution
./test_manifest.sh # Manifest parsing
./test_util.sh     # Utility functions

Contributing

Join the ReproVM community

We welcome contributions! Here's how you can help:

🐛 Report bugs and issues
💡 Suggest new features
📝 Improve documentation
🔧 Submit pull requests
⭐ Star the repo to show support

Development Setup

# Fork and clone
git clone https://github.com/YOUR_USERNAME/ReproVM-Virtual-Machine
cd ReproVM-Virtual-Machine

# Create feature branch
git checkout -b feature/amazing-feature

# Build and test
make
cd tests && ./test_all.sh

# Make changes, commit, push
git add .
git commit -m "Add amazing feature"
git push origin feature/amazing-feature

# Open Pull Request on GitHub

Code Style

ReproVM uses automated formatting:

# Format all code
./format.sh

# Pre-commit hook
cp .githooks/pre-commit .git/hooks/
chmod +x .git/hooks/pre-commit

Frequently Asked Questions

❓ How does caching work?

Each task's identity is a SHA-256 hash of its command, sorted input hashes, and dependency result hashes. If this hash exists in .reprovm/cache/, outputs are restored from CAS instead of re-executing.

❓ Can I share cache across machines?

Yes! Copy .reprovm/cas/ and .reprovm/cache/ to another machine. For teams, implement a remote CAS backend (S3, HTTP, etc.) as an extension.

❓ What if I rename a file?

File identity is content-based in CAS, but manifests reference paths. Renaming changes the input list, causing task re-execution. Create symlinks or update the manifest.

❓ How do I debug failed tasks?

Run the command manually from the shell to see detailed errors. Check .reprovm/cache/ for task metadata and inspect input hashes.

❓ Does ReproVM support Windows?

ReproVM targets POSIX systems (Linux, macOS, BSD). Windows support via WSL, Cygwin, or MinGW is possible but untested.

❓ How fast is it compared to Make/Bazel?

For cache hits, ReproVM is near-instant (~80ms overhead). For cold builds, performance is comparable to Make. Bazel has more optimization but higher overhead.

Troubleshooting

Common issues and solutions

Symptom	Cause	Solution
Unknown target 'foo'	Target not defined in manifest	Check spelling, ensure task exists
Cycle detected	Circular dependency in DAG	Break cycle by removing/reordering deps
Task failed with exit code N	Command returned non-zero	Run command manually to debug
Cache hit but output missing	Output not declared in manifest	Add file to outputs list
Corrupted .meta file	Cache metadata unreadable	Delete .meta to force recompute
Failed to hash input	Input file missing/unreadable	Check file exists and permissions

🔧 Debug Mode: Set REPROVM_VERBOSE=1 (future extension) to see detailed internal decisions and hash computations.

Roadmap

Future enhancements and planned features

✅ Completed

Core task execution engine
Content-addressed storage
SHA-256 hashing and caching
Parallel execution with pthreads
Docker support
Manifest DSL parser

🚧 In Progress

Remote CAS backend (S3, HTTP)
Web UI for DAG visualization
Enhanced manifest validation
Performance profiling tools

🔮 Planned

Distributed execution across nodes
Language bindings (Python, Rust, Go)
Cryptographic signing of cache entries
YAML/JSON manifest formats
Built-in metrics and observability
IDE integrations (VSCode, IntelliJ)