Reproducible Task Execution Virtual Machine with Content-Addressed Caching
A lightweight yet powerful task execution engine designed for reproducibility and efficiency
ReproVM is a lightweight task execution virtual machine written entirely in C99. It makes complex workflowsโbuilds, data pipelines, test orchestrations, and analysesโreproducible, incremental, and efficient through content-addressed caching and explicit dependency tracking.
Think of it as a miniaturized, transparent cousin of modern build systems like Bazel or data pipeline orchestrators, but without hidden state or heavyweight dependencies. It operates on a declarative manifest of tasks, each describing what to run, what files it consumes and produces, and what it depends on.
Why ReproVM stands out from traditional build systems
Every piece of work is content-addressed. If inputs and commands are identical, outputs are reused deterministically across machines and environments.
Only tasks whose inputs, commands, or upstream results have changed are re-executed. Massive time savings on repeated runs.
ASCII diagrams show exactly what ran, what was skipped, and why. No hidden state or magicโcomplete visibility into execution flow.
Execute independent tasks concurrently with pthreads. Respects dependencies automatically while maximizing throughput.
All inputs and outputs stored by SHA-256 hash. Natural deduplication, versioning, and provenance tracking built-in.
Pure C99 implementation with only standard libraries. No external runtime, no package managersโjust compile and run.
Execute tasks in any languageโC, Python, Go, Rust, Ruby, Node.js. The VM doesn't care what's in your command.
Full Docker and docker-compose support for containerized, reproducible execution environments across teams.
Understanding ReproVM's internal design
project/ โโโ .reprovm/ โ โโโ cas/ โ โ โโโ objects/ โ โ โโโ a3/ โ โ โ โโโ f5d9c0e1b2c3d4... (blob: file content) โ โ โโโ 5f/ โ โ โ โโโ 6e7d8c9b0a1f2e... (blob: output data) โ โ โโโ 9a/ โ โ โโโ 8b7c6d5e4f3a2b... (blob: binary) โ โโโ cache/ โ โโโ.meta (metadata: task identity) โ โโโ .meta โ โโโ .meta โโโ manifest.txt โโโ source files...
Built on battle-tested, industry-standard technologies
Get up and running in under 5 minutes
git clone https://github.com/hoangsonww/ReproVM-Virtual-Machine cd ReproVM-Virtual-Machine
make # Builds both serial (reprovm) and parallel (reprovm_parallel) binaries
Output:
gcc -std=c99 -O2 -Wall -Wextra -g -c main.c -o main.o gcc -std=c99 -O2 -Wall -Wextra -g -c task.c -o task.o gcc -std=c99 -O2 -Wall -Wextra -g -c cas.c -o cas.o gcc -std=c99 -O2 -Wall -Wextra -g -c util.c -o util.o gcc -std=c99 -O2 -Wall -Wextra -g -o reprovm main.o task.o cas.o util.o gcc -std=c99 -O2 -Wall -Wextra -g -o reprovm_parallel ...
task build {
cmd = gcc -o hello hello.c
inputs = hello.c
outputs = hello
deps =
}
task test {
cmd = ./hello > result.txt
inputs = hello
outputs = result.txt
deps = build
}
task checksum {
cmd = sha256sum result.txt > result.sha
inputs = result.txt
outputs = result.sha
deps = test
}
# Serial execution ./reprovm manifest.txt # Parallel execution (auto-detects CPU count) ./reprovm_parallel manifest.txt # Parallel with specific worker count ./reprovm_parallel -j 4 manifest.txt
Run again without changes:
$ ./reprovm manifest.txt Will execute 3 tasks in order: build test checksum [*] build (cache hit) [*] test (cache hit) [*] checksum (cache hit) All tasks completed. โ
Everything you need to know about ReproVM
Tasks are declared in a simple DSL:
task <name> {
cmd = <shell command>
inputs = <file1>, <file2>, ...
outputs = <out1>, <out2>, ...
deps = <task1>, <task2>, ...
}
# Run all tasks ./reprovm manifest.txt # Run specific targets ./reprovm manifest.txt target1 target2 # Parallel execution ./reprovm_parallel -j N manifest.txt # Force rebuild (ignore cache) rm -rf .reprovm ./reprovm manifest.txt
| Symbol | Status | Meaning |
|---|---|---|
| [ ] | Pending | Not yet executed |
| [~] | Running | Currently executing |
| [*] | Skipped | Cache hit, restored from CAS |
| [โ] | Success | Executed successfully |
| [X] | Failed | Command returned non-zero |
# Build Docker image docker build -t reprovm:latest . # Run with volume mount docker run --rm -v "$(pwd)":/workspace \ -w /workspace reprovm:latest \ ./reprovm_parallel manifest.txt # Use docker-compose docker compose up reprovm # Interactive shell docker run --rm -it -v "$(pwd)":/workspace \ -w /workspace reprovm:latest bash
# View cache metadata cat .reprovm/cache/<task_hash>.meta # Clean cache rm -rf .reprovm # Selective invalidation rm .reprovm/cache/<specific_task>.meta # CI cache restoration ./ci_restore_cache.sh
.reprovm directories require trust boundariesReal-world manifests demonstrating ReproVM's capabilities
task compile {
cmd = gcc -c -o app.o app.c
inputs = app.c
outputs = app.o
deps =
}
task link {
cmd = gcc -o app app.o
inputs = app.o
outputs = app
deps = compile
}
task test {
cmd = ./app --test > test.log
inputs = app
outputs = test.log
deps = link
}
task fetch_data {
cmd = curl -o data.csv https://api.example.com/data
inputs =
outputs = data.csv
deps =
}
task clean_data {
cmd = python3 clean.py data.csv > clean.csv
inputs = data.csv
outputs = clean.csv
deps = fetch_data
}
task analyze {
cmd = python3 analyze.py clean.csv > report.json
inputs = clean.csv
outputs = report.json
deps = clean_data
}
task visualize {
cmd = python3 plot.py report.json chart.png
inputs = report.json
outputs = chart.png
deps = analyze
}
task preprocess {
cmd = python3 preprocess.py raw.csv processed.csv
inputs = raw.csv
outputs = processed.csv
deps =
}
task train_model {
cmd = python3 train.py processed.csv model.pkl
inputs = processed.csv
outputs = model.pkl
deps = preprocess
}
task evaluate {
cmd = python3 eval.py model.pkl metrics.json
inputs = model.pkl, processed.csv
outputs = metrics.json
deps = train_model
}
task package {
cmd = tar czf model.tar.gz model.pkl metrics.json
inputs = model.pkl, metrics.json
outputs = model.tar.gz
deps = evaluate
}
task compile_ts {
cmd = tsc src/index.ts --outDir dist/
inputs = src/index.ts
outputs = dist/index.js
deps =
}
task bundle_js {
cmd = webpack dist/index.js -o dist/bundle.js
inputs = dist/index.js
outputs = dist/bundle.js
deps = compile_ts
}
task minify_css {
cmd = cleancss -o dist/style.min.css src/style.css
inputs = src/style.css
outputs = dist/style.min.css
deps =
}
task package_assets {
cmd = tar czf assets.tar.gz dist/
inputs = dist/bundle.js, dist/style.min.css
outputs = assets.tar.gz
deps = bundle_js, minify_css
}
.reprovm directory between runs to achieve 40x-90x speedups on unchanged steps.
Power-user capabilities and extension points
The reprovm_parallel binary uses pthreads to execute independent tasks concurrently:
# Auto-detect CPU count ./reprovm_parallel manifest.txt # Specify worker threads ./reprovm_parallel -j 8 manifest.txt # Set via environment export REPROVM_JOBS=16 ./reprovm_parallel manifest.txt
How it works:
Extend ReproVM to push/pull blobs from remote storage:
// cas.h extension point int cas_push_remote(const char *hash, const char *url); int cas_pull_remote(const char *hash, const char *url); // Example: S3 backend cas_push_remote(hash, "s3://bucket/cas/objects/"); cas_pull_remote(hash, "s3://bucket/cas/objects/");
ReproVM is designed for CI/CD environments:
# .github/workflows/ci.yml
name: CI
on: [push]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Restore Cache
uses: actions/cache@v3
with:
path: .reprovm
key: reprovm-${{ hashFiles('manifest.txt') }}
- name: Build
run: |
make
./reprovm_parallel -j 4 manifest.txt
- name: Save Cache
uses: actions/cache@v3
with:
path: .reprovm
key: reprovm-${{ hashFiles('manifest.txt') }}
Every cache metadata file contains full provenance:
task_hash: a3f5d9c0e1b2c3d4e5f6a7b8... result_hash: 1f2e3d4c5b6a7980a9b8c7d6... output hello 5e2f1a3b4c6d7e8f9a0b1c2d... output result.txt 9c4d7a1e2f3b5c6d7e8f...
Use this to:
For complex codebases, integrate with language servers:
# Start TypeScript language server
tsserver --stdio
# Use with ReproVM to analyze dependencies
task analyze_deps {
cmd = tsserver --check src/**/*.ts > deps.json
inputs = src/index.ts, src/utils.ts
outputs = deps.json
deps =
}
Comprehensive test suite included:
# Run all tests cd tests && ./test_all.sh # Individual test suites ./test_cas.sh # CAS functionality ./test_parallel.sh # Parallel execution ./test_manifest.sh # Manifest parsing ./test_util.sh # Utility functions
Join the ReproVM community
We welcome contributions! Here's how you can help:
# Fork and clone git clone https://github.com/YOUR_USERNAME/ReproVM-Virtual-Machine cd ReproVM-Virtual-Machine # Create feature branch git checkout -b feature/amazing-feature # Build and test make cd tests && ./test_all.sh # Make changes, commit, push git add . git commit -m "Add amazing feature" git push origin feature/amazing-feature # Open Pull Request on GitHub
ReproVM uses automated formatting:
# Format all code ./format.sh # Pre-commit hook cp .githooks/pre-commit .git/hooks/ chmod +x .git/hooks/pre-commit
Each task's identity is a SHA-256 hash of its command, sorted input hashes, and dependency result hashes. If this hash exists in .reprovm/cache/, outputs are restored from CAS instead of re-executing.
Yes! Copy .reprovm/cas/ and .reprovm/cache/ to another machine. For teams, implement a remote CAS backend (S3, HTTP, etc.) as an extension.
File identity is content-based in CAS, but manifests reference paths. Renaming changes the input list, causing task re-execution. Create symlinks or update the manifest.
Run the command manually from the shell to see detailed errors. Check .reprovm/cache/ for task metadata and inspect input hashes.
ReproVM targets POSIX systems (Linux, macOS, BSD). Windows support via WSL, Cygwin, or MinGW is possible but untested.
For cache hits, ReproVM is near-instant (~80ms overhead). For cold builds, performance is comparable to Make. Bazel has more optimization but higher overhead.
Common issues and solutions
| Symptom | Cause | Solution |
|---|---|---|
| Unknown target 'foo' | Target not defined in manifest | Check spelling, ensure task exists |
| Cycle detected | Circular dependency in DAG | Break cycle by removing/reordering deps |
| Task failed with exit code N | Command returned non-zero | Run command manually to debug |
| Cache hit but output missing | Output not declared in manifest | Add file to outputs list |
| Corrupted .meta file | Cache metadata unreadable | Delete .meta to force recompute |
| Failed to hash input | Input file missing/unreadable | Check file exists and permissions |
REPROVM_VERBOSE=1 (future extension) to see detailed internal decisions and hash computations.
Future enhancements and planned features
Learn more about the technologies behind ReproVM