ReproVM is a lightweight yet powerful task execution virtual machine written entirely in C (C99). It is designed to make complex workflows — builds, data pipelines, test orchestrations, analyses — reproducible, incremental, and efficient through content-addressed caching and explicit dependency tracking. Think of it as a miniaturized, transparent cousin of modern build systems and pipeline orchestrators (e.g., parts of Docker, Bazel, or data lineage tooling) without hidden state or heavyweight dependencies.
It operates on a declarative manifest of tasks, each describing what to run, what files it consumes and produces, and what it depends on. ReproVM automatically computes identities (hashes), reuses work when possible, restores outputs, and visualizes the dependency graph with status in the terminal.
Real-world workflows suffer from repeated, wasted computation, hidden dependencies, and brittle pipelines. ReproVM addresses these problems:
cmd
, inputs, outputs, and declared dependencies..reprovm/cache/
that records what was produced, allowing skip on repeat.flowchart TB
U[User invokes CLI with manifest & optional targets/env flags] --> CLI["CLI Entry Point<br>(read env: FORCE, VERBOSE, targets)"]
subgraph "Input & Initialization"
CLI --> Parser["Manifest Parser<br>(parse DSL, validate syntax)"]
Parser --> DAG["Dependency DAG Builder / Scheduler<br>(build task graph, apply target filtering)"]
DAG --> ReadyCheck{"Any ready tasks?<br>(dependencies satisfied)"}
end
subgraph "Execution Engine"
ReadyCheck -- yes --> Dispatch["Parallel Executor / Worker Pool<br>(schedule ready tasks)"]
Dispatch --> TaskInstance["Task Instance"]
subgraph "Task Instance Logic"
TaskInstance --> ComputeHash["Compute Task Hash<br>(cmd + sorted input blob hashes + deps result hashes)"]
ComputeHash --> CacheLookup{"Cache record exists?<br>(task_hash) & not forced"}
CacheLookup -- hit & not forced --> RestoreOutputs["Restore outputs from CAS<br>(mark skipped)"]
CacheLookup -- miss or forced --> RunCmd["Execute shell command<br>(capture output, exit code)"]
RunCmd --> CmdSuccess{"Exit code == 0?"}
CmdSuccess -- yes --> StoreCAS["Store outputs in CAS<br>(SHA-256 content-addressed)"]
StoreCAS --> WriteMeta["Write cache metadata<br>(task_hash, result_hash, output map)"]
CmdSuccess -- no --> MarkFailed["Mark task failed<br>(record error)"]
RestoreOutputs --> UpdateSkipped["Update graph: skipped"]
WriteMeta --> UpdateSuccess["Update graph: success"]
MarkFailed --> UpdateFailed["Update graph: failed"]
end
UpdateSkipped --> DAG
UpdateSuccess --> DAG
UpdateFailed --> DAG
Dispatch --> ReadyCheck
end
ReadyCheck -- no --> AllDone{"All tasks processed?"}
AllDone -- yes --> Visualizer["Graph Visualizer<br>(render ASCII DAG with statuses)"]
Visualizer --> Summary["Final Summary & Exit Code<br>(success/failure, cache hits/misses)"]
AllDone -- no --> ReadyCheck
MarkFailed --> OverallFailure["Overall run marked failed"]
OverallFailure --> Visualizer
OverallFailure --> Summary
ChangeInput["Upstream input changed or manifest edited"] --> InvalidateDeps["Invalidate dependent task hashes"]
InvalidateDeps --> DAG
CLI --> ForceBypass["Force rebuild flag<br>(overrides cache hits)"]
ForceBypass --> CacheLookup
make
./reprovm manifest.txt
If nothing has changed, subsequent runs are near-instant due to cache hits.
gcc
or other C99-compatible compilersh
, bash
etc.)git clone https://github.com/hoangsonww/ReproVM-Virtual-Machine
cd ReproVM-Virtual-Machine
make
Expected output:
$ make
gcc -std=c99 -O2 -Wall -Wextra -g -c main.c -o main.o
gcc -std=c99 -O2 -Wall -Wextra -g -c task.c -o task.o
gcc -std=c99 -O2 -Wall -Wextra -g -c cas.c -o cas.o
gcc -std=c99 -O2 -Wall -Wextra -g -c util.c -o util.o
gcc -std=c99 -O2 -Wall -Wextra -g -o reprovm main.o task.o cas.o util.o
This produces the reprovm
binary.
ReproVM manifest is a simple, custom DSL. Whitespaces are flexible; comments begin with #
.
manifest := { task_block }+
task_block := "task" <name> "{" { field_line } "}"
field_line := <key> "=" <value>
key := "cmd" | "inputs" | "outputs" | "deps"
value := arbitrary string (for cmd), or comma-separated list (for others)
cmd
— Shell command to execute. Should produce the declared outputs and respect inputs.inputs
— Comma-separated list of file paths consumed by the task.outputs
— Comma-separated list of file paths produced by the task.deps
— Comma-separated list of other task names that must run before this one.task build {
cmd = gcc -o hello hello.c
inputs = hello.c
outputs = hello
deps =
}
task test {
cmd = ./hello > result.txt
inputs = hello
outputs = result.txt
deps = build
}
task checksum {
cmd = sha256sum result.txt > result.sha
inputs = result.txt
outputs = result.sha
deps = test
}
deps
) are used to control ordering beyond just file-based inference.deps =
or omitted after the equals (empty).For each task in topological order:
.reprovm/cache/<task_hash>.meta
.[*]
).cmd
via system()
..meta
file containing task_hash, result_hash, and output-to-hash mapping.Statuses:
[ ]
pending[~]
running[*]
skipped (cache hit)[✔]
success[X]
failedBlobs are stored under:
.reprovm/cas/objects/<first-two-hex>/<remaining-hash>
Example:
.reprovm/cas/objects/a3/f5d9c0e1b2... # blob for a file or output
This two-level split avoids directory explosion.
Each task produces a metadata file:
.reprovm/cache/<task_hash>.meta
Example contents:
task_hash: a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7
result_hash: 7d9a4f2e5b3c1a0d6e8f9b7c4d3e2f1a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2
output hello 5e2f1a3b4c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1
output result.txt 9c4d7a1e2f3b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9
./reprovm <manifest> [target1 target2 ...]
<manifest>
: path to manifest file.[target...]
: optional list of task names to build. If omitted, all tasks are considered targets.Run full pipeline:
./reprovm manifest.txt
Run specific target (and its dependencies):
./reprovm manifest.txt checksum
Force re-computation of everything:
rm -rf .reprovm
./reprovm manifest.txt
ReproVM does not require environment variables to function, but the following are recognized/usable for extension or debugging if you choose to augment it:
REPROVM_CACHE_DIR
(future extension) — override .reprovm/cache
path.REPROVM_VERBOSE=1
— hypothetical future verbose mode to log internal decisions.REPROVM_FORCE=<task>
— (not implemented in base) could be used to bypass cache for a particular task.The current implementation uses the project root (.
) as base; you can change this by modifying cas_init
parameter in main.c
.
Note: The following outputs are realistic simulations, with plausible hash values and formatting.
Manifest: as above (build
, test
, checksum
). hello.c
prints “Hello, ReproVM!”.
$ ./reprovm manifest.txt
Will execute 3 tasks in order:
build
test
checksum
==> Running task 'build': gcc -o hello hello.c
==> Task 'build' completed.
=== Task Graph ===
[✔] build (hash=a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7) res=1f2e3d4c5b6a7980a9b8c7d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7)
[ ] test (hash=5f6e7d8c9b0a1f2e3d4c5b6a7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d) res=
[ ] checksum (hash=9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a) res=
==================
==> Running task 'test': ./hello > result.txt
==> Task 'test' completed.
=== Task Graph ===
[✔] build (hash=...) res=...
[✔] test (hash=5f6e7d8c9b0a1f2e3d4c5b6a7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d) res=3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c)
[ ] checksum (hash=9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a) res=
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
==> Task 'checksum' completed.
=== Task Graph ===
[✔] build (hash=...) res=...
[✔] test (hash=...) res=...
[✔] checksum (hash=9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a) res=7f8e9d0c1b2a3e4f5d6c7b8a9f0e1d2c3b4a5f6e7d8c9b0a1c2d3e4f5a6b7c8)
==================
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
$ ./reprovm manifest.txt
Will execute 3 tasks in order:
build
test
checksum
[*] build (cache hit) (hash=a3f5d9c0e1b2c3d4...) res=1f2e3d4c...
[*] test (cache hit) (hash=5f6e7d8c9b0a1f2...) res=3c4d5e6f...
[*] checksum (cache hit) (hash=9a8b7c6d5e4f3a2b...) res=7f8e9d0c...
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[*] build ...
[*] test ...
[*] checksum ...
==================
hello.c
modified)Modify hello.c
to print “Hello, Updated ReproVM!” then:
$ ./reprovm manifest.txt
Will execute 3 tasks in order:
build
test
checksum
==> Running task 'build': gcc -o hello hello.c
==> Task 'build' completed.
=== Task Graph ===
[✔] build (hash=de4f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0) res=2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3)
[ ] test ...
[ ] checksum ...
==================
==> Running task 'test': ./hello > result.txt
==> Task 'test' completed.
=== Task Graph ===
[✔] build ...
[✔] test (hash=8c7d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7f6a5b4c3d2e1f0a9b8c7) res=4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d)
[ ] checksum ...
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
==> Task 'checksum' completed.
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
All tasks completed (some may have been cached). Final graph:
...
Manifest has a typo: inputs = missing.c
but missing.c
does not exist.
$ ./reprovm manifest.txt
Will execute 3 tasks in order:
build
test
checksum
==> Running task 'build': gcc -o hello missing.c
Task 'build' failed with exit code 1
=== Task Graph ===
[X] build (hash=...) res=
[ ] test ...
[ ] checksum ...
==================
One or more tasks failed.
Run only a high-level result:
$ ./reprovm manifest.txt checksum
Will execute 3 tasks in order:
build
test
checksum
[*] build (cached)
[*] test (cached)
[*] checksum (cached)
Final graph:
=== Task Graph ===
[*] build ...
[*] test ...
[*] checksum ...
==================
To ignore the cached result of a task:
$ rm .reprovm/cache/a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7.meta
$ ./reprovm manifest.txt build
View metadata manually:
$ cat .reprovm/cache/a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7.meta
task_hash: a3f5d9c0e1b2...
result_hash: 1f2e3d4c5b6a...
output hello 5e2f1a3b...
Restore any blob:
$ ./reprovm # normal run will auto-restore, or use cas APIs to manually extract
task fetch_data {
cmd = curl -s -o raw.csv https://example.com/data.csv
inputs =
outputs = raw.csv
deps =
}
task clean {
cmd = python3 scripts/clean.py raw.csv > cleaned.csv
inputs = raw.csv
outputs = cleaned.csv
deps = fetch_data
}
task train {
cmd = python3 scripts/train_model.py cleaned.csv model.bin
inputs = cleaned.csv
outputs = model.bin
deps = clean
}
task evaluate {
cmd = python3 scripts/evaluate.py model.bin cleaned.csv > metrics.json
inputs = model.bin, cleaned.csv
outputs = metrics.json
deps = train
}
task bundle {
cmd = tar -czf package.tar.gz model.bin metrics.json
inputs = model.bin, metrics.json
outputs = package.tar.gz
deps = evaluate
}
Run the final bundle:
./reprovm pipeline_manifest.txt bundle
Re-running after only editing clean.py
results in re-running clean
, train
, evaluate
, and bundle
— earlier cached steps (fetch_data
) are reused.
system()
with RPC to remote workers.-O2
or higher for building ReproVM itself (Makefile
already uses -O2
)..reprovm
directory, ensure trust boundaries; consider signing metadata in a hardened variant.Symptom / Error | Meaning | Action |
---|---|---|
Unknown target 'foo' |
Specified target not defined | Verify spelling in manifest |
Cycle detected among tasks |
Dependency loop | Break cycle by reordering or removing dependency |
Task 'X' failed with exit code N |
Command returned nonzero | Inspect task cmd , check inputs, run manually for detailed error |
Missing output but cache hit reported | Downstream expected file not declared as output | Ensure outputs reflect actual produced files |
Corrupted .meta file or CAS object |
Cache metadata unreadable | Delete relevant .meta to force recompute |
Failed to hash input file |
Input file missing or unreadable | Confirm file exists and permissions are correct |
Exit codes:
0
— All tasks succeeded (possibly with skips).1
— One or more errors (parsing, execution failure, cycle, internal error).Q: Can I use ReproVM inside CI to speed up repeated runs?
A: Yes. Persist the .reprovm/cas
and .reprovm/cache
directories between pipeline runs to reuse cached steps and drastically reduce execution time.
Q: What happens if I rename a file but its content is same? A: File identity in CAS is content-based, but the manifest refers to file paths. Renaming changes input/output names in the manifest, so tasks will rerun unless you create symlinks or adapt the manifest.
Q: Can one task produce outputs not declared? A: The system won’t cache undeclared outputs properly. Always declare all side-effect files for reproducibility.
Q: Is the cache sharable across machines?
A: Out of the box, the CAS is local. You can copy .reprovm/cas/objects
and .reprovm/cache/
to another machine as long as relative paths and environment are compatible. For robust sharing, one could extend to a remote CAS backend.
Contributions are welcome. Suggested ways to help:
Guidelines:
ReproVM can be optionally extended to execute independent tasks in parallel while still respecting explicit dependencies. This is provided via a companion parallel executor and alternate entry point; it is non-invasive and does not require changes to the original codebase. If you don’t opt into parallel mode, the original serial behavior remains exactly the same.
The parallel version lives alongside the regular reprovm
binary. Compile it by adding the new sources and linking pthreads:
gcc -std=c99 -O2 -Wall -Wextra -g reprovm_parallel.c parallel_executor.c task.c cas.c util.c -lpthread -o reprovm_parallel
You can keep both versions:
gcc -std=c99 -O2 -Wall -Wextra -g main.c task.c cas.c util.c -o reprovm
gcc -std=c99 -O2 -Wall -Wextra -g reprovm_parallel.c parallel_executor.c task.c cas.c util.c -lpthread -o reprovm_parallel
./reprovm_parallel [-j N] <manifest> [target1 target2 ...]
-j N
/ --jobs N
: number of worker threads to use. If omitted, it defaults to the number of online CPUs (fallbacking to 4).You can also influence parallelism via environment variable (future extension support):
export REPROVM_JOBS=8 # if implemented, can be read as default parallelism
$ ./reprovm_parallel manifest.txt
Will execute 3 tasks (parallel workers: 8)
==> Scheduling...
[~] build (hash=...) res=
[~] test (waiting on build)
[~] checksum (waiting on test)
==> Running task 'build': gcc -o hello hello.c
==> Running task 'test': ./hello > result.txt # if build finished early enough and test became eligible
[✔] build ...
[*] test (cache hit or skipped if unchanged)
[ ] checksum ...
...
$ ./reprovm_parallel -j 4 manifest.txt
Will execute 3 tasks (parallel workers: 4)
==> Running task 'build': gcc -o hello hello.c
==> Task 'build' completed.
=== Task Graph ===
[✔] build ...
[ ] test ...
[ ] checksum ...
==================
==> Running task 'test': ./hello > result.txt
==> Task 'test' completed.
=== Task Graph ===
[✔] build ...
[✔] test ...
[ ] checksum ...
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
==> Task 'checksum' completed.
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
Given a manifest with two independent tasks A
and B
that both feed into C
, they can run concurrently:
Will execute 3 tasks (parallel workers: 4)
==> Running task 'A': ...
==> Running task 'B': ...
[✔] A ...
[✔] B ...
==> Running task 'C': ...
[✔] C ...
Final graph:
[✔] A ...
[✔] B ...
[✔] C ...
If one worker encounters a failure (non-zero exit), the failure is recorded but other in-flight eligible tasks are allowed to finish so you get a full snapshot. The final exit code is non-zero, and the ASCII graph will show [X]
for failed tasks.
reprovm_parallel
when you want concurrency../reprovm
for debugging or simple runs, and the parallel ./reprovm_parallel -j N
for performance on larger DAGs.$ ./reprovm_parallel -j 3 manifest.txt
Will execute 3 tasks (parallel workers: 3)
==> Scheduling and starting workers...
[~] build (running) [ ] test (waiting) [ ] checksum (waiting)
==> Running task 'build': gcc -o hello hello.c
[✔] build (completed)
=== Task Graph ===
[✔] build (hash=...) res=...
[ ] test (ready) res=
[ ] checksum (waiting on test)
==================
==> Running task 'test': ./hello > result.txt
[~] test (running)
[ ] checksum (waiting on test)
[✔] test (completed)
=== Task Graph ===
[✔] build ...
[✔] test ...
[ ] checksum ...
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
[✔] checksum (completed)
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
If a task is a cache hit and skipped, it shows as [*]
and its dependents may immediately become eligible without delay.
-j
to control throughput; too many threads on small DAGs may add scheduling overhead, so match worker count to workload size.Here’s a Docker section you can insert into the README (e.g., right after Installation & Build or before Manifest Specification):
ReproVM can be run inside a container for reproducible, isolated execution environments. The provided Dockerfile
and docker-compose.yml
(see repo) bundle all required tools (GCC, make, Python, Node.js, Ruby, etc.), build both serial and parallel binaries, and give you a consistent runtime.
.reprovm
cache/CAS are shared (and persist across container invocations).reprovm_parallel
binary and optionally controlled with -j
or environment variable REPROVM_JOBS
.From the repo root:
docker build -t reprovm:latest .
This builds the image, compiling ReproVM (serial + parallel) and preparing ancillary scripts.
Run the pipeline against your manifest with:
docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./run_pipeline.sh manifest.txt
Or invoke the parallel binary directly (auto-detects CPU count unless overridden):
docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel -j 4 manifest.txt
To pass parallelism via environment variable:
docker run --rm -v "$(pwd)":/workspace -w /workspace -e REPROVM_JOBS=8 reprovm:latest ./run_pipeline.sh manifest.txt
With the included docker-compose.yml
, you can spin up the ReproVM service easily:
docker compose up reprovm
This mounts the current directory into /workspace
, builds/uses the image, and runs ./run_pipeline.sh
by default. To target a specific manifest or override:
docker compose run --rm reprovm ./reprovm_parallel -j 2 pipeline_manifest.txt bundle
Since .reprovm
lives in the mounted workspace, cache and CAS entries survive between container runs. To maximize CI or repeated-run performance, persist the host-side .reprovm
directory (e.g., in CI artifacts or volume-backed storage).
Cold build:
docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel manifest.txt
Warm rebuild (uses cache):
docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel manifest.txt
Run a specific target:
docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel manifest.txt checksum
Shell into container for inspection/debugging:
docker run --rm -it -v "$(pwd)":/workspace -w /workspace reprovm:latest bash
# then inside:
./reprovm_parallel -j 4 manifest.txt
To remove the built image when you no longer need it:
docker image rm reprovm:latest
If using docker-compose, tear down (and optionally remove volumes):
docker compose down
.reprovm
directory across container invocations for maximum cache reuse.docker run
or docker-compose
if you need to run custom diagnostics (e.g., inspect .reprovm/cache
contents)..reprovm
between jobs to drastically cut repeat run time..
├── hello.c
├── manifest.txt
├── hello # compiled binary
├── result.txt # program output
├── result.sha # checksum
├── reprovm # compiled VM executable
└── .reprovm
├── cache
│ ├── a3f5d9c0e1b2...meta
│ ├── 5f6e7d8c9b0a...meta
│ └── 9a8b7c6d5e4f...meta
└── cas
└── objects
├── a3/
│ └── f5d9c0e1b2c3d4... # blobs: e.g., input hello.c
├── 5f/
│ └── 6e7d8c9b0a1f2e... # more blobs
└── 9a/
└── 8b7c6d5e4f3a2b... # output blobs
MIT License. See the LICENSE file for details.
Thank you for using ReproVM! We hope it helps you build reproducible, efficient workflows with ease. If you have any questions or suggestions, feel free to open an issue or contribute to the project.