ReproVM-Virtual-Machine

ReproVM - Reproducible Task Virtual Machine with Content-Addressed Caching

C99 GCC Make Bash POSIX pthreads clang-format clang-tidy uncrustify AStyle EditorConfig pre-commit Content Addressed Storage SHA-256 CRC32 Git CLI Coreutils curl tar AWS CLI S3 Go Rust Ruby Assembly (NASM) JSON HTTP Server Docker Docker Compose ASCII UI Unix MIT License

Table of Contents

  1. Overview
  2. Why ReproVM?
  3. Key Concepts
  4. Quickstart
  5. Installation & Build
  6. Manifest Specification
  7. Task Lifecycle
  8. Cache & CAS Internals
  9. CLI Usage Reference
  10. Environment & Configuration
  11. Simulated Realistic Sessions & Logs
  12. Advanced Usage Patterns
  13. Extension Points / Developer Notes
  14. Performance Tuning
  15. Security Considerations
  16. Troubleshooting & Error Codes
  17. FAQ
  18. Contributing
  19. Glossary
  20. Parallel Execution
  21. Docker
  22. Example File Layout After Successful Run
  23. License

Overview

ReproVM is a lightweight yet powerful task execution virtual machine written entirely in C (C99). It is designed to make complex workflows — builds, data pipelines, test orchestrations, analyses — reproducible, incremental, and efficient through content-addressed caching and explicit dependency tracking. Think of it as a miniaturized, transparent cousin of modern build systems and pipeline orchestrators (e.g., parts of Docker, Bazel, or data lineage tooling) without hidden state or heavyweight dependencies.

It operates on a declarative manifest of tasks, each describing what to run, what files it consumes and produces, and what it depends on. ReproVM automatically computes identities (hashes), reuses work when possible, restores outputs, and visualizes the dependency graph with status in the terminal.

Why ReproVM?

Real-world workflows suffer from repeated, wasted computation, hidden dependencies, and brittle pipelines. ReproVM addresses these problems:

Key Concepts

Key Technologies

flowchart TB
    U[User invokes CLI with manifest & optional targets/env flags] --> CLI["CLI Entry Point<br>(read env: FORCE, VERBOSE, targets)"]

    subgraph "Input & Initialization"
        CLI --> Parser["Manifest Parser<br>(parse DSL, validate syntax)"]
        Parser --> DAG["Dependency DAG Builder / Scheduler<br>(build task graph, apply target filtering)"]
        DAG --> ReadyCheck{"Any ready tasks?<br>(dependencies satisfied)"}
    end

    subgraph "Execution Engine"
        ReadyCheck -- yes --> Dispatch["Parallel Executor / Worker Pool<br>(schedule ready tasks)"]
        Dispatch --> TaskInstance["Task Instance"]

        subgraph "Task Instance Logic"
            TaskInstance --> ComputeHash["Compute Task Hash<br>(cmd + sorted input blob hashes + deps result hashes)"]
            ComputeHash --> CacheLookup{"Cache record exists?<br>(task_hash) & not forced"}
            CacheLookup -- hit & not forced --> RestoreOutputs["Restore outputs from CAS<br>(mark skipped)"]
            CacheLookup -- miss or forced --> RunCmd["Execute shell command<br>(capture output, exit code)"]
            RunCmd --> CmdSuccess{"Exit code == 0?"}
            CmdSuccess -- yes --> StoreCAS["Store outputs in CAS<br>(SHA-256 content-addressed)"]
            StoreCAS --> WriteMeta["Write cache metadata<br>(task_hash, result_hash, output map)"]
            CmdSuccess -- no --> MarkFailed["Mark task failed<br>(record error)"]
            RestoreOutputs --> UpdateSkipped["Update graph: skipped"]
            WriteMeta --> UpdateSuccess["Update graph: success"]
            MarkFailed --> UpdateFailed["Update graph: failed"]
        end

        UpdateSkipped --> DAG
        UpdateSuccess --> DAG
        UpdateFailed --> DAG
        Dispatch --> ReadyCheck
    end

    ReadyCheck -- no --> AllDone{"All tasks processed?"}
    AllDone -- yes --> Visualizer["Graph Visualizer<br>(render ASCII DAG with statuses)"]
    Visualizer --> Summary["Final Summary & Exit Code<br>(success/failure, cache hits/misses)"]
    AllDone -- no --> ReadyCheck

    MarkFailed --> OverallFailure["Overall run marked failed"]
    OverallFailure --> Visualizer
    OverallFailure --> Summary

    ChangeInput["Upstream input changed or manifest edited"] --> InvalidateDeps["Invalidate dependent task hashes"]
    InvalidateDeps --> DAG

    CLI --> ForceBypass["Force rebuild flag<br>(overrides cache hits)"]
    ForceBypass --> CacheLookup

Quickstart

  1. Write a manifest describing your tasks (see Manifest Specification).
  2. Build ReproVM: make
  3. Run the VM:
    ./reprovm manifest.txt
    

If nothing has changed, subsequent runs are near-instant due to cache hits.

Installation & Build

Requirements

Build

git clone https://github.com/hoangsonww/ReproVM-Virtual-Machine
cd ReproVM-Virtual-Machine
make

Expected output:

$ make
gcc -std=c99 -O2 -Wall -Wextra -g -c main.c -o main.o
gcc -std=c99 -O2 -Wall -Wextra -g -c task.c -o task.o
gcc -std=c99 -O2 -Wall -Wextra -g -c cas.c -o cas.o
gcc -std=c99 -O2 -Wall -Wextra -g -c util.c -o util.o
gcc -std=c99 -O2 -Wall -Wextra -g -o reprovm main.o task.o cas.o util.o

This produces the reprovm binary.

Manifest Specification

ReproVM manifest is a simple, custom DSL. Whitespaces are flexible; comments begin with #.

Grammar Summary

manifest      := { task_block }+
task_block   := "task" <name> "{" { field_line } "}"
field_line   := <key> "=" <value>
key          := "cmd" | "inputs" | "outputs" | "deps"
value        := arbitrary string (for cmd), or comma-separated list (for others)

Valid Fields

Example

task build {
  cmd = gcc -o hello hello.c
  inputs = hello.c
  outputs = hello
  deps =
}

task test {
  cmd = ./hello > result.txt
  inputs = hello
  outputs = result.txt
  deps = build
}

task checksum {
  cmd = sha256sum result.txt > result.sha
  inputs = result.txt
  outputs = result.sha
  deps = test
}

Notes

Task Lifecycle

For each task in topological order:

  1. Compute Task Hash
  1. Cache Lookup
  1. Execution (if no cache hit)
  1. Graph Update & Display

Statuses:

Cache & CAS Internals

CAS Layout

Blobs are stored under:

.reprovm/cas/objects/<first-two-hex>/<remaining-hash>

Example:

.reprovm/cas/objects/a3/f5d9c0e1b2...  # blob for a file or output

This two-level split avoids directory explosion.

Metadata Record

Each task produces a metadata file:

.reprovm/cache/<task_hash>.meta

Example contents:

task_hash: a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7
result_hash: 7d9a4f2e5b3c1a0d6e8f9b7c4d3e2f1a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2
output hello 5e2f1a3b4c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9c0d1
output result.txt 9c4d7a1e2f3b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d5e6f7a8b9

Identity Propagation

CLI Usage Reference

./reprovm <manifest> [target1 target2 ...]

Examples

Environment & Configuration

ReproVM does not require environment variables to function, but the following are recognized/usable for extension or debugging if you choose to augment it:

The current implementation uses the project root (.) as base; you can change this by modifying cas_init parameter in main.c.

Simulated Realistic Sessions & Logs

Note: The following outputs are realistic simulations, with plausible hash values and formatting.

1. First Full Run (cold cache)

Manifest: as above (build, test, checksum). hello.c prints “Hello, ReproVM!”.

$ ./reprovm manifest.txt
Will execute 3 tasks in order:
  build
  test
  checksum
==> Running task 'build': gcc -o hello hello.c
==> Task 'build' completed.
=== Task Graph ===
[✔] build (hash=a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7) res=1f2e3d4c5b6a7980a9b8c7d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7)
[ ] test (hash=5f6e7d8c9b0a1f2e3d4c5b6a7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d) res=
[ ] checksum (hash=9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a) res=
==================
==> Running task 'test': ./hello > result.txt
==> Task 'test' completed.
=== Task Graph ===
[✔] build (hash=...) res=...
[✔] test (hash=5f6e7d8c9b0a1f2e3d4c5b6a7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d) res=3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c)
[ ] checksum (hash=9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a) res=
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
==> Task 'checksum' completed.
=== Task Graph ===
[✔] build (hash=...) res=...
[✔] test (hash=...) res=...
[✔] checksum (hash=9a8b7c6d5e4f3a2b1c0d9e8f7a6b5c4d3e2f1a0b9c8d7e6f5a4b3c2d1e0f9a) res=7f8e9d0c1b2a3e4f5d6c7b8a9f0e1d2c3b4a5f6e7d8c9b0a1c2d3e4f5a6b7c8)
==================
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[✔] build ... 
[✔] test ...
[✔] checksum ...
==================

2. Repeat Run (warm cache)

$ ./reprovm manifest.txt
Will execute 3 tasks in order:
  build
  test
  checksum
[*] build (cache hit) (hash=a3f5d9c0e1b2c3d4...) res=1f2e3d4c...
[*] test (cache hit) (hash=5f6e7d8c9b0a1f2...) res=3c4d5e6f...
[*] checksum (cache hit) (hash=9a8b7c6d5e4f3a2b...) res=7f8e9d0c...
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[*] build ...
[*] test ...
[*] checksum ...
==================

3. Change in Upstream Input (hello.c modified)

Modify hello.c to print “Hello, Updated ReproVM!” then:

$ ./reprovm manifest.txt
Will execute 3 tasks in order:
  build
  test
  checksum
==> Running task 'build': gcc -o hello hello.c
==> Task 'build' completed.
=== Task Graph ===
[✔] build (hash=de4f1a2b3c4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0) res=2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7d8e9f0a1b2c3d4e5f6a7b8c9d0e1f2a3)
[ ] test ...
[ ] checksum ...
==================
==> Running task 'test': ./hello > result.txt
==> Task 'test' completed.
=== Task Graph ===
[✔] build ...
[✔] test (hash=8c7d6e5f4a3b2c1d0e9f8a7b6c5d4e3f2a1b0c9d8e7f6a5b4c3d2e1f0a9b8c7) res=4d5e6f7a8b9c0d1e2f3a4b5c6d7e8f9a0b1c2d3e4f5a6b7c8d9e0f1a2b3c4d)
[ ] checksum ...
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
==> Task 'checksum' completed.
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
All tasks completed (some may have been cached). Final graph:
...

4. Failure Case: Missing Input

Manifest has a typo: inputs = missing.c but missing.c does not exist.

$ ./reprovm manifest.txt
Will execute 3 tasks in order:
  build
  test
  checksum
==> Running task 'build': gcc -o hello missing.c
Task 'build' failed with exit code 1
=== Task Graph ===
[X] build (hash=...) res=
[ ] test ...
[ ] checksum ...
==================
One or more tasks failed.

Advanced Usage Patterns

Partial Target Execution

Run only a high-level result:

$ ./reprovm manifest.txt checksum
Will execute 3 tasks in order:
  build
  test
  checksum
[*] build (cached)
[*] test (cached)
[*] checksum (cached)
Final graph:
=== Task Graph ===
[*] build ...
[*] test ...
[*] checksum ...
==================

Forcing a Rebuild

To ignore the cached result of a task:

$ rm .reprovm/cache/a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7.meta
$ ./reprovm manifest.txt build

Inspect Cache / Provenance

View metadata manually:

$ cat .reprovm/cache/a3f5d9c0e1b2c3d4e5f6a7b8c9d0e1f2a3b4c5d6e7f8a9b0c1d2e3f4a5b6c7.meta
task_hash: a3f5d9c0e1b2...
result_hash: 1f2e3d4c5b6a...
output hello 5e2f1a3b...

Restore any blob:

$ ./reprovm  # normal run will auto-restore, or use cas APIs to manually extract

Example: More Complex Pipeline Manifest

task fetch_data {
  cmd = curl -s -o raw.csv https://example.com/data.csv
  inputs =
  outputs = raw.csv
  deps =
}

task clean {
  cmd = python3 scripts/clean.py raw.csv > cleaned.csv
  inputs = raw.csv
  outputs = cleaned.csv
  deps = fetch_data
}

task train {
  cmd = python3 scripts/train_model.py cleaned.csv model.bin
  inputs = cleaned.csv
  outputs = model.bin
  deps = clean
}

task evaluate {
  cmd = python3 scripts/evaluate.py model.bin cleaned.csv > metrics.json
  inputs = model.bin, cleaned.csv
  outputs = metrics.json
  deps = train
}

task bundle {
  cmd = tar -czf package.tar.gz model.bin metrics.json
  inputs = model.bin, metrics.json
  outputs = package.tar.gz
  deps = evaluate
}

Run the final bundle:

./reprovm pipeline_manifest.txt bundle

Re-running after only editing clean.py results in re-running clean, train, evaluate, and bundle — earlier cached steps (fetch_data) are reused.

Extension Points / Developer Notes

Performance Tuning

Security Considerations

Troubleshooting & Error Codes

Symptom / Error Meaning Action
Unknown target 'foo' Specified target not defined Verify spelling in manifest
Cycle detected among tasks Dependency loop Break cycle by reordering or removing dependency
Task 'X' failed with exit code N Command returned nonzero Inspect task cmd, check inputs, run manually for detailed error
Missing output but cache hit reported Downstream expected file not declared as output Ensure outputs reflect actual produced files
Corrupted .meta file or CAS object Cache metadata unreadable Delete relevant .meta to force recompute
Failed to hash input file Input file missing or unreadable Confirm file exists and permissions are correct

Exit codes:

FAQ

Q: Can I use ReproVM inside CI to speed up repeated runs? A: Yes. Persist the .reprovm/cas and .reprovm/cache directories between pipeline runs to reuse cached steps and drastically reduce execution time.

Q: What happens if I rename a file but its content is same? A: File identity in CAS is content-based, but the manifest refers to file paths. Renaming changes input/output names in the manifest, so tasks will rerun unless you create symlinks or adapt the manifest.

Q: Can one task produce outputs not declared? A: The system won’t cache undeclared outputs properly. Always declare all side-effect files for reproducibility.

Q: Is the cache sharable across machines? A: Out of the box, the CAS is local. You can copy .reprovm/cas/objects and .reprovm/cache/ to another machine as long as relative paths and environment are compatible. For robust sharing, one could extend to a remote CAS backend.

Contributing

Contributions are welcome. Suggested ways to help:

Guidelines:

  1. Fork the repository.
  2. Create a feature branch.
  3. Write tests or example manifests demonstrating your addition.
  4. Submit a pull request with a clear description and any limitations.

Glossary

Parallel Execution

ReproVM can be optionally extended to execute independent tasks in parallel while still respecting explicit dependencies. This is provided via a companion parallel executor and alternate entry point; it is non-invasive and does not require changes to the original codebase. If you don’t opt into parallel mode, the original serial behavior remains exactly the same.

What’s Different

Building the Parallel Binary

The parallel version lives alongside the regular reprovm binary. Compile it by adding the new sources and linking pthreads:

gcc -std=c99 -O2 -Wall -Wextra -g reprovm_parallel.c parallel_executor.c task.c cas.c util.c -lpthread -o reprovm_parallel

You can keep both versions:

gcc -std=c99 -O2 -Wall -Wextra -g main.c task.c cas.c util.c -o reprovm
gcc -std=c99 -O2 -Wall -Wextra -g reprovm_parallel.c parallel_executor.c task.c cas.c util.c -lpthread -o reprovm_parallel

Usage

./reprovm_parallel [-j N] <manifest> [target1 target2 ...]

You can also influence parallelism via environment variable (future extension support):

export REPROVM_JOBS=8   # if implemented, can be read as default parallelism

Examples

Full Parallel Run (auto worker count)

$ ./reprovm_parallel manifest.txt
Will execute 3 tasks (parallel workers: 8)
==> Scheduling...
[~] build (hash=...) res=
[~] test  (waiting on build)
[~] checksum (waiting on test)

==> Running task 'build': gcc -o hello hello.c
==> Running task 'test': ./hello > result.txt       # if build finished early enough and test became eligible
[✔] build ...
[*] test  (cache hit or skipped if unchanged)
[ ] checksum ...
...

Specifying 4 Workers

$ ./reprovm_parallel -j 4 manifest.txt
Will execute 3 tasks (parallel workers: 4)
==> Running task 'build': gcc -o hello hello.c
==> Task 'build' completed.
=== Task Graph ===
[✔] build ...
[ ] test ...
[ ] checksum ...
==================
==> Running task 'test': ./hello > result.txt
==> Task 'test' completed.
=== Task Graph ===
[✔] build ...
[✔] test ...
[ ] checksum ...
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
==> Task 'checksum' completed.
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================

Mixed Parallelism with Independent Tasks

Given a manifest with two independent tasks A and B that both feed into C, they can run concurrently:

Will execute 3 tasks (parallel workers: 4)
==> Running task 'A': ...
==> Running task 'B': ...
[✔] A ...
[✔] B ...
==> Running task 'C': ...
[✔] C ...
Final graph:
[✔] A ...
[✔] B ...
[✔] C ...

Failure Behavior

If one worker encounters a failure (non-zero exit), the failure is recorded but other in-flight eligible tasks are allowed to finish so you get a full snapshot. The final exit code is non-zero, and the ASCII graph will show [X] for failed tasks.

Integration Notes

Simulated Parallel Run Output (Realistic)

$ ./reprovm_parallel -j 3 manifest.txt
Will execute 3 tasks (parallel workers: 3)
==> Scheduling and starting workers...
[~] build (running)         [ ] test (waiting)        [ ] checksum (waiting)
==> Running task 'build': gcc -o hello hello.c
[✔] build (completed)
=== Task Graph ===
[✔] build (hash=...) res=...
[ ] test  (ready) res=
[ ] checksum (waiting on test)
==================
==> Running task 'test': ./hello > result.txt
[~] test (running)
[ ] checksum (waiting on test)
[✔] test (completed)
=== Task Graph ===
[✔] build ...
[✔] test ...
[ ] checksum ...
==================
==> Running task 'checksum': sha256sum result.txt > result.sha
[✔] checksum (completed)
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================
All tasks completed (some may have been cached). Final graph:
=== Task Graph ===
[✔] build ...
[✔] test ...
[✔] checksum ...
==================

If a task is a cache hit and skipped, it shows as [*] and its dependents may immediately become eligible without delay.

Tuning

Here’s a Docker section you can insert into the README (e.g., right after Installation & Build or before Manifest Specification):

Docker

ReproVM can be run inside a container for reproducible, isolated execution environments. The provided Dockerfile and docker-compose.yml (see repo) bundle all required tools (GCC, make, Python, Node.js, Ruby, etc.), build both serial and parallel binaries, and give you a consistent runtime.

Overview

Building the Image

From the repo root:

docker build -t reprovm:latest .

This builds the image, compiling ReproVM (serial + parallel) and preparing ancillary scripts.

Running ReproVM with Docker

Run the pipeline against your manifest with:

docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./run_pipeline.sh manifest.txt

Or invoke the parallel binary directly (auto-detects CPU count unless overridden):

docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel -j 4 manifest.txt

To pass parallelism via environment variable:

docker run --rm -v "$(pwd)":/workspace -w /workspace -e REPROVM_JOBS=8 reprovm:latest ./run_pipeline.sh manifest.txt

Using docker-compose

With the included docker-compose.yml, you can spin up the ReproVM service easily:

docker compose up reprovm

This mounts the current directory into /workspace, builds/uses the image, and runs ./run_pipeline.sh by default. To target a specific manifest or override:

docker compose run --rm reprovm ./reprovm_parallel -j 2 pipeline_manifest.txt bundle

Persisting Cache

Since .reprovm lives in the mounted workspace, cache and CAS entries survive between container runs. To maximize CI or repeated-run performance, persist the host-side .reprovm directory (e.g., in CI artifacts or volume-backed storage).

Examples

Cold build:

docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel manifest.txt

Warm rebuild (uses cache):

docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel manifest.txt

Run a specific target:

docker run --rm -v "$(pwd)":/workspace -w /workspace reprovm:latest ./reprovm_parallel manifest.txt checksum

Shell into container for inspection/debugging:

docker run --rm -it -v "$(pwd)":/workspace -w /workspace reprovm:latest bash
# then inside:
./reprovm_parallel -j 4 manifest.txt

Cleanup

To remove the built image when you no longer need it:

docker image rm reprovm:latest

If using docker-compose, tear down (and optionally remove volumes):

docker compose down

Tips

Example File Layout After Successful Run

.
├── hello.c
├── manifest.txt
├── hello                    # compiled binary
├── result.txt              # program output
├── result.sha             # checksum
├── reprovm                # compiled VM executable
└── .reprovm
    ├── cache
    │   ├── a3f5d9c0e1b2...meta
    │   ├── 5f6e7d8c9b0a...meta
    │   └── 9a8b7c6d5e4f...meta
    └── cas
        └── objects
            ├── a3/
            │   └── f5d9c0e1b2c3d4...  # blobs: e.g., input hello.c
            ├── 5f/
            │   └── 6e7d8c9b0a1f2e...  # more blobs
            └── 9a/
                └── 8b7c6d5e4f3a2b...  # output blobs

License

MIT License. See the LICENSE file for details.


Thank you for using ReproVM! We hope it helps you build reproducible, efficient workflows with ease. If you have any questions or suggestions, feel free to open an issue or contribute to the project.