ReproVM

Reproducible Task Execution Virtual Machine with Content-Addressed Caching

๐Ÿ”ง C99 ๐Ÿ” SHA-256 โšก pthreads ๐Ÿณ Docker ๐Ÿ“ฆ CAS ๐Ÿ”„ Incremental Builds ๐ŸŽฏ Zero Dependencies ๐Ÿ“œ MIT License
Get Started โ†’ View on GitHub

Overview

A lightweight yet powerful task execution engine designed for reproducibility and efficiency

ReproVM is a lightweight task execution virtual machine written entirely in C99. It makes complex workflowsโ€”builds, data pipelines, test orchestrations, and analysesโ€”reproducible, incremental, and efficient through content-addressed caching and explicit dependency tracking.

Think of it as a miniaturized, transparent cousin of modern build systems like Bazel or data pipeline orchestrators, but without hidden state or heavyweight dependencies. It operates on a declarative manifest of tasks, each describing what to run, what files it consumes and produces, and what it depends on.

100%
Pure C99
0
Runtime Dependencies
SHA-256
Content Addressing
โˆž
Task Parallelism

Key Features

Why ReproVM stands out from traditional build systems

๐Ÿ”„

Reproducibility

Every piece of work is content-addressed. If inputs and commands are identical, outputs are reused deterministically across machines and environments.

โšก

Incrementality

Only tasks whose inputs, commands, or upstream results have changed are re-executed. Massive time savings on repeated runs.

๐Ÿ”

Transparency

ASCII diagrams show exactly what ran, what was skipped, and why. No hidden state or magicโ€”complete visibility into execution flow.

๐Ÿš€

Parallel Execution

Execute independent tasks concurrently with pthreads. Respects dependencies automatically while maximizing throughput.

๐Ÿ“ฆ

Content-Addressed Storage

All inputs and outputs stored by SHA-256 hash. Natural deduplication, versioning, and provenance tracking built-in.

๐Ÿ”ง

Zero Dependencies

Pure C99 implementation with only standard libraries. No external runtime, no package managersโ€”just compile and run.

๐ŸŒ

Polyglot Support

Execute tasks in any languageโ€”C, Python, Go, Rust, Ruby, Node.js. The VM doesn't care what's in your command.

๐Ÿณ

Docker Ready

Full Docker and docker-compose support for containerized, reproducible execution environments across teams.

Architecture

Understanding ReproVM's internal design

Core Components

๐Ÿ“„ Manifest Parser
๐Ÿ”— DAG Builder
๐Ÿ” SHA-256 Hasher
๐Ÿ’พ Content-Addressed Storage
โš™๏ธ Task Executor
๐Ÿงต Parallel Scheduler
๐Ÿ“Š Graph Visualizer
๐Ÿ’ฝ Cache Manager

Execution Flow

flowchart TB Start([User invokes CLI]) --> Parse[Parse Manifest] Parse --> Build[Build Dependency DAG] Build --> Filter[Filter Target Tasks] Filter --> Topo[Topological Sort] Topo --> Ready{Ready Tasks?} Ready -->|Yes| Schedule[Schedule Parallel Execution] Ready -->|No| Done{All Complete?} Schedule --> Task[Task Instance] subgraph "Task Execution" Task --> Hash[Compute Task Hash
cmd + inputs + deps] Hash --> Cache{Cache Hit?} Cache -->|Yes + Not Forced| Restore[Restore from CAS
โšก SKIP] Cache -->|No or Forced| Execute[Execute Command
๐Ÿ”ง RUN] Execute --> Success{Exit 0?} Success -->|Yes| Store[Store Outputs to CAS
๐Ÿ’พ CACHE] Success -->|No| Fail[Mark Failed
โŒ ERROR] Store --> UpdateSuccess[Update: SUCCESS โœ“] Restore --> UpdateSkip[Update: SKIPPED *] Fail --> UpdateFail[Update: FAILED X] end UpdateSuccess --> Ready UpdateSkip --> Ready UpdateFail --> Ready Done -->|Yes| Visualize[Render Final Graph] Done -->|No| Ready Visualize --> End([Exit]) style Start fill:#2563eb style End fill:#10b981 style Fail fill:#ef4444 style Restore fill:#7c3aed style Store fill:#f59e0b

Content-Addressed Storage Layout

project/
โ”œโ”€โ”€ .reprovm/
โ”‚   โ”œโ”€โ”€ cas/
โ”‚   โ”‚   โ””โ”€โ”€ objects/
โ”‚   โ”‚       โ”œโ”€โ”€ a3/
โ”‚   โ”‚       โ”‚   โ””โ”€โ”€ f5d9c0e1b2c3d4...  (blob: file content)
โ”‚   โ”‚       โ”œโ”€โ”€ 5f/
โ”‚   โ”‚       โ”‚   โ””โ”€โ”€ 6e7d8c9b0a1f2e...  (blob: output data)
โ”‚   โ”‚       โ””โ”€โ”€ 9a/
โ”‚   โ”‚           โ””โ”€โ”€ 8b7c6d5e4f3a2b...  (blob: binary)
โ”‚   โ””โ”€โ”€ cache/
โ”‚       โ”œโ”€โ”€ .meta  (metadata: task identity)
โ”‚       โ”œโ”€โ”€ .meta
โ”‚       โ””โ”€โ”€ .meta
โ”œโ”€โ”€ manifest.txt
โ””โ”€โ”€ source files...

Task Hash Computation

graph LR A[Command String] --> H[SHA-256] B[Sorted Input Hashes] --> H C[Dependency Result Hashes] --> H H --> D[Task Hash
64-char hex] D --> E{Cache Lookup} E -->|Hit| F[Restore Outputs] E -->|Miss| G[Execute & Store] style D fill:#2563eb style F fill:#10b981 style G fill:#f59e0b

Technology Stack

Built on battle-tested, industry-standard technologies

๐Ÿ”ง Core Implementation

C99
GCC
Make
POSIX

๐Ÿ” Security & Hashing

SHA-256
CRC32
CAS

โšก Parallelism

pthreads
Mutex/CV
DAG Scheduling

๐Ÿณ Deployment

Docker
Docker Compose
CI/CD

๐ŸŒ Polyglot Extensions

Go
Rust
Python
Ruby
Node.js
NASM

๐Ÿ” Code Quality

clang-format
clang-tidy
uncrustify
AStyle

Quick Start

Get up and running in under 5 minutes

1. Clone the Repository

git clone https://github.com/hoangsonww/ReproVM-Virtual-Machine
cd ReproVM-Virtual-Machine

2. Build ReproVM

make
# Builds both serial (reprovm) and parallel (reprovm_parallel) binaries

Output:

gcc -std=c99 -O2 -Wall -Wextra -g -c main.c -o main.o
gcc -std=c99 -O2 -Wall -Wextra -g -c task.c -o task.o
gcc -std=c99 -O2 -Wall -Wextra -g -c cas.c -o cas.o
gcc -std=c99 -O2 -Wall -Wextra -g -c util.c -o util.o
gcc -std=c99 -O2 -Wall -Wextra -g -o reprovm main.o task.o cas.o util.o
gcc -std=c99 -O2 -Wall -Wextra -g -o reprovm_parallel ...

3. Create a Manifest

task build {
  cmd = gcc -o hello hello.c
  inputs = hello.c
  outputs = hello
  deps =
}

task test {
  cmd = ./hello > result.txt
  inputs = hello
  outputs = result.txt
  deps = build
}

task checksum {
  cmd = sha256sum result.txt > result.sha
  inputs = result.txt
  outputs = result.sha
  deps = test
}

4. Run Your Pipeline

# Serial execution
./reprovm manifest.txt

# Parallel execution (auto-detects CPU count)
./reprovm_parallel manifest.txt

# Parallel with specific worker count
./reprovm_parallel -j 4 manifest.txt

5. Marvel at Cache Hits โšก

Run again without changes:

$ ./reprovm manifest.txt
Will execute 3 tasks in order:
  build
  test
  checksum
[*] build (cache hit)
[*] test (cache hit)
[*] checksum (cache hit)
All tasks completed. โœ“
๐ŸŽ‰ Success! All tasks were skipped because nothing changed. This is the power of content-addressed cachingโ€”instant, deterministic rebuilds.

Documentation

Everything you need to know about ReproVM

๐Ÿ“– Manifest Specification

Tasks are declared in a simple DSL:

task <name> {
  cmd = <shell command>
  inputs = <file1>, <file2>, ...
  outputs = <out1>, <out2>, ...
  deps = <task1>, <task2>, ...
}
  • cmd: Shell command to execute
  • inputs: Files consumed by the task
  • outputs: Files produced by the task
  • deps: Task dependencies (execution order)

๐ŸŽฏ CLI Usage

# Run all tasks
./reprovm manifest.txt

# Run specific targets
./reprovm manifest.txt target1 target2

# Parallel execution
./reprovm_parallel -j N manifest.txt

# Force rebuild (ignore cache)
rm -rf .reprovm
./reprovm manifest.txt

๐Ÿ” Task Statuses

Symbol Status Meaning
[ ] Pending Not yet executed
[~] Running Currently executing
[*] Skipped Cache hit, restored from CAS
[โœ”] Success Executed successfully
[X] Failed Command returned non-zero

๐Ÿณ Docker Usage

# Build Docker image
docker build -t reprovm:latest .

# Run with volume mount
docker run --rm -v "$(pwd)":/workspace \
  -w /workspace reprovm:latest \
  ./reprovm_parallel manifest.txt

# Use docker-compose
docker compose up reprovm

# Interactive shell
docker run --rm -it -v "$(pwd)":/workspace \
  -w /workspace reprovm:latest bash

โš™๏ธ Cache Management

# View cache metadata
cat .reprovm/cache/<task_hash>.meta

# Clean cache
rm -rf .reprovm

# Selective invalidation
rm .reprovm/cache/<specific_task>.meta

# CI cache restoration
./ci_restore_cache.sh

๐Ÿ” Security Considerations

โš ๏ธ Warning:
  • Tasks execute arbitrary shell commandsโ€”only run trusted manifests
  • Shared .reprovm directories require trust boundaries
  • CAS blobs can be replaced on disk (no cryptographic signing by default)
  • Consider implementing integrity verification for production use

Examples

Real-world manifests demonstrating ReproVM's capabilities

๐Ÿ—๏ธ C Build Pipeline

task compile {
  cmd = gcc -c -o app.o app.c
  inputs = app.c
  outputs = app.o
  deps =
}

task link {
  cmd = gcc -o app app.o
  inputs = app.o
  outputs = app
  deps = compile
}

task test {
  cmd = ./app --test > test.log
  inputs = app
  outputs = test.log
  deps = link
}

๐Ÿ“Š Data Pipeline

task fetch_data {
  cmd = curl -o data.csv https://api.example.com/data
  inputs =
  outputs = data.csv
  deps =
}

task clean_data {
  cmd = python3 clean.py data.csv > clean.csv
  inputs = data.csv
  outputs = clean.csv
  deps = fetch_data
}

task analyze {
  cmd = python3 analyze.py clean.csv > report.json
  inputs = clean.csv
  outputs = report.json
  deps = clean_data
}

task visualize {
  cmd = python3 plot.py report.json chart.png
  inputs = report.json
  outputs = chart.png
  deps = analyze
}

๐Ÿค– ML Training Pipeline

task preprocess {
  cmd = python3 preprocess.py raw.csv processed.csv
  inputs = raw.csv
  outputs = processed.csv
  deps =
}

task train_model {
  cmd = python3 train.py processed.csv model.pkl
  inputs = processed.csv
  outputs = model.pkl
  deps = preprocess
}

task evaluate {
  cmd = python3 eval.py model.pkl metrics.json
  inputs = model.pkl, processed.csv
  outputs = metrics.json
  deps = train_model
}

task package {
  cmd = tar czf model.tar.gz model.pkl metrics.json
  inputs = model.pkl, metrics.json
  outputs = model.tar.gz
  deps = evaluate
}

๐ŸŒ Web Asset Pipeline

task compile_ts {
  cmd = tsc src/index.ts --outDir dist/
  inputs = src/index.ts
  outputs = dist/index.js
  deps =
}

task bundle_js {
  cmd = webpack dist/index.js -o dist/bundle.js
  inputs = dist/index.js
  outputs = dist/bundle.js
  deps = compile_ts
}

task minify_css {
  cmd = cleancss -o dist/style.min.css src/style.css
  inputs = src/style.css
  outputs = dist/style.min.css
  deps =
}

task package_assets {
  cmd = tar czf assets.tar.gz dist/
  inputs = dist/bundle.js, dist/style.min.css
  outputs = assets.tar.gz
  deps = bundle_js, minify_css
}

๐Ÿ“ˆ Performance Comparison

graph LR A[Cold Run
No Cache] -->|100%| B[3.2s] C[Warm Run
Full Cache] -->|~0%| D[0.08s] E[Partial Change
1 task] -->|33%| F[1.1s] style B fill:#ef4444 style D fill:#10b981 style F fill:#f59e0b
๐Ÿ’ก Pro Tip: In CI/CD pipelines, persist the .reprovm directory between runs to achieve 40x-90x speedups on unchanged steps.

Advanced Features

Power-user capabilities and extension points

๐Ÿงต Parallel Execution

The reprovm_parallel binary uses pthreads to execute independent tasks concurrently:

# Auto-detect CPU count
./reprovm_parallel manifest.txt

# Specify worker threads
./reprovm_parallel -j 8 manifest.txt

# Set via environment
export REPROVM_JOBS=16
./reprovm_parallel manifest.txt

How it works:

  • Tasks with satisfied dependencies run immediately
  • Worker pool managed via mutex/condition variables
  • Failures propagate but don't block independent tasks
  • Graph updates are serialized for clean output

๐ŸŒ Remote CAS (Extension)

Extend ReproVM to push/pull blobs from remote storage:

// cas.h extension point
int cas_push_remote(const char *hash, const char *url);
int cas_pull_remote(const char *hash, const char *url);

// Example: S3 backend
cas_push_remote(hash, "s3://bucket/cas/objects/");
cas_pull_remote(hash, "s3://bucket/cas/objects/");

๐Ÿ”„ CI/CD Integration

ReproVM is designed for CI/CD environments:

# .github/workflows/ci.yml
name: CI
on: [push]
jobs:
  build:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Restore Cache
        uses: actions/cache@v3
        with:
          path: .reprovm
          key: reprovm-${{ hashFiles('manifest.txt') }}
      - name: Build
        run: |
          make
          ./reprovm_parallel -j 4 manifest.txt
      - name: Save Cache
        uses: actions/cache@v3
        with:
          path: .reprovm
          key: reprovm-${{ hashFiles('manifest.txt') }}

๐Ÿ“Š Provenance Tracking

Every cache metadata file contains full provenance:

task_hash: a3f5d9c0e1b2c3d4e5f6a7b8...
result_hash: 1f2e3d4c5b6a7980a9b8c7d6...
output hello 5e2f1a3b4c6d7e8f9a0b1c2d...
output result.txt 9c4d7a1e2f3b5c6d7e8f...

Use this to:

  • Trace output back to exact inputs and commands
  • Audit reproducibility across environments
  • Build compliance artifacts for regulated industries

๐Ÿ”Œ Language Server Integration

For complex codebases, integrate with language servers:

# Start TypeScript language server
tsserver --stdio

# Use with ReproVM to analyze dependencies
task analyze_deps {
  cmd = tsserver --check src/**/*.ts > deps.json
  inputs = src/index.ts, src/utils.ts
  outputs = deps.json
  deps =
}

๐Ÿงช Testing ReproVM

Comprehensive test suite included:

# Run all tests
cd tests && ./test_all.sh

# Individual test suites
./test_cas.sh      # CAS functionality
./test_parallel.sh # Parallel execution
./test_manifest.sh # Manifest parsing
./test_util.sh     # Utility functions

Contributing

Join the ReproVM community

We welcome contributions! Here's how you can help:

  • ๐Ÿ› Report bugs and issues
  • ๐Ÿ’ก Suggest new features
  • ๐Ÿ“ Improve documentation
  • ๐Ÿ”ง Submit pull requests
  • โญ Star the repo to show support

Development Setup

# Fork and clone
git clone https://github.com/YOUR_USERNAME/ReproVM-Virtual-Machine
cd ReproVM-Virtual-Machine

# Create feature branch
git checkout -b feature/amazing-feature

# Build and test
make
cd tests && ./test_all.sh

# Make changes, commit, push
git add .
git commit -m "Add amazing feature"
git push origin feature/amazing-feature

# Open Pull Request on GitHub

Code Style

ReproVM uses automated formatting:

# Format all code
./format.sh

# Pre-commit hook
cp .githooks/pre-commit .git/hooks/
chmod +x .git/hooks/pre-commit

Frequently Asked Questions

โ“ How does caching work?

Each task's identity is a SHA-256 hash of its command, sorted input hashes, and dependency result hashes. If this hash exists in .reprovm/cache/, outputs are restored from CAS instead of re-executing.

โ“ Can I share cache across machines?

Yes! Copy .reprovm/cas/ and .reprovm/cache/ to another machine. For teams, implement a remote CAS backend (S3, HTTP, etc.) as an extension.

โ“ What if I rename a file?

File identity is content-based in CAS, but manifests reference paths. Renaming changes the input list, causing task re-execution. Create symlinks or update the manifest.

โ“ How do I debug failed tasks?

Run the command manually from the shell to see detailed errors. Check .reprovm/cache/ for task metadata and inspect input hashes.

โ“ Does ReproVM support Windows?

ReproVM targets POSIX systems (Linux, macOS, BSD). Windows support via WSL, Cygwin, or MinGW is possible but untested.

โ“ How fast is it compared to Make/Bazel?

For cache hits, ReproVM is near-instant (~80ms overhead). For cold builds, performance is comparable to Make. Bazel has more optimization but higher overhead.

Troubleshooting

Common issues and solutions

Symptom Cause Solution
Unknown target 'foo' Target not defined in manifest Check spelling, ensure task exists
Cycle detected Circular dependency in DAG Break cycle by removing/reordering deps
Task failed with exit code N Command returned non-zero Run command manually to debug
Cache hit but output missing Output not declared in manifest Add file to outputs list
Corrupted .meta file Cache metadata unreadable Delete .meta to force recompute
Failed to hash input Input file missing/unreadable Check file exists and permissions
๐Ÿ”ง Debug Mode: Set REPROVM_VERBOSE=1 (future extension) to see detailed internal decisions and hash computations.

Roadmap

Future enhancements and planned features

โœ… Completed

  • Core task execution engine
  • Content-addressed storage
  • SHA-256 hashing and caching
  • Parallel execution with pthreads
  • Docker support
  • Manifest DSL parser

๐Ÿšง In Progress

  • Remote CAS backend (S3, HTTP)
  • Web UI for DAG visualization
  • Enhanced manifest validation
  • Performance profiling tools

๐Ÿ”ฎ Planned

  • Distributed execution across nodes
  • Language bindings (Python, Rust, Go)
  • Cryptographic signing of cache entries
  • YAML/JSON manifest formats
  • Built-in metrics and observability
  • IDE integrations (VSCode, IntelliJ)

Resources

Learn more about the technologies behind ReproVM