🛡️ FRAMES NEVER GO ON THE BUS (ADR-0008) ⚡ GLASS-TO-EVENT p95 ≈ 2.7 ms (LOCAL) vs 300 ms TARGET 🧠 BOUNDED-AUTONOMY AGENTS — REFUSAL IN CODE, NOT PROMPTS 🎯 WILSON LOWER-BOUND + MIN-SAMPLE FLOOR CANARY (ADR-0023) 👁️ SAME-URN SHADOW INFERENCE (ADR-0024) 📊 MULTI-WINDOW BURN-RATE SLOs (ADR-0025) 📦 AIR-GAPPED BUNDLE AS DAY-ONE CI ARTIFACT (ADR-0027) 🚀 35 SERVICES · 38 HELM CHARTS · 30 ADRs · 14 LIBRARIES 🔒 mTLS STRICT · OPA AuthZ · SPIRE · VAULT TRANSIT · COSIGN KEYLESS · SLSA v1
🛡️

Distributed, autonomous,
GPU-native visual intelligence.

AegisVision is a 35-service Go monorepo for realtime multimodal perception, reasoning, and orchestration. Bounded-autonomy agents, Wilson canary, same-URN shadow, multi-window SLOs, signed air-gapped install — built from scar tissue.

Built by Son Nguyen · Apache 2.0 · Phases 0–7 complete (incl. production Next.js console)

35
Go Services
39
Helm Charts
33
Console Routes
30
ADRs
14
Shared Libs
49/49
Modules Green
10k
Streams Target
2.7ms
p95 Glass→Event
📡 Go 1.26 · gRPC · Buf · Protobuf ☸️ Kubernetes · Helm · Istio Ambient · ArgoCD · Kyverno · ESO · SPIRE · Vault 🗄️ PostgreSQL · Patroni · ClickHouse · Redis · NATS JetStream · Kafka 🎮 NVIDIA Triton · TensorRT-LLM · DeepStream · MIG 📈 OpenTelemetry · Prometheus · Loki · Tempo · Grafana 🔏 Cosign · Sigstore · Rekor · SLSA v1 · Syft SBOM 🧪 chaos-mesh · k6 · WAL-G · clickhouse-backup 📦 Cosign-signed air-gap bundle · OCI-layout · zstd

What is AegisVision?

A platform-shaped answer to every hard lesson learned from operating large computer-vision platforms — built once, on purpose, with the scar tissue baked in.

AegisVision is what you get when you take the hardest lessons learned from operating large CV platforms (Matroid, Scale, Roboflow, ClearML, Sensible) — and rebuild from first principles around the constraints that actually dominate at scale.

The five non-negotiables

🚫

Frames never go on the bus

The bus carries claim-check URNs; bytes live in object storage. ADR-0008.

Per-frame work never touches the control plane

The data plane is stateless beyond the operator buffer. Temporal never sees a frame. ADR-0001.

🎮

GPUs are MIG-partitioned by default

Hardware isolation bounds blast radius. No soft-share fallback. ADR-0003.

🛑

Agents do not auto-execute consequential actions

Tier-3 tools refuse in code without a resolved gate. ADR-0014·0017.

📚

Platform-fact answers must cite

Knowledge-service returns cited snippets. No hallucinated stream-IDs. ADR-0020.

📦

Air-gap is day-one

Signed bundle is a first-class CI artifact. Never a retrofit. ADR-0027.

Status today

Phases 0–6 complete. The code is end-to-end green; operational validation (real SOC 2 audit, real 10k-stream soak, real production cluster) is what remains.

PhaseThemeStatus
0Foundations: proto contracts, pkg/platform, walking-skeleton spinecomplete
1Glass-to-event walking skeleton (5 services + NATS)complete · p95 ≈ 2.7 ms local
2GPU hot path: Triton + MIG + inference-router + canary plumbingcomplete
3Multi-tenant + edge + storage tier (Patroni / ClickHouse / Vault)complete
4Intelligence tier: LLM gateway + agent + RAG + bounded autonomycomplete
5Adaptive autonomy: canary + shadow + drift + SLO + prefetchcomplete
6GA hardening: compliance evidence + air-gap + chaos + DR drills + releasecomplete
7Production console: Next.js 14 + Tailwind UI exposing every public endpoint (33 routes)complete

High-level architecture

Two planes — separated along the frequency axis, not the domain axis. Per-frame work runs in the data plane; per-event work runs in the control plane.

flowchart LR subgraph Edge["Edge / on-prem"] CAM[Cameras / RTSP / Files] end subgraph Stream["Stream tier"] SM[stream-manager] DR[dataplane-runner] end subgraph GPU["GPU tier"] IR[inference-router] TRT[Triton + TRT-LLM] GS[gpu-scheduler] end subgraph Control["Control plane"] AG[api-gateway] PS[pipeline-service] MR[model-registry] TS[tenant-service] end subgraph Events["Event tier"] ES[event-service] RH[realtime-hub] NS[notification-service] end subgraph Store["Storage"] CH[(ClickHouse)] PG[(Postgres)] OBJ[(Object store)] end subgraph Brain["Intelligence tier"] LG[llm-gateway] AS[agent-service] PG2[policy-gate-service] KS[knowledge-service] end CAM --> SM SM --> DR DR --> IR IR --> TRT DR -- detections --> ES ES --> CH ES --> RH RH --> NS AG --> PS AG --> ES AG --> AS AS --> LG AS --> KS AS --> PG2 IR --> GS PS --> MR

Glass-to-event flow

The hot path. Every pixel that becomes a tenant-visible event takes this route.

sequenceDiagram autonumber participant CAM as Camera participant DR as dataplane-runner participant CC as claim-check participant IR as inference-router participant TRT as Triton + MIG participant NATS as NATS participant ES as event-service participant CH as ClickHouse participant AG as api-gateway actor U as User CAM->>DR: RTSP frame DR->>CC: PUT frame_urn DR->>DR: sampler DR->>IR: Infer(frame_urn, model) IR->>TRT: detect TRT-->>IR: detections IR->>NATS: inference.completed.v1 IR-->>DR: detections DR->>DR: tracker, rule alt rule trips DR->>NATS: events.v1 NATS->>ES: deliver ES->>CH: insert ES->>AG: SSE push AG->>U: SSE event end

Inference fan-out

Every Infer call emits up to three bus events. The 5-test integration smoke asserts every subject has both a producer and a consumer in CI.

flowchart LR IR[inference-router] -->|inference.completed.v1| NATS NATS --> MET[metering-service] NATS --> DD[drift-detection-service] NATS --> CA[cost-accounting] NATS --> AL[active-learning-service] IR -->|inference.baseline.v1| NATS NATS --> SI[shadow-inference-service] IR -->|inference.outcome.v1| NATS NATS --> CC[canary-controller]

Storage architecture

Three tiers, each chosen for the shape of the data — not "one DB for everything."

flowchart LR subgraph hot["hot path"] EVT[events.v1] -->|consume| ES[event-service] ES --> CH[(ClickHouse
3×2 replicated)] ES --> RING[in-mem ring] end subgraph cold["cold path"] PG[(Postgres
Patroni HA)] REDIS[(Redis
Sentinel HA)] end PS[pipeline-service] --> PG MR[model-registry] --> PG TS[tenant-service] --> PG AU[audit-service] --> PG NLQ[nlq-service] -.->|read| CH OBJ[(Object store
S3 / MinIO / Ceph)] CC[claim-check] --> OBJ MS[media-service] --> OBJ
🧰 35 services in one go.work — each its own Go module, each its own bounded context 📐 Protobuf-everywhere, Buf-managed, breaking-change-checked (ADR-0007) 🪝 Every chart conformance-tested for mTLS STRICT + OPA AuthZ + default-deny NetworkPolicy 🧪 5-test cross-service integration smoke catches bus subject drift before deploy 🚦 Refusal-in-code: 4 tools tried to bypass tier-3; all four were caught by pkg/agent

The 35 services

Each in its own Go module, each a single bounded context, each communicating over typed protobuf contracts. No service shares state in memory with another.

Control plane Temporal-friendly

api-gateway
Public REST. JWT verify, OPA AuthZ, RFC 9457, idempotency, cursor pagination, SSE proxy, console.
pipeline-service
Pipeline DAGs + revisions. The canonical CRUD-with-revisions reference.
stream-manager
Stream lifecycle. Dispatches operator.control to dataplane-runner shards.
model-registry
Versioned model artifacts + reference distribution + gated promotion.
dataset-service
Datasets + dataset versions + lineage.
annotation-service
Labels + label-policy revisions (immutable).
training-orchestrator
Wraps Kubeflow / Ray / Argo Workflows training jobs.
media-service
Recordings + clips + retention. Crypto-shredded per tenant key.
tenant-service
Tenants + projects + members + RBAC.
auth-proxy
JWT verify against JWKS, tenant injection. No HMAC.
audit-service
Append-only, hash-chained audit log. Fail-closed.

Data plane per-frame, stateless

dataplane-runner
Operator DAG: ingest → sampler → detect → tracker → rule → emit.
inference-router
Routes to Triton; publishes inference.completed/baseline/outcome.
gpu-scheduler
MIG-default reservation ledger. No soft-share.
rule-engine
dwell, count, line-cross, zone-enter — composable.
event-service
Consumes events.v1; ClickHouse + SSE.
realtime-hub
WebSocket fan-out for console + integrations.
notification-service
Webhooks, email, Slack — replay-safe idempotent.
edge-gateway
k3s-friendly outbox sync to core.

Intelligence tier Phase 4

llm-gateway
One OpenAI-compatible endpoint. Sanitizer + PII redactor + refusal threshold.
agent-service
Bounded-autonomy runtime. Tier-3 routes via policy-gate. Auto-resume on gate.resolved.
policy-gate-service
Human-in-the-loop approval. Audit on every decision.
knowledge-service
RAG corpus over docs + ADRs + runbooks. Citation-mandatory.
active-learning-service
Uncertainty + diversity sampling. Never random.
nlq-service
Natural-language → structured query.

Adaptive autonomy Phase 5

canary-controller
Wilson lower-bound + min-sample floor proportion test.
shadow-inference-service
Same-URN candidate-vs-baseline. Tenant never sees candidate.
drift-detection-service
JS / KL / TVD divergence vs reference.
slo-watchdog
Multi-window burn-rate. SRE workbook.
prefetch-service
7×24 EMA grid; warm-ups ahead of demand.
autonomy-orchestrator
Cron + signal-driven agent sessions. No second runtime.

GA hardening Phase 6

compliance-evidence-service
Composes per-control evidence; owns no data.
semantic-search
Cross-tenant semantic search over events + clips.
cost-accounting
Per-tenant GPU-second + token + storage cost.
metering-service
Billable-event aggregation from inference.completed.v1.

Production console Phase 7

console
Next.js 14 + Tailwind. 33 routes covering every public REST endpoint: dashboard with live SSE, streams + pipelines + models, agents with citations + tier-3 gate banner, gate inbox, canary decision board, drift heatmap, SLO burn-rate, prefetch grid, knowledge RAG, audit log + chain-verify, tenants, cost, compliance bundles, and more. Conformance-clean Helm chart.

Bounded-autonomy agents

The agent has a fixed toolbox. Each tool carries a risk tier. Tier-3 tools refuse to run in code without a resolved gate. You cannot prompt your way past it.

Tier 0 · auto, read-only

Auto-execute. Read-only. Examples:

query_knowledge read_event_stream describe_pipeline list_models

Tier 1 · advisory

Auto-execute. Returns recommendation only. Examples:

summarise compare_distributions predict_dwell

Tier 2 · propose

Auto-execute. Returns a proposal, no side effect. Examples:

propose_retrain propose_canary_plan propose_retention_change

Tier 3 · gated 🛑

Refused in code without resolved gate. Routes via policy-gate-service. Examples:

promote_model override_retention force_failover delete_dataset

Tier-3 gate round-trip

sequenceDiagram autonumber actor U as User participant AS as agent-service participant LG as llm-gateway participant PG as policy-gate-service actor A as Approver participant NATS as NATS U->>AS: "promote my-model-v2" AS->>LG: choose tool LG-->>AS: tool=promote_model (tier 3) AS->>PG: RequestGate(promote_model, args) PG-->>AS: gate_id=g_xyz AS-->>U: pending approval (gate_id) PG->>A: notify A->>PG: approve(g_xyz) PG->>NATS: gate.resolved.g_xyz NATS->>AS: auto-deliver AS->>AS: a.Resume(ctx, g_xyz, toolResult) AS-->>U: "Promoted my-model-v2."
The refusal is in pkg/agent/agent.go: if tool.Tier == Tier3 && req.GateID == "" { return ErrTier3NeedsGate }. It cannot be prompted away. Even autonomy-orchestrator goes through the same agent-service runtime (ADR-0022) — every constraint binds.

Adaptive autonomy

Five self-improving loops. Each is statistically rigorous; promotion still routes through a human gate by design.

📈

Canary

Wilson lower-bound proportion test + minimum sample floor. Automatic rollback; gated promotion. ADR-0023.

👻

Shadow inference

Same-URN candidate-vs-baseline. Candidate never reaches the tenant; only the comparison metric. ADR-0024.

📊

Drift

Sliding-window JS / KL / TVD divergence vs reference distribution. ADR-0025.

🚨

SLO burn-rate

Multi-window (1h fast, 6h slow). Page only if both windows breach. ADR-0025.

🔥

Predictive prefetch

7×24 hour-of-week EMA grid; warm-ups dispatched at horizon ahead of demand. ADR-0026.

🤖

Continuous autonomy

Cron + signal-driven agent sessions. No second runtime — opens regular agent-service sessions. ADR-0022.

The canary loop

sequenceDiagram autonumber participant IR as inference-router participant CC as canary-controller participant MR as model-registry participant PG as policy-gate-service loop per outcome IR->>CC: inference.outcome.v1 CC->>CC: update Wilson lower bound end alt rollback condition CC->>MR: automatic rollback else promote recommendation CC->>PG: RequestGate(promote_model) PG-->>CC: approved CC->>MR: promote candidate else hold CC->>CC: keep accumulating end

Security & supply chain

Defense in depth, refuse unsafe defaults, fail closed. Every control here has a corresponding SOC 2 evidence record + pen-test scope entry.

🔐

mTLS STRICT everywhere

Istio Ambient. Every pod-to-pod call is mTLS. No plaintext. Per-service AuthorizationPolicy ALLOW list with SPIFFE IDs.

📜

OPA AuthZ

Per-service Rego. Default-deny. Per-tenant model allow-lists; per-project member roles; per-resource ownership.

🆔

SPIRE workload identity

Every pod gets a SPIFFE ID. Mesh authorization keyed on these IDs.

🗝️

Per-tenant Vault transit

Destroying a tenant's transit key renders all encrypted bytes (including backups) unreadable. Crypto-shredding. ADR-0014.

📦

Cosign keyless + SLSA v1 + SBOM

Every image keylessly signed via OIDC. Kyverno admission verifies signatures against the build workflow. Syft SPDX SBOM attached.

🧱

NetworkPolicy default-deny

Even if the mesh is compromised, the pod network refuses connections.

📝

Append-only, hash-chained audit

No UPDATE, no DELETE. Fail-closed: if audit can't append, the upstream operation fails.

🧬

Prompt-injection defense in depth

Sanitiser + PII redactor + refusal threshold + per-tenant rate limit + tier-3 gate. ADR-0021.

Defense-in-depth example: "promote a model"

flowchart TB USER["Agent suggestion: promote_model"] --> CITE["Citation required
knowledge-service"] CITE --> TIER["Tier 3 → refused in code without gate"] TIER --> AUTHZ["OPA AuthZ check"] AUTHZ --> AUDIT1["Audit: gate requested"] AUDIT1 --> GATE["policy-gate-service
human approval"] GATE --> AUDIT2["Audit: gate resolved"] AUDIT2 --> WILSON["Wilson lower bound met"] WILSON --> MR["model-registry: promote"] MR --> AUDIT3["Audit: model promoted"] MR --> COSIGN["Cosign verify on artifact pull"]

Eight checkpoints. Any one of them can refuse.

Deploy paths

Local laptop → online Kubernetes → air-gapped DMZ → k3s edge. Same charts, same contracts, one bundle.

💻

A · Walking skeleton

5 services + embedded NATS, no external deps. ~5 min from clone to first real event.

☁️

B · Local dev (full stack)

Postgres + ClickHouse + NATS + MinIO via docker-compose; all 35 services local.

☸️

C · Online cluster

ArgoCD ApplicationSet reconciles 38 charts. ~5–10 min first sync.

📦

D · Air-gapped

Signed bundle (~6–8 GiB tar.zst). Cosign verify on target. ~30 min install on a 6-node cluster.

🛰️

E · Edge (k3s)

Reduced operator set + outbox sync to core. Jetson AGX Orin friendly.

📈

Capacity targets

10,000 streams / cluster · 1,000,000 detections/s · 1,000 concurrent agents · p95 glass→event < 300 ms.

Walking-skeleton in 5 commands

# Terminal 1 — event-service embeds NATS in dev mode.
AEGIS_EMBED_NATS=true task run:event-service
# Note the URL it logs. Export it for the rest:
export AEGIS_NATS_URL=nats://127.0.0.1:NNNNN

# Terminals 2–5
task run:pipeline-service
AEGIS_NATS_URL=$AEGIS_NATS_URL task run:stream-manager
AEGIS_NATS_URL=$AEGIS_NATS_URL task run:dataplane-runner
AEGIS_STREAM_MANAGER_ADDR=localhost:9092 \
AEGIS_EVENT_SERVICE_URL=http://localhost:8090 \
AEGIS_CONSOLE_DIR=$(pwd)/services/api-gateway/console \
  task run:api-gateway

# Subscribe to the SSE feed, then create a stream:
curl -N -H 'X-Tenant-Id: t-demo' \
  'http://localhost:8080/v1/events:stream?stream_id=stream-dock-1'

curl -X POST -H 'X-Tenant-Id: t-demo' -H 'Idempotency-Key: 1' \
  -H 'Content-Type: application/json' \
  -d '{"name":"dock-1","project_id":"p1","protocol":"file",
       "url":"file:///x","pipeline_id":"p-walking"}' \
  http://localhost:8080/v1/streams

Within ~5 s, an event with kind: KIND_DWELL arrives on the SSE feed. End-to-end. Real services.

30 Architecture Decision Records

The load-bearing decisions. Violating any of these is an ADR-tracked change, not a feature PR.

ADRDecision
0001Control plane / data plane are hard-separated. Temporal never sees a frame.
0002ClickHouse holds the detection firehose. PostgreSQL holds metadata.
0003MIG is the default GPU sharing mode for production inference.
0007Protobuf-everywhere, Buf-managed. /proto has strictest CODEOWNERS.
0008Frames and media never travel on Kafka or NATS. Claim-check only.
0014Bounded autonomy. Agents auto-execute read/advisory + reversible ops; consequential = gate.
0016Walking skeleton first — thin but complete on the real architecture.
00174-tier risk model encoded in tool schemas; refusal-in-code for tier-3.
0018One LLM/VLM gateway. OpenAI-compatible. Backend-swappable.
0019Active learning samples by uncertainty + diversity, never random.
0020Platform-fact answers are retrieval-augmented; no hallucinated identifiers.
0021Prompt-injection defense in depth — sanitizer + PII + threshold + rate limit + gate.
0022Continuous autonomy uses regular agent-service sessions. No second runtime.
0023Canary uses Wilson lower-bound + min-sample floor. Promotion gated; rollback automatic.
0024Shadow inference compares candidate vs baseline on same frame URN.
0025Drift = JS/KL/TVD vs reference. SLO = multi-window burn-rate.
0026Predictive prefetch via 7×24 EMA grid.
0027Air-gapped bundle as day-one CI artifact.
0028Chaos engineering as production-readiness gate.
0029compliance-evidence-service owns no data; composes evidence on demand.
0030release-please + signed air-gap bundle promotion; single platform version.

Browse the docs

📘 README What AegisVision is, status, layout, quickstart.
🏛 ARCHITECTURE Canonical architecture deep dive. 700+ lines.
⚙️ SETUP_GUIDE Local → cluster → air-gapped → edge. Step-by-step.
📐 30 ADRs Every load-bearing architectural decision.
📋 Compliance SOC 2, EU AI Act, GDPR DPIA, pen-test scope.
📒 Runbooks Oncall, incident, DR, chaos game day, drift spike.
💡 Concepts Pipelines, streams, models, datasets, rules.
🔒 Security Threat model, defenses, prompt-injection.
🧠 Agents Bounded autonomy, tier model, citation.
🎯 Canary & Shadow Wilson math + same-URN evaluation.
📊 Drift & SLO JS/KL/TVD + multi-window burn-rate.
🌐 API Reference Every public REST endpoint.
🛠️ Go 1.26 · gRPC · Buf · Protobuf ☸️ Kubernetes · Helm · Istio Ambient · ArgoCD · Kyverno · ESO · SPIRE · Vault 🗄️ PostgreSQL · Patroni · ClickHouse · Redis · NATS JetStream · Kafka · pgvector 🎮 NVIDIA Triton · TensorRT-LLM · DeepStream · MIG · ByteTrack · DeepSORT 📈 OpenTelemetry · Prometheus · Loki · Tempo · Grafana 🔏 Cosign · Sigstore · Rekor · SLSA v1 · Syft SBOM 🧪 chaos-mesh · k6 · WAL-G · clickhouse-backup · GitHub Actions 📦 Cosign-signed air-gap bundle · OCI-layout · zstd · crane · release-please

Built by one person.

AegisVision is authored and maintained by Son Nguyen. The author would love to hear from you.

Son Nguyen

Son Nguyen

Author · Maintainer · BDFL

⭐ GitHub 💼 LinkedIn 🌐 sonnguyenhoang.com ✉️ Email
"AegisVision is what you get when you take the hardest lessons learned from operating large CV platforms — and rebuild from first principles around the constraints that actually dominate at scale."