AegisVision is a 35-service Go monorepo for realtime multimodal perception, reasoning, and orchestration. Bounded-autonomy agents, Wilson canary, same-URN shadow, multi-window SLOs, signed air-gapped install — built from scar tissue.
Built by Son Nguyen · Apache 2.0 · Phases 0–7 complete (incl. production Next.js console)
A platform-shaped answer to every hard lesson learned from operating large computer-vision platforms — built once, on purpose, with the scar tissue baked in.
The bus carries claim-check URNs; bytes live in object storage. ADR-0008.
The data plane is stateless beyond the operator buffer. Temporal never sees a frame. ADR-0001.
Hardware isolation bounds blast radius. No soft-share fallback. ADR-0003.
Tier-3 tools refuse in code without a resolved gate. ADR-0014·0017.
Knowledge-service returns cited snippets. No hallucinated stream-IDs. ADR-0020.
Signed bundle is a first-class CI artifact. Never a retrofit. ADR-0027.
Phases 0–6 complete. The code is end-to-end green; operational validation (real SOC 2 audit, real 10k-stream soak, real production cluster) is what remains.
| Phase | Theme | Status |
|---|---|---|
| 0 | Foundations: proto contracts, pkg/platform, walking-skeleton spine | complete |
| 1 | Glass-to-event walking skeleton (5 services + NATS) | complete · p95 ≈ 2.7 ms local |
| 2 | GPU hot path: Triton + MIG + inference-router + canary plumbing | complete |
| 3 | Multi-tenant + edge + storage tier (Patroni / ClickHouse / Vault) | complete |
| 4 | Intelligence tier: LLM gateway + agent + RAG + bounded autonomy | complete |
| 5 | Adaptive autonomy: canary + shadow + drift + SLO + prefetch | complete |
| 6 | GA hardening: compliance evidence + air-gap + chaos + DR drills + release | complete |
| 7 | Production console: Next.js 14 + Tailwind UI exposing every public endpoint (33 routes) | complete |
Two planes — separated along the frequency axis, not the domain axis. Per-frame work runs in the data plane; per-event work runs in the control plane.
The hot path. Every pixel that becomes a tenant-visible event takes this route.
Every Infer call emits up to three bus events. The 5-test integration smoke asserts every subject has both a producer and a consumer in CI.
Three tiers, each chosen for the shape of the data — not "one DB for everything."
Each in its own Go module, each a single bounded context, each communicating over typed protobuf contracts. No service shares state in memory with another.
The agent has a fixed toolbox. Each tool carries a risk tier. Tier-3 tools refuse to run in code without a resolved gate. You cannot prompt your way past it.
Auto-execute. Read-only. Examples:
query_knowledge
read_event_stream
describe_pipeline
list_models
Auto-execute. Returns recommendation only. Examples:
summarise
compare_distributions
predict_dwell
Auto-execute. Returns a proposal, no side effect. Examples:
propose_retrain
propose_canary_plan
propose_retention_change
Refused in code without resolved gate. Routes via policy-gate-service. Examples:
promote_model
override_retention
force_failover
delete_dataset
if tool.Tier == Tier3 && req.GateID == "" { return ErrTier3NeedsGate }.
It cannot be prompted away. Even autonomy-orchestrator
goes through the same agent-service runtime
(ADR-0022) — every constraint binds.
Five self-improving loops. Each is statistically rigorous; promotion still routes through a human gate by design.
Wilson lower-bound proportion test + minimum sample floor. Automatic rollback; gated promotion. ADR-0023.
Same-URN candidate-vs-baseline. Candidate never reaches the tenant; only the comparison metric. ADR-0024.
Sliding-window JS / KL / TVD divergence vs reference distribution. ADR-0025.
Multi-window (1h fast, 6h slow). Page only if both windows breach. ADR-0025.
7×24 hour-of-week EMA grid; warm-ups dispatched at horizon ahead of demand. ADR-0026.
Cron + signal-driven agent sessions. No second runtime — opens regular agent-service sessions. ADR-0022.
Defense in depth, refuse unsafe defaults, fail closed. Every control here has a corresponding SOC 2 evidence record + pen-test scope entry.
Istio Ambient. Every pod-to-pod call is mTLS. No plaintext. Per-service AuthorizationPolicy ALLOW list with SPIFFE IDs.
Per-service Rego. Default-deny. Per-tenant model allow-lists; per-project member roles; per-resource ownership.
Every pod gets a SPIFFE ID. Mesh authorization keyed on these IDs.
Destroying a tenant's transit key renders all encrypted bytes (including backups) unreadable. Crypto-shredding. ADR-0014.
Every image keylessly signed via OIDC. Kyverno admission verifies signatures against the build workflow. Syft SPDX SBOM attached.
Even if the mesh is compromised, the pod network refuses connections.
No UPDATE, no DELETE. Fail-closed: if audit can't append, the upstream operation fails.
Sanitiser + PII redactor + refusal threshold + per-tenant rate limit + tier-3 gate. ADR-0021.
Eight checkpoints. Any one of them can refuse.
Local laptop → online Kubernetes → air-gapped DMZ → k3s edge. Same charts, same contracts, one bundle.
5 services + embedded NATS, no external deps. ~5 min from clone to first real event.
Postgres + ClickHouse + NATS + MinIO via docker-compose; all 35 services local.
ArgoCD ApplicationSet reconciles 38 charts. ~5–10 min first sync.
Signed bundle (~6–8 GiB tar.zst). Cosign verify on target. ~30 min install on a 6-node cluster.
Reduced operator set + outbox sync to core. Jetson AGX Orin friendly.
10,000 streams / cluster · 1,000,000 detections/s · 1,000 concurrent agents · p95 glass→event < 300 ms.
# Terminal 1 — event-service embeds NATS in dev mode. AEGIS_EMBED_NATS=true task run:event-service # Note the URL it logs. Export it for the rest: export AEGIS_NATS_URL=nats://127.0.0.1:NNNNN # Terminals 2–5 task run:pipeline-service AEGIS_NATS_URL=$AEGIS_NATS_URL task run:stream-manager AEGIS_NATS_URL=$AEGIS_NATS_URL task run:dataplane-runner AEGIS_STREAM_MANAGER_ADDR=localhost:9092 \ AEGIS_EVENT_SERVICE_URL=http://localhost:8090 \ AEGIS_CONSOLE_DIR=$(pwd)/services/api-gateway/console \ task run:api-gateway # Subscribe to the SSE feed, then create a stream: curl -N -H 'X-Tenant-Id: t-demo' \ 'http://localhost:8080/v1/events:stream?stream_id=stream-dock-1' curl -X POST -H 'X-Tenant-Id: t-demo' -H 'Idempotency-Key: 1' \ -H 'Content-Type: application/json' \ -d '{"name":"dock-1","project_id":"p1","protocol":"file", "url":"file:///x","pipeline_id":"p-walking"}' \ http://localhost:8080/v1/streams
Within ~5 s, an event with kind: KIND_DWELL arrives on the SSE feed. End-to-end. Real services.
The load-bearing decisions. Violating any of these is an ADR-tracked change, not a feature PR.
| ADR | Decision |
|---|---|
| 0001 | Control plane / data plane are hard-separated. Temporal never sees a frame. |
| 0002 | ClickHouse holds the detection firehose. PostgreSQL holds metadata. |
| 0003 | MIG is the default GPU sharing mode for production inference. |
| 0007 | Protobuf-everywhere, Buf-managed. /proto has strictest CODEOWNERS. |
| 0008 | Frames and media never travel on Kafka or NATS. Claim-check only. |
| 0014 | Bounded autonomy. Agents auto-execute read/advisory + reversible ops; consequential = gate. |
| 0016 | Walking skeleton first — thin but complete on the real architecture. |
| 0017 | 4-tier risk model encoded in tool schemas; refusal-in-code for tier-3. |
| 0018 | One LLM/VLM gateway. OpenAI-compatible. Backend-swappable. |
| 0019 | Active learning samples by uncertainty + diversity, never random. |
| 0020 | Platform-fact answers are retrieval-augmented; no hallucinated identifiers. |
| 0021 | Prompt-injection defense in depth — sanitizer + PII + threshold + rate limit + gate. |
| 0022 | Continuous autonomy uses regular agent-service sessions. No second runtime. |
| 0023 | Canary uses Wilson lower-bound + min-sample floor. Promotion gated; rollback automatic. |
| 0024 | Shadow inference compares candidate vs baseline on same frame URN. |
| 0025 | Drift = JS/KL/TVD vs reference. SLO = multi-window burn-rate. |
| 0026 | Predictive prefetch via 7×24 EMA grid. |
| 0027 | Air-gapped bundle as day-one CI artifact. |
| 0028 | Chaos engineering as production-readiness gate. |
| 0029 | compliance-evidence-service owns no data; composes evidence on demand. |
| 0030 | release-please + signed air-gap bundle promotion; single platform version. |