Comprehensive guide for deploying, monitoring, and maintaining DocuThinkerβs enterprise-grade infrastructure.
DocuThinker uses a modern, enterprise-grade cloud-native architecture with 16 production-ready DevOps components.
graph TB
subgraph "AWS Cloud - Production Infrastructure"
subgraph "Edge Layer"
CF[CloudFront CDN]
WAF[AWS WAF<br/>DDoS Protection]
CERTMGR[cert-manager<br/>Auto TLS]
end
subgraph "Service Mesh - Istio"
ISTIO_IG[Istio Ingress Gateway<br/>3 Replicas + mTLS]
ISTIO_EG[Istio Egress Gateway<br/>Controlled External Access]
ISTIOD[Istiod Control Plane<br/>HA - 3 Replicas]
end
subgraph "Policy & Security Layer"
OPA[OPA Gatekeeper<br/>10 Policies + 8 Mutations]
FALCO[Falco<br/>Runtime Threat Detection]
NETPOL[Network Policies]
end
subgraph "VPC - Multi-AZ (3 Zones)"
subgraph "Public Subnets"
NAT1[NAT Gateway AZ-1]
NAT2[NAT Gateway AZ-2]
NAT3[NAT Gateway AZ-3]
end
subgraph "Private Subnets - EKS Cluster"
subgraph "Application Pods - With Envoy Sidecars"
FE1[Frontend 1 + Envoy]
FE2[Frontend 2 + Envoy]
FE3[Frontend 3 + Envoy]
BE1[Backend 1 + Envoy]
BE2[Backend 2 + Envoy]
BE3[Backend 3 + Envoy]
end
subgraph "Progressive Delivery"
FLAGGER[Flagger<br/>Automated Canary]
CANARY[Canary Deployment<br/>10% Traffic]
end
subgraph "Observability Stack"
OTEL[OpenTelemetry<br/>3 Replicas]
PROM[Prometheus<br/>SLO/SLI Monitoring]
GRAF[Grafana<br/>Dashboards]
JAEGER[Jaeger<br/>Distributed Tracing]
ELK[ELK Stack<br/>Logs]
end
subgraph "Reliability Engineering"
LITMUS[Litmus Chaos<br/>4 Experiments]
VELERO[Velero<br/>Daily + Hourly Backups]
end
subgraph "Autoscaling"
KEDA[KEDA<br/>Event-Driven HPA]
HPA[Traditional HPA<br/>CPU/Memory]
end
subgraph "Data Layer"
RDS[(PostgreSQL RDS<br/>Multi-AZ + Flyway)]
REDIS[(ElastiCache Redis<br/>Cluster Mode)]
end
end
end
subgraph "Security & Secrets"
VAULT[HashiCorp Vault<br/>HA]
SM[AWS Secrets Manager]
ESO[External Secrets Operator]
end
subgraph "Storage"
S3[S3 Buckets<br/>Versioning + Lifecycle]
end
subgraph "Testing"
K6[K6 Load Tests<br/>6 Scenarios]
TERRATEST[Terratest<br/>Infrastructure Validation]
end
end
Users -->|HTTPS| CF
CF --> WAF
WAF --> CERTMGR
CERTMGR --> ISTIO_IG
ISTIOD -.->|Config + Certs| ISTIO_IG
ISTIO_IG -.->|Policy Check| OPA
ISTIO_IG --> FE1 & FE2 & FE3
FE1 & FE2 & FE3 -->|mTLS| BE1 & BE2 & BE3
FLAGGER -.->|Manage| CANARY
CANARY -.->|10% Traffic| BE3
BE1 & BE2 & BE3 --> RDS
BE1 & BE2 & BE3 --> REDIS
BE1 & BE2 & BE3 --> S3
BE1 -.->|Traces| OTEL
OTEL --> JAEGER
OTEL --> PROM
PROM --> GRAF
BE1 -.->|Logs| ELK
FALCO -.->|Monitor| BE1 & BE2 & BE3
LITMUS -.->|Test| BE1 & BE2
VELERO -.->|Backup| RDS
KEDA -.->|Scale| BE1 & BE2 & BE3
HPA -.->|Scale| FE1 & FE2 & FE3
VAULT --> ESO
SM --> ESO
ESO -.->|Sync| BE1 & BE2 & BE3
K6 -.->|Test| ISTIO_IG
TERRATEST -.->|Validate| RDS & S3
style ISTIO_IG fill:#FF6B6B,color:#fff
style OPA fill:#4ECDC4,color:#fff
style OTEL fill:#F38181,color:#fff
style LITMUS fill:#AA96DA,color:#fff
style FLAGGER fill:#95E1D3
style KEDA fill:#FCBAD3
style VELERO fill:#FFD93D
Install required tools:
# AWS CLI
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /
# kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
# Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# Terraform
brew install terraform
# istioctl
curl -L https://istio.io/downloadIstio | sh -
# k6
brew install k6
# velero
brew install velero
aws configure
cd terraform
terraform init
terraform plan
terraform apply
aws eks update-kubeconfig --name docuthinker-eks-prod --region us-east-1
# cert-manager (TLS automation)
cd tls/cert-manager
./install-cert-manager.sh v1.13.0 admin@docuthinker.example.com
# OPA Gatekeeper (Policy enforcement)
cd ../../policy-as-code/opa
./install-opa.sh 3.14.0
# Istio Service Mesh (Traffic management + mTLS)
cd ../../service-mesh/istio
./install-istio.sh docuthinker-prod production
# OpenTelemetry
helm install otel-collector open-telemetry/opentelemetry-collector \
-n monitoring -f observability/opentelemetry/values.yaml
# Prometheus + SLO/SLI
kubectl apply -f monitoring/slo-sli/prometheus-rules.yaml
# Litmus Chaos Engineering
cd chaos-engineering/litmus
./install-litmus.sh 3.0.0
# Velero Backup & DR
cd ../../backup-dr/velero
./install-velero.sh v1.12.0 us-east-1 docuthinker-velero-backups
# Flagger Progressive Delivery
helm install flagger flagger/flagger \
-n istio-system -f progressive-delivery/flagger/values.yaml
# KEDA Event-Driven Autoscaling
helm install keda kedacore/keda \
-n keda --create-namespace -f autoscaling/keda/values.yaml
# Falco Runtime Security
helm install falco falcosecurity/falco \
-n falco --create-namespace -f security/falco/values.yaml
Istio provides:
graph LR
subgraph "Istio Components"
ISTIOD[Istiod<br/>Control Plane<br/>3 Replicas]
subgraph "Gateways"
IG[Ingress Gateway<br/>3 Replicas<br/>LoadBalancer]
EG[Egress Gateway<br/>2 Replicas<br/>ClusterIP]
end
subgraph "Sidecars"
ENVOY1[Envoy Proxy 1]
ENVOY2[Envoy Proxy 2]
ENVOY3[Envoy Proxy 3]
end
subgraph "Config"
VS[Virtual Services<br/>Routing Rules]
DR[Destination Rules<br/>Circuit Breaking]
PA[Peer Authentication<br/>Strict mTLS]
AP[Authorization Policies<br/>RBAC]
end
subgraph "Observability"
KIALI[Kiali UI]
JAEGER_UI[Jaeger UI]
end
end
INTERNET[Internet] --> IG
IG --> ENVOY1
ENVOY1 <-->|mTLS| ENVOY2
ENVOY2 <-->|mTLS| ENVOY3
ENVOY3 --> EG
EG --> EXTERNAL[External APIs]
ISTIOD -.->|Config| IG & EG & ENVOY1 & ENVOY2 & ENVOY3
VS & DR & PA & AP -.->|Apply| ENVOY1 & ENVOY2
ENVOY1 -.->|Metrics| KIALI
ENVOY1 -.->|Traces| JAEGER_UI
style ISTIOD fill:#FF6B6B,color:#fff
style IG fill:#4ECDC4,color:#fff
cd service-mesh/istio
./install-istio.sh docuthinker-prod production
1. Canary Deployment (10% traffic split):
# Configured in traffic-management/virtual-services.yaml
route:
- destination:
host: backend
subset: stable
weight: 90
- destination:
host: backend
subset: canary
weight: 10
2. Circuit Breaking:
# Configured in traffic-management/destination-rules.yaml
outlierDetection:
consecutive5xxErrors: 3
interval: 10s
baseEjectionTime: 60s
maxEjectionPercent: 30
3. Retry Logic:
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,reset,connect-failure
# Kiali (Service Mesh Visualization)
kubectl port-forward svc/kiali -n istio-system 20001:20001
# Open: http://localhost:20001
# Jaeger (Distributed Tracing)
kubectl port-forward svc/jaeger-query -n istio-system 16686:16686
# Open: http://localhost:16686
OPA Gatekeeper enforces:
sequenceDiagram
participant Dev
participant K8s API
participant OPA
participant Policies
participant Pod
Dev->>K8s API: kubectl apply deployment.yaml
K8s API->>OPA: Admission Request
OPA->>Policies: Evaluate Constraints
alt Violations Found
Policies->>OPA: Deny: Missing labels, no resource limits
OPA->>K8s API: Admission Denied
K8s API->>Dev: Error: Policy violations
else Policies Satisfied
Policies->>OPA: All policies met
OPA->>OPA: Apply Mutations (add defaults)
OPA->>K8s API: Admission Allowed (modified)
K8s API->>Pod: Create Pod
Pod->>Dev: Success
end
Note over OPA: Continuous Audit (every hour)
OPA->>Policies: Scan existing resources
Policies-->>OPA: Report violations
cd policy-as-code/opa
./install-opa.sh 3.14.0
:latest image tags# List all constraints
kubectl get constraints
# View violations
kubectl get k8srequiredlabels pod-must-have-labels -o yaml
# Test deployment
kubectl apply --dry-run=server -f deployment.yaml
graph LR
subgraph "11-Stage Pipeline"
GIT[Git Push]
PRE[Pre-Check<br/>Lint + Audit]
BUILD[Build<br/>FE + BE + AI]
TEST[Test<br/>Unit + Coverage]
SECURITY[Security<br/>Trivy + SonarQube]
PACKAGE[Package<br/>Docker Build]
DEPLOY_DEV[Deploy Dev<br/>Auto]
DEPLOY_STG[Deploy Staging<br/>Manual]
CANARY[Deploy Canary<br/>Flagger]
PROMOTE[Promote Prod<br/>Manual Approval]
POST[Post-Deploy<br/>Smoke + Perf Tests]
CLEANUP[Cleanup<br/>Old Images]
end
GIT --> PRE
PRE --> BUILD
BUILD --> TEST
TEST --> SECURITY
SECURITY --> PACKAGE
PACKAGE --> DEPLOY_DEV
DEPLOY_DEV --> DEPLOY_STG
DEPLOY_STG --> CANARY
CANARY -.->|Metrics OK| PROMOTE
CANARY -.->|Metrics Fail| ROLLBACK[Auto Rollback]
PROMOTE --> POST
POST --> CLEANUP
# .gitlab-ci.yml (11 stages)
stages:
- pre-check
- build
- test
- security
- package
- deploy-dev
- deploy-staging
- deploy-canary
- deploy-production
- post-deploy
- cleanup
flowchart TD
START([Start]) --> ENV{Environment?}
ENV -->|Dev| DEV_DEPLOY[Auto-Deploy to Dev]
ENV -->|Staging| STG_APPROVAL{Manual Approval?}
ENV -->|Production| PROD_PREP[Prepare Prod Canary]
STG_APPROVAL -->|Yes| STG_DEPLOY[Deploy to Staging]
STG_APPROVAL -->|No| CANCEL([Cancelled])
DEV_DEPLOY --> SMOKE_DEV[Smoke Tests]
SMOKE_DEV --> SUCCESS_DEV([Dev Deployed])
STG_DEPLOY --> SMOKE_STG[Smoke Tests]
SMOKE_STG --> SUCCESS_STG([Staging Deployed])
PROD_PREP --> FLAGGER[Flagger Canary Analysis]
FLAGGER --> CANARY_INIT[Initialize 0% Traffic]
CANARY_INIT --> RAMP[Progressive Ramp 10β50%]
RAMP --> ANALYSIS{Metrics OK?}
ANALYSIS -->|Success Rate >99%<br/>Latency <500ms| PROMOTE[Promote to 100%]
ANALYSIS -->|Metrics Fail| AUTO_RB[Automatic Rollback]
PROMOTE --> FINAL_SMOKE[Final Smoke Tests]
AUTO_RB --> ALERT[Send Alert]
FINAL_SMOKE --> SUCCESS_PROD([Production Deployed])
ALERT --> FAIL([Deployment Failed])
style PROMOTE fill:#6BCB77,color:#fff
style AUTO_RB fill:#FF6B6B,color:#fff
style SUCCESS_PROD fill:#6BCB77,color:#fff
# Deploy via script
./scripts/deploy/deploy.sh [dev|staging|production]
# Deploy via Helm
helm upgrade --install docuthinker ./helm/docuthinker \
-f ./helm/docuthinker/values-prod.yaml \
-n docuthinker-prod
# Rollback
./scripts/deploy/rollback.sh production 3
Flagger automates canary deployments with metric-based promotion/rollback.
Features:
graph TB
START[New Deployment] --> INIT[Flagger Detects Change]
INIT --> CREATE[Create Canary Deployment]
CREATE --> ROUTE_0[Route 0% Traffic to Canary]
ROUTE_0 --> RAMP_10[Ramp to 10%]
RAMP_10 --> ANALYZE_10{Analyze Metrics<br/>1 minute}
ANALYZE_10 -->|Pass| RAMP_20[Ramp to 20%]
ANALYZE_10 -->|Fail| ROLLBACK[Automatic Rollback]
RAMP_20 --> ANALYZE_20{Analyze Metrics}
ANALYZE_20 -->|Pass| RAMP_50[Ramp to 50%]
ANALYZE_20 -->|Fail| ROLLBACK
RAMP_50 --> ANALYZE_50{Analyze Metrics}
ANALYZE_50 -->|Pass| PROMOTE[Promote to 100%]
ANALYZE_50 -->|Fail| ROLLBACK
PROMOTE --> CLEANUP[Delete Canary]
CLEANUP --> SUCCESS([Deployment Complete])
ROLLBACK --> ALERT[Send Slack Alert]
ALERT --> FAIL([Deployment Failed])
style PROMOTE fill:#6BCB77,color:#fff
style ROLLBACK fill:#FF6B6B,color:#fff
style SUCCESS fill:#6BCB77,color:#fff
helm install flagger flagger/flagger \
-n istio-system \
-f progressive-delivery/flagger/values.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: backend
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: backend
service:
port: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
- name: request-duration
thresholdRange:
max: 500
graph TB
subgraph "Applications"
APP[Applications<br/>Frontend + Backend]
end
subgraph "Collection"
OTEL[OpenTelemetry Collector<br/>3 Replicas HA]
PROM_EXP[Prometheus Exporters]
FILEBEAT[Filebeat]
end
subgraph "Storage & Processing"
subgraph "Traces"
JAEGER[Jaeger]
ES_TRACE[(Elasticsearch)]
end
subgraph "Metrics"
PROM[Prometheus]
SLO_CALC[SLO/SLI Calculator]
ERROR_BUDGET[Error Budget]
end
subgraph "Logs"
LOGSTASH[Logstash]
ES_LOG[(Elasticsearch)]
end
subgraph "Coralogix SaaS"
CX_LOGS[Coralogix Logs]
CX_METRICS[Coralogix Metrics]
CX_TRACES[Coralogix Traces]
CX_TCO[TCO Optimizer]
end
end
subgraph "Visualization"
GRAF[Grafana<br/>Unified Dashboards]
KIBANA[Kibana<br/>Log Analysis]
KIALI[Kiali<br/>Service Mesh]
CX_DASH[Coralogix<br/>Dashboards]
end
subgraph "Alerting"
ALERT[AlertManager]
CX_ALERT[Coralogix Alerts]
SLACK[Slack]
PD[PagerDuty]
end
APP -->|OTLP| OTEL
APP -->|Metrics| PROM_EXP
APP -->|Logs| FILEBEAT
OTEL --> JAEGER
OTEL -->|OTLP/gRPC| CX_TRACES
OTEL -->|OTLP/gRPC| CX_METRICS
OTEL -->|OTLP/gRPC| CX_LOGS
JAEGER --> ES_TRACE
PROM_EXP --> PROM
PROM --> SLO_CALC
PROM -->|Remote Write| CX_METRICS
SLO_CALC --> ERROR_BUDGET
FILEBEAT --> LOGSTASH
LOGSTASH --> ES_LOG
CX_LOGS --> CX_TCO
CX_METRICS --> CX_DASH
CX_TRACES --> CX_DASH
PROM --> GRAF
JAEGER --> GRAF
CX_METRICS --> GRAF
ES_LOG --> KIBANA
PROM --> KIALI
PROM -.->|Alerts| ALERT
CX_LOGS -.->|Alerts| CX_ALERT
ALERT --> SLACK
ALERT -->|Critical| PD
CX_ALERT --> SLACK
CX_ALERT -->|Critical| PD
style OTEL fill:#F38181,color:#fff
style PROM fill:#E85D04,color:#fff
style GRAF fill:#F48C06,color:#fff
style SLO_CALC fill:#95E1D3
style CX_LOGS fill:#6C63FF,color:#fff
style CX_METRICS fill:#6C63FF,color:#fff
style CX_TRACES fill:#6C63FF,color:#fff
style CX_TCO fill:#6C63FF,color:#fff
style CX_DASH fill:#6C63FF,color:#fff
style CX_ALERT fill:#6C63FF,color:#fff
Service Level Objectives:
Prometheus Recording Rules:
# Availability SLI
sli:availability:ratio_rate30d >= 0.999
# Latency SLI
sli:latency:p99_5m <= 0.5
# Error Budget
slo:error_budget:remaining
Alerts:
# Grafana (Metrics + SLO/SLI)
kubectl port-forward svc/grafana -n monitoring 3000:80
# Open: http://localhost:3000
# Prometheus (Raw Metrics)
kubectl port-forward svc/prometheus -n monitoring 9090:9090
# Open: http://localhost:9090
# Kibana (Logs)
kubectl port-forward svc/kibana -n monitoring 5601:5601
# Open: http://localhost:5601
# Kiali (Service Mesh)
kubectl port-forward svc/kiali -n istio-system 20001:20001
# Open: http://localhost:20001
Coralogix provides a SaaS-based unified observability backend for logs, metrics, and traces with cost optimization via TCO policies.
Integration Points:
| Component | Path | Purpose |
|---|---|---|
| OTel Collector | observability/opentelemetry/values.yaml |
Traces + Metrics + Logs via OTLP/gRPC |
| Fluent Bit DaemonSet | monitoring/coralogix/fluent-bit-values.yaml |
Node-level K8s log collection |
| K8s Integration | monitoring/coralogix/values.yaml |
OTel Agent + Cluster Collector |
| Prometheus Remote Write | monitoring/prometheus/values.yaml |
Metrics forwarding to Coralogix |
| Grafana Datasource | monitoring/prometheus/values.yaml |
Query Coralogix from Grafana |
| Terraform IaC | terraform/modules/coralogix/ |
Alerts, TCO, recording rules |
| External Secrets | secrets/external-secrets/secret-store.yaml |
API key management via Vault |
| Network Policies | monitoring/coralogix/network-policy.yaml |
Egress rules for Coralogix |
TCO Cost Optimization Tiers:
Deployment:
# Add Coralogix Helm repo
helm repo add coralogix https://cgx.jfrog.io/artifactory/coralogix-charts-virtual
helm repo update
# Deploy K8s integration (OTel Agent + Cluster Collector)
helm upgrade --install coralogix-integration coralogix/coralogix-integration \
--namespace monitoring -f monitoring/coralogix/values.yaml
# Deploy Fluent Bit log shipper
helm upgrade --install fluent-bit fluent/fluent-bit \
--namespace monitoring -f monitoring/coralogix/fluent-bit-values.yaml
# Apply network policies and alert definitions
kubectl apply -f monitoring/coralogix/network-policy.yaml
kubectl apply -f monitoring/coralogix/alerts.yaml
kubectl apply -f monitoring/coralogix/recording-rules.yaml
# Terraform: provision alerts, TCO policies, recording rules
cd terraform && terraform apply -target=module.coralogix
12 Production Alerts: High error rate, P95 latency, pod crashlooping, memory/CPU usage, DB pool exhaustion, API endpoint down, error budget burn, node health, disk space, SLO violations, Redis memory.
Litmus validates system resilience through controlled chaos experiments.
Available Experiments:
cd chaos-engineering/litmus
./install-litmus.sh 3.0.0
# Pod deletion test
kubectl apply -f chaos-engineering/litmus/experiments/pod-delete-experiment.yaml
# Network latency test
kubectl apply -f chaos-engineering/litmus/experiments/network-latency-experiment.yaml
# Resource stress test
kubectl apply -f chaos-engineering/litmus/experiments/resource-stress-experiment.yaml
# Comprehensive workflow (all experiments sequentially)
kubectl apply -f chaos-engineering/litmus/workflows/comprehensive-chaos-workflow.yaml
# Watch chaos engine
kubectl get chaosengine -n docuthinker-prod -w
# View results
kubectl describe chaosresult backend-pod-delete -n docuthinker-prod
# Access ChaosCenter UI
kubectl port-forward svc/chaos-litmus-frontend-service -n litmus 9091:9091
# Open: http://localhost:9091
Velero provides automated backup and disaster recovery:
cd backup-dr/velero
./install-velero.sh v1.12.0 us-east-1 docuthinker-velero-backups
# Create manual backup
velero backup create prod-backup-$(date +%Y%m%d) \
--include-namespaces docuthinker-prod
# List backups
velero backup get
# Describe backup
velero backup describe prod-backup-20250127
# View backup logs
velero backup logs prod-backup-20250127
# Restore from backup
velero restore create --from-backup prod-backup-20250127
# Restore specific namespace
velero restore create --from-backup prod-backup-20250127 \
--include-namespaces docuthinker-prod
# Monitor restore
velero restore get
velero restore describe <restore-name>
Automatically configured:
KEDA provides event-driven autoscaling:
helm install keda kedacore/keda \
-n keda --create-namespace \
-f autoscaling/keda/values.yaml
1. SQS Queue Scaler (1-50 replicas):
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/.../docuthinker-jobs
queueLength: "5"
awsRegion: "us-east-1"
2. HTTP Scaler (2-20 replicas):
triggers:
- type: prometheus
metadata:
query: sum(rate(http_requests_total{app="backend"}[1m]))
threshold: "100"
3. Cron Scaler (business hours):
triggers:
- type: cron
metadata:
timezone: America/New_York
start: 0 8 * * 1-5 # 8 AM weekdays
end: 0 18 * * 1-5 # 6 PM weekdays
desiredReplicas: "10"
kubectl apply -f autoscaling/keda/scalers/queue-scaler.yaml
Falco provides runtime threat detection:
helm install falco falcosecurity/falco \
-n falco --create-namespace \
-f security/falco/values.yaml
# View Falco logs
kubectl logs -l app=falco -n falco -f
# Check for alerts
kubectl logs -l app=falco -n falco | grep -i "warning\|critical"
graph TB
subgraph "Secret Sources"
VAULT[HashiCorp Vault<br/>HA]
AWS_SM[AWS Secrets Manager]
end
subgraph "Kubernetes"
ESO[External Secrets Operator]
K8S_SECRET[Kubernetes Secrets]
end
subgraph "Applications"
POD[Application Pods]
end
VAULT -.->|Pull| ESO
AWS_SM -.->|Pull| ESO
ESO --> K8S_SECRET
K8S_SECRET -->|Mount| POD
style VAULT fill:#AA96DA,color:#fff
style ESO fill:#6BCB77,color:#fff
# Install Vault
helm install vault hashicorp/vault \
-n vault -f secrets/vault/vault-values.yaml
# Initialize Vault
./secrets/vault/init-vault.sh
# Access UI
kubectl port-forward svc/vault -n vault 8200:8200
# Open: http://localhost:8200
# Apply secret store
kubectl apply -f secrets/external-secrets/secret-store.yaml
# Secrets are automatically synced from Vault/AWS to K8s
6 Test Scenarios:
# Basic load test
k6 run --vus 100 --duration 5m testing/load-tests/k6-advanced-scenarios.js
# With custom endpoint
BASE_URL=https://staging.docuthinker.com k6 run testing/load-tests/k6-advanced-scenarios.js
# All scenarios
k6 run testing/load-tests/k6-advanced-scenarios.js
Flyway provides version-controlled database migrations:
database/migrations/
βββ flyway.conf # Configuration
βββ sql/
βββ V1__initial_schema.sql
βββ V2__add_api_keys.sql
βββ V3__add_audit_log.sql
# Via Flyway CLI
flyway -configFiles=database/migrations/flyway.conf migrate
# Via Docker
docker run --rm \
-v $(pwd)/database/migrations:/flyway/sql \
flyway/flyway migrate
# Rollback (if supported)
flyway -configFiles=database/migrations/flyway.conf undo
Validate Terraform infrastructure with automated tests.
Tests Included:
cd testing/infrastructure
# Run all tests
go test -v -timeout 30m
# Run specific test
go test -v -run TestTerraformDocuThinkerInfrastructure
# Parallel execution
go test -v -parallel 4
graph TB
L1[Layer 1: Network<br/>WAF + TLS + mTLS]
L2[Layer 2: Admission<br/>OPA Gatekeeper]
L3[Layer 3: Authentication<br/>Firebase + JWT + RBAC]
L4[Layer 4: Runtime<br/>Falco Monitoring]
L5[Layer 5: Secrets<br/>Vault + Secrets Manager]
L6[Layer 6: Code & Supply Chain<br/>SonarQube + Snyk + Trivy]
L7[Layer 7: Data<br/>Encryption at Rest/Transit]
L8[Layer 8: Audit<br/>Logs + Compliance]
L1 --> L2 --> L3 --> L4 --> L5 --> L6 --> L7 --> L8
style L1 fill:#FF6B6B,color:#fff
style L2 fill:#4ECDC4,color:#fff
style L4 fill:#F38181,color:#fff
style L5 fill:#AA96DA,color:#fff
style L6 fill:#4E9BCD,color:#fff
# Comprehensive scan (SonarQube + all Snyk scans)
./scripts/security/scan-all.sh
# Trivy image scanning
./scripts/security/trivy-scan.sh
# SonarQube analysis (multi-module)
sonar-scanner
# Snyk open-source dependency scan
snyk test --all-projects --severity-threshold=high
# Snyk container scan
snyk container test docuthinker/backend:latest --severity-threshold=high
# Snyk IaC scan
snyk iac test terraform/ kubernetes/ helm/ --severity-threshold=medium
# Snyk SAST code analysis
snyk code test --severity-threshold=high
# OPA policy violations
kubectl get constraints -o json | jq '.items[].status.violations'
# Falco alerts
kubectl logs -l app=falco -n falco | grep -i critical
SonarQube Enterprise 10.4 provides continuous code quality inspection across all services.
# Deploy SonarQube via Helm
helm repo add sonarqube https://SonarSource.github.io/helm-chart-sonarqube
helm install sonarqube sonarqube/sonarqube \
-f security/sonarqube/values.yaml \
-n security --create-namespace
# Import quality gate
curl -u admin:${SONAR_TOKEN} -X POST \
"${SONAR_URL}/api/qualitygates/create" \
-d "name=DocuThinker Production"
# Import quality profiles
curl -u admin:${SONAR_TOKEN} -X POST \
"${SONAR_URL}/api/qualityprofiles/restore" \
--form backup=@security/sonarqube/quality-profiles.json
The root sonar-project.properties defines a multi-module project:
| Module | Language | Coverage Tool |
|---|---|---|
frontend |
JavaScript/TypeScript | Jest + LCOV |
backend |
JavaScript | Jest + LCOV |
orchestrator |
JavaScript | Jest + LCOV |
ai_ml |
Python | pytest-cov + coverage.xml |
| Metric | Threshold | Rationale |
|---|---|---|
| New code coverage | β₯ 80% | Prevent coverage regression |
| Overall coverage | β₯ 70% | Maintain baseline |
| Duplicated lines | β€ 3% | Reduce code duplication |
| Security rating | A | Zero vulnerabilities on new code |
| Reliability rating | A | Zero bugs on new code |
| Maintainability rating | A | Zero code smells on new code |
| Security hotspots reviewed | 100% | All hotspots triaged |
| New critical issues | 0 | Block critical findings |
| New blocker issues | 0 | Block blocker findings |
Snyk provides comprehensive vulnerability scanning across 4 dimensions.
graph LR
subgraph "Snyk Scanning Pipeline"
OSS[Open Source<br/>Dependency SCA]
CONTAINER[Container<br/>Image Scan]
IAC[Infrastructure<br/>as Code]
SAST[Snyk Code<br/>SAST Analysis]
end
CODE[Source Code] --> OSS
CODE --> SAST
DOCKER[Docker Images] --> CONTAINER
INFRA[Terraform/K8s/Helm] --> IAC
OSS --> REPORT[Security Report]
CONTAINER --> REPORT
IAC --> REPORT
SAST --> REPORT
style OSS fill:#4C4A73,color:#fff
style CONTAINER fill:#4C4A73,color:#fff
style IAC fill:#4C4A73,color:#fff
style SAST fill:#4C4A73,color:#fff
Snyk K8s Controller continuously monitors running workloads:
# Deploy Snyk K8s Controller
helm repo add snyk-charts https://snyk.github.io/kubernetes-monitor
helm install snyk-monitor snyk-charts/snyk-monitor \
-f security/snyk/values.yaml \
-n snyk-monitor --create-namespace
Monitored namespaces: docuthinker-prod, docuthinker-staging, monitoring, istio-system
| Rule | Threshold | Action |
|---|---|---|
| Critical vulnerabilities | 0 | Fail build |
| High vulnerabilities | β€ 5 (prod), β€ 10 (staging) | Fail build |
| Base image age | < 30 days | Warn |
| Banned licenses | GPL-3.0, AGPL-3.0, SSPL-1.0 | Fail build |
| Rule | Description |
|---|---|
CUSTOM-001 |
All pods must have resource limits |
CUSTOM-002 |
No containers running as root |
CUSTOM-003 |
All ingresses require TLS |
CUSTOM-004 |
No hardcoded secrets in manifests |
Run the comprehensive scan script locally:
# Run all scans (SonarQube + Snyk OSS/Container/IaC/SAST)
./scripts/security/scan-all.sh
# Reports generated in security-reports/ directory
ls security-reports/
# sonar-report.json snyk-oss-report.json snyk-container-report.json
# snyk-iac-report.json snyk-sast-report.json
1. Pod not starting:
kubectl describe pod <pod-name> -n docuthinker-prod
kubectl logs <pod-name> -n docuthinker-prod
kubectl logs <pod-name> -c istio-proxy -n docuthinker-prod
2. OPA blocking deployment:
# Check violations
kubectl get constraints
# Test deployment
kubectl apply --dry-run=server -f deployment.yaml
# View specific constraint
kubectl get k8srequiredlabels pod-must-have-labels -o yaml
3. Istio traffic issues:
# Check virtual services
kubectl get virtualservices -n docuthinker-prod
# Check destination rules
kubectl get destinationrules -n docuthinker-prod
# Analyze configuration
istioctl analyze -n docuthinker-prod
# View proxy logs
kubectl logs <pod-name> -c istio-proxy
4. Canary not promoting:
# Check Flagger status
kubectl describe canary backend -n docuthinker-prod
# View Flagger logs
kubectl logs -l app=flagger -n istio-system
# Check metrics
kubectl port-forward svc/prometheus -n monitoring 9090:9090
# Query: flagger_canary_status
5. High error rate:
# Check SLO/SLI metrics
kubectl port-forward svc/prometheus -n monitoring 9090:9090
# Query: sli:availability:ratio_rate5m
# View error budget
# Query: slo:error_budget:remaining
# Check application logs
kubectl logs -l app=backend -n docuthinker-prod | grep -i error
# View deployment history
helm history docuthinker -n docuthinker-prod
# Rollback to previous version
./scripts/deploy/rollback.sh production
# Rollback to specific revision
helm rollback docuthinker 3 -n docuthinker-prod
# Emergency rollback (bypass Flagger)
kubectl rollout undo deployment/backend -n docuthinker-prod
Daily:
Weekly:
Monthly:
# === Deployment ===
./scripts/deploy/deploy.sh [dev|staging|production]
./scripts/deploy/rollback.sh [environment] [revision]
# === Monitoring ===
kubectl port-forward svc/grafana -n monitoring 3000:80
kubectl port-forward svc/prometheus -n monitoring 9090:9090
kubectl port-forward svc/kiali -n istio-system 20001:20001
# === Chaos Engineering ===
kubectl apply -f chaos-engineering/litmus/experiments/pod-delete-experiment.yaml
kubectl get chaosresult -n docuthinker-prod
# === Backup & Restore ===
velero backup create prod-backup-$(date +%Y%m%d) --include-namespaces docuthinker-prod
velero restore create --from-backup prod-backup-20250127
# === Security ===
./scripts/security/trivy-scan.sh
kubectl get constraints
kubectl logs -l app=falco -n falco
# === Load Testing ===
k6 run --vus 100 --duration 5m testing/load-tests/k6-advanced-scenarios.js
# === Logs ===
kubectl logs -l app=backend -n docuthinker-prod -f
kubectl logs -l app=backend -c istio-proxy -n docuthinker-prod
For issues: