Comprehensive guide for deploying, monitoring, and maintaining DocuThinkerβs enterprise-grade infrastructure.
DocuThinker uses a modern, enterprise-grade cloud-native architecture with 15 production-ready DevOps components.
graph TB
subgraph "AWS Cloud - Production Infrastructure"
subgraph "Edge Layer"
CF[CloudFront CDN]
WAF[AWS WAF<br/>DDoS Protection]
CERTMGR[cert-manager<br/>Auto TLS]
end
subgraph "Service Mesh - Istio"
ISTIO_IG[Istio Ingress Gateway<br/>3 Replicas + mTLS]
ISTIO_EG[Istio Egress Gateway<br/>Controlled External Access]
ISTIOD[Istiod Control Plane<br/>HA - 3 Replicas]
end
subgraph "Policy & Security Layer"
OPA[OPA Gatekeeper<br/>10 Policies + 8 Mutations]
FALCO[Falco<br/>Runtime Threat Detection]
NETPOL[Network Policies]
end
subgraph "VPC - Multi-AZ (3 Zones)"
subgraph "Public Subnets"
NAT1[NAT Gateway AZ-1]
NAT2[NAT Gateway AZ-2]
NAT3[NAT Gateway AZ-3]
end
subgraph "Private Subnets - EKS Cluster"
subgraph "Application Pods - With Envoy Sidecars"
FE1[Frontend 1 + Envoy]
FE2[Frontend 2 + Envoy]
FE3[Frontend 3 + Envoy]
BE1[Backend 1 + Envoy]
BE2[Backend 2 + Envoy]
BE3[Backend 3 + Envoy]
end
subgraph "Progressive Delivery"
FLAGGER[Flagger<br/>Automated Canary]
CANARY[Canary Deployment<br/>10% Traffic]
end
subgraph "Observability Stack"
OTEL[OpenTelemetry<br/>3 Replicas]
PROM[Prometheus<br/>SLO/SLI Monitoring]
GRAF[Grafana<br/>Dashboards]
JAEGER[Jaeger<br/>Distributed Tracing]
ELK[ELK Stack<br/>Logs]
end
subgraph "Reliability Engineering"
LITMUS[Litmus Chaos<br/>4 Experiments]
VELERO[Velero<br/>Daily + Hourly Backups]
end
subgraph "Autoscaling"
KEDA[KEDA<br/>Event-Driven HPA]
HPA[Traditional HPA<br/>CPU/Memory]
end
subgraph "Data Layer"
RDS[(PostgreSQL RDS<br/>Multi-AZ + Flyway)]
REDIS[(ElastiCache Redis<br/>Cluster Mode)]
end
end
end
subgraph "Security & Secrets"
VAULT[HashiCorp Vault<br/>HA]
SM[AWS Secrets Manager]
ESO[External Secrets Operator]
end
subgraph "Storage"
S3[S3 Buckets<br/>Versioning + Lifecycle]
end
subgraph "Testing"
K6[K6 Load Tests<br/>6 Scenarios]
TERRATEST[Terratest<br/>Infrastructure Validation]
end
end
Users -->|HTTPS| CF
CF --> WAF
WAF --> CERTMGR
CERTMGR --> ISTIO_IG
ISTIOD -.->|Config + Certs| ISTIO_IG
ISTIO_IG -.->|Policy Check| OPA
ISTIO_IG --> FE1 & FE2 & FE3
FE1 & FE2 & FE3 -->|mTLS| BE1 & BE2 & BE3
FLAGGER -.->|Manage| CANARY
CANARY -.->|10% Traffic| BE3
BE1 & BE2 & BE3 --> RDS
BE1 & BE2 & BE3 --> REDIS
BE1 & BE2 & BE3 --> S3
BE1 -.->|Traces| OTEL
OTEL --> JAEGER
OTEL --> PROM
PROM --> GRAF
BE1 -.->|Logs| ELK
FALCO -.->|Monitor| BE1 & BE2 & BE3
LITMUS -.->|Test| BE1 & BE2
VELERO -.->|Backup| RDS
KEDA -.->|Scale| BE1 & BE2 & BE3
HPA -.->|Scale| FE1 & FE2 & FE3
VAULT --> ESO
SM --> ESO
ESO -.->|Sync| BE1 & BE2 & BE3
K6 -.->|Test| ISTIO_IG
TERRATEST -.->|Validate| RDS & S3
style ISTIO_IG fill:#FF6B6B,color:#fff
style OPA fill:#4ECDC4,color:#fff
style OTEL fill:#F38181,color:#fff
style LITMUS fill:#AA96DA,color:#fff
style FLAGGER fill:#95E1D3
style KEDA fill:#FCBAD3
style VELERO fill:#FFD93D
Install required tools:
# AWS CLI
curl "https://awscli.amazonaws.com/AWSCLIV2.pkg" -o "AWSCLIV2.pkg"
sudo installer -pkg AWSCLIV2.pkg -target /
# kubectl
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/darwin/amd64/kubectl"
chmod +x kubectl
sudo mv kubectl /usr/local/bin/
# Helm
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
# Terraform
brew install terraform
# istioctl
curl -L https://istio.io/downloadIstio | sh -
# k6
brew install k6
# velero
brew install velero
aws configure
cd terraform
terraform init
terraform plan
terraform apply
aws eks update-kubeconfig --name docuthinker-eks-prod --region us-east-1
# cert-manager (TLS automation)
cd tls/cert-manager
./install-cert-manager.sh v1.13.0 admin@docuthinker.example.com
# OPA Gatekeeper (Policy enforcement)
cd ../../policy-as-code/opa
./install-opa.sh 3.14.0
# Istio Service Mesh (Traffic management + mTLS)
cd ../../service-mesh/istio
./install-istio.sh docuthinker-prod production
# OpenTelemetry
helm install otel-collector open-telemetry/opentelemetry-collector \
-n monitoring -f observability/opentelemetry/values.yaml
# Prometheus + SLO/SLI
kubectl apply -f monitoring/slo-sli/prometheus-rules.yaml
# Litmus Chaos Engineering
cd chaos-engineering/litmus
./install-litmus.sh 3.0.0
# Velero Backup & DR
cd ../../backup-dr/velero
./install-velero.sh v1.12.0 us-east-1 docuthinker-velero-backups
# Flagger Progressive Delivery
helm install flagger flagger/flagger \
-n istio-system -f progressive-delivery/flagger/values.yaml
# KEDA Event-Driven Autoscaling
helm install keda kedacore/keda \
-n keda --create-namespace -f autoscaling/keda/values.yaml
# Falco Runtime Security
helm install falco falcosecurity/falco \
-n falco --create-namespace -f security/falco/values.yaml
Istio provides:
graph LR
subgraph "Istio Components"
ISTIOD[Istiod<br/>Control Plane<br/>3 Replicas]
subgraph "Gateways"
IG[Ingress Gateway<br/>3 Replicas<br/>LoadBalancer]
EG[Egress Gateway<br/>2 Replicas<br/>ClusterIP]
end
subgraph "Sidecars"
ENVOY1[Envoy Proxy 1]
ENVOY2[Envoy Proxy 2]
ENVOY3[Envoy Proxy 3]
end
subgraph "Config"
VS[Virtual Services<br/>Routing Rules]
DR[Destination Rules<br/>Circuit Breaking]
PA[Peer Authentication<br/>Strict mTLS]
AP[Authorization Policies<br/>RBAC]
end
subgraph "Observability"
KIALI[Kiali UI]
JAEGER_UI[Jaeger UI]
end
end
INTERNET[Internet] --> IG
IG --> ENVOY1
ENVOY1 <-->|mTLS| ENVOY2
ENVOY2 <-->|mTLS| ENVOY3
ENVOY3 --> EG
EG --> EXTERNAL[External APIs]
ISTIOD -.->|Config| IG & EG & ENVOY1 & ENVOY2 & ENVOY3
VS & DR & PA & AP -.->|Apply| ENVOY1 & ENVOY2
ENVOY1 -.->|Metrics| KIALI
ENVOY1 -.->|Traces| JAEGER_UI
style ISTIOD fill:#FF6B6B,color:#fff
style IG fill:#4ECDC4,color:#fff
cd service-mesh/istio
./install-istio.sh docuthinker-prod production
1. Canary Deployment (10% traffic split):
# Configured in traffic-management/virtual-services.yaml
route:
- destination:
host: backend
subset: stable
weight: 90
- destination:
host: backend
subset: canary
weight: 10
2. Circuit Breaking:
# Configured in traffic-management/destination-rules.yaml
outlierDetection:
consecutive5xxErrors: 3
interval: 10s
baseEjectionTime: 60s
maxEjectionPercent: 30
3. Retry Logic:
retries:
attempts: 3
perTryTimeout: 2s
retryOn: 5xx,reset,connect-failure
# Kiali (Service Mesh Visualization)
kubectl port-forward svc/kiali -n istio-system 20001:20001
# Open: http://localhost:20001
# Jaeger (Distributed Tracing)
kubectl port-forward svc/jaeger-query -n istio-system 16686:16686
# Open: http://localhost:16686
OPA Gatekeeper enforces:
sequenceDiagram
participant Dev
participant K8s API
participant OPA
participant Policies
participant Pod
Dev->>K8s API: kubectl apply deployment.yaml
K8s API->>OPA: Admission Request
OPA->>Policies: Evaluate Constraints
alt Violations Found
Policies->>OPA: Deny: Missing labels, no resource limits
OPA->>K8s API: Admission Denied
K8s API->>Dev: Error: Policy violations
else Policies Satisfied
Policies->>OPA: All policies met
OPA->>OPA: Apply Mutations (add defaults)
OPA->>K8s API: Admission Allowed (modified)
K8s API->>Pod: Create Pod
Pod->>Dev: Success
end
Note over OPA: Continuous Audit (every hour)
OPA->>Policies: Scan existing resources
Policies-->>OPA: Report violations
cd policy-as-code/opa
./install-opa.sh 3.14.0
:latest image tags# List all constraints
kubectl get constraints
# View violations
kubectl get k8srequiredlabels pod-must-have-labels -o yaml
# Test deployment
kubectl apply --dry-run=server -f deployment.yaml
graph LR
subgraph "11-Stage Pipeline"
GIT[Git Push]
PRE[Pre-Check<br/>Lint + Audit]
BUILD[Build<br/>FE + BE + AI]
TEST[Test<br/>Unit + Coverage]
SECURITY[Security<br/>Trivy + SonarQube]
PACKAGE[Package<br/>Docker Build]
DEPLOY_DEV[Deploy Dev<br/>Auto]
DEPLOY_STG[Deploy Staging<br/>Manual]
CANARY[Deploy Canary<br/>Flagger]
PROMOTE[Promote Prod<br/>Manual Approval]
POST[Post-Deploy<br/>Smoke + Perf Tests]
CLEANUP[Cleanup<br/>Old Images]
end
GIT --> PRE
PRE --> BUILD
BUILD --> TEST
TEST --> SECURITY
SECURITY --> PACKAGE
PACKAGE --> DEPLOY_DEV
DEPLOY_DEV --> DEPLOY_STG
DEPLOY_STG --> CANARY
CANARY -.->|Metrics OK| PROMOTE
CANARY -.->|Metrics Fail| ROLLBACK[Auto Rollback]
PROMOTE --> POST
POST --> CLEANUP
# .gitlab-ci.yml (11 stages)
stages:
- pre-check
- build
- test
- security
- package
- deploy-dev
- deploy-staging
- deploy-canary
- deploy-production
- post-deploy
- cleanup
flowchart TD
START([Start]) --> ENV{Environment?}
ENV -->|Dev| DEV_DEPLOY[Auto-Deploy to Dev]
ENV -->|Staging| STG_APPROVAL{Manual Approval?}
ENV -->|Production| PROD_PREP[Prepare Prod Canary]
STG_APPROVAL -->|Yes| STG_DEPLOY[Deploy to Staging]
STG_APPROVAL -->|No| CANCEL([Cancelled])
DEV_DEPLOY --> SMOKE_DEV[Smoke Tests]
SMOKE_DEV --> SUCCESS_DEV([Dev Deployed])
STG_DEPLOY --> SMOKE_STG[Smoke Tests]
SMOKE_STG --> SUCCESS_STG([Staging Deployed])
PROD_PREP --> FLAGGER[Flagger Canary Analysis]
FLAGGER --> CANARY_INIT[Initialize 0% Traffic]
CANARY_INIT --> RAMP[Progressive Ramp 10β50%]
RAMP --> ANALYSIS{Metrics OK?}
ANALYSIS -->|Success Rate >99%<br/>Latency <500ms| PROMOTE[Promote to 100%]
ANALYSIS -->|Metrics Fail| AUTO_RB[Automatic Rollback]
PROMOTE --> FINAL_SMOKE[Final Smoke Tests]
AUTO_RB --> ALERT[Send Alert]
FINAL_SMOKE --> SUCCESS_PROD([Production Deployed])
ALERT --> FAIL([Deployment Failed])
style PROMOTE fill:#6BCB77,color:#fff
style AUTO_RB fill:#FF6B6B,color:#fff
style SUCCESS_PROD fill:#6BCB77,color:#fff
# Deploy via script
./scripts/deploy/deploy.sh [dev|staging|production]
# Deploy via Helm
helm upgrade --install docuthinker ./helm/docuthinker \
-f ./helm/docuthinker/values-prod.yaml \
-n docuthinker-prod
# Rollback
./scripts/deploy/rollback.sh production 3
Flagger automates canary deployments with metric-based promotion/rollback.
Features:
graph TB
START[New Deployment] --> INIT[Flagger Detects Change]
INIT --> CREATE[Create Canary Deployment]
CREATE --> ROUTE_0[Route 0% Traffic to Canary]
ROUTE_0 --> RAMP_10[Ramp to 10%]
RAMP_10 --> ANALYZE_10{Analyze Metrics<br/>1 minute}
ANALYZE_10 -->|Pass| RAMP_20[Ramp to 20%]
ANALYZE_10 -->|Fail| ROLLBACK[Automatic Rollback]
RAMP_20 --> ANALYZE_20{Analyze Metrics}
ANALYZE_20 -->|Pass| RAMP_50[Ramp to 50%]
ANALYZE_20 -->|Fail| ROLLBACK
RAMP_50 --> ANALYZE_50{Analyze Metrics}
ANALYZE_50 -->|Pass| PROMOTE[Promote to 100%]
ANALYZE_50 -->|Fail| ROLLBACK
PROMOTE --> CLEANUP[Delete Canary]
CLEANUP --> SUCCESS([Deployment Complete])
ROLLBACK --> ALERT[Send Slack Alert]
ALERT --> FAIL([Deployment Failed])
style PROMOTE fill:#6BCB77,color:#fff
style ROLLBACK fill:#FF6B6B,color:#fff
style SUCCESS fill:#6BCB77,color:#fff
helm install flagger flagger/flagger \
-n istio-system \
-f progressive-delivery/flagger/values.yaml
apiVersion: flagger.app/v1beta1
kind: Canary
metadata:
name: backend
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: backend
service:
port: 8080
analysis:
interval: 1m
threshold: 5
maxWeight: 50
stepWeight: 10
metrics:
- name: request-success-rate
thresholdRange:
min: 99
- name: request-duration
thresholdRange:
max: 500
graph TB
subgraph "Applications"
APP[Applications<br/>Frontend + Backend]
end
subgraph "Collection"
OTEL[OpenTelemetry Collector<br/>3 Replicas HA]
PROM_EXP[Prometheus Exporters]
FILEBEAT[Filebeat]
end
subgraph "Storage & Processing"
subgraph "Traces"
JAEGER[Jaeger]
ES_TRACE[(Elasticsearch)]
end
subgraph "Metrics"
PROM[Prometheus]
SLO_CALC[SLO/SLI Calculator]
ERROR_BUDGET[Error Budget]
end
subgraph "Logs"
LOGSTASH[Logstash]
ES_LOG[(Elasticsearch)]
end
end
subgraph "Visualization"
GRAF[Grafana<br/>Unified Dashboards]
KIBANA[Kibana<br/>Log Analysis]
KIALI[Kiali<br/>Service Mesh]
end
subgraph "Alerting"
ALERT[AlertManager]
SLACK[Slack]
PD[PagerDuty]
end
APP -->|OTLP| OTEL
APP -->|Metrics| PROM_EXP
APP -->|Logs| FILEBEAT
OTEL --> JAEGER
JAEGER --> ES_TRACE
PROM_EXP --> PROM
PROM --> SLO_CALC
SLO_CALC --> ERROR_BUDGET
FILEBEAT --> LOGSTASH
LOGSTASH --> ES_LOG
PROM --> GRAF
JAEGER --> GRAF
ES_LOG --> KIBANA
PROM --> KIALI
PROM -.->|Alerts| ALERT
ALERT --> SLACK
ALERT -->|Critical| PD
style OTEL fill:#F38181,color:#fff
style PROM fill:#E85D04,color:#fff
style GRAF fill:#F48C06,color:#fff
style SLO_CALC fill:#95E1D3
Service Level Objectives:
Prometheus Recording Rules:
# Availability SLI
sli:availability:ratio_rate30d >= 0.999
# Latency SLI
sli:latency:p99_5m <= 0.5
# Error Budget
slo:error_budget:remaining
Alerts:
# Grafana (Metrics + SLO/SLI)
kubectl port-forward svc/grafana -n monitoring 3000:80
# Open: http://localhost:3000
# Prometheus (Raw Metrics)
kubectl port-forward svc/prometheus -n monitoring 9090:9090
# Open: http://localhost:9090
# Kibana (Logs)
kubectl port-forward svc/kibana -n monitoring 5601:5601
# Open: http://localhost:5601
# Kiali (Service Mesh)
kubectl port-forward svc/kiali -n istio-system 20001:20001
# Open: http://localhost:20001
Litmus validates system resilience through controlled chaos experiments.
Available Experiments:
cd chaos-engineering/litmus
./install-litmus.sh 3.0.0
# Pod deletion test
kubectl apply -f chaos-engineering/litmus/experiments/pod-delete-experiment.yaml
# Network latency test
kubectl apply -f chaos-engineering/litmus/experiments/network-latency-experiment.yaml
# Resource stress test
kubectl apply -f chaos-engineering/litmus/experiments/resource-stress-experiment.yaml
# Comprehensive workflow (all experiments sequentially)
kubectl apply -f chaos-engineering/litmus/workflows/comprehensive-chaos-workflow.yaml
# Watch chaos engine
kubectl get chaosengine -n docuthinker-prod -w
# View results
kubectl describe chaosresult backend-pod-delete -n docuthinker-prod
# Access ChaosCenter UI
kubectl port-forward svc/chaos-litmus-frontend-service -n litmus 9091:9091
# Open: http://localhost:9091
Velero provides automated backup and disaster recovery:
cd backup-dr/velero
./install-velero.sh v1.12.0 us-east-1 docuthinker-velero-backups
# Create manual backup
velero backup create prod-backup-$(date +%Y%m%d) \
--include-namespaces docuthinker-prod
# List backups
velero backup get
# Describe backup
velero backup describe prod-backup-20250127
# View backup logs
velero backup logs prod-backup-20250127
# Restore from backup
velero restore create --from-backup prod-backup-20250127
# Restore specific namespace
velero restore create --from-backup prod-backup-20250127 \
--include-namespaces docuthinker-prod
# Monitor restore
velero restore get
velero restore describe <restore-name>
Automatically configured:
KEDA provides event-driven autoscaling:
helm install keda kedacore/keda \
-n keda --create-namespace \
-f autoscaling/keda/values.yaml
1. SQS Queue Scaler (1-50 replicas):
triggers:
- type: aws-sqs-queue
metadata:
queueURL: https://sqs.us-east-1.amazonaws.com/.../docuthinker-jobs
queueLength: "5"
awsRegion: "us-east-1"
2. HTTP Scaler (2-20 replicas):
triggers:
- type: prometheus
metadata:
query: sum(rate(http_requests_total{app="backend"}[1m]))
threshold: "100"
3. Cron Scaler (business hours):
triggers:
- type: cron
metadata:
timezone: America/New_York
start: 0 8 * * 1-5 # 8 AM weekdays
end: 0 18 * * 1-5 # 6 PM weekdays
desiredReplicas: "10"
kubectl apply -f autoscaling/keda/scalers/queue-scaler.yaml
Falco provides runtime threat detection:
helm install falco falcosecurity/falco \
-n falco --create-namespace \
-f security/falco/values.yaml
# View Falco logs
kubectl logs -l app=falco -n falco -f
# Check for alerts
kubectl logs -l app=falco -n falco | grep -i "warning\|critical"
graph TB
subgraph "Secret Sources"
VAULT[HashiCorp Vault<br/>HA]
AWS_SM[AWS Secrets Manager]
end
subgraph "Kubernetes"
ESO[External Secrets Operator]
K8S_SECRET[Kubernetes Secrets]
end
subgraph "Applications"
POD[Application Pods]
end
VAULT -.->|Pull| ESO
AWS_SM -.->|Pull| ESO
ESO --> K8S_SECRET
K8S_SECRET -->|Mount| POD
style VAULT fill:#AA96DA,color:#fff
style ESO fill:#6BCB77,color:#fff
# Install Vault
helm install vault hashicorp/vault \
-n vault -f secrets/vault/vault-values.yaml
# Initialize Vault
./secrets/vault/init-vault.sh
# Access UI
kubectl port-forward svc/vault -n vault 8200:8200
# Open: http://localhost:8200
# Apply secret store
kubectl apply -f secrets/external-secrets/secret-store.yaml
# Secrets are automatically synced from Vault/AWS to K8s
6 Test Scenarios:
# Basic load test
k6 run --vus 100 --duration 5m testing/load-tests/k6-advanced-scenarios.js
# With custom endpoint
BASE_URL=https://staging.docuthinker.com k6 run testing/load-tests/k6-advanced-scenarios.js
# All scenarios
k6 run testing/load-tests/k6-advanced-scenarios.js
Flyway provides version-controlled database migrations:
database/migrations/
βββ flyway.conf # Configuration
βββ sql/
βββ V1__initial_schema.sql
βββ V2__add_api_keys.sql
βββ V3__add_audit_log.sql
# Via Flyway CLI
flyway -configFiles=database/migrations/flyway.conf migrate
# Via Docker
docker run --rm \
-v $(pwd)/database/migrations:/flyway/sql \
flyway/flyway migrate
# Rollback (if supported)
flyway -configFiles=database/migrations/flyway.conf undo
Validate Terraform infrastructure with automated tests.
Tests Included:
cd testing/infrastructure
# Run all tests
go test -v -timeout 30m
# Run specific test
go test -v -run TestTerraformDocuThinkerInfrastructure
# Parallel execution
go test -v -parallel 4
graph TB
L1[Layer 1: Network<br/>WAF + TLS + mTLS]
L2[Layer 2: Admission<br/>OPA Gatekeeper]
L3[Layer 3: Authentication<br/>Firebase + JWT + RBAC]
L4[Layer 4: Runtime<br/>Falco Monitoring]
L5[Layer 5: Secrets<br/>Vault + Secrets Manager]
L6[Layer 6: Data<br/>Encryption at Rest/Transit]
L7[Layer 7: Audit<br/>Logs + Compliance]
L1 --> L2 --> L3 --> L4 --> L5 --> L6 --> L7
style L1 fill:#FF6B6B,color:#fff
style L2 fill:#4ECDC4,color:#fff
style L4 fill:#F38181,color:#fff
style L5 fill:#AA96DA,color:#fff
# Trivy image scanning
./scripts/security/trivy-scan.sh
# SonarQube analysis
sonar-scanner -Dproject.settings=scripts/security/sonarqube.properties
# OPA policy violations
kubectl get constraints -o json | jq '.items[].status.violations'
# Falco alerts
kubectl logs -l app=falco -n falco | grep -i critical
1. Pod not starting:
kubectl describe pod <pod-name> -n docuthinker-prod
kubectl logs <pod-name> -n docuthinker-prod
kubectl logs <pod-name> -c istio-proxy -n docuthinker-prod
2. OPA blocking deployment:
# Check violations
kubectl get constraints
# Test deployment
kubectl apply --dry-run=server -f deployment.yaml
# View specific constraint
kubectl get k8srequiredlabels pod-must-have-labels -o yaml
3. Istio traffic issues:
# Check virtual services
kubectl get virtualservices -n docuthinker-prod
# Check destination rules
kubectl get destinationrules -n docuthinker-prod
# Analyze configuration
istioctl analyze -n docuthinker-prod
# View proxy logs
kubectl logs <pod-name> -c istio-proxy
4. Canary not promoting:
# Check Flagger status
kubectl describe canary backend -n docuthinker-prod
# View Flagger logs
kubectl logs -l app=flagger -n istio-system
# Check metrics
kubectl port-forward svc/prometheus -n monitoring 9090:9090
# Query: flagger_canary_status
5. High error rate:
# Check SLO/SLI metrics
kubectl port-forward svc/prometheus -n monitoring 9090:9090
# Query: sli:availability:ratio_rate5m
# View error budget
# Query: slo:error_budget:remaining
# Check application logs
kubectl logs -l app=backend -n docuthinker-prod | grep -i error
# View deployment history
helm history docuthinker -n docuthinker-prod
# Rollback to previous version
./scripts/deploy/rollback.sh production
# Rollback to specific revision
helm rollback docuthinker 3 -n docuthinker-prod
# Emergency rollback (bypass Flagger)
kubectl rollout undo deployment/backend -n docuthinker-prod
Daily:
Weekly:
Monthly:
# === Deployment ===
./scripts/deploy/deploy.sh [dev|staging|production]
./scripts/deploy/rollback.sh [environment] [revision]
# === Monitoring ===
kubectl port-forward svc/grafana -n monitoring 3000:80
kubectl port-forward svc/prometheus -n monitoring 9090:9090
kubectl port-forward svc/kiali -n istio-system 20001:20001
# === Chaos Engineering ===
kubectl apply -f chaos-engineering/litmus/experiments/pod-delete-experiment.yaml
kubectl get chaosresult -n docuthinker-prod
# === Backup & Restore ===
velero backup create prod-backup-$(date +%Y%m%d) --include-namespaces docuthinker-prod
velero restore create --from-backup prod-backup-20250127
# === Security ===
./scripts/security/trivy-scan.sh
kubectl get constraints
kubectl logs -l app=falco -n falco
# === Load Testing ===
k6 run --vus 100 --duration 5m testing/load-tests/k6-advanced-scenarios.js
# === Logs ===
kubectl logs -l app=backend -n docuthinker-prod -f
kubectl logs -l app=backend -c istio-proxy -n docuthinker-prod
For issues: