Production-grade AWS infrastructure for DocuThinkerβs AI-powered document intelligence platform using ECS Fargate, AWS CDK, and CloudFormation.
DocuThinkerβs AWS infrastructure is designed for production-grade deployment of the AI/ML platform with:
Frontend (Vercel) β CloudFront β ALB β ECS Fargate Services
βββ Backend Service (Node.js)
βββ AI/ML Service (FastAPI)
βββ Neo4j (Aura/Neptune)
βββ ChromaDB (EFS/EC2)
graph TB
subgraph "External Users"
USER[Users/Clients]
ADMIN[Administrators]
end
subgraph "Frontend Layer"
VERCEL[Vercel Frontend]
CF[CloudFront CDN]
S3[S3 Static Assets]
end
subgraph "AWS Cloud - Production VPC"
subgraph "Public Subnets"
ALB[Application Load Balancer]
NAT[NAT Gateway]
end
subgraph "Private Subnets"
subgraph "ECS Cluster"
BACKEND[Backend Service<br/>Node.js/Express<br/>Port 5000]
AIML[AI/ML Service<br/>FastAPI/Python<br/>Port 8000]
end
end
subgraph "Data Layer"
MONGO[(MongoDB Atlas /<br/>DocumentDB)]
REDIS[(ElastiCache Redis)]
NEO4J[(Neo4j Aura /<br/>Neptune)]
CHROMA[(ChromaDB on EFS)]
end
subgraph "Management & Security"
ECR[ECR Container Registry]
SM[Secrets Manager]
CW[CloudWatch Logs & Metrics]
IAM[IAM Roles & Policies]
end
end
subgraph "CI/CD Pipeline"
JENKINS[Jenkins]
CB[CodeBuild]
CDP[CodePipeline]
end
USER --> VERCEL
USER --> CF
CF --> ALB
ADMIN --> ALB
ALB --> BACKEND
ALB --> AIML
BACKEND --> MONGO
BACKEND --> REDIS
AIML --> NEO4J
AIML --> CHROMA
BACKEND -.->|Read Secrets| SM
AIML -.->|Read Secrets| SM
BACKEND -.->|Logs| CW
AIML -.->|Logs| CW
JENKINS --> CB
CB --> ECR
CB --> CDP
CDP --> BACKEND
CDP --> AIML
BACKEND --> IAM
AIML --> IAM
style ALB fill:#FF9900,stroke:#232F3E,stroke-width:3px,color:#fff
style BACKEND fill:#527FFF,stroke:#232F3E,stroke-width:2px,color:#fff
style AIML fill:#3F8624,stroke:#232F3E,stroke-width:2px,color:#fff
style ECR fill:#FF9900,stroke:#232F3E,stroke-width:2px,color:#fff
style SM fill:#DD344C,stroke:#232F3E,stroke-width:2px,color:#fff
graph LR
subgraph "Region: us-east-1"
subgraph "VPC (10.0.0.0/16)"
subgraph "AZ-1 (us-east-1a)"
PUB1[Public Subnet<br/>10.0.1.0/24]
PRIV1[Private Subnet<br/>10.0.3.0/24]
end
subgraph "AZ-2 (us-east-1b)"
PUB2[Public Subnet<br/>10.0.2.0/24]
PRIV2[Private Subnet<br/>10.0.4.0/24]
end
IGW[Internet Gateway]
ALB_NODE[ALB]
NAT1[NAT Gateway]
NAT2[NAT Gateway]
PUB1 --> IGW
PUB2 --> IGW
PUB1 --> NAT1
PUB2 --> NAT2
ALB_NODE --> PUB1
ALB_NODE --> PUB2
PRIV1 --> NAT1
PRIV2 --> NAT2
subgraph "ECS Cluster"
TASK1[Fargate Task<br/>Backend]
TASK2[Fargate Task<br/>AI/ML]
TASK3[Fargate Task<br/>Backend]
TASK4[Fargate Task<br/>AI/ML]
end
TASK1 --> PRIV1
TASK2 --> PRIV1
TASK3 --> PRIV2
TASK4 --> PRIV2
end
end
style PUB1 fill:#3F8624,stroke:#232F3E,stroke-width:2px,color:#fff
style PUB2 fill:#3F8624,stroke:#232F3E,stroke-width:2px,color:#fff
style PRIV1 fill:#527FFF,stroke:#232F3E,stroke-width:2px,color:#fff
style PRIV2 fill:#527FFF,stroke:#232F3E,stroke-width:2px,color:#fff
style ALB_NODE fill:#FF9900,stroke:#232F3E,stroke-width:3px,color:#fff
VPC Configuration:
Security Groups:
sequenceDiagram
participant Dev as Developer
participant Git as GitHub
participant Jenkins as Jenkins
participant CB as CodeBuild
participant ECR as ECR
participant ECS as ECS Fargate
participant ALB as Load Balancer
Dev->>Git: git push
Git->>Jenkins: Webhook Trigger
Jenkins->>CB: Start Build Job
CB->>CB: 1. Run Tests
CB->>CB: 2. Build Docker Images
CB->>ECR: 3. Push Images
CB->>CB: 4. Update Task Definitions
Jenkins->>ECS: 5. Deploy New Tasks
ECS->>ECS: 6. Blue/Green Deployment
ECS->>ALB: 7. Register New Tasks
ALB->>ECS: 8. Health Check
ECS->>ECS: 9. Deregister Old Tasks
Jenkins-->>Dev: Deployment Complete
| Component | Technology | Purpose | Configuration |
|---|---|---|---|
| Container Orchestration | ECS Fargate | Serverless container management | aws/cloudformation/fargate-service.yaml |
| Load Balancer | Application Load Balancer | Traffic distribution & SSL termination | Auto-configured via CDK/CFN |
| Container Registry | Amazon ECR | Docker image storage | docuthinker-backend, docuthinker-ai-ml |
| Secrets Management | AWS Secrets Manager | Secure API key storage | docuthinker/* secrets |
| Networking | Amazon VPC | Network isolation & security | 2 AZs, public/private subnets |
| Monitoring | CloudWatch | Logs, metrics, alarms | Container logs, health metrics |
| CDN | CloudFront | Static asset caching | Optional for S3 assets |
| Database | MongoDB Atlas/DocumentDB | Primary data store | External/managed |
| Cache | ElastiCache Redis | Session & caching layer | External/managed |
| Knowledge Graph | Neo4j Aura/Neptune | Graph database | External/managed |
| Vector Store | ChromaDB on EFS | Persistent embeddings | EFS-backed volume |
/health endpointNODE_ENV=productionPORT=5000/health endpointDOCUTHINKER_SYNC_GRAPH=trueDOCUTHINKER_SYNC_VECTOR=trueRecommended for production and automated deployments.
npm install -g aws-cdk
cd aws/infrastructure
npm install
# Bootstrap CDK (first time only)
cdk bootstrap aws://ACCOUNT-ID/REGION
# Synthesize CloudFormation template
cdk synth
# Deploy stack
cdk deploy DocuThinkerStack
# View outputs
cdk deploy --outputs-file outputs.json
cdk destroy DocuThinkerStack
Alternative deployment using raw CloudFormation templates.
# Validate template
aws cloudformation validate-template \
--template-body file://aws/cloudformation/fargate-service.yaml
# Create stack
aws cloudformation create-stack \
--stack-name docuthinker-stack \
--template-body file://aws/cloudformation/fargate-service.yaml \
--parameters \
ParameterKey=VpcId,ParameterValue=vpc-xxxxx \
ParameterKey=PublicSubnets,ParameterValue=subnet-xxxxx,subnet-yyyyy \
ParameterKey=BackendImage,ParameterValue=ACCOUNT.dkr.ecr.REGION.amazonaws.com/docuthinker-backend:latest \
ParameterKey=AIMLImage,ParameterValue=ACCOUNT.dkr.ecr.REGION.amazonaws.com/docuthinker-ai-ml:latest \
--capabilities CAPABILITY_IAM
# Monitor stack creation
aws cloudformation wait stack-create-complete \
--stack-name docuthinker-stack
# Get outputs
aws cloudformation describe-stacks \
--stack-name docuthinker-stack \
--query 'Stacks[0].Outputs'
aws cloudformation update-stack \
--stack-name docuthinker-stack \
--template-body file://aws/cloudformation/fargate-service.yaml \
--parameters \
ParameterKey=BackendImage,ParameterValue=ACCOUNT.dkr.ecr.REGION.amazonaws.com/docuthinker-backend:v2.0 \
ParameterKey=AIMLImage,ParameterValue=ACCOUNT.dkr.ecr.REGION.amazonaws.com/docuthinker-ai-ml:v2.0
aws cloudformation delete-stack --stack-name docuthinker-stack
For development and testing purposes.
See Detailed Setup section below.
aws --version
npm install -g aws-cdk
cdk --version
docker --version
node --version
AmazonECS_FullAccessAmazonEC2ContainerRegistryFullAccessAmazonVPCFullAccessIAMFullAccessSecretsManagerReadWriteCloudWatchLogsFullAccessElasticLoadBalancingFullAccessaws configure
# AWS Access Key ID: YOUR_ACCESS_KEY
# AWS Secret Access Key: YOUR_SECRET_KEY
# Default region name: us-east-1
# Default output format: json
export AWS_PROFILE=your-profile
export AWS_REGION=us-east-1
aws ecr create-repository --repository-name docuthinker-backend
aws ecr create-repository --repository-name docuthinker-ai-ml
# Login to ECR
aws ecr get-login-password --region us-east-1 | \
docker login --username AWS --password-stdin ACCOUNT.dkr.ecr.us-east-1.amazonaws.com
# Build and push backend
cd backend
docker build -t docuthinker-backend .
docker tag docuthinker-backend:latest ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/docuthinker-backend:latest
docker push ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/docuthinker-backend:latest
# Build and push AI/ML
cd ../ai_ml
docker build -t docuthinker-ai-ml .
docker tag docuthinker-ai-ml:latest ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/docuthinker-ai-ml:latest
docker push ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/docuthinker-ai-ml:latest
# Create secrets in Secrets Manager
aws secretsmanager create-secret \
--name docuthinker/openai \
--secret-string "sk-..."
aws secretsmanager create-secret \
--name docuthinker/anthropic \
--secret-string "sk-ant-..."
aws secretsmanager create-secret \
--name docuthinker/google \
--secret-string "..."
aws secretsmanager create-secret \
--name docuthinker/neo4j \
--secret-string "bolt://neo4j.example.com:7687"
aws secretsmanager create-secret \
--name docuthinker/chroma \
--secret-string "/mnt/efs/chroma"
cd aws/infrastructure
npm install
cdk deploy DocuThinkerStack
# Get ALB DNS name
aws cloudformation describe-stacks \
--stack-name DocuThinkerStack \
--query 'Stacks[0].Outputs'
# Test backend
curl http://ALB-DNS-NAME/health
# Test AI/ML service
curl http://ALB-DNS-NAME:8080/health
# Backend repository
aws ecr create-repository \
--repository-name docuthinker-backend \
--image-scanning-configuration scanOnPush=true \
--encryption-configuration encryptionType=AES256
# AI/ML repository
aws ecr create-repository \
--repository-name docuthinker-ai-ml \
--image-scanning-configuration scanOnPush=true \
--encryption-configuration encryptionType=AES256
# Keep only last 10 images
aws ecr put-lifecycle-policy \
--repository-name docuthinker-backend \
--lifecycle-policy-text '{
"rules": [{
"rulePriority": 1,
"description": "Keep last 10 images",
"selection": {
"tagStatus": "any",
"countType": "imageCountMoreThan",
"countNumber": 10
},
"action": {
"type": "expire"
}
}]
}'
aws ecr put-replication-configuration \
--replication-configuration '{
"rules": [{
"destinations": [{
"region": "us-west-2",
"registryId": "ACCOUNT-ID"
}]
}]
}'
{
"docuthinker/openai": "sk-proj-...",
"docuthinker/anthropic": "sk-ant-...",
"docuthinker/google": "AIza...",
"docuthinker/neo4j": "bolt://username:password@host:7687",
"docuthinker/chroma": "/mnt/efs/chroma",
"docuthinker/mongodb": "mongodb+srv://...",
"docuthinker/redis": "redis://..."
}
#!/bin/bash
SECRETS=(
"docuthinker/openai:sk-proj-..."
"docuthinker/anthropic:sk-ant-..."
"docuthinker/google:AIza..."
"docuthinker/neo4j:bolt://user:pass@host:7687"
"docuthinker/chroma:/mnt/efs/chroma"
)
for secret in "${SECRETS[@]}"; do
IFS=':' read -r name value <<< "$secret"
aws secretsmanager create-secret \
--name "$name" \
--secret-string "$value" \
--description "DocuThinker API key/config"
done
aws secretsmanager rotate-secret \
--secret-id docuthinker/openai \
--rotation-lambda-arn arn:aws:lambda:REGION:ACCOUNT:function:SecretRotation \
--rotation-rules AutomaticallyAfterDays=30
# Get default VPC ID
aws ec2 describe-vpcs --filters "Name=isDefault,Values=true"
# Get subnets
aws ec2 describe-subnets --filters "Name=vpc-id,Values=vpc-xxxxx"
# Create VPC
aws ec2 create-vpc --cidr-block 10.0.0.0/16 --tag-specifications 'ResourceType=vpc,Tags=[{Key=Name,Value=docuthinker-vpc}]'
# Create public subnets
aws ec2 create-subnet --vpc-id vpc-xxxxx --cidr-block 10.0.1.0/24 --availability-zone us-east-1a
aws ec2 create-subnet --vpc-id vpc-xxxxx --cidr-block 10.0.2.0/24 --availability-zone us-east-1b
# Create private subnets
aws ec2 create-subnet --vpc-id vpc-xxxxx --cidr-block 10.0.3.0/24 --availability-zone us-east-1a
aws ec2 create-subnet --vpc-id vpc-xxxxx --cidr-block 10.0.4.0/24 --availability-zone us-east-1b
# Create Internet Gateway
aws ec2 create-internet-gateway
aws ec2 attach-internet-gateway --vpc-id vpc-xxxxx --internet-gateway-id igw-xxxxx
# Create NAT Gateways
aws ec2 allocate-address --domain vpc
aws ec2 create-nat-gateway --subnet-id subnet-xxxxx --allocation-id eipalloc-xxxxx
aws ecs create-cluster --cluster-name docuthinker-cluster
Backend task definition:
aws ecs register-task-definition --cli-input-json file://task-definition-backend.json
task-definition-backend.json:
{
"family": "docuthinker-backend",
"networkMode": "awsvpc",
"requiresCompatibilities": ["FARGATE"],
"cpu": "512",
"memory": "1024",
"executionRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskExecutionRole",
"taskRoleArn": "arn:aws:iam::ACCOUNT:role/ecsTaskRole",
"containerDefinitions": [{
"name": "backend",
"image": "ACCOUNT.dkr.ecr.REGION.amazonaws.com/docuthinker-backend:latest",
"portMappings": [{
"containerPort": 5000,
"protocol": "tcp"
}],
"environment": [{
"name": "NODE_ENV",
"value": "production"
}],
"secrets": [
{
"name": "OPENAI_API_KEY",
"valueFrom": "arn:aws:secretsmanager:REGION:ACCOUNT:secret:docuthinker/openai"
}
],
"logConfiguration": {
"logDriver": "awslogs",
"options": {
"awslogs-group": "/ecs/docuthinker-backend",
"awslogs-region": "us-east-1",
"awslogs-stream-prefix": "ecs"
}
}
}]
}
aws ecs create-service \
--cluster docuthinker-cluster \
--service-name backend-service \
--task-definition docuthinker-backend:1 \
--desired-count 2 \
--launch-type FARGATE \
--network-configuration "awsvpcConfiguration={subnets=[subnet-xxxxx,subnet-yyyyy],securityGroups=[sg-xxxxx],assignPublicIp=ENABLED}"
# Create ALB
aws elbv2 create-load-balancer \
--name docuthinker-alb \
--subnets subnet-xxxxx subnet-yyyyy \
--security-groups sg-xxxxx \
--scheme internet-facing \
--type application
# Create target groups
aws elbv2 create-target-group \
--name backend-tg \
--protocol HTTP \
--port 5000 \
--vpc-id vpc-xxxxx \
--target-type ip \
--health-check-path /health
aws elbv2 create-target-group \
--name aiml-tg \
--protocol HTTP \
--port 8000 \
--vpc-id vpc-xxxxx \
--target-type ip \
--health-check-path /health
# Create listeners
aws elbv2 create-listener \
--load-balancer-arn arn:aws:elasticloadbalancing:... \
--protocol HTTP \
--port 80 \
--default-actions Type=forward,TargetGroupArn=arn:aws:elasticloadbalancing:...
| Variable | Description | Default | Required |
|---|---|---|---|
NODE_ENV |
Node environment | production |
Yes |
PORT |
Server port | 5000 |
No |
OPENAI_API_KEY |
OpenAI API key | - | Yes |
ANTHROPIC_API_KEY |
Anthropic API key | - | Yes |
GOOGLE_API_KEY |
Google API key | - | Yes |
MONGODB_URI |
MongoDB connection string | - | Yes |
REDIS_URL |
Redis connection string | - | Yes |
| Variable | Description | Default | Required |
|---|---|---|---|
DOCUTHINKER_SYNC_GRAPH |
Enable Neo4j sync | true |
No |
DOCUTHINKER_SYNC_VECTOR |
Enable ChromaDB sync | true |
No |
OPENAI_API_KEY |
OpenAI API key | - | Yes |
ANTHROPIC_API_KEY |
Anthropic API key | - | Yes |
GOOGLE_API_KEY |
Google API key | - | Yes |
DOCUTHINKER_NEO4J_URI |
Neo4j connection URI | - | Yes* |
DOCUTHINKER_CHROMA_DIR |
ChromaDB directory | /mnt/efs/chroma |
Yes* |
* Required if corresponding sync flag is enabled
aws/infrastructure/cdk-app.ts:
const app = new cdk.App();
new DocuThinkerStack(app, "DocuThinkerStack", {
env: {
account: process.env.CDK_DEFAULT_ACCOUNT,
region: process.env.CDK_DEFAULT_REGION ?? "us-east-1",
},
// Custom configurations
desiredCount: 2,
cpu: 512,
memory: 1024,
backendPort: 5000,
aimlPort: 8000,
});
aws/cloudformation/fargate-service.yaml parameters:
Parameters:
VpcId: vpc-xxxxx
PublicSubnets: subnet-xxxxx,subnet-yyyyy
ClusterName: docuthinker-cluster
BackendImage: ACCOUNT.dkr.ecr.REGION.amazonaws.com/docuthinker-backend:latest
AIMLImage: ACCOUNT.dkr.ecr.REGION.amazonaws.com/docuthinker-ai-ml:latest
ContainerPort: 5000
AIMLPort: 8000
DesiredCount: 2
CPU: 512
Memory: 1024
SecretsPrefix: docuthinker
# Create log groups
aws logs create-log-group --log-group-name /ecs/docuthinker-backend
aws logs create-log-group --log-group-name /ecs/docuthinker-ai-ml
# Set retention
aws logs put-retention-policy \
--log-group-name /ecs/docuthinker-backend \
--retention-in-days 30
# Stream logs
aws logs tail /ecs/docuthinker-backend --follow
# Query logs
aws logs filter-log-events \
--log-group-name /ecs/docuthinker-backend \
--filter-pattern "ERROR"
| Metric | Description | Threshold |
|---|---|---|
CPUUtilization |
Task CPU usage | > 80% |
MemoryUtilization |
Task memory usage | > 85% |
TargetResponseTime |
ALB response time | > 1000ms |
HealthyHostCount |
Healthy targets | < 1 |
UnHealthyHostCount |
Unhealthy targets | > 0 |
RequestCount |
Total requests | - |
HTTPCode_Target_4XX_Count |
Client errors | > 100/min |
HTTPCode_Target_5XX_Count |
Server errors | > 10/min |
# High CPU alarm
aws cloudwatch put-metric-alarm \
--alarm-name docuthinker-high-cpu \
--alarm-description "Alert when CPU exceeds 80%" \
--metric-name CPUUtilization \
--namespace AWS/ECS \
--statistic Average \
--period 300 \
--threshold 80 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--dimensions Name=ServiceName,Value=backend-service Name=ClusterName,Value=docuthinker-cluster
# High error rate alarm
aws cloudwatch put-metric-alarm \
--alarm-name docuthinker-high-errors \
--alarm-description "Alert when 5XX errors exceed 10/min" \
--metric-name HTTPCode_Target_5XX_Count \
--namespace AWS/ApplicationELB \
--statistic Sum \
--period 60 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1
Enable Container Insights for enhanced monitoring:
aws ecs update-cluster-settings \
--cluster docuthinker-cluster \
--settings name=containerInsights,value=enabled
Enable AWS X-Ray for distributed tracing:
# Add X-Ray daemon to task definition
{
"name": "xray-daemon",
"image": "amazon/aws-xray-daemon",
"cpu": 32,
"memoryReservation": 256,
"portMappings": [{
"containerPort": 2000,
"protocol": "udp"
}]
}
Allows ECS to pull images and write logs:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"ecr:GetAuthorizationToken",
"ecr:BatchCheckLayerAvailability",
"ecr:GetDownloadUrlForLayer",
"ecr:BatchGetImage",
"logs:CreateLogStream",
"logs:PutLogEvents",
"secretsmanager:GetSecretValue"
],
"Resource": "*"
}
]
}
Allows containers to access AWS services:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:GetObject",
"s3:PutObject",
"dynamodb:Query",
"dynamodb:PutItem",
"sqs:SendMessage"
],
"Resource": [
"arn:aws:s3:::docuthinker-*/*",
"arn:aws:dynamodb:*:*:table/docuthinker-*",
"arn:aws:sqs:*:*:docuthinker-*"
]
}
]
}
# Allow inbound HTTP/HTTPS
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxx \
--protocol tcp \
--port 80 \
--cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress \
--group-id sg-xxxxx \
--protocol tcp \
--port 443 \
--cidr 0.0.0.0/0
# Allow inbound from ALB
aws ec2 authorize-security-group-ingress \
--group-id sg-yyyyy \
--protocol tcp \
--port 5000 \
--source-group sg-xxxxx
aws ec2 authorize-security-group-ingress \
--group-id sg-yyyyy \
--protocol tcp \
--port 8000 \
--source-group sg-xxxxx
Enable encryption at rest for Secrets Manager:
aws secretsmanager create-secret \
--name docuthinker/openai \
--secret-string "sk-..." \
--kms-key-id arn:aws:kms:REGION:ACCOUNT:key/KEY-ID
aws acm request-certificate \
--domain-name api.docuthinker.ai \
--validation-method DNS \
--subject-alternative-names "*.api.docuthinker.ai"
aws elbv2 create-listener \
--load-balancer-arn arn:aws:elasticloadbalancing:... \
--protocol HTTPS \
--port 443 \
--certificates CertificateArn=arn:aws:acm:... \
--default-actions Type=forward,TargetGroupArn=arn:aws:elasticloadbalancing:...
| Service | Configuration | Monthly Cost |
|---|---|---|
| ECS Fargate | 4 tasks Γ 0.5 vCPU Γ 1 GB Γ 730 hours | ~$70 |
| Application Load Balancer | 1 ALB + 2 LCUs | ~$30 |
| NAT Gateway | 2 NAT gateways + data transfer | ~$90 |
| CloudWatch Logs | 10 GB ingestion + storage | ~$15 |
| ECR Storage | 20 GB storage | ~$2 |
| Secrets Manager | 10 secrets | ~$4 |
| Data Transfer | 100 GB out | ~$9 |
| Total | Β | ~$220/month |
* Excludes external services (MongoDB Atlas, Neo4j Aura, etc.)
# Switch to EC2 launch type with Spot
# Save up to 70% on compute costs
# Right-size CPU and memory
CPU: 256 (instead of 512) = 50% savings
Memory: 512 MB (instead of 1024 MB) = 50% savings
aws ecs create-service \
--capacity-provider-strategy \
capacityProvider=FARGATE_SPOT,weight=1,base=0
Only run necessary capacity:
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/docuthinker-cluster/backend-service \
--min-capacity 1 \
--max-capacity 4
Eliminate NAT Gateway costs for AWS service traffic:
aws ec2 create-vpc-endpoint \
--vpc-id vpc-xxxxx \
--service-name com.amazonaws.us-east-1.s3 \
--route-table-ids rtb-xxxxx
Reduce log retention period:
aws logs put-retention-policy \
--log-group-name /ecs/docuthinker-backend \
--retention-in-days 7 # Instead of 30
For predictable workloads, purchase reserved instances:
# Register scalable target
aws application-autoscaling register-scalable-target \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/docuthinker-cluster/backend-service \
--min-capacity 2 \
--max-capacity 10
# Target tracking scaling policy (CPU)
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/docuthinker-cluster/backend-service \
--policy-name cpu-target-tracking \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 70.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageCPUUtilization"
},
"ScaleInCooldown": 300,
"ScaleOutCooldown": 60
}'
# Target tracking scaling policy (Memory)
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/docuthinker-cluster/backend-service \
--policy-name memory-target-tracking \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 80.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ECSServiceAverageMemoryUtilization"
}
}'
# Request count scaling
aws application-autoscaling put-scaling-policy \
--service-namespace ecs \
--scalable-dimension ecs:service:DesiredCount \
--resource-id service/docuthinker-cluster/backend-service \
--policy-name request-count-scaling \
--policy-type TargetTrackingScaling \
--target-tracking-scaling-policy-configuration '{
"TargetValue": 1000.0,
"PredefinedMetricSpecification": {
"PredefinedMetricType": "ALBRequestCountPerTarget",
"ResourceLabel": "app/docuthinker-alb/xxxxx/targetgroup/backend-tg/yyyyy"
}
}'
Update task definition with more resources:
# Update to 1 vCPU and 2 GB memory
aws ecs register-task-definition \
--family docuthinker-backend \
--cpu 1024 \
--memory 2048 \
--container-definitions file://container-def.json
# Update service
aws ecs update-service \
--cluster docuthinker-cluster \
--service backend-service \
--task-definition docuthinker-backend:2 \
--force-new-deployment
aws elbv2 modify-target-group-attributes \
--target-group-arn arn:aws:elasticloadbalancing:... \
--attributes Key=deregistration_delay.timeout_seconds,Value=30
aws elbv2 modify-target-group-attributes \
--target-group-arn arn:aws:elasticloadbalancing:... \
--attributes Key=stickiness.enabled,Value=true Key=stickiness.type,Value=lb_cookie
Task definitions are versioned automatically:
# List all versions
aws ecs list-task-definitions --family-prefix docuthinker
# Rollback to previous version
aws ecs update-service \
--cluster docuthinker-cluster \
--service backend-service \
--task-definition docuthinker-backend:1
# Enable cross-region replication
# Images are automatically replicated to backup region
# Manual backup
docker pull ACCOUNT.dkr.ecr.us-east-1.amazonaws.com/docuthinker-backend:latest
docker tag ... backup-registry/...
docker push ...
# Export CloudFormation stack
aws cloudformation get-template \
--stack-name docuthinker-stack \
--query TemplateBody > backup.yaml
# Export secrets (encrypted)
aws secretsmanager get-secret-value \
--secret-id docuthinker/openai \
--query SecretString > secrets-backup.json
# Deploy to secondary region
export AWS_REGION=us-west-2
cdk deploy DocuThinkerStack --context env=dr
# Configure Route 53 failover
aws route53 change-resource-record-sets \
--hosted-zone-id ZXXXXX \
--change-batch file://failover-config.json
failover-config.json:
{
"Changes": [{
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.docuthinker.ai",
"Type": "A",
"SetIdentifier": "Primary",
"Failover": "PRIMARY",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "primary-alb.us-east-1.elb.amazonaws.com",
"EvaluateTargetHealth": true
}
}
}, {
"Action": "CREATE",
"ResourceRecordSet": {
"Name": "api.docuthinker.ai",
"Type": "A",
"SetIdentifier": "Secondary",
"Failover": "SECONDARY",
"AliasTarget": {
"HostedZoneId": "Z35SXDOTRQ7X7K",
"DNSName": "secondary-alb.us-west-2.elb.amazonaws.com",
"EvaluateTargetHealth": true
}
}
}]
}
| Scenario | RTO | RPO | Recovery Steps |
|---|---|---|---|
| Single Task Failure | < 1 min | 0 | Auto-restart by ECS |
| AZ Failure | < 2 min | 0 | Auto-failover to other AZ |
| Region Failure | < 15 min | < 5 min | Manual Route 53 failover |
| Complete Disaster | < 1 hour | < 15 min | Redeploy from backup |
Symptoms: Tasks keep restarting
Diagnosis:
# Check service events
aws ecs describe-services \
--cluster docuthinker-cluster \
--services backend-service
# Check task logs
aws logs tail /ecs/docuthinker-backend --follow
# Describe stopped tasks
aws ecs describe-tasks \
--cluster docuthinker-cluster \
--tasks task-id
Common Causes:
Diagnosis:
# Check metrics
aws cloudwatch get-metric-statistics \
--namespace AWS/ECS \
--metric-name CPUUtilization \
--dimensions Name=ServiceName,Value=backend-service Name=ClusterName,Value=docuthinker-cluster \
--start-time 2024-01-01T00:00:00Z \
--end-time 2024-01-01T23:59:59Z \
--period 3600 \
--statistics Average
Solutions:
Diagnosis:
# Check target health
aws elbv2 describe-target-health \
--target-group-arn arn:aws:elasticloadbalancing:...
# Check ALB access logs
aws s3 ls s3://alb-logs-bucket/
Solutions:
/health endpoint is respondingError: Unable to fetch secrets from Secrets Manager
Diagnosis:
# Verify secret exists
aws secretsmanager describe-secret --secret-id docuthinker/openai
# Check IAM permissions
aws iam get-role-policy --role-name ecsTaskExecutionRole --policy-name SecretsAccess
Solutions:
secretsmanager:GetSecretValue to task execution roleError: CannotPullContainerError
Diagnosis:
# Verify image exists
aws ecr describe-images --repository-name docuthinker-backend
# Check ECR permissions
aws ecr get-repository-policy --repository-name docuthinker-backend
Solutions:
# SSH into running container (ECS Exec)
aws ecs execute-command \
--cluster docuthinker-cluster \
--task task-id \
--container backend \
--interactive \
--command "/bin/bash"
# View task details
aws ecs describe-tasks \
--cluster docuthinker-cluster \
--tasks task-id \
--include TAGS
# Check service scaling activity
aws application-autoscaling describe-scaling-activities \
--service-namespace ecs \
--resource-id service/docuthinker-cluster/backend-service
# Force new deployment
aws ecs update-service \
--cluster docuthinker-cluster \
--service backend-service \
--force-new-deployment
β Always use IaC (CDK or CloudFormation) β Version control all infrastructure code β Peer review infrastructure changes β Test infrastructure changes in dev/staging first β Document all infrastructure decisions
β
Use specific image tags (not latest)
β
Scan images for vulnerabilities
β
Minimize image size (multi-stage builds)
β
Run as non-root user
β
Use health checks for all containers
β
Implement graceful shutdown
β Principle of least privilege for IAM roles β Rotate secrets regularly β Enable encryption at rest and in transit β Use private subnets for ECS tasks β Enable VPC Flow Logs β Implement WAF rules β Regular security audits
β Set up CloudWatch alarms for critical metrics β Enable Container Insights β Implement distributed tracing (X-Ray) β Monitor costs with AWS Cost Explorer β Set up log aggregation β Create dashboards for key metrics
β Blue/Green deployments for zero downtime β Automated rollback on failures β Gradual rollout (canary deployments) β Integration tests before production β Backup before deployment
Jenkinsfile:
pipeline {
agent any
environment {
AWS_REGION = 'us-east-1'
ECR_REGISTRY = "${AWS_ACCOUNT_ID}.dkr.ecr.${AWS_REGION}.amazonaws.com"
BACKEND_REPO = 'docuthinker-backend'
AIML_REPO = 'docuthinker-ai-ml'
}
stages {
stage('Checkout') {
steps {
checkout scm
}
}
stage('Test') {
parallel {
stage('Backend Tests') {
steps {
dir('backend') {
sh 'npm test'
}
}
}
stage('AI/ML Tests') {
steps {
dir('ai_ml') {
sh 'pytest'
}
}
}
}
}
stage('Build Images') {
parallel {
stage('Backend Image') {
steps {
script {
sh """
cd backend
docker build -t ${BACKEND_REPO}:${BUILD_NUMBER} .
docker tag ${BACKEND_REPO}:${BUILD_NUMBER} ${ECR_REGISTRY}/${BACKEND_REPO}:latest
"""
}
}
}
stage('AI/ML Image') {
steps {
script {
sh """
cd ai_ml
docker build -t ${AIML_REPO}:${BUILD_NUMBER} .
docker tag ${AIML_REPO}:${BUILD_NUMBER} ${ECR_REGISTRY}/${AIML_REPO}:latest
"""
}
}
}
}
}
stage('Push to ECR') {
steps {
script {
sh """
aws ecr get-login-password --region ${AWS_REGION} | docker login --username AWS --password-stdin ${ECR_REGISTRY}
docker push ${ECR_REGISTRY}/${BACKEND_REPO}:latest
docker push ${ECR_REGISTRY}/${AIML_REPO}:latest
"""
}
}
}
stage('Deploy to ECS') {
steps {
script {
sh """
aws ecs update-service --cluster docuthinker-cluster --service backend-service --force-new-deployment
aws ecs update-service --cluster docuthinker-cluster --service aiml-service --force-new-deployment
"""
}
}
}
stage('Verify Deployment') {
steps {
script {
sh """
aws ecs wait services-stable --cluster docuthinker-cluster --services backend-service aiml-service
"""
}
}
}
}
post {
success {
echo 'Deployment successful!'
}
failure {
echo 'Deployment failed. Rolling back...'
sh """
aws ecs update-service --cluster docuthinker-cluster --service backend-service --task-definition docuthinker-backend:previous
aws ecs update-service --cluster docuthinker-cluster --service aiml-service --task-definition docuthinker-ai-ml:previous
"""
}
}
}
.github/workflows/deploy.yml:
name: Deploy to AWS ECS
on:
push:
branches: [ main ]
env:
AWS_REGION: us-east-1
ECR_REPOSITORY_BACKEND: docuthinker-backend
ECR_REPOSITORY_AIML: docuthinker-ai-ml
ECS_CLUSTER: docuthinker-cluster
jobs:
deploy:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v3
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v2
with:
aws-access-key-id: $
aws-secret-access-key: $
aws-region: $
- name: Login to Amazon ECR
id: login-ecr
uses: aws-actions/amazon-ecr-login@v1
- name: Build, tag, and push backend image
env:
ECR_REGISTRY: $
IMAGE_TAG: $
run: |
cd backend
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY_BACKEND:$IMAGE_TAG .
docker tag $ECR_REGISTRY/$ECR_REPOSITORY_BACKEND:$IMAGE_TAG $ECR_REGISTRY/$ECR_REPOSITORY_BACKEND:latest
docker push $ECR_REGISTRY/$ECR_REPOSITORY_BACKEND:$IMAGE_TAG
docker push $ECR_REGISTRY/$ECR_REPOSITORY_BACKEND:latest
- name: Build, tag, and push AI/ML image
env:
ECR_REGISTRY: $
IMAGE_TAG: $
run: |
cd ai_ml
docker build -t $ECR_REGISTRY/$ECR_REPOSITORY_AIML:$IMAGE_TAG .
docker tag $ECR_REGISTRY/$ECR_REPOSITORY_AIML:$IMAGE_TAG $ECR_REGISTRY/$ECR_REPOSITORY_AIML:latest
docker push $ECR_REGISTRY/$ECR_REPOSITORY_AIML:$IMAGE_TAG
docker push $ECR_REGISTRY/$ECR_REPOSITORY_AIML:latest
- name: Deploy to ECS
run: |
aws ecs update-service --cluster $ECS_CLUSTER --service backend-service --force-new-deployment
aws ecs update-service --cluster $ECS_CLUSTER --service aiml-service --force-new-deployment
- name: Wait for services to stabilize
run: |
aws ecs wait services-stable --cluster $ECS_CLUSTER --services backend-service aiml-service
# Build new images
docker build -t docuthinker-backend:v2.0 backend/
docker build -t docuthinker-ai-ml:v2.0 ai_ml/
# Push to ECR
docker tag docuthinker-backend:v2.0 ${ECR_REGISTRY}/docuthinker-backend:v2.0
docker push ${ECR_REGISTRY}/docuthinker-backend:v2.0
# Update service
aws ecs update-service \
--cluster docuthinker-cluster \
--service backend-service \
--force-new-deployment
# Using CDK
cd aws/infrastructure
npm install # Update dependencies
cdk diff # Preview changes
cdk deploy # Apply changes
# Using CloudFormation
aws cloudformation update-stack \
--stack-name docuthinker-stack \
--template-body file://aws/cloudformation/fargate-service.yaml \
--parameters file://parameters.json
A: Use CloudWatch Logs:
aws logs tail /ecs/docuthinker-backend --follow
A: Update desired count:
aws ecs update-service \
--cluster docuthinker-cluster \
--service backend-service \
--desired-count 4
A: Yes, change launch type to EC2 and manage EC2 instances yourself. Fargate is recommended for ease of management.
A: Use AWS CodeDeploy with ECS:
aws deploy create-deployment \
--application-name docuthinker-app \
--deployment-group-name docuthinker-dg \
--revision file://appspec.yaml
A: Approximately $220/month for the AWS infrastructure (see Cost Optimization section). This excludes external services like MongoDB Atlas and Neo4j Aura.
A: Request an ACM certificate and add HTTPS listener to ALB (see SSL/TLS Configuration).
A: Yes, deploy the CDK/CloudFormation stack in each region and configure Route 53 for failover (see Multi-Region Deployment).
A: Use ECS Exec to SSH into the container:
aws ecs execute-command \
--cluster docuthinker-cluster \
--task task-id \
--container backend \
--interactive \
--command "/bin/bash"
Made with β€οΈ by Son Nguyen