DocuThinker-AI-App

DocuThinker Architecture Documentation

Table of Contents


Overview

DocuThinker is an enterprise-grade, full-stack AI-powered document analysis and summarization application built using the FERN Stack (Firebase, Express, React, Node.js). The application has been enhanced with a comprehensive DevOps platform featuring 15+ production-ready components for security, reliability, and scalability.

Core Features

DevOps Platform Features


System Architecture

The application follows a cloud-native, microservices-oriented architecture with enterprise-grade DevOps components across multiple layers.

graph TB
    subgraph "Client Layer"
        A[Web Browser]
        B[Mobile App - React Native]
        C[VS Code Extension]
    end

    subgraph "Edge & Security Layer"
        D[CloudFront CDN]
        E[AWS WAF]
        F[cert-manager<br/>Auto TLS]
    end

    subgraph "Ingress Layer - Istio Service Mesh"
        G[Istio Ingress Gateway<br/>mTLS + Traffic Management]
        H[Istio Egress Gateway<br/>Controlled External Access]
    end

    subgraph "Policy & Security"
        I[OPA Gatekeeper<br/>Admission Control]
        J[Falco<br/>Runtime Security]
        K[Network Policies]
    end

    subgraph "Application Layer - Service Mesh"
        L[React Frontend<br/>+ Envoy Sidecar]
        M[Express Backend<br/>+ Envoy Sidecar]
        N[GraphQL API<br/>+ Envoy Sidecar]
    end

    subgraph "Progressive Delivery"
        O[Flagger<br/>Automated Canary]
        P[Canary Analysis]
    end

    subgraph "Service Layer"
        Q[Auth Service]
        R[Document Service]
        S[AI/ML Service]
        T[Analytics Service]
    end

    subgraph "Data Layer"
        U[(Firebase Auth)]
        V[(Firestore)]
        W[(MongoDB Atlas)]
        X[(Redis Cache)]
    end

    subgraph "Observability Platform"
        Y[OpenTelemetry Collector]
        Z[Prometheus + SLO/SLI]
        AA[Grafana Dashboards]
        AB[Jaeger Tracing]
        AC[ELK Stack]
        CX[Coralogix<br/>Unified Observability]
    end

    subgraph "Reliability Engineering"
        AD[Litmus Chaos<br/>Resilience Testing]
        AE[Velero<br/>Backup & DR]
    end

    subgraph "Autoscaling"
        AF[KEDA<br/>Event-Driven HPA]
        AG[HPA<br/>CPU/Memory Based]
    end

    subgraph "External Services"
        AH[Google Cloud NLP]
        AI[Google AI Studio]
        AJ[LangChain]
        AK[RabbitMQ]
    end

    A --> D
    B --> E
    C --> D
    D --> E
    E --> F
    F --> G

    G -.->|Policy Check| I
    I -.->|Validate| G
    G --> L
    G --> M
    G --> N

    L --> Q
    M --> R
    M --> S
    N --> T

    O -.->|Monitor| M
    P -.->|Analyze| O

    Q --> U
    R --> V
    R --> W
    S --> AH
    S --> AI
    S --> AJ
    M --> X
    M --> AK

    L -.->|Traces| Y
    M -.->|Traces| Y
    N -.->|Traces| Y

    Y --> AB
    Y --> Z
    Z --> AA
    M -.->|Logs| AC

    J -.->|Monitor| M
    AD -.->|Test| M
    AE -.->|Backup| V

    AF -.->|Scale| M
    AG -.->|Scale| L

Production DevOps Platform

Complete enterprise DevOps infrastructure with 15 integrated components.

graph TB
    subgraph "Ingress & Traffic Management"
        SM[Istio Service Mesh<br/>mTLS + Circuit Breaking]
        IG[Istio Ingress Gateway]
        TLS[cert-manager<br/>Automated TLS]
    end

    subgraph "Security & Governance"
        OPA[OPA Gatekeeper<br/>Policy Enforcement]
        FALCO[Falco<br/>Runtime Security]
        VAULT[HashiCorp Vault<br/>Secrets Management]
    end

    subgraph "Applications"
        FE[Frontend Pods]
        BE[Backend Pods]
        DB[(Databases)]
    end

    subgraph "Observability Stack"
        OTEL[OpenTelemetry<br/>Distributed Tracing]
        PROM[Prometheus<br/>Metrics + SLO/SLI]
        GRAF[Grafana<br/>Visualization]
        JAEGER[Jaeger<br/>Trace Analysis]
        ELK[ELK Stack<br/>Log Aggregation]
    end

    subgraph "Reliability Engineering"
        LITMUS[Litmus Chaos<br/>Resilience Testing]
        FLAGGER[Flagger<br/>Progressive Delivery]
        VELERO[Velero<br/>Backup & DR]
    end

    subgraph "Autoscaling & Optimization"
        KEDA[KEDA<br/>Event-Driven Scaling]
        HPA[HPA<br/>Resource-Based Scaling]
    end

    subgraph "Testing & Validation"
        K6[K6 Load Testing<br/>6 Scenarios]
        TERRATEST[Terratest<br/>Infrastructure Tests]
        FLYWAY[Flyway<br/>DB Migrations]
    end

    TLS --> IG
    IG --> SM
    SM --> OPA
    OPA --> FE
    OPA --> BE

    FALCO -.->|Monitor| BE
    VAULT -.->|Secrets| BE

    FE --> DB
    BE --> DB

    BE -.->|Traces| OTEL
    OTEL --> JAEGER
    OTEL --> PROM
    PROM --> GRAF
    BE -.->|Logs| ELK

    LITMUS -.->|Test| BE
    FLAGGER -.->|Canary| BE
    VELERO -.->|Backup| DB

    KEDA -.->|Scale| BE
    HPA -.->|Scale| FE

    K6 -.->|Test| IG

Frontend Architecture

The frontend is built with React 18 and follows a component-based architecture with modern React patterns.

graph TB
    subgraph "Frontend Application"
        A[App.js - Root Component]

        subgraph "Routing Layer"
            B[React Router v6]
            C[Protected Routes]
            D[Public Routes]
        end

        subgraph "State Management"
            E[Context API]
            F[Local Storage]
            G[Session Storage]
        end

        subgraph "UI Components"
            H[Material-UI Components]
            I[Custom Components]
            J[Styled Components]
        end

        subgraph "Pages"
            K[Landing Page]
            L[Home Page]
            M[Documents Page]
            N[Profile Page]
            O[Analytics Page]
        end

        subgraph "Services"
            P[API Service]
            Q[Auth Service]
            R[Storage Service]
        end

        subgraph "Utilities"
            S[Error Handling]
            T[Form Validation]
            U[Date Formatting]
        end
    end

    A --> B
    B --> C
    B --> D
    A --> E
    E --> F
    E --> G
    K --> H
    L --> H
    M --> H
    N --> H
    O --> H
    H --> I
    I --> J
    P --> Q
    P --> R
    K --> P
    L --> P
    M --> P
    N --> P
    O --> P

Backend Architecture

The backend follows the MVC (Model-View-Controller) pattern with additional service layers for business logic.

graph TB
    subgraph "Backend Architecture"
        A[Express Server]

        subgraph "Middleware Layer"
            B[CORS Middleware]
            C[Auth Middleware - JWT]
            D[Firebase Auth Middleware]
            E[Error Handler]
            F[Request Logger]
        end

        subgraph "Routes Layer"
            G[User Routes]
            H[Document Routes]
            I[AI/ML Routes]
            J[GraphQL Routes]
        end

        subgraph "Controller Layer"
            K[User Controller]
            L[Document Controller]
            M[AI Controller]
            N[Analytics Controller]
        end

        subgraph "Service Layer"
            O[User Service]
            P[Document Service]
            Q[AI/ML Service]
            R[Storage Service]
        end

        subgraph "Model Layer"
            S[User Model]
            T[Document Model]
            U[Analytics Model]
        end

        subgraph "Integration Layer"
            V[Firebase Admin SDK]
            W[Google Cloud APIs]
            X[LangChain]
            Y[Redis Client]
        end
    end

    A --> B
    A --> C
    A --> D
    A --> E
    A --> F

    B --> G
    C --> H
    D --> I

    G --> K
    H --> L
    I --> M
    J --> N

    K --> O
    L --> P
    M --> Q
    N --> R

    O --> S
    P --> T
    Q --> U

    O --> V
    P --> V
    Q --> W
    Q --> X
    P --> Y

MVC Pattern Implementation

Controllers

Handle HTTP requests and responses, coordinate between services and views.

// Example Controller Structure
exports.uploadDocument = async (req, res) => {
  try {
    // 1. Parse request
    const { userId, file } = req.body;

    // 2. Validate input
    if (!file) throw new Error('No file provided');

    // 3. Call service layer
    const result = await documentService.processDocument(userId, file);

    // 4. Format response
    sendSuccessResponse(res, 200, 'Document uploaded successfully', result);
  } catch (error) {
    // 5. Handle errors
    sendErrorResponse(res, 400, 'Document upload failed', error.message);
  }
};

Services

Contain business logic and interact with data models and external APIs.

graph LR
    A[Controller] --> B[Service Layer]
    B --> C[Firebase Auth]
    B --> D[Firestore]
    B --> E[AI/ML APIs]
    B --> F[Redis Cache]
    C --> G[Response]
    D --> G
    E --> G
    F --> G
    G --> A

Models

Define data schemas and database interactions.

classDiagram
    class User {
        +String id
        +String email
        +Date createdAt
        +Array documents
        +Object socialMedia
        +String theme
    }

    class Document {
        +String id
        +String userId
        +String title
        +String originalText
        +String summary
        +Array keyIdeas
        +Array discussionPoints
        +Date uploadedAt
    }

    class Analytics {
        +String userId
        +Number totalDocuments
        +Number totalSummaries
        +Object usageStats
        +Date lastAccess
    }

    User "1" --> "*" Document : has
    User "1" --> "1" Analytics : has

Agentic Orchestration Layer

The orchestrator (orchestrator/) is a standalone Node.js service (port 4000) that sits between the Express backend and the Python AI/ML services. It implements a supervisor-driven agentic architecture with circuit breakers, cost controls, context management, and MCP integration.

Orchestrator Architecture

graph TB
    subgraph "Entry Points"
        REST[REST API :4000]
        MCP_S[MCP Server<br/>13 tools / stdio]
    end

    subgraph "Supervisor Pipeline"
        CLASSIFY[1. Classify Intent<br/>route-match or LLM]
        BUDGET[2. Token Budget Check<br/>context window guard]
        DECOMPOSE[3. Decompose<br/>single-step or multi-step DAG]
        DISPATCH[4. Dispatch<br/>parallel execution with deps]
        AGGREGATE[5. Aggregate<br/>merge results + trace]
    end

    subgraph "Core Components"
        AL[Agent Loop<br/>iterative tool-use<br/>max 10 iterations]
        CB[Circuit Breaker<br/>per-provider<br/>CLOSED / OPEN / HALF_OPEN]
        CT[Cost Tracker<br/>per-model pricing<br/>daily + monthly budgets]
        BP[Batch Processor<br/>batch=10, concurrency=3]
        DLQ[Dead Letter Queue<br/>3 retries then DLQ]
        HO[Handoff Manager<br/>context serialization]
        TR[Tool Registry<br/>local + Python-bridged]
        PB[Python Bridge<br/>HTTP with circuit breaker]
    end

    subgraph "Context Management"
        TBM[Token Budget Manager<br/>7 model context windows]
        CS[Conversation Store<br/>auto-summarize at 20 msgs<br/>LRU eviction at 10K]
        OBS[Context Observability<br/>OTel metrics + alerts]
        HRAG[Hybrid RAG<br/>keyword + semantic + RRF]
    end

    subgraph "Prompt Engineering"
        SP[14 Versioned Prompts]
        ZOD[12 Zod Output Schemas]
        PC[3-Layer Prompt Cache<br/>system / document / history]
    end

    subgraph "LLM Providers"
        CLAUDE[Anthropic Claude<br/>Sonnet / Haiku / Opus]
        GEMINI[Google Gemini<br/>Pro / 1.5-Pro]
    end

    subgraph "Python AI/ML :8000"
        PY[FastAPI Service]
    end

    REST --> CLASSIFY
    CLASSIFY --> BUDGET
    BUDGET --> DECOMPOSE
    DECOMPOSE --> DISPATCH
    DISPATCH --> AGGREGATE

    DISPATCH --> AL
    DISPATCH --> BP
    AL --> TR
    TR --> PB
    PB --> PY

    AL --> CB
    CB --> CLAUDE
    CB --> GEMINI

    CT -.-> BUDGET
    TBM -.-> BUDGET
    CS -.-> AL
    OBS -.-> CT
    HO -.-> AL
    HRAG -.-> TR
    SP -.-> AL
    ZOD -.-> AGGREGATE
    PC -.-> AL

    DLQ -.-> DISPATCH
    MCP_S --> TR

Supervisor Workflow

The DocuThinkerSupervisor processes every request through a five-stage pipeline:

  1. Classify – Determines the intent from 18+ registered intents. Uses exact route matching first (/upload -> document.upload, /chat -> chat.document, etc.), then falls back to LLM classification using Claude Haiku, and finally defaults to chat.general.
  2. Budget Check – Validates the request against the model’s context window using TokenBudgetManager. Rejects requests that would overflow the available token budget.
  3. Decompose – Breaks the intent into a task DAG. Simple intents produce a single task. document.upload decomposes into three sequential tasks (extract -> summarize -> store). Batch intents produce parallel tasks.
  4. Dispatch – Executes tasks respecting dependency order. Independent tasks run in parallel via Promise.allSettled. On failure, automatically retries with an alternate provider from the intent’s provider preference list.
  5. Aggregate – Merges results from all tasks, attaches a trace ID (dt-{timestamp}-{random}), and records cost/token usage.

Circuit Breaker State Diagram

stateDiagram-v2
    [*] --> CLOSED
    CLOSED --> OPEN : failures >= threshold
    OPEN --> HALF_OPEN : cooldown elapsed
    HALF_OPEN --> CLOSED : probe succeeds
    HALF_OPEN --> OPEN : probe fails
    CLOSED --> CLOSED : success (reset failures)

Each LLM provider (claude, gemini, python-ai-ml) has an independent circuit breaker. Configuration:

Orchestrator API Endpoints

Method Endpoint Description
GET /health Full system health: circuit breakers, costs, cache stats, DLQ, providers, tools
GET /api/costs Cost breakdown by provider and intent
GET /api/circuits Circuit breaker state per provider (state, failures, uptime %)
GET /api/context-metrics Context utilization stats, cache hit rate, per-provider breakdown
GET /api/dlq DLQ stats and last 20 dead-lettered messages
GET /api/tools Registered tool definitions and count
POST /api/tools/execute Execute a tool by name: { "tool": "...", "input": {...} }
POST /api/token-check Token budget check: { "model": "...", "systemPrompt": "...", "messages": [...] }
POST /api/supervisor/process Full supervisor pipeline: { "route": "/...", ...body }
POST /api/agent/run Agent loop: { "message": "...", "context": {...}, "provider": "claude" }
POST /api/batch/process Batch processing: { "documents": [...], "operation": "summarize" }
POST /api/conversations/:userId/:documentId/message Add message: { "role": "user", "content": "..." }
GET /api/conversations/:userId/:documentId Get conversation history and summary state
DELETE /api/conversations/:userId/:documentId Clear a conversation

Context Management Details

Token Budget Manager tracks context windows for 7+ models (Claude 200K, Gemini 2M, GPT-4 128K) and provides a check() method that estimates token usage and returns whether the request is allowed, the utilization percentage, and a recommendation to compact if above 80%.

Conversation Store maintains per-user per-document conversations in memory with automatic summarization. When a conversation exceeds 20 messages, the oldest messages are summarized using Claude Haiku and replaced with a summary injection. LRU eviction kicks in at 10,000 active conversations.

Prompt Cache Strategy implements Anthropic’s 3-layer caching:


Beads Task Coordination

DocuThinker employs a Beads sub-architecture for coordinating work across multiple AI agents (or developers) operating on the same codebase concurrently. A bead is an atomic, dependency-aware task unit that any agent can claim, execute, and complete β€” enabling safe parallel development without merge conflicts or duplicated effort.

Beads Architecture Overview

graph TB
    subgraph "Beads Coordination Layer"
        STATUS[".beads/.status.json<br/>Agent reservations & counters"]
        TEMPLATE["Bead Templates<br/>Structured task definitions"]
        DEPS["Dependency Graph<br/>Upstream / downstream ordering"]
    end

    subgraph "Conflict Zones (single agent)"
        CZ1["docker-compose.yml"]
        CZ2["ai_ml/services/orchestrator.py"]
        CZ3["ai_ml/providers/registry.py"]
        CZ4["orchestrator/index.js"]
        CZ5["Shared config files"]
    end

    subgraph "Safe Parallel Zones"
        PZ1["Separate service directories"]
        PZ2["Independent test files"]
        PZ3["New files / new directories"]
        PZ4["Documentation files"]
    end

    subgraph "Runtime Layers"
        ORCH["Orchestrator :4000<br/>Supervisor β†’ Agent Loop β†’ Tools"]
        AIML["AI/ML Backend :8000<br/>RAG Pipeline β†’ CrewAI β†’ Stores"]
    end

    STATUS -->|reserves files in| CZ1
    STATUS -->|reserves files in| CZ2
    STATUS -->|allows parallel work| PZ1
    TEMPLATE -->|defines tasks for| ORCH
    TEMPLATE -->|defines tasks for| AIML
    DEPS -->|orders execution| TEMPLATE

Bead Lifecycle

Each bead moves through a well-defined lifecycle:

stateDiagram-v2
    [*] --> Authored: Bead created from template
    Authored --> Claimed: Agent reserves files via .status.json
    Claimed --> InProgress: Agent begins implementation
    InProgress --> Testing: Code changes complete
    Testing --> Done: Acceptance criteria pass
    Testing --> InProgress: Tests fail β€” iterate
    Done --> [*]: Reservations released
    InProgress --> Blocked: Upstream dependency not met
    Blocked --> InProgress: Dependency resolved

Bead Structure

Every bead follows the canonical template at .beads/templates/feature-bead.md:

Section Purpose
Background Business or technical context for why the work exists
Current State Files the agent must read before making any changes
Desired Outcome A specific, testable description of the end state
Files to Touch Explicit list β€” READ FIRST, then ENHANCE or CREATE NEW
Dependencies Upstream beads (Depends on) and downstream beads (Blocks)
Acceptance Criteria Checklist that must pass, always including β€œall existing tests still pass”

Status Tracking

The .beads/.status.json file is the single source of truth for coordination:

{
  "version": "1.0.0",
  "agents": {
    "agent-1": { "name": "orchestrator-dev", "startedAt": "...", "currentBead": "ORCH-04" }
  },
  "reservations": {
    "orchestrator/index.js": "agent-1"
  },
  "lastUpdated": "2025-01-15T10:30:00Z",
  "beadsCompleted": 12,
  "beadsActive": 2
}
Field Type Purpose
agents Record<string, AgentMeta> Active agent IDs mapped to metadata (name, start time, current bead)
reservations Record<string, string> File paths mapped to the agent ID holding the reservation
lastUpdated ISO 8601 / null Timestamp of the most recent status update
beadsCompleted number Running count of successfully completed beads
beadsActive number Number of beads currently being worked on

Conflict Zones & Safe Zones

Conflict zones are files that only one agent may reserve at a time because they are shared entry points or cross-cutting configuration:

File Reason
docker-compose.yml Global service topology
ai_ml/services/orchestrator.py Central AI/ML faΓ§ade β€” all pipelines flow through it
ai_ml/providers/registry.py LLM provider registry shared by all AI/ML components
orchestrator/index.js Orchestrator entry point and route wiring
Shared config files Cross-service environment and build settings

Safe parallel zones allow multiple agents to work simultaneously because they are logically isolated:

Agent Communication Protocol

sequenceDiagram
    participant A as Agent
    participant S as .beads/.status.json
    participant C as Codebase
    participant T as Test Suite

    A->>S: 1. Read status β€” check for conflicts
    S-->>A: No reservation on target files
    A->>S: 2. Write reservation (agent ID + file list)
    A->>C: 3. Implement bead instructions
    loop Every 30 minutes
        A->>S: 4. Heartbeat β€” update lastUpdated
    end
    A->>T: 5. Run acceptance criteria tests
    T-->>A: All tests pass
    A->>S: 6. Release reservations
    A->>S: 7. Increment beadsCompleted, decrement beadsActive

Rules:

  1. Check first β€” always read .status.json before claiming files.
  2. Reserve explicitly β€” post agent ID and every file path you will modify.
  3. Heartbeat β€” update status every 30 minutes while actively working.
  4. Release on exit β€” release all reservations on completion or failure. A post-session hook (.claude/hooks/post-session.sh) auto-cleans stale reservations.
  5. Branch convention β€” agent/<agent-name>/<bead-id> (e.g., agent/claude/ORCH-04).

Relationship to Runtime Layers

Beads operate at development time β€” they coordinate who changes what. The runtime architecture has analogous patterns at request time:

Beads Concept Runtime Analogue Layer
Bead (atomic task) Intent (e.g., document.upload) Orchestrator Supervisor
Dependency graph Task DAG decomposition Supervisor decompose()
File reservation Circuit breaker per provider Circuit Breaker
Agent heartbeat Health checks & cost tracking Cost Tracker / /health
Conflict zones Mutex on shared state Conversation Store LRU
Acceptance criteria Zod output schema validation Schema validation layer

For the full agent protocol including branch naming and escalation, see AGENTS.md.


AI/ML Pipeline

DocuThinker’s AI/ML pipeline is a production-ready, multi-agent RAG (Retrieval-Augmented Generation) platform that orchestrates multiple LLM providers, vector stores, and knowledge graphs for comprehensive document intelligence.

AI/ML Architecture Overview

graph TB
    subgraph "Entry Points"
        CLI[CLI Interface<br/>main.py]
        API[FastAPI Server<br/>server.py]
        MCP[MCP Server<br/>mcp/server.py]
        PY[Python API<br/>backend.py]
    end

    subgraph "Service Layer"
        SERVICE[DocumentIntelligenceService<br/>services/orchestrator.py]
    end

    subgraph "Agentic Pipeline - LangGraph"
        INGEST[Ingest Node<br/>Chunking & Embedding]
        RAG_NODE[RAG Node<br/>Primary Analysis]
        CREW_NODE[Crew Node<br/>Multi-Agent Validation]
        FINAL[Finalize Node<br/>Report Assembly]
    end

    subgraph "Multi-Agent System - CrewAI"
        ANALYST[Document Analyst<br/>OpenAI GPT-4o]
        RESEARCHER[Cross-Referencer<br/>Google Gemini]
        REVIEWER[Insights Curator<br/>Anthropic Claude]
    end

    subgraph "LLM Providers"
        REGISTRY[LLMProviderRegistry<br/>providers/registry.py]
        OPENAI[OpenAI<br/>GPT-4o/GPT-4o-mini]
        ANTHROPIC[Anthropic<br/>Claude 3.5 Sonnet]
        GEMINI[Google<br/>Gemini 1.5 Pro]
    end

    subgraph "Embeddings & Tools"
        HF[HuggingFace<br/>all-MiniLM-L6-v2]
        SEARCH[DocumentSearchTool<br/>Semantic Search]
        INSIGHTS[InsightsExtractionTool<br/>Topic Extraction]
    end

    subgraph "Persistence Layer"
        FAISS[FAISS<br/>In-Memory Vector Store]
        CHROMA[ChromaDB<br/>Persistent Vector Store]
        NEO4J[Neo4j<br/>Knowledge Graph]
    end

    subgraph "Processing Features"
        SENTIMENT[Sentiment Analysis]
        TRANSLATION[Multi-Language Translation<br/>Helsinki-NLP]
        SUMMARIZE[Summarization]
        TOPIC[Topic Extraction]
        QA[Question Answering]
        REWRITE[Content Rewriting]
    end

    CLI --> SERVICE
    API --> SERVICE
    MCP --> SERVICE
    PY --> SERVICE

    SERVICE --> INGEST
    INGEST --> RAG_NODE
    RAG_NODE --> CREW_NODE
    CREW_NODE --> FINAL

    CREW_NODE --> ANALYST
    CREW_NODE --> RESEARCHER
    CREW_NODE --> REVIEWER

    ANALYST --> REGISTRY
    RESEARCHER --> REGISTRY
    REVIEWER --> REGISTRY

    REGISTRY --> OPENAI
    REGISTRY --> ANTHROPIC
    REGISTRY --> GEMINI

    RAG_NODE --> HF
    SEARCH --> FAISS
    INSIGHTS --> FAISS

    SERVICE --> CHROMA
    SERVICE --> NEO4J

    SERVICE --> SENTIMENT
    SERVICE --> TRANSLATION
    SERVICE --> SUMMARIZE
    SERVICE --> TOPIC
    SERVICE --> QA
    SERVICE --> REWRITE

    style SERVICE fill:#4CAF50,stroke:#333,stroke-width:3px,color:#fff
    style CREW_NODE fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff
    style REGISTRY fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff
    style NEO4J fill:#00BFA5,stroke:#333,stroke-width:2px,color:#fff
    style CHROMA fill:#9C27B0,stroke:#333,stroke-width:2px,color:#fff

Core Components

1. DocumentIntelligenceService

The main orchestration service that coordinates all AI/ML operations.

Location: ai_ml/services/orchestrator.py

Key Responsibilities:

Key Methods:

analyze_document()      # Full pipeline execution
summarize()            # Narrative summarization
bullet_summary()       # Bullet-point summaries
extract_topics()       # Topic extraction
answer_question()      # Q&A system
sentiment()            # Sentiment analysis
translate()            # Multi-language translation
semantic_search()      # Vector-based search

2. AgenticRAGPipeline (LangGraph)

Stateful multi-step RAG workflow using LangGraph.

Location: ai_ml/pipelines/rag_graph.py

Pipeline Flow:

graph LR
    A[Start] --> B[Ingest Node]
    B --> C[RAG Node]
    C --> D[Crew Node]
    D --> E[Finalize Node]
    E --> F[End]

Node Details:

  1. Ingest Node
    • Chunks document into manageable pieces (default: 900 chars, 120 overlap)
    • Generates embeddings using HuggingFace models
    • Creates in-memory FAISS vector store
    • Prepares retrieval tools for agents
  2. RAG Node
    • Performs primary document analysis
    • Uses DocumentSearchTool for semantic retrieval
    • Generates structured JSON output:
      • Overview summary
      • Key topics
      • Q&A answer (if question provided)
      • Supporting context/citations
  3. Crew Node
    • Invokes CrewAI multi-agent system
    • Three agents collaborate sequentially
    • Validates RAG findings with multiple LLM perspectives
    • Produces executive-level insights
  4. Finalize Node
    • Merges RAG and Crew outputs
    • Assembles final comprehensive report
    • Returns structured payload

3. Multi-Agent System (CrewAI)

Three specialized agents collaborate for thorough document analysis.

Location: ai_ml/agents/crew_agents.py

sequenceDiagram
    participant Pipeline
    participant Analyst as Document Analyst<br/>(OpenAI GPT-4o)
    participant Researcher as Cross-Referencer<br/>(Google Gemini)
    participant Reviewer as Insights Curator<br/>(Anthropic Claude)
    participant Report

    Pipeline->>Analyst: RAG Overview + Question
    Analyst->>Analyst: Draft Summary<br/>Use Search & Insights Tools
    Analyst->>Researcher: Summary + Document Context
    Researcher->>Researcher: Validate Claims<br/>Verify with Citations
    Researcher->>Reviewer: Validated Findings
    Reviewer->>Reviewer: Distill Insights<br/>Generate Recommendations
    Reviewer->>Report: Executive Summary
    Report->>Pipeline: Complete Analysis

Agent Specifications:

Agent LLM Role Tools Output
Document Analyst OpenAI GPT-4o-mini Lead summarizer DocumentSearchTool, InsightsExtractionTool Structured summary with citations
Cross-Referencer Google Gemini 1.5 Pro Fact verifier DocumentSearchTool Validated statements with flagged uncertainties
Insights Curator Anthropic Claude 3.5 Sonnet Executive reviewer DocumentSearchTool, InsightsExtractionTool Strategic recommendations and action items

4. LLM Provider Registry

Unified interface for multiple LLM providers with lazy loading.

Location: ai_ml/providers/registry.py

Supported Providers:

Features:

Configuration Example:

LLMConfig(
    provider="openai",           # openai | anthropic | google
    model="gpt-4o-mini",
    temperature=0.15,
    max_tokens=900,
    extra={}                     # Provider-specific parameters
)

5. Vector Stores

FAISS (In-Memory):

ChromaDB (Persistent):

Location: ai_ml/vectorstores/chroma_store.py

6. Knowledge Graph (Neo4j)

Stores document relationships and topic networks.

Location: ai_ml/graph/neo4j_client.py

Schema:

(Document {
  id: String,
  title: String,
  summary: Text,
  updated_at: DateTime,
  metadata: Map
})

(Topic {
  name: String
})

(Document)-[:COVERS]->(Topic)

Use Cases:

AI/ML Data Flow

sequenceDiagram
    participant User
    participant Service as DocumentIntelligenceService
    participant Pipeline as AgenticRAGPipeline
    participant RAG as RAG Node
    participant Crew as CrewAI Agents
    participant Neo4j
    participant ChromaDB

    User->>Service: analyze_document(text, question)
    Service->>Pipeline: run(document, question)

    Pipeline->>Pipeline: 1. Ingest: Chunk & Embed
    Pipeline->>Pipeline: 2. Build FAISS Vector Store

    Pipeline->>RAG: 3. Primary RAG Analysis
    RAG->>RAG: Semantic Search
    RAG->>RAG: LLM Generation (JSON)
    RAG-->>Pipeline: RAG Payload

    Pipeline->>Crew: 4. Multi-Agent Validation
    Crew->>Crew: Analyst β†’ Researcher β†’ Reviewer
    Crew-->>Pipeline: Crew Insights

    Pipeline->>Pipeline: 5. Finalize Report
    Pipeline-->>Service: Complete RAG Output

    Service->>Service: 6. Enrichment
    Service->>Service: - Sentiment Analysis
    Service->>Service: - Topic Extraction
    Service->>Service: - Translation (optional)

    alt Knowledge Graph Enabled
        Service->>Neo4j: 7. Sync Document + Topics
        Neo4j-->>Service: Sync Confirmation
    end

    alt Vector Store Enabled
        Service->>ChromaDB: 8. Upsert Document
        ChromaDB-->>Service: Upsert Confirmation
    end

    Service-->>User: Complete Analysis Results

Processing Features

Document Summarization

Types:

  1. Narrative Summary: Coherent prose summary
  2. Bullet Summary: Concise bullet points
  3. Refined Summary: Iterative improvement of draft summaries

Models Used: OpenAI GPT-4o-mini (configurable)

Location: ai_ml/processing/summarizer.py, ai_ml/extended_features/

Sentiment Analysis

Returns structured sentiment with confidence scores.

Output Format:

{
  "label": "positive | neutral | negative",
  "confidence": 0.85,
  "rationale": "Explanation of sentiment determination"
}

Model: Anthropic Claude 3 Haiku (fast & cost-effective)

Location: ai_ml/processing/sentiment.py

Topic Extraction

Extracts main themes and topics from documents.

Methods:

  1. LLM-based: Uses primary analyst model
  2. Heuristic-based: TF-IDF and frequency analysis

Location: ai_ml/processing/topic_extractor.py

Multi-Language Translation

Supports 7+ languages using Helsinki-NLP models.

Supported Languages:

Models: Helsinki-NLP/opus-mt-en-{lang}

Location: ai_ml/processing/translator.py, ai_ml/models/hf_model.py

Question Answering

Context-aware Q&A using RAG pipeline.

Process:

  1. Document chunks embedded and indexed
  2. Question used to retrieve relevant context
  3. LLM generates answer with citations
  4. Supporting context returned for verification

Model: OpenAI GPT-4o-mini or configured QA model

Content Rewriting

Style-based document transformation.

Supported Tones:

Location: ai_ml/extended_features/rewriter.py

Vector-based document search within or across documents.

Features:

Location: ai_ml/tools/document_tools.py

Technology Stack

Core ML Frameworks:

LLM Providers:

Embeddings & Vector Stores:

Knowledge Graph:

Model Optimization:

API & Deployment:

Utilities:

Deployment Interfaces

1. CLI Interface

python -m ai_ml.main documents/sample.txt \
  --question "What are the key findings?" \
  --translate_lang es \
  --doc_id doc-001 \
  --title "Sample Document"

2. Python API

from ai_ml.services import get_document_service

service = get_document_service()
results = service.analyze_document(
    document="Document text...",
    question="What are the insights?",
    translate_lang="fr",
    metadata={"id": "doc-001"}
)

3. FastAPI Server

uvicorn ai_ml.server:app --host 0.0.0.0 --port 8000

# POST /analyze endpoint
curl -X POST http://localhost:8000/analyze \
  -H "Content-Type: application/json" \
  -d '{"document": "...", "question": "..."}'

4. MCP Server

python -m ai_ml.mcp.server

# Exposes tools for external consumption:
# - agentic_document_brief
# - semantic_document_search
# - quick_topics
# - vector_upsert/search
# - graph_upsert/query

Configuration

All AI/ML components are configurable via environment variables:

Core Settings (ai_ml/core/settings.py):

# LLM Models
DOCUTHINKER_OPENAI_MODEL=gpt-4o-mini
DOCUTHINKER_CLAUDE_MODEL=claude-3-5-sonnet-20241022
DOCUTHINKER_GEMINI_MODEL=gemini-1.5-pro

# Embeddings
DOCUTHINKER_EMBEDDING_PROVIDER=huggingface
DOCUTHINKER_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

# Chunking
DOCUTHINKER_CHUNK_SIZE=900
DOCUTHINKER_CHUNK_OVERLAP=120

# Neo4j
DOCUTHINKER_SYNC_GRAPH=true
DOCUTHINKER_NEO4J_URI=bolt://localhost:7687
DOCUTHINKER_NEO4J_USER=neo4j
DOCUTHINKER_NEO4J_PASSWORD=password

# ChromaDB
DOCUTHINKER_SYNC_VECTOR=true
DOCUTHINKER_CHROMA_DIR=.chroma
DOCUTHINKER_CHROMA_COLLECTION=docuthinker

Performance Characteristics

Typical Performance (5K token document, M1 MacBook Pro):

Operation Time Notes
Full Agentic Analysis 15-25s With CrewAI collaboration
Summary Only 3-5s Single LLM call
Topic Extraction 2-4s Single LLM call
Semantic Search (FAISS) 100-200ms In-memory, 10K docs
Vector Upsert (ChromaDB) 50-100ms Single document
Translation 5-10s Helsinki-NLP model
Sentiment Analysis 2-3s Claude Haiku

Optimization Features:

Integration with Backend

The AI/ML pipeline integrates with the Express backend through:

  1. Direct Python API Calls: Backend calls Python functions via child processes
  2. REST API: Backend communicates with FastAPI server
  3. Shared Database: Results stored in PostgreSQL/Firestore
  4. Message Queue: Async processing via RabbitMQ
  5. Caching Layer: Redis caches AI/ML results
graph LR
    EXPRESS[Express Backend] --> PYTHON[Python AI/ML API]
    EXPRESS --> FASTAPI[FastAPI Server]
    EXPRESS --> REDIS[Redis Cache]
    EXPRESS --> POSTGRES[(PostgreSQL)]

    PYTHON --> OPENAI[OpenAI]
    PYTHON --> ANTHROPIC[Anthropic]
    PYTHON --> GEMINI[Google AI]

    FASTAPI --> NEO4J[(Neo4j)]
    FASTAPI --> CHROMA[(ChromaDB)]

    style EXPRESS fill:#68A063,stroke:#333,stroke-width:2px,color:#fff
    style PYTHON fill:#3776AB,stroke:#333,stroke-width:2px,color:#fff
    style FASTAPI fill:#009688,stroke:#333,stroke-width:2px,color:#fff

Monitoring & Observability

AI/ML pipeline metrics tracked by Prometheus:

Logging: Structured JSON logs with trace IDs for distributed tracing

Alerts:


Database Architecture

DocuThinker uses a hybrid database approach with Flyway migrations for version control.

graph TB
    subgraph "Database Layer"
        A[Application Layer]

        subgraph "Primary Databases"
            B[(PostgreSQL RDS<br/>Multi-AZ)]
            C[(Firestore<br/>Real-time Sync)]
            D[(MongoDB Atlas<br/>Document Store)]
        end

        subgraph "Caching Layer"
            E[(Redis Cache<br/>ElastiCache)]
            F[API Response Cache]
            G[Session Cache]
        end

        subgraph "Database Migrations"
            H[Flyway<br/>Version Control]
            I[Migration Scripts]
            J[Rollback Support]
        end

        subgraph "Backup & Recovery"
            K[Automated Backups<br/>Daily + Hourly]
            L[Point-in-Time Recovery]
            M[Velero Snapshots]
        end
    end

    A --> B
    A --> C
    A --> D
    A --> E

    E --> F
    E --> G

    H --> B
    H --> I
    H --> J

    B -.->|Backup| K
    K --> L
    K --> M

PostgreSQL Schema with Flyway Migrations

erDiagram
    USERS ||--o{ DOCUMENTS : owns
    USERS ||--|| ANALYTICS : has
    USERS ||--o{ API_KEYS : has
    DOCUMENTS ||--o{ DOCUMENT_TAGS : has
    USERS ||--o{ ANALYTICS_EVENTS : generates

    USERS {
        uuid id PK
        string email UK
        string username UK
        string password_hash
        string role
        boolean is_active
        timestamp created_at
        timestamp updated_at
    }

    DOCUMENTS {
        uuid id PK
        uuid user_id FK
        string title
        text content
        string file_path
        bigint file_size
        string status
        integer version
        timestamp created_at
        timestamp updated_at
        timestamp deleted_at
    }

    DOCUMENT_TAGS {
        uuid id PK
        uuid document_id FK
        string tag
        timestamp created_at
    }

    ANALYTICS {
        uuid user_id PK
        integer total_documents
        integer total_chats
        jsonb usage_by_day
        timestamp last_access
    }

    ANALYTICS_EVENTS {
        bigserial id PK
        uuid user_id FK
        uuid document_id FK
        string event_type
        jsonb event_data
        inet ip_address
        timestamp created_at
    }

    API_KEYS {
        uuid id PK
        uuid user_id FK
        string key_hash UK
        string name
        text[] scopes
        boolean is_active
        timestamp expires_at
        timestamp created_at
    }

    AUDIT_LOG {
        bigserial id PK
        uuid user_id FK
        string action
        string entity_type
        uuid entity_id
        jsonb old_values
        jsonb new_values
        timestamp created_at
    }

Service Mesh Architecture

Istio provides zero-trust security and advanced traffic management.

graph TB
    subgraph "Istio Service Mesh"
        subgraph "Control Plane"
            ISTIOD[Istiod<br/>HA - 3 Replicas]
            PILOT[Pilot<br/>Service Discovery]
            CITADEL[Citadel<br/>Certificate Authority]
        end

        subgraph "Data Plane - Envoy Sidecars"
            FE_PROXY[Frontend<br/>+ Envoy]
            BE_PROXY[Backend<br/>+ Envoy]
            DB_PROXY[Database<br/>+ Envoy]
        end

        subgraph "Gateways"
            INGRESS[Ingress Gateway<br/>HTTPS + mTLS]
            EGRESS[Egress Gateway<br/>External APIs]
        end

        subgraph "Traffic Management"
            VS[Virtual Services<br/>Routing Rules]
            DR[Destination Rules<br/>Circuit Breaking]
            GW[Gateway Config<br/>TLS Termination]
        end

        subgraph "Security"
            PA[Peer Authentication<br/>Strict mTLS]
            AP[Authorization Policies<br/>RBAC]
        end

        subgraph "Observability"
            KIALI[Kiali<br/>Mesh Visualization]
            JAEGER[Jaeger<br/>Distributed Tracing]
            PROM[Prometheus<br/>Metrics]
        end
    end

    INTERNET[Internet] --> INGRESS
    INGRESS --> FE_PROXY
    FE_PROXY <-->|mTLS| BE_PROXY
    BE_PROXY <-->|mTLS| DB_PROXY
    BE_PROXY --> EGRESS
    EGRESS --> EXTERNAL[External APIs]

    ISTIOD -.->|Config| INGRESS
    ISTIOD -.->|Config| FE_PROXY
    ISTIOD -.->|Config| BE_PROXY
    ISTIOD -.->|Certificates| CITADEL

    VS -.->|Apply| FE_PROXY
    DR -.->|Apply| BE_PROXY
    PA -.->|Enforce| BE_PROXY
    AP -.->|Enforce| BE_PROXY

    FE_PROXY -.->|Traces| JAEGER
    BE_PROXY -.->|Metrics| PROM
    PROM --> KIALI

Traffic Management Flow

sequenceDiagram
    participant Client
    participant Ingress as Istio Ingress Gateway
    participant VS as Virtual Service
    participant DR as Destination Rule
    participant Stable as Backend Stable (90%)
    participant Canary as Backend Canary (10%)

    Client->>Ingress: HTTPS Request
    Ingress->>VS: Route Request
    VS->>VS: Apply Routing Rules

    alt Header-based routing
        VS->>Canary: Route if x-version: v2
    else Weight-based routing
        VS->>DR: Check Destination Rules
        DR->>DR: Apply Circuit Breaking
        DR->>DR: Check Connection Pool
        DR->>Stable: 90% Traffic
        DR->>Canary: 10% Traffic
    end

    Stable-->>Client: Response (with retry)
    Canary-->>Client: Response (with retry)

Observability & Monitoring

Comprehensive observability with OpenTelemetry, Prometheus, ELK Stack, and Coralogix.

graph TB
    subgraph "Data Collection"
        APP[Applications]
        OTEL[OpenTelemetry Collector<br/>HA - 3 Replicas]
        PROM_EXP[Prometheus Exporters]
        FILEBEAT[Filebeat]
        FLUENTBIT[Fluent Bit DaemonSet<br/>Node-level Logs]
    end

    subgraph "Traces"
        JAEGER[Jaeger<br/>Distributed Tracing]
        TRACE_STORE[(Elasticsearch<br/>Trace Storage)]
    end

    subgraph "Metrics"
        PROM[Prometheus<br/>Metrics Storage]
        SLO[SLO/SLI Calculator]
        ERROR_BUDGET[Error Budget Tracker]
    end

    subgraph "Logs"
        LOGSTASH[Logstash<br/>Log Processing]
        ELASTIC[(Elasticsearch<br/>Log Storage)]
        KIBANA[Kibana<br/>Log Analysis]
    end

    subgraph "Coralogix SaaS"
        CX_INGEST[Coralogix Ingestion<br/>OTLP/gRPC + Fluent Bit]
        CX_TCO[TCO Optimizer<br/>Cost Tiering]
        CX_LOGS[Logs Engine]
        CX_METRICS[Metrics Engine]
        CX_TRACES[Traces Engine]
        CX_ALERTS[Coralogix Alerts<br/>12 Production Rules]
        CX_DASH[Coralogix Dashboards]
    end

    subgraph "Visualization"
        GRAF[Grafana<br/>Unified Dashboards]
        KIALI[Kiali<br/>Service Mesh View]
    end

    subgraph "Alerting"
        ALERT_MGR[AlertManager<br/>Alert Routing]
        SLACK[Slack Notifications]
        PAGERDUTY[PagerDuty<br/>Incident Management]
    end

    APP -->|Traces OTLP| OTEL
    APP -->|Metrics| PROM_EXP
    APP -->|Logs| FILEBEAT
    APP -->|Logs| FLUENTBIT

    OTEL --> JAEGER
    OTEL -->|OTLP/gRPC| CX_INGEST
    JAEGER --> TRACE_STORE

    PROM_EXP --> PROM
    PROM --> SLO
    PROM -->|Remote Write| CX_INGEST
    SLO --> ERROR_BUDGET

    FILEBEAT --> LOGSTASH
    LOGSTASH --> ELASTIC
    ELASTIC --> KIBANA
    FLUENTBIT -->|HTTPS| CX_INGEST

    CX_INGEST --> CX_TCO
    CX_TCO --> CX_LOGS
    CX_TCO --> CX_METRICS
    CX_TCO --> CX_TRACES
    CX_LOGS --> CX_ALERTS
    CX_METRICS --> CX_DASH
    CX_TRACES --> CX_DASH
    CX_METRICS --> GRAF

    PROM --> GRAF
    JAEGER --> GRAF
    TRACE_STORE --> KIALI

    PROM -.->|Alerts| ALERT_MGR
    CX_ALERTS -.-> SLACK
    CX_ALERTS -.->|Critical| PAGERDUTY
    ALERT_MGR --> SLACK
    ALERT_MGR -->|Critical| PAGERDUTY

    style OTEL fill:#F38181,color:#fff
    style PROM fill:#E85D04,color:#fff
    style GRAF fill:#F48C06,color:#fff
    style SLO fill:#95E1D3
    style CX_INGEST fill:#6C63FF,color:#fff
    style CX_TCO fill:#6C63FF,color:#fff
    style CX_LOGS fill:#6C63FF,color:#fff
    style CX_METRICS fill:#6C63FF,color:#fff
    style CX_TRACES fill:#6C63FF,color:#fff
    style CX_ALERTS fill:#6C63FF,color:#fff
    style CX_DASH fill:#6C63FF,color:#fff
    style FLUENTBIT fill:#6C63FF,color:#fff

Coralogix Integration

Coralogix serves as the unified SaaS observability backend, complementing the existing on-cluster Prometheus/Grafana/ELK stack.

Data Flow:

Signal Source Transport Destination
Traces OTel Collector OTLP/gRPC (TLS) Coralogix Traces
Metrics OTel Collector + Prometheus Remote Write OTLP/gRPC + Remote Write Coralogix Metrics
Logs (app) OTel Collector OTLP/gRPC (TLS) Coralogix Logs
Logs (K8s) Fluent Bit DaemonSet HTTPS Coralogix Logs
K8s Events Cluster Collector OTLP/gRPC (TLS) Coralogix Logs
K8s Metrics Cluster Collector OTLP/gRPC (TLS) Coralogix Metrics

TCO Cost Optimization:

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Frequent Search  β”‚ Monitoring          β”‚ Compliance      β”‚
β”‚ (High Priority)  β”‚ (Medium Priority)   β”‚ (Low Priority)  β”‚
β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€
β”‚ Errors, critical β”‚ Warnings, info      β”‚ Debug, verbose  β”‚
β”‚ Error spans      β”‚ Normal spans        β”‚ K8s infra logs  β”‚
β”‚ Full indexing    β”‚ Monitoring index    β”‚ Archive only    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  Health check logs β†’ BLOCKED (zero cost)

IaC Management (Terraform coralogix/coralogix provider):

SLO/SLI Monitoring

graph LR
    subgraph "SLI Collection"
        AVAIL[Availability SLI<br/>Success Rate]
        LAT[Latency SLI<br/>P50/P95/P99]
        ERR[Error Rate SLI]
    end

    subgraph "SLO Targets"
        SLO_AVAIL[Availability > 99.9%]
        SLO_LAT[P99 Latency < 500ms]
        SLO_ERR[Error Rate < 0.1%]
    end

    subgraph "Error Budget"
        BUDGET[Error Budget<br/>Monthly]
        BURN[Burn Rate<br/>Fast/Slow]
        REMAINING[Budget Remaining]
    end

    subgraph "Alerting"
        FAST_BURN[Fast Burn Alert<br/>>14.4x]
        SLOW_BURN[Slow Burn Alert<br/>>1x]
        SLO_BREACH[SLO Violation]
    end

    AVAIL --> SLO_AVAIL
    LAT --> SLO_LAT
    ERR --> SLO_ERR

    SLO_AVAIL --> BUDGET
    SLO_LAT --> BUDGET
    SLO_ERR --> BUDGET

    BUDGET --> BURN
    BURN --> REMAINING

    BURN -.->|Monitor| FAST_BURN
    BURN -.->|Monitor| SLOW_BURN
    SLO_AVAIL -.->|Check| SLO_BREACH

Security Architecture

Multi-layered security with OPA, Falco, mTLS, SonarQube, and Snyk.

graph TB
    subgraph "Layer 1: Network Security"
        WAF[AWS WAF<br/>DDoS Protection]
        TLS[cert-manager<br/>Auto TLS Renewal]
        MTLS[Istio mTLS<br/>Service-to-Service]
    end

    subgraph "Layer 2: Admission Control"
        OPA[OPA Gatekeeper<br/>Policy Enforcement]
        POLICIES[10 Security Policies]
        MUTATIONS[8 Auto-Remediation Rules]
    end

    subgraph "Layer 3: Authentication"
        FIREBASE[Firebase Auth]
        JWT[JWT Tokens]
        RBAC[Kubernetes RBAC]
    end

    subgraph "Layer 4: Runtime Security"
        FALCO[Falco<br/>Threat Detection]
        RULES[4 Custom Rules]
        ALERTS[Real-time Alerts]
    end

    subgraph "Layer 5: Secrets Management"
        VAULT[HashiCorp Vault]
        AWS_SM[AWS Secrets Manager]
        ESO[External Secrets Operator]
    end

    subgraph "Layer 6: Code & Supply Chain Security"
        SONAR[SonarQube 10.4<br/>Static Analysis + Quality Gates]
        SNYK_OSS[Snyk Open Source<br/>Dependency Vulnerabilities]
        SNYK_CONTAINER[Snyk Container<br/>Image Scanning + Licenses]
        SNYK_IAC[Snyk IaC<br/>Terraform/K8s Misconfig]
        SNYK_SAST[Snyk Code<br/>SAST Analysis]
        TRIVY[Trivy<br/>Filesystem + Image Scan]
    end

    subgraph "Layer 7: Data Protection"
        ENCRYPT_REST[Encryption at Rest]
        ENCRYPT_TRANSIT[Encryption in Transit]
        BACKUP[Encrypted Backups]
    end

    subgraph "Layer 8: Audit & Compliance"
        AUDIT[Audit Logs]
        COMPLIANCE[Compliance Reports]
        SIEM[SIEM Integration]
    end

    INTERNET[Internet] --> WAF
    WAF --> TLS
    TLS --> MTLS

    MTLS --> OPA
    OPA --> POLICIES
    POLICIES --> MUTATIONS

    MUTATIONS --> FIREBASE
    FIREBASE --> JWT
    JWT --> RBAC

    RBAC -.->|Monitor| FALCO
    FALCO --> RULES
    RULES --> ALERTS

    APP[Applications] --> VAULT
    VAULT --> AWS_SM
    AWS_SM --> ESO

    CODE[Source Code] --> SONAR
    CODE --> SNYK_SAST
    CODE --> SNYK_OSS
    IMAGES[Container Images] --> SNYK_CONTAINER
    IMAGES --> TRIVY
    INFRA[IaC Configs] --> SNYK_IAC

    APP --> ENCRYPT_REST
    APP --> ENCRYPT_TRANSIT
    APP --> BACKUP

    FALCO -.->|Log| AUDIT
    SONAR -.->|Reports| COMPLIANCE
    SNYK_OSS -.->|Findings| COMPLIANCE
    AUDIT --> COMPLIANCE
    COMPLIANCE --> SIEM

    style OPA fill:#4ECDC4,color:#fff
    style FALCO fill:#FF6B6B,color:#fff
    style VAULT fill:#AA96DA,color:#fff
    style SONAR fill:#4E9BCD,color:#fff
    style SNYK_OSS fill:#4C4A73,color:#fff
    style SNYK_CONTAINER fill:#4C4A73,color:#fff
    style SNYK_IAC fill:#4C4A73,color:#fff
    style SNYK_SAST fill:#4C4A73,color:#fff

OPA Policy Enforcement Flow

sequenceDiagram
    participant Dev as Developer
    participant API as K8s API Server
    participant OPA as OPA Gatekeeper
    participant POLICIES as Policy Templates
    participant POD as Pod

    Dev->>API: kubectl apply -f deployment.yaml
    API->>OPA: Admission Request

    OPA->>POLICIES: Check Constraints

    alt Policy Violations Found
        POLICIES->>OPA: Violations: Missing labels, No resource limits
        OPA->>API: Deny Admission
        API->>Dev: Error: Policy violations
    else Policies Met
        POLICIES->>OPA: All policies satisfied
        OPA->>OPA: Apply Mutations (add defaults)
        OPA->>API: Allow Admission (modified)
        API->>POD: Create Pod
        POD->>Dev: Pod Created Successfully
    end

    Note over OPA: Continuous Audit
    OPA->>POLICIES: Scan Existing Resources
    POLICIES-->>OPA: Report Violations

Reliability Engineering

Chaos testing and disaster recovery for production resilience.

graph TB
    subgraph "Chaos Engineering - Litmus"
        CHAOS_CTRL[Litmus Controller]

        subgraph "Chaos Experiments"
            POD_DELETE[Pod Deletion<br/>50% Pods]
            NET_LATENCY[Network Latency<br/>2000ms Injection]
            CPU_STRESS[CPU Stress<br/>100% Load]
            MEM_STRESS[Memory Stress<br/>500MB Consumption]
        end

        subgraph "Validation"
            HTTP_PROBE[HTTP Health Probes]
            K8S_PROBE[K8s Resource Probes]
            PROM_PROBE[Prometheus Metrics]
        end

        WORKFLOW[Chaos Workflows<br/>Sequential Execution]
    end

    subgraph "Disaster Recovery - Velero"
        BACKUP_CTRL[Velero Controller]

        subgraph "Backup Schedule"
            DAILY[Daily Full Backup<br/>30-day Retention]
            HOURLY[Hourly Incremental<br/>7-day Retention]
        end

        subgraph "Storage"
            S3[S3 Backup Storage]
            EBS_SNAP[EBS Snapshots]
        end

        RESTORE[Restore Operations<br/>RTO < 1 hour]
    end

    subgraph "Application"
        APP[Backend Services]
        DB[(Databases)]
    end

    CHAOS_CTRL --> POD_DELETE
    CHAOS_CTRL --> NET_LATENCY
    CHAOS_CTRL --> CPU_STRESS
    CHAOS_CTRL --> MEM_STRESS

    POD_DELETE -.->|Test| APP
    NET_LATENCY -.->|Test| APP
    CPU_STRESS -.->|Test| APP
    MEM_STRESS -.->|Test| APP

    POD_DELETE --> HTTP_PROBE
    NET_LATENCY --> K8S_PROBE
    CPU_STRESS --> PROM_PROBE

    CHAOS_CTRL --> WORKFLOW

    BACKUP_CTRL --> DAILY
    BACKUP_CTRL --> HOURLY

    DAILY --> S3
    HOURLY --> S3
    DB -.->|Snapshot| EBS_SNAP

    S3 --> RESTORE
    EBS_SNAP --> RESTORE

    style CHAOS_CTRL fill:#AA96DA,color:#fff
    style RESTORE fill:#6BCB77,color:#fff

Progressive Delivery

Automated canary deployments with Flagger.

graph TB
    subgraph "Flagger Progressive Delivery"
        FLAGGER[Flagger Controller]

        subgraph "Deployment Stages"
            INIT[Initialize Canary<br/>0% Traffic]
            RAMP[Progressive Ramp<br/>10% β†’ 50%]
            ANALYSIS[Canary Analysis<br/>1-min Intervals]
            PROMOTE[Promote to 100%]
            ROLLBACK[Automatic Rollback]
        end

        subgraph "Metrics Analysis"
            SUCCESS[Success Rate > 99%]
            LATENCY[Latency < 500ms]
            CUSTOM[Custom Metrics]
        end

        subgraph "Integration"
            ISTIO_INT[Istio<br/>Traffic Splitting]
            PROM_INT[Prometheus<br/>Metrics Source]
            SLACK_INT[Slack<br/>Notifications]
        end
    end

    subgraph "Deployments"
        STABLE[Stable Version<br/>Current Production]
        CANARY[Canary Version<br/>New Release]
    end

    FLAGGER --> INIT
    INIT --> RAMP
    RAMP --> ANALYSIS

    ANALYSIS --> SUCCESS
    ANALYSIS --> LATENCY
    ANALYSIS --> CUSTOM

    SUCCESS & LATENCY & CUSTOM -.->|All Pass| PROMOTE
    SUCCESS & LATENCY & CUSTOM -.->|Any Fail| ROLLBACK

    PROMOTE --> STABLE
    ROLLBACK --> STABLE

    FLAGGER -.->|Control| ISTIO_INT
    FLAGGER -.->|Query| PROM_INT
    FLAGGER -.->|Notify| SLACK_INT

    ISTIO_INT -.->|Split| STABLE
    ISTIO_INT -.->|Split| CANARY
    
    style PROMOTE fill:#6BCB77,color:#fff
    style ROLLBACK fill:#FF6B6B,color:#fff

Autoscaling Strategy

Multi-dimensional autoscaling with KEDA and HPA.

graph TB
    subgraph "KEDA - Event-Driven Autoscaling"
        KEDA[KEDA Operator<br/>HA - 2 Replicas]

        subgraph "Scalers"
            SQS[AWS SQS Scaler<br/>Queue Depth]
            HTTP[HTTP Scaler<br/>Request Rate]
            CRON[Cron Scaler<br/>Scheduled]
            PROM_SCALER[Prometheus Scaler<br/>Custom Metrics]
        end

        SCALE_ZERO[Scale to Zero<br/>Cost Optimization]
    end

    subgraph "HPA - Resource-Based"
        HPA[Horizontal Pod Autoscaler]

        subgraph "Triggers"
            CPU[CPU > 70%]
            MEM[Memory > 80%]
        end
    end

    subgraph "Applications"
        WORKER[Worker Pods<br/>1-50 Replicas]
        BACKEND[Backend Pods<br/>2-20 Replicas]
        FRONTEND[Frontend Pods<br/>2-10 Replicas]
    end

    KEDA --> SQS
    KEDA --> HTTP
    KEDA --> CRON
    KEDA --> PROM_SCALER

    SQS -.->|Scale| WORKER
    HTTP -.->|Scale| BACKEND
    CRON -.->|Scale| BACKEND
    PROM_SCALER -.->|Scale| BACKEND

    KEDA -.->|Enable| SCALE_ZERO
    SCALE_ZERO -.->|Apply| WORKER

    HPA --> CPU
    HPA --> MEM

    CPU -.->|Scale| FRONTEND
    MEM -.->|Scale| FRONTEND

    style KEDA fill:#FCBAD3,color:#000
    style SCALE_ZERO fill:#6BCB77,color:#fff
    style HPA fill:#4D96FF,color:#fff

Container Orchestration

Enhanced Kubernetes with Istio, OPA, and advanced deployments.

graph TB
    subgraph "Kubernetes Cluster - EKS"
        subgraph "Control Plane Components"
            ISTIOD[Istiod Control Plane]
            OPA_CTRL[OPA Gatekeeper]
            FLAGGER_CTRL[Flagger]
            KEDA_CTRL[KEDA Operator]
        end

        subgraph "Ingress"
            ISTIO_IG[Istio Ingress Gateway<br/>3 Replicas]
        end

        subgraph "Frontend Namespace"
            FE_SVC[Frontend Service]
            FE1[Frontend Pod 1 + Envoy]
            FE2[Frontend Pod 2 + Envoy]
            FE3[Frontend Pod 3 + Envoy]
        end

        subgraph "Backend Namespace"
            BE_SVC[Backend Service]
            BE_STABLE[Stable Deployment<br/>90% Traffic]
            BE_CANARY[Canary Deployment<br/>10% Traffic]
        end

        subgraph "Data Services"
            PG_SVC[(PostgreSQL<br/>StatefulSet)]
            REDIS_SVC[(Redis<br/>StatefulSet)]
        end

        subgraph "Monitoring Namespace"
            PROM[Prometheus]
            GRAF[Grafana]
            JAEGER[Jaeger]
        end

        subgraph "Config & Secrets"
            CM[ConfigMaps]
            SECRET[Kubernetes Secrets]
            ESO[External Secrets]
        end
    end

    INTERNET[Internet] --> ISTIO_IG

    ISTIOD -.->|Config| ISTIO_IG
    OPA_CTRL -.->|Validate| FE1
    OPA_CTRL -.->|Validate| BE_STABLE

    ISTIO_IG --> FE_SVC
    FE_SVC --> FE1 & FE2 & FE3

    FE1 --> BE_SVC
    BE_SVC --> BE_STABLE
    BE_SVC --> BE_CANARY

    FLAGGER_CTRL -.->|Manage| BE_CANARY
    KEDA_CTRL -.->|Scale| BE_STABLE

    BE_STABLE --> PG_SVC
    BE_STABLE --> REDIS_SVC

    BE_STABLE -.->|Metrics| PROM
    FE1 -.->|Traces| JAEGER

    ESO -.->|Sync| SECRET
    SECRET --> BE_STABLE
    CM --> BE_STABLE

Technology Stack

Comprehensive overview of all technologies used.

mindmap
  root((DocuThinker<br/>Tech Stack))
    Frontend
      React 18
      Material-UI
      TailwindCSS
      React Router
      Axios
      Context API
      React Markdown / KaTeX
      pdfjs-dist
      React Dropzone
      Dropbox SDK
      Google API / OAuth
      Vercel Analytics
      Babel / Craco / Webpack
    Backend
      Node.js 18+
      Express
      Firebase Admin SDK
      Firebase Auth
      GraphQL / graphql-tools
      JWT / jsonwebtoken
      Redis
      RabbitMQ
      Multer / Busboy
      Mammoth / pdf-parse
      Google APIs / googleapis
      Google Generative AI SDK
      Swagger / OpenAPI
    Orchestrator
      Anthropic AI SDK
      Google Generative AI SDK
      MCP SDK
      Zod Schema Validation
      Supervisor Pattern
      Agent Loop / ReAct
      Circuit Breaker
      Cost Tracker
      Dead Letter Queue
      Token Budget Manager
      Conversation Store
      Hybrid RAG
      Prompt Cache Strategy
      14 System Prompts
    AI/ML Pipeline
      FastAPI / Uvicorn
      LangChain
      LangGraph
      CrewAI
      OpenAI GPT-4o
      Anthropic Claude 3.5 Sonnet
      Google Gemini 1.5 Pro
      FAISS
      ChromaDB
      Neo4j
      sentence-transformers
      PyTorch
      HuggingFace Transformers
      ONNX / ONNX Runtime
      Optuna
      ROUGE Score
      Pandas / Matplotlib
      MCP Server
      Google Cloud NLP
      Google Speech-to-Text
    Database
      PostgreSQL / RDS
      MongoDB Atlas
      Firestore
      Redis / ElastiCache
      Neo4j Graph DB
      ChromaDB Vectors
    Mobile App
      React Native 0.74
      Expo 51
      Expo Router
      React Navigation
      React Native Reanimated
    VS Code Extension
      TypeScript
      VS Code Extension API
      VSCE
    Service Mesh
      Istio 1.20
      Envoy Proxy
      mTLS
      Circuit Breaking
      Kiali Dashboard
    Security
      OPA Gatekeeper
      Falco 0.36
      HashiCorp Vault 1.15
      External Secrets Operator
      cert-manager
      AWS WAF
      Trivy
      SonarQube 10.4
      Snyk (OSS/Container/IaC/SAST)
    Observability
      OpenTelemetry Collector
      Prometheus / AlertManager
      Grafana
      Jaeger
      Loki
      ELK Stack
      SLO / SLI
    Reliability
      Flagger 1.34
      KEDA 2.12
      Velero
      HPA
      Blue-Green Deployments
      Canary Deployments
      AWS Backup
    DevOps
      Docker
      Docker Compose
      Kubernetes 1.28+
      Helm 3.13+
      ArgoCD
      Terraform 1.5+
      GitHub Actions
      GitLab CI
      CircleCI
      Jenkins
    Cloud Services
      AWS EKS
      AWS RDS
      AWS S3
      ElastiCache
      CloudFront
      ECS Fargate
      Secrets Manager
      IAM / IRSA
      VPC Multi-AZ
    Testing and Quality
      Jest
      React Testing Library
      pytest
      k6 Load Testing
      Supertest
      SonarQube
      ESLint
      Prettier
      Postman
    API and Documentation
      Swagger / OpenAPI 3.0
      GraphiQL
      REST APIs
      GraphQL
      MCP Protocol

Scalability & Performance

Enhanced with event-driven autoscaling and SLO monitoring.

graph TB
    subgraph "Scalability Architecture"
        A[Application Growth]

        subgraph "Horizontal Scaling"
            B[Istio Load Balancing]
            C[Multi-AZ Deployment]
            D[Database Replication]
            E[KEDA Auto-scaling]
        end

        subgraph "Performance Optimization"
            F[Redis Caching<br/>85% Hit Rate]
            G[CDN Distribution]
            H[Code Splitting]
            I[Lazy Loading]
            J[Connection Pooling]
        end

        subgraph "SLO Targets"
            K[Availability > 99.9%]
            L[P99 Latency < 500ms]
            M[Error Rate < 0.1%]
            N[Error Budget Tracking]
        end

        subgraph "Resilience"
            O[Circuit Breaking]
            P[Retry Logic]
            Q[Timeout Controls]
            R[Chaos Testing]
        end

        subgraph "Monitoring"
            S[OpenTelemetry]
            T[Prometheus Metrics]
            U[Real-time Alerting]
        end
    end

    A --> B & C & E

    B --> F
    C --> G
    E --> H & I & J

    F --> K
    G --> L
    H --> M
    I --> N

    O --> K
    P --> L
    Q --> M
    R --> N

    S --> T
    T --> U
    U -.->|Alert| DevOps[DevOps Team]

Deployment Architecture

Multi-environment with GitOps and progressive delivery.

graph TB
    subgraph "CI/CD Pipeline"
        GIT[Git Repository]

        subgraph "Build"
            BUILD[Build Stage]
            TEST[Test Stage<br/>Unit + Integration]
            SECURITY[Security Scan<br/>Trivy + SonarQube + Snyk]
            PACKAGE[Docker Build]
        end

        subgraph "GitOps - ArgoCD"
            ARGO[ArgoCD Controller]
            SYNC[Auto-Sync]
            HEALTH[Health Check]
        end

        subgraph "Progressive Delivery"
            CANARY[Flagger Canary]
            ANALYSIS[Metric Analysis]
            DECISION[Promote/Rollback]
        end
    end

    subgraph "Environments"
        DEV[Development<br/>Auto-Deploy]
        STAGING[Staging<br/>Manual Approval]
        PROD[Production<br/>Canary + Manual]
    end

    subgraph "Infrastructure"
        HELM[Helm Charts]
        K8S[Kubernetes]
        ISTIO[Istio Mesh]
    end

    GIT --> BUILD
    BUILD --> TEST
    TEST --> SECURITY
    SECURITY --> PACKAGE

    PACKAGE --> ARGO
    ARGO --> SYNC
    SYNC --> HEALTH

    HEALTH --> DEV
    DEV -.->|Smoke Tests Pass| STAGING
    STAGING -.->|Approval| CANARY

    CANARY --> ANALYSIS
    ANALYSIS --> DECISION
    DECISION -->|Success| PROD
    DECISION -->|Failure| STAGING

    HELM --> K8S
    K8S --> ISTIO
    ARGO -.->|Deploy| HELM

    style ARGO fill:#FF6B35,color:#fff
    style PROD fill:#6BCB77,color:#fff

Conclusion

DocuThinker’s enhanced architecture delivers:

For more details, refer to:


Last Updated: January 2025 Version: 2.0.0 - Enterprise DevOps Edition Author: Son Nguyen