DocuThinker-AI-App

DocuThinker AI/ML Agentic Platform

Python 3.10+ LangChain CrewAI License

A production-ready, multi-agent RAG platform orchestrating LangGraph, LangChain, CrewAI, and multiple LLM providers for comprehensive document intelligence.


πŸ“‹ Table of Contents


Overview

The ai_ml package is a sophisticated, production-ready Retrieval-Augmented Generation (RAG) platform that seamlessly integrates:

Key Capabilities

βœ… Comprehensive Document Analysis - Extract summaries, topics, insights, and sentiments βœ… Multi-Agent Reasoning - Three specialized agents collaborate for thorough analysis βœ… Semantic Search - Vector-based retrieval with configurable embeddings βœ… Q&A System - Context-aware question answering βœ… Multi-Language Translation - Helsinki-NLP models for 7+ languages βœ… Knowledge Graph Sync - Persistent document relationships in Neo4j βœ… Persistent Memory - ChromaDB for cross-session semantic recall βœ… Flexible Deployment - CLI, API server, MCP server, or Python library


Architecture

High-Level Architecture

graph TB
    subgraph "External Interfaces"
        CLI[CLI main.py]
        FASTAPI[FastAPI Server<br/>server.py]
        MCP[MCP Server<br/>mcp/server.py]
        PYAPI[Python API<br/>backend.py]
    end

    subgraph "Service Layer"
        SVC[DocumentIntelligenceService<br/>services/orchestrator.py]
    end

    subgraph "Pipeline Layer"
        RAG[AgenticRAGPipeline<br/>pipelines/rag_graph.py]
        CREW[CrewAI Agents<br/>agents/crew_agents.py]
    end

    subgraph "Tools & Utilities"
        TOOLS[Document Tools<br/>tools/document_tools.py]
        SEARCH[DocumentSearchTool]
        INSIGHTS[InsightsExtractionTool]
    end

    subgraph "Providers & Models"
        REGISTRY[LLMProviderRegistry<br/>providers/registry.py]
        OPENAI[OpenAI<br/>GPT-4o]
        ANTHROPIC[Anthropic<br/>Claude 3.5]
        GEMINI[Google<br/>Gemini 1.5 Pro]
        HF[HuggingFace<br/>Embeddings]
        HELSINKI[Helsinki-NLP<br/>Translators]
    end

    subgraph "Persistence Layer"
        FAISS[FAISS<br/>In-Memory Vector Store]
        CHROMA[ChromaDB<br/>Persistent Vector Store]
        NEO4J[Neo4j<br/>Knowledge Graph]
    end

    CLI --> SVC
    FASTAPI --> SVC
    MCP --> SVC
    PYAPI --> SVC

    SVC --> RAG
    SVC --> CREW
    SVC --> TOOLS

    RAG --> REGISTRY
    CREW --> REGISTRY
    TOOLS --> REGISTRY

    REGISTRY --> OPENAI
    REGISTRY --> ANTHROPIC
    REGISTRY --> GEMINI
    REGISTRY --> HF
    SVC --> HELSINKI

    TOOLS --> FAISS
    SVC --> CHROMA
    SVC --> NEO4J

    style SVC fill:#4CAF50,stroke:#333,stroke-width:3px,color:#fff
    style RAG fill:#2196F3,stroke:#333,stroke-width:2px,color:#fff
    style CREW fill:#FF9800,stroke:#333,stroke-width:2px,color:#fff

Component Diagram

graph LR
    subgraph "Core Settings"
        CFG[Settings<br/>core/settings.py]
        ENV[Environment Variables]
    end

    subgraph "Service Orchestration"
        SERVICE[DocumentIntelligenceService]
    end

    subgraph "Agentic Pipeline"
        INGEST[Ingest Node<br/>Chunking & Embedding]
        RAG_NODE[RAG Node<br/>Primary Analysis]
        CREW_NODE[Crew Node<br/>Multi-Agent Validation]
        FINAL[Finalize Node<br/>Report Assembly]
    end

    subgraph "CrewAI Agents"
        ANALYST[Analyst Agent<br/>OpenAI GPT-4o]
        RESEARCHER[Researcher Agent<br/>Google Gemini]
        REVIEWER[Reviewer Agent<br/>Anthropic Claude]
    end

    ENV --> CFG
    CFG --> SERVICE
    SERVICE --> INGEST

    INGEST --> RAG_NODE
    RAG_NODE --> CREW_NODE
    CREW_NODE --> FINAL

    CREW_NODE --> ANALYST
    CREW_NODE --> RESEARCHER
    CREW_NODE --> REVIEWER

    ANALYST --> FINAL
    RESEARCHER --> FINAL
    REVIEWER --> FINAL

    style SERVICE fill:#9C27B0,stroke:#333,stroke-width:3px,color:#fff
    style CREW_NODE fill:#FF5722,stroke:#333,stroke-width:2px,color:#fff

Data Flow

sequenceDiagram
    participant User
    participant CLI/API
    participant Service
    participant Pipeline
    participant RAG
    participant CrewAI
    participant Neo4j
    participant ChromaDB

    User->>CLI/API: Submit Document + Question
    CLI/API->>Service: analyze_document()
    Service->>Pipeline: run(document, question)

    Pipeline->>Pipeline: 1. Ingest & Chunk
    Pipeline->>Pipeline: 2. Create FAISS Vector Store

    Pipeline->>RAG: 3. Primary RAG Pass
    RAG->>RAG: Retrieve Context
    RAG->>RAG: Generate JSON (overview, topics, QA)
    RAG-->>Pipeline: RAG Payload

    Pipeline->>CrewAI: 4. Crew Collaboration
    CrewAI->>CrewAI: Analyst drafts summary
    CrewAI->>CrewAI: Researcher validates with citations
    CrewAI->>CrewAI: Reviewer synthesizes insights
    CrewAI-->>Pipeline: Crew Payload

    Pipeline->>Pipeline: 5. Finalize Report
    Pipeline-->>Service: Final Output

    Service->>Service: Enrich (sentiment, translation)

    alt Knowledge Graph Enabled
        Service->>Neo4j: Sync Document & Topics
        Neo4j-->>Service: Sync Status
    end

    alt Vector Store Enabled
        Service->>ChromaDB: Upsert Document
        ChromaDB-->>Service: Upsert Status
    end

    Service-->>CLI/API: Complete Results
    CLI/API-->>User: Display Analysis

Core Components

πŸ“‚ Module Structure

ai_ml/
β”œβ”€β”€ core/
β”‚   β”œβ”€β”€ settings.py              # Runtime configuration & environment settings
β”‚   └── __init__.py
β”œβ”€β”€ services/
β”‚   β”œβ”€β”€ orchestrator.py          # DocumentIntelligenceService facade
β”‚   └── __init__.py
β”œβ”€β”€ pipelines/
β”‚   β”œβ”€β”€ rag_graph.py             # LangGraph agentic RAG pipeline
β”‚   └── __init__.py
β”œβ”€β”€ agents/
β”‚   β”œβ”€β”€ crew_agents.py           # CrewAI multi-agent collaboration
β”‚   └── __init__.py
β”œβ”€β”€ providers/
β”‚   β”œβ”€β”€ registry.py              # Multi-provider LLM & embedding registry
β”‚   └── __init__.py
β”œβ”€β”€ tools/
β”‚   β”œβ”€β”€ document_tools.py        # Semantic search & insights tools
β”‚   └── __init__.py
β”œβ”€β”€ graph/
β”‚   β”œβ”€β”€ neo4j_client.py          # Neo4j knowledge graph client
β”‚   └── __init__.py
β”œβ”€β”€ vectorstores/
β”‚   β”œβ”€β”€ chroma_store.py          # ChromaDB persistent vector store
β”‚   └── __init__.py
β”œβ”€β”€ mcp/
β”‚   β”œβ”€β”€ server.py                # MCP server for tool exposure
β”‚   └── __init__.py
β”œβ”€β”€ processing/
β”‚   β”œβ”€β”€ summarizer.py            # Summarization utilities
β”‚   β”œβ”€β”€ sentiment.py             # Sentiment analysis
β”‚   β”œβ”€β”€ topic_extractor.py       # Topic extraction
β”‚   └── translator.py            # Translation utilities
β”œβ”€β”€ extended_features/
β”‚   β”œβ”€β”€ chat_interface.py        # Conversational AI interface
β”‚   β”œβ”€β”€ bullet_summary_generator.py
β”‚   β”œβ”€β”€ key_ideas_extractor.py
β”‚   β”œβ”€β”€ recommendations_generator.py
β”‚   β”œβ”€β”€ refine_summary.py
β”‚   β”œβ”€β”€ rewriter.py
β”‚   └── voice_chat.py
β”œβ”€β”€ models/
β”‚   β”œβ”€β”€ hf_model.py              # HuggingFace model loaders
β”‚   β”œβ”€β”€ model_utils.py           # Model utilities
β”‚   └── onnx_helper.py           # ONNX conversion helpers
β”œβ”€β”€ backend.py                    # High-level API facade
β”œβ”€β”€ server.py                     # FastAPI REST server
β”œβ”€β”€ main.py                       # CLI entry point
β”œβ”€β”€ config.py                     # Legacy configuration
β”œβ”€β”€ requirements.txt              # Python dependencies
└── README.md                     # This file

πŸ”‘ Key Classes

Class Location Purpose
DocumentIntelligenceService services/orchestrator.py Main facade - Orchestrates all AI/ML capabilities
AgenticRAGPipeline pipelines/rag_graph.py LangGraph pipeline - Stateful RAG workflow
LLMProviderRegistry providers/registry.py Provider registry - Lazy-load LLMs & embeddings
Neo4jGraphClient graph/neo4j_client.py Knowledge graph - Neo4j operations
ChromaVectorClient vectorstores/chroma_store.py Vector store - Persistent semantic search
DocumentSearchTool tools/document_tools.py Semantic search - FAISS-backed retrieval
InsightsExtractionTool tools/document_tools.py Topic extraction - Heuristic-based insights

Features

🎯 Document Intelligence

Feature Description Module
Comprehensive Analysis Full document intelligence with multi-agent validation services/orchestrator.py
Summarization Narrative and bullet-point summaries processing/summarizer.py
Topic Extraction AI-powered theme identification services/orchestrator.py:140
Q&A System Context-aware question answering pipelines/rag_graph.py
Sentiment Analysis JSON-based sentiment with confidence scores services/orchestrator.py:211
Semantic Search Vector-based document retrieval tools/document_tools.py
Translation Multi-language support (7+ languages) models/hf_model.py
Recommendations Actionable next steps generation extended_features/recommendations_generator.py
Discussion Points Debate prompts generation discussion/discussion_generator.py
Rewriting Tone-based document rewriting extended_features/rewriter.py

πŸ€– Multi-Agent System

The platform employs three specialized CrewAI agents that collaborate sequentially:

graph LR
    A[Document Analyst<br/>OpenAI GPT-4o] -->|Draft Summary| B[Cross-Referencer<br/>Google Gemini]
    B -->|Validated Findings| C[Insights Curator<br/>Anthropic Claude]
    C -->|Executive Insights| D[Final Report]

    style A fill:#10A37F,stroke:#333,stroke-width:2px,color:#fff
    style B fill:#4285F4,stroke:#333,stroke-width:2px,color:#fff
    style C fill:#D97757,stroke:#333,stroke-width:2px,color:#fff
    style D fill:#9C27B0,stroke:#333,stroke-width:2px,color:#fff

Agent Roles

  1. πŸ“Š Document Analyst (OpenAI GPT-4o)
    • Role: Lead summarizer
    • Goal: Create faithful synopsis and highlight structure
    • Tools: DocumentSearchTool, InsightsExtractionTool
    • Output: Structured markdown summary with citations
  2. πŸ” Cross-Referencer (Google Gemini)
    • Role: Research agent verifying facts
    • Goal: Validate claims with direct citations
    • Tools: DocumentSearchTool
    • Output: Verified statements with flagged uncertainties
  3. πŸ’‘ Insights Curator (Anthropic Claude)
    • Role: Executive reviewer
    • Goal: Distill strategic recommendations and risks
    • Tools: DocumentSearchTool, InsightsExtractionTool
    • Output: Executive-ready action items and follow-ups

πŸ—„οΈ Persistence Layers

Vector Store (ChromaDB)

Knowledge Graph (Neo4j)


Getting Started

Prerequisites

Installation

1. Clone Repository

git clone https://github.com/hoangsonww/DocuThinker-AI-App.git
cd DocuThinker-AI-App/ai_ml

2. Create Virtual Environment

python -m venv venv
source venv/bin/activate  # On Windows: venv\Scripts\activate

3. Install Dependencies

pip install -r requirements.txt

4. Install Optional Dependencies

For specific features:

# For Neo4j knowledge graph
pip install neo4j

# For ChromaDB vector store
pip install chromadb

# For ONNX model optimization
pip install onnx onnxruntime optimum[onnxruntime]

Configuration

Environment Variables

Create a .env file in the ai_ml/ directory:

# Required: At least one LLM provider
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=...

# Optional: Model overrides
DOCUTHINKER_OPENAI_MODEL=gpt-4o-mini
DOCUTHINKER_CLAUDE_MODEL=claude-3-5-sonnet-20241022
DOCUTHINKER_GEMINI_MODEL=gemini-1.5-pro
DOCUTHINKER_SENTIMENT_MODEL=claude-3-haiku-20240307
DOCUTHINKER_QA_MODEL=gpt-4o-mini

# Embedding configuration
DOCUTHINKER_EMBEDDING_PROVIDER=huggingface
DOCUTHINKER_EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2

# Chunking configuration
DOCUTHINKER_CHUNK_SIZE=900
DOCUTHINKER_CHUNK_OVERLAP=120

# RAG configuration
DOCUTHINKER_RAG_QUESTION="Provide a comprehensive intelligence brief for this document."
DOCUTHINKER_BULLET_STYLE="Use concise bullet points and preserve key metrics or figures."

# Neo4j configuration (optional)
DOCUTHINKER_SYNC_GRAPH=true
DOCUTHINKER_NEO4J_URI=bolt://localhost:7687
DOCUTHINKER_NEO4J_USER=neo4j
DOCUTHINKER_NEO4J_PASSWORD=your-password
DOCUTHINKER_NEO4J_DATABASE=neo4j

# ChromaDB configuration (optional)
DOCUTHINKER_SYNC_VECTOR=true
DOCUTHINKER_CHROMA_DIR=.chroma
DOCUTHINKER_CHROMA_COLLECTION=docuthinker
DOCUTHINKER_VECTOR_TOP_K=6

# Knowledge base path (optional)
DOCUTHINKER_KB_PATH=/path/to/knowledge/base

Minimal Configuration

For quick start with OpenAI only:

export OPENAI_API_KEY=sk-...

The system will automatically fallback to OpenAI for all agents if other providers are unavailable.


Usage

CLI Usage

The CLI provides a comprehensive document analysis interface.

Basic Analysis

python -m ai_ml.main documents/sample.txt

With Question

python -m ai_ml.main documents/sample.txt \
  --question "What are the key risks mentioned?"

With Translation

python -m ai_ml.main documents/sample.txt \
  --question "Summarize the main findings" \
  --translate_lang es

With Metadata

python -m ai_ml.main documents/sample.txt \
  --doc_id policy-2024-001 \
  --title "Q4 Policy Whitepaper" \
  --translate_lang de

CLI Output

The CLI displays:

  1. Agentic RAG Overview - Structured JSON payload
  2. Summary - Narrative summary
  3. Bullet Summary - Concise bullet points
  4. Topics - Extracted themes
  5. Insights - Key findings
  6. Q&A - Answer to question (if provided)
  7. Sentiment - Sentiment analysis (label, confidence, rationale)
  8. Discussion - Discussion prompts
  9. Recommendations - Actionable next steps
  10. Translation - Translated version (if requested)
  11. Sync Report - Neo4j/ChromaDB sync status (if enabled)

Python API

Basic Usage

from ai_ml.services import get_document_service

# Get singleton service instance
service = get_document_service()

# Analyze document
results = service.analyze_document(
    document="Your document text here...",
    question="What are the key takeaways?",
    translate_lang="fr",
    metadata={"id": "doc-001", "title": "Sample Document"}
)

# Access results
print(results["summary"])
print(results["topics"])
print(results["qa"])
print(results["sentiment"])
print(results["translation"])

Individual Features

from ai_ml.services import get_document_service

service = get_document_service()
document = "Your document text..."

# Summarization
summary = service.summarize(document)
bullet_summary = service.bullet_summary(document)

# Topic extraction
topics = service.extract_topics(document)

# Q&A
answer = service.answer_question(document, "What is the main conclusion?")

# Sentiment analysis
sentiment = service.sentiment(document)

# Translation
translation = service.translate(document, "es")

# Semantic search
results = service.semantic_search(document, "risk assessment")

# Recommendations
recommendations = service.recommendations(document)

# Discussion points
discussion = service.discussion_points(document)

# Rewriting
rewritten = service.rewrite(document, tone="executive")

# Refine summary
refined = service.refine_summary("Draft summary...", document)

Vector Store Operations

from ai_ml.services import get_document_service

service = get_document_service()

# Upsert document to vector store
result = service.upsert_vector_document(
    document="Document text...",
    metadata={"source": "research_paper", "year": 2024},
    doc_id="doc-123"
)

# Query vector store
hits = service.query_vector_index("machine learning trends", n_results=10)
for hit in hits:
    print(f"ID: {hit['id']}, Score: {hit['distance']}")
    print(f"Snippet: {hit['document'][:200]}...")

Knowledge Graph Operations

from ai_ml.services import get_document_service

service = get_document_service()

# Sync to knowledge graph
result = service.sync_to_knowledge_graph(
    document="Document text...",
    agentic_payload={"overview": "...", "key_topics": [...]},
    metadata={"id": "doc-456", "title": "Research Paper"}
)

# Run Cypher query
results = service.run_graph_query(
    query="""
    MATCH (d:Document)-[:COVERS]->(t:Topic)
    WHERE t.name = $topic
    RETURN d.title, d.summary
    LIMIT 10
    """,
    params={"topic": "artificial intelligence"}
)

Conversation Chain

from ai_ml.services import get_document_service

service = get_document_service()

# Create conversation chain
chain = service.create_conversation_chain()

# Interactive conversation
response1 = chain.run(input="What is machine learning?")
response2 = chain.run(input="Can you give me an example?")
response3 = chain.run(input="How does it relate to AI?")

FastAPI Server

Start Server

uvicorn ai_ml.server:app --reload --host 0.0.0.0 --port 8000

API Endpoint

POST /analyze

Request Body:

{
  "document": "Your document text here...",
  "question": "What are the main findings?",
  "translate_lang": "fr",
  "metadata": {
    "id": "doc-001",
    "title": "Sample Document",
    "source": "research"
  }
}

Response:

{
  "rag": {
    "overview": "...",
    "key_topics": ["topic1", "topic2"],
    "qa_answer": "...",
    "supporting_context": ["quote1", "quote2"],
    "crew_analysis": {...},
    "citations": [...]
  },
  "summary": "...",
  "topics": ["topic1", "topic2"],
  "qa": "...",
  "discussion": "...",
  "insights": "...",
  "sentiment": {
    "label": "positive",
    "confidence": 0.85,
    "rationale": "..."
  },
  "translation": "...",
  "document_id": "doc-001",
  "metadata": {...},
  "sync": {
    "graph": {"status": "ok", "document_id": "doc-001"},
    "vector_store": {"status": "ok", "document_id": "doc-001"}
  }
}

cURL Example

curl -X POST http://localhost:8000/analyze \
  -H "Content-Type: application/json" \
  -d '{
    "document": "Artificial intelligence is transforming industries...",
    "question": "What industries are affected?",
    "translate_lang": "es"
  }'

Python Client Example

import requests

response = requests.post(
    "http://localhost:8000/analyze",
    json={
        "document": "Your document text...",
        "question": "What are the key findings?",
        "translate_lang": "fr"
    }
)

results = response.json()
print(results["summary"])

MCP Server

The MCP server exposes DocuThinker capabilities as standardized tools for external consumption.

Start MCP Server

python -m ai_ml.mcp.server

Available MCP Tools

Tool Description Parameters
agentic_document_brief Full agentic analysis pipeline document, question?, translate_lang
semantic_document_search Semantic search within document document, query
quick_topics Extract bullet topics document
vector_upsert Persist to vector store document, doc_id?, metadata?
vector_search Search vector store query, n_results
graph_upsert Sync to knowledge graph document, metadata?
graph_query Execute Cypher query query, params?

MCP Configuration

Add to your MCP client configuration (e.g., Claude Desktop, Cline):

{
  "mcpServers": {
    "docuthinker-agentic": {
      "command": "python",
      "args": ["-m", "ai_ml.mcp.server"],
      "env": {
        "OPENAI_API_KEY": "sk-...",
        "ANTHROPIC_API_KEY": "sk-ant-...",
        "GOOGLE_API_KEY": "..."
      }
    }
  }
}

MCP Usage Example

From an MCP-compatible client:

Use the docuthinker-agentic tool agentic_document_brief with:
- document: "Quarterly earnings report shows 15% revenue growth..."
- question: "What are the key financial metrics?"
- translate_lang: "de"

Advanced Features

Knowledge Graph Integration

Setup Neo4j

Using Docker
docker run -d \
  --name neo4j \
  -p 7474:7474 -p 7687:7687 \
  -e NEO4J_AUTH=neo4j/your-password \
  -v $HOME/neo4j/data:/data \
  neo4j:5.11
Using Neo4j Desktop
  1. Download from neo4j.com/download
  2. Create a new database
  3. Start the database
  4. Note the connection details

Configure DocuThinker

export DOCUTHINKER_SYNC_GRAPH=true
export DOCUTHINKER_NEO4J_URI=bolt://localhost:7687
export DOCUTHINKER_NEO4J_USER=neo4j
export DOCUTHINKER_NEO4J_PASSWORD=your-password
export DOCUTHINKER_NEO4J_DATABASE=neo4j

Query Examples

Find Documents by Topic
from ai_ml.services import get_document_service

service = get_document_service()

results = service.run_graph_query(
    query="""
    MATCH (d:Document)-[:COVERS]->(t:Topic)
    WHERE t.name CONTAINS $keyword
    RETURN d.id, d.title, d.summary, t.name
    ORDER BY d.updated_at DESC
    LIMIT 20
    """,
    params={"keyword": "machine learning"}
)
results = service.run_graph_query(
    query="""
    MATCH (d1:Document)-[:COVERS]->(t:Topic)<-[:COVERS]-(d2:Document)
    WHERE d1.id = $doc_id AND d1 <> d2
    RETURN DISTINCT d2.id, d2.title, COUNT(t) AS shared_topics
    ORDER BY shared_topics DESC
    LIMIT 10
    """,
    params={"doc_id": "doc-123"}
)
Topic Network Analysis
results = service.run_graph_query(
    query="""
    MATCH (t:Topic)<-[:COVERS]-(d:Document)
    RETURN t.name, COUNT(d) AS document_count
    ORDER BY document_count DESC
    LIMIT 20
    """
)

Vector Store Persistence

Setup ChromaDB

ChromaDB is file-based and requires no separate server:

export DOCUTHINKER_SYNC_VECTOR=true
export DOCUTHINKER_CHROMA_DIR=.chroma
export DOCUTHINKER_CHROMA_COLLECTION=docuthinker

Upsert Documents

from ai_ml.services import get_document_service

service = get_document_service()

# Upsert multiple documents
documents = [
    {"text": "Document 1 content...", "id": "doc-1", "metadata": {"year": 2024}},
    {"text": "Document 2 content...", "id": "doc-2", "metadata": {"year": 2023}},
    {"text": "Document 3 content...", "id": "doc-3", "metadata": {"year": 2024}},
]

for doc in documents:
    service.upsert_vector_document(
        document=doc["text"],
        doc_id=doc["id"],
        metadata=doc["metadata"]
    )
# Search across all documents
results = service.query_vector_index(
    query="What are the latest trends in AI?",
    n_results=10
)

for result in results:
    print(f"Document: {result['id']}")
    print(f"Similarity Score: {result['distance']}")
    print(f"Snippet: {result['document'][:200]}...")
    print(f"Metadata: {result['metadata']}")
    print()

Multi-Agent Collaboration

The CrewAI integration provides sophisticated multi-agent reasoning.

Agent Configuration

Agents are configured in agents/crew_agents.py and can be customized:

from crewai import Agent, Crew, Process, Task
from ai_ml.providers.registry import LLMProviderRegistry, LLMConfig

registry = LLMProviderRegistry()

# Create custom agent
custom_agent = Agent(
    name="Data Analyst",
    role="Statistical data analyzer",
    goal="Extract quantitative insights and trends",
    backstory="You are a data scientist specializing in statistical analysis...",
    llm=registry.chat(LLMConfig(provider="openai", model="gpt-4o-mini")),
    tools=[search_tool, insights_tool],
    verbose=True
)

# Create custom task
custom_task = Task(
    name="Statistical Analysis",
    description="Perform statistical analysis on the document data...",
    agent=custom_agent,
    expected_output="Statistical report with key metrics and trends"
)

Custom Crew

from ai_ml.agents import build_document_crew
from ai_ml.tools import DocumentSearchTool, InsightsExtractionTool, build_vector_store
from ai_ml.providers.registry import LLMProviderRegistry

# Build tools
retriever = build_vector_store(document)
search_tool = DocumentSearchTool(retriever)
insights_tool = InsightsExtractionTool(chunks)

# Build crew with custom context
registry = LLMProviderRegistry()
crew = build_document_crew(
    registry,
    retriever_tool=search_tool,
    insights_tool=insights_tool,
    additional_context={
        "openai_model": "gpt-4o",  # Use more powerful model
        "gemini_model": "gemini-1.5-pro",
        "claude_model": "claude-3-5-sonnet-20241022"
    }
)

# Run crew
result = crew.kickoff(inputs={
    "question": "Analyze the financial projections",
    "rag_overview": "...",
    "rag_topics": [...]
})

API Reference

DocumentIntelligenceService

The main service facade for all document intelligence operations.

analyze_document(document, question=None, translate_lang='fr', metadata=None)

Run the full agentic pipeline with enrichments.

Parameters:

Returns:

Example:

results = service.analyze_document(
    document="Your text...",
    question="What are the risks?",
    translate_lang="es",
    metadata={"id": "doc-001"}
)

summarize(document, style=None)

Generate narrative summary.

Parameters:

Returns:

bullet_summary(document)

Generate bullet-point summary.

Parameters:

Returns:

extract_topics(document)

Extract main topics/themes.

Parameters:

Returns:

answer_question(document, question)

Answer question about document.

Parameters:

Returns:

sentiment(document)

Analyze sentiment.

Parameters:

Returns:

translate(document, target_lang)

Translate document.

Parameters:

Returns:

semantic_search(document, query)

Perform semantic search.

Parameters:

Returns:

recommendations(document)

Generate actionable recommendations.

Parameters:

Returns:

discussion_points(document)

Generate discussion prompts.

Parameters:

Returns:

rewrite(document, tone='professional')

Rewrite document in specified tone.

Parameters:

Returns:

refine_summary(draft_summary, document)

Refine draft summary.

Parameters:

Returns:

sync_to_knowledge_graph(document, agentic_payload, metadata)

Sync document to Neo4j.

Parameters:

Returns:

run_graph_query(query, params=None)

Execute Cypher query.

Parameters:

Returns:

upsert_vector_document(document, metadata=None, doc_id=None)

Upsert document to vector store.

Parameters:

Returns:

query_vector_index(query, n_results=None)

Query vector store.

Parameters:

Returns:

create_conversation_chain()

Create conversation chain.

Returns:


Configuration Reference

Core Settings

Defined in core/settings.py and configurable via environment variables.

Setting Environment Variable Default Description
LLM Models Β  Β  Β 
Analyst Model DOCUTHINKER_OPENAI_MODEL gpt-4o-mini OpenAI model for analysis
Researcher Model DOCUTHINKER_GEMINI_MODEL gemini-1.5-pro Google model for research
Reviewer Model DOCUTHINKER_CLAUDE_MODEL claude-3-5-sonnet-20241022 Anthropic model for review
Sentiment Model DOCUTHINKER_SENTIMENT_MODEL claude-3-haiku-20240307 Model for sentiment analysis
Q&A Model DOCUTHINKER_QA_MODEL Same as analyst Model for Q&A
Embeddings Β  Β  Β 
Embedding Provider DOCUTHINKER_EMBEDDING_PROVIDER huggingface Provider for embeddings
Embedding Model DOCUTHINKER_EMBEDDING_MODEL sentence-transformers/all-MiniLM-L6-v2 Embedding model name
Chunking Β  Β  Β 
Chunk Size DOCUTHINKER_CHUNK_SIZE 900 Characters per chunk
Chunk Overlap DOCUTHINKER_CHUNK_OVERLAP 120 Overlap between chunks
RAG Β  Β  Β 
Default Question DOCUTHINKER_RAG_QUESTION "Provide a comprehensive intelligence brief..." Default RAG question
Bullet Style DOCUTHINKER_BULLET_STYLE "Use concise bullet points..." Bullet summary style
Neo4j Β  Β  Β 
Enable Sync DOCUTHINKER_SYNC_GRAPH false Enable Neo4j sync
URI DOCUTHINKER_NEO4J_URI None Neo4j connection URI
User DOCUTHINKER_NEO4J_USER None Neo4j username
Password DOCUTHINKER_NEO4J_PASSWORD None Neo4j password
Database DOCUTHINKER_NEO4J_DATABASE None Neo4j database name
ChromaDB Β  Β  Β 
Enable Sync DOCUTHINKER_SYNC_VECTOR false Enable vector store sync
Directory DOCUTHINKER_CHROMA_DIR None ChromaDB persist directory
Collection DOCUTHINKER_CHROMA_COLLECTION docuthinker Collection name
Top K DOCUTHINKER_VECTOR_TOP_K 6 Number of results
Other Β  Β  Β 
Knowledge Base Path DOCUTHINKER_KB_PATH None Path to knowledge base
Fallback Summarizer DOCUTHINKER_FALLBACK_SUMMARIZER facebook/bart-large-cnn HuggingFace summarizer

Provider Specifications

Each agent model is configured as a ProviderSpec:

ProviderSpec(
    provider="openai",  # "openai", "anthropic", "google"
    model="gpt-4o-mini",
    temperature=0.15,
    max_tokens=900,
    extra={}  # Provider-specific kwargs
)

Translation Models

Default Helsinki-NLP models by language:

Language Code Model
French fr Helsinki-NLP/opus-mt-en-fr
German de Helsinki-NLP/opus-mt-en-de
Spanish es Helsinki-NLP/opus-mt-en-es
Italian it Helsinki-NLP/opus-mt-en-it
Portuguese pt Helsinki-NLP/opus-mt-en-pt
Chinese zh Helsinki-NLP/opus-mt-en-zh
Japanese ja Helsinki-NLP/opus-mt-en-ja

Performance & Optimization

Benchmarks

Typical performance metrics (M1 MacBook Pro, 16GB RAM):

Operation Input Size Time Notes
Full Analysis 5K tokens ~15-25s With CrewAI collaboration
Summary Only 5K tokens ~3-5s Single LLM call
Topic Extraction 5K tokens ~2-4s Single LLM call
Semantic Search 10K docs ~100-200ms FAISS in-memory
Vector Upsert 1 doc ~50-100ms ChromaDB persist
Translation 5K tokens ~5-10s HuggingFace model

Optimization Tips

1. Model Selection

# Faster models for lower latency
export DOCUTHINKER_OPENAI_MODEL=gpt-4o-mini  # Instead of gpt-4o
export DOCUTHINKER_CLAUDE_MODEL=claude-3-haiku-20240307  # Instead of sonnet

2. Chunking Strategy

# Smaller chunks = faster embedding, less context
export DOCUTHINKER_CHUNK_SIZE=600
export DOCUTHINKER_CHUNK_OVERLAP=60

3. Caching

The service uses singleton pattern and caches:

4. Parallel Processing

# Use asyncio for concurrent document processing
import asyncio
from ai_ml.services import get_document_service

async def analyze_batch(documents):
    service = get_document_service()
    tasks = [
        asyncio.to_thread(service.analyze_document, doc)
        for doc in documents
    ]
    return await asyncio.gather(*tasks)

# Process 10 documents concurrently
results = asyncio.run(analyze_batch(documents))

5. ONNX Optimization

Convert HuggingFace models to ONNX for faster inference:

python -m ai_ml.convert_to_onnx \
  --model_name sentence-transformers/all-MiniLM-L6-v2 \
  --output_dir models/onnx/

Deployment

Docker Deployment

Dockerfile

FROM python:3.10-slim

WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Copy requirements
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy application
COPY . .

# Expose ports
EXPOSE 8000

# Set environment variables
ENV PYTHONUNBUFFERED=1

# Run FastAPI server
CMD ["uvicorn", "ai_ml.server:app", "--host", "0.0.0.0", "--port", "8000"]

Docker Compose

version: '3.8'

services:
  docuthinker-api:
    build: .
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
      - GOOGLE_API_KEY=${GOOGLE_API_KEY}
      - DOCUTHINKER_SYNC_GRAPH=true
      - DOCUTHINKER_NEO4J_URI=bolt://neo4j:7687
      - DOCUTHINKER_NEO4J_USER=neo4j
      - DOCUTHINKER_NEO4J_PASSWORD=${NEO4J_PASSWORD}
      - DOCUTHINKER_SYNC_VECTOR=true
      - DOCUTHINKER_CHROMA_DIR=/data/chroma
    volumes:
      - chroma-data:/data/chroma
    depends_on:
      - neo4j

  neo4j:
    image: neo4j:5.11
    ports:
      - "7474:7474"
      - "7687:7687"
    environment:
      - NEO4J_AUTH=neo4j/${NEO4J_PASSWORD}
    volumes:
      - neo4j-data:/data

volumes:
  chroma-data:
  neo4j-data:

Build & Run

# Set environment variables
export OPENAI_API_KEY=sk-...
export ANTHROPIC_API_KEY=sk-ant-...
export GOOGLE_API_KEY=...
export NEO4J_PASSWORD=your-password

# Build and start
docker-compose up -d

# View logs
docker-compose logs -f docuthinker-api

# Stop
docker-compose down

Heroku Deployment

# Install Heroku CLI
brew install heroku/brew/heroku

# Login
heroku login

# Create app
heroku create docuthinker-api

# Set environment variables
heroku config:set OPENAI_API_KEY=sk-...
heroku config:set ANTHROPIC_API_KEY=sk-ant-...
heroku config:set GOOGLE_API_KEY=...

# Deploy
git push heroku main

# View logs
heroku logs --tail

AWS Lambda Deployment

Use AWS Lambda + API Gateway for serverless deployment:

# lambda_handler.py
import json
from ai_ml.services import get_document_service

service = get_document_service()

def lambda_handler(event, context):
    body = json.loads(event['body'])

    results = service.analyze_document(
        document=body['document'],
        question=body.get('question'),
        translate_lang=body.get('translate_lang', 'fr'),
        metadata=body.get('metadata')
    )

    return {
        'statusCode': 200,
        'body': json.dumps(results),
        'headers': {'Content-Type': 'application/json'}
    }

Package and deploy using AWS SAM or Serverless Framework.


Troubleshooting

Common Issues

1. Missing API Keys

Error:

MissingAPIKeyError: OPENAI_API_KEY environment variable is required

Solution:

export OPENAI_API_KEY=sk-...
# Or add to .env file

2. Missing Dependencies

Error:

MissingDependencyError: Install langchain-anthropic to use the Anthropic provider

Solution:

pip install langchain-anthropic

3. Neo4j Connection Failed

Error:

Neo4jNotConfigured: Neo4j credentials are not configured

Solution:

# Ensure Neo4j is running
docker ps | grep neo4j

# Check connection
export DOCUTHINKER_NEO4J_URI=bolt://localhost:7687
export DOCUTHINKER_NEO4J_USER=neo4j
export DOCUTHINKER_NEO4J_PASSWORD=your-password

4. ChromaDB Persistence Issues

Error:

ChromaNotConfigured: Chroma persist directory is not configured

Solution:

# Create directory
mkdir -p .chroma

# Set environment variable
export DOCUTHINKER_CHROMA_DIR=.chroma

5. Model Loading Timeout

Error:

TimeoutError: Model loading exceeded timeout

Solution:

# Pre-download models
python -c "from sentence_transformers import SentenceTransformer; SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')"

6. Memory Issues with Large Documents

Error:

MemoryError: Unable to allocate array

Solution:

# Reduce chunk size
export DOCUTHINKER_CHUNK_SIZE=500

# Or split document before processing
def split_large_document(text, max_size=50000):
    return [text[i:i+max_size] for i in range(0, len(text), max_size)]

parts = split_large_document(large_document)
results = [service.analyze_document(part) for part in parts]

Debug Mode

Enable verbose logging:

import logging

logging.basicConfig(level=logging.DEBUG)
logger = logging.getLogger('ai_ml')
logger.setLevel(logging.DEBUG)

Health Check

from ai_ml.services import get_document_service

service = get_document_service()

# Test basic functionality
try:
    result = service.summarize("This is a test document.")
    print(f"βœ… Service operational: {result[:50]}...")
except Exception as e:
    print(f"❌ Service error: {e}")

Development

Development Setup

# Clone repository
git clone https://github.com/hoangsonww/DocuThinker-AI-App.git
cd DocuThinker-AI-App/ai_ml

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install development dependencies
pip install -r requirements.txt
pip install -r requirements-dev.txt  # If available

# Install pre-commit hooks
pre-commit install

Code Style

The project follows PEP 8 with Black formatting:

# Format code
black ai_ml/

# Check formatting
black --check ai_ml/

# Lint
flake8 ai_ml/

# Type checking
mypy ai_ml/

Project Structure Guidelines

Adding New Features

1. Add Service Method

# services/orchestrator.py

def new_feature(self, document: str) -> str:
    """New feature description."""
    prompt = ChatPromptTemplate.from_template("...")
    llm = self._resolve_llm(self.settings.agent_models["analyst"])
    chain = prompt | llm | StrOutputParser()
    return chain.invoke({"document": document})

2. Add Backend Wrapper

# backend.py

def new_feature(document: str) -> str:
    return SERVICE.new_feature(document)

3. Add MCP Tool

# mcp/server.py

@app.tool()
def new_feature_tool(document: str) -> str:
    """Tool description for MCP."""
    return SERVICE.new_feature(document)

4. Update CLI

# main.py

print("\n=== New Feature ===")
print(backend.new_feature(document))

Testing

Unit Tests

# Run all tests
pytest ai_ml/tests/

# Run specific test
pytest ai_ml/tests/test_services.py

# Run with coverage
pytest --cov=ai_ml --cov-report=html

Integration Tests

# Test with live APIs (requires API keys)
pytest ai_ml/tests/integration/

Mock Tests

from langchain_core.language_models.fake import FakeListChatModel
from ai_ml.services import DocumentIntelligenceService
from ai_ml.providers.registry import LLMProviderRegistry

# Create mock LLM
fake_llm = FakeListChatModel(responses=["Test summary"])

# Create mock registry
class MockRegistry(LLMProviderRegistry):
    def chat(self, config):
        return fake_llm

# Test with mock
service = DocumentIntelligenceService(registry=MockRegistry())
result = service.summarize("Test document")
assert result == "Test summary"

End-to-End Test

from ai_ml.services import get_document_service

service = get_document_service()

# Test full pipeline
results = service.analyze_document(
    document="Artificial intelligence is rapidly transforming industries...",
    question="What industries are affected?"
)

assert "summary" in results
assert "topics" in results
assert "qa" in results
assert results["qa"] is not None

Contributing

We welcome contributions! Please follow these guidelines:

Contribution Workflow

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature/amazing-feature
  3. Commit your changes: git commit -m 'Add amazing feature'
  4. Push to the branch: git push origin feature/amazing-feature
  5. Open a Pull Request

Pull Request Guidelines

Code Review Checklist


License

This project is licensed under the MIT License - see the LICENSE file for details.


Acknowledgments

Technologies


Made with ❀️ by Son Nguyen