An end-to-end, production-ready Agentic RAG pipeline powered by Google Gemini (planning, writing, critique), FAISS (vector search), Google Programmable Search (hybrid web retrieval), and a simple, file-backed session memory. It does intent recognition, task decomposition, dynamic retrieval planning, multi-agent verification, tool/API calls, and guardrailed finalization.
flowchart TD
U[User] --> IR[Intent Router - Flash]
IR --> PL[Planner - Pro]
PL --> RP[Retrieval Planner - Pro]
RP --> R1[VectorRetriever - FAISS]
RP --> R2[WebRetriever - Google CSE + Fetch]
R1 --> W[Writer - Pro]
R2 --> W
W --> C[Critic - Pro]
C -->|follow-ups| RP
W --> G[Guardrails]
G --> A[Answer + Evidence]
PL -. session .-> M[(File-backed Memory)]
User
â
âź
Intent Router (Gemini Flash)
â JSON: intents, safety, urgency
âź
Planner / Decomposer (Gemini Pro)
â JSON: sub-goals with sources & done-tests
âââââââââââââââ
â â
âź âź
Retrieval Planner (Gemini Pro) Memory (session)
â JSON: diverse queries, k
âź
Retrievers (parallel per query)
ââ VectorRetriever (FAISS)
ââ WebRetriever (Google CSE + page reader)
â
âź
Writer / Synthesizer (Gemini Pro)
â JSON: {status, draft, missing}
âź
Critic / Verifier (Gemini Pro)
â JSON: {ok, issues, followup_queries}
ââ(if gaps)â targeted re-retrieval â Writer
âź
Guardrails (PII masking)
âź
Final Answer + Evidence Trace
Do agents share the same LLM instance? Each agent runs its own LLM session (distinct system prompt, temperature, token budget) while pointing to the same Gemini family (e.g., 1.5 Pro/Flash). This isolates roles, enables parallelism, and simplifies telemetry/cost control.
Accounts/keys:
GOOGLE_API_KEY
(required) for Gemini.CSE_API_KEY
and CSE_ENGINE_ID
for Google Programmable Search (web retrieval).python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install --upgrade pip
pip install google-generativeai faiss-cpu httpx requests beautifulsoup4 pydantic python-dotenv
.env
in the project root):# Required
GOOGLE_API_KEY=your_gemini_api_key
# Optional (enable web retrieval)
CSE_API_KEY=your_google_cse_key
CSE_ENGINE_ID=your_google_cse_engine_id
# Optional (where to ingest local docs)
CORPUS_DIR=corpus
.txt
or .md
files into corpus/
. Theyâll be chunked and embedded on startup.Environment variables quick reference
Name | Required | Purpose |
---|---|---|
GOOGLE_API_KEY | Yes | Gemini API key |
CSE_API_KEY | No | Google Programmable Search API key |
CSE_ENGINE_ID | No | Google CSE Engine ID |
CORPUS_DIR | No | Directory of local documents to index |
python app.py
Youâll see:
[ingest] Loading corpus from: corpus
[ingest] Added N chunks.
[web] Google Programmable Search enabled. # if keys provided
Then type questions at the prompt:
>>> Compare the two documents in the corpus and list actionable next steps.
The system will plan, retrieve (vector + web if enabled), synthesize, critique, and output a grounded answer with a Sources list.
k
.Retrievers run hybrid search:
agentic-rag/
app.py
core/
llm.py # Gemini client, embeddings, JSON helpers
vector.py # FAISS index + corpus ingestion
tools.py # Web search + page fetcher
memory.py # File-backed session memory
structs.py # Pydantic data contracts
agents/
base.py
intent.py
planner.py
retrieval_planner.py
retrievers.py
writer.py
critic.py
guardrails.py
graph/
orchestrator.py # The control flow / loop
eval/
harness.py # Optional quick smoke tests
corpus/ # (your .txt/.md docs)
.session_memory/ # (generated)
{intents[], safety[], urgency, notes}
.sources
and done_test
.{queries[], k}
per sub-goal.{status, draft, missing}
with bracketed citations [ #1 ]
.{ok, issues, followup_queries}
; triggers one targeted repair loop.LLM instances: each agent uses its own Gemini session & parameters. Same model family (e.g., gemini-1.5-pro
), different prompts and budgets.
.txt
/ .md
files in corpus/
.text-embedding-004
(768-dim). FAISS uses inner-product on normalized vectors.Tip: Create a file like corpus/knowledge.md
with key facts, glossaries, or SOPs for stronger grounding.
Web (optional) broadens coverage and provides fresh/public context:
CSE_API_KEY
and CSE_ENGINE_ID
.Tuning knobs (see graph/orchestrator.py
):
k
per sub-goal from Retrieval Planner (bounded to 4â12).(uri, chunk_id)
)..session_memory/SESSION_ID.jsonl
.
The orchestrator appends user and assistant messages and can generate a short summary window for context.done_test
. The critic evaluates whether the draft meets it; if not, it proposes follow-ups.Temperatures:
0.1â0.3
for determinism.~0.1
to keep follow-ups targeted.Latency & cost:
Observability:
Security:
Scaling:
GOOGLE_API_KEY is required
Set it in .env
or your shell environment.
Web search always disabled
Set both CSE_API_KEY
and CSE_ENGINE_ID
(and ensure your CSE is configured to search the web or desired domains).
Empty/weak answers
corpus/
.k
or chunk size in core/vector.py
.âJSON parsingâ warnings (rare) The pipeline is resilient and attempts to coerce malformed JSON. If it recurs, lower temperatures.
Slow runs
Disable web, reduce k
, or remove the critic loop in orchestrator.py
.
Q: Does each agent use an instance of the LLM? A: Yes. Each agent maintains its own Gemini session & config (system prompt, temperature, token limits). They typically use the same base model (Gemini 1.5 Pro for planning/writing/critique; Gemini 1.5 Flash for routing/guardrails).
Q: Can I run without web search?
A: Yes. The system runs vector-only if CSE_API_KEY
/CSE_ENGINE_ID
arenât set.
Q: How do I add a custom tool/API (e.g., SQL, Jira, GitHub)?
A: Add a client in core/tools.py
, create a dedicated Agent (e.g., DataAgent
) with restricted prompts/permissions, and call it from the orchestrator based on sub-goal sources
or intent routing.
Q: How do I swap FAISS for pgvector/Pinecone?
A: Replace FAISSIndex
with your client; keep the add()
/search()
signatures. Most adapters are a few dozen lines.
Q: How do I change models or parameters?
A: Edit core/llm.py
(GEMINI_PRO
, GEMINI_FLASH
, temperatures, max tokens). You can also route some roles to Flash
for cost/latency.
# 1) Install deps
pip install google-generativeai faiss-cpu httpx requests beautifulsoup4 pydantic python-dotenv
# 2) Configure keys
echo "GOOGLE_API_KEY=sk-..." >> .env
# Optional web:
echo "CSE_API_KEY=..." >> .env
echo "CSE_ENGINE_ID=..." >> .env
# 3) Add local docs (optional)
mkdir -p corpus
echo "Your internal SOPs or notes go here." > corpus/notes.md
# 4) Run
python app.py
This pipeline is designed to be a solid foundation for building advanced, agentic RAG systems with Gemini. It can be extended with more agents, tools, and retrieval methods as needed. Happy coding!