End-to-End Data Pipeline

Enterprise-grade data platform with 20 Docker services, Snowflake data warehouse, .NET 8 REST API, batch & streaming processing, ML experiment tracking, and multi-provider deployment (AWS, GCP, Azure, on-prem)

πŸ”₯ 20 Docker Services ⚑ Real-Time Streaming ❄️ Snowflake Warehouse πŸ“Š MLflow + Spark 🌐 Multi-Cloud Deploy πŸš€ Helm + Terraform πŸ”§ .NET 8 API πŸ“¦ 49 Tests Passing
0
Docker Services
0+
Technologies Integrated
0
Automated Tests
0
Deployment Targets

Overview

A comprehensive, production-ready data pipeline that seamlessly integrates batch and streaming processing with enterprise-grade monitoring, governance, and machine learning capabilities.

πŸ“₯

Data Ingestion

Batch Sources: MySQL, PostgreSQL, CSV/JSON/XML files, Data Lakes (MinIO/S3)

Streaming Sources: Apache Kafka for event logs, IoT sensor data, social media streams, and real-time CDC (Change Data Capture)

βš™οΈ

Data Processing

Batch Processing: Apache Spark for large-scale ETL with Great Expectations for data quality

Stream Processing: Spark Structured Streaming for real-time transformations, anomaly detection, and event processing

πŸ’Ύ

Multi-Tier Storage

7 Storage Systems: MinIO (S3-compatible data lake), PostgreSQL (analytics + staging), MongoDB (documents), InfluxDB (time-series), Redis (caching), Elasticsearch (search + logs), MySQL (OLTP source)

❄️

Snowflake Data Warehouse

Star Schema: 4 dimension tables, 3 fact tables, 2 aggregation tables

ETL: Automated staging, MERGE-based loading, Snowflake Tasks for scheduled aggregation

Fallback: PostgreSQL warehouse for local development

πŸ”§

.NET 8 REST API

16 Endpoints: Batch ingestion, streaming, warehouse ETL, ML, governance, CI/CD

Enterprise: Serilog logging, Polly retry, Swagger docs, health probes (live/ready/full)

7 Controllers, 10 Services, 6 Health Checks

πŸ“Š

Monitoring & ML

Observability: Prometheus (9 scrape targets), Grafana dashboards, Elasticsearch logs

MLOps: MLflow experiment tracking with PostgreSQL backend + MinIO artifacts

Quality: Great Expectations validation, Apache Atlas lineage

πŸš€

Multi-Provider Deployment

Local: Docker Compose (full 20-service or lite 8GB mode)

Cloud: Helm chart with AWS/GCP/Azure/on-prem value overrides

IaC: Terraform (VPC, EKS, RDS, S3) + Argo CD GitOps

CI/CD: GitHub Actions (lint, test, build, integration)

Architecture

Cloud-native, microservices-based architecture designed for scalability, reliability, and maintainability

High-Level System Architecture

graph TB subgraph Sources BS[Batch Sources - MySQL] SS[Streaming - Kafka] end subgraph Orchestration AIR[Apache Airflow - 3 DAGs] KAF[Apache Kafka 7.5.0] end subgraph Processing SPB[Spark Batch ETL] SPS[Spark Streaming] GE[Great Expectations] end subgraph Storage MIN[MinIO - S3 Storage] PG[PostgreSQL 15] MDB[MongoDB 6.0] IDB[InfluxDB 2.7] RED[Redis 7] ES[Elasticsearch 8.11] end subgraph Serving MLF[MLflow 2.9.2] API[.NET 8 API] SF[Snowflake DW] end subgraph Monitoring PROM[Prometheus] GRAF[Grafana] end BS --> AIR SS --> KAF AIR --> SPB KAF --> SPS SPB --> GE SPS --> GE GE --> MIN GE --> PG PG --> SF PG --> MLF PG --> API PROM --> GRAF

Batch Processing Flow

sequenceDiagram participant BS as MySQL Source participant AF as Airflow DAG participant GE as Great Expectations participant MN as MinIO participant SP as Spark Batch participant PG as PostgreSQL BS->>AF: Trigger Batch Job AF->>BS: Extract Data AF->>GE: Validate Quality GE-->>AF: Validation OK AF->>MN: Upload Raw Data AF->>SP: Submit Spark Job SP->>MN: Read Raw Data SP->>SP: Transform + Enrich SP->>PG: Write Processed Data SP->>MN: Archive to S3

Streaming Processing Flow

sequenceDiagram participant KP as Kafka Producer participant KT as Kafka Topic participant SS as Spark Streaming participant AD as Anomaly Detection participant PG as PostgreSQL participant MN as MinIO loop Every 2 seconds KP->>KT: Publish sensor_readings end KT->>SS: Consume Stream SS->>AD: Process Events AD->>AD: Detect Anomalies AD->>PG: Store Anomalies AD->>MN: Archive to S3

Docker Compose Stack (20 Services)

graph TB subgraph Data ZK[Zookeeper :2181] --> KAFKA[Kafka :9092] MYSQL[MySQL :3306] PG[PostgreSQL :5432] REDIS[Redis :6379] MONGO[MongoDB :27017] MINIO[MinIO :9000] INFLUX[InfluxDB :8086] KP[Kafka Producer] end subgraph Orchestration AF_WS[Airflow :8080] AF_SC[Scheduler] end subgraph Processing SM[Spark Master :8081] SW[Spark Worker] end subgraph Serving MLF[MLflow :5001] API[.NET API :5000] end subgraph Monitoring PROM[Prometheus :9090] GRAF[Grafana :3000] ES[Elasticsearch :9200] end KP --> KAFKA AF_WS --> MYSQL AF_WS --> PG SM --> SW SM --> MINIO SM --> KAFKA SM --> PG MLF --> PG API --> PG API --> KAFKA PROM --> GRAF

Technology Stack

Built with industry-leading open-source technologies and cloud-native tools

πŸ”„ Data Processing

  • ⚑ Apache Spark 3.5.3 - Batch & Stream Processing
  • πŸ“« Apache Kafka 7.5.0 - Event Streaming
  • 🦁 Zookeeper - Kafka Coordination
  • ✨ Great Expectations - Data Quality Validation

πŸ“Š Orchestration & API

  • πŸŒ€ Apache Airflow 2.7.3 - DAG Orchestration (3 DAGs)
  • πŸ”§ .NET 8 / C# - REST API (16 endpoints, Swagger)
  • πŸ“¦ Serilog + Polly - Structured Logging & Retry
  • 🎯 Dapper - Micro-ORM for DB Access

πŸ’Ύ Databases & Storage

  • ❄️ Snowflake - Cloud Data Warehouse (primary)
  • 🐘 PostgreSQL 15 - Analytics + Staging + Fallback DW
  • 🐬 MySQL 8.0 - Source OLTP Database
  • πŸͺ£ MinIO - S3-Compatible Object Storage
  • πŸƒ MongoDB 6.0 - NoSQL Document Store
  • πŸ“ˆ InfluxDB 2.7 - Time-Series Database
  • ⚑ Redis 7 - In-Memory Cache
  • πŸ” Elasticsearch 8.11 - Search & Log Indexing

πŸ€– ML & Monitoring

  • πŸ”¬ MLflow 2.9.2 - Experiment Tracking & Model Registry
  • πŸ”₯ Prometheus 2.48 - Metrics Collection (9 targets)
  • πŸ“ˆ Grafana 10.2 - Dashboards & Alerting
  • πŸ—ΊοΈ Apache Atlas - Data Lineage & Governance

πŸš€ DevOps & Deployment

  • 🐳 Docker Compose - Local Development (20 services)
  • ☸️ Kubernetes + Helm - Production Deployment
  • πŸ—οΈ Terraform - AWS Infrastructure (VPC, EKS, RDS, S3)
  • πŸš€ Argo CD - GitOps Continuous Delivery
  • βš™οΈ GitHub Actions - CI/CD Pipeline

πŸ“‹ Languages & Frameworks

  • 🐍 Python 3.10 - Pipeline Code & DAGs
  • πŸ”§ C# / .NET 8 - Backend REST API
  • 🌐 HTML5 / CSS3 / JavaScript - Wiki Frontend
  • πŸ—„οΈ SQL - Warehouse Schema & Transformations
  • πŸ“ HCL - Terraform Infrastructure

Key Features

Enterprise-grade capabilities for production data engineering workloads

⚑

Real-Time Processing

Process millions of events per second with Spark Structured Streaming and Kafka. Sub-second latency for critical business insights with exactly-once semantics.

πŸ“Š

Batch Analytics

Scalable batch processing with Apache Spark. Handle petabyte-scale datasets with optimized partitioning, compression, and distributed computing.

πŸ”

Data Quality

Automated validation with Great Expectations. Define expectations, run validations, and generate data quality reports. Prevent bad data from entering your pipeline.

πŸ—ΊοΈ

Data Lineage

Track data flow from source to destination with Apache Atlas. Understand data dependencies, impact analysis, and compliance with automated lineage tracking.

πŸ“ˆ

Observability

Comprehensive monitoring with Prometheus and Grafana. Track pipeline health, performance metrics, SLA compliance, and receive intelligent alerts.

πŸ”’

Security & Compliance

Enterprise security with encryption at rest and in transit. RBAC, audit logging, secrets management, and compliance with GDPR, HIPAA, SOC 2.

πŸ“¦

Containerized

Fully containerized with Docker and Kubernetes. Portable, scalable, and cloud-agnostic. Run on AWS, GCP, Azure, or on-premises infrastructure.

πŸ”„

CI/CD Ready

GitOps workflow with Argo CD. Automated testing, deployment, and rollback. Blue/Green and Canary deployment strategies with progressive delivery.

πŸ€–

ML Integration

Seamless MLOps with MLflow and Feast. Track experiments, manage models, serve predictions, and maintain feature stores for ML workflows.

Deployment Strategies

Enterprise-grade deployment patterns for zero-downtime releases

πŸ”΅πŸŸ’ Blue/Green Deployment Zero Downtime

Deploy new version alongside current version, then switch traffic instantly. Instant rollback capability by switching back to previous version.

  • Instant traffic switching between versions
  • Zero downtime during deployment
  • Immediate rollback capability
  • Full testing in production environment
  • Preview environment before promotion

πŸ•―οΈ Canary Deployment Progressive Rollout

Gradually shift traffic from old to new version with automated analysis. Detect issues early and rollback automatically if metrics degrade.

  • Progressive traffic shifting (10% β†’ 25% β†’ 50% β†’ 100%)
  • Automated Prometheus metrics analysis
  • Auto-rollback on failure detection
  • Reduced blast radius for issues
  • Real-time performance comparison

πŸ”„ Rolling Deployment Incremental Update

Update pods incrementally while maintaining service availability. Ideal for stateless services with minimal resource requirements.

  • Incremental pod updates
  • Configurable update speed
  • Health checks before promotion
  • Pause and resume capability
  • Minimal resource overhead

CI/CD Pipeline Flow

graph LR subgraph Development DEV[Developer] --> GIT[Git Push] end subgraph CI/CD GIT --> GHA[GitHub Actions] GHA --> TEST[Run Tests] TEST --> BUILD[Build Images] BUILD --> PUSH[Push to GHCR] end subgraph Deploy PUSH --> HELM[Helm Chart] HELM --> K8S[Kubernetes] end subgraph Infra TERRA[Terraform] --> CLOUD[AWS/GCP/Azure] CLOUD --> K8S end

Use Cases

Real-world applications across industries and domains

πŸ›’

E-Commerce & Retail

Real-time recommendations, fraud detection, inventory optimization, customer behavior analysis, and personalized marketing campaigns.

πŸ’°

Financial Services

Risk analysis, trade surveillance, fraud detection, regulatory compliance, portfolio analytics, and real-time transaction monitoring.

πŸ₯

Healthcare

Patient monitoring, clinical trial analysis, predictive diagnostics, IoT medical device data processing, and treatment outcome prediction.

🏭

Manufacturing & IoT

Predictive maintenance, supply chain optimization, quality control, sensor data analysis, and production efficiency monitoring.

πŸ“±

Media & Social

Sentiment analysis, ad fraud detection, content recommendations, user engagement tracking, and real-time trend detection.

πŸš—

Transportation & Logistics

Route optimization, fleet management, demand forecasting, delivery tracking, and real-time traffic analysis.

Getting Started

Deploy all 20 services in minutes with a single command

1

Clone the Repository

Get the source code from GitHub and navigate to the project directory

bash
git clone https://github.com/hoangsonww/End-to-End-Data-Pipeline.git
cd End-to-End-Data-Pipeline
2

Configure & Start

Create environment file and launch all 20 services with Docker Compose

bash
cp .env.example .env
make build && make up

# Or for 8GB machines:
make up-lite
3

Access the Services

Once all services are running, access the web interfaces:

4

Run Your First Pipeline

Trigger the batch ingestion DAG to see the full pipeline in action

bash
# Trigger batch pipeline: MySQL β†’ Validation β†’ MinIO β†’ Spark β†’ PostgreSQL
make trigger-batch

# Trigger warehouse ETL: Staging β†’ Snowflake dimensions β†’ facts β†’ aggregations
make trigger-warehouse

# View all DAGs
make list-dags
5

Run Streaming & Spark Jobs

The kafka-producer service auto-generates sensor data. Run Spark jobs for processing

bash
# Kafka producer is already running (auto-started)
make kafka-topics

# Run Spark batch ETL
make spark-batch

# Run Spark streaming (continuous)
make spark-stream
6

Deploy to Production (Optional)

Deploy to any cloud or on-prem Kubernetes cluster via Helm chart

bash
# Deploy to any Kubernetes cluster
make deploy-k8s

# Or deploy to specific providers:
make deploy-aws      # AWS EKS (Terraform + Helm)
make deploy-gcp      # GCP GKE
make deploy-azure    # Azure AKS
make deploy-onprem   # On-prem (k3s, kubeadm)

# Check status / teardown
make deploy-status
make deploy-teardown

πŸ“š Additional Resources

↑