Enterprise-grade data platform with 20 Docker services, Snowflake data warehouse, .NET 8 REST API, batch & streaming processing, ML experiment tracking, and multi-provider deployment (AWS, GCP, Azure, on-prem)
A comprehensive, production-ready data pipeline that seamlessly integrates batch and streaming processing with enterprise-grade monitoring, governance, and machine learning capabilities.
Batch Sources: MySQL, PostgreSQL, CSV/JSON/XML files, Data Lakes (MinIO/S3)
Streaming Sources: Apache Kafka for event logs, IoT sensor data, social media streams, and
real-time CDC (Change Data Capture)
Batch Processing: Apache Spark for large-scale ETL with Great Expectations for data
quality
Stream Processing: Spark Structured Streaming for real-time transformations, anomaly
detection, and event processing
7 Storage Systems: MinIO (S3-compatible data lake), PostgreSQL (analytics + staging), MongoDB (documents), InfluxDB (time-series), Redis (caching), Elasticsearch (search + logs), MySQL (OLTP source)
Star Schema: 4 dimension tables, 3 fact tables, 2 aggregation tables
ETL: Automated staging, MERGE-based loading, Snowflake Tasks for scheduled aggregation
Fallback: PostgreSQL warehouse for local development
16 Endpoints: Batch ingestion, streaming, warehouse ETL, ML, governance, CI/CD
Enterprise: Serilog logging, Polly retry, Swagger docs, health probes (live/ready/full)
7 Controllers, 10 Services, 6 Health Checks
Observability: Prometheus (9 scrape targets), Grafana dashboards, Elasticsearch logs
MLOps: MLflow experiment tracking with PostgreSQL backend + MinIO artifacts
Quality: Great Expectations validation, Apache Atlas lineage
Local: Docker Compose (full 20-service or lite 8GB mode)
Cloud: Helm chart with AWS/GCP/Azure/on-prem value overrides
IaC: Terraform (VPC, EKS, RDS, S3) + Argo CD GitOps
CI/CD: GitHub Actions (lint, test, build, integration)
Cloud-native, microservices-based architecture designed for scalability, reliability, and maintainability
Built with industry-leading open-source technologies and cloud-native tools
Enterprise-grade capabilities for production data engineering workloads
Process millions of events per second with Spark Structured Streaming and Kafka. Sub-second latency for critical business insights with exactly-once semantics.
Scalable batch processing with Apache Spark. Handle petabyte-scale datasets with optimized partitioning, compression, and distributed computing.
Automated validation with Great Expectations. Define expectations, run validations, and generate data quality reports. Prevent bad data from entering your pipeline.
Track data flow from source to destination with Apache Atlas. Understand data dependencies, impact analysis, and compliance with automated lineage tracking.
Comprehensive monitoring with Prometheus and Grafana. Track pipeline health, performance metrics, SLA compliance, and receive intelligent alerts.
Enterprise security with encryption at rest and in transit. RBAC, audit logging, secrets management, and compliance with GDPR, HIPAA, SOC 2.
Fully containerized with Docker and Kubernetes. Portable, scalable, and cloud-agnostic. Run on AWS, GCP, Azure, or on-premises infrastructure.
GitOps workflow with Argo CD. Automated testing, deployment, and rollback. Blue/Green and Canary deployment strategies with progressive delivery.
Seamless MLOps with MLflow and Feast. Track experiments, manage models, serve predictions, and maintain feature stores for ML workflows.
Enterprise-grade deployment patterns for zero-downtime releases
Deploy new version alongside current version, then switch traffic instantly. Instant rollback capability by switching back to previous version.
Gradually shift traffic from old to new version with automated analysis. Detect issues early and rollback automatically if metrics degrade.
Update pods incrementally while maintaining service availability. Ideal for stateless services with minimal resource requirements.
Real-world applications across industries and domains
Real-time recommendations, fraud detection, inventory optimization, customer behavior analysis, and personalized marketing campaigns.
Risk analysis, trade surveillance, fraud detection, regulatory compliance, portfolio analytics, and real-time transaction monitoring.
Patient monitoring, clinical trial analysis, predictive diagnostics, IoT medical device data processing, and treatment outcome prediction.
Predictive maintenance, supply chain optimization, quality control, sensor data analysis, and production efficiency monitoring.
Sentiment analysis, ad fraud detection, content recommendations, user engagement tracking, and real-time trend detection.
Route optimization, fleet management, demand forecasting, delivery tracking, and real-time traffic analysis.
Deploy all 20 services in minutes with a single command
Get the source code from GitHub and navigate to the project directory
git clone https://github.com/hoangsonww/End-to-End-Data-Pipeline.git
cd End-to-End-Data-Pipeline
Create environment file and launch all 20 services with Docker Compose
cp .env.example .env
make build && make up
# Or for 8GB machines:
make up-lite
Once all services are running, access the web interfaces:
Trigger the batch ingestion DAG to see the full pipeline in action
# Trigger batch pipeline: MySQL β Validation β MinIO β Spark β PostgreSQL
make trigger-batch
# Trigger warehouse ETL: Staging β Snowflake dimensions β facts β aggregations
make trigger-warehouse
# View all DAGs
make list-dags
The kafka-producer service auto-generates sensor data. Run Spark jobs for processing
# Kafka producer is already running (auto-started)
make kafka-topics
# Run Spark batch ETL
make spark-batch
# Run Spark streaming (continuous)
make spark-stream
Deploy to any cloud or on-prem Kubernetes cluster via Helm chart
# Deploy to any Kubernetes cluster
make deploy-k8s
# Or deploy to specific providers:
make deploy-aws # AWS EKS (Terraform + Helm)
make deploy-gcp # GCP GKE
make deploy-azure # Azure AKS
make deploy-onprem # On-prem (k3s, kubeadm)
# Check status / teardown
make deploy-status
make deploy-teardown