This repository contains a fully integrated, production-ready data pipeline that supports both batch and streaming data processing using open-source technologies. It is designed to be easily configured and deployed by any business or individual with minimal modifications.
The pipeline incorporates:
Read this README and follow the step-by-step guide to set up the pipeline on your local machine or cloud environment. Customize the pipeline components, configurations, and example applications to suit your data processing needs.
The architecture of the end-to-end data pipeline is designed to handle both batch and streaming data processing. Below is a high-level overview of the components and their interactions:
graph TB
subgraph "Data Sources"
BS[Batch Sources<br/>MySQL, Files, CSV/JSON/XML]
SS[Streaming Sources<br/>Kafka Events, IoT, Social Media]
end
subgraph "Ingestion & Orchestration"
AIR[Apache Airflow<br/>DAG Orchestration]
KAF[Apache Kafka<br/>Event Streaming]
end
subgraph "Processing Layer"
SPB[Spark Batch<br/>Large-scale ETL]
SPS[Spark Streaming<br/>Real-time Processing]
GE[Great Expectations<br/>Data Quality]
end
subgraph "Storage Layer"
MIN[MinIO<br/>S3-Compatible Storage]
PG[PostgreSQL<br/>Analytics Database]
S3[AWS S3<br/>Cloud Storage]
MDB[MongoDB<br/>NoSQL Store]
IDB[InfluxDB<br/>Time-series DB]
end
subgraph "Monitoring & Governance"
PROM[Prometheus<br/>Metrics Collection]
GRAF[Grafana<br/>Dashboards]
ATL[Apache Atlas<br/>Data Lineage]
end
subgraph "ML & Serving"
MLF[MLflow<br/>Model Tracking]
FST[Feast<br/>Feature Store]
BI[BI Tools<br/>Tableau/PowerBI/Looker]
end
BS --> AIR
SS --> KAF
AIR --> SPB
KAF --> SPS
SPB --> GE
SPS --> GE
GE --> MIN
GE --> PG
MIN --> S3
PG --> MDB
PG --> IDB
SPB --> PROM
SPS --> PROM
PROM --> GRAF
SPB --> ATL
SPS --> ATL
PG --> MLF
PG --> FST
PG --> BI
MIN --> MLF
Basically, data will be streamed with Kafka, processed with Spark, and stored in a data warehouse using PostgreSQL. The pipeline also integrates MinIO as an object storage solution and uses Airflow to orchestrate the end-to-end data flow. Great Expectations enforces data quality checks, while Prometheus and Grafana provide monitoring and alerting capabilities. MLflow and Feast are used for machine learning model tracking and feature store integration.
[!CAUTION] Note: The diagram(s) may not reflect ALL components in the repository, but it provides a good overview of the main components and their interactions. For instance, I added BI tools like Tableau, Power BI, and Looker to the repo for data visualization and reporting.
sequenceDiagram
participant BS as Batch Source<br/>(MySQL/Files)
participant AF as Airflow DAG
participant GE as Great Expectations
participant MN as MinIO
participant SP as Spark Batch
participant PG as PostgreSQL
participant MG as MongoDB
participant PR as Prometheus
BS->>AF: Trigger Batch Job
AF->>BS: Extract Data
AF->>GE: Validate Data Quality
GE-->>AF: Validation Results
AF->>MN: Upload Raw Data
AF->>SP: Submit Spark Job
SP->>MN: Read Raw Data
SP->>SP: Transform & Enrich
SP->>PG: Write Processed Data
SP->>MG: Write NoSQL Data
SP->>PR: Send Metrics
AF->>PR: Job Status Metrics
sequenceDiagram
participant KP as Kafka Producer
participant KT as Kafka Topic
participant SS as Spark Streaming
participant AD as Anomaly Detection
participant PG as PostgreSQL
participant MN as MinIO
participant GF as Grafana
KP->>KT: Publish Events
KT->>SS: Consume Stream
SS->>AD: Process Events
AD->>AD: Detect Anomalies
AD->>PG: Store Results
AD->>MN: Archive Data
SS->>GF: Real-time Metrics
GF->>GF: Update Dashboard
graph LR
subgraph "Data Quality Pipeline"
DI[Data Ingestion] --> GE[Great Expectations]
GE --> VR{Validation<br/>Result}
VR -->|Pass| DP[Data Processing]
VR -->|Fail| AL[Alert & Log]
AL --> DR[Data Rejection]
DP --> DQ[Quality Metrics]
end
subgraph "Data Governance"
DP --> ATL[Apache Atlas]
ATL --> LIN[Lineage Tracking]
ATL --> CAT[Data Catalog]
ATL --> POL[Policies & Compliance]
end
DQ --> PROM[Prometheus]
PROM --> GRAF[Grafana Dashboard]
graph LR
subgraph "Development"
DEV[Developer] --> GIT[Git Push]
end
subgraph "CI/CD Pipeline"
GIT --> GHA[GitHub Actions]
GHA --> TEST[Run Tests]
TEST --> BUILD[Build Docker Images]
BUILD --> SCAN[Security Scan]
SCAN --> PUSH[Push to Registry]
end
subgraph "Deployment"
PUSH --> ARGO[Argo CD]
ARGO --> K8S[Kubernetes Cluster]
K8S --> HELM[Helm Charts]
HELM --> PODS[Deploy Pods]
end
subgraph "Infrastructure"
TERRA[Terraform] --> CLOUD[Cloud Resources]
CLOUD --> K8S
end
PODS --> MON[Monitoring]
ββββββββββββββββββββββββββββββββββ
β Batch Source β
β(MySQL, Files, User Interaction)β
ββββββββββββββββββ¬ββββββββββββββββ
β
β (Extract/Validate)
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Airflow Batch DAG β
β - Extracts data from MySQL β
β - Validates with Great Expectations β
β - Uploads raw data to MinIO β
βββββββββββββββββββ¬ββββββββββββββββββββ
β (spark-submit)
βΌ
ββββββββββββββββββββββββββββββββββ
β Spark Batch Job β
β - Reads raw CSV from MinIO β
β - Transforms, cleans, enriches β
β - Writes transformed data to β
β PostgreSQL & MinIO β
ββββββββββββββββ¬ββββββββββββββββββ
β (Load/Analyze)
βΌ
ββββββββββββββββββββββββββββββββββ
β Processed Data Store β
β (PostgreSQL, MongoDB, AWS S3) β
βββββββββββββββββ¬βββββββββββββββββ
β (Query/Analyze)
βΌ
ββββββββββββββββββββββββββββββββββ
β Cache & Indexing β
β (Elasticsearch, Redis) β
ββββββββββββββββββββββββββββββββββ
Streaming Side:
βββββββββββββββββββββββββββββββ
β Streaming Source β
β (Kafka) β
ββββββββββββββ¬βββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββ
β Spark Streaming Job β
β - Consumes Kafka messages β
β - Filters and detects anomalies β
β - Persists anomalies to β
β PostgreSQL & MinIO β
βββββββββββββββββββββββββββββββββββββ
Monitoring & Governance:
ββββββββββββββββββββββββββββββββββ
β Monitoring & β
β Data Governance Layer β
β - Prometheus & Grafana β
β - Apache Atlas / OpenMetadata β
ββββββββββββββββββββββββββββββββββ
ML & Serving:
ββββββββββββββββββββββββββββββββ
β AI/ML Serving β
β - Feature Store (Feast) β
β - MLflow Model Tracking β
β - Model training & serving β
β - BI Dashboards β
ββββββββββββββββββββββββββββββββ
CI/CD & Terraform:
ββββββββββββββββββββββββββββββββ
β CI/CD Pipelines β
β - GitHub Actions / Jenkins β
β - Terraform for Cloud Deploy β
ββββββββββββββββββββββββββββββββ
Container Orchestration:
ββββββββββββββββββββββββββββββββ
β Kubernetes Cluster β
β - Argo CD for GitOps β
β - Helm Charts for Deployment β
ββββββββββββββββββββββββββββββββ
A more detailed flow diagram that includes backend and frontend integration is available in the assets/ directory. This diagram illustrates how the data pipeline components interact with each other and with external systems, including data sources, storage, processing, visualization, and monitoring.
Although the frontend & backend integration is not included in this repository (since itβs supposed to only contain the pipeline), you can easily integrate it with your existing frontend application or create a new one using popular frameworks like React, Angular, or Vue.js.
graph TB
subgraph "Docker Compose Stack"
subgraph "Data Sources"
MYSQL[MySQL<br/>Port: 3306]
KAFKA[Kafka<br/>Port: 9092]
ZK[Zookeeper<br/>Port: 2181]
end
subgraph "Processing"
AIR[Airflow<br/>Webserver:8080<br/>Scheduler]
SPARK[Spark<br/>Master/Worker]
end
subgraph "Storage"
MINIO[MinIO<br/>API: 9000<br/>Console: 9001]
PG[PostgreSQL<br/>Port: 5432]
end
subgraph "Monitoring"
PROM[Prometheus<br/>Port: 9090]
GRAF[Grafana<br/>Port: 3000]
end
KAFKA --> ZK
AIR --> MYSQL
AIR --> PG
AIR --> SPARK
SPARK --> MINIO
SPARK --> PG
SPARK --> KAFKA
PROM --> AIR
PROM --> SPARK
GRAF --> PROM
end
flowchart LR
subgraph "Feature Engineering"
RAW[Raw Data] --> FE[Feature<br/>Extraction]
FE --> FS[Feature Store<br/>Feast]
end
subgraph "Model Training"
FS --> TRAIN[Training<br/>Pipeline]
TRAIN --> VAL[Validation]
VAL --> MLF[MLflow<br/>Registry]
end
subgraph "Model Serving"
MLF --> DEPLOY[Model<br/>Deployment]
DEPLOY --> API[Prediction<br/>API]
API --> APP[Applications]
end
subgraph "Monitoring"
API --> METRICS[Performance<br/>Metrics]
METRICS --> DRIFT[Drift<br/>Detection]
DRIFT --> RETRAIN[Retrigger<br/>Training]
end
RETRAIN --> TRAIN
end-to-end-pipeline/
βββ .devcontainer/ # VS Code Dev Container settings
βββ docker-compose.yaml # Docker orchestration for all services
βββ docker-compose.ci.yaml # Docker Compose for CI/CD pipelines
βββ End_to_End_Data_Pipeline.ipynb # Jupyter notebook for pipeline overview
βββ requirements.txt # Python dependencies for scripts
βββ .gitignore # Standard Git ignore file
βββ README.md # Comprehensive documentation (this file)
βββ airflow/
β βββ Dockerfile # Custom Airflow image with dependencies
β βββ requirements.txt # Python dependencies for Airflow
β βββ dags/
β βββ batch_ingestion_dag.py # Batch pipeline DAG
β βββ streaming_monitoring_dag.py # Streaming monitoring DAG
βββ spark/
β βββ Dockerfile # Custom Spark image with Kafka and S3 support
β βββ spark_batch_job.py # Spark batch ETL job
β βββ spark_streaming_job.py # Spark streaming job
βββ kafka/
β βββ producer.py # Kafka producer for simulating event streams
βββ storage/
β βββ aws_s3_influxdb.py # S3-InfluxDB integration stub
β βββ hadoop_batch_processing.py # Hadoop batch processing stub
β βββ mongodb_streaming.py # MongoDB streaming integration stub
βββ great_expectations/
β βββ great_expectations.yaml # GE configuration
β βββ expectations/
β βββ raw_data_validation.py # GE suite for data quality
βββ governance/
β βββ atlas_stub.py # Dataset lineage registration with Atlas/OpenMetadata
βββ monitoring/
β βββ monitoring.py # Python script to set up Prometheus & Grafana
β βββ prometheus.yml # Prometheus configuration file
βββ ml/
β βββ feature_store_stub.py # Feature Store integration stub
β βββ mlflow_tracking.py # MLflow model tracking
βββ kubernetes/
β βββ argo-app.yaml # Argo CD application manifest
β βββ deployment.yaml # Kubernetes deployment manifest
βββ terraform/ # Terraform scripts for cloud deployment
βββ scripts/
βββ init_db.sql # SQL script to initialize MySQL and demo data
Clone the Repository
git clone https://github.com/hoangsonww/End-to-End-Data-Pipeline.git
cd End-to-End-Data-Pipeline
Start the Pipeline Stack
Use Docker Compose to launch all components:
docker-compose up --build
This command will:
scripts/init_db.sql).mysql_default β Host: mysql, DB: source_db, User: user, Password: passpostgres_default β Host: postgres, DB: processed_db, User: user, Password: passminio, Password: minio123)9092admin/admin)batch_ingestion_dag to run the end-to-end batch pipeline.docker-compose exec kafka python /opt/spark_jobs/../kafka/producer.py
docker-compose exec spark spark-submit --master local[2] /opt/spark_jobs/spark_streaming_job.py
monitoring.py script (or access Grafana) to view real-time metrics and dashboards.governance/atlas_stub.py script registers lineage between datasets (can be extended for full Apache Atlas integration).ml/mlflow_tracking.py to simulate model training and tracking.ml/feature_store_stub.py to integrate with a feature store like Feast.docker-compose.ci.yaml file to set up CI/CD pipelines.kubernetes/ directory for Kubernetes deployment manifests.terraform/ directory for cloud deployment scripts..github/workflows/ directory for GitHub Actions CI/CD workflows.Congratulations! You have successfully set up the end-to-end data pipeline with batch and streaming processing. However, this is a very general pipeline that needs to be customized for your specific use case.
[!IMPORTANT] Note: Be sure to visit the files and scripts in the repository and change the credentials, configurations, and logic to match your environment and use case. Feel free to extend the pipeline with additional components, services, or integrations as needed.
Docker Compose:
All services are defined in docker-compose.yaml. Adjust resource limits, environment variables, and service dependencies as needed.
Airflow:
Customize DAGs in the airflow/dags/ directory. Use the provided PythonOperators to integrate custom processing logic.
Spark Jobs:
Edit transformation logic in spark/spark_batch_job.py and spark/spark_streaming_job.py to match your data and processing requirements.
Kafka Producer:
Modify kafka/producer.py to simulate different types of events or adjust the batch size and frequency using environment variables.
Monitoring:
Update monitoring/monitoring.py and prometheus.yml to scrape additional metrics or customize dashboards. Place Grafana dashboard JSON files in the monitoring/grafana_dashboards/ directory.
Governance & ML:
Replace stub implementations in governance/atlas_stub.py and ml/ with real integrations as needed.
CI/CD & Deployment:
Customize CI/CD workflows in .github/workflows/ and deployment manifests in kubernetes/ and terraform/ for your cloud environment.
Storage:
Data storage options are in the storage/ directory with AWS S3, InfluxDB, MongoDB, and Hadoop stubs. Replace these with real integrations or credentials as needed.
mindmap
root((Data Pipeline<br/>Use Cases))
E-Commerce
Real-Time Recommendations
Clickstream Processing
User Behavior Analysis
Personalized Content
Fraud Detection
Transaction Monitoring
Pattern Recognition
Risk Scoring
Finance
Risk Analysis
Credit Assessment
Portfolio Analytics
Market Risk
Trade Surveillance
Market Data Processing
Compliance Monitoring
Anomaly Detection
Healthcare
Patient Monitoring
IoT Sensor Data
Real-time Alerts
Predictive Analytics
Clinical Trials
Data Integration
Outcome Prediction
Drug Efficacy Analysis
IoT/Manufacturing
Predictive Maintenance
Sensor Analytics
Failure Prediction
Maintenance Scheduling
Supply Chain
Inventory Optimization
Logistics Tracking
Demand Forecasting
Media
Sentiment Analysis
Social Media Streams
Brand Monitoring
Trend Detection
Ad Fraud Detection
Click Pattern Analysis
Bot Detection
Campaign Analytics
Feel free to use this pipeline as a starting point for your data processing needs. Extend it with additional components, services, or integrations to build a robust, end-to-end data platform.
docker-compose logs) to troubleshoot errors with MySQL, Kafka, Airflow, or Spark.docker-compose.yaml.Contributions, issues, and feature requests are welcome!
git checkout -b feature/AmazingFeature)git commit -m 'Add some AmazingFeature')git push origin feature/AmazingFeature)This project is licensed under the MIT License.
[!NOTE] This end-to-end data pipeline is designed for rapid deployment and customization. With minor configuration changes, it can be adapted to many business casesβfrom real-time analytics and fraud detection to predictive maintenance and advanced ML model training. Enjoy building a data-driven future with this pipeline!
Thanks for reading! If you found this repository helpful, please star it and share it with others. For questions, feedback, or suggestions, feel free to reach out to me on GitHub.