This repository contains a fully integrated, production-ready data pipeline that supports both batch and streaming data processing using open-source technologies. It is designed to be easily configured and deployed by any business or individual with minimal modifications.
The pipeline incorporates:
Read this README and follow the step-by-step guide to set up the pipeline on your local machine or cloud environment. Customize the pipeline components, configurations, and example applications to suit your data processing needs.
The architecture of the end-to-end data pipeline is designed to handle both batch and streaming data processing. Below is a high-level overview of the components and their interactions:
Note: The diagram may not reflect ALL components in the repository, but it provides a good overview of the main components and their interactions. For instance, I added BI tools like Tableau, Power BI, and Looker to the repo for data visualization and reporting.
ββββββββββββββββββββββββββββββββββ
β Batch Source β
β(MySQL, Files, User Interaction)β
ββββββββββββββββββ¬ββββββββββββββββ
β
β (Extract/Validate)
βΌ
βββββββββββββββββββββββββββββββββββββββ
β Airflow Batch DAG β
β - Extracts data from MySQL β
β - Validates with Great Expectations β
β - Uploads raw data to MinIO β
βββββββββββββββββββ¬ββββββββββββββββββββ
β (spark-submit)
βΌ
ββββββββββββββββββββββββββββββββββ
β Spark Batch Job β
β - Reads raw CSV from MinIO β
β - Transforms, cleans, enriches β
β - Writes transformed data to β
β PostgreSQL & MinIO β
ββββββββββββββββ¬ββββββββββββββββββ
β (Load/Analyze)
βΌ
ββββββββββββββββββββββββββββββββββ
β Processed Data Store β
β (PostgreSQL, MongoDB, AWS S3) β
βββββββββββββββββ¬βββββββββββββββββ
β (Query/Analyze)
βΌ
ββββββββββββββββββββββββββββββββββ
β Cache & Indexing β
β (Elasticsearch, Redis) β
ββββββββββββββββββββββββββββββββββ
Streaming Side:
βββββββββββββββββββββββββββββββ
β Streaming Source β
β (Kafka) β
ββββββββββββββ¬βββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββ
β Spark Streaming Job β
β - Consumes Kafka messages β
β - Filters and detects anomalies β
β - Persists anomalies to β
β PostgreSQL & MinIO β
βββββββββββββββββββββββββββββββββββββ
Monitoring & Governance:
ββββββββββββββββββββββββββββββββββ
β Monitoring & β
β Data Governance Layer β
β - Prometheus & Grafana β
β - Apache Atlas / OpenMetadata β
ββββββββββββββββββββββββββββββββββ
ML & Serving:
ββββββββββββββββββββββββββββββββ
β AI/ML Serving β
β - Feature Store (Feast) β
β - MLflow Model Tracking β
β - Model training & serving β
β - BI Dashboards β
ββββββββββββββββββββββββββββββββ
CI/CD & Terraform:
ββββββββββββββββββββββββββββββββ
β CI/CD Pipelines β
β - GitHub Actions / Jenkins β
β - Terraform for Cloud Deploy β
ββββββββββββββββββββββββββββββββ
Container Orchestration:
ββββββββββββββββββββββββββββββββ
β Kubernetes Cluster β
β - Argo CD for GitOps β
β - Helm Charts for Deployment β
ββββββββββββββββββββββββββββββββ
A more detailed flow diagram that includes backend and frontend integration is available in the assets/
directory. This diagram illustrates how the data pipeline components interact with each other and with external systems, including data sources, storage, processing, visualization, and monitoring.
Although the frontend & backend integration is not included in this repository (since itβs supposed to only contain the pipeline), you can easily integrate it with your existing frontend application or create a new one using popular frameworks like React, Angular, or Vue.js.
end-to-end-pipeline/
βββ .devcontainer/ # VS Code Dev Container settings
βββ docker-compose.yaml # Docker orchestration for all services
βββ docker-compose.ci.yaml # Docker Compose for CI/CD pipelines
βββ End_to_End_Data_Pipeline.ipynb # Jupyter notebook for pipeline overview
βββ requirements.txt # Python dependencies for scripts
βββ .gitignore # Standard Git ignore file
βββ README.md # Comprehensive documentation (this file)
βββ airflow/
β βββ Dockerfile # Custom Airflow image with dependencies
β βββ requirements.txt # Python dependencies for Airflow
β βββ dags/
β βββ batch_ingestion_dag.py # Batch pipeline DAG
β βββ streaming_monitoring_dag.py # Streaming monitoring DAG
βββ spark/
β βββ Dockerfile # Custom Spark image with Kafka and S3 support
β βββ spark_batch_job.py # Spark batch ETL job
β βββ spark_streaming_job.py # Spark streaming job
βββ kafka/
β βββ producer.py # Kafka producer for simulating event streams
βββ storage/
β βββ aws_s3_influxdb.py # S3-InfluxDB integration stub
β βββ hadoop_batch_processing.py # Hadoop batch processing stub
β βββ mongodb_streaming.py # MongoDB streaming integration stub
βββ great_expectations/
β βββ great_expectations.yaml # GE configuration
β βββ expectations/
β βββ raw_data_validation.py # GE suite for data quality
βββ governance/
β βββ atlas_stub.py # Dataset lineage registration with Atlas/OpenMetadata
βββ monitoring/
β βββ monitoring.py # Python script to set up Prometheus & Grafana
β βββ prometheus.yml # Prometheus configuration file
βββ ml/
β βββ feature_store_stub.py # Feature Store integration stub
β βββ mlflow_tracking.py # MLflow model tracking
βββ kubernetes/
β βββ argo-app.yaml # Argo CD application manifest
β βββ deployment.yaml # Kubernetes deployment manifest
βββ terraform/ # Terraform scripts for cloud deployment
βββ scripts/
βββ init_db.sql # SQL script to initialize MySQL and demo data
Clone the Repository
git clone https://github.com/hoangsonww/End-to-End-Data-Pipeline.git
cd End-to-End-Data-Pipeline
Start the Pipeline Stack
Use Docker Compose to launch all components:
docker-compose up --build
This command will:
scripts/init_db.sql
).mysql_default
β Host: mysql
, DB: source_db
, User: user
, Password: pass
postgres_default
β Host: postgres
, DB: processed_db
, User: user
, Password: pass
minio
, Password: minio123
)9092
admin/admin
)batch_ingestion_dag
to run the end-to-end batch pipeline.docker-compose exec kafka python /opt/spark_jobs/../kafka/producer.py
docker-compose exec spark spark-submit --master local[2] /opt/spark_jobs/spark_streaming_job.py
monitoring.py
script (or access Grafana) to view real-time metrics and dashboards.governance/atlas_stub.py
script registers lineage between datasets (can be extended for full Apache Atlas integration).ml/mlflow_tracking.py
to simulate model training and tracking.ml/feature_store_stub.py
to integrate with a feature store like Feast.docker-compose.ci.yaml
file to set up CI/CD pipelines.kubernetes/
directory for Kubernetes deployment manifests.terraform/
directory for cloud deployment scripts..github/workflows/
directory for GitHub Actions CI/CD workflows.Congratulations! You have successfully set up the end-to-end data pipeline with batch and streaming processing. However, this is a very general pipeline that needs to be customized for your specific use case.
Note: Be sure to visit the files and scripts in the repository and change the credentials, configurations, and logic to match your environment and use case. Feel free to extend the pipeline with additional components, services, or integrations as needed.
Docker Compose:
All services are defined in docker-compose.yaml
. Adjust resource limits, environment variables, and service dependencies as needed.
Airflow:
Customize DAGs in the airflow/dags/
directory. Use the provided PythonOperators to integrate custom processing logic.
Spark Jobs:
Edit transformation logic in spark/spark_batch_job.py
and spark/spark_streaming_job.py
to match your data and processing requirements.
Kafka Producer:
Modify kafka/producer.py
to simulate different types of events or adjust the batch size and frequency using environment variables.
Monitoring:
Update monitoring/monitoring.py
and prometheus.yml
to scrape additional metrics or customize dashboards. Place Grafana dashboard JSON files in the monitoring/grafana_dashboards/
directory.
Governance & ML:
Replace stub implementations in governance/atlas_stub.py
and ml/
with real integrations as needed.
CI/CD & Deployment:
Customize CI/CD workflows in .github/workflows/
and deployment manifests in kubernetes/
and terraform/
for your cloud environment.
Storage:
Data storage options are in the storage/
directory with AWS S3, InfluxDB, MongoDB, and Hadoop stubs. Replace these with real integrations or credentials as needed.
Feel free to use this pipeline as a starting point for your data processing needs. Extend it with additional components, services, or integrations to build a robust, end-to-end data platform.
docker-compose logs
) to troubleshoot errors with MySQL, Kafka, Airflow, or Spark.docker-compose.yaml
.Contributions, issues, and feature requests are welcome!
git checkout -b feature/AmazingFeature
)git commit -m 'Add some AmazingFeature'
)git push origin feature/AmazingFeature
)This project is licensed under the MIT License.
This end-to-end data pipeline is designed for rapid deployment and customization. With minor configuration changes, it can be adapted to many business casesβfrom real-time analytics and fraud detection to predictive maintenance and advanced ML model training. Enjoy building a data-driven future with this pipeline!
Thanks for reading! If you found this repository helpful, please star it and share it with others. For questions, feedback, or suggestions, feel free to reach out to me on GitHub.