End-to-End-Data-Pipeline

End-to-End Data Pipeline with Batch & Streaming Processing

This repository contains a fully integrated, production-ready data pipeline that supports both batch and streaming data processing using open-source technologies. It is designed to be easily configured and deployed by any business or individual with minimal modifications.

The pipeline incorporates:

Python SQL Bash Docker Kubernetes Apache Airflow Apache Spark Apache Flink Kafka Apache Hadoop PostgreSQL MySQL MongoDB InfluxDB MinIO AWS S3 Prometheus Grafana Elasticsearch MLflow Feast Great Expectations Apache Atlas Tableau Power BI Looker Redis Terraform

Read this README and follow the step-by-step guide to set up the pipeline on your local machine or cloud environment. Customize the pipeline components, configurations, and example applications to suit your data processing needs.

Table of Contents

  1. Architecture Overview
  2. Directory Structure
  3. Components & Technologies
  4. Setup Instructions
  5. Configuration & Customization
  6. Example Applications
  7. Troubleshooting & Further Considerations
  8. Contributing
  9. License
  10. Final Notes

Architecture Overview

The architecture of the end-to-end data pipeline is designed to handle both batch and streaming data processing. Below is a high-level overview of the components and their interactions:

Flow Diagram

Architecture Diagram

Note: The diagram may not reflect ALL components in the repository, but it provides a good overview of the main components and their interactions. For instance, I added BI tools like Tableau, Power BI, and Looker to the repo for data visualization and reporting.

Text-Based Pipeline Diagram

                            β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                            β”‚         Batch Source           β”‚
                            β”‚(MySQL, Files, User Interaction)β”‚
                            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                             β”‚
                                             β”‚  (Extract/Validate)
                                             β–Ό
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚      Airflow Batch DAG              β”‚
                           β”‚ - Extracts data from MySQL          β”‚
                           β”‚ - Validates with Great Expectations β”‚
                           β”‚ - Uploads raw data to MinIO         β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                             β”‚ (spark-submit)
                                             β–Ό
                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                             β”‚         Spark Batch Job        β”‚
                             β”‚ - Reads raw CSV from MinIO     β”‚
                             β”‚ - Transforms, cleans, enriches β”‚
                             β”‚ - Writes transformed data to   β”‚
                             β”‚   PostgreSQL & MinIO           β”‚
                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                            β”‚ (Load/Analyze)
                                            β–Ό
                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                             β”‚       Processed Data Store     β”‚
                             β”‚ (PostgreSQL, MongoDB, AWS S3)  β”‚
                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                             β”‚ (Query/Analyze)
                                             β–Ό
                             β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                             β”‚         Cache & Indexing       β”‚
                             β”‚     (Elasticsearch, Redis)     β”‚
                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        
Streaming Side:
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚       Streaming Source      β”‚
                              β”‚         (Kafka)             β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                           β”‚
                                           β–Ό
                           β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                           β”‚    Spark Streaming Job            β”‚
                           β”‚ - Consumes Kafka messages         β”‚
                           β”‚ - Filters and detects anomalies   β”‚
                           β”‚ - Persists anomalies to           β”‚
                           β”‚   PostgreSQL & MinIO              β”‚
                           β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Monitoring & Governance:
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚       Monitoring &             β”‚
                              β”‚  Data Governance Layer         β”‚
                              β”‚ - Prometheus & Grafana         β”‚
                              β”‚ - Apache Atlas / OpenMetadata  β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                        
ML & Serving:
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚        AI/ML Serving         β”‚
                              β”‚ - Feature Store (Feast)      β”‚
                              β”‚ - MLflow Model Tracking      β”‚
                              β”‚ - Model training & serving   β”‚
                              β”‚ - BI Dashboards              β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                              
CI/CD & Terraform:
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚        CI/CD Pipelines       β”‚
                              β”‚ - GitHub Actions / Jenkins   β”‚
                              β”‚ - Terraform for Cloud Deploy β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Container Orchestration:
                              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                              β”‚       Kubernetes Cluster     β”‚
                              β”‚ - Argo CD for GitOps         β”‚
                              β”‚ - Helm Charts for Deployment β”‚
                              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Full Flow Diagram with Backend & Frontend Integration (Optional)

A more detailed flow diagram that includes backend and frontend integration is available in the assets/ directory. This diagram illustrates how the data pipeline components interact with each other and with external systems, including data sources, storage, processing, visualization, and monitoring.

Although the frontend & backend integration is not included in this repository (since it’s supposed to only contain the pipeline), you can easily integrate it with your existing frontend application or create a new one using popular frameworks like React, Angular, or Vue.js.

Full Flow Diagram

Directory Structure

end-to-end-pipeline/
  β”œβ”€β”€ .devcontainer/                 # VS Code Dev Container settings
  β”œβ”€β”€ docker-compose.yaml            # Docker orchestration for all services
  β”œβ”€β”€ docker-compose.ci.yaml         # Docker Compose for CI/CD pipelines
  β”œβ”€β”€ End_to_End_Data_Pipeline.ipynb # Jupyter notebook for pipeline overview
  β”œβ”€β”€ requirements.txt               # Python dependencies for scripts
  β”œβ”€β”€ .gitignore                     # Standard Git ignore file
  β”œβ”€β”€ README.md                      # Comprehensive documentation (this file)
  β”œβ”€β”€ airflow/
  β”‚   β”œβ”€β”€ Dockerfile                 # Custom Airflow image with dependencies
  β”‚   β”œβ”€β”€ requirements.txt           # Python dependencies for Airflow
  β”‚   └── dags/
  β”‚       β”œβ”€β”€ batch_ingestion_dag.py # Batch pipeline DAG
  β”‚       └── streaming_monitoring_dag.py  # Streaming monitoring DAG
  β”œβ”€β”€ spark/
  β”‚   β”œβ”€β”€ Dockerfile                 # Custom Spark image with Kafka and S3 support
  β”‚   β”œβ”€β”€ spark_batch_job.py         # Spark batch ETL job
  β”‚   └── spark_streaming_job.py     # Spark streaming job
  β”œβ”€β”€ kafka/
  β”‚   └── producer.py                # Kafka producer for simulating event streams
  β”œβ”€β”€ storage/
  β”‚   β”œβ”€β”€ aws_s3_influxdb.py         # S3-InfluxDB integration stub
  β”‚   β”œβ”€β”€ hadoop_batch_processing.py  # Hadoop batch processing stub
  β”‚   └── mongodb_streaming.py       # MongoDB streaming integration stub
  β”œβ”€β”€ great_expectations/
  β”‚   β”œβ”€β”€ great_expectations.yaml    # GE configuration
  β”‚   └── expectations/
  β”‚       └── raw_data_validation.py # GE suite for data quality
  β”œβ”€β”€ governance/
  β”‚   └── atlas_stub.py              # Dataset lineage registration with Atlas/OpenMetadata
  β”œβ”€β”€ monitoring/
  β”‚   β”œβ”€β”€ monitoring.py              # Python script to set up Prometheus & Grafana
  β”‚   └── prometheus.yml             # Prometheus configuration file
  β”œβ”€β”€ ml/
  β”‚   β”œβ”€β”€ feature_store_stub.py      # Feature Store integration stub
  β”‚   └── mlflow_tracking.py         # MLflow model tracking
  β”œβ”€β”€ kubernetes/
  β”‚   β”œβ”€β”€ argo-app.yaml              # Argo CD application manifest
  β”‚   └── deployment.yaml            # Kubernetes deployment manifest
  β”œβ”€β”€ terraform/                     # Terraform scripts for cloud deployment
  └── scripts/
      └── init_db.sql                # SQL script to initialize MySQL and demo data

Components & Technologies

Setup Instructions

Prerequisites

Step-by-Step Guide

  1. Clone the Repository

    git clone https://github.com/hoangsonww/End-to-End-Data-Pipeline.git
    cd End-to-End-Data-Pipeline
    
  2. Start the Pipeline Stack

    Use Docker Compose to launch all components:

    docker-compose up --build
    

    This command will:

    • Build custom Docker images for Airflow and Spark.
    • Start MySQL, PostgreSQL, Kafka (with Zookeeper), MinIO, Prometheus, Grafana, and Airflow webserver.
    • Initialize the MySQL database with demo data (via scripts/init_db.sql).
  3. Access the Services
  4. Run Batch Pipeline
    • In the Airflow UI, enable the batch_ingestion_dag to run the end-to-end batch pipeline.
    • This DAG extracts data from MySQL, validates it, uploads raw data to MinIO, triggers a Spark job for transformation, and loads data into PostgreSQL.
  5. Run Streaming Pipeline
    • Open a terminal and start the Kafka producer:
      docker-compose exec kafka python /opt/spark_jobs/../kafka/producer.py
      
    • In another terminal, run the Spark streaming job:
      docker-compose exec spark spark-submit --master local[2] /opt/spark_jobs/spark_streaming_job.py
      
    • The streaming job consumes events from Kafka, performs real-time anomaly detection, and writes results to PostgreSQL and MinIO.
  6. Monitoring & Governance
    • Prometheus & Grafana:
      Use the monitoring.py script (or access Grafana) to view real-time metrics and dashboards.
    • Data Lineage:
      The governance/atlas_stub.py script registers lineage between datasets (can be extended for full Apache Atlas integration).
  7. ML & Feature Store
    • Use ml/mlflow_tracking.py to simulate model training and tracking.
    • Use ml/feature_store_stub.py to integrate with a feature store like Feast.
  8. CI/CD & Deployment
    • Use the docker-compose.ci.yaml file to set up CI/CD pipelines.
    • Use the kubernetes/ directory for Kubernetes deployment manifests.
    • Use the terraform/ directory for cloud deployment scripts.
    • Use the .github/workflows/ directory for GitHub Actions CI/CD workflows.

Next Steps

Congratulations! You have successfully set up the end-to-end data pipeline with batch and streaming processing. However, this is a very general pipeline that needs to be customized for your specific use case.

Note: Be sure to visit the files and scripts in the repository and change the credentials, configurations, and logic to match your environment and use case. Feel free to extend the pipeline with additional components, services, or integrations as needed.

Configuration & Customization

Example Applications

E-Commerce & Retail

Financial Services & Banking

Healthcare & Life Sciences

IoT & Manufacturing

Media & Social Networks

Feel free to use this pipeline as a starting point for your data processing needs. Extend it with additional components, services, or integrations to build a robust, end-to-end data platform.

Troubleshooting & Further Considerations

Contributing

Contributions, issues, and feature requests are welcome!

  1. Fork the Project
  2. Create your Feature Branch (git checkout -b feature/AmazingFeature)
  3. Commit your Changes (git commit -m 'Add some AmazingFeature')
  4. Push to the Branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request
  6. We will review your changes and merge them into the main branch upon approval.

License

This project is licensed under the MIT License.

Final Notes

This end-to-end data pipeline is designed for rapid deployment and customization. With minor configuration changes, it can be adapted to many business casesβ€”from real-time analytics and fraud detection to predictive maintenance and advanced ML model training. Enjoy building a data-driven future with this pipeline!


Thanks for reading! If you found this repository helpful, please star it and share it with others. For questions, feedback, or suggestions, feel free to reach out to me on GitHub.

⬆️ Back to top