0
Core ML pillars (predict, cluster, map)
Portfolio-grade Machine Learning System
End-to-end platform for YouTube success prediction, channel clustering, global intelligence visualization, and production MLOps delivery across AWS, GCP, Azure, and OCI, with a Next.js frontend, FastAPI backend, and Kubernetes deployment architecture.
Core ML pillars (predict, cluster, map)
Supported production cloud providers
Deployment strategies (rolling/canary/bluegreen)
Major API and platform capability surfaces
What The Platform Delivers
Inputs: uploads, category, country, age. Outputs: predicted subscribers, predicted earnings, predicted growth.
Unsupervised segmentation via KMeans and DBSCAN to identify strategic archetypes across channel behavior.
Country/category influence analytics and map-ready data products for dashboard storytelling and decisions.
System Design
These views describe how the platform is separated into product-facing interfaces, inference services, and lifecycle infrastructure so teams can evolve each area without coupling failures across the system.
The diagrams below should be read together: component topology explains ownership, pipeline flow explains data contracts, and lifecycle/observability views define runtime guardrails for production traffic.
flowchart LR
FE[Next.js Frontend] --> API[FastAPI/Flask]
API --> SVC[Intelligence Services]
SVC --> SUP[Supervised Models]
SVC --> CLU[Clustering Models]
SVC --> MLOPS[MLOps Artifacts]
SUP --> ART[(artifacts/models)]
CLU --> ART
MLOPS --> REP[(reports + manifest + registry)]
flowchart TD
RAW[Raw CSV Dataset] --> LOAD[data.loader]
LOAD --> CLEAN[Type coercion and null handling]
CLEAN --> FEAT[Feature engineering]
FEAT --> TRAIN[Train supervised and clustering]
TRAIN --> EVAL[Metrics and quality reports]
EVAL --> PUBLISH[Bundle artifacts and registry update]
sequenceDiagram
participant U as User
participant FE as Frontend
participant API as API Router
participant S as Predictor Service
participant A as Artifacts
U->>FE: submit channel inputs
FE->>API: POST /predict
API->>S: validate and infer
S->>A: load active model bundle
A-->>S: model + preprocessing pipeline
S-->>API: predictions + cluster context
API-->>FE: response payload
FE-->>U: charts, explainability, recommendations
flowchart LR
APP["Next.js App Router"] --> HOME["Route: /"]
APP --> CHARTS["Route: /visualizations/charts"]
APP --> LAB["Route: /intelligence/lab"]
APP --> WIKI["Route: /wiki"]
HOME --> PREDICT["Prediction workflows"]
CHARTS --> ANALYTICS["Post-processing charts"]
LAB --> EXPLAIN["Simulation and explainability"]
flowchart TD
IN[Incoming payload] --> VALIDATE[Pydantic validation]
VALIDATE --> READY{Artifacts ready?}
READY -- no --> E503[503 not_ready]
READY -- yes --> SERVE[Predictor service]
SERVE --> OUT[Typed response]
flowchart LR
Client --> HEALTH["/health"]
Client --> READY["/ready"]
Client --> METRICS["/metrics"]
METRICS --> Dashboards[Grafana/Prometheus]
READY --> Alerts[Release gating]
Quality, Governance, Reliability
MLOps here is treated as an operational control plane, not only a training script. Every run produces lineage metadata, quality reports, and promotion context that can be audited before a model becomes active.
The lifecycle charts map how drift checks, readiness gates, rollback paths, and registry state interact to keep inference stable while still allowing frequent model refreshes.
Advanced controls are intentionally opt-in: MLflow/W&B tracking, Optuna tuning, DVC/Feast data workflows, Prefect retraining orchestration, and Prometheus/Grafana observability can be enabled without destabilizing baseline CI quality gates.
flowchart LR
T[Training Run] --> M1[supervised_bundle.joblib]
T --> M2[clustering_bundle.joblib]
T --> R1[training_metrics.json]
T --> R2[data_quality_report.json]
T --> R3[feature_store_snapshot.csv]
T --> R4[optuna_study.json]
T --> MAN[training_manifest.json]
MAN --> REG[model_registry.json]
REG --> ACTIVE[active_run_id]
flowchart TD
INPUT["Incoming payload sample"] --> DRIFT["Endpoint: /mlops/drift-check"]
DRIFT --> NUM["Numeric z-score checks"]
DRIFT --> CAT["Categorical frequency checks"]
NUM --> SCORE[Severity aggregation]
CAT --> SCORE
SCORE --> DECISION{High severity?}
DECISION -- yes --> RETRAIN[Trigger retraining workflow]
DECISION -- no --> SERVE[Continue active run]
stateDiagram-v2
[*] --> Boot
Boot --> NotReady: artifacts missing
Boot --> Ready: artifacts found
NotReady --> Train
Train --> Ready
Ready --> Serving
Serving --> DriftRisk
DriftRisk --> Train
sequenceDiagram
participant Train as Training Pipeline
participant Reg as model_registry.json
participant API as Inference API
participant Ops as Operator
Train->>Reg: append run and set active_run_id
API->>Reg: load active run
Ops->>Reg: rollback to prior run if needed
API-->>Ops: serving previous stable run
flowchart LR
DATA["Data intake"] --> TRAIN["Model training"]
TRAIN --> EVAL["Evaluation + quality gates"]
EVAL --> REG["Register artifacts + manifest"]
REG --> DEPLOY["Deploy active run"]
DEPLOY --> MONITOR["Health + drift monitoring"]
MONITOR --> RETRAIN{"Retrain needed?"}
RETRAIN -- yes --> TRAIN
RETRAIN -- no --> DEPLOY
flowchart LR
TR[train.py] --> CFG[Tracking flags]
CFG --> MLF["MLflow backend (optional)"]
CFG --> WB["W&B backend (optional)"]
TR --> PAR[log params]
TR --> MET[log metrics]
TR --> ART[log artifacts]
PAR --> MLF
PAR --> WB
MET --> MLF
MET --> WB
ART --> MLF
ART --> WB
flowchart TD
START[CLI --optuna-trials] --> STUDY[Create or load study]
STUDY --> TRIAL[Sample trial params]
TRIAL --> TRAINRF[Train candidate regressor]
TRAINRF --> SCORE[Evaluate RMSE]
SCORE --> BEST{Best score?}
BEST -- yes --> UPDATE[Update best params]
BEST -- no --> NEXT[Next trial]
UPDATE --> NEXT
NEXT --> STUDY
STUDY --> OUT[optuna_study.json + tuned config]
flowchart LR
RAW[Raw dataset] --> PREP[Preprocess stage]
PREP --> SNAP[feature_store_snapshot.csv]
SNAP --> DVC[dvc.yaml stages]
SNAP --> FEAST[Feast file source]
FEAST --> OFFLINE[Offline feature retrieval]
DVC --> LINEAGE[Data lineage + reproducibility]
flowchart TB
PREF[Prefect schedule] --> FLOW[retraining flow]
FLOW --> TRAIN[run_training]
TRAIN --> REG[manifest + registry]
API[API service] --> READY["/ready"]
API --> MET["/metrics"]
API --> CAP["/mlops/capabilities"]
MET --> PROM[Prometheus]
PROM --> GRAF[Grafana]
READY --> FLOW
flowchart TD
CANDIDATE[Candidate run] --> QUALITY[Quality checks]
QUALITY --> DRIFT[Baseline drift tolerance]
DRIFT --> POLICY[Policy review]
POLICY --> APPROVE{Approved?}
APPROVE -- yes --> PROMOTE[Set active_run_id]
APPROVE -- no --> HOLD[Keep previous stable run]
PROMOTE --> AUDIT[Append governance record]
HOLD --> AUDIT
Production Delivery
Delivery is designed as a single workflow model with provider-specific execution lanes. CI verifies build quality and artifact integrity once, then GitOps drives reconciled rollout behavior across AWS, GCP, Azure, and OCI.
Strategy overlays let operators switch between rolling, canary, and blue/green without changing app code. This keeps release mechanics explicit, reviewable, and reversible.
flowchart LR
COMMIT[Git Commit] --> J[Jenkins Pipeline]
J --> TEST[Train + test + frontend build]
J --> IMG[Build and push containers]
IMG --> REG[Cloud registry]
J --> KUST[Update kustomize overlay]
KUST --> ARGO[Argo CD sync]
ARGO --> K8S[Kubernetes rollout]
flowchart TD
STRAT[Selected strategy] --> ROLLING[rolling overlay]
STRAT --> CANARY[canary overlay]
STRAT --> BG[bluegreen overlay]
ROLLING --> D[Deployment controller]
CANARY --> AR[Argo Rollouts canary]
BG --> ARBG[Argo Rollouts bluegreen]
flowchart TB
TF[Terraform roots] --> AWS[AWS: EKS + ECR + S3]
TF --> GCP[GCP: GKE + Artifact Registry + GCS]
TF --> AZ[Azure: AKS + ACR + Blob]
TF --> OCI[OCI: OKE + OCIR + Object Storage]
AWS --> K8S1[Kubernetes runtime]
GCP --> K8S2[Kubernetes runtime]
AZ --> K8S3[Kubernetes runtime]
OCI --> K8S4[Kubernetes runtime]
sequenceDiagram
participant CI as Jenkins
participant Argo as Argo Rollouts
participant SVC as Service
CI->>Argo: deploy canary image
Argo->>SVC: route 10% traffic
Argo->>SVC: route 25% traffic
Argo->>SVC: route 50% traffic
Argo->>SVC: run analysis template
Argo->>SVC: promote 100%
flowchart TB
CI["Jenkins CI"] --> GITOPS["GitOps overlay commit"]
GITOPS --> ARGO["Argo CD controller"]
ARGO --> AWSLANE["AWS lane"]
ARGO --> GCPLANE["GCP lane"]
ARGO --> AZLANE["Azure lane"]
ARGO --> OCILANE["OCI lane"]
AWSLANE --> AWSC["EKS workload rollout"]
GCPLANE --> GCPC["GKE workload rollout"]
AZLANE --> AZC["AKS workload rollout"]
OCILANE --> OCIC["OKE workload rollout"]
flowchart LR
BUILD[Verified build] --> DEV[Dev ring]
DEV --> QA[QA ring]
QA --> PREPROD[Preprod ring]
PREPROD --> PROD[Production ring]
DEV --> HEALTH1[Health gate]
QA --> HEALTH2[Soak gate]
PREPROD --> HEALTH3[Approval gate]
HEALTH1 --> QA
HEALTH2 --> PREPROD
HEALTH3 --> PROD
PROD --> ROLLBACK[Instant rollback to prior revision]
| Provider | Kubernetes | Container Registry | Artifact Storage | Terraform Root |
|---|---|---|---|---|
| AWS | EKS | ECR | S3 | infra/terraform/environments/aws |
| GCP | GKE | Artifact Registry | GCS | infra/terraform/environments/gcp |
| Azure | AKS | ACR | Blob Storage | infra/terraform/environments/azure |
| OCI | OKE | OCIR | Object Storage | infra/terraform/environments/oci |
Operator Toolkit
This command catalog is structured by execution intent: local validation, frontend quality checks, Kubernetes rendering, GitOps strategy control, and infrastructure provisioning.
Use it as the fast path for repeatable operations. Each block is copy-ready and maps to a production concern so on-call and delivery teams can execute with low ambiguity.
source .venv/bin/activate
PYTHONPATH=src python -m youtube_success_ml.train --run-all
PYTHONPATH=src pytest -q
PYTHONPATH=src uvicorn youtube_success_ml.api.fastapi_app:app --host 0.0.0.0 --port 8000
cd frontend
npm ci
npm run lint
npm run build
npm run dev
kubectl kustomize infra/k8s/overlays/rolling
kubectl kustomize infra/k8s/overlays/canary
kubectl kustomize infra/k8s/overlays/bluegreen
bash infra/argocd/bootstrap.sh
bash infra/argocd/switch-strategy.sh canary
bash infra/argocd/switch-strategy.sh bluegreen
bash infra/argocd/switch-strategy.sh rolling
cd infra/terraform/environments/aws
cp terraform.tfvars.example terraform.tfvars
terraform init
terraform plan
terraform apply
docker compose up --build
Release Governance
Run training pipeline, produce artifacts, validate metrics and data quality reports.
Pass `pytest`, frontend lint/build, and confirm `/health` + `/ready` contracts.
Build API/frontend images from the same repo revision and publish to cloud registry.
Sync rolling/canary/bluegreen overlays through Argo CD and verify rollout health.
Gate promotions, monitor SLO signals, and execute rollback if regression is detected.
Execution Playbook
The runbook captures high-signal commands used during validation, incident handling, and controlled rollout. It complements CI/CD by giving humans deterministic break-glass and verification steps.
In production operations, prioritize this sequence: verify readiness, run smoke checks, confirm deployment parameters, and only then escalate to abort/undo actions.
bash scripts/smoke_api.sh http://127.0.0.1:8000
curl -i http://127.0.0.1:8000/ready
CLOUD_PROVIDER=(aws|gcp|azure|oci)
DEPLOY_STRATEGY=(rolling|canary|bluegreen)
RUN_TERRAFORM_APPLY=(true|false)
IMAGE_TAG=(optional)
kubectl argo rollouts abort yts-api -n yts-prod
kubectl argo rollouts abort yts-frontend -n yts-prod
kubectl argo rollouts undo yts-api -n yts-prod
Documentation Index
Platform overview, setup, APIs, and operations entrypoint.
Detailed system design, interaction models, and reliability views.
Jenkins, Argo CD, K8s, and multi-cloud Terraform workflows.
Model lineage, governance, drift checks, and promotion policy.
Endpoint contracts for prediction, clustering, analytics, and health.
Route topology, charts UX, SEO metadata, and integration model.
Next.js production demo on Vercel:
https://youtube-success.vercel.app.
Frequently Asked Questions
Install Terraform in the Jenkins agent image or use a dedicated IaC stage container before `terraform_plan_apply.sh`.
Use `infra/argocd/switch-strategy.sh` to enforce one active strategy app, then sync via Argo and monitor health.
Artifacts are missing or inaccessible. Run training and verify mounted artifact storage before deployment sync.
Static assets are served from `frontend/public/wiki` and available through the Next route `/wiki`.