Portfolio-grade Machine Learning System

YouTube Success Prediction & Intelligence Visualization Platform

Q: How do we change rollout strategy safely?

Use infra/argocd/switch-strategy.sh to enforce one active strategy app, then sync via Argo and monitor health.

Q: Where is this wiki hosted in the app?

Static assets are served from frontend/public/wiki and available through the Next route /wiki.

End-to-end platform for YouTube success prediction, channel clustering, global intelligence visualization, and production MLOps delivery across AWS, GCP, Azure, and OCI, with a Next.js frontend, FastAPI backend, and Kubernetes deployment architecture.

Explore Architecture Open Command Center Frontend Demo GitHub Repository

0

Core ML pillars (predict, cluster, map)

0

Supported production cloud providers

0

Deployment strategies (rolling/canary/bluegreen)

0

Major API and platform capability surfaces

What The Platform Delivers

Capability Surface

1. Success Prediction System

Inputs: uploads, category, country, age. Outputs: predicted subscribers, predicted earnings, predicted growth.

Single and batch inference
Simulation and recommendation endpoints
Model lab with four insight cards plus always-visible drift snapshot
Icon-only top-left control for animated navbar collapse/expand
FastAPI + Flask compatibility

2. Channel Clustering

Unsupervised segmentation via KMeans and DBSCAN to identify strategic archetypes across channel behavior.

Viral entertainers
Consistent educators
High earning low upload and high upload low growth

3. Global Visualization

Country/category influence analytics and map-ready data products for dashboard storytelling and decisions.

Country metrics APIs
Real world map tabs (influence, earnings choropleth, category dominance)
Charts page with post-processed views
Plotly/Folium map artifacts

System Design

Architecture Views

These views describe how the platform is separated into product-facing interfaces, inference services, and lifecycle infrastructure so teams can evolve each area without coupling failures across the system.

The diagrams below should be read together: component topology explains ownership, pipeline flow explains data contracts, and lifecycle/observability views define runtime guardrails for production traffic.

Platform Components

flowchart LR
  FE[Next.js Frontend] --> API[FastAPI/Flask]
  API --> SVC[Intelligence Services]
  SVC --> SUP[Supervised Models]
  SVC --> CLU[Clustering Models]
  SVC --> MLOPS[MLOps Artifacts]
  SUP --> ART[(artifacts/models)]
  CLU --> ART
  MLOPS --> REP[(reports + manifest + registry)]

Data And Feature Pipeline

flowchart TD
  RAW[Raw CSV Dataset] --> LOAD[data.loader]
  LOAD --> CLEAN[Type coercion and null handling]
  CLEAN --> FEAT[Feature engineering]
  FEAT --> TRAIN[Train supervised and clustering]
  TRAIN --> EVAL[Metrics and quality reports]
  EVAL --> PUBLISH[Bundle artifacts and registry update]

Request Lifecycle

sequenceDiagram
  participant U as User
  participant FE as Frontend
  participant API as API Router
  participant S as Predictor Service
  participant A as Artifacts

  U->>FE: submit channel inputs
  FE->>API: POST /predict
  API->>S: validate and infer
  S->>A: load active model bundle
  A-->>S: model + preprocessing pipeline
  S-->>API: predictions + cluster context
  API-->>FE: response payload
  FE-->>U: charts, explainability, recommendations

Frontend Route Topology

flowchart LR
  APP["Next.js App Router"] --> HOME["Route: /"]
  APP --> CHARTS["Route: /visualizations/charts"]
  APP --> LAB["Route: /intelligence/lab"]
  APP --> WIKI["Route: /wiki"]
  HOME --> PREDICT["Prediction workflows + 6 intelligence cards"]
  CHARTS --> ANALYTICS["Real map workspace + post-processing charts"]
  LAB --> EXPLAIN["Simulation and explainability + 4 insight cards"]
  LAB --> DRIFT["Always-visible Drift Snapshot (run-lab guidance when idle)"]

Overview Intelligence Card Stack

flowchart TB
  HOME["Route: /"] --> C1["Market Momentum Lens"]
  HOME --> C2["Archetype Share Wheel"]
  HOME --> C3["Revenue Efficiency Signals"]
  HOME --> C4["Category Pressure Map"]
  HOME --> C5["Market Share Balance"]
  HOME --> C6["Monetization Lift Curve"]
  C1 --> TABLE["Global Country Intelligence"]
  C6 --> TABLE

Model Lab Insight Card Stack

flowchart TB
  LAB["Route: /intelligence/lab"] --> L1["Growth Elasticity Pulse"]
  LAB --> L2["Explainability Concentration"]
  LAB --> L3["Earnings Response Gradient"]
  LAB --> L4["Drift Severity Mix"]
  LAB --> DS["Drift Snapshot (always rendered, run-lab text when idle)"]
  LAB --> BW["Batch Prediction Workbench"]
  LAB --> EMPTY["Growth/Explainability + Drift: run-lab text before first run"]

Navigation Shell Interaction

stateDiagram-v2
  [*] --> Expanded
  Expanded --> Collapsed: top-left icon click
  Collapsed --> Expanded: top-left icon click
  Expanded --> MobileMenuOpen: mobile Menu tap
  MobileMenuOpen --> Expanded: close/menu tap or route change

Inference Guardrails

flowchart TD
  IN[Incoming payload] --> VALIDATE[Pydantic validation]
  VALIDATE --> READY{Artifacts ready?}
  READY -- no --> E503[503 not_ready]
  READY -- yes --> SERVE[Predictor service]
  SERVE --> OUT[Typed response]

Observability Surface

flowchart LR
  Client --> HEALTH["/health"]
  Client --> READY["/ready"]
  Client --> METRICS["/metrics"]
  METRICS --> Dashboards[Grafana/Prometheus]
  READY --> Alerts[Release gating]

Quality, Governance, Reliability

MLOps Lifecycle

MLOps here is treated as an operational control plane, not only a training script. Every run produces lineage metadata, quality reports, and promotion context that can be audited before a model becomes active.

The lifecycle charts map how drift checks, readiness gates, rollback paths, and registry state interact to keep inference stable while still allowing frequent model refreshes.

Advanced controls are intentionally opt-in: MLflow/W&B tracking, Optuna tuning, DVC/Feast data workflows, Prefect retraining orchestration, and Prometheus/Grafana observability can be enabled without destabilizing baseline CI quality gates.

Artifact Lineage

flowchart LR
  T[Training Run] --> M1[supervised_bundle.joblib]
  T --> M2[clustering_bundle.joblib]
  T --> R1[training_metrics.json]
  T --> R2[data_quality_report.json]
  T --> R3[feature_store_snapshot.csv]
  T --> R4[optuna_study.json]
  T --> MAN[training_manifest.json]
  MAN --> REG[model_registry.json]
  REG --> ACTIVE[active_run_id]

Drift Detection Process

flowchart TD
  INPUT["Incoming payload sample"] --> DRIFT["Endpoint: /mlops/drift-check"]
  DRIFT --> NUM["Numeric z-score checks"]
  DRIFT --> CAT["Categorical frequency checks"]
  NUM --> SCORE[Severity aggregation]
  CAT --> SCORE
  SCORE --> DECISION{High severity?}
  DECISION -- yes --> RETRAIN[Trigger retraining workflow]
  DECISION -- no --> SERVE[Continue active run]

Service Readiness Model

stateDiagram-v2
  [*] --> Boot
  Boot --> NotReady: artifacts missing
  Boot --> Ready: artifacts found
  NotReady --> Train
  Train --> Ready
  Ready --> Serving
  Serving --> DriftRisk
  DriftRisk --> Train

Model Promotion And Rollback

sequenceDiagram
  participant Train as Training Pipeline
  participant Reg as model_registry.json
  participant API as Inference API
  participant Ops as Operator

  Train->>Reg: append run and set active_run_id
  API->>Reg: load active run
  Ops->>Reg: rollback to prior run if needed
  API-->>Ops: serving previous stable run

MLOps Lifecycle Orchestration

flowchart LR
  DATA["Data intake"] --> TRAIN["Model training"]
  TRAIN --> EVAL["Evaluation + quality gates"]
  EVAL --> REG["Register artifacts + manifest"]
  REG --> DEPLOY["Deploy active run"]
  DEPLOY --> MONITOR["Health + drift monitoring"]
  MONITOR --> RETRAIN{"Retrain needed?"}
  RETRAIN -- yes --> TRAIN
  RETRAIN -- no --> DEPLOY

Experiment Tracking Layer

flowchart LR
  TR[train.py] --> CFG[Tracking flags]
  CFG --> MLF["MLflow backend (optional)"]
  CFG --> WB["W&B backend (optional)"]
  TR --> PAR[log params]
  TR --> MET[log metrics]
  TR --> ART[log artifacts]
  PAR --> MLF
  PAR --> WB
  MET --> MLF
  MET --> WB
  ART --> MLF
  ART --> WB

Optuna Tuning Loop

flowchart TD
  START[CLI --optuna-trials] --> STUDY[Create or load study]
  STUDY --> TRIAL[Sample trial params]
  TRIAL --> TRAINRF[Train candidate regressor]
  TRAINRF --> SCORE[Evaluate RMSE]
  SCORE --> BEST{Best score?}
  BEST -- yes --> UPDATE[Update best params]
  BEST -- no --> NEXT[Next trial]
  UPDATE --> NEXT
  NEXT --> STUDY
  STUDY --> OUT[optuna_study.json + tuned config]

Feature Store And Versioning Flow

flowchart LR
  RAW[Raw dataset] --> PREP[Preprocess stage]
  PREP --> SNAP[feature_store_snapshot.csv]
  SNAP --> DVC[dvc.yaml stages]
  SNAP --> FEAST[Feast file source]
  FEAST --> OFFLINE[Offline feature retrieval]
  DVC --> LINEAGE[Data lineage + reproducibility]

Retraining + Monitoring Control Plane

flowchart TB
  PREF[Prefect schedule] --> FLOW[retraining flow]
  FLOW --> TRAIN[run_training]
  TRAIN --> REG[manifest + registry]
  API[API service] --> READY["/ready"]
  API --> MET["/metrics"]
  API --> CAP["/mlops/capabilities"]
  MET --> PROM[Prometheus]
  PROM --> GRAF[Grafana]
  READY --> FLOW

Model Governance Gate

flowchart TD
  CANDIDATE[Candidate run] --> QUALITY[Quality checks]
  QUALITY --> DRIFT[Baseline drift tolerance]
  DRIFT --> POLICY[Policy review]
  POLICY --> APPROVE{Approved?}
  APPROVE -- yes --> PROMOTE[Set active_run_id]
  APPROVE -- no --> HOLD[Keep previous stable run]
  PROMOTE --> AUDIT[Append governance record]
  HOLD --> AUDIT

Production Delivery

Multi-Cloud CI/CD and GitOps

Delivery is designed as a single workflow model with provider-specific execution lanes. CI verifies build quality and artifact integrity once, then GitOps drives reconciled rollout behavior across AWS, GCP, Azure, and OCI.

Strategy overlays let operators switch between rolling, canary, and blue/green without changing app code. This keeps release mechanics explicit, reviewable, and reversible.

Jenkins + Argo Delivery Flow

flowchart LR
  COMMIT[Git Commit] --> J[Jenkins Pipeline]
  J --> TEST[Train + test + frontend build]
  J --> IMG[Build and push containers]
  IMG --> REG[Cloud registry]
  J --> KUST[Update kustomize overlay]
  KUST --> ARGO[Argo CD sync]
  ARGO --> K8S[Kubernetes rollout]

Deployment Strategy Control

flowchart TD
  STRAT[Selected strategy] --> ROLLING[rolling overlay]
  STRAT --> CANARY[canary overlay]
  STRAT --> BG[bluegreen overlay]
  ROLLING --> D[Deployment controller]
  CANARY --> AR[Argo Rollouts canary]
  BG --> ARBG[Argo Rollouts bluegreen]

Cloud Provider Packs

flowchart TB
  TF[Terraform roots] --> AWS[AWS: EKS + ECR + S3]
  TF --> GCP[GCP: GKE + Artifact Registry + GCS]
  TF --> AZ[Azure: AKS + ACR + Blob]
  TF --> OCI[OCI: OKE + OCIR + Object Storage]
  AWS --> K8S1[Kubernetes runtime]
  GCP --> K8S2[Kubernetes runtime]
  AZ --> K8S3[Kubernetes runtime]
  OCI --> K8S4[Kubernetes runtime]

Canary Rollout Sequence

sequenceDiagram
  participant CI as Jenkins
  participant Argo as Argo Rollouts
  participant SVC as Service

  CI->>Argo: deploy canary image
  Argo->>SVC: route 10% traffic
  Argo->>SVC: route 25% traffic
  Argo->>SVC: route 50% traffic
  Argo->>SVC: run analysis template
  Argo->>SVC: promote 100%

Multi-Cloud Delivery Lanes

flowchart TB
  CI["Jenkins CI"] --> GITOPS["GitOps overlay commit"]
  GITOPS --> ARGO["Argo CD controller"]
  ARGO --> AWSLANE["AWS lane"]
  ARGO --> GCPLANE["GCP lane"]
  ARGO --> AZLANE["Azure lane"]
  ARGO --> OCILANE["OCI lane"]
  AWSLANE --> AWSC["EKS workload rollout"]
  GCPLANE --> GCPC["GKE workload rollout"]
  AZLANE --> AZC["AKS workload rollout"]
  OCILANE --> OCIC["OKE workload rollout"]

GitOps Promotion Rings

flowchart LR
  BUILD[Verified build] --> DEV[Dev ring]
  DEV --> QA[QA ring]
  QA --> PREPROD[Preprod ring]
  PREPROD --> PROD[Production ring]
  DEV --> HEALTH1[Health gate]
  QA --> HEALTH2[Soak gate]
  PREPROD --> HEALTH3[Approval gate]
  HEALTH1 --> QA
  HEALTH2 --> PREPROD
  HEALTH3 --> PROD
  PROD --> ROLLBACK[Instant rollback to prior revision]

Cloud Integration Matrix

Provider	Kubernetes	Container Registry	Artifact Storage	Terraform Root
AWS	EKS	ECR	S3	`infra/terraform/environments/aws`
GCP	GKE	Artifact Registry	GCS	`infra/terraform/environments/gcp`
Azure	AKS	ACR	Blob Storage	`infra/terraform/environments/azure`
OCI	OKE	OCIR	Object Storage	`infra/terraform/environments/oci`

Operator Toolkit

Command Center

This command catalog is structured by execution intent: local validation, frontend quality checks, Kubernetes rendering, GitOps strategy control, and infrastructure provisioning.

Use it as the fast path for repeatable operations. Each block is copy-ready and maps to a production concern so on-call and delivery teams can execute with low ambiguity.

Local ML + API Bootstrap

source .venv/bin/activate
PYTHONPATH=src python -m youtube_success_ml.train --run-all
PYTHONPATH=src pytest -q
PYTHONPATH=src uvicorn youtube_success_ml.api.fastapi_app:app --host 0.0.0.0 --port 8000

Frontend Lifecycle

cd frontend
npm ci
npm run lint
npm run build
npm run dev

Kubernetes Render Checks

kubectl kustomize infra/k8s/overlays/rolling
kubectl kustomize infra/k8s/overlays/canary
kubectl kustomize infra/k8s/overlays/bluegreen

Argo Strategy Switching

bash infra/argocd/bootstrap.sh
bash infra/argocd/switch-strategy.sh canary
bash infra/argocd/switch-strategy.sh bluegreen
bash infra/argocd/switch-strategy.sh rolling

Terraform Apply Pattern

cd infra/terraform/environments/aws
cp terraform.tfvars.example terraform.tfvars
terraform init
terraform plan
terraform apply

Container Stack

docker compose up --build

Release Governance

Delivery Roadmap And Checklist

Phase 1: Data + ML Integrity

Run training pipeline, produce artifacts, validate metrics and data quality reports.

Phase 2: API + Frontend Validation

Pass `pytest`, frontend lint/build, and confirm `/health` + `/ready` contracts.

Phase 3: Container Build and Push

Build API/frontend images from the same repo revision and publish to cloud registry.

Phase 4: GitOps Strategy Rollout

Sync rolling/canary/bluegreen overlays through Argo CD and verify rollout health.

Phase 5: Observe, Promote, Rollback

Gate promotions, monitor SLO signals, and execute rollback if regression is detected.

Release Checklist

0/11 completed

Execution Playbook

Operator Runbook

The runbook captures high-signal commands used during validation, incident handling, and controlled rollout. It complements CI/CD by giving humans deterministic break-glass and verification steps.

In production operations, prioritize this sequence: verify readiness, run smoke checks, confirm deployment parameters, and only then escalate to abort/undo actions.

Smoke API

bash scripts/smoke_api.sh http://127.0.0.1:8000

Readiness Verification

curl -i http://127.0.0.1:8000/ready

Jenkins Inputs

CLOUD_PROVIDER=(aws|gcp|azure|oci)
DEPLOY_STRATEGY=(rolling|canary|bluegreen)
RUN_TERRAFORM_APPLY=(true|false)
IMAGE_TAG=(optional)

Incident Response

kubectl argo rollouts abort yts-api -n yts-prod
kubectl argo rollouts abort yts-frontend -n yts-prod
kubectl argo rollouts undo yts-api -n yts-prod

Documentation Index

Deep-Dive Documents

README.md

Platform overview, setup, APIs, and operations entrypoint.

ARCHITECTURE.md

Detailed system design, interaction models, and reliability views.

DEPLOYMENT.md

Jenkins, Argo CD, K8s, and multi-cloud Terraform workflows.

MLOPS.md

Model lineage, governance, drift checks, and promotion policy.

API_REFERENCE.md

Endpoint contracts for prediction, clustering, analytics, and health.

FRONTEND.md

Route topology, charts UX, SEO metadata, and integration model.

Live Frontend Demo

Next.js production demo on Vercel: https://youtube-success.vercel.app.

Frequently Asked Questions

Operational FAQ

What if Terraform is not installed on a runner?

Install Terraform in the Jenkins agent image or use a dedicated IaC stage container before `terraform_plan_apply.sh`.

How do we change rollout strategy safely?

Use `infra/argocd/switch-strategy.sh` to enforce one active strategy app, then sync via Argo and monitor health.

Why might `/ready` return 503?

Artifacts are missing or inaccessible. Run training and verify mounted artifact storage before deployment sync.

Where is this wiki hosted in the app?

Static assets are served from `frontend/public/wiki` and available through the Next route `/wiki`.