Portfolio-grade Machine Learning System

Comprehensive Platform Wiki and Landing Site

End-to-end platform for YouTube success prediction, channel clustering, global intelligence visualization, and production MLOps delivery across AWS, GCP, Azure, and OCI, with a Next.js frontend, FastAPI backend, and Kubernetes deployment architecture.

0

Core ML pillars (predict, cluster, map)

0

Supported production cloud providers

0

Deployment strategies (rolling/canary/bluegreen)

0

Major API and platform capability surfaces

What The Platform Delivers

Capability Surface

1. Success Prediction System

Inputs: uploads, category, country, age. Outputs: predicted subscribers, predicted earnings, predicted growth.

  • Single and batch inference
  • Simulation and recommendation endpoints
  • FastAPI + Flask compatibility

2. Channel Clustering

Unsupervised segmentation via KMeans and DBSCAN to identify strategic archetypes across channel behavior.

  • Viral entertainers
  • Consistent educators
  • High earning low upload and high upload low growth

3. Global Visualization

Country/category influence analytics and map-ready data products for dashboard storytelling and decisions.

  • Country metrics APIs
  • Charts page with post-processed views
  • Plotly/Folium map artifacts

System Design

Architecture Views

These views describe how the platform is separated into product-facing interfaces, inference services, and lifecycle infrastructure so teams can evolve each area without coupling failures across the system.

The diagrams below should be read together: component topology explains ownership, pipeline flow explains data contracts, and lifecycle/observability views define runtime guardrails for production traffic.

Platform Components

flowchart LR
  FE[Next.js Frontend] --> API[FastAPI/Flask]
  API --> SVC[Intelligence Services]
  SVC --> SUP[Supervised Models]
  SVC --> CLU[Clustering Models]
  SVC --> MLOPS[MLOps Artifacts]
  SUP --> ART[(artifacts/models)]
  CLU --> ART
  MLOPS --> REP[(reports + manifest + registry)]
            

Data And Feature Pipeline

flowchart TD
  RAW[Raw CSV Dataset] --> LOAD[data.loader]
  LOAD --> CLEAN[Type coercion and null handling]
  CLEAN --> FEAT[Feature engineering]
  FEAT --> TRAIN[Train supervised and clustering]
  TRAIN --> EVAL[Metrics and quality reports]
  EVAL --> PUBLISH[Bundle artifacts and registry update]
            

Request Lifecycle

sequenceDiagram
  participant U as User
  participant FE as Frontend
  participant API as API Router
  participant S as Predictor Service
  participant A as Artifacts

  U->>FE: submit channel inputs
  FE->>API: POST /predict
  API->>S: validate and infer
  S->>A: load active model bundle
  A-->>S: model + preprocessing pipeline
  S-->>API: predictions + cluster context
  API-->>FE: response payload
  FE-->>U: charts, explainability, recommendations
            

Frontend Route Topology

flowchart LR
  APP["Next.js App Router"] --> HOME["Route: /"]
  APP --> CHARTS["Route: /visualizations/charts"]
  APP --> LAB["Route: /intelligence/lab"]
  APP --> WIKI["Route: /wiki"]
  HOME --> PREDICT["Prediction workflows"]
  CHARTS --> ANALYTICS["Post-processing charts"]
  LAB --> EXPLAIN["Simulation and explainability"]
            

Inference Guardrails

flowchart TD
  IN[Incoming payload] --> VALIDATE[Pydantic validation]
  VALIDATE --> READY{Artifacts ready?}
  READY -- no --> E503[503 not_ready]
  READY -- yes --> SERVE[Predictor service]
  SERVE --> OUT[Typed response]
            

Observability Surface

flowchart LR
  Client --> HEALTH["/health"]
  Client --> READY["/ready"]
  Client --> METRICS["/metrics"]
  METRICS --> Dashboards[Grafana/Prometheus]
  READY --> Alerts[Release gating]
            

Quality, Governance, Reliability

MLOps Lifecycle

MLOps here is treated as an operational control plane, not only a training script. Every run produces lineage metadata, quality reports, and promotion context that can be audited before a model becomes active.

The lifecycle charts map how drift checks, readiness gates, rollback paths, and registry state interact to keep inference stable while still allowing frequent model refreshes.

Advanced controls are intentionally opt-in: MLflow/W&B tracking, Optuna tuning, DVC/Feast data workflows, Prefect retraining orchestration, and Prometheus/Grafana observability can be enabled without destabilizing baseline CI quality gates.

Artifact Lineage

flowchart LR
  T[Training Run] --> M1[supervised_bundle.joblib]
  T --> M2[clustering_bundle.joblib]
  T --> R1[training_metrics.json]
  T --> R2[data_quality_report.json]
  T --> R3[feature_store_snapshot.csv]
  T --> R4[optuna_study.json]
  T --> MAN[training_manifest.json]
  MAN --> REG[model_registry.json]
  REG --> ACTIVE[active_run_id]
            

Drift Detection Process

flowchart TD
  INPUT["Incoming payload sample"] --> DRIFT["Endpoint: /mlops/drift-check"]
  DRIFT --> NUM["Numeric z-score checks"]
  DRIFT --> CAT["Categorical frequency checks"]
  NUM --> SCORE[Severity aggregation]
  CAT --> SCORE
  SCORE --> DECISION{High severity?}
  DECISION -- yes --> RETRAIN[Trigger retraining workflow]
  DECISION -- no --> SERVE[Continue active run]
            

Service Readiness Model

stateDiagram-v2
  [*] --> Boot
  Boot --> NotReady: artifacts missing
  Boot --> Ready: artifacts found
  NotReady --> Train
  Train --> Ready
  Ready --> Serving
  Serving --> DriftRisk
  DriftRisk --> Train
            

Model Promotion And Rollback

sequenceDiagram
  participant Train as Training Pipeline
  participant Reg as model_registry.json
  participant API as Inference API
  participant Ops as Operator

  Train->>Reg: append run and set active_run_id
  API->>Reg: load active run
  Ops->>Reg: rollback to prior run if needed
  API-->>Ops: serving previous stable run
            

MLOps Lifecycle Orchestration

flowchart LR
  DATA["Data intake"] --> TRAIN["Model training"]
  TRAIN --> EVAL["Evaluation + quality gates"]
  EVAL --> REG["Register artifacts + manifest"]
  REG --> DEPLOY["Deploy active run"]
  DEPLOY --> MONITOR["Health + drift monitoring"]
  MONITOR --> RETRAIN{"Retrain needed?"}
  RETRAIN -- yes --> TRAIN
  RETRAIN -- no --> DEPLOY
            

Experiment Tracking Layer

flowchart LR
  TR[train.py] --> CFG[Tracking flags]
  CFG --> MLF["MLflow backend (optional)"]
  CFG --> WB["W&B backend (optional)"]
  TR --> PAR[log params]
  TR --> MET[log metrics]
  TR --> ART[log artifacts]
  PAR --> MLF
  PAR --> WB
  MET --> MLF
  MET --> WB
  ART --> MLF
  ART --> WB
            

Optuna Tuning Loop

flowchart TD
  START[CLI --optuna-trials] --> STUDY[Create or load study]
  STUDY --> TRIAL[Sample trial params]
  TRIAL --> TRAINRF[Train candidate regressor]
  TRAINRF --> SCORE[Evaluate RMSE]
  SCORE --> BEST{Best score?}
  BEST -- yes --> UPDATE[Update best params]
  BEST -- no --> NEXT[Next trial]
  UPDATE --> NEXT
  NEXT --> STUDY
  STUDY --> OUT[optuna_study.json + tuned config]
            

Feature Store And Versioning Flow

flowchart LR
  RAW[Raw dataset] --> PREP[Preprocess stage]
  PREP --> SNAP[feature_store_snapshot.csv]
  SNAP --> DVC[dvc.yaml stages]
  SNAP --> FEAST[Feast file source]
  FEAST --> OFFLINE[Offline feature retrieval]
  DVC --> LINEAGE[Data lineage + reproducibility]
            

Retraining + Monitoring Control Plane

flowchart TB
  PREF[Prefect schedule] --> FLOW[retraining flow]
  FLOW --> TRAIN[run_training]
  TRAIN --> REG[manifest + registry]
  API[API service] --> READY["/ready"]
  API --> MET["/metrics"]
  API --> CAP["/mlops/capabilities"]
  MET --> PROM[Prometheus]
  PROM --> GRAF[Grafana]
  READY --> FLOW
            

Model Governance Gate

flowchart TD
  CANDIDATE[Candidate run] --> QUALITY[Quality checks]
  QUALITY --> DRIFT[Baseline drift tolerance]
  DRIFT --> POLICY[Policy review]
  POLICY --> APPROVE{Approved?}
  APPROVE -- yes --> PROMOTE[Set active_run_id]
  APPROVE -- no --> HOLD[Keep previous stable run]
  PROMOTE --> AUDIT[Append governance record]
  HOLD --> AUDIT
            

Production Delivery

Multi-Cloud CI/CD and GitOps

Delivery is designed as a single workflow model with provider-specific execution lanes. CI verifies build quality and artifact integrity once, then GitOps drives reconciled rollout behavior across AWS, GCP, Azure, and OCI.

Strategy overlays let operators switch between rolling, canary, and blue/green without changing app code. This keeps release mechanics explicit, reviewable, and reversible.

Jenkins + Argo Delivery Flow

flowchart LR
  COMMIT[Git Commit] --> J[Jenkins Pipeline]
  J --> TEST[Train + test + frontend build]
  J --> IMG[Build and push containers]
  IMG --> REG[Cloud registry]
  J --> KUST[Update kustomize overlay]
  KUST --> ARGO[Argo CD sync]
  ARGO --> K8S[Kubernetes rollout]
            

Deployment Strategy Control

flowchart TD
  STRAT[Selected strategy] --> ROLLING[rolling overlay]
  STRAT --> CANARY[canary overlay]
  STRAT --> BG[bluegreen overlay]
  ROLLING --> D[Deployment controller]
  CANARY --> AR[Argo Rollouts canary]
  BG --> ARBG[Argo Rollouts bluegreen]
            

Cloud Provider Packs

flowchart TB
  TF[Terraform roots] --> AWS[AWS: EKS + ECR + S3]
  TF --> GCP[GCP: GKE + Artifact Registry + GCS]
  TF --> AZ[Azure: AKS + ACR + Blob]
  TF --> OCI[OCI: OKE + OCIR + Object Storage]
  AWS --> K8S1[Kubernetes runtime]
  GCP --> K8S2[Kubernetes runtime]
  AZ --> K8S3[Kubernetes runtime]
  OCI --> K8S4[Kubernetes runtime]
            

Canary Rollout Sequence

sequenceDiagram
  participant CI as Jenkins
  participant Argo as Argo Rollouts
  participant SVC as Service

  CI->>Argo: deploy canary image
  Argo->>SVC: route 10% traffic
  Argo->>SVC: route 25% traffic
  Argo->>SVC: route 50% traffic
  Argo->>SVC: run analysis template
  Argo->>SVC: promote 100%
            

Multi-Cloud Delivery Lanes

flowchart TB
  CI["Jenkins CI"] --> GITOPS["GitOps overlay commit"]
  GITOPS --> ARGO["Argo CD controller"]
  ARGO --> AWSLANE["AWS lane"]
  ARGO --> GCPLANE["GCP lane"]
  ARGO --> AZLANE["Azure lane"]
  ARGO --> OCILANE["OCI lane"]
  AWSLANE --> AWSC["EKS workload rollout"]
  GCPLANE --> GCPC["GKE workload rollout"]
  AZLANE --> AZC["AKS workload rollout"]
  OCILANE --> OCIC["OKE workload rollout"]
            

GitOps Promotion Rings

flowchart LR
  BUILD[Verified build] --> DEV[Dev ring]
  DEV --> QA[QA ring]
  QA --> PREPROD[Preprod ring]
  PREPROD --> PROD[Production ring]
  DEV --> HEALTH1[Health gate]
  QA --> HEALTH2[Soak gate]
  PREPROD --> HEALTH3[Approval gate]
  HEALTH1 --> QA
  HEALTH2 --> PREPROD
  HEALTH3 --> PROD
  PROD --> ROLLBACK[Instant rollback to prior revision]
            

Cloud Integration Matrix

Provider Kubernetes Container Registry Artifact Storage Terraform Root
AWS EKS ECR S3 infra/terraform/environments/aws
GCP GKE Artifact Registry GCS infra/terraform/environments/gcp
Azure AKS ACR Blob Storage infra/terraform/environments/azure
OCI OKE OCIR Object Storage infra/terraform/environments/oci

Operator Toolkit

Command Center

This command catalog is structured by execution intent: local validation, frontend quality checks, Kubernetes rendering, GitOps strategy control, and infrastructure provisioning.

Use it as the fast path for repeatable operations. Each block is copy-ready and maps to a production concern so on-call and delivery teams can execute with low ambiguity.

Local ML + API Bootstrap

source .venv/bin/activate
PYTHONPATH=src python -m youtube_success_ml.train --run-all
PYTHONPATH=src pytest -q
PYTHONPATH=src uvicorn youtube_success_ml.api.fastapi_app:app --host 0.0.0.0 --port 8000

Frontend Lifecycle

cd frontend
npm ci
npm run lint
npm run build
npm run dev

Kubernetes Render Checks

kubectl kustomize infra/k8s/overlays/rolling
kubectl kustomize infra/k8s/overlays/canary
kubectl kustomize infra/k8s/overlays/bluegreen

Argo Strategy Switching

bash infra/argocd/bootstrap.sh
bash infra/argocd/switch-strategy.sh canary
bash infra/argocd/switch-strategy.sh bluegreen
bash infra/argocd/switch-strategy.sh rolling

Terraform Apply Pattern

cd infra/terraform/environments/aws
cp terraform.tfvars.example terraform.tfvars
terraform init
terraform plan
terraform apply

Container Stack

docker compose up --build

Release Governance

Delivery Roadmap And Checklist

Phase 1: Data + ML Integrity

Run training pipeline, produce artifacts, validate metrics and data quality reports.

Phase 2: API + Frontend Validation

Pass `pytest`, frontend lint/build, and confirm `/health` + `/ready` contracts.

Phase 3: Container Build and Push

Build API/frontend images from the same repo revision and publish to cloud registry.

Phase 4: GitOps Strategy Rollout

Sync rolling/canary/bluegreen overlays through Argo CD and verify rollout health.

Phase 5: Observe, Promote, Rollback

Gate promotions, monitor SLO signals, and execute rollback if regression is detected.

Release Checklist

0/8 completed

Execution Playbook

Operator Runbook

The runbook captures high-signal commands used during validation, incident handling, and controlled rollout. It complements CI/CD by giving humans deterministic break-glass and verification steps.

In production operations, prioritize this sequence: verify readiness, run smoke checks, confirm deployment parameters, and only then escalate to abort/undo actions.

Smoke API

bash scripts/smoke_api.sh http://127.0.0.1:8000

Readiness Verification

curl -i http://127.0.0.1:8000/ready

Jenkins Inputs

CLOUD_PROVIDER=(aws|gcp|azure|oci)
DEPLOY_STRATEGY=(rolling|canary|bluegreen)
RUN_TERRAFORM_APPLY=(true|false)
IMAGE_TAG=(optional)

Incident Response

kubectl argo rollouts abort yts-api -n yts-prod
kubectl argo rollouts abort yts-frontend -n yts-prod
kubectl argo rollouts undo yts-api -n yts-prod

Documentation Index

Deep-Dive Documents

Frequently Asked Questions

Operational FAQ

What if Terraform is not installed on a runner?

Install Terraform in the Jenkins agent image or use a dedicated IaC stage container before `terraform_plan_apply.sh`.

How do we change rollout strategy safely?

Use `infra/argocd/switch-strategy.sh` to enforce one active strategy app, then sync via Argo and monitor health.

Why might `/ready` return 503?

Artifacts are missing or inaccessible. Run training and verify mounted artifact storage before deployment sync.

Where is this wiki hosted in the app?

Static assets are served from `frontend/public/wiki` and available through the Next route `/wiki`.