Spot the Scam - AI Job Fraud Detection

🎯 Project Overview

Spot The Scam is a comprehensive machine learning system designed to protect job-seekers from fraudulent job postings. The system combines classical ML models with deep learning transformers, featuring ensemble architectures, calibrated probability estimates, and explainable AI capabilities.

🤖

Intelligent Detection

Leverages ensemble ML pipeline combining Linear SVMs, Logistic Regression, XGBoost, LightGBM, and DistilBERT for robust fraud detection across diverse posting styles.

📈

High Performance

Achieves 85.4% precision on test data with excellent calibration (ECE: 0.0066), ensuring reliable confidence estimates for critical decision-making.

🔍

Explainable Predictions

Token-level importance analysis with SHAP-style contribution rankings provides transparency and trust in model decisions.

⚙️

Smart Policy

Gray-zone policy routes low-confidence predictions to human review, optimizing the balance between automation and accuracy.

🎨

Interactive Dashboard

Real-time prediction interface with AI-powered chat assistant for fraud analysis, built with Next.js and modern UI components.

🐳

Production Ready

FastAPI REST API with Docker containerization, MLflow experiment tracking, and ONNX model export for scalable deployment.

✨ Key Features

🏗️ Ensemble Architecture

Calibrated Linear SVMs and Logistic Regression models
TF-IDF vectorization for text features
Engineered tabular features for job posting metadata
Weighted ensemble combining top-performing models
Isotonic calibration for reliable probability estimates

🤖 Machine Learning Models

Classical Models: Logistic Regression, Linear SVM, XGBoost, LightGBM
Deep Learning: DistilBERT fine-tuning for transformer-based classification
Bayesian Hyperparameter Optimization via Optuna with TPE sampler
Automated tuning workflows with YAML configuration overrides
Model versioning and experiment tracking with MLflow

📊 Evaluation & Explainability

Comprehensive metrics: F1, Precision, Recall, ROC-AUC, PR-AUC, Brier Score
Calibration curves and Expected Calibration Error (ECE)
Token-level contribution analysis with SHAP-style explanations
Slice-based bias detection across job categories
Automated report generation with visualizations

🚀 Deployment & API

FastAPI REST endpoints for single and batch predictions
Server-Sent Events (SSE) for streaming chat responses
AI chatbot with Gemini integration for natural language queries
Docker containerization for reproducible environments
ONNX model export for cross-platform compatibility

⭐️ Performance Metrics

Our ensemble model achieves state-of-the-art performance on job fraud detection, with well-calibrated probability estimates.

Validation Set

0.000 F1 Score

0.000 Precision

0.000 Recall

0.000 ROC-AUC

0.000 PR-AUC

0.000 Brier Score

Test Set

0.000 F1 Score

0.000 Precision

0.000 Recall

0.000 ROC-AUC

0.000 PR-AUC

0.000 Brier Score

📈 Model Calibration

Our model demonstrates excellent calibration with an Expected Calibration Error (ECE) of only 0.0066 on the test set, ensuring that predicted probabilities accurately reflect true confidence levels.

Decision Threshold: 0.5802 (optimized on validation F1)
Gray-Zone Width: 0.10 (for uncertain predictions)
Calibration Method: Isotonic Regression
Expected Calibration Error (Test): 0.0066

Precision-Recall Curve

Calibration Curve

Score Distribution

Confusion Matrix

🏗️ System Architecture

A comprehensive end-to-end pipeline from data ingestion to production deployment, with modular components for training, evaluation, and inference.

High-Level System Overview

flowchart TD subgraph Offline["🔄 Offline Training Pipeline"] A1[📦 Raw Kaggle CSV] -->|Download| A2[📥 Data Ingestion] A2 --> A3[🔧 Preprocessing & Feature Engineering] A3 --> A4[🤖 Classical Models & Ensembles] A3 --> A5[🧠 DistilBERT Fine-tuning] A4 --> A6[📊 Calibration & Model Selection] A5 --> A6 A6 -->|Persist| A7[💾 Artifacts & Reports] end subgraph Online["🌐 Online Serving"] B1[🚀 FastAPI Service
/predict, /chat, /insights] B2[🔮 FraudPredictor] B3[📁 Model Artifacts
Features & Pipelines] B4[🤖 Gemini AI API] B5[🧭 Chat Routing Layer] end A7 -->|Load Models| B3 B3 --> B2 --> B1 B1 -->|Invoke| B5 B5 -->|Detect Job Posts| B2 B5 -->|LLM Responses| B4 subgraph Frontend["💻 Frontend & Registry"] C1[⚛️ Next.js Dashboard
& Chat Assistant] C2[📊 MLflow Model Registry] end B1 <-->|REST + SSE| C1 A6 -->|Register| C2 C2 -->|pyfunc / ONNX| B1 style Offline fill:#1e293b,stroke:#2563eb,stroke-width:2px style Online fill:#1e293b,stroke:#10b981,stroke-width:2px style Frontend fill:#1e293b,stroke:#f59e0b,stroke-width:2px

Training Flow Sequence

sequenceDiagram participant CLI as 🖥️ CLI (Typer) participant Config as ⚙️ Config Loader participant Data as 📊 Data Pipeline participant Features as 🔧 Feature Builder participant Models as 🤖 Model Trainers participant Eval as 📈 Evaluation Suite participant Persist as 💾 Artifact Writer CLI->>Config: load_config() CLI->>Data: load_raw_dataset() Data->>Data: preprocess_dataframe() Data->>Data: create_splits() Data->>Features: build_feature_bundle() Features->>Models: train_classical_models() Features->>Models: train_transformer_model() Models->>Eval: compute_metrics() Eval->>Persist: save artifacts + plots Persist->>CLI: append_run_record()

Chat & Inference Pipeline

sequenceDiagram participant UI as 💬 Next.js Chat UI participant API as 🚀 FastAPI /chat participant Cls as 🔍 Gemini Classifier participant Pred as 🔮 FraudPredictor participant LLM as 🤖 Gemini Assistant UI->>API: POST /chat (message, context, history) API->>Cls: classify message Cls-->>API: {is_job_posting, confidence, reason} alt Is Job Posting API->>Pred: predict(job_data) Pred-->>API: {proba, decision, explanations} end API->>LLM: stream response with context LLM-->>API: text chunks API-->>UI: SSE chunks (ChatStreamChunk)

📂 Repository Structure

spot-the-scam/
├── 📦 src/spot_scam/       # Core Python package
│   ├── data/                # Data ingestion & preprocessing
│   ├── features/            # Feature engineering (TF-IDF, tabular)
│   ├── models/              # Classical, XGBoost, Transformer models
│   ├── tuning/              # Optuna hyperparameter optimization
│   ├── evaluation/          # Metrics, curves, calibration, reporting
│   ├── inference/           # FraudPredictor, gray-zone policy
│   └── api/                 # FastAPI endpoints & schemas
├── 🎨 frontend/             # Next.js dashboard + chat UI
├── 🔧 configs/              # YAML configuration files
├── 📜 scripts/              # CLI utilities (train, tune, API)
├── 💾 artifacts/            # Trained models, vectorizers, metadata
├── 📊 experiments/          # Generated reports, figures, tables
├── 🧪 tests/                # Unit tests
├── 🐳 docker/               # Dockerfiles & compose configs
└── 📖 docs/                 # Comprehensive documentation

🛠️ Technology Stack

🐍 Backend & ML

🐍 Python 3.9+ ⚡ FastAPI 🔬 scikit-learn 📊 XGBoost ⚡ LightGBM 🤗 Transformers 🔥 PyTorch 📈 MLflow 🎯 Optuna 🔄 ONNX 🧮 NumPy 🐼 Pandas 📉 Matplotlib 🎨 Seaborn 📊 Plotly

⚛️ Frontend

⚛️ Next.js 14 📘 TypeScript 🎨 Tailwind CSS 🧩 shadcn/ui 🔄 SWR 📝 ReactMarkdown 🧮 KaTeX

🚀 DevOps & Infrastructure

🐳 Docker 🔧 Docker Compose ☁️ Vercel 🔄 GitHub Actions 🧪 pytest 🎯 pre-commit 📝 Black 🔍 Ruff 📊 Coverage

🤖 AI & LLM Integration

🧠 Google Gemini 🤗 DistilBERT 🔮 SHAP 🍋 LIME 📊 Imbalanced-learn

📚 Documentation

Comprehensive guides and resources for understanding, deploying, and extending the system.

📖

INFO.md

Project overview, feature summary, and quick start guide for getting up and running.

🏗️

ARCHITECTURE.md

Detailed system design, component breakdown, and architecture diagrams.

📋

INSTRUCTIONS.md

Setup instructions, environment configuration, and usage guidelines.

🔬

TRAINING_ANALYSIS.md

Training pipeline walkthrough, data analysis, and model development process.

📊

RESULTS.md

Performance metrics, evaluation results, and model comparison analysis.

➕

ADD_MODELS.md

Guide for integrating new models and extending the ensemble architecture.

🚀

DEVOPS_READINESS.md

Progressive delivery, CI/CD pipelines, and production operations guide.

🎯

Optuna Tuning

Hyperparameter optimization with Optuna, including TPE sampler and study analysis.

🚀 Getting Started

1. Clone the Repository

git clone https://github.com/hoangsonww/Spot-the-Scam-AI-Job-Fraud.git
cd Spot-the-Scam-AI-Job-Fraud

2. Set Up Environment

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .

3. Download Data

python scripts/download_data.py

4. Train Models

python -m spot_scam.pipeline.train

5. Run API Server

python scripts/run_api.py

6. Launch Dashboard

cd frontend
npm install
npm run dev

🐳 Docker Quick Start

# Build and run with Docker Compose
docker-compose up --build

# Access the services
# API: http://localhost:8000
# Frontend: http://localhost:3000
# MLflow: http://localhost:5000

🌟 Live Demo & Screenshots

Explore the interactive dashboard and see the system in action. Note: The demo uses sample data for exploration purposes.

🌐 Try Live Demo 🎥 Watch Demo Video

⚠️ Important: The hosted demo at Vercel runs with fake data and a demo model for exploration. For a fully functional version with real predictions, please follow the Docker setup instructions to run locally.

Dashboard Screenshots

Main Dashboard Interface

Prediction Analysis View

AI Chat Assistant

Optuna Dashboard

Optimization History

Parameter Importance

Parallel Coordinate Analysis

🚀 Citation

BibTeX

@software{spot_the_scam_2025-2026,
  title = {Spot the Scam: Calibrated Job-Posting Fraud Detection},
  author = {Nguyen, Son},
  year = {2025-2026},
  version = {0.1.0},
  url = {https://github.com/hoangsonww/Spot-the-Scam-AI-Job-Fraud},
  note = {End-to-end ML pipeline for detecting fraudulent job postings with 
          classical and transformer models, calibrated policies, and 
          explainability tooling.}
}

APA

Nguyen, S. (2025-2026). Spot the Scam: Calibrated Job-Posting Fraud Detection (Version 0.1.0) [Computer software]. https://github.com/hoangsonww/Spot-the-Scam-AI-Job-Fraud

🤝 Contributing

We welcome contributions! Whether it's bug fixes, new features, documentation improvements, or model enhancements.

🐛

Report Issues

Found a bug or have a feature request? Open an issue on GitHub with detailed information.

🔧

Submit Pull Requests

Fork the repository, make your changes, and submit a PR with a clear description of improvements.

📖

Improve Documentation

Help us make the docs better by fixing typos, adding examples, or clarifying explanations.

🧪

Add Tests

Increase code coverage by writing unit tests for existing features or new functionality.

⚖️ License

MIT License

This project is licensed under the MIT License. You are free to use, modify, and distribute this software for both commercial and non-commercial purposes. See the LICENSE file for details.