Spot the Scam

AI-Powered Job Fraud Detection with Ensemble Machine Learning

Protecting job-seekers from fraudulent postings using calibrated ML models achieving 85.4% precision and 77.2% F1 score with explainable predictions.

๐ŸŽฏ 85.4% Precision ๐Ÿ“Š 98.6% ROC-AUC ๐Ÿ”ฌ Explainable AI ๐Ÿš€ Production Ready

๐ŸŽฏ Project Overview

Spot The Scam is a comprehensive machine learning system designed to protect job-seekers from fraudulent job postings. The system combines classical ML models with deep learning transformers, featuring ensemble architectures, calibrated probability estimates, and explainable AI capabilities.

๐Ÿค–

Intelligent Detection

Leverages ensemble ML pipeline combining Linear SVMs, Logistic Regression, XGBoost, LightGBM, and DistilBERT for robust fraud detection across diverse posting styles.

๐Ÿ“ˆ

High Performance

Achieves 85.4% precision on test data with excellent calibration (ECE: 0.0066), ensuring reliable confidence estimates for critical decision-making.

๐Ÿ”

Explainable Predictions

Token-level importance analysis with SHAP-style contribution rankings provides transparency and trust in model decisions.

โš™๏ธ

Smart Policy

Gray-zone policy routes low-confidence predictions to human review, optimizing the balance between automation and accuracy.

๐ŸŽจ

Interactive Dashboard

Real-time prediction interface with AI-powered chat assistant for fraud analysis, built with Next.js and modern UI components.

๐Ÿณ

Production Ready

FastAPI REST API with Docker containerization, MLflow experiment tracking, and ONNX model export for scalable deployment.

โœจ Key Features

๐Ÿ—๏ธ Ensemble Architecture

  • Calibrated Linear SVMs and Logistic Regression models
  • TF-IDF vectorization for text features
  • Engineered tabular features for job posting metadata
  • Weighted ensemble combining top-performing models
  • Isotonic calibration for reliable probability estimates

๐Ÿค– Machine Learning Models

  • Classical Models: Logistic Regression, Linear SVM, XGBoost, LightGBM
  • Deep Learning: DistilBERT fine-tuning for transformer-based classification
  • Bayesian Hyperparameter Optimization via Optuna with TPE sampler
  • Automated tuning workflows with YAML configuration overrides
  • Model versioning and experiment tracking with MLflow

๐Ÿ“Š Evaluation & Explainability

  • Comprehensive metrics: F1, Precision, Recall, ROC-AUC, PR-AUC, Brier Score
  • Calibration curves and Expected Calibration Error (ECE)
  • Token-level contribution analysis with SHAP-style explanations
  • Slice-based bias detection across job categories
  • Automated report generation with visualizations

๐Ÿš€ Deployment & API

  • FastAPI REST endpoints for single and batch predictions
  • Server-Sent Events (SSE) for streaming chat responses
  • AI chatbot with Gemini integration for natural language queries
  • Docker containerization for reproducible environments
  • ONNX model export for cross-platform compatibility

โญ๏ธ Performance Metrics

Our ensemble model achieves state-of-the-art performance on job fraud detection, with well-calibrated probability estimates.

Validation Set

0.000 F1 Score
0.000 Precision
0.000 Recall
0.000 ROC-AUC
0.000 PR-AUC
0.000 Brier Score

Test Set

0.000 F1 Score
0.000 Precision
0.000 Recall
0.000 ROC-AUC
0.000 PR-AUC
0.000 Brier Score

๐Ÿ“ˆ Model Calibration

Our model demonstrates excellent calibration with an Expected Calibration Error (ECE) of only 0.0066 on the test set, ensuring that predicted probabilities accurately reflect true confidence levels.

  • Decision Threshold: 0.5802 (optimized on validation F1)
  • Gray-Zone Width: 0.10 (for uncertain predictions)
  • Calibration Method: Isotonic Regression
  • Expected Calibration Error (Test): 0.0066
Precision-Recall Curve

Precision-Recall Curve

Calibration Curve

Calibration Curve

Score Distribution

Score Distribution

Confusion Matrix

Confusion Matrix

๐Ÿ—๏ธ System Architecture

A comprehensive end-to-end pipeline from data ingestion to production deployment, with modular components for training, evaluation, and inference.

High-Level System Overview

flowchart TD subgraph Offline["๐Ÿ”„ Offline Training Pipeline"] A1[๐Ÿ“ฆ Raw Kaggle CSV] -->|Download| A2[๐Ÿ“ฅ Data Ingestion] A2 --> A3[๐Ÿ”ง Preprocessing & Feature Engineering] A3 --> A4[๐Ÿค– Classical Models & Ensembles] A3 --> A5[๐Ÿง  DistilBERT Fine-tuning] A4 --> A6[๐Ÿ“Š Calibration & Model Selection] A5 --> A6 A6 -->|Persist| A7[๐Ÿ’พ Artifacts & Reports] end subgraph Online["๐ŸŒ Online Serving"] B1[๐Ÿš€ FastAPI Service
/predict, /chat, /insights] B2[๐Ÿ”ฎ FraudPredictor] B3[๐Ÿ“ Model Artifacts
Features & Pipelines] B4[๐Ÿค– Gemini AI API] B5[๐Ÿงญ Chat Routing Layer] end A7 -->|Load Models| B3 B3 --> B2 --> B1 B1 -->|Invoke| B5 B5 -->|Detect Job Posts| B2 B5 -->|LLM Responses| B4 subgraph Frontend["๐Ÿ’ป Frontend & Registry"] C1[โš›๏ธ Next.js Dashboard
& Chat Assistant] C2[๐Ÿ“Š MLflow Model Registry] end B1 <-->|REST + SSE| C1 A6 -->|Register| C2 C2 -->|pyfunc / ONNX| B1 style Offline fill:#1e293b,stroke:#2563eb,stroke-width:2px style Online fill:#1e293b,stroke:#10b981,stroke-width:2px style Frontend fill:#1e293b,stroke:#f59e0b,stroke-width:2px

Training Flow Sequence

sequenceDiagram participant CLI as ๐Ÿ–ฅ๏ธ CLI (Typer) participant Config as โš™๏ธ Config Loader participant Data as ๐Ÿ“Š Data Pipeline participant Features as ๐Ÿ”ง Feature Builder participant Models as ๐Ÿค– Model Trainers participant Eval as ๐Ÿ“ˆ Evaluation Suite participant Persist as ๐Ÿ’พ Artifact Writer CLI->>Config: load_config() CLI->>Data: load_raw_dataset() Data->>Data: preprocess_dataframe() Data->>Data: create_splits() Data->>Features: build_feature_bundle() Features->>Models: train_classical_models() Features->>Models: train_transformer_model() Models->>Eval: compute_metrics() Eval->>Persist: save artifacts + plots Persist->>CLI: append_run_record()

Chat & Inference Pipeline

sequenceDiagram participant UI as ๐Ÿ’ฌ Next.js Chat UI participant API as ๐Ÿš€ FastAPI /chat participant Cls as ๐Ÿ” Gemini Classifier participant Pred as ๐Ÿ”ฎ FraudPredictor participant LLM as ๐Ÿค– Gemini Assistant UI->>API: POST /chat (message, context, history) API->>Cls: classify message Cls-->>API: {is_job_posting, confidence, reason} alt Is Job Posting API->>Pred: predict(job_data) Pred-->>API: {proba, decision, explanations} end API->>LLM: stream response with context LLM-->>API: text chunks API-->>UI: SSE chunks (ChatStreamChunk)

๐Ÿ“‚ Repository Structure

spot-the-scam/
โ”œโ”€โ”€ ๐Ÿ“ฆ src/spot_scam/       # Core Python package
โ”‚   โ”œโ”€โ”€ data/                # Data ingestion & preprocessing
โ”‚   โ”œโ”€โ”€ features/            # Feature engineering (TF-IDF, tabular)
โ”‚   โ”œโ”€โ”€ models/              # Classical, XGBoost, Transformer models
โ”‚   โ”œโ”€โ”€ tuning/              # Optuna hyperparameter optimization
โ”‚   โ”œโ”€โ”€ evaluation/          # Metrics, curves, calibration, reporting
โ”‚   โ”œโ”€โ”€ inference/           # FraudPredictor, gray-zone policy
โ”‚   โ””โ”€โ”€ api/                 # FastAPI endpoints & schemas
โ”œโ”€โ”€ ๐ŸŽจ frontend/             # Next.js dashboard + chat UI
โ”œโ”€โ”€ ๐Ÿ”ง configs/              # YAML configuration files
โ”œโ”€โ”€ ๐Ÿ“œ scripts/              # CLI utilities (train, tune, API)
โ”œโ”€โ”€ ๐Ÿ’พ artifacts/            # Trained models, vectorizers, metadata
โ”œโ”€โ”€ ๐Ÿ“Š experiments/          # Generated reports, figures, tables
โ”œโ”€โ”€ ๐Ÿงช tests/                # Unit tests
โ”œโ”€โ”€ ๐Ÿณ docker/               # Dockerfiles & compose configs
โ””โ”€โ”€ ๐Ÿ“– docs/                 # Comprehensive documentation

๐Ÿ› ๏ธ Technology Stack

๐Ÿ Backend & ML

๐Ÿ Python 3.9+ โšก FastAPI ๐Ÿ”ฌ scikit-learn ๐Ÿ“Š XGBoost โšก LightGBM ๐Ÿค— Transformers ๐Ÿ”ฅ PyTorch ๐Ÿ“ˆ MLflow ๐ŸŽฏ Optuna ๐Ÿ”„ ONNX ๐Ÿงฎ NumPy ๐Ÿผ Pandas ๐Ÿ“‰ Matplotlib ๐ŸŽจ Seaborn ๐Ÿ“Š Plotly

โš›๏ธ Frontend

โš›๏ธ Next.js 14 ๐Ÿ“˜ TypeScript ๐ŸŽจ Tailwind CSS ๐Ÿงฉ shadcn/ui ๐Ÿ”„ SWR ๐Ÿ“ ReactMarkdown ๐Ÿงฎ KaTeX

๐Ÿš€ DevOps & Infrastructure

๐Ÿณ Docker ๐Ÿ”ง Docker Compose โ˜๏ธ Vercel ๐Ÿ”„ GitHub Actions ๐Ÿงช pytest ๐ŸŽฏ pre-commit ๐Ÿ“ Black ๐Ÿ” Ruff ๐Ÿ“Š Coverage

๐Ÿค– AI & LLM Integration

๐Ÿง  Google Gemini ๐Ÿค— DistilBERT ๐Ÿ”ฎ SHAP ๐Ÿ‹ LIME ๐Ÿ“Š Imbalanced-learn

๐Ÿ“š Documentation

Comprehensive guides and resources for understanding, deploying, and extending the system.

๐Ÿš€ Getting Started

1. Clone the Repository

git clone https://github.com/hoangsonww/Spot-the-Scam-AI-Job-Fraud.git
cd Spot-the-Scam-AI-Job-Fraud

2. Set Up Environment

python -m venv .venv
source .venv/bin/activate  # On Windows: .venv\Scripts\activate
pip install -e .

3. Download Data

python scripts/download_data.py

4. Train Models

python -m spot_scam.pipeline.train

5. Run API Server

python scripts/run_api.py

6. Launch Dashboard

cd frontend
npm install
npm run dev

๐Ÿณ Docker Quick Start

# Build and run with Docker Compose
docker-compose up --build

# Access the services
# API: http://localhost:8000
# Frontend: http://localhost:3000
# MLflow: http://localhost:5000

๐ŸŒŸ Live Demo & Screenshots

Explore the interactive dashboard and see the system in action. Note: The demo uses sample data for exploration purposes.

โš ๏ธ Important: The hosted demo at Vercel runs with fake data and a demo model for exploration. For a fully functional version with real predictions, please follow the Docker setup instructions to run locally.

Dashboard Screenshots

Main Dashboard

Main Dashboard Interface

Prediction Analysis

Prediction Analysis View

Chat Assistant

AI Chat Assistant

Optuna Dashboard

Optuna Optimization History

Optimization History

Parameter Importance

Parameter Importance

Parallel Coordinate Plot

Parallel Coordinate Analysis

๐Ÿš€ Citation

BibTeX

@software{spot_the_scam_2025-2026,
  title = {Spot the Scam: Calibrated Job-Posting Fraud Detection},
  author = {Nguyen, Son},
  year = {2025-2026},
  version = {0.1.0},
  url = {https://github.com/hoangsonww/Spot-the-Scam-AI-Job-Fraud},
  note = {End-to-end ML pipeline for detecting fraudulent job postings with 
          classical and transformer models, calibrated policies, and 
          explainability tooling.}
}

APA

Nguyen, S. (2025-2026). Spot the Scam: Calibrated Job-Posting Fraud Detection (Version 0.1.0) [Computer software]. https://github.com/hoangsonww/Spot-the-Scam-AI-Job-Fraud

๐Ÿค Contributing

We welcome contributions! Whether it's bug fixes, new features, documentation improvements, or model enhancements.

๐Ÿ›

Report Issues

Found a bug or have a feature request? Open an issue on GitHub with detailed information.

๐Ÿ”ง

Submit Pull Requests

Fork the repository, make your changes, and submit a PR with a clear description of improvements.

๐Ÿ“–

Improve Documentation

Help us make the docs better by fixing typos, adding examples, or clarifying explanations.

๐Ÿงช

Add Tests

Increase code coverage by writing unit tests for existing features or new functionality.

โš–๏ธ License

MIT License

This project is licensed under the MIT License. You are free to use, modify, and distribute this software for both commercial and non-commercial purposes. See the LICENSE file for details.