AI / ML

Toxicity Mining

NLP pipeline classifying hate speech and toxicity across 1.8M+ social media records using classical ML and transformer-based models, with DistilBERT fine-tuned end-to-end across corpora.

2026

PythonBERTDistilBERTLightGBMTF-IDF

Live Demo GitHub

Overview

A machine learning project that categorizes toxic language and hate speech on social media platforms, distinguishing between generic offensive language and hate speech targeted at specific entities, users, or groups. Uses BERT and DistilBERT transformer models alongside traditional classifiers like Naive Bayes, Logistic Regression, SVM, and LightGBM. Trained on Google Jigsaw Civil Comments (1.8M rows) and TweetEval Hate Speech datasets.

Problem Statement & Approach

Social platforms need to separate generic offensive language from hate speech aimed at specific people or groups — a distinction generic toxicity filters routinely miss, and a hard one to learn because toxic vocabulary differs sharply from platform to platform (only 2.6% overlap between the two corpora).

Approach: Run two preprocessing tracks over the combined corpus — TF-IDF with stopword removal for classical models and minimal normalization for transformers — then compare a classical ML baseline against a fine-tuned DistilBERT trained end-to-end on the balanced Jigsaw + TweetEval set.

System Architecture

The pipeline spans 1.8M+ records from Google Jigsaw Civil Comments and TweetEval. Text flows through dual preprocessing tracks into two modeling paths: a classical track (TF-IDF → TruncatedSVD/LSA → clustering and linear/tree classifiers) and a transformer track (minimal normalization → DistilBERT fine-tuning), with evaluation via precision, recall, F1, and confusion-matrix analysis across both corpora.

Key Features

Distinguishes generic offensive language from targeted hate speech
Trained on Google Jigsaw Civil Comments (1.8M rows) and TweetEval
BERT and DistilBERT transformers with Naive Bayes, Logistic Regression, SVM, and LightGBM baselines
Dual preprocessing: TF-IDF + stopword removal and minimal normalization
TF-IDF features reduced to 200 dims via TruncatedSVD (LSA) for classical modeling
DistilBERT fine-tuned end-to-end on the combined, balanced Jigsaw + TweetEval corpus

Technical Stack

Data

Google Jigsaw Civil Comments (1.8M rows)TweetEval Hate Speech

Classical ML

scikit-learnTF-IDFTruncatedSVD (LSA)KMeansNaive BayesLogistic RegressionSVMLightGBM

Deep Learning

PyTorchHugging Face TransformersBERTDistilBERT

Website

TypeScript

Deployment

The results and write-up are published as a static site on GitHub Pages.

Challenges & Solutions

Challenge: Only 2.6% vocabulary overlap between the forum-comment and tweet corpora, and a staged Jigsaw→Twitter fine-tune stalled around 67% accuracy.

Solution: Switched to direct end-to-end DistilBERT training on the combined, balanced corpus with consistent preprocessing, which trained stably across both sources.

Challenge: The Jigsaw corpus was 91.7% non-toxic, making raw accuracy a misleading metric.

Solution: Balanced the combined dataset and evaluated with precision, recall, F1, and confusion matrices instead of accuracy.

Challenge: KMeans on the TF-IDF/LSA features clustered poorly (silhouette ≈ 0.10) — toxic and non-toxic text overlap heavily in linear space.

Solution: Used that finding to justify moving from classical clustering to fine-tuned DistilBERT for the classification task.

Improvements

Multi-class targeting to identify which group a hate-speech instance is aimed at
A real-time inference API for live moderation
Multilingual toxicity detection beyond English

NLPBERTHate Speech DetectionMachine Learning

More Projects

IntelliTalent - Resume to Job Matching

AI / ML

PyTorch Playing Card Classifier

AI / ML