Deep Learning on Noisy & Imbalanced Data

MSc thesis at Imperial College London and ACM CCS 2025 publication on robust deep learning under severe label noise.

GitHub Link
|
Python
PyTorch
Scikit-learn

Frameworks for robust deep learning under severe label noise, boosting macro F1 from 74.5% to 96.0% on noisy cybersecurity benchmarks.
This thesis developed novel training strategies for learning from data where labels are partially automated, inconsistent, or expensive to verify — a common challenge in cybersecurity, medical, and industrial settings. The work led to a peer-reviewed publication at ACM CCS 2025 and won the Corporate Partnership Programme Individual Project Prize for best MSc thesis at Imperial College London.

Publication

Deep Learning for Imperfectly Labeled Malware Data — F. Alotaibi, E. Goodbrand, S. Maffeis. ACM CCS 2025.

Key Contributions

SLB framework — dynamic clean/noisy set partitioning with pseudo-labelling for robust training under severe label noise.
MIMICRY — synthetic label noise injection for controlled evaluation of noise-robust methods.
CLEAN-STOP & SENTINEL — ensemble-based data cleaning strategies that improve robustness under combined label noise, feature noise, and class imbalance.
74.5% → 96.0% macro F1 on noisy malware classification benchmarks, demonstrating scalable training under weak and imperfect supervision.
Cross-domain validation — evaluated on both malware and network intrusion detection datasets, confirming generality of the approach.