Back to Portfolio MSc Thesis · Royal Holloway
// academic research · MSc Information Security

Intelligent Financial Fraud Detection
Using a Hybrid Approach

MSc thesis exploring a hybrid machine learning model combining Neural Networks, Random Forest, and XGBoost for detecting fraudulent financial transactions — submitted as part of the MSc in Information Security at Royal Holloway, University of London.

Royal Holloway, University of London  ·  MSc Information Security  ·  August 2023
Neural Networks
XGBoost
Random Forest
Fraud Detection
Machine Learning

00 Executive Summary

This research introduces a hybrid machine learning model that integrates three distinct algorithms — Neural Networks, Random Forest, and XGBoost — to address the growing challenge of financial fraud detection in digital transactions.

Neural Networks excel at identifying complex patterns in high-dimensional data. Random Forest provides robustness and stability through ensemble averaging. XGBoost contributes predictive accuracy and efficient handling of imbalanced datasets — a frequent challenge in fraud scenarios where fraudulent transactions are rare.

NN
Neural Networks
Captures complex non-linear patterns and high-dimensional relationships in transaction data.
RF
Random Forest
Ensemble of decision trees reducing overfitting and providing stable, consistent predictions.
XGB
XGBoost
Gradient boosting with exceptional accuracy on imbalanced datasets and efficient sparse data handling.
Key outcome: The hybrid approach promises a detection system that is unparalleled in accuracy, adaptable to evolving fraud tactics, and capable of real-time transaction screening — delivering continuous learning to remain effective as fraud patterns evolve.

01 Introduction

The digital revolution has transformed transactions into an integral part of daily life. While online commerce and digital banking offer unprecedented convenience, they have simultaneously created new vectors for financial fraud. As fraudsters deploy increasingly sophisticated techniques, traditional rule-based detection systems are proving inadequate.

Background & Context

The surge in online transactions globally — driven by e-commerce growth, digital banking adoption, and the globalization of services — has been paralleled by an escalating sophistication in fraudulent schemes. Criminals exploit vulnerabilities in detection systems, leading to significant financial losses and undermining consumer confidence.

Problem Statement

Traditional rule-based fraud detection systems frequently falter when faced with novel, sophisticated fraud patterns. The challenge is twofold: detecting fraudulent transactions with high accuracy while doing so in real-time to ensure seamless user experience. Machine learning offers a solution — particularly a hybrid model that combines the strengths of Neural Networks, Random Forest, and XGBoost.

Significance of the Study

Financial institutions, e-commerce platforms, and payment processors stand to benefit enormously from an effective hybrid fraud detection system. The research also demonstrates the broader potential of combining multiple machine learning algorithms to address complex, real-world challenges.

Motivation & Objectives

The project aims to design, develop and evaluate a hybrid fraud detection model that leverages the complementary strengths of three powerful ML algorithms. The system is designed for real-time deployment with continuous learning capabilities to adapt to emerging fraud patterns.

02 Literature Review

A comprehensive review of existing research establishes the theoretical foundation and identifies gaps in current fraud detection methodologies:

2.1 Financial Fraud Detection

Delamaire, Abdou & Pointon (2009) examine multiple categories of credit card fraud — bankruptcy, counterfeit, theft, application, and behavioural fraud — and evaluate pair-wise matching, decision trees, clustering, neural networks, and evolutionary algorithms. The study highlights the ethical dilemmas in cases where fraud identification costs exceed the financial benefit.

2.2 Neural Network Approaches (GANN)

Research into Genetic Algorithm Neural Networks (GANN) demonstrates the potential of combining neural networks with genetic algorithms for parameter optimization in fraud detection. The work highlights the challenge of real-time detection given the vast volume of credit card transactions and the relative rarity of fraud events.

2.3 XGBoost for Financial Fraud

Lei, Xu, Huang & Sha propose a distinctive system for e-commerce merchants combining manual and automatic classification with XGBoost. Their hybrid methodology minimizes false positives through human oversight while leveraging XGBoost's capability to handle imbalanced datasets — genuine transactions vastly outnumber fraudulent ones.

2.4 Random Forest for Fraud Detection

Research utilizing two variants of Random Forest on e-commerce transaction data from China demonstrates the algorithm's robustness in handling large datasets and its inherent ability to manage class imbalance. Comparative analysis evaluates base classifier variations for optimal fraud detection.

2.5 Resampling Techniques

Udeze, Eteng & Ibor evaluate four sampling techniques (baseline split, class-weighted hyperparameter, undersampling, oversampling) combined with Random Forest, XGBoost, and TensorFlow DNN. Results show DNN outperforms others on undersampled data, while all three excel on oversampled datasets.

2.6 Comparative ML Review

A rigorous comparative analysis of logistic regression, decision trees, KNN, and XGBoost across accuracy, precision, recall, and F1-score establishes XGBoost as the top performer for computational speed and overall metrics. The study suggests that fine-tuning or combining algorithms could yield more robust systems.

03 Theoretical Exploration

3.1 Neural Networks

A neural network is a machine learning method inspired by the human brain's structure. It consists of three layers — input, hidden, and output — connected by weighted nodes that process and transform data iteratively through backpropagation.

Key architectures include FNNs (general classification), CNNs (image/spatial data), RNNs (sequential data), LSTMs (long-term dependencies), GRUs (efficient sequential processing), GANs (data generation), and Autoencoders (anomaly detection).

ADVANTAGES
Models complex non-linear relationships
End-to-end learning without manual feature engineering
Scalable to large high-dimensional datasets
Adaptable across diverse problem domains
DISADVANTAGES
Black-box — limited interpretability
Requires large amounts of labelled training data
Computationally expensive to train
Long development and iteration cycles
3.2 Gradient Boosting Machines (GBM)

GBMs sequentially build prediction models where each new model corrects the errors of its predecessor. Introduced by Jerome Friedman (1999) at Stanford, GBMs use gradient descent to iteratively minimize a loss function — enabling flexible, powerful predictive models from weak learners (decision trees).

ADVANTAGES
Flexible for both regression and classification
Robust to missing data and irrelevant features
High accuracy on large datasets
Feature importance quantification
DISADVANTAGES
Sequential training — computationally expensive
Prone to overfitting without careful tuning
Sensitive to hyperparameter selection
High memory consumption for large tree counts
3.3 XGBoost (eXtreme Gradient Boosting)

XGBoost extends GBMs with a sparsity-aware split-finding algorithm, approximate tree learning via sketching, and hardware-level optimizations for cache access, data compression, and sharding. It scales to billions of examples while maintaining resource efficiency.

Its design prioritizes adaptability — supporting regression, classification, ranking, and custom prediction tasks — while combining algorithmic advances with system optimization to deliver state-of-the-art results on diverse machine learning challenges.

3.4 Random Forest

Random Forest constructs an ensemble of decision trees, each trained on a bootstrapped subset of data with a random subset of features at each split. This ensures tree diversity, mitigating overfitting. Classification uses majority voting; regression uses averaging across all trees.

Key strengths include handling noise/outliers effectively, working with numerical and categorical data without preprocessing, providing built-in feature importance analysis, and parallelizing training across CPUs for scalable large-dataset performance.

04 Design & Methodology

The hybrid approach is a multifaceted end-to-end pipeline combining data engineering, individual model training, and ensemble combination:

4.1 Data Collection
Financial transaction data gathered from internal databases, third-party providers, and public datasets. The target variable (fraud/legitimate) is defined, and compliance with legal and ethical guidelines (GDPR) is ensured. Comprehensive mix of both legitimate and fraudulent transactions to represent real-world distributions.
4.2 Data Processing
Missing values imputed, outliers treated, and data inconsistencies resolved. Feature engineering creates new attributes (polynomial features, interaction terms, domain-specific calculations) to expose fraud patterns. Numerical features normalized; categorical variables encoded. Class imbalance addressed via SMOTE or undersampling. Data split into training, validation, and test sets.
4.3 Neural Network Design & Training
Architecture selected based on input features, hidden layers, neurons, and fraud detection requirements. Cross-entropy loss function with stochastic gradient descent optimization. Hyperparameter tuning via grid/random search. Iterative training with backpropagation; early stopping and L1/L2 regularization to prevent overfitting.
4.4 XGBoost Training
Data prepared and split (67% train / 33% test). XGBoost model initialized with hyperparameter experimentation. Training via the fit method; performance evaluated using accuracy, precision, recall, F1-score. Categorical variables handled via embedding layers in the Neural Network component, complementing XGBoost's preprocessing requirements.
4.5 Random Forest Training
Data split 70/30 (train/test). Random Forest classifier initialized and trained via the fit method. Hyperparameters tuned experimentally. Feature importance analysis performed. Out-of-bag error monitored. Metrics evaluated: accuracy, precision, recall, F1-score.
4.6 Model Combination (Stacking)
Individual model predictions used as features for a meta-model (logistic regression) trained on the validation set. Stacking allows the meta-model to learn optimal combination strategies — capturing interactions between base model predictions. Offers flexibility to explore different stacking architectures.
4.7 Synergy of All Three Models
Neural Networks uncover complex non-linear patterns; Random Forest provides robustness against overfitting; XGBoost handles imbalanced data with computational efficiency. Together, stacking achieves enhanced predictive accuracy, robustness to individual model weaknesses, and improved generalization to unseen fraud patterns.

05 Benefits & Challenges

POTENTIAL BENEFITS
Adaptability — adapts to evolving fraud patterns by combining diverse algorithmic strengths
Imbalanced Data Handling — XGBoost integration directly addresses the class imbalance challenge
High Predictive Accuracy — three potent algorithms reduce false negatives and financial losses
Continuous Learning — embedded mechanism remains updated with emerging fraud patterns
CHALLENGES & LIMITATIONS
Complexity — three-algorithm integration is difficult to interpret and deploy
Computational Overhead — significant resources required for real-time processing
Data Privacy — transactional data must be anonymized and GDPR-compliant
Overfitting Risk — ensemble stacking requires careful validation and monitoring

06 Ethical Considerations

DATA PRIVACY
Data Privacy
Financial fraud detection involves analysing sensitive personal and transaction data. Robust security measures, GDPR compliance, and data minimization principles are essential to prevent unauthorized access and maintain customer trust.
TRANSPARENCY
Transparency & Accountability
Stakeholders — customers, regulators, internal teams — must understand how transactions are flagged. Mechanisms for review, appeal, and redress are required. Transparency builds trust and enables continuous improvement through scrutiny and feedback.
FAIRNESS
Bias & Fairness
Biased training data or feature selection can lead to systematic errors that disproportionately flag transactions from certain demographics. Active bias identification, careful data preparation, and ongoing monitoring ensure impartial, fair treatment of all transactions.
ACCURACY
False Positives & Negatives
False positives erode customer trust and increase operational costs. False negatives expose institutions to financial risk and regulatory penalties. Balancing sensitivity and specificity requires continuous tuning, considering each institution's unique risk profile and customer base.

07 Future Research Directions

  • Expanding Algorithmic Diversity — integrate CNNs and RNNs for sequential transaction patterns; explore unsupervised algorithms for anomaly detection without labelled data
  • Enhancing Real-Time Capabilities — develop lightweight models optimized for speed; explore edge computing for device-level preliminary fraud screening to reduce latency
  • Deeper Feature Engineering — investigate automated feature engineering via deep feature synthesis; explore domain-specific features emerging from new transactional patterns
  • Model Explainability — develop hybrid architectures where interpretable components (decision trees) coexist with black-box algorithms; apply SHAP or LIME for feature attribution

08 Conclusion

Overview

This research presents a novel hybrid machine learning model that fuses Neural Networks, Random Forest, and XGBoost for financial fraud detection. Each algorithm contributes distinct strengths — deep pattern recognition, robustness against overfitting, and imbalanced data handling — forming a comprehensive detection system.

Methodological Rigor

The project emphasizes thorough data collection, robust preprocessing and balancing techniques, meaningful feature engineering, individual algorithm training, and ensemble combination through stacking. This systematic approach ensures the highest levels of accuracy and precision.

Implications

With real-time processing capability, the model significantly reduces the window of vulnerability in digital transactions. The research demonstrates the broader potential of multi-algorithm hybrid approaches for complex, real-world cybersecurity and fraud challenges.

Closing thought: As digital transactions become the norm, this research underscores the importance of proactive, adaptive fraud detection. The hybrid model stands as a testament to what can be achieved when innovation meets determination — guiding future endeavours toward a safer transactional landscape.

A List of Abbreviations

GBMGradient Boosting Machine
XGBoosteXtreme Gradient Boosting
GANNGenetic Algorithm Neural Network
DNNDeep Neural Network
LSTMLong Short-Term Memory
CNNConvolutional Neural Network
GANGenerative Adversarial Network
RNNRecurrent Neural Network
GRUGated Recurrent Unit
AEAutoencoder
SMOTESynthetic Minority Over-sampling Technique
KNNK-Nearest Neighbours
GDPRGeneral Data Protection Regulation
FNNFeedforward Neural Network
NASNeural Architecture Search
RELURectified Linear Unit

B Bibliography

[1]L. Delamaire, H. Abdou, and J. Pointon, "Credit card fraud and detection techniques: a review," 2009.
[2]D. P. Foster and R. A. Stine, "Variable Selection in Data Mining: Building a Predictive Model for Bankruptcy," 2004.
[3]R. Patidar and L. Sharma, "Credit card fraud detection using neural network," 2011.
[4]Y. Fang et al., "Credit Card Fraud Detection Based on Machine Learning," Neural Information Processing Conference, pp. 483–490, 2016.
[5]"Random forest for credit card fraud detection," IEEE Xplore, 2018.
[6]C. L. Udeze, I. E. Eteng, and A. E. Ibor, "Application of Machine Learning and Resampling Techniques to Credit Card Fraud Detection," Journal of the Nigerian Society of Physical Sciences.
[7]Z. Faraji, "A Review of Machine Learning Applications for Credit Card Fraud Detection," SEISENSE Journal of Management, vol. 5, no. 1, pp. 49–59, Feb. 2022.
[8]"What is a Neural Network?" AWS. aws.amazon.com/what-is/neural-network/
[14]"Gradient and Newton Boosting for Classification and Regression," arXiv:1808.03064.
[18]T. Chen and C. Guestrin, "XGBoost: A Scalable Tree Boosting System," arXiv:1603.02754.
[20]L. Breiman, "Random Forests," Machine Learning, vol. 45, pp. 5–32, Springer, 2001.
[29]European Parliament, "General Data Protection Regulation (GDPR)," EUR-Lex 32016R0679.