Detecting Fraud Using Machine Learning

By: Malcom Calivar

Published: July 23, 2021

Project information

The dataset included in this project contains information about credit card transactions and their labels as either legitimate or fraudulent transactions. As the vast majority of transactions carried out are legitimate, this is a highly imbalanced dataset. Using different techniques such as undersampling, oversampling, and SMOTE, we will attempt to balance the data and train machine learning models to correctly predict whether a transaction is fraudulent or not.

The dataset contains transactions made by credit cards in September 2013 by European cardholders over a period of two days.

Necessary libraries

In order to perform exploratory data analysis, I used the pandas Python library. For visualizing data, I used a combination of matplotlib and seaborn. Finally, sci-kit learn provides all the machine learning models that were used during this project, as well as the tools to effectively measure their accuracy.

Exploring the dataset

Initial exploration of the data shows a few things. We have 31 columns:

  • Time (in seconds, elapsed since first transaction in dataset)
  • V1-V28 Primary components which have been transformed into numerical data. For reasons of confidentiality, the true labels of these features are unknown.
  • Amount (how much was spent in that transaction)
  • Class (0 for legitimate transaction, 1 for fraudulent. From this point on, 0 will be referred to as legitimate and 1 as fraudulent)

Transforming the Data

Most of the data has already been transformed. The only columns that haven't are: TimeAmount and Class. We won't be using time, and class is simply a binary classification of legitimate or fraudulent activity.

If we want to use Amount in our training models, we will need to standardize it so that its values are similar to those in our already transformed data contained within columns V1-V28.

Visualizing the Data

This dataset is highly imbalanced. Legitimate charges compose 99.82% of the data in this dataset, while fraudulent charges make up only 0.17% of this dataset.

What would happen if we were to train a machine learning algorithm using this dataset? Using a metric such as accuracy, we'd see some promising results. In fact, a Logistic Regression model trained on this data and measured on accuracy shows that it's 99.9% accurate. 

What's really happening here is that the machine learning algorithm sees that most of the data consists of legitimate transactions, and simply "guesses" its way through its predictions. In the end, it will still be right 99% of the time, because fraudulent data is hugely underrepresented.

This is obviously not a good metric to measure our machine learning algorithm by. In fact, if we use a different metric -- Area Under Precision Recall Curve -- we see that this machine learning model is just slightly better than chance at predicting fraudulent charges. The area under this precision-recall curve was about 0.66. That leaves a third of data improperly classified. Clearly not good enough to use in the real world.

Solving the Imbalance

In this project, three techniques were used: random undersampling, random oversampling, and a combination of random undersampling + SMOTE. For more details on how these methods work, view the notebook included in this post's source code. In essence, either the minority class is increased, the majority class decreased, or a combination of the two.


Overall, results were very promising, but some methods have issues. Oversampling creates exact duplicates of the minority class. This data will be highly specific and our models may fail to recognize real world data due to overfitting. Undersampling works great but prunes almost the entire dataset to achieve balance. What important information could our models be missing out on by doing this?

Overall, random undersampling + SMOTE offered the best results, with the K-Nearest Neighbor classifier achieving the highest area under a precision-recall curve at 0.9956.

Further research

More data on fraudulent cases would be extremely helpful in improving these models. Even when oversampling to just 5% of the total dataset and using techniques that create synthetic data points, we run the risk of creating highly-specific data. Our classifiers will find it difficult to interpret anything outside these parameters, and previously unseen cases of fraud may slip through the cracks.