Credit Card Fraud Detection

To uncover patterns in fraudulent behavior and develop a machine learning model capable of detecting future fraud.

Tools: Microsoft Excel, Python, Google Colab, Microsoft Power BI

Keywords: finance, commerce, business, data standardization, data cleaning, machine learning

With the increasing reliance of e-commerce and online transaction, banks need to develop a automated machine learning that instantaneously detect potential credit fraud cases to protect their customers' interest and the bank's reputation.

In this project, we use a publicly available dataset of credit card transactions from September 2013 by European cardholders, and trying to uncover patterns in fraudulent behavior and develop a machine learning model capable of detecting future fraud.

Examining the dataset

The dataset comprises 284,807 transactions across two days, with 492 labeled as fraudulent, making up only 0.172% of all transactions.

The dataset in csv format

Here are the componenets of the dataset:

  • Numerical Variables: The dataset features 28 principal components (V1 to V28) obtained through PCA (Principal Component Analysis). These anonymized features ensure data confidentiality.

  • Time: The time elapsed (in seconds) between each transaction and the first transaction in the dataset.

  • Amount: The transaction amount, which can influence cost-sensitive analyses.

  • Class: The target variable, where 1 indicates fraud and 0 indicates a legitimate transaction.

The small percentage of fradulent cases presented in this dataset presents a challenge, as our machine model may not have enough fradulent samples to learn what it takes to be a fradulent case.

Initial dataset analysis

This project was conducted in two main phases: Exploratory Data Analysis (EDA) and Machine Learning Model Development. Below is the detailed process:

  1. Understanding the Dataset:

  • Loaded the dataset and reviewed its structure (columns, data types, and target variable).

  • Identified the highly imbalanced nature of the data (0.172% fraud cases).

  1. Exploratory Data Analysis (EDA):

  • Summary Statistics:

    • Examined the distribution of features like Time, Amount, and Class.

    • Checked for missing or inconsistent data (none were found in this dataset).

  • Visualizations:

    • Created histograms and boxplots to identify patterns in transaction times and amounts.

This histogram shows how many transaction in each timeslot since the first transaction was made.

Boxplot showing the number of transaction within each bin of amount category.

  • Correlation Analysis:

    • Investigated relationships between PCA features (V1 to V28) and the likelihood of fraud.

  1. Data Preprocessing:

  • Standardized features such as Time and Amount to ensure they are on a similar scale.

  • Addressed the imbalanced dataset using techniques like Synthetic Minority Oversampling Technique (SMOTE) to oversample fraud cases.

ML model building

  • Model Selection:

    • Experimented with multiple models, including Logistic Regression, Random Forest, and Gradient Boosting.

  • Training and Evaluation:

    • Split the data into training and testing sets (80%-20%).

    • Used cross-validation to evaluate models and tune hyperparameters.

  • Metrics:

    • Focused on precision, recall, and F1-score to assess model performance, given the imbalanced dataset.

Fraud Detection and Correlation Analysis

  • Fraud Timing Patterns: Fraudulent transactions showed distinct timing patterns compared to non-fraudulent ones.

  • Transaction Amounts: Fraudulent transactions tended to cluster within specific ranges, suggesting that transaction amount plays a significant role in identifying fraud.

  • Feature Correlations: Analysis revealed certain features (Vx) with higher predictive power for detecting fraud. Model Building for Fraud Detection

  • Machine Learning Approach:

    • Techniques like Random Forest and Logistic Regression were employed.

    • Resampling strategies (e.g., SMOTE) were applied to address the class imbalance.

  • Model Performance:

    • The best-performing model achieved high precision in identifying fraudulent transactions while minimizing false positives to avoid unnecessary disruptions for customers.

For accessing to the files used for this project, please visit the GitHub repositoryarrow-up-right.

Last updated