# Credit Card Fraud Detection

**Tools:** Microsoft Excel, Python, Google Colab, Microsoft Power BI

**Keywords:** finance, commerce, business, data standardization, data cleaning, machine learning

<figure><img src="https://539050446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FW65tN1jweulcjF26J7LM%2Fuploads%2FhXxmqJ4a8IEBiFUthNHb%2FCredit%20Card%20Fraud.jpg?alt=media&#x26;token=e24c4f65-253c-4167-aa5e-bbb3ed368ee8" alt=""><figcaption></figcaption></figure>

With the increasing reliance of e-commerce and online transaction, banks need to develop a automated machine learning that instantaneously detect potential credit fraud cases to protect their customers' interest and the bank's reputation.

In this project, we use a publicly available dataset of credit card transactions from September 2013 by European cardholders, and trying to uncover patterns in fraudulent behavior and develop a machine learning model capable of detecting future fraud.

## Examining the dataset

The dataset comprises 284,807 transactions across two days, with 492 labeled as fraudulent, making up only 0.172% of all transactions.

<figure><img src="https://539050446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FW65tN1jweulcjF26J7LM%2Fuploads%2FQVgTLkMx3JMaLGO85jns%2FScreenshot%202025-12-24%20131531.png?alt=media&#x26;token=5da9578f-44e2-4f32-9586-daf9705f1fc1" alt=""><figcaption></figcaption></figure>

> The dataset in csv format

Here are the componenets of the dataset:

* **Numerical Variables:** The dataset features 28 principal components (`V1` to `V28`) obtained through PCA (Principal Component Analysis). These anonymized features ensure data confidentiality.
* **Time:** The time elapsed (in seconds) between each transaction and the first transaction in the dataset.
* **Amount:** The transaction amount, which can influence cost-sensitive analyses.
* **Class:** The target variable, where `1` indicates fraud and `0` indicates a legitimate transaction.

The small percentage of fradulent cases presented in this dataset presents a challenge, as our machine model may not have enough fradulent samples to learn what it takes to be a fradulent case.

## Initial dataset analysis

This project was conducted in two main phases: **Exploratory Data Analysis (EDA)** and **Machine Learning Model Development**. Below is the detailed process:

1. **Understanding the Dataset:**

* Loaded the dataset and reviewed its structure (columns, data types, and target variable).
* Identified the highly imbalanced nature of the data (0.172% fraud cases).

2. **Exploratory Data Analysis (EDA):**

* **Summary Statistics:**
  * Examined the distribution of features like `Time`, `Amount`, and `Class`.
  * Checked for missing or inconsistent data (none were found in this dataset).
* **Visualizations:**
  * Created histograms and boxplots to identify patterns in transaction times and amounts.

<figure><img src="https://539050446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FW65tN1jweulcjF26J7LM%2Fuploads%2F7dvWIcTMz53Z2VLlRIZJ%2FHistogram.jpg?alt=media&#x26;token=57e3c08e-a7d7-4665-8b01-b52d420d536a" alt=""><figcaption></figcaption></figure>

> This histogram shows how many transaction in each timeslot since the first transaction was made.

<figure><img src="https://539050446-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2FW65tN1jweulcjF26J7LM%2Fuploads%2FYudGCBziXQ4qI8NXgCWf%2FBoxplot.jpg?alt=media&#x26;token=cf32739e-2214-4496-bdb2-cacb6e8c0258" alt=""><figcaption></figcaption></figure>

> Boxplot showing the number of transaction within each bin of amount category.

* **Correlation Analysis:**
  * Investigated relationships between PCA features (`V1` to `V28`) and the likelihood of fraud.

3. **Data Preprocessing:**

* Standardized features such as `Time` and `Amount` to ensure they are on a similar scale.
* Addressed the imbalanced dataset using techniques like Synthetic Minority Oversampling Technique (SMOTE) to oversample fraud cases.

## ML model building

* **Model Selection:**
  * Experimented with multiple models, including Logistic Regression, Random Forest, and Gradient Boosting.
* **Training and Evaluation:**
  * Split the data into training and testing sets (80%-20%).
  * Used cross-validation to evaluate models and tune hyperparameters.
* **Metrics:**
  * Focused on precision, recall, and F1-score to assess model performance, given the imbalanced dataset.

## Fraud Detection and Correlation Analysis

* **Fraud Timing Patterns:** Fraudulent transactions showed distinct timing patterns compared to non-fraudulent ones.
* **Transaction Amounts:** Fraudulent transactions tended to cluster within specific ranges, suggesting that transaction amount plays a significant role in identifying fraud.
* **Feature Correlations:** Analysis revealed certain features (`Vx`) with higher predictive power for detecting fraud. **Model Building for Fraud Detection**
* **Machine Learning Approach:**
  * Techniques like Random Forest and Logistic Regression were employed.
  * Resampling strategies (e.g., SMOTE) were applied to address the class imbalance.
* **Model Performance:**
  * The best-performing model achieved high precision in identifying fraudulent transactions while minimizing false positives to avoid unnecessary disruptions for customers.

## Link to the GitHub repository

For accessing to the files used for this project, please visit [the GitHub repository](https://github.com/cedricyu000925/Credit-Card-Fraud-Detection-Project/tree/main).&#x20;
