# Credit Card Fraud Detection **Tools:** Microsoft Excel, Python, Google Colab, Microsoft Power BI **Keywords:** finance, commerce, business, data standardization, data cleaning, machine learning

With the increasing reliance of e-commerce and online transaction, banks need to develop a automated machine learning that instantaneously detect potential credit fraud cases to protect their customers' interest and the bank's reputation. In this project, we use a publicly available dataset of credit card transactions from September 2013 by European cardholders, and trying to uncover patterns in fraudulent behavior and develop a machine learning model capable of detecting future fraud. ## Examining the dataset The dataset comprises 284,807 transactions across two days, with 492 labeled as fraudulent, making up only 0.172% of all transactions.

> The dataset in csv format Here are the componenets of the dataset: * **Numerical Variables:** The dataset features 28 principal components (`V1` to `V28`) obtained through PCA (Principal Component Analysis). These anonymized features ensure data confidentiality. * **Time:** The time elapsed (in seconds) between each transaction and the first transaction in the dataset. * **Amount:** The transaction amount, which can influence cost-sensitive analyses. * **Class:** The target variable, where `1` indicates fraud and `0` indicates a legitimate transaction. The small percentage of fradulent cases presented in this dataset presents a challenge, as our machine model may not have enough fradulent samples to learn what it takes to be a fradulent case. ## Initial dataset analysis This project was conducted in two main phases: **Exploratory Data Analysis (EDA)** and **Machine Learning Model Development**. Below is the detailed process: 1. **Understanding the Dataset:** * Loaded the dataset and reviewed its structure (columns, data types, and target variable). * Identified the highly imbalanced nature of the data (0.172% fraud cases). 2. **Exploratory Data Analysis (EDA):** * **Summary Statistics:** * Examined the distribution of features like `Time`, `Amount`, and `Class`. * Checked for missing or inconsistent data (none were found in this dataset). * **Visualizations:** * Created histograms and boxplots to identify patterns in transaction times and amounts.

> This histogram shows how many transaction in each timeslot since the first transaction was made.

> Boxplot showing the number of transaction within each bin of amount category. * **Correlation Analysis:** * Investigated relationships between PCA features (`V1` to `V28`) and the likelihood of fraud. 3. **Data Preprocessing:** * Standardized features such as `Time` and `Amount` to ensure they are on a similar scale. * Addressed the imbalanced dataset using techniques like Synthetic Minority Oversampling Technique (SMOTE) to oversample fraud cases. ## ML model building * **Model Selection:** * Experimented with multiple models, including Logistic Regression, Random Forest, and Gradient Boosting. * **Training and Evaluation:** * Split the data into training and testing sets (80%-20%). * Used cross-validation to evaluate models and tune hyperparameters. * **Metrics:** * Focused on precision, recall, and F1-score to assess model performance, given the imbalanced dataset. ## Fraud Detection and Correlation Analysis * **Fraud Timing Patterns:** Fraudulent transactions showed distinct timing patterns compared to non-fraudulent ones. * **Transaction Amounts:** Fraudulent transactions tended to cluster within specific ranges, suggesting that transaction amount plays a significant role in identifying fraud. * **Feature Correlations:** Analysis revealed certain features (`Vx`) with higher predictive power for detecting fraud. **Model Building for Fraud Detection** * **Machine Learning Approach:** * Techniques like Random Forest and Logistic Regression were employed. * Resampling strategies (e.g., SMOTE) were applied to address the class imbalance. * **Model Performance:** * The best-performing model achieved high precision in identifying fraudulent transactions while minimizing false positives to avoid unnecessary disruptions for customers. ## Link to the GitHub repository For accessing to the files used for this project, please visit [the GitHub repository](https://github.com/cedricyu000925/Credit-Card-Fraud-Detection-Project/tree/main).