In the digital age, credit card fraud poses a sophisticated challenge that impacts millions worldwide, undermining financial security and trust. The burgeoning field of data science offers promising solutions, empowering us to detect and prevent fraudulent activities with unprecedented precision. Today, we embark on an analytical journey through the Credit Card Fraud Detection dataset, leveraging Python to uncover the subtle nuances of fraudulent transactions.
The Credit Card Fraud Detection dataset, made publicly available by Kaggle, serves as a cornerstone for our exploration. Comprising transactions made by European cardholders in September 2013, it encapsulates the intricate dynamics of fraud in the digital payment ecosystem. The dataset is characterized by features derived from PCA transformation, ensuring anonymity, alongside time and transaction amount details, all culminating in a binary classification problem: fraud or no fraud.
Our initial foray into the dataset involves a thorough examination to understand its structure and nuances.
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Assuming you have the dataset as 'creditcard.csv'
df = pd.read_csv('creditcard.csv')
# Basic data exploration
print(df.head())
print(df.describe())
# Check for imbalance in the dataset
fraud_counts = df['Class'].value_counts()
print(fraud_counts)
Time V1 V2 V3 V4 V5 V6 V7 \\
0 0.0 -1.359807 -0.072781 2.536347 1.378155 -0.338321 0.462388 0.239599
1 0.0 1.191857 0.266151 0.166480 0.448154 0.060018 -0.082361 -0.078803
2 1.0 -1.358354 -1.340163 1.773209 0.379780 -0.503198 1.800499 0.791461
3 1.0 -0.966272 -0.185226 1.792993 -0.863291 -0.010309 1.247203 0.237609
4 2.0 -1.158233 0.877737 1.548718 0.403034 -0.407193 0.095921 0.592941
V8 V9 ... V21 V22 V23 V24 V25 \\
0 0.098698 0.363787 ... -0.018307 0.277838 -0.110474 0.066928 0.128539
1 0.085102 -0.255425 ... -0.225775 -0.638672 0.101288 -0.339846 0.167170
2 0.247676 -1.514654 ... 0.247998 0.771679 0.909412 -0.689281 -0.327642
3 0.377436 -1.387024 ... -0.108300 0.005274 -0.190321 -1.175575 0.647376
4 -0.270533 0.817739 ... -0.009431 0.798278 -0.137458 0.141267 -0.206010
V26 V27 V28 Amount Class
0 -0.189115 0.133558 -0.021053 149.62 0
1 0.125895 -0.008983 0.014724 2.69 0
2 -0.139097 -0.055353 -0.059752 378.66 0
3 -0.221929 0.062723 0.061458 123.50 0
4 0.502292 0.219422 0.215153 69.99 0
[5 rows x 31 columns]
Time V1 V2 V3 V4 \\
count 284807.000000 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 94813.859575 1.168375e-15 3.416908e-16 -1.379537e-15 2.074095e-15
std 47488.145955 1.958696e+00 1.651309e+00 1.516255e+00 1.415869e+00
min 0.000000 -5.640751e+01 -7.271573e+01 -4.832559e+01 -5.683171e+00
25% 54201.500000 -9.203734e-01 -5.985499e-01 -8.903648e-01 -8.486401e-01
50% 84692.000000 1.810880e-02 6.548556e-02 1.798463e-01 -1.984653e-02
75% 139320.500000 1.315642e+00 8.037239e-01 1.027196e+00 7.433413e-01
max 172792.000000 2.454930e+00 2.205773e+01 9.382558e+00 1.687534e+01
V5 V6 V7 V8 V9 \\
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean 9.604066e-16 1.487313e-15 -5.556467e-16 1.213481e-16 -2.406331e-15
std 1.380247e+00 1.332271e+00 1.237094e+00 1.194353e+00 1.098632e+00
min -1.137433e+02 -2.616051e+01 -4.355724e+01 -7.321672e+01 -1.343407e+01
25% -6.915971e-01 -7.682956e-01 -5.540759e-01 -2.086297e-01 -6.430976e-01
50% -5.433583e-02 -2.741871e-01 4.010308e-02 2.235804e-02 -5.142873e-02
75% 6.119264e-01 3.985649e-01 5.704361e-01 3.273459e-01 5.971390e-01
max 3.480167e+01 7.330163e+01 1.205895e+02 2.000721e+01 1.559499e+01
... V21 V22 V23 V24 \\
count ... 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05
mean ... 1.654067e-16 -3.568593e-16 2.578648e-16 4.473266e-15
std ... 7.345240e-01 7.257016e-01 6.244603e-01 6.056471e-01
min ... -3.483038e+01 -1.093314e+01 -4.480774e+01 -2.836627e+00
25% ... -2.283949e-01 -5.423504e-01 -1.618463e-01 -3.545861e-01
50% ... -2.945017e-02 6.781943e-03 -1.119293e-02 4.097606e-02
75% ... 1.863772e-01 5.285536e-01 1.476421e-01 4.395266e-01
max ... 2.720284e+01 1.050309e+01 2.252841e+01 4.584549e+00
V25 V26 V27 V28 Amount \\
count 2.848070e+05 2.848070e+05 2.848070e+05 2.848070e+05 284807.000000
mean 5.340915e-16 1.683437e-15 -3.660091e-16 -1.227390e-16 88.349619
std 5.212781e-01 4.822270e-01 4.036325e-01 3.300833e-01 250.120109
min -1.029540e+01 -2.604551e+00 -2.256568e+01 -1.543008e+01 0.000000
25% -3.171451e-01 -3.269839e-01 -7.083953e-02 -5.295979e-02 5.600000
50% 1.659350e-02 -5.213911e-02 1.342146e-03 1.124383e-02 22.000000
75% 3.507156e-01 2.409522e-01 9.104512e-02 7.827995e-02 77.165000
max 7.519589e+00 3.517346e+00 3.161220e+01 3.384781e+01 25691.160000
Class
count 284807.000000
mean 0.001727
std 0.041527
min 0.000000
25% 0.000000
50% 0.000000
75% 0.000000
max 1.000000
[8 rows x 31 columns]
Class
0 284315
1 492
Name: count, dtype: int64
Class Distribution Insight (Balanced or unbalanced? - that is the question!)
A crucial aspect of our analysis reveals a stark imbalance between fraudulent and legitimate transactions, a common scenario in fraud detection datasets.
sns.countplot(x='Class', data=df)
plt.title('Transaction Class Distribution')
plt.yscale('log')
plt.show()
Delving into Transaction Amounts
Understanding the distribution of transaction amounts offers insights into spending patterns, potentially unraveling characteristics unique to fraudulent transactions.
plt.figure(figsize=(10, 6))
sns.histplot(df[df['Class'] == 0]['Amount'], bins=50, color='green', label='Legitimate')
sns.histplot(df[df['Class'] == 1]['Amount'], bins=50, color='red', label='Fraudulent')
plt.xlabel('Transaction Amount')
plt.ylabel('Frequency')
plt.title('Transaction Amount Distribution')
plt.legend()
plt.yscale('log') # Set the y-axis to log scale
plt.show()
Temporal Patterns: Time vs. Fraud
Investigating the timing of transactions may uncover temporal patterns indicative of fraudulent activity.
plt.figure(figsize=(12, 8))
sns.histplot(df[df['Class'] == 0]['Time'], bins=50, color='green', label='Legitimate', kde=True)
sns.histplot(df[df['Class'] == 1]['Time'], bins=50, color='red', label='Fraudulent', kde=True)
plt.xlabel('Time (in Seconds)')
plt.ylabel('Frequency')
plt.title('Transaction Time Distribution')
plt.legend()
plt.yscale('log')
plt.show()