E-Commerce Platform Data Analysis and Predictive Modeling

Data Source

  1. Online Retail Dataset (https://archive.ics.uci.edu/ml/datasets/Online%20Retail)

    This dataset contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.

Overview

Python Package Used: datetime, pandas, numpy, matplotlib, seaborn

Customer Nationality

alt
Customer Nationality

Sales Trend

alt
Sales Grouped by Week
alt
Average Sales of Each Day in a Week

RFM Analysis

Python Package Used: datetime, pandas, numpy, matplotlib, seaborn

Using recency, frequency, and monetary value, all customer is segmented into 8 categories with each indicator splits into “high” or “low” based on the median of each indicator.

Customer TypeRecencyFrequencyMonetary
High-Value LoyalistHighHighHigh
High-Value Loyalist At RiskLowHighHigh
Active Potential LoyalistHighLowHigh
Hibernating Potential LoyalistLowLowHigh
Low-Value LoyalistHighHighLow
Low-Value Loyalist At RiskLowHighLow
New CustomerHighLowLow
HibernatingLowLowLow
alt alt
Proportion of Different Types of Customer and Amount of money spent grouped by Customer Type
alt
RFM Analysis Result Note: Frequency and monetary axis is on log10 scale.

The result of RFM analysis is compared using Kmeans clustering algorithm. Prior to training Kmeans, RFM value is standardized using the standard scaler. The number of clusters is set to 8 to better compare RFM results.

alt
Kmeans Result Note: Frequency and monetary axis is on log10 scale.

Churn Prediction

Python Package Used: datetime, pandas, numpy, matplotlib, seaborn, statistics, sklearn

To determine whether a customer is churned, a criterion of 90-days purchase interval from the last purchase is chosen based on the density graph below.

alt
Customer Purchase Interval

A number of feature variables is derived based on the original dataset. The nationality of customer is transformed into dummy variable.

Variable CodeMeaning
AveMoneteryThe average amount of money spent based on frequency
NumBuyThe number of different types of things brought
AveQuantBuyThe average quantity of things brought in each invoice
AvePriceBuyThe average price of things brought in each invoice
AveTotalBuyThe average amount of money spent in each purchase

A number of different predictive algorithms is tested using k-fold cross validation.

ModelAverage AccuracyStandard DeviationMaximum AccuracyMinimum Accuracy
Logistic Regression0.7197520.0110950.7358230.699862
Gradient Boosting0.7135280.0097730.730290.697095
Ada Boost0.7126040.0119680.7289070.692946
Random Forest0.7029280.0077090.712310.688797
K-Neighbors0.6837980.006710.690180.669433
SVC0.6667450.0029550.6694330.661602
Gaussian Naive Bayes0.6632870.0030520.6666670.65884
Decision Tree0.6351710.0081650.6500690.627939
alt
ROC of Logistic Regression

The feature importance of the logistic regression model is further analyze using the permutation method.

alt
Feature Importance

Customer Lifetime Value Analysis

Python Package Used: datetime, pandas, numpy, matplotlib, seaborn, lifetimes

Using the Beta Geometric Negative Binomial Distribution (BG-NBD) model, the the expected number of future purchases and possibility of whether a customer is alive is estimated based on frequency and recency.

alt
Frequency Recency Matrix
alt
Probability Alive Matrix

Based on the above analysis, 5 customer with highest expected purchases in the next period is selected.

Customer IDFrequencyTRecencyMonetarypredicted_purchases
149111323733721089.5837880.316007
12748114373372295.7871050.273483
17841112373371365.9961610.268593
1531190373373675.1988890.216880
1460689373372136.5915730.214418

Using the Gamma Gamma model, the average expected value of each customer can be estimated. 5 customer with highest expected value is selected.

Customer IDFrequencyTRecencyMonetarypredicted_purchasesconditional_expected_average_profit
16446220520484236.2500000.01369384660.425564
123461325077183.6000000.00004777965.789328
150981182039916.5000000.00030640326.709551
1574923329722267.1500000.00165422383.590581
1600012012393.7000000.03729612529.192039