E-Commerce Platform Data Analysis and Predictive Modeling

Data Source

  1. Online Retail Dataset (https://archive.ics.uci.edu/ml/datasets/Online%20Retail)

    This dataset contains all the transactions occurring between 01/12/2010 and 09/12/2011 for a UK-based and registered non-store online retail.The company mainly sells unique all-occasion gifts. Many customers of the company are wholesalers.


Python Package Used: datetime, pandas, numpy, matplotlib, seaborn

Customer Nationality

Customer Nationality

Sales Trend

Sales Grouped by Week
Average Sales of Each Day in a Week

RFM Analysis

Python Package Used: datetime, pandas, numpy, matplotlib, seaborn

Using recency, frequency, and monetary value, all customer is segmented into 8 categories with each indicator splits into “high” or “low” based on the median of each indicator.

Customer TypeRecencyFrequencyMonetary
High-Value LoyalistHighHighHigh
High-Value Loyalist At RiskLowHighHigh
Active Potential LoyalistHighLowHigh
Hibernating Potential LoyalistLowLowHigh
Low-Value LoyalistHighHighLow
Low-Value Loyalist At RiskLowHighLow
New CustomerHighLowLow
alt alt
Proportion of Different Types of Customer and Amount of money spent grouped by Customer Type
RFM Analysis Result Note: Frequency and monetary axis is on log10 scale.

The result of RFM analysis is compared using Kmeans clustering algorithm. Prior to training Kmeans, RFM value is standardized using the standard scaler. The number of clusters is set to 8 to better compare RFM results.

Kmeans Result Note: Frequency and monetary axis is on log10 scale.

Churn Prediction

Python Package Used: datetime, pandas, numpy, matplotlib, seaborn, statistics, sklearn

To determine whether a customer is churned, a criterion of 90-days purchase interval from the last purchase is chosen based on the density graph below.

Customer Purchase Interval

A number of feature variables is derived based on the original dataset. The nationality of customer is transformed into dummy variable.

Variable CodeMeaning
AveMoneteryThe average amount of money spent based on frequency
NumBuyThe number of different types of things brought
AveQuantBuyThe average quantity of things brought in each invoice
AvePriceBuyThe average price of things brought in each invoice
AveTotalBuyThe average amount of money spent in each purchase

A number of different predictive algorithms is tested using k-fold cross validation.

ModelAverage AccuracyStandard DeviationMaximum AccuracyMinimum Accuracy
Logistic Regression0.7197520.0110950.7358230.699862
Gradient Boosting0.7135280.0097730.730290.697095
Ada Boost0.7126040.0119680.7289070.692946
Random Forest0.7029280.0077090.712310.688797
Gaussian Naive Bayes0.6632870.0030520.6666670.65884
Decision Tree0.6351710.0081650.6500690.627939
ROC of Logistic Regression

The feature importance of the logistic regression model is further analyze using the permutation method.

Feature Importance

Customer Lifetime Value Analysis

Python Package Used: datetime, pandas, numpy, matplotlib, seaborn, lifetimes

Using the Beta Geometric Negative Binomial Distribution (BG-NBD) model, the the expected number of future purchases and possibility of whether a customer is alive is estimated based on frequency and recency.

Frequency Recency Matrix
Probability Alive Matrix

Based on the above analysis, 5 customer with highest expected purchases in the next period is selected.

Customer IDFrequencyTRecencyMonetarypredicted_purchases

Using the Gamma Gamma model, the average expected value of each customer can be estimated. 5 customer with highest expected value is selected.

Customer IDFrequencyTRecencyMonetarypredicted_purchasesconditional_expected_average_profit