Google Merchandise Store

Google analytics customer revenue prediction

Predict how much GStore customers will spend

--

Table of Contents:

  1. Business problem
  2. About data
  3. About features
  4. Performance metrics for the problem
  5. Machine learning problem formulation
  6. Data loading and Preprocessing
  7. Data cleaning
  8. Exploratory data analysis
  9. Feature engineering
  10. Time-series featurization
  11. Additional features
  12. Preparing train and test data sets
  13. Machine learning models
  14. Hyper-parameter tuning for classification model
  15. Hyper-parameter tuning for regression model
  16. Final Model
  17. Results
  18. Feature Importance
  19. Trying ensemble models with bagging
  20. Future work
  21. GitHub,LinkedIn profile links
  22. References
  1. Business Problem :
  • In every business it was proven about 80–20 rule., this rule tells us 80% of our revenue will be generated by only 20% of our potential customers. So our goal is to predict the revenue that is going to be generated by those potential customers in the near feature. So that marketing teams will invest appropriate money on promotional strategies to attract potential customers.
  • In simple words we are given with the users past data and transactions (when they logged into G-store)., so by using this data we need to predict the future revenue will be created by those customers.
  • So google provided Merchandise customer dataset and no.of transactions per customer. We will build a predictive model using G-store data set to predict the total revenue per customer that helps in better use of marketing budget and we will also interpret the most impacting element on the total revenue prediction using different models.

2. About data :

We have downloaded data from below Kaggle link :

  • we need to download train_v2.csv and test_v2.csv.
  • we will be predicting the target for all users in the posted test set: test_v2.csv, for their transactions in the future time period of December 1st 2018 through January 31st 2019.
  • Each row in the dataset is one visit to the store. Because we are predicting the log of the total revenue per user, not all rows in test.csv will correspond to a row in the submission, but all unique fullVisitorIds will correspond to a row in the submission.
  • some of the features are in .json format so we need to parse those json columns., regarding this we will see in brief at the time of data reading.

3. About features/columns/independent variables :

  • fullVisitorId : A unique identifier for each user of the Google Merchandise Store.
  • channelGrouping : The channel via which the user came to the Store.
  • date : The date on which the user visited the Store.
  • device : The specifications for the device used to access the Store.
  • geoNetwork : This section contains information about the geography of the user.
  • sessionId : A unique identifier for this visit to the store.
  • socialEngagementType : Engagement type, either “Socially Engaged” or “Not Socially Engaged”.
  • totals : This section contains aggregate values across the session.
  • trafficSource : This section contains information about the Traffic Source from which the session originated.
  • visitId : An identifier for this session. This is part of the value usually stored as the utmb cookie. This is only unique to the user. For a completely unique ID, you should use a combination of fullVisitorId and visitId.
  • visitNumber : The session number for this user. If this is the first session, then this is set to 1.
  • visitStartTime : The timestamp (expressed as POSIX time).

for more details about features description: https://support.google.com/analytics/answer/3437719?hl=en

4. Performance metric for the problem :

  • Submissions are scored on the root mean squared error.
  • RMSE is defined as:

where y hat is the natural log of the predicted revenue for a customer and y is the natural log of the actual summed revenue value plus one.

5. Machine learning problem formulation :

  • so here we are going to predict the revenue generated by the customer (in dollars) when he visits the store ., so we can pose this problem as regression problem
  • By following some of the kaggle discussions and winners solution., they are solving this problem in this way.
  • they are building a classification model that will predict whether the user will visit the store or not in the test period time and then suppose if there is chance that he is visiting store then by using regression model we will predict the revenue is going to be generated by the customer.

Solving the problem as Classification + Regression is motivated by Hurdle Model.

Hurdle Model :-

  • This model is preferred way of solving problem where target variable has more number of zeroes than a value.

    It recommends to solve problem by using :
    * Classification whether the value is going to non-zero or not
    * and then predict the amount.
    The solution implemented for this challenge is based on above model.
  • anyway we will discuss more about this at the time of featurization and model building.

6. Data loading and Preprocessing :

Data loading and Parsing json columns
  • Train dataset shape is : (1708337, 60)
  • Test dataset shape is : (401589, 59)
  • Here each record corresponds to one visit to store.

Note : I am using higher configuration virtual machine so I am not getting any memory issues while reading huge data. But if your system had low configuration then you can use “dask data frame” to avoid memory issues and you can speed up the operations.

7. Data cleaning :

  • we will check how many unique values are present in each column of the data :
column_names = train_df.columns.to_list()unique_value_columns=[]for column in column_names:
count = train_df[column].nunique()
if count==1:
del train_df[column]
unique_value_columns.append(column)

In the above code snippet we are checking for each column how many unique values are present. If count is one , that means same value exists for that feature across the data-set. so that particular feature will not help us for any kind of predictions. so we are dropping those columns.

  • Missing data analysis:
Bar plot for missing data percentage

Now we will analyse the each missing valued feature whether it is useful or not ., if it is useful then we will analyse how to impute missing values.

8. Exploratory data analysis :

a) Target feature analysis:

Transaction Revenue analysis

In the above plot we had users index on x-axis and each user log transaction revenue value on y-axis.

As we already discussed about 80/20 rule., by looking at this graph it was proved ., most of the transactions had generated zero revenue but only few transactions had no-zero revenue.

b) Trend analysis :

Now we will how no.of visits and transactions are happening over time :

visits and transactions trend analysis
  • If we observe the above plot in the month of December-2017 the the no.of visits and revenue are raised drastically.
  • so this one of the useful insights to the promotional team., so that they can invest more money in promotions in the month of December.

c) Channel grouping analysis:

Now we will see no.of visits and transactions that are happening through each channel :

no.of visits and revenue per channel
  • The most revenue is coming from ‘organic search’, ‘Direct’,’Referral’.., but no.of visits in ‘Direct’, ‘referral’ are very less.
  • so here conclusion is the analytics team can invest less money in ‘Direct’,’referral’ channels (since less users are visiting from this channel) and can generate most revenue.

d) Web-browser analysis :

Now we will see no.of visits and transactions that are happening through each web-browser :

no.of visits and revenue through web-browser
  • The no.of visits in chrome browser are very huge compare with all browsers.
  • The most revenue is coming from ‘chrome’, ‘Firefox’,’safari’,’Internet explorer’,’edge’,’opera’,’Samsung internet’,’android web view’,’safari’,’amazon silk’,’Ya-browser’..,
  • so here conclusion is the analytics team can invest less money on the users visiting store through browsers(ex:safari,Firefox,opera,edge) except chrome and can generate most revenue.

e) Operating System analysis:

Now we will see no.of visits and transactions that are happening through each operating system :

no.of visits and revenue for each operating system
  • Most of the users are visiting store through Windows,Macintosh and the most revenue is generating from windows and Macintosh.
  • If we observe carefully very less people (less than 100K) are visiting through Linux and chrome OS., so business team can invest very less on money for promotions on this two OS platforms and can generate most revenue.
  • very importantly through windows phone less than 2000 people are visiting merchandise site but they are also generating good amount of revenue. so analytics team can invest less money on the windows phone OS and can generate good revenue.

f) Device category analysis:

Now we will see no.of visits and transactions that are happening through each device category :

no.of visits and revenue for each device
  • Most of the users are visiting through desktop.
  • here the very important observation is through tablet device less than 68K people(this is significantly less compare with other devices) are visiting but they are generating significantly higher revenue.
  • so analytics team can invest less amount of money for promotions on the users visiting store through ‘tablet device’ and can generate significantly higher revenue.

g) Mobile vs non-mobile analysis :

Now we will see no.of visits and transactions that are happening through mobile and non-mobile devices:

no.of visits and revenue in mobile vs non-mobile
  • so many users are coming through non-mobile devices and more revenue is generating from non-mobile devices.
  • no.of visitors are relatively very less in mobile users but they are also generating significantly good revenue as compared to non-mobile users.

h) continent analysis :

Now we will see no.of visits and transactions that are happening through each continent :

no.of visits and revenue for each continent
  • no.of visits from america are significantly more compare with other continents.
  • even no.of visits are less from ‘Oceania’,’Africa’., but this continents are also generating good amount of revenue. so its’s better to invest in this two continents sources.

9. Feature engineering:

a) Impute missing values:

Here we will impute the zero for missing values in target feature. since we already know that about 98% of transactions are not generating any money

train_df['totals.transactionRevenue'].fillna(0,inplace=True)

b) Convert Boolean Features :

device.isMobile : is a boolean feature

# here we are converting the "device.isMobile" data type from string to boolean.train_df['device.isMobile']=   train_df['device.isMobile'].astype(bool)test_df['device.isMobile']  = test_df['device.isMobile'].astype(bool)

c) Convert numerical features to float :

# here we defining list of all the numerical features and we are converting each numeric feature to float.numeric_features = [
'visitNumber','visitStartTime','totals.hits','totals.pageviews', 'totals.timeOnSite','totals.transactions','totals.transactionRevenu']
for col in numeric_feat:
train_df[col].fillna(0,inplace=True)
train_df[col] = train_df[col].astype('float')

test_df[col].fillna(0,inplace=True)
test_df[col] = test_df[col].astype('float')

d) Label encoding for categorical features :

Here we are performing the Label Encoding for categorical features. since we are not using one-hot encoding is because of it will increase our dimensionality of data.

categorical_feat = ['channelGrouping','device.browser','device.operatingSystem',
'device.deviceCategory','geoNetwork.continent',
'geoNetwork.subContinent','geoNetwork.country','geoNetwork.region','geoNetwork.metro','geoNetwork.city','geoNetwork.networkDomain',
'totals.sessionQualityDim','trafficSource.campaign',
'trafficSource.source','trafficSource.medium',
'trafficSource.keyword','trafficSource.referralPath', 'trafficSource.adContent']
for feature in categorical_feat:

label_encoder = preprocessing.LabelEncoder() # initializing label encoder object

label_encoder.fit(list(train_df[feature].values.astype('str')) + list(test_df[feature].values.astype('str')))

# fit with list of variables in that feature

train_df[feature] = label_encoder.transform(list(train_df[feature].values.astype('str')))
# transforming that feature
test_df[feature] = label_encoder.transform(list(test_df[feature].values.astype('str'))) print("for this feature : {0} label-encoding was done succesfully".format(feature))

10. Time-series featurization :

The most important task for this problem is time series featurization:

  • credits : “https://www.kaggle.com/c/ga-customer-revenue-prediction/discussion/82614
  • since this is regression problem and most values are zero ., so we are going to solve these kind of problems using hurdle models.
  • Here I will discuss the entire methodology for this idea.
  • Basically kaggle given :
    * train data time period range is : Aug 1'st 2016 to Apr 30th 2018 => total 638 days.
    * test data time period range is : may 1'st 2018 to Oct 15th 2018 => total 168 days.
    * Prediction data time period range is : Dec 1'st 2018 to Jan 31'st 2019 => total 62 days.
  • so here we need to predict the revenue of users in the period of Dec 1st 2018 to Jan 31'st 2019 by using the train and test data given to us.
  • so we had data until Oct 15th 2018 and prediction data beginning date was Dec 1st 2018., so the in between period is called “cooling period” and it is 46 days.
  • so here idea is first we need to predict whether the user will come to store or not after the “cooling period” of 46 days(or in test period). so for this we will use classification model.
  • suppose if he will come to store then we will predict the revenue of that user by using regression model with user data(features).
  • so the next step is we need to build the data for classification model in such way that it will replicate the real world scenario.

real world scenario ?
That means train data will consists of 168 days data and test data will consists of 62 days data and we will maintain the gap between the train data end date and test data beginning date with 46 days.

so by using this train data we need to predict whether the user will come to store or not for test data that we prepared.
ex:
train data = Aug 1'st 2016 to Jan 15th 2017 (168 days)
test data = mar 2nd 2017 to may 3'rd 2017 (62 days)
The gap between train and test data is 46 days.

so by using the data that we had we can make 4 sets of train and test frames.

data set-1:
*train data = Aug 1'st 2016 to Jan 15th 2017 (168 days)
*test data = mar 2nd 2017 to may 3'rd 2017 (62 days)
data set-2:
*train data = Jan 16'st 2017 to Jul 2nd 2017 (168 days)
*test data = Aug 17th 2017 to Oct 18th 2017 (62 days)
data set-3:
*train data = Jul 3'rd 2017 to Dec 17th 2017 (168 days)
*test data = Feb 1'st 2018 to Apr 4th 2018 (62 days)
data set-4:
*train data = Dec 18th 2017 to Jun 4th 2018 (168 days)
*test data = Jul 20th 2018 to Sep 20th 2018 (62 days)

# so from above data sets for the users which are common in train and test(that means they returned after cooling period) we will create a new feature ‘is_returned’ and we will set it to 1., for the users which not returned we will set ‘is_returned’ to 0.

# we will create some new features for the every user in ‘train data’ and finally we will merge all this data frames.

  • so now our target features are “is_returned” and “revenue”.
  • where “is_returned” will indicate that the whether the user will come to store or not in test period.
  • where “revenue” will indicate that the revenue generated by the user.

Note: I know for first time it is very difficult to understand. so here in fewer lines I will give the brief summary of time series featurization.

we decided to build classification model and regression model. Here the task of classification model is to predict whether the user will come to store or not. If he is not coming to store store then the revenue from that user is zero. until here we are clear.

But for building classification model we don’t have any labelled data with us. so we are trying to generate the data for classification model. so with the data that we have on our hand we are dividing it into train frame and test frame by replicating real world scenario( cooling period gap). If the user present in both train frame and test frame means he will come to store and label for that user is ‘1’., If he is not present in test frame then we will label that user with ‘0’. I hope now it’s clear for you.

train_test_data = pd.concat([train_df, test_df], axis=0).reset_index()%time train_frame_1 = get_time_series_features(train_test_data,1)
train_frame_1.to_pickle('train_frame_1')
%time train_frame_2 = get_time_series_features(train_test_data,2)
train_frame_2.to_pickle('train_frame_2')
%time train_frame_3 = get_time_series_features(train_test_data,3)
train_frame_3.to_pickle('train_frame_3')
%time train_frame_4 = get_time_series_features(train_test_data,4)
train_frame_4.to_pickle('train_frame_4')
#concatenating all our featurized frames:final_featurized_data = pd.concat([tr1, tr2, tr3, tr4], axis=0, sort=False).reset_index(drop=True)

11. Additional features:

We will calculate this features for every data point.

we will group the data points by visitor-ID and then we will calculate this new features.

  • Max value of network domain
  • Max value of the city
  • Max value of Device Operating System
  • Max value of Geo Network Metro
  • Max Value of Geo Network Region
  • Max Value of Channel Grouping
  • Max value of Referral Path
  • Max value of Country
  • Max value of Source
  • Max value of Medium
  • Max value of keyword
  • Max value of Browser
  • Max value of device category
  • Max value of continent
  • Summation of time on site
  • Min value of time on site
  • Max value of time on site
  • Mean value of time on site
  • Summation of page views
  • Min value of page views
  • Max value of page views
  • Mean value of page views
  • Summation of hits
  • Min value of hits
  • Max value of hits
  • Mean value of hits
  • Count of visit Start Time
  • Max value of session Quality Dim
  • Max value of isMobile
  • Maximum number of visits
  • Summation of all the transaction amounts
  • Summation of all the transaction counts
  • Days from first shopping session for customer from the period start date
  • Days from last shopping session for customer before the period start date
  • Interval Days — Difference between first and last shopping session for customer in current frame
  • Unique number of dates customer visited

similarly we will featurize the test data also and we will fill the target features “is_returned” and “revenue” with (null)“np.nan” values.

12. Preparing train and test data sets :

  • we will merge all the data that we generated by ‘fullvisitorid’.
  • Now the data points that have ‘is_returned’ and ‘revenue’ feature values are null those data points are our test data points. (since while featurizing test data we append the null values for target features)
# for all our test records we already fill the 'revenue' column with 'null' values., 
# so here we are separating our train and test records
train_df = final_featurized_data[final_featurized_data['revenue'].notnull()]test_df = final_featurized_data[final_featurized_data['revenue'].isnull()]

13. Machine learning models :

Here we are using two models for building final that will predict revenue:-

  • Classification Model to predict whether customer would return during test window.
  • Regression Model to predict transaction amount.

so the final value is :-

predicted revenue = classification model output(probability)* regression model output(real value)

Note: If predicted revenue is negative then we will make it as zero. since revenue is not in negative.

14. Hyper-parameter tuning for classification model :

Here we are taking Light-GBM as our bse model for classification task. so by using random search we will find the hyper-parameter values.

# initializing grid parameters:gridParams = {
'learning_rate': [0.005,0.01,0.015],
'n_estimators': [40,100,200],
'num_leaves': [6,8,12,15,16],
'boosting_type' : ['gbdt'],
'objective' : ['binary'],
'metric' : ['binary_logloss'],
'colsample_bytree' : [0.6, 0.8, 1],
'subsample' : [0.7,0.9, 1],
'reg_alpha' : [0,1],
'reg_lambda' : [0,1],
'max_leaves': [128,256,512],
'min_child_samples' : [1,20]
}
#initializing the model object:
model = lgb.LGBMClassifier()
target_columns = ['is_returned', 'revenue', 'fullVisitorId']# RandomizedSearchCV to tuning the parametersgrid = RandomizedSearchCV(model,
gridParams,
cv=3)
# Run the Randomsearch cv on the train dataset to find tuned hyper-parameters:%time grid.fit(train_df.drop(target_columns, axis=1) , train_df['is_returned'])

After the above code snippet executed we got the below best hyper-parameters for our Light-GBM classification model.

{'subsample': 0.9, 'reg_lambda': 1, 'reg_alpha': 0, 'objective': 'binary', 'num_leaves': 16, 'n_estimators': 200, 'min_child_samples': 20, 'metric': 'binary_logloss', 'max_leaves': 128, 'learning_rate': 0.015, 'colsample_bytree': 1, 'boosting_type': 'gbdt'}

15. Hyper-parameter tuning for regression model :

Here we are taking Light-GBM as our bse model for regression task. so by using random search we will find the hyper-parameter values.

#defining grid parameters:gridParams = {
'learning_rate': [0.005,0.01,0.015],
'n_estimators': [40,100,200],
'num_leaves': [6,8,12,15,16],
'boosting_type' : ['gbdt'],
'objective' : ['regression'],
'metric' : ['rmse'],
'colsample_bytree' : [0.6, 0.8, 1],
'subsample' : [0.7,0.9, 1],
'reg_alpha' : [0,1],
'reg_lambda' : [0,1],
'max_leaves': [128,256,512],
'min_child_samples' : [1,20]
}
# Define Light-GBM Regressor modelmodel = lgb.LGBMRegressor()# RandomizedSearchCV to tune the parametersrandom_search = RandomizedSearchCV(model,
gridParams,
cv=3)
# Run the Randomsearch cv on the train dataset to find tuned hyper -parameters%time random_search.fit(train_df.drop(target_columns, axis=1)[train_df['is_returned']==1], train_df['revenue'][train_df['is_returned']==1])

After the above code snippet executed we got the below best hyper-parameters for our Light-GBM classification model.

{'subsample': 0.9, 'reg_lambda': 0, 'reg_alpha': 1, 'objective': 'regression', 'num_leaves': 8, 'n_estimators': 100, 'min_child_samples': 20, 'metric': 'rmse', 'max_leaves': 128, 'learning_rate': 0.015, 'colsample_bytree': 1, 'boosting_type': 'gbdt'}

16. Final Model :

  • converting data into LGB dataset objects for easy to operate:
# Define dataset for Classification model to determine whether customer would return during test time window.dtrain_returned = lgb.Dataset(train_df.drop(target_columns, axis=1), label = train_df['is_returned'])# Define dataset for Regression model, picking only the customers who returned during test time window.dtrain_revenue = lgb.Dataset(train_df.drop(target_columns, axis=1)[train_df['is_returned']==1], 
label=train_df['revenue'][train_df['is_returned']==1])
  • Final Model:
  • We are building classification model and regression model on the best hyper-parameter values that we got and we run for multiple times(let say 10) and we are taking the average value of all the predictions generated in each iteration.
#Running Light-GBM model for 10 iterations and took average of those.
#Source :- https://www.kaggle.com/kostoglot/winning-solution
pr_lgb_sum = 0 #Variable to store predictions.print('Training and predictions')for i in range(10):
print('Interation number ', i)

#Classification model to predict whether customer will return in test window.

classification_model = lgb.train(params_classification, dtrain_returned)
pr_lgb = classification_model.predict(test_df.drop(target_columns, axis=1))
classification_model.save_model('lgb_model1_itr_' + str(i) + '.txt' )


#Regression model to predict the transaction amount for the customers who returned in that time window.

regression_model = lgb.train(params_lgb2, dtrain_revenue)
pr_lgb_ret = regression_model.predict(test_df.drop(target_columns, axis=1)) pr_lgb_ret.save_model('lgb_model2_itr_' + str(i) + '.txt' )

#Calculating final prediction as product of above two amounts.
pr_lgb_sum = pr_lgb_sum + pr_lgb*pr_lgb_ret
#Taking average value from above iterations the model was run.
pr_final2 = pr_lgb_sum/10

Now we will create final submission.csv file with ‘fullvisitorID’ and ‘PredictedLogRevenue’ as columns.

# creating a data frame for predictions that we made for test data users:pred_df = pd.DataFrame({"fullVisitorId":test_df["fullVisitorId"].values})
pred_df["PredictedLogRevenue"] = pr_final2
pred_df.columns = ["fullVisitorId", "PredictedLogRevenue"]

17. Results :

The above model resulted in score of 0.8848 on private leader board that would end up rank of 5 on leader board.

kaggle leader board submission
  • we can multiple combination of models ex: logistic regression+linear regression , Random forest(classification)+Random forest(regression),XGB(classification)+XGB(regression).

18. Feature Importance :

  • Now we will do some experimentation. we will start with feature importance.
  • Here we will see which features are really useful., so that we will use only those features so that we can reduce the dimensionality of data and computational time also.
  • so for that we are using ‘Recursive feature elimination’.

Recursive feature elimination:

  • The recursive feature elimination idea is similar to backward feature selection.
  • First we need to specify the base model(Here base model has to return the feature importance's). so that the algorithm first train the model on all features of the data set.
  • Now it will take feature importance's of all features., Now by removing least importance features it will re-train the model on new set of features., so this operation will iteratively run for different set of features.
  • Finally the feature set which is gibing best accuracy those features are selected as our final features.
# Define Light-GBM Regressor model as our base model:estimator = lgb.LGBMRegressor(objective = "regression",metric= "rmse",max_leaves=128,num_leaves = 8,min_child_samples = 20 , learning_rate = 0.015,subsample = 0.9,colsample_bytree =1,bagging_frequency = 1,n_estimators = 100, reg_alpha = 1,reg_lambda = 0,boosting_type = "gbdt")rfecv = RFECV(estimator, step=1) # here step denotes at a time how many features do you want to drop.%time rfecv.fit(train_df.drop(target_columns, axis=1)[train_df['is_returned']==1], train_df['revenue'][train_df['is_returned']==1])
Feature Importance
print('Optimal number of features: {}'.format(rfecv.n_features_))Optimal number of features: 19
Feature importance
  • Now we can build the same models that we built earlier. I tried it and results are not significantly improved but slight change in the performance of final model that I observed.

19. Trying ensemble models with bagging :

  • Ensemble learning helps improve machine learning results by combining several models. Ensemble methods are meta-algorithms that combine several machine learning techniques into one predictive model in order to decrease variance (Bagging)
  • Bagging : Also known as boost strapped aggregation. Here we will take samples with replacement. Instead of passing the whole data to each model we will pass subset of data to each model in our ensemble architecture. Here subsets are formed by taking samples(with replacement) from whole train data. so by using this technique we can reduce the variance in models.
  • Here is my ensemble architecture:
Ensemble Model
  • Here I am not using classification model. I am using only regression model to predict the revenue of the user.

Note: Here I am using only 3 models but in real world people will use 100’s of base models to improve the results.

Here is my result for ensemble method:

Ensemble model results

20. Future Work :

Deep learning models:

  • we can try various cnn model architectures with conv1D layers and max- pool layers and we can try LSTM models ( since the given data is varying with time, so we know that for time series problems LSTM’s will do great job).
  • In the description section you can find my GitHub profile there you can find the code for the entire project and future work.

21. Profile :

22. References :

--

--

Kireeti kunam
Analytics Vidhya

Machine learning enthusiast, worked at tcs for 1.5 years