Google Analytics Customer Revenue Prediction

Gaurav Kavhar
7 min readFeb 18, 2021

--

Google Merchandise Store (also known as GStore, where Google swag is sold)

In this blog I’ll talk about the “Google Analytics Customer Revenue Prediction” competition hosted by Kaggle.

Index:-

  1. Business Problem
  2. Use of Machine Learning to solve the problem
  3. Source of data
  4. Data loading and preprocessing
  5. Exploratory data analysis
  6. Feature engineering
  7. Data splitting: Train, validation and test sets
  8. Machine learning models
  9. Results
  10. Future work
  11. References

1. Business problem:

  • In this competition, we’re challenged to analyze a Google Merchandise Store customer dataset to predict revenue per customer.
  • We need to analyse the customer data visiting the GStore and predict the revenue the customer might generate in the future.
  • The insights from our models would be used to make operational changes and make better use of marketing budgets for those companies who choose to use data analysis on top of Google Analytics data.
  • The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

2. Use of Machine learning to solve the problem:

We are going to predict the future revenue generated by the customer, so will pose this as a regression problem. The performance metric we will be using is Root Mean Squared Error(RMSE).

Root Mean Squared Error

y hat : natural log of the predicted revenue for a customer.

y : the natural log of the actual summed revenue value plus one.

3. Source of data:

We can download the data provided by kaggle below link. All details regarding the data can also be found in the below link.

4. Data loading and preprocessing

The data provided by kaggle is very huge and trying to read the whole data at once requires more than 30GB of space. So if your machine doesn’t have that much RAM make sure to use “nrow” parameter in the load_df() which reads the data in batches and you can save these batches in pickle files.

Now we need to clean the data, we will delete all the columns having a single value for all rows. These columns will not be useful for predicting the target variable as they will not provide any useful information.

start = time()
# make a list of all column names
column_list = train_df.columns.to_list()
for column in column_list:
# dropping columns having single value across all rows
if train_df[column].nunique() == 1:
del train_df[column]
end = time()print(f’Time taken: {(end-start)/60} mins’)
print(f’After dropping single valued columns df size is: {train_df.shape}’)
train_df.head(2)

5. Exploratory data analysis

5.1 Target variable analysis

abcd
  • We can see that the natural log of target variable(TransactionRevenue) follows a gaussian distribution.
  • From the pie chart we can see that from all our customers only 1.2% have completed a transaction.

5.2 Trend analysis

Visits and Transactions trend analysis
  • We can observe from the graphs that no. of visits increased in Dec 2017 drastically, but no. of transactions didn’t.
  • Also no. of transactions suddenly increased in March 2018.

5.3 Pageviews, Time spent on a Session

Bi-variate analysis
  • Around 75% of pageviews lies between 1–4. The log of “transactionRevenue” follows a normal distribution against “pageviews” but there is no direct correlation between them.
  • Around 75% users spend less than 4 mins(244 secs) on a session.
    The log of “transactionRevenue” also follows a normal distribution against “timeOnSite”.

5.4 Channel Grouping analysis

Transactions and revenue across channels
  • Most transactions come from Referral and Organic search; they also have high revenue generation.
  • No of transactions from Direct sources are low but their revenue generation is high, on par with Referral and Organic search.

5.5 Web browser analysis

Transactions and revenue across web browsers
  • No of transactions and revenue generated is highest from Chrome.
  • Firefox and Safari users have very low transactions compared to Chrome but their revenue generation is close to that of Chrome.
  • Marketing teams can focus on chrome users to maximise the revenue generation.

5.6 City Analysis

Transactions and revenue across cities
  • Lot of city data is missing in the dataset (58%).
  • New York, Mountain View and San Francisco are 3 most high revenue generating cities with most number of transactions.

Multivariate analysis

5.7 Grouping OS and browsers to see their impact on transactionRevenue

Transactions and revenue across Browser-OS
  • Both Windows and Mac users have higher transactions and total revenue generation using the Chrome browser.
  • Across all Operating systems chrome users have higher number of transactions.
  • This supports earlier conclusion that chrome users generates more revenue compared to other browser users.

6. Feature engineering

6.1 Data imputation

We will impute “null” values with 0 for all numerical variables. Following code snippet shows an example of imputation for the target variable “transactionRevenue”:-

%%time
# we will impute ‘nan’ with 0
train_df[‘totals.transactionRevenue’].fillna(0, inplace=True)
test_df[‘totals.transactionRevenue’].fillna(0, inplace=True)

6.2 Delete non-useful features

We will delete features having more than 85% of missing data and which may not have any useful data for predicting the target variable.

# list of columns to drop due to over 90% missing data
cols_to_drop = ['trafficSource.adContent', 'trafficSource.adwordsClickInfo.adNetworkType', \
'trafficSource.adwordsClickInfo.slot', 'trafficSource.adwordsClickInfo.page', \
'trafficSource.adwordsClickInfo.gclId', 'hits', 'totals.totalTransactionRevenue']
train_df.drop(cols_to_drop, axis=1, inplace=True)# deleting all columns in test dataframe that are not present in train
# list of columns in test_df that are not in train_df
tl = [col for col in test_df.columns if col not in train_df.columns]
# dropping those columns
test_df.drop(tl,axis=1,inplace=True)

6.3 Standardising Numeric features

We will standardise the numeric features using the MinMaxScaler() from scikit-learn.

6.4 Label encoding Categorical features

We will encode categorical features using LabelEncoder() from scikit-learn.

6.5 Time window features

Main Idea:-
https://www.kaggle.com/c/ga-customer-revenue-prediction/discussion/81542

  • Inspired from the above discussion thread. The author tells that the problem is essentially a time window to time window prediction.
  • He created time windows using 15 days of overlapping windows. But instead we can try creating windows of size 168 days as the test data given by kaggle has 168 days of session.
  • We need to make sure that target variable for each window has a gap period of 45 days.
  • And the target period should be of 2 months same as that of the private leaderboard of kaggle.
  • Another key idea is not to do hyperparameter tuning.
  • Kaggle data:-
    TEST DATA : transactions from May 1st 2018 to October 15th 2018 (168 DAYS)
    KAGGLE PRIVATE DATA : Dec 1st 2018 to jan 31 st 2019 (2 MONTHS)
    GAP PERIOD : Time interval between Test data and Private data (45 DAYS)

7. Data splitting: Train, validation and test sets

  • We will use first 8 windows for train, last 2 windows for validation(we will not be doing parameter tuning but this validation set will be used to compare models therefore it can also be thought as test data).
Train set
validation set
  • We will use the test data provided by kaggle for submission and getting the private leaderboard score.
Test set for kaggle submission

8. Machine learning models

As discussed in feature engineering and suggested here, we will not be doing any hyperparameter tuning for our models.

8.1 LightGBM (Light Gradient Boosting Machine)

  • This model gave the best kaggle private score of 0.884 among all different models i tried.
Kaggle private score
  • Following code snippet shows how to create the submission file.
  • Feature importance according to the LightGBM model.
Feature importance using LightGBM model

8.2 Random Forest model

  • Random Forest gave a score of 0.9373 on private leaderboard.
  • Feature importance:-
Feature importance using Random Forest model

9. Results

Following table contains the summary of the results.

Final results

10. Future work

In the future work we can try following things:-

  • Try ensembling of different models.
  • Try various sizes of time windows, in our case we didn’t use overlapping windows that can also be tried.

11. References

12. Profile

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Gaurav Kavhar
Gaurav Kavhar

Written by Gaurav Kavhar

Machine Learning enthusiast. Computer Science engineer.

No responses yet

Write a response