Google Analytics Customer Revenue Prediction

Gaurav Kavhar

7 min readFeb 18, 2021

Google Merchandise Store (also known as GStore, where Google swag is sold)

In this blog I’ll talk about the “Google Analytics Customer Revenue Prediction” competition hosted by Kaggle.

Index:-

Business Problem
Use of Machine Learning to solve the problem
Source of data
Data loading and preprocessing
Exploratory data analysis
Feature engineering
Data splitting: Train, validation and test sets
Machine learning models
Results
Future work
References

1. Business problem:

In this competition, we’re challenged to analyze a Google Merchandise Store customer dataset to predict revenue per customer.
We need to analyse the customer data visiting the GStore and predict the revenue the customer might generate in the future.
The insights from our models would be used to make operational changes and make better use of marketing budgets for those companies who choose to use data analysis on top of Google Analytics data.
The 80/20 rule has proven true for many businesses–only a small percentage of customers produce most of the revenue. As such, marketing teams are challenged to make appropriate investments in promotional strategies.

2. Use of Machine learning to solve the problem:

We are going to predict the future revenue generated by the customer, so will pose this as a regression problem. The performance metric we will be using is Root Mean Squared Error(RMSE).

y hat : natural log of the predicted revenue for a customer.

y : the natural log of the actual summed revenue value plus one.

3. Source of data:

We can download the data provided by kaggle below link. All details regarding the data can also be found in the below link.

Google Analytics Customer Revenue Prediction

Predict how much GStore customers will spend

www.kaggle.com

4. Data loading and preprocessing

The data provided by kaggle is very huge and trying to read the whole data at once requires more than 30GB of space. So if your machine doesn’t have that much RAM make sure to use “nrow” parameter in the load_df() which reads the data in batches and you can save these batches in pickle files.

Now we need to clean the data, we will delete all the columns having a single value for all rows. These columns will not be useful for predicting the target variable as they will not provide any useful information.

start = time()
# make a list of all column names
column_list = train_df.columns.to_list()
for column in column_list:
 # dropping columns having single value across all rows
 if train_df[column].nunique() == 1:
 del train_df[column]end = time()print(f’Time taken: {(end-start)/60} mins’)
print(f’After dropping single valued columns df size is: {train_df.shape}’)
train_df.head(2)

5. Exploratory data analysis

5.1 Target variable analysis

We can see that the natural log of target variable(TransactionRevenue) follows a gaussian distribution.
From the pie chart we can see that from all our customers only 1.2% have completed a transaction.

5.2 Trend analysis

We can observe from the graphs that no. of visits increased in Dec 2017 drastically, but no. of transactions didn’t.
Also no. of transactions suddenly increased in March 2018.

5.3 Pageviews, Time spent on a Session

Around 75% of pageviews lies between 1–4. The log of “transactionRevenue” follows a normal distribution against “pageviews” but there is no direct correlation between them.
Around 75% users spend less than 4 mins(244 secs) on a session.
The log of “transactionRevenue” also follows a normal distribution against “timeOnSite”.

5.4 Channel Grouping analysis

Transactions and revenue across channels

Most transactions come from Referral and Organic search; they also have high revenue generation.
No of transactions from Direct sources are low but their revenue generation is high, on par with Referral and Organic search.

5.5 Web browser analysis

Transactions and revenue across web browsers

No of transactions and revenue generated is highest from Chrome.
Firefox and Safari users have very low transactions compared to Chrome but their revenue generation is close to that of Chrome.
Marketing teams can focus on chrome users to maximise the revenue generation.

5.6 City Analysis

Lot of city data is missing in the dataset (58%).
New York, Mountain View and San Francisco are 3 most high revenue generating cities with most number of transactions.

Multivariate analysis

5.7 Grouping OS and browsers to see their impact on transactionRevenue

Transactions and revenue across Browser-OS

Both Windows and Mac users have higher transactions and total revenue generation using the Chrome browser.
Across all Operating systems chrome users have higher number of transactions.
This supports earlier conclusion that chrome users generates more revenue compared to other browser users.

6. Feature engineering

6.1 Data imputation

We will impute “null” values with 0 for all numerical variables. Following code snippet shows an example of imputation for the target variable “transactionRevenue”:-

%%time
# we will impute ‘nan’ with 0
train_df[‘totals.transactionRevenue’].fillna(0, inplace=True)
test_df[‘totals.transactionRevenue’].fillna(0, inplace=True)

6.2 Delete non-useful features

We will delete features having more than 85% of missing data and which may not have any useful data for predicting the target variable.

# list of columns to drop due to over 90% missing data
cols_to_drop = ['trafficSource.adContent', 'trafficSource.adwordsClickInfo.adNetworkType', \
                'trafficSource.adwordsClickInfo.slot', 'trafficSource.adwordsClickInfo.page', \
               'trafficSource.adwordsClickInfo.gclId', 'hits', 'totals.totalTransactionRevenue']train_df.drop(cols_to_drop, axis=1, inplace=True)# deleting all columns in test dataframe that are not present in train
# list of columns in test_df that are not in train_df
tl = [col for col in test_df.columns if col not in train_df.columns]
# dropping those columns
test_df.drop(tl,axis=1,inplace=True)

6.3 Standardising Numeric features

We will standardise the numeric features using the MinMaxScaler() from scikit-learn.

6.4 Label encoding Categorical features

We will encode categorical features using LabelEncoder() from scikit-learn.

6.5 Time window features

Main Idea:-
https://www.kaggle.com/c/ga-customer-revenue-prediction/discussion/81542

Inspired from the above discussion thread. The author tells that the problem is essentially a time window to time window prediction.
He created time windows using 15 days of overlapping windows. But instead we can try creating windows of size 168 days as the test data given by kaggle has 168 days of session.
We need to make sure that target variable for each window has a gap period of 45 days.
And the target period should be of 2 months same as that of the private leaderboard of kaggle.
Another key idea is not to do hyperparameter tuning.
Kaggle data:-
TEST DATA : transactions from May 1st 2018 to October 15th 2018 (168 DAYS)
KAGGLE PRIVATE DATA : Dec 1st 2018 to jan 31 st 2019 (2 MONTHS)
GAP PERIOD : Time interval between Test data and Private data (45 DAYS)

7. Data splitting: Train, validation and test sets

We will use first 8 windows for train, last 2 windows for validation(we will not be doing parameter tuning but this validation set will be used to compare models therefore it can also be thought as test data).

Train set

validation set