Starbucks Data Science Capstone Project
Project Introduction
Starbucks is the largest retail coffee chain in the world with a market capitalization of nearly $140 billion at the time of this writing. With such a broad span across the globe, maintaining customer satisfaction and maximizing revenue is no simple task.
The challenge of such a task was revealed in mock data sets provided through a partnership of Udacity and Starbucks in the final capstone project of the Udacity Data Scientist program. As a current student, I was excited to get a glimpse behind the curtain of the complex challenges faced by data science teams at the coffee behemoth.
The data sets provided centered around customer transactions and offers that Starbucks provides through their app. Two data sets were similar to lookup tables and provided fake offer and customer information. The third set, however, was similar to an event log in that every customer transaction or interaction of a customer with an offer was recorded. The goals of the project were to determine which demographic group of customers respond best to which kind of offer type.
Solution Strategy
The strategy implemented followed a three step approach. The first step was to clean, transform, and merge the provided datasets on the primary keys of customer and offer id. The second step was to transform the merged data set to extract metrics and analysis pertaining to the project goals. The third step focused more on customer transaction data, and sought to use a linear regression machine learning model to predict how much a customer would spend based on demographics.
Metrics and Analysis
After extensive data transformation, I sought to answer some questions about the data as part two of the outlined strategy:
- Does gender influence offer completions by offer type?
To start I began with a fairly straightforward question. I found that not much significance can be drawn in correlating gender to offer completions by type in the chart below:
2. What group of customers complete offers without viewing those respective offers?
This question presented some significant data processing challenges, in that an offer view and completion were treated as two separate logging entries with the same customer ID. My approach treated the separate entries as unique series from which two respective lists of customer ID values were extracted. I then used list comprehension to find people in the completed offer list that were not in the offers viewed list. Breaking these customers down by mean income and age I did not find any meaningful inference:
However, breaking down by sum of gender I found a significant heuristic. Female customers are 2.7 times more likely than males to complete discount offers without viewing them, and 2.3 times more likely than males to complete BOGO offers without viewing them. However, the amount of customers who complete offers without viewing them proved to be only 0.4% of all customers. This finding should be taken with a grain of salt due to the small sample size:
3. Does customer annual income influence the amount spent per transaction?
Looking at raw transaction event types and excluding offer events, I was surprised to learn that overall annual income does not influence the amount spent by the customer per transaction. For example, the average amount spent per transaction is $13.996 or $14, with 99.5% of all customers spending less than $50. Looking at the graph below, the somewhat linear relationship only applies to the remaining 0.5% of customers that spend more than $50 per transaction:
Modeling and Results
Based on the prior findings, my hopes for a robust linear regression machine learning model were slim. The vast majority of customers in the data set appear to behave in a very similar manner regardless of certain heuristics. Nonetheless, I pushed forward and decided to predict amount spent per transaction by customer age, gender and annual income.
4. Can the amount spent per transaction be predicted based on customer traits?
I used a simple linear regression model and utilized normalization and scalarization to reduce the weight of numerical values in the age and annual income categories. Gender values were dummied with the use of the pd.get_dummies function. The r-squared metric was used to determine model performance in both normalized and scalarized approaches. The results for both approaches yielded identical, poor model performance:
'The r-squared score for the model was 0.05879398428498239 on 36653 values.'
Conclusion and Improvements
The poor r-squared results with both data normalization and scalarization do not come as a surprise given the aforementioned challenges in the exploratory data analysis. Improvements could have been made in the approach I took to the model. The model would have been much more informative on a subset of data (i.e. aforementioned question 3) than the ‘shotgun’ approach of the entire merged data set.
Conclusively, I learned that industry data science teams at large corporations face the challenge of modeling messy data that is difficult to wrangle and develop robust machine learning models for. A good chunk of this projects time was spent extracting, transforming and cleaning the data which points to the fact that data science is not all about machine learning. Furthermore, in performing this project I learned that it is important to always tie data science to key company goals.
Github link: https://github.com/mlevanduski/Data_Science_Capstone