Home Depot Kaggle Competition Comes to a Conclusion

01 May 2016

I have spent the past few days working on the Home Depot Product Search Relevancy competition on Kaggle. My approach was to use this as an opportunity to try different algorithms in the Caret package rather than aggressively compete for the prize. The purpose of this post is to share some of what I learned during the process. The basic premise of the competition was to predict whether a product is relevant to a customer's search query. Home Depot provided a list of product search queries and, for each query, several products with relevance score for each. The competition was to accurately predict a product's relevancy score with a given query. The purpose was not to predict whether a product was relevant, but instead to predict the score given by Home Depot's raters. It's a subtle difference, but important to know the source of the prediction target.

Prediction Error and Competition Rank

The chart at right shows summarizes the relationship between prediction error and competitor rank. You can see how the prediction error score decreases as the competitor rank decreases or gets closer to the winner. This is similar to scoring in golf- the lowest score wins and as your score increases, your rank behind the winner also increases. One thing you will notice from the chart is that the accuracy scores flatten out at just over 0.53 and slowly decreases until the winning score of 0.431. This means that small improvements in your prediction error score could result in large jumps in the rankings. It would make sense to put in the extra effort to improve your score. Contrarily, since nearly all of the competitors had very close prediction error scores, the chart also indicates that opportunities to improve your model were hard to find.

You can see where I landed in the competition. I ranked 1,397 of 2,147 teams which wasn't as high as I'd hoped for. I take comfort knowing that my error rate was only 0.06 behind the winner. I could have put more time into my model but the law of diminishing returns was in full effect. Any gains in my score would be much more expensive than what I had already achieved.

Workflow Summary

My workflow followed this general path. You can also follow along with the R code here.
  1. Preparing the search queries and product descriptions by correcting for spelling errors, stemming the words, and removing stop words (e.g.: words like a, the, and).
  2. Calculate term-frequency metrics. Count the number of words in the search query that occur in each of the title, description, and brand name fields. Each word match counts as one.
  3. Generate several independent regression models based on the term-frequency metrics. The models included generalized linear model (GLM), support vector machines (SVM), and Extreme Gradient Boost (XGB). Each model was trained using cross validation. SVM is very resource intensive so I set the models to run in parallel.
  4. Stack models using a simple linear regression. I added the output of the 3 models as features to the training set and trained a simple linear regression model on top of this super-set.
  5. Repeat steps 1 and 2 on test set and then predict the relevance score.

Lesson Learned

  • XGBoost is incredibly fast and powerful. This was my first time using the algorithm and I was really impressed by how fast it trained. It took me a while to figure out how to use the tuning parameters and what the difference between L1 and L2 normalization but I will definitely be using this model again.
  • The Caret package documentation left me frustrated. I was often trying to bridge from the docs in the original model library to how they were implemented in Caret's train or trControl functions. I'm still not sure how to determine which inputs go in each function and only got my model to train with a combination of StackOverflow and trial and error.

Conclusion

This was a interesting competition that allowed me to try out some new tools for the first time. I've learned a lot about XG Boost which will make future competitions more fun. If you enjoyed this post, feel free to share it to your favorite social network. You can also follow me on Google+.

Posted with : R, Kaggle