Prediction Error and Competition Rank
The chart at right shows summarizes the relationship between prediction error and competitor rank. You can see how the prediction error score decreases as the competitor rank decreases or gets closer to the winner. This is similar to scoring in golf- the lowest score wins and as your score increases, your rank behind the winner also increases. One thing you will notice from the chart is that the accuracy scores flatten out at just over 0.53 and slowly decreases until the winning score of 0.431. This means that small improvements in your prediction error score could result in large jumps in the rankings. It would make sense to put in the extra effort to improve your score. Contrarily, since nearly all of the competitors had very close prediction error scores, the chart also indicates that opportunities to improve your model were hard to find.
You can see where I landed in the competition. I ranked 1,397 of 2,147 teams which wasn't as high as I'd hoped for. I take comfort knowing that my error rate was only 0.06 behind the winner. I could have put more time into my model but the law of diminishing returns was in full effect. Any gains in my score would be much more expensive than what I had already achieved.
Workflow SummaryMy workflow followed this general path. You can also follow along with the R code here.
- Preparing the search queries and product descriptions by correcting for spelling errors, stemming the words, and removing stop words (e.g.: words like a, the, and).
- Calculate term-frequency metrics. Count the number of words in the search query that occur in each of the title, description, and brand name fields. Each word match counts as one.
- Generate several independent regression models based on the term-frequency metrics. The models included generalized linear model (GLM), support vector machines (SVM), and Extreme Gradient Boost (XGB). Each model was trained using cross validation. SVM is very resource intensive so I set the models to run in parallel.
- Stack models using a simple linear regression. I added the output of the 3 models as features to the training set and trained a simple linear regression model on top of this super-set.
- Repeat steps 1 and 2 on test set and then predict the relevance score.
- XGBoost is incredibly fast and powerful. This was my first time using the algorithm and I was really impressed by how fast it trained. It took me a while to figure out how to use the tuning parameters and what the difference between L1 and L2 normalization but I will definitely be using this model again.
- The Caret package documentation left me frustrated. I was often trying to bridge from the docs in the original model library to how they were implemented in Caret's train or trControl functions. I'm still not sure how to determine which inputs go in each function and only got my model to train with a combination of StackOverflow and trial and error.