Random forests are a one of my favorite machine machine learning methods. I’ve found them to be incredibly powerful in predicting a number of items in my work, but often run into performance issues running them on my local machine. A coworker recommended the R package H2O – an open source, high performance, in-memory machine learning platform. It has been a game changer in terms of my being able to run efficient predictive models locally. In this post, I will walk though implemenation of a random forest using a long passed Kaggle competition, Don’t Get Kicked.
I chose this dataset since it has both numeric and categorical predictors, a mix I often find in my workspace. The goal of this Kaggle competition is essentially to predict which cars will be “lemons” when bought at auction. Download the training data to get started. There is also an accompanying data dictionary if you are interested in knowing the variable definitions.
One last note before digging into the reading. Markdown has trouble working with the H2O environment. So, I ran the code in a separate workbook and have pasted in the code output/images separately into my document. I also did not set a seed, so your results (if you choose to run this) may be slightly different from my own.
Data Setup
First, I will read the data and clean it up a bit.
H2O and Random Forest
Now that the data has been formatted, let’s run the random forest model. Since I will be using H2O, I will need to initialize a local cluster before running the model. I will also be using a 75% of the data as a training set and 25% as the testing set. There is a separate testing data set available on Kaggle. If you wish to use the entire training.csv file as training set and the test.csv file as the test set, you could certainly do that too.
In this example, I just threw all the variables into the model. I would typically do a separate analysis to determine which features to include in the model, but let’s skip over that for now. Next, we will take a look at variable importance and some metrics from the validation set.
Validation Metrics
Let’s see how our model performed. The output below summarizes the model performance on the test set of data used (the 25% held out from the training.csv).
Whenever I run a random forest model, I always look at the variable importance output. It is interesting to see which variable perform well and which do not. Accroding to the H2O documentation …
“Variable importance is determined by calculating the relative influence of each variable: whether that variable was selected during splitting in the tree building process and how much the squared error (over all trees) improved as a result.”
In our case, “WheelType” (The vehicle wheel type description (Alloy, Covers, Special)) was the stongest performer.
The variable importance plot displays the scaled importance.
Lastly, I will take a look at the ROC curve. Our model is better than making random predictions - yay!
Cluster Shut Down
If you are satisfied with the result, go ahead and shutdown the cluster you have running locally. However, if you would like to go back and refine the model you built, shut it down later.
Submission to Kaggle
If you want to submit this random forest model (or some other tweaked model from my code above) to Kaggle, the code I used to create the submission file is below. In order for your submission to be accepted, you must make predictions on the test.csv dataset provided on Kaggle website. You will have to import and do the data preprocessing above in order to get a result they will accept (hey … it will be good practice!).
This solution will get you about middle of the pack on Kaggle.