Kaggle is the perfect platform for a data scientist to hone their skills, build a great reputation and potentially get some quick cash. However, succeeding on Kaggle is no small task; it takes patience, hard work, and consistent practice. Keep in mind that this platform is home to some of the most brilliant minds in data sciences, so the competition is tough. To become a Grandmaster, you need a high level of commitment and industry insights. This chapter will give you a brief guideline on how to succeed on Kaggle.
Step one is to start by reading the competition guidelines thoroughly. Many Kagglers who are struggling to succeed on this platform do not have a thorough understanding of the competition, that is the overview, description, timeline, evaluation and eligibility criteria and the prize. Ignoring these little details will cost you big time in the long run. You need to know the deadline for your last submission. Small details such as the timeline of a particular competition are deal breakers. By studying the guidelines clearly, you will also uncover other commonly missed details such as the appropriate submission format and a guide on reproducing benchmarks. Do not start working on a Kaggle competition before you are clear about all the instructions. Take your time before jumping in.
The second and very crucial step is to understand the performance measures. How the performance measure works is the yardstick your submission will be measured against, and you need to know it inside out. According to most experienced Kagglers, an optimised approach that is suitable to a particular measure makes it substantially easy to boost your score. For instance, Mean Square Error (MSE) and Mean Absolute Error (MAE) are closely related, not knowing the difference will penalize your end score.
Step three is to understand the data in detail. You start with exploratory data analysis to find missing and null values and hidden patterns in the dataset. The more you know about the data, the better models you can build on top of it to improve your performance. Over-specialisation works in your favor as far as you do not over-fit. See what data weaknesses you can exploit for your own advantage, can you extract second fields from the given primary values, or can you typecast the given values to any other format to make it more machine learning friendly.
Step four is to know what you want (objective) before worrying about how. Most novoices on Kaggle tend to worry excessively about which language to use (R or Python). It is wise, to begin with learning the data and ascertaining the patterns you intend to model. Knowing the domain and understanding data goes a long way when it comes to winning the competition.
Step five and the often neglected step is to setup your own local validation environment. By doing that, you will be able to move at a faster pace. This will enable you to produce dependable results instead of solely relying on leader-board scores. You can skip this step if you are out of time or the dataset is too small and can easily be managed and executed on Kaggle dockers. By setting up your own environment, you can run the submission as many times as you like and you are not bound with five submissions a day restriction on Kaggle competitions. Once you feel confident enough about the results, you can submit it to live competition. It gives you an immense edge over your peers who do not have their local environments setup. By reducing the number of submissions you make, you are also substantially reducing the probability of over-fitting the leader-board, and it will save you for poor results at the evaluation stage.
Step six is to read the forums. Forums and discussions are your friend. Take your time to consistently monitor the forum as you work on the competition, there is no way around it. Please subscribe to the forum and receive notifications related to the competition you are participating in. The forum will help you keep abreast with what the competition is up to. This has been made possible by the recent Kaggle trend of sharing code as the competition is going on. The host also shares their insights and directions about the competition on the forum more often. Even if you do not win, you can keep trying and learn from the post-competition summaries available at the forum to see where you went wrong or what your peers did to supersede your brilliance. This is a great way to learn from the best and improve consistently.
Step seven is to research exhaustively. There is a good possibility that the competition you are participating is by people who have dedicated their lives to finding a viable solution. The people who host such competitions often have codes, benchmarks, official company blogs and extensive published papers or patents that come in handy. Even if you do not win in your first several attempts, you will learn, hone your skills and become a better data scientist.
Step eight to stay with basics and apply it rigorously. While playing around with obscure methods is fun for data scientists, it is the basics that will get you far in a competition. The common algorithms you may ignore have great implementations. It is wise to do manual tuning or main parameters when experimenting with methods. Experienced Kagglers admit that one of the winning habits is to do the manual tuning.
Step nine is the mother of all steps. It’s time to ensemble models. It simply means combining all the models that you have developed independently. In most high profile competitions, different teams usually come together to combine their models to boost their scores. Since no competition on Kaggle has ever been won through a single model, it is wise to merge different independent models even when you are doing the solo ride.
Step ten is the commitment to work on a single or selected few projects. If you commit and try to compete in every single competition, you will lose focus. It is better to focus on one or two and prove your mettle. The rank progression all the way to grand master will come naturally doing that. Remember the time and patience are two prime factors along with your data science expertise to move forward.
Step eleven is the final step to pick the right approach. In the history of Kaggle, there are only two winning approaches that keep emerging from all the competitions. Feature engineering and Neural/Deep Learning Networks.
Feature engineering is the best approach if you understand the data. The first step is taking the provided data and using it to accurately plot histograms to help you explore more. You will then typically spend a large amount of time generating features and then testing which ones correlate with the given target variables. For example, in a recent Kaggle competition titled Don’t Get Kicked hosted by a chain of dealers known as Carvana. The participants were required to predict the cars that would go up for sale in a second hand (pre-owned) auction and the ones that will not be sold. Many participants put forward their algorithms and models. Ultimately, it turns out that the most feasible predictive feature was color. The participants grouped the cars into two categories: standard colors and unusual colors. It turns out that unusually colored car is more likely to be sold at a second-hand auction. Before Kaggle was able to arrive at this conclusion, there were numerous hypotheses, models, and kernel that did not perform the way expected.
The most popular winning algorithm was a Random Forest. However, this has changed over the last six months. A new algorithm XGboost is becoming a winner, it is taking over practically every competition for structured data.
The second winning approach on Kaggle is neural networks and deep learning. If you are dealing with a dataset that contains speech problems and image-rich content, deep learning is the way to go. The Kagglers who are emerging as the winner in most competitions are the people dealing with structured data. This is because the rarely spend any time focusing on feature engineering. These people consider it more productive and effective to focus more on the construction of neutral networks. For example, let’s take a look at Kaggle problem that requires the deep learning and neural networks approach. The diabetic retinopathy detection competition hosted by the California health care foundation is where the participants were asked to take clear images of the eye and diagnose which images indicated the presence of diabetic retinopathy. This devastating illness is one of the leading causes of blindness in the United States. The winning algorithm essentially had a similar agreement rate with the ophthalmologist as one professional ophthalmologist will have on another one.
So in a Kaggle competition, should you use deep learning and building networks or just opt for feature engineering? Choosing the best approach for a particular competition is pretty straight-forward. If you are dealing with a problem that consists of a lot of structured data, your best bet at success is using the features engineering approach. On the other hand, if you are dealing with unstructured data or has a lot of images, then the recommended approach is building and training neural networks. Overall, it’s always the mix of the two that takes the prize.
Believe in yourself and take the time to learn as much as you can. Avoid dismissing any piece of information. For all data scientists who want to master machine learning algorithms, Kaggle is the best platform to boost your experience and hone your skills.
You may like to read my recent book – Kaggle For Beginners as well.