How do Kaggle competitions work and where do I start from?
The projects posted on featured competitions to solve a specific real-world problem is your place to be. For instance, GE posted a project about vibration and heat and asked the contributors to come up with an algorithm that predicts when an airplane might fail. As part of the problem, GE provided the training data where the outcome expected is known to both the company and the Kaggle contributors taking part in the competition. The second set of data is where the outcome is only known to the company and not to the Kaggle contributors. This is one of the best and most reliable ways of determining the accuracy of a learning model. As different contributors upload their submission, Kaggle takes the prerogative to share in real time how they are faring as compared to their peers in the live leader-board.
Competitions typically last between two to six months, and contributors are allowed to upload five entries per day (as an individual or team).
One of the most critical questions that most people ask when getting started on Kaggle is whether they possess the necessary skills to take part in Kaggle competitions. It is normal to be overwhelmed by the level of difficulty that these competitions offer. It is all about being confident and approaching each competition as a learning experience and maybe a chance to win a prize.
Some reasons that most people hesitate to get started on Kaggle competitions include: underestimating their knowledge, experience, techniques, and level of skills, the time commitment for such a project, following guidelines and strict deadlines of the competition and failing to correctly equivocate skill level to how difficult a problem is.
According to most experience Kagglers, the issue arises from the platform itself as it offers very little information and help for the beginners. Kaggle should give a learning path for beginners so they can learn while coding for competitions in a self-paced manner independently. Due to the complexity and diversity of competitions, it makes a challenging task to pick which competition to participate in for a beginner.
For a beginner, this is perhaps the most challenging and overwhelming part since there are a plethora of options to choose from.
Here are some tips to get you started in the right direction
1. Go for a competition that you are genuinely interested in. When you choose a problem on a subject that intrigues and appeals to you, you will be more passionate and committed to following through and coming up with a feasible solution. You should have some experience or interest in a problem; it will differentiate you from your peers and will increase your chances of learning and success.
2. Take an active role in the forums, and read the scripts as this is a good opportunity to learn how other competitors construct features and interpreting data. Also, do read blog posts detailing previous competitions and any winner interviews you can access. There is a formal blog by Kaggle called No Free Hunch, make sure to follow and regularly read it. The information you get by exploring these avenues will be extremely beneficial to help you get a feel of the battlefield
3. When starting out, keep in your mind that you are taking on a learning process that will have its own challenges and requires patience. If you stuck somewhere and do not know what to do next, please ask at forums or team up with someone who can teach you more
4. Once you get the feel of the platform and have confidence in competing, find and explore different types of competitions to enhance your skills and enrich your experience. Go for something new every time, it is not only exciting and mentally stimulating but also gives you a rare learning opportunity
5. Every competition you take part in is an opportunity to learn from your peers and reflect on the weaknesses of your model and approach. It is advisable to keep track of your learning and techniques you have used. Ask yourself as many questions as you can. See if you can apply the same models to solve similar problems in the same or entirely different domains
6. The problems presented in Kaggle competitions are difficult. The host companies do not post simple problems that can be solved within one afternoon. The projects presented are complex and complicated. Host offer prizes to the winners and structures the competition in a way to get the value of every single penny spent. Most hosts see Kaggle as a platform to settle their toughest and biggest problems
7. The solutions should be new and unique and are not available anywhere else. This means that to have a chance at ranking high in any competition you need to not only customize algorithms but also get trained in advanced models and perform extended research. This will require patience, exceptional data handling skills, time and creativity to build promising models
8. On Kaggle performance is relative; if you get frustrated at not winning or get discouraged for not ranking high, then this is probably not the best data science community for you. Performance is compared with others, means what you come up with will be compared against every other participant and team in the competition. Not winning on Kaggle is not a loss but rather a chance to learn and advance your data science and machine learning skills to be the best in the world
9. As a rule of thumb always submit a solution before the deadline expires
10. Take your time to thoroughly understand a domain before you even start analyzing the data. Understanding the domain in detail helps to get a clear idea how to handle the data
11. It is advisable that on each competition a contributor takes the prerogative to make an evaluation algorithm that mimics the Kaggle test score. Using a simple tenfold cross-validation typically works just fine. Please do understand the evaluation matrix in detail
12. Use the train data to carve out different features. This is what pushes from average to the top forty percentile
13. Keep in mind that a single model has a very little chance of getting you to the top ten. Your best chance to get to the top is to create as many models as you can and then assemble them together
There are several winning approaches on Kaggle competitions depending on who you ask. The top two that stands out are features engineering and neural networks. Features engineering is a killer way if you understand the data inside out. It begins with plotting histograms and explores what is included in the dataset. Part of it is generating and testing features to ascertain which ones correlate with the target variable. The maximum win for a model on Kaggle so far are mostly ensembles of decision trees.
Deep learning and neural networks are good way to start if you are dealing with datasets that contain speech or image classification problems.
Kaggle has created a platform where data scientists and hosts can interact and solve real-life problems. It is the place to be for companies looking for innovative, brilliant minds ready to offer solutions for their data science or machine learning problems. This platform is also an indispensable learning tool for the data science enthusiasts who want to learn and excel in their careers.
I’ve recently published a book Kaggle for Beginners. I hope you will enjoy reading it.