The most critical question most data science enthusiasts face when starting out on Kaggle is whether they are skilled enough to compete favorably in Kaggle competitions. This is similar to fearing water that prevents you from having the courage to start swimming. As you cannot find how deep is water until you get in, do not assume you cannot win until you try.
I am sure you are wondering; how do I try? The great news is that Kaggle has learning resources to help beginners understand what is involved in real-life Kaggle competitions. We are going to look at how to choose the appropriate Kaggle problem based on your experience, knowledge, tools, and techniques. Additionally, we also illustrate the skills required to solve the problem and the level of difficulty.
My goal is to make Kaggle a less frightening place for you, so you can practice and learn on your own.
Let’s start by looking at common Kaggle tutorials and their level of difficulty:
Titanic: Machine Learning from Disaster
This is the perfect problem for beginners in machine learning. In this competition, users are given the attributes of on-board passengers for a ship that sinks. Users are required to predict whether the passenger will survive or not. Some of the passenger attributes provided include gender, passenger ID, traveling class and the cost of the ticket. This is the evergreen Kaggle tutorial, and you will find tons of kernels and blogs on how to complete this learning assignment.
In this tutorial competition, users are required to identify digits from thousands of provided handwritten images. It is a very good start in image recognition and experience with machine learning techniques.
Bag of Words meet Bag of Popcorn
The aim of this tutorial competition is to predict sentiment labels. The data provided is a collection of movie reviews. This competition will introduce you to a Google package Word2Vec.
Ever heard of OCR? This is a valuable tool for Optical Character Recognition (OCR) that helps to convert handwritten documents into digital documents. The OCR like any other technology has shortcomings; your work as a data scientist is to improve its performance using machine learning
San Francisco Crime Classification
This is very interesting competition. The user is given the time and location of a crime, and they need to predict crime category.
This is a challenging data science problem that requires solving two problems using the same dataset. The users are asked to use initial partial trajectories to predict the destination and how much time it will take to reach there.
Facebook Recruiting – Bot or Human
Understanding a new domain is very crucial for data scientists. We live in a hybrid connected world where the boundaries between humans and machines are kept blurring. This is a very interesting problem for the connected world. Users are given extensive bidding data then asked to identify whether a bidder is human or a bot.
Now that you have an idea of what these tutorial competitions are all about, let’s discuss which path you chose based on your expertise and interests. If you are new to machine learning but have a good programming background, the most suitable Kaggle tutorial, to begin with, is Taxi Trajectory Prediction. This is because it features a complex data set that includes JSON format in one column (defines the coordinates visited by the taxi). Once you can break this down, you do not necessarily need machine learning to get initial estimates on the time of target destination, and you can use your background in coding to solve the problem.
The next project you should take is Titanic. You now have some background knowledge on how to handle complicated data sets; this is the time to try machine learning problems. There are good kernels and scripts available to guide you in the right direction.
I would suggest taking Facebook Bot problem at the end to challenge your machine learning skills.
If you are experienced with building models but not really comfortable working with Python or R, your first bet should be the Titanic competition. You can build the model, so now you can focus on learning the language. There are several kernels available to play with, and you can apply various models to find the solution and improve the performance. It will also help you to learn new machine learning methods.
As a next step, you can take on Facebook Bot competition or may be something outside of your comfort zone like crime classification, Avinto context ad clicks or diabetic retinopathy detection to practice and polish your data science and machine learning skills.
I’ve recently published a book Kaggle for Beginners, I hope you will enjoy it.