Mercari (Latin word means “to trade.”) is Japanese unicorn startup – call it an advanced version of Craigslist. It was founded in 2013 by Hiroshi Toshima, Ryo Ishizuka, Shintaro Yamada, and Tommy Tomishima. It’s community-sourced customer-to-customer mobile marketplace to securely buy and sell anything and everything. Mercari extended to the US and the UK in 2014. This link will give you an inside account for their market analysis and data patterns for the US market and how they fight fraudsters using data sciences.
It has raised $116 Million since inception through five funding rounds and had 55 million downloads worldwide. The app witness more than 100,000 items updates per day. Mercari boosts its uniqueness by attaching an emotional value of things being trading on their app and help customers throughout the transaction. To get one step further, in the age of AI and Data Sciences, Mercari recently launched this competition for $100,000 in prize money.
The competition is to predict the product’s selling price so they can guide the customers who are placing the products to sell (thus increasing their chances of selling the product for the right price and gaining the buyers’ trust of buying the product for the right price as well).
Right now, there are 650+ teams competing for the price, and you have three months to compete. The performance measured in Root Mean Squared Logarithmic Error (RMSLE) is coming up live at the leaderboard, that’s the difference between your predicted price and the price Mercari has sold the product for.
The given datasets have close to 700,000 observations in the test.tsv, so you need to predict the price value for each item in this dataset as your submission. The training dataset (with price value) contains close to 1.5 million observations.
The tab-delimited files contain seven values – ID, Name, Item Condition, Category, Brand Name, Price, Shipping (1 if paid by the seller, 0 if paid by the buyer) and Item Description (the textual description of the product).
Out of given seven variables, three contain numeric data (ID, Price, and Shipping) and four contain string data (Name, Condition, Category, and Brand). A lot can be done to perform NLP on given descriptions to find out which words and descriptors correlated with the product price. Brand name and categories should have their own and very important weights as well.
There are 1,287 categories of items in the dataset that can be reduced down to a dozen or so main categories. Women and Beauty account for more than 50% observations in the dataset. I would highly recommend going through the wonderful EDA, Interactive EDA, and Data Analysis by my fellow Kagglers to understand the data inside out.
First Kernel – Pre-Processing
- Handle Missing Values – Replaced “missing” values with NA.
- Lemmatization performed on item_description – Aiming to remove inflectional endings only and to return the base or dictionary form of a word
- Label encoding has been performed on categorical values – Encode labels with value between 0 and n_classes-1.
- Tokenization – Given a character sequence, tokenization is the task of chopping it up into pieces, called tokens and remove punctuation.
- Maximum length of all sequences has been specified
- Scaling performed on target variable (price)
- Sentiment scored computed on item_description
- Scaling performed on item description length as well
- Checked if the item has pictures
- Final dataset – 15 variables:
trainid, name, itemcondition, categoryname (transferred into seq), brandname (transferred into seq), price, shipping, itemdescription, deslemma, seqitemdescription, seqname, target, descsentiment, declen, mayhave_pictures
First Kernel – Predictive Model
The dataset has been converted into Keras data dictionary format and I’ve defined the Keras model in single function by defining input layers, embedding layers, main layer and the output and model definition itself. The dropped-out probability has been set to 0.1 (e.g. 10%)
Here is a great tutorial to get you started with Text Processing in Keras
Ideas on Winning
Here are few ideas I am exploring to build my next Kernel. I like to share it with community to see if anyone else is working on similar or different lines and like to join hands to form a team.
- Since we have to match the predicted price with what Mercari currently has, in the first phase, we need to look in detail to find out what Mercari is doing at the moment and how? For example, we can see that Mercari is using My SQL, Tableau, TensorFlow, Mxnet, Caffe/Caffe2, Spark, Scikit-Learn, Theano, Python and R from their recently advertised Data Scientist and Machine Learning positions. It will help us guesstimate their limits of prediction capability.
- These two slideshares (Link 1, Link 2) will give you an idea of their current technical stack and
- Canvassing the similar competitions worldwide, location is always being a good indicator of product price (along with other factors). Clearly, the price of a used item in Alabama will be different for the similar product in California. I am wondering why the location variable has been left out and if there is a way to incorporate or inference location data from the current dataset.
- Here are the profiles of the Top Machine Learning and Data Science folks at Mercari – these are the people we are competing against. I am exploring their past research, current work, GitHub repos, presentations and talks to estimate the current algorithm being used and to explore how I can improve upon it – Shuichi Iida, Shintaro Matsuda, Ishikawa Yu and his GitHub, Hikaru K, Hao Zhang, Wen-Ying-Fen (recently poached by their competitor), Lukas Frannek, Kenji Sugiki, Ishikawa Yu, Will Faught, Yanpeng Lin, Gabriel Driver-Wilson, Naoki Shimizu, Takuma Yamaguchi
- I am also scrapping the complete dataset from Mercari website to calculate a plausible weight for each location, brand, and category.
Note: As suggested by Fellow Kagglers. The Scrapping is not allowed in this competition. Please refrain from it. Do read my other article on how to win Kaggle Competitions.
There are few more approaches that I would save for the next blog post. Until then, happy coding.