Kaggle is an online community (social medium?) geared towards data science enthusiasts. It aims to link data analytics problems with data analysts with a flair for machine learning and other data science methodologies. I’ve even mentioned in in my books and videos at times, as a potential resource for gaining some traction with the data modeling aspect of the field and to get people more interested in the craft. However, some people confuse Kaggle experience with data science experience. Let’s delve more into this matter.
Kaggle problems tend to be geared towards analysts, not data scientists. The whole nature of a competition is one-sided (aiming to optimize a particular evaluation metric), which is not the same as creating a successful data science product. Although the model in the back-end of a data science product has to be somewhat accurate, there are other factors involved that are equally important (if not more important) than the model’s raw performance. For example, the amount of resources it uses, its interpretability, how easy it is to use, its maintenance costs, and how compatible it is with existing technologies, are all factors that are relevant when building a data science model.
What’s more, the data in Kaggle competitions is very much like baby formula. In the real world, data is far more complex, more noisy, and stemming from a variety of sources. So, if someone can handle Kaggle data, that’s great, but it’s not the same as handling real-world data, no matter how robust his models are. In fact, many people argue that around 80% of the data scientist’s work, time-wise, is getting the data clean and ready for the data model.
Of course, it’s not just Kaggle that provides this false image of data science, there are other platforms that are similar. The UCI repository, for example, is similar, though in all fairness, they state that their datasets are there for research purposes. Still, they lend themselves for practicing data modeling and trying out heuristics and new data analytics or A.I. algorithms.
So, when it comes to data science experience, it is important to remember that data modeling is just one part of it. It is an important part, but not the whole picture. If you want to gain real-world experience, you are better off getting some data from real-world sources, such as social media feeds, sensor data from IoT devices, etc. rather than Kaggle competitions. The latter are fun, but data science is not all fun and games. Much like hackathons are great for practicing coding, when it comes to doing programming in the real world, you need more than just hackathon experience. So, why would data science be any different?
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.