Although it's been over 2 weeks since I finished working on the Data Visualization video and about a month since I completed the Deep Learning one, both of them just got made available on Safari (a subscription based platform for various educational material). So, if you are up for some food for thought on DL and DV, check them out when you have a moment: Deep Learning vid and Data Visualization vid. Note that these are both overview videos and although in the Data Viz one I include several references to libraries in Python and Julia for creating various plots, the videos are fairly highlevel. These are not indepth tutorials on the topics. Once I decide to take a break from all the bookwriting these days, I'll probably make another video either on AI or on a more conventional DS topic. So, stay tuned...
0 Comments
When it comes to DS education, nowadays there is a lot of emphasis given in one of two things: the math aspect of it, and the complex algorithms of deep learning systems. Although all this is essential, particularly if you want to be a futureproof data science professional, there is much more to the field than that. Namely, the engineer mentality is something that you need to cultivate, since at its core, data science is an engineering discipline. I don’t mean that in a software manner, but more of a practicality and efficiency oriented approach to building a system. This is largely due to the scaling dimension of a data science metric or model. Unfortunately most data science “educators” fail to elaborate on this point, since they focus mainly on parroting other people’s work, instead of inciting students to gain a deeper understanding of the methods and processes being taught. Also, scaling something is the filter that distinguishes a robust algorithm from a mediocre one. As we obtain more and more data, having an algorithm that works well on a small dataset only (or one that requires a great deal of parallelization to yield any benefits), is not sustainable. Of course some people are happy with that, since they have a great deal of resources available, which they are happy to rent out. However, we can often obtain good enough results with less resources, through algorithms that have better scaling. Even if most people don’t share this foxlike approach to data science, it doesn’t make it less relevant. After all, many people associate methods with the frameworks particular companies offer, rather than understand the science behind these methods. Scaling a method up intelligently is the product of three things: 1. having a deep understanding of a method 2. not relying on an abundance of resources to scale it up 3. being creative about the method, making compromises where necessary, to make it more lightweight That’s where the engineering mentality comes it. The engineer understands the math, but isn’t concerned about having the perfect solution to a problem. Instead, he cares about having a good enough solution that is reliable and not too costly. This kind of thinking is what drives the development of modern optimization systems, which are an important part of AI. Artificial Intelligence may involve things like deep learning networks, but there is more to it than that. So, if you want to delve more into this field and its numerous applications in data science, cultivating this engineering mentality is the optimal way to go. Perhaps not the absolute best one, but definitely one that works well and is efficient enough! I've mentioned both in the DS Modeling Tutorial and in another article of mine the importance of discretization / binning of a continuous variable, as a strategy for turning it into a feature, to be used in a data model. However, how meaningful and informationrich the resulting categorical feature is going to be depends on the thresholds we use. In this post I'd like to share with you a strategy that I've come up with that works well in doing just that. First of all, we need to make sure we have a potent method for calculating the density of a data point. I'm not talking about probability density though, since the latter is a statistical concept that has more to do with the mathematical form of a distribution than the actual density observed. The actual density is what we would measure if we were to look at the data itself and although it's quite straightforward, it's not as easy to do at scale. That's why I first developed a very simple (almost simplistic) method for approximating density using a sampling of sorts, rather than looking at each individual element in the variable. Afterwards, we just need to figure out the point of least density, that's not an extreme of the variable. In other words, identity of a local minimum in the density distribution, a fairly easy task that's also computationally cheap. Of course it's good to have a threshold too, to distinguish between this point being an actual lowdensity point and one that could be due to chance. If the density of that point is below this threshold, we can take it to be a point of dissection for the variable, effectively binarizing it. Beyond that, we can repeat the same process recursively, for the two partitions of the variable. This way, we can end up with 3, 4, or even 100 partitions at the end of the process. This is another reason why this aforementioned threshold is very important. After all, not all partitions would be binarizable in a meaningful way. Also, it would be a good idea to have a limit to how many partitions overall we allow, so that we don't end up with a categorical variable having 1000 unique values either! This optimal discretization / binning process is very simple and robust, resulting into a simpler form of the original variable, one that can be broken down to a set of binary features afterwards, if needed. This can also be useful in identifying potential outliers and being able to use them (as separate values in the new feature) instead of discarding them. The method is made even faster through its implementation in Julia, which once again proved itself as a great DS tool. 
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests. Archives
February 2018
Categories
All
