Many things have changed in data science over the past few years, yet recruiting doesn’t seem to be one of them. There are still companies and agencies evaluating candidates as if we were software developers or something, giving a disproportional amount of emphasis to experience, as well as other not-so-relevant factors. A data scientist, however, is much more than his / her coding skills or any other facet that’s reflected in the “years of experience” metric. This is especially true in our era where A.I. is gaining more and more ground, transforming the field in unprecedented ways.
Back in the day, you’d need to know a technology or a piece of know-how quite well, if you were to use it. Particularly when it came to data models, you had to know the ins and outs of them because chances were that you’d have to code one of them from scratch at some point, especially in cases where the two-language problem was an unresolved issue.
Now, however, these issues have withered as new technologies have come along. A.I. models have been proven to be better than even the best-performing conventional models, at least in cases where sufficient data is available. As big data is becoming more and more widespread, having sufficient data is less of an issue. Also, all of these A.I. models are made available as part of this or the other framework, so a data scientist usually needs to just create the necessary wrapper functions, a fairly easy task that can be mastered within a month or two. Therefore, a candidate having 5+ years of experience won’t necessarily be more adept at data modeling than a new data scientist who is trained in the latest technologies of our craft.
As for the two-language problem, that’s something that seems to linger more than reason would dictate. Nowadays there are various programming languages that can be used end-to-end in the data science pipeline (with Julia being one of the most prominent ones). Therefore, coding some method from scratch so that it can be deployed is masochistic at best. Even in Python there are APIs that make the deployment of a model feasible, without having to translate the corresponding code to C++ or Java.
As we are now in the post-AI era, there is a growing consensus among data scientists that the increased automation that A.I. provides is bound to bleed into our field too, meaning less and less low-level tasks needed to be carried out by the data scientist as the specialized A.I. will be able to do them instead. This doesn’t necessarily mean that humans won’t be in the loop, however. The data scientists involved will just have a different role to play, one that is still uncorrelated to years of experience and other obsolete metrics.
So, the only thing that stands in the way of new, enthusiastic, competent data scientists and their placement in data science jobs is the outdated recruitment processes that are attached to the job market like fleas. Fortunately, there are companies like ResourceFlow that evaluate candidates in a more holistic way, looking at both the technical and the non-technical aspects of them, while also opting for a good understanding of what exactly is required by the company having the vacancy, in terms of data science and A.I. expertise. So, even if most recruiting companies are still slow to adapt to the new realities of our field, fortunately companies like ResourceFlow give us hope about the future of data science recruitment.
This is a topic that I'm pretty confident hasn't been featured much anywhere in the pop data science literature. Although it is quite well-known in the research sphere, most non-PhDs (and some PhDs too!) may have never heard about it, or why it is useful in day-to-day data science work. So, if you are one of those people who are curious and interested in learning even the less popular topics of our field, feel free to check it out on Safari.
Note that although I made an effort to cover this subject from various angles, this is still an introduction video to its topics. Also, some experience in data science would be immensely useful, otherwise the video may appear a bit abstract. Whatever the case, I hope you find it useful and use it as a jumping board to new aspects of data science that you were not aware of. Cheers!
In many data science courses, these peculiar data points in a dataset often go by the term “anomalies” and are considered to be inherently bad. In fact, it is suggested by many that they be removed before the data modeling stage. Now, for obvious reasons I cannot contradict that approach partly because I myself have taken that stance when covering basic data engineering topics, but also because there is merit in this treatment of outliers and inliers. After all, they are just too weird to be left as they are, right?
Well, it depends. In all the cases when they are removed, it’s usually because we are going to use some run-of-the-mill model that is just too rudimentary to do anything clever with the data it’s given. So, if there are anomalous data points in the training set, it’s likely to over-fit or at the very least under-perform. This would not happen so often though in an A.I. model, which is one of the reasons why the data engineering stage is so closely linked to the data modeling one. Also, sometimes the signal we are looking for lies in those particular anomalous elements, so getting rid of them isn’t that wise then.
Regardless of all that though, we need to differentiate between these two kinds of anomalies. The outliers can be easily smoothed out, if we were to adopt a possibilistic way of handing the day, instead of the crude statistical metrics we are used to using. Smoothing outliers is also a good way to retain more signal in the dataset (especially if it’s a small sample that we are working with), something that translates into better-performing models.
Inliers though are harder to process. Oftentimes removing them is the best strategy, but they need to be looked at holistically, not just in individual variables. Also, even if they distort the signal at times, they may not be that harmful when doing dimensionality reduction, so keeping them in the dataset may be a good idea. Nevertheless, it’s good to make a note of these anomalous elements, as they may have a particular significance once the data is processed by a model we build. Perhaps we can use them as fringe cases in a classification model, for example, to do some more rigorous testing to it.
To sum up, outliers and inliers are interesting data points in a dataset and whether they are more noise than signal depends on the problem we are trying to solve. When tackled in a multi-dimensional manner, they can be better identified, when when processed, certain care needs to be taken. After all, just because certain data analytics methods aren’t well-equipped to handle them, we shouldn’t change our data to suit the corresponding models / metrics. Often we have more to gain by shifting our perspective and adapting our ways to the data at hand. The possibilistic approach to data may be a great asset in all that. Should you wish to learn more about outlier and inliers, you can check out my presentation video on this topic in the Safari platform.
First of all, I'd like to thank all of you for visiting this blog and checking out the various posts I've put up over the past couple of years. I appreciate it, even if I don't express it!
Lately it has come to my attention that many people comment in various posts for the sake of commenting. You never get to see these comments because I delete them or mark them as Spam. The reason is simple. Even if they don't directly promote this or the other company or brand, they are:
Naturally, even if a comment doesn't directly promote this or the other brand, it is accompanied by a link, so there is SEO value in it. Having served as an SEO manager in a company once, I'm quite familiar with these tricks. So, it seems that the intent of these comments is not aligned with the intent of this blog, which is to inform people about certain data science and A.I. related topics and challenge conventional ideas and preconceptions about them. I am considering removing the commenting option from now on, so if this happens, know that it is in order to avoid these noisy comments. Whatever the case, you are always welcome to contact me directly, like some of you have done already.
Again, thank you for reading this blog. I look forward to sharing more fox-like insights in the future!
Contrary to the probabilistic approach to data analytics, which relies on probabilities and ways to model them, usually through a statistical framework, the possibilistic approach focuses on what’s actually there, not what could be there, in an effort to model uncertainty. Although not officially a paradigm (yet), it has what it takes to form a certain mindset, highly congruent with that of a competent data scientist.
If you haven’t heard of the possibilistic approach to things, that’s normal. Most people have already jumped on the bandwagon of the probabilistic dogma, so someone seriously thinking of things possibilistically would be considered eccentric at best. After all, the last successful possibilistic systems are often considered obsolete, due to their inherent limitations when it came to higher dimensionality datasets. I’m referring to the Fuzzy Logic systems, which are part of the the GOFAI family of A.I. systems (in these systems the possibilities are expressed as membership levels, through corresponding functions). These systems are still useful, of course, but not the go-to choice when it comes to building an AI solution to most modern data science problems.
Possibilistic reasoning is that which relies on concrete facts and observable relationships in the data at hand. It doesn’t assume anything, nor does it opt for shortcuts by summarizing a variable with a handful of parameters corresponding to a distribution. So, if something is predicted with a possibilistic model, you know all the how’s and why’s of that prediction. This is directly opposite to the black-box predictions of most modern AI systems.
Working with possibilities isn’t easy though. Oftentimes it requires a lot of computational resources, while an abundance of creativity is also needed, when the data is complex. For example, you may need to do some clever dimensionality reduction before you can start looking at the data, while unbiased sampling may be a prerequisite also, particularly in transduction-related systems. So, if you are looking for a quick-and-easy way of doing things, you may want to stick with MXNet, TensorFlow, or whatever A.I. framework takes your fancy.
If on the other hand you are up for a challenge, then you need to start thinking in terms of possibilities, forgetting about probabilities for the time being. Some questions that may help in that are the following:
* How much does each data point contribute to a metric (e.g. one of central tendency or one of spread)?
* Which factors / features influence the similarity between two data points and by how much?
* What do the fundamental components of a dataset look like, if they are defined by both linear and non-linear relationships among the original features?
* How can we generate new data without any knowledge of the shape or form of the original dataset?
* How can we engineer the best possible centroids in a K-means-like clustering framework?
* What is an outlier or inlier essentially and how does it relate to the rest of the dataset?
For all of these cases, assume that there is no knowledge of the statistical distributions of the corresponding variables. In fact, you are better off disregarding any knowledge of Stats whatsoever, as it’s easy to be tempted to use a probability-based approach.
Finally, although this new way of thinking about data is fairly superior to the probabilistic one, the latter has its uses too. So, I’m not advocating that you shouldn’t learn Stats. In fact, I’d argue that only after you’ve learned Stats quite well, will you be able to appreciate the possibilistic approach to data in full. So, if you are looking into A.I., Machine Learning, or both, you may want to consider a possibilistic way of tackling uncertainty, instead of blindly following those who have vested interests in the currently dominant paradigm.
Blockchain has been making waves in the past 10 years or so, with many applications like BitCoin and other cryptocurrencies that have been developed on this platform. Yet, there is also alternative platforms like Hashgraph that promise to deliver the same services but in a more efficient manner. All these technologies are under the umbrella of Distributed Ledger Technologies and are particularly important in our era of pronounced cyber-security concerns.
Recently I’ve put together a video on this topic that’s now available on Safari. It’s more high level but it covers all the key aspects of the technologies, making it ideal for someone new to the topic. What’s more, I’ve written a short article comparing the two technologies, on the DSP blog. Feel free to check them both out. Enjoy!
Before someone says “yes, of course; you just need to apply a non-linear transformation to one of the variables!”, let me rephrase: can we measure a non-linear relationship between two variables, without any transformations whatsoever? In other words, is there a heuristic metric that can facilitate the task of establishing whether two variables are linked in some fashion, without any data engineering from our part?
The answer is “yes, of course” again. However, the relationship has to be monotonous for this to work. In other words, there needs to be a 1-1 relationship between the values of the two variables. Otherwise, it may not appear as strong, due to the nature of non-linearity.
So, if we have two variables x and y, and y is something like x^10 + exp(x), that’s a relationship that is clearly non-linear, but also monotonous. Also, the Pearson correlation of the two variables in this case is not particularly strong (for the variables tested, it was about 0.67). If it were measured by a different correlation metric, however, like a custom-built one I’ve recently developed, the relationship would be somewhat stronger (for these variables, it would be around 0.75) while Kendall's ranked correlation coefficient would produce a great result too (1.00 for these variables).
In a different scenario, where z = 1 / x, for example, the results of the correlation metrics differ more. Pearson’s correlation in this case would be something like -0.16, while the custom-made metric would yield something around -0.69. Also, Kendall’s coefficient would be -1.00.
Although the effect is not always pronounced, in cases like this one, a different metric can make the difference between a strong correlation and a not-so-strong one, affecting our decisions about the variables.
Bottom line, even if the Pearson correlation coefficient is the most popular method for measuring the relationship between two variables, it’s not the best choice when it comes to non-linear relationships. That’s why different metrics need to be used for evaluating the relationship between two variables, particularly if it’s a non-linear one.
Although I’ve talked about dimensionality reduction for data science in the corresponding video on Safari, covering various angles of the topic, I was never fully content with the methodologies out there. After all, all the good ones are fairly sophisticated, while all the easier ones are quite limited. Could there be a different (better) way of performing dimensionality reduction in a dataset? If so, what issue would such a method tackle?
First of all, conventional dimensionality reduction methods tend to come from Statistics. That’s great if the dataset is fairly simple, but methods like PCA focus on the linear relationships among the features, which although it’s a good place to start, it doesn’t cover all the bases. For example, what if features F1 and F2 have a non-linear relationship? Will PCA be able to spot that? Probably not, unless there is a strong linear component to it. Also, if F1 and F2 follow some strange distribution, the PCA method won’t work very well either.
What's more, what if you want to have meta-features that are independent to each other, yet still explain a lot of variance? Clearly PCA won’t always give you this sort of results, since for complex datasets the PCs will end up being tangled themselves. Also, ICA, a method designed for independent components, is not as easy to use since it’s hard to figure out exactly where to stop when it comes to selecting meta-features.
In addition, what’s the deal with outliers in the features? Surely they affect the end result, by changing the whole landscape of the features, breaking the whole scale equilibrium at times. Well, that’s one of the weak point of PCA and similar dimensionality reduction methods, since they require some data engineering before they can do their magic.
Finally, how much does each one of the original features contribute to the meta-features you end up with after using PCA? That’s a question that few people can answer although the answer is right there in front of them. Also, such a piece of information may be useful in evaluating the original features or providing some explanation of how much they are worth in terms of predictive potential, after the meta-features are used in a model.
All of these issues and more can be tackled by using a new approach to dimensionality reduction, one that is based on a new paradigm (the same one that can tackled the clustering issues mentioned in the previous post). Also, even though the new approach doesn’t use a network architecture, it can still be considered a type of A.I. as there is some kind of optimization involved. As for the specifics of the new approach, that’s something to be discussed in another post, when the time is right...
A/B testing is a crucial methodology / application in the data science field. Although it mainly relies on Statistics, it has a remained quite relevant in this machine learning and AI oriented era of our field. It's no coincidence that in Thinkful that's one of the first things data science students learn, once they get comfortable with descriptive Stats and basic data manipulation. So, I decided to do a video on this topic to help those interested in learning about it get a good perspective of it and understand better its relationship with Hypothesis Testing. It is my hope that this video can be a good supplement to one's learning on the subject. Enjoy!
I was never particularly fond of this unsupervised learning methodology that’s under the umbrella of machine learning. It’s not that I didn’t see value in it, but the methods that were available for it when I started delving into it were rudimentary at best and fairly crude. In fact, if I were to do a PhD now, I’d choose a clustering-related topic since there is so much room for improvement that even a simple idea for improving the most popular clustering methods out there is bound to improve them!
However, the fact that data science researchers and machine learning engineers in particular haven’t spent much time looking into clustering doesn’t make clustering a bad methodology. In fact, I’d argue that it’s one of the most insightful ones and it plays an important role in many data science projects, particularly in the data exploration stage.
The key issues with clustering are:
1. The whole set of distance metrics used
2. The fact that the vast majority of clustering methods yield a (slightly) different result every time they are run
3. The need of an external parameter (K) in most clustering methods used in practice, in order to define how many clusters there are
4. The fact that it’s very shallow in its results
There may be more issues with clustering, but these are the most important ones I’ve found. So, if we were to rethink clustering and do it better, we’d need to address each one of these issues. Namely:
1. A new set of distance metrics would be needed, metrics that are not influenced by the dimensional “noise” so much, in the case of many dimensions in the dataset.
2. The option for a deterministic clustering method, one that would optimize the centroid seed before starting the whole clustering process
3. An optimization process would be in place so as to find the best number of clusters. This should include the possibility of a single cluster, in the case where there isn’t enough diversity in the dataset.
4. A multi-level clustering option needs to be available, much like hierarchical clustering but in reverse, i.e. start with the main clusters in the dataset and gradually dig deeper into levels of sub-clusters.
Now, all this may sound simple but it’s not as easy to put into practice. Apart from an in-depth understanding of data science, a quite refined programming ability is needed too, so that the implementation of this clustering approach can be efficient and scalable. Perhaps all this is not even possible with the conventional data analytics framework, but there is not a single doubt in my mind that it is possible in general, and if a high-performance language is used (e.g. Julia), it is even practically feasible.
Naturally, a clustering framework like this one would require a certain level of A.I. to be used. This doesn’t have to be an ANN though, since A.I. can take many forms, not just network-based ones. Whatever the case, conventional statistics-based methods may be largely inadequate, while the very basic machine learning methods for clustering may not be sufficient either.
This illustrates something that many data science practitioners have forgotten: that data science methods evolve, just like other aspects of the craft. New tools may be intriguing, but equally intriguing are the conventional methodological tools, especially if we were to rethink them from a more advanced perspective. This can be beneficial in many ways, such as opening new avenues of data analytics and even synthesizing new data. This, however, is a story for another time...
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.