The previous week has been intense as I was working on a part of the proposal for a new project, attending a conference, and figuring out some things about my publication-related endeavors. With all that in mind, it was natural that I didn’t post anything on the blog, even though I wanted to. However, as my focus is always on quality, I didn’t want to just publish a rushed post or a simple announcement. That’s why I waited until now to get a new post out.
The Event of the Decade
On 8/8/18 the new release of Julia came out. This wasn’t just any release though, but the big one: 1.0. It is really hard to overestimate the importance of this release, even if the most conservative Julia users still feel that it would take a few months before the full force of v. 1.0 will reach the world. After all, just because Julia is now production ready, it doesn’t mean that everyone using it can benefit from this the same way, since the packages people depend on may take some time before they are fully compatible with the new release. Nevertheless, those who prefer to rely on our own code primarily can experience the benefits of Julia right now. Whatever the case, the fact is that Julia has now entered a new era, since it has proven itself to be robust and even faster than ever before.
To give you an example of that, in the conference there was a talk about how Julia is applied in Robotics, via a specialized package some Robotics researcher developed recently. Even though this guy had worked with C++ before for the same project, he eventually shifted to Julia for the vast majority of the code, since it was good enough (i.e. sufficiently fast and reliable) to perform challenging optimization-related tasks in real-time. To be exact, the operations were 36% faster than real-time, enabling a robot operation frequency of 1000 Hz, at least in the simulations he was conducting. At the time of this writing, no other language has accomplished that, without having significant dependencies on C libraries.
Ramification of Version 1.0 in Data Science and A.I.
But how does all this affect us, as data science and A.I. professionals? Well, Julia isn’t evolving merely on the Base package or the fairly niche application of Robotics. In fact, there are now full-fledged packages that cover a variety of data science related applications, including deep learning models. In the conference there was a talk about the Knet package, for example, which is a deep learning package built entirely on Julia. Personally I don’t know any other deep learning tool that has been built entirely on a data science language (I don’t consider C++ to be such a language by the way, since data scientists tend to use high-level languages mainly). What’s more, this deep learning tool has comparative performance with other more established frameworks, while in one of the benchmarks it outperformed all of them.
But data science is not just deep learning. There is a significant part of it that has to do with more conventional methods, mainly deriving from Statistics. What about Julia’s role in all that? Well, Julia has a number of fairly mature packages in Stats, including Bayesian Stats. What’s more, there is a new book being written right now on Stats with Julia, by a couple of academics who teach Stats in a university in Australia. So, it’s safe to say that Julia is pretty evolved in this aspect of data science too.
More specialized parts of data science, such as Graph Analytics also have corresponding packages in Julia, while the LightGraphs package I talked about in my Julia for Data Science book, is still out there, now better than ever. Data engineering packages also exist, while there are several packages on optimization too, something data science can benefit from greatly, for the more challenging problems tackled.
From all this, I believe it’s fair to say that the age-old argument that “Julia is not ready for DS / A.I. because x, y, z” is now as ridiculous as the belief that the number of available libraries is what makes a language more suitable for data science. Sure, packages can help, but it’s mostly due to their quality, not their quantity, while how fast a language runs is an important factor when analyzing the truckload of factors in a modern data model. That’s not to say that Python, Scala, and other data science languages are not useful any more, but ignoring the value of Julia in the data science / A.I. arena is silly and to some extent unprofessional.
Recently I decided to do something a bit more experimental, which very few people have tried covering in a video. So, I tackled a more niche sub-topic of Natural Language Processing, related to custom-made features and their construction. Despite its seemingly simple nature, this skill is something that can differentiate you from a newcomer in NLP. This A.I. video assumes some knowledge of NLP but you don’t need to be a seasoned data scientist to follow. Also, I provide several examples, as well as an original taxonomy to help you organize all this information in your mind. So, check it out on Safari when you have a moment.
Note that a subscription to the Safari portal is required in order to view the video in its entirety. With the subscription you have access to a large number of books and videos, across various publishers and domains.
When a machine learning predictive analytics systems makes predictions about something chances are that we have some idea of what drove it to make these predictions. Oftentimes, we even know how confident the system is about each prediction, something that helps us become confident about it too. However, most A.I. systems (including all modern AIs used in data science) fail in giving us any insight as to how they arrived at their results. This is known as the black box problem and it’s one of the most challenging issues of network-based AIs.
Although things are hopeless for this kind of systems due to their inherent complexity and lack of any order behind their predictive magic, it doesn’t mean that all AIs need to be under the same umbrella. Besides, the A.I. space is mostly unexplored even if often seems to be a fully mapped terrain. Without discounting the immense progress that has been made in network-based systems and their potential, let’s explore the possibility of a different kind of A.I. that is more transparent.
Unfortunately, I cannot be very transparent about this matter as the tech is close-source, while the whole framework it is based on is so far beyond conventional data analytics that most people would have a hard time making sense of it all. So, I’ll keep it high-level enough so that everyone can get the gist of it.
The Rationale heuristic is basically a way analyzing a certain similarity metric to its core components, figuring out how much each one of them contributes to the end result. The similarity metric is non-linear and custom-made, with various versions of it to accommodate different data geometries. As for the components, if they are the original features, then we can have a way to directly link the outputs (decisions) with the inputs, for each data point predicted. By examining each input-output relationship and applying some linear transformation to the measures we obtain, we end up with a vector for each data point, whose components add up to 1.
Naturally, the similarity metric needs to be versatile enough to be usable in different scenarios, while also able to handle high dimensionality. In other words, we need a new method that allows us to process high-dimensional data, without having to dumb it down through a series of meta-features (something all network-based AIs do in one way or another). Of course, no-one is stopping you from using this method with meta-features, but then interpretability goes out the window since these features may not have any inherent meaning attached to them. Unless of course you have generated the meta-features yourself and know how everything is connected.
“But wait,” you may say, “how can an A.I. make predictions with just a single layer of abstraction, so as to enable interpretability through the Rationale heuristic?” Well, if we start thinking laterally about it we can also try to make A.I. systems that emulate this kind of thinking, exhibiting a kind of intuition, if you will. So, if we do all that, then we wouldn’t need to ask this question at all and start asking more meaningful questions such as: what constitutes the most useful data for the A.I. and how can I distill the original dataset to provide that? Because the answer to this question would render any other layer-related questions meaningless.
As someone once said, "Knowledge is having the right answer; Intelligence is asking the right question." So, if we are to make truly intelligence systems, we might want to try acting as intelligent beings ourselves...
Over the past couple of weeks I've been thinking about this topic and gathering material about it. After all, unlike other more attractive aspects of A.I., this one still eludes the limelight, even though it's become quite popular as a research topic lately. Since I believe this is a matter that concerns everyone, not just those of us who are in the A.I. field, I created this video on the topic. It's a bit longer than the other ones on A.I. topics, but I made an effort to make it relate-able and avoid too many technical terms. So, if you have a Safari account, I invite you to check it out here.
This past week we received the first round of feedback from our publisher, so my co-author and I have been feverishly working on refining the book, making clarifications where necessary and adding some content for better context here and there (mostly there). So, after a week’s worth of editing we have completed the revised version of the book which we’ve sent to the publisher this weekend...
Also, this past week I wrote three articles for one of the blogs of the company I work with in London, so it’s been quite busy writing-wise. These are all part of the SEO plan for one of the websites of the company, so they are a bit dry but they are still interesting to read.
What’s more, on my free time I’ve been thinking about A.I. Safety and creating mind maps on the topic. In fact, until further notice, that’s going to be my main past-time from now on, that and creative writing. After all, that sci-fi novella of mine isn’t going to write itself!
So, with all that going on, I didn’t have the chance to put together an article for this blog this week. Stay tuned though since the ones I have in mind are going to be unique and intriguing...
Many things have changed in data science over the past few years, yet recruiting doesn’t seem to be one of them. There are still companies and agencies evaluating candidates as if we were software developers or something, giving a disproportional amount of emphasis to experience, as well as other not-so-relevant factors. A data scientist, however, is much more than his / her coding skills or any other facet that’s reflected in the “years of experience” metric. This is especially true in our era where A.I. is gaining more and more ground, transforming the field in unprecedented ways.
Back in the day, you’d need to know a technology or a piece of know-how quite well, if you were to use it. Particularly when it came to data models, you had to know the ins and outs of them because chances were that you’d have to code one of them from scratch at some point, especially in cases where the two-language problem was an unresolved issue.
Now, however, these issues have withered as new technologies have come along. A.I. models have been proven to be better than even the best-performing conventional models, at least in cases where sufficient data is available. As big data is becoming more and more widespread, having sufficient data is less of an issue. Also, all of these A.I. models are made available as part of this or the other framework, so a data scientist usually needs to just create the necessary wrapper functions, a fairly easy task that can be mastered within a month or two. Therefore, a candidate having 5+ years of experience won’t necessarily be more adept at data modeling than a new data scientist who is trained in the latest technologies of our craft.
As for the two-language problem, that’s something that seems to linger more than reason would dictate. Nowadays there are various programming languages that can be used end-to-end in the data science pipeline (with Julia being one of the most prominent ones). Therefore, coding some method from scratch so that it can be deployed is masochistic at best. Even in Python there are APIs that make the deployment of a model feasible, without having to translate the corresponding code to C++ or Java.
As we are now in the post-AI era, there is a growing consensus among data scientists that the increased automation that A.I. provides is bound to bleed into our field too, meaning less and less low-level tasks needed to be carried out by the data scientist as the specialized A.I. will be able to do them instead. This doesn’t necessarily mean that humans won’t be in the loop, however. The data scientists involved will just have a different role to play, one that is still uncorrelated to years of experience and other obsolete metrics.
So, the only thing that stands in the way of new, enthusiastic, competent data scientists and their placement in data science jobs is the outdated recruitment processes that are attached to the job market like fleas. Fortunately, there are companies like ResourceFlow that evaluate candidates in a more holistic way, looking at both the technical and the non-technical aspects of them, while also opting for a good understanding of what exactly is required by the company having the vacancy, in terms of data science and A.I. expertise. So, even if most recruiting companies are still slow to adapt to the new realities of our field, fortunately companies like ResourceFlow give us hope about the future of data science recruitment.
This is a topic that I'm pretty confident hasn't been featured much anywhere in the pop data science literature. Although it is quite well-known in the research sphere, most non-PhDs (and some PhDs too!) may have never heard about it, or why it is useful in day-to-day data science work. So, if you are one of those people who are curious and interested in learning even the less popular topics of our field, feel free to check it out on Safari.
Note that although I made an effort to cover this subject from various angles, this is still an introduction video to its topics. Also, some experience in data science would be immensely useful, otherwise the video may appear a bit abstract. Whatever the case, I hope you find it useful and use it as a jumping board to new aspects of data science that you were not aware of. Cheers!
In many data science courses, these peculiar data points in a dataset often go by the term “anomalies” and are considered to be inherently bad. In fact, it is suggested by many that they be removed before the data modeling stage. Now, for obvious reasons I cannot contradict that approach partly because I myself have taken that stance when covering basic data engineering topics, but also because there is merit in this treatment of outliers and inliers. After all, they are just too weird to be left as they are, right?
Well, it depends. In all the cases when they are removed, it’s usually because we are going to use some run-of-the-mill model that is just too rudimentary to do anything clever with the data it’s given. So, if there are anomalous data points in the training set, it’s likely to over-fit or at the very least under-perform. This would not happen so often though in an A.I. model, which is one of the reasons why the data engineering stage is so closely linked to the data modeling one. Also, sometimes the signal we are looking for lies in those particular anomalous elements, so getting rid of them isn’t that wise then.
Regardless of all that though, we need to differentiate between these two kinds of anomalies. The outliers can be easily smoothed out, if we were to adopt a possibilistic way of handing the day, instead of the crude statistical metrics we are used to using. Smoothing outliers is also a good way to retain more signal in the dataset (especially if it’s a small sample that we are working with), something that translates into better-performing models.
Inliers though are harder to process. Oftentimes removing them is the best strategy, but they need to be looked at holistically, not just in individual variables. Also, even if they distort the signal at times, they may not be that harmful when doing dimensionality reduction, so keeping them in the dataset may be a good idea. Nevertheless, it’s good to make a note of these anomalous elements, as they may have a particular significance once the data is processed by a model we build. Perhaps we can use them as fringe cases in a classification model, for example, to do some more rigorous testing to it.
To sum up, outliers and inliers are interesting data points in a dataset and whether they are more noise than signal depends on the problem we are trying to solve. When tackled in a multi-dimensional manner, they can be better identified, when when processed, certain care needs to be taken. After all, just because certain data analytics methods aren’t well-equipped to handle them, we shouldn’t change our data to suit the corresponding models / metrics. Often we have more to gain by shifting our perspective and adapting our ways to the data at hand. The possibilistic approach to data may be a great asset in all that. Should you wish to learn more about outlier and inliers, you can check out my presentation video on this topic in the Safari platform.
First of all, I'd like to thank all of you for visiting this blog and checking out the various posts I've put up over the past couple of years. I appreciate it, even if I don't express it!
Lately it has come to my attention that many people comment in various posts for the sake of commenting. You never get to see these comments because I delete them or mark them as Spam. The reason is simple. Even if they don't directly promote this or the other company or brand, they are:
Naturally, even if a comment doesn't directly promote this or the other brand, it is accompanied by a link, so there is SEO value in it. Having served as an SEO manager in a company once, I'm quite familiar with these tricks. So, it seems that the intent of these comments is not aligned with the intent of this blog, which is to inform people about certain data science and A.I. related topics and challenge conventional ideas and preconceptions about them. I am considering removing the commenting option from now on, so if this happens, know that it is in order to avoid these noisy comments. Whatever the case, you are always welcome to contact me directly, like some of you have done already.
Again, thank you for reading this blog. I look forward to sharing more fox-like insights in the future!
Contrary to the probabilistic approach to data analytics, which relies on probabilities and ways to model them, usually through a statistical framework, the possibilistic approach focuses on what’s actually there, not what could be there, in an effort to model uncertainty. Although not officially a paradigm (yet), it has what it takes to form a certain mindset, highly congruent with that of a competent data scientist.
If you haven’t heard of the possibilistic approach to things, that’s normal. Most people have already jumped on the bandwagon of the probabilistic dogma, so someone seriously thinking of things possibilistically would be considered eccentric at best. After all, the last successful possibilistic systems are often considered obsolete, due to their inherent limitations when it came to higher dimensionality datasets. I’m referring to the Fuzzy Logic systems, which are part of the the GOFAI family of A.I. systems (in these systems the possibilities are expressed as membership levels, through corresponding functions). These systems are still useful, of course, but not the go-to choice when it comes to building an AI solution to most modern data science problems.
Possibilistic reasoning is that which relies on concrete facts and observable relationships in the data at hand. It doesn’t assume anything, nor does it opt for shortcuts by summarizing a variable with a handful of parameters corresponding to a distribution. So, if something is predicted with a possibilistic model, you know all the how’s and why’s of that prediction. This is directly opposite to the black-box predictions of most modern AI systems.
Working with possibilities isn’t easy though. Oftentimes it requires a lot of computational resources, while an abundance of creativity is also needed, when the data is complex. For example, you may need to do some clever dimensionality reduction before you can start looking at the data, while unbiased sampling may be a prerequisite also, particularly in transduction-related systems. So, if you are looking for a quick-and-easy way of doing things, you may want to stick with MXNet, TensorFlow, or whatever A.I. framework takes your fancy.
If on the other hand you are up for a challenge, then you need to start thinking in terms of possibilities, forgetting about probabilities for the time being. Some questions that may help in that are the following:
* How much does each data point contribute to a metric (e.g. one of central tendency or one of spread)?
* Which factors / features influence the similarity between two data points and by how much?
* What do the fundamental components of a dataset look like, if they are defined by both linear and non-linear relationships among the original features?
* How can we generate new data without any knowledge of the shape or form of the original dataset?
* How can we engineer the best possible centroids in a K-means-like clustering framework?
* What is an outlier or inlier essentially and how does it relate to the rest of the dataset?
For all of these cases, assume that there is no knowledge of the statistical distributions of the corresponding variables. In fact, you are better off disregarding any knowledge of Stats whatsoever, as it’s easy to be tempted to use a probability-based approach.
Finally, although this new way of thinking about data is fairly superior to the probabilistic one, the latter has its uses too. So, I’m not advocating that you shouldn’t learn Stats. In fact, I’d argue that only after you’ve learned Stats quite well, will you be able to appreciate the possibilistic approach to data in full. So, if you are looking into A.I., Machine Learning, or both, you may want to consider a possibilistic way of tackling uncertainty, instead of blindly following those who have vested interests in the currently dominant paradigm.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.