Although I’ve talked about dimensionality reduction for data science in the corresponding video on Safari, covering various angles of the topic, I was never fully content with the methodologies out there. After all, all the good ones are fairly sophisticated, while all the easier ones are quite limited. Could there be a different (better) way of performing dimensionality reduction in a dataset? If so, what issue would such a method tackle?
First of all, conventional dimensionality reduction methods tend to come from Statistics. That’s great if the dataset is fairly simple, but methods like PCA focus on the linear relationships among the features, which although it’s a good place to start, it doesn’t cover all the bases. For example, what if features F1 and F2 have a non-linear relationship? Will PCA be able to spot that? Probably not, unless there is a strong linear component to it. Also, if F1 and F2 follow some strange distribution, the PCA method won’t work very well either.
What's more, what if you want to have meta-features that are independent to each other, yet still explain a lot of variance? Clearly PCA won’t always give you this sort of results, since for complex datasets the PCs will end up being tangled themselves. Also, ICA, a method designed for independent components, is not as easy to use since it’s hard to figure out exactly where to stop when it comes to selecting meta-features.
In addition, what’s the deal with outliers in the features? Surely they affect the end result, by changing the whole landscape of the features, breaking the whole scale equilibrium at times. Well, that’s one of the weak point of PCA and similar dimensionality reduction methods, since they require some data engineering before they can do their magic.
Finally, how much does each one of the original features contribute to the meta-features you end up with after using PCA? That’s a question that few people can answer although the answer is right there in front of them. Also, such a piece of information may be useful in evaluating the original features or providing some explanation of how much they are worth in terms of predictive potential, after the meta-features are used in a model.
All of these issues and more can be tackled by using a new approach to dimensionality reduction, one that is based on a new paradigm (the same one that can tackled the clustering issues mentioned in the previous post). Also, even though the new approach doesn’t use a network architecture, it can still be considered a type of A.I. as there is some kind of optimization involved. As for the specifics of the new approach, that’s something to be discussed in another post, when the time is right...
A/B testing is a crucial methodology / application in the data science field. Although it mainly relies on Statistics, it has a remained quite relevant in this machine learning and AI oriented era of our field. It's no coincidence that in Thinkful that's one of the first things data science students learn, once they get comfortable with descriptive Stats and basic data manipulation. So, I decided to do a video on this topic to help those interested in learning about it get a good perspective of it and understand better its relationship with Hypothesis Testing. It is my hope that this video can be a good supplement to one's learning on the subject. Enjoy!
I was never particularly fond of this unsupervised learning methodology that’s under the umbrella of machine learning. It’s not that I didn’t see value in it, but the methods that were available for it when I started delving into it were rudimentary at best and fairly crude. In fact, if I were to do a PhD now, I’d choose a clustering-related topic since there is so much room for improvement that even a simple idea for improving the most popular clustering methods out there is bound to improve them!
However, the fact that data science researchers and machine learning engineers in particular haven’t spent much time looking into clustering doesn’t make clustering a bad methodology. In fact, I’d argue that it’s one of the most insightful ones and it plays an important role in many data science projects, particularly in the data exploration stage.
The key issues with clustering are:
1. The whole set of distance metrics used
2. The fact that the vast majority of clustering methods yield a (slightly) different result every time they are run
3. The need of an external parameter (K) in most clustering methods used in practice, in order to define how many clusters there are
4. The fact that it’s very shallow in its results
There may be more issues with clustering, but these are the most important ones I’ve found. So, if we were to rethink clustering and do it better, we’d need to address each one of these issues. Namely:
1. A new set of distance metrics would be needed, metrics that are not influenced by the dimensional “noise” so much, in the case of many dimensions in the dataset.
2. The option for a deterministic clustering method, one that would optimize the centroid seed before starting the whole clustering process
3. An optimization process would be in place so as to find the best number of clusters. This should include the possibility of a single cluster, in the case where there isn’t enough diversity in the dataset.
4. A multi-level clustering option needs to be available, much like hierarchical clustering but in reverse, i.e. start with the main clusters in the dataset and gradually dig deeper into levels of sub-clusters.
Now, all this may sound simple but it’s not as easy to put into practice. Apart from an in-depth understanding of data science, a quite refined programming ability is needed too, so that the implementation of this clustering approach can be efficient and scalable. Perhaps all this is not even possible with the conventional data analytics framework, but there is not a single doubt in my mind that it is possible in general, and if a high-performance language is used (e.g. Julia), it is even practically feasible.
Naturally, a clustering framework like this one would require a certain level of A.I. to be used. This doesn’t have to be an ANN though, since A.I. can take many forms, not just network-based ones. Whatever the case, conventional statistics-based methods may be largely inadequate, while the very basic machine learning methods for clustering may not be sufficient either.
This illustrates something that many data science practitioners have forgotten: that data science methods evolve, just like other aspects of the craft. New tools may be intriguing, but equally intriguing are the conventional methodological tools, especially if we were to rethink them from a more advanced perspective. This can be beneficial in many ways, such as opening new avenues of data analytics and even synthesizing new data. This, however, is a story for another time...
So, my publisher has been co-organizing this conference for a few years now, and this September it is going to be in Düsseldorf, Germany. What's so special about it? Well, I'll be participating in it too, as a speaker. But regardless of that, the DMZ conference has grown a lot since it first started and now covers a variety of topics, not just related to Data Modeling. Also, just like other good conferences, DMZ has a variety of good technical books made available, plus if you register for the conference using the code DMZEU2018_VOULGARIS you can get a 25 Euro discount on any book-related purchase you make (that's about $29 worth of reading material). So, check it out, when you get the chance, at https://datamodelingzone.com.
It’s not the programming language, as some people may think. After all, if you know what you are doing, even a suboptimal language could be used without too much of an efficiency compromise. No, the biggest mistake people make, in my experience, is that they rely too much on libraries they find as well as the methods out there. This is not the worst part though. If someone relies excessively on predefined processes and methods, the chances of that person’s role getting automated by an A.I. are quite high. So, what can you do?
For starters, one needs to understand that both data science and artificial intelligence, like other modern fields, are in a state of flux. This means that what was considered gospel a few years back may be irrelevant in the near future, even if it is somewhat useful right now. Take Expert Systems, for example. These were all the rage during the time when A.I. came out as an independent field. However, nowadays, they are hardly used and in the near future, they may appear more anachronistic than ever before. That’s not to say that modern aspects of data science and A.I. are going to wane necessarily, but if one focuses too much on them, at the expense of the objective they are designed for, that person risks becoming obsolete as they become less relevant.
Of course, certain things may remain relevant no matter what. Regardless of how data science and A.I. evolve, the k-fold cross-validation method will be useful still. Same goes with certain evaluation metrics. So, how do you discern what is bound to remain relevant from what isn’t? Well, you can’t unless you try to innovate. If certain methods appear too simple, for example, they may not stick around for much longer, even if they linger in the textbooks. Do these methods have variants already that outperform the original algorithms? Are people developing similar methods to overcome drawbacks that they exhibit? What would you do if you were to improve these methods? Questions like this may be hard to answer because you won’t find the necessary info on Wikipedia or on StackOverflow, but they are worth thinking about for sure, even if an exact answer may elude you.
For example, I always thought that clustering had to be stochastic because everyone was telling me that it is an NP-hard problem that cannot be solved efficiently with a deterministic method. Well, with this mindset no innovations would ever take place in that method of unsupervised learning, would it? So, I questioned this matter and found out that not only are there ways to solve clustering in a deterministic way, but some of these methods are more stable than the stochastic ones. Are they easy? No. But they work. So, just like we tend to opt for mechanized transportation today, instead of the (much simpler) horse and carriage alternative, perhaps the more sophisticated clustering methods will prevail. But even if they don’t (after all, there are no limits to some people’s detest towards something new, especially if it’s difficult for them to understand), the fact that I’ve learned about them enables me to be more flexible if this change takes place. At the same time, I can be more prepared for other changes in the field, of a similar nature.
I am not against stochastic methods, by the way, but if an efficient deterministic solution exists for a problem, I see no reason why we should stick with a stochastic approach to that problem. However, for optimization related scenarios, especially those involving very complex problems, the stochastic approach may be the only viable option. Bottom line, we need to be flexible about these matters.
To sum up, learning about the conventional way of solving data-related problems, be it through data science methods, or via A.I. ones, is but the first step. Stopping there though would be a grave mistake, since you’d be depriving yourself the opportunity to delve deeper into the field and explore not only what’s feasible but also what’s possible. Isn’t that what science is about?
A few years back, at a period I was both inspired to experiment with different Complex Systems and had enough time on my hands, I created this interesting variant of John Conway's Game of Life. As the beings in this model evolved, I named it the Game of Evolving Life. I ran a bunch of simulations on it and analyzed the results, a project that took the form of a whole ebook, which I never got around to publishing. Whatever the case, I thought this project would make a good example for the Complex Systems subtopic of the previous video's topic, so I made a video on it. This new video is now online on Safari. Enjoy!
Note that this video covers the main highlights of the model, with a very brief introduction to what complex systems are. Also, I focused on the more visual aspect of the analysis I'd done, otherwise it would be a much longer video that wouldn't be as interesting to most people. Finally, this whole thing was more of a programming exercise, so if you are looking at Data Science related videos that go into more depth on the methods of the craft, perhaps other videos would be better for you.
There is no doubt that Artificial Intelligence has a number of issues that need to be addressed before its benefits can become more wide-spread. Also, if it were to become more autonomous, we would need to be able to at least anticipate its decisions and perhaps even understand how they come about. However, none of these things have proven to be happening yet. Whether that’s due to some innate infeasibility or due to some other factor is yet to be discovered.
What we have discovered though, again and again, is that most A.I. developments take the world by surprise. Even the people involved in this field, dedicated scientists and engineers who have spent countless hours working with such systems. However, our collective understanding of them still eludes us and it’s not the A.I.’s fault.
It’s easy to blame an A.I. or the people behind it for anything that goes wrong, but remember that various A.I. projects were seen to their completion because we as potential users of them wanted them out there. Whether we understood the implications of these systems or not though is questionable.
So, the biggest issue of A.I. might be how we relate to it, combined with the fact that we don’t really understand it in depth. The evangelists of the field view it as a panacea of sorts, oftentimes confusing A.I. with ML, while often considering the latter as a subfield of the former. On the other hand, the technical people involved in A.I. see it as a cool technology that can keep them relevant in the tech market. As for the consumers of A.I., they see it as a cool futuristic tech that may make life more interesting, though it may also change the dynamics of the job market in very disruptive (or even disturbing) ways. Unless, we all obtain a more clear understanding of what A.I. is, what it can and cannot do, and how it works (to the extent each person’s technical level allows), A.I. will remain an exotic technology wrapped in a mist of mystique.
That’s not an unsurmountable problem though. Nowadays, knowledge is more accessible than ever before, so if someone wants to learn about A.I. more, it’s just a matter of committing to that task and putting the hours necessary. Granted that sometimes a few books or videos would be needed too, with whatever cost this entails, still the task is a quite manageable one. Besides, one doesn’t need to be an A.I. expert in order to have sensible expectations of this tech and be able to discern the brilliance of some such systems from the BS of many of the futurists.
All in all, the more one knows about this field and the more realistic his or her expectations are, the better the chances of deriving value from A.I., without falling victim of the problems that surround it.
When I started my life-long journey in the world of data analytics (which morphed into Data Science and modern AI-based predictive analytics systems), it was through academia. I even did a post-doc at one point, which although paid the bills, it was the worst-paying job I’ve ever had during my career. Yet, as long as there were things to learn and challenges to overcome, I was willing to see past that.
As I matured, I realized that the only thing that mattered in that strange world, if you were to have a career in it, was publications. As I enjoyed writing, I gave it a shot. However, the needlessly long waiting time for any feedback, the low quality of that feedback, and the overall time it took for something to get published, put me off eventually. After that, I decided to pursue a career, any career, in the real-world, as at least here there is more meritocracy and smaller waiting times, enabling a much faster growth.
A few months ago, I was approached by a big-time academic publishing house for an article in their encyclopedia of big data. I was surprised to see that after so many years they had come to be more progressive about the whole publications related business. As the topic was down my alley, I decided to accept their offer. At the time I felt that this would be my way of giving back to the data science programming community. I only asked that the companies I work with get mentioned in the article so that they can at least justify my being distracted by this project. The academic publisher accepted and said that these companies would be mentioned as my affiliations. I even provided their location details afterwards, so that they were going to be represented fully.
Months later, I got some feedback, some really minor corrections, that I took care of promptly. Finally, last month the article was published. I was pleased, for a couple of minutes, till I realized that the affiliations were all screwed up. Up to this day I am not sure how this could happen. It would take a whole new level of incompetence to mess up such a simple task, more than I was used to seeing through my academic life. Of course, mistakes happen and since I’m not perfect either, I politely asked for corrections on this part of the article. I had to do this twice, since apparently the first time they must have forgotten about it (apparently these corrections were not a priority to them). Up to this day, the article remains uncorrected, since clearly this 2-minute task is just too much for them to handle, or perhaps there isn’t much of a motivation.
If there was a slight chance of me ever working in an academic setting again, e.g. by writing articles like that one or academic papers, this is gone as this event proved what a colossal waste of time it is working with this sort of bureaucracy. Perhaps for you it’s different because you have higher tolerance or lower self-esteem (or maybe both) and you can put up with these clowns. However, if you are on a crossroad in your career in our field, be sure to explore your options wisely before being tempted to compromising with an academic publication gig. More often than not, it would not be worth your time, while all the other alternatives would be more rewarding.
UPDATE: finally they managed to update the affiliations bit. I wonder if this article had anything to do with it! It's doubtful that I'll change my view on the academic publications matter any time soon though.
Randomness, Uncertainty, Complex Systems, and Applications Video Now Online + Shout-out to a Viewer of This Blog
This past week I decided to do a vid on an experimental topic, involving different fields, an interdisciplinary topic if you will. I understand the risks of such a video, since randomness is not particularly easy as a subject, while complex systems are a bit niche as a field. However, I tried to bring about a more intuitive approach to all this and introduce a new feature for such videos: mini-quizzes so that you can test your understanding while you watch the video. Anyway, feel free to check out this introductory video to this topic by visiting the corresponding Safari page. Warning: some of the stuff covered in this video veers aways from conventional approaches to this topic. Also, the video is very light on the math aspect of the topic as otherwise it would be too long and it's already over 30 minutes in length...
Also, recently a viewer of this blog, S.M., contacted me with some suggestions on how to tackle certain typo-related issues he had found. Big thanks to S.M. for his contribution!
Bias-Variance Trade-Off for Data Science & Backing Up and Wiping Out Sensitive Data Videos Are Now Online
This past week I've had some time off work as my CEO was on vacation. As a result I did 2 videos, not just 1. Here they are:
The Bias-Variance Trade-Off: when you have a model that favors a certain class or a certain set of values, you have high bias, while you have a model whose predictions are all over the place, you have high variance. Could you find a compromise between the two? And how does all this relate to the model's fitness? This video includes a few examples too, for classification and regression problems, to cement the concepts introduced.
Backing Up and Wiping Out Sensitive Data: you probably have heard of this topic and perhaps even apply it to some extent, since taking care of sensitive data is a good cyber-security habit to have, plus it's not new either. However, there is much more to it than that, like which storage media are best for back-up, how you can handle sensitive data on your computer without leaving a trace, and what software is out there that helps make that happen.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.