A/B testing is a crucial methodology / application in the data science field. Although it mainly relies on Statistics, it has a remained quite relevant in this machine learning and AI oriented era of our field. It's no coincidence that in Thinkful that's one of the first things data science students learn, once they get comfortable with descriptive Stats and basic data manipulation. So, I decided to do a video on this topic to help those interested in learning about it get a good perspective of it and understand better its relationship with Hypothesis Testing. It is my hope that this video can be a good supplement to one's learning on the subject. Enjoy!
I was never particularly fond of this unsupervised learning methodology that’s under the umbrella of machine learning. It’s not that I didn’t see value in it, but the methods that were available for it when I started delving into it were rudimentary at best and fairly crude. In fact, if I were to do a PhD now, I’d choose a clustering-related topic since there is so much room for improvement that even a simple idea for improving the most popular clustering methods out there is bound to improve them!
However, the fact that data science researchers and machine learning engineers in particular haven’t spent much time looking into clustering doesn’t make clustering a bad methodology. In fact, I’d argue that it’s one of the most insightful ones and it plays an important role in many data science projects, particularly in the data exploration stage.
The key issues with clustering are:
1. The whole set of distance metrics used
2. The fact that the vast majority of clustering methods yield a (slightly) different result every time they are run
3. The need of an external parameter (K) in most clustering methods used in practice, in order to define how many clusters there are
4. The fact that it’s very shallow in its results
There may be more issues with clustering, but these are the most important ones I’ve found. So, if we were to rethink clustering and do it better, we’d need to address each one of these issues. Namely:
1. A new set of distance metrics would be needed, metrics that are not influenced by the dimensional “noise” so much, in the case of many dimensions in the dataset.
2. The option for a deterministic clustering method, one that would optimize the centroid seed before starting the whole clustering process
3. An optimization process would be in place so as to find the best number of clusters. This should include the possibility of a single cluster, in the case where there isn’t enough diversity in the dataset.
4. A multi-level clustering option needs to be available, much like hierarchical clustering but in reverse, i.e. start with the main clusters in the dataset and gradually dig deeper into levels of sub-clusters.
Now, all this may sound simple but it’s not as easy to put into practice. Apart from an in-depth understanding of data science, a quite refined programming ability is needed too, so that the implementation of this clustering approach can be efficient and scalable. Perhaps all this is not even possible with the conventional data analytics framework, but there is not a single doubt in my mind that it is possible in general, and if a high-performance language is used (e.g. Julia), it is even practically feasible.
Naturally, a clustering framework like this one would require a certain level of A.I. to be used. This doesn’t have to be an ANN though, since A.I. can take many forms, not just network-based ones. Whatever the case, conventional statistics-based methods may be largely inadequate, while the very basic machine learning methods for clustering may not be sufficient either.
This illustrates something that many data science practitioners have forgotten: that data science methods evolve, just like other aspects of the craft. New tools may be intriguing, but equally intriguing are the conventional methodological tools, especially if we were to rethink them from a more advanced perspective. This can be beneficial in many ways, such as opening new avenues of data analytics and even synthesizing new data. This, however, is a story for another time...
So, my publisher has been co-organizing this conference for a few years now, and this September it is going to be in Düsseldorf, Germany. What's so special about it? Well, I'll be participating in it too, as a speaker. But regardless of that, the DMZ conference has grown a lot since it first started and now covers a variety of topics, not just related to Data Modeling. Also, just like other good conferences, DMZ has a variety of good technical books made available, plus if you register for the conference using the code DMZEU2018_VOULGARIS you can get a 25 Euro discount on any book-related purchase you make (that's about $29 worth of reading material). So, check it out, when you get the chance, at https://datamodelingzone.com.
It’s not the programming language, as some people may think. After all, if you know what you are doing, even a suboptimal language could be used without too much of an efficiency compromise. No, the biggest mistake people make, in my experience, is that they rely too much on libraries they find as well as the methods out there. This is not the worst part though. If someone relies excessively on predefined processes and methods, the chances of that person’s role getting automated by an A.I. are quite high. So, what can you do?
For starters, one needs to understand that both data science and artificial intelligence, like other modern fields, are in a state of flux. This means that what was considered gospel a few years back may be irrelevant in the near future, even if it is somewhat useful right now. Take Expert Systems, for example. These were all the rage during the time when A.I. came out as an independent field. However, nowadays, they are hardly used and in the near future, they may appear more anachronistic than ever before. That’s not to say that modern aspects of data science and A.I. are going to wane necessarily, but if one focuses too much on them, at the expense of the objective they are designed for, that person risks becoming obsolete as they become less relevant.
Of course, certain things may remain relevant no matter what. Regardless of how data science and A.I. evolve, the k-fold cross-validation method will be useful still. Same goes with certain evaluation metrics. So, how do you discern what is bound to remain relevant from what isn’t? Well, you can’t unless you try to innovate. If certain methods appear too simple, for example, they may not stick around for much longer, even if they linger in the textbooks. Do these methods have variants already that outperform the original algorithms? Are people developing similar methods to overcome drawbacks that they exhibit? What would you do if you were to improve these methods? Questions like this may be hard to answer because you won’t find the necessary info on Wikipedia or on StackOverflow, but they are worth thinking about for sure, even if an exact answer may elude you.
For example, I always thought that clustering had to be stochastic because everyone was telling me that it is an NP-hard problem that cannot be solved efficiently with a deterministic method. Well, with this mindset no innovations would ever take place in that method of unsupervised learning, would it? So, I questioned this matter and found out that not only are there ways to solve clustering in a deterministic way, but some of these methods are more stable than the stochastic ones. Are they easy? No. But they work. So, just like we tend to opt for mechanized transportation today, instead of the (much simpler) horse and carriage alternative, perhaps the more sophisticated clustering methods will prevail. But even if they don’t (after all, there are no limits to some people’s detest towards something new, especially if it’s difficult for them to understand), the fact that I’ve learned about them enables me to be more flexible if this change takes place. At the same time, I can be more prepared for other changes in the field, of a similar nature.
I am not against stochastic methods, by the way, but if an efficient deterministic solution exists for a problem, I see no reason why we should stick with a stochastic approach to that problem. However, for optimization related scenarios, especially those involving very complex problems, the stochastic approach may be the only viable option. Bottom line, we need to be flexible about these matters.
To sum up, learning about the conventional way of solving data-related problems, be it through data science methods, or via A.I. ones, is but the first step. Stopping there though would be a grave mistake, since you’d be depriving yourself the opportunity to delve deeper into the field and explore not only what’s feasible but also what’s possible. Isn’t that what science is about?
When I started my life-long journey in the world of data analytics (which morphed into Data Science and modern AI-based predictive analytics systems), it was through academia. I even did a post-doc at one point, which although paid the bills, it was the worst-paying job I’ve ever had during my career. Yet, as long as there were things to learn and challenges to overcome, I was willing to see past that.
As I matured, I realized that the only thing that mattered in that strange world, if you were to have a career in it, was publications. As I enjoyed writing, I gave it a shot. However, the needlessly long waiting time for any feedback, the low quality of that feedback, and the overall time it took for something to get published, put me off eventually. After that, I decided to pursue a career, any career, in the real-world, as at least here there is more meritocracy and smaller waiting times, enabling a much faster growth.
A few months ago, I was approached by a big-time academic publishing house for an article in their encyclopedia of big data. I was surprised to see that after so many years they had come to be more progressive about the whole publications related business. As the topic was down my alley, I decided to accept their offer. At the time I felt that this would be my way of giving back to the data science programming community. I only asked that the companies I work with get mentioned in the article so that they can at least justify my being distracted by this project. The academic publisher accepted and said that these companies would be mentioned as my affiliations. I even provided their location details afterwards, so that they were going to be represented fully.
Months later, I got some feedback, some really minor corrections, that I took care of promptly. Finally, last month the article was published. I was pleased, for a couple of minutes, till I realized that the affiliations were all screwed up. Up to this day I am not sure how this could happen. It would take a whole new level of incompetence to mess up such a simple task, more than I was used to seeing through my academic life. Of course, mistakes happen and since I’m not perfect either, I politely asked for corrections on this part of the article. I had to do this twice, since apparently the first time they must have forgotten about it (apparently these corrections were not a priority to them). Up to this day, the article remains uncorrected, since clearly this 2-minute task is just too much for them to handle, or perhaps there isn’t much of a motivation.
If there was a slight chance of me ever working in an academic setting again, e.g. by writing articles like that one or academic papers, this is gone as this event proved what a colossal waste of time it is working with this sort of bureaucracy. Perhaps for you it’s different because you have higher tolerance or lower self-esteem (or maybe both) and you can put up with these clowns. However, if you are on a crossroad in your career in our field, be sure to explore your options wisely before being tempted to compromising with an academic publication gig. More often than not, it would not be worth your time, while all the other alternatives would be more rewarding.
UPDATE: finally they managed to update the affiliations bit. I wonder if this article had anything to do with it! It's doubtful that I'll change my view on the academic publications matter any time soon though.
Bias-Variance Trade-Off for Data Science & Backing Up and Wiping Out Sensitive Data Videos Are Now Online
This past week I've had some time off work as my CEO was on vacation. As a result I did 2 videos, not just 1. Here they are:
The Bias-Variance Trade-Off: when you have a model that favors a certain class or a certain set of values, you have high bias, while you have a model whose predictions are all over the place, you have high variance. Could you find a compromise between the two? And how does all this relate to the model's fitness? This video includes a few examples too, for classification and regression problems, to cement the concepts introduced.
Backing Up and Wiping Out Sensitive Data: you probably have heard of this topic and perhaps even apply it to some extent, since taking care of sensitive data is a good cyber-security habit to have, plus it's not new either. However, there is much more to it than that, like which storage media are best for back-up, how you can handle sensitive data on your computer without leaving a trace, and what software is out there that helps make that happen.
Why Articles on Social Media about Programming for Data Science Seem to Be Straight Out of a Time Capsule
Data science related topics sell, no doubt about that. This is great is you are interested in the field and want to learn more about it, especially practical things that can offer you some orientation in the field. Since programming is a key component of data science, it makes sense to pay attention to material along these lines, particularly if you are new to this whole matter.
How the Situation Is Today
Fortunately there is an abundance of articles on this topic, especially on the social media. However, not everyone who writes such articles is up-to-date on this subject since many of these “expert” tech writers are not forward thinking data scientists themselves. Best case scenario, they have spend a few minutes on the web, probably focusing on the results on the first page of a search engine for the bulk of their material. And shocking as it may be, this material may be geared more towards what’s more popular rather on what’s more accurate. Alternatively, they may have relied on what some data science guru once said on the topic, information that may no longer be particularly relevant. Apart from that, the writers who delve into the production of this sort of articles (or infographics in some cases) have their own biases. Probably they took a programming course at university so if a particular programming platform comes up on their “research” they may be more likely to highlight it. After all, this would make them knowledgeable since they have hands-on experience on that platform, even if it’s not that useful to data science any more. What’s more, many people who write about these topics don’t want to take risks with newer things. It’s much safer to mention languages that everyone knows about and which have a large community around them, than mention newer ones that may be despised by the hardcore users of older coding platforms.
Hope for the Future
For better or for worse, an article on the social media has a limited life span. After all, its purpose is mainly to get enough people to click on a particular link where a given site serves ads, so that the people owning the site can get some revenue from said ads. Therefore, if the article is forgotten in a week, its producers won’t lose any sleep over it. Books and subscription-based videos are not like that though. Neither are technical conferences. So, since the new trends are geared more towards this kind of platforms to become well-known, they are not that much hindered by social media misinformation. After all, if a programming language is good, this is something that will eventually show, even if the fan-boys of the more traditional languages would sooner die than change their views on their favorite coding platforms.
What You Can Do
So, instead of getting swayed by this or the other “expert” with X thousand followers (many of whom are probably either bots or bought followers), you can do your own research. Check out what books are out there on the various programming languages and if they hint towards applicability in data science. Check out videos on Safari and other serious educational platforms. Look at what new language conferences are out there and how they cover data science related topics. And most importantly, try some of these languages yourself. This way you’ll have some more reliable data when making a decision on what language is most relevant and most future-proof in our field, rather than blindly believe whatever this or the other “expert” on the social media says.
After investigating this topic quite a bit, as I was looking into A.I. stuff, I decided to create a video on it. To make it more complete, I included other methods too, such as Statistics-based and heuristics-based ones. Despite the excessive amount of content I put together into this project (the script was over 4000 words), I managed to keep the video at a manageable length (a bit less than half an hour). Check it out on Safari when you have some time!
Ever since social media (SM) became a mainstream option for spending one’s time on the web, it has started to disrupt the way we view information and even knowledge to some extent. Even though there is no doubt that SM offer substantial benefits in advertising and branding, there is little they can offer when it comes to actually learning something. Here is why.
Even though some articles can be thought-provoking, but consuming information to satisfy your curiosity and actually assimilating it are two different things. This is particularly true when it comes to a technical field, like data science, where being informed about something is barely enough to have an opinion on the topic, let alone do something useful with it. Many people who roam the SM in search of mentors don’t realize that. They tend to forget that following someone in an attempt to learn from them is the equivalent of body-building by just hanging out at the lobby of a gym. Yet, they do it anyway because it’s easy and it doesn’t cost them anything (other than some time, assuming that they read the stuff their leaders post on the SM).
If you really want to learn something, especially something complex and multifaceted like data science, you need to get your hands dirty and you have to break a sweat. The various things someone posts on the SM aren’t going to help much. There is a reason why books and videos on the subject sell, even if there is abundant information on the web. Also, in my experience, if a platform doesn’t charge you for the “products” it offers to you, that’s because you are the product! SM are designed with that in mind. Of course, some of them may be worth the time you spend on them since they can be a source of a diverse array of views on a topic (hopefully from different perspectives), but that’s not the same as applicable knowledge. If you want to hone your data science skills you need something you can rely on, not something someone types on the SM while enjoying their morning coffee, to pass the time.
So, what can you do, instead of following someone on the SM? There are various strategies, each with its own sets of benefits. Ideally, you would do a combination of them to maximize your learning opportunities. The main ones of these strategies are:
What are your thoughts on the matter? How do you learn data science?
For the past few months I've been working on a tutorial on the data modeling part of the data science process. Recently I've finished it and as of 2 weeks ago, it available online at the Safari portal. Although this tutorial is mainly for newcomers to the field, everyone can benefit from it, particularly people who are interested in not just the technical aspects but also on the concepts behind them and how it all relates to the other parts of the pipeline. Enjoy!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.