So, my publisher has been co-organizing this conference for a few years now, and this September it is going to be in Düsseldorf, Germany. What's so special about it? Well, I'll be participating in it too, as a speaker. But regardless of that, the DMZ conference has grown a lot since it first started and now covers a variety of topics, not just related to Data Modeling. Also, just like other good conferences, DMZ has a variety of good technical books made available, plus if you register for the conference using the code DMZEU2018_VOULGARIS you can get a 25 Euro discount on any book-related purchase you make (that's about $29 worth of reading material). So, check it out, when you get the chance, at https://datamodelingzone.com.
It’s not the programming language, as some people may think. After all, if you know what you are doing, even a suboptimal language could be used without too much of an efficiency compromise. No, the biggest mistake people make, in my experience, is that they rely too much on libraries they find as well as the methods out there. This is not the worst part though. If someone relies excessively on predefined processes and methods, the chances of that person’s role getting automated by an A.I. are quite high. So, what can you do?
For starters, one needs to understand that both data science and artificial intelligence, like other modern fields, are in a state of flux. This means that what was considered gospel a few years back may be irrelevant in the near future, even if it is somewhat useful right now. Take Expert Systems, for example. These were all the rage during the time when A.I. came out as an independent field. However, nowadays, they are hardly used and in the near future, they may appear more anachronistic than ever before. That’s not to say that modern aspects of data science and A.I. are going to wane necessarily, but if one focuses too much on them, at the expense of the objective they are designed for, that person risks becoming obsolete as they become less relevant.
Of course, certain things may remain relevant no matter what. Regardless of how data science and A.I. evolve, the k-fold cross-validation method will be useful still. Same goes with certain evaluation metrics. So, how do you discern what is bound to remain relevant from what isn’t? Well, you can’t unless you try to innovate. If certain methods appear too simple, for example, they may not stick around for much longer, even if they linger in the textbooks. Do these methods have variants already that outperform the original algorithms? Are people developing similar methods to overcome drawbacks that they exhibit? What would you do if you were to improve these methods? Questions like this may be hard to answer because you won’t find the necessary info on Wikipedia or on StackOverflow, but they are worth thinking about for sure, even if an exact answer may elude you.
For example, I always thought that clustering had to be stochastic because everyone was telling me that it is an NP-hard problem that cannot be solved efficiently with a deterministic method. Well, with this mindset no innovations would ever take place in that method of unsupervised learning, would it? So, I questioned this matter and found out that not only are there ways to solve clustering in a deterministic way, but some of these methods are more stable than the stochastic ones. Are they easy? No. But they work. So, just like we tend to opt for mechanized transportation today, instead of the (much simpler) horse and carriage alternative, perhaps the more sophisticated clustering methods will prevail. But even if they don’t (after all, there are no limits to some people’s detest towards something new, especially if it’s difficult for them to understand), the fact that I’ve learned about them enables me to be more flexible if this change takes place. At the same time, I can be more prepared for other changes in the field, of a similar nature.
I am not against stochastic methods, by the way, but if an efficient deterministic solution exists for a problem, I see no reason why we should stick with a stochastic approach to that problem. However, for optimization related scenarios, especially those involving very complex problems, the stochastic approach may be the only viable option. Bottom line, we need to be flexible about these matters.
To sum up, learning about the conventional way of solving data-related problems, be it through data science methods, or via A.I. ones, is but the first step. Stopping there though would be a grave mistake, since you’d be depriving yourself the opportunity to delve deeper into the field and explore not only what’s feasible but also what’s possible. Isn’t that what science is about?
When I started my life-long journey in the world of data analytics (which morphed into Data Science and modern AI-based predictive analytics systems), it was through academia. I even did a post-doc at one point, which although paid the bills, it was the worst-paying job I’ve ever had during my career. Yet, as long as there were things to learn and challenges to overcome, I was willing to see past that.
As I matured, I realized that the only thing that mattered in that strange world, if you were to have a career in it, was publications. As I enjoyed writing, I gave it a shot. However, the needlessly long waiting time for any feedback, the low quality of that feedback, and the overall time it took for something to get published, put me off eventually. After that, I decided to pursue a career, any career, in the real-world, as at least here there is more meritocracy and smaller waiting times, enabling a much faster growth.
A few months ago, I was approached by a big-time academic publishing house for an article in their encyclopedia of big data. I was surprised to see that after so many years they had come to be more progressive about the whole publications related business. As the topic was down my alley, I decided to accept their offer. At the time I felt that this would be my way of giving back to the data science programming community. I only asked that the companies I work with get mentioned in the article so that they can at least justify my being distracted by this project. The academic publisher accepted and said that these companies would be mentioned as my affiliations. I even provided their location details afterwards, so that they were going to be represented fully.
Months later, I got some feedback, some really minor corrections, that I took care of promptly. Finally, last month the article was published. I was pleased, for a couple of minutes, till I realized that the affiliations were all screwed up. Up to this day I am not sure how this could happen. It would take a whole new level of incompetence to mess up such a simple task, more than I was used to seeing through my academic life. Of course, mistakes happen and since I’m not perfect either, I politely asked for corrections on this part of the article. I had to do this twice, since apparently the first time they must have forgotten about it (apparently these corrections were not a priority to them). Up to this day, the article remains uncorrected, since clearly this 2-minute task is just too much for them to handle, or perhaps there isn’t much of a motivation.
If there was a slight chance of me ever working in an academic setting again, e.g. by writing articles like that one or academic papers, this is gone as this event proved what a colossal waste of time it is working with this sort of bureaucracy. Perhaps for you it’s different because you have higher tolerance or lower self-esteem (or maybe both) and you can put up with these clowns. However, if you are on a crossroad in your career in our field, be sure to explore your options wisely before being tempted to compromising with an academic publication gig. More often than not, it would not be worth your time, while all the other alternatives would be more rewarding.
UPDATE: finally they managed to update the affiliations bit. I wonder if this article had anything to do with it! It's doubtful that I'll change my view on the academic publications matter any time soon though.
Bias-Variance Trade-Off for Data Science & Backing Up and Wiping Out Sensitive Data Videos Are Now Online
This past week I've had some time off work as my CEO was on vacation. As a result I did 2 videos, not just 1. Here they are:
The Bias-Variance Trade-Off: when you have a model that favors a certain class or a certain set of values, you have high bias, while you have a model whose predictions are all over the place, you have high variance. Could you find a compromise between the two? And how does all this relate to the model's fitness? This video includes a few examples too, for classification and regression problems, to cement the concepts introduced.
Backing Up and Wiping Out Sensitive Data: you probably have heard of this topic and perhaps even apply it to some extent, since taking care of sensitive data is a good cyber-security habit to have, plus it's not new either. However, there is much more to it than that, like which storage media are best for back-up, how you can handle sensitive data on your computer without leaving a trace, and what software is out there that helps make that happen.
Why Articles on Social Media about Programming for Data Science Seem to Be Straight Out of a Time Capsule
Data science related topics sell, no doubt about that. This is great is you are interested in the field and want to learn more about it, especially practical things that can offer you some orientation in the field. Since programming is a key component of data science, it makes sense to pay attention to material along these lines, particularly if you are new to this whole matter.
How the Situation Is Today
Fortunately there is an abundance of articles on this topic, especially on the social media. However, not everyone who writes such articles is up-to-date on this subject since many of these “expert” tech writers are not forward thinking data scientists themselves. Best case scenario, they have spend a few minutes on the web, probably focusing on the results on the first page of a search engine for the bulk of their material. And shocking as it may be, this material may be geared more towards what’s more popular rather on what’s more accurate. Alternatively, they may have relied on what some data science guru once said on the topic, information that may no longer be particularly relevant. Apart from that, the writers who delve into the production of this sort of articles (or infographics in some cases) have their own biases. Probably they took a programming course at university so if a particular programming platform comes up on their “research” they may be more likely to highlight it. After all, this would make them knowledgeable since they have hands-on experience on that platform, even if it’s not that useful to data science any more. What’s more, many people who write about these topics don’t want to take risks with newer things. It’s much safer to mention languages that everyone knows about and which have a large community around them, than mention newer ones that may be despised by the hardcore users of older coding platforms.
Hope for the Future
For better or for worse, an article on the social media has a limited life span. After all, its purpose is mainly to get enough people to click on a particular link where a given site serves ads, so that the people owning the site can get some revenue from said ads. Therefore, if the article is forgotten in a week, its producers won’t lose any sleep over it. Books and subscription-based videos are not like that though. Neither are technical conferences. So, since the new trends are geared more towards this kind of platforms to become well-known, they are not that much hindered by social media misinformation. After all, if a programming language is good, this is something that will eventually show, even if the fan-boys of the more traditional languages would sooner die than change their views on their favorite coding platforms.
What You Can Do
So, instead of getting swayed by this or the other “expert” with X thousand followers (many of whom are probably either bots or bought followers), you can do your own research. Check out what books are out there on the various programming languages and if they hint towards applicability in data science. Check out videos on Safari and other serious educational platforms. Look at what new language conferences are out there and how they cover data science related topics. And most importantly, try some of these languages yourself. This way you’ll have some more reliable data when making a decision on what language is most relevant and most future-proof in our field, rather than blindly believe whatever this or the other “expert” on the social media says.
After investigating this topic quite a bit, as I was looking into A.I. stuff, I decided to create a video on it. To make it more complete, I included other methods too, such as Statistics-based and heuristics-based ones. Despite the excessive amount of content I put together into this project (the script was over 4000 words), I managed to keep the video at a manageable length (a bit less than half an hour). Check it out on Safari when you have some time!
Ever since social media (SM) became a mainstream option for spending one’s time on the web, it has started to disrupt the way we view information and even knowledge to some extent. Even though there is no doubt that SM offer substantial benefits in advertising and branding, there is little they can offer when it comes to actually learning something. Here is why.
Even though some articles can be thought-provoking, but consuming information to satisfy your curiosity and actually assimilating it are two different things. This is particularly true when it comes to a technical field, like data science, where being informed about something is barely enough to have an opinion on the topic, let alone do something useful with it. Many people who roam the SM in search of mentors don’t realize that. They tend to forget that following someone in an attempt to learn from them is the equivalent of body-building by just hanging out at the lobby of a gym. Yet, they do it anyway because it’s easy and it doesn’t cost them anything (other than some time, assuming that they read the stuff their leaders post on the SM).
If you really want to learn something, especially something complex and multifaceted like data science, you need to get your hands dirty and you have to break a sweat. The various things someone posts on the SM aren’t going to help much. There is a reason why books and videos on the subject sell, even if there is abundant information on the web. Also, in my experience, if a platform doesn’t charge you for the “products” it offers to you, that’s because you are the product! SM are designed with that in mind. Of course, some of them may be worth the time you spend on them since they can be a source of a diverse array of views on a topic (hopefully from different perspectives), but that’s not the same as applicable knowledge. If you want to hone your data science skills you need something you can rely on, not something someone types on the SM while enjoying their morning coffee, to pass the time.
So, what can you do, instead of following someone on the SM? There are various strategies, each with its own sets of benefits. Ideally, you would do a combination of them to maximize your learning opportunities. The main ones of these strategies are:
What are your thoughts on the matter? How do you learn data science?
For the past few months I've been working on a tutorial on the data modeling part of the data science process. Recently I've finished it and as of 2 weeks ago, it available online at the Safari portal. Although this tutorial is mainly for newcomers to the field, everyone can benefit from it, particularly people who are interested in not just the technical aspects but also on the concepts behind them and how it all relates to the other parts of the pipeline. Enjoy!
So, when I was in the US recently, I interviewed with some people from a Podcast geared towards SW engineering and data science topics (with some A.I. stuff too). This interview, which constitutes a whole episode on that podcast, covered various topics related to both data science as a field and some specific aspects of it that can help someone embrace it as a practitioner / professional in it. The podcast episode is now online and freely available. Although it's by no means a thorough coverage of the field of data science, or even the topic of the mindset related to it, it's a good introduction to it, engaging enough to keep your commute somewhat more interesting than listening to the radio. Enjoy!
“I have never let my schooling interfere with my education.” (quote believed to be originally by Mark Twain)
People talk about education a lot these days, particularly in a data science setting. However, we need to discern between actual education and training. Both are essential, but it is the former that holds the most value. The latter is easier and oftentimes faster, but it may not be a good investment of your time if it is not accompanied by the former.
Education is all about mindset development and the ability to feel inspired from knowledge, thereby developing a healthy yearning for it. It is what happens when you teach a child how to play a game, or do a specific task. Although it’s more of a state of mind than anything else, education also has a formal aspect to it which is related to courses, seminars, workshops and talks, geared towards enhancing one’s understanding and comprehension of the topic at hand.
Training on the other hand is more geared towards techniques, methods, and the technical details of the topic taught. This is useful, of course, since every data scientist needs to know all these things. That’s why there are so many data science books and videos out there! However, knowing how to build an SVM or a neural network doesn’t make someone a competent data scientist. In fact, in some cases it doesn’t make him even an employable one.
Perhaps there is a reason why most companies require X years of experience in their recruits. Some things in data science you can only learn through time, by practicing them and by developing an intuition for the data and how it is processed. Although the idea that a data scientist has to have X years of experience to be worthy is something that remains debatable (why X and not Y?), this trend shows that hiring managers can spot a difference between someone who knows data science from a book (or videos) and someone who knows the craft because she has worked the data and has developed a bunch of models, through lots of trials and the inevitable mistakes that ensue.
Education is therefore something that can be attained through experience, not just reading and watching data science material on the Safari platform. The latter can be a great start, but you still need to get your hands dirty and also think about the whole thing, instead of just following recipes, from a data science cookbook. It’s important to know techniques, no doubt, but unless you have developed an understanding that allows you to go beyond these techniques and explore alternative features and alternative models, you may never grow beyond the advanced beginner stage.
Even someone who has spend most of his life in data science can still learn about this field, as it's a) very diverse and wide-spread, and b) always evolving. Personally, I still find that I’m learning new things as I delve deeper into the field and as I converse with other data scientists and A.I. professionals, of all levels. This too can be a form of education, not any less valuable than the education of creating a new data analytics method, or a new data product. The moment someone starts looking down on education and thinks that he knows “enough” is the moment he begins becoming obsolete.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.