When I started my life-long journey in the world of data analytics (which morphed into Data Science and modern AI-based predictive analytics systems), it was through academia. I even did a post-doc at one point, which although paid the bills, it was the worst-paying job I’ve ever had during my career. Yet, as long as there were things to learn and challenges to overcome, I was willing to see past that.
As I matured, I realized that the only thing that mattered in that strange world, if you were to have a career in it, was publications. As I enjoyed writing, I gave it a shot. However, the needlessly long waiting time for any feedback, the low quality of that feedback, and the overall time it took for something to get published, put me off eventually. After that, I decided to pursue a career, any career, in the real-world, as at least here there is more meritocracy and smaller waiting times, enabling a much faster growth.
A few months ago, I was approached by a big-time academic publishing house for an article in their encyclopedia of big data. I was surprised to see that after so many years they had come to be more progressive about the whole publications related business. As the topic was down my alley, I decided to accept their offer. At the time I felt that this would be my way of giving back to the data science programming community. I only asked that the companies I work with get mentioned in the article so that they can at least justify my being distracted by this project. The academic publisher accepted and said that these companies would be mentioned as my affiliations. I even provided their location details afterwards, so that they were going to be represented fully.
Months later, I got some feedback, some really minor corrections, that I took care of promptly. Finally, last month the article was published. I was pleased, for a couple of minutes, till I realized that the affiliations were all screwed up. Up to this day I am not sure how this could happen. It would take a whole new level of incompetence to mess up such a simple task, more than I was used to seeing through my academic life. Of course, mistakes happen and since I’m not perfect either, I politely asked for corrections on this part of the article. I had to do this twice, since apparently the first time they must have forgotten about it (apparently these corrections were not a priority to them). Up to this day, the article remains uncorrected, since clearly this 2-minute task is just too much for them to handle, or perhaps there isn’t much of a motivation.
If there was a slight chance of me ever working in an academic setting again, e.g. by writing articles like that one or academic papers, this is gone as this event proved what a colossal waste of time it is working with this sort of bureaucracy. Perhaps for you it’s different because you have higher tolerance or lower self-esteem (or maybe both) and you can put up with these clowns. However, if you are on a crossroad in your career in our field, be sure to explore your options wisely before being tempted to compromising with an academic publication gig. More often than not, it would not be worth your time, while all the other alternatives would be more rewarding.
Randomness, Uncertainty, Complex Systems, and Applications Video Now Online + Shout-out to a Viewer of This Blog
This past week I decided to do a vid on an experimental topic, involving different fields, an interdisciplinary topic if you will. I understand the risks of such a video, since randomness is not particularly easy as a subject, while complex systems are a bit niche as a field. However, I tried to bring about a more intuitive approach to all this and introduce a new feature for such videos: mini-quizzes so that you can test your understanding while you watch the video. Anyway, feel free to check out this introductory video to this topic by visiting the corresponding Safari page. Warning: some of the stuff covered in this video veers aways from conventional approaches to this topic. Also, the video is very light on the math aspect of the topic as otherwise it would be too long and it's already over 30 minutes in length...
Also, recently a viewer of this blog, S.M., contacted me with some suggestions on how to tackle certain typo-related issues he had found. Big thanks to S.M. for his contribution!
Bias-Variance Trade-Off for Data Science & Backing Up and Wiping Out Sensitive Data Videos Are Now Online
This past week I've had some time off work as my CEO was on vacation. As a result I did 2 videos, not just 1. Here they are:
The Bias-Variance Trade-Off: when you have a model that favors a certain class or a certain set of values, you have high bias, while you have a model whose predictions are all over the place, you have high variance. Could you find a compromise between the two? And how does all this relate to the model's fitness? This video includes a few examples too, for classification and regression problems, to cement the concepts introduced.
Backing Up and Wiping Out Sensitive Data: you probably have heard of this topic and perhaps even apply it to some extent, since taking care of sensitive data is a good cyber-security habit to have, plus it's not new either. However, there is much more to it than that, like which storage media are best for back-up, how you can handle sensitive data on your computer without leaving a trace, and what software is out there that helps make that happen.
It seems like yesterday when I came up with this encryption system, for which I even wrote on this blog about. I never expected to create a video on it, but what better way to share it with the world, at least its core aspects of it. As there is no reason why I'd consider my implementation of this idea the best possible, I leave the viewer to experiment on his/her own on that matter, after I explain each aspect of the method and showcase a couple of examples of it. Anyway, check out the video on Safari when you get the chance and let me know here what you think of it. Enjoy!
Why Articles on Social Media about Programming for Data Science Seem to Be Straight Out of a Time Capsule
Data science related topics sell, no doubt about that. This is great is you are interested in the field and want to learn more about it, especially practical things that can offer you some orientation in the field. Since programming is a key component of data science, it makes sense to pay attention to material along these lines, particularly if you are new to this whole matter.
How the Situation Is Today
Fortunately there is an abundance of articles on this topic, especially on the social media. However, not everyone who writes such articles is up-to-date on this subject since many of these “expert” tech writers are not forward thinking data scientists themselves. Best case scenario, they have spend a few minutes on the web, probably focusing on the results on the first page of a search engine for the bulk of their material. And shocking as it may be, this material may be geared more towards what’s more popular rather on what’s more accurate. Alternatively, they may have relied on what some data science guru once said on the topic, information that may no longer be particularly relevant. Apart from that, the writers who delve into the production of this sort of articles (or infographics in some cases) have their own biases. Probably they took a programming course at university so if a particular programming platform comes up on their “research” they may be more likely to highlight it. After all, this would make them knowledgeable since they have hands-on experience on that platform, even if it’s not that useful to data science any more. What’s more, many people who write about these topics don’t want to take risks with newer things. It’s much safer to mention languages that everyone knows about and which have a large community around them, than mention newer ones that may be despised by the hardcore users of older coding platforms.
Hope for the Future
For better or for worse, an article on the social media has a limited life span. After all, its purpose is mainly to get enough people to click on a particular link where a given site serves ads, so that the people owning the site can get some revenue from said ads. Therefore, if the article is forgotten in a week, its producers won’t lose any sleep over it. Books and subscription-based videos are not like that though. Neither are technical conferences. So, since the new trends are geared more towards this kind of platforms to become well-known, they are not that much hindered by social media misinformation. After all, if a programming language is good, this is something that will eventually show, even if the fan-boys of the more traditional languages would sooner die than change their views on their favorite coding platforms.
What You Can Do
So, instead of getting swayed by this or the other “expert” with X thousand followers (many of whom are probably either bots or bought followers), you can do your own research. Check out what books are out there on the various programming languages and if they hint towards applicability in data science. Check out videos on Safari and other serious educational platforms. Look at what new language conferences are out there and how they cover data science related topics. And most importantly, try some of these languages yourself. This way you’ll have some more reliable data when making a decision on what language is most relevant and most future-proof in our field, rather than blindly believe whatever this or the other “expert” on the social media says.
So, the NLP Fundamentals video I made recently is online as of today (you can find it on the Safari site). Note that since Natural Language Processing is a very broad subject, it is quite hard to do it justice in a single video. However, for someone needing a good introduction to it, this video should be fine. Enjoy!
A few months ago, I wrote a blog post on Artificial Emotional Intelligence, a kind of A.I. that emulates the EQ aspects of our mental process. Of course this technology is still limited to emulating basic aspects of the human emotional spectrum, focusing mainly on comprehending emotion through text data. Nevertheless, this can still add a lot of value to an organization, as in the case of ZimGo Polling, an initiative to predict the outcome of an election, prior to the counting of the votes.
BPU Holdings, the company behind this ambitious yet quite down-to-earth initiative, make use of an advanced NLP system, employing some specialized AI systems (interestingly I’m currently in the process of creating a video on the topic of NLP, for Safari!). Also, the data for this endeavor stems from social media, something that ensures abundance as well as freshness, both key factors in making good predictions on this sort of trends.
Since part of my job at DSP Ltd. is being up-to-date on the latest and greatest trends in data science and A.I., when a representative of BPU Holdings approached me with their AEI ideas, I took the opportunity to learn more about this field and about what they were doing. That’s how that previous blog post I mentioned came about. Last month, I had the opportunity to talk to the people of this company directly, learning more about their work and AEI’s promise of bringing more value to organizations around the world, through this intriguing niche. After looking into this matter a bit, I became convinced that this company may be actually on to something.
The case study presented to me involved the S. Korean elections, where this AEI system managed to predict the results with impressive accuracy. Of course, the company doesn’t plan to rest at its laurels, as there are already plans to apply this new approach to data analytics to other areas, such as the US elections. You can read more about this as well as the company’s offerings in the attached press release as well as its website.
Note that I am not affiliated with this company, so if I appear a bit biased towards it that’s because I favor the use of A.I. for such initiatives, rather than other, more aggressive applications, such as those in the military. After all, if there is one thing that I hope has come across from all my postings on this topic, it is that A.I. can be a positive tech, bringing about value to everyone, not just some multinational conglomerates that may not always use it wisely. Also, instead of following blindly this or the other A.I. expert on the social media, I prefer to take a more active approach to this matter by directly connecting with the people involved and providing them with feedback on this tech, as they are developing it. That’s why this blog is the first one worldwide to publicly announce this company’s initiative and bring AEI to the conversation table.
What are your thoughts / emotions on it? Feel free to share them with me, either through this blog or via a direct message.
Even though this topic may be a bit polarizing, especially among people who are new to data science, knowing more about it can be very useful, particularly if you value a sense of perspective more than a good grade in some data science crash course. The latter is bound to overemphasize either Stats or AI, depending on the instructor's knowledge and experience. However, some data science professionals, myself included, prefer a more balanced approach on the topic. This is the reason why I decided to make this video, which is now available on Safari for your viewing.
Note that this is by no means a complete tutorial on the topic, but it is a good overview of the various aspects of statistics related to data science, along with some programming resources in both Python and Julia, to get you started. Enjoy!
Recently I decided to spice things up a bit and experiment with a new, more fresh approach to videos. As a result, I played around with graphics more, in an effort to go for a more intuitive presentation of the topic I looked at, namely sampling (check out the video here). Not all videos that ensue are going to be like that, but I’m definitely going to look into more interesting ways of tackling the graphics part.
This kind of video production takes a lot of work though and as I haven’t done graphics design in years, I’m a bit rusty and such a project takes a considerable amount of time. At the same time, I need to keep promoting my stuff online and one of the strategies I’ve found quite effective is through articles on beBee. As a result, I won’t be posting articles on my blog that often. However, if someone is interested in contributing to it, I’d be happy to consider guest authors on Data Science, A.I., Cyber-security, Programming, and other relevant topics.
A.I. and ML are often used interchangeably, while many people consider one to be a subset of the other (which one is the bigger set depends on who you ask). However, things may not be as clear-cut as they may seem, since the communities of these two fields are not all the related, while there is a sort of rivalry among the hard-core members of each one of them. Why is that though if A.I. and ML are so similar to each other, enough to confuse even data scientists?
First of all, let’s start with some definitions. A.I. is the group of methods, algorithms, and processes, that bring about computer systems that emulate human intelligence, even if the intelligence they usually exhibit is quite different to our own. Also, these systems often take the form of self-sufficient machines, such as robots, as well as agent programs that roam the Internet or cyber space in general. ML on the other hand is the group of methods, algorithms, and processes that bring about computer systems that solve some data analytics problem in an efficient manner, through some training procedure (the learning part of machine learning). The latter can be with the help of some specific outcomes (aka targets) or without. Also, the training can take the form of feedback on the system’s predictions, which is like on-the-job training of sorts.
Clearly, there is a close link between ML and data science, since ML systems are designed for this sort of problems. A.I. systems on the other hand, may tackle different kinds of problems too (e.g. finding the optimal route given some restrictions). So, there is a part of A.I. that is leveraged in data science and a part of A.I. that has nothing to do with our craft. That part of A.I. that is used in data science has a large intersect with ML, mainly through network-based systems, such as ANNs. Lately, Deep Learning networks, which are specialized and more sophisticated kinds of ANNs, have become quite popular and are also part of that intersect between A.I. and ML.
Many people who work in A.I. consider it more of a science than ML and they are right in a way. Most of ML methods are heuristics based and don’t have much theory behind them, while the ones that are tied to Stats (Statistical and ML hybrids) are heavily restrained by the assumptions that the Stats theory has. A.I. methods are generally data-driven though, but also related to processes found in nature, so they are not out of the blue.
Nevertheless, a data scientist who is being professional and pragmatic doesn’t put too much emphasis on the differences between A.I. and ML methods, since he cares more about how they can be applied to solve the problems at hand. So, even if these two families of methods are not the same, nor is one a subset of the other, they are both very useful, if not essential, in practical data science.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.