A few weeks ago I created a video on DB frameworks, from a data science perspective. Somehow it didn't get into the production pipeline, but now it surfaced and is available on the Safari platform. You can view it here. Enjoy!
There is a certain idea about this matter that I find particularly vexing and misleading, as it paints a very limiting picture of what data science is. There are people who have entered the field through Statistics, as there is a direct link between data science and Stats. However, for reasons of their own, these people tend to view data science as part of Statistics, or sometimes a branch of it. Let’s delve into this matter and clarify this complicated relationship between data science and Stats, before things get out of hand.
First of all, let’s get some definitions in place. Statistics is a sub-field of Mathematics involving the description and analysis of data, particularly numeric data, through a variety of models and processes. It is a very useful framework that is essential in data science. As for the latter, it is usually viewed as a new field, one that comprises of several other fields, such as computer science, business, communication, and mathematical modeling. In other words, it’s an inter-disciplinary field that borrows from several other fields, in order to tackle complex problems that couldn't be solved otherwise.
As I have repeatedly stated in my books as well as many of my videos, a data scientist needs to know a variety of things, particularly programming. Statisticians usually focus on all-in-one platforms, like R, SAS, SPSS, etc. for their scripts. These are not the same as full-blown programming languages like Python, Julia, Scala, etc. that are usually used in data science. So, if someone calls data science a part of Statistics is not only inaccurate, but a sign that he doesn't understand what data science entails.
Also, data science tackles a large variety of data types, including text. In fact, there are a lot of data scientists who focus primarily on text data, while there are various methodologies that aim to quantify text data, in a way that enables the analysis of a corpus using a mathematical model. Statistics is unable to tackle any data of this kind, even if data scientists oftentimes make use of Stats when analyzing the quantified text data.
Moreover, Statistics tend to make use of certain models that are based on a number of assumptions about the distribution of the data, or some characteristics of it. Many data science methods don’t have any assumptions about the data. This allows for more versatile models that exhibit a more robust performance, oftentimes unattainable by statistical models. So, if someone claims that data science is part of Stats, they are probably oblivious of Machine Learning and A.I. systems employed in data science.
Naturally, Statistics are useful in data science and there is no data science course out there that doesn't cover this useful framework in its syllabus. Every data scientist is expected to have a solid grasp of Statistics and use statistical methods in her work. However, relying on Stats exclusively is quite rare and often unproductive.
To sum up, Statistics is a great field that has a lot to offer to data science. However, data science is an inter-disciplinary field, borrowing from various areas, including but definitely not limited to Statistics. If you want to learn more about the various aspects of the data science craft and how you can enrich your know-how of it, feel free to check out my latest book, Data Science Mindset, Methodologies, and Misconceptions (Technics Publications). Then, even if you don’t share my view on this topic, at least you’ll be more aware of the complicated relationship between data science and Statistics.
Recently I read an article on Pulse (LinkedIn) that was talking about what to look for when hiring a statistician. This shocked me for two reasons. 1. The role of the statistician is becoming obsolete, giving way to that of the data scientist and the A.I. professional. 2. If you need help for hiring someone in a profession that has been around for over a century, then no article can help you, no matter how well-written the article is.
I have no opinion on the article itself. I’m sure its author meant well and that he did his research prior to writing it. I do have a view on the whole matter though and how oftentimes the market is lagging in its understanding of what data analytics entails. So, let me clarify some things on this, since I've worked in this field for the largest part of my career.
Data analytics involves various sub-fields, such as statistics, business intelligence, data science, and modern A.I.-based predictive analytics. Statistics is a great tool, something that every self-respecting data scientist needs to know (though its usefulness is not limited to data science). However, a statistician is an overly specialized professional who relies on Stats primarily for his analyses. This is like someone who is a professional traveler (say a blogger of sorts specializing in touristic destinations and such), who only uses his car to go to the various countries he visits. Of course driving can get you to many places, depending on where you start from, but this way you miss out on all the islands, as well as Australia. There is nothing wrong with using your car to go to places, but if you want to be a professional traveler, you need to use other modes of transport too (otherwise you’ll never get to Victoria, BC, which is awesome, especially at this time of the year).
So, if you want to get a good story on the various beautiful spots someone can go to in the Hawaii islands, the aforementioned overly specialized professional traveller may not be able to deliver. However, someone who is more versatile and doesn’t mind using a plane once in a while, can get you that story and do so fairly quickly (especially if he is based in the West Coast). That latter professional traveller is the equivallent of a data scientist.
Nowadays, the world needs people who are versatile and comfortable with technology. In the previous example, they need someone who not only drives a car, but also knows how it works and is able to fix it if it doesn’t drive well. The statistician may know how to drive various vehicles (statistical models), but is usually unable to do more than drive. A data scientist, on the other hand, is quite comfortable with all kinds of modes of transport and can even build one from scratch (given enough expertise, of course). So, if you were to hire someone to get that data to talk, who are you going to go with, the overly specialized statistician or the versatile data scientist?
Lately I've been looking into cyber security, as it is a field that is very useful to know about, regardless of one's profession. As in data science we often deal with sensitive data, I found it useful to be able to apply certain cyber security principles to ensure the data at hand remains secure. As I've researched this matter enough to have something useful to share with the world, I decided to create a video on it, which is now available on the Safari platform here (you'll need a subscription to the Safari platform in order to view it). This is by no means enough to make you an expert in network security, but it's a good starting point. Enjoy!
Lately I've been thinking about A.I. and Statistics a lot (you could say that the amount of time spent thinking about these topics is significantly higher at alpha = 0.05!). This is partly because my Stats article managed to get more traction than any other article I've written in the past few months, and partly because A.I. is becoming more and more relevant in our field. So, the question of whether A.I. is one day going to replace Stats altogether remains a very relevant one.
The key advantage of A.I. methods is that they are assumption-free. This by itself enables them to tackle the problems they are aiming to solve, in a very methodical and efficient way. Of course, certain assumptions might speed things up, but they might obstruct the discovery of the optimal solutions to the problem at hand. Statistical inference models lost the war against machine learning models because of that, especially when artificial neural networks (ANNs) entered the scene. Also, the fact that many ML models could be combined in an ensemble setting allowed them to become even more robust, attaining F1 scores that were unfathomable for statistical prediction models. So, the possibility of other methods of statistics becoming outsourced to alternative systems is quite real.
On the other hand, statistics are very easy to use and interpret, since most of them were designed from a user’s perspective. There are doctors out there (the medical kind), who don’t know much about data analytics but can easily work a statistical model for figuring out if a certain drug has a positive influence on certain patients, and derive some scientific conclusions based on that. That doctor may not be able to write a script to save his life, but he can make use of the data he gathers and advance his scientific field, using just statistics. It’s quite unlikely that this kind of person, who is usually too busy or just not technically adept enough, will take up an A.I. approach to this kind of analysis any time soon.
Of course, A.I. constantly evolves so the black-box issue that makes many ANN-based systems unfavorable, may wane in the future. Already there are A.I. professionals talking about A.I. systems that offer some kind of interpretability. So, even if statistical systems are easier to understand and communicate, it could be that A.I. hasn't said its final word yet.
Whatever the case, I prefer to remain agnostic on this matter. Just like with programming, it’s best to keep one’s options open, when it comes to data science. I’m not a fan of statistics (and never was), but I see value in them and I’m happy to use them to the extent that they offer value to the projects I work on. A.I. may be more of a novel and exciting framework, but if an A.I. system is hard to communicate to the client, or doesn't lend itself to interpretation, then I may not use it everywhere. Just like you don’t take your fancy fringe science book to the beach, you don’t need to show off your A.I. know-how at every opportunity. Perhaps the humble historic novel is more suitable for reading while sunbathing, just like the humble statistics are fine for describing if sample A is significantly different from sample B.
Recently I had a nice chat with a fellow data scientist who works at LinkedIn. After bouncing some ideas off him, I decided to make another video, based on a topic of mutual interest, partly for demonstrating to him how straight-forward the process is, once you have done the research on the topic. This video is now published on Safari here (subscription required). Enjoy!
With so many options for publishing videos online nowadays, someone may wonder “why would I want to go through hoops to get something published on Safari?” This is a valid question, and it’s equivalent to asking “why should I get published through a publishing house when I can self-publish on Amazon, or some other platform?” Although there is merit in self-publishing, there are two main issues with it: quality assurance (QA), and marketing.
Before I get into the details of all this, let me inform you that I've been down the self-publishing path and it wasn't as glamorous as people make it out to be. I published not just 1, but 3 e-books, created a website for them, and even hired people to help promote them. A few years later the only real benefit I've seen through all this was the experience I’d gained through the whole process. So, if this is your sole motivation, that’s fine. If you however want to make enough money to make the whole thing worthwhile, then there are better options out there.
Getting published on Safari (or any other professional video platform) ensures a certain quality standard. Of course not all videos there are great, but at least you won’t find many that are a total waste of time or riddled with inaccurate information, like you would on YouTube, for example. The reason is that for a video to get on the Safari site, it first goes through some QA process. If there is an issue about it, you will need to revise it. This doesn't happen often, if you know what you are doing, but it’s a good fail-safe.
Marketing is another matter where platforms like Safari excel. If something is on Safari, people will see it and may watch/read it. If you have a video on YouTube, few people will notice it and even fewer will watch the whole thing. Especially now with the new strict policies that YouTube has adopted, content creators have it hard. Unless you create a lot of content regularly, your exposure on YouTube is bound to be very limited. Of course, if you create a lot of content, the quality is bound to drop, but YouTube doesn't seem to care much about this. As long as they get lots of people watching the videos they host, and keep the ad money rolling, they are fine. And if your vid gets flagged because some oversensitive person finds it problematic for whatever reason, that’s your problem, not YouTube’s.
I’m not trying to say that YouTube is bad. Every video hosting platform has its use cases. However, for quality content that you expect to at least pay for the effort you've put into creating it, a more professional platform like Safari makes more sense. You can create a promo video and put it on YouTube, or Vimeo. But if you spend a week creating a data science or A.I. video, you are better off publishing it through proper channels, like Safari.
To give you an idea of the profits that a Safari video can yield, last year I published a book. I spent about 9 months writing it and editing it. It was considered successful and helped me get some traction in the field, while also promote the programming language it was about. One of the videos I created and published for Safari yielded about the same revenue. It had taken me about a week to create it and edit it, while I also enjoyed it more, since it felt more like a creative endeavor, rather than work. Since I don’t have a huge following, I doubt that the same video could yield the same revenue if it were published on YouTube or some other open platform.
If you find that you have content you wish to share with the world, in a professional manner, I’d recommend you consider Safari as an option. If you find that it entails too much work and you are unsure as to where you need to start, you can always go through a publisher, like Technics Publications, like I did. As Nelson Mandela eloquently said, “it always seems impossible until it's done.”
Recently someone on LI recommend that I bring more JOY to the world instead of merely complain about it (I wasn’t complaining but apparently she thought I were!). I’m not an entertainer, nor a psychology expert, but perhaps you don’t need to be in these lines of work in order to bring joy to the people you interact with. I thought about it and decided that perhaps data science could be a source of joy to other people. However, for this to happen, it needs first and foremost to be joyful to you.
Deriving joy from a challenging and oftentimes frustrating procedure such as a data science project is not easy. In fact, many people can’t stand that largest part of the work such a project entails. However, with the right mindset, even the more tedious aspects of the work can be enjoyable (i.e. be conducive to joy). So, what is this mindset that turns boredom to beauty and drudgery to delight?
Although there is no magic formula for making things more enjoyable in data science, if you have the attitude of the data science amateur when you approach a problem, your chances of enjoying it are better. This doesn’t mean being sloppy and checking Stackoverflow or Quora every 5 minutes. The amateur’s attitude is, as the word amateur implies, an attitude based on love for what you are doing. The amateur doesn’t care if they get paid for their work. They may even never get paid, but they do it anyway because they find it fulfilling. It’s like a hobby for them.
However, a data scientist still needs to be professional about her work. There are deadlines, meetings with stakeholders, and of course debugging scripts that throw errors at the worst possible time! Handling these matters takes professionalism, but it doesn’t need to be a mechanical and draining process. If you see part of your work as a data scientist (even the debugging stage) as a learning experience and have what is known in Zen as the beginner’s mind, you are bound to find everything a bit more enjoyable. It’s the joy that comes from detachment and lack of rigid expectations from your work, something that every professional knows.
Remembering all this, especially on a Monday morning, is not as straight-forward as it may seem when you think of it. However, being joyful is a matter of perspective and at the end of the day a matter of habit. Aristotle famously said that “virtue is a matter of habit” and some could argue that joy is a kind of virtue. Maybe not something you would put on your resume or talk about in an interview, but definitely something worth keeping in mind in those long mornings when you may be tempted to question your career choices. After all, if you could be joyful about data science as a field once, you can be joyful about data science work too. And if you still feel that you need some help to get your enthusiasm flowing, invigorating a joyful mindset, you can always read my book Data Science – Mindset, Methodologies, and Misconceptions. :-)
After several days being in limbo, the video "Remaining Relevant in Data Science" that I've made recently, is now online on Safari (link). If you have a subscription to that platform, do check it out. If you prefer to access this kind of knowledge through a different medium, feel free to check out the last chapter of my latest book, Data Science Mindset, Methodologies, and Misconceptions. Enjoy!
When people nowadays talk about A.I., they usually refer to the deep learning methodology and other ANN frameworks. This is great, considering that ANNs were almost considered a dead-end once, due to the inability of technology to help them exhibit their potential. Yet, now computers are more powerful than ever and GPUs are commonplace as add-ons, enabling deep learning and other ANN-based system to function at greater scales. However, there are some other A.I. methodologies that are equally valid and actually predate ANNs. These I refer to as the “hipsters of A.I.” since they were part of the A.I. field before A.I. was cool.
The A.I. hipster methodologies are A.I. frameworks that are not ANN-related. These are systems like Fuzzy Logic (FL), which came about years before ANNs reached a level of development that made them worth using in machine learning. FL systems were used heavily in data analytics, while they were even implemented in hardware. At one point, researchers even experimented with a hybrid system that is part FL and part ANN (this was called ANFIS and was in essence an Artificial Neural network that optimized the membership functions of a Fuzzy Inference System).
Another hipster methodology is the family of optimization methods. These are systems like Genetic Algorithms, Simulated Annealing, and Particle Swarm Optimization (as well as its many variants). Although the scope of these A.I. fields is limited to finding optima of particular functions (aka fitness functions), their usefulness covers a variety of fields. Even dimensionality reduction processes sometimes make use of GAs or some other optimization tool. Note that these system are not the same as the analytical optimization methods known from Calculus, since they tackle very complex search spaces, with oftentimes dozens of variables, and use a stochastic process in the back-end.
If there is one take-away from these hipster A.I. systems it is that there is more than meets the eye when it comes to artificial intelligence. That’s not to say that deep learning systems are not worth your while, but it’s good to keep an open mind about other A.I. systems that may not be as popular today, but may have played (and still play) an important role in the evolution of the field.
Also, having a solid understanding of A.I. through its various methodologies, allows us to be able to think forward in a creative way. Instead of merely trying to extend the methodologies we know, we may come up with new ones, enriching A.I. in ways that we wouldn't be able to fathom if our understanding were limited to a single A.I. framework. Isn't that what A.I. is about, finding novel ways to solve problems, leveraging clever heuristics and imaginative architectures?
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.