Lately I've been looking into cyber security, as it is a field that is very useful to know about, regardless of one's profession. As in data science we often deal with sensitive data, I found it useful to be able to apply certain cyber security principles to ensure the data at hand remains secure. As I've researched this matter enough to have something useful to share with the world, I decided to create a video on it, which is now available on the Safari platform here (you'll need a subscription to the Safari platform in order to view it). This is by no means enough to make you an expert in network security, but it's a good starting point. Enjoy!
Lately I've been thinking about A.I. and Statistics a lot (you could say that the amount of time spent thinking about these topics is significantly higher at alpha = 0.05!). This is partly because my Stats article managed to get more traction than any other article I've written in the past few months, and partly because A.I. is becoming more and more relevant in our field. So, the question of whether A.I. is one day going to replace Stats altogether remains a very relevant one.
The key advantage of A.I. methods is that they are assumption-free. This by itself enables them to tackle the problems they are aiming to solve, in a very methodical and efficient way. Of course, certain assumptions might speed things up, but they might obstruct the discovery of the optimal solutions to the problem at hand. Statistical inference models lost the war against machine learning models because of that, especially when artificial neural networks (ANNs) entered the scene. Also, the fact that many ML models could be combined in an ensemble setting allowed them to become even more robust, attaining F1 scores that were unfathomable for statistical prediction models. So, the possibility of other methods of statistics becoming outsourced to alternative systems is quite real.
On the other hand, statistics are very easy to use and interpret, since most of them were designed from a user’s perspective. There are doctors out there (the medical kind), who don’t know much about data analytics but can easily work a statistical model for figuring out if a certain drug has a positive influence on certain patients, and derive some scientific conclusions based on that. That doctor may not be able to write a script to save his life, but he can make use of the data he gathers and advance his scientific field, using just statistics. It’s quite unlikely that this kind of person, who is usually too busy or just not technically adept enough, will take up an A.I. approach to this kind of analysis any time soon.
Of course, A.I. constantly evolves so the black-box issue that makes many ANN-based systems unfavorable, may wane in the future. Already there are A.I. professionals talking about A.I. systems that offer some kind of interpretability. So, even if statistical systems are easier to understand and communicate, it could be that A.I. hasn't said its final word yet.
Whatever the case, I prefer to remain agnostic on this matter. Just like with programming, it’s best to keep one’s options open, when it comes to data science. I’m not a fan of statistics (and never was), but I see value in them and I’m happy to use them to the extent that they offer value to the projects I work on. A.I. may be more of a novel and exciting framework, but if an A.I. system is hard to communicate to the client, or doesn't lend itself to interpretation, then I may not use it everywhere. Just like you don’t take your fancy fringe science book to the beach, you don’t need to show off your A.I. know-how at every opportunity. Perhaps the humble historic novel is more suitable for reading while sunbathing, just like the humble statistics are fine for describing if sample A is significantly different from sample B.
Recently I had a nice chat with a fellow data scientist who works at LinkedIn. After bouncing some ideas off him, I decided to make another video, based on a topic of mutual interest, partly for demonstrating to him how straight-forward the process is, once you have done the research on the topic. This video is now published on Safari here (subscription required). Enjoy!
With so many options for publishing videos online nowadays, someone may wonder “why would I want to go through hoops to get something published on Safari?” This is a valid question, and it’s equivalent to asking “why should I get published through a publishing house when I can self-publish on Amazon, or some other platform?” Although there is merit in self-publishing, there are two main issues with it: quality assurance (QA), and marketing.
Before I get into the details of all this, let me inform you that I've been down the self-publishing path and it wasn't as glamorous as people make it out to be. I published not just 1, but 3 e-books, created a website for them, and even hired people to help promote them. A few years later the only real benefit I've seen through all this was the experience I’d gained through the whole process. So, if this is your sole motivation, that’s fine. If you however want to make enough money to make the whole thing worthwhile, then there are better options out there.
Getting published on Safari (or any other professional video platform) ensures a certain quality standard. Of course not all videos there are great, but at least you won’t find many that are a total waste of time or riddled with inaccurate information, like you would on YouTube, for example. The reason is that for a video to get on the Safari site, it first goes through some QA process. If there is an issue about it, you will need to revise it. This doesn't happen often, if you know what you are doing, but it’s a good fail-safe.
Marketing is another matter where platforms like Safari excel. If something is on Safari, people will see it and may watch/read it. If you have a video on YouTube, few people will notice it and even fewer will watch the whole thing. Especially now with the new strict policies that YouTube has adopted, content creators have it hard. Unless you create a lot of content regularly, your exposure on YouTube is bound to be very limited. Of course, if you create a lot of content, the quality is bound to drop, but YouTube doesn't seem to care much about this. As long as they get lots of people watching the videos they host, and keep the ad money rolling, they are fine. And if your vid gets flagged because some oversensitive person finds it problematic for whatever reason, that’s your problem, not YouTube’s.
I’m not trying to say that YouTube is bad. Every video hosting platform has its use cases. However, for quality content that you expect to at least pay for the effort you've put into creating it, a more professional platform like Safari makes more sense. You can create a promo video and put it on YouTube, or Vimeo. But if you spend a week creating a data science or A.I. video, you are better off publishing it through proper channels, like Safari.
To give you an idea of the profits that a Safari video can yield, last year I published a book. I spent about 9 months writing it and editing it. It was considered successful and helped me get some traction in the field, while also promote the programming language it was about. One of the videos I created and published for Safari yielded about the same revenue. It had taken me about a week to create it and edit it, while I also enjoyed it more, since it felt more like a creative endeavor, rather than work. Since I don’t have a huge following, I doubt that the same video could yield the same revenue if it were published on YouTube or some other open platform.
If you find that you have content you wish to share with the world, in a professional manner, I’d recommend you consider Safari as an option. If you find that it entails too much work and you are unsure as to where you need to start, you can always go through a publisher, like Technics Publications, like I did. As Nelson Mandela eloquently said, “it always seems impossible until it's done.”
Recently someone on LI recommend that I bring more JOY to the world instead of merely complain about it (I wasn’t complaining but apparently she thought I were!). I’m not an entertainer, nor a psychology expert, but perhaps you don’t need to be in these lines of work in order to bring joy to the people you interact with. I thought about it and decided that perhaps data science could be a source of joy to other people. However, for this to happen, it needs first and foremost to be joyful to you.
Deriving joy from a challenging and oftentimes frustrating procedure such as a data science project is not easy. In fact, many people can’t stand that largest part of the work such a project entails. However, with the right mindset, even the more tedious aspects of the work can be enjoyable (i.e. be conducive to joy). So, what is this mindset that turns boredom to beauty and drudgery to delight?
Although there is no magic formula for making things more enjoyable in data science, if you have the attitude of the data science amateur when you approach a problem, your chances of enjoying it are better. This doesn’t mean being sloppy and checking Stackoverflow or Quora every 5 minutes. The amateur’s attitude is, as the word amateur implies, an attitude based on love for what you are doing. The amateur doesn’t care if they get paid for their work. They may even never get paid, but they do it anyway because they find it fulfilling. It’s like a hobby for them.
However, a data scientist still needs to be professional about her work. There are deadlines, meetings with stakeholders, and of course debugging scripts that throw errors at the worst possible time! Handling these matters takes professionalism, but it doesn’t need to be a mechanical and draining process. If you see part of your work as a data scientist (even the debugging stage) as a learning experience and have what is known in Zen as the beginner’s mind, you are bound to find everything a bit more enjoyable. It’s the joy that comes from detachment and lack of rigid expectations from your work, something that every professional knows.
Remembering all this, especially on a Monday morning, is not as straight-forward as it may seem when you think of it. However, being joyful is a matter of perspective and at the end of the day a matter of habit. Aristotle famously said that “virtue is a matter of habit” and some could argue that joy is a kind of virtue. Maybe not something you would put on your resume or talk about in an interview, but definitely something worth keeping in mind in those long mornings when you may be tempted to question your career choices. After all, if you could be joyful about data science as a field once, you can be joyful about data science work too. And if you still feel that you need some help to get your enthusiasm flowing, invigorating a joyful mindset, you can always read my book Data Science – Mindset, Methodologies, and Misconceptions. :-)
After several days being in limbo, the video "Remaining Relevant in Data Science" that I've made recently, is now online on Safari (link). If you have a subscription to that platform, do check it out. If you prefer to access this kind of knowledge through a different medium, feel free to check out the last chapter of my latest book, Data Science Mindset, Methodologies, and Misconceptions. Enjoy!
When people nowadays talk about A.I., they usually refer to the deep learning methodology and other ANN frameworks. This is great, considering that ANNs were almost considered a dead-end once, due to the inability of technology to help them exhibit their potential. Yet, now computers are more powerful than ever and GPUs are commonplace as add-ons, enabling deep learning and other ANN-based system to function at greater scales. However, there are some other A.I. methodologies that are equally valid and actually predate ANNs. These I refer to as the “hipsters of A.I.” since they were part of the A.I. field before A.I. was cool.
The A.I. hipster methodologies are A.I. frameworks that are not ANN-related. These are systems like Fuzzy Logic (FL), which came about years before ANNs reached a level of development that made them worth using in machine learning. FL systems were used heavily in data analytics, while they were even implemented in hardware. At one point, researchers even experimented with a hybrid system that is part FL and part ANN (this was called ANFIS and was in essence an Artificial Neural network that optimized the membership functions of a Fuzzy Inference System).
Another hipster methodology is the family of optimization methods. These are systems like Genetic Algorithms, Simulated Annealing, and Particle Swarm Optimization (as well as its many variants). Although the scope of these A.I. fields is limited to finding optima of particular functions (aka fitness functions), their usefulness covers a variety of fields. Even dimensionality reduction processes sometimes make use of GAs or some other optimization tool. Note that these system are not the same as the analytical optimization methods known from Calculus, since they tackle very complex search spaces, with oftentimes dozens of variables, and use a stochastic process in the back-end.
If there is one take-away from these hipster A.I. systems it is that there is more than meets the eye when it comes to artificial intelligence. That’s not to say that deep learning systems are not worth your while, but it’s good to keep an open mind about other A.I. systems that may not be as popular today, but may have played (and still play) an important role in the evolution of the field.
Also, having a solid understanding of A.I. through its various methodologies, allows us to be able to think forward in a creative way. Instead of merely trying to extend the methodologies we know, we may come up with new ones, enriching A.I. in ways that we wouldn't be able to fathom if our understanding were limited to a single A.I. framework. Isn't that what A.I. is about, finding novel ways to solve problems, leveraging clever heuristics and imaginative architectures?
There is a lot of unstructured data out there. Many people view it as untapped potential, and they are right. There is a lot of signals out there, waiting to be harnessed by the data scientists who get to them. However, most of the data where these signals dwell is unstructured, or semi-structured (there is some structure to them but it’s not consistent). This leads some people to believe that structuring it will instantly make the data more valuable. This view is quite debatable, however, and is worth exploring further, before it brings about unrealistic expectations of what data science can do.
Structuring data is part of the data science process. Before we can feed it to a model, we need to get the data into the form of a matrix (if all the data is of the same type) or a data frame (whenever we have various types in the dataset). However, the fact that structuring data is necessary for the mining of the information in it (usually in the form of insights), does not make it a sufficient condition for that. In other words, we have to structure the data, but this doesn't guarantee anything. There have been many times when upon training various models, from different frameworks, things don’t seem to pan out. The performance is mediocre, the results are not actionable, and the whole thing is labeled as a failure of sorts. I do not mean to dismay anyone, but it’s healthy to be aware of this possibility, since it’s not often shown in data science books or tutorials. People like to talk about the success stories, leading to a false understanding and unrealistic expectations.
For the data to be valuable, it needs to have a strong signal in it. This means that even by just looking at it, you can tell that there is something there that given enough time and effort, you would be able to find yourself. In this case, data science facilitates the process of mining that signal, since no-one has the patience or the resources to go through a data stream on there own, no matter how motivated they are. In this case, data science is bound to be successful, since it accelerates the process of turning this information-rich data into actual information, or even knowledge. However, the structure of the data is not so relevant in this case. Even if the data is in a JSON or raw text format, for example, it can still be useful, since it’s not too difficult to generate features that penetrate this nebulous form and manage to encapsulate the essence of it, in a form that can easily fit into a database table (albeit a very large one usually).
So, it is important to exercise discernment in this matter. Surely structured data may be more appealing for a data scientist, as it means less tedious work for her, but it doesn't guarantee anything of value. Besides, the process of structuring the data (aka data engineering) can be insightful too, as it involves some data exploration. Data exploration may not always accelerate the structuring of the data, but it definitely helps you understand it better and make more informed choices about the whole data science process (including structuring). After all, shortcuts in the process may save you some time, but if you know what you are doing, you can definitely do without them, saving your organization some money in the process, since automated data structuring is not free. The choice is yours.
Many people argue that data science’s main purpose, particularly in a business setting, is to mine and deliver insights. Contrary to data products (which is another data science deliverable type), insights are fairly straight-forward and require little software development (something often outsourced to the dev team). However, their value is something that is the subject of debate, since few insights are actually used in practice, in real-world projects.
An insight is generally some non-trivial conclusion that stems from rigorous analysis of a data stream, be it with A.I. techniques (e.g. a deep learning network), some other machine learning methodology (e.g. an unsupervised learning system), or even some statistical process (e.g. a chi-square test). By definition, it is not something that you can pinpoint by just plotting the data, or calculating some superficial metric, like the mean, or standard deviation (which are fine by themselves, but insufficient for generating useful insights).
It would be good to differentiate between the various aspects of the value of an insight. First of all, there is the inherent value of the insight. This is in essence a signal in the data analyzed, or some interpretation of it. This kind of value is useful primarily for the data scientist and other people involved in the project, in a hands-on way. If the data science project is related to research, this kind of insight can be the basis of a publication. However, an insight that has merely innate value is often not enough.
Another aspect of the value of an insight is its commercial application. This is significantly more important for the majority of data science project. The reason is that someone is paying for the project and it’s this kind of valuable insights that eventually bring about a positive ROI for the project. The data scientist may not necessarily value the commercial aspect of the insights he delivers, but the project manager definitely does, as well as other stakeholders of the project.
Finally, there is the practical value of the insight. Whether the insight has commercial value or not, it may enable the development of something tangible, like a data product, or some in-depth understanding of the problem at hand. This kind of value is conducive to a new cycle in the data science process, something that is bound to bring about new insights, yielding additional value.
Whatever the value of the insights, it is important to remember that one’s work shouldn’t be judged entirely by them. Surely it’s great if you can produce something actionable, or something that sheds light to the problem investigated, but if the data streams available are as noisy as the screen of a TV that’s not tuned to a network, then there is not much you can do with them. After all, the rule that many software developers have “garbage in, garbage out” (GIGO) is applicable to data science as well. If you want valuable insights, you need data streams that have some useful signal(s) in them, otherwise you are just wasting your time.
What are your insights on this matter?
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.