With so many ways to get a book out there, even in a fairly challenging subject such as data science, you may wonder what this process entails and what is the best way to go about it. After all, these days it’s easier than ever to reach an audience online and promote your work, all while branding yourself as a professional in the field.
Writing a book in data science is first and foremost an education initiative, targeting a particular audience. Usually, this is data science learners though it may be other professionals involved in data science, such as managers, developers, etc. A data science book generally tries to explain what data science can do, what its various methodologies are, and how all of that can be useful for solving particular problems (emphasis on the last part!). If you see a book that focuses a lot of the methods, particularly those of a particular methodology, it may be too specialized to be of most audiences, unless you are targeting that particular niche that requires this specific know-how.
A key thing to note when exploring the option of writing a book is a publisher. Even if you prefer to self-publish, your book must be able to compete with other books in this area and a publisher is usually the best way to figure that out. If a publisher is interested in your book, then it’s likely to be somewhat successful. Also, if you are new to book authoring, you may want to start with a publisher since there are a lot of things you’d never learn without one. Also, a book published through a publisher is bound to have more credibility and a larger life-span.
Understandably, you may have explored the various deals publishers make with their authors and figured out that you’ll never make a lot of money by publishing books. Fair enough; you’ll probably never make a living by selling your words (although it is possible still). However, if your book is good, you’ll probably make enough money to justify the time you’ve put into this project. Also, remember that most publishing deals provide you with a passive income, even if the publisher wants you to promote your book to some extent. So, even though you won’t make a lot of cash, you’ll have a revenue stream for the duration of your book’s lifetime.
With all the data science material available on the web these days, acquiring all the relevant information and compiling it into a book is a fairly straight-forward task. However, just because it is fairly feasible, it doesn’t mean that it’s what the readers need. Without someone to guide you through the whole process and give you honest feedback (that’s also useful feedback), it’s really hard to figure out what is necessary to put in the book, what should be included in an appendix, and what should be mentioned in a link. Your readers may or may not be able to provide you with this information, while if your main means of interacting with them is how many of them download your book or visit your website, you are just satisfying your ego!
A publisher's honest feedback often hurts but that’s what gradually turns you into a real author, namely one who has some authority in his/her written works. Otherwise, you’ll be yet another writer, which is fine if you just want to talk about writing a book or how you have written a book that you have on Amazon, things that are bound to be forgotten quicker than you may think…
Although when people think of math in data science, it’s usually Calculus, Linear Algebra, and Graph Theory that comes to mind, Geometry is also a very important aspect of our craft. After all, once we have formatted the data and turned into a numeric matrix (or a numeric data frame), it’s basically a bunch of points in an m-dimensional space.
Of course, most people don’t linger at this stage to explore the data much since there are various tools that can do that for you. Some people just proceed to data modeling or dimensionality reduction, using PCA or some other method. However, oftentimes we need to look at the data and explore it, something that is done with Clustering to some extent. The now trending methodology of Data Visualization is very relevant here and if you think about it, it is based on Geometry.
Geometry does more than just help us visualize the data though. Many data models use geometry to make sense of the data, for example, particularly those models based on distances. I talked about distances recently, but it’s hard to do the topic justice in a blog post, especially without the context that geometry offers.
Perhaps geometry seems old-fashioned to those people used to fancy methods that other areas of math offer. However, it is through geometry that revolutionary ideas in science took root (e.g. Theory of Relativity) while cutting edge research in Quantum Physics is also using geometry as a way to understand those other dimensions and how the various fundamental particles of our world relate to each other.
In data science, geometry may not be in the limelight, unless you are doing research in the field. However, understanding it can help you gain a better appreciation of the data science work and the possibilities that exist in the field. After all, a serious mistake someone can make when delving into data science is to think that the theory in a course curriculum or some book is all there is to it. When you reduce data science to a set of methods and algorithms you are basically limiting the potential of it and how you can use the field as a data scientist. If however, you maintain a sense of mystery, such as that which geometry can offer, you are bound to have a healthier relationship with the craft and a channel for new ideas. After all, data science is still in its infancy as a field while the best data science methods are yet to come...
As the field of Data Science matures and everything in it is categorized and turned into a teaching module, compartmentalization may seem easier and more efficient as a learning strategy. After all, there is a bunch of books on specialized topics of the craft. That’s all great and for some people, it may even work satisfactorily, but that’s where the risk lies and it’s a pretty big risk too!
Learning about something specialized in data science, particularly without a good sense of context or its limitations, can be catastrophic. The old saying “for someone who only knows how to use a hammer everything starts looking like a nail” is applicable here too. Learning about a specialized aspect of data science can often make you think that this is the best approach to solving data science related problems. After all, the author seems to know what he’s talking about and some employers value this skill. However, if this know-how is out of context, it is bound to be ineffective at best and problematic at worst. Data science is an interdisciplinary field with lots of different tools in it, from various areas. Anyone who tries to dissect it and focus mainly on one of them is doing a disservice to the field and if you as a data science learner pay attention to this person, you are bound to warp your knowledge of the craft and delay your mastery of it.
Also, this overspecialization in know-how may make you think that you are better than the other data science practitioners who have not developed that niche skill yet. This will limit your ability to learn and perhaps even cooperate with these people, significantly. After all, you are an expert in this, so why bother with less fancy know-how at all? Well, sometimes even the more humble aspects of the field, such as feature engineering, can turn to be more effective at solving a problem well, than some fancy model, so it’s good to remember that.
That’s why I’ve always promoted the idea of the right mindset in data science, something that no matter how the field evolves, it is bound to remain stable in the years to come and help you adapt to whatever know-how becomes the norm. Also, no matter how important the algorithms are, it’s even more important knowing how to create your own algorithms and change existing ones, optimizing them for the problem at hand. That’s something that no data science book teaches adequately, as the emphasis is covering material related to certain buzzwords, sometimes without the supervision of an editor. The latter can help immensely in making the contents of a book more comprehensible and relevant to data science in general, providing you with a sense of perspective.
So, be careful with what you let enter your data science curriculum as you learn about the craft. Some books may be a waste of time while others, especially those not published through a publisher, may even hinder your development as a data scientist.
Before starting the new data science book, I made one video on a very fascinating topic that I've delved in for a while now: Cryptanalysis. Although I'm not a hacker, I've researched this topic sufficiently and even broke a few ciphers myself over the years. This video (available on Safari/O'Reilly) is a gentle introduction to the topic and ties very well with my other Cybersecurity videos. Check it out when you have the chance!
Note that in order to view the video in its entirety, you'll need an account (e.g. through a subscription). If you are an employee of a tech company, you may have full access to the Safari platform already. The latter is a useful resource for both videos and books, all of which you can access through a mobile device too.
For some reason, people who delve into data science tend to focus more on certain aspects of the craft at the expense of others. One of these things that often doesn’t get nearly enough attention is the concept of distance. If you ask a data scientist (especially one who is fairly new to the craft or overspecialized in one aspect of it), they’ll tell you about the distance metrics they are familiar with and how distance is a kind of similarity metric. Although all of this is true, it only portrays just one part of the picture.
I’ve delved into the topic for several years now and since my Ph.D. is based on transductive systems (i.e. data science systems that are based on distances), I’ve come to have a particular perspective on the matter, one that helps me see the incompleteness of it all. After all, no matter how many distance heuristics we develop, the way distance is perceived will remain limited until we look at it through a more holistic angle. So, let’s look at the different kinds of distances out there and how they are useful in data science.
Distances of the first kind are those most commonly used and are expressed through the various distance heuristics people have devised over the centuries. The most common ones are Euclidean distance and Manhattan distance. Mathematically, it is defined as the norm of a vector connecting two points.
Another kind of distances is the normalized ones. Every distance metric out there that is not in this category is crude and limited to the particular set of dimensions it was calculated in. This makes comparisons of distances between two datasets of different dimensionality impossible (if the meaning is to be maintained), even if mathematically it’s straight-forward. Normalizing the matrix of distances of the various data points in the dataset requires finding the largest distance, something feasible when the number of data points is small but quite challenging otherwise. What if we need the normalized distances of a sample of data points only because the whole dataset is too large? That’s a fundamental question that needs to be answered efficiently (i.e. at a fairly low big O complexity) if normalized distances are to be practical.
The last and most interesting kind of distances is the weighted distance. Although this kind of distance is already well-documented, the way it has been tackled is fairly rudimentary, considering the plethora of possibilities it offers. For example, by warping the feature space based on the discernibility scores of the various features, you can improve the feature set’s predictive potential in various transductive systems. Also, using a specialized weighted distance, you can better pinpoint the signal of a dataset and refine the similarity between two data points in a large dimensionality space, effectively rendering the curse of dimensionality a non-issue. However, all this is possible only through a different kind of data analytics paradigm, one that is not limited by the unnecessary assumptions of the current one.
Naturally, you can have a combination of the latter two kinds of distances for an even more robust distance measure. Whatever the case, understanding the limitations of the first kind of distances is crucial for gaining a deeper understanding of the concept and apply it more effectively.
Note that all this is my personal take on the matter. You are advised to approach this whole matter with skepticism and arrive at your own conclusions. After all, the intention of this post is to make you think more (and hopefully more deeply) about this topic, instead of spoon-feeding you canned answers. So, experiment with distances instead of limiting your thinking to the stuff that’s already been documented already. Otherwise, the distance between what you can do and what you are capable of doing, in data science, will remain depressingly large...
Lately, there has been an explosion of interest in Data Science, mainly due to the appealing job prospects of someone who has the relevant know-how. It is easy, unfortunately, to get into the state of complacency whereby data science become all too familiar and you find yourself working the same methods and the same processes in general when dealing with the problems you are asked to solve. This situation can be quite toxic though, even if it’s unlikely someone will tell you so. After all, as long as you deliver what you have to deliver no one cares, right? Unfortunately, no. If you stop evolving as a data scientist, chances are that you’ll become obsolete in a few years, while your approach to the problems at hand will cease to be as effective. Besides, the field evolves as do the challenges we as data scientists have to face.
The remedy to all this is exploring data science with a renewed sense of enthusiasm, something akin to what is referred to as “beginner’s mind” in the Zen tradition. Of course, enthusiasm doesn’t come about on its own after you’ve experienced it once. You need to create the conditions for it and what better way to do that than exploring data science further. This exploration can be in more breadth (i.e. additional aspects of the craft, including but not limited to new methods), and in more depth (i.e. understand the inner workings of various algorithms and the variants they may have). Research in the field can go a long way when it comes to both of these exploration strategies. It’s important to note that you don’t need to publish a paper in order to do proper research. In fact, you can do perfectly adequate research with just a computer and a few datasets, as long as you know how.
It’s also good to keep the breadth and depth in balance when you are exploring data science. Going too much in breadth can lead you to have a more superficial knowledge of the field while going too much in depth can make you overspecialized. What you do first, however, is totally up to you. Also, it’s important to use reliable resources when exploring the field, since nowadays it seems that everyone wants to be a data science content creator, without having the essential training or educational mindset. A good rule of thumb is to stick to content that has undergone extensive editings, such as the stuff made available through a publisher, particularly one specializing in data related books and videos.
Whatever the case, it’s always good to explore data science in an enjoyable manner too. Find a dataset you are interested in, before starting to apply some obscure method. This way the whole process will become more manageable and perhaps even fulfilling. Fortunately, there is no shortage of datasets out there, so you have many options. Happy exploration!
After noticing a subtle but clear gap in the data science education of today, and after discussing this matter with a couple of my associates, I decided that a new data science book would be in order. So, after some negotiations and refinements of this idea, over the space of 3 months, we are now ready to initiate this publication project. So, once the paperwork is done, I'll be working on a new title, one that would appeal to a large audience of data science related professionals. We expect the first draft to be ready by the beginning of summer, and if all goes well, the book should be available for purchase by early autumn.
A big thanks to my publisher Technics Publications and to all of you, particularly those buying my books and watching the videos of mine that are made available on Safari. Cheers!
Happy holidays everyone! I hope you have a chance to relax, recuperate, and rejuvenate this holiday period :-) See you in 2019 with new, insightful, and fox-like blog posts!
Lately I worked on a more ambitious topic for a data science video. Graph Analytics, aka Network Analytics, is one of the more niche aspects of our craft and although I've been using it for many years, creating a video on the topic has always been daunting due to the amount of material it has. However, I managed to create a fairly succinct clip (a bit less than half an hour long) and put it out there through my publisher. You can find it on the Safari portal.
Note that you will need a subscription to Safari in order to view it in its entirety. Also, a subscription to this educational platform enables you to have access to a bunch of different material, including all of my books. Cheers!
Although I’ve always been a big fan of online videos and find many such projects entertaining to watch, I’ve never really seriously considered doing anything on YouTube. That’s despite the fact that I’m fully aware that some people are making a living on this endeavor.
First of all, YouTube has changed dramatically over the years and not for the better. Specifically, the algorithm used for featuring what’s hot on the YouTube homepage has degraded drastically, in a desperate effort to promote “fresh” content creators. In other words, if a producer doesn’t publish videos frequently, they are not promoted much by the algorithm, something that inevitably gives rise to sloppy and cheap content, created merely to satisfy that mindless algorithm. Of course many YouTube fanatics (or YouTubers as they like to call themselves) have their own channels and networks of promoting their stuff, so they get some views regardless. However, the effort it takes to build such a network and the fact that it require constant work to keep it active, makes the whole process inefficient and problematic in many ways.
In addition, YouTube has started to filter its content in an effort to block offensive videos from being made available. It’s not that the company gives a damn about what you view since there is already a plethora of super low quality videos over there, but it wants to avoid lawsuits. So, in a desperate effort to save its ass, YouTube has aggressively started filtering its content through any means necessary. This includes having its own unpaid workers, some dedicated users that have nothing else to do with their time, to do this deed for YouTube. Of course these people are not trained while the guidelines they have been given are vague at best. So, it’s up to their limited discernment to figure out what constitutes a bad video and what doesn’t, so what they flag is oftentimes seemingly random. This way, many legitimate videos have been filtered as inappropriate just because some idiot couldn’t tell what they were about. This resulted to the corresponding producer not receiving any revenue from these videos, despite the amount of work he/she has put into these projects.
Moreover, the revenue YouTubers make from a single video is not that high, unless the video goes viral. What’s worse, the revenue decreases exponentially since just the most recent and most popular videos attract enough viewership. Who cares about something that was published a year ago, right? Well, wrong. If a video is of a certain quality standard, it is bound to be good to watch even after a year or two after its release date. Then again, most YouTubers have given up on quality videos since those take a lot of time and they need to get something online soon, if it is going to be fresh. So, since I don't have a whole crew working for me, if I were to do YouTube videos I'd make a fairly small income from the videos themselves, unless of course I were to have some sponsor. Sponsor ads however are not something the viewer wants to watch, so once you have a sponsor in a video, its quality immediately drops.
Furthermore, as I have a better alternative to YouTube (the Safari platform), it makes no sense whatsoever to settle for a less professional platform. Besides, YouTube is only popular because it's been around the longest and with newer and better platforms entering the scene lately, it's doubtful this trend will continue. As a bonus for not working for YouTube, I don’t have to worry about the Article 13 issue that seems to trouble YouTubers, nor do I have to busk for subscriptions from my viewers. I still get some nasty comments from time to time, but the majority of the feedback I receive is positive.
Finally, there is also the recent fiasco with the YouTube Rewind 2018 video (which broke the record for the number of dislikes in a single video, as well as the record of how quickly a video accumulates dislikes). This may seem insignificant to the YouTube fanatic, whose allegiance to YouTube and Alphabet trumps any rational thoughts on this matter, but the fact is that the company doesn't care about its content creators. Otherwise, it would mention the ones that actually make a contribution to it, instead of veering away from them, in favor of a celebrity and some not so relevant YouTubers. I don't know about you, but I'd rather not make videos at all than publish my videos to a platform like this, which fails to appreciate its contributors.
So, if you are someone thinking of becoming a content creator and make a revenue from all this, there are better ways than YouTube. Perhaps it was a viable option once but right now it’s one of the worst places to publish your stuff. Besides, with Safari and other quality-based platforms out there, figuring out what to do with a quality video is really a no-brainer.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.