Someone may wonder why would someone write a technical book these days, especially when the earnings of such an endeavor are small and getting smaller. I wondered about the same thing once until eventually, I came to realize some key reasons why such an endeavor is indeed worthwhile, particularly in the Data Science area. I’d like to share with you here the most important of these reasons (benefits), knowing fully well that these are just my own insights and that you’d be able to find your own ones, should you ever consider writing a technical book.
First of all, when you discover something, even if it’s not entirely new, the natural next step is to try to share this with others, be it to advance your career, personal branding, or for whatever other reason drives you. If you are serious about this task, you’d want to make sure that whatever you deliver has a certain quality standard, so writing a book on the topic could help you accomplish that. If you are more inclined to use film as your medium, you may decide to go with a video instead, but that would require a longer time and sufficiently more expertise. So, a technical book would be a more viable option, especially if what you have to say can have enough commercial value to attract a publisher. You as an author may not be motivated by the royalties you’ll receive from this project, but no publisher would publish something that isn’t going to pay for the paper or digital storage it is going to need.
Also, through writing such a book you realize what you don’t know and develop a more balanced approach to the whole subject since you are more aware of what’s out there. The arrogance you may have harbored as a newcomer will gradually give way to humility and a deeper appreciation of the field, as the research required to write this book is bound to cultivate in you. Besides, even the stuff you may know well, you may realize that you may not be able to express comprehensibly, something that the editor will be more than happy to let you know! So, your development as a professional in this field will be (at least) two-fold: related to the knowledge of subtle aspects of the field and related to your ability to express all that effectively and eloquently.
Finally, writing a technical book, particularly one that is marketed professionally by a publisher, enables your thoughts to cross lots of borders, reaching out to people you wouldn’t normally find on your own. This will expose you to a larger variety of feedback for your work that can help you grow further as a professional. Not all of the feedback is going to be useful, but at least some of it is bound to be. Besides, the people who would normally read your work are likely to be people who have valued it enough to pay for it beforehand, be it through the publisher or through a subscription to a technical knowledge platform. Either way, they would most likely be people who are driven by genuine care about the subject, not just curiosity.
I could go on about this for a while, perhaps even write a book on this topic! However, as I respect your time, I’d leave it to this. What other benefits of writing a technical book can you think of? Do they justify to you the undertaking of such a project?
Puzzles, especially programming related ones, can be very useful to hone one’s problem-solving skills. This has been known for a while among coders, who often have to come up with their own solutions to problems that lend themselves to analytical solutions. Since data science often involves similar situations, it is useful to have this kind of experiences, albeit in moderation.
Programming puzzles often involve math problems, many of which require a lot of specialized knowledge to solve. This sort of problems, may not be that useful since it’s unlikely that you’ll ever need that knowledge in data science or any other applied field. Number theory, for example, although very interesting, has little to do with hands-on problems like the ones we are asked to solve in a data science setting.
The kind of problems that benefit a data scientist the most are the ones where you need to come up with a clever algorithm as well as do some resource management. It’s easy to think of the computer as a system of infinite resources but that’s not the case. Even in the case of the cloud, where resource limitations are more lenient, you still have to pay for them, so it’s unwise to use them nilly-willy unless you have no other choice.
Fortunately, there are lots of places on the web where you can find some good programming challenges for you. I find that the ones that are not language-specific are the best ones since they focus on the algorithms, rather than the technique.
Solving programming puzzles won’t make you a data scientist for sure, but if you are new to coding or if you can use coding to express your problem-solving creativity, that’s definitely something worth exploring. Always remember though that being able to handle a dataset and build a robust model is always more useful, so budget your time accordingly.
With all the talk about Data Science and A.I., it’s easy to forget about the person doing all this and how his awareness grows as he gets more involved in the field of data science. What’s more, all those gurus at the social media will tell you anything about data science except this sort of stuff, since they prefer to have you as a dependent follower rather than an autonomous individual making his own way as a data scientist.
So, as you enter the field of data science, you are naturally focused on its applications and the tangible benefits of it. As a professional in this field, you may care about the high salary such a vocation entails or the cool stuff you may build, using data science methods. Everything else seems like something you have to put up with in order to arrive at this place where you can reap the fruits of your data science related efforts. It’s usually at this level of awareness that you see people complain about the field as being too hard, or not engaging enough after a while. Still, this level is important because it often provides you with a strong incentive to continue learning about this field, growing more aware of it.
The second level of data science awareness involves a deeper understanding of it and an appreciation of its various tools, methods, and algorithms. People who dwell in this level of awareness either come from academia or end up spending a lot of time in academic endeavors, while in the worst case, they become fanatics of this or the other technology, seeing all others as inferior, just like the people who prefer them. The same goes with the methods involved since there are data scientists who swear by the models they use and wouldn’t use any other ones unless they absolutely had to. This is the level where most people end up with since it’s quite challenging to transcend it, especially on your own.
Finally, when you reach the third level of data science awareness, you are more interested in the data and the possibilities it offers. You have a solid understanding of most of the methods used and can see beyond them since they all seem like instances of the same thing. Your interest in data engineering grows and you become more comfortable with processes that are either esoteric or mundane, for most people. Heuristics seem far more interesting, while you begin to question things that others take for granted, regarding how data should be used. The best part is that you can see through the truisms (and other useless information) of the various “experts” in the social media and value your experience and intuition more than what you may read in this or the other book on this subject.
It’s fairly easy to figure out which level you are in, in your data science journey. Most importantly, it doesn’t matter as much as being aware of it and making an effort to move on, going deeper into the field. Because, just like other aspects of science, data science can be a path of sorts, rather than just the superficial field of work that many people make it appear. So, if you want to find meaning in all this, it’s really up to you!
Although there is a plethora of heuristics for assessing the similarity of two arrays, few of them can handle different sizes in these arrays and even fewer can address various aspects of these pieces of data. PCM is one such heuristic, which I’ve come up with, in order to answer the question: when are two arrays (vectors or matrices) similar enough? The idea was to use this metric as a proxy for figuring out when a sample is representative enough in terms of the distributions of its variables and able to reflect the same relationships among them. PCM manages to accomplish the first part.
PCM stands for Possibility Congruency Metric and it makes use primarily of the distribution of the data involved as a way to figure out if there is congruency or not. Optionally, it uses the difference between the mean values and the variance too. The output is a number between 0 and 1 (inclusive), denoting how similar the two arrays are. The higher the value, the more similar they are. Random sampling provides PCM values around 0.9, but more careful sampling can reach values up closer to 1.0, for the data tested. Naturally, there is a limit to how high this value can get in respect with the sample size, because of the inevitable loss of information through the process.
PCM works in a fairly simple and therefore scalable manner. Note that the primary method focuses on vectors. The distribution information (frequencies) is acquired by binning the variables and examining how many data points fall into each bin. The number of bins is determined by taking the harmonic mean of the optimum numbers of bins for the two arrays, after rounding it to the closest integer. Then the absolute difference between the two frequency vectors is taken and normalized. In the case of the mean and variance option being active, the mean and variance of each array are calculated and their absolute differences are taken. Then each one of them is normalized by dividing with the maximum difference possible, for these particular arrays. The largest array is always taken as a reference point.
When calculating the PCM of matrices (which need to have the same number of dimensions), the PCM for each one of their columns is calculated first. Then, an average of these values is taken and used as the PCM of the whole matrix. The PCM method also yields the PCMs of the individual variables as part of its output.
PCM is great for figuring out the similarity of two arrays through an in-depth view of the data involved. Instead of looking at just the mean and variance metrics, which can be deceiving, it makes use of the distribution data too. The fact that it doesn’t assume any particular distribution is a plus since it allows for biased samples to be considered. Overall, it’s a useful heuristic to know, especially if you prefer an alternative approach to analytics than what Stats has to offer.
Recently I came across this interesting platform for sharing curated content, called Wakelet. It's also a British startup from Manchester, by the way, one that appears quite promising, given that they find a way to monetize their project.
Anyway, the platform is a bit like Pinterest but with more features and an offline presence too. These are the most important features, in my view:
* Very intuitive and fast to learn
* Can work with a variety of content types: videos, images, formatted text, PDFs, and website links
* Every list can be exported to a PDF
* Free to use
* No account is required to view the lists
* Lots of free images to use as thumbnails and backgrounds
* QR code is generated for each list you wish to share
* Private lists are also an option
* Plenty of tutorials online that explain the various features and use-cases
You can check out a wake (that's how these curated lists are called) that I've made in the space of a few minutes, here. In the future I'll probably be using it more, particularly on this blog. Whatever the case, do let me know what you think of this platform and of my wake. Cheers!
If you wish to put yourself out there as a content creator, now more than ever, videos are a great way to do it. This may seem somewhat daunting to some, but with the plethora of software options out there and the ease of use of many of them, it’s just a matter of making your resolve to do it. Apart from the obvious benefit of personal branding, creating a data science or A.I. video can also be lucrative as an endeavor.
I’m not referring to the amateur videos many people on YouTube make, in their vain attempts to gather likes and shares, much like beggars gather pitiful coins from the passers-by. If you want to create a technical video that will be worth your while, there are better and more self-respecting options to do so, options that you would be happy to include in your resume/CV. Namely, you can create a video that you promote through a respectable publisher, such as Technics Publications. Such an alternative will enable you to receive royalties every 6 months and not have to worry about promoting your work all by yourself. Of course, there is also the option of a one-time payment that some publishers offer, but this isn’t nearly as appealing since the amount of money you can potentially earn through royalties is higher and the requirements are easier to meet.
When creating a video, many people think that it’s just you standing in front of a camera and talking adlib about a topic, perhaps using some props like a whiteboard. Although that’s one straightforward way to do it, it may not appeal to the less charismatic presenters or those who don’t consider themselves particularly photogenic. Besides, have a screen-share video or a slideshow with voice-over, always based on a script, is much easier to produce and sometimes more effective at illustrating the points you wish to make. Alternatively, you can try combining both approaches, though this may require more takes.
Whatever the case, making a video is the easy part of the whole project, relatively speaking. What ensues this is what is the most challenging task for most people: promoting the video to your target audience. Although social media have an important role to play in all this, having some support from a publisher is priceless. After all, promoting technical content is what publishers are really good at, especially if they have a good niche in the market. Still, if you have a large enough network, it doesn’t hurt to spread the word yourself too, for additional exposure, though you are not required to do so.
If you are interested in covering a data science or A.I. related topic with a video, through a publisher, feel free to contact me directly, as I’d be happy to help you in that, particularly if you are serious about it. The world could definitely use some new content out there for data science and A.I. since there is way too much noise, confounding those who wish to study these fields. Perhaps this new content could come from you.
When I created this heuristic about a year and a half ago, I wasn't planning to make a video about it. However, after exploring its various benefits, I felt this should become more well-known to data science and A.I. practitioners. So, after a series of experiments and some extra research, I've made this video demonstrating the various aspects of this intriguing heuristic metric. Check it out whenever you have the chance!
Please note that Safari Books Online (O'Reilly) is a paid platform for quality content, so you need to have a subscription to it in order to view this and any other video in their entirety. However, it's a worthy investment that every data science and A.I. learner ought to consider making.
Dimensionality reduction has been a standard methodology to deal with datasets that have a lot of features, more than a typical model can handle effectively. Reducing the number of features can also save time and storage space, while when it comes to sensitive data it can be a big plus as it enables anonymity in the people involved. What’s more, in some cases, a reduced dimensionality dataset can be more effective as there is less noise in it. However, conventional dimensionality reduction methods don’t always do the trick due to the inherent limitations they have. For example, PCA only considers linear relationships among the variables and a linear combination of features, as a solution.
Of course, other people are not sitting idle when it comes to this issue. There are several dimensionality reduction options that are being pursued, the most interesting of which is autoencoders. This AI-based method involves a data-driven approach to figuring out the nature of the data and creating new variables that can represent the underlying signal, by minimizing the error. The issue with this is that it often requires a lot of data and some specialized know-how in order to configure optimally. Also, this whole process may be fairly slow, due to the large number of computations involved.
An alternative approach has to do with feature fusion in a non-AI way. The idea is to maintain transparency to the extent this is possible, while at the same time optimize the whole process in terms of speed. The use of multiple operators, some linear and some non-linear, is essential, while the option of dropping useless features is also very useful. Naturally, this whole process would be more effective in the presence of a target variable, but it should be able to work without it, for better applicability. Whatever the case, the use of a metric able to handle non-linear correlations is paramount since the conventional correlation metric used leaves a lot to be desired.
Based on all this, it’s clear that the dimensionality reduction area is still capable of enhancements. Despite the great work that has been done already, there is still room for new methods that can address the limitations the existing methods have, which aren’t going away any time soon. Perhaps it would be best to explore this methodology of data engineering more, instead of focusing the latest and greatest system, which although intriguing, may sacrifice too much (e.g. transparency) in the name of accuracy, a trade-off that may no longer be cost-effective. Something to think about...
This famous Buddhist quote is one of my personal favorites and one that Bruce Lee also used in one of his movies. Although it may seem more relevant to some Eastern philosophy or martial arts, it actually has a lot of relevance in data science too.
Through this blog, my books, and my videos, I’ve put forward some ideas and hopefully some useful knowledge for anyone interested in data science and A.I. However, it’s easy to mistake conviction with cult-like hegemony, something I’ve observed in social media a lot. Whenever someone competent enough to have a good professional role and some prestige comes about, many people choose to become his or her followers, treating that person as a guru of sorts. This, in my view, is one of the most toxic things someone can do and it’s best to avoid at all costs. That’s not to say that all those people who have followers are bad, far from it! However, the act of blindly following someone just because of their status and/or their conviction is dangerous. You may get lots of information this way, but you will lose the most important thing in your quest: initiative.
Of course, some of these people are happy to have a following and couldn’t care less about your loss of initiative. After all, they often measure their value in terms of how many followers they have, how many downloads their free book has, and how many likes they receive. This in and of itself should raise some serious red flags because no matter how much data science or A.I. know-how these individuals have, the path they are on doesn’t go anywhere good.
I’m a firm believer in free will and I value it more than anything else, especially in the domain of science. As data science (and A.I.) are part of this domain, it’s imperative to show respect to this quality, even at the expense of a large following. That’s why whenever I share something with you, be it some data science methodology, some A.I. system, some heuristic, or some ideas about our field, I expect you to experiment with it and draw your own conclusions. Don’t take my word for it, because even though I make an effort to verify everything I write about, some inaccuracies are inevitable. After all, data science and A.I. are not an exact science!
Naturally, it takes more than experimentation to learn data science and A.I., but with some guidance, some contemplation, some skepticism, and some experimentation, it is quite doable to learn and eventually master this craft. That has been my experience both for my own journey in data science and A.I., as well as in the journeys of my mentees. Hopefully, your experience will be equally rewarding and educational...
Sounds like a bold statement, doesn’t it? Well, regardless of how it sounds, this is a project I’ve been working on for a long time and which I’ve been refining for the past couple of weeks, while also doing some additional testing. So, this is not some half-baked idea like many of the things that tech evangelists write about to promote this or the other agenda. This is the kind of stuff I’d publish a paper about if I still cared about publications.
In a nutshell, the diversity heuristic is a simple metric for measuring how diverse the points of a dataset are. This is quite different to spread metrics (e.g. standard deviation), since the latter focuses on the spread of a distribution, while it can take any positive value. Diversity, on the other hand, takes place between 0 and 1, inclusive. So, if all the vast majority of the data points are crammed into a single or a couple of places, the diversity is 0, while if the data points are more or less evenly distributed in the data space, the diversity is 1. Interestingly, even a random set of points has a diversity score that’s less than 1, since perfect uniformity is super rare unless you are using a really good random number generator!
Also, this diversity metric is pretty fast because, well, if a heuristic is to be useful, it has to scale well. So, I designed it to be quite fast to compute, even for a multiple-dimensional dataset. Because of this, it can be used several times without the computer overheating. As a result, it is fairly easy and computationally cheap to have a diversity-based sampling process, i.e. a sampling method that aims to optimize the yielded sample in terms of diversity. Naturally, a diverse sample is bound to cram more of the original dataset’s signal in it, though some information loss is inevitable. Nevertheless, the diverse sample, which usually has higher diversity than the original dataset, can be used as a proxy of the original dataset for a dimensionality reduction process, such as PCA. Interestingly, the meta-features that stem from the sample are not exactly the same as those of the original dataset, but they good enough, in terms of predictive power. So, by taking the rotation matrix of the PCA model of the sample, we can use it to reduce the original dataset, making dimensionality reduction a piece of cake.
So, there you have it: diversity can be used to reduce a dataset not just in terms of the number of data points it has (sampling) but also in terms of its dimensions. I know this may sound very simple as a process, but considering the computational cost of the alternative (not using diversity-based sampling), I believe it’s a step forward. Naturally, this is just one application of this new heuristic, which can perhaps help in other aspects of data science.
Anyway, I’d love to write more about this but I’m saving it for a video I plan to do on this topic. Currently, I’m still busy with the new book so, stay tuned...
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.