In the previous article, I talked about the dichotomy of Student – Mentee in a data science context. However, it’s really the dichotomy of teacher – Mentor that is at the root of all this, while the data science field itself has a role to play too, something many learners of the craft have forgotten. In this article, we’ll explore just that, in an attempt to gain a better perspective of how true learning works and what it takes to connect to the essence of this fascinating field that’s being tainted by the ones who see it as merely a career-boosting opportunity.
Although there is nothing wrong with the role of a professor in any field, especially the fields of science, it’s important to highlight a distinct difference between a professor and a mentor. The former is usually geared towards giving a set of lectures, in order to fulfill the requirements of his/her professional position, something that may or may not be adequate for conveying the essence of the field, especially for a field as complex as data science. It’s not that the professor doesn’t care for all this, but the nature of this profession makes it incredibly difficult, if not impossible, to do this justice. After all, most of these professional educators have other priorities, such as their research.
Mentors, on the other hand, help others learn about data science, not because it’s our job, but because we care about the field, while we have other sources of income to cover our daily expenses. Of course, we may still have monetary benefits linked to mentoring, but it’s generally not the key motivation of all this. Also, we share knowledge about the field based on our own experience, rather than some curriculum which may not always align with the field. Finally, the connection with the people we help (mentees) is more direct and tailored to their needs, rather than generic and impersonal.
It’s important to note that these two roles although different, may still have an overlap. There are professors who can act as mentors, though they usually do this outside the classroom, as in the case of their supervisory role for a PhD student. Also, someone can be a mentor and yet also work part-time in a university. So, it’s good to maintain a flexible view of this whole matter.
Anyway, if you are willing to learn data science in depth, it’s definitely better to do so through a mentor, particularly one with a diversity of experiences in the field. But what about the mentor himself? Where does he learn about all this? In many cases, a mentor may have another mentor to learn from, though it is also possible that the data science field itself is that person’s mentor. After all, data science is a living field, dynamic and ever-changing, with plenty of things to teach to those who are willing to learn from it. Many of its secrets have been discovered but there is still a lot that it’s uncharted territory. That’s something data science can teach anyone who is willing to learn from it. All it takes is a solid understanding of the fundamentals, a strong sense of discipline, and the open-mindedness to abandon what you know for what you can know if you maintain a beginner’s mind...
I've talked about mentoring in the past and what a good mentee looks like. Here I'd like to highlight the differences between a student and a mentee since it's easy to confuse the two.
First of all, a student is someone who is generally more passive than a mentee. The latter takes initiative and feels responsible for her progress in the field of study. The former often outsources this to the instructor(s) and focuses more on passing exams rather than actually learning.
In addition, a mentee has a closer connection with the person helping him learn, namely the mentor. The student's relationship with his instructors is more impersonal, mainly because the latter have lots of students to deal with and can't usually focus on every single one unless it's for their dissertation project or something. The mentor, however, is more dedicated to getting to know the mentee better and coach him accordingly.
Moreover, the mentee-mentor relationship is beyond academia, even though it can exist in a university too (e.g. in the case of a PhD program). More often than not, mentees and mentors are working professionals, while the former tends to already have a degree.
Furthermore, mentees tend to have a more focused approach to learning, usually related to a specific field, much like an apprentice. The student, on the other hand, may study lots of different fields, as part of her curriculum at the college or university.
Interestingly, although there are several university courses on Data Science these days, most people who learn the craft, tend to do so either independently or through the help of a mentor. Perhaps there is something to this, other than a coincidence. That's not to say that university courses are bad, but oftentimes learning data science as a mentee tends to be more cost-effective and efficient, time-wise.
What are your thoughts and experiences on the matter?
Lately, I’ve been thinking a lot about what information is in a data science context, partly because of a couple of projects I’m involved in, and partly because that’s what I enjoy thinking about in my leisure time. After all, there are people who care about data science in a deeper way, as it’s more than a profession for them, something that commands a certain level of dedication that others may not comprehend. As I’m one of those people, I can attest to the unique beauty of the field and the qualities of it that keep it evergreen and ever-interesting.
For the longest time, I was under the impression that it’s the data points or the features that contain the information in a dataset. After all, that’s what most data science sources imply and something that makes some intuitive sense. However, lately I’ve experimented with new algorithms that can generate new data points and new features, while others manage to reduce the number of data points without information loss (aka intelligent sampling) or summarize the same information in a smaller number of features through the use of usually non-linear combinations of the original features (aka feature fusion). In all these cases, there isn’t any new information generated and there isn’t any significant information loss. What’s more, if you can effectively replace the original dataset with synthetic data, without losing any information then the claim that information is basically the original data doesn’t hold any water. In other words, information exists with or without the data at hand, since the same information can be expressed oftentimes more eloquently with a more succinct set of data points or features. Much like the essence of an ice cube is not the exact molecules of the water in it, but the fact that it consists of water and has a certain shape, dictated by the ice cube tray mold.
From all this, it follows that what we need is an information-rich dataset, i.e. a dataset that contains useful information without excessive data points or excessive features. Of course, it’s not always easy to perform the transformations required to accomplish this, but it is feasible and most modern A.I. systems are proof of that. Whether, however, this black box approach is the most effective way to accomplish this information distillation is something that needs to be investigated. In my view, looking into this sort of matters and having this perspective is far more important than all the technical know-how about the latest and greatest machine learning system, know-how that is oftentimes superficial when not accompanied by the data science mindset. The latter is something super important which however cannot be described in a simple blog article. I’ve written a book trying to explain it and even that may not have done it justice.
Anyway, pondering about these things may seem a bit philosophical, but if this pondering is transformed into concrete and actionable insights that can help improve existing data science methods or spawn new ones, then it’s probably more than just theoretical. Perhaps it’s this pondering that help keeps data science fresh in our minds, preventing it from becoming a mechanical process void of any life and inspiration. After all, just because many people have forgotten about what lured them to data science, it doesn’t mean that this is the only course of action. Someone can practice data science and still be enthusiastic about it while maintaining a sense of creative curiosity about the subject. It’s all a matter of perspective...
Short answer: Nope! Longer answer: clustering can be a simple deterministic problem, given that you figure out the optimal centroids to start with. But isn’t the latter the solution of a stochastic process though? Again, nope. You can meander around the feature space like a gambler, hoping to find some points that can yield a good solution, or you can tackle the whole problem scientifically. To do that, however, you have to forget everything you know about clustering and even basic statistics, since the latter are inherently limited and frankly, somewhat irrelevant to proper clustering.
Finding the optimal clusters is a two-fold problem: 1. you need to figure out which solutions make sense for the data (i.e. a good value for K), and 2. figure out these solutions in a methodical and robust manner. The former has been resolved as a problem and it’s fairly trivial. Vincent Granville talked about it in his blog, many years ago and since he is better at explaining things than I am, I’m not going to bother with that part at all. My solution to it is a bit different but it’s still heuristics-based. The 2nd part of the problem is also the more challenging one since it’s been something many people have been pursuing a while now, without much success (unless you count the super slow method of DBSCAN, with more parameters than letters in its name, as a viable solution).
To find the optimal centroids, you need to take into account two things, the density of each centroid and the distances of each centroid to the other ones. Then you need to combine the two in a single metric, with you need to maximize. Each one of these problems seems fairly trivial, but something that many people don’t realize is that in practice, it’s very very hard, especially if you have multi-dimensional data (where conventional distance metrics fail) and lots of it (making the density calculations a major pain). Fortunately, I found a solution to both of these problems using 1. a new kind of distance metric, that yields a higher U value (this is the heuristic used to evaluate distance metrics in higher dimensional space), though with an inevitable compromise, and 2. a far more efficient way of calculating densities. The aforementioned compromise is that this metric cannot guarantee that the triangular inequality holds, but then again, this is not something you need for clustering anyway. As long as the clustering algo converges, you are fine.
Preliminary results of this new clustering method show that it’s fairly quick (even though it searches through various values of K to find the optimum one) and computationally light. What’s more, it is designed to be fairly scalable, something that I’ll be experimenting with in the weeks to come. The reason for the scalability is that it doesn’t calculate the density of each data point, but of certain regions of the dataset only. Finding these regions is the hardest part, but you only need to do that once, before you start playing around with K values.
Anyway, I’d love to go into detail about the method but the math I use is different to anything you’ve seen and beyond what is considered canon. Then again, some problems need new math to be solved and perhaps clustering is one of them. Whatever the case, this is just one of the numerous applications of this new framework of data analysis, which I call AAF (alternative analytics framework), a project I’ve been working on for more than 10 years now. More on that in the coming months.
Someone may wonder why would someone write a technical book these days, especially when the earnings of such an endeavor are small and getting smaller. I wondered about the same thing once until eventually, I came to realize some key reasons why such an endeavor is indeed worthwhile, particularly in the Data Science area. I’d like to share with you here the most important of these reasons (benefits), knowing fully well that these are just my own insights and that you’d be able to find your own ones, should you ever consider writing a technical book.
First of all, when you discover something, even if it’s not entirely new, the natural next step is to try to share this with others, be it to advance your career, personal branding, or for whatever other reason drives you. If you are serious about this task, you’d want to make sure that whatever you deliver has a certain quality standard, so writing a book on the topic could help you accomplish that. If you are more inclined to use film as your medium, you may decide to go with a video instead, but that would require a longer time and sufficiently more expertise. So, a technical book would be a more viable option, especially if what you have to say can have enough commercial value to attract a publisher. You as an author may not be motivated by the royalties you’ll receive from this project, but no publisher would publish something that isn’t going to pay for the paper or digital storage it is going to need.
Also, through writing such a book you realize what you don’t know and develop a more balanced approach to the whole subject since you are more aware of what’s out there. The arrogance you may have harbored as a newcomer will gradually give way to humility and a deeper appreciation of the field, as the research required to write this book is bound to cultivate in you. Besides, even the stuff you may know well, you may realize that you may not be able to express comprehensibly, something that the editor will be more than happy to let you know! So, your development as a professional in this field will be (at least) two-fold: related to the knowledge of subtle aspects of the field and related to your ability to express all that effectively and eloquently.
Finally, writing a technical book, particularly one that is marketed professionally by a publisher, enables your thoughts to cross lots of borders, reaching out to people you wouldn’t normally find on your own. This will expose you to a larger variety of feedback for your work that can help you grow further as a professional. Not all of the feedback is going to be useful, but at least some of it is bound to be. Besides, the people who would normally read your work are likely to be people who have valued it enough to pay for it beforehand, be it through the publisher or through a subscription to a technical knowledge platform. Either way, they would most likely be people who are driven by genuine care about the subject, not just curiosity.
I could go on about this for a while, perhaps even write a book on this topic! However, as I respect your time, I’d leave it to this. What other benefits of writing a technical book can you think of? Do they justify to you the undertaking of such a project?
Puzzles, especially programming related ones, can be very useful to hone one’s problem-solving skills. This has been known for a while among coders, who often have to come up with their own solutions to problems that lend themselves to analytical solutions. Since data science often involves similar situations, it is useful to have this kind of experiences, albeit in moderation.
Programming puzzles often involve math problems, many of which require a lot of specialized knowledge to solve. This sort of problems, may not be that useful since it’s unlikely that you’ll ever need that knowledge in data science or any other applied field. Number theory, for example, although very interesting, has little to do with hands-on problems like the ones we are asked to solve in a data science setting.
The kind of problems that benefit a data scientist the most are the ones where you need to come up with a clever algorithm as well as do some resource management. It’s easy to think of the computer as a system of infinite resources but that’s not the case. Even in the case of the cloud, where resource limitations are more lenient, you still have to pay for them, so it’s unwise to use them nilly-willy unless you have no other choice.
Fortunately, there are lots of places on the web where you can find some good programming challenges for you. I find that the ones that are not language-specific are the best ones since they focus on the algorithms, rather than the technique.
Solving programming puzzles won’t make you a data scientist for sure, but if you are new to coding or if you can use coding to express your problem-solving creativity, that’s definitely something worth exploring. Always remember though that being able to handle a dataset and build a robust model is always more useful, so budget your time accordingly.
With all the talk about Data Science and A.I., it’s easy to forget about the person doing all this and how his awareness grows as he gets more involved in the field of data science. What’s more, all those gurus at the social media will tell you anything about data science except this sort of stuff, since they prefer to have you as a dependent follower rather than an autonomous individual making his own way as a data scientist.
So, as you enter the field of data science, you are naturally focused on its applications and the tangible benefits of it. As a professional in this field, you may care about the high salary such a vocation entails or the cool stuff you may build, using data science methods. Everything else seems like something you have to put up with in order to arrive at this place where you can reap the fruits of your data science related efforts. It’s usually at this level of awareness that you see people complain about the field as being too hard, or not engaging enough after a while. Still, this level is important because it often provides you with a strong incentive to continue learning about this field, growing more aware of it.
The second level of data science awareness involves a deeper understanding of it and an appreciation of its various tools, methods, and algorithms. People who dwell in this level of awareness either come from academia or end up spending a lot of time in academic endeavors, while in the worst case, they become fanatics of this or the other technology, seeing all others as inferior, just like the people who prefer them. The same goes with the methods involved since there are data scientists who swear by the models they use and wouldn’t use any other ones unless they absolutely had to. This is the level where most people end up with since it’s quite challenging to transcend it, especially on your own.
Finally, when you reach the third level of data science awareness, you are more interested in the data and the possibilities it offers. You have a solid understanding of most of the methods used and can see beyond them since they all seem like instances of the same thing. Your interest in data engineering grows and you become more comfortable with processes that are either esoteric or mundane, for most people. Heuristics seem far more interesting, while you begin to question things that others take for granted, regarding how data should be used. The best part is that you can see through the truisms (and other useless information) of the various “experts” in the social media and value your experience and intuition more than what you may read in this or the other book on this subject.
It’s fairly easy to figure out which level you are in, in your data science journey. Most importantly, it doesn’t matter as much as being aware of it and making an effort to move on, going deeper into the field. Because, just like other aspects of science, data science can be a path of sorts, rather than just the superficial field of work that many people make it appear. So, if you want to find meaning in all this, it’s really up to you!
Although there is a plethora of heuristics for assessing the similarity of two arrays, few of them can handle different sizes in these arrays and even fewer can address various aspects of these pieces of data. PCM is one such heuristic, which I’ve come up with, in order to answer the question: when are two arrays (vectors or matrices) similar enough? The idea was to use this metric as a proxy for figuring out when a sample is representative enough in terms of the distributions of its variables and able to reflect the same relationships among them. PCM manages to accomplish the first part.
PCM stands for Possibility Congruency Metric and it makes use primarily of the distribution of the data involved as a way to figure out if there is congruency or not. Optionally, it uses the difference between the mean values and the variance too. The output is a number between 0 and 1 (inclusive), denoting how similar the two arrays are. The higher the value, the more similar they are. Random sampling provides PCM values around 0.9, but more careful sampling can reach values up closer to 1.0, for the data tested. Naturally, there is a limit to how high this value can get in respect with the sample size, because of the inevitable loss of information through the process.
PCM works in a fairly simple and therefore scalable manner. Note that the primary method focuses on vectors. The distribution information (frequencies) is acquired by binning the variables and examining how many data points fall into each bin. The number of bins is determined by taking the harmonic mean of the optimum numbers of bins for the two arrays, after rounding it to the closest integer. Then the absolute difference between the two frequency vectors is taken and normalized. In the case of the mean and variance option being active, the mean and variance of each array are calculated and their absolute differences are taken. Then each one of them is normalized by dividing with the maximum difference possible, for these particular arrays. The largest array is always taken as a reference point.
When calculating the PCM of matrices (which need to have the same number of dimensions), the PCM for each one of their columns is calculated first. Then, an average of these values is taken and used as the PCM of the whole matrix. The PCM method also yields the PCMs of the individual variables as part of its output.
PCM is great for figuring out the similarity of two arrays through an in-depth view of the data involved. Instead of looking at just the mean and variance metrics, which can be deceiving, it makes use of the distribution data too. The fact that it doesn’t assume any particular distribution is a plus since it allows for biased samples to be considered. Overall, it’s a useful heuristic to know, especially if you prefer an alternative approach to analytics than what Stats has to offer.
Recently I came across this interesting platform for sharing curated content, called Wakelet. It's also a British startup from Manchester, by the way, one that appears quite promising, given that they find a way to monetize their project.
Anyway, the platform is a bit like Pinterest but with more features and an offline presence too. These are the most important features, in my view:
* Very intuitive and fast to learn
* Can work with a variety of content types: videos, images, formatted text, PDFs, and website links
* Every list can be exported to a PDF
* Free to use
* No account is required to view the lists
* Lots of free images to use as thumbnails and backgrounds
* QR code is generated for each list you wish to share
* Private lists are also an option
* Plenty of tutorials online that explain the various features and use-cases
You can check out a wake (that's how these curated lists are called) that I've made in the space of a few minutes, here. In the future I'll probably be using it more, particularly on this blog. Whatever the case, do let me know what you think of this platform and of my wake. Cheers!
If you wish to put yourself out there as a content creator, now more than ever, videos are a great way to do it. This may seem somewhat daunting to some, but with the plethora of software options out there and the ease of use of many of them, it’s just a matter of making your resolve to do it. Apart from the obvious benefit of personal branding, creating a data science or A.I. video can also be lucrative as an endeavor.
I’m not referring to the amateur videos many people on YouTube make, in their vain attempts to gather likes and shares, much like beggars gather pitiful coins from the passers-by. If you want to create a technical video that will be worth your while, there are better and more self-respecting options to do so, options that you would be happy to include in your resume/CV. Namely, you can create a video that you promote through a respectable publisher, such as Technics Publications. Such an alternative will enable you to receive royalties every 6 months and not have to worry about promoting your work all by yourself. Of course, there is also the option of a one-time payment that some publishers offer, but this isn’t nearly as appealing since the amount of money you can potentially earn through royalties is higher and the requirements are easier to meet.
When creating a video, many people think that it’s just you standing in front of a camera and talking adlib about a topic, perhaps using some props like a whiteboard. Although that’s one straightforward way to do it, it may not appeal to the less charismatic presenters or those who don’t consider themselves particularly photogenic. Besides, have a screen-share video or a slideshow with voice-over, always based on a script, is much easier to produce and sometimes more effective at illustrating the points you wish to make. Alternatively, you can try combining both approaches, though this may require more takes.
Whatever the case, making a video is the easy part of the whole project, relatively speaking. What ensues this is what is the most challenging task for most people: promoting the video to your target audience. Although social media have an important role to play in all this, having some support from a publisher is priceless. After all, promoting technical content is what publishers are really good at, especially if they have a good niche in the market. Still, if you have a large enough network, it doesn’t hurt to spread the word yourself too, for additional exposure, though you are not required to do so.
If you are interested in covering a data science or A.I. related topic with a video, through a publisher, feel free to contact me directly, as I’d be happy to help you in that, particularly if you are serious about it. The world could definitely use some new content out there for data science and A.I. since there is way too much noise, confounding those who wish to study these fields. Perhaps this new content could come from you.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.