There are many mistakes that can be made in data science, many of which can go unnoticed for a while. The reason is that unlike coding bugs, these mistakes don't throw an error or an exception, making them harder to spot and fix, as a result. In my view, the biggest such mistake is that of thinking that one aspect of data science is so significantly better than the others that the latter don't matter much. I used to think like that back in PhD days (my thesis was on Machine Learning and heuristics) but fortunately, I discovered the error of my thinking and started broadening my perspective on this matter, something I continue to do as I learn more about this fascinating field.
Let's look into this more closely. For starters, there are several frameworks or tool-kits available in data science today, ranging from Statistics to Machine Learning, and lately, A.I. based models. All of them have their own set of advantages as well as limitations. Many Machine Learning models, for example, particularly A.I. based ones (mainly ANNs) are very hard to interpret and are often referred to as black boxes. Stats models, on the other hand, may be easy to interpret, but they may not be as accurate, while they tend to have a number of assumptions which may not always hold true. That's why claiming that one of these frameworks or tool-kits is the best one at the expense of others is a very shaky position.
However, with all the hype around the latest and greatest Deep Learning methods (and other A.I. based models used in Data Science), it's difficult to argue against this position. Also, with Statistics having such a good reputation in academia and proven applicability across different domains, it's also hard to argue that it's not as good a framework. This may be good in a way since it keeps us humble, but it may also obstruct progress. How can you have the nerve to put forward something new if it doesn't comply with what is considered "the best" or if it doesn't comply with the traditional approaches to data learning, such as Statistical Learning?
I'm not claiming to have a solution to this conundrum, by the way, and perhaps it's not something that can be answered simply. However, this kind of riddles that plague the data science field are what can be good food for thought and bring about a sense of genuine wonder about the prospects and the future of data science. Maybe when someone asks us what the best framework of data science is it's better to say "I don't know" and consider using different ones in tandem, instead of flocking into this or the other group of people who have made up their minds about this, and who are unlikely to ever change it. After all, open-mindedness is something that never gets old, at least not in a truly scientific field.
Being open-minded is a key trait of any scientist, since the beginning of Science. The scientific method is basically a practice that relies on open-mindedness, focusing on testing a hypothesis based on the evidence at hand. However, nowadays there is a trend towards a heretic behavior (in lack of a better word) when it comes to the science of data, as well as the application of A.I. in it.
Open-mindedness is not just being open about the results of an experiment though. That’s easy. Being open to other people’s ideas and beliefs is also important. It’s easy to dismiss some people, especially those writing about this matter, even though they lack the training you may have on the field. Still, those people may have some interesting insights, which they often express in their articles. You don’t have to agree with them, in order to gain from this, expanding your perspective. However, dismissing an article because it makes use of this or the other term (which in your opinion is not that relevant to the topic they tackle) is closed-minded.
That’s not to say that we should accept everything we read, however. Some of the material out there is of low informational value and can be biased towards this or the other technology, for various reasons. That’s normal since the field of data science (as well as A.I. to some extent) is closely linked to the business world and is influenced by the dynamics of the markets of tools and frameworks related to data analytics.
So, what do we do about all this? For starters, we can read an article before we dismiss it as irrelevant or otherwise problematic. Also, if we don’t agree about something with the author, we can construct arguments against that point and express them without attacking the other person. There are people who are incredibly toxic to the field and pose a threat to the field, by propagating their erroneous beliefs, but fortunately, these are few. Also, they are probably beyond salvation, since they have too large a following to ever question their beliefs. Still, by going against their propaganda, we can still help the people who haven’t made up their minds yet on the topic.
Perhaps that’s why the most important thing you can learn about data science and A.I. is to have a mindset that is congruent to your development as a professional, always maintaining an open mind. Just because there are fanatics in this field who are getting paid way more than they should and maintain a large following due to their charisma, it doesn’t mean that this is the best way to go. It’s not easy to be open-minded in a place where fanaticism thrives, but in the long run, it’s a viable strategy. After all, data science is here to stay, in one form or another, while the views on it that are now popular are bound to change.
As experience and knowledge accumulate in our minds, it’s increasingly easy to lose touch of that original spark that brought us into this journey of learning, in the fascinating field of data science. I’m referring to that sense of wonder that made all this otherwise dry know-how of math, programming, and data, something we could lose sleep over. Because if you are really in a state of wonder, it’s easy to forget to eat, postpone other tasks, and even find sleep somewhat less important, when your other option is delving more into the learning of the craft.
A sense of wonder, however, is much more than curiosity or even interest in data science. It is all that, but it’s also a way of feeling, a higher sentiment if you will. Being at wonder is what incites wondering and going into more depth. It is what makes a seemingly mundane task, such as data cleaning, appear intriguing and valuable. It is what makes learning about a new model something truly interesting, not just as a memory-based activity, but also as something that sparks imagination and innovation. It is wonder that makes us ask “what if?” instead of just being content with what is presented to us.
Naturally, this sense of wonder is fleeting, just like the perspective we have as newcomers to data science. The more we learn, the more limited our wanderings in the vast knowledge that the field entails, since being more focused on specific tasks and time frames are of the essence. That’s normal since as data scientists we need to be practical and akin to the way the world works, otherwise, we’d be unemployable. Yet, at a certain point of aptitude and understanding of the craft, it is this sense of wonder that enables us to go further and grow beyond what we are expected to be.
The sense of wonder can be cultivated through a sincere wish to become better for the sake of being better, a wish nourished by our love for data science. Ambition can only take us so far, plus after a while, it can become stressful. Wanting to become better because of a lasting motivation is therefore essential for bringing about the sense of wonder. However, we also need to make time for it and allocate resources to such endeavors. Learning through a book or a crash course may be efficient but it’s what we do beyond this that enables us to learn deeply and cultivate the sense of wonder. Liaising with people who already have this sense strong in them, such as beginners who are dedicated learners of the craft, can be a great aid too. Finally, we need to think about the craft and experiment with new ideas. If we just rely on what this or the other expert says, we are bound to be limited by them. We need to study existing ideas, but also dare to venture beyond them, exploring new models and new metrics. Most of them are bound to yield nowhere but some of them are bound to work and help us look at data science from a different angle.
Cultivating a sense of wonder isn’t easy and it’s an ongoing challenge. However, through it, new perspectives come about (such as some of the stuff I talk about in this blog periodically) while the connectedness of the various aspects of the field becomes apparent. All in all, it’s this perspective that makes the field truly wonderful, much more than a line of work. That’s something to wonder about...
In the previous article, I talked about the dichotomy of Student – Mentee in a data science context. However, it’s really the dichotomy of teacher – Mentor that is at the root of all this, while the data science field itself has a role to play too, something many learners of the craft have forgotten. In this article, we’ll explore just that, in an attempt to gain a better perspective of how true learning works and what it takes to connect to the essence of this fascinating field that’s being tainted by the ones who see it as merely a career-boosting opportunity.
Although there is nothing wrong with the role of a professor in any field, especially the fields of science, it’s important to highlight a distinct difference between a professor and a mentor. The former is usually geared towards giving a set of lectures, in order to fulfill the requirements of his/her professional position, something that may or may not be adequate for conveying the essence of the field, especially for a field as complex as data science. It’s not that the professor doesn’t care for all this, but the nature of this profession makes it incredibly difficult, if not impossible, to do this justice. After all, most of these professional educators have other priorities, such as their research.
Mentors, on the other hand, help others learn about data science, not because it’s our job, but because we care about the field, while we have other sources of income to cover our daily expenses. Of course, we may still have monetary benefits linked to mentoring, but it’s generally not the key motivation of all this. Also, we share knowledge about the field based on our own experience, rather than some curriculum which may not always align with the field. Finally, the connection with the people we help (mentees) is more direct and tailored to their needs, rather than generic and impersonal.
It’s important to note that these two roles although different, may still have an overlap. There are professors who can act as mentors, though they usually do this outside the classroom, as in the case of their supervisory role for a PhD student. Also, someone can be a mentor and yet also work part-time in a university. So, it’s good to maintain a flexible view of this whole matter.
Anyway, if you are willing to learn data science in depth, it’s definitely better to do so through a mentor, particularly one with a diversity of experiences in the field. But what about the mentor himself? Where does he learn about all this? In many cases, a mentor may have another mentor to learn from, though it is also possible that the data science field itself is that person’s mentor. After all, data science is a living field, dynamic and ever-changing, with plenty of things to teach to those who are willing to learn from it. Many of its secrets have been discovered but there is still a lot that it’s uncharted territory. That’s something data science can teach anyone who is willing to learn from it. All it takes is a solid understanding of the fundamentals, a strong sense of discipline, and the open-mindedness to abandon what you know for what you can know if you maintain a beginner’s mind...
I've talked about mentoring in the past and what a good mentee looks like. Here I'd like to highlight the differences between a student and a mentee since it's easy to confuse the two.
First of all, a student is someone who is generally more passive than a mentee. The latter takes initiative and feels responsible for her progress in the field of study. The former often outsources this to the instructor(s) and focuses more on passing exams rather than actually learning.
In addition, a mentee has a closer connection with the person helping him learn, namely the mentor. The student's relationship with his instructors is more impersonal, mainly because the latter have lots of students to deal with and can't usually focus on every single one unless it's for their dissertation project or something. The mentor, however, is more dedicated to getting to know the mentee better and coach him accordingly.
Moreover, the mentee-mentor relationship is beyond academia, even though it can exist in a university too (e.g. in the case of a PhD program). More often than not, mentees and mentors are working professionals, while the former tends to already have a degree.
Furthermore, mentees tend to have a more focused approach to learning, usually related to a specific field, much like an apprentice. The student, on the other hand, may study lots of different fields, as part of her curriculum at the college or university.
Interestingly, although there are several university courses on Data Science these days, most people who learn the craft, tend to do so either independently or through the help of a mentor. Perhaps there is something to this, other than a coincidence. That's not to say that university courses are bad, but oftentimes learning data science as a mentee tends to be more cost-effective and efficient, time-wise.
What are your thoughts and experiences on the matter?
Short answer: Nope! Longer answer: clustering can be a simple deterministic problem, given that you figure out the optimal centroids to start with. But isn’t the latter the solution of a stochastic process though? Again, nope. You can meander around the feature space like a gambler, hoping to find some points that can yield a good solution, or you can tackle the whole problem scientifically. To do that, however, you have to forget everything you know about clustering and even basic statistics, since the latter are inherently limited and frankly, somewhat irrelevant to proper clustering.
Finding the optimal clusters is a two-fold problem: 1. you need to figure out which solutions make sense for the data (i.e. a good value for K), and 2. figure out these solutions in a methodical and robust manner. The former has been resolved as a problem and it’s fairly trivial. Vincent Granville talked about it in his blog, many years ago and since he is better at explaining things than I am, I’m not going to bother with that part at all. My solution to it is a bit different but it’s still heuristics-based. The 2nd part of the problem is also the more challenging one since it’s been something many people have been pursuing a while now, without much success (unless you count the super slow method of DBSCAN, with more parameters than letters in its name, as a viable solution).
To find the optimal centroids, you need to take into account two things, the density of each centroid and the distances of each centroid to the other ones. Then you need to combine the two in a single metric, with you need to maximize. Each one of these problems seems fairly trivial, but something that many people don’t realize is that in practice, it’s very very hard, especially if you have multi-dimensional data (where conventional distance metrics fail) and lots of it (making the density calculations a major pain). Fortunately, I found a solution to both of these problems using 1. a new kind of distance metric, that yields a higher U value (this is the heuristic used to evaluate distance metrics in higher dimensional space), though with an inevitable compromise, and 2. a far more efficient way of calculating densities. The aforementioned compromise is that this metric cannot guarantee that the triangular inequality holds, but then again, this is not something you need for clustering anyway. As long as the clustering algo converges, you are fine.
Preliminary results of this new clustering method show that it’s fairly quick (even though it searches through various values of K to find the optimum one) and computationally light. What’s more, it is designed to be fairly scalable, something that I’ll be experimenting with in the weeks to come. The reason for the scalability is that it doesn’t calculate the density of each data point, but of certain regions of the dataset only. Finding these regions is the hardest part, but you only need to do that once, before you start playing around with K values.
Anyway, I’d love to go into detail about the method but the math I use is different to anything you’ve seen and beyond what is considered canon. Then again, some problems need new math to be solved and perhaps clustering is one of them. Whatever the case, this is just one of the numerous applications of this new framework of data analysis, which I call AAF (alternative analytics framework), a project I’ve been working on for more than 10 years now. More on that in the coming months.
Recently I came across this interesting platform for sharing curated content, called Wakelet. It's also a British startup from Manchester, by the way, one that appears quite promising, given that they find a way to monetize their project.
Anyway, the platform is a bit like Pinterest but with more features and an offline presence too. These are the most important features, in my view:
* Very intuitive and fast to learn
* Can work with a variety of content types: videos, images, formatted text, PDFs, and website links
* Every list can be exported to a PDF
* Free to use
* No account is required to view the lists
* Lots of free images to use as thumbnails and backgrounds
* QR code is generated for each list you wish to share
* Private lists are also an option
* Plenty of tutorials online that explain the various features and use-cases
You can check out a wake (that's how these curated lists are called) that I've made in the space of a few minutes, here. In the future I'll probably be using it more, particularly on this blog. Whatever the case, do let me know what you think of this platform and of my wake. Cheers!
This famous Buddhist quote is one of my personal favorites and one that Bruce Lee also used in one of his movies. Although it may seem more relevant to some Eastern philosophy or martial arts, it actually has a lot of relevance in data science too.
Through this blog, my books, and my videos, I’ve put forward some ideas and hopefully some useful knowledge for anyone interested in data science and A.I. However, it’s easy to mistake conviction with cult-like hegemony, something I’ve observed in social media a lot. Whenever someone competent enough to have a good professional role and some prestige comes about, many people choose to become his or her followers, treating that person as a guru of sorts. This, in my view, is one of the most toxic things someone can do and it’s best to avoid at all costs. That’s not to say that all those people who have followers are bad, far from it! However, the act of blindly following someone just because of their status and/or their conviction is dangerous. You may get lots of information this way, but you will lose the most important thing in your quest: initiative.
Of course, some of these people are happy to have a following and couldn’t care less about your loss of initiative. After all, they often measure their value in terms of how many followers they have, how many downloads their free book has, and how many likes they receive. This in and of itself should raise some serious red flags because no matter how much data science or A.I. know-how these individuals have, the path they are on doesn’t go anywhere good.
I’m a firm believer in free will and I value it more than anything else, especially in the domain of science. As data science (and A.I.) are part of this domain, it’s imperative to show respect to this quality, even at the expense of a large following. That’s why whenever I share something with you, be it some data science methodology, some A.I. system, some heuristic, or some ideas about our field, I expect you to experiment with it and draw your own conclusions. Don’t take my word for it, because even though I make an effort to verify everything I write about, some inaccuracies are inevitable. After all, data science and A.I. are not an exact science!
Naturally, it takes more than experimentation to learn data science and A.I., but with some guidance, some contemplation, some skepticism, and some experimentation, it is quite doable to learn and eventually master this craft. That has been my experience both for my own journey in data science and A.I., as well as in the journeys of my mentees. Hopefully, your experience will be equally rewarding and educational...
With so many ways to get a book out there, even in a fairly challenging subject such as data science, you may wonder what this process entails and what is the best way to go about it. After all, these days it’s easier than ever to reach an audience online and promote your work, all while branding yourself as a professional in the field.
Writing a book in data science is first and foremost an education initiative, targeting a particular audience. Usually, this is data science learners though it may be other professionals involved in data science, such as managers, developers, etc. A data science book generally tries to explain what data science can do, what its various methodologies are, and how all of that can be useful for solving particular problems (emphasis on the last part!). If you see a book that focuses a lot of the methods, particularly those of a particular methodology, it may be too specialized to be of most audiences, unless you are targeting that particular niche that requires this specific know-how.
A key thing to note when exploring the option of writing a book is a publisher. Even if you prefer to self-publish, your book must be able to compete with other books in this area and a publisher is usually the best way to figure that out. If a publisher is interested in your book, then it’s likely to be somewhat successful. Also, if you are new to book authoring, you may want to start with a publisher since there are a lot of things you’d never learn without one. Also, a book published through a publisher is bound to have more credibility and a larger life-span.
Understandably, you may have explored the various deals publishers make with their authors and figured out that you’ll never make a lot of money by publishing books. Fair enough; you’ll probably never make a living by selling your words (although it is possible still). However, if your book is good, you’ll probably make enough money to justify the time you’ve put into this project. Also, remember that most publishing deals provide you with a passive income, even if the publisher wants you to promote your book to some extent. So, even though you won’t make a lot of cash, you’ll have a revenue stream for the duration of your book’s lifetime.
With all the data science material available on the web these days, acquiring all the relevant information and compiling it into a book is a fairly straight-forward task. However, just because it is fairly feasible, it doesn’t mean that it’s what the readers need. Without someone to guide you through the whole process and give you honest feedback (that’s also useful feedback), it’s really hard to figure out what is necessary to put in the book, what should be included in an appendix, and what should be mentioned in a link. Your readers may or may not be able to provide you with this information, while if your main means of interacting with them is how many of them download your book or visit your website, you are just satisfying your ego!
A publisher's honest feedback often hurts but that’s what gradually turns you into a real author, namely one who has some authority in his/her written works. Otherwise, you’ll be yet another writer, which is fine if you just want to talk about writing a book or how you have written a book that you have on Amazon, things that are bound to be forgotten quicker than you may think…
As the field of Data Science matures and everything in it is categorized and turned into a teaching module, compartmentalization may seem easier and more efficient as a learning strategy. After all, there is a bunch of books on specialized topics of the craft. That’s all great and for some people, it may even work satisfactorily, but that’s where the risk lies and it’s a pretty big risk too!
Learning about something specialized in data science, particularly without a good sense of context or its limitations, can be catastrophic. The old saying “for someone who only knows how to use a hammer everything starts looking like a nail” is applicable here too. Learning about a specialized aspect of data science can often make you think that this is the best approach to solving data science related problems. After all, the author seems to know what he’s talking about and some employers value this skill. However, if this know-how is out of context, it is bound to be ineffective at best and problematic at worst. Data science is an interdisciplinary field with lots of different tools in it, from various areas. Anyone who tries to dissect it and focus mainly on one of them is doing a disservice to the field and if you as a data science learner pay attention to this person, you are bound to warp your knowledge of the craft and delay your mastery of it.
Also, this overspecialization in know-how may make you think that you are better than the other data science practitioners who have not developed that niche skill yet. This will limit your ability to learn and perhaps even cooperate with these people, significantly. After all, you are an expert in this, so why bother with less fancy know-how at all? Well, sometimes even the more humble aspects of the field, such as feature engineering, can turn to be more effective at solving a problem well, than some fancy model, so it’s good to remember that.
That’s why I’ve always promoted the idea of the right mindset in data science, something that no matter how the field evolves, it is bound to remain stable in the years to come and help you adapt to whatever know-how becomes the norm. Also, no matter how important the algorithms are, it’s even more important knowing how to create your own algorithms and change existing ones, optimizing them for the problem at hand. That’s something that no data science book teaches adequately, as the emphasis is covering material related to certain buzzwords, sometimes without the supervision of an editor. The latter can help immensely in making the contents of a book more comprehensible and relevant to data science in general, providing you with a sense of perspective.
So, be careful with what you let enter your data science curriculum as you learn about the craft. Some books may be a waste of time while others, especially those not published through a publisher, may even hinder your development as a data scientist.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.