Lately, I've been busy with preparations for my conference trips, hence my online absence. Nevertheless, I found time to write something for you all who keep an open mind to non-hyped data science and A.I related content. So, this time I'd like to share a few thoughts on programming for data science, from a somewhat different perspective.
First of all, it doesn't matter that much what language you use, if you have attained mastery of it. Even sub-Julia languages can be useful if you know how to use them well. However, in cases where you use a less powerful language, you need to know about lambda functions. I mastered this programming technique only recently because in Julia the performance improvement is negligible (unless your original code is inefficient to start with). However, as they make for more compact scripts, it seems like useful know-how to have. Besides, they have numerous uses in data science, particularly when it comes to:
Another thing that I’ve found incredibly useful, and which I mastered in the past few weeks, is the use of auxiliary functions for refactoring complex programs. A large program is bound to be difficult to comprehend and maintain, something that often falls into the workload of someone else you may not have a chance to help out. As comments in your script may also prove insufficient, it’s best to break things down to smaller and more versatile functions that are combined in your wrapper function. This modular approach, which is quite common in functional programming, makes for more useful code, which can be reused elsewhere, with minor modifications. Also, it’s the first step towards building a versatile programming library (package).
Moreover, I’ve rediscovered the value of pen and paper in a programming setting. Particularly when dealing with problems that are difficult to envision fully, this approach is very useful. It may seem rudimentary and not something that a "good data scientist" would do, but if you think about it, most programmers also make use of a whiteboard or some other analog writing equipment when designing a solution. It may seem like an excessive task that may slow you down, but in the long run, it will save you time. I've tried that for testing a new graph algorithm I've developed for figuring out if a given graph has cycles (cliques) in it or not. Since drawing graphs is fairly simple, it was a very useful auxiliary task that made it possible to come up with a working solution to the problem in a matter of minutes.
Finally, I discovered again the usefulness of in-depth pair-coding, particularly for data engineering tasks. Even if one's code is free of errors, there are always things that could use improvement, something that can be introduced through pair-coding. Fortunately, with tools like Zoom, this is easier than ever before as you don't need to be in the same physical room to perform this programming technique. This is something I do with all my data science mentees, once they reach a certain level of programming fluency and according to the feedback I've received, it is what benefits them the most.
Hopefully, all this can help you clarify the role of programming in data science a bit more. After all, you don't need to be a professional coder to make use of a programming language in fields like data science.
Rhythm in learning is something that most people don't think about, mostly because they take it for granted. If you were educated in a structure-oriented country, like most countries in the West, this would be instilled in you (contrary to countries like Greece where disorder and lack of any functional structure reign supreme). However, even then you may not value it so much because it is not something you're conscious of always. The need to be aware of it and make conscious effort comes about when you are on your own, be it as a freelancer or a learner in a free-form kind of course (i.e. not a university course of a boot camp). And just like any other real needs, this needs to be fulfilled in one way or another.
The idea of this article came about from a real situation, namely a session with one of my mentees. Although she is a very conscientious learner and a very good mentee, she was struggling with rhythm, mostly due to external circumstances in her life. Having been there myself, I advised her accordingly. The distillation of this is what follows.
So, rhythm is not something you need to strive for as it's built-in yourself as an innate characteristic. In other words, it's natural, like breathing and should come by on its own. If it doesn't, it's because you've put something in its way. So, you just need to remove this obstacle and rhythm will start flowing again on its own. This action of removal may take some effort but it's a one-time thing (unless you are in a very demanding situation in your life, in which case you need to re-set your boundaries). But how does rhythm manifest in practice? It's all about being able to do something consistently, even if it's a small amount certain days.
In my experience with writing (a truly challenging task in the long run, particularly when there is a deadline looming over you), I make it a habit of writing a bit every day, even if it's just a single paragraph or the headings and subheadings structure of a new chapter. Sometimes I don't feel like working on a book at all, in which case I take the time to annotate the corresponding Jupyter notebooks or write an article on this blog. Whatever the case, I avoid idleness like the plague since it's the killer of rhythm.
When it comes to learning data science and A.I., rhythm manifests as follows. You cultivate the habit of reading/coding/writing something related to the topic of your study plan or course curriculum. Even a little bit can go a long way since it's not that bit that makes the difference but the maintenance of your momentum. It's generally harder to pick up something that has gone rusty in your mind, particularly coding. However, if you coded a bit the previous day, it's so much easier. If you get stuck somewhere, you can always work on another drill or project. The important thing is to never give up and go idle.
Frustration is oftentimes inevitable but if you leverage it properly, it can be a powerful force as it has elements of willpower in it, willpower that doesn't have a proper outlet and it trapped. This is what can cause the break of rhythm but what can also remedy it. You always have the energy to carry on, even at a slower pace sometimes. You just need to tap into it and apply yourself. That's when having a mentor can do wonders, yet even without one, you can still manage, but with a bit more effort. It's all up to you!
Throughout this blog, I've talked about all sorts of problems and how solving them can aid one's data science acumen as well as the development of the data science mindset. Problem-Solving skills rank high when it comes to the soft skills aspect of our craft, something I also mentioned in my latest video on O'Reilly. However, I haven't talked much about how you can hone this ability.
Enter Brilliant, a portal for all sorts of STEM-related courses and puzzles that can help you develop problem-solving, among other things. If you have even a vague interest in Math and the positive Sciences, Brilliant can help you grow this into a passion and even a skill-set in these disciplines. The most intriguing thing about all this is that it does so in a fun and engaging way.
Naturally, most of the stuff Brilliant offers comes with a price tag (if it didn't, I would be concerned!). However, the cost of using the resources this site offers is a quite reasonable one and overall good value for money. The best part is that by signing up there you can also help me cover some of the expenses of this blog, as long as you use this link here: www.brilliant.org/fds (FDS stands for Foxy Data Science, by the way). Also, if you are among the first 200 people to sign up you'll get a 20% discount, so time is definitely of the essence!
Note that I normally don't promote anything of this blog unless I'm certain about its quality standard. Also, out of respect for your time I refrain from posting any ads on the site. So, whenever I post something like this affiliate link here I do so after careful consideration, opting to find the best way to raise some revenue for the site all while providing you with something useful and relevant to it. I hope that you view this initiative the same way.
Being a data science author is not a simple matter. With the bookshelves brimming with data science books these days, one may come to think of this as being something easy and accessible to everyone. Perhaps the latter is true since nowadays everyone can publish a data science book through some publisher with very low standards or he can publish the book himself, thanks to Amazon and other sites that are happy to make your book available to everyone. Some people stoop so low as to give away their book for free, something that says more about the quality of their book than it does for their generosity (of course there are exceptions to this, since many academics prefer this approach since the academic publishers make their books inaccessible to most of their students due to the high price tag they force on them). Whatever the case, being a data science author involves more than just putting a book out there for the world to view and perhaps read.
In my experience for the past 10 years or so, authoring a book is quite different to just writing one and making it accessible to the public. Authoring a book is all about providing a certain level of quality and going through the oftentimes exhausting process of revisions and edits, once the first draft is completed. Fortunately, the first book I authored was on something I had spent 5 years working on, namely my PhD project. The book was my PhD thesis, which is much like a normal technical book, though geared towards a more limited audience.
Other books I've authored were mostly through a publisher, except for some ebooks and a novel ("I, AGI: the adventures of an advanced Artificial Intelligence"). Every time it was a challenge of sorts, through one through which I could grow as a writer. Here is a list of the things I learned that are necessary to author a book:
Beyond these, several other things are necessary for authoring a book, perhaps too many to list in a blog article. However, for anyone serious about writing, these are a good place to start. Cheers!
Just like week, during a business trip to London, I started working on this video, on my spare time, and now it's already online! In this 40 minute video, comprising of 3 clips, I explore the topic of Optimization, through a series of questions spanning across 5 categories. Whether you are an aspiring A.I. expert or a data scientist, you can learn a lot of useful things from this test of sorts and with the right mindset, even enjoy the whole process! You can find it on the O'Reilly platform, where you need to have an account (even a trial one will do) to watch it in its entirety. Cheers!
With everyone in A.I. feeling the need to have an opinion or even a stance on Artificial General Intelligence (AGI), we often neglect the source of this concept. Namely, the well-rounded intelligence that characterizes a human being, having all kinds of smarts. The latter I refer to as Natural General Intelligence (NGI) and someone can argue that it's as important if not more important than AGI, at least in this point in time, particularly to data science professionals.
But isn’t this kind of intelligence another name for genius? Not necessarily. NGI is modeled after the human being in general even if its artificial counterpart (AGI) is often linked to super-intelligence, a kind of supergenius that may characterize an A.I. that has developed this level of intelligence. Still, it is possible to have NGI without being a modern Leonardo DaVinci or a Benjamin Franklin.
Natural General Intelligence is all about enabling your mind to develop in different aspects, not merely the ones that you need for your vocation or the ones that were essential for your survival so far. This idea is not new and has been popular during the Renaissance. Even today we use the term "Renaissance Man" to refer to the individual who is well-rounded in his or her life and can be good at different things. In this era of overspecialization, this seems to be a Utopian endeavor, at least to some people. In reality, however, it isn't. If you want to learn a musical instrument, for example, there are plenty of courses and books you can leverage, while there are even music instructors who can teach you over the internet. As for the instruments themselves, they are far more affordable than they used to be while for certain instruments, the prices continue to drop due to high demand. However, more important than developing one’s musical aptitude is the growth of one’s emotional intelligence (EQ), particularly interpersonal skills.
What does all this have to do with data science? Well, in data science it’s easy to overspecialize too (e.g. in Machine Learning, Data Engineering, NLP, etc.). However, this creates artificial barriers which may render communication with other data professionals more challenging. Of course, more often than not these issues are alleviated through a competent data science lead or a manager with sufficient data science understanding. Still, if you as a data science professional can mitigate the need for external intervention when it comes to collaborating with others, that’s definitely a plus. Not just in terms of smoothing the professional relationships involved, but also in terms of business value. Stand-alone professionals are very sought after since such people tend to be (or quickly become) assets. In time, these professionals can grow into versatilists and/or assume leadership positions.
From all this, it is hopefully clear that Natural General Intelligence is more tangible and significantly more feasible than any other kind of advanced intelligence capable of yielding value in an organization. What's more, an individual with NGI is bound to be more relate-able and accountable, rendering the whole team he/she belongs to a more functional unit. Perhaps such a goal is more beneficial than the blind pursuit of some exotic kind of A.I. that can solve all of our problems. The latter is intriguing and worth investigating, but I wouldn't bet on it benefiting the average Joe any time soon!
Being an expert in this topic since my PhD, I decided to create a video about it. The topic is a bit niche but it's very practical and useful in various data science tasks, particularly data engineering. Check out the video on O'Reilly and feel free to give me any feedback on it, especially regarding the I.D. metric once you look into it. Note that you will need an account on the O'Reilly platform in order to view the video (and any other material) in its entirety. However, considering the quality of the stuff there and the diversity of the content, it is a worthwhile investment. Also, you can have a free trial for 10 days to check it out, before you make a decision about it. Cheers!
The knowledge vs. faith conundrum has been a philosophical debate for eons, yet it usually is geared towards abstract matters, such as life after death. So, how does this apply to a pragmatic field such as data science? Well, contrary to what many people think, most data science practitioners often rely on faith to a great extent, when dealing with data science matters. But why is that?
Unfortunately, most people learning the craft have a strict time table to keep, so they don't have a chance to go in depth on the material covered. This increasingly severe temporal limitation is also coupled with other factors, such as the plethora of "cookbooks" on the topic. Not to be confused with actual cookbooks, comprising of various recipes, oftentimes original tried and tested dishes developed by experienced chefs; these cookbooks are fine and probably have a bigger bang for your buck, compared to the technical cookbooks that are basically a bunch of methods and functions, usually in a popular programming language, organized by someone who oftentimes doesn't even understand them. If you rely mainly on such sources of knowledge, you are basically putting your faith in these people and creating gaps in your understanding of the craft.
So, if you obtain technical knowledge quickly or from a source that doesn't go much in depth, it is unlikely to truly know data science. That's not to say that you shouldn't read books; far from it. Books are useful but no matter how good they are, the best way to learn something remains the empirical approach. Going under the hood of the methods involved, implementing methods from scratch and even experimenting with your own ideas, are all good ways to learn something in more depth and remember it for longer periods of time. Also, through empirical knowledge of the craft, you are more confident about what you know and oftentimes more aware of the boundaries of your knowledge.
There is room for faith in our field, as for example when you trust what your data science lead/director tells you, when you accept advice from a mentor, and when you rely on the know-how of an academic paper written by someone who knows data science in-depth. However, it's good to balance it with empirical knowledge to the extent your time allows. Perhaps in abstract matters, it's hard to obtain empirical knowledge, but on things that you can test yourself, the only limitations are man-made ones. Are you willing to transcend them?
There are many mistakes that can be made in data science, many of which can go unnoticed for a while. The reason is that unlike coding bugs, these mistakes don't throw an error or an exception, making them harder to spot and fix, as a result. In my view, the biggest such mistake is that of thinking that one aspect of data science is so significantly better than the others that the latter don't matter much. I used to think like that back in PhD days (my thesis was on Machine Learning and heuristics) but fortunately, I discovered the error of my thinking and started broadening my perspective on this matter, something I continue to do as I learn more about this fascinating field.
Let's look into this more closely. For starters, there are several frameworks or tool-kits available in data science today, ranging from Statistics to Machine Learning, and lately, A.I. based models. All of them have their own set of advantages as well as limitations. Many Machine Learning models, for example, particularly A.I. based ones (mainly ANNs) are very hard to interpret and are often referred to as black boxes. Stats models, on the other hand, may be easy to interpret, but they may not be as accurate, while they tend to have a number of assumptions which may not always hold true. That's why claiming that one of these frameworks or tool-kits is the best one at the expense of others is a very shaky position.
However, with all the hype around the latest and greatest Deep Learning methods (and other A.I. based models used in Data Science), it's difficult to argue against this position. Also, with Statistics having such a good reputation in academia and proven applicability across different domains, it's also hard to argue that it's not as good a framework. This may be good in a way since it keeps us humble, but it may also obstruct progress. How can you have the nerve to put forward something new if it doesn't comply with what is considered "the best" or if it doesn't comply with the traditional approaches to data learning, such as Statistical Learning?
I'm not claiming to have a solution to this conundrum, by the way, and perhaps it's not something that can be answered simply. However, this kind of riddles that plague the data science field are what can be good food for thought and bring about a sense of genuine wonder about the prospects and the future of data science. Maybe when someone asks us what the best framework of data science is it's better to say "I don't know" and consider using different ones in tandem, instead of flocking into this or the other group of people who have made up their minds about this, and who are unlikely to ever change it. After all, open-mindedness is something that never gets old, at least not in a truly scientific field.
Being open-minded is a key trait of any scientist, since the beginning of Science. The scientific method is basically a practice that relies on open-mindedness, focusing on testing a hypothesis based on the evidence at hand. However, nowadays there is a trend towards a heretic behavior (in lack of a better word) when it comes to the science of data, as well as the application of A.I. in it.
Open-mindedness is not just being open about the results of an experiment though. That’s easy. Being open to other people’s ideas and beliefs is also important. It’s easy to dismiss some people, especially those writing about this matter, even though they lack the training you may have on the field. Still, those people may have some interesting insights, which they often express in their articles. You don’t have to agree with them, in order to gain from this, expanding your perspective. However, dismissing an article because it makes use of this or the other term (which in your opinion is not that relevant to the topic they tackle) is closed-minded.
That’s not to say that we should accept everything we read, however. Some of the material out there is of low informational value and can be biased towards this or the other technology, for various reasons. That’s normal since the field of data science (as well as A.I. to some extent) is closely linked to the business world and is influenced by the dynamics of the markets of tools and frameworks related to data analytics.
So, what do we do about all this? For starters, we can read an article before we dismiss it as irrelevant or otherwise problematic. Also, if we don’t agree about something with the author, we can construct arguments against that point and express them without attacking the other person. There are people who are incredibly toxic to the field and pose a threat to the field, by propagating their erroneous beliefs, but fortunately, these are few. Also, they are probably beyond salvation, since they have too large a following to ever question their beliefs. Still, by going against their propaganda, we can still help the people who haven’t made up their minds yet on the topic.
Perhaps that’s why the most important thing you can learn about data science and A.I. is to have a mindset that is congruent to your development as a professional, always maintaining an open mind. Just because there are fanatics in this field who are getting paid way more than they should and maintain a large following due to their charisma, it doesn’t mean that this is the best way to go. It’s not easy to be open-minded in a place where fanaticism thrives, but in the long run, it’s a viable strategy. After all, data science is here to stay, in one form or another, while the views on it that are now popular are bound to change.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.