Lately I've been down with a severe case of quiz fever! Combined with the fact that it was too hot to go outside (Mediterranean summer heat is no joke!), I was more focused on this task. As a result, I created a bunch of quizzes to publish on O'Reilly, two of which are now online. Namely, the Data Engineering and the Machine Learning Applications one are now available on the O'Reilly platform. Check them out when you have the chance.
Note that in order to have full access to the quizzes, you need an account with O'Reilly, a pretty good investment. If you are unsure whether you want to go for it, you can always create a trial account first (valid for 10 days) and check out the content of this great platform, without any strings attached. However, to maximize your benefit, I recommend you get a paid account. The plethora of quality content on this platform makes it worth it!
Furthermore, I recently noticed that my videos are receiving lots of hits and have fetched some very promising reviews. I'd like to thank you all for that. It's a relatively small thing for everyone of you to take the time out of your busy day and watch my work and you may not think much of it. However, it does make a difference to me, so I'm grateful for that. At a time when YouTube is the go-to option for many people, some of you choose quality over convenience and that's something I never take for granted. Cheers!
Although transparency is often viewed in relation to predictive analytics models, when it comes to data science there is another aspect of transparency that is also particularly important: the transparency of the data science work. This has to do mainly with how transparent the process followed is, as well as the models used and the results.
Nowadays it’s easy (perhaps too easy!) to build a predictive analytics model in complete obscurity, thanks to the wonder of deep learning. This way, you may be able to bring about a satisfactory result, without gaining a sufficient understanding of the data at play, or the quirks of the problem at hand. Of course, it's not just the data scientist to blame for this reckless behavior. Far from it. The root of the problem is managerial since we often forget that the data scientist will tend to follow the most economical course of action, to deliver a result, be it a data product or a set of insights, in the least amount of time. This is often due to the strict deadlines involved in a data science project and the all too frequent lack of understanding of the field, by the people managing the project.
There is more to a successful project than a high accuracy rate or an easily accessible model on the cloud. Oftentimes the problems tackled by data science are complex and have lots of peculiarities that deserve close attention if the problems are to be solved properly. Anyone can build a predictive analytics model nowadays, without having a good grasp of data science, thanks to all these 10-12 week boot camps that offer the most superficial knowledge humanly possible to the aspiring data scientists! Yet, if our expectations of the data scientists are equally shallow and we are willing to up with opaque models and pipelines, then we reap what we saw. That's why it's important to have good communication about these matters, going beyond the basics. Mentoring can also be a priceless aid in all this.
Fixing this fundamental issue requires more than just good communication and mentoring, however. We also need to opt for a transparent approach to data science. All aspects of the pipeline need to be explainable, even if the models used are black boxes, due to the performance requirements involved. The data scientists need to be able to communicate their work and findings, while we as managers need to do the same when it comes to requirements, domain knowledge, and other factors that may play a role in the project at hand. All this may not solve every issue with today’s obscure data science pipelines, but it is a good place to start.
Perhaps if we have transparency as a key value in our data science teams, we have a better chance of deriving true insights from the data available and bring about a more valuable result overall.
Since I'm in a quiz frame of mind these days, I've created yet another quiz video, which is now available on the O'Reilly platform. Namely, this quiz on Machine Learning vid explores a few key aspects of the subject, such as supervised, unsupervised and reinforcement learning, as well as the main model types and the hyper-parameters involved. Designed to be as inclusive as possible, this is a video that can benefit both the beginner to this topic and the more seasoned machine learning professional. Enjoy!
Note that O'Reilly is a subscription based platform (formerly known as Safari). So, in order to view this or any other video in its entirety, you'll need to have an account there. Definitely a worthwhile investment, if you ask me, particularly if you are a data science professional. I don't receive any benefits from saying this, btw, since I work with a different publisher (Technics Publications), who contributes these videos to this platform.
Beyond the play of words here, there is an important matter that needs to be addressed, since data science is becoming increasingly influential nowadays, in various aspects of our lives. Gone are the days when it was limited to the data science departments of certain companies; these days, the impact of data science transcends the boundaries of the organizations it serves. Take for example the data scientists working for large companies like Facebook and Google. The impact of their work influences a large number of people, even outside the companies themselves. Perhaps the range of this impact is hard to fathom even by the managers of these data science teams since it often has a lasting impact that's nearly impossible to gauge without sufficient data and the time required for this impact to fully manifest.
Ethics is a word that's used so much that has lost its meaning, or maybe it was never really properly defined in the first place. Also, with the impersonal aspects of ethics being formalized in particular codes of conduct, it has lost its essence since it has been reduced to a number of do's and don't, a set of guidelines which can be followed unconsciously and mechanically. However, ethics is the formal aspect of morality, which is founded in the values we follow. The latter is real and oftentimes comprehensible things that we express in our actions, oftentimes consciously. Values like honesty, diligence, and efficiently don't require a Master's in philosophy in order to comprehend, while the ethics of a modern information worker can be a bit more abstract and challenging to relate to. Values are something we have, whether we talk about them or not, and it's not too difficult to figure out what these are with a little introspection. However, even though values are a personal matter, they have a concrete effect on our work and in how we relate to the world. Good managers are aware of that and pay attention to the values of the candidates of the positions they wish to fill. The resume/CV is important but it’s not the only factor at play when hiring a professional.
Perhaps it's time to pay attention to this aspect of the craft more. Knowledge and know-how are becoming more easily accessible to everyone, particularly those who are willing to pay for that, an investment that is guaranteed to pay off. That's great, particularly for those who wish to enter this field even if their education is not aligned with this subject. Still, it's equally important to balance this aptitude with the moral strength that empowers us to deliver our data science work in a way that respects other people's privacy and doesn't abuse the information involved. At one point in our careers, it is natural to come into a crossroad where we need to either do is expected or do what is ethically right. The former is bound to be a more tempting option, at least financially, while the latter may be void of any direct benefit. Having a solid set of positive values may help us make the right choice instead of trading the long-term benefit of the many for the short-term gain of the few.
Just like week, during a business trip to London, I started working on this video, on my spare time, and now it's already online! In this 40 minute video, comprising of 3 clips, I explore the topic of Optimization, through a series of questions spanning across 5 categories. Whether you are an aspiring A.I. expert or a data scientist, you can learn a lot of useful things from this test of sorts and with the right mindset, even enjoy the whole process! You can find it on the O'Reilly platform, where you need to have an account (even a trial one will do) to watch it in its entirety. Cheers!
With everyone in A.I. feeling the need to have an opinion or even a stance on Artificial General Intelligence (AGI), we often neglect the source of this concept. Namely, the well-rounded intelligence that characterizes a human being, having all kinds of smarts. The latter I refer to as Natural General Intelligence (NGI) and someone can argue that it's as important if not more important than AGI, at least in this point in time, particularly to data science professionals.
But isn’t this kind of intelligence another name for genius? Not necessarily. NGI is modeled after the human being in general even if its artificial counterpart (AGI) is often linked to super-intelligence, a kind of supergenius that may characterize an A.I. that has developed this level of intelligence. Still, it is possible to have NGI without being a modern Leonardo DaVinci or a Benjamin Franklin.
Natural General Intelligence is all about enabling your mind to develop in different aspects, not merely the ones that you need for your vocation or the ones that were essential for your survival so far. This idea is not new and has been popular during the Renaissance. Even today we use the term "Renaissance Man" to refer to the individual who is well-rounded in his or her life and can be good at different things. In this era of overspecialization, this seems to be a Utopian endeavor, at least to some people. In reality, however, it isn't. If you want to learn a musical instrument, for example, there are plenty of courses and books you can leverage, while there are even music instructors who can teach you over the internet. As for the instruments themselves, they are far more affordable than they used to be while for certain instruments, the prices continue to drop due to high demand. However, more important than developing one’s musical aptitude is the growth of one’s emotional intelligence (EQ), particularly interpersonal skills.
What does all this have to do with data science? Well, in data science it’s easy to overspecialize too (e.g. in Machine Learning, Data Engineering, NLP, etc.). However, this creates artificial barriers which may render communication with other data professionals more challenging. Of course, more often than not these issues are alleviated through a competent data science lead or a manager with sufficient data science understanding. Still, if you as a data science professional can mitigate the need for external intervention when it comes to collaborating with others, that’s definitely a plus. Not just in terms of smoothing the professional relationships involved, but also in terms of business value. Stand-alone professionals are very sought after since such people tend to be (or quickly become) assets. In time, these professionals can grow into versatilists and/or assume leadership positions.
From all this, it is hopefully clear that Natural General Intelligence is more tangible and significantly more feasible than any other kind of advanced intelligence capable of yielding value in an organization. What's more, an individual with NGI is bound to be more relate-able and accountable, rendering the whole team he/she belongs to a more functional unit. Perhaps such a goal is more beneficial than the blind pursuit of some exotic kind of A.I. that can solve all of our problems. The latter is intriguing and worth investigating, but I wouldn't bet on it benefiting the average Joe any time soon!
Being an expert in this topic since my PhD, I decided to create a video about it. The topic is a bit niche but it's very practical and useful in various data science tasks, particularly data engineering. Check out the video on O'Reilly and feel free to give me any feedback on it, especially regarding the I.D. metric once you look into it. Note that you will need an account on the O'Reilly platform in order to view the video (and any other material) in its entirety. However, considering the quality of the stuff there and the diversity of the content, it is a worthwhile investment. Also, you can have a free trial for 10 days to check it out, before you make a decision about it. Cheers!
In the most venerable of sciences, Physics, there are two closely linked concepts, that of work and that of energy. Work is the result of a force applied over a given distance, while energy is often seen as the result of work. However, energy takes a variety of forms, which enables us to produce work through the use of it, be it through a preexisting form (e.g. uranium and thorium) or some man-made form (e.g. a battery). This fundamental idea of the relationship between work and energy, which we often take for granted, is something that applies to data science as well, by substituting energy for value.
Value is sometimes considered as the 5th V of Big Data (the other four being Volume, Velocity, Variety, and Veracity), something that is quite inaccurate though since value is a fundamental characteristic of information, not a particular kind of data. Information, however, can be found even in relatively small datasets (which were considered large once, before the era of big data), so calling it a characteristic of big data can be misleading. This misconception doesn't take away any value from the idea of value though, which is often a value instilled in many data scientists, particularly those who go beyond the techniques and methods. These data scientists penetrate the essence of the craft, through the development of the data science mindset, which is the most valuable aspect of the field.
Value is something that concerns business people too, however, since it is one of the outcomes of a data science project, which ideally can translate into increased revenue, be it via the development of a new product or by making a business process more efficient. Also, value can enable an organization to expand its scope, know its customers better (KYC), and liaise with other organizations more effectively. This value, which often takes the form of insights, is at the core and oftentimes at the end of the data science pipeline.
Value, however, can take the form of a product, such as an API that automates a particular evaluation process or a prediction. Although the technology behind such a product is nothing spectacular (APIs have existed for a while now and they are fairly straight-forward for a software engineer to develop), the data science part of that product is what brings about the real value in such an API. Without a data science engine behind it, an API is bound to be more of an ETL tool which although still valuable, it's not of the same caliber of data science-powered APIs.
Value in data science is often found in the information distilled from the data, particularly through a predictive analytics model. Elements of it, however, are already encountered in the data discovery stage of the pipeline, where the data scientist evaluates the features at hand and the metadata available. This is often conducted through the creation of data models, which is why it is part of the data modeling part of the pipeline. I talk about all this in detail in the Data Science Modeling Tutorial, available on the O'Reilly (formerly known as Safari) platform.
Value in data science is a big topic and if I were to continue this article would be irksomely long. It would be best if I continue this in another article, or even a series of articles, in the weeks to come. Cheers!
As you may know already, one of the world's top data modeling conferences, Data Modeling Zone (DMZ), is taking place this November in Stuttgart. This is an international conference for all sorts of data professionals, not just data architects, plus it has an impressive bookstore. As I'll be participating in this conference as a speaker, I get to give away discounts and such. So, if you plan to register for the conference, you can use my last name (8 characters) to claim a 15% discount. With this kind of money, you can treat yourself to a tour to the lovely German countryside and still have some money left to buy a book or two from the aforementioned bookstore!
The knowledge vs. faith conundrum has been a philosophical debate for eons, yet it usually is geared towards abstract matters, such as life after death. So, how does this apply to a pragmatic field such as data science? Well, contrary to what many people think, most data science practitioners often rely on faith to a great extent, when dealing with data science matters. But why is that?
Unfortunately, most people learning the craft have a strict time table to keep, so they don't have a chance to go in depth on the material covered. This increasingly severe temporal limitation is also coupled with other factors, such as the plethora of "cookbooks" on the topic. Not to be confused with actual cookbooks, comprising of various recipes, oftentimes original tried and tested dishes developed by experienced chefs; these cookbooks are fine and probably have a bigger bang for your buck, compared to the technical cookbooks that are basically a bunch of methods and functions, usually in a popular programming language, organized by someone who oftentimes doesn't even understand them. If you rely mainly on such sources of knowledge, you are basically putting your faith in these people and creating gaps in your understanding of the craft.
So, if you obtain technical knowledge quickly or from a source that doesn't go much in depth, it is unlikely to truly know data science. That's not to say that you shouldn't read books; far from it. Books are useful but no matter how good they are, the best way to learn something remains the empirical approach. Going under the hood of the methods involved, implementing methods from scratch and even experimenting with your own ideas, are all good ways to learn something in more depth and remember it for longer periods of time. Also, through empirical knowledge of the craft, you are more confident about what you know and oftentimes more aware of the boundaries of your knowledge.
There is room for faith in our field, as for example when you trust what your data science lead/director tells you, when you accept advice from a mentor, and when you rely on the know-how of an academic paper written by someone who knows data science in-depth. However, it's good to balance it with empirical knowledge to the extent your time allows. Perhaps in abstract matters, it's hard to obtain empirical knowledge, but on things that you can test yourself, the only limitations are man-made ones. Are you willing to transcend them?
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.