These days I'm working feverishly on a book project so there is no time for any new data science / A.I. related post here. If you want something else to read, feel free to check my articles on beBee, such as the latest one, available here. Parallel to all this, I'm preparing another educational project, something I'll talk more about later on. Stay tuned!
So, recently I decided to make a video on this topic, based on some things I've observed in data science candidates. The hope is that this may help them and anyone else who may be looking into becoming a more holistic data scientist, instead of just a data science technician. The video I made is now available online on O'Reilly and although it's a bit longer than others I've made (not counting the quiz ones), it's fairly easy to follow. Enjoy!
Everyone wants to do business especially when it comes to data science. The more someone is aware of the merits of this field and the value it can bring, the keener that person usually is. Whether it is for a hands-on project or something more high level, the wish to do a collaborative project is bound to rise, the more they get to know you and what you can do for them. However, just because you can work with someone on a potentially interesting and lucrative project, it doesn't mean that you should. Namely, there are certain red flags you ought to be aware of and which once spotted should make you rethink the whole endeavor.
First of all, there is a lack of organization when it comes to the first meeting (and the ones that may follow). Many people want to meet but they often lack the basics of organizing a meeting. Sometimes the time is vague (e.g. they set up a day but not a clear time) or the place is unclear (e.g. there is agreement about using a VoIP system but there is no mention of which system or which room, as in the case of Zoom). If your potential client fails to provide such crucial information, probably they are still new to doing business and there are bound to be other discrepancies down the line.
What’s more, the lack of clear objectives is something to be wary of. Some people want to do wonders with data science (esp. when A.I. is also leveraged) but they have no idea how. There are no clear objectives, deadlines, and the whole project feels more like a plan drafted by a 5-year-old. Situations like this spell out trouble since no matter how hard you work, they won’t be satisfied by your deliverables.
Moreover, when someone doesn’t have a solid understanding of the field and has irrational expectations because of this. This ties into the previous point since the lack of clear objectives often stems from the lack of a solid understanding of what data science is and what it can do. With a perception tainted by the hype of data science and A.I., the client may be unaware of what is feasible and what isn't, leading to a very unrealistic set of expectations that no matter how good you are, you are unlikely to be able to meet.
Furthermore, the lack of access to the actual data is a serious issue for a data science project. If I had a dime for every time I encountered this situation, I wouldn't need to work anymore! Yes, many people may have a clear plan and a solid understanding of data science but the data is not there. Sometimes they do have it but it is inaccessible and you have to go through miles of red tape just to get a glimpse of it. Cybersecurity and privacy processes are something completely unknown to clients like this, and they are overly protective of the data they have, granting you access to it only after you have signed a contract. However, embarking on a data science project without some exploratory data analysis first is like asking for trouble, but they don't usually understand that either.
Finally, if the paperwork is not properly handled (contracts, NDAs, etc.) that’s a big red flag. This is the other extreme, whereby the client is very open about everything but has no idea of how the world works and doesn't bother with NDAs, formal contracts, etc. This way, if there are issues (something quite likely) you are screwed since there are no legal guarantees for the whole project making any pending payments as likely to become actual revenue as a lottery ticket! Also, the ownership of the IP involved in such a project can become a nightmare.
Note that all these are red flags I’ve experienced myself so this list is by no means complete. Hopefully, it can give you an idea of things to look out for, ensuring that your data science expertise is not exploited or wasted in projects that are not likely to yield any benefit for you.
Alright, the quiz video fever is over for the time being, so I'm back to making conventional data science videos. This latest one on APIs, for example, just got published on O'Reilly. It's more technical than others, but very useful, particularly if you know already a few things about data science. Anyway, I hope you enjoy it!
Note that although you can view the list of videos and books on O'Reilly's learning platform, you need to have a valid account in order to view them in their entirety. A pretty good investment, if you ask me, but before you commit to a monthly or a yearly subscription, you can always have a trial one which lasts for 10 days. Cheers!
So, the 7th quiz video I've created is finally online on O'Reilly. This is the longest one so far spanning over 51 minutes, meaning there are lots of explanations for the various questions. It covers a bunch of topics, such as A/B testing, ANOVA, and various statistical tests. I put a lot of thought in this, much like you'd put a lot of thought in designing a data science experiment. Hopefully, you'll find it as useful and enjoyable as I did.
Note that just like other videos published on O'Reilly, you'll need to have an active account (even if it's a trial one), in order to view it in its entirety. As a bonus, you'll be able to view other videos as well as books available on that platform. Enjoy!
So, the royalties for the last 3-month period came for my self-published novel today ("I, AGI; the adventures of an advanced AI") and they were quite underwhelming. In fact, with the money I received I couldn't even cover my expenses for this book. Yes, I did pay others to help out, such as an editor and someone to handle the formatting that Kindle Publishing expects of its books, including the cover design. After all, I have a lot of respect for my audience, even if probably most of the people who read the book chose to not pay for it (there are loopholes when it comes to Amazon Kindle). Still, the reviews I got about it, from reliable sources like Goodreads, were quite positive, so I must have done something right!
Anyway, I could have published this book elsewhere and perhaps if I had 6 months to a year to spend, I could have found a literary publisher for it (unfortunately my regular publisher doesn't do novels!). Yet, even then it's not really worth it for the revenue a fiction book can bring. After all, the standards for sci-fi these days are quite high and I'm more of a non-fiction author. So, why did I bother with this whole project? Well, mostly because I enjoy writing, all kinds, not just non-fiction. And if you have a story in your head that you wish to share with others, the low revenue that stems from a publication of this story doesn't pose a real obstacle.
Also, and perhaps more importantly, I had a message to get to the world, regarding the safety aspect of A.I. and AGI. Of course, I've made this point through other forms, such as a video on the topic and numerous articles on this blog. However, if you care about reaching as many people as possible, you need to be creative about how you promote your idea. And that's exactly what I did.
So, even if Amazon Kindle is not the most profitable way to publish an ebook, even if the people reading this book probably have dozen other books on their to-read list and are less likely to value it the same way we used to value books before the Internet era, even if people are mesmerized about the benefits of A.I. today and are quite reluctant to view any of the potential shortcomings, I'm glad I published this book. At the very least, it was a learning experience and a way to gauge the literary market first hand. And who knows, if things go well, I may author a sequel to this novel as there is more to the story!
Lately I've been down with a severe case of quiz fever! Combined with the fact that it was too hot to go outside (Mediterranean summer heat is no joke!), I was more focused on this task. As a result, I created a bunch of quizzes to publish on O'Reilly, two of which are now online. Namely, the Data Engineering and the Machine Learning Applications one are now available on the O'Reilly platform. Check them out when you have the chance.
Note that in order to have full access to the quizzes, you need an account with O'Reilly, a pretty good investment. If you are unsure whether you want to go for it, you can always create a trial account first (valid for 10 days) and check out the content of this great platform, without any strings attached. However, to maximize your benefit, I recommend you get a paid account. The plethora of quality content on this platform makes it worth it!
Furthermore, I recently noticed that my videos are receiving lots of hits and have fetched some very promising reviews. I'd like to thank you all for that. It's a relatively small thing for everyone of you to take the time out of your busy day and watch my work and you may not think much of it. However, it does make a difference to me, so I'm grateful for that. At a time when YouTube is the go-to option for many people, some of you choose quality over convenience and that's something I never take for granted. Cheers!
Although transparency is often viewed in relation to predictive analytics models, when it comes to data science there is another aspect of transparency that is also particularly important: the transparency of the data science work. This has to do mainly with how transparent the process followed is, as well as the models used and the results.
Nowadays it’s easy (perhaps too easy!) to build a predictive analytics model in complete obscurity, thanks to the wonder of deep learning. This way, you may be able to bring about a satisfactory result, without gaining a sufficient understanding of the data at play, or the quirks of the problem at hand. Of course, it's not just the data scientist to blame for this reckless behavior. Far from it. The root of the problem is managerial since we often forget that the data scientist will tend to follow the most economical course of action, to deliver a result, be it a data product or a set of insights, in the least amount of time. This is often due to the strict deadlines involved in a data science project and the all too frequent lack of understanding of the field, by the people managing the project.
There is more to a successful project than a high accuracy rate or an easily accessible model on the cloud. Oftentimes the problems tackled by data science are complex and have lots of peculiarities that deserve close attention if the problems are to be solved properly. Anyone can build a predictive analytics model nowadays, without having a good grasp of data science, thanks to all these 10-12 week boot camps that offer the most superficial knowledge humanly possible to the aspiring data scientists! Yet, if our expectations of the data scientists are equally shallow and we are willing to up with opaque models and pipelines, then we reap what we saw. That's why it's important to have good communication about these matters, going beyond the basics. Mentoring can also be a priceless aid in all this.
Fixing this fundamental issue requires more than just good communication and mentoring, however. We also need to opt for a transparent approach to data science. All aspects of the pipeline need to be explainable, even if the models used are black boxes, due to the performance requirements involved. The data scientists need to be able to communicate their work and findings, while we as managers need to do the same when it comes to requirements, domain knowledge, and other factors that may play a role in the project at hand. All this may not solve every issue with today’s obscure data science pipelines, but it is a good place to start.
Perhaps if we have transparency as a key value in our data science teams, we have a better chance of deriving true insights from the data available and bring about a more valuable result overall.
Since I'm in a quiz frame of mind these days, I've created yet another quiz video, which is now available on the O'Reilly platform. Namely, this quiz on Machine Learning vid explores a few key aspects of the subject, such as supervised, unsupervised and reinforcement learning, as well as the main model types and the hyper-parameters involved. Designed to be as inclusive as possible, this is a video that can benefit both the beginner to this topic and the more seasoned machine learning professional. Enjoy!
Note that O'Reilly is a subscription based platform (formerly known as Safari). So, in order to view this or any other video in its entirety, you'll need to have an account there. Definitely a worthwhile investment, if you ask me, particularly if you are a data science professional. I don't receive any benefits from saying this, btw, since I work with a different publisher (Technics Publications), who contributes these videos to this platform.
Beyond the play of words here, there is an important matter that needs to be addressed, since data science is becoming increasingly influential nowadays, in various aspects of our lives. Gone are the days when it was limited to the data science departments of certain companies; these days, the impact of data science transcends the boundaries of the organizations it serves. Take for example the data scientists working for large companies like Facebook and Google. The impact of their work influences a large number of people, even outside the companies themselves. Perhaps the range of this impact is hard to fathom even by the managers of these data science teams since it often has a lasting impact that's nearly impossible to gauge without sufficient data and the time required for this impact to fully manifest.
Ethics is a word that's used so much that has lost its meaning, or maybe it was never really properly defined in the first place. Also, with the impersonal aspects of ethics being formalized in particular codes of conduct, it has lost its essence since it has been reduced to a number of do's and don't, a set of guidelines which can be followed unconsciously and mechanically. However, ethics is the formal aspect of morality, which is founded in the values we follow. The latter is real and oftentimes comprehensible things that we express in our actions, oftentimes consciously. Values like honesty, diligence, and efficiently don't require a Master's in philosophy in order to comprehend, while the ethics of a modern information worker can be a bit more abstract and challenging to relate to. Values are something we have, whether we talk about them or not, and it's not too difficult to figure out what these are with a little introspection. However, even though values are a personal matter, they have a concrete effect on our work and in how we relate to the world. Good managers are aware of that and pay attention to the values of the candidates of the positions they wish to fill. The resume/CV is important but it’s not the only factor at play when hiring a professional.
Perhaps it's time to pay attention to this aspect of the craft more. Knowledge and know-how are becoming more easily accessible to everyone, particularly those who are willing to pay for that, an investment that is guaranteed to pay off. That's great, particularly for those who wish to enter this field even if their education is not aligned with this subject. Still, it's equally important to balance this aptitude with the moral strength that empowers us to deliver our data science work in a way that respects other people's privacy and doesn't abuse the information involved. At one point in our careers, it is natural to come into a crossroad where we need to either do is expected or do what is ethically right. The former is bound to be a more tempting option, at least financially, while the latter may be void of any direct benefit. Having a solid set of positive values may help us make the right choice instead of trading the long-term benefit of the many for the short-term gain of the few.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.