Just a heads-up. My second video for Technics Publications, "Becoming a Data Scientist, in a Nutshell" is now available at Safari Books Online (link to site). Based on my first book "Data Scientist: the definite guide to becoming a data scientist", this video covers some key aspects of the data science role and provides some practical advice on what skills you need to develop in order to pursue a career in data science.
2 Comments
Data science is a rapidly evolving field, there is no doubt about that. However, this doesn’t have to be a stress factor for those involved in it. In fact, you can benefit from this as a field like that is bound to maintain a sense of novelty for a longer period of time. To make the most of this situation and take better advantage of the fast pace of data science, it would be best to have a mentor.
The role of a mentor has been popularized in pop culture, particularly in movies. Perhaps there is something in our culture that makes it very relevant, if not necessary, particularly in career-related endeavors. That’s not to say that a mentor is just a career-booster. In fact, you can have a mentor in every aspect of human culture, be it your profession, the art you feel expresses you the most, or even the sport you enjoy. A mentor is basically someone who is more adept at a certain activity and is eager to share his experience and expertise with you, usually as a labor of love. However, this doesn’t mean that a mentor is someone at the apex of their evolutionary journey. Anyone can be a mentor, given that they know enough and are willing to share all that in a constructive manner with their peers. In data science, having a mentor is crucial, since there are so many new technologies out there, along with many more mature ones, that it’s often confounding! Also, with so many people having conflicting views on where data science is heading, and the recent buzz about A.I., you could really use some guidance, even if you know enough to call yourself a data scientist on your business card. Even though anyone more experienced than you can qualify to be your mentor, usually it is best if that person has enough commitment to the role to be of any use. I may want Mr. X to be my mentor, but if she is too busy with her career or her family to help me out, this isn’t really going to work, is it? One great place to find people who are serious about undertaking this responsible role is Thinkful, a startup that aims to connect data science learners with mentors in this field. Think of it like Uber for data science professionals (or aspiring professionals). Of course, there are other places where you can seek mentorship for your data science learning, but this is the one I’ve found to be the most serious about the task at hand. Whatever mentoring ecosystem you decide to go with, it is important to cultivate the following qualities in yourself, so that you benefit from this experience the most. First of all, you need to have an open mind and be willing to learn new things. This seems obvious but you’ll be surprised how many people lack this fundamental requirement (which is probably why they never have a mentor throughout their careers). Also, you need to be willing to investigate whatever the mentor shares with you. Mentorship is not a cult. You need to take whatever your mentor tells you with a healthy skepticism. Look into it before you accept it. This allows for better comprehension as well. Finally, you need to be willing to change yourself, by applying the new things you learn. Clearly, learning is enjoyable, especially if you are not tested on it afterwards! However, for it to be useful, you need to apply what you learn, after you assimilate it of course. So, if you are willing to do that, you are bound to not only benefit from this new know-how, but also encourage your mentor to share more stuff, perhaps going deeper into the secrets of the craft. Finally, whatever you decide to do with mentorship, be aware that this is not a one-directional graph. You can connect with other people, less experienced and less knowledgeable than you, and help them too. You can do that on your own, or via a more organized platform, like Thinkful. Whatever the case, even if this seems like a lofty goal for now, it doesn’t hurt having it in mind as a potential. Because at the end of the day, what’s left when everything else fades away, is our legacy. Personally, I can’t think of a better legacy than helping others in one’s field accomplish their potential through mentoring. What about you? You have seen them. They are everywhere these days. I’m not talking about just the YouTube ones, that are taken for granted these days. There are educational videos on the MOOC platforms (e.g. edX and Coursera), on Vimeo, and of course, the Safari Books Online platform. Many of these are not free, which may deter some people, but there is value in all of them, to some extent. This value may not be so readily accessible though, as it’s often hidden, just like the signal in the data we are summoned to analyze.
Why look into educational videos and not just focus on books though? Well, books are great but in today’s fast-paced world, they are unable to keep up with the times so much. Even publishing houses that have high throughput often fail to deliver books fast enough for them to remain relevant for long. Of course the eBooks movement tackles this issue, at least partly. However, even eBooks are not so engaging as videos, since the latter have more channels to convey information. Emulating lectures and workshops, educational videos manage to engage the viewers through both visual and audio stimuli, diverting their attention to the most essential parts of their topics. Books can do that as well, if they are well-written, but they require much more concentration. Even if you possess this level of focus, you may not do so everywhere. For example, when you are on a bus or a train, the myriad of distractions may make focusing on a book for long quite a challenge. A video, however, is easier to concentrate on, even under these adverse conditions. Watching educational videos, however, is not the same as watching a documentary or some other non-fiction audiovisual. The latter are created to be very engaging and even entertaining, to some extent. Educational videos, on the other hand, tend to be more packed in terms of information. One technique that I’ve found useful is taking notes while watching them. Fortunately, you can easily pause the video so that the note-taking activity doesn’t distract you. If you are on the move while watching the video, you can always take a shorter note, perhaps of the time-stamp of the part of the video that you feel requires more thought. This way, you can go back to it when you’re at home and delve deeper into it. Also, an educational video may require some work from you too. Apart from assimilating its content, you may need to do some research as well, on the topics it covers. This may not make you an expert, but it will definitely help you retain the stuff you’ve learned. Unlike conventional videos that are geared towards giving you an excuse to eat some popcorn or chips, educational videos provide you with a different kind of reward. This may take a while to settle, since assimilating new information, especially know-how, can be a time-consuming process. However, they definitely help you right here right now keep boredom at bay and make the most of the time that you’d otherwise dedicate to less productive tasks. That’s not to say that you need to watch educational videos whenever you are not engaged in some other productive activity, but you can definitely strike a balance between watching an educational video and playing your favorite game on your phone! If you haven’t done so already, check out my own educational videos on SafariBooksOnline.If you can go beyond my peculiar accent, there is no doubt that your mind will have quite a bit to chew on for that day! t is without a doubt that Artificial Intelligence (AI) has taken the world by storm. Although this field used to be a very arcane domain, after the development of data science, it has refurbished its image and has won the respect and admiration of people all over the tech world. Namely, since Prof. Hinton’s novel approach to neural networks (his deep learning networks), A.I. has seen a huge boost in popularity and has been embraced irrevocably by the data science community. Living up to the hype of easy and robust predictive analytics models that require little if any domain knowledge, they have delivered where all other models have failed. This makes one wonder what the next evolution step will be.
Contrary to what many people think, deep learning and similar A.I. technologies that constitute the fringe of A.I. today, are not all about GPUs, although this kind of cheap computing power module definitely plays an important role in this field. Beyond the raw computation capabilities of this hardware, A.I. is also about writing programs that employ its principles in a robust and efficient manner. Even though there are packages on this tech in every language out there, people tend to flock to either the more efficient or the more easy-to-use programming platforms. This polarization is to be expected, considering that the programming languages themselves are polarized. There are low-level languages like C, Java, and C++ that are very fast but a major pain to write code in, and there are also high-level languages like Python and R that are easy to develop scripts in, but are painfully slow when it comes to executing those scripts. Of course, the latter problem is solved to some extent with linking these languages with some distributed computing (big data) platform, such as Hadoop and Spark. That’s great, but it usually means that you have to spend the majority of your time doing ETL between the programming language you use and the big data platform, as well as a lot of data engineering to ensure that everything works well. With Spark things aren’t that bad, but it is often the case that in order to ensure better performance, you end up translating your high-level scripts into Scala, which is the language that the particular big data platform works best with. To resolve this false dichotomy some people at MIT developed Julia. I’ve talked about this language in a previous post, so this time I’d like to focus on its usefulness in A.I. Julia doesn’t need all that overhead that other languages need, in order to process big data. Also, as of late, it has become easy to use in a GPU setting, so deploying a deep learning network in Julia, doesn’t require some advanced expertise. Most importantly, as it is designed to be very fast, it is ideal for computationally expensive processes, such as those involved in a modern A.I. system. So, a language like that is more or less ideal for this kind of sophisticated models, while it also lends itself for experimenting with new A.I. systems too. But don’t take my word for it; give it a try and see for yourself! Having a bunch of sensors and tiny computers working in tandem, while being connected to the internet may seem like the premise of a sci-fi movie but nowadays it is a reality. Maybe not on a larger scale but it’s getting there. Since this whole idea may not sound so appealing to the average citizen, somone came up with the term Internet of Things (IoT) to describe it. “What does all this have to do with data science?” you may ask. “Everything,” I would reply.
Interestingly, the IoT movement is not about the sensors, nor about the connectivity of these sensors to a global network via some high-tech computers the size of your phone. It is all about the data that is collected and then aggregated on the cloud, via the Internet. People have been using sensors for decades and today most cars have several of them built in their information infrastructure that is managed by their built-in computers. Yet, only recently have a series of such sensors been made widely accessible physically (due to their low cost) and informationally (due to the Internet and the cheap distributed computing infrastructure that tiny computers allow). Naturally, there is no better way to obtain accurate data than with a sensor. A patient may tell his doctor that he feel that he has a fever, but if the doctor is to do her job, she won’t take his word for it. After listening to him explain the symptoms he’s experiencing, she will take his temperature with a thermometer. Although it may not look it, this simple medical device is nothing more than a specialized sensor for obtaining temperature data. So, sensors are not some abstract piece of tech that some engineers use. They are everywhere and have a variety of uses, many of which are critical. The data streams that flow from the abundance of sensors in an IoT framework are therefore very rich in information and insights. Without this data, a modern plane wouldn’t be able to cruise safely at the high altitudes that it is designed for. However, the data on its own is not enough to make anything interesting happen. It’s its processing that makes it useful and valuable. And with many modern data science systems, this task is made much easier. However, before diving into IoT data and immerse ourselves in the brewing of insights from it, there are a few things we need to consider. These may not seem relevant and will probably be shunned by most conventional data scientists, people who care only about doing what they are told. Yet, a fox-like data scientist who sees things from various angles, exploring different possibilities and several aspects of the problem at hand, such a person would definitely consider the following:
Applying foxy data science is not just about finding clever and innovative ways of working the data and presenting it in visuals that are engaging and insightful. Foxy data science, in my experience, is also about seeing the bigger picture and asking some interesting, albeit sometimes hard, questions. This may seem obvious to conventional scientists but it is not so obvious to many people today who are engaged in data science so much that they lose sight of what it’s for. Some of us have had the privilege of working with talented and relatable managers who would share the bigger picture in a succinct and elegant way. However, most people just don’t bother, partly because they are too focused on the technological aspects of it. Even though this may have been relatively harmless in most cases so far, the IoT framework is a whole new animal and may require closer attention. Because once the djini is out of the bottle, it isn’t going to go back... When Predictions Go Awry - The Case of Statisticians’ Failure to Guess the US Election Outcome11/10/2016 Image from bnymellonmarketeye.com
Unless you were living under a rock, you probably know about the Black Swan that is the recent elections’ outcome in the US. Nate Silver, who had successfully predicted the former president’s success, failed miserably to predict that of the current one’s. Other statisticians had a similar experience with this challenge. So, what went wrong? Why did their models fail to make an accurate prediction? Although this may seem like a simple problem, in essence it is one of the most complicated data analytics tasks people have dealt with. It’s not the volume of data that was the issue, or the velocity with which it was generated. As for variety, that was practically non-existent since all the data points were of the same type. However, the veracity of the data may have very well been the underlying factor of this data analytics blunder. People assume that when someone tells them that (s)he voted for X, they have indeed voted for X. After all, all the data points in this survey are anonymous, so what’s the point of lying? Well, the fact that there is no way to verify the validity of one’s input makes lying not just a convenient option but also a quite likely one. This results to a lower veracity of the data, leading to an incorrect output in the predictive model, which might otherwise be very accurate. In cases where people believe in a particular candidate, especially for what he or she stands for, they would be more than happy to voice their opinion and their vote on the matter. However, if they don’t believe in that person and they just vote because they don’t want the other candidate to get elected, it is possible that they may want to hide their true vote and instead state something that is more socially acceptable. It’s not rocket science, just human psychology. Would this predictive analytics fiasco have been avoided if the analysts used a more robust system, like deep learning, or whatever kind of A.I. you prefer? Unlikely. An A.I. system cannot guard against bad data. There is a famous adage about this in computer science: garbage in, garbage out (GIGO). If you feed an algorithm garbage inputs, you can be certain that the outputs aren’t going to be any better. That’s not to say that an A.I. system is not a good choice in general, since many such systems have yielded very accurate results in a variety of problems. However, if the data is problematic, they won’t magically filter out all the low-veracity data points and yield an accurate results. This is science, not a sci-fi movie. Also, contrary to what some managers think, data scientists don’t have a magic wand, so if there is an issue with the data, they can’t just wish it away with a spell. This sounds obvious but many people’s expectations of the field seem to show that they may believe that, even if they never admit it. So, before blaming Nate Silver or any other statistician who failed to predict this election’s outcome accurately, be sure to examine the root-causes of their failure. Predictive analytics is not an exact science and it is heavily dependent on the data at hand. If the data is unreliable, you may want to adjust your expectations accordingly. This post is not about the talk on this topic that I gave at Galvanize a couple of months ago. This was for the few who happened to be around the Seattle area and who didn’t have any other commitments at that time. I’m referring to the video based on this talk which I created afterwards and that found its way to the SafariBooksOnline website, via Technics Publications.
This 20+ minute video covers some of the basics of Julia (so that you don’t have to read a book on it to learn them), as well as some more data science specific topics, illustrating how it can be a useful tool in your toolkit. I am not making the argument that Julia is the next best thing since sliced bread, like other passionate coders often do, particularly when talking about Python in relation to R, or vice versa. Everyone has options and Julia is just one of them. Since it is the option I am qualified to talk about more than any of the other ones, I choose to do so in this video. My hope is that people will start using it more, probably in combination with Python, or whatever else they are using (even the C language). Because at the end of the day, what’s important is not the tool itself, but what you do with it. However, how useful a tool is greatly depends on the know-how around it. Even though you won’t be an expert in Julia by watching this video, you will get a good understanding of what it is about and why it can be a useful technology to know if you are doing data science. The better you are at data science, the better your chances of finding it useful. This is probably why many people use Julia for other applications (e.g. academic research, simulations, etc.). There is nothing wrong with that, since Julia was developed to be a versatile tool. The reason why this video is special is that it demonstrates a certain angle that many Julians may not be so aware of: Julia’s usefulness in data science. So, if you are intrigued by this possibility, here is my recommendation: improve your data science know-how, examine where you can use Julia in your data science pipeline, and start experimenting with it for specific data science problems that you are trying to solve. Hopefully this video can be an asset towards this objective. Disclaimer: I’m not poised to promote Julia because someone told me so, or because it’s a niche technology that I happened to be an expert in, at least for data science applications. The reason I’m promoting this new tech is because right now it appears to be the optimum choice for doing data science, particularly the hard parts of it. If Dr. X of university Y comes out tomorrow with a new programming platform that outperforms Julia overall, you can be sure that I’ll be looking into it with the same zest as I now have for Julia. Recently I've changed the favicon for my blog site to a fox image (what a surprise!). I have to say I was quite impressed with how easy it has become nowadays to create a favicon online based on an image file! Also, the variety of premade favicons on the web is impressive too! Anyway, just wanted to say that this favicon site I used is great and that I'd recommend it to all website owners out there.
Image taken from math.ucdavis.edu
Graphs have become more and more popular in data science in the past few years. In fact, it is highly unlikely that you haven’t used a graph in your analytics work, even without realizing it. Decision trees and neural networks, for example, are special cases of graphs, of the DAG category (Directed Acyclical Graphs). Developing a graph, however, in order to model a problem or a process is not a trivial task. Maybe it’s easy for employees of FB and LI, who work with graphs all day long, but for the average data scientist, it can be a bit of a challenge. The reason is simple: graphs deal with an abstraction of a feature space / process, in such a way that they only have two main dynamics: objects (aka vertexes or nodes) and the relationships among these objects (aka edges or arcs). So, how do you go about developing a graph to express a particular data set or a process (e.g. a machine learning model)? Well, you need to talk to your business liaison first and make sure that you understand the requirements of the model you are trying to develop. You can make a great graph that represents your data perfectly but it may not be the droid that he/she is looking for! So, step 0 would be to make sure that you are in alignment with the business directive and the question(s) you are aiming to answer through your data analytics efforts. Once you have figured that out, it is fairly easy to craft your graph. You just need to follow the following steps:
Note that you may often have to make assumptions about the connectivity of the nodes, since you don’t want to make your graph too complicated. Although there are perks to having a fully connected graph, the computational overhead of such a graph model may not justify the additional cost in terms of resources required to store and process the graph. So, you may want to include a threshold below which a connection is rendered absent (i.e. the corresponding nodes appear disconnected). This will of course have to do with the weights of the edges and you may need to do some analytics to come up with a meaningful threshold. Even though graphs have their own algorithms for processing the data they represent, they are not divorced from statistics and other data analysis tools. People who see graphs as a completely separate part of data science have not understood them in depth. We recommend that you distance yourself as much as possible from those people and join our sub-graph of data scientists: the ones who use all data analytics tools in tandem, without having silos among them. After all, a well-connected graph is bound to yield more interesting (and oftentimes more meaningful) insights. Isn’t that why you craft graphs in the first place? As some of you may know, last year I took a sabbatical and focused on my authoring efforts, in order to produce a book on Julia (Julia for Data Science book, by Technics Publications). Actually it wasn’t a full sabbatical as I had to do some odd jobs here and there to pay for the overpriced rent and other expenses of the city of Seattle! That’s not to say that I didn’t enjoy the whole process though. I particularly loved working as a data scientist for the 3-month contract job at G2 Web Services, even more than I enjoyed working in other data analytics positions over the years. However, I found that it was particularly challenging to do a good job in the book, while having a full-time job. That’s because, unlike other publishers, Technics Publications focused (and still focuses) on quality rather than quantity, so I had to make sure that whatever I wrote was worth the ink and paper it was going to use when it was going to be published.
Writing a book on an evolving technology like Julia wasn’t an easy feat, which is probably why Technics Publications was the first non-trivial publisher to finish such a project (even though there were other more well-known publishers that were attempting the same thing while I was writing this book, such as O’Reily’s with Leah Hansen, one of the Julia gurus, whom I followed over the years through her blog). However, writing yet another book on Julia wasn’t appealing to me for two reasons. First of all, I wasn’t a developer so there were some more esoteric aspects of the language that I wouldn’t be able to explain properly. Secondly, people in my field are more interested in the usefulness of a technology, particularly how it can be used to crunch data effectively and efficiently. Since I had some expertise on data science I decided to write yet another data science book, geared towards the Julia language. This whole endeavor was tough but educational for me. I had an opportunity to go deep into the technology, always being up-to-date with the latest trends, and interacting with some of the more experienced users (who were very happy to help out, by the way, something I hadn’t encountered while learning other technologies, for example). Also, I got to write several Julia scripts and get acquainted with the IDEs in a way that would have been impossilbe otherwise. Because, it’s only when you try to explain something to someone who’s never heard it before, that you really get to know it yourself. “Was it worth it?” you may ask. Well, for me it was. I wouldn’t do the same thing again though for yet another book on R or Python though, since I would have a hard time being motivated to do so. After all, Julia was a bit of a gamble back then (even though I had no doubt that it would become better and more popular as time went by). Things could have gone awry and the book would have been all for nothing. If I would do my best though it might just help this technology become more well-known. It was a risk but a calculated one at that. So, if you are thinking about writing a book yourself in some data science technology or methodology, here is my advice: don’t talk to too many people but do talk to the ones who know. Form a concrete idea of what the world needs and aim to fulfill that need through your book. It may not be a best seller but it definitely looks good on both your resume and on your bookshelf! |
Zacharias Voulgaris, PhDPassionate data scientist with a foxy approach to technology, particularly related to A.I. Archives
April 2024
Categories
All
|