“I have never let my schooling interfere with my education.” (quote believed to be originally by Mark Twain)
People talk about education a lot these days, particularly in a data science setting. However, we need to discern between actual education and training. Both are essential, but it is the former that holds the most value. The latter is easier and oftentimes faster, but it may not be a good investment of your time if it is not accompanied by the former.
Education is all about mindset development and the ability to feel inspired from knowledge, thereby developing a healthy yearning for it. It is what happens when you teach a child how to play a game, or do a specific task. Although it’s more of a state of mind than anything else, education also has a formal aspect to it which is related to courses, seminars, workshops and talks, geared towards enhancing one’s understanding and comprehension of the topic at hand.
Training on the other hand is more geared towards techniques, methods, and the technical details of the topic taught. This is useful, of course, since every data scientist needs to know all these things. That’s why there are so many data science books and videos out there! However, knowing how to build an SVM or a neural network doesn’t make someone a competent data scientist. In fact, in some cases it doesn’t make him even an employable one.
Perhaps there is a reason why most companies require X years of experience in their recruits. Some things in data science you can only learn through time, by practicing them and by developing an intuition for the data and how it is processed. Although the idea that a data scientist has to have X years of experience to be worthy is something that remains debatable (why X and not Y?), this trend shows that hiring managers can spot a difference between someone who knows data science from a book (or videos) and someone who knows the craft because she has worked the data and has developed a bunch of models, through lots of trials and the inevitable mistakes that ensue.
Education is therefore something that can be attained through experience, not just reading and watching data science material on the Safari platform. The latter can be a great start, but you still need to get your hands dirty and also think about the whole thing, instead of just following recipes, from a data science cookbook. It’s important to know techniques, no doubt, but unless you have developed an understanding that allows you to go beyond these techniques and explore alternative features and alternative models, you may never grow beyond the advanced beginner stage.
Even someone who has spend most of his life in data science can still learn about this field, as it's a) very diverse and wide-spread, and b) always evolving. Personally, I still find that I’m learning new things as I delve deeper into the field and as I converse with other data scientists and A.I. professionals, of all levels. This too can be a form of education, not any less valuable than the education of creating a new data analytics method, or a new data product. The moment someone starts looking down on education and thinks that he knows “enough” is the moment he begins becoming obsolete.
We often tend to forget that at the end of the day, data science is a business process and that data is a business resource. Whether this business is a for-profit or a non-profit is irrelevant. The essence of the whole thing is that data science is not a typical scientific field. In fact, some would argue that it’s not a “real science” at all since it is so attached to the business world. Although these people would probably view this as a defect of the craft, I tend to look at it from a very positive aspect. After all, what constitutes a real science is often a matter of debate.
Sometimes it’s easy to get carried away and focus on data science too much, losing sight of the applications of it. Although this is something somewhat common in an academic setting (particularly in universities that don’t have any ties to the industry), it may happen in companies too. When this happens, it’s usually best to walk away, since data science without any real-world application can be problematic.
Data science and A.I. that’s geared towards data analytics, involve a lot of scientific methodologies, which are quite interesting on their own. This may urge someone to get lost in that aspect of the craft and neglect the application part, particularly the one where these methodologies are employed for solving real-world problems. That’s not to say that doing data science research is bad. Quite the contrary. However, when the research is without any application, focusing too much on the math side of things, it is bound to be a waste of resources (unless you are doing this as part of a research project, e.g. for a research center or a university, in which case this is expected). The reason is that data science is by definition an applied field, much like engineering. Particularly when it is undertaken by a company (e.g. a startup), it needs to be able to deliver something concrete, and more importantly, something useful.
It’s hard to over-estimate the value of this aspect of data science that has to do with the end-user. After all, this person is often the one paying the bills! Also, focusing on the application part of the craft enables something else too: the more practical implementation of the technologies developed and the inception of new methods that are more hands-on and therefore useful. This is one of the reasons that data science has veered away from Statistics, a field which is by its nature more theoretical and more math-y than applied Science. That’s also the main reason why data science involves a lot of programming, oftentimes building things from scratch, even if it’s simple scripts. That’s quite different than using an all-in-one software package, like SAS or SPSS, where the user merely calls functions and does rudimentary data processing.
You can come up with ingenious methods in data science, that would be able to fetch a journal publication or two. However, if these methods don’t add value to an organization, they are not that great, from a holistic standpoint. This is observed in other parts of Science too, e.g. Electromagnetism. Despite the various theoretical aspects of that field, its usefulness is also apparent. People who practice this part of Physics tend to be very practical and oftentimes come up with interesting inventions that add value to their user (e.g. in the case of electromagnets, or power transformers). Data science is not any different.
All the clever mathematics behind a method may be enchanting for the mind, but it’s when this method is put into practice and yields some oftentimes actionable insight when it really becomes meaningful. That’s something worth remembering, since it’s easy to lose sight of the questions we are trying to answer, and focus too much on the possibilities that we discover. And some may argue that it’s the journey that matters, but for a journey to be a journey there needs to be a destination. The latter is usually some person who doesn't care much about the science behind the insights, but more about their applicability and usefulness. Companies like MAXset LLC may be completely ignorant of that, but this doesn't make it a viable strategy. On the other hand, companies that have a chance of providing true value to the world make the business aspect of the craft their priority.
People like to talk about the V’s of big data, since it is a topic comprehensive to almost everyone, while it also provides insight regarding the benefits of using data science in an organization. Naturally, these benefits are linked to having access to various data streams, usually resulting to massive amounts of data, and usually referred to as big data. Not everyone agrees as to what V’s are valid for characterizing this valuable resource (some say it’s 4, others exclude Veracity, while other include a couple of others too). However, there seems to be a consensus about the last V, namely Value. Nevertheless, whether there is value in big data or not is something that remains to be determined, since not all big data is created equal.
The issue with the V of value is that it’s not inherent in the data. If that were the case, someone could just buy this data (or license it) and then automatically improve his organization’s ROI. The value of big data is actually something that stems from data science’s transformation of this data into insights and/or data products. The same data that would otherwise be gathering dust on some computer cluster somewhere is turned into something people can use and oftentimes monetize, through data science. This is something that takes effort, however, and most importantly, requires a certain quality in the data to begin with.
It’s often useful to think of data as a gold mine. After all, just because it has the potential of yielding large amounts of the valuable metal, it doesn't mean that it will. Perhaps the mine is all dried up, or doesn’t have much gold to begin with. No amount of data science can remedy that. Data science can yield something of value if there is something in the data that could be of value. Many time people forget that, just like the people who buy a gold mine and expect that they’ll be swimming in gold soon enough.
The V’s of big data, on the other hand, are something real and present in every data stream that qualifies as big data. In fact, they are more like characteristics of the data itself, rather than something dependent on data science. However, the V’s themselves may provide some insight as to how much of big data the data at hand is, but not much regarding its potential for an organization. For example, big data of high veracity that’s related to people’s views on a particular commercial product may be completely useless to an organization that is all about some service. The data itself is fine, but doesn't add value to the organization.
So, in order for big data to be of actual value, we need certain things to be in place. First of all, the data needs to be handled by a data science team (or a single data scientist, if he’s competent enough). Moreover, it needs to have some affinity to the organization’s domain. Finally, there needs to be something insightful in the data, which can be surfaced through a data science project, be it through a better understanding of a situation or through a data product that the organization can use.
In conclusion, the fact that some data stream can offer value doesn't necessarily mean that it will. After the data science team has done its part, the stakeholders of the project need to take action, utilizing the insights and/or the data product developed. People sometimes forget that and neglect leveraging the benefits of a data science project to the fullest extent, much like a gold miner may obtain the gold from a mine, but never get around to doing anything useful with it...
It is easy to fall into this misconception of believing that in data science we are all solitary people doing our work and interacting only in the workplace and in the social media. Perhaps we are part of some data science team, but still feel we are still on our own when it comes to our relationship with the field. However, this is just one of many possibilities in how we relate to the data science world, and it is definitely not the best one.
Being part of a community in data science is not only possible but also necessary. Of course just networking with other data scientists may not be enough, but it is often a good starting point. This is particularly important towards the beginning of one’s career. After all, not even the best data science books can give someone solace in times of difficulty or doubt. That’s when having a good mentor comes in very handy. After all, even if that mentor is a bit aloof and preoccupied with his own stuff, he tends to have a genuine interest in your career and is motivated to help you out, at least to some extent. This can be another step towards becoming part of a community of data science professionals.
Make no mistake, however. Neither the mentor, nor anyone else is going to fight your battles for you. The other data scientists, be it professional acquaintances, mentors, or teammates, have their own battles to tackle. However, they may be able to offer you advice or help you gain insight to solutions that you couldn't think of by yourself, especially during the time you are immersed in the problems you are tackling.
Finding a physical community may not always be possible. Not all cities are as advanced as the ones where the field thrives and has a cohorts bustling with data science events and activities. However, data scientists are out there who are also in need of a community, so it’s only a matter of time before you find them. Perhaps you’ll “meet” them online, through some social network or a data science forum. Maybe you’ll encounter them in a data science conference, or a webinar. Bottom line, if you are open to finding a community of data scientists, the opportunities to do so will manifest, sooner or later.
Being part of a data science community is not only to help you in difficult times though. It’s also a great accelerator for developing yourself as a data scientist through being exposed to new trends, novel approaches to known problems, and most importantly, to unknown problems that you’d probably not encounter on your own, even if you work in a data-driven company. All that is bound to foster in you the knowledge and know-how you need to advance to the next level, whatever that level is for you. At the same time, it can help you maintain your enthusiasm for data science, and perhaps even make you more zestful about the field. After all, it is usually the people who are passionate about something that make the most progress in it and are also consistent in do so. Data science is not any different in that respect.
Everyone talks about data science these days, as well as A.I., since the value these disciplines can add to an organization is being verified more and more. However, there are organizations out there that are not ready yet to make use of data science, even if they have ads for data scientists in various job forums. Before applying to places like that, you may want to answer this question for yourself: is this organization I’m interested in data science ready?
Just because an organization has seen value in a data science proof-of-concept (PoC) project, it doesn't make it ready to employ and utilize data science professionals. First of all, it has to have a solid leadership team, one that at the very least has a CTO who has worked with data scientists, though additional roles like that of a CIO and a CDO, would also be useful. If the C-level team of an organization hasn't worked with data scientists and doesn't have a clear idea of what data science can and what it cannot do, then this is a red flag.
In addition, an organization that has access to a variety of data streams, even if these don’t qualify for “big data” status, is essential for making it data science ready. If all its data is in Excel spreadsheets and SQL data bases, perhaps they need a data analyst, a business intelligence professional, or a statistician. If they do get a data scientist, they won’t be able to do much more with her, since she will not have enough to work with and provide sufficient value, that can translate to a positive ROI for her group. That data scientist is better off working somewhere else where they make better use of her skills and her mindset.
Moreover, a data science ready organization has realistic expectations and a good plan about how to utilize its data resources. Just because it has access to good data, it doesn’t mean that it can get value from it, even if it employs a group of very talented data scientists. It also need to know what it is going to do with it, what data products it can create, how it is going to leverage the insights the data science team provides, etc. All that is not going to take place in the next quarter necessarily, especially if the organization is new to data science. So, expecting some ground-breaking results within the next 3 months would be naive and financially irresponsible. An investment like this is bound to take some time before it yields dividends and if the organization is not aware of this, then it may not be ready just yet.
Beyond these signs, there are other, more specialized ones that are more domain-specific or data-specific. However, mentioning them here would make the article so long that you’ll need to run some text analytics system on it to derive all the information from it! So, let’s just say that there are other thing that can be good predictors as to whether an organization is worth your time as a data scientist, or in the case you are a hiring manager of such an organization, whether you should start recruiting data scientists at this point. After all, data science is a long game, so there is no point rushing into it. It’s more beneficial if it is conducted in an environment that is conducive to it, and capable of fostering a congruent and efficient team, poised to add value to whatever data it utilizes.
People like to argue, especially about things they can reason with. However, just because you can justify that your view has merit, giving some practical examples or through logical reasoning, this doesn't make alternative views invalid. If there are several programming languages in data science, perhaps an oversimplification like “X is the best language for data science because Y” doesn't hold much water. Let’s examine why.
Although it is possible to rule out certain languages (e.g. Assembly or C) as optimal for data science, this doesn't mean that the problem has a clear-cut solution. Also, the assumption that a single programming language can cover all the use cases of a data science professional is a quite unjustifiable one. Some data scientists use two or three programming languages, sometimes in combination, getting the best of each, for optimal overall performance.
Also, data science is all about solving a business problem in a scientific manner. Just because say Dr. Smith prefers to use language X over Y, it doesn't mean that you have to follow her example. Maybe she has used language X during her PhD and didn't have time to learn another language, or she attained mastery of that language, so she feels more comfortable doing her data science work with that. She may be a successful data scientist but following her programming habits won’t make you a great data scientist necessarily.
Moreover, with new languages and new packages in the existing languages coming about all the time, which language is best is like the best performing basketball team. Definitely not something particularly stable! Besides, it’s often the case that a particular project may requite special handling, so what is a top-performer now, may not be the best option for that particular case.
In addition, the almost religious attitude towards programming languages that many people have (not just data scientists) is by itself problematic. If a potential employer sees you arguing about how your language of choice is the best and that you are not open to consider alternatives, he may not be so eager to hire you, since this kind of attitude creates disharmony and difficulty in collaboration among the members of a team. Besides, in most companies nowadays, they rarely ask for a specific language in the candidate requirements. As long as you can do the task that’s required of you, they don’t really care much what your programming background is. Of course companies that have already invested in a particular language and have all their code in that language may not be so flexible, but that shouldn't be the principle factor in your decision about which language you learn.
Finally, when it comes to deep learning, many modern frameworks, like Apache’s MXNet, have APIs for a variety of programming language. So if your A.I. guru friend tries to convince you that you should learn language X because that’s the best deep learning language, take that suggestion with a pinch of salt!
The important thing is for whatever language you decide to learn for data science, you make sure that you learn it well. Familiarize yourself with its packages, use it to solve various problems, and learn the best strategies for debugging code written in that language. If you do that, you can still make good use of it for your data science projects, even if the majority of people prefer this or the other language instead.
Just wanted to clarify something about the videos I post on Safari Books Online. Each one of these videos is not an audio-visual version of a book on the topic, but more of an overview of it.
I have specific requirements about the duration, so it is infeasible to go into much depth on any one of the topics, especially those topics that are more general. So, if you decide to watch a video of mine, please manage your expectations accordingly. None of these videos will make you an expert or provide you with the specialized knowledge that you'd find in a book. However, they can be a quick and effective way to get the basics down so that when you read a book on that topic, you'll have a sense of perspective and be able to focus on the details, since you'll have a firm grasp of the key concepts.
So, if you want to go into depth on any given topic, I'd recommend to either read a book or two, or do a course on it. The videos have a more supportive role and it is more useful if they are seen as such.
Recently I decided to make another video on cyber security, a topic I'm quite fond of. This time, I tackled Cryptography, which is a truly intriguing field independent but similar in some ways to data science. So, as of today this video is available on Safari (you need to have subscription to the portal in order to view the whole of it). Now, it's just an introductory video, so don't expect it to make you an expert in this. However, after viewing it, you'll have a solid understanding of what Cryptography is, how it is useful, what methods it includes, and some practical tips on how you can make use of it in your everyday life. Enjoy!
People nowadays, especially those who don’t understand programming, tend to be opinionated about programming languages and harbor unrealistic expectations. It’s this kind of people who spill negativity towards promising projects like Julia, which are still in the process of development. The same people would probably say nasty things about Python, or R, if these languages were developed in a time when early releases of them were accessible to the world through the Internet. So, perhaps it’s not really Julia these people have an issue with...
It’s easy to criticize something, be it a book, a movie, or a programming language. It’s probably the easiest thing someone can do, other than doing nothing. However, doing nothing doesn't hurt anyone, while the negativity of criticism has a corrosive effect on whoever is exposed to it. It would be overly idealistic to think that people who have this nasty habit could be cured of it, since most likely there are deep issues that cause it to manifest, which would probably require professional help to remedy. What can be remedied fairly easily though is the effect of these criticisms, since they are based on some shallow opinion rather than facts.
So, if you have heard someone who has spent a few hours learning about Julia and trying it out on his laptop dis Julia, that’s not a view you need to take very seriously. Just like every programming language, Julia has its issues and the packages out there are not in their final form. Just because something doesn't have the maturity and elegance of Pandas or Scikit-learn, it doesn't make it useless though. Julia, unlike other high-level languages, enables its users to make their own scripts easily and ensure high performance in them. Imagine trying to do that in Python! You’d need to be a computer science expert in order to guarantee high performance in a script you just put together and most likely you’d need to make use of C at one point it (Cython).
However, just because some people love Julia and swear by it, you shouldn't take their word for it. The idea is that you try it out yourself, like you’d try some other language, namely through methodical studying and practice. After you've spent quite some time and have developed your own (working) programs in it, then you can have a valid opinion on it. And if you don’t like it, that’s fine. Most Julia users don’t take offense if you don’t like their favorite language. However, since these people don’t dis your language of choice, I believe it is only fair if you show some respect for their favorite language. After all, Julia is not competing with any other language. It just does its thing, like Swift, and other fairly new programming languages.
Perhaps Julia is not the language of choice for the majority of data science practitioners. That’s perfectly fine. Just because it’s not as mature as Python or R, however, it doesn't mean that it’s not useful. Also, as it’s still in its early stages of development, it can only improve as time goes by. Till then, you can always use it for specific tasks, parallel to your language of choice. After all, there are bridge packages that enable that, which is more that someone could say about some other new languages, like Go.
If I've tried to make the argument that Julia is a great programming language, that’s because I find new technologies interesting and useful for an ever-changing field, such as data science. It was never my intention to convert anyone to that language, merely make it more well-known. After all, data science is all about mindset and methodologies, not so much about the specific tools, which inevitably change over time.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.