Everyone talks about data science these days, as well as A.I., since the value these disciplines can add to an organization is being verified more and more. However, there are organizations out there that are not ready yet to make use of data science, even if they have ads for data scientists in various job forums. Before applying to places like that, you may want to answer this question for yourself: is this organization I’m interested in data science ready?
Just because an organization has seen value in a data science proof-of-concept (PoC) project, it doesn't make it ready to employ and utilize data science professionals. First of all, it has to have a solid leadership team, one that at the very least has a CTO who has worked with data scientists, though additional roles like that of a CIO and a CDO, would also be useful. If the C-level team of an organization hasn't worked with data scientists and doesn't have a clear idea of what data science can and what it cannot do, then this is a red flag.
In addition, an organization that has access to a variety of data streams, even if these don’t qualify for “big data” status, is essential for making it data science ready. If all its data is in Excel spreadsheets and SQL data bases, perhaps they need a data analyst, a business intelligence professional, or a statistician. If they do get a data scientist, they won’t be able to do much more with her, since she will not have enough to work with and provide sufficient value, that can translate to a positive ROI for her group. That data scientist is better off working somewhere else where they make better use of her skills and her mindset.
Moreover, a data science ready organization has realistic expectations and a good plan about how to utilize its data resources. Just because it has access to good data, it doesn’t mean that it can get value from it, even if it employs a group of very talented data scientists. It also need to know what it is going to do with it, what data products it can create, how it is going to leverage the insights the data science team provides, etc. All that is not going to take place in the next quarter necessarily, especially if the organization is new to data science. So, expecting some ground-breaking results within the next 3 months would be naive and financially irresponsible. An investment like this is bound to take some time before it yields dividends and if the organization is not aware of this, then it may not be ready just yet.
Beyond these signs, there are other, more specialized ones that are more domain-specific or data-specific. However, mentioning them here would make the article so long that you’ll need to run some text analytics system on it to derive all the information from it! So, let’s just say that there are other thing that can be good predictors as to whether an organization is worth your time as a data scientist, or in the case you are a hiring manager of such an organization, whether you should start recruiting data scientists at this point. After all, data science is a long game, so there is no point rushing into it. It’s more beneficial if it is conducted in an environment that is conducive to it, and capable of fostering a congruent and efficient team, poised to add value to whatever data it utilizes.
People like to argue, especially about things they can reason with. However, just because you can justify that your view has merit, giving some practical examples or through logical reasoning, this doesn't make alternative views invalid. If there are several programming languages in data science, perhaps an oversimplification like “X is the best language for data science because Y” doesn't hold much water. Let’s examine why.
Although it is possible to rule out certain languages (e.g. Assembly or C) as optimal for data science, this doesn't mean that the problem has a clear-cut solution. Also, the assumption that a single programming language can cover all the use cases of a data science professional is a quite unjustifiable one. Some data scientists use two or three programming languages, sometimes in combination, getting the best of each, for optimal overall performance.
Also, data science is all about solving a business problem in a scientific manner. Just because say Dr. Smith prefers to use language X over Y, it doesn't mean that you have to follow her example. Maybe she has used language X during her PhD and didn't have time to learn another language, or she attained mastery of that language, so she feels more comfortable doing her data science work with that. She may be a successful data scientist but following her programming habits won’t make you a great data scientist necessarily.
Moreover, with new languages and new packages in the existing languages coming about all the time, which language is best is like the best performing basketball team. Definitely not something particularly stable! Besides, it’s often the case that a particular project may requite special handling, so what is a top-performer now, may not be the best option for that particular case.
In addition, the almost religious attitude towards programming languages that many people have (not just data scientists) is by itself problematic. If a potential employer sees you arguing about how your language of choice is the best and that you are not open to consider alternatives, he may not be so eager to hire you, since this kind of attitude creates disharmony and difficulty in collaboration among the members of a team. Besides, in most companies nowadays, they rarely ask for a specific language in the candidate requirements. As long as you can do the task that’s required of you, they don’t really care much what your programming background is. Of course companies that have already invested in a particular language and have all their code in that language may not be so flexible, but that shouldn't be the principle factor in your decision about which language you learn.
Finally, when it comes to deep learning, many modern frameworks, like Apache’s MXNet, have APIs for a variety of programming language. So if your A.I. guru friend tries to convince you that you should learn language X because that’s the best deep learning language, take that suggestion with a pinch of salt!
The important thing is for whatever language you decide to learn for data science, you make sure that you learn it well. Familiarize yourself with its packages, use it to solve various problems, and learn the best strategies for debugging code written in that language. If you do that, you can still make good use of it for your data science projects, even if the majority of people prefer this or the other language instead.
Just wanted to clarify something about the videos I post on Safari Books Online. Each one of these videos is not an audio-visual version of a book on the topic, but more of an overview of it.
I have specific requirements about the duration, so it is infeasible to go into much depth on any one of the topics, especially those topics that are more general. So, if you decide to watch a video of mine, please manage your expectations accordingly. None of these videos will make you an expert or provide you with the specialized knowledge that you'd find in a book. However, they can be a quick and effective way to get the basics down so that when you read a book on that topic, you'll have a sense of perspective and be able to focus on the details, since you'll have a firm grasp of the key concepts.
So, if you want to go into depth on any given topic, I'd recommend to either read a book or two, or do a course on it. The videos have a more supportive role and it is more useful if they are seen as such.
Recently I decided to make another video on cyber security, a topic I'm quite fond of. This time, I tackled Cryptography, which is a truly intriguing field independent but similar in some ways to data science. So, as of today this video is available on Safari (you need to have subscription to the portal in order to view the whole of it). Now, it's just an introductory video, so don't expect it to make you an expert in this. However, after viewing it, you'll have a solid understanding of what Cryptography is, how it is useful, what methods it includes, and some practical tips on how you can make use of it in your everyday life. Enjoy!
People nowadays, especially those who don’t understand programming, tend to be opinionated about programming languages and harbor unrealistic expectations. It’s this kind of people who spill negativity towards promising projects like Julia, which are still in the process of development. The same people would probably say nasty things about Python, or R, if these languages were developed in a time when early releases of them were accessible to the world through the Internet. So, perhaps it’s not really Julia these people have an issue with...
It’s easy to criticize something, be it a book, a movie, or a programming language. It’s probably the easiest thing someone can do, other than doing nothing. However, doing nothing doesn't hurt anyone, while the negativity of criticism has a corrosive effect on whoever is exposed to it. It would be overly idealistic to think that people who have this nasty habit could be cured of it, since most likely there are deep issues that cause it to manifest, which would probably require professional help to remedy. What can be remedied fairly easily though is the effect of these criticisms, since they are based on some shallow opinion rather than facts.
So, if you have heard someone who has spent a few hours learning about Julia and trying it out on his laptop dis Julia, that’s not a view you need to take very seriously. Just like every programming language, Julia has its issues and the packages out there are not in their final form. Just because something doesn't have the maturity and elegance of Pandas or Scikit-learn, it doesn't make it useless though. Julia, unlike other high-level languages, enables its users to make their own scripts easily and ensure high performance in them. Imagine trying to do that in Python! You’d need to be a computer science expert in order to guarantee high performance in a script you just put together and most likely you’d need to make use of C at one point it (Cython).
However, just because some people love Julia and swear by it, you shouldn't take their word for it. The idea is that you try it out yourself, like you’d try some other language, namely through methodical studying and practice. After you've spent quite some time and have developed your own (working) programs in it, then you can have a valid opinion on it. And if you don’t like it, that’s fine. Most Julia users don’t take offense if you don’t like their favorite language. However, since these people don’t dis your language of choice, I believe it is only fair if you show some respect for their favorite language. After all, Julia is not competing with any other language. It just does its thing, like Swift, and other fairly new programming languages.
Perhaps Julia is not the language of choice for the majority of data science practitioners. That’s perfectly fine. Just because it’s not as mature as Python or R, however, it doesn't mean that it’s not useful. Also, as it’s still in its early stages of development, it can only improve as time goes by. Till then, you can always use it for specific tasks, parallel to your language of choice. After all, there are bridge packages that enable that, which is more that someone could say about some other new languages, like Go.
If I've tried to make the argument that Julia is a great programming language, that’s because I find new technologies interesting and useful for an ever-changing field, such as data science. It was never my intention to convert anyone to that language, merely make it more well-known. After all, data science is all about mindset and methodologies, not so much about the specific tools, which inevitably change over time.
That’s a question that many people ask themselves and professionals in the data analytics field. However, they get different answers depending on who they ask. Naturally, the A.I. professional will tell you that of course, since A.I. methods are much better than conventional machine learning ones, while the field is booming lately. The data scientist may have a more retrained approach, as she is more likely to look at the matter scientifically, expressing some cautiousness about how influential A.I. professionals will be in the data science field. As someone who is both in A.I. and Data Science, perhaps I could offer a more balanced perspective.
First of all, an A.I. professional is a specialist in A.I. methods and if we are thinking about how this person can do a data scientist’s job, we are looking at someone who focuses on data analytics, rather than some other part of A.I. (e.g. robotics, theoretical A.I., etc.). Also, when we are examining a data science professional, we are looking at someone who is not in A.I. and who uses mostly conventional data science methods for the data analytics problems he tackles.
In my latest book, I outlined the importance of A.I. and how it is very influential in the data science field and the role of the data scientist. I even encouraged people to be kept up-to-date about the developments of A.I. as I predicted it will have an important role to play in the years to come. However, I did not urge anyone to drop what they are doing and focus on A.I. methods alone. If someone is already in the field, that’s great, since they already have developed the mindset of the data scientist and have mastered some of the tools, so by studying A.I. methods for data analytics, they are expanding their skill-set. That’s different from becoming A.I. specialists though. The A.I. specialists may be great at tackling Kaggle competitions, where the data is in a pretty clean and structured form (or at least mostly structured). However, this doesn't automatically make them adept at handling all kinds of data, like a data scientist does.
It’s really hard to make predictions about things involving people and their work, as the market is a chaoit system. However, I can attempt to venture an educated guess about what is most likely to happen, if things continue evolving the way they do. So, as A.I. becomes more and more versatile and more robust in tackling data analytics problems, it is bound to dominate over other data science techniques. So, if you are happy using SVMs or random forests, for example, you may want to rethink your toolkit! Yet, it is unlikely that A.I. will fully automate the data science process, much like statistics have not become fully obsolete just because there are several statistical programming environments out there (e.g. Statistica, R, SAS, etc.). Statistics is and is bound to remain useful because it is much more than its techniques. The same goes for data science. Even if all the conventional methods used by a data scientist become obsolete, giving way to A.I. ones, people will continue asking questions about the data, forming hypotheses, analyzing problems so that they can be modeled as data science ones, etc.
Of course, people will still communicate with the stakeholders of the projects, create visuals, do presentations, etc. So, even if the A.I. professional is bound to be an asset to an organization, he is most likely going to be part of a data science team, working side-by-side with a data scientist. As for the latter, she will be more knowledgeable about A.I. methods and will spend more time on other parts of her job, rather than doing feature selecting and building a series of models, since that’s something that will be automated by an A.I. system.
Therefore, unless a major breakthrough happens in the next few years, I’d recommend you are a bit skeptical about the A.I. paradigm shift that many evangelists talk about, as if it’s the coming of a new Messiah. It would be nice if everything was suddenly easy and smooth, due to A.I., but I wouldn’t uninstall my data science software just yet...
With all the hype about A.I. lately, many people have jumped on the A.I. bandwagon without realizing that what they are producing is not always related to A.I. and that their false promises can only get them that far. That’s not to say that modern processes in data science that leverage alternative approaches to analyzing data without relying on a predefined data representation system are not A.I. Far from it. However, there is a lot of jazz about knowledge representation systems (KRS), such as those applied in Natural Language Processing (NLP) that are merely transformations of text data into a quantitative format. Calling that an A.I. is calling a sedan a 4-by-4 monster truck!
Knowledge representation is useful in many ways as it is an often necessary component to Natural Language Understanding (NLU) and other NLP-related systems. For example, the NLTK package in Python has a process in place that categorizes a given text into a series of parts of speech (PoS), by labeling each word with the most appropriate PoS tag. That’s useful, but it’s not exactly A.I. technology. Similar frameworks providing some kind of labeling of text data fall under the same umbrella. In fact, without someone processing their output and building some kind of model based on it, such a labeling is utterly useless. It’s like the dough someone makes, which without additional processing (e.g. baking), it’s bound to be something you’d probably not serve in a dinner party as-is (though many kids may be quite content eating it in this form).
People managing data-driven products, however, are not kids. They expect some kind of value from the processing of the text-based data streams (which sometimes come at a cost) and a positive ROI. It’s quite unlikely that serving them some half-baked data using a knowledge representation system on the given data is going to make them content. Maybe they are fooled once into believing that this is A.I. at work, but it’s probably going to be a one-time thing. This is especially true if they have some data scientist on-board, who knows a thing or two about text analytics.
A.I. systems are automated processes that make an in-depth transformation of the data they are fed, yielding something of value at the end. They usually require a lot of sophisticated processes in the back-end, such as the generation of a large number of meta-features, gradually refining the original features into something that encapsulates the information in them, and then use the end-result to make predictions of some kind. When it comes to data, this could be some new text that mimics the style of the original text, or some better representation of the data using a compact feature set. All this is done through computationally heavy processes that often employ the usage of GPUs. So, saying that a knowledge representation system that can run on an average computer, without any additional computing power, is an A.I. system, is inaccurate and misleading. Best case scenario, its results will be later discovered to be interesting but practically useless. After all, A.I. systems are robust because they drill into the data in ways that no human can do, and usually not even comprehend fully.
So, if you hear someone claim that they have developed some new A.I. system that can handle raw text data, without the use of some non-parametric model, they are probably trying to sell you snake oil. This is expected in times where new technologies are available yet not fully understood, and charlatans trying to take advantage of the fact are promoting products convoluted enough to masquarade as this new tech, without actually offering any real value to the user. The answer to this situation is to better understand the field through methodical study (it doesn’t have to be too time-consuming) through reliable sources and the consultation of A.I. professionals and data scientist with an NLP focus. Once you are armed with this understanding, no KRS charlatans can take advantage of you since you’ll be able to see through their lies.
Being part of a tech start-up is a more intimate kind of work, since you are more involved in the decisions of the company, while at the same time collaboration is more direct and sincere. Of course there are still politics, but they are significantly less impactful in your career as a tech entrepreneur. Because if you are part of the founding team of a start-up, you are an entrepreneur, period. So, why would someone leave such a company, esp. if it’s still in its growing phase? There are many reasons and they greatly depend on the company and the team dynamics of it. Here is my story, in a company called MAXset.
MAXset started as an NLP company with the mission to automate the structuring of text data, for any given corpus. Originally it was decided to use the state-of-the-art programming paradigm (functional programming) and a custom-built framework for knowledge representation. Basically, the goal was efficiency and innovation, so as to facilitate text analytics, particularly related to data science and business intelligence. Great idea, yet ideas that are good are a dime a dozen. Implementing this idea was a whole different ball game, one that required a lot of sacrifices and dirty compromises.
MAXset's Framework Implementation
Implementing a novel framework like that wasn’t easy. All the conventional text analytics systems were insufficient and embarrassingly suboptimal. Eventually we decided we had to build everything from scratch. This was great for me, since prototyping in a functional language was fairly easy and fast, while at the same time we were building a unique code base that could be featured as IP for the company, an asset of sorts. We even examined the possibility of filing a patent, at one point.
However, even though all the scripts I developed were fine, they were not used in practice since the framework was poorly defined and was changing constantly. It was like trying to optimize a fitness function that was different every time you looked at it. Also, at one point a decision was made to use a certain Python’s package, since the developer we had hired was not comfortable with using a functional language like Julia (even though that was a condition for hiring him). Of course, if you are hiring someone without giving them a salary, you have to make compromises like that, otherwise things will never take off.
Other Issues of MAXset
Technology and ideas aside, MAXset had other serious issues, that were highly incompatible with what investors would call a promising start-up. For example, there was no clear product definition, no clear market / audience, and no clear strategy for how this great idea would eventually make money. Investors may be very keen on spreadsheets and plots, but they are also intelligent enough to see beyond these and tend to have a pretty good BS detector. After all, there are so many other options for putting their hard-earned cash, especially in a tech city like Seattle. So, needless to say the idea never got the anticipated traction in the angel investment and VC community.
Also, the fact that there were no regular meeting locations (usually in the study rooms of libraries, or sometimes in coffee shops), didn't help the situation either. Apart from the obvious issue of lack of privacy, the logistics of the meetings were a constant problem. One of the team members had a good contact in a shared office space and he was certain he could get a really good deal for an office there. Yet, this never materialized for various reasons.
Regarding the team, we were originally 4 people, each one having a sizeable part of the company’s equity. There were also people having advisory roles, like a very talented cloud systems expert who I personally looked up to. Naturally you don’t expect everyone who is in the company in its first stages to linger, since not everyone is that patient, even if they are vested in the company’s success. Even one of Apple’s founders left within the first couple of years, leaving Steve Jobs and Steve Wosniac the only major stakeholders of the start-up they had all created. However, if most of the founders leave, that’s not a good sign. That’s what happened in MAXset. I was the last original founder other that the CEO who was around, when I sent my resignation letter. Perhaps I was less experienced than the other two gentlemen who made the same choice months ago. Or maybe I was too optimistic. Whatever the case, I eventually had to go, since it was no longer cost-effective for me to stay there.
Innovation Wasn't That Great
As for the innovation factor, MAXset prides itself to be an A.I. company, employing fringe data science methods for NLP applications. However, upon closer look, if you manage to see beyond the convoluted framework of its main product, it is merely a knowledge representation system. Also, prior to it busking for investors' money, everyone there was oblivious regarding the fact that there are several other companies out there that do the same thing, though with a different technology. Perhaps the technology in MAXset is unique, but this does not make the product innovative necessarily. Needless to say, most investors who flirted with the idea of investing in the company didn't take long to figure that out and keep their distance from MAXset.
Disrespect Towards People Outside the Company
It's one thing not liking someone because they are a competitor, or a former employee, and it's quite another dissing them. MAXset was notorious for the latter. Also, even people who would be considered potential collaborators, people who had a very positive attitude toward the company and wanted to help, were often treated with disrespect. For example, there was a marketing guy who had an appointment with the CEO one day at a local Starbucks. The CEO had double-booked himself that morning so he didn't show up for the meeting with that guy. He didn't even bother to reschedule or let him know, so that guy called the CEO asking him where he was. The CEO apologized of course, but at that moment I felt really embarrassed for his sake.
It is quite normal in start-ups to have to work without getting paid much. However, you would expect that the compensation would reflect the amount of work you've put and how vested you are in the company. That wasn't the case with MAXset. During one of the main payments, the compensation was hugely disproportionate to the amount of work or time invested in the company. This wasn't just for me, as there was another person too who was paid much less than he had worked. Also, another person got more than either one of us, even though he had been recruited recently. In general, the cash-flows in the company were managed so poorly that I wouldn't be surprised if there is an embezzlement fiasco in the news about this company (if it doesn't file for Chapter 11 in the meantime).
Start-ups are evolving creatures, so it is natural to change and adapt to circumstances, in order to survive and prosper. However, this kind of change tends to be gradual and in relation to some external factor that needs to be reckoned with. MAXset would change in a very whimsical fashion, shifting programming platforms, data analytics frameworks, and even product objectives like most people would change their clothes. This kind of work is not conducive to sustainable professional development, in my view, and highly incongruent to my values as a tech professional. Although it is good to be flexible, if the requirements of a system change bi-weekly, it is really hard to produce anything worthwhile. Also, the lack of any sort of solid plan about the company's strategy is not a good sign either.
Although I still feel like this whole gig was a waste of my time, time I could have spend creating more videos, or engaging in other data science projects, I find that even from this kind of experience it is possible to learn and hone one’s skills, while at the same time broaden one's perspective. There is a very nice Greek saying that goes “he who sits and hasn’t sat uncomfortably, doesn't sit comfortably.” Perhaps some people need to undergo through these harsh experiences in order to appreciate other companies. These companies may be less innovative and perhaps less exciting than a Seattle start-up, yet they are more viable and more useful to the world, since they have a definite objective and a clear plan on how to achieve it. So, I focus on that part of my experience and sincerely hope that if you pursue employment in a tech start-up, you never work in a place like MAXset.
A few weeks ago I created a video on DB frameworks, from a data science perspective. Somehow it didn't get into the production pipeline, but now it surfaced and is available on the Safari platform. You can view it here. Enjoy!
There is a certain idea about this matter that I find particularly vexing and misleading, as it paints a very limiting picture of what data science is. There are people who have entered the field through Statistics, as there is a direct link between data science and Stats. However, for reasons of their own, these people tend to view data science as part of Statistics, or sometimes a branch of it. Let’s delve into this matter and clarify this complicated relationship between data science and Stats, before things get out of hand.
First of all, let’s get some definitions in place. Statistics is a sub-field of Mathematics involving the description and analysis of data, particularly numeric data, through a variety of models and processes. It is a very useful framework that is essential in data science. As for the latter, it is usually viewed as a new field, one that comprises of several other fields, such as computer science, business, communication, and mathematical modeling. In other words, it’s an inter-disciplinary field that borrows from several other fields, in order to tackle complex problems that couldn't be solved otherwise.
As I have repeatedly stated in my books as well as many of my videos, a data scientist needs to know a variety of things, particularly programming. Statisticians usually focus on all-in-one platforms, like R, SAS, SPSS, etc. for their scripts. These are not the same as full-blown programming languages like Python, Julia, Scala, etc. that are usually used in data science. So, if someone calls data science a part of Statistics is not only inaccurate, but a sign that he doesn't understand what data science entails.
Also, data science tackles a large variety of data types, including text. In fact, there are a lot of data scientists who focus primarily on text data, while there are various methodologies that aim to quantify text data, in a way that enables the analysis of a corpus using a mathematical model. Statistics is unable to tackle any data of this kind, even if data scientists oftentimes make use of Stats when analyzing the quantified text data.
Moreover, Statistics tend to make use of certain models that are based on a number of assumptions about the distribution of the data, or some characteristics of it. Many data science methods don’t have any assumptions about the data. This allows for more versatile models that exhibit a more robust performance, oftentimes unattainable by statistical models. So, if someone claims that data science is part of Stats, they are probably oblivious of Machine Learning and A.I. systems employed in data science.
Naturally, Statistics are useful in data science and there is no data science course out there that doesn't cover this useful framework in its syllabus. Every data scientist is expected to have a solid grasp of Statistics and use statistical methods in her work. However, relying on Stats exclusively is quite rare and often unproductive.
To sum up, Statistics is a great field that has a lot to offer to data science. However, data science is an inter-disciplinary field, borrowing from various areas, including but definitely not limited to Statistics. If you want to learn more about the various aspects of the data science craft and how you can enrich your know-how of it, feel free to check out my latest book, Data Science Mindset, Methodologies, and Misconceptions (Technics Publications). Then, even if you don’t share my view on this topic, at least you’ll be more aware of the complicated relationship between data science and Statistics.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.