People nowadays, especially those who don’t understand programming, tend to be opinionated about programming languages and harbor unrealistic expectations. It’s this kind of people who spill negativity towards promising projects like Julia, which are still in the process of development. The same people would probably say nasty things about Python, or R, if these languages were developed in a time when early releases of them were accessible to the world through the Internet. So, perhaps it’s not really Julia these people have an issue with...
It’s easy to criticize something, be it a book, a movie, or a programming language. It’s probably the easiest thing someone can do, other than doing nothing. However, doing nothing doesn't hurt anyone, while the negativity of criticism has a corrosive effect on whoever is exposed to it. It would be overly idealistic to think that people who have this nasty habit could be cured of it, since most likely there are deep issues that cause it to manifest, which would probably require professional help to remedy. What can be remedied fairly easily though is the effect of these criticisms, since they are based on some shallow opinion rather than facts.
So, if you have heard someone who has spent a few hours learning about Julia and trying it out on his laptop dis Julia, that’s not a view you need to take very seriously. Just like every programming language, Julia has its issues and the packages out there are not in their final form. Just because something doesn't have the maturity and elegance of Pandas or Scikit-learn, it doesn't make it useless though. Julia, unlike other high-level languages, enables its users to make their own scripts easily and ensure high performance in them. Imagine trying to do that in Python! You’d need to be a computer science expert in order to guarantee high performance in a script you just put together and most likely you’d need to make use of C at one point it (Cython).
However, just because some people love Julia and swear by it, you shouldn't take their word for it. The idea is that you try it out yourself, like you’d try some other language, namely through methodical studying and practice. After you've spent quite some time and have developed your own (working) programs in it, then you can have a valid opinion on it. And if you don’t like it, that’s fine. Most Julia users don’t take offense if you don’t like their favorite language. However, since these people don’t dis your language of choice, I believe it is only fair if you show some respect for their favorite language. After all, Julia is not competing with any other language. It just does its thing, like Swift, and other fairly new programming languages.
Perhaps Julia is not the language of choice for the majority of data science practitioners. That’s perfectly fine. Just because it’s not as mature as Python or R, however, it doesn't mean that it’s not useful. Also, as it’s still in its early stages of development, it can only improve as time goes by. Till then, you can always use it for specific tasks, parallel to your language of choice. After all, there are bridge packages that enable that, which is more that someone could say about some other new languages, like Go.
If I've tried to make the argument that Julia is a great programming language, that’s because I find new technologies interesting and useful for an ever-changing field, such as data science. It was never my intention to convert anyone to that language, merely make it more well-known. After all, data science is all about mindset and methodologies, not so much about the specific tools, which inevitably change over time.
That’s a question that many people ask themselves and professionals in the data analytics field. However, they get different answers depending on who they ask. Naturally, the A.I. professional will tell you that of course, since A.I. methods are much better than conventional machine learning ones, while the field is booming lately. The data scientist may have a more retrained approach, as she is more likely to look at the matter scientifically, expressing some cautiousness about how influential A.I. professionals will be in the data science field. As someone who is both in A.I. and Data Science, perhaps I could offer a more balanced perspective.
First of all, an A.I. professional is a specialist in A.I. methods and if we are thinking about how this person can do a data scientist’s job, we are looking at someone who focuses on data analytics, rather than some other part of A.I. (e.g. robotics, theoretical A.I., etc.). Also, when we are examining a data science professional, we are looking at someone who is not in A.I. and who uses mostly conventional data science methods for the data analytics problems he tackles.
In my latest book, I outlined the importance of A.I. and how it is very influential in the data science field and the role of the data scientist. I even encouraged people to be kept up-to-date about the developments of A.I. as I predicted it will have an important role to play in the years to come. However, I did not urge anyone to drop what they are doing and focus on A.I. methods alone. If someone is already in the field, that’s great, since they already have developed the mindset of the data scientist and have mastered some of the tools, so by studying A.I. methods for data analytics, they are expanding their skill-set. That’s different from becoming A.I. specialists though. The A.I. specialists may be great at tackling Kaggle competitions, where the data is in a pretty clean and structured form (or at least mostly structured). However, this doesn't automatically make them adept at handling all kinds of data, like a data scientist does.
It’s really hard to make predictions about things involving people and their work, as the market is a chaoit system. However, I can attempt to venture an educated guess about what is most likely to happen, if things continue evolving the way they do. So, as A.I. becomes more and more versatile and more robust in tackling data analytics problems, it is bound to dominate over other data science techniques. So, if you are happy using SVMs or random forests, for example, you may want to rethink your toolkit! Yet, it is unlikely that A.I. will fully automate the data science process, much like statistics have not become fully obsolete just because there are several statistical programming environments out there (e.g. Statistica, R, SAS, etc.). Statistics is and is bound to remain useful because it is much more than its techniques. The same goes for data science. Even if all the conventional methods used by a data scientist become obsolete, giving way to A.I. ones, people will continue asking questions about the data, forming hypotheses, analyzing problems so that they can be modeled as data science ones, etc.
Of course, people will still communicate with the stakeholders of the projects, create visuals, do presentations, etc. So, even if the A.I. professional is bound to be an asset to an organization, he is most likely going to be part of a data science team, working side-by-side with a data scientist. As for the latter, she will be more knowledgeable about A.I. methods and will spend more time on other parts of her job, rather than doing feature selecting and building a series of models, since that’s something that will be automated by an A.I. system.
Therefore, unless a major breakthrough happens in the next few years, I’d recommend you are a bit skeptical about the A.I. paradigm shift that many evangelists talk about, as if it’s the coming of a new Messiah. It would be nice if everything was suddenly easy and smooth, due to A.I., but I wouldn’t uninstall my data science software just yet...
With all the hype about A.I. lately, many people have jumped on the A.I. bandwagon without realizing that what they are producing is not always related to A.I. and that their false promises can only get them that far. That’s not to say that modern processes in data science that leverage alternative approaches to analyzing data without relying on a predefined data representation system are not A.I. Far from it. However, there is a lot of jazz about knowledge representation systems (KRS), such as those applied in Natural Language Processing (NLP) that are merely transformations of text data into a quantitative format. Calling that an A.I. is calling a sedan a 4-by-4 monster truck!
Knowledge representation is useful in many ways as it is an often necessary component to Natural Language Understanding (NLU) and other NLP-related systems. For example, the NLTK package in Python has a process in place that categorizes a given text into a series of parts of speech (PoS), by labeling each word with the most appropriate PoS tag. That’s useful, but it’s not exactly A.I. technology. Similar frameworks providing some kind of labeling of text data fall under the same umbrella. In fact, without someone processing their output and building some kind of model based on it, such a labeling is utterly useless. It’s like the dough someone makes, which without additional processing (e.g. baking), it’s bound to be something you’d probably not serve in a dinner party as-is (though many kids may be quite content eating it in this form).
People managing data-driven products, however, are not kids. They expect some kind of value from the processing of the text-based data streams (which sometimes come at a cost) and a positive ROI. It’s quite unlikely that serving them some half-baked data using a knowledge representation system on the given data is going to make them content. Maybe they are fooled once into believing that this is A.I. at work, but it’s probably going to be a one-time thing. This is especially true if they have some data scientist on-board, who knows a thing or two about text analytics.
A.I. systems are automated processes that make an in-depth transformation of the data they are fed, yielding something of value at the end. They usually require a lot of sophisticated processes in the back-end, such as the generation of a large number of meta-features, gradually refining the original features into something that encapsulates the information in them, and then use the end-result to make predictions of some kind. When it comes to data, this could be some new text that mimics the style of the original text, or some better representation of the data using a compact feature set. All this is done through computationally heavy processes that often employ the usage of GPUs. So, saying that a knowledge representation system that can run on an average computer, without any additional computing power, is an A.I. system, is inaccurate and misleading. Best case scenario, its results will be later discovered to be interesting but practically useless. After all, A.I. systems are robust because they drill into the data in ways that no human can do, and usually not even comprehend fully.
So, if you hear someone claim that they have developed some new A.I. system that can handle raw text data, without the use of some non-parametric model, they are probably trying to sell you snake oil. This is expected in times where new technologies are available yet not fully understood, and charlatans trying to take advantage of the fact are promoting products convoluted enough to masquarade as this new tech, without actually offering any real value to the user. The answer to this situation is to better understand the field through methodical study (it doesn’t have to be too time-consuming) through reliable sources and the consultation of A.I. professionals and data scientist with an NLP focus. Once you are armed with this understanding, no KRS charlatans can take advantage of you since you’ll be able to see through their lies.
Being part of a tech start-up is a more intimate kind of work, since you are more involved in the decisions of the company, while at the same time collaboration is more direct and sincere. Of course there are still politics, but they are significantly less impactful in your career as a tech entrepreneur. Because if you are part of the founding team of a start-up, you are an entrepreneur, period. So, why would someone leave such a company, esp. if it’s still in its growing phase? There are many reasons and they greatly depend on the company and the team dynamics of it. Here is my story, in a company called MAXset.
MAXset started as an NLP company with the mission to automate the structuring of text data, for any given corpus. Originally it was decided to use the state-of-the-art programming paradigm (functional programming) and a custom-built framework for knowledge representation. Basically, the goal was efficiency and innovation, so as to facilitate text analytics, particularly related to data science and business intelligence. Great idea, yet ideas that are good are a dime a dozen. Implementing this idea was a whole different ball game, one that required a lot of sacrifices and dirty compromises.
MAXset's Framework Implementation
Implementing a novel framework like that wasn’t easy. All the conventional text analytics systems were insufficient and embarrassingly suboptimal. Eventually we decided we had to build everything from scratch. This was great for me, since prototyping in a functional language was fairly easy and fast, while at the same time we were building a unique code base that could be featured as IP for the company, an asset of sorts. We even examined the possibility of filing a patent, at one point.
However, even though all the scripts I developed were fine, they were not used in practice since the framework was poorly defined and was changing constantly. It was like trying to optimize a fitness function that was different every time you looked at it. Also, at one point a decision was made to use a certain Python’s package, since the developer we had hired was not comfortable with using a functional language like Julia (even though that was a condition for hiring him). Of course, if you are hiring someone without giving them a salary, you have to make compromises like that, otherwise things will never take off.
Other Issues of MAXset
Technology and ideas aside, MAXset had other serious issues, that were highly incompatible with what investors would call a promising start-up. For example, there was no clear product definition, no clear market / audience, and no clear strategy for how this great idea would eventually make money. Investors may be very keen on spreadsheets and plots, but they are also intelligent enough to see beyond these and tend to have a pretty good BS detector. After all, there are so many other options for putting their hard-earned cash, especially in a tech city like Seattle. So, needless to say the idea never got the anticipated traction in the angel investment and VC community.
Also, the fact that there were no regular meeting locations (usually in the study rooms of libraries, or sometimes in coffee shops), didn't help the situation either. Apart from the obvious issue of lack of privacy, the logistics of the meetings were a constant problem. One of the team members had a good contact in a shared office space and he was certain he could get a really good deal for an office there. Yet, this never materialized for various reasons.
Regarding the team, we were originally 4 people, each one having a sizeable part of the company’s equity. There were also people having advisory roles, like a very talented cloud systems expert who I personally looked up to. Naturally you don’t expect everyone who is in the company in its first stages to linger, since not everyone is that patient, even if they are vested in the company’s success. Even one of Apple’s founders left within the first couple of years, leaving Steve Jobs and Steve Wosniac the only major stakeholders of the start-up they had all created. However, if most of the founders leave, that’s not a good sign. That’s what happened in MAXset. I was the last original founder other that the CEO who was around, when I sent my resignation letter. Perhaps I was less experienced than the other two gentlemen who made the same choice months ago. Or maybe I was too optimistic. Whatever the case, I eventually had to go, since it was no longer cost-effective for me to stay there.
Innovation Wasn't That Great
As for the innovation factor, MAXset prides itself to be an A.I. company, employing fringe data science methods for NLP applications. However, upon closer look, if you manage to see beyond the convoluted framework of its main product, it is merely a knowledge representation system. Also, prior to it busking for investors' money, everyone there was oblivious regarding the fact that there are several other companies out there that do the same thing, though with a different technology. Perhaps the technology in MAXset is unique, but this does not make the product innovative necessarily. Needless to say, most investors who flirted with the idea of investing in the company didn't take long to figure that out and keep their distance from MAXset.
Disrespect Towards People Outside the Company
It's one thing not liking someone because they are a competitor, or a former employee, and it's quite another dissing them. MAXset was notorious for the latter. Also, even people who would be considered potential collaborators, people who had a very positive attitude toward the company and wanted to help, were often treated with disrespect. For example, there was a marketing guy who had an appointment with the CEO one day at a local Starbucks. The CEO had double-booked himself that morning so he didn't show up for the meeting with that guy. He didn't even bother to reschedule or let him know, so that guy called the CEO asking him where he was. The CEO apologized of course, but at that moment I felt really embarrassed for his sake.
It is quite normal in start-ups to have to work without getting paid much. However, you would expect that the compensation would reflect the amount of work you've put and how vested you are in the company. That wasn't the case with MAXset. During one of the main payments, the compensation was hugely disproportionate to the amount of work or time invested in the company. This wasn't just for me, as there was another person too who was paid much less than he had worked. Also, another person got more than either one of us, even though he had been recruited recently. In general, the cash-flows in the company were managed so poorly that I wouldn't be surprised if there is an embezzlement fiasco in the news about this company (if it doesn't file for Chapter 11 in the meantime).
Start-ups are evolving creatures, so it is natural to change and adapt to circumstances, in order to survive and prosper. However, this kind of change tends to be gradual and in relation to some external factor that needs to be reckoned with. MAXset would change in a very whimsical fashion, shifting programming platforms, data analytics frameworks, and even product objectives like most people would change their clothes. This kind of work is not conducive to sustainable professional development, in my view, and highly incongruent to my values as a tech professional. Although it is good to be flexible, if the requirements of a system change bi-weekly, it is really hard to produce anything worthwhile. Also, the lack of any sort of solid plan about the company's strategy is not a good sign either.
Although I still feel like this whole gig was a waste of my time, time I could have spend creating more videos, or engaging in other data science projects, I find that even from this kind of experience it is possible to learn and hone one’s skills, while at the same time broaden one's perspective. There is a very nice Greek saying that goes “he who sits and hasn’t sat uncomfortably, doesn't sit comfortably.” Perhaps some people need to undergo through these harsh experiences in order to appreciate other companies. These companies may be less innovative and perhaps less exciting than a Seattle start-up, yet they are more viable and more useful to the world, since they have a definite objective and a clear plan on how to achieve it. So, I focus on that part of my experience and sincerely hope that if you pursue employment in a tech start-up, you never work in a place like MAXset.
A few weeks ago I created a video on DB frameworks, from a data science perspective. Somehow it didn't get into the production pipeline, but now it surfaced and is available on the Safari platform. You can view it here. Enjoy!
There is a certain idea about this matter that I find particularly vexing and misleading, as it paints a very limiting picture of what data science is. There are people who have entered the field through Statistics, as there is a direct link between data science and Stats. However, for reasons of their own, these people tend to view data science as part of Statistics, or sometimes a branch of it. Let’s delve into this matter and clarify this complicated relationship between data science and Stats, before things get out of hand.
First of all, let’s get some definitions in place. Statistics is a sub-field of Mathematics involving the description and analysis of data, particularly numeric data, through a variety of models and processes. It is a very useful framework that is essential in data science. As for the latter, it is usually viewed as a new field, one that comprises of several other fields, such as computer science, business, communication, and mathematical modeling. In other words, it’s an inter-disciplinary field that borrows from several other fields, in order to tackle complex problems that couldn't be solved otherwise.
As I have repeatedly stated in my books as well as many of my videos, a data scientist needs to know a variety of things, particularly programming. Statisticians usually focus on all-in-one platforms, like R, SAS, SPSS, etc. for their scripts. These are not the same as full-blown programming languages like Python, Julia, Scala, etc. that are usually used in data science. So, if someone calls data science a part of Statistics is not only inaccurate, but a sign that he doesn't understand what data science entails.
Also, data science tackles a large variety of data types, including text. In fact, there are a lot of data scientists who focus primarily on text data, while there are various methodologies that aim to quantify text data, in a way that enables the analysis of a corpus using a mathematical model. Statistics is unable to tackle any data of this kind, even if data scientists oftentimes make use of Stats when analyzing the quantified text data.
Moreover, Statistics tend to make use of certain models that are based on a number of assumptions about the distribution of the data, or some characteristics of it. Many data science methods don’t have any assumptions about the data. This allows for more versatile models that exhibit a more robust performance, oftentimes unattainable by statistical models. So, if someone claims that data science is part of Stats, they are probably oblivious of Machine Learning and A.I. systems employed in data science.
Naturally, Statistics are useful in data science and there is no data science course out there that doesn't cover this useful framework in its syllabus. Every data scientist is expected to have a solid grasp of Statistics and use statistical methods in her work. However, relying on Stats exclusively is quite rare and often unproductive.
To sum up, Statistics is a great field that has a lot to offer to data science. However, data science is an inter-disciplinary field, borrowing from various areas, including but definitely not limited to Statistics. If you want to learn more about the various aspects of the data science craft and how you can enrich your know-how of it, feel free to check out my latest book, Data Science Mindset, Methodologies, and Misconceptions (Technics Publications). Then, even if you don’t share my view on this topic, at least you’ll be more aware of the complicated relationship between data science and Statistics.
Recently I read an article on Pulse (LinkedIn) that was talking about what to look for when hiring a statistician. This shocked me for two reasons. 1. The role of the statistician is becoming obsolete, giving way to that of the data scientist and the A.I. professional. 2. If you need help for hiring someone in a profession that has been around for over a century, then no article can help you, no matter how well-written the article is.
I have no opinion on the article itself. I’m sure its author meant well and that he did his research prior to writing it. I do have a view on the whole matter though and how oftentimes the market is lagging in its understanding of what data analytics entails. So, let me clarify some things on this, since I've worked in this field for the largest part of my career.
Data analytics involves various sub-fields, such as statistics, business intelligence, data science, and modern A.I.-based predictive analytics. Statistics is a great tool, something that every self-respecting data scientist needs to know (though its usefulness is not limited to data science). However, a statistician is an overly specialized professional who relies on Stats primarily for his analyses. This is like someone who is a professional traveler (say a blogger of sorts specializing in touristic destinations and such), who only uses his car to go to the various countries he visits. Of course driving can get you to many places, depending on where you start from, but this way you miss out on all the islands, as well as Australia. There is nothing wrong with using your car to go to places, but if you want to be a professional traveler, you need to use other modes of transport too (otherwise you’ll never get to Victoria, BC, which is awesome, especially at this time of the year).
So, if you want to get a good story on the various beautiful spots someone can go to in the Hawaii islands, the aforementioned overly specialized professional traveller may not be able to deliver. However, someone who is more versatile and doesn’t mind using a plane once in a while, can get you that story and do so fairly quickly (especially if he is based in the West Coast). That latter professional traveller is the equivallent of a data scientist.
Nowadays, the world needs people who are versatile and comfortable with technology. In the previous example, they need someone who not only drives a car, but also knows how it works and is able to fix it if it doesn’t drive well. The statistician may know how to drive various vehicles (statistical models), but is usually unable to do more than drive. A data scientist, on the other hand, is quite comfortable with all kinds of modes of transport and can even build one from scratch (given enough expertise, of course). So, if you were to hire someone to get that data to talk, who are you going to go with, the overly specialized statistician or the versatile data scientist?
Lately I've been looking into cyber security, as it is a field that is very useful to know about, regardless of one's profession. As in data science we often deal with sensitive data, I found it useful to be able to apply certain cyber security principles to ensure the data at hand remains secure. As I've researched this matter enough to have something useful to share with the world, I decided to create a video on it, which is now available on the Safari platform here (you'll need a subscription to the Safari platform in order to view it). This is by no means enough to make you an expert in network security, but it's a good starting point. Enjoy!
Lately I've been thinking about A.I. and Statistics a lot (you could say that the amount of time spent thinking about these topics is significantly higher at alpha = 0.05!). This is partly because my Stats article managed to get more traction than any other article I've written in the past few months, and partly because A.I. is becoming more and more relevant in our field. So, the question of whether A.I. is one day going to replace Stats altogether remains a very relevant one.
The key advantage of A.I. methods is that they are assumption-free. This by itself enables them to tackle the problems they are aiming to solve, in a very methodical and efficient way. Of course, certain assumptions might speed things up, but they might obstruct the discovery of the optimal solutions to the problem at hand. Statistical inference models lost the war against machine learning models because of that, especially when artificial neural networks (ANNs) entered the scene. Also, the fact that many ML models could be combined in an ensemble setting allowed them to become even more robust, attaining F1 scores that were unfathomable for statistical prediction models. So, the possibility of other methods of statistics becoming outsourced to alternative systems is quite real.
On the other hand, statistics are very easy to use and interpret, since most of them were designed from a user’s perspective. There are doctors out there (the medical kind), who don’t know much about data analytics but can easily work a statistical model for figuring out if a certain drug has a positive influence on certain patients, and derive some scientific conclusions based on that. That doctor may not be able to write a script to save his life, but he can make use of the data he gathers and advance his scientific field, using just statistics. It’s quite unlikely that this kind of person, who is usually too busy or just not technically adept enough, will take up an A.I. approach to this kind of analysis any time soon.
Of course, A.I. constantly evolves so the black-box issue that makes many ANN-based systems unfavorable, may wane in the future. Already there are A.I. professionals talking about A.I. systems that offer some kind of interpretability. So, even if statistical systems are easier to understand and communicate, it could be that A.I. hasn't said its final word yet.
Whatever the case, I prefer to remain agnostic on this matter. Just like with programming, it’s best to keep one’s options open, when it comes to data science. I’m not a fan of statistics (and never was), but I see value in them and I’m happy to use them to the extent that they offer value to the projects I work on. A.I. may be more of a novel and exciting framework, but if an A.I. system is hard to communicate to the client, or doesn't lend itself to interpretation, then I may not use it everywhere. Just like you don’t take your fancy fringe science book to the beach, you don’t need to show off your A.I. know-how at every opportunity. Perhaps the humble historic novel is more suitable for reading while sunbathing, just like the humble statistics are fine for describing if sample A is significantly different from sample B.
Recently I had a nice chat with a fellow data scientist who works at LinkedIn. After bouncing some ideas off him, I decided to make another video, based on a topic of mutual interest, partly for demonstrating to him how straight-forward the process is, once you have done the research on the topic. This video is now published on Safari here (subscription required). Enjoy!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.