Why Data Science and A.I. Need to Be Focused on the Customer Rather than Merely on the Science Part10/30/2017 Sometimes it’s easy to get carried away and focus on data science too much, losing sight of the applications of it. Although this is something somewhat common in an academic setting (particularly in universities that don’t have any ties to the industry), it may happen in companies too. When this happens, it’s usually best to walk away, since data science without any real-world application can be problematic. Data science and A.I. that’s geared towards data analytics, involve a lot of scientific methodologies, which are quite interesting on their own. This may urge someone to get lost in that aspect of the craft and neglect the application part, particularly the one where these methodologies are employed for solving real-world problems. That’s not to say that doing data science research is bad. Quite the contrary. However, when the research is without any application, focusing too much on the math side of things, it is bound to be a waste of resources (unless you are doing this as part of a research project, e.g. for a research center or a university, in which case this is expected). The reason is that data science is by definition an applied field, much like engineering. Particularly when it is undertaken by a company (e.g. a startup), it needs to be able to deliver something concrete, and more importantly, something useful. It’s hard to over-estimate the value of this aspect of data science that has to do with the end-user. After all, this person is often the one paying the bills! Also, focusing on the application part of the craft enables something else too: the more practical implementation of the technologies developed and the inception of new methods that are more hands-on and therefore useful. This is one of the reasons that data science has veered away from Statistics, a field which is by its nature more theoretical and more math-y than applied Science. That’s also the main reason why data science involves a lot of programming, oftentimes building things from scratch, even if it’s simple scripts. That’s quite different than using an all-in-one software package, like SAS or SPSS, where the user merely calls functions and does rudimentary data processing. You can come up with ingenious methods in data science, that would be able to fetch a journal publication or two. However, if these methods don’t add value to an organization, they are not that great, from a holistic standpoint. This is observed in other parts of Science too, e.g. Electromagnetism. Despite the various theoretical aspects of that field, its usefulness is also apparent. People who practice this part of Physics tend to be very practical and oftentimes come up with interesting inventions that add value to their user (e.g. in the case of electromagnets, or power transformers). Data science is not any different. All the clever mathematics behind a method may be enchanting for the mind, but it’s when this method is put into practice and yields some oftentimes actionable insight when it really becomes meaningful. That’s something worth remembering, since it’s easy to lose sight of the questions we are trying to answer, and focus too much on the possibilities that we discover. And some may argue that it’s the journey that matters, but for a journey to be a journey there needs to be a destination. The latter is usually some person who doesn't care much about the science behind the insights, but more about their applicability and usefulness. Companies like MAXset LLC may be completely ignorant of that, but this doesn't make it a viable strategy. On the other hand, companies that have a chance of providing true value to the world make the business aspect of the craft their priority.
1 Comment
People like to talk about the V’s of big data, since it is a topic comprehensive to almost everyone, while it also provides insight regarding the benefits of using data science in an organization. Naturally, these benefits are linked to having access to various data streams, usually resulting to massive amounts of data, and usually referred to as big data. Not everyone agrees as to what V’s are valid for characterizing this valuable resource (some say it’s 4, others exclude Veracity, while other include a couple of others too). However, there seems to be a consensus about the last V, namely Value. Nevertheless, whether there is value in big data or not is something that remains to be determined, since not all big data is created equal. The issue with the V of value is that it’s not inherent in the data. If that were the case, someone could just buy this data (or license it) and then automatically improve his organization’s ROI. The value of big data is actually something that stems from data science’s transformation of this data into insights and/or data products. The same data that would otherwise be gathering dust on some computer cluster somewhere is turned into something people can use and oftentimes monetize, through data science. This is something that takes effort, however, and most importantly, requires a certain quality in the data to begin with. It’s often useful to think of data as a gold mine. After all, just because it has the potential of yielding large amounts of the valuable metal, it doesn't mean that it will. Perhaps the mine is all dried up, or doesn’t have much gold to begin with. No amount of data science can remedy that. Data science can yield something of value if there is something in the data that could be of value. Many time people forget that, just like the people who buy a gold mine and expect that they’ll be swimming in gold soon enough. The V’s of big data, on the other hand, are something real and present in every data stream that qualifies as big data. In fact, they are more like characteristics of the data itself, rather than something dependent on data science. However, the V’s themselves may provide some insight as to how much of big data the data at hand is, but not much regarding its potential for an organization. For example, big data of high veracity that’s related to people’s views on a particular commercial product may be completely useless to an organization that is all about some service. The data itself is fine, but doesn't add value to the organization. So, in order for big data to be of actual value, we need certain things to be in place. First of all, the data needs to be handled by a data science team (or a single data scientist, if he’s competent enough). Moreover, it needs to have some affinity to the organization’s domain. Finally, there needs to be something insightful in the data, which can be surfaced through a data science project, be it through a better understanding of a situation or through a data product that the organization can use. In conclusion, the fact that some data stream can offer value doesn't necessarily mean that it will. After the data science team has done its part, the stakeholders of the project need to take action, utilizing the insights and/or the data product developed. People sometimes forget that and neglect leveraging the benefits of a data science project to the fullest extent, much like a gold miner may obtain the gold from a mine, but never get around to doing anything useful with it... It is easy to fall into this misconception of believing that in data science we are all solitary people doing our work and interacting only in the workplace and in the social media. Perhaps we are part of some data science team, but still feel we are still on our own when it comes to our relationship with the field. However, this is just one of many possibilities in how we relate to the data science world, and it is definitely not the best one. Being part of a community in data science is not only possible but also necessary. Of course just networking with other data scientists may not be enough, but it is often a good starting point. This is particularly important towards the beginning of one’s career. After all, not even the best data science books can give someone solace in times of difficulty or doubt. That’s when having a good mentor comes in very handy. After all, even if that mentor is a bit aloof and preoccupied with his own stuff, he tends to have a genuine interest in your career and is motivated to help you out, at least to some extent. This can be another step towards becoming part of a community of data science professionals. Make no mistake, however. Neither the mentor, nor anyone else is going to fight your battles for you. The other data scientists, be it professional acquaintances, mentors, or teammates, have their own battles to tackle. However, they may be able to offer you advice or help you gain insight to solutions that you couldn't think of by yourself, especially during the time you are immersed in the problems you are tackling. Finding a physical community may not always be possible. Not all cities are as advanced as the ones where the field thrives and has a cohorts bustling with data science events and activities. However, data scientists are out there who are also in need of a community, so it’s only a matter of time before you find them. Perhaps you’ll “meet” them online, through some social network or a data science forum. Maybe you’ll encounter them in a data science conference, or a webinar. Bottom line, if you are open to finding a community of data scientists, the opportunities to do so will manifest, sooner or later. Being part of a data science community is not only to help you in difficult times though. It’s also a great accelerator for developing yourself as a data scientist through being exposed to new trends, novel approaches to known problems, and most importantly, to unknown problems that you’d probably not encounter on your own, even if you work in a data-driven company. All that is bound to foster in you the knowledge and know-how you need to advance to the next level, whatever that level is for you. At the same time, it can help you maintain your enthusiasm for data science, and perhaps even make you more zestful about the field. After all, it is usually the people who are passionate about something that make the most progress in it and are also consistent in do so. Data science is not any different in that respect. Everyone talks about data science these days, as well as A.I., since the value these disciplines can add to an organization is being verified more and more. However, there are organizations out there that are not ready yet to make use of data science, even if they have ads for data scientists in various job forums. Before applying to places like that, you may want to answer this question for yourself: is this organization I’m interested in data science ready? Just because an organization has seen value in a data science proof-of-concept (PoC) project, it doesn't make it ready to employ and utilize data science professionals. First of all, it has to have a solid leadership team, one that at the very least has a CTO who has worked with data scientists, though additional roles like that of a CIO and a CDO, would also be useful. If the C-level team of an organization hasn't worked with data scientists and doesn't have a clear idea of what data science can and what it cannot do, then this is a red flag. In addition, an organization that has access to a variety of data streams, even if these don’t qualify for “big data” status, is essential for making it data science ready. If all its data is in Excel spreadsheets and SQL data bases, perhaps they need a data analyst, a business intelligence professional, or a statistician. If they do get a data scientist, they won’t be able to do much more with her, since she will not have enough to work with and provide sufficient value, that can translate to a positive ROI for her group. That data scientist is better off working somewhere else where they make better use of her skills and her mindset. Moreover, a data science ready organization has realistic expectations and a good plan about how to utilize its data resources. Just because it has access to good data, it doesn’t mean that it can get value from it, even if it employs a group of very talented data scientists. It also need to know what it is going to do with it, what data products it can create, how it is going to leverage the insights the data science team provides, etc. All that is not going to take place in the next quarter necessarily, especially if the organization is new to data science. So, expecting some ground-breaking results within the next 3 months would be naive and financially irresponsible. An investment like this is bound to take some time before it yields dividends and if the organization is not aware of this, then it may not be ready just yet. Beyond these signs, there are other, more specialized ones that are more domain-specific or data-specific. However, mentioning them here would make the article so long that you’ll need to run some text analytics system on it to derive all the information from it! So, let’s just say that there are other thing that can be good predictors as to whether an organization is worth your time as a data scientist, or in the case you are a hiring manager of such an organization, whether you should start recruiting data scientists at this point. After all, data science is a long game, so there is no point rushing into it. It’s more beneficial if it is conducted in an environment that is conducive to it, and capable of fostering a congruent and efficient team, poised to add value to whatever data it utilizes. People like to argue, especially about things they can reason with. However, just because you can justify that your view has merit, giving some practical examples or through logical reasoning, this doesn't make alternative views invalid. If there are several programming languages in data science, perhaps an oversimplification like “X is the best language for data science because Y” doesn't hold much water. Let’s examine why. Although it is possible to rule out certain languages (e.g. Assembly or C) as optimal for data science, this doesn't mean that the problem has a clear-cut solution. Also, the assumption that a single programming language can cover all the use cases of a data science professional is a quite unjustifiable one. Some data scientists use two or three programming languages, sometimes in combination, getting the best of each, for optimal overall performance. Also, data science is all about solving a business problem in a scientific manner. Just because say Dr. Smith prefers to use language X over Y, it doesn't mean that you have to follow her example. Maybe she has used language X during her PhD and didn't have time to learn another language, or she attained mastery of that language, so she feels more comfortable doing her data science work with that. She may be a successful data scientist but following her programming habits won’t make you a great data scientist necessarily. Moreover, with new languages and new packages in the existing languages coming about all the time, which language is best is like the best performing basketball team. Definitely not something particularly stable! Besides, it’s often the case that a particular project may requite special handling, so what is a top-performer now, may not be the best option for that particular case. In addition, the almost religious attitude towards programming languages that many people have (not just data scientists) is by itself problematic. If a potential employer sees you arguing about how your language of choice is the best and that you are not open to consider alternatives, he may not be so eager to hire you, since this kind of attitude creates disharmony and difficulty in collaboration among the members of a team. Besides, in most companies nowadays, they rarely ask for a specific language in the candidate requirements. As long as you can do the task that’s required of you, they don’t really care much what your programming background is. Of course companies that have already invested in a particular language and have all their code in that language may not be so flexible, but that shouldn't be the principle factor in your decision about which language you learn. Finally, when it comes to deep learning, many modern frameworks, like Apache’s MXNet, have APIs for a variety of programming language. So if your A.I. guru friend tries to convince you that you should learn language X because that’s the best deep learning language, take that suggestion with a pinch of salt! The important thing is for whatever language you decide to learn for data science, you make sure that you learn it well. Familiarize yourself with its packages, use it to solve various problems, and learn the best strategies for debugging code written in that language. If you do that, you can still make good use of it for your data science projects, even if the majority of people prefer this or the other language instead. Just wanted to clarify something about the videos I post on Safari Books Online. Each one of these videos is not an audio-visual version of a book on the topic, but more of an overview of it. I have specific requirements about the duration, so it is infeasible to go into much depth on any one of the topics, especially those topics that are more general. So, if you decide to watch a video of mine, please manage your expectations accordingly. None of these videos will make you an expert or provide you with the specialized knowledge that you'd find in a book. However, they can be a quick and effective way to get the basics down so that when you read a book on that topic, you'll have a sense of perspective and be able to focus on the details, since you'll have a firm grasp of the key concepts. So, if you want to go into depth on any given topic, I'd recommend to either read a book or two, or do a course on it. The videos have a more supportive role and it is more useful if they are seen as such. Recently I decided to make another video on cyber security, a topic I'm quite fond of. This time, I tackled Cryptography, which is a truly intriguing field independent but similar in some ways to data science. So, as of today this video is available on Safari (you need to have subscription to the portal in order to view the whole of it). Now, it's just an introductory video, so don't expect it to make you an expert in this. However, after viewing it, you'll have a solid understanding of what Cryptography is, how it is useful, what methods it includes, and some practical tips on how you can make use of it in your everyday life. Enjoy! People nowadays, especially those who don’t understand programming, tend to be opinionated about programming languages and harbor unrealistic expectations. It’s this kind of people who spill negativity towards promising projects like Julia, which are still in the process of development. The same people would probably say nasty things about Python, or R, if these languages were developed in a time when early releases of them were accessible to the world through the Internet. So, perhaps it’s not really Julia these people have an issue with...
It’s easy to criticize something, be it a book, a movie, or a programming language. It’s probably the easiest thing someone can do, other than doing nothing. However, doing nothing doesn't hurt anyone, while the negativity of criticism has a corrosive effect on whoever is exposed to it. It would be overly idealistic to think that people who have this nasty habit could be cured of it, since most likely there are deep issues that cause it to manifest, which would probably require professional help to remedy. What can be remedied fairly easily though is the effect of these criticisms, since they are based on some shallow opinion rather than facts. So, if you have heard someone who has spent a few hours learning about Julia and trying it out on his laptop dis Julia, that’s not a view you need to take very seriously. Just like every programming language, Julia has its issues and the packages out there are not in their final form. Just because something doesn't have the maturity and elegance of Pandas or Scikit-learn, it doesn't make it useless though. Julia, unlike other high-level languages, enables its users to make their own scripts easily and ensure high performance in them. Imagine trying to do that in Python! You’d need to be a computer science expert in order to guarantee high performance in a script you just put together and most likely you’d need to make use of C at one point it (Cython). However, just because some people love Julia and swear by it, you shouldn't take their word for it. The idea is that you try it out yourself, like you’d try some other language, namely through methodical studying and practice. After you've spent quite some time and have developed your own (working) programs in it, then you can have a valid opinion on it. And if you don’t like it, that’s fine. Most Julia users don’t take offense if you don’t like their favorite language. However, since these people don’t dis your language of choice, I believe it is only fair if you show some respect for their favorite language. After all, Julia is not competing with any other language. It just does its thing, like Swift, and other fairly new programming languages. Perhaps Julia is not the language of choice for the majority of data science practitioners. That’s perfectly fine. Just because it’s not as mature as Python or R, however, it doesn't mean that it’s not useful. Also, as it’s still in its early stages of development, it can only improve as time goes by. Till then, you can always use it for specific tasks, parallel to your language of choice. After all, there are bridge packages that enable that, which is more that someone could say about some other new languages, like Go. If I've tried to make the argument that Julia is a great programming language, that’s because I find new technologies interesting and useful for an ever-changing field, such as data science. It was never my intention to convert anyone to that language, merely make it more well-known. After all, data science is all about mindset and methodologies, not so much about the specific tools, which inevitably change over time. That’s a question that many people ask themselves and professionals in the data analytics field. However, they get different answers depending on who they ask. Naturally, the A.I. professional will tell you that of course, since A.I. methods are much better than conventional machine learning ones, while the field is booming lately. The data scientist may have a more retrained approach, as she is more likely to look at the matter scientifically, expressing some cautiousness about how influential A.I. professionals will be in the data science field. As someone who is both in A.I. and Data Science, perhaps I could offer a more balanced perspective. First of all, an A.I. professional is a specialist in A.I. methods and if we are thinking about how this person can do a data scientist’s job, we are looking at someone who focuses on data analytics, rather than some other part of A.I. (e.g. robotics, theoretical A.I., etc.). Also, when we are examining a data science professional, we are looking at someone who is not in A.I. and who uses mostly conventional data science methods for the data analytics problems he tackles. In my latest book, I outlined the importance of A.I. and how it is very influential in the data science field and the role of the data scientist. I even encouraged people to be kept up-to-date about the developments of A.I. as I predicted it will have an important role to play in the years to come. However, I did not urge anyone to drop what they are doing and focus on A.I. methods alone. If someone is already in the field, that’s great, since they already have developed the mindset of the data scientist and have mastered some of the tools, so by studying A.I. methods for data analytics, they are expanding their skill-set. That’s different from becoming A.I. specialists though. The A.I. specialists may be great at tackling Kaggle competitions, where the data is in a pretty clean and structured form (or at least mostly structured). However, this doesn't automatically make them adept at handling all kinds of data, like a data scientist does. It’s really hard to make predictions about things involving people and their work, as the market is a chaoit system. However, I can attempt to venture an educated guess about what is most likely to happen, if things continue evolving the way they do. So, as A.I. becomes more and more versatile and more robust in tackling data analytics problems, it is bound to dominate over other data science techniques. So, if you are happy using SVMs or random forests, for example, you may want to rethink your toolkit! Yet, it is unlikely that A.I. will fully automate the data science process, much like statistics have not become fully obsolete just because there are several statistical programming environments out there (e.g. Statistica, R, SAS, etc.). Statistics is and is bound to remain useful because it is much more than its techniques. The same goes for data science. Even if all the conventional methods used by a data scientist become obsolete, giving way to A.I. ones, people will continue asking questions about the data, forming hypotheses, analyzing problems so that they can be modeled as data science ones, etc. Of course, people will still communicate with the stakeholders of the projects, create visuals, do presentations, etc. So, even if the A.I. professional is bound to be an asset to an organization, he is most likely going to be part of a data science team, working side-by-side with a data scientist. As for the latter, she will be more knowledgeable about A.I. methods and will spend more time on other parts of her job, rather than doing feature selecting and building a series of models, since that’s something that will be automated by an A.I. system. Therefore, unless a major breakthrough happens in the next few years, I’d recommend you are a bit skeptical about the A.I. paradigm shift that many evangelists talk about, as if it’s the coming of a new Messiah. It would be nice if everything was suddenly easy and smooth, due to A.I., but I wouldn’t uninstall my data science software just yet... With all the hype about A.I. lately, many people have jumped on the A.I. bandwagon without realizing that what they are producing is not always related to A.I. and that their false promises can only get them that far. That’s not to say that modern processes in data science that leverage alternative approaches to analyzing data without relying on a predefined data representation system are not A.I. Far from it. However, there is a lot of jazz about knowledge representation systems (KRS), such as those applied in Natural Language Processing (NLP) that are merely transformations of text data into a quantitative format. Calling that an A.I. is calling a sedan a 4-by-4 monster truck! Knowledge representation is useful in many ways as it is an often necessary component to Natural Language Understanding (NLU) and other NLP-related systems. For example, the NLTK package in Python has a process in place that categorizes a given text into a series of parts of speech (PoS), by labeling each word with the most appropriate PoS tag. That’s useful, but it’s not exactly A.I. technology. Similar frameworks providing some kind of labeling of text data fall under the same umbrella. In fact, without someone processing their output and building some kind of model based on it, such a labeling is utterly useless. It’s like the dough someone makes, which without additional processing (e.g. baking), it’s bound to be something you’d probably not serve in a dinner party as-is (though many kids may be quite content eating it in this form). People managing data-driven products, however, are not kids. They expect some kind of value from the processing of the text-based data streams (which sometimes come at a cost) and a positive ROI. It’s quite unlikely that serving them some half-baked data using a knowledge representation system on the given data is going to make them content. Maybe they are fooled once into believing that this is A.I. at work, but it’s probably going to be a one-time thing. This is especially true if they have some data scientist on-board, who knows a thing or two about text analytics. A.I. systems are automated processes that make an in-depth transformation of the data they are fed, yielding something of value at the end. They usually require a lot of sophisticated processes in the back-end, such as the generation of a large number of meta-features, gradually refining the original features into something that encapsulates the information in them, and then use the end-result to make predictions of some kind. When it comes to data, this could be some new text that mimics the style of the original text, or some better representation of the data using a compact feature set. All this is done through computationally heavy processes that often employ the usage of GPUs. So, saying that a knowledge representation system that can run on an average computer, without any additional computing power, is an A.I. system, is inaccurate and misleading. Best case scenario, its results will be later discovered to be interesting but practically useless. After all, A.I. systems are robust because they drill into the data in ways that no human can do, and usually not even comprehend fully. So, if you hear someone claim that they have developed some new A.I. system that can handle raw text data, without the use of some non-parametric model, they are probably trying to sell you snake oil. This is expected in times where new technologies are available yet not fully understood, and charlatans trying to take advantage of the fact are promoting products convoluted enough to masquarade as this new tech, without actually offering any real value to the user. The answer to this situation is to better understand the field through methodical study (it doesn’t have to be too time-consuming) through reliable sources and the consultation of A.I. professionals and data scientist with an NLP focus. Once you are armed with this understanding, no KRS charlatans can take advantage of you since you’ll be able to see through their lies. |
Zacharias Voulgaris, PhDPassionate data scientist with a foxy approach to technology, particularly related to A.I. Archives
April 2024
Categories
All
|