In a nut-shell, open-mindedness is our ability to view things from a wider perspective, with as few assumptions as humanly possible. It’s very much like the “beginner’s mind” concept which I’ve talked about in previous posts. I’ve also written an article about the value of open-mindedness on this blog before, a post that remains somewhat popular to this day. That’s why I decided to go deeper on this topic, which is both evergreen and practical. The first scenario where open-mindedness becomes practically useful in data science is when you are learning about it. For example, you can learn the craft like some people do, blindly following some course/video/book, or you can be more open-minded about it and learn through a series of case studies, Q&A sessions with a mentor, and your own research into the topic. Having an active role in learning about the field is crucial if you want to have an open-minded approach to it. The same goes for taking initiative in practice projects and such. Of course, open-mindedness has other advantages in data science work. For example, when finding a solution in a data science project, you may consider different – somewhat unconventional – approaches to it. You may try all the standard methods, but also consider different combinations of models, or variants of them. Such an approach is bound to be beneficial in complex problems that cannot be easily tackled with conventional models. What’s more, open-mindedness can be applied to data handling too. For example, you can consider different ways of managing your features, alternative ways of combining them, and even different options for creating new features. All this can enable you to use more refined data potentially, providing you with an edge in your data engineering work. Let’s not forget that the latter constitutes the bulk of most data scientists’ workload. As such, this part of the pipeline conceals the largest potential for improvement. Communicating with data science project stakeholders is another aspect of open-mindedness, perhaps one that deserves the most attention. After all, it’s not always easy to convey one’s insights and methodology to the other stakeholders of a data science project. Sometimes you need to find the right angle and the right justifications, which may not be just technical. That’s why open-mindedness here can shine and help bring about new iterations of the data science pipeline, for a given project. Also, it can bring about spin-off data science projects, related to the original one. Although this topic is vast, you can learn more about open-mindedness and data science through one of my books. Namely, the Data Science Mindset, Methodologies, and Misconceptions book that I authored a few years ago covers such topics. Although the term open-mindedness is not used per se, the book delves into the way of thinking of a data scientist and how qualities like creativity (which is very closely related to open-mindedness) come into play. So, check it out when you have the chance. Cheers!
0 Comments
Data visualization is a key aspect of data science work as it illustrates insights and findings very efficiently and effortlessly. It is an intriguing methodology that’s used in every data science and data analytics project at one point or another. However, many people today, particularly those in the data analytics field, entertain the idea that it’s best to perform this sort of task using Tableau, a paid data visualization tool. Note that there are several options when it comes to data visualization, many of which are either free or better-priced than Tableau. The latter appears to be popular as it was one of the first such software to become available. However, this doesn’t mean that it’s a good option, particularly for data scientists. So, what other options are there for data visualization tasks? For starters, every programming language (and even a math platform) has a set of visualization libraries. These make it possible to create plots on the fly, customizing them to your heart's content, and being able to replicate the whole process easily if needed through the corresponding script. Also, they are regularly updated and have a community around them, making it easy to seek help and advice on your data visualization tasks. Also, there are other data visualization programs, much more affordable than Tableau, which are also compatible with Linux-based operating systems. Tableau may be fine for Windows and macOS, but when it comes to GNU/Linux, it leaves you hanging. Let's shift gears a bit and look at the business aspect of data science work. In a data science team, there are various costs that can diminish its chances of being successful. After all, just like other parts of the organization, a data science team has a budget to work with. This budget has to cover a variety of tasks, from data governance costs (e.g. a big data system for storing and querying data), data analytics costs (e.g. cloud computing resources), and of course the salaries and bonuses of the people involved. Adding yet another cost to all this, for a Tableau subscription, doesn’t make much sense, especially considering how challenging it can be for a data science project to yield profits in the beginning. Also, considering that there are free alternatives for data visualization tasks, it makes more sense to invest in them (i.e. learn them instead). So what are some better ways to invest money for data science work? For starters, you can invest in the education of your team (e.g. through a course or a good book). Even if they are all adept in the essentials of data science work, they can always get up to speed on some new technology or methodology that can be used in some of their projects. Also, you can invest in additional equipment, upgrading the computers involved, and even getting more cloud resources. Finally, you can always invest in specialized software that is related to your domain or hire consultants to help out when needed. A few years ago, as I was writing the Data Science Mindset, Methodologies, and Misconceptions book, I mentioned Tableau as a data visualization alternative. However, I didn't look at the bigger picture of data science work from the organization's perspective. The latter is something my co-author and I did in our book Data Scientist Bedside Manner, which I'd encourage you to buy. In it we cover a variety of topics related to data science work and how there are better ways to invest resources for it, building towards a more successful pipeline. Cheers! Benchmarking is the process of measuring a script's performance in terms of the time it takes to run and memory it requires. It is an essential part of programming and it's particularly useful for developing scalable code. Usually, it involves a more detailed analysis of the code, such as profiling, so we know exactly which parts of the script are run more often and what proportion of the overall time they take. As a result, we know how we can optimize the script using this information, making it lighter and more useful. Benchmarking is great as it allows us to optimize our scripts but what does this mean for us as data scientists? From a practical perspective, it enables us to work with larger data samples and save time. This extra time we can use for more high-level thinking, refining our work. Also, being able to develop high-performance code can make us more independent as professionals, something that has numerous advantages, especially when dealing with large scale projects. Finally, benchmarking allows us to assess the methods we use (e.g. our heuristics) and thereby make better decisions regarding them. In Julia, in particular, there is a useful package for benchmarking, which I discovered recently through a fellow Julia user. It’s called Benchmarking Tools and it has a number of useful functions you can use for accurately measuring the performance of any script (e.g. the @btime and @benchmark macros which provide essential performance statistics). With these measures as a guide, you can easily improve the performance of a Julia script, making it more scalable. Give it a try when you get the chance. Note that benchmarking may not be a sufficient condition for improving a script, by the way. Unless you take action to change the script, perhaps even rewrite it using a different algorithm, benchmarking can't do much. After all, the latter is more like an objective function that you try to optimize. How it changes is really up to you! This illustrates that benchmarking is really just one part of the whole editing process. What’s more, note that benchmarking needs to be done on scripts that are free of bugs. Otherwise, it wouldn’t be possible to assess the performance of the script since it wouldn’t run to its completion. Still, you can evaluate parts of it independently, something that a functional approach to the program would enable. Finally, it’s always good to remember this powerful methodology for script optimization. Its value in data science is beyond doubt, plus it can make programming more enjoyable. After all, for those who can appreciate elegance in a script, a piece of code can be a work of art, one that is truly valuable. As privacy matters gain more value these days, transparency also gains a lot of value in data science work. This is to be expected for another reason, which I hope it has become more obvious if you have been following this blog: transparent models are easier to explain to others. Beyond these advantages, there are other ones too (such as transparent models are easier to tweak and optimize), which I'm not going to elaborate on right now. Instead, I'm going to look at the various data models used in data science and where they fall on the transparency spectrum. On the one extreme of this spectrum lie the most transparent data models. These are usually Stats-based since they can provide exact proportions of each feature's contribution. Also, you know exactly what's going on with the decisions involved in the predictions they yield. Even if you know nothing about data science you can still make sense of these models and understand the predictions they yield. The main disadvantage of these models is that they are not as accurate, partly because of the overly simple processes they use. On the other extreme of the spectrum, you can find the most opaque data models. These are usually AI-based and are referred to as black boxes. Not only do they not tell us anything about feature importance, but trying to explain their inner workings is a futile task. However, they tend to have an edge in performance when it comes to accuracy, plus they require very little prep work for the data they use (data engineering). Somewhere in the middle of the spectrum lie all the other models, mostly under the machine learning category. These include random forests and boosted trees (some transparency), k nearest neighbor (very little transparency), support vector machines (no transparency), and fuzzy logic systems (pretty decent transparency). That’s a category of models most people forget since they tend to think of transparency as a binary attribute. Finally, it’s good to remember that transparency is usually linked to a business requirement. Also, sometimes the performance you obtain from the black box models is a good trade-off since some projects require high accuracy in the predictions involved. So, transparency is not always a necessity even if it can facilitate the communication of these models to the project stakeholders. As a result, it's always good to think about whether you need this extra transparency that a statistical model may offer you if you can achieve better performance with a less transparent model. For more information about transparency and other aspects of data science models (particularly machine learning related), you can check out my latest book, Julia for Machine Learning. It is a very hands-on kind of book, which doesn't neglect to provide a lot of information needed to build the right mindset when it comes to data science work. Also, it includes lots of examples and links to useful resources that can help you understand all the concepts involved. Data analytics is the field of analyzing data and using any insights you've discovered to facilitate an organization's workflow. It has a very hands-on approach to things and focuses on describing a problem accurately and liaising with the stakeholders to drive decisions based on the data analyzed. Data science is akin to that, while it employs the scientific method to go deeper into the data and develop more sophisticated strategies to drive those decisions. Nowadays, the roles of the data analyst and the data scientist are somewhat mixed since the latter is still relatively new. The fact that its evangelists haven't publicized it accurately makes it even more challenging to understand how it fills a somewhat different niche as a role. A data analyst is more wide-spread as a role and ties to a large variety of tasks, including marketing analytics (e.g., SEO) and business intelligence (BI) work. An organization usually leverages a data scientist in cases with more unstructured data (e.g., text) or data from various forms, making data wrangling a necessary part of the analysis. Also, in scenarios where the objectives are not as clear-cut (e.g., predict the sales of the next quarter), a data scientist is usually preferred. However, it's worth pointing out that many data scientists start their careers as data analysts and that both roles are necessary. After all, they both work with data to produce insights, even if they often go about it differently. Beyond the differences mentioned previously, another critical difference between the two roles is the models used. Namely, the data analyst is more geared towards describing the data and understanding the problem it represents. The data scientist goes a step further and also makes predictions (through mathematical models, usually based on machine learning), digging deeper into it. As a result, she can put together a complete solution, such as a predictive model accessible through an API. Naturally, she can also create a dashboard, something that is among the data analyst's deliverables. What's more, although data analysts can tackle all sorts of data, usually when it comes to text and semi-structured data, that's where they draw the line. For this sort of datasets, specialized methods are required, such as Natural Language Processing (NLP), which falls in the data scientist's domain. The use of AI is often essential in problems like this, something that a data scientist is usually required to know, but beyond the job description of a data analyst. You can learn more about data science and the data scientist's role through a couple of my books. Specifically, the book Data Scientist: The Definite Guide to Becoming a Data Scientist, illustrates the ins and outs of this role and some practical advice as to how you can pursue a career in this field. Additionally, the Data Science Mindset, Methodologies, and Misconceptions book showcases the field overall and its defining aspects as well as the essential techniques used. Both books together offer a birds-eye view of data science and how you can build your career in it. Check them out when you have a chance. Cheers! Statistics is a very interesting field that has some relevance to data science work. Perhaps not as much as some people claim, but it's definitely a useful toolset, particularly in the data exploration part of the pipeline. But how can Stats improve, particularly when it comes to new technologies like A.I.? Statistics can benefits from such technologies in various ways. The most important is the mindset of the AI-based approach, which is empirical and pragmatic. Instead of imagining complex theories for explaining the data, AI looks at it as it is and works with it accordingly (the data-driven approach). Conventional Statistics on the other hand tries to fit the data into this or the other mathematical model for describing it and then it processes the data with some arbitrary metrics that are mediocre at best. However, the math is elegant, so we go with it anyway. So many textbooks can’t be wrong, right? Well, in data science there is no right or wrong, just models that work well and others that don’t work as well. Since there is usually money on the line, we prefer to go with the former models, which coincidentally tend to be machine learning related, particularly AI-based. So, there is surely room for improvement for Statistics if it were to adopt the same mindset. What's more, Statistics can benefit from A.I. through additional heuristics (or statistics, as they are often referred to in that context). The existing heuristics may work well, but they are very narrowly defined, making them overly specialized. AI-based heuristics are broader and tend to be more applicable in a variety of data sets. If Statistics were to adopt a similar approach to heuristics, it would for sure benefit greatly and become more widely applicable. Finally, Statistics can benefit from A.I. by embracing a different approach to describing the data. This is a more fundamental change and probably most fans of the field will disregard it as impractical. However, it is feasible and even efficient, with a bit of clever programming (based on heuristics and a geometrical approach). The latter is something Statistics seems to be divorced from, which is another area of improvement. It's worth noting that although developments in Statistics are bound to be beneficial to anyone applying it in data analytics projects, the practitioner also needs to evolve. There is no point in advancing this field if its practitioners remain in their old ways, limited and rigid. Perhaps that's one of the reasons machine learning and A.I. have advanced so much; their practitioners are more open to changes and willing to adapt. No wonder these fields now dominate the data science world. Something to think about... PS - This article was supposed to be published yesterday. However, there was an issue with the scheduler, hence the delay. Normally, I'll have new material every Monday and sometimes on Thursdays too. Cheers! With all this talk these days about Statistics and other frameworks and their immense value in data science, it’s good to be more pragmatic about this matter. After all, it’s not a coincidence that Machine Learning maintains the top position both as a framework and as a specialization when it comes to data science work. In this article, we'll explore why this is. First of all, machine learning is a more scientific paradigm for data science. It doesn't make any assumptions as it relies on the data at hand and nothing else. Well, there are also the ML models that it makes use of, but it doesn't try to model everything as this or the other distribution and rely on metrics based on these distributions. The scientific approach has proven itself to be very useful in understanding the world, so it only makes sense that it is used (in the form of machine learning methods) in data science too. What’s more, machine learning makes use of more advanced methods than other frameworks. After all, it makes sense that if a framework works well, as in the case of machine learning, more methods are researched and refined. As a result, the models that machine learning brings to the table are more state-of-the-art and efficient. This makes using the machine learning framework a no-brainer, particularly when it comes to critical processes where accuracy and efficiency are key requirements. Also, machine learning nowadays is powered to a great extent by AI, creating powerful models that outperform anything else available to a data scientist. This may be a trend that's here to stay since many AI-based model have proven to be exceptionally good and versatile. Although these models have special requirements that may not be met in every data science option, it's good that there is this option available for data science work. Moreover, machine learning is easier to learn and use since it doesn't have a lot of theory behind it. As a result, you don't need to spend a lot of time learning it or having to worry about the requirements of each model, like in Statistics. Of course, there is some theory in this framework too, but it's fairly straight-forward and doesn't require too specialized math to learn it to an adequate degree. Finally, there are lots of libraries nowadays for every machine learning model or process, making it easy to implement. In other words, you don't have to do a lot of coding to get your machine learning method up and running. Also, the fact that there is usually adequate documentation in these libraries makes it easier to understand the corresponding programs and the techniques too, supplementing your learning. Speaking of learning, if you wish to learn more about machine learning through a hands-on approach to the subject, feel free to check out my latest book, Julia for Machine Learning (Technics Publications). There I talk about the subject in some depth, while I explain how you can use Julia to deploy different kinds of machine learning models and heuristics. Cheers! Throughout our careers in data science and AI, we constantly encounter all sorts of obstacles that hinder our development. This is something inevitable, particularly when we undertake a role that's constantly evolving. However, the biggest obstacle is not something external, as one might think, but something closer to home. On the bright side, this means that it’s more within our control than anything subject to external circumstances. Let’s clarify. The biggest obstacle is related to the limits of our aptitude, something primarily linked to our knowledge and know-how. After all, no one knows all there is to know on a subject so broad as data science (or AI). However, as we gather enough knowledge to do what we are asked to, we are overwhelmed by the idea that we know enough. Eventually, this can morph into a conviction and even expand, letting us cultivate the illusion that we know everything there is to know in our field. Naturally, nothing could be further from the truth since even a unicorn data scientist has gaps in her knowledge. One great way to avoid this obstacle is to constantly challenge yourself in anything related to our field. I'm not talking about Kaggle competitions and other trivial things like that. After all, these are hardly as realistic as data science challenges. I'm referring to challenges to techniques and methods that you are lacking as well as refining those that you already have under your belt. This may seem simple but it's not, especially since no one enjoys becoming aware of the things he doesn't know or doesn't know fully. Perhaps that’s why developing ourselves isn’t something easy or popular. Another way to enhance ourselves is through reading technical books related to our field. Of course, not all such books are worth your while, but if you know where to look, it's not as challenging a task. What's more, it's good to remember that the value of such a book also depends on how you process this new information. For example, in many such books, there are exercises and problems that the reader is asked to solve. By taking advantage of such opportunities, you can learn the new material better and grow a deeper understanding of the topics presented. One way for learning more is through Technics Publications books. Although many of the books from that publishing house are related to data modeling, there are a few data science-related ones as well as a couple on AI. Of course, even the data modeling books can be useful to a data scientist, since we often need to deal with databases, particularly in the initial stages of a project. Also, if you were to buy a book from this publisher using the coupon code DSML, you can get a 20% discount. The same applies to any webinars you may register for. So, if the cost of this material is an obstacle for you, at least with this code you can alleviate it and get a bigger bang for your buck! Although it's fairly easy to compare two continuous variables and assess their similarity, it's not so straight-forward when you perform the same task on categorical variables. Of course, things are fairly simple when the variables at hand are binary (aka dummy variables), but even in this case, it's not as obvious as you may think. For example, if two variables are aligned (zeros to zeros and ones to ones), that’s fine. You can use Jaccard similarity to gauge how similar they are. But what happens when the two variables are reversely similar (the zeros of the first variable correspond to the ones of the second, and vice versa)? Then Jaccard similarity finds them dissimilar though there is no doubt that such a pair of variables may be relevant and the first one could be used to predict the second variable. Enter the Symmetric Jaccard Similarity (SJS), a metric that can alleviate this shortcoming of the original Jaccard similarity. Namely, it takes the maximum of the two Jaccard similarities, one with the features as they are originally and one with one of them reversed. SJS is easy to use and scalable, while its implementation in Julia is quite straight-forward. You just need to be comfortable with contingency tables, something that’s already an easy task in this language, though you can also code it from scratch without too much of a challenge. Anyway, SJS is fairly simple a metric, and something I've been using for years now. However, only recently did I explore its generalization to nominal variables, something that’s not as simple as it may first seem. Applying the SJS metric to a pair of nominal variables entails maximizing the potential similarity value between them, just like the original SJS does for binary variables. In other words, it shuffles the first variables until the similarity of it with the second variable is maximized, something that’s done in a deterministic and scalable manner. However, it becomes apparent through the algorithm that SJS may fail to reveal the edge that a non-symmetric approach may yield, namely in the case where certain values of the first variable are more similar toward a particular value of the second variable. In a practical sense it means that certain values of the nominal feature at hand are good at predicting a specific class, but not all of the classes. That's why an exhaustive search of all the binary combinations is generally better, since a given nominal feature may have more to offer in a classification model if it's broken down into several binary ones. That's something we do anyway, but this investigation through the SJS metric illustrates why this strategy is also a good one. Of course, SJS for nominal features may be useful for assessing if one of them is redundant. Just like we apply some correlation metric for a group of continuous features, we can apply SJS for a group of nominal features, eliminating those that are unnecessary, before we start breaking them down into binary ones, something that can make the dataset explode in size in some cases. All this is something I’ve been working on the other day, as part of another project. In my latest book “Julia for Machine Learning” (Technics Publications) I talk about such metrics (not SJS in particular) and how you can develop them from scratch in this programming language. Feel free to check it out. Cheers! The concept of antifragility is well-established by Dr. Taleb and has even been adopted by the mainstream to some extent (e.g. in Investopedia). This is a vast concept and it’s unlikely that I can do it justice, especially in a blog post. That’s why I suggest you familiarize yourself with it first before reading the rest of this article. Antifragility is not only desirable but also essential to some extent, particularly when it comes to data science / AI work. Even though most data models are antifragile by nature (particularly the more sophisticated ones that manage to get every drop of signal from the data they are given), there are fragilities all over the place when it comes to how these models are used. A clear example of this is the computer code around them. I’m not referring to the code that’s used to implement them, usually coming from some specialized packages. That code is fine and usually better than most code found in data science / AI projects. The code around the models, however, be it the one taking care of ETL work, feature engineering, and even data visualization, may not always be good enough. Antifragility applies to computer code in various ways. Here are the ones I’ve found so far:
All this may seem like a lot of work and it may not agree with your temporal restrictions, particularly if you have strict deadlines. However, you can always improve on your code after you’ve cleared a milestone. This way, you can avoid some Black Swans like an error being thrown while the program you’ve made is already in production. Cheers! |
Zacharias Voulgaris, PhDPassionate data scientist with a foxy approach to technology, particularly related to A.I. Archives
December 2022
Categories
All
|