As much as I'd love to write a (probably long) post about this, I'd rather use my voice. So, if you are interested in learning more about this topic, check out the latest episode of my podcast, available on Buzzsprout and a few other places (e.g., Spotify). Cheers!
0 Comments
Creating Diagrams and Unconventional Graphics for Data Science and Data Analytics Work – Revisited7/5/2021 I realize that I’ve done this topic before, but perhaps it needs some more attention, as it’s a very useful topic. Diagrams are great, but they are also challenging. As for the other graphics (particularly those not generated by a plotting library), these can be tough too. But both diagrams and these unconventional graphics are often essential in our line of work, be it as data scientists or data analysts. Let's examine the hows and whys of all this. First of all, diagrams and graphics, in general, are a means of conveying information more intuitively. When you look at a table filled with numbers and other kinds of data, you need to think about them, and sometimes you have to know something about the context of all this. With diagrams, you may get an idea of the underlying information even if you don't know much about the context. Of course, the latter can help bring about scope and perspective, helping you interpret the diagram better and make it more applicable to the task at hand. Diagrams and unconventional graphics are paramount in presentations too. Imagine going to a client or a manager with just a code notebook at your disposal! Even if they may appreciate you having done all this work chances are that you'll need more than that to get them on your side and see the real value behind all these ones and zeros! Besides, the adage of "a picture is a thousand words" is valid, even in Analytics work. Data modelers have figured that out a long time ago, which is why diagrams are their bread and butter. Perhaps there is something to be learned from all this. But how do you go about creating diagrams and unconventional graphics in general? After all, graphics design is a challenging discipline, and it's not realistic to try to do this kind of work without lots of studying and practicing. Also, it's doubtful we'll ever be as good as graphics professionals who often have the talent to drive their know-how. Still, we can learn some basics and create decent-looking diagrams and graphics, to facilitate our data science endeavors. For starters, we can invest in learning a program like GIMP. This software is an open-source version of Photoshop, and it's well-established and documented. So, if you have a good image or graphic to work with, GIMP can make it shine. Also, programs like LibreOffice Draw can be practically essential for this sort of work, especially if you want to build something from scratch. Contrary to what some people think, creating graphics is very detailed work, not some artistic endeavor. You need to use both your analytical and your imaginative faculties for such a task, even if the imagination part may seem dominant, at least in the beginning. So, for any graphics-related tasks, remember, zooming in is your friend! As for the properties box of any graphical object, that's your best friend! Anyway, I could go on and talk about graphics in data science and data analytics work all day, but it’s not possible to do this topic justice in a single blog post. Besides, the best way to learn is by practicing, just like when it comes to building and refining data models for your Analytics work. Cheers! Many people talk about strategy nowadays, from the strategy of a marketing campaign to business strategy, and even content strategy. However, strategy is a more general concept that finds application in many other areas, including data science. In this article, we'll look at how strategy relates to data science work, as well as data science learning. Strategy is being able to analyze a situation, create a plan of action around it, and following that plan. Strategy is relevant when there are other people (players) involved, as it deals with the dynamics of the interactions among all these people. It's a vast field, often associated with Game Theory, the brainchild of John Nash, considered to be one of the best modern Mathematicians (he even won the Nobel prize for this work, once his work's applications in Economics were discovered). In any case, strategy is not something to be taken lightly, even if there are more lighthearted applications of it out there, such as strategy games, something about which I'm passionate. Strategy applies to data science too, however, as the latter is a complex matter that also involves lots of people (e.g., the project stakeholders). Thinking about data science strategically is all about understanding the risks involved, the various options available, and employing foresight in your every action as a data scientist. It's not just a responsible role (esp. when dealing with sensitive data) but also a role crucial in many organizations. After all, in many cases, it's us who deliver insights that effect changes in the organization or bring about valuable (and often profitable) products or services, which the organization can market to its clients. Strategy in data science is all about thinking outside the box and understanding the bigger picture. It's not just the datasets at hand that matter, but how they are leveraged and used to build valuable data products. It's about mining them for insights significant to the stakeholders instead of coming up with findings of limited importance. Data science is practical and hands-on, just like the strategies that revolve around it. Strategy in data science is also relevant to how we learn it. We may go for the more established option of doing a course on it and reading a textbook or two that the instructor recommends. However, this is just one strategy and perhaps not the best one for you. Mentoring is another strategy that's becoming increasingly important these days since it's more hands-on and personal in the sense that it addresses specific issues that you as a learner have throughout your assimilating of the newfound data science knowledge. Another powerful strategy is videos and quizzes that provide you with valuable knowledge and know-how, which enable you to get a more intuitive understanding of a data science topic. Of course, there is also the strategy of combining two or more such strategies for a more holistic approach to data science learning. Choosing a strategy for your data science work or your data science learning isn't easy. This matter is something you often need to think about and evaluate over several days. In any case, usually data science educational material can help you in that and can also supplement your work, enriching your skill-set. Some such material you can find among the books I've published as well as the video courses I've created (e.g., those on Cybersecurity). I hope they can help you in your data science journey and make it easier and more enjoyable. Cheers! Data modeling of data architecture is the discipline that deals with how data is organized, how various (mostly business-related) processes express themselves as data flows, and how we leverage data to answer business-related questions. It involves some basic analytics (the stuff you'd do to create a pivot table, for example) but no heavy-lifting data analysis, like what you'd find in our field. There is no doubt that data modeling benefits from data analytics a great deal, but the reverse is also true. Let's explore why through a few examples. First of all, data modeling is fundamental in the structure of the data involved (data architects often design the databases we use) and the relationships among the various datasets, especially when it comes to an RDBS architecture. However, they also work with semi-structured data and ensure that the data is kept accessible and secure. Over the past few years, data modelers also work on the cloud, ensuring efficiency in how we access the data stored there, all while keeping the overall costs low. So, it's next to impossible to do any data-related work without consulting with a data architect. Since data modeling is the language these professionals speak, we need to know it, at least to some extent. Data modeling also involves generating reports based on the data at hand. These reports may need to be augmented using additional metrics, which may not be very easy to compute with the conventional analytics tools (slicing and dicing methods). So, we may need to step in there and build some models to make these metrics available for these reports. Before we do, however, we need to know about their context in the problem at hand. This context is something some knowledge of data modeling can help provide. Apart from these two cases, there are other scenarios where we need to leverage data modeling knowledge in our pipelines. These, however, are project-specific and beyond the scope of this article. In any case, having the right mindset in data science (and data analytics in general) is crucial for bridging the gap between our field and data modeling. This is something I explore in all of my books, particularly the Data Science Mindset, Methodologies, and Misconceptions one. So, check it out when you have a moment. Cheers. Since the relatively recent exodus of users from Facebook and other conventional platforms, there has been a rise in privacy-focused social media. Most of them are blockchain-based, a promising technology linked to financial rewards, usually in crypto, for the more successful content contributors. One such platform is Flote, which I've been testing for the past few weeks. In this article, I'll present some of my thoughts based on my experience with it. First of all, I'm not affiliated with Flote in any way, while no one invited me to join it either. So, I could have left at any time, especially since I have other platforms that I frequent and where I have established a network of contacts already. Still, I lingered in Flote because of its simplicity, clean user interface, and innovative model. I don't aspire to be an influencer there, but I enjoy the platform, while the fact that one of its founders, Erin Edwards, actively engages with the members of the platform offering help and promoting posts others may find interesting. I've only seen that in a couple of other places. So, why Flote? Well, it's privacy-oriented, fresh, and big on blockchain tech. That's not to say that it's there yet, though. Also, the engagement you may get on a platform like this is bound to be linked with a very particular set of people. It doesn't feature the diversity of places like MeWe, but it has the potential to do so, perhaps once the beta-testing is over. Some features still don't work well on the mobile app, which is why it's still labeled as beta, while the whole platform seems quite minimalist for a social medium. Perhaps that's why some people view it as a Twitter alternative, even if it doesn't have the silly size limitations the well-known gossiping site features. I've tried other privacy-focused platforms over the past two years, and although Flote seems quite promising, I don't think I'll drop my other ones any time soon to focus on Flote as my primary online socializing site. Still, I don't think I'll quit any time soon, partly because there is a feeling of authenticity in the users there. If you look past the biases of the user base, Flote is very open-minded and fosters debate, something most social media today have forgotten or even banned. So, if the attention we give to the sites we frequent counts for something, like a vote of sorts for what is worth spending time on, I feel that Flote deserves a chance, at least right now. After all, many places start great (e.g., Voice) and then take a wrong turn somewhere, turning into something undesirable. If you are interested in platforms like Flote, i.e., it floats your boat, but you're open to other places too, you may want to check out my Privacy Fundamentals course on WintellectNow. There I talk about various privacy-related matters, with lots of practical advice on your online options. This includes but it’s not limited to privacy-oriented social media sites. So, check it out when you have a moment. Cheers! Good documentation is in high demand everywhere, from coding libraries to products and services to even data science projects. The funny thing is that even though many people value communication in data science, not everyone can link good communication and good documentation. Interestingly, even if you are the most charismatic communicator out there, if you don't express your communication skills in your documentation, your data science work will suffer. But why is documentation so valuable? What about visuals? Aren't they worth (at least) 1000 words each? What's the point of dressing up our code notebooks with text too? First thing's first. You don't need to be a technical writer to write good documentation. Just take a look at the documentation of the most mature packages in Julia. Do you think their creators were technical writers? The same goes for other kinds of documentation available online. As long as the reader can understand what you are doing without having to dig deep into the code (or even worse, run parts of the code), your documentation is a decent first draft. That can later be improved, but first, you need to write it! Even if you are the only person to read this documentation, perhaps on a future iteration of that data science project, it's good to do it properly. This way, you won't scratch your head trying to figure out what you were thinking when you put that notebook together. Good documentation is not just about the reader, though. It's also about organizing your thoughts and understanding your code better. Perhaps some refactoring needs to take place, simplifying the whole project. Or maybe some examples could help clarify the objective or the value-add of your script. It's easy to lose sight of these matters when you are entrenched in analytics work, especially the coding part. A well-documented data science project can be a great addition to your portfolio (assuming, of course, that you have the option of exhibiting your work publicly). It's unlikely that someone will go through every line of your code to see what you've done. Still, that person may read at least parts of your documentation, especially the text at the beginning, where you explain the objectives, assumptions, and datasets related to this project. And you can be almost certain that if someone makes it to the end of your code notebook, they'll read your conclusions too. Documentation in data science may not seem as important a skill as knowledge of machine learning, data visualization, etc., but it's a powerful catalyst for all these. After all, just because you create a fancy visual, it doesn't mean that everything is fully comprehensible in it. Perhaps there is so much to see that you need to point the reader to the key findings, which they can then verify by looking closely at the plot. Although good code is self-explanatory, because of its structure and naming conventions, it's always useful to add some text around it. I'm not talking about some comments, but also stuff going beyond the code itself. After all, the code you write is not a work of art (even if you may think that at times!) but a means to an end. That end, along with how the code achieves that end, is something the reader of your code notebook shouldn't have to think about too much. It's better to make it easy for him through good documentation, allowing him to ponder on the whole project, rather than him having to spend all his time trying to figure out what you have done and why. I can go on about this topic until the cows come home. However, an attribute of good documentation is brevity, which is why I'll stop right here. If you find this material of value, you can check out my various books, where I talk about topics like this in more detail. Cheers! Open-source software is any piece of software that's open to review and edits/forks. In most cases, it's also free and under the GNU license or something equivalent, though when people refer to it as free, they often use the term as a proxy to freedom. As a result, most people refer to open-source software today as FOSS, which stands for Free and Open-Source Software. FOSS is also a movement of sorts that's taken hold since the earlier days of computing with people like Richard Stallman, who spearheaded the GNU initiative and has been very active in promoting FOSS throughout his life. With the advent of FOSS programming languages and FOSS operating systems (such as GNU/Linux and FreeBSD), this movement grew and is now quite established across various fields that involve programming. As you can imagine, FOSS is also quite relevant in data science and A.I., at least lately. Most data scientists and A.I. professionals today tend to use an open-source language (many of them using Python, while the more adventurous dabble with Julia, Scala, and lately even Rust), handle open-source dataset (such as those made freely available at the UCI Machine Learning repository, among many other sites), and work with open-source frameworks (such as Scikit-learn, MXNet, and Flow). It's doubtful that many people get into data science with any monetary investment in the tools or the datasets they need since it's a far better investment to spend money on educational resources such as books and videos marketed by a technical publisher. Interestingly, these resources have more in common with FOSS than all that mediocre stuff you find on YouTube these days, labeled as educational for some reason. FOSS in data science (and A.I. to a great extent) is largely responsible for the immense growth of this field. While back in the old days when I was doing my Ph.D. the best way to get into analytics, particularly machine learning, was through platforms like Matlab that come with a relatively high price tag, nowadays you can start your data science journey without spending any money on the software you use. This way, you can develop some skills and try out the field before deciding to stick with it. Since there are more reasons to commit to data science than not to, the easy point of entry made data science popular, while the trend is also bound to continue. Nevertheless, it's important to note some exceptions to the FOSS paradigm, which are also relevant in data science. First of all, there is Mathematica, which is probably one of the best closed-source platforms out there, not just for data science but for any field that involves numeric data. Contrary to what its name suggests, Mathematica is a broad kind of platform having its own programming language built-in; it's not just about Math. Also, its latest version feature A.I. tools, while the person behind this piece of software is a genius scientist who also came up with a novel model for describing the universe. Apart from Mathematica, there is also Matlab, which is still used by made learners of the craft, particularly in academia. Lately, however, its popularity has started to decline, partly because of its open-source clone, Octave, and partly because it pales when compared with modern data science and A.I. platforms that feature better performance and larger communities of users. All in all, FOSS is paramount in data science work, partly due to the relevance of programming in this field. While new FOSS players come to our field (the most notable of which is Rust, which I covered briefly in the previous article on this blog), chances are that some of them are bound to stay. Things like the Jupyter notebook, for example, aren't going to disappear, even if other code notebooks have entered the scene lately, especially when it comes to the Julia language. In any case, if you want to learn more about the various (mostly open-source) software that populates our fascinating field, you can check out my book Data Science Mindset, Methodologies, and Misconceptions. As a bonus, you can also learn about other aspects of the data science field, such as the marvelous methodologies it features, without getting all too mathy about it! It's been a few years since I authored it, but so far, it's aged quite well, just like most FOSS out there we use in data science and A.I. work. Cheers! The data scientist and the data analyst both deal with data analysis as their primary task, yet those two roles differ enough to warrant an entirely different set of expectations for each. Both share common attributes and skills, however, making them more similar than people think. This similarity allows a relatively more straightforward transition from one role to another, if needed, something not everyone realizes. This article explores this situation's details and makes some suggestions as to how each role can benefit the other. The two roles are surprisingly similar, in ways going beyond the surface kinship (i.e., data analysis). Data scientists and data analysts deal with all kinds of data (even though text data is not standard among data analysts), often directly from databases. So, they both deal with SQL (or some SQL-like language) to access a database and obtain the data needed for the project at hand. Both kinds of professionals deal with cleaning and formatting the data to some extent, be it in a programming language (e.g., Python or Julia), or some specialized software (e.g., a Spreadsheet program, in the case of data analysts). Also, both data scientists and data analysts deal with visuals and presentations containing these graphics. Finally, both kinds of professionals write reports or some form of documentation for their work and share it with the project's appropriate stakeholders. Despite the sophistication of our field, we can learn some things from data analysts as data scientists. Particularly the new generation of data scientists, coming out of bootcamps or from a programming background, have a lot to benefit from these professionals. Namely, the data analysts are closer to the business side of things and often have domain knowledge that data scientists don't. After all, data analysts are more versatile as professionals in employability, making them more prone to gathering experience in different domains. Also, data analysts tend to have more developed soft skills, particularly communication, as they have more opportunities to hone them. Learning all that can benefit any data scientist, especially those who are new to the field. Additionally, data analysts can learn from data science professionals too. Specifically, the value of an in-depth analysis that we do as data scientists are something every analyst can benefit from undoubtedly. In particular, data engineering is the kind of work that adds a lot of value in data science projects (when it's done right) and something we don't see that much in data analytics ones. What's more, predictive modeling (e.g., using modern frameworks, such as machine learning) is found only in data science, yet something a data analyst can apply. Once someone has the right mindset (aka the data science mindset), it's not too difficult to pick up those skills, particularly if they are already versed in data analytics. If you wish to learn more about the soft skills and business-related aspects of data science, you can check out one of my relatively recent books, Data Scientist Bedside Manner. In this book, my co-author and I look into the organization hiring data scientist, the relevant expectations, and how such a professional can work effectively and efficiently within an organization. So, check it out if you haven't already. Cheers! The data scientist role is an incredibly important one in the world today. Be it in for-profit organizations or non-profit ones, it has a lot of value to add and aid decision-making. However, it's still unclear what exactly it entails and how someone can become a data scientist starting from a data analytics background. The data scientist is a tech professional who processes data, especially complex data, in large amounts (aka big data) to derive insights and build data products. This role involves gathering data, cleaning it up, combining it with other relevant data, evaluating the features involved, and building models based on them, usually to predict some variable of interest or solve some complex problem. It also involves creating insightful visuals and presenting your findings to the project stakeholders, with whom you often need to liaise throughout the data science projects. For all this work, you need to use a lot of programming and various data analysis methods, particularly machine learning. To transition to the data scientist role from the data analyst one, you need to beef up your programming skills and work on your data analysis methodologies. Learning more techniques for pre-processing data (data engineering) is also essential. What's more, you need to familiarize yourself with various methods for depicting data, such as graphs, and how to process the data in this sort of encodings. Dimensionality reduction methods are also vital for assuming the data scientist role, just like various sampling techniques. Furthermore, handling data in different formats (e.g., JSON, XML, and text) is essential, particularly in projects that deal with semi-structured data. Naturally, having some familiarity with NoSQL databases is also very important, as it goes hand-in-hand with this sort of data. Naturally, all this is the tip of the iceberg when it comes to transitioning into a data scientist from a data analyst. To make sure this transition is solid enough to build a career on top of it, you need to develop other skills and a good understanding of the complex data involved in data science projects. Being able to communicate with other data professionals well and understand them is also very important. Nowadays, you often have to work as part of a data science team, which involves a certain specialization level. So, having such expertise is significant, at least for certain data scientist positions. You can learn more about this topic by reading my first book on data science, namely the Data Scientist: The Ultimate Guide to Becoming a Data Scientist one. This book covers various topics related to data scientist has a whole section dedicated to similar roles. It is also written in an easy-to-follow way, without too much technical jargon, while it also has a glossary at the end. Interviews with data scientists of various levels help clarify the role's details and how it is on a practical level. So, check it out when you have a moment. Cheers! A data product is the main deliverable of data science and some data analytics projects. It involves developing a stand-alone piece of software, often with a data model under the hood. Other times, it takes the form of a set of visualizations that depict particular variables of interest or other useful insights. In any case, data products are vital as they constitute an essential part of a data science project and a useful deliverable in a data analytics project (even if it's not always a requirement). Dashboards are a kind of data product, featuring graphics and an intuitive (albeit minimalist) interface. They sometimes involve some control element that enables the user to change some settings and adjust the related graphics to different operating conditions. This element provides a more dynamic aspect to the dashboard, which augments the innate dynamism they have. The latter stems from the fact that they are usually linked to a dataset that changes over time, as new data becomes available. The popularity of dashboards illustrates data visualization's value, be it in data science or data analytics. It's hard to imagine a project like this without some visuals, pinpointing important insights and other findings. Additionally, whenever predictive models are involved, specialized visuals for showcasing the models' performance are a must. That's why data visualization as a sub-field of data science and data analytics has grown, especially in the past few years. The development of professional software undertaking such tasks and specialized libraries in various programming languages have contributed to this growth. Beyond data visualization, however, other subtle aspects of the data science and data analytics fields are essential but less pronounced in the various educational material out there. For example, the communication of insights and using the visuals mentioned earlier in presentations is something every data professional ought to know. This point is particularly important when you need to liaise with non-technical people, whether colleagues or clients. Also, managing a data analytics project can be challenging, especially in the modern Agile-driven workplace. After all, most data analytics projects today are all about teamwork and tight deadlines, and changing requirements. What's more, although a dashboard is a powerful asset in an organization, it needs to be maintained periodically and fed good-quality data. The latter requires additional work and proper data governance, which not everyone involved in this field is usually aware of, unfortunately. My Data Scientist Bedside Manner book, which I co-authored last year, is an excellent resource for this kind of topic. Although written for data science professionals mainly, it can be useful to all sorts of data analysts and people involved in data-driven projects (e.g., managers). The idea is to bridge the gap between technical and non-technical professionals in an organization and leverage data analytics work effectively. This is an excellent reference book that every data professional can benefit from in the years to come. Cheers! |
Zacharias Voulgaris, PhDPassionate data scientist with a foxy approach to technology, particularly related to A.I. Archives
June 2022
Categories
All
|