Hi everyone. Since these days I explore a different avenue for data science education, I've put together another webinar that's just 3 weeks away (May 18th). If you are interested in AI, be it as a data science professional or a stakeholder in data science projects, this is something that can add value to you. Also, you'll have a chance to ask me questions directly and if the time allows, even have a short discussion on this topic.
Note that due to the success of previous webinars in the Technics Publications platforms, the price of each webinar has risen. However, this upcoming webinar, which was originally designed as a talk for an international conference in Germany, is still at the very accessible price of $14.99. Feel free to check it out here and spread the word to friends or colleagues. You can also learn about the other webinars this platform offers through the corresponding web page. Cheers!
These days I didn't have a chance to prepare an article for my blog. Between helping out a friend of mine and preparing for my webinar this Thursday, I didn't have the headspace to write anything. Nevertheless, one of the articles I wrote for my friend's initiative, related to mentoring, is now available on Medium. Feel free to check it out!
As for the webinar, it's about the data science mindset, a topic I talked about on all of my books, particularly the Data Science Mindset, Methodologies, and Misconceptions one. At the time of this writing, there are still some spots available for the webinars, so if you are interested, feel free to register for it here.
On another note, my latest book is almost ready for the review stage so I'll be working on that come Friday. Stay tuned for more details in the weeks to come...
That's all for now. I hope you have a great week. Stay healthy and positive!
With more and more people getting into data science and AI these days, certain aspects of the field are inevitably over-emphasized while others are neglected. Naturally, those providing the corresponding know-how are not professional educators, even if they are competent practitioners and very knowledgeable people. As a result, a lot of emphasis is given to the technical aspects, such as math and programming related skills, data visualization, etc. What about domain knowledge though? Where does that fit in the whole picture?
Domain knowledge is all that knowledge that is specific to the domain data science or AI is applied on. If you are in the finance industry, it involves economics theory as well as how certain econometric indexes come into play. In the epidemiology sector, it involves some knowledge as to how viruses come about, how they propagate, and their effects on the organisms they exploit. Naturally, even if domain knowledge is specialized, it may play an important role in many cases. How much exactly depends on the problem at hand as well as how deep the data scientist or AI practitioner wants to go into the subject.
Domain knowledge may also include certain business-related aspects that also factor in data science work. Understanding the role of the different individuals who participate in a project is very important, especially if you are tackling a problem that is too complex for data professionals alone. Oftentimes, in projects like this, subject matter experts (SMEs) are utilized and as a data scientist or AI professional you need to liaise with them. This is not always easy as there is limited common ground that can be used as a frame of reference. That's where some general-purpose business knowledge comes in handy.
Naturally, incorporating domain knowledge in a data science project is a challenge in and of itself. Even if you do have this non-technical knowledge, you need to find ways to include it in the project organically, adding value to your analysis. That's why certain questions, particularly high-level questions that the stakeholders may want to be answered, are very important. Pairing these questions with other, more low-level questions that have to do with the data at hand, is crucial. Part of being a holistic, well-rounded data science / AI professional involves being able to accomplish this.
Of course, exploring this vast topic in a single or even multiple blog posts isn’t practical. Besides, how much can someone go into depth about this subject without getting difficult to read, especially if you are accessing this blog site via a mobile device? For this purpose, my co-author and I have gathered all the material we have accumulated on this topic and put it in a more refined form, namely a technical book. We are now at the final stages of this book, which is titled “Data Scientist Bedside Manner” and is published by Technics Publications. The book should be available before the end of the season. Stay tuned for more details...
Webinars have been a valuable educational resource for years now, but only recently has the potential of this technology been valued so much. This is largely due to the Covid-19 situation that has made conventional conferences a no-no. Also, the low cost of webinars, coupled with the ecological advantage they have over their physical counterparts, makes webinars a great alternative.
At a time when video-based content is in abundance, it's easy to find something to watch and potentially educate yourself with. However, if you want quality content and value your time more than the ease of accessibility of the stuff available for free, it's worth exploring the webinar option. Besides, nowadays the technology is more affordable than ever before, making it a high ROI endeavor. As a bonus, you get to ask the presenter questions and do a bit of networking too.
How does all this fit with data science though and why is it part of this blog? Well, although webinars are good in general, they are particularly useful in data science as the latter is a hot topic. Because it's such a popular subject, data science has attracted all sorts of opportunists who brand themselves as data scientists just to make a quick buck. These people tend to create all sorts of content that is low veracity information at best (and a scam at worst). Since discerning between what's legitimate content and what's just click-bait can sometimes be difficult (these con artists have become pretty good at what they do), it makes sense to pursue reputable sources for this video content. One such source is the Technics Publications platform, which has recently started providing its own video content in the form of webinars. Although most of these webinars are on data modeling, a couple of them are on data science topics (ahem). Feel free to check them out!
Disclaimer: I have a direct monetary benefit in promoting these data science webinars. However, I do so after ensuring I put a lot of work in preparing them, the same amount of work I’d put in preparing for a physical conference, like Customer Identity World and Data Modeling Zone. The only difference is the medium through which this content is delivered.
What’s a Transductive Model?
A transductive model is a predictive analytics model that makes use of distances or similarities. Contrary to inference models that make use of induction and deduction to make their predictions, transductive models tend to be direct. Oftentimes, they don’t even have a training phase in the sense that the model “learns” as it performs its predictions on the testing set. Transductive models are generally under the machine learning umbrella and so far they have always been opaque (black boxes).
What’s Transparency in a Predictive Analytics Model?
Transparency is an umbrella term for anything that lends itself to a clear understanding of how it makes its predictions and/or how to interpret its results. Statistical models boast transparency since they are simple enough to understand and explain (but not simplistic). Transparency is valued greatly particularly when it comes to business decisions that use the outputs of a predictive model. For example, if you decide to let an employee go, you want to be able to explain why, be it to your manager, to your team, or the employee himself.
Transparent kNN sounds like an oxymoron, partly because the basic algorithm itself is a moron. It's very hard to think of a simpler and more basic algorithm in machine learning. This, however, hasn't stopped people from using it again and again due to the high speed it exhibits, particularly in smaller datasets. Still, kNN has been a black box so far, despite its many variants, some of which are ingenious indeed.
Lately, I've been experimenting with distances and on how they can be broken down into their fundamental components. As a result, I managed to develop a method for a distance metric that is transparent by design. By employing this same metric on the kNN model, and by applying some tweaks in various parts of it, the transparent version of kNN came about.
In particular, this transparent kNN model yields not only its predictions about the data at hand but also a confidence metric (akin to a probability score for each one of its predictions) and a weight matrix consisting of the weight each feature has in each one of its predictions. Naturally, as kNN is a model used in both classification and regression, all of the above are available in either one of its modalities. On top of that, the system can identify what modality to use based on the target variable of the training set.
For now, I’ll probably continue with other, more useful matters, such as feature fusion. After all, just like most machine learning models, kNN is at the mercy of the features it is given. If I were in academic research, I’d probably write a series of papers on this topic, but as I work solo on these endeavors, I need to prioritize. However, for anyone interested in learning more about this, I’m happy to reply to any queries through this blog. Cheers!
(image by Arek Socha, available at pixabay)
Lately, I've been working on the final parts of my latest book, which is contracted for the end of Spring this year. As this is probably going to be my last technical book for the foreseeable future, I'd like to put my best into it, given the available resources of time and energy. This is one of the reasons I haven't been very active on this blog as of late. In this book (whose details I’m going to reveal when it’s in the printing press) I examine various aspects of data science in a quite hands-on way. One of these aspects, which I often talk about with my mentees, is that of scale.
Scaling is very important in data science projects, particularly those involving distance-based metrics. Although the latter may be a bit niche from a modern standpoint where A.I. based systems are often the go-to option, there is still a lot of value in distances as they are usually the prima materia of almost all similarity metrics. Similarity-based systems, aka transductive systems, are quite popular even in this era of A.I. based models. This is particularly the case in clustering problems, whereby both the clustering algorithms and the evaluation metrics (e.g. Silhouette score/width) are based on distances for evaluating cluster affinity. Also, certain dimensionality reduction methods like Principle Components Analysis (PCA) often require a certain kind of scaling to function optimally.
Scaling is not as simple as it may first seem. After all, it greatly depends on the application as well as the data itself (something not everyone is aware of since the way scaling/normalization is treated in data science educational material is somewhat superficial). For example, you can have a fixed range scaling process or a fixed center one. You can even have a fixed range and fixed center one at the same time if you wish, though it's not something you'd normally see anywhere. Fixed scaling is usually in the [0, 1] interval and it involves scaling the data so that its range is constant. The center point of that data (usually measured with the arithmetic mean/average), however, could be distorted. How much so depends on the structure of the data. As for the fixed center scaling, this ensures that the center of the scaled variable is a given value, usually 0. In many cases, the spread of the scaled data is fixed too, usually by setting the standard deviation to 1.
Programmatic methods for performing scaling vary, perhaps more than the Stats educators will have you think. For example, in the fixed range scaling, you could use the min-max normalization (aka 0-1 normalization, a term that shows both limited understanding of the topic and vagueness), or you could use a non-linear function that is also bound by these values. The advantage of the latter is that you can mitigate the effect of any outliers, without having to eradicate them, all through the use of good old-fashioned Math! Naturally, most Stats educators shy away at the mention of the word non-linear since they like to keep things simple (perhaps too simple) so don’t expect to learn about this kind of fixed-range scaling in a Stats book.
All in all, scaling is something worth keeping in mind when dealing with data, particularly when using a distance-based method or a dimensionality reduction process like PCA. Naturally, there is more to the topic than meets the eye, plus as a process, it's not as basic as it may seem through the lens of package documentation or a Stats book. Whatever the case, it's something worth utilizing, always in tandem with other data engineering tools to ensure a better quality data science project.
Hello everyone and happy new year! I hope you all had a good holiday break. I thought about it quite a bit and I've decided this year to go a different direction with the videos I make as I plan to focus more on courses. Stay tuned for more news on this matter in the weeks to come...
Just wanted to wish you all Happy Holidays! It's been a great year and I appreciate your support through this blog. I won't be posting anything new in the next couple of weeks as I'll b traveling. Feel free to check out some of my older posts, though.
I hope your holidays are insightful, inspirational, and intriguing!
Lately, I've made some progress on a data science research project I've been working on for the past couple of years. I’ve hinted about it in previous posts, though due to the nature of this work I’ve abstained from going into any details. Besides, most people are not that open to new ideas, unless they are marketed by some established company or some renowned professor.
Anyway, the other day I made a breakthrough in this work, something that can have significant implications in how we deal with private data. What’s more, I've developed a new way of summarizing a dataset (which is innately different from sampling it), with minimal loss of information. This opens new avenues of research and the possibilities of new data science and A.I. methods are vast. Naturally, I'll need to look into this more, so any online writing I do will have to take second priority.
Parallel to that, I’ve been working on another project lately, something I plan to continue for the foreseeable future. However, an important part of it is completed, which I’ll make sure I’ll announce in the next few days.
As a result to all this, I’m now more open to hosting other people’s articles on data science and A.I. topics, given that they are not spammy in any way. Back-links are also acceptable, given that they are towards relevant sites to the articles. So, if you have something you’d like to contribute to the blog, now is a great opportunity to do so.
Whatever the case, I plan to continue writing on this blog albeit at a slower pace for the time being, so stay tuned!
Throughout this blog, I've talked about all sorts of problems and how solving them can aid one's data science acumen as well as the development of the data science mindset. Problem-Solving skills rank high when it comes to the soft skills aspect of our craft, something I also mentioned in my latest video on O'Reilly. However, I haven't talked much about how you can hone this ability.
Enter Brilliant, a portal for all sorts of STEM-related courses and puzzles that can help you develop problem-solving, among other things. If you have even a vague interest in Math and the positive Sciences, Brilliant can help you grow this into a passion and even a skill-set in these disciplines. The most intriguing thing about all this is that it does so in a fun and engaging way.
Naturally, most of the stuff Brilliant offers comes with a price tag (if it didn't, I would be concerned!). However, the cost of using the resources this site offers is a quite reasonable one and overall good value for money. The best part is that by signing up there you can also help me cover some of the expenses of this blog, as long as you use this link here: www.brilliant.org/fds (FDS stands for Foxy Data Science, by the way). Also, if you are among the first 200 people to sign up you'll get a 20% discount, so time is definitely of the essence!
Note that I normally don't promote anything of this blog unless I'm certain about its quality standard. Also, out of respect for your time I refrain from posting any ads on the site. So, whenever I post something like this affiliate link here I do so after careful consideration, opting to find the best way to raise some revenue for the site all while providing you with something useful and relevant to it. I hope that you view this initiative the same way.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.