Throughout this blog, I've talked about all sorts of problems and how solving them can aid one's data science acumen as well as the development of the data science mindset. Problem-Solving skills rank high when it comes to the soft skills aspect of our craft, something I also mentioned in my latest video on O'Reilly. However, I haven't talked much about how you can hone this ability.
Enter Brilliant, a portal for all sorts of STEM-related courses and puzzles that can help you develop problem-solving, among other things. If you have even a vague interest in Math and the positive Sciences, Brilliant can help you grow this into a passion and even a skill-set in these disciplines. The most intriguing thing about all this is that it does so in a fun and engaging way.
Naturally, most of the stuff Brilliant offers comes with a price tag (if it didn't, I would be concerned!). However, the cost of using the resources this site offers is a quite reasonable one and overall good value for money. The best part is that by signing up there you can also help me cover some of the expenses of this blog, as long as you use this link here: www.brilliant.org/fds (FDS stands for Foxy Data Science, by the way). Also, if you are among the first 200 people to sign up you'll get a 20% discount, so time is definitely of the essence!
Note that I normally don't promote anything of this blog unless I'm certain about its quality standard. Also, out of respect for your time I refrain from posting any ads on the site. So, whenever I post something like this affiliate link here I do so after careful consideration, opting to find the best way to raise some revenue for the site all while providing you with something useful and relevant to it. I hope that you view this initiative the same way.
Translinearity is the super-set of what’s linear, so as to include what is not linear, in a meaningful manner. In data analytics, it includes all connections among data points and variables that make sense in order to maintain robustness (i.e. avoid any kind of over-fitting). Although fairly abstract, it is in essence what has brought about most modern fields of science, including Relativistic Physics. Naturally, when modeled appropriately, it can have an equally groundbreaking effect in all kinds of data analytics processes, including all the statistical ones as well as some machine learning processes. Effectively, a framework based on translinearity can bridge the different aspects of data science processes into a unified whole where everything can be sophisticated enough to be considered A.I. related while at the same time transparent enough, much like all statistical models.
Because we have reached the limits of what the linear approach has to offer through Statistics, Linear Algebra, etc. Also, the non-linear approach, although effective and accessible, are black boxes, something that may remain so for the foreseeable future. Also, the translinear approach can unveil aspects of the data that are inaccessible with the conventional methods at our disposal, while they can help cultivate a more holistic and more intuitive mindset, benefiting the data scientists as much as the projects it is applied on.
So far, Translinearity is implemented in the Julia ecosystem by myself. This is something I've been working on for the past decade or so. I have reason to believe that it is more than just a novelty as I have observed various artifacts concerning some of its methods, things that were previously considered impossible. One example is optimal binning of multi-dimensional data, developing a metric that can assess the similarity of data points in high dimensionality space, a new kind of normalization method that combines the benefits of the two existing ones (min-max and mean-std normalization, aka standardization), etc.
Translinearity is made applicable through the systematic and meticulous development of a new data analytics framework, rooted in the principles and completely void of assumptions about the data. Everything in the data is discovered based on the data itself and is fully parametrized in the corresponding functions. Also, all the functions are optimized and build on each other. A bit more than 30 in total, the main methods of this model cover all the fundamentals of data analytics and open the way to the development of predictive analytics models too.
Translinearity opens new roads in data analytics rendering conventional approaches more or less obsolete. However, the key outcome of this new paradigm of data analytics is the possibility of a new kind of A.I. that is transparent and comprehensible, not merely comprehensive in terms of application domains. Translinearity is employed in the more advanced deep learning systems but it’s so well hidden that it escapes the user. However, if an A.I. system is built from the ground-up using translinear principles, it can maintain transparency and flexibility, to accompany high performance.
It's interesting how even though there are a zillion ways to assess the similarity between two vectors (each representing a single-dimensional data sample) when it comes to doing the same thing with matrices (each representing a whole sample of data) the metrics available are mediocre at best. It's really strange that when it comes to clustering, for example, where this is an important part of the whole process, we often revert to crude metrics like Silhouette Width to figure out if the clusters are similar enough or not. What if there was a way to assess similarity more scientifically, beyond such amateur heuristics?
Well, fortunately, there is a way, at least as of late. Enter the Congruency concept. This is basically the idea that you can explore the similarity of two n-dimensional samples through the systematic analysis of their components, given that the latter are orthogonal. If they are not orthogonal, it shouldn't be difficult to make them orthogonal, without any loss of information. Whatever the case, it's important to avoid any strong relationships among the variables involved as this can skew the whole process of assessing similarity.
The Congruency concept is something I came up with a few months ago but it wasn't until recently that I managed to implement it in a reliable and scalable way, using a new framework I've developed in Julia. The metric takes as inputs the two matrices and yields a float number as its output, between 0 and 1. The larger this number is the more similar (congruent) the two matrices are. Naturally, the metric was designed to be robust regardless of its dimensionality, though if there are a lot of noisy variables, they are bound to distort the result. That's why it performs some preprocessing first to ensure that the variables are independent and as useful as possible.
Applications of the Congruency metric (which I coined as dcon) go beyond clustering, however. Namely, it can be used in assessing sampling algorithms too (usually yielding values 0.95+ for a reliable sample) as well as synthetic data generation. Since the metric doesn't make any assumptions about the data, it can be used with all kinds of data, not just those following a particular set of distributions. Also, as it doesn't make use of all dimensions simultaneously, it is possible to avoid the curse of dimensionality altogether.
Things like Congruency may seem like ambitious heuristics and few people would trust it when the more established statistical heuristics exist as an option. However, there comes a time when a data scientist starts to question whether a statistic's / metric's age is sufficient for establishing its usefulness. After all, what is now old and established was once new and experimental, let's not forget that...
Being a data science author is not a simple matter. With the bookshelves brimming with data science books these days, one may come to think of this as being something easy and accessible to everyone. Perhaps the latter is true since nowadays everyone can publish a data science book through some publisher with very low standards or he can publish the book himself, thanks to Amazon and other sites that are happy to make your book available to everyone. Some people stoop so low as to give away their book for free, something that says more about the quality of their book than it does for their generosity (of course there are exceptions to this, since many academics prefer this approach since the academic publishers make their books inaccessible to most of their students due to the high price tag they force on them). Whatever the case, being a data science author involves more than just putting a book out there for the world to view and perhaps read.
In my experience for the past 10 years or so, authoring a book is quite different to just writing one and making it accessible to the public. Authoring a book is all about providing a certain level of quality and going through the oftentimes exhausting process of revisions and edits, once the first draft is completed. Fortunately, the first book I authored was on something I had spent 5 years working on, namely my PhD project. The book was my PhD thesis, which is much like a normal technical book, though geared towards a more limited audience.
Other books I've authored were mostly through a publisher, except for some ebooks and a novel ("I, AGI: the adventures of an advanced Artificial Intelligence"). Every time it was a challenge of sorts, through one through which I could grow as a writer. Here is a list of the things I learned that are necessary to author a book:
Beyond these, several other things are necessary for authoring a book, perhaps too many to list in a blog article. However, for anyone serious about writing, these are a good place to start. Cheers!
These days I was on vacation (this image should give you a hint!), so no post this week unfortunately... However, as of next week (or even later this week, depending on my workload), I should have something for you. In the meantime, you can check out some of my older posts. Until next time!
These days I'm working feverishly on a book project so there is no time for any new data science / A.I. related post here. If you want something else to read, feel free to check my articles on beBee, such as the latest one, available here. Parallel to all this, I'm preparing another educational project, something I'll talk more about later on. Stay tuned!
So, recently I decided to make a video on this topic, based on some things I've observed in data science candidates. The hope is that this may help them and anyone else who may be looking into becoming a more holistic data scientist, instead of just a data science technician. The video I made is now available online on O'Reilly and although it's a bit longer than others I've made (not counting the quiz ones), it's fairly easy to follow. Enjoy!
Everyone wants to do business especially when it comes to data science. The more someone is aware of the merits of this field and the value it can bring, the keener that person usually is. Whether it is for a hands-on project or something more high level, the wish to do a collaborative project is bound to rise, the more they get to know you and what you can do for them. However, just because you can work with someone on a potentially interesting and lucrative project, it doesn't mean that you should. Namely, there are certain red flags you ought to be aware of and which once spotted should make you rethink the whole endeavor.
First of all, there is a lack of organization when it comes to the first meeting (and the ones that may follow). Many people want to meet but they often lack the basics of organizing a meeting. Sometimes the time is vague (e.g. they set up a day but not a clear time) or the place is unclear (e.g. there is agreement about using a VoIP system but there is no mention of which system or which room, as in the case of Zoom). If your potential client fails to provide such crucial information, probably they are still new to doing business and there are bound to be other discrepancies down the line.
What’s more, the lack of clear objectives is something to be wary of. Some people want to do wonders with data science (esp. when A.I. is also leveraged) but they have no idea how. There are no clear objectives, deadlines, and the whole project feels more like a plan drafted by a 5-year-old. Situations like this spell out trouble since no matter how hard you work, they won’t be satisfied by your deliverables.
Moreover, when someone doesn’t have a solid understanding of the field and has irrational expectations because of this. This ties into the previous point since the lack of clear objectives often stems from the lack of a solid understanding of what data science is and what it can do. With a perception tainted by the hype of data science and A.I., the client may be unaware of what is feasible and what isn't, leading to a very unrealistic set of expectations that no matter how good you are, you are unlikely to be able to meet.
Furthermore, the lack of access to the actual data is a serious issue for a data science project. If I had a dime for every time I encountered this situation, I wouldn't need to work anymore! Yes, many people may have a clear plan and a solid understanding of data science but the data is not there. Sometimes they do have it but it is inaccessible and you have to go through miles of red tape just to get a glimpse of it. Cybersecurity and privacy processes are something completely unknown to clients like this, and they are overly protective of the data they have, granting you access to it only after you have signed a contract. However, embarking on a data science project without some exploratory data analysis first is like asking for trouble, but they don't usually understand that either.
Finally, if the paperwork is not properly handled (contracts, NDAs, etc.) that’s a big red flag. This is the other extreme, whereby the client is very open about everything but has no idea of how the world works and doesn't bother with NDAs, formal contracts, etc. This way, if there are issues (something quite likely) you are screwed since there are no legal guarantees for the whole project making any pending payments as likely to become actual revenue as a lottery ticket! Also, the ownership of the IP involved in such a project can become a nightmare.
Note that all these are red flags I’ve experienced myself so this list is by no means complete. Hopefully, it can give you an idea of things to look out for, ensuring that your data science expertise is not exploited or wasted in projects that are not likely to yield any benefit for you.
Alright, the quiz video fever is over for the time being, so I'm back to making conventional data science videos. This latest one on APIs, for example, just got published on O'Reilly. It's more technical than others, but very useful, particularly if you know already a few things about data science. Anyway, I hope you enjoy it!
Note that although you can view the list of videos and books on O'Reilly's learning platform, you need to have a valid account in order to view them in their entirety. A pretty good investment, if you ask me, but before you commit to a monthly or a yearly subscription, you can always have a trial one which lasts for 10 days. Cheers!
So, the 7th quiz video I've created is finally online on O'Reilly. This is the longest one so far spanning over 51 minutes, meaning there are lots of explanations for the various questions. It covers a bunch of topics, such as A/B testing, ANOVA, and various statistical tests. I put a lot of thought in this, much like you'd put a lot of thought in designing a data science experiment. Hopefully, you'll find it as useful and enjoyable as I did.
Note that just like other videos published on O'Reilly, you'll need to have an active account (even if it's a trial one), in order to view it in its entirety. As a bonus, you'll be able to view other videos as well as books available on that platform. Enjoy!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.