In the previous post (not counting the webinars one, which was more of an announcement), I talked a bit about a new high-level model about scientific knowledge. However, I didn't talk much about its evolution, since that would make for a very long article (or even a book!). In this article, I'll look into some additional parts of this model and how it can help us understand the evolution of scientific knowledge. All this is closely tied to the data science mindset since, at its core, Data Science is applied science in real-world problems. So, in the previous article, we covered research, fidelity, and application as the key aspects of scientific knowledge and how the three of them are closely linked to a fourth one, the scope. But how do all these relate to the scientist and her work? Let's find out.
So, if you recall, the aforementioned factors can be visualized in the schematic we saw in the previous post.
But what lies in the middle of this? What’s at the heart of scientific knowledge? If you guessed the scientific method, you are right. After all, scientific knowledge doesn't grow on trees (with the exception of that apple tree upon which Newton was resting, perhaps). The scientific method is at the core of it since it binds research, fidelity, and even application to some extent. When an engineer (or even the scientist herself) explores a new theory and tests its validity, he makes use of the scientific method. Without it, he could still argue for or against the theory, but it would be more of a philosophical kind of treatise than anything else. Naturally, philosophy has value too, especially when it is a practical kind of philosophy, like that of the Stoics. However, in science, we are interested more in things that can be formulated with mathematical formulas and be tested rigorously through various data analytics tools, such as Statistics. This scientific method also constitutes the mindset of the scientist, something very important across different disciplines.
Now, if we were to explore this further, going beyond the plane of all the aforementioned aspects of scientific knowledge, we’d find (at least) two more aspects that are closely related to all this. Namely, we’ll find understanding and vision, both of which have to do with the scientist primarily. Understanding involves how deep we go into the ideas the scientific knowledge entails. It is not just rational though since it involves our intuition too. Understanding is like the roots of a tree, grounding scientific knowledge to something beyond the data and making the scientific theory we delve into something potentially imbued by enthusiasm. When you hear some scientists talk about their inventions, for example, you can almost feel that. No scientist would get passionate about math formulas but when it comes to the understanding of the scientific knowledge they have worked on, they can get quite passionate about it for sure!
On the other direction of this we have vision, which has to do with what we imagine about the scientific knowledge, be it its applicability, its extensions, and even the questions it may raise. The latter may bring about additional scientific projects, evolving the knowledge (and understanding) further. That's why it makes sense to visualize this as an upwards vector. Besides, we talk about understanding going deep, which is why we'd visualize it as a downwards vector. Naturally, we'd expect these to be correlated to some extent since deeper understanding would make for loftier visions regarding the scientific knowledge we explore. Also, these two aspects of scientific knowledge highlight the evolutionary aspect of it, rending it something highly dynamic and adaptive.
Hopefully, this article has shed some light on this intriguing topic. It may be a bit abstract but scientific knowledge is like this, at least until it manifests as technology. Feel free to share your thoughts on this topic through this blog. Cheers!
Scientific knowledge is a greatly misunderstood matter, especially today. As we are bombarded with scientific innovations regularly and see scientists getting featured in various media, or even have films made about them (e.g. the classic "A Beautiful Mind" and "The Theory of Everything"), we may conclude that science is easy or that anyone with enough determination and some intelligence can make it in the scientific world. However, science is anything but easy, while someone's mental prowess and willpower although they play an important role in all this, are not the best predictors of their success. Sometimes, people are just at the right place at the right time (as in the case of Einstein). In any case, to get a better understanding of all this, let's break it down to its fundamental components through a high-level model of sorts and see how they come into place to make scientific knowledge come about.
First of all, scientific knowledge is the knowledge that comes about through the scientific method, preexisting knowledge or information (e.g. through observations or raw data) and concerning a particular problem. The latter may be something concrete (e.g. a machine that can transform chemical energy into work) or abstract (e.g. a mathematical model that explains how two variables relate to each other and how one of them can act as a predictor for the other). A problem may also be a weakness of an existing theory or model, that requires further understanding before it can more widely useful.
Scientific knowledge has three primary aspects: research, fidelity, and application. Research has to do with the integration of information into a theory and/or new knowledge that supplements existing knowledge. This doesn't have to be groundbreaking since even a meta-analysis on a subject can provide crucial insight in understanding the problem at hand from a more holistic perspective. Perhaps that's why the first steps in a scientific project involve exploration and a critical analysis of the literature. In any case, it's hard to imagine scientific knowledge without research at its core since otherwise, it can become static, dogmatic and even superficial. The variety of different approaches to research and the value people place on it in all scientific institutions (including the non-academic ones) attests to that.
Fidelity has to do with developing confidence in something, particularly something new or different from what's already known. If the product of one's research doesn't carry confidence with it and it's just speculation or a thin interpretation of the data, it doesn't provide much usefulness. A scientist needs to attack the new knowledge with everything she's got to ensure that it holds water. That's why experimentation is so important as well as in some cases, peer review. The latter is very useful though not always essential since if the experiments are carried out properly and the scientist has no vested interest in the new knowledge (i.e. his intentions are pure), if there are issues with it they will surface sooner rather than later. If the new knowledge remains firm, the fidelity of it will grow along with the confidence of the scientist in what it can do. Naturally, this confidence level will never reach 100% since in science there is always room for disproving something.
This brings us to the next aspect: application. This involves the application of this new knowledge into a real-world problem or some other situation that's somewhat different from the original one. It has to do with making predictions using the new theory, predictions that are of some value to someone beyond the scientist. "Application is the ultimate and most sacred form of theory," a Greek philosopher once wrote. Even if he wasn't a scientist he must have been on to something. After all, most of today's new scientific output is geared towards applications of one form or another. That's not to say that a purely theoretical kind of research is not of any value. However, purely theoretical research is still knowledge in progress. Once this research finds its way to robustness (through fidelity) and a model or a physical system that applies itself to solve a real problem, then it will have completed its evolutionary journey.
Naturally, research, fidelity, and application are not isolated from each other. There is a great deal of interaction between them, partly because they are part of an organic whole and partly because one stems from the other. Without research, we cannot talk about fidelity, nor an application. Technology (which is linked to the latter) doesn’t come out of thin air. Also, no matter how much research we do, without testing the new knowledge to ensure a level of fidelity in it, we cannot use this knowledge elsewhere without contaminating the existing pool of knowledge. What’s more, if an application doesn’t work well enough we often need to go back to the research stage to refine the underlying knowledge. Finally, sometimes it is the application that drives both research and fidelity, giving everything an end goal and a quality standard. Otherwise, we could be researching for research’s sake without ever producing any new knowledge that can benefit others.
Because of all this, it makes sense to connect three data points to form a triangle. We can also draw the circle around this triangle as a way to picture another important aspect of scientific knowledge: its scope (see figure below). No scientific theory aims to explain everything, except perhaps some ambitious projects in Physics that aim to unify all existing theories regarding the universe (though none of them have been successful while their chances of success are a highly debatable matter). That’s where scope fits in to put all this into perspective.
In data science scope is beautifully explained in the "No Free Lunch Theorem" which goes on to say that if a model or algorithm has an edge over the alternatives for a particular kind of datasets, it means that it is bound to be weaker against these same alternatives for a different kind of dataset. In other words, no model outperforms all its alternatives always, just like there is no car out there that's better than all other cars for all sorts of terrain. Naturally, as the scientific knowledge grows for a particular domain, new knowledge may come about that has a larger scope (e.g. a theory that explains more of the observed phenomena and the corresponding data). Still, most new knowledge tends to have a very specific scope that although small in relation to the whole domain, it is still useful, as there is value in niche systems (think of a pickup truck that although a bit specialized, it addresses certain use cases very effectively and efficiently).
The aforementioned aspects of scientific knowledge are the basis of the high-level model proposed here as an effort to describe it. Naturally, this is just the beginning, as other factors come into play once we examine things from a larger time frame. This, however, is something that deserves its own article, so stay tuned...
What’s a Transductive Model?
A transductive model is a predictive analytics model that makes use of distances or similarities. Contrary to inference models that make use of induction and deduction to make their predictions, transductive models tend to be direct. Oftentimes, they don’t even have a training phase in the sense that the model “learns” as it performs its predictions on the testing set. Transductive models are generally under the machine learning umbrella and so far they have always been opaque (black boxes).
What’s Transparency in a Predictive Analytics Model?
Transparency is an umbrella term for anything that lends itself to a clear understanding of how it makes its predictions and/or how to interpret its results. Statistical models boast transparency since they are simple enough to understand and explain (but not simplistic). Transparency is valued greatly particularly when it comes to business decisions that use the outputs of a predictive model. For example, if you decide to let an employee go, you want to be able to explain why, be it to your manager, to your team, or the employee himself.
Transparent kNN sounds like an oxymoron, partly because the basic algorithm itself is a moron. It's very hard to think of a simpler and more basic algorithm in machine learning. This, however, hasn't stopped people from using it again and again due to the high speed it exhibits, particularly in smaller datasets. Still, kNN has been a black box so far, despite its many variants, some of which are ingenious indeed.
Lately, I've been experimenting with distances and on how they can be broken down into their fundamental components. As a result, I managed to develop a method for a distance metric that is transparent by design. By employing this same metric on the kNN model, and by applying some tweaks in various parts of it, the transparent version of kNN came about.
In particular, this transparent kNN model yields not only its predictions about the data at hand but also a confidence metric (akin to a probability score for each one of its predictions) and a weight matrix consisting of the weight each feature has in each one of its predictions. Naturally, as kNN is a model used in both classification and regression, all of the above are available in either one of its modalities. On top of that, the system can identify what modality to use based on the target variable of the training set.
For now, I’ll probably continue with other, more useful matters, such as feature fusion. After all, just like most machine learning models, kNN is at the mercy of the features it is given. If I were in academic research, I’d probably write a series of papers on this topic, but as I work solo on these endeavors, I need to prioritize. However, for anyone interested in learning more about this, I’m happy to reply to any queries through this blog. Cheers!
The reality of data is often taken for granted, just like many things in data science. However, there is more to it than meets the eye and it's only after talking with other data professionals (particularly data architects) that this hierarchy of realities becomes accessible. Of course, this is not something you'll see in a data science book or video, but if you think about it it makes good sense. I've been thinking about it quite a bit before putting it down in words; eventually, all this helped me put things into perspective. Hopefully, it will do the same for you.
First of all, as the basest and most accessible reality of data, we have the values of a dataset. This involves all the numeric and non-numeric data that lives in the data frames we process. Naturally, this is usually referred to as data and it's the most fundamental entity we work with in every data science project. However, there is much more to all this than that since this data comes from somewhere else, through a higher abstraction of it.
This abstraction is the variables of the dataset. These are much more than just containers of the data values since they often represent pieces of information that represent characteristics we can relate to in the problem we are tackling. Also, the variables themselves have an inherent structure representing a pattern, which goes beyond the data values themselves. This is why Statistics is so obsessed with various metrics describing individual variables; in a way, these metrics reflect the essence of a variable and they are usually more important than the data itself.
Moreover, the relationships among all these variables are another level of reality regarding the data. After all, these variables are rarely independent of each other and the relationships among them are crucial for analyzing the data involved. This is what makes data generation a bit tricky since it's not as simple as creating data that follows the distribution of each variable involved. The relationships among the variables play a role in all this. That's why things like correlation metrics are important and help us analyze the data on a deeper level.
Furthermore, there is the structure of the dataset based on the inherent patterns and the reference variable. The latter is usually the target variable we are trying to predict. Naturally, the structure of the dataset is also relevant to the previous realities, particularly the one related to the relationship of the variables, since it influences the densities of the data. However, a higher-order is introduced to the data through the target variable, making this structure even more prominent. Whatever the case, it is by understanding this structure (e.g. through clustering, feature evaluation, etc.) that we manage to gain a deeper understanding of the essence of the data.
Finally, there are the multidimensional patterns that generated the data in the first place. This is the most important reality of the data since it's the one that defines the whole dataset and in a way transcends it. After all, a dataset is but a sample of all the possible data points that stem from a certain population. The latter is usually beyond reach and it can be limitless as new data usually becomes available. So, knowing these multidimensional patterns is the closest we can get to that population and making use of them is what makes a data science project successful.
Naturally, A.I. is involved in each one of these realities, usually as a tool for analyzing the data. However, it’s particularly relevant in the last level whereby it figures out these multidimensional patterns and manages to create new data similar to the original. Also, understanding these patterns well enables it to make more accurate predictions, due to the generalization of the data that it accomplishes.
Nevertheless, this 5-fold hierarchy of the realities of the data is useful for understanding a dataset, with or without A.I. methods. As a bonus, it enables us to gain a better appreciation of the heuristics available and helps us use them more consciously.
As mentioned in a previous post, translinearity is a concept describing the fluidity of the linear and the non-linear, as they are combined in a unified framework. However, linear relationships are still valuable, particularly if you want to develop a robust model. It's just that the rigid classification between linear and non-linear is arbitrary and meaningless when it comes to such a model. To clarify this whole matter I started exploring it further and developed an interesting heuristic to measure the level of non-linearity on a scale that's intuitive and useful.
So, let's start with a single feature or variable. How does it fare by itself in terms of linearity and non-linearity? A statistician will probably tell you that this sort of question is meaningless since the indoctrination he/she has received would make it impossible to ask anything that's not within a Stats course's curriculum. However, the question is meaningful even though it's not as useful as the follow-up questions that can ensue. So, depending on the data in that feature, it can be linear, super-linear, or sub-linear, in various degrees. The Index of Non-Linearity (INL) metric gauges that and through the values it takes (ranging from -1 to 1, inclusive) we can assess what a feature is like on its own. Naturally, these scores can be easily shifted by a non-linear operator (e.g. sqrt(x) or exp(x)) while all linear operators (e.g. standard normalization methods) do not affect these scores. Also, at the current implementation of INL, the value of the heuristic is calculated using three reference points in the variable.
Having established that, we can proceed to explore how a feature fares in relation to another variable (e.g. the target variable in a predictive analytics setting). Usually, the feature is used as the independent variable and the other variable as the dependent one, though you can explore the reverse relationship too, using this same heuristic. Interestingly the problem is not as simple now because the two variables need to be viewed in tandem. That's why all the reference points used shift if we change the order of the variables (i.e. the heuristic is not symmetric). Whatever the case, it is still possible to calculate INL with the same idea but taking into account the reference values of both variables. In the current implementation of the heuristic, the values can go a bit off-limits, which is why they are bound artificially to the [-1, 1] range.
Naturally, metrics like INL are just the tip of the iceberg in this deep concept. However, the existence of INL illustrates that it is possible to devise heuristics for every concept in data science, as long as we are open to the possibilities the data world offers. Not everything has been analyzed through Stats, which despite its indisputable value as a data science tool, it is still just one framework, a singular way of looking at things. Fortunately, the data-scapes of data science can be viewed in many more ways leading to intriguing possibilities worth exploring.
Everyone in data science (and even beyond data science to some extent) is familiar with the process of sampling. It’s such a fundamental method in data analytics that it’s hard to be unaware of it. The fact that’s so intuitive as well makes it even easier to comprehend and apply. Besides, in the world of Big Data, sampling seems to be not only useful but also necessary! What about data summarization though? How does that fit in data science and how does it differ from sampling?
Both data summarization and sampling aim to reduce the number of data points in the data set. However, they go about it in very different ways. For starters, sampling usually picks the data points randomly while in some cases, it takes into account an additional variable (usually the target variable). The latter is the case of stratified sampling, something essential if you want to perform proper K-fold cross-validation for a classification problem. Data summarization, on the other hand, creates new data points that aim to contain the same information as the original dataset, or at least retain as much of it as possible.
Another important difference between the two methodologies is that data summarization tends to be deterministic, while sampling is highly stochastic. This means that you cannot use data summarization instead of sampling, at least not repeatedly as in the case of K-fold cross-validation. Otherwise, you’ll end up with the same results every time, something that doesn’t help with the validation of the models at hand! Perhaps that’s one of the reasons why data summarization is not so widely known in the data science community, where model validation is a key focus of data science work.
What’s more, if sampling is done properly, it can maintain the relationships among the variables at hand (obviously this would entail the use of some heuristics since random sampling alone won’t cut it). Data summarization, on the other hand, doesn't do that so well, partly because it focuses on the most important aspects of the dataset, discarding everything else. This results in skewing the variable relationships a bit, much like a PCA method changes the data completely when it is applied. So, if you care about maintaining these variable correlations, data summarization is not the way to go.
Finally, due to the nature of the data involved, data summarization could be used for data anonymization and even data generation. Sampling, however, wouldn't work so well for these sorts of tasks, even though it could be used for data generation if the sampling is free of biases (something which can also be attained if certain heuristics are applied). All this illustrates the point that although these two methods are quite different, they are also applicable in different use cases so they don’t exactly compete with each other. It’s up to the discerning data scientist to figure out when to use which, adding value to the project at hand.
Rhythm in learning is something that most people don't think about, mostly because they take it for granted. If you were educated in a structure-oriented country, like most countries in the West, this would be instilled in you (contrary to countries like Greece where disorder and lack of any functional structure reign supreme). However, even then you may not value it so much because it is not something you're conscious of always. The need to be aware of it and make conscious effort comes about when you are on your own, be it as a freelancer or a learner in a free-form kind of course (i.e. not a university course of a boot camp). And just like any other real needs, this needs to be fulfilled in one way or another.
The idea of this article came about from a real situation, namely a session with one of my mentees. Although she is a very conscientious learner and a very good mentee, she was struggling with rhythm, mostly due to external circumstances in her life. Having been there myself, I advised her accordingly. The distillation of this is what follows.
So, rhythm is not something you need to strive for as it's built-in yourself as an innate characteristic. In other words, it's natural, like breathing and should come by on its own. If it doesn't, it's because you've put something in its way. So, you just need to remove this obstacle and rhythm will start flowing again on its own. This action of removal may take some effort but it's a one-time thing (unless you are in a very demanding situation in your life, in which case you need to re-set your boundaries). But how does rhythm manifest in practice? It's all about being able to do something consistently, even if it's a small amount certain days.
In my experience with writing (a truly challenging task in the long run, particularly when there is a deadline looming over you), I make it a habit of writing a bit every day, even if it's just a single paragraph or the headings and subheadings structure of a new chapter. Sometimes I don't feel like working on a book at all, in which case I take the time to annotate the corresponding Jupyter notebooks or write an article on this blog. Whatever the case, I avoid idleness like the plague since it's the killer of rhythm.
When it comes to learning data science and A.I., rhythm manifests as follows. You cultivate the habit of reading/coding/writing something related to the topic of your study plan or course curriculum. Even a little bit can go a long way since it's not that bit that makes the difference but the maintenance of your momentum. It's generally harder to pick up something that has gone rusty in your mind, particularly coding. However, if you coded a bit the previous day, it's so much easier. If you get stuck somewhere, you can always work on another drill or project. The important thing is to never give up and go idle.
Frustration is oftentimes inevitable but if you leverage it properly, it can be a powerful force as it has elements of willpower in it, willpower that doesn't have a proper outlet and it trapped. This is what can cause the break of rhythm but what can also remedy it. You always have the energy to carry on, even at a slower pace sometimes. You just need to tap into it and apply yourself. That's when having a mentor can do wonders, yet even without one, you can still manage, but with a bit more effort. It's all up to you!
Translinearity is the super-set of what’s linear, so as to include what is not linear, in a meaningful manner. In data analytics, it includes all connections among data points and variables that make sense in order to maintain robustness (i.e. avoid any kind of over-fitting). Although fairly abstract, it is in essence what has brought about most modern fields of science, including Relativistic Physics. Naturally, when modeled appropriately, it can have an equally groundbreaking effect in all kinds of data analytics processes, including all the statistical ones as well as some machine learning processes. Effectively, a framework based on translinearity can bridge the different aspects of data science processes into a unified whole where everything can be sophisticated enough to be considered A.I. related while at the same time transparent enough, much like all statistical models.
Because we have reached the limits of what the linear approach has to offer through Statistics, Linear Algebra, etc. Also, the non-linear approach, although effective and accessible, are black boxes, something that may remain so for the foreseeable future. Also, the translinear approach can unveil aspects of the data that are inaccessible with the conventional methods at our disposal, while they can help cultivate a more holistic and more intuitive mindset, benefiting the data scientists as much as the projects it is applied on.
So far, Translinearity is implemented in the Julia ecosystem by myself. This is something I've been working on for the past decade or so. I have reason to believe that it is more than just a novelty as I have observed various artifacts concerning some of its methods, things that were previously considered impossible. One example is optimal binning of multi-dimensional data, developing a metric that can assess the similarity of data points in high dimensionality space, a new kind of normalization method that combines the benefits of the two existing ones (min-max and mean-std normalization, aka standardization), etc.
Translinearity is made applicable through the systematic and meticulous development of a new data analytics framework, rooted in the principles and completely void of assumptions about the data. Everything in the data is discovered based on the data itself and is fully parametrized in the corresponding functions. Also, all the functions are optimized and build on each other. A bit more than 30 in total, the main methods of this model cover all the fundamentals of data analytics and open the way to the development of predictive analytics models too.
Translinearity opens new roads in data analytics rendering conventional approaches more or less obsolete. However, the key outcome of this new paradigm of data analytics is the possibility of a new kind of A.I. that is transparent and comprehensible, not merely comprehensive in terms of application domains. Translinearity is employed in the more advanced deep learning systems but it’s so well hidden that it escapes the user. However, if an A.I. system is built from the ground-up using translinear principles, it can maintain transparency and flexibility, to accompany high performance.
It's interesting how even though there are a zillion ways to assess the similarity between two vectors (each representing a single-dimensional data sample) when it comes to doing the same thing with matrices (each representing a whole sample of data) the metrics available are mediocre at best. It's really strange that when it comes to clustering, for example, where this is an important part of the whole process, we often revert to crude metrics like Silhouette Width to figure out if the clusters are similar enough or not. What if there was a way to assess similarity more scientifically, beyond such amateur heuristics?
Well, fortunately, there is a way, at least as of late. Enter the Congruency concept. This is basically the idea that you can explore the similarity of two n-dimensional samples through the systematic analysis of their components, given that the latter are orthogonal. If they are not orthogonal, it shouldn't be difficult to make them orthogonal, without any loss of information. Whatever the case, it's important to avoid any strong relationships among the variables involved as this can skew the whole process of assessing similarity.
The Congruency concept is something I came up with a few months ago but it wasn't until recently that I managed to implement it in a reliable and scalable way, using a new framework I've developed in Julia. The metric takes as inputs the two matrices and yields a float number as its output, between 0 and 1. The larger this number is the more similar (congruent) the two matrices are. Naturally, the metric was designed to be robust regardless of its dimensionality, though if there are a lot of noisy variables, they are bound to distort the result. That's why it performs some preprocessing first to ensure that the variables are independent and as useful as possible.
Applications of the Congruency metric (which I coined as dcon) go beyond clustering, however. Namely, it can be used in assessing sampling algorithms too (usually yielding values 0.95+ for a reliable sample) as well as synthetic data generation. Since the metric doesn't make any assumptions about the data, it can be used with all kinds of data, not just those following a particular set of distributions. Also, as it doesn't make use of all dimensions simultaneously, it is possible to avoid the curse of dimensionality altogether.
Things like Congruency may seem like ambitious heuristics and few people would trust it when the more established statistical heuristics exist as an option. However, there comes a time when a data scientist starts to question whether a statistic's / metric's age is sufficient for establishing its usefulness. After all, what is now old and established was once new and experimental, let's not forget that...
So, recently I decided to make a video on this topic, based on some things I've observed in data science candidates. The hope is that this may help them and anyone else who may be looking into becoming a more holistic data scientist, instead of just a data science technician. The video I made is now available online on O'Reilly and although it's a bit longer than others I've made (not counting the quiz ones), it's fairly easy to follow. Enjoy!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.