The reality of data is often taken for granted, just like many things in data science. However, there is more to it than meets the eye and it's only after talking with other data professionals (particularly data architects) that this hierarchy of realities becomes accessible. Of course, this is not something you'll see in a data science book or video, but if you think about it it makes good sense. I've been thinking about it quite a bit before putting it down in words; eventually, all this helped me put things into perspective. Hopefully, it will do the same for you.
First of all, as the basest and most accessible reality of data, we have the values of a dataset. This involves all the numeric and non-numeric data that lives in the data frames we process. Naturally, this is usually referred to as data and it's the most fundamental entity we work with in every data science project. However, there is much more to all this than that since this data comes from somewhere else, through a higher abstraction of it.
This abstraction is the variables of the dataset. These are much more than just containers of the data values since they often represent pieces of information that represent characteristics we can relate to in the problem we are tackling. Also, the variables themselves have an inherent structure representing a pattern, which goes beyond the data values themselves. This is why Statistics is so obsessed with various metrics describing individual variables; in a way, these metrics reflect the essence of a variable and they are usually more important than the data itself.
Moreover, the relationships among all these variables are another level of reality regarding the data. After all, these variables are rarely independent of each other and the relationships among them are crucial for analyzing the data involved. This is what makes data generation a bit tricky since it's not as simple as creating data that follows the distribution of each variable involved. The relationships among the variables play a role in all this. That's why things like correlation metrics are important and help us analyze the data on a deeper level.
Furthermore, there is the structure of the dataset based on the inherent patterns and the reference variable. The latter is usually the target variable we are trying to predict. Naturally, the structure of the dataset is also relevant to the previous realities, particularly the one related to the relationship of the variables, since it influences the densities of the data. However, a higher-order is introduced to the data through the target variable, making this structure even more prominent. Whatever the case, it is by understanding this structure (e.g. through clustering, feature evaluation, etc.) that we manage to gain a deeper understanding of the essence of the data.
Finally, there are the multidimensional patterns that generated the data in the first place. This is the most important reality of the data since it's the one that defines the whole dataset and in a way transcends it. After all, a dataset is but a sample of all the possible data points that stem from a certain population. The latter is usually beyond reach and it can be limitless as new data usually becomes available. So, knowing these multidimensional patterns is the closest we can get to that population and making use of them is what makes a data science project successful.
Naturally, A.I. is involved in each one of these realities, usually as a tool for analyzing the data. However, it’s particularly relevant in the last level whereby it figures out these multidimensional patterns and manages to create new data similar to the original. Also, understanding these patterns well enables it to make more accurate predictions, due to the generalization of the data that it accomplishes.
Nevertheless, this 5-fold hierarchy of the realities of the data is useful for understanding a dataset, with or without A.I. methods. As a bonus, it enables us to gain a better appreciation of the heuristics available and helps us use them more consciously.
As mentioned in a previous post, translinearity is a concept describing the fluidity of the linear and the non-linear, as they are combined in a unified framework. However, linear relationships are still valuable, particularly if you want to develop a robust model. It's just that the rigid classification between linear and non-linear is arbitrary and meaningless when it comes to such a model. To clarify this whole matter I started exploring it further and developed an interesting heuristic to measure the level of non-linearity on a scale that's intuitive and useful.
So, let's start with a single feature or variable. How does it fare by itself in terms of linearity and non-linearity? A statistician will probably tell you that this sort of question is meaningless since the indoctrination he/she has received would make it impossible to ask anything that's not within a Stats course's curriculum. However, the question is meaningful even though it's not as useful as the follow-up questions that can ensue. So, depending on the data in that feature, it can be linear, super-linear, or sub-linear, in various degrees. The Index of Non-Linearity (INL) metric gauges that and through the values it takes (ranging from -1 to 1, inclusive) we can assess what a feature is like on its own. Naturally, these scores can be easily shifted by a non-linear operator (e.g. sqrt(x) or exp(x)) while all linear operators (e.g. standard normalization methods) do not affect these scores. Also, at the current implementation of INL, the value of the heuristic is calculated using three reference points in the variable.
Having established that, we can proceed to explore how a feature fares in relation to another variable (e.g. the target variable in a predictive analytics setting). Usually, the feature is used as the independent variable and the other variable as the dependent one, though you can explore the reverse relationship too, using this same heuristic. Interestingly the problem is not as simple now because the two variables need to be viewed in tandem. That's why all the reference points used shift if we change the order of the variables (i.e. the heuristic is not symmetric). Whatever the case, it is still possible to calculate INL with the same idea but taking into account the reference values of both variables. In the current implementation of the heuristic, the values can go a bit off-limits, which is why they are bound artificially to the [-1, 1] range.
Naturally, metrics like INL are just the tip of the iceberg in this deep concept. However, the existence of INL illustrates that it is possible to devise heuristics for every concept in data science, as long as we are open to the possibilities the data world offers. Not everything has been analyzed through Stats, which despite its indisputable value as a data science tool, it is still just one framework, a singular way of looking at things. Fortunately, the data-scapes of data science can be viewed in many more ways leading to intriguing possibilities worth exploring.
Everyone in data science (and even beyond data science to some extent) is familiar with the process of sampling. It’s such a fundamental method in data analytics that it’s hard to be unaware of it. The fact that’s so intuitive as well makes it even easier to comprehend and apply. Besides, in the world of Big Data, sampling seems to be not only useful but also necessary! What about data summarization though? How does that fit in data science and how does it differ from sampling?
Both data summarization and sampling aim to reduce the number of data points in the data set. However, they go about it in very different ways. For starters, sampling usually picks the data points randomly while in some cases, it takes into account an additional variable (usually the target variable). The latter is the case of stratified sampling, something essential if you want to perform proper K-fold cross-validation for a classification problem. Data summarization, on the other hand, creates new data points that aim to contain the same information as the original dataset, or at least retain as much of it as possible.
Another important difference between the two methodologies is that data summarization tends to be deterministic, while sampling is highly stochastic. This means that you cannot use data summarization instead of sampling, at least not repeatedly as in the case of K-fold cross-validation. Otherwise, you’ll end up with the same results every time, something that doesn’t help with the validation of the models at hand! Perhaps that’s one of the reasons why data summarization is not so widely known in the data science community, where model validation is a key focus of data science work.
What’s more, if sampling is done properly, it can maintain the relationships among the variables at hand (obviously this would entail the use of some heuristics since random sampling alone won’t cut it). Data summarization, on the other hand, doesn't do that so well, partly because it focuses on the most important aspects of the dataset, discarding everything else. This results in skewing the variable relationships a bit, much like a PCA method changes the data completely when it is applied. So, if you care about maintaining these variable correlations, data summarization is not the way to go.
Finally, due to the nature of the data involved, data summarization could be used for data anonymization and even data generation. Sampling, however, wouldn't work so well for these sorts of tasks, even though it could be used for data generation if the sampling is free of biases (something which can also be attained if certain heuristics are applied). All this illustrates the point that although these two methods are quite different, they are also applicable in different use cases so they don’t exactly compete with each other. It’s up to the discerning data scientist to figure out when to use which, adding value to the project at hand.
Rhythm in learning is something that most people don't think about, mostly because they take it for granted. If you were educated in a structure-oriented country, like most countries in the West, this would be instilled in you (contrary to countries like Greece where disorder and lack of any functional structure reign supreme). However, even then you may not value it so much because it is not something you're conscious of always. The need to be aware of it and make conscious effort comes about when you are on your own, be it as a freelancer or a learner in a free-form kind of course (i.e. not a university course of a boot camp). And just like any other real needs, this needs to be fulfilled in one way or another.
The idea of this article came about from a real situation, namely a session with one of my mentees. Although she is a very conscientious learner and a very good mentee, she was struggling with rhythm, mostly due to external circumstances in her life. Having been there myself, I advised her accordingly. The distillation of this is what follows.
So, rhythm is not something you need to strive for as it's built-in yourself as an innate characteristic. In other words, it's natural, like breathing and should come by on its own. If it doesn't, it's because you've put something in its way. So, you just need to remove this obstacle and rhythm will start flowing again on its own. This action of removal may take some effort but it's a one-time thing (unless you are in a very demanding situation in your life, in which case you need to re-set your boundaries). But how does rhythm manifest in practice? It's all about being able to do something consistently, even if it's a small amount certain days.
In my experience with writing (a truly challenging task in the long run, particularly when there is a deadline looming over you), I make it a habit of writing a bit every day, even if it's just a single paragraph or the headings and subheadings structure of a new chapter. Sometimes I don't feel like working on a book at all, in which case I take the time to annotate the corresponding Jupyter notebooks or write an article on this blog. Whatever the case, I avoid idleness like the plague since it's the killer of rhythm.
When it comes to learning data science and A.I., rhythm manifests as follows. You cultivate the habit of reading/coding/writing something related to the topic of your study plan or course curriculum. Even a little bit can go a long way since it's not that bit that makes the difference but the maintenance of your momentum. It's generally harder to pick up something that has gone rusty in your mind, particularly coding. However, if you coded a bit the previous day, it's so much easier. If you get stuck somewhere, you can always work on another drill or project. The important thing is to never give up and go idle.
Frustration is oftentimes inevitable but if you leverage it properly, it can be a powerful force as it has elements of willpower in it, willpower that doesn't have a proper outlet and it trapped. This is what can cause the break of rhythm but what can also remedy it. You always have the energy to carry on, even at a slower pace sometimes. You just need to tap into it and apply yourself. That's when having a mentor can do wonders, yet even without one, you can still manage, but with a bit more effort. It's all up to you!
Translinearity is the super-set of what’s linear, so as to include what is not linear, in a meaningful manner. In data analytics, it includes all connections among data points and variables that make sense in order to maintain robustness (i.e. avoid any kind of over-fitting). Although fairly abstract, it is in essence what has brought about most modern fields of science, including Relativistic Physics. Naturally, when modeled appropriately, it can have an equally groundbreaking effect in all kinds of data analytics processes, including all the statistical ones as well as some machine learning processes. Effectively, a framework based on translinearity can bridge the different aspects of data science processes into a unified whole where everything can be sophisticated enough to be considered A.I. related while at the same time transparent enough, much like all statistical models.
Because we have reached the limits of what the linear approach has to offer through Statistics, Linear Algebra, etc. Also, the non-linear approach, although effective and accessible, are black boxes, something that may remain so for the foreseeable future. Also, the translinear approach can unveil aspects of the data that are inaccessible with the conventional methods at our disposal, while they can help cultivate a more holistic and more intuitive mindset, benefiting the data scientists as much as the projects it is applied on.
So far, Translinearity is implemented in the Julia ecosystem by myself. This is something I've been working on for the past decade or so. I have reason to believe that it is more than just a novelty as I have observed various artifacts concerning some of its methods, things that were previously considered impossible. One example is optimal binning of multi-dimensional data, developing a metric that can assess the similarity of data points in high dimensionality space, a new kind of normalization method that combines the benefits of the two existing ones (min-max and mean-std normalization, aka standardization), etc.
Translinearity is made applicable through the systematic and meticulous development of a new data analytics framework, rooted in the principles and completely void of assumptions about the data. Everything in the data is discovered based on the data itself and is fully parametrized in the corresponding functions. Also, all the functions are optimized and build on each other. A bit more than 30 in total, the main methods of this model cover all the fundamentals of data analytics and open the way to the development of predictive analytics models too.
Translinearity opens new roads in data analytics rendering conventional approaches more or less obsolete. However, the key outcome of this new paradigm of data analytics is the possibility of a new kind of A.I. that is transparent and comprehensible, not merely comprehensive in terms of application domains. Translinearity is employed in the more advanced deep learning systems but it’s so well hidden that it escapes the user. However, if an A.I. system is built from the ground-up using translinear principles, it can maintain transparency and flexibility, to accompany high performance.
It's interesting how even though there are a zillion ways to assess the similarity between two vectors (each representing a single-dimensional data sample) when it comes to doing the same thing with matrices (each representing a whole sample of data) the metrics available are mediocre at best. It's really strange that when it comes to clustering, for example, where this is an important part of the whole process, we often revert to crude metrics like Silhouette Width to figure out if the clusters are similar enough or not. What if there was a way to assess similarity more scientifically, beyond such amateur heuristics?
Well, fortunately, there is a way, at least as of late. Enter the Congruency concept. This is basically the idea that you can explore the similarity of two n-dimensional samples through the systematic analysis of their components, given that the latter are orthogonal. If they are not orthogonal, it shouldn't be difficult to make them orthogonal, without any loss of information. Whatever the case, it's important to avoid any strong relationships among the variables involved as this can skew the whole process of assessing similarity.
The Congruency concept is something I came up with a few months ago but it wasn't until recently that I managed to implement it in a reliable and scalable way, using a new framework I've developed in Julia. The metric takes as inputs the two matrices and yields a float number as its output, between 0 and 1. The larger this number is the more similar (congruent) the two matrices are. Naturally, the metric was designed to be robust regardless of its dimensionality, though if there are a lot of noisy variables, they are bound to distort the result. That's why it performs some preprocessing first to ensure that the variables are independent and as useful as possible.
Applications of the Congruency metric (which I coined as dcon) go beyond clustering, however. Namely, it can be used in assessing sampling algorithms too (usually yielding values 0.95+ for a reliable sample) as well as synthetic data generation. Since the metric doesn't make any assumptions about the data, it can be used with all kinds of data, not just those following a particular set of distributions. Also, as it doesn't make use of all dimensions simultaneously, it is possible to avoid the curse of dimensionality altogether.
Things like Congruency may seem like ambitious heuristics and few people would trust it when the more established statistical heuristics exist as an option. However, there comes a time when a data scientist starts to question whether a statistic's / metric's age is sufficient for establishing its usefulness. After all, what is now old and established was once new and experimental, let's not forget that...
So, recently I decided to make a video on this topic, based on some things I've observed in data science candidates. The hope is that this may help them and anyone else who may be looking into becoming a more holistic data scientist, instead of just a data science technician. The video I made is now available online on O'Reilly and although it's a bit longer than others I've made (not counting the quiz ones), it's fairly easy to follow. Enjoy!
So, the 7th quiz video I've created is finally online on O'Reilly. This is the longest one so far spanning over 51 minutes, meaning there are lots of explanations for the various questions. It covers a bunch of topics, such as A/B testing, ANOVA, and various statistical tests. I put a lot of thought in this, much like you'd put a lot of thought in designing a data science experiment. Hopefully, you'll find it as useful and enjoyable as I did.
Note that just like other videos published on O'Reilly, you'll need to have an active account (even if it's a trial one), in order to view it in its entirety. As a bonus, you'll be able to view other videos as well as books available on that platform. Enjoy!
Although transparency is often viewed in relation to predictive analytics models, when it comes to data science there is another aspect of transparency that is also particularly important: the transparency of the data science work. This has to do mainly with how transparent the process followed is, as well as the models used and the results.
Nowadays it’s easy (perhaps too easy!) to build a predictive analytics model in complete obscurity, thanks to the wonder of deep learning. This way, you may be able to bring about a satisfactory result, without gaining a sufficient understanding of the data at play, or the quirks of the problem at hand. Of course, it's not just the data scientist to blame for this reckless behavior. Far from it. The root of the problem is managerial since we often forget that the data scientist will tend to follow the most economical course of action, to deliver a result, be it a data product or a set of insights, in the least amount of time. This is often due to the strict deadlines involved in a data science project and the all too frequent lack of understanding of the field, by the people managing the project.
There is more to a successful project than a high accuracy rate or an easily accessible model on the cloud. Oftentimes the problems tackled by data science are complex and have lots of peculiarities that deserve close attention if the problems are to be solved properly. Anyone can build a predictive analytics model nowadays, without having a good grasp of data science, thanks to all these 10-12 week boot camps that offer the most superficial knowledge humanly possible to the aspiring data scientists! Yet, if our expectations of the data scientists are equally shallow and we are willing to up with opaque models and pipelines, then we reap what we saw. That's why it's important to have good communication about these matters, going beyond the basics. Mentoring can also be a priceless aid in all this.
Fixing this fundamental issue requires more than just good communication and mentoring, however. We also need to opt for a transparent approach to data science. All aspects of the pipeline need to be explainable, even if the models used are black boxes, due to the performance requirements involved. The data scientists need to be able to communicate their work and findings, while we as managers need to do the same when it comes to requirements, domain knowledge, and other factors that may play a role in the project at hand. All this may not solve every issue with today’s obscure data science pipelines, but it is a good place to start.
Perhaps if we have transparency as a key value in our data science teams, we have a better chance of deriving true insights from the data available and bring about a more valuable result overall.
Beyond the play of words here, there is an important matter that needs to be addressed, since data science is becoming increasingly influential nowadays, in various aspects of our lives. Gone are the days when it was limited to the data science departments of certain companies; these days, the impact of data science transcends the boundaries of the organizations it serves. Take for example the data scientists working for large companies like Facebook and Google. The impact of their work influences a large number of people, even outside the companies themselves. Perhaps the range of this impact is hard to fathom even by the managers of these data science teams since it often has a lasting impact that's nearly impossible to gauge without sufficient data and the time required for this impact to fully manifest.
Ethics is a word that's used so much that has lost its meaning, or maybe it was never really properly defined in the first place. Also, with the impersonal aspects of ethics being formalized in particular codes of conduct, it has lost its essence since it has been reduced to a number of do's and don't, a set of guidelines which can be followed unconsciously and mechanically. However, ethics is the formal aspect of morality, which is founded in the values we follow. The latter is real and oftentimes comprehensible things that we express in our actions, oftentimes consciously. Values like honesty, diligence, and efficiently don't require a Master's in philosophy in order to comprehend, while the ethics of a modern information worker can be a bit more abstract and challenging to relate to. Values are something we have, whether we talk about them or not, and it's not too difficult to figure out what these are with a little introspection. However, even though values are a personal matter, they have a concrete effect on our work and in how we relate to the world. Good managers are aware of that and pay attention to the values of the candidates of the positions they wish to fill. The resume/CV is important but it’s not the only factor at play when hiring a professional.
Perhaps it's time to pay attention to this aspect of the craft more. Knowledge and know-how are becoming more easily accessible to everyone, particularly those who are willing to pay for that, an investment that is guaranteed to pay off. That's great, particularly for those who wish to enter this field even if their education is not aligned with this subject. Still, it's equally important to balance this aptitude with the moral strength that empowers us to deliver our data science work in a way that respects other people's privacy and doesn't abuse the information involved. At one point in our careers, it is natural to come into a crossroad where we need to either do is expected or do what is ethically right. The former is bound to be a more tempting option, at least financially, while the latter may be void of any direct benefit. Having a solid set of positive values may help us make the right choice instead of trading the long-term benefit of the many for the short-term gain of the few.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.