The reality of data is often taken for granted, just like many things in data science. However, there is more to it than meets the eye and it's only after talking with other data professionals (particularly data architects) that this hierarchy of realities becomes accessible. Of course, this is not something you'll see in a data science book or video, but if you think about it it makes good sense. I've been thinking about it quite a bit before putting it down in words; eventually, all this helped me put things into perspective. Hopefully, it will do the same for you.
First of all, as the basest and most accessible reality of data, we have the values of a dataset. This involves all the numeric and non-numeric data that lives in the data frames we process. Naturally, this is usually referred to as data and it's the most fundamental entity we work with in every data science project. However, there is much more to all this than that since this data comes from somewhere else, through a higher abstraction of it.
This abstraction is the variables of the dataset. These are much more than just containers of the data values since they often represent pieces of information that represent characteristics we can relate to in the problem we are tackling. Also, the variables themselves have an inherent structure representing a pattern, which goes beyond the data values themselves. This is why Statistics is so obsessed with various metrics describing individual variables; in a way, these metrics reflect the essence of a variable and they are usually more important than the data itself.
Moreover, the relationships among all these variables are another level of reality regarding the data. After all, these variables are rarely independent of each other and the relationships among them are crucial for analyzing the data involved. This is what makes data generation a bit tricky since it's not as simple as creating data that follows the distribution of each variable involved. The relationships among the variables play a role in all this. That's why things like correlation metrics are important and help us analyze the data on a deeper level.
Furthermore, there is the structure of the dataset based on the inherent patterns and the reference variable. The latter is usually the target variable we are trying to predict. Naturally, the structure of the dataset is also relevant to the previous realities, particularly the one related to the relationship of the variables, since it influences the densities of the data. However, a higher-order is introduced to the data through the target variable, making this structure even more prominent. Whatever the case, it is by understanding this structure (e.g. through clustering, feature evaluation, etc.) that we manage to gain a deeper understanding of the essence of the data.
Finally, there are the multidimensional patterns that generated the data in the first place. This is the most important reality of the data since it's the one that defines the whole dataset and in a way transcends it. After all, a dataset is but a sample of all the possible data points that stem from a certain population. The latter is usually beyond reach and it can be limitless as new data usually becomes available. So, knowing these multidimensional patterns is the closest we can get to that population and making use of them is what makes a data science project successful.
Naturally, A.I. is involved in each one of these realities, usually as a tool for analyzing the data. However, it’s particularly relevant in the last level whereby it figures out these multidimensional patterns and manages to create new data similar to the original. Also, understanding these patterns well enables it to make more accurate predictions, due to the generalization of the data that it accomplishes.
Nevertheless, this 5-fold hierarchy of the realities of the data is useful for understanding a dataset, with or without A.I. methods. As a bonus, it enables us to gain a better appreciation of the heuristics available and helps us use them more consciously.
Throughout this blog, I've talked about all sorts of problems and how solving them can aid one's data science acumen as well as the development of the data science mindset. Problem-Solving skills rank high when it comes to the soft skills aspect of our craft, something I also mentioned in my latest video on O'Reilly. However, I haven't talked much about how you can hone this ability.
Enter Brilliant, a portal for all sorts of STEM-related courses and puzzles that can help you develop problem-solving, among other things. If you have even a vague interest in Math and the positive Sciences, Brilliant can help you grow this into a passion and even a skill-set in these disciplines. The most intriguing thing about all this is that it does so in a fun and engaging way.
Naturally, most of the stuff Brilliant offers comes with a price tag (if it didn't, I would be concerned!). However, the cost of using the resources this site offers is a quite reasonable one and overall good value for money. The best part is that by signing up there you can also help me cover some of the expenses of this blog, as long as you use this link here: www.brilliant.org/fds (FDS stands for Foxy Data Science, by the way). Also, if you are among the first 200 people to sign up you'll get a 20% discount, so time is definitely of the essence!
Note that I normally don't promote anything of this blog unless I'm certain about its quality standard. Also, out of respect for your time I refrain from posting any ads on the site. So, whenever I post something like this affiliate link here I do so after careful consideration, opting to find the best way to raise some revenue for the site all while providing you with something useful and relevant to it. I hope that you view this initiative the same way.
Translinearity is the super-set of what’s linear, so as to include what is not linear, in a meaningful manner. In data analytics, it includes all connections among data points and variables that make sense in order to maintain robustness (i.e. avoid any kind of over-fitting). Although fairly abstract, it is in essence what has brought about most modern fields of science, including Relativistic Physics. Naturally, when modeled appropriately, it can have an equally groundbreaking effect in all kinds of data analytics processes, including all the statistical ones as well as some machine learning processes. Effectively, a framework based on translinearity can bridge the different aspects of data science processes into a unified whole where everything can be sophisticated enough to be considered A.I. related while at the same time transparent enough, much like all statistical models.
Because we have reached the limits of what the linear approach has to offer through Statistics, Linear Algebra, etc. Also, the non-linear approach, although effective and accessible, are black boxes, something that may remain so for the foreseeable future. Also, the translinear approach can unveil aspects of the data that are inaccessible with the conventional methods at our disposal, while they can help cultivate a more holistic and more intuitive mindset, benefiting the data scientists as much as the projects it is applied on.
So far, Translinearity is implemented in the Julia ecosystem by myself. This is something I've been working on for the past decade or so. I have reason to believe that it is more than just a novelty as I have observed various artifacts concerning some of its methods, things that were previously considered impossible. One example is optimal binning of multi-dimensional data, developing a metric that can assess the similarity of data points in high dimensionality space, a new kind of normalization method that combines the benefits of the two existing ones (min-max and mean-std normalization, aka standardization), etc.
Translinearity is made applicable through the systematic and meticulous development of a new data analytics framework, rooted in the principles and completely void of assumptions about the data. Everything in the data is discovered based on the data itself and is fully parametrized in the corresponding functions. Also, all the functions are optimized and build on each other. A bit more than 30 in total, the main methods of this model cover all the fundamentals of data analytics and open the way to the development of predictive analytics models too.
Translinearity opens new roads in data analytics rendering conventional approaches more or less obsolete. However, the key outcome of this new paradigm of data analytics is the possibility of a new kind of A.I. that is transparent and comprehensible, not merely comprehensive in terms of application domains. Translinearity is employed in the more advanced deep learning systems but it’s so well hidden that it escapes the user. However, if an A.I. system is built from the ground-up using translinear principles, it can maintain transparency and flexibility, to accompany high performance.
So, the royalties for the last 3-month period came for my self-published novel today ("I, AGI; the adventures of an advanced AI") and they were quite underwhelming. In fact, with the money I received I couldn't even cover my expenses for this book. Yes, I did pay others to help out, such as an editor and someone to handle the formatting that Kindle Publishing expects of its books, including the cover design. After all, I have a lot of respect for my audience, even if probably most of the people who read the book chose to not pay for it (there are loopholes when it comes to Amazon Kindle). Still, the reviews I got about it, from reliable sources like Goodreads, were quite positive, so I must have done something right!
Anyway, I could have published this book elsewhere and perhaps if I had 6 months to a year to spend, I could have found a literary publisher for it (unfortunately my regular publisher doesn't do novels!). Yet, even then it's not really worth it for the revenue a fiction book can bring. After all, the standards for sci-fi these days are quite high and I'm more of a non-fiction author. So, why did I bother with this whole project? Well, mostly because I enjoy writing, all kinds, not just non-fiction. And if you have a story in your head that you wish to share with others, the low revenue that stems from a publication of this story doesn't pose a real obstacle.
Also, and perhaps more importantly, I had a message to get to the world, regarding the safety aspect of A.I. and AGI. Of course, I've made this point through other forms, such as a video on the topic and numerous articles on this blog. However, if you care about reaching as many people as possible, you need to be creative about how you promote your idea. And that's exactly what I did.
So, even if Amazon Kindle is not the most profitable way to publish an ebook, even if the people reading this book probably have dozen other books on their to-read list and are less likely to value it the same way we used to value books before the Internet era, even if people are mesmerized about the benefits of A.I. today and are quite reluctant to view any of the potential shortcomings, I'm glad I published this book. At the very least, it was a learning experience and a way to gauge the literary market first hand. And who knows, if things go well, I may author a sequel to this novel as there is more to the story!
Just like week, during a business trip to London, I started working on this video, on my spare time, and now it's already online! In this 40 minute video, comprising of 3 clips, I explore the topic of Optimization, through a series of questions spanning across 5 categories. Whether you are an aspiring A.I. expert or a data scientist, you can learn a lot of useful things from this test of sorts and with the right mindset, even enjoy the whole process! You can find it on the O'Reilly platform, where you need to have an account (even a trial one will do) to watch it in its entirety. Cheers!
Dimensionality reduction has been a standard methodology to deal with datasets that have a lot of features, more than a typical model can handle effectively. Reducing the number of features can also save time and storage space, while when it comes to sensitive data it can be a big plus as it enables anonymity in the people involved. What’s more, in some cases, a reduced dimensionality dataset can be more effective as there is less noise in it. However, conventional dimensionality reduction methods don’t always do the trick due to the inherent limitations they have. For example, PCA only considers linear relationships among the variables and a linear combination of features, as a solution.
Of course, other people are not sitting idle when it comes to this issue. There are several dimensionality reduction options that are being pursued, the most interesting of which is autoencoders. This AI-based method involves a data-driven approach to figuring out the nature of the data and creating new variables that can represent the underlying signal, by minimizing the error. The issue with this is that it often requires a lot of data and some specialized know-how in order to configure optimally. Also, this whole process may be fairly slow, due to the large number of computations involved.
An alternative approach has to do with feature fusion in a non-AI way. The idea is to maintain transparency to the extent this is possible, while at the same time optimize the whole process in terms of speed. The use of multiple operators, some linear and some non-linear, is essential, while the option of dropping useless features is also very useful. Naturally, this whole process would be more effective in the presence of a target variable, but it should be able to work without it, for better applicability. Whatever the case, the use of a metric able to handle non-linear correlations is paramount since the conventional correlation metric used leaves a lot to be desired.
Based on all this, it’s clear that the dimensionality reduction area is still capable of enhancements. Despite the great work that has been done already, there is still room for new methods that can address the limitations the existing methods have, which aren’t going away any time soon. Perhaps it would be best to explore this methodology of data engineering more, instead of focusing the latest and greatest system, which although intriguing, may sacrifice too much (e.g. transparency) in the name of accuracy, a trade-off that may no longer be cost-effective. Something to think about...
In an interview I recently watched, Elon Musk put forward the case of a utility (objective) function for a hypothetical advanced A.I. (basically an AGI) and how special attention must be given to such a task to avoid undesirable results. So, he suggested we use some utility function some person had recommended (probably an A.I. expert), namely that of maximizing “freedom of action for everyone,” something that’s quite reasonable and perhaps even profound if you think about it. However, if you think more about it, it becomes evident that it’s a terrible, terrible idea!
First of all, I mean no disrespect to Elon Musk. I think many of the things he’s created are great, even if some of his ideas are somewhat extreme. So even if he is not a role model of mine, I admire him as a tech entrepreneur and find that he has a lot to offer to the world through his businesses and his ideas for a better world. Except of course his idea for a utility function; that would be catastrophic, though I’m sure that in his mind it’s a brilliant solution to the utility function problem.
For starters, freedom is a very abstract concept even if it’s made more specific by the term “of action” to clarify it. How do you measure freedom of action? How would an A.I. understand this concept, especially if it never gets to experience it? Then, would maximum freedom be a good thing necessarily? Isn’t that a form of anarchy in a way? These are things that need to be addressed before asking an A.I. engineer to implement such a function for this hypothetical A.I. So, unless we figure this out, we cannot be sure that this A.I. will be benign, even if its creators have the best intentions in the world for it.
For example, an A.I. that makes use of this utility function may accelerate the depletion of natural resources of this planet (and any other planet it has access to), in order to ensure that everyone, even some random criminal on the streets or an inmate in a high security prison, has as much freedom of action as possible. Do you see where I’m going with this? Perhaps I’d better stop here before this whole post turns into some dystopian scenario or something.
The utility function problem is a difficult one and in all fairness Elon Musk is not someone knowledgeable enough in A.I. to be able to provide a bullet-proof solution to it. He may know a lot about the topic but I doubt he’s ever created an A.I. system from scratch. And unless you are close to the metal about these things, any ideas you have about how things should be regarding the high-level aspects of such complex systems is just an opinion on the matter, not a serious candidate for a solution to the problem at hand. The latter would be something that has legs and right now it seems that Mr. Musk’s suggestion is floating in the clouds just like many futurists when they talk about A.I. Perhaps that’s why many people don’t take Elon Musk’s warnings about A.I. very seriously, although I believe that’s one of the things he’s got right.
Despite the inevitable risks such an endeavor has, I’ll venture to make a suggestion of my own for a utility function, namely one that evolves over time. In other words, I propose a narrow A.I. whose sole purpose is to optimize the utility function of the AGI, perhaps in a Reinforcement Learning fashion, based on the feedback it receives from other people, while it starts with a utility function that’s as risk-free as possible (based on some simulations we run before we deploy it to the AGI). Some core heuristics may be in place to ensure a large enough diversity of signals that this A.I. will take into account, coordinating the various objectives / values that the AGI will have to uphold. Besides, it would be naive to assume that a human being, no matter how knowledgeable, can be in a position to come up with a utility function that can apply to some creature more intelligent than all the people in the world, forever.
If our own evolution has taught as anything is that there are no absolutes in nature and that we evolve to become better and adjust our values according to the circumstances we face and the challenges we wish to overcome. Why should an AGI be any different, considering that it’s created in our own image?
Trinary Logic is not something new. It’s been around for decades, though it was more of a mathematical / high-level framework. I should know, as I did my Masters thesis on this subject and how it applies to GIS. I even wrote code implementing the corresponding model I came up with, though in today’s programming world it seems like legacy code... Anyway, bottom line is that Trinary Logic is useful and could have a place in modern Information Systems, including data analytics projects. The question is, could it be applicable to A.I. too?
The answer is, as usual, “it depends.” Trinary Logic on its own is quite limited and unless you are familiar with its 700+ gates, it may be like any novel idea: interesting but not exactly something worth delving into. After all, just like any system of reasoning, Trinary Logic is meaningless without an in-depth understanding of its key contribution to the thorny issue we always tackle through reasoning: handling uncertainty effectively.
Uncertainty, oftentimes modeled as noise or randomness (depending on who you ask), is everywhere. Since we cannot eliminate it without damaging the signal too, we find ways to cope with it. Trinary Logic offers an interesting way of doing that through the 3rd value of its variables, namely the “indifferent” state. Something can be True, False, or Indifferent, the latter being something in-between. These are the states of those intermediate values in the membership functions of fuzzy variables, in Fuzzy Logic. The latter is a well-known and quite established A.I. framework with lots of applications in data science. Do you see where I’m going with this?
So Trinary Logic is a framework for reasoning, much like Fuzzy Logic, but the latter is an A.I. framework too, so Trinary Logic is A.I. also, right? Well, no. Trinary Logic is a mathematical construct, so unless it is applied to A.I. programmatically, and as a well-defined process, it is yet another concept that can’t even fetch an academic publication! But if it were to manifest as a heuristic of sorts and add value to a process in the A.I. sphere, things would be different.
Enter the Trinary Curve, a heuristic (or meta-heuristic, depending on how you use it) that encapsulates Trinary Logic in a simple yet not simplistic way, turning an input signal into something that an A.I. agent can understand and work with. Namely, it can engineer a new variable in the [-1, 1] interval (notice the closed brackets in this case), that enables the corresponding module to have the in-between state of uncertainty more evident. As a result, the A.I. agent is allowed to be unsure about something and examine it more closely, given the right architecture, instead of working with what it has and hope for the best. Note that the Trinary Curve can be customized, while its output can be normalized to a different interval (always closed) if needed. The Trinary Curve is differentiatable throughout the space it is defined, while it’s easy to use programmatically (at least in Julia).
Perhaps the Trinary Curve is a novelty and an A.I. system can evolve adequately without it. However, it is something worth considering, instead of just experimenting with the countable parameters of existing A.I. systems solely. After all, Trinary Logic is compatible with existing A.I. frameworks so if it’s not utilized, it’s primarily because of some people’s unwillingness to think outside the box, and that’s something that doesn’t have any uncertainty about it...
This week I'm away, as I prepare for my talk at the Consumer Identity World EU 2018 conference in Amsterdam (the same conference takes place in a couple of other places, but I'll be attending just the one in Europe). So, if you are in the Dutch capital, feel free to check it out. More information on my talk here. Cheers!
Dichotomy: a binary separation of a set into two mutually exclusive subsets
Data Science: the interdisciplinary field for analyzing data, building models, and bringing about insights and/or data products, which add value to an organization. Data science makes use of various frameworks and methodologies, including (but not limited) to Stats, ML, and A.I.
After getting these pesky definitions out of the way, in an effort to mitigate the chances of misunderstandings, let’s get to the gist of this fairly controversial topic. For starters, all this information here is for educational purposes and shouldn’t be taken as gospel since in data science there is plenty of room for experimentation and someone adept in it doesn’t need to abide to this taxonomy or any rules deriving from it.
The inaccurate dichotomy issues in data science, however, can be quite problematic for newcomers to the field as well as for managers involved in data related processes. After all, in order to learn about this field a considerable amount of time is required, something that is not within the temporal budget of most people involved in data science, particularly those who are starting off now. So, let’s get some misconceptions out of the way so that your understanding of the field is not contaminated by the garbage that roams the web, especially the social media, when it comes to data science.
Namely, there are (mis-)infographics out there that state that Stats and ML are mutually exclusive, or that there is no overlap between non-AI methods and ML. In other words, ML is part of AI, something that is considered blasphemy in the ML community. The reason is simple: ML as a field was developed independently of AI and has its own applications. AI can greatly facilitate ML through its various network-based models (among other systems), but ML stands on its own. After all, many ML models are not AI related, even if AI can be used to improve them in various ways. So, there is an overlap between ML and AI, but there are non-AI models that are under the ML umbrella.
Same goes with Statistics. This proud sub-field of Mathematics has been the main framework for data analytics for a long time before ML started to appear, revolting against the model-based approach dictated by Stats. However, things aren’t that clear-cut. Even if the majority of Stats models are model-based, there are also models that are hybrid, having elements of Stats and ML. Take Bayesian Networks for example, or some variants of the Naive Bayes model. Although these models are inherently Statistical, they have enough elements of ML that they can be considered ML models too. In other words, they lie on the nexus of the two sets of methods.
What about Stats and AI? Well, Variational AutoEncoders (VAEs) are an AI-based model for dimensionality reduction and data generation. So, there is no doubt that it lies within the AI set. However, if you look under the hood you’ll see that it makes use of Stats for the figuring out what the data generated by it would be like. Specifically, it makes use of distributions, a fundamentally statistical concept, for the understanding and the generation of the data involved. So, it wouldn’t be far-fetched to put VAEs in the Stats set too.
From all this I hope it becomes clear that the taxonomy of data science models isn’t that rigid as it may seem. If there was a time when this rigid separation of models made sense, this time is now gone as hybrid systems are becoming more and more popular, while at the same time the ML field expands in various directions outside AI. So, I’d recommend you take those (mis-)infographics with a pinch of salt. After all, most likely they were created by some overworked employee (perhaps an intern) with a limited understanding of data science.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.