Don't let the somewhat philosophical title mislead you! This isn't an article for the abstract aspects of the world, but something pointing to a very real problem in our thinking and how it makes experiencing life difficult. Namely, an age-old issue with our logic, reasoning, and how as a data modeling tool, it fails to capture the essence of the stuff it aims to understand. In other words, it's a data problem that influences how we process information and understand life in its various aspects. In more practical terms, it is the loss of information through predefined categories that may or may not have relevance to life itself.
The problem starts with how we reason logically. In conventional logic, there are two states, true or false. It's the simplest possible way of thinking about something, assuming it's simple enough to break it down into binary components. If we look at a particular plant, for example, we can say that it's either alive or dead. Fair enough. Of course, this approach may not scale well for a group of such plants (e.g., a forest). Can a forest's state be reduced to this all-or-nothing categorization? If so, what do we sacrifice in the process?
Things get more complicated once we introduce additional categories (or classes in Data Science). We can say that a particular plant is either a sprouting seed (still well within the ground), a newly blossomed plant (above the ground too, but just barely), a mature plant (possibly yielding seeds of its own), or a piece of wood that's lying there lifeless. These four categories may describe the state of a plant in more detail and provide better insight regarding how the plant is fairing. But whether these categories have any real meaning is debatable. Most likely, unless you are a botanist, you wouldn't care much about this classification as you may come up with your own that is better suited for the plant you have in mind. Naturally, the same issue with scalability exists with this classification also, as it's not as straightforward as it seems. A forest may exist in various states at the same time. How would you aggregate the states of its members? Is that even possible?
In analytics, we tend to avoid such problems as we often deal with continuous variables. These variables can be aggregated very easily and we can teach a computer to reason with them very efficiently, perhaps more efficiently than we do. So, a well-trained computer model can make inferences about the plants we are dealing with based on the continuous variables we use to describe those plants. Practically, that gives rise to various sophisticated models that appear to exhibit a kind of intelligence different from our conventional intelligence. This is what we refer to as AI, and it's all the rage lately.
So, where does conventional (human) intelligence end and AI begin? Or, to generalize, where do any categories end and others begin? Can we even answer such a question when we are dealing with qualitative matters? Perhaps that's why we have this inherent need to quantify everything, especially for stuff we can measure. But how does this measurement affect our thinking? It's doubtful that many people stop to ask questions like that. The reason is simple: it's very challenging to answer them in a way agreeable to many people.
This lack of consensus is what gave rise to heuristics of all sorts over the years. We cannot reason with complexity and sometimes we just need to have a crisp answer. Heuristics provide that for us, empowering us in the process. These shortcuts are super popular in data science too, even if few people acknowledge the fact. There is something uncomfortable with accepting uncertainty, especially the kind that's impossible to tackle definitively. Some things in nature, however, aren't black and white or conform to whatever taxonomy we have designed for them. They just are, giddily existing in some spectrum that we may choose to ignore as it's easier to view them in terms of categories. Categories simplify things and give us comfort, much like when we organize our notes in some predefined sections in a notebook (physical or digital) for easier referencing.
The problem with categories arises when we think of them as something real and perhaps more important than the phenomena they aim to model. John Michael Greer makes this argument very powerfully in his book "The Retro Future," where he criticizes various things we have taken for granted due to the nature of our technology-oriented culture. But why are categories such a big problem, practically? Well, categories are by definition simplifications of something, which would otherwise be expressed as a continuum, a spectrum of values. So, by applying this categorization to it, we lose information and make the transition to the original phenomenon next to impossible.
Additionally, categories are closely related to the binary classification of things since every categorical variable can be broken down into a series of binary variables. In the previous plant example, we can say that a plant is or isn't a sprouting seed, is or isn't a newly blossomed plant, etc. The cool thing about this, which many data scientists leverage, is that this transition doesn't involve any information loss. Also, if you have enough binary variables about something, you can recreate the categorical variable they derived from originally.
All this mental work (which to a large extent is automated nowadays) makes for a very artificial worldview where everything seems to exist in a series of ones and zeros, trues and falses, veering away from the complexity of the real world. So, it's not that the world itself is very limited, but rather our logic and as a result our perception and our mental models of it that is limited. As a well-known data professional famously said, "all models are wrong, but some are useful" (George Box). My question is: "how much accuracy are we willing to sacrifice to make something useful from the information at hand?"
If you find that heuristics are a worthwhile option for dealing with the complexity of life effectively, you'd be intrigued by how far they can go in data science and AI. There are plenty of powerful heuristics which can simplify the problems we are dealing with without information loss, all while bringing about interesting insights and new ways of applying creativity. My latest book, The Data Path Less Traveled, is a gentle introduction to this topic and is accompanied by lots of code to keep things down-to-earth and practical. Check it out!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.