I've been trying to answer this question for years. Well, not many years, but still, at least since the second half of the previous decade. Why? Well, I've always liked to explore the boundaries between the continuous and the discrete and since I finally internalized the teaching that everything in this universe is discrete (see: Quantum Physics), I decided to explore that angle and see if there was indeed a way to turn a continuous variable into a discrete one, with minimal information loss.
Over the past few months, I've developed three distinct approaches, depending on how distinct the values of the target variable are (see what I did there?). Let's start with something simple: no target variable at all. So, how can we discretize a continuous variable x? Well, you have to binarize it until there is no more binarization possible! But how do you optimally binarize a variable? That's something that involves densities after you handle all outliers and inliers in x, of course. How do you do that? Well, that's a topic that can fill a whole book chapter, so I'll have to draw the line here, I'm afraid.
What about when there is a target variable? Let's start with a binary one as it's simpler this way. We can employ a robust similarity metric that can assess the similarity of two binary variables, regardless of their alignment or any similarities due to chance. Fortunately, I've developed one such metric, which I call holistic symmetric similarity (HSS), which also works with all sorts of discrete variables. So, by using this metric, we can optimize the split to maximize the HSS score between the binarized x and the target variable y. The same approach works if y is discrete but not binary since I've generalized HSS to handle nominal variables too.
Ok, but what about when y is continuous, though? Well, that takes a bit more creativity since it's not as simple a task as it may seem. Fortunately, it's doable and relatively light, computationally speaking. We can find the threshold that maximizes a custom correlation metric that becomes larger once any non-linearities are tackled. This process doesn't have to be rocket science since I'm sure you can come up with a metric like that if you've been mentored by someone worth his salt in data science. Of course, you could use a translinearity correlation metric, yet, I wouldn't recommend that since it would inevitably pick up signals you wouldn't want it to, plus it's bound to be more computationally heavy.
So, there you have it. You can binarize and therefore discretize any feature x you like, with or without a target variable. The latter can be binary, discrete, or even continuous, depending on the problem at hand. Such a process can help you preserve computational resources and perhaps even enable you to make better and more transparent models (after all, binary variables tend to be easier on the mind, not just on the computer). All this I've done in the OD.jl script, which I cannot share here, unfortunately, as it has dependencies on proprietary code (the BROOM framework), which I'd rather not give away. Still, if you wish to explore this topic further, we can do that in a one-on-one mentoring session or two, given that you have the required commitment to the craft and a genuine interest to learn more about it. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.