OK, this title may sound a bit heavy, especially for this time of year. Let me break it down for you. There are various correlation metrics out there, which can handle two variables (let's call them x and y) and measure their relationship. More often than not, these metrics focus on the linear aspects of this relationship and are often confused by the non-linear ones. For example, a correlation metric like Pearson’s Correlation can tell you that a variable y defined as 2x + 5 is strongly correlated to x (a shocker, isn't it!) but if a variable z is defined as exp(x^2 + 1) were to be used instead, well Pearson's Correlation might struggle with that. A mathematician or even a Stats professional would assure you that there is a non-linear relationship between the two variables (x and z), but they'd have to rely on a plot of the two variables or some transformation of one of them (e.g., applying log() to z) if they were to measure this relationship. Things get even more complicated if the relationship is not as simple, e.g. that of x and a variable w defined as cos(x). Most likely, Pearson's correlation won't find anything there (a relationship close to 0), even though the Math or Stats professional mentioned previously would be sure there is a relationship there. So, what gives?
Well, what gives is a big question that if I were to answer it here, it would shake your belief in Stats like a super quake, similar to that which brought San Francisco down over a century ago. Interestingly, most Stats concepts are from around that same time, perhaps a bit older than that. So, you've got to give those guys a break since they didn't know any better, plus they didn't have the tools we have at our disposal. Given the circumstances, they did a pretty good job at defining the metrics they did and weaving the fabric of a theory around their methods. Come to think about it, if modern mathematicians were like them, we'd be reasoning in high-dimensional terms now, instead of relying on these old-fashioned formulas and techniques.
I propose a method based on the BROOM framework that looks into the non-linear and non-monotonous artifacts of a pair of variables to establish their relationship. This metric, which I call rbc (range-based correlation, as it's part of the ranges part of the framework), explores the two variables in an entirely data-driven manner, making no assumptions whatsoever about their distributions and their other aspects. As long as they are normalized, they are good to go. And this metric, contrary to all other correlation metrics I've tried, yields a correlation of 0.99 for the x-w pair and a similar figure for the x-z one. When you compare x with some random variable q (q belongs to the [0, 1] interval), it yields a weak correlation (usually between 0.1 and 0.2). As a result, we can deduce that it's a worthwhile metric for measuring the relationship between two variables, taking into account all non-linear artifacts while being unaffected by any lack of a monotonous pattern the two variables may exhibit. If you are interested in learning more, feel free to contact me. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.