Sounds like a bold statement, doesn’t it? Well, regardless of how it sounds, this is a project I’ve been working on for a long time and which I’ve been refining for the past couple of weeks, while also doing some additional testing. So, this is not some half-baked idea like many of the things that tech evangelists write about to promote this or the other agenda. This is the kind of stuff I’d publish a paper about if I still cared about publications.
In a nutshell, the diversity heuristic is a simple metric for measuring how diverse the points of a dataset are. This is quite different to spread metrics (e.g. standard deviation), since the latter focuses on the spread of a distribution, while it can take any positive value. Diversity, on the other hand, takes place between 0 and 1, inclusive. So, if all the vast majority of the data points are crammed into a single or a couple of places, the diversity is 0, while if the data points are more or less evenly distributed in the data space, the diversity is 1. Interestingly, even a random set of points has a diversity score that’s less than 1, since perfect uniformity is super rare unless you are using a really good random number generator!
Also, this diversity metric is pretty fast because, well, if a heuristic is to be useful, it has to scale well. So, I designed it to be quite fast to compute, even for a multiple-dimensional dataset. Because of this, it can be used several times without the computer overheating. As a result, it is fairly easy and computationally cheap to have a diversity-based sampling process, i.e. a sampling method that aims to optimize the yielded sample in terms of diversity. Naturally, a diverse sample is bound to cram more of the original dataset’s signal in it, though some information loss is inevitable. Nevertheless, the diverse sample, which usually has higher diversity than the original dataset, can be used as a proxy of the original dataset for a dimensionality reduction process, such as PCA. Interestingly, the meta-features that stem from the sample are not exactly the same as those of the original dataset, but they good enough, in terms of predictive power. So, by taking the rotation matrix of the PCA model of the sample, we can use it to reduce the original dataset, making dimensionality reduction a piece of cake.
So, there you have it: diversity can be used to reduce a dataset not just in terms of the number of data points it has (sampling) but also in terms of its dimensions. I know this may sound very simple as a process, but considering the computational cost of the alternative (not using diversity-based sampling), I believe it’s a step forward. Naturally, this is just one application of this new heuristic, which can perhaps help in other aspects of data science.
Anyway, I’d love to write more about this but I’m saving it for a video I plan to do on this topic. Currently, I’m still busy with the new book so, stay tuned...
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.