Every data set is a multi-dimensional structure, a crystallization of information. As such, it can best be described through mathematics, particularly Geometry (the science of measuring Earth and the foundation of many scientific fields, such as Physics). You may have a different perspective about Geometry, based on your high school education, but let me tell you this: the field of Geometry is much more than theorems and diagrams, which although essential, are but the backbone of this fascinating field. However, Geometry is not just theoretical but also very practical. The fact that it applies to the data we deal with in data science attests to that.
When it comes to Geometry in data science, a couple of metrics come to mind. Namely, there is the Index of Discernibility (ID) and Density (there is also an optimizer I've developed called DCO, short for Divide and Conquer Optimizer, but it doesn't scale beyond 5 dimensions so I won't talk about it in this article). Both of these metrics are useful in assessing the feature space and aiding various data engineering tasks, such as feature selection (ID can be used for evaluating features) and data generation (through the accurate assessment of the dataset's "hot spots"). Also, both work with either hyperspheres (spheres in the multidimensional space) or hyper-rectangles. The latter is a more efficient way of handling hyperspaces so it's sometimes preferable.
Metrics like the ones mentioned previously have a lot to offer in data science and A.I. In particular, they are useful in evaluating features in classification problems and the data space in general (in the case of density). Naturally, they are very useful in exploratory data analysis (EDA) as well as other aspects of data engineering. Although ID is geared towards classification, Density is more flexible than that since it doesn't require a target variable at all. Note that Density is largely misunderstood since many people view it as a probability-related metric, which is not the case. Neither ID nor Density has anything to do with Statistics, even if the latter has its own set of useful metrics.
Beyond the aforementioned points, there are several more things that are worth pondering upon when it comes to geometry-based metrics. Specifically, visualization is key in understanding the data at hand and whenever possible, it's good to combine it with geometry-based metrics like ID and Density. The idea is that you view the data set both as a collage of data points mapped on a grid as well as a set of entities with certain characteristics. The ID score and the density value are a couple of such characteristics. In any case, it's good to remember that geometry-based metrics like these can be useful when used intelligently since there is no metric out there that's a panacea for data science tasks.
So what's next? What can you do to put all this into practice? First of all, you can learn more about the Index of Discernibility and other heuristic metrics in my latest book, Julia for Machine Learning (Technics Publications). Also, you can experiment with it and see how you can employ it in your data science projects. For the more adventurous of you, there is also the option of coming up with your own geometry-based metrics to augment your data science work. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.