Although there is a plethora of heuristics for assessing the similarity of two arrays, few of them can handle different sizes in these arrays and even fewer can address various aspects of these pieces of data. PCM is one such heuristic, which I’ve come up with, in order to answer the question: when are two arrays (vectors or matrices) similar enough? The idea was to use this metric as a proxy for figuring out when a sample is representative enough in terms of the distributions of its variables and able to reflect the same relationships among them. PCM manages to accomplish the first part.
PCM stands for Possibility Congruency Metric and it makes use primarily of the distribution of the data involved as a way to figure out if there is congruency or not. Optionally, it uses the difference between the mean values and the variance too. The output is a number between 0 and 1 (inclusive), denoting how similar the two arrays are. The higher the value, the more similar they are. Random sampling provides PCM values around 0.9, but more careful sampling can reach values up closer to 1.0, for the data tested. Naturally, there is a limit to how high this value can get in respect with the sample size, because of the inevitable loss of information through the process.
PCM works in a fairly simple and therefore scalable manner. Note that the primary method focuses on vectors. The distribution information (frequencies) is acquired by binning the variables and examining how many data points fall into each bin. The number of bins is determined by taking the harmonic mean of the optimum numbers of bins for the two arrays, after rounding it to the closest integer. Then the absolute difference between the two frequency vectors is taken and normalized. In the case of the mean and variance option being active, the mean and variance of each array are calculated and their absolute differences are taken. Then each one of them is normalized by dividing with the maximum difference possible, for these particular arrays. The largest array is always taken as a reference point.
When calculating the PCM of matrices (which need to have the same number of dimensions), the PCM for each one of their columns is calculated first. Then, an average of these values is taken and used as the PCM of the whole matrix. The PCM method also yields the PCMs of the individual variables as part of its output.
PCM is great for figuring out the similarity of two arrays through an in-depth view of the data involved. Instead of looking at just the mean and variance metrics, which can be deceiving, it makes use of the distribution data too. The fact that it doesn’t assume any particular distribution is a plus since it allows for biased samples to be considered. Overall, it’s a useful heuristic to know, especially if you prefer an alternative approach to analytics than what Stats has to offer.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.