The PCM Heuristic and Its Uses
Although there is a plethora of heuristics for assessing the similarity of two arrays, few of them can handle different sizes in these arrays and even fewer can address various aspects of these pieces of data. PCM is one such heuristic, which I’ve come up with, in order to answer the question: when are two arrays (vectors or matrices) similar enough? The idea was to use this metric as a proxy for figuring out when a sample is representative enough in terms of the distributions of its variables and able to reflect the same relationships among them. PCM manages to accomplish the first part.
PCM stands for Possibility Congruency Metric and it makes use primarily of the distribution of the data involved as a way to figure out if there is congruency or not. Optionally, it uses the difference between the mean values and the variance too. The output is a number between 0 and 1 (inclusive), denoting how similar the two arrays are. The higher the value, the more similar they are. Random sampling provides PCM values around 0.9, but more careful sampling can reach values up closer to 1.0, for the data tested. Naturally, there is a limit to how high this value can get in respect with the sample size, because of the inevitable loss of information through the process.
PCM works in a fairly simple and therefore scalable manner. Note that the primary method focuses on vectors. The distribution information (frequencies) is acquired by binning the variables and examining how many data points fall into each bin. The number of bins is determined by taking the harmonic mean of the optimum numbers of bins for the two arrays, after rounding it to the closest integer. Then the absolute difference between the two frequency vectors is taken and normalized. In the case of the mean and variance option being active, the mean and variance of each array are calculated and their absolute differences are taken. Then each one of them is normalized by dividing with the maximum difference possible, for these particular arrays. The largest array is always taken as a reference point.
When calculating the PCM of matrices (which need to have the same number of dimensions), the PCM for each one of their columns is calculated first. Then, an average of these values is taken and used as the PCM of the whole matrix. The PCM method also yields the PCMs of the individual variables as part of its output.
PCM is great for figuring out the similarity of two arrays through an in-depth view of the data involved. Instead of looking at just the mean and variance metrics, which can be deceiving, it makes use of the distribution data too. The fact that it doesn’t assume any particular distribution is a plus since it allows for biased samples to be considered. Overall, it’s a useful heuristic to know, especially if you prefer an alternative approach to analytics than what Stats has to offer.
The Wakelet Platform
Recently I came across this interesting platform for sharing curated content, called Wakelet. It's also a British startup from Manchester, by the way, one that appears quite promising, given that they find a way to monetize their project.
Anyway, the platform is a bit like Pinterest but with more features and an offline presence too. These are the most important features, in my view:
* Very intuitive and fast to learn
* Can work with a variety of content types: videos, images, formatted text, PDFs, and website links
* Every list can be exported to a PDF
* Free to use
* No account is required to view the lists
* Lots of free images to use as thumbnails and backgrounds
* QR code is generated for each list you wish to share
* Private lists are also an option
* Plenty of tutorials online that explain the various features and use-cases
You can check out a wake (that's how these curated lists are called) that I've made in the space of a few minutes, here. In the future I'll probably be using it more, particularly on this blog. Whatever the case, do let me know what you think of this platform and of my wake. Cheers!
If you wish to put yourself out there as a content creator, now more than ever, videos are a great way to do it. This may seem somewhat daunting to some, but with the plethora of software options out there and the ease of use of many of them, it’s just a matter of making your resolve to do it. Apart from the obvious benefit of personal branding, creating a data science or A.I. video can also be lucrative as an endeavor.
I’m not referring to the amateur videos many people on YouTube make, in their vain attempts to gather likes and shares, much like beggars gather pitiful coins from the passers-by. If you want to create a technical video that will be worth your while, there are better and more self-respecting options to do so, options that you would be happy to include in your resume/CV. Namely, you can create a video that you promote through a respectable publisher, such as Technics Publications. Such an alternative will enable you to receive royalties every 6 months and not have to worry about promoting your work all by yourself. Of course, there is also the option of a one-time payment that some publishers offer, but this isn’t nearly as appealing since the amount of money you can potentially earn through royalties is higher and the requirements are easier to meet.
When creating a video, many people think that it’s just you standing in front of a camera and talking adlib about a topic, perhaps using some props like a whiteboard. Although that’s one straightforward way to do it, it may not appeal to the less charismatic presenters or those who don’t consider themselves particularly photogenic. Besides, have a screen-share video or a slideshow with voice-over, always based on a script, is much easier to produce and sometimes more effective at illustrating the points you wish to make. Alternatively, you can try combining both approaches, though this may require more takes.
Whatever the case, making a video is the easy part of the whole project, relatively speaking. What ensues this is what is the most challenging task for most people: promoting the video to your target audience. Although social media have an important role to play in all this, having some support from a publisher is priceless. After all, promoting technical content is what publishers are really good at, especially if they have a good niche in the market. Still, if you have a large enough network, it doesn’t hurt to spread the word yourself too, for additional exposure, though you are not required to do so.
If you are interested in covering a data science or A.I. related topic with a video, through a publisher, feel free to contact me directly, as I’d be happy to help you in that, particularly if you are serious about it. The world could definitely use some new content out there for data science and A.I. since there is way too much noise, confounding those who wish to study these fields. Perhaps this new content could come from you.
When I created this heuristic about a year and a half ago, I wasn't planning to make a video about it. However, after exploring its various benefits, I felt this should become more well-known to data science and A.I. practitioners. So, after a series of experiments and some extra research, I've made this video demonstrating the various aspects of this intriguing heuristic metric. Check it out whenever you have the chance!
Please note that Safari Books Online (O'Reilly) is a paid platform for quality content, so you need to have a subscription to it in order to view this and any other video in their entirety. However, it's a worthy investment that every data science and A.I. learner ought to consider making.
Dimensionality reduction has been a standard methodology to deal with datasets that have a lot of features, more than a typical model can handle effectively. Reducing the number of features can also save time and storage space, while when it comes to sensitive data it can be a big plus as it enables anonymity in the people involved. What’s more, in some cases, a reduced dimensionality dataset can be more effective as there is less noise in it. However, conventional dimensionality reduction methods don’t always do the trick due to the inherent limitations they have. For example, PCA only considers linear relationships among the variables and a linear combination of features, as a solution.
Of course, other people are not sitting idle when it comes to this issue. There are several dimensionality reduction options that are being pursued, the most interesting of which is autoencoders. This AI-based method involves a data-driven approach to figuring out the nature of the data and creating new variables that can represent the underlying signal, by minimizing the error. The issue with this is that it often requires a lot of data and some specialized know-how in order to configure optimally. Also, this whole process may be fairly slow, due to the large number of computations involved.
An alternative approach has to do with feature fusion in a non-AI way. The idea is to maintain transparency to the extent this is possible, while at the same time optimize the whole process in terms of speed. The use of multiple operators, some linear and some non-linear, is essential, while the option of dropping useless features is also very useful. Naturally, this whole process would be more effective in the presence of a target variable, but it should be able to work without it, for better applicability. Whatever the case, the use of a metric able to handle non-linear correlations is paramount since the conventional correlation metric used leaves a lot to be desired.
Based on all this, it’s clear that the dimensionality reduction area is still capable of enhancements. Despite the great work that has been done already, there is still room for new methods that can address the limitations the existing methods have, which aren’t going away any time soon. Perhaps it would be best to explore this methodology of data engineering more, instead of focusing the latest and greatest system, which although intriguing, may sacrifice too much (e.g. transparency) in the name of accuracy, a trade-off that may no longer be cost-effective. Something to think about...
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.