Image from bnymellonmarketeye.com
Unless you were living under a rock, you probably know about the Black Swan that is the recent elections’ outcome in the US. Nate Silver, who had successfully predicted the former president’s success, failed miserably to predict that of the current one’s. Other statisticians had a similar experience with this challenge. So, what went wrong? Why did their models fail to make an accurate prediction?
Although this may seem like a simple problem, in essence it is one of the most complicated data analytics tasks people have dealt with. It’s not the volume of data that was the issue, or the velocity with which it was generated. As for variety, that was practically non-existent since all the data points were of the same type. However, the veracity of the data may have very well been the underlying factor of this data analytics blunder.
People assume that when someone tells them that (s)he voted for X, they have indeed voted for X. After all, all the data points in this survey are anonymous, so what’s the point of lying? Well, the fact that there is no way to verify the validity of one’s input makes lying not just a convenient option but also a quite likely one. This results to a lower veracity of the data, leading to an incorrect output in the predictive model, which might otherwise be very accurate. In cases where people believe in a particular candidate, especially for what he or she stands for, they would be more than happy to voice their opinion and their vote on the matter. However, if they don’t believe in that person and they just vote because they don’t want the other candidate to get elected, it is possible that they may want to hide their true vote and instead state something that is more socially acceptable. It’s not rocket science, just human psychology.
Would this predictive analytics fiasco have been avoided if the analysts used a more robust system, like deep learning, or whatever kind of A.I. you prefer? Unlikely. An A.I. system cannot guard against bad data. There is a famous adage about this in computer science: garbage in, garbage out (GIGO). If you feed an algorithm garbage inputs, you can be certain that the outputs aren’t going to be any better. That’s not to say that an A.I. system is not a good choice in general, since many such systems have yielded very accurate results in a variety of problems. However, if the data is problematic, they won’t magically filter out all the low-veracity data points and yield an accurate results. This is science, not a sci-fi movie. Also, contrary to what some managers think, data scientists don’t have a magic wand, so if there is an issue with the data, they can’t just wish it away with a spell. This sounds obvious but many people’s expectations of the field seem to show that they may believe that, even if they never admit it.
So, before blaming Nate Silver or any other statistician who failed to predict this election’s outcome accurately, be sure to examine the root-causes of their failure. Predictive analytics is not an exact science and it is heavily dependent on the data at hand. If the data is unreliable, you may want to adjust your expectations accordingly.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.