In many data science courses, these peculiar data points in a dataset often go by the term “anomalies” and are considered to be inherently bad. In fact, it is suggested by many that they be removed before the data modeling stage. Now, for obvious reasons I cannot contradict that approach partly because I myself have taken that stance when covering basic data engineering topics, but also because there is merit in this treatment of outliers and inliers. After all, they are just too weird to be left as they are, right?
Well, it depends. In all the cases when they are removed, it’s usually because we are going to use some run-of-the-mill model that is just too rudimentary to do anything clever with the data it’s given. So, if there are anomalous data points in the training set, it’s likely to over-fit or at the very least under-perform. This would not happen so often though in an A.I. model, which is one of the reasons why the data engineering stage is so closely linked to the data modeling one. Also, sometimes the signal we are looking for lies in those particular anomalous elements, so getting rid of them isn’t that wise then.
Regardless of all that though, we need to differentiate between these two kinds of anomalies. The outliers can be easily smoothed out, if we were to adopt a possibilistic way of handing the day, instead of the crude statistical metrics we are used to using. Smoothing outliers is also a good way to retain more signal in the dataset (especially if it’s a small sample that we are working with), something that translates into better-performing models.
Inliers though are harder to process. Oftentimes removing them is the best strategy, but they need to be looked at holistically, not just in individual variables. Also, even if they distort the signal at times, they may not be that harmful when doing dimensionality reduction, so keeping them in the dataset may be a good idea. Nevertheless, it’s good to make a note of these anomalous elements, as they may have a particular significance once the data is processed by a model we build. Perhaps we can use them as fringe cases in a classification model, for example, to do some more rigorous testing to it.
To sum up, outliers and inliers are interesting data points in a dataset and whether they are more noise than signal depends on the problem we are trying to solve. When tackled in a multi-dimensional manner, they can be better identified, when when processed, certain care needs to be taken. After all, just because certain data analytics methods aren’t well-equipped to handle them, we shouldn’t change our data to suit the corresponding models / metrics. Often we have more to gain by shifting our perspective and adapting our ways to the data at hand. The possibilistic approach to data may be a great asset in all that. Should you wish to learn more about outlier and inliers, you can check out my presentation video on this topic in the Safari platform.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy flair when it comes to technology, technique, and tests.