As privacy matters gain more value these days, transparency also gains a lot of value in data science work. This is to be expected for another reason, which I hope it has become more obvious if you have been following this blog: transparent models are easier to explain to others. Beyond these advantages, there are other ones too (such as transparent models are easier to tweak and optimize), which I'm not going to elaborate on right now. Instead, I'm going to look at the various data models used in data science and where they fall on the transparency spectrum.
On the one extreme of this spectrum lie the most transparent data models. These are usually Stats-based since they can provide exact proportions of each feature's contribution. Also, you know exactly what's going on with the decisions involved in the predictions they yield. Even if you know nothing about data science you can still make sense of these models and understand the predictions they yield. The main disadvantage of these models is that they are not as accurate, partly because of the overly simple processes they use.
On the other extreme of the spectrum, you can find the most opaque data models. These are usually AI-based and are referred to as black boxes. Not only do they not tell us anything about feature importance, but trying to explain their inner workings is a futile task. However, they tend to have an edge in performance when it comes to accuracy, plus they require very little prep work for the data they use (data engineering).
Somewhere in the middle of the spectrum lie all the other models, mostly under the machine learning category. These include random forests and boosted trees (some transparency), k nearest neighbor (very little transparency), support vector machines (no transparency), and fuzzy logic systems (pretty decent transparency). That’s a category of models most people forget since they tend to think of transparency as a binary attribute.
Finally, it’s good to remember that transparency is usually linked to a business requirement. Also, sometimes the performance you obtain from the black box models is a good trade-off since some projects require high accuracy in the predictions involved. So, transparency is not always a necessity even if it can facilitate the communication of these models to the project stakeholders. As a result, it's always good to think about whether you need this extra transparency that a statistical model may offer you if you can achieve better performance with a less transparent model.
For more information about transparency and other aspects of data science models (particularly machine learning related), you can check out my latest book, Julia for Machine Learning. It is a very hands-on kind of book, which doesn't neglect to provide a lot of information needed to build the right mindset when it comes to data science work. Also, it includes lots of examples and links to useful resources that can help you understand all the concepts involved.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.