Data security is a topic I’ve talked about in most of my books over the years and even made videos about (unfortunately these videos are no longer available as the contract with the platform has expired). In any case, as it’s an important topic I’ll continue talking about it. After all, it concerns all data professionals, including data scientists.
Data security is essential because it affects the usability of your models as well as the people involved in your projects. I'm not talking about just the shareholders but also the people behind the data involved. Say you have some personally identifiable information (PII) in your dataset, for example. Do you think the people this information corresponds to would be pleased if it got compromised, e.g. by a hacker? What about the accountability of the models? Securing your data is no longer a nice-to-have but something of an obligation, especially whenever sensitive information is involved.
Fortunately, you can secure your data in various ways. Encryption and back-ups are by far the most popular methods, though other cybersecurity techniques such as steganography can also be applied. Also, for each method, there are variants that you can consider, such as the different encryption algorithms, the various back-up schemata, etc. Usually, a cybersecurity professional can assess your needs and provide a solution for your data, though it's not far-fetched to obtain the same services from a tech-savvy data scientist too.
What about the cost of all this? After all, if you are to implement a cybersecurity solution that’s the first question you’d be asked by the stakeholders. The cost is broken down into two main parts: hardware- and software-related. As for the former (which tends to be the larger part), it involves the purchase of specialized equipment (e.g. a firewall node in your computer network, or a back-up server).
The software part involves specialized software, such as the one responsible for your encryption, intrusion detection, etc. Also, this category includes any software-as-a-service solution you may purchase (usually through a subscription) for software that lives on the cloud. Software handling DDoS attacks, for example, is commonplace and often comes as an add-on for any web hosting package you have for your site. Naturally, some of this software may have nothing to do with your data (e.g. the aforementioned DDoS attack prevention) but it can help keep any APIs you have up and running, serving processed data to your users and clients.
A good rule-of-thumb for assessing a cybersecurity module and its relevance to a data-related project is the usefulness time for the data at hand. If the data is going to be obsolete (stale) in a few months perhaps you don't need the latest and greatest encryption module, while if the data is available in other places with a small fee (so it's mostly an ETL effort to get it on your computers), then back-up systems may not need to follow the most advanced schema.
Beyond these cybersecurity matters, there are other considerations that are useful to have, which however are beyond the scope of this article. Suffice to say that this is a topic worth considering and discussing with your colleagues as it is crucial in today’s data-driven world where the security of digital assets is as important as physical security.
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.