Data security is a topic I’ve talked about in most of my books over the years and even made videos about (unfortunately these videos are no longer available as the contract with the platform has expired). In any case, as it’s an important topic I’ll continue talking about it. After all, it concerns all data professionals, including data scientists.
Data security is essential because it affects the usability of your models as well as the people involved in your projects. I'm not talking about just the shareholders but also the people behind the data involved. Say you have some personally identifiable information (PII) in your dataset, for example. Do you think the people this information corresponds to would be pleased if it got compromised, e.g. by a hacker? What about the accountability of the models? Securing your data is no longer a nice-to-have but something of an obligation, especially whenever sensitive information is involved.
Fortunately, you can secure your data in various ways. Encryption and back-ups are by far the most popular methods, though other cybersecurity techniques such as steganography can also be applied. Also, for each method, there are variants that you can consider, such as the different encryption algorithms, the various back-up schemata, etc. Usually, a cybersecurity professional can assess your needs and provide a solution for your data, though it's not far-fetched to obtain the same services from a tech-savvy data scientist too.
What about the cost of all this? After all, if you are to implement a cybersecurity solution that’s the first question you’d be asked by the stakeholders. The cost is broken down into two main parts: hardware- and software-related. As for the former (which tends to be the larger part), it involves the purchase of specialized equipment (e.g. a firewall node in your computer network, or a back-up server).
The software part involves specialized software, such as the one responsible for your encryption, intrusion detection, etc. Also, this category includes any software-as-a-service solution you may purchase (usually through a subscription) for software that lives on the cloud. Software handling DDoS attacks, for example, is commonplace and often comes as an add-on for any web hosting package you have for your site. Naturally, some of this software may have nothing to do with your data (e.g. the aforementioned DDoS attack prevention) but it can help keep any APIs you have up and running, serving processed data to your users and clients.
A good rule-of-thumb for assessing a cybersecurity module and its relevance to a data-related project is the usefulness time for the data at hand. If the data is going to be obsolete (stale) in a few months perhaps you don't need the latest and greatest encryption module, while if the data is available in other places with a small fee (so it's mostly an ETL effort to get it on your computers), then back-up systems may not need to follow the most advanced schema.
Beyond these cybersecurity matters, there are other considerations that are useful to have, which however are beyond the scope of this article. Suffice to say that this is a topic worth considering and discussing with your colleagues as it is crucial in today’s data-driven world where the security of digital assets is as important as physical security.
Personally Identifiable Information, or PII for short, is an essential aspect of data science work today. It involves sensitive data that can compromise the identity of at least some of the people involved in a dataset (e.g., someone's name, financial data, address, phone number, etc.). PII is particularly important today as it's protected by law in many countries, and any violation of this sort of data can fetch huge fines. What's more, PII is often essential in data science projects as it carries useful information that can bring about a sense of personalization to the data products developed.
Due to various factors, such as using multiple data streams, datasets used in data science today are full of PII. Note that PII can result from a combination of variables since there isn't an infinite amount of people. As a result, given enough information-rich variables, you can predict several PII variables with reasonable accuracy. This ability to predict PII makes the problem even more severe since PII can be a serious liability if it leaks. As there is plenty of it in modern datasets, the risk of this happening grows with the more data you gather for your data science project.
Of course, you could remove PII from your dataset, but it's not always a good option. After all, much of this PII is useful information that can help with the models built. So, even if you can eliminate certain variables, the bulk of PII will need to be retained for the models at hand to be useful and a value-add to your data science project. As for obscuring the PII variables (e.g., through a method like PCA), this is also a valid option. However, with it, any chance of transparency in your models goes out the window.
Fortunately, you can protect PII with various cybersecurity methods, without compromising your models' performance or transparency. Encryption, for example, is one of the most widely used techniques to keep data secure as it's turned into gibberish when not in use. In some cases, even in that gibberish state, you can perform some operations for additional security. However, in most cases, the protection is there for the time the data is in transit, which is when it's also the most vulnerable.
Since nowadays the use of the cloud for both storing and processing data is commonplace, the risk of exposing PII is more significant than ever. Fortunately, it's not too difficult to have security even in these situations, as long as the cloud provider offers this protection level. It just needs to have that in the platform it uses and in all the network connections involved.
Hostkey is a Dutch company providing cloud services, targeted towards data science professionals. Just like most modern cloud providers, it offers high-quality cybersecurity for all the data handled. At the same time, if you are super serious about this matter, you can also lease a dedicated server from it. Additionally, Hostkey offers GPU servers, which are a bigger bang for your buck when it comes to high-performance data models, such as deep learning ones. So, check out this cloud company and see how you can benefit from its services. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.