Category: Ethics

The Value of Privacy and Data Science's Role in All This

4/5/2021

More and more datasets these days contain sensitive data capable of identifying the people behind those ones and zeros. We usually refer to this kind of data as personally identifiable information or PII for short. PII is a privacy concern for every data scientist or analyst working with such a dataset since if it leaks, we're all in trouble! Not just the data scientist, but also the whole organization, especially if it's complying with privacy regulations like GDPR. Let's look into this matter in more detail.

First of all, PII-related privacy is inevitable in most data science projects today in the real world. Chances are that at least some of the variables you deal with contain some type of sensitive data. These can be things like names, contact details, credit card numbers, and even health-related data (this latter kind of PII is particularly important since most of it cannot be changed, in contrast to a credit card). Even geo-location data is often under the PII umbrella though on its own it's not so sensitive because it's hard to match it to a particular individual without using some other variable too.

This matching of particular variables to specific individuals is the source of all privacy-related problems. It's not so much the fact that some people's identities are compromised that's the issue (who cares if it becomes public that I enjoy a cup of coffee at the local coffee shop every morning?) but the fact that this data is supposedly protected. When it's out in the open, it's a breach of some privacy legislation, while the organization that handles this data is liable for a lawsuit. To make matters worse, if word gets out that a particular company doesn't protect its clients' sensitive data adequately, its reputation is bound to suffer, and its brand can be damaged. Not to mention that some of this PII can be traded in the black market, so if a malicious hacker gets hold of it, it can make things even more challenging to manage.

To avoid these problems, we need to handle PII properly. You can do this in various ways, some of which we're going to explore in future articles. As I've lately delved more into Cybersecurity and Privacy, I can provide a better perspective on this subject, which can tie into data science work more practically. However, should you wish to delve into this topic a bit now, you can check out my latest video course on WintellectNow, titled Privacy Fundamentals. There I cover various practical ways about securing privacy in your personal and professional life. It's not data science-focused, but it can help you cultivate the right mindset that will enable you to handle PII more responsibly. Stay tuned for more material in the coming months. Cheers!

0 Comments

The Ethics of Web Scraping

11/30/2020

0 Comments

In a nutshell, web scraping is the process of taking stuff from the web programmatically and often at scale. This involves specialized libraries as well as some understanding of how websites are structured, to separate the useful data from any markup and other stuff found on a web page.

Web scraping is very useful, especially if you are looking at updating your dataset based on a particular online source that's publicly available but doesn't have an API. The latter is something very important and popular these days, but it’s beyond the scope of this article. Feel free to check out another article I’ve written on this topic. In any case, web scraping is very popular too as it greatly facilitates data acquisition be it for building a dataset from scratch or supplementing existing datasets.

Despite its immense usefulness, web scraping is not without its limitations. These generally fall under the umbrella of ethics since they aren’t set in stone nor are they linked to legal issues. Nevertheless, they are quite serious and could potentially jeopardize a data science project if left unaddressed. So, even though these ethical rules/guidelines vary from case to case, here is a summary of the most important ones that are relevant for a typical data science project:

Ask for permission before launching a web scraping project. You don’t know if the website you are accessing it has sufficient bandwidth to spare for this, nor if the people who manage it require you to license this data stream before using it.
Ensure you leave some time between two consecutive requests so that you don’t overwhelm the server. Usually a second or two is enough, while some people prefer to use random time intervals.
Have a clear idea of how you are going to use this data and inform the people managing the website. This way you can avoid any abuse of the data involved.
Mention your source in your project so that at least the people providing you with this data get some organic traffic on their site and some acknowledgment.
Don’t download everything as this could be perceived as industrial espionage (in the case of a commercial site) or an otherwise malicious endeavor.

It’s also important to keep certain other things in mind when it comes to web scraping. Namely, since web scraping scripts tend to hang (usually due to the server they are requesting data from), it's good to save the scraped data periodically. Also, make sure you scrape all the fields (variables) you need, even if you are not sure about them. It's better to err on the plus side since you can always remove unnecessary data afterward. If you need additional fields though, or your script hangs before you save the data, you'll have to redo the web scraping from the beginning.

If you want to learn more about ethics and other non-technical aspects of data science work, I invite you to check out a book I co-authored earlier this year. Namely, the Data Scientist Bedside Manner book covers a variety of such topics, including some hands-on advice to boost your career in this field, all while maintaining an ethical mindset. Cheers!

0 Comments

The Ethics of Modern NLP Systems

9/14/2020

0 Comments

It’s been a while since I’ve written anyone on Natural Language Processing (NLP) so I figured I might as well do an article now. In a nutshell, NLP is the set of algorithms that deal with natural language data (e.g. English sentences), usually based on text data. Since it’s a fairly advanced part of data science, it is considered to be more AI-related than anything else. Nevertheless, NLP employs a large number of data science techniques and models under the hood, so it’s usually part of a data scientist’s toolbox.

What about modern NLP systems then? This kind of system deals with rough data (containing typos, strange characters like emojis, etc.) and delivers a humanly understandable result, e.g. in the form of a prediction (categorization of the text or new text). However, other modern NLP systems go a step further and deliver original text based on a corpus they have learned and a prompt. The latter is a sentence or two that help outline the topic of the text that follows. NLP systems geared towards that task pick up this topic and compose a novel piece of text that's similar to the original sentence and relevant to them topic-wise. NLP systems like these are made available through OpenAI, primarily.

All this is great, but this article is about the ethics of NLP. So, where do ethics enter the picture? Well, when you have an A.I. system writing a piece of text for you at practically no cost and no responsibility, the amount of information out there is going to skyrocket. The worst part is that much of that information would be seemingly real even if it wouldn't bear any relation to reality. In the case where a text like that produced by an A.I. is useful enough, it can be abused for school projects and such. So, any sense of merit when it comes to essay writing and content creation, in general, is bound to disappear gradually.

Besides, this whole situation affects our morals since it’s becoming increasingly acceptable to cheat using A.I. From the mediocre AI-written articles you see published on various social media, to promote this or the other product or service, to students cheating in their courses, the NLP matter is clearly a manifested problem rather than some imminent threat. However, this can also be viewed as a test for our morality and a chance to be ethical even if others are not.

So what can you do to get a hold on NLP and other AI-related matters? It may seem a daunting task but it’s doable, given enough understanding of the subject and practice. On useful and empowering resource on this topic is the AI for Data Science book I’ve co-authored with another data scientist a couple of years ago. Although its focus is A.I. in general, it provides the fundamentals for doing all sorts of tasks. Feel free to check it out when you have the chance. Cheers!

PS - you can get a 20% discount on this and any other book you buy from this publisher. Do you remember the coupon code I've mentioned in previous posts?

0 Comments

AI Thought Experiment – The Perfect Answer Conundrum

6/22/2020

0 Comments

Suppose we have an advanced AI system (e.g. an AGI) which is now proven to be safe enough to use on general-purpose scenarios. This system can do all sorts of things, including finding the optimal stance on a matter, given the data available on the subject. Fortunately, everyone who manages the corresponding databases has agreed to let this AI access and analyze this data, so long as it acknowledges the generous contributors of this data. So, data abundance is a given in this hypothetical scenario. Now, with this data and the immense computing resources this AI has at its disposal, it can sort out any controversial topic and come up with a mathematically sound solution that is valid beyond any doubt, given the data at hand. The question is would you trust this result, even if it is probably beyond your understanding, and accept this as the “right answer” to the controversial topic in question?

Let's make this more concrete. Suppose that we are dealing with a fairly realistic situation where we have a settlement in some inhospitable environment (e.g. a research center in Antarctica or on the ISS). Due to unforeseeable circumstances, there aren't sufficient resources to save everyone and everything from that place. So, someone has to decide whether they should save all the scientific samples that these people have spent years accumulating and/or analyzing, or the scientists themselves? Or perhaps a combination of the two, prioritizing senior scientists, for example. Obviously, this isn't a decision that anyone would be comfortable making, especially if that person has a conscience. However, an AI system may be more than happy to provide a solution to this problem. A clear-cut solution may be unfathomable to us but for that AI (which has access to all sorts of data, not just the data specific to the problem at hand), it's a much more feasible task. Yet, we may not like what the AI's solution is. Would we accept it nevertheless? Who shall we attribute responsibility to for this scenario?

Thinking about things like that may not help anyone gain a better understanding of the ins and outs of AI technology. However, someone could argue that solving this sort of conundrums is as important as sorting out the technical aspects of AI. After all, at one point, probably sooner rather than later, we may have to deal with real-world situations akin to this thought experiment. So, preparing ourselves for this is definitely a worthwhile task, even if it seems challenging or futile, depending on who you ask. There is no doubt that AIs help us solve all sorts of problems and we can outsource a large variety of tasks to them. Soon, an AI may be able to undertake even high-level responsibilities. It is doubtful, however, that it can act ethically if we are not able to do the same ourselves. And we don't need an AI to know that with sufficient certainty. Cheers!

0 Comments

FOXY DATA SCIENCE
unconventional insights about data science, A.I., cybersecurity, data analytics, and more

The Value of Privacy and Data Science's Role in All This

The Ethics of Web Scraping

The Ethics of Modern NLP Systems

AI Thought Experiment – The Perfect Answer Conundrum

Zacharias Voulgaris, PhD

Archives

Categories

FOXY DATA SCIENCEunconventional insights about data science, A.I., cybersecurity, data analytics, and more

The Value of Privacy and Data Science's Role in All This

The Ethics of Web Scraping

The Ethics of Modern NLP Systems

AI Thought Experiment – The Perfect Answer Conundrum

Zacharias Voulgaris, PhD

Archives

Categories

FOXY DATA SCIENCE
unconventional insights about data science, A.I., cybersecurity, data analytics, and more