Blog Archives

The Ethics of Web Scraping

11/30/2020

In a nutshell, web scraping is the process of taking stuff from the web programmatically and often at scale. This involves specialized libraries as well as some understanding of how websites are structured, to separate the useful data from any markup and other stuff found on a web page.

Web scraping is very useful, especially if you are looking at updating your dataset based on a particular online source that's publicly available but doesn't have an API. The latter is something very important and popular these days, but it’s beyond the scope of this article. Feel free to check out another article I’ve written on this topic. In any case, web scraping is very popular too as it greatly facilitates data acquisition be it for building a dataset from scratch or supplementing existing datasets.

Despite its immense usefulness, web scraping is not without its limitations. These generally fall under the umbrella of ethics since they aren’t set in stone nor are they linked to legal issues. Nevertheless, they are quite serious and could potentially jeopardize a data science project if left unaddressed. So, even though these ethical rules/guidelines vary from case to case, here is a summary of the most important ones that are relevant for a typical data science project:

Ask for permission before launching a web scraping project. You don’t know if the website you are accessing it has sufficient bandwidth to spare for this, nor if the people who manage it require you to license this data stream before using it.
Ensure you leave some time between two consecutive requests so that you don’t overwhelm the server. Usually a second or two is enough, while some people prefer to use random time intervals.
Have a clear idea of how you are going to use this data and inform the people managing the website. This way you can avoid any abuse of the data involved.
Mention your source in your project so that at least the people providing you with this data get some organic traffic on their site and some acknowledgment.
Don’t download everything as this could be perceived as industrial espionage (in the case of a commercial site) or an otherwise malicious endeavor.

It’s also important to keep certain other things in mind when it comes to web scraping. Namely, since web scraping scripts tend to hang (usually due to the server they are requesting data from), it's good to save the scraped data periodically. Also, make sure you scrape all the fields (variables) you need, even if you are not sure about them. It's better to err on the plus side since you can always remove unnecessary data afterward. If you need additional fields though, or your script hangs before you save the data, you'll have to redo the web scraping from the beginning.

If you want to learn more about ethics and other non-technical aspects of data science work, I invite you to check out a book I co-authored earlier this year. Namely, the Data Scientist Bedside Manner book covers a variety of such topics, including some hands-on advice to boost your career in this field, all while maintaining an ethical mindset. Cheers!

FOXY DATA SCIENCEunconventional insights about data science, A.I., cybersecurity, data analytics, and more

Zacharias Voulgaris, PhD

Archives

Categories

FOXY DATA SCIENCE
unconventional insights about data science, A.I., cybersecurity, data analytics, and more