In a nutshell, web scraping is the process of taking stuff from the web programmatically and often at scale. This involves specialized libraries as well as some understanding of how websites are structured, to separate the useful data from any markup and other stuff found on a web page.
Web scraping is very useful, especially if you are looking at updating your dataset based on a particular online source that's publicly available but doesn't have an API. The latter is something very important and popular these days, but it’s beyond the scope of this article. Feel free to check out another article I’ve written on this topic. In any case, web scraping is very popular too as it greatly facilitates data acquisition be it for building a dataset from scratch or supplementing existing datasets.
Despite its immense usefulness, web scraping is not without its limitations. These generally fall under the umbrella of ethics since they aren’t set in stone nor are they linked to legal issues. Nevertheless, they are quite serious and could potentially jeopardize a data science project if left unaddressed. So, even though these ethical rules/guidelines vary from case to case, here is a summary of the most important ones that are relevant for a typical data science project:
It’s also important to keep certain other things in mind when it comes to web scraping. Namely, since web scraping scripts tend to hang (usually due to the server they are requesting data from), it's good to save the scraped data periodically. Also, make sure you scrape all the fields (variables) you need, even if you are not sure about them. It's better to err on the plus side since you can always remove unnecessary data afterward. If you need additional fields though, or your script hangs before you save the data, you'll have to redo the web scraping from the beginning.
If you want to learn more about ethics and other non-technical aspects of data science work, I invite you to check out a book I co-authored earlier this year. Namely, the Data Scientist Bedside Manner book covers a variety of such topics, including some hands-on advice to boost your career in this field, all while maintaining an ethical mindset. Cheers!
Zacharias Voulgaris, PhD
Passionate data scientist with a foxy approach to technology, particularly related to A.I.