Data Scraping & Data Mining
Author: Sheona Burrow
Illustration: Ilaria Urbinati
Data scraping and data mining are (usually technological) processes where data (usually in some form of database) is accessed and used in some way, sometimes within a wider process of data cleansing, integration, transformation or evaluation. It is a common research and business technique, for example a price comparison tool will usually scrape data about pricing from databases. Unfortunately, understanding legal use can be difficult because of a number of overlapping legal regimes to consider. For example, legal use can depend on what is in the data (for example, personal information about individuals or copyright works), whether the data is in a legally protected database and on whether there is some form of contract with the database owner (e.g. terms and conditions).
This section aims to provide a picture of the different layers of legal protection one needs to consider in order to make informed decisions around data scraping and data mining. Lawful access to and use of data is not a settled legal matter. There is a need to establish community-specific norms that are in tune with legitimate public interest needs as well as competition and innovation concerns.
Both copyright law and a special database law provide protection for databases which meet certain criteria. Protected databases must be a ‘collection of independent works, data or other materials, which are arranged in a systematic or methodical way and are individually accessible by electronic or other means.’ This is quite a broad definition and will capture most databases.
A special right called the sui generis database right (SGDR) protects data in a database from unauthorised extraction or reutilisation unless there is a basis for legal use. However, it protects those who have invested in the ‘obtaining, verification or presentation’ of the data, rather than those who created the original data.
Copyright law also protects a database if ‘the selection or arrangement of the contents of the database the database constitutes the author’s own intellectual creation.’ This protection targets databases where the data has been compiled in a way which has required some intellectual thought – for example, a compilation of addresses listed alphabetically is unlikely to be protected by copyright. This protection is for the structure of the database and not for the data itself.
The effect of these protections is slightly different as they apply for different time periods and the protections can overlap and apply to the same database. These protections can even be owned by different parties. They apply in addition to all other forms of protection for the data – including contractual and competition law restrictions and confidentiality.
Click on the most relevant links below to read more about whether data scraping or data mining is permissible in the UK.
Personal information capable of identifying living individuals is granted special protection by most legal regimes, including the United Kingdom.
As well as potentially protecting any works in the data, copyright law also protects a database where the database structure or arrangement is in some way the ‘intellectual creation’ of the author.
3. I want to scrape or use a dataset which I don’t think contains any personal information or copyright protected works
Protection of a dataset is contextual in every case – the data, database and other context may dictate higher or lower risk for data scraping and data mining.
It is important for users to have a better understanding of how to deal with the influence of algorithms on the production, distribution and consumption of creative works, and the implications for copyright.
Students and researchers often need to make use of materials which are copyright protected. In the context of their research or study, they may have to make copies or use extracts of those materials.
The electronic analysis of large amounts of copyright works allows researchers to discover patterns, trends and other useful information that cannot be detected through usual ‘human’ reading. This process, known as ‘text and data mining’…