Text & Data Mining
Author: Maurizio Borghi
Illustration: Davide Bonazzi
The electronic analysis of large amounts of copyright works allows researchers to discover patterns, trends and other useful information that cannot be detected through usual ‘human’ reading. This process, known as ‘text and data mining’, may lead to knowledge which can be found in the works being mined but not yet explicitly formulated. For example, the processing of data contained in a large collection of scientific papers in a particular medical field could suggest a possible association between a gene and a disease, or between a drug and an adverse event, without this connection being explicitly identified or mentioned in any of the papers.
Scientific publishers recognise the importance of text mining and offer various applications which can be used by researchers. For instance, Elsevier (RELX Group) makes available to its subscribers facilities to carry out independent text mining research, and offers services of customised text mining research in the fields of life science and pharmaceutical research.
Text mining is also used by researchers in the humanities. Google Books, one of the largest existing collections of digitised books, offers a ‘text mining experience’ to all internet users through Ngram Viewer, a graphic tool created in collaboration with researchers from Harvard University. The tool enables the tracking of the frequency of particular words or combinations of letters across over five million digitised books published between 1800 and 2000. However, access to the whole corpus of Google Books to carry out more sophisticated text mining research is restricted, and can only be obtained upon request.
Technologies based on the electronic analysis of large amounts of works are still in their infancy, and the possibilities they might open up in the future are largely unpredictable.
Text mining and copyright
Text and data mining often involves copying large amounts of copyright material. In order to ‘mine’ texts and other content, researchers need to access, copy and process them using computer programs. Even if researchers can lawfully access and read the material, for instance through their university library, copying a substantial part of works may infringe copyright in those works (where what is ‘substantial’ depends on the context and circumstances).
However, copyright was never meant to restrict the use of the ideas, facts and information that exist in a work. This principle has been recently re-established by the UK Supreme Court in a case on internet browsing: ‘Broadly speaking, it is an infringement to make or distribute copies or adaptations of a protected work. Merely viewing or reading it is not an infringement.’ (Public Relations Consultants Association Ltd v The Newspaper Licensing Agency Ltd,  UKSC 18). Text and data mining can be understood as a technology that simply substitutes for human viewing or reading. And so – one can argue – copying in the course of a text mining process should be considered merely incidental to the way this technology works, instead of an activity aimed at exploiting the copyright protected material. In this respect, it can be observed that copyright owners (publishers) have normally been willing to give permission to researchers to ‘mine’ works contained in their catalogues, especially because the research may result in mutually beneficial outputs, such as the development of software tools that eventually improve the value of their catalogues. In this way, readers and researchers are not competing with but are allies of copyright owners.
An exception for text and data analysis
In the UK, copyright law provides an exception that allows researchers to make copies of works ‘for text and data analysis’. This means that where a user has lawful access to a work they can make a copy of it for the purpose of carrying out a computational analysis of anything recorded in the work. The exception applies under the following conditions:
1) The computational analysis must be for the purpose of non-commercial research
2) The copy is accompanied by sufficient acknowledgment (unless this is practically impossible)
The provision further specifies that copyright is infringed if the copy made is transferred to another person, or it is used for purposes different than those permitted by the exception (although the researcher could ask the owner for permission to do either of these things). Also, copies made for text and data analysis cannot be sold or let for hire.
Importantly, the provision states that the activities covered by the exception cannot be ruled out by contract. Contractual terms which purport to restrict or prevent the doing of the acts permitted under the exception are unenforceable.
Although text and data analysis is mainly concerned with mining literary works, the exception covers all categories of copyright works, and a parallel exception applies to recordings of performances.
Researchers who have lawful access to a work or a recording of a performance in electronic format (for instance, through the library of their own institution) can freely make further copies of those works or recordings to carry out computational analysis of their content, without having to ask for permission from the copyright owner (for instance the publisher or the recording company). This is true, irrespective of the terms and conditions set out in any licensing agreement between the publisher and the library. However, the research must be non-commercial in nature and the researcher must give acknowledgment of the source, unless this is impossible for practical reasons. This typically occurs where computational analysis of large quantities of works is involved.
Legal and technological restrictions on databases
It should be borne in mind that other legal or technical restrictions may limit the access to collections of works, such as databases of scientific publishers. Examples of such databases are JSTOR, ScienceDirect and LexisNexis.
In the UK and in the EU, any collection of data, information or works which required substantial investment in obtaining, verifying or presenting its contents, is protected by a ‘database right’. The database right is an exclusive right that prevents substantial extraction or re-utilisation of the content of the database, as well as systematic insubstantial extraction of the said content (where what is ‘substantial’ and ‘systematic’ depends on the context). Moreover, the use of a database can also be regulated by contract. In some cases, access to a database may require acceptance of ‘terms and conditions’ that restrict certain activities, including text and data analysis. But, as with the copyright exception discussed above, engaging in permissible activities on a database for the purpose of text and data analysis cannot be ruled out by contract.
However, the European Court of Justice has said that when a database is not protected either by copyright or the ‘database right’, the owner is free to determine the contractual conditions of the use of such database. In effect, this means the owner can rely on contract law to prevent or restrict text and data analysis of the database.
Databases are also usually sheltered by technological measures which impede systematic access to their contents and ‘bulk’ copying. For example, public domain books in Google Books can be read and downloaded individually, but no one can access the entire corpus of these digitised books at once. Hence, researchers may need not only permission, but also technical support from the database owner before engaging in large-scale computational analysis of the contents of a database. For this reason, despite the fact that researchers can rely on the exception for text and data analysis, collaboration between database owners and researchers remains a fundamental component of text and data mining research.
Public Relations Consultants Association Ltd v The Newspaper Licensing Agency Ltd,  UKSC 18
‘Broadly speaking, it is an infringement to make or distribute copies or adaptations of a protected work. Merely viewing or reading it is not an infringement’.
Ryanair Ltd v PR Aviation BV, Case C-30/14, § 43
‘[I]f the author of a database decides to authorise the use of its database or a copy thereof, he has the option […] to regulate that use by an agreement concluded with a lawful user which sets out […] the ‘purposes and the way’ of using that database or a copy thereof.’
The law on copying works and recordings of performances for the purpose of data and text mining is set out in section 29A and Schedule 2(2)1D of the Copyright Designs and Patents Act 1988. You can read both provisions here:
The Value and Benefits of Text Mining, Report of JISC (2012), available at: http://www.jisc.ac.uk/publications/reports/2012/value-and-benefits-of-text-mining.aspx
More information about text mining is available from the website of the National Centre for Text Mining (NaCTeM) at the University of Manchester.
Terms and conditions are a set of rules. These rules generally form a contract between you, the user, and the service provider, whose website you are visiting.
It does not matter whether you are dealing with a video clip, text, music, photos or computer icons, if you want to make sure your use is lawful, you need to have accessed that material legally.
Students and researchers often need to make use of materials which are copyright protected. In the context of their research or study, they may have to make copies or use extracts of those materials.