Skip to main content

Text and Data Mining

Guide to text mining resources available through the University of Chicago Library

What is Text Mining?

Text mining is a research technique using computational analysis to uncover patterns in large text-based data sets.

It is useful in numerous scholarly fields, from the humanities, where it is one of the tools of digital humanities, to the sciences, where useful data can be mined from text databases of published literature.

Source: UMass Amherst Libraries

Which Library resources permit text mining?

Text and data mining is sometimes permitted according to the Library's license agreements. In the vast majority of cases, providers prohibit any automated searching, scraping, and/or downloading of content, even if you are only "testing." Raw data files may be supplied in other mechanisms (APIs, hard drives, download sites). This guide is a non-exclusive list of resources where the library has secured rights for text and data mining. Even with rights secured, additional costs and/or time to obtain data may occur.

For information about text and data mining access to other resources, researchers should contact:

Kristin Martin, Director of Technical Services

Email: kmarti@uchicago.edu

Adam Matthew

The Library has secured text mining rights from the publisher, Adam Matthew The Library has purchased multiple archives from Adam Matthew, including the collections below.

Although rights are secured, an additional project description will be required from the researcher before Adam Matthew will release the data. Additional costs may be involved. Adam Matthew prohibits any automated searching and downloading of content from the website. Please contact the Library for assistance if you have a text and data mining project where you would like to use content from Adam Matthew.
  • URL

ARTFL Project

Brill

Brill permits text mining on their publications and content that has been purchased by the Library; however it cannot be automatically downloaded. Please contact us with specifics and we will work with Brill to provide the content needed.

Gale (Including the Economist Historical Archive)

Resesarchers may request text mining access to content from most Gale Digital Collections.

HathiTrust

The HathiTrust Research Center provides computational access to HathiTrust.

  • Use the Workset Builder to create corpora from HathiTrust content and then search the full, uncorrected OCR text or within specific metadata fields.

Related Tools:

HathiTrust+Bookworm - Occurrences of the word "Chicago" by year

JSTOR Data for Research

The JSTOR Data for Research (DfR) service, freely available to the public, provides text and data mining tools for selecting and interacting with the content in JSTOR. The tools include faceted searching, topic modeling, and data visualization. Researchers can obtain, view and bulk download document-level datasets, including word frequencies, citations, key terms and ngrams. JSTOR will work with you to tailor datasets to your needs.  For more information, see the Data for Research FAQ.

Library of Congress Chronicling America

Chronicling America provides access to information about historic newspapers and select digitized newspaper pages. To encourage a wide range of potential uses, the Library of Congress designed several different views of the data, all of which are publicly visible. Each uses common Web protocols, and access is not restricted in any way. You do not need to apply for a special key to use them. Together they make up an extensive application programming interface (API) which you can use to explore the data in many ways.

Library of Congress, Chronicling America API

Linguistic Data Consortium (LDC)

The University of Chicago has subscribed to the Linguistic Data Consortium since 2001, and therefore, authorized UC users have access to all of the corpora that LDC has produced from 2001-present.  In addition, we have separately acquired a small number of LDC corpora from 1992-2000. Note that most corpora are available as CDs or DVDs and can be accessed by individual title through the Library Catalog. There are some corpora available online only and another subset of corpora for which there is both cd/dvd access and online access. Many of these are available for download. Follow the links below to view the data sets available to users and how to download the data available online.

PLOS

PLOS provides two APIs:

  • PLOS Search API enables querying of the content of PLOS journals. See the PLOS Search API FAQ.
  • PLOS Article-Level Metric (ALM) API provides access to data collected from articles published in PLOS journals, including usage statistics (e.g. page views, downloads), citation counts, mentions in Wikipedia, and activity on social networks and blogs. For more information, see the PLOS ALM API FAQ.

See also the PLOS API Display Policy.

ProQuest Historical Newspapers

ProQuest offers data for text and data mining for select years of Historical Newspapers. The Library may already have this data available, or will work with individual researchers to acquire the files. Please contact us for more information.

Baltimore Afro-American 1893-1988
Boston Globe 1872-1983
Chicago Defender 1910-1975
Guardian and Observer (Guardian) 1821-1907
Guardian and Observer (Observer) 1791-1907
Irish Times (Irish Times) 1859-1926
Irish Times (Irish Weekly Times) 1876-1926
Jerusalem Post 1932-1976
LA Times 1881-1931
New York Times 1851-1934
NewYork Herald Tribune 1841-1962
SF Chronicle 1865-1922
Times of India 1838-2005
Wall Street Journal 1889-1933
Washington Post 1877-1933

PubMed

The full PubMed database can be downloaded and kept up-to-date with daily updates. A Press Release from June 2017 provides further details and links to the files.

ScienceDirect / Scopus / Elsevier

Access to ScienceDirect and Scopus content for text mining is available through an API, for which an API key is needed. To get an API key:

  1. Go to the "My Projects" page on the Elsevier developer portal.
  2. Log in with your ScienceDirect/Scopus username or create a new profile (a separate account from your CNetID and password must be created).
  3. On the "My Projects" page, click on "Register a New Text Mining Project."
  4. Enter a project name and description and accept the text mining user agreement.
  5. You will now see your newly registered project listed under "My Text Mining Projects". Click "View API Key" to get your API key.

For further instructions, see:

Springer Nature

Springer Nature allows text mining from content collected from their websites, including:

Content may be downloaded manually or through automated means, but must be stored on a server only accessible to University of Chicago affiliates. Automated means should not be faster than one request per second. Content must be deleted at the conclusion of the text and data mining project. If you have any questions regarding the content that is covered by this agreement or mechanisms for downloading, contact the Library.

Text Creation Partnership

From the website: The Text Creation Partnership creates standardized, accurate XML/SGML encoded electronic text editions of early print books. We transcribe and mark up the text from the millions of page images in ProQuest's Early English Books Online, Gale Cengage's Eighteenth Century Collections Online, and Readex's Evans Early American Imprints.

This work, and the resulting text files, are jointly funded and owned by more than 150 libraries worldwide. All of the TCP's work will be released the public domain for anyone to use.

Currently available collections are:

To download the files, email the appropriate help address.

More Text and Data Mining Resources