Skip to main content

Text and Data Mining

Guide to text mining resources available through the University of Chicago Library

What is Text Mining?

Text mining is a research technique using computational analysis to uncover patterns in large text-based data sets.

It is useful in numerous scholarly fields, from the humanities, where it is one of the tools of digital humanities, to the sciences, where useful data can be mined from text databases of published literature.

Source: UMass Amherst Libraries

Which Library resources permit text mining?

Text and data mining is not permitted under most of the Library's license agreements. The resources listed in this guide are the exceptions.

For information about text and data mining access to other resources, researchers should contact:

Kristin Martin, Electronic Resources Management Librarian


BioMed Central

BioMed Central makes available over 250,000 full-text, peer-reviewed articles for text and data mining. Learn more.

Gale (Including the Economist Historical Archive)

Resesarchers may request text mining access to content from most Gale Digital Collections.


The HathiTrust Research Center provides computational access to HathiTrust.

  • Use the Workset Builder to create corpora from HathiTrust content and then search the full, uncorrected OCR text or within specific metadata fields.

Related Tools:

HathiTrust+Bookworm - Occurrences of the word "Chicago" by year

JSTOR Data for Research

The JSTOR Data for Research (DfR) service, freely available to the public, provides text and data mining tools for selecting and interacting with the content in JSTOR. The tools include faceted searching, topic modeling, and data visualization. Researchers can obtain, view and bulk download document-level datasets, including word frequencies, citations, key terms and ngrams. JSTOR will work with you to tailor datasets to your needs.  For more information, see the Data for Research FAQ.

Library of Congress Chronicling America

Chronicling America provides access to information about historic newspapers and select digitized newspaper pages. To encourage a wide range of potential uses, the Library of Congress designed several different views of the data, all of which are publicly visible. Each uses common Web protocols, and access is not restricted in any way. You do not need to apply for a special key to use them. Together they make up an extensive application programming interface (API) which you can use to explore the data in many ways.

Library of Congress, Chronicling America API


PLOS provides two APIs:

  • PLOS Search API enables querying of the content of PLOS journals. See the PLOS Search API FAQ.
  • PLOS Article-Level Metric (ALM) API provides access to data collected from articles published in PLOS journals, including usage statistics (e.g. page views, downloads), citation counts, mentions in Wikipedia, and activity on social networks and blogs. For more information, see the PLOS ALM API FAQ.

See also the PLOS API Display Policy.

ProQuest Historical Newspapers

ProQuest offers data for text and data mining for select years of Historical Newspapers. The Library may already have this data available, or will work with individual researchers to acquire the files. Please contact us for more information.

Baltimore Afro-American 1893-1988
Boston Globe 1872-1983
Chicago Defender 1910-1975
Guardian and Observer (Guardian) 1821-1907
Guardian and Observer (Observer) 1791-1907
Irish Times (Irish Times) 1859-1926
Irish Times (Irish Weekly Times) 1876-1926
Jerusalem Post 1932-1976
LA Times 1881-1931
New York Times 1851-1934
NewYork Herald Tribune 1841-1962
SF Chronicle 1865-1922
Times of India 1838-2005
Wall Street Journal 1889-1933
Washington Post 1877-1933

ScienceDirect / Scopus / Elsevier

Access to ScienceDirect and Scopus content for text mining is available through an API, for which an API key is needed. To get an API key:

  1. Go to the "My Projects" page on the Elsevier developer portal.
  2. Log in with your ScienceDirect/Scopus username or create a new profile (a separate account from your CNetID and password must be created).
  3. On the "My Projects" page, click on "Register a New Text Mining Project."
  4. Enter a project name and description and accept the text mining user agreement.
  5. You will now see your newly registered project listed under "My Text Mining Projects". Click "View API Key" to get your API key.

For further instructions, see:

Springer Nature

Springer Nature allows text mining from content collected from their websites, including:

Content may be downloaded manually or through automated means, but must be stored on a server only accessible to University of Chicago affiliates. Automated means should not be faster than one request per second. Content must be deleted at the conclusion of the text and data mining project. If you have any questions regarding the content that is covered by this agreement or mechanisms for downloading, contact the Library.

Text Creation Partnership

From the website: The Text Creation Partnership creates standardized, accurate XML/SGML encoded electronic text editions of early print books. We transcribe and mark up the text from the millions of page images in ProQuest's Early English Books Online, Gale Cengage's Eighteenth Century Collections Online, and Readex's Evans Early American Imprints.

This work, and the resulting text files, are jointly funded and owned by more than 150 libraries worldwide. All of the TCP's work will be released the public domain for anyone to use.

Currently available collections are:

To download the files, email the appropriate help address.

More Text and Data Mining Resources

Ask a Librarian

Ask a Librarian's picture
Ask a Librarian
Subjects:Help Guides