Skip to main content


Linguistic Corpora

Corpus Linguistics


Linguistic Corpora: A collection of linguistic data, either written texts or a transcription of recorded speech, which can be used as a starting-point of linguistic description or as a means of verifying hypotheses about a language (corpus linguistics). Linguistic descriptions which are ‘corpus-restricted’ have been the subject of criticism, especially by generative grammarians, who point to the limitations of corpora (e.g. that they are samples of performance only, and that one still needs a means of projecting beyond the corpus to the language as a whole). In fieldwork on a new language, or in historical study, it may be very difficult to get beyond one's corpus (i.e. it is a ‘closed’ as opposed to an ‘extendable’ corpus), but in languages where linguists have regular access to native-speakers (and may be native-speakers themselves) their approach will invariably be ‘corpus-based’, rather than corpus-restricted. Corpora provide the basis for one kind of computational linguistics . A computer corpus is a large body of machine-readable texts. Increasingly large corpora (especially of English) have been compiled since the 1980s, and are used both in the development of natural language processing software and in such applications as lexicography, speech recognition and machine translation.

-David Crystal. A Dictionary of Linguistics and Phonetics, 2003 [ ]

Linguistic Data Consortium

The University of Chicago has subscribed to the Linguistic Data Consortium since 2001, and therefore, authorized UC users have access to all of the corpora that LDC has produced from 2001-present.  In addition, we have separately acquired a small number of LDC corpora from 1992-2000. If you are in need of corpora from these early years which we lack, please contact the Linguistics Bibliographer.

Note that most corpora are available as CDs or DVDs and can be accessed by individual title through the Library Catalog. There are some corpora available online only and another subset of corpora for which there is both cd/dvd access and online access. Many of these are available for download. To determine which corpora are available for downloading:

  • Register with LDC as an authorized University of Chicago user (see direction below) and wait for email confirmation
  • On the LDC web site, click the MEMBERS tab
  • Click the INTRANET link and log-in
  • Click on CORPORA AVAILABLE FOR DOWNLOAD link to determine which are available to University of Chicago users


Subject Specialist

June Farris's picture
June Farris
Bibliographer for Slavic, East European and Eurasian Studies & General Linguistics
Joseph Regenstein Library
Room 263

Related Links