What is Corpus?

By Michael McCarthy

A corpus is a collection of texts, written or spoken, usually stored in a computer database. A corpus may be quite small, for example, containing only 50,000 words of text, or very large, containing many millions of words. …

Written texts in corpora might be drawn from books, newspapers, or magazines that have been scanned or downloaded electronically. Other written corpora might contain works of literature, or all the writings of one author (e.g., William Shakespeare). Such corpora help us to see how language is used in contemporary society, how our use of language has changed over time, and how language is used in different situations.

Spoken corpora, on the other hand, contain transcripts of spoken language. Such transcripts may be of ordinary conversations recorded in people’s homes and workplaces, or of phone calls, business meetings, radio broadcasts, or TV shows. Like written corpora, spoken corpora show us how language is used in real life and in many different contexts.

People build corpora of different sizes for specific reasons. For example, a very large corpus would be required to help in the preparation of a dictionary. It might contain tens of millions of words – because it has to include many examples of all the words and expressions that are used in the language. A medium-sized corpus might contain transcripts of lectures and seminars and could be used to write books for learners who need academic language for their studies. Such corpora range in size from a million words to five or ten million words. Other corpora are more specialized and much smaller. These might contain the transcripts of business meetings, for instance, and could be used to help writers design materials for teaching business language.

Once a corpus is stored in a database, we can analyze it and “search” for information in the same way we use search engines to find keywords on the Internet, but with more sophisticated tools. By searching a corpus we can get answers to questions like these:

  • What are the most frequent words and phrases in English?
  • What are the differences between spoken and written English?
  • Which tenses do people use most frequently?
  • What prepositions follow particular verbs?
  • How do people use words like can, may and might?
  •  Which words are used in more formal situations, and which are used in more informal ones?
  •  How often do people use idiomatic expressions and why?
  • How many words must a learner know in order to participate in everyday conversation?
  •  How many different words do native speakers generally use in conversation?

With corpora and software tools to analyze them, we can see how language is really used. We no longer have to rely heavily on intuition to know what we say or what we write; instead we can see what hundreds of different speakers and writers have actually said or written, all at the click of a mouse.

A corpus, then, is simply a large collection of texts that we can analyze using computer software, just as we can access the millions of texts on the Internet. It is not a theory of language learning or a teaching methodology, but it does influence our way of thinking about language and the kinds of texts and examples we use in language teaching.

Editors Note: Reprinted with Permission from McCarthy, M. 2004. Touchstone: From Corpus to Course Book. Cambridge: Cambridge University Press

