Corpus Use for Beginners
By Susan Holzman
To get a feel for what you are going to experience, you might begin by reading Professor Mark Davies’ home page:
http://davies-linguistics.byu.edu/personal/
Then read this page which describes the various corpora that are available:
There are several corpora listed here. For example, the Corpus of Contemporary American English (COCA) has 435 million words and has been updated in 2012. Another is the British National Corpus (BNC) which has 100 million words and has not been updated since 1993. This does not negate or diminish the value of the BNC, but it qualifies it and the user should be aware of this.
Corpus of Contemporary American English (COCA) |
425 million |
American English |
1990-2012 |
BYU-BNC: British National Corpus* | 100 million |
British English |
1980s-1993 |
Steps for corpus use:
- Go to the site:
- Register/log in:
It is possible to access the corpus without registering for the site, but after a few searches, you will be asked to register. This is easily done in the upper right hand corner of the page.
- Have a question:
Once an editor friend of mine asked my opinion about a disagreement she had with a client. The editor maintained that differentiate should not be followed by between. She said that we differentiate X from Y and not between X and Y. I could check the use of differentiate in two ways. I could type differentiate into the search box (labeled WORD(S) or differentiate between.
- Search:
- Type differentiate between into the box. Click on SEARCH .
- A new window opens with showing the words differentiate between and the number of times this phrase appears in the corpus. Clicking on the phrase gives actual instances of use.
- The data clearly demonstrate that this phrase is commonly used in spoken English, in academic texts and in magazines.
- Based on this data, she had to concede that her client was right. This might not change her opinion that the phrase is clumsy or awkward, but using between is not wrong.
- Get your answer!
More advanced corpus use:
- Type differentiate in the search box (see below).
- Go to the drop down POS (Part of speech) LIST and select “prep.ALL” (prepositions ALL). After selection, the symbol for prepositions appears in the collocates box.
- Click on search.
The editor who believed that “differentiate” was not followed by a preposition could now learn that “differentiate” is followed by a great number of prepositions. The corpus tells how many times this combination appears and with a few more clicks, it is possible to know the dates and sources of the use of the combination, spoken or written, fiction, academic magazine or newspaper.
This is just the most basic tutorial. To take full advantage of this tool requires time and patience and imagination (What is the best way to get the information I need to answer this particular lexical question?). There are days when I do not consult the corpus for my work. There are days when I consult it numerous times. Like a thesaurus, dictionary or encyclopedia, the corpus is another indispensible resource for the editor.
This tutorial might be enough for an editor to get quick, easy and authoritative answers to pressing questions. However, editors are usually curious about language and many will want to know more and extend their use of the corpus. I encourage you to take advantage of the help offered at the site:
Good luck!
Leave a Reply