By Mark Davies
Perhaps one of the most interesting uses of large online corpora is their ability to help us look at current, ongoing changes in a given language. In this paper, we will briefly discuss and give examples of how the 425 million word Corpus of Contemporary American English (COCA) can look at ongoing changes in English, in ways that are probably not possible with any other corpus or resource.
In order to look at ongoing changes, a corpus would ideally have the following characteristics:
- Large (probably 100 million words or more), so that we can look at even low-frequency phenomena
- Recent texts (ideally, it would be updated to within a year of the present time)
- Balance between several genres (e.g. not just newspapers, since that doesn’t reflect the full range of a given language)
- Roughly the same genre balance from year to year (will be discussed below)
- An architecture that shows frequency over time and which allows one to compare frequencies between different periods
Although other corpora have some of these features, none besides COCA has all five. For example:
- The British National Corpus is large and has many genres, but it is now almost 20 years out of date and will likely never be updated.
- The Brown family of corpora (Brown, Frown, LOB, FLOB) is neither large nor recent. (The last texts are from 1991.)
- The American National Corpus has none of the five characteristics listed above.
- The Bank of English and the Oxford English Corpus are both large and fairly recent (up through 2005 / 2006), but their genre balance varies a great deal from year to year. As a result, there is no way to know if the changes they show are indicative of actual changes in the “real world” or whether they just reflect changes in the corpus itself. To give a simple example, a higher frequency of “fiction” words (pale, smile, sparkle, etc.) in the early 2000s than in the late 1990s might simply reflect an increase in the total number of words in fiction texts during that time, but would give no evidence that these words or phrases had actually increased in real world usage. (In addition, another issue with these two corpora is that neither is freely-available to the public.)
- The Web (via Google) and text archives are not genre-balanced, and (most importantly) there is no way to measure change over time. In order to do so, one would have to know the frequency of an item in a given year and then know the overall size of all texts in that year (to get normalized frequency statistics). There are also real problems in terms of searches involving phrases, and not just individual words.
(Note that the corpora mentioned above might be great for other things, just not as a monitor corpus to look at ongoing changes.)
Let us consider briefly a few examples of data from the corpus relating to ongoing changes in English. Rather than presenting tables in the paper itself, we have chosen to take advantage of the online format of this paper, and to make the searches more interactive. Readers can simply click on any of the following links to run the queries and see the actual data from COCA.
- Lexical change (words and phrases): COCA allows us to see the frequency of words and phrases in five year periods since the early 1990s. (Users can click on [See all sections] in the Chart Display to see the frequency by individual years as well). For example, some words and phrases that have either entered the language since the early 1990s (or else significantly increased in frequency) are jonesing, morph, old-school, gift (as a verb), freak out, perfect storm, (think) outside the box, (be) on the hook for, or throw someone under the bus.
- In the examples above, we searched for a specific word or phrase. But COCA also allows us to find all words or phrases that have a significantly different frequency in two different periods, even when we don’t know ahead of time what these specific words or phrases might be. For example, we can find phrasal verbs with up that are used a lot more in 2005-11 than in 1990-99. (Note that not all entries are relevant, but it’s a good start.) Likewise, we could find the frequency of all words ending in –ism (e.g. communism, terrorism) in each time period since the early 1990s, and we could do a simple query to find which –ism words are more common in the 2000s than the 1990s. (In this search, the newer words are on the left). Notice the increase in words like bioterrorism, cyberterrorism, and antiterrorism, and the decrease in words like Afrocentricism, behavioralism, Freudiannism, and Leninism, all of which provide interesting insight into historical, cultural, and intellectual shifts in the US during the past two decades.
- Morphological change (word formation): Continuing on the –ism example above, we can easily compare the frequency of words with specific roots, prefixes, or substrings. For example, a search for –gate (indicating “scandal”: filegate, zippergate, travelgate) shows more entries in the 1990s (on the left) than in the 2000s (on the right), which is perhaps due to the political situation in the 1990s. We can also see the rise in words becoming semi-bound morphemes, in cases like -friendly (e.g. wallet-friendly, eco-friendly), which are much more frequent in 2000s (left) than the 1990s (right).
- Syntactic change (grammar): Because the corpus is tagged and lemmatized (unlike a resource like Google or Google Books), we can search for syntactic constructions. For example, COCA shows an increase in the end up V-ing construction (e.g. we’ll end up paying too much), the increase in the get passive (e.g. he got hired last week) and the corresponding decrease in the be passive (he was fired last week), and the increase in the “quotative like” construction (he’s like, I’m not going) and the so not ADJ construction (I’m so not interested in her)
- Semantic change (word meaning): Changes over time with collocates (nearby words) can often indicate changes in meaning of a given word. However, we need very large corpora for there to be enough collocates of a given word to carry out such analyses; it would never work with a small 2-4 million word set of corpora like the Brown family of corpora. Using COCA, for example, we can compare the collocates for the following words in the 2000s (left) and the 1990s (right): green (increase in jobs, building, energy, and economy, showing the rise in the meaning “environmentally friendly”), web (increase in site, email, page, and browser, showing its newer meaning related to the Internet), and engine (increase in search, Google, Internet, and access, showing its newer meaning related to search engines).
- Discourse analysis: Again, we can use collocates to compare usage. For example, compare the collocates for the given words in the 2000s (left) compared to the 1990s (right): crisis (mortgage, foreclosure, climate obesity), terror (global, war, September 11), or China (competitor, debt, and savings, reflecting US fears about economic competition from China). While the collocates with green, web, and engine (the previous section) show a semantic shift, in these later cases the “meaning” of the word is essentially the same (crisis still means crisis); it’s simply “what we’re saying about topic X” that has changed over time.
As can be seen, with a corpus like COCA – which is large, recent, genre-balanced, and well-annotated – we can carry out a wide range of investigations into recent changes in the language.
Davies, Mark. “The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English”. Literary and Linguistic Computing 25: (2011): 447-65.
(Much more emphasis on the theory of creating monitor corpora, but correspondingly fewer concrete examples than the present paper).
Davies, Mark. “Examining Recent Changes in English: Some Methodological Issues”. In Handbook on the History of English: Rethinking Approaches to the History of English, edited by Terttu Nevalainen and Elizabeth Closs Traugott. Oxford: Oxford Univ. Press, forthcoming. .
(Detailed focus on using corpora to map syntactic changes in English from the 1800s to the current time).[S1]