Introducing the 1.9 Billion Word Global Web-Based English Corpus (GloWbE)

Introducing the 1.9 Billion Word Global Web-Based English Corpus (GloWbE)

By Mark Davies

 

 

  1. Introduction

The GloWbE Corpus (Global Web-Based English), which was released in 2013, is based on 1.9 billion words of text from 20 different countries.  The texts in the corpus consist of informal blogs (about 60% of corpus) and other web-based materials, such as newspapers, magazines, company websites, and so on. As with the other corpora from corpus.byu.edu (see Davies 2009, 2011, 2012), GloWbE is freely available to all researchers at http://corpus.byu.edu/glowbe.

In this short paper, I will provide a number of concrete examples of how GloWbE allows researchers to carry out a wide range of studies on lexical, phraseological, morphological, syntactic, and semantic variation among dialects of English, many of which could probably not be studied with other, smaller corpora. (Note that you will only be able to click on 20-30 of these links to see the search results before the corpus asks you to register. Registration is free, and takes less than one minute.)

Before we look at the corpus data, however, I will briefly discuss how the corpus was designed and created.

Designing and creating the GloWbE corpus

The first goal in creating GloWbE was to have a corpus that was large enough to permit research on a wide range of phenomena in World Englishes. To this end, there was really only one possible source for the texts, and those were web pages. But another goal was ensuring that the web pages represented informal language fairly well (which was our second goal). About 60% of the words for each country come from informal blogs, whereas the other 40% come from a wide variety of (often) more formal genres and text types.

The first task in creating the corpus was to get the URLs for millions of web pages from the 20 different countries. In order to do so, hundreds of very high frequency 3-grams (three word strings) in COCA were run against Google – phrases such as {and from the}, {but if it}, {and they are}, etc. Because of the high frequency of the search strings and because Google does not use search engine optimization criteria for phrases like “and from the”, it ends up listing essentially random URLs, which is precisely what we wanted. I stored these URLs in a database, along with all associated metadata (web site, country, page title, etc.).

In order to achieve a roughly 60/40 mix of informal and somewhat more formal language, one million URLs from a “general” search in Google were collected and then another one million URLs from Google searches of just blogs. In the general search, however, about 20% of these were also blogs (there is no way to exclude them from “general” searches), which results in (roughly) a 60/40 mix overall.

The most challenging part of the corpus creation was ensuring that the web pages were correctly associated with each of the 20 countries in the corpus. To do so, each of these two sets of searches (general and blogs) were carried out for each country separately, using Google “Advanced Search”, and limiting by “Region” (as Google calls it) – Canada, Ireland, India, Singapore, etc.

After the list of URLs was created, HTTrack was used to download the 2,000,000 web pages, and then JusText was used to remove “boilerplate” material from the web pages – recurring headers, footers, sidebars, and so on. After this, the CLAWS 7 tagger was used to tag the entire corpus. Finally, the texts were imported into a database, where they would use the same architecture and interface as the other corpora from corpus.byu.edu.

The end result was a 1.9 billion word corpus from about 1.8 million web pages in 20 different countries:

 

Country Web sites Web pages Words
United States 82,260 275,156 386,809,355
Canada 33,776 135,692 134,765,381
Great Britain 64,351 381,841 387,615,074
Ireland 15,840 102,147 101,029,231
Australia 28,881 129,244 148,208,169
New Zealand 14,053 82,679 81,390,476
India 18,618 113,765 96,430,888
Sri Lanka 4,208 38,389 46,583,115
Pakistan 4,955 42,769 51,367,152
Bangladesh 5,712 45,059 39,658,255
Singapore 8,339 45,459 42,974,705
Malaysia 8,966 45,601 42,420,168
Philippines 10,224 46,342 43,250,093
Hong Kong 8,740 43,936 40,450,291
South Africa 10,308 45,264 45,364,498
Nigeria 4,516 37,285 42,646,098
Ghana 3,616 47,351 38,768,231
Kenya 5,193 45,962 41,069,085
Tanzania 4,575 41,356 35,169,042
Jamaica 3,488 46,748 39,663,666
TOTAL 340,619 1,792,045 1,885,632,973

 

  1. Examples of GloWbE-based queries

3.1 Lexical variation

 In terms of research on lexical variation among the dialects, we can search for any word or phrase, and see its frequency in all 20 varieties. For example, all of the following are more common in [British] than in American English:

fortnighttrousersrained offon holidayat university[be] different torather more ADJ.

Some other examples are:

[Ireland] jackeen*banjax*culchie*childersoft day[act] the maggot*; [Australia] bikkiesthongsrockmelon*

[Malaysia] (+Singapore) rakyatmakanhand phone[take] ADJ foodlah!; [Jamaica] ackeebammyguinepcallaloo.

We can also see comparisons across groups of countries, e.g.:

[South Asia] out of stationeve teas*be elder tokeep in view

[Non-“core” countries]: equipmentsthricegodownsame to the[discuss] about[cope] up.

In all of the preceding searches, we input a specific word or phrase, and then see the frequency in each country. But because the corpus has already stored the frequency of each word and phrase in each country, we can also do more complicated searches, in which we have GloWbE show us what words or phrases occur in a given country (or set of countries), but not in another. For example, we could compare all *ism words in the six “core” countries (left) and the four countries in South Asia (right), and we would see that words like Euroskepticism and Nimbyism (“not in my backyard”) are more common in the Inner Circle countries, whereas words related to religion (e.g. Talibanism, Shaivism) are more common in the South Asian countries. Or we could find all *ies nouns that are more common in Australia (left) than in other countries (e.g. cockies, pollies, furphies).

3.2 Idioms (and phrases)

The following are a few idioms related to “head” that are more common in American (and Canadian) English:

in over ~ headhead startheads or tailstalking [head](like) a deer in the headlightscooler heads (will prevail).

(Note that the GloWbE corpus online uses different search syntax than the “~”, but it is used for ease in presentation here.) On the other hand, the following are spread more evenly across the dialects:

price on ~ headhead over heels (in love)head and shoulders abovetwo heads are better (than one)[use] ~ head[make] ~ head spin[put] ~ head* togetherfrom head to toehanging over ~ headoff the top of ~ head.

Note, by the way, how sensitive the frequency of idioms is to size. In a “small” corpus like the British National Corpus, which is 1/20th the size of GloWbE, there might only be 1/20th as many tokens (so perhaps just 5 or 6 total), and in a tiny one million word corpus, there probably wouldn’t be any tokens at all.

Again, because we can easily compare anything in different countries or regions, we could for example compare V-ed me up (e.g. stressed, freaked, creeped me out) in the six “Inner Circle” countries (left) and the countries in South Asia (right). Or we could see, for example, what prepositions are used with a given adjective (like integrated) in different countries (notice the “non-standard” ones in India: in and to, instead of into).

 3.3 Morphology

 Just a few examples show that [be] spoilt (vs spoiled) and [have] learnt (vs. learned) are less common in the US and Canada than in other varieties, whereas American and Canadian English prefer dove (vs dived) more than other “core” varieties. In addition, whereas some have suggested that snuck is limited to use just in the United States, we find that it is also found in other “Inner Circle” varieties (cf. Kachru 1985), but crucially very little in the UK, which may explain the mistaken perception that it occurs only in the US.

 3.4 Syntax

 We can enter any grammatical construction and then see its frequency across each of the 20 countries. For example, we could look for likely V (e.g. would likely remember), which is more common in the United States and Canada). We can search for use of the subjunctive (e.g. if I were king) or the lack of the subjunctive (e.g. if I was king), and we see how the non-use of the subjunctive is less common in South Asia than in the Inner Circle varieties. An interesting case of variation occurs with try and verb (e.g. you should try and do it), where it is much less common in the United States and Canada – perhaps due to greater prescriptive pressure against this construction in those countries. With the oft-studied “like” construction (and he’s like ,…), we find that it is used the most in the United States, and that it is less common (in stair-step fashion) in the other Inner Circle dialects.

We can also look for much broader constructions (of the types that are popular in Construction Grammar), such as the “go + ADJ” construction (e.g. go crazy, go bankrupt), the “way” construction (e.g. he pushed his way through the crowd) or verb someone into V-ing construction (e.g. he talked her into coming), and see the different verbs or adjectives by country.

Because of its size, GloWbE can compare low frequency constructions in different dialects. For example, compared to UK, Ireland, Australia, New Zealand, [stop] someone V-ing and [prevent] someone V-ing (they stopped / prevented him going) are quite infrequent in American and Canadian English (they would need from as well: stopprevent).

Finally, we can also examine “discourse markers”, just as “that said ,” or “having said that ,”. Note that here, the former is more common in the US and Canada, whereas the latter is more common in the other Inner Circle dialects.

3.5 Semantics

We can use collocates (nearby wordscore) to compare the meaning of a word in two dialects. For example, the collocates of scheme in the US (left) are much more negative than those in the UK (right; e.g. evil, fraudulent, nefarious). In the UK (right), cupboards are not limited just to kitchens (as in the US; left), and so we get collocates like wardrobe and clothes. And finally, it looks like in British English (right) boost (verb) refers primarily to “increasing” something (e.g. finances, figures), whereas in American English (left) it has expanded its meaning to “improvement” (e.g. mood, spirits, security)

 3.6 Discourse analysis (cultural) insights

 Finally, one of the most interesting uses of the corpus is the ability to compare frequency or collocates across countries. For example, it is probably no surprise in which countries the words Quran or Allah are most common (Pakistan and other Muslim countries), or Buddh* (Sri Lanka), or feminism (six Inner Circle countries). Using collocates, we can also compare “what is being said” about specific concepts in different countries or regions. For example, ADJ book in the Asian countries (left) refers much more to religious texts (divine, revealed, Buddhist) than in the six Inner Circle (more secular) countries (right). ADJ belief in South Asia (left) contains Hindu, corrupt, wrong, Islamic, heretical, etc compared to silly, contradictory, liberal, and Catholic in the six “inner circle” (more secular) countries (right). Finally, the adjectives with wife in the non-inner circle countries (left) contain chaste, temporary, obedient, Muslim, virtuous, etc much more than in the (more secular) Inner Circle countries (right).

Conclusion (including a discussion on corpus size)

In Section 3, we have seen a number of examples showing how the data from GloWbE can be used to insightfully investigate a wide range of phenomena in different dialects of English. One aspect of this that I might touch on in a somewhat more detailed fashion here is the importance of corpus size.

Other than GloWbE, the only other corpus of English that contains data from a number of different dialects, and which is organized in a way that allows us to compare across these dialects, is the International Corpus of English (see Greenbaum 1996). The ICE corpus contains one million words each for fourteen different dialects (eleven of which contain both spoken and written English), for a total of about 12,200,000 words of text. GloWbE, on the other hand, contains about 1.9 billion words of data. In other words, GloWbE is more than 150 times as large as ICE. Where ICE may yield 20-30 tokens of a given word, phrase, or construction, GloWbE will often yield 150 times as much, or in other words 3,000-4,000 tokens for the same phenomenon. Another advantage of GloWbE is that it provides data on a number of varieties so far not included in ICE (such as Pakistani and Malaysian English).

For high frequency syntactic constructions, ICE often has enough data, and this is why it is probably no surprise that so many ICE-based studies in fact deal with rather high frequency constructions. But for many of the phenomena discussed in this paper, ICE would probably not have enough tokens. For example, most of the words and phrases shown in Section 3 occur 500-2000 times in GloWbE, and they would only occur between perhaps 4 and 15 times in ICE. In terms of morphological variation, contrasting forms like dived/dove occur 1,000-1,200 times in GloWbE, and they might therefore only occur 6 or 7 times in ICE. In GloWbE there are about 8,000 tokens for a construction like each of them {is|are}, and in ICE there would be only about 50 tokens – probably too few to say much of interest. And things are even more problematic in terms of the number of tokens for collocates shown in Sections 3.5-3.6. For a given collocate, there are often only 30-40 tokens in GloWbE, and with a corpus only 1/150th the size, we might be lucky to have a single token in ICE.

But of course size is not everything. The ICE corpora have been constructed very carefully, and for phenomena where “every token counts” and when there can be no “messiness” at all in the data, then the carefully-curated, manually annotated ICE corpora may be more useful than GloWbE. Likewise, for phenomena where actual spoken material is needed, ICE will probably be better than GloWbE, where there is no spoken data (although the 60% or so of texts in GloWbE that come from blogs do provide fairly informal language). Finally, in GloWbE we only know that a website is from a particular country, but there might be speakers from other countries who have posted to that website. In ICE, on the other hand, care has been taken to ensure that all speakers are from the country in question.

Recognizing the fact that each corpus has its own strengths and weaknesses, it is probably not an “either/or” situation. Researchers may want to use ICE for some studies, GloWbE for others, and perhaps proprietary corpora that they have created for yet other studies. All of these can be seen as useful “tools” in the researchers’ “toolbox”, and they complement each other nicely.

To the extent, though, that researchers do adopt GloWbE as part of their “toolbox” (along with ICE and other corpora), they will be able to expand their horizons in terms of the types of variation that they consider, as they carry out research on World Englishes.

 

References

Davies, Mark. 2012. “Expanding Horizons in Historical Linguistics with the 400 million word Corpus of Historical American English”. Corpora 7: 121-57.

Davies, Mark. 2011. “The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English”. Literary and Linguistic Computing 25: 447-65.

Davies, Mark. 2009. “The 385+ Million Word Corpus of Contemporary American English (1990-2008+): Design, Architecture, and Linguistic Insights”. International Journal of Corpus Linguistics. 14: 159-90.

Greenbaum, Sidney, ed. 1996. Comparing English Worldwide: The International Corpus of English. Oxford: Oxford University Press.

Kachru, Braj B. 1985. “Standards, codification and sociolinguistic realism: the English language in the outer circle.”. In Randolph Quirk and Henry Widdowson, eds. English in the World: Teaching and Learning the Language and Literatures. Cambridge: Cambridge University Press, 11–30.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: