Introducing the 1.9 Billion Word Global Web-Based English Corpus (GloWbE)

Introducing the 1.9 Billion Word Global Web-Based English Corpus (GloWbE)

By Mark Davies

Introduction

The GloWbE Corpus (Global Web-Based English), which was released in 2013, is based on 1.9 billion words of text from 20 different countries. The texts in the corpus consist of informal blogs (about 60% of corpus) and other web-based materials, such as newspapers, magazines, company websites, and so on. As with the other corpora from corpus.byu.edu (see Davies 2009, 2011, 2012), GloWbE is freely available to all researchers at http://corpus.byu.edu/glowbe.

In this short paper, I will provide a number of concrete examples of how GloWbE allows researchers to carry out a wide range of studies on lexical, phraseological, morphological, syntactic, and semantic variation among dialects of English, many of which could probably not be studied with other, smaller corpora. (Note that you will only be able to click on 20-30 of these links to see the search results, before the corpus asks you to register. Registration is free, and takes less than one minute.)

Before we look at the corpus data, however, I will briefly discuss how the corpus was designed and created.

Designing and creating the GloWbE corpus

The first goal in creating GloWbE was to have a corpus that was large enough to permit research on a wide range of phenomena in World Englishes. To this end, there was really only one possible source for the texts, and those were web pages. But another goal was ensuring that the web pages represented informal language fairly well (which was our second goal). About 60% of the words for each country come from informal blogs, whereas the other 40% come from a wide variety of (often) more formal genres and text types.

The first task in creating the corpus was to get the URLs for millions of web pages from the 20 different countries. In order to do so, hundreds of very high frequency 3-grams (three word strings) in COCA were run against Google – phrases such as {and from the}, {but if it}, {and they are}, etc. Because of the high frequency of the search strings and because Google does not use search engine optimization criteria for phrases like “and from the”, it ends up listing essentially random URLs, which is precisely what we wanted. I stored these URLs in a database, along with all associated metadata (web site, country, page title, etc.).

In order to achieve a roughly 60/40 mix of informal and somewhat more formal language, one million URLs from a “general” search in Google were collected and then another one million URLs from Google searches of just blogs. In the general search, however, about 20% of these were also blogs (there is no way to exclude them from “general” searches), which results in (roughly) a 60/40 mix overall.

The most challenging part of the corpus creation was ensuring that the web pages were correctly associated with each of the 20 countries in the corpus. To do so, each of these two sets of searches (general and blogs) were carried out for each country separately, using Google “Advanced Search”, and limiting by “Region” (as Google calls it) – Canada, Ireland, India, Singapore, etc.

After the list of URLs was created, HTTrack was used to download the 2,000,000 web pages, and then JusText was used to remove “boilerplate” material from the web pages – recurring headers, footers, sidebars, and so on. After this, the CLAWS 7 tagger was used to tag the entire corpus. Finally, the texts were imported into a database, where they would use the same architecture and interface as the other corpora from corpus.byu.edu.

The end result was a 1.9 billion word corpus from about 1.8 million web pages in 20 different countries:

Country	Web sites	Web pages	Words
United States	82,260	275,156	386,809,355
Canada	33,776	135,692	134,765,381
Great Britain	64,351	381,841	387,615,074
Ireland	15,840	102,147	101,029,231
Australia	28,881	129,244	148,208,169
New Zealand	14,053	82,679	81,390,476
India	18,618	113,765	96,430,888
Sri Lanka	4,208	38,389	46,583,115
Pakistan	4,955	42,769	51,367,152
Bangladesh	5,712	45,059	39,658,255
Singapore	8,339	45,459	42,974,705
Malaysia	8,966	45,601	42,420,168
Philippines	10,224	46,342	43,250,093
Hong Kong	8,740	43,936	40,450,291
South Africa	10,308	45,264	45,364,498
Nigeria	4,516	37,285	42,646,098
Ghana	3,616	47,351	38,768,231
Kenya	5,193	45,962	41,069,085
Tanzania	4,575	41,356	35,169,042
Jamaica	3,488	46,748	39,663,666
TOTAL	340,619	1,792,045	1,885,632,973

Examples of GloWbE-based queries

3.1 Lexical variation

In terms of research on lexical variation among the dialects, we can search for any word or phrase, and see its frequency in all 20 varieties. For example, all of the following are more common in [British] than in American English:

fortnight, trousers, rained off, on holiday, at university, [be] different to, rather more ADJ.

Some other examples are:

[Ireland] jackeen*, banjax*, culchie*, childer, soft day, [act] the maggot*; [Australia] bikkies, thongs, rockmelon*

[Malaysia] (+Singapore) rakyat, makan, hand phone, [take] ADJ food, lah!; [Jamaica] ackee, bammy, guinep, callaloo.

We can also see comparisons across groups of countries, e.g.:

[South Asia] out of station, eve teas*, be elder to, keep in view

[Non-“core” countries]: equipments, thrice, godown, same to the, [discuss] about, [cope] up.

In all of the preceding searches, we input a specific word or phrase, and then see the frequency in each country. But because the corpus has already stored the frequency of each word and phrase in each country, we can also do more complicated searches, in which we have GloWbE show us what words or phrases occur in a given country (or set of countries), but not in another. For example, we could compare all *ism words in the six “core” countries (left) and the four countries in South Asia (right), and we would see that words like Euroskepticism and Nimbyism (“not in my backyard”) are more common in the Inner Circle countries, whereas words related to religion (e.g. Talibanism, Shaivism) are more common in the South Asian countries. Or we could find all *ies nouns that are more common in Australia (left) than in other countries (e.g. cockies, pollies, furphies).

3.2 Idioms (and phrases)

The following are a few idioms related to “head” that are more common in American (and Canadian) English :

in over ~ head, head start, heads or tails, talking [head], (like) a deer in the headlights, cooler heads (will prevail).

(Note that the GloWbE corpus online uses different search syntax than the “~”, but it it used for ease in presentation here.) On the other hand, the following are spread more evenly across the dialects :

price on ~ head, head over heels (in love), head and shoulders above, two heads are better (than one), [use] ~ head, [make] ~ head spin, [put] ~ head* together, from head to toe, hanging over ~ head, off the top of ~ head.

Note, by the way, how sensitive the frequency of idioms is to size. In a “small” corpus like the British National Corpus, which is 1/20th the size of GloWbE, there might only be 1/20th as many tokens (so perhaps just 5 or 6 total), and in a tiny one million word corpus, there probably wouldn’t be any tokens at all.

Again, because we can easily compare anything in different countries or regions, we could for example compare V-ed me up (e.g. stressed, freaked, creeped me out) in the six “Inner Circle” countries (left) and the countries in South Asia (right). Or we could see, for example, what prepositions are used with a given adjective (like integrated) in different countries (notice the “non-standard” ones in India: in and to, instead of into).

3.3 Morphology

Just a few examples show that [be] spoilt (vs spoiled) and [have] learnt (vs. learned) are less common in the US and Canada than in other varieties, whereas American and Canadian English prefer dove (vs dived) more than other “core” varieties. In addition, whereas some have suggested that snuck is limited to use just in the United States, we find that it is also found in other “Inner Circle” varieties (cf. Kachru 1985), but crucially very little in the UK, which may explain the mistaken perception that it occurs only in the US.

3.4 Syntax

We can enter any grammatical construction and then see its frequency across each of the 20 countries. For example, we could look for V likely V (e.g. would likely remember), which is more common in the United States and Canada). We can search for use of the subjunctive (e.g. if I were king) or the lack of the subjunctive (e.g. if I was king), and we see how the non-use of the subjunctive is less common in South Asia than in the Inner Circle varieties. An interesting case of variation occurs with try and verb (e.g. you should try and do it), where it is much less common in the United States and Canada – perhaps due to greater prescriptive pressure against this construction in those countries. With the oft-studied “like” construction (and he’s like ,…), we find that it is used the most in the United States, and that it is less common (in stair-step fashion) in the other Inner Circle dialects.

We can also look for much broader constructions (of the types that are popular in Construction Grammar), such as the “go + ADJ” construction (e.g. go crazy, go bankrupt), the “way” construction (e.g. he pushed his way through the crowd) or verb someone into V-ing construction (e.g. he talked her into coming), and see the different verbs or adjectives by country.

Because of its size, GloWbE can compare low frequency constructions in different dialects. For example, compared to UK, Ireland, Australia, New Zealand, [stop] someone V-ing and [prevent] someone V-ing (they stopped / prevented him going) are quite infrequent in American and Canadian English (they would need from as well: stop, prevent).

Finally, we can also examine “discourse markers”, just as “that said ,” or “having said that ,”. Note that here, the former is more common in the US and Canada, whereas the latter is more common in the other Inner Circle dialects.

3.5 Semantics

We can use collocates (nearby wordscore) to compare the meaning of a word in two dialects. For example, the collocates of scheme in the US (left) are much more negative than those in the UK (right; e.g. evil, fraudulent, nefarious). In the UK (right), cupboards are not limited just to kitchens (as in the US; left), and so we get collocates like wardrobe and clothes. And finally, it looks like in British English (right) boost (verb) refers primarily to “increasing” something (e.g. finances, figures), whereas in American English (left) it has expanded its meaning to “improvement” (e.g. mood, spirits, security)

3.6 Discourse analysis (cultural) insights

Finally, one of the most interesting uses of the corpus is the ability to compare frequency or collocates across countries. For example, it is probably no surprise in which countries the words or Allah are most common (Pakistan and other Muslim countries), or Buddh* (Sri Lanka), or feminism (six “inner circle” countries). Using collocates, we can also compare “what is being said” about specific concepts in different countries or regions. For example, ADJ book in the Asian countries (left) refers much more to religious texts (divine, revealed, Buddhist) than in the six “inner circle” (more secular) countries (right). ADJ belief in South Asia (left) contains Hindu, corrupt, wrong, Islamic, heretical, etc compared to silly, contradictory, liberal, and Catholic in the six “inner circle” (more secular) countries (right). Finally, the adjectives with wife in the “non-inner circle” countries (left) contain chaste, temporary, obedient, Muslim, virtuous, etc much more than in the (more secular) “inner circle” countries (right).

Conclusion (including a discussion on corpus size)

In Section 3, we have seen a number of examples showing how the data from GloWbE can be used to insightfully investigate a wide range of phenomena in different dialects of English. One aspect of this that I might touch on in a somewhat more detailed fashion here is the importance of corpus size.

Other than GloWbE, the only other corpus of English that contains data from a number of different dialects, and which is organized in a way that allows us to compare across these dialects, is the International Corpus of English (see Greenbaum 1996). The ICE corpus contains one million words each for fourteen different dialects (eleven of which contain both spoken and written English), for a total of about 12,200,000 words of text. GloWbE, on the other hand, contains about 1.9 billion words of data. In other words, GloWbE is more than 150 times as large as ICE. Where ICE may yield 20-30 tokens of a given word, phrase, or construction, GloWbE will often yield 150 times as much, or in other words 3,000-4,000 tokens for the same phenomenon. Another advantage of GloWbE is that it provides data on a number of varieties so far not included in ICE (such as Pakistani and Malaysian English).

For high frequency syntactic constructions, ICE often has enough data, and this is why it is probably no surprise that so many ICE-based studies in fact deal with rather high frequency constructions. But for many of the phenomena discussed in this paper, ICE would probably not have enough tokens. For example, most of the words and phrases shown in Section 3 occur 500-2000 times in GloWbE, and they would only occur between perhaps 4 and 15 times in ICE. In terms of morphological variation, contrasting forms like dived/dove occur 1,000-1,200 times in GloWbE, and they might therefore only occur 6 or 7 times in ICE. In GloWbE there are about 8,000 tokens for a construction like each of them {is|are}, and in ICE there would be only about 50 tokens – probably too few to say much of interest. And things are even more problematic in terms of the number of tokens for collocates shown in Sections 3.5-3.6. For a given collocate, there are often only 30-40 tokens in GloWbE, and with a corpus only 1/150th the size, we might be lucky to have a single token in ICE.

But of course size is not everything. The ICE corpora have been constructed very carefully, and for phenomena where “every token counts” and when there can be no “messiness” at all in the data, then the carefully-curated, manually annotated ICE corpora may be more useful than GloWbE. Likewise, for phenomena where actual spoken material is needed, ICE will probably be better than GloWbE, where there is no spoken data (although the 60% or so of texts in GloWbE that come from blogs do provide fairly informal language). Finally, in GloWbE we only know that a website is from a particular country, but there might be speakers from other countries who have posted to that website. In ICE, on the other hand, care has been taken to ensure that all speakers are from the country in question.

Recognizing the fact that each corpus has its own strengths and weaknesses, it is probably not an “either/or” situation. Researchers may want to use ICE for some studies, GloWbE for others, and perhaps proprietary corpora that they have created for yet other studies. All of these can be seen as useful “tools” in the researchers’ “toolbox”, and they complement each other nicely.

To the extent, though, that researchers do adopt GloWbE as part of their “toolbox” (along with ICE and other corpora), they will be able to expand their horizons in terms of the types of variation that they consider, as they carry out research on World Englishes.

References

Davies, Mark. 2012. “Expanding Horizons in Historical Linguistics with the 400 million word Corpus of Historical American English”. Corpora 7: 121-57.

Davies, Mark. 2011. “The Corpus of Contemporary American English as the First Reliable Monitor Corpus of English”. Literary and Linguistic Computing 25: 447-65.

Davies, Mark. 2009. “The 385+ Million Word Corpus of Contemporary American English (1990-2008+): Design, Architecture, and Linguistic Insights”. International Journal of Corpus Linguistics. 14: 159-90.

Greenbaum, Sidney, ed. 1996. Comparing English Worldwide: The International Corpus of English. Oxford: Oxford University Press.

Kachru, Braj B. 1985. “Standards, codification and sociolinguistic realism: the English language in the outer circle.”. In Randolph Quirk and Henry Widdowson, eds. English in the World: Teaching and Learning the Language and Literatures. Cambridge: Cambridge University Press, 11–30.

[ss1] This is how this looks in the corpus

in over [ap*] head

(no tilda)

[ss2]At this point, the corpus began returning blank results – user needs to login (but there wasn’t really a prompt)

[ss3]Search is actually for Qu*ran

[B4]You can change this to http://corpus.byu.edu/glowbe/x1.asp?c=glowbe&q=39397865, if you’d like.

The 21st Century Text

Introducing the 1.9 Billion Word Global Web-Based English Corpus (GloWbE)

Examples of GloWbE-based queries

3.1 Lexical variation

Leave a comment Cancel reply

Examples of GloWbE-based queries

3.1 Lexical variation

Share this:

Leave a comment Cancel reply