In 2011 I gave a plenary talk on how American English is changing over time (contrasting it with British English), using the Brown Family of corpora. Each member of the Brown family consists of a corpus of 1 million words of written, published, standard English, divided into 500 files each of about 2000 words each. Fifteen genres of writing are represented – this framework being created decades ago when the original Brown corpus was compiled by Henry Kučera and W. Nelson Francis at Brown University, having the distinction of being the first publically available corpus ever built. Containing only American texts published in 1961, it originally went by the name of A Standard Corpus of Present-Day Edited American English for use with Digital Computers but later became known as just the Brown Corpus. It was followed by an equivalent British version, with later members representing English from the 1990s, the 2000s and the 1930s. A 1901 British version is in the pipeline.
Before I gave my talk, however, Mark Davies gave a brilliant presentation on the COHA (Corpus of Historical American English) which has 400 million words and covers the period from 1800 to the present day. It was the proverbial hard act to follow. Compared to the COHA, the Brown family are tiny, and the coverage occurs across 30 or 15 year snapshots, rather than representing every year. If we identify, say, that the word Mr is less frequent in 2006 than in 1991 then it is tempting to say that Mr is becoming less frequent over time. But we don’t know for certain what corpora from all the years in between would tell us. Having multiple sampling points presents a more convincing picture, but judicious hedging must be applied.
Also, being small, many words in the Brown family have tiny frequencies so it’s very difficult to make any claims about them. And the sampling could be viewed as rather outdated – the sorts of texts that people accessed in the 1960s are not necessarily the same as they access now. There are no online texts in the Brown family (although to ease collection, both the 2006 members involved texts that were originally published in written form, then placed online). Nor is there any advertising text. Or song lyrics. Or horror fiction. Or erotica (although there is a section on Romantic Fiction which could be pushed in that direction). Finally, the fact that all the texts are of the published variety means that they tend to represent a somewhat standardised, conservative form of English. A lot of the innovation in English happens in much more informal contexts, especially where young people or people from different backgrounds mix together – inner-city playgrounds and internet forums being two good examples. By the time such innovation gets into written published standard English, it’s no longer innovative. So the Brown family can’t tell us about the cutting edge of language use – they’ll always be a few years out of fashion.
So what are the Brown family good for, if anything?
Well, there are a few advantages to being small. Familiarity with your corpus is one. As is the fact that the corpora can work with a variety of different software and are relatively fast to process. It’s also easier to control for sampling when something is small, and the beauty of the Brown family is that it is exquisitely sampled and balanced, even if the framework is crystalized into the Mad Men era. And that allows us to make diachronic and synchronic comparisons with some confidence, providing we deal with high-frequency phenomena.
So, for lexical-grammatical analyses, the most frequent 380 words in the Brown family account for about 62% of all of the linguistic content across the corpora (at around the 400th most frequent word, we are seeing frequencies of around 250 – which in most cases will give more than enough data to account for rare and frequent patterns). And there’s plenty that can be done with 380 words. When I looked at which of these words showed constant increases over time in the 4 British members of Brown, the 10 which showed the most dramatic increases were around, health, information, it’s, didn’t, says, social, family, children, and need. The 10 which showed the most dramatic decreases were upon, shall, Mr, certain, Mrs, great, sir, must, whole, and present. Analysing how these words appear in context and trying to extrapolate why they show the patterns they do has the potential to tell us a great deal about the interface between language and culture. Additionally, carrying out grammatical and semantic tagging of words will group them into larger categories. A single word may be quite infrequent on its own, but group it with similar words and frequencies are increased. So another potential of the Brown family is in working with tagged versions. A recent book by Leech et al has successfully examined grammatical variation within the Brown family, focussing on high frequency phenomena like modals, the progressive, the passive voice, noun phrases and non-finite clauses. Lower frequencies also mean that imposing some sort of user-created annotation scheme is more feasible, as is identifying features that may be difficult to spot through regex concordance searches (such as metaphor).
For many small corpus projects, particularly those which analyse a relatively narrow genre of language (newspaper articles on a certain topic, student essays, pop lyrics) one or more members of the Brown family can act as a good reference corpus to create keywords. Also, for future decades, it will be relatively easy to keep the Brown family up to date by adding new members. It took me about 12 working days to collect the British 2006 member, working alone. At Lancaster we have used classes of students to collect about 5-10 texts each as a corpus building exercise, which could allow for new members to be built with little individual effort. Although as a note of caution, texts are easier to locate and gather in some language varieties than others. So at the recent Corpus Linguistics Conference, Sally Hunt gave a presentation on adding a South African English version to the family, and she pointed out that it took much longer than 12 working days to complete her corpus. It is hoped that as global online access improves, so should the availability of potential corpus texts.
And regarding the issue of the sampling frame being stuck in the 1960s, there is no reason why future members of the family cannot address this issue by including a 1 million word core, which could be supplemented (or not depending on the user’s needs) with 1 million words of spoken English. 1 million words of computer-mediated-communication or 1 million words of unpublished writing.
All corpora have their limitations. We need to be honest about what they are and make adjustments if necessary. But we should also acknowledge their strengths – and continue to exploit such strengths where possible.
If you’re interested in working with the Brown Family, they are more accessible now than ever before. The entire extended family can be accessed through CQPweb, and BE06 and AmE06 (the most recent additions) have just been added as reference corpora in Wmatrix and SketchEngine. If you would like to use AmE06 or BE06 as reference corpora in WordSmith, you’ll find it helpful to know that the wordlists for these corpora are also available for download on my website.
CASS co-investigator Paul Baker is a Professor of Linguistics and English Language at Lancaster University. His research interests corpus linguistics, language and gender/sexual identities and critical discourse analysis.