A new addition to the Brown Family: BE21

Back in the 1960s, the Brown Corpus was the first corpus ever created – 1 million words of written standard American English from 15 registers, across 500 text samples, all around 2,000 words in size. Since then, there have been matched versions to cover the 1930s, the 1990s and the 2000s. Today’s reference corpora can be very large (enTenTen20 is 36.5 billion words) so 1 million words is not very big in corpus linguistics these days. However, because the members of the Brown family have used the same sampling frame, they can be usefully employed to carry out comparisons of UK and US English, as well as allowing us to track trends in language development over time – with the stipulation that these corpora tend to be effective when considering high frequency phenomena.

I built the BE06 in 2008, a British English version of the Brown corpus, with texts from 2006, and I was also involved in building its American sibling – AmE06. For those corpora, we stipulated that texts could be found online, as long as they had also been published in paper format elsewhere. That way, the job of creating the corpus was made easier, but the texts would also be similar in form to the other corpora. The gap between the 1990s versions of the corpora and BE06 was 15 years, so 2021 is a good point to create new versions. I began collecting data for BE21 (British English from 2021) around the mid-point of 2021 and finished in early 2022.

As well as helping us to examine language change and variation, the corpora can also be used as reference corpora for projects which involve relatively small amounts of text. So if you have a corpus of recent British newspaper texts, for example, that is under a million words in size, then BE21 would be a reasonable option as a reference corpus.

As with BE06, for BE21 I collected texts from online sources. For most (around 80-90%) of texts, there are “on paper” equivalents somewhere. However, these days, some texts appear designed to exist in online form – this includes things like online magazines or government documents. So I relaxed the stipulation a little bit to reflect how written language is increasingly migrating to online as opposed to paper.

My memory of collecting the BE06 corpus was that it didn’t take that long. At the time I calculated that it took around 10 working days in total. For the BE21, the texts took about three times longer than this. That was a surprise – I’d assumed that with more people publishing online, compared to 15 years ago, there would be a wealth of texts to choose from and the task would be easier. However, compared to 15 years ago, many texts are now behind firewalls so cannot be freely accessed. This was especially the case for magazines – collecting texts for the Popular Lore and Skills, trades and hobbies categories was more difficult than expected.

For the fictional sections, back in 2006, many authors had set up their own websites, where they provided free samples of their latest novels. These days, this doesn’t seem to be as common – although Amazon does provide free samples of books, via Kindle – so that was the route I went to collect the samples of fiction as well as the biographies.

Another complicating factor involved identifying and locating British authors, particularly when it came to the Academic Writing category. This was reasonably easy back in 2006, although today, academic publishing is a more international activity, as well as a team-based one, so I found that I was passing over many more possible articles than I remembered doing so in 2006. Trying to locate journals that contained the word “British” was a bit of a red herring, as that was no guarantee that British academics would be publishing in them. To be certain I was sampling “British” English, I needed to make sure that everyone on the team was from the UK, which meant quite a lot of Googling the names of academics and trying to get a sense of their background. I’ve erred on the side of caution, although this made the task of collecting the academic articles more difficult.

2021 will be remembered as a year when the main topic of conversation was COVID. If you obtain keywords from BE21, using BE06 as the reference corpus, the top 10 are COVID, pandemic, lockdown, I, vaccine, my, Brexit, care, people and coronavirus. The top one, COVID, appears in 114 out of the 500 text samples, a total of 446 times across the corpus. This was not because I actively sought out texts about COVID – it was just very hard to avoid them, particularly when I was collecting the non-fiction texts – COVID had permeated almost every aspect of British society in 2021. Co-incidentally, another corpus in the Brown family, B-LOB (Before LOB), also covers an international crisis. The year 1931 saw the Great Depression, which began in the United States, overwhelm Britain, with investors withdrawing their gold from London at the rate of £2.5 million a day. The UK went off the gold standard and during the election of that year, the Labour party was virtually destroyed leaving Labour’s Ramsay MacDonald as the Prime Minister for a National Government, an all-party coalition. This results in a few linguistic peculiarities in the BLOB corpus – such as the high frequency of words like unemployment which, in Great Britain increased by 129% between 1929 and 1932. Despite constituting new lexical items, and an increased focus on certain topics, The Great Depression does not make the B-LOB ineffective as a reference corpus though, just as COVID-19 does not render BE21 so idiosyncratic as to be useless.

Today I had the pleasure of seeing students use the BE21 for the first time, in one of my corpus linguistics seminars. We looked at the frequency of negation forms like “should not” and “shouldn’t” determining the extent to which the latter form is gaining precedence. The “n’t” form has been increasing in frequency for the past century. It has not yet become the dominant form, but it is starting to get very close, indicating grammaticalization of “n’t” as a bound morpheme, linked to densification and colloquialisation of written English. Looking to the future, the next time I would expect to build another Brown Family corpus will be in 2036. I suspect that by then, the “n’t” form will have become more popular than “not” and that increasingly, we will only see “not” occurring in archaic sounding phrasing like “it mattered not”.

The corpus is currently available through Lancaster University’s CQPweb, which is free to sign up for, and it will be coming to new versions of AntConc and #LancsBox soon. I have done some work examining changes in part-of-speech tag frequencies across the British Brown members, and a paper in the International Journal of Corpus Linguistics is forthcoming. I also gave a talk on the corpus and the results of that analysis, which can be found here. Compared to earlier corpora, BE21 has more first and second pronouns, more use of -s and -ing forms of verbs, more genitives, and far fewer terms of address like Mr and Mrs, as well as fewer modal verbs, gradable adverbs, of-based noun phrases and male pronouns. The trends identified by Geoffrey Leech, like densification, colloquialisation and democratisation are continuing. Another trend, Americanisation – also seems to be holding up, and when the AmE21 corpus is available (currently work is under way at Cardiff University to build it), we can start to further identify the ways that English is likely to shift in the future.

About Paul Baker

CASS co-investigator Paul Baker is a Professor of Linguistics and English Language at Lancaster University. His research interests corpus linguistics, language and gender/sexual identities and critical discourse analysis.