Spoken BNC2014 project announcement

BNC2014 logo

We are excited to announce that the ESRC-funded Centre for Corpus Approaches to Social Science (CASS) at Lancaster University and Cambridge University Press have agreed to collaborate on the compilation of a new, publicly accessible corpus of spoken British English called the ‘Spoken British National Corpus 2014’ (the Spoken BNC2014).

The aim of the Spoken BNC2014 project, which will be led jointly by Lancaster University’s Professor Tony McEnery and Cambridge University Press’ Dr Claire Dembry, is to compile a very large collection of recordings of real-life, informal, spoken interactions between people whose first language is British English. These will then be transcribed and made available publicly for a wide range of research purposes.

We aim to encourage people from all over the UK to record their interactions and send them to us as MP3 files. For each hour of good quality recordings we receive, along with all associated consent forms and information sheets completed correctly, we will pay £18. Each recording does not have to be 1 hour in length; participants may submit two 30 minute recordings, or three 20 minute recordings, but for each hour in total, they will receive £18.

The collaboration between CASS at Lancaster University and Cambridge University Press brings together the best resources available for this task. Cambridge University Press is greatly experienced at collecting very large English corpora, and it already has the infrastructure in place to undertake such a large compilation project. CASS at Lancaster University has the linguistic research expertise necessary to ensure that the spoken BNC2014 will be as useful, and accessible as possible for a wide range of purposes. The academic community will benefit from access to a new large spoken British English corpus that is balanced according to a selection of useful demographic criteria, including gender, age, and socio-economic status. This opens the door for all kinds of research projects including the comparison of the spoken BNC2014 with older spoken corpora.

CASS at Lancaster University and Cambridge University Press are very excited to launch the Spoken BNC2014 project, and we look forward to sharing the corpus as widely as possible once it is complete.

To contribute to the Spoken BNC2014 project as a participant please email corpus(Replace this parenthesis with the @ sign)cambridge.org for more information.

Visiting With The Brown Family

In 2011 I gave a plenary talk on how American English is changing over time (contrasting it with British English), using the Brown Family of corpora. Each member of the Brown family consists of a corpus of 1 million words of written, published, standard English, divided into 500 files each of about 2000 words each. Fifteen genres of writing are represented – this framework being created decades ago when the original Brown corpus was compiled by Henry Kučera and W. Nelson Francis at Brown University, having the distinction of being the first publically available corpus ever built. Containing only American texts published in 1961, it originally went by the name of A Standard Corpus of Present-Day Edited American English for use with Digital Computers but later became known as just the Brown Corpus. It was followed by an equivalent British version, with later members representing English from the 1990s, the 2000s and the 1930s. A 1901 British version is in the pipeline.

Before I gave my talk, however, Mark Davies gave a brilliant presentation on the COHA (Corpus of Historical American English) which has 400 million words and covers the period from 1800 to the present day. It was the proverbial hard act to follow. Compared to the COHA, the Brown family are tiny, and the coverage occurs across 30 or 15 year snapshots, rather than representing every year. If we identify, say, that the word Mr is less frequent in 2006 than in 1991 then it is tempting to say that Mr is becoming less frequent over time. But we don’t know for certain what corpora from all the years in between would tell us. Having multiple sampling points presents a more convincing picture, but judicious hedging must be applied.

Also, being small, many words in the Brown family have tiny frequencies so it’s very difficult to make any claims about them. And the sampling could be viewed as rather outdated – the sorts of texts that people accessed in the 1960s are not necessarily the same as they access now. There are no online texts in the Brown family (although to ease collection, both the 2006 members involved texts that were originally published in written form, then placed online). Nor is there any advertising text. Or song lyrics. Or horror fiction. Or erotica (although there is a section on Romantic Fiction which could be pushed in that direction). Finally, the fact that all the texts are of the published variety means that they tend to represent a somewhat standardised, conservative form of English. A lot of the innovation in English happens in much more informal contexts, especially where young people or people from different backgrounds mix together – inner-city playgrounds and internet forums being two good examples. By the time such innovation gets into written published standard English, it’s no longer innovative. So the Brown family can’t tell us about the cutting edge of language use – they’ll always be a few years out of fashion.

So what are the Brown family good for, if anything?

