Compiling a new, publicly accessible corpus of British English conversation
The Spoken BNC2014 is now accessible online in full, free of charge, for research and teaching purposes. To access the corpus, you should first create a free account on Lancaster University’s CQPweb server (https://cqpweb.lancs.ac.uk/) if you do not already have one. Once registered, please visit the BNC2014 website (http://corpora.lancs.ac.uk/bnc2014) to (a) sign the corpus’ end-user licence and (b) register your CQPweb account – following the instructions on the site. When you return to CQPweb, you will have access to the Spoken BNC2014 via the link that appears in the list of ‘Present-day English’ corpora. While access is initially only via the CQPweb platform, the underlying corpus XML files and associated metadata will be available for download in Autumn 2018. The BNC2014 website also contains lots of useful information about the corpus, and in particular a downloadable manual and reference guide.
This project is a collaboration between CASS and Cambridge University Press. Together, we have collected samples of real-life, informal, spoken interactions between speakers of British English from across the United Kingdom. The transcriptions of these recordings form a corpus known as the Spoken British National Corpus 2014 (Spoken BNC2014), which will be made available publicly in the autumn of 2017 on CQPweb’s Lancaster server.
The audio recordings contain face-to-face conversations between people who speak British English as their first language, collected between 2012 and 2016. The recordings could be on any subject, and speakers were aware of being recorded as they conversed. In total the corpus comprises over 10 million words.
We used the year 2014 in the name of the corpus for three reasons: it commemorates the 20th anniversary of release of the original British National Corpus (1994); it is the year in which CASS and CUP launched the project; and, perhaps most importantly, it is the median year of the data, which was collected between 2012 and 2016.
In 2015, we announced that some of the data would be released early to selected researchers who could apply with a research proposal. Since then, a dozen fascinating research projects have been conducted and we look forward to publishing them; some comprise a forthcoming special issue of the International Journal of Corpus Linguistics (edited by Tony McEnery, Robbie Love & Vaclav Brezina) in 2017, and others are being published in a Routledge book (edited by Vaclav Brezina, Robbie Love & Karin Aijmer) in 2018.
On Monday 26th June 2017, we will host a half-day symposium at Lancaster University to celebrate the upcoming release of the Spoken BNC2014.
Earlier in 2017 we announced plans to build a large scale extension to the Spoken BNC2014, using audio recordings from the BBC Listening Project, which are archived by the British Library. Working with the BBC and the British Library, we will undertake transcription of a large number of recordings from hard to reach areas of the UK. Once completed, the transcripts will be made available as a supplement to the Spoken BNC2014.
McEnery, T. and Love, R. (fc). Bad Language. In Culpeper, J., F. Katamba, P. Kerswill, R. Wodak and T. McEnery (eds.). (fc). English Language: Description, Variation and Context (2nd ed.). London: Palgrave.
Brezina, V., R. Love and K. Aijmer (eds.). (2018 fc). Corpus Approaches to Contemporary British Speech: Sociolinguistic studies of the Spoken BNC2014. New York: Routledge.
Love, R., Dembry, C., Hardie, A., Brezina, V. and McEnery, T. (2017 fc). The Spoken BNC2014: designing and building a spoken corpus of everyday conversations. In International Journal of Corpus Linguistics, 22:3.
McEnery, T., Love, R. and Brezina, V. (eds.). (2017 fc). International Journal of Corpus Linguistics, 22:3, Special Issue.
Related conference papers & public talks
Love, R. (2017 fc). Bad language revisited: swearing in the Spoken BNC2014. Corpus Linguistics 2017 Conference. University of Birmingham, UK. July 2017.
Love, R. and Hardie, A. (2017 fc). Introducing the Spoken BNC2014 – explore the data yourself. Pre-conference workshop. Corpus Linguistics 2017 Conference. University of Birmingham, UK. July 2017.
Love, R. and Dembry, C. (2017 fc). Introducing the Spoken BNC2014. Spoken BNC2014 symposium. Lancaster University, UK. June 2017.
Love, R. (2017). FUCK in spoken British English revisited with the Spoken BNC2014. ICAME 38 Conference. Charles University, Czech Republic. May 2017.
Love, R. (2016). “Accent – General American; Dialect – British English”: reflections on tricky metadata in the Spoken BNC2014. American Association for Corpus Linguistics (AACL) 2016 Conference. Iowa State University, Ames, Iowa, USA. September 2016.
Love, R. (2016). Sociolinguistics for spoken corpora: swearing in the Spoken BNC2014. Sociolinguistics Summer School 7. Université de Lyon, France. June 2016.
Love, R. (2016). “Normal with a brummy twang”: dealing with metadata in the Spoken BNC2014. IVACS 2016 Conference. Bath Spa University, UK. June 2016.
Love, R. (2015). Spoken English in UK society. ESRC Language Matters: Communication, Culture, and Society. International Anthony Burgess Foundation, Manchester, UK. November 2015.
Love, R. and Dembry, C. (2015). Who says what in spoken corpora?: speaker identification in the Spoken BNC2014. Corpus Linguistics 2015 Conference. Lancaster University, UK. July 2015.
Dembry, C. and Love, R. (2015). Collecting the new Spoken BNC2014 – overview of methodology. Corpus Linguistics 2015 Conference. Lancaster University, UK. July 2015.
Love, R. (2015). Critical issues in spoken corpus development: defining a transcription schema for the spoken BNC2014. ICAME 36 Conference. University of Trier, Germany. May 2015.
McEnery, T., Love, R. and Dembry, C. (2014). Words ‘yesterday and today’. ESRC Language Matters: Communication, Culture, and Society. Royal United Services Institute, London, UK. November 2014.
Dembry, C. and Love, R. (2014). Spoken English in Today’s Britain. Cambridge Festival of Ideas. Cambridge University, UK. October 2014.
Co-Investigator: Tony McEnery
Co-Investigator: Claire Dembry (Cambridge University Press)
Co-Investigator: Andrew Hardie
Senior Research Associate: Vaclav Brezina
Research Student: Robbie Love
Read the latest updates on this project:
- ‘Using corpora to teach sociolinguistics’ at the TaLC conference in Cambridge (22 July 2018)
Last week, the Faculty of Education, The University of Cambridge hosted 13th Teaching and Language Corpora Conference. This wonderful event brought together researchers and practitioners interested in different applications of corpus techniques in the classroom. Dana Gablasova and I with the help of Irene Marin Cervantes and Tanjun Liu gave a practical workshop introducing the idea ...
- My experience with working at CASS as a SPRINT intern (17 July 2018)
Over the last few weeks I have been working at the ESRC Centre for Corpus Approaches, Lancaster University (CASS) as part of the SPRINT 2018 internship programme. I have just finished my second year studying Spanish and Linguistics and this project was particularly interesting to me from a linguistic perspective. I wanted to work with ...
- Is Academic Writing Becoming More Colloquial? (10 July 2018)
Have you noticed that academic writing in books and journals seems less formal than it used to? Preliminary data from the Written BNC2014 shows that you may be right! Some early data from the academic journals and academic books sections of the new corpus has been analysed to find out whether academic writing has become more ...
- British National Corpus 2014: A sociolinguistic book is out (1 June 2018)
Have you ever wondered what real spoken English looks like? Have you ever asked the question of whether people from different backgrounds (based on gender, age, social class etc.) use language differently? Have you ever thought it would be interesting to investigate how much English has changed over the last twenty years? All these questions ...
- Learn about the BNC2014, scan a book sample and contribute to the corpus… (14 May 2018)
On Saturday 12 May 2018, CASS hosted a small training event at Lancaster University for a group of participants, who came from different universities in the UK. We talked about the BNC2014 project and discussed both the theoretical underpinnings as well as the practicalities of corpus design and compilation. Slides from the event are available ...
- The Spoken BNC2014 is now available! (25 September 2017)
On behalf of Lancaster University and Cambridge University Press, it gives us great pleasure to announce the public release of the Spoken British National Corpus 2014 (Spoken BNC2014). The Spoken BNC2014 contains 11.5 million words of transcribed informal British English conversation, recorded by (mainly English) speakers between the years 2012 and 2016. The situational context of ...
- Spoken BNC2014 Symposium (27 June 2017)
On the afternoon of Monday 26th June, CASS hosted a special symposium to celebrate the upcoming public launch of the Spoken British National Corpus 2014 – a corpus which members of CASS and Cambridge University Press have spent the last three years compiling. More than fifty guests attended, representing a mixture of Lancaster Summer Schools participants, ...
- Introducing a new project with the British Library (21 February 2017)
Since 2012 the BBC have been working with the British Library to build a collection of intimate conversations from across the UK in the BBC Listening Project. Through its network of local radio stations, and with the help of a travelling recording booth the BBC has captured many conversations of people, who are well known ...
- Spoken BNC2014 book announcement (5 August 2016)
We are excited to announce a forthcoming book which will be published as part of the Routledge Advances in Corpus Linguistics series. “Corpus Approaches to Contemporary British Speech: Sociolinguistic Studies of the Spoken BNC2014” (edited by Vaclav Brezina, Robbie Love and Karin Aijmer) will feature a collection of research which is currently being undertaken by ...
- The Spoken BNC2014 early access projects: Part 4 (16 March 2016)
In January, we announced the recipients of the Spoken BNC2014 Early Access Data Grants. Over the next several months, they will use exclusive access to the first five million words of Spoken BNC2014 data to carry out a total of thirteen research projects. In this series of blogs, we are excited to share more information about ...