Celebrating the Written BNC2014: Lancaster Castle event

On 19 November 2021, The ESRC Centre for Corpus Approaches to Social Science (CASS) organised an event to celebrate the launch of the Written British National Corpus 2014 (BNC2024). The event was live-streamed from a very special location: the medieval Lancaster Castle.  There were about 20 participants on the site and more than 1,200 participants joined the event online.  Dr Vaclav Brezina started the event and welcomed the participants from over 30 different countries. After the official welcome by Professor Elena Semino and Professor Paul Connolly, a series of invited talks were delivered by prominent speakers from the UK and abroad. The talks covered topics such as corpus development, corpora in the classroom, corpora and fiction and the historical development of English.

The BNC2014 is now available together with its predecessor the BNC1994 via #LancBox X.

#LancsBox X interface
#LancsBox X interface

More information about the design and development of the Written BNC2014 is available from this open access research article:

If you missed the event, we offer the recording of the individual sessions below. You can also view the pdf slides about the Written BNC2014.

Online programme: Lancaster Corpus Linguistics
Vaclav Brezina, Elena Semino, Paul Connolly  (Lancaster University): Welcome and Introduction to the event
Tony McEnery (Lancaster University): The idea of the written BNC2014
Dawn Knight (Cardiff University): Building a National Corpus:  The story of the National Corpus of Contemporary Welsh
Vaclav Brezina and William Platt (Lancaster University): Current British English  and Exploring the BNC2014 using #LancsBox X
Randi Reppen (Northern Arizona University): Corpora in the classroom
Alice Deignan (University of Leeds): Corpora in education
Dana Gablasova (Lancaster University): Corpus for schools
Bas Aarts (University College London): Plonker of a politician NPs
Marc Alexander (University of Glasgow): British English: A historical perspective
Michaela Mahlberg (University of Birmingham): Corpora and fiction
Martin Wynne (University of Oxford): CLARIN – corpora, corpus tools and collaboration
Vaclav Brezina Farewell

Introductory Blog – Hanna Schmueck

I am very honoured to have received the Geoffrey Leech Outstanding MA Student Award for my MA in Language and Linguistics. This award traditionally goes to the MA student with the highest overall average.

I started my postgraduate journey in September 2019 after finishing my undergraduate degree at the University of Bamberg (Germany) in 2018 and working as a freelance translator and teacher for a year. I’ve always had an interest in the way language influences us both as individuals and as a society and have carried with me a fascination for experimentation and statistics. I first discovered corpus linguistics in the second year of my undergraduate degree, it soon after cemented itself as my primary research interest. I chose a corpus-based project for my undergraduate dissertation on pronouns in the English-lexifier lingua franca Bislama. From here I realised that much of the relevant methodological literature had been published by Lancaster academics – which cemented my decision to apply at Lancaster despite having to move abroad and face a number of Brexit-related administrative hurdles.

When I finally came to Lancaster for my MA, I felt welcome in the department from day one and I had the chance to attend/audit a wide variety of modules such as Cognitive Linguistics, Experimental Approaches to Language and Cognition, Forensic Linguistics, Stylistics, and Corpus Linguistics. The freedom of choice that Lancaster MA students in Language and Linguistics are given was another major motivation for studying at Lancaster and the flexible approach really benefited my personal learning experience. Another important element of my academic learning experience was being able to attend research groups – such as the Trinity group and UCREL talks –which focus on a wide variety of topics and allow you to come into contact with people that have all kinds of specialisms while getting the opportunity to develop your own research interests further.

I had, like all of us, not foreseen that my MA would move online in spring and all the challenges COVID-19 would bring about, but after the first phase of getting used to the situation I tried my best to see this as an opportunity to focus on my MA thesis titled “More than the sum of its parts: Collocation networks in the written section of the BNC2014 Baby+”. The aim of this thesis was to explore corpus-wide collocation networks and their structural and graph-theoretical properties using the BNC2014 Baby+ as the underlying dataset. I developed a method to create and display large MI2-score based weighted networks in order to analyse meta-level collocational patterns that emerge and performed a graph-theoretical analysis on them. The results obtained from this pilot study suggested that there is an underlying structure that all sections in the BNC2014 Baby+ share and the structure of the generated networks resembles other networks from a wide variety of phenomena such as power grids, social networks, and networks of brain neurons. The findings indicated that there are, however, text-type specific differences in terms of how connected different topic areas are and that certain words serve as hubs connecting topics with one another. The network displayed below is an example taken from the BNC Baby+ academic books section with a filter applied to only show the node “award”, its direct neighbours and their weighted interrelations.

I am very grateful for having had the opportunity to learn from and exchange ideas with so many amazing academics in the department over the course of my MA and I’m very excited to carry on researching collocation networks for my PhD here at Lancaster.

Is Academic Writing Becoming More Colloquial?

Have you noticed that academic writing in books and journals seems less formal than it used to? Preliminary data from the Written BNC2014 shows that you may be right!

Some early data from the academic journals and academic books sections of the new corpus has been analysed to find out whether academic writing has become more colloquial since the 1990s. Colloquialisation is “a tendency for features of the conversational spoken language to infiltrate and spread in the written language” (Leech, 2002: 72). The colloquialisation of language can make messages more easily understood by the general public because, whilst not everybody is familiar with the specifics of academic language, everyone is familiar with spoken language. In order to investigate the colloquialisation of academic writing, the frequencies of several linguistic features which have been associated with colloquialisation were compared in academic writing in the BNC1994 and the BNC2014.

Results show that, of the eleven features studied, five features have shown large changes in frequency between the BNC1994 and the BNC2014, pointing to the colloquialisation of academic writing. The use of first and second person pronouns, verb contractions, and negative contractions have previously been found to be strongly associated with spoken language. These features have all increased in academic language between 1994 and 2014. Passive constructions and relative pronouns have previously been found to be strongly associated with written language, and are not often used in spoken language. This analysis shows that both of these features have decreased in frequency in academic language in the BNC2014.

Figure 1: Frequency increases indicating the colloquialisation of academic language.

Figure 2: Frequency decreases indicating the colloquialisation of academic language.

These frequency changes were also compared for each genre of academic writing separately. The genres studied were: humanities & arts, social science, politics, law & education, medicine, natural science, and technology & engineering. An interesting difference between some of these genres emerged. It seems that the ‘hard’ sciences (medicine, natural science, and technology & engineering) have shown much larger changes in some of the linguistic features studied than the other genres have. For example, figure 3 shows the difference in the percentage increase of verb contractions for each genre, and clearly shows a difference between the ‘hard’ sciences and the social sciences and humanities subjects.


Figure 3: % increases in the frequency of the use of verb contractions between 1994 and 2014 for each genre of academic writing.

This may lead you to think that medicine, natural science, and technology & engineering writing has become more colloquial than the other genres, but this is in fact not the case. Looking more closely at the data shows us that these ‘hard’ science genres were actually much less colloquial than the other genres in the 1990s, and that the large change seen here is actually a symptom of all genres becoming more similar in their use of these features. In other words, some genres have not become more colloquial than others, they have simply had to change more in order for all of the genres to become more alike.

So it seems from this analysis that, in some respects at least, academic language has certainly become more colloquial since the 1990s. The following is a typical example of academic writing in the 1990s, taken from a sample of a natural sciences book in the BNC1994. It shows avoidance of using first or second person pronouns and contractions (which have increased in use in the BNC2014), and shows use of a passive construction (the use of which has decreased in the BNC2014).

Experimentally one cannot set up just this configuration because of the difficulty in imposing constant concentration boundary conditions (Section 14.3). In general, the most readily practicable experiments are ones in which an initial density distribution is set up and there is then some evolution of the configuration during the course of the experiment.

It is much more common nowadays to see examples such as the following, taken from an academic natural sciences book in the BNC2014. This example contains active sentence constructions, first person pronouns, verb contractions, negative contractions, and a question.

No doubt people might object in further ways, but in the end nearly all these replies boil down to the first one I discussed above. I’d like to return to it and ponder a somewhat more aggressive version, one that might reveal the stakes of this discussion even more clearly. Very well, someone might say. Not reproducing may make sense for most people, but my partner and I are well – educated, well – off, and capable of protecting our children from whatever happens down the road. Why shouldn’t we have children if we want to?

It will certainly be interesting to see if this trend of colloquialisation can be seen in other genres of writing in the BNC2014!


Would you like to contribute to the Written BNC2014?

We are looking for native speakers of British English to submit their student essays, emails, and Facebook and Whatsapp messages for inclusion in the corpus! To find out more, and to get involved click here. All contributors will be fully credited in the corpus documentation.

British National Corpus 2014: A sociolinguistic book is out

Have you ever wondered what real spoken English looks like? Have you ever asked the question of whether people from different backgrounds (based on gender, age, social class etc.) use language differently? Have you ever  thought it would be interesting to investigate how much English has changed over the last twenty years? All these questions can be answered by looking at language corpora such as the Spoken BNC 2014 and analysing them from a sociolinguistic persective. Corpus Approaches to Contemporary British Speech:  Sociolinguistic Studies of the Spoken BNC2014 is a book which offers a series of studies that provide a unique insight into a number of topics ranging from Discourse, Pragmatics and Interaction to Morphology and Syntax.

This is, however, only the first step. We are hoping that there will be many more studies to come based on this wonderful dataset. If you want to start exploring the Spoken BNC 2014 corpus, it is just three mouse clicks away:

Get access to the BNC2014 Spoken

  1. Register for free and log on to CQPweb.
  2. Sign-up for access to the BNC2014 Spoken.
  3. Select ‘BNC2014’in the main CQPweb menu.

Also, right now there is a great opportunity to take part in the written BNC 2014 project, a written counterpart to the Spoken BNC2014.  If you’d like to contribute to the written BNC2014, please check out the project’s website for more information.

Learn about the BNC2014, scan a book sample and contribute to the corpus…

On Saturday 12 May 2018, CASS hosted a small training event at Lancaster University for a group of participants, who came from different universities in the UK.  We talked about the BNC2014 project and discussed both the theoretical underpinnings as well as the practicalities of corpus design and compilation. Slides from the event are available as pdf here.

The participants then tried in practice what is involved in the compilation of a large general corpus such as the BNC2014. They selected and scanned samples of books from current British fiction, poetry and a range of non-fiction books (history, popular science, hobbies etc.). Once processed, these samples will become a part of the written BNC2014.

Here are some pictures from the event:

Carmen Dayrell and Vaclav Brezina before the event

Elena Semino welcoming participants

In the computer lab: Abi Hawtin helping participants


A box full of books

If you are interested in contributing to the written BNC2014, go to the project website  to find out about different ways in which you can participate in this exciting project.

The event was supported by ESRC grant no. EP/P001559/1.

The Spoken BNC2014 is now available!

On behalf of Lancaster University and Cambridge University Press, it gives us great pleasure to announce the public release of the Spoken British National Corpus 2014 (Spoken BNC2014).

The Spoken BNC2014 contains 11.5 million words of transcribed informal British English conversation, recorded by (mainly English) speakers between the years 2012 and 2016. The situational context of the recordings – casual conversation among friends and family members – is designed to make the corpus broadly comparable to the demographically-sampled component of the original spoken British National Corpus.

The Spoken BNC2014 is now accessible online in full, free of charge, for research and teaching purposes. To access the corpus, you should first create a free account on Lancaster University’s CQPweb server (https://cqpweb.lancs.ac.uk/) if you do not already have one. Once registered, please visit the BNC2014 website (http://corpora.lancs.ac.uk/bnc2014) to (a) sign the corpus’ end-user licence and (b) register your CQPweb account – following the instructions on the site. When you return to CQPweb, you will have access to the Spoken BNC2014 via the link that appears in the list of ‘Present-day English’ corpora. While access is initially only via the CQPweb platform, the underlying corpus XML files and associated metadata will be available for download in Autumn 2018.

The BNC2014 website also contains lots of useful information about the corpus, and in particular a downloadable manual and reference guide, which will be available soon. Further information, as well as the first research articles to use Spoken BNC2014 data, will be available in two in-press publications associated with the project: a special issue of the International Journal of Corpus Linguistics (due next month) and an edited collection in the Routledge ‘Advances in Corpus Linguistics’ series (due early 2018).

The BNC2014 does not end here – we are currently working on transcribing materials provided to us by the British Library to provide a substantial supplement to the corpus – find out more about that here: http://cass.lancs.ac.uk/?p=2241. For now, we will be waiting and watching with interest to see what work the corpus releases today stimulates. As ever with corpus data, it does not enable all questions to be answered, but it does allow a very wide range of questions to be investigated.

The Spoken BNC2014 research team would like to express our gratitude to all who have had a hand in the creation of the corpus, and hope that you enjoy exploring the data. We are, of course, keen to hear your feedback about the corpus; this, as well as any questions, can be directed to Robbie Love (r.m.love@lancaster.ac.uk) or Andrew Hardie (a.hardie@lancaster.ac.uk).

Spoken BNC2014 Symposium

On the afternoon of Monday 26th June, CASS hosted a special symposium to celebrate the upcoming public launch of the Spoken British National Corpus 2014 – a corpus which members of CASS and Cambridge University Press have spent the last three years compiling.

More than fifty guests attended, representing a mixture of Lancaster Summer Schools participants, members of the CASS Challenge Panel, and those who travelled to Lancaster just for the day.

To kick off the symposium, CASS Centre Director Andrew Hardie said a few words about the history of Corpus Linguistics at Lancaster University, and put the compilation of a new BNC into context against previous developments in the field. He expressed his delight at the interest in the Spoken BNC2014 project as evidenced by the number of guests who were in attendance for the symposium.

I then gave the first talk alongside Claire Dembry (from Cambridge University Press) and Andrew Hardie, as representatives of the Spoken BNC2014 research team which also includes Vaclav Brezina and Tony McEnery. We discussed the main methodological decisions we made when thinking about the design, data collection, transcription and processing of the corpus. Andrew then gave a quick demonstration of the corpus in CQPweb, showing how features including speaker IDs, overlaps and attribution confidence are displayed in the interface.

Following our talk came the first of four research presentations, all of which used (the early access subset of) the Spoken BNC2014. The first of these was a talk by Karin Aijmer (University of Gothenburg) about the intensifier fucking, which went down very well with the audience. Karin’s Spoken BNC2014 research, which also includes other intensifiers, will be published as a chapter in Brezina et al. (forthcoming).

After a short break for refreshments, Jacqueline Laws (University of Reading) presented research into verb-forming suffixation which she had undertaken with Chris Ryder and Sylvia Jaworska. Comparing the demographically-sampled component of the Spoken BNC1994 to the new Spoken BNC2014, she found that females now appear to produce more neologisms (e.g. favouritize, popify) compared to males. Laws et al.’s research will be published in a forthcoming special issue of the International Journal of Corpus Linguistics.

Susan Reichelt (Lancaster University) was next to present her work on producing sociolinguistically comparable subsets of both the original and new Spoken British National Corpora. She highlighted a point which I had touched upon in my earlier talk: that the compilation of the Spoken BNC2014 sought to strike a balance between direct comparability with the original corpus on the one hand, and methodological improvement on the other. The areas where improvement was favoured over comparability (e.g. the classification of speaker socio-economic status) ought to be considered especially when thinking about sociolinguistic analysis. Susan’s work is associated with the recently announced CASS SDA project.

Finally, Jonathan Culpeper and Mathew Gillings (Lancaster University) presented their work on politeness variation between the north and south of England. They aimed to assess the extent to which commonly held stereotypes about differences between northern and southern politeness were reflected in language use in both the original and new corpora as a single dataset. Their work will be published as a chapter in Brezina et al. (forthcoming).

My reaction as the organiser of the symposium was that there is definitely a sense of anticipation about the release of the Spoken BNC2014, which is planned to take place in the autumn. Furthermore it was lovely to meet so many friendly and enthusiastic attendees. I am very grateful to each of the speakers for giving such interesting talks, and to all who attended – especially those who tweeted their reactions to the talks using the #BNC2014 hashtag! As one of my final duties as a member of CASS before moving onto pastures new, I am very glad that the symposium went as well as it did.