Spoken BNC2014 Symposium

On the afternoon of Monday 26th June, CASS hosted a special symposium to celebrate the upcoming public launch of the Spoken British National Corpus 2014 – a corpus which members of CASS and Cambridge University Press have spent the last three years compiling.

More than fifty guests attended, representing a mixture of Lancaster Summer Schools participants, members of the CASS Challenge Panel, and those who travelled to Lancaster just for the day.

To kick off the symposium, CASS Centre Director Andrew Hardie said a few words about the history of Corpus Linguistics at Lancaster University, and put the compilation of a new BNC into context against previous developments in the field. He expressed his delight at the interest in the Spoken BNC2014 project as evidenced by the number of guests who were in attendance for the symposium.

I then gave the first talk alongside Claire Dembry (from Cambridge University Press) and Andrew Hardie, as representatives of the Spoken BNC2014 research team which also includes Vaclav Brezina and Tony McEnery. We discussed the main methodological decisions we made when thinking about the design, data collection, transcription and processing of the corpus. Andrew then gave a quick demonstration of the corpus in CQPweb, showing how features including speaker IDs, overlaps and attribution confidence are displayed in the interface.

Following our talk came the first of four research presentations, all of which used (the early access subset of) the Spoken BNC2014. The first of these was a talk by Karin Aijmer (University of Gothenburg) about the intensifier fucking, which went down very well with the audience. Karin’s Spoken BNC2014 research, which also includes other intensifiers, will be published as a chapter in Brezina et al. (forthcoming).

After a short break for refreshments, Jacqueline Laws (University of Reading) presented research into verb-forming suffixation which she had undertaken with Chris Ryder and Sylvia Jaworska. Comparing the demographically-sampled component of the Spoken BNC1994 to the new Spoken BNC2014, she found that females now appear to produce more neologisms (e.g. favouritize, popify) compared to males. Laws et al.’s research will be published in a forthcoming special issue of the International Journal of Corpus Linguistics.

Susan Reichelt (Lancaster University) was next to present her work on producing sociolinguistically comparable subsets of both the original and new Spoken British National Corpora. She highlighted a point which I had touched upon in my earlier talk: that the compilation of the Spoken BNC2014 sought to strike a balance between direct comparability with the original corpus on the one hand, and methodological improvement on the other. The areas where improvement was favoured over comparability (e.g. the classification of speaker socio-economic status) ought to be considered especially when thinking about sociolinguistic analysis. Susan’s work is associated with the recently announced CASS SDA project.

Finally, Jonathan Culpeper and Mathew Gillings (Lancaster University) presented their work on politeness variation between the north and south of England. They aimed to assess the extent to which commonly held stereotypes about differences between northern and southern politeness were reflected in language use in both the original and new corpora as a single dataset. Their work will be published as a chapter in Brezina et al. (forthcoming).

My reaction as the organiser of the symposium was that there is definitely a sense of anticipation about the release of the Spoken BNC2014, which is planned to take place in the autumn. Furthermore it was lovely to meet so many friendly and enthusiastic attendees. I am very grateful to each of the speakers for giving such interesting talks, and to all who attended – especially those who tweeted their reactions to the talks using the #BNC2014 hashtag! As one of my final duties as a member of CASS before moving onto pastures new, I am very glad that the symposium went as well as it did.

Introducing a new project with the British Library

Since 2012 the BBC have been working with the British Library to build a collection of intimate conversations from across the UK in the BBC Listening Project. Through its network of local radio stations, and with the help of a travelling recording booth the BBC has captured many conversations of people, who are well known to one another, on a range of topics in high quality audio.

For the past two years we have been discussing with the BBC and the British Library the possibility of using these recordings as the basis of a large scale extension of our spoken BNC corpus. The Spoken BNC2014 has been built so far to reflect language in intimate settings – with recordings made in the home. This has led to a large and very useful collection of data but, without the resources of an organization such as the BBC, we were not able to roam the country with a sound recording booth to sample language from John o’Groats to Land’s End! By teaming up with the BBC and British Library we can supplement this very useful corpus of data, which is strongly focused on a ‘hard to capture’ context, intimate conversations in the home, with another type of data, intimate conversations in a public situation sampled from across the UK.

Another way in which the Listening data should prove helpful to linguists is that the data itself was captured in a recording studio as high quality audio recordings. Our hope is that a corpus based on this material will be of direct interest and use to phoneticians.

We have recently concluded our discussion with the British Library, which is archiving this material, and signed an agreement which will see CASS undertake orthographic transcription of the data. Our goal is to provide a high quality transcription of the data which will be of use to linguists and members of the public, who may wish to browse the collection, alike. In doing this we will be building on our experience of producing the Trinity Lancaster Corpus of Spoken Learner English and the Spoken BNC2014.

We take our first delivery of recordings at the beginning of March and are very excited at the prospect of lifting the veil a little further on the fascinating topic of everyday conversation and language use. The plan is to transcribe up to 1000 of the recordings archived at the British Library. We will be working to time align the transcriptions with the sound recordings also and are working closely with our strong phonetics team in the Department of Linguistics and English Language at Lancaster University to begin to assess the extent to which this new dataset could facilitate new work, for example, on the accents of the British Isles.

Our partners in the British Library are just as excited as we are – Jonnie Robinson, lead Curator for Spoken English at the British Library says ‘The British Library is delighted to enable Lancaster to make such innovative use of the Listening Project conversations and we look forward to working with them to make the collection more accessible and to enhance its potential to support linguistic and other research enquiries’.

Keep an eye on the CASS website and Twitter feed over the next couple of years for further updates on this new project!

The Spoken BNC2014 early access projects: Part 4

In January, we announced the recipients of the Spoken BNC2014 Early Access Data Grants. Over the next several months, they will use exclusive access to the first five million words of Spoken BNC2014 data to carry out a total of thirteen research projects.

In this series of blogs, we are excited to share more information about these projects, in the words of their authors.

In the fourth and final part of our series, read about the work of Tanja Hessner & Ira Gawlitzek, Karin Axelsson, Andrew Caines et al. and Tanja Säily et al.


Tanja Hessner and Ira Gawlitzek

University of Mannheim, Germany

Women speak in an emotional manner; men show their authority through speech! – A corpus-based study on linguistic differences showing which gender clichés are (still) true by analysing boosters in the Spoken BNC2014

Western world clichés claim that women are emotional and often exaggerate, which is reflected in their speech. In contrast, men’s language is said to be characterised by bluntness. Aiming to shed a bit more light on statements like these, this study is going to consider gender differences on the lexical level.

In order to discover if and, if so, to which extent there really is a difference between female and male speakers, the phenomena of boosters will be investigated in the Spoken BNC2014 early access subset. Boosters such as totally or absolutely are particularly appealing and suitable for analysing gender differences since they are extremely multifaceted and they are indicators not only of lively, but also of emotional and powerful speech. Not only are appropriate boosters investigated by using quantitative methods, but also by analysing the data in a qualitative way.


Karin Axelsson

University of Gothenburg, Sweden

Canonical and non-canonical tag questions in the Spoken BNC2014: What has happened since the original BNC?

What is happening to tag questions in British everyday conversation? Are canonical tag questions, where the form of the tag reflects that of the preceding clause (as in She won’t come, will she?), on the way out as the use of innit and other invariant tags is spreading? Who uses innit in 2014? The use of tag questions in the Spoken BNC2014 early access subset will be compared to the use in the demographic part of the original Spoken BNC reflecting the language of the early 1990s.


Andrew Caines1, Michael McCarthy2 and Paula Buttery1

1University of Cambridge, UK

2University of Nottingham, UK

‘You still talking to me?’ The zero auxiliary progressive in spoken British English, twenty years on

With early access to a subset of the Spoken BNC2014, we will be able to assess whether a supposedly ‘ungrammatical’ construction has become more frequently used in conversational British English over the past 20 years. The construction in question is the ‘zero auxiliary’ – for example, the progressive aspect construction may be used with an -ing verb form alone (“you talking to me?”, “What you doing?”, “We going to town”) whereas the standard rule is to combine an auxiliary verb (BE or HAVE) with the -ing form.

In the original Spoken BNC recorded in the early 1990s, the zero auxiliary occurred in one-in-twenty progressive constructions, a rate that rose to one-in-three if second person interrogatives (You talking to me? etc.) were considered alone. Moreover, younger working-class speakers were more likely to use the zero auxiliary than older middle-class speakers. We will investigate how these usage rates compare to the Spoken BNC2014, in the process updating the demographics of zero auxiliary use as well.


Tanja Säily1, Victoria González-Díaz2 and Jukka Suomela3

1University of Helsinki, Finland

2University of Liverpool, UK

3Aalto University, Finland

Variation in the productivity of adjective comparison

The functional competition between inflectional (‑er) and periphrastic (more) comparative strategies in English has received a great deal of attention in corpus-based research. A key area of competition remains relatively unexplored, however: the productivity of either comparative strategy, or how diversely they are used with different adjectives. The received wisdom is that inflection is fully productive, so we might expect to find no variation within the productivity of ‑er. However, recent research using new methods shows sociolinguistic variation in the productivity of extremely productive derivational suffixes. Whether the same variation applies to the productivity of inflectional processes remains an open question.

On the basis of the Spoken BNC2014 early access subset, our project will analyse intra- and extra-linguistic variation in the productivity of inflectional and periphrastic comparative strategies. Intra-linguistic factors include syntactic position, modification preferences, length and derivational type of the adjective. The extra-linguistic determinants focus on gender, age, socio-economic status, conversational setting and roles of the interlocutors. Our research constitutes a timely contribution to current knowledge of adjective comparison and morphological theory-building. If (a) variation in the productivity of inflectional comparison is found and (b) similar change in the productivity of both derivational and inflectional processes is observed, this will support our hypothesis that there is a derivation-to-inflection cline rather than a sharp divide.


Check back soon for more updates on the Spoken BNC2014 project!

Spoken BNC2014 Early Access Data Grant Scheme – Applications now open

Lancaster University’s ESRC funded Centre for Corpus Approaches to Social Science (CASS) and Cambridge University Press are excited to announce the Spoken British National Corpus 2014 Early Access Data Grant scheme.

Applications are now open for researchers at any level in the field of corpus linguistics and beyond to gain early access to a large subset of the Spoken BNC2014, which is currently being compiled and is due for release in late 2017. Successful applicants will write a paper based on their proposed research for exclusive publication (subject to peer review) in either a special issue of the International Journal of Corpus Linguistics or an edited collection.

We invite proposals for interesting and innovative research that would use approximately five million words of the upcoming Spoken BNC2014 as its primary source of data.

Successful applicants will gain access to the data via the CQPweb platform (cqpweb.lancs.ac.uk). Standard CQPweb functionality will be provided, including annotation (POS tagging, lemmatisation, semantic tagging) and with one new feature: the ability to search the corpus according to categories of speaker metadata such as gender, age, dialect and socio-economic status.

Proposals can approach the data from any theoretical angle, provided corpus methodologies are used and the research can be carried out within the affordances of CQPweb. Successful applicants will receive access to the data in February 2016 with a deadline for full paper submission in October 2016. Subject to peer review, papers will be published in one of the two Spoken BNC2014 launch publications in 2017 (a special issue of the International Journal of Corpus Linguistics has been agreed and a thematic edited collection is being planned).

This is a fantastic opportunity to work with the first very large, general corpus of informal British English conversation created since the original BNC more than twenty years ago. Successful applicants will get access to a large subset of the Spoken BNC2014 eighteen months before the full corpus is released, and will be the very first scholars to undertake and publish research based on this new dataset.

More details about the terms of the data grant scheme can be found in the application form. To apply, download and complete the application form and email it to Robbie Love (r.m.love@lancaster.ac.uk). The deadline for applications is Friday 11th December 2015.

Spoken BNC2014 project announcement

BNC2014 logo

We are excited to announce that the ESRC-funded Centre for Corpus Approaches to Social Science (CASS) at Lancaster University and Cambridge University Press have agreed to collaborate on the compilation of a new, publicly accessible corpus of spoken British English called the ‘Spoken British National Corpus 2014’ (the Spoken BNC2014).

The aim of the Spoken BNC2014 project, which will be led jointly by Lancaster University’s Professor Tony McEnery and Cambridge University Press’ Dr Claire Dembry, is to compile a very large collection of recordings of real-life, informal, spoken interactions between people whose first language is British English. These will then be transcribed and made available publicly for a wide range of research purposes.

We aim to encourage people from all over the UK to record their interactions and send them to us as MP3 files. For each hour of good quality recordings we receive, along with all associated consent forms and information sheets completed correctly, we will pay £18. Each recording does not have to be 1 hour in length; participants may submit two 30 minute recordings, or three 20 minute recordings, but for each hour in total, they will receive £18.

The collaboration between CASS at Lancaster University and Cambridge University Press brings together the best resources available for this task. Cambridge University Press is greatly experienced at collecting very large English corpora, and it already has the infrastructure in place to undertake such a large compilation project. CASS at Lancaster University has the linguistic research expertise necessary to ensure that the spoken BNC2014 will be as useful, and accessible as possible for a wide range of purposes. The academic community will benefit from access to a new large spoken British English corpus that is balanced according to a selection of useful demographic criteria, including gender, age, and socio-economic status. This opens the door for all kinds of research projects including the comparison of the spoken BNC2014 with older spoken corpora.

CASS at Lancaster University and Cambridge University Press are very excited to launch the Spoken BNC2014 project, and we look forward to sharing the corpus as widely as possible once it is complete.

To contribute to the Spoken BNC2014 project as a participant please email corpus@cambridge.org for more information.