The Spoken British National Corpus 2014 – project update

SpokenBNCupdateIt has been little over a year since CASS and Cambridge University Press announced a collaboration to compile a successor to the spoken component of the British National Corpus, the Spoken BNC2014. This will be the largest corpus of spoken British English since the original, with the advantage of being collected in the 2010s rather than the 1990s, providing an updated snapshot of spoken language in the UK. By including a set of recordings already gathered by Cambridge University Press before our collaboration began, we plan for the corpus to contain data ranging from the years 2012-2016. As well as being the year in which the project was announced, 2014 will be the median year of the planned data range, and so we chose it to feature in the working title of the project: the Spoken BNC2014.

Since our announcement, we have been hard at work: advertising the project nationally, collecting recordings from speakers from all over the UK, transcribing the data, conducting methodological investigations, and presenting our work so far at corpus linguistics conferences. At ICAME 36 in May we described the development of the Spoken BNC2014 transcription scheme, and at Corpus Linguistics 2015 in July we gave an overview of the data collection methodology as well as presenting new research on speaker identification in transcription. All of this activity continues as we work towards making the corpus freely and publicly available in the year 2017.

So far, we have gathered nearly 700 recordings at an estimated total of approximately six million words of informal conversational data. The majority of recordings feature two or three speakers, with about a quarter of recordings containing four or more so far. So far, the balance of speaker gender is fairly even, and we have been able to gather data from a wide range of ages – though at the moment the 19-29 year olds have a clear lead! We have done very well in England to gather recordings from a great range of self-reported dialects, and we plan now to focus more heavily on gathering recordings from Wales, Scotland, and Northern Ireland. The word cloud of self-reported conversation topics gives a first look at the range of things that users can expect to find being discussed in the corpus.

We are very pleased with the progress of the project so far, and we look forward to releasing the corpus texts publicly once they are complete. In the meantime, as announced at CL2015, we will be offering the opportunity to apply for pre-release data grants later this year. More information about the data grants will be announced in the near future.

Swimming in the deep end of the Spoken BNC2014 media frenzy

As someone who enjoys acting in his spare time, I’m rarely afraid of the chance spend some time in the spotlight. But as I sat one morning a few weeks ago in my bedroom, in nothing but a dressing gown, about to do a live interview on a national Irish radio station, with no kind of media training or experience under my belt, I really did get a case of the nerves. I would spend the entire day appearing on over a dozen radio and TV broadcasts (thankfully with time to get dressed after the first), promoting participation in the Spoken BNC2014 project, and finding out the true meaning of the phrase ‘learning on the job’. My experiences taught me a few things about the relationship between the broadcast media and academic research, which I’ve summarised at the end of this blog.

In late July, CASS and Cambridge University Press announced a new collaboration which aims to compile a new spoken British National Corpus, known as the Spoken BNC2014. This is an ambitious project that requires contributions of recordings from hundreds, if not thousands, of speakers from across the entire United Kingdom. As a research team (which includes Lancaster’s Professor Tony McEnery, Cambridge’s Dr Claire Dembry, as well as Dr Vaclav Brezina, Dr Andrew Hardie, and me), we knew that we had to spread the word far and wide in order to drum up the participation of speakers across the country.

So, at the end of August, we put out a press release which teased some preliminary observations, and invited people to get involved by emailing corpus(Replace this parenthesis with the @ sign)cambridge.org. These findings were based on some basic comparisons between the relative frequencies of the words in the demographic section of the original spoken BNC, and those of the first two million words collected for the Spoken BNC2014 project. We put out lists of the top ten words which had fallen and risen in relative frequency the most drastically between the 1990s data and today’s data.

Words which had declined Words which had risen
fortnight facebook
marvellous internet
fetch website
walkman awesome
poll email
catalogue google
pussy cat smartphone
marmalade iphone
drawers essentially
cheerio treadmill

It seems that these words really captured the imagination of the media powers that be. On the week of the release at the end of August, I was told on the Monday afternoon that the release had been sent out. By late that night, the story had already been picked up by the Daily Mail. Such was my joy, and perhaps naivety, that I sent out a brief and fairly humble blog post celebrating the fact that one person from one newspaper had run an article on our story. What I didn’t realise at the time was that, had I put out a blog post every time we discovered a piece of coverage the next day, I would still be writing them now.

The next morning I was woken by a message from Lancaster Linguistics and English Language department’s resident media celebrity, Dr Claire Hardaker, asking urgently for some information about the Spoken BNC2014 project. She had been contacted by LBC Radio, who had caught wind of the story and assumed sort-of-understandably that, since it was a linguistics story that involved Lancaster University, Claire would be directly involved. She isn’t, sadly, but they had lined up a live interview with her in twenty minutes’ time regardless, and she had kindly agreed to do it anyway with what information I could get to her in time.

After that, I soon realised that perhaps this story would garner more interest than a few newspaper articles. My phone went into melt-down, bleeping with emails from the PR team at the university and phone calls from unknown numbers. There was a 90 minute period where I couldn’t leave my room to get a shower, get dressed, and get on to the campus, simply because I was being lined up for so many interviews throughout the day. As such, I had to do my first there and then, in my dressing gown, while Claire Hardaker kindly waited on stand-by in the university press office in case I couldn’t make it to campus on time for my next.

Once I got there, it was a busy day of interviews right through to 6pm that evening. Over the course of the day, I was interviewed by international radio stations BBC World Service and Talk Radio Europe, UK national stations BBC Radio 4, Sky Radio, and Classic FM, Irish national station Today FM, and Russian national station Voice of Russia UK. I was also interviewed by UK regional BBC news stations London, Merseyside, Coventry & Warwick, Lancashire, and Three Counties. The highlight for me though was the TV interview with the Sky News channel, which I recorded using the Skype app on my little Windows tablet. The interviewer could see me, but I couldn’t see her (or indeed hear her all that well), and I had no idea that she was set up in the studio and that the video would be edited together and released that day. Aside from being shown on the Sky News television channel itself, and their website, the interview appeared on upwards of 40 regional radio websites, including Rock FM, Magic FM, The Bee, North Sound, Yorkshire Coast Radio, Wave 965, and Juice Brighton, as well as other media sites. Claire Dembry also got involved from Cambridge, doing further TV interviews with Sky News and even joining me for a live double interview with BBC Radio London.

So, what did I ‘learn on the job’ through my baptism of fire in the media world? Three main points:

  • Some interviewers thought I was announcing the death of the English language

Though most of the interviews went about as smoothly as I could have expected, with me remembering to plug the email address corpus(Replace this parenthesis with the @ sign)cambridge.org at any given opportunity, some were much harder work. Some interviewers seemed horrified at the thought of ‘losing’ words such as marvellous and cheerio, and wanted me to tell them what they could do to help rescue them. Though it was tempting to say “well if you keep saying them they won’t disappear…”, I instead politely made the point that language, like everything else to do with being human, changes over time, and that this is perfectly okay. Just like fashion. This ‘endangered species’ discourse came about in a few interviews, and it seemed that the interviewers felt I was suggesting that the English language was somehow shrinking or degrading over time.

  • Some interviewers thought I was actively promoting the changes I was reporting

In other cases, the interviewers seemed to imply that I was making recommendations for the words that speakers should avoid or should start saying more, in order to ‘stay up to date’ and not come across ‘old fashioned’. In other words, I was mistaken for a prescriptivist rather than a descriptivist, who was trying to stop people from using the word catalogue, or encouraging everybody to say the word treadmill at least five times a day.

  • Some interviewers asked ‘nice’ questions, and some didn’t

This is a more general observation which I suspected to be the case before I started, and had it confirmed as the interviews went on. It is a simple truth that the interviewers who ‘got’ the project the most were the ones who, for me, asked the best questions. When being interviewed about the list of words which have decreased in frequency I was, in varying forms and among many others, asked the following two types of question:

A: The words which were more popular in the 1990s but not so much now – tell me about ‘pussy cat’ – what’s going on there?

B: The words which were as popular in the 1990s as Facebook is now – I guess words like ‘marvellous’ and ‘catalogue’ are harder to spell and we’re getting lazier these days so we’re just going to say shorter words aren’t we?

For me, and I imagine many others, question A is the ‘nice’ question of this pair. The interviewer draws me to one example which looks interesting – fair enough – but importantly they make no inference themselves about the possible explanation. They set up a blank canvas and allow me to paint it in the way which is most advantageous to my purpose.

Question B, however, is much more problematic for me as the interviewee and sadly occurred as much, if not more, than those like question A. Firstly the interviewer has re-conceptualised the findings and created equivalence between the frequency of the declining words and the words on the rise. Therefore the possibility for conclusions like “marmalade used to be as popular as Facebook” or, worse, “iPhones replace pussy cats in British society” are opened up and thrown into the ether.

Secondly, and much harder to deal with immediately, is the lumping of two completely unrelated words (marvellous and catalogue), the assumption of societal degradation (we’re getting lazier), the pseudo-logical causal relationship between written conventions and spoken interaction (harder to spell), which are based on such assumptions of societal degradation (so we’re just going to say shorter words), and, the icing on the cake, the tag question which invites me to agree that everything the interviewer has just said is perfectly correct (aren’t we?). Yes, this is indeed not a nice question. The strategy I developed is to say that yes, everything you have just said could be the case, and then to go about repackaging their question into something more reasonable for me to say anything about. This was not easy and in some cases I did this better than others!

The recurring theme of my experience was the extent to which the interviewers’ expectations of the Spoken BNC2014 research matched what we are actually trying to do. Most of the time, there was a close match and the questions fit my aims well. In the cases where this didn’t happen, and the questions made all sorts of false assumptions, life was more difficult. I don’t think, however, that anyone was deliberately misconstruing our humble aims, and really I’d rather have given those difficult interviews, where I felt like I was in a fight for mutual understanding, than not to have given them at all for fear of being misunderstood. It seems that this is an inevitable aspect of daring to throw your work out of the bubble of academia and into the public sphere, where it really matters. My goal for next time is to improve the way that the research is communicated in the first place, and to plug potential potholes of misunderstanding in a way that is as accurate as reasonable but still makes a good story.

Overall, I think I managed as well as I could have done, given the abrupt start to the day and my naïve expectation that the press wouldn’t be as interested in the story as it turns out they were. Hopefully we’ll have generated lots of interest in the project. I’d like to thank Claire Hardaker for helping me learn the ropes as I went along, the staff at Lancaster University’s press office for keeping me in the right place at the right time, and the ESRC, who have since offered me some media training, which I will very gladly accept. Awesome!

The Spoken BNC2014 project features in the Daily Mail

BNC2014 logoThe recently announced collaboration between Cambridge University Press and CASS, the Spoken BNC2014 project, has made headlines in the Daily Mail.

The article, entitled, “No longer marvellous – now we’re all awesome: Britons are using more American words because traditional English is in decline”, describes the preliminary findings of the project, which is in its early stages.

To participate in the project, native British English speakers from all over the UK can record their conversations and send them to us as MP3 files. For each hour of good quality recordings we receive, along with all associated consent forms and information sheets completed correctly, we will pay £18. Each recording does not have to be 1 hour in length; participants may submit two 30 minute recordings, or three 20 minute recordings, but for each hour in total, they will receive £18.

To register your interest in participating, please email corpus(Replace this parenthesis with the @ sign)cambridge.org