The Spoken BNC2014 project features in the Daily Mail

BNC2014 logoThe recently announced collaboration between Cambridge University Press and CASS, the Spoken BNC2014 project, has made headlines in the Daily Mail.

The article, entitled, “No longer marvellous – now we’re all awesome: Britons are using more American words because traditional English is in decline”, describes the preliminary findings of the project, which is in its early stages.

To participate in the project, native British English speakers from all over the UK can record their conversations and send them to us as MP3 files. For each hour of good quality recordings we receive, along with all associated consent forms and information sheets completed correctly, we will pay £18. Each recording does not have to be 1 hour in length; participants may submit two 30 minute recordings, or three 20 minute recordings, but for each hour in total, they will receive £18.

To register your interest in participating, please email

Using Corpora to Analyze Gender

ucagI wrote UCAG during a sabbatical as a semi-sequel to a book I published in 2006 called Using Corpora for Discourse Analysis. Part of the reason for the second book was to update and expand some of my thinking around discourse- or social-related corpus linguistics. As time has passed, I haven’t become disenamoured of corpus methods, but I have become more reflective and critical of them and I wanted to use the book to highlight what they can and can’t do, and how researchers need to be guarded against using tools which might send them down a particular analytical path with a set of pre-ordained answers. Part of this has involved reflecting on how interpretations and explanations of corpus findings often need to come from outside the texts themselves (one of the tenets of critical discourse analysis), and subsequently whether a corpus approach requires analysts to go further and critically evaluate their findings in terms of “who benefits”.

Another way in which my thinking around corpus linguistics has developed since 2006 is in considering the advantages of methodological triangulation (or approaching a research project in multiple ways). In one analysis chapter I take three small corpora of adverts from Craigslist and try out three methods of attempting to uncover something interesting about gender from them – one very broad involving an automated tagging of every word, one semi-automatic relying on a focus on a smaller set of words, and another much more qualitative, relying on looking at concordance lines only. In another chapter I look at “difficult” search terms – comparing two methods of finding all the cases where a lecturer indicates that a student has given an incorrect answer in a corpus of academic-related speech. Would it be better to just read the whole corpus from start to finish, or is it possible to devise search terms so concordancing would elicit pretty much the same set?

The book also gave me a chance to revisit older data, particularly a set of newspaper articles about gay people from the Daily Mail which I had first looked at in Public Discourses of Gay Men (2005). As a replication experiment I revisited that data and redid an analysis I had first carried out about 10 years ago. While the idea of an objective researcher is fictional, corpus methods have aimed to redress the issue of researcher bias to an extent – although in retreading my steps, I did not obtain exactly the same results. Fortunately, the overall outcome was the same, but there were a few important points that the 10 years younger version of me missed. Does that matter? I suspect it doesn’t invalidate the analysis although it is a useful reminder about how our own analytical abilities alter over time.

Part of the reason for writing the book was to address other researchers who are either from corpus linguistics and want to look at gender, or who do research in gender and want to use corpus methods. I sometimes feel that these two groups of people do not talk to each other very much and as a result the corpus research in this area is often based around the “gender differences” paradigm where the focus is on how men and women apparently differ from each other in language use (with attendant metaphors about Mars and Venus). Chapters 2 and to an extent 3, address this by trying a number of experiments to see just how much lexical variation there is in sets of spoken corpora of male and female language – and when difference is found, how can it be explained? I also warn against lumping all men together into a box to compare them with all women who are put in a second box. The variation within the boxes can actually be the more interesting story to tell and this is where corpus tools around dispersion can really come into their own. So even if, for example, men do swear more than women, it’s not all men and not all the time. On the other hand, some differences which are more consistent and widespread can be incredibly revealing, although not in ways you might think – chapter 2 took me down an analytical path that ended up at the word Christmas – not perhaps an especially interesting word relating to gender, but it produced a lovely punchline to the chapter.

It was also good to introduce different corpora, tools and techniques that weren’t available in 2006. Mark Davies has an amazing set of online corpora, mostly based around American English, and I took the opportunity to use the COHA (Corpus of Historical American English) to track changes in language which reflects male bias over time, from the start of the 19th century to the present day. Another chapter utilises Adam Kilgariff’s online tool Sketch Engine which allows collocates to be calculated in terms of their grammatical relationships to one another. This allowed for a comparison of the terms boy and girl which allowed me to consider verbs that positioned either as subject or object. So girls are more likely to be impressed while boys are more likely to be outperformed. On the other hand boys cry whereas girls scream.

It would be great if the book inspired other researchers to consider the potential of using corpora in discourse/social related subjects as well as showing how this potential has expanded in recent years. It’s been fun to explore a relatively unexplored field (or rather travel a route between two connecting fields) but it occasionally gets lonely. I hope to encounter a few more people heading in the same direction as me in the coming years.

“It’s all sex and celebrity now”: Page three corpus linguistics

On Monday (16th October) on page three of the Daily Mail, the readers could come across a short article about changes in the English lexicon with a title: “Forget supper and soup… it’s all sex and celebrity now”. (A longer version of the article is available online.) The article quoted some data from the New General Service List (new-GSL) and compared these with Bauman and Culligan’s version of West’s GSL. Bauman and Culligan offer a list of words from West’s GSL (1953) combined with word frequency rankings based on the Brown Corpus (1961).

It comes as no surprise that the word ranks in Bauman and Culligan’s version of the GSL differ from the ranks in the new-GSL. This might be given not only by the time factor, but also by the composition of the source corpora. The new-GSL is a wordlist based on four different language corpora (three British English corpora and one corpus representing the language of the internet) of the total size of over 12 billion words; Bauman and Culligan’s counts, on the other hand, rely on a single one-million-word corpus of American English compiled in the 1960s. The comparisons in the Daily Mail therefore need to be interpreted with caution. In particular, the following points should be considered:

  • The language changes and there is no doubt that over time, some words become more popular  and other words fall out of fashion. The new GSL lists 378 lexical innovations including words such as Internet, website, online, email, network, client, mobile, file and web.
  • On the other hand, the research shows that there is a large stable lexical core (2,116 items in the new-GSL) including frequent nouns, verbs and adjectives such as time, year, people, way, say, make, take, go, good, new, great and same.
  • In order to interpret the social significance of lexical changes, we need to look at the contexts in which different words appear. A good example of this is the word “sex” quoted in the headline of the Daily Mail article.

Let’s talk about “sex”, shall we?

The word sex is polysemous and can mean either physical activity or biological dimorphism (male or female). I suppose the phrase “it’s all sex now” in the headline of the Daily Mail article alludes to the former meaning of the word sex, because the fact that we talk about males and females (the latter meaning of the word) does not sell newspapers. Let’s have a look at some corpus evidence.

Brown (1961)American writing EnTenTen12 (2012)Internet language
form “sex” per million words 82.7 86.6
sex as activity 75% 90%*
sex as dimorphism 25% 10%*

*based on a random sample of 250 lines

A quick comparison of the evidence in the Brown Corpus (which the Daily Mail uses as the point of departure) and the EnTenTen12 internet corpus (one of the sources of the new-GSL) shows that the frequencies per million do not differ very much. There is a difference, however, in the proportions of the two meanings (sex as activity and sex as dimorphism) which can be explained by the difference in the genres sampled in the two corpora. In contrast to the Brown corpus, EnTenTen12 includes also pornography (as you would expect from an internet-based corpus); This is also reflected in some of the prominent collocates of the word “sex” such as oral, anal, hardcore, gay, lesbian and toy in EnTenTen12. However, the fact that “it’s all sex now” (as the Daily Mail puts it) has even a more simple and prosaic explanation:  When compiling the original wordlist, Michael West very likely decided to exclude the term “sex” as something that does not need to be mentioned in the classroom context.