Trinity Lancaster Spoken Learner Corpus: A milestone to celebrate

On Monday 19 May we came together to celebrate the completion of the first part of the Trinity Lancaster Spoken Learner Corpus project. The transcription of our 2012 dataset is now complete and the corpus comprises 1.5 million running words. The Trinity Lancaster Spoken Learner Corpus represents a balanced sample of learner speech from six different countries (Italy, Spain, Mexico, India, China and Sri Lanka) covering the B1.2 – C2 levels of the Common European Framework (CEFR). Below are some pictures from our small celebration.


trinity1 trinity2

We are continuing with the corpus development adding more data from our 2014 dataset so there is still a lot of work to be done. However, we are really excited about the possibilities of applied linguistic and language testing research based on this unique dataset.

You can read more about the Trinity Lancaster Spoken Learner Corpus in the AEA-Europe newsletter report.

Is this the way to do Corpus Linguistics? Feedback system for the Corpus Linguistics MOOC

Corpus linguistics (CL) is a set of incredibly versatile methods of language analysis applicable to a number of different contexts. So, for example, if you are interested in language, culture, history or society, corpus linguistics has something to offer. Today, thanks to the amazing development in computer technology, corpus linguistic tools are literally only a mouse click away or a touch away, if you are using a tablet or a smartphone. Are you then ready to get your hands dirty with computational analysis of large amounts of language? If the answer is yes, you have probably already registered for the new massively open online course (MOOC) on Corpus Linguistics, created and run by Tony McEnery and other members of the CASS team. (If you haven’t managed to register yet, you can still do so at the FutureLearn website. The course kicks off on 27th January 2014.)

An essential part of the Corpus Linguistics MOOC is its unique feedback system. You will be given a question, a data set and a software tool, and you will be asked to apply what you have learnt in the MOOC lectures to real language analysis. You will explore a topic using corpus techniques which will enable you to uncover interesting patterns in language data. We have a range of topics in store for you. These include English grammar, British and American language and culture, historical discourse of 17th century news books and learner language. But don’t worry, we won’t ask you to write an essay on the topic. Instead, we will give you a number of analyses and descriptions of the corpus data and you will decide which ones use the corpus techniques correctly. After you’ve made your decisions we will provide detailed comments on each of the options. In this way, the CASS Corpus Linguistics MOOC system aims to promote independent learning so that next time you can apply the corpus tools with confidence to answer your own questions.

Centre Vacancy: Senior Research Associate in Corpus Linguistics

Post A793

Linguistics & English Language
Salary:   £31,331 to £36,298
Closing Date:   Sunday 20 October 2013
Interview Date:   To be confirmed
Reference:  A793

The Centre for Corpus Approaches to Social Science, funded by the ESRC, is seeking to appoint to 4 year research contract in the area of corpus linguistics and language teaching/assessment.

You will pursue research on developing new approaches to the use of corpus linguistics within language teaching and assessment in conjunction with Trinity College London. A knowledge of corpus linguistics is essential, as well as some familiarity with language teaching and /or assessment.

You will join an interdisciplinary team of internationally renowned researchers within CASS. You will be offered excellent career progression opportunities through the ESRC Centre.

This is a fixed term contract for 4 years.

Informal enquiries may be made to Dr. Vaclav Brezina:

Apply through the Lancaster University website. 

“It’s all sex and celebrity now”: Page three corpus linguistics

On Monday (16th October) on page three of the Daily Mail, the readers could come across a short article about changes in the English lexicon with a title: “Forget supper and soup… it’s all sex and celebrity now”. (A longer version of the article is available online.) The article quoted some data from the New General Service List (new-GSL) and compared these with Bauman and Culligan’s version of West’s GSL. Bauman and Culligan offer a list of words from West’s GSL (1953) combined with word frequency rankings based on the Brown Corpus (1961).

It comes as no surprise that the word ranks in Bauman and Culligan’s version of the GSL differ from the ranks in the new-GSL. This might be given not only by the time factor, but also by the composition of the source corpora. The new-GSL is a wordlist based on four different language corpora (three British English corpora and one corpus representing the language of the internet) of the total size of over 12 billion words; Bauman and Culligan’s counts, on the other hand, rely on a single one-million-word corpus of American English compiled in the 1960s. The comparisons in the Daily Mail therefore need to be interpreted with caution. In particular, the following points should be considered:

  • The language changes and there is no doubt that over time, some words become more popular  and other words fall out of fashion. The new GSL lists 378 lexical innovations including words such as Internet, website, online, email, network, client, mobile, file and web.
  • On the other hand, the research shows that there is a large stable lexical core (2,116 items in the new-GSL) including frequent nouns, verbs and adjectives such as time, year, people, way, say, make, take, go, good, new, great and same.
  • In order to interpret the social significance of lexical changes, we need to look at the contexts in which different words appear. A good example of this is the word “sex” quoted in the headline of the Daily Mail article.

Let’s talk about “sex”, shall we?

The word sex is polysemous and can mean either physical activity or biological dimorphism (male or female). I suppose the phrase “it’s all sex now” in the headline of the Daily Mail article alludes to the former meaning of the word sex, because the fact that we talk about males and females (the latter meaning of the word) does not sell newspapers. Let’s have a look at some corpus evidence.

Brown (1961)American writing EnTenTen12 (2012)Internet language
form “sex” per million words 82.7 86.6
sex as activity 75% 90%*
sex as dimorphism 25% 10%*

*based on a random sample of 250 lines

A quick comparison of the evidence in the Brown Corpus (which the Daily Mail uses as the point of departure) and the EnTenTen12 internet corpus (one of the sources of the new-GSL) shows that the frequencies per million do not differ very much. There is a difference, however, in the proportions of the two meanings (sex as activity and sex as dimorphism) which can be explained by the difference in the genres sampled in the two corpora. In contrast to the Brown corpus, EnTenTen12 includes also pornography (as you would expect from an internet-based corpus); This is also reflected in some of the prominent collocates of the word “sex” such as oral, anal, hardcore, gay, lesbian and toy in EnTenTen12. However, the fact that “it’s all sex now” (as the Daily Mail puts it) has even a more simple and prosaic explanation:  When compiling the original wordlist, Michael West very likely decided to exclude the term “sex” as something that does not need to be mentioned in the classroom context.

The New General Service List (new-GSL) is out

The new-GSL is an English vocabulary baseline intended for both researchers and practitioners. It is based on robust comparison of four corpora of general English of the total size of over 12 billion words. It contains 2,494 vocabulary items, 2,116 of which belong to a stable lexical core; 378 words in the wordlist represent lexical innovations. All of these words appear with high frequencies across a large number of different contexts.

The article, which describes the methodology of the wordlist compilation, as well as the full new-GSL are available from the Applied Linguistics website in the open access mode.

Further research

At the moment, we are working on an American supplement to the new-GSL. Our findings show that there is a surprisingly large overlap between frequent lexical items in British and American corpora. With some modifications, the new-GSL can therefore be successfully used also in the American English contexts.

A larger question, however, that the new-GSL raises is – how do we reconcile our intuitions about important vocabulary items with the corpus-based findings? In this respect, the new-GSL is not a prescriptive but a descriptive wordlist. As we stress in the article, “[w]ith respect to the diversity of ESL/EFL contexts, it is deemed more useful to envision the use of our wordlist as a vocabulary base with the possibility of further additions, rather than a wordlist that strives to cater to a mixed cluster of heterogeneous expectations and needs” (p. 19).

Read more:

Brezina, V. and Gablasova, D. (2013) Is There a Core General Vocabulary? Introducing the New General Service List Applied Linguistics.

ESRC Summer School in Corpus Approaches to Social Science 2013: feedback

In the week of 16th – 19th July 2013, CASS organised the first Summer school for PhD students and post-doctoral researchers in social science disciplines with an interest in the methods of corpus linguistics. Twenty participants from 15 different Higher Education institutions form the UK and overseas (Israel, Brazil, Poland, Czech Republic, Italy) attended the event.  The following is a summary of the feedback we received from the participants at the end of the event (this summary is based on 16 returned surveys).

  • All participants agreed that the quality of the Summer School sessions was high.
  • All participants agreed that the Summer School had a friendly atmosphere.
  • All but one participant said that they were confident to apply corpus methods in their own work after having attended the Summer School.

In particular, the participants appreciated the practical, hands-on approach (including lab sessions), engaging lectures, and the fact that the Summer school was free of charge.

Did you miss this year’s summer school? Check back regularly for information on dates for next year, as well as information on how to apply.

Vocabulary wordlists designed for learners: Development of the new-GSL

Imagine you have just started learning a new foreign language. Which words do you need to learn first? We all might have some intuitions about this. If the language is English then time – the most frequent noun both in speech and writing – will probably be more useful than say the adjective temporaneous (yes, OED records this word). However, intuitions (as corpus linguists know) are not to be trusted (at least not all the time). Only through analysis of large amounts of textual data (yes, language corpora!)  will we be able to identify words that occur frequently across a number of different contexts.

The research Dana and I are going to talk about on Thursday will look at the methodology of creating a pedagogical wordlist – the new-GSL (the old one is now really out of date)- which can assist both learners and teachers in the process of acquisition of basic English vocabulary. We’ll be looking at the ways in which both large (BNC, EnTenTen12) and small corpora (LOB, BE06) can be used in the creation of such a wordlist.