New CASS Briefing now available — What words are most useful for learners of English?

CASSbriefings-EDLWhat words are most useful for learners of English? Introducing the New General Service List. Learning vocabulary is a complex process in which the learner needs to acquire both the form and a variety of meanings of a given vocabulary item. General vocabulary lists can assist in the process of learning words by providing common vocabulary items. In response to problems identified in the currently available General Service List, the authors decided to investigate the core English vocabulary with very large language corpora using current corpus linguistics technology.

New resources are being added regularly to the new CASS: Briefings tab above, so check back soon.

“It’s all sex and celebrity now”: Page three corpus linguistics

On Monday (16th October) on page three of the Daily Mail, the readers could come across a short article about changes in the English lexicon with a title: “Forget supper and soup… it’s all sex and celebrity now”. (A longer version of the article is available online.) The article quoted some data from the New General Service List (new-GSL) and compared these with Bauman and Culligan’s version of West’s GSL. Bauman and Culligan offer a list of words from West’s GSL (1953) combined with word frequency rankings based on the Brown Corpus (1961).

It comes as no surprise that the word ranks in Bauman and Culligan’s version of the GSL differ from the ranks in the new-GSL. This might be given not only by the time factor, but also by the composition of the source corpora. The new-GSL is a wordlist based on four different language corpora (three British English corpora and one corpus representing the language of the internet) of the total size of over 12 billion words; Bauman and Culligan’s counts, on the other hand, rely on a single one-million-word corpus of American English compiled in the 1960s. The comparisons in the Daily Mail therefore need to be interpreted with caution. In particular, the following points should be considered:

  • The language changes and there is no doubt that over time, some words become more popular  and other words fall out of fashion. The new GSL lists 378 lexical innovations including words such as Internet, website, online, email, network, client, mobile, file and web.
  • On the other hand, the research shows that there is a large stable lexical core (2,116 items in the new-GSL) including frequent nouns, verbs and adjectives such as time, year, people, way, say, make, take, go, good, new, great and same.
  • In order to interpret the social significance of lexical changes, we need to look at the contexts in which different words appear. A good example of this is the word “sex” quoted in the headline of the Daily Mail article.

Let’s talk about “sex”, shall we?

The word sex is polysemous and can mean either physical activity or biological dimorphism (male or female). I suppose the phrase “it’s all sex now” in the headline of the Daily Mail article alludes to the former meaning of the word sex, because the fact that we talk about males and females (the latter meaning of the word) does not sell newspapers. Let’s have a look at some corpus evidence.

Brown (1961)American writing EnTenTen12 (2012)Internet language
form “sex” per million words 82.7 86.6
sex as activity 75% 90%*
sex as dimorphism 25% 10%*

*based on a random sample of 250 lines

A quick comparison of the evidence in the Brown Corpus (which the Daily Mail uses as the point of departure) and the EnTenTen12 internet corpus (one of the sources of the new-GSL) shows that the frequencies per million do not differ very much. There is a difference, however, in the proportions of the two meanings (sex as activity and sex as dimorphism) which can be explained by the difference in the genres sampled in the two corpora. In contrast to the Brown corpus, EnTenTen12 includes also pornography (as you would expect from an internet-based corpus); This is also reflected in some of the prominent collocates of the word “sex” such as oral, anal, hardcore, gay, lesbian and toy in EnTenTen12. However, the fact that “it’s all sex now” (as the Daily Mail puts it) has even a more simple and prosaic explanation:  When compiling the original wordlist, Michael West very likely decided to exclude the term “sex” as something that does not need to be mentioned in the classroom context.

The New General Service List (new-GSL) is out

The new-GSL is an English vocabulary baseline intended for both researchers and practitioners. It is based on robust comparison of four corpora of general English of the total size of over 12 billion words. It contains 2,494 vocabulary items, 2,116 of which belong to a stable lexical core; 378 words in the wordlist represent lexical innovations. All of these words appear with high frequencies across a large number of different contexts.

The article, which describes the methodology of the wordlist compilation, as well as the full new-GSL are available from the Applied Linguistics website in the open access mode.

Further research

At the moment, we are working on an American supplement to the new-GSL. Our findings show that there is a surprisingly large overlap between frequent lexical items in British and American corpora. With some modifications, the new-GSL can therefore be successfully used also in the American English contexts.

A larger question, however, that the new-GSL raises is – how do we reconcile our intuitions about important vocabulary items with the corpus-based findings? In this respect, the new-GSL is not a prescriptive but a descriptive wordlist. As we stress in the article, “[w]ith respect to the diversity of ESL/EFL contexts, it is deemed more useful to envision the use of our wordlist as a vocabulary base with the possibility of further additions, rather than a wordlist that strives to cater to a mixed cluster of heterogeneous expectations and needs” (p. 19).

Read more:

Brezina, V. and Gablasova, D. (2013) Is There a Core General Vocabulary? Introducing the New General Service List Applied Linguistics.

Vocabulary wordlists designed for learners: Development of the new-GSL

Imagine you have just started learning a new foreign language. Which words do you need to learn first? We all might have some intuitions about this. If the language is English then time – the most frequent noun both in speech and writing – will probably be more useful than say the adjective temporaneous (yes, OED records this word). However, intuitions (as corpus linguists know) are not to be trusted (at least not all the time). Only through analysis of large amounts of textual data (yes, language corpora!)  will we be able to identify words that occur frequently across a number of different contexts.

The research Dana and I are going to talk about on Thursday will look at the methodology of creating a pedagogical wordlist – the new-GSL (the old one is now really out of date)- which can assist both learners and teachers in the process of acquisition of basic English vocabulary. We’ll be looking at the ways in which both large (BNC, EnTenTen12) and small corpora (LOB, BE06) can be used in the creation of such a wordlist.