What’s wrong with “a bunch of migrants”? Looking at the linguistic evidence

This week at Prime Minister’s Questions, David Cameron used the term “a bunch of migrants to describe refugees at a camp in Calais. He was subsequently criticised by Labour MPs and members of the general public on Twitter, and the story was reported on in mainstream newspapers like the Guardian and the Telegraph. Critics described his comments as “dehumanising”, “callous” and “inflammatory”.

Something about David Cameron saying the words “bunch of” to describe a group of people caused a furore – but what was it? Is this how people normally use this phrase, or is this a noteworthy departure from the norm?

Here at CASS we have the unique opportunity to analyse a very large set of everyday conversations between speakers of British English from all over the UK, which participants have been recording in their homes and sending to us to be transcribed. Using the transcriptions, we can use computer software to analyse how words and phrases are used commonly across the entire country.

I searched through 4.5 million words of present day conversation to find out how people in the UK normally use the phrase “bunch of”. I found that “people”, “flowers” and “things” are the most likely words to be described in this way. Beyond this, there are several other words which refer to groups of people:

“kids”, “volunteers”, “retards”, “losers”, “lads”, “individuals”, “friends”, “dickheads”, “dancers”, “Aussies”, “alcoholics”, “thieving sods” and “thieving fuckers”.

Absent from this list is the word “migrants”, which does not occur in this context. The evidence suggests that people do often use “bunch of” to describe groups of people negatively or with distaste. Therefore the upset caused by Cameron’s use of the phrase “a bunch of migrants” is perhaps understandable.

We are still collecting recordings from speakers all over the UK. For information on how to contribute to this project, which is led by Lancaster University and Cambridge University Press, please visit the Spoken BNC2014 website.

FireAnt Launch Event

We will be running a launch event and workshop for a new software tool that we have created called FireAnt. The event and workshop will be held from 13:00 to 17:00 on Monday 22nd February 2016 here at Lancaster University.

FireAnt was created by Laurence Anthony as part of the 2015 ESRC-funded CASS-affiliated DOOM project on social media analysis. FireAnt is a free and easy-to-use tool designed to help corpus linguists and social scientists analyze Twitter and other social network data without the need for programming or database management skills. The following features of the tool will be explored in this workshop:

  • import different formats of data (e.g. Twitter data in JSON format, Reddit data in CSV format, etc.)
  • search that data and its associated metadata in a variety of ways (e.g., retrieve all tweets containing #blacklivesmatter sent in December 2015)
  • export the results to other formats including a plain text file for “standard” corpus analysis, an Excel/CSV file for statistical analysis, a timeline chart, and a network graph

We will be providing lunch at the start of the event and all materials for the workshop (including the software and help guide) on a USB drive. The schedule for the day can be found below.


Time Agenda
1315-1415 PDR Room: Lunch
1415-1430 Introduction, log on, etc.
1430-1530 FireAnt basics
1530-1545 Refuel: Coffee break
1545-1645 FireAnt advanced
1645-1700 Q&As, requests, bouquets, encores

Please note that places are extremely limited and must be booked in advance. If you would like to attend, please email Claire Hardaker (c.hardaker(Replace this parenthesis with the @ sign)lancaster.ac.uk) in the first instance.

Welcome Jens Zinn – Marie Skłodowska-Curie Fellow

Jens ZinnCASS is delighted to welcome Jens Zinn to the centre after being awarded a Marie Skłodowska-Curie Fellowship! This is an extremely prestigious award, named after the double Nobel Prize winning Polish-French scientist famed for her work on radioactivity. The fellowships support outstanding scholars at all stages of their careers, irrespective of nationality.

Jens has studied and taught at many universities in Germany, and in 2009 he was appointed Associate Professor and Reader in Sociology at The University of Melbourne. Jens has founded a number of international research networks on the Sociology of Risk and Uncertainty (SoRU). The joint internet portal of these groups is open to everyone to contribute to current debates and ongoing activities. His research activities include a number of studies on people’s management of risk and uncertainty during the course of their life (e.g. youth transitions into the labour market; certainty constructions in reflexive modernity; British veteran’s management of risk and uncertainty). He led a collaborative research initiative ‘Risk, Social Inclusion and the Life Course – A Social Policy Perspective’ at the University of Melbourne and a research project ‘Decision Taking in Times of Uncertainty. Towards an efficient strategy to manage risk and uncertainty in climate change adaptation’ funded by the Victorian Centre for Climate Change Adaptation Research. Most recently he has worked with Daniel Mcdonald on a project examining the change of the risk semantic in the New York Times from an historical perspective combining corpus linguistics with sociology.

Here at CASS, Jens will be working with Professor Tony McEnery on a project which aims to advance our understanding of the forces that have driven the proliferation of risk discourses in the UK and Germany since World War Two. Working at the boundaries of risk sociology and corpus linguistics, this is a highly innovative enterprise, both theoretically and methodologically. It will examine the contribution made by main-stream risk theories to explaining the increasing use of the risk semantic in media coverage during the last 50 years, and it will develop an empirically grounded theory of the observable shift towards risk. Jens will utilise cutting-edge corpus-based research strategies to systematically reconstruct the changing use of the discourse-semantics of risk and will complement these with interviews of media experts to examine how these changes are linked to institutional and socio-cultural changes and historically significant events.

CASS would like to congratulate Jens on securing this highly esteemed fellowship, and we are very much looking forward to working with Jens on this exciting project!

Check back soon for more updates!

Workshop on Corpus Linguistics in Ghana

Back in 2014, a team from CASS ran a well-received introductory workshop on Corpus Linguistics in Accra, Ghana – a country where Lancaster University has a number of longstanding academic partnerships and has recently established a campus.

We’re pleased to announce that in February of this year, we will be returning to Ghana and running two more introductory one-day events. Both events are free to attend, each consisting of a series of introductory lectures and practical sessions on topics in corpus linguistics and the use of corpus tools.

Since the 2014 workshop was attended by some participants from a long way away, this time we are running events in two different locations in Ghana. The first workshop, on Tuesday 23rd February 2016, will be in Cape Coast, organised jointly with the University of Cape Coast: click this link for details. The second workshop, on  Friday 26th February 2016, will be in Legon (nr. Accra), organised jointly with the University of Ghana: click this link for details. The same material will be covered at both workshops.

The workshop in 2014 was built largely around the use of our online corpus tools, particularly CQPweb. In the 2016 events, we’re going to focus instead on a pair of programs that you can run on your own computer to analyse your own data: AntConc and GraphColl. For that reason we will be encouraging participants who have their own corpora to bring them along to analyse in the workshop. These can be in any language – not just English! Don’t worry however – we will also provide sample datasets that participants who don’t have their own data can work with.

We invite anyone in Ghana who wants to learn more about the versatile methodology for language analysis that is corpus linguistics to attend! While the events are free, registration in advance is required, as places are limited.

Spoken BNC2014 Early Access Data Grant Scheme – winning proposals

Lancaster University’s ESRC funded Centre for Corpus Approaches to Social Science (CASS) and Cambridge University Press are pleased to announce the recipients of the Spoken BNC2014 Early Access Data Grants. These successful applicants will receive exclusive early access to approximately five million words of the Spoken BNC2014 via CQPweb. They will be the first to conduct research using the data and produce papers to be published in 2017, coinciding with the release of the full corpus.

The successful applicants, their institutions, and the research they intend to undertake, are:


Karin Aijmer


Investigating intensifiers in the Spoken BNC2014


Karin Axelsson


Canonical and non-canonical tag questions in the Spoken BNC2014: What has happened since the original BNC?


Andrew Caines1, Michael McCarthy2 and Paula Buttery1

1Cambridge, 2Nottingham

‘You still talking to me?’ The zero auxiliary progressive in spoken British English, twenty years on


Andreea Simona Calude


Sociolinguistic Variation in Cleft Constructions – a quantitative corpus study of spontaneous conversation


Jonathan Culpeper


Politeness variation in England


Robert Fuchs


Recent Change in the sociolinguistics of intensifiers in British English


Kazuki Hata, Yun Pan and Steve Walsh


Talking the talk, walking the walk: interactional competence in and out


Tanja Hessner and Ira Gawlitzek


Women speak in an emotional manner; men show their authority through speech! – A corpus-based study on linguistic differences showing which gender clichés are (still) true by analysing boosters in the Spoken BNC2014


Barbara McGillivray1, Jenset Gard1 and Michael Rundell2

1Oxford, 2Lexicography MasterClass

The dative alternation revisited: fresh insights from contemporary spoken data


Laura Paterson


‘You can just give those documents to myself’:  Untriggered reflexive pronouns in 21st century spoken British English


Chris Ryder, Jacqueline Laws and Sylvia Jaworska


From oldies to selfies: A diachronic corpus-based study into changing productivity patterns in British English suffixation


Tanja Säily1, Victoria González-Díaz2 and Jukka Suomela3

1Helsinki, 2Liverpool, 3Aalto

Variation in the productivity of adjective comparison


Deanna Wong


Investigating British English backchannels in the Spoken BNC2014


Thank you to everyone who applied, and congratulations to the winning proposals. Check back soon for more details on the Early Access Data Grant Scheme research.


Encyclopaedia of Shakespeare’s Language Project: A methodological journey

Just before Christmas 2015, the AHRC announced that it was going to fund the £1 million Encyclopaedia of Shakespeare’s Language project. I actually had the idea for the project 20 years ago. The fact that it took so long has much to do with method.

The approach I envisaged for Shakespeare’s language is analogous to more recent developments in dictionaries of general English, and, specifically, the departure from the philological tradition that resulted in the Collins Cobuild Dictionary of the English Language, the first full corpus-based dictionary. Being corpus-based implies both a particular methodology for revealing meanings, and a particular theoretical approach to meaning. There is less reliance on the vagaries and biases of editors, and a greater focus on the evidence of actual usage. The question ‘what does X mean?’ is pursued through another question: ‘how is X used?’

But I wanted more from the encyclopaedia than this. I wanted it to be comparative, to reveal not just the usage of words and other linguistic units in Shakespeare but also in the general language of the period. This way, we can tap into issues such as what is distinctive about Shakespeare’s language, and, more particularly, how Shakespeare’s language would have been perceived by his contemporary audience.

For example, the play Henry V contains Welsh, Irish and Scottish characters. A pilot examination I conducted with Alison Findlay (English and Creative Writing) of the words Welsh, Irish and Scottish used in over 100 million words written in Shakespeare’s time revealed that: (1) that the Welsh barely registered on the Elizabethan consciousness, being considered a harmless in-group, only noteworthy for their curious language, (2) the Irish were wild, savage, rebels, viewed positively only in relation to Irish rugs (an important colonial import), and (3) the Scottish, whilst also rebels, were respected for their political power. (Current Shakespearean dictionaries do not contain entries for any of these three words).

The problem 20 years ago was the lack of comparative data. Back in the early 1990s, the leading historical corpus of English was without doubt the Helsinki Corpus of English Texts, completed in 1991. This corpus amounted to 1.5 million words – an impressive figure in those days! Moreover, it had been put together with great care; it was reliable. But those 1.5 million words covered the period 730 to 1710. The section contemporaneous with Shakespeare amounted to less than half a million words, and was thus far short of what is required for serious comparative work.

To solve the problem, I set about, with Merja Kytö, creating the Corpus of English Dialogues. The reason for the focus on dialogues is that this would provide an interesting comparison for the dialogues of Shakespeare’s plays. This project soaked up 10 or more years, not just in creating the corpus but also in publishing the various insights it afforded into early modern dialogues along the way.

I was then overtaken – in a positive way! – by other events, notably, the advent of a fully-searchable 1.2 billion transcribed version of Early English Books Online (EEBO) (i.e. EEBO-TCP). For years, EEBO, which contains pretty much all early modern printed output, had been of limited value to linguists because the texts were only available as images, and language searches relied on OCR, with all its inaccuracies. Now, however, I have a 321 million word fully searchable corpus of texts written by Shakespeare’s contemporaries.

In addition, solutions, or at least partial solutions, had evolved for the various problems associated with the computational analysis of historical language data. Early modern spelling variation had been a major stumbling block (e.g. the word would could be spelt would, wold, wolde, woolde, wuld, vvold, etc.). This problem has been largely solved by the Variant Detector (VARD), devised by scholars at Lancaster, especially Alistair Baron . The Lancaster-developed CLAWS part-of-speech annotation system, which works well for present-day English, has been adapted for Early Modern English (though more work will be necessary). Similarly, semantic annotation has received attention from generations of researchers at Lancaster University, and has been (and is being) adapted for Early Modern English, most recently within the AHRC-funded SAMUELS project, involving a consortium of universities, including Lancaster.

I don’t doubt that there will be many more twists and turns, lumps and bumps in the future methodological journey. But I am cheered by the fact that I will not be facing them alone but in the company of a wonderful group of people who are part of the project: Andrew Hardie and Tony McEnery (both LAEL), Paul Rayson (Computing and Communications), Alison Findlay (English & Creative Writing) and Dawn Archer (Manchester Metropolitan).

For a brief project description, see: AHRC award to create a new Encyclopaedia of Shakespeare’s Language

Beyond the checkbox – understanding what patients say in feedback on NHS services

In 2016 I will be working on a new project in CASS, which has received funding from the ESRC (£61,532 FEC). The purpose of this project is to help the National Health Service better understand the results of patient feedback so that they can improve their services. The NHS gathers a great deal of user feedback on its services from patients. Much of this is in “free text” format and represents a rich dataset, although the amount of text generated in the thousands of feedback forms patients fill in each year makes it unfeasible to undertake a close qualitative analysis of all of it. Categorisation-based approaches like sentiment analysis have been tried on the dataset but have not found to be revealing. In this project we will be working with the NHS to first identify a set of research questions they would like to be answered from the data, and then we will use corpus-based discourse analysis to draw out the main themes and issues arising from the data. We will focus on four key NHS services – dentists, GP practices, hospitals and pharmacies. From these services alone we have around 423,418 comments to analyse, totalling 105,380,697 words. Some of the issues we are likely to be focussing on include: what matters most for patients, the key drivers for positive and negative feedback, indicators in comments that might trigger an alert or urgent review and differences across providers/services or by socio-demographic group.

Language Matters: Communication, Culture and Society

On 12th November, the CASS team made their way over to the International Anthony Burgess Foundation in Manchester for the ESRC Festival of Social Science 2015. The theme for this year’s event was “Language Matters: Communication, Culture and Society,” and it featured a series of four informal talks by CASS researchers based at Lancaster. The talks were pitched to a general audience, and gave the public the opportunity to hear renowned scholars talk about their lives, their work, and what they find most interesting about the relationship between language and society.

The first talk was by Robbie Love, a current PhD student in CASS who is working on the Spoken BNC 2014 project along with Cambridge University Press. The researchers are collecting 10 million words for the project, and it has received a great deal of media attention since it was announced last year. Robbie delved into some preliminary findings from the corpus, and explained how “fortnight, cheerio, catalogue, marvellous, and marmalade” are all on the decrease, whilst “treadmill, essentially, Internet, Google, and Facebook” are all on the rise. Some of these findings might be expected, but these subtle differences say a great deal about how our language usage has changed over the past 20 years.

The second presentation was by Jonathan Culpeper who discussed “Impoliteness: The Language of Offence”. He drew upon several pieces of corpus research, and argued that impoliteness doesn’t necessarily stem from what people say, but rather the way in which they say it. The use of 3rd person constructions, sarcastic remarks, and reduced eye-contact can all signal impoliteness. He argues that impoliteness often fits into a category, such as insults, negative evaluations, dismissals, silencers, threats, or condescensions. By using corpus-based methods, Jonathan is able to determine the most common constructions which signal impoliteness, and then consider the subtle pragmatic cues that may accompany them.

The third of the mini-lectures came from Paul Iganski (Law School). The presentation was entitled: “Vile words. What is the case for criminalising everyday hate speech as hate crime?” Considering that well over half of the racially and religiously aggravated offences in England and Wales in 2010-11 were categorised as “hate speech,” Paul considers both the legal and societal implications when the state criminalises language. He is firmly of the view that there is no such thing as “free speech,” as every nation state in the EU criminalises a form of hate speech. Furthermore, he argues that hate crimes hurt more than otherwise motivated crimes as they send a message striking at the core of the victim’s identity, and the restriction on “free speech” goes a long way towards protected minority groups in society.

The fourth and final talk was from Claire Hardaker who discussed “The ethics of investigating online aggression”. Claire started by discussing the media’s culture when stories regarding online abuse arise. They tend to have an appetite for exposing online trolls, and want to put a real-life face to the otherwise faceless online character. Claire went on to describe how easy it is to track someone down based solely on the information they give away on their Twitter accounts. Academics, for example, often promote their position and institution on their profile, and a quick internet search can lead someone straight to the individual’s office. This information can be used by the media, and Claire discussed how the media essentially witch-hunted a 63-year-old troll called Brenda Leyland, despite her comments not actually being considered criminal under British law. There’s a constant, sensitive conflict between ethics and the online environment, and Claire argues that whilst we need to publish about our research, we must do so without endangering anyone involved.

After the four presentations came to a close, visitors had the opportunity to meet with the speakers, talk about their research, and network with other attendees.

I’d like to personally offer my thanks to not only the speakers for offering their time throughout the day, but to everyone who joined us in the audience too. I think you’ll agree that it was a huge success, and the day really highlighted why corpus-based research is so important for uncovering the fascinating relationship between language and society.

Spoken BNC2014 Early Access Data Grant Scheme – Applications now open

Lancaster University’s ESRC funded Centre for Corpus Approaches to Social Science (CASS) and Cambridge University Press are excited to announce the Spoken British National Corpus 2014 Early Access Data Grant scheme.

Applications are now open for researchers at any level in the field of corpus linguistics and beyond to gain early access to a large subset of the Spoken BNC2014, which is currently being compiled and is due for release in late 2017. Successful applicants will write a paper based on their proposed research for exclusive publication (subject to peer review) in either a special issue of the International Journal of Corpus Linguistics or an edited collection.

We invite proposals for interesting and innovative research that would use approximately five million words of the upcoming Spoken BNC2014 as its primary source of data.

Successful applicants will gain access to the data via the CQPweb platform (cqpweb.lancs.ac.uk). Standard CQPweb functionality will be provided, including annotation (POS tagging, lemmatisation, semantic tagging) and with one new feature: the ability to search the corpus according to categories of speaker metadata such as gender, age, dialect and socio-economic status.

Proposals can approach the data from any theoretical angle, provided corpus methodologies are used and the research can be carried out within the affordances of CQPweb. Successful applicants will receive access to the data in February 2016 with a deadline for full paper submission in October 2016. Subject to peer review, papers will be published in one of the two Spoken BNC2014 launch publications in 2017 (a special issue of the International Journal of Corpus Linguistics has been agreed and a thematic edited collection is being planned).

This is a fantastic opportunity to work with the first very large, general corpus of informal British English conversation created since the original BNC more than twenty years ago. Successful applicants will get access to a large subset of the Spoken BNC2014 eighteen months before the full corpus is released, and will be the very first scholars to undertake and publish research based on this new dataset.

More details about the terms of the data grant scheme can be found in the application form. To apply, download and complete the application form and email it to Robbie Love (r.m.love(Replace this parenthesis with the @ sign)lancaster.ac.uk). The deadline for applications is Friday 11th December 2015.

Corpus compilation: working paper now available

We are pleased to announce that the CASS Corpus on Urban Violence in Brazil is now ready to be analysed. It contains a total of about 5,127 articles (1,778,282 words) published between Jan-Dec 2014 by four Brazilian newspapers: Folha de São Paulo, Estado de São Paulo, Zero Hora and Pioneiro.

This working paper explains the process of compiling the corpus. It describes the selection of sources and individual texts, preparation of the texts so that they can be processed by corpus linguistics techniques, and concludes with an overview of the corpus’ content.