Representing trans people in the UK press – a follow-up study

I do not identify as trans, nor did I carry out this research for profit or because I am an activist. I approached the subject from the position of allowing the data to speak for itself, and the corpus methods I use rely on computational techniques that are unbiased – computer software identifies the most frequent words, phrases and combinations of words, which then have to be accounted for by the analyst.

Introduction

A few years ago I published the “corpus linguistics” chapter in an edited collection relating to different methods of carrying out critical discourse studies. As a case study for the chapter, I decided to look at the representation of trans people in the British press. At the time there had been a disapproving article about a trans person who was also a school-teacher in The Daily Mail who had committed suicide three months later, while another article published in the Observer, one of the more respectable Sunday broadsheet newspapers, had used pejorative phrases about trans people like ‘a bunch of bed-wetters in bad wigs’ and ‘screaming mimis’. I wanted to use corpus approaches to see whether these articles were typical of the general press discussion around trans people or whether they stood out as unusually harsh. I built a (small by corpus linguistics standards) corpus of around 900 articles, just from 2012 and used traditional corpus methods (keywords, collocates, concordancing) to examine a range of words like transgender, transsexual and trannie. My analysis found that the two articles mentioned above were at the extreme end of a continuum, although:

“the analysis did find a great deal of evidence to support the view that trans people are regularly represented in reasonably large sections of the press as receiving special treatment lest they be offended, as victims or villains, as involved in transient relationships or sex scandals, as the object of jokes about their appearance or sexual organs and as attention-seeking freakish objects. There were a scattering of more positive representations but they were not as easy to locate and tended to appear as isolated cases, rather than occurring repeatedly as trends.” (Baker 2014)

I was recently approached by the charity Mermaids UK who asked me if I would carry out an updated analysis of more recent press representation. This time I collected data from the previous 2 years (21 October 2017 to 21 October 2019), resulting in a larger corpus of around 6,400 articles, indicating that there were around 3 and a half times as many articles written about trans people in this later period. In terms of news values, trans people are seen as rather more newsworthy these days. So has the discourse around them changed?

Changing Labels

In terms of how the press refer to trans people, in 2012, the most common term by far was transgender. In 2018-19, transgender and trans were about of equally frequency, this being mostly an effect of the Guardian and Observer showing a strong preference for trans. Terms I had expected that would have died out, like sex-change and transsexual, had decreased somewhat but were still being used about once every other day, with the Mail, Telegraph and Times making the bulk of such cases. Another decreasing term, tranny occurred about once a fortnight. In 2012 it was used to imply bad taste, outlandishness, sex romps or the subject of jokes. The term was a particular favourite term of journalist AA Gill (who used it in bizarre ways like tranny panto and tranny centaur night out). However, in 2018-19 it was now mainly acknowledged as a bullying term (AA Gill died in 2016). The rather jarring use of transgender(s) as a noun (“How about One Guy, A Girl, A Transgender and Two Nonbinary Persons” (The Sun)), occurred 37 times in 2018-19 (there was only 1 such usage in 2012).

Collocates of trans(gender)

Examining the contexts that trans and transgender people were written in showed one of the most notable changes though. I’d noted in 2012 that transgender people were implied to be quick to take offence – in that year there were 8 cases of trans(gender) co-occurring with words like angry, clash, complaint, fury, offended, outrage, row, spat, upset and wrath. There were enormous increases of this representation in 2018-19 though – 586 cases. While a small number of these cases don’t attribute trans people as being the ones who are cast as angry or complaining, the vast majority do – and the wider point is that trans people are being discussed as being at the centre of controversy. A similar set of words which relate to conflict including aggressive, demand, harassed, bullied, confronted, lunge, militant, outspoken, pressure and threat saw a similar pattern – 5 cases of these kinds of words appearing near trans(gender) in 2012, but 334 cases in 2018-19. The result is that trans people are constructed as newsworthy because they are difficult, angry, easily offended (and often unreasonably so).

Scout leaders have been told to avoid referring to children as boys and girls to ensure transgender members are not offended. (Mail on Sunday)

A transgender woman is demanding an apology and £2,500 compensation after claiming she was called “sir” by rail company staff. (Times, March 16, 2019)

It’s not a new representation. I saw the same thing when I looked at news stories about gay people in the early 2000s, Muslims in the 2000s and feminists in the 1990s and 2000s. Another representation (also used on gay people) was to link trans people with crime, connecting them to words like killer, prisoner, lag, criminal, murderer, rapist, jail and kill. These words occurred with trans(gender) 3 times in 2012, but 608 times in 2018-19.

It’s crazy to give trans prisoners everything they say they want,’ said chair Janice Williams. Why wouldn’t they lie in the circumstances? (Daily Mail)

Women’s jail holds trans lag born lad (The Sun, September 13, 2019)

Some of the trans brigade advocate the murder of Terfs as the best course. (Telegraph, 12 January 2019)

Transphobia, trans children and the trans lobby

What about more general contexts? What topics are trans people talked about in relation to more, or less these days? Here we see potentially a change for the better. Topics that now take up less space in the overall debate involve references to transvestites and ladyboys as well as discussion of implants, the clothing worn by trans people and their ability to “pass” as a particular gender. There’s less of the inappropriate prurience in trans people that’s associated with sitcom characters like Alan Partridge. In its place, the biggest area of growth is in stories relating to transphobia and discrimination, although there were also increases in references to transitioning, inclusivity and gender-neutral pronouns.

Lest we think that references to transphobia indicate that the press are overall more concerned about trans people being abused, a closer look indicates this is not always the case. Although such references are 112 times more frequent in 2018-9 compared to 2012, 15% of the 2018-19 mentions put the word transphobia in quotes, implying authorial distance or even rejection of the term.

A transgender teenager who demanded the removal of a female Labour member from her post as women’s officer over her allegedly “transphobic” views has been elected to the post in her local Labour party. (The Times, November 20, 2017)

I took 100 random cases of transphobia and related words like transphobe and looked at them in more detail. Approximately half (47) used the term to raise questions about its validity – either using the distancing quotes, referring to “supposed” or “alleged” transphobia, mentioned the way that the accusers behave: e.g. “howled down as transphobia” or simply baldly stating that something is not transphobia.

An analysis of the term trans(gender) children found a slightly better picture. That term doesn’t occur in the distancing scare quotes – so the concept of trans(gender) children appears to be more accepted in the press than the concept of transphobia. An analysis of 100 random cases found 56 that accepted the existence of trans children and/or advocated that they should receive support. Thirty seven cases were more disapproving, either suggesting that children who identify as trans should not be supported in transitioning or that efforts to support them (e.g. through pronoun stickers or gender-neutral toilets) are unnecessary, even unhelpful. A further seven cases appear more neutral, noting that this is an issue which divides people but not clearly coming down on either side. It’s very rare to find voices of trans(gender) children in these press articles.

A final change relates to the increase in the phrase trans(gender) lobby. There were no mentions of this phrase in 2012. In contrast, 2018-19 saw 151 mentions of it, with over 90% of such cases writing about it in a negative way (e.g. as silencing debate, peddling politically-correct fallacies, being deranged or aggressively militant). The transgender lobby is described in somewhat contradictory terms across the press. At times, journalists go out of their way to stress that it is unimportant, referring to it as miniscule and doomed, yet at other times it is described as powerful, hegemonic and influential (with the implication that it should not be these things).

Conclusion

The UK press wrote over 6,000 articles about trans people in 2018-19. On the surface there appear to have been improvements – the more sexualising and joking uses of language around trans people have reduced since 2012 and there are many more stories around transphobia and inclusivity. However, there are large swathes of the press which write about these topics in order to be critical of trans people and many articles which consequently paint trans people as unreasonable and aggressive. The picture suggests that the conservative press and most of the tabloids have shifted from an openly hostile and ridiculing stance on trans people towards a carefully worded but still very negative stance.

Reference

Baker, P. (2014) ‘”Bad wigs and screaming mimis”: Using corpus-Assisted techniques to carry out critical discourse analysis of the representation of trans people in the British press.’ In C. Hart and P. Cap (eds) Contemporary Critical Discourse Studies. London, Bloomsbury: 211-236.

Time to Celebrate: Trinity Lancaster Corpus

On Wednesday 30 October, The ESRC Centre for Corpus Approaches (CASS) organised a small get-together in its new location, Bailrigg House, to celebrate the research that is being carried out at the centre. Specifically, on this occasion, we wanted to highlight the Trinity Lancaster Corpus, a corpus of spoken learner English built in collaboration between Lancaster University and Trinity College London.

Cutting the cake with the Trinity Lancaster Corpus logo

We are really proud of the corpus, which is the largest learner corpus of its kind. It took us over five years to complete this part of the project. Here are a few numbers that describe the Trinity Lancaster Corpus:

  • Over 2,000 transcripts
  • Over 4.2 million words
  • Over 3,500 hours of transcription time
  • Over 10 L1 and cultural backgrounds
  • Up to four speaking tasks

A balanced sample of the corpus is now available for online searching via TLC Hub (password: Lancaster1964). To read more about the corpus and its development, check out this article in the International Journal of Learner Corpus Research:

Gablasova, D., Brezina, V., & McEnery, T. (2019). The Trinity Lancaster Corpus: Development, Description and ApplicationInternational Journal of Learner Corpus Research5(2), 126-158. [open access]

A new special issue of the journal featuring articles on various aspects of learner language, which use the Trinity Lancaster Corpus as their primary data source, is available from this link.

Table of contents of the special issue of the International Journal of Learner Corpus Research

A cake to celebrate the Trinity Lancaster Corpus

Celebrations at CASS

Celebrations at CASS (posters featuring research on TLC in the background)

Islam in the Media – A new CASS project working with The Aziz Foundation

We are very pleased to announce that in our next CASS project we will be working with The Aziz Foundation  to examine representations of Islam in the British press. The project will be led by Tony McEnery (Principal Investigator) with Gavin Brookes as Co-Investigator. We are also delighted to announce that Isobelle Clarke will be joining us later this year as Research Fellow on this project – welcome Isobelle! (introductory blog post to follow…).

The aim of this research will be to expand on previous work on this topic carried out by members of the Centre. The project will be methodologically innovative, devising new techniques and adapting existing methods to afford new insights into representations of Islam in the UK and how these vary across different parts of the Country and over time. Specifically, this project will be structured according the following three strands:

  1. Examining press representations of Islam over time. This will involve expanding the University’s existing database of press articles about Islam – which currently represents national news articles up to 2015 – allowing for a comparison of representations of Islam over time, from 1998 to the present day.
  2. Comparing national and regional press representations of Islam. As well as providing insight into what is, in the regional press, an under-researched area of media representations of Islam, this strand will also be able to address hypotheses which suggest that Muslims positively appraise local over national media coverage of Islam and Muslims (Open Society Institute, 2010: 215). In addition to expanding the existing dataset to include articles published up to the present day, as per (1), this strand will entail the expansion of this dataset to include regional (as well as national) newspaper articles about Islam published from 1998 to the present day. By studying temporal changes in both the national and regional press, this project will be able to assess the extent to which any shifts are uniform across both tiers (local/national) or, on the other hand, whether divergences between the two actually become starker over time.
  3. Exploring the social effect of press representations. This strand will analyse how readers respond to both positive and negative framings of Islam in ‘below-the-line’ comments which accompanying the articles in the data. This strand will therefore take a wider view of societal discussions of Islam, comparing readers’ perspectives against press representations in order to ascertain the extent to which such representations might influence but also be challenged by the public. By exploring comments both on articles which contain positive as well as negative representations – as these are identified in (1) and (2) – this strand will provide useful evidence for demonstrating to the media the social effects of constructive journalism over poor journalistic practice.

We will actively engage members of the British Muslim community in our research by sharing our findings with them, listening to their thoughts and feedback, and helping them to read media texts more critically in order to challenge negative representations in the future, for example by formulating complaints that are informed by (corpus) linguistic insight. We are excited to begin this exciting project and are looking forward to working with The Aziz Foundation and, of course, to welcoming Isobelle to Lancaster!

‘Collaborations between Linguistics and the Professions’ event – Three participants’ views

On 4th-6th March 2019, we organised an event on ‘Collaborations between Linguistics and the Professions’. If you missed it, here are three reports from early-career researchers – one for each day.

Day One – by Mathew Gillings

The aim and focus of the Collaborations between Linguistics and the Professions event was to look at connections and opportunities that arise between the academic discipline of linguistics and wider industry. Broadly speaking, the event considered how university-based linguists may be able to advise in both the public and private sectors, providing consultancy to help inform real-world issues. As a PhD student applying corpus methods to the study of deceptive language, the first day of the three-day event was of particular interest to me, due to its focus on forensic linguistics and public engagement.

After a quick welcome and opening of the event from Elena Semino, we started with a talk by Louise Mullany. Louise works at the University of Nottingham, but also carries out linguistic consultancy through the Linguistic Profiling for Professionals unit she directs. Louise discussed how her team has worked within a whole range of sectors and applied linguistic theory to help inform their practice. For example, politeness theory might well provide the answers to why one firm is struggling with their customer service; or perhaps an investigation into gendered talk might reveal some underlying problems or tensions. Perhaps even other methods could provide further insight, such as eye-tracking, or putting clients through an online learning course. Louise’s talk gave a good insight into how such a unit operates.

The second talk built on the first one, with Isobelle Clarke showing us not only what you need to think about and be aware of, but also what you shouldn’t do when trying to build a reputation as a linguistic consultant. Although Isobelle has already had some good opportunities through the connections made throughout her PhD, she argued that her reputation will always be questioned for the stereotypes that come with the territory. For example: she is female, unlike most forensic linguistic consultants; she is from Essex, and therefore has an accent that is often prejudiced against; and she is also still early-career. These are things she cannot change, but still unfortunately affect whether or not she is considered credible as an expert in the field. It was good to hear such an open and frank discussion about inequalities within the field.

Continuing with the forensic linguistics theme, Georgina Brown offered an insight into how new methodologies within forensic speech science are now being used to inform proceedings within the courtroom. Georgina introduced us to Soundscape Voice Evidence, a new start-up based right here in the Department of Linguistics and English Language at Lancaster, which is all the proof you need that there is a real appetite for further collaboration between academia and industry.

Another interesting talk later that day was by Lancaster’s very own Claire Hardaker, who talked about learning when to say ‘no’ to opportunities that come your way. Claire discussed several cases where, due to her own excitement, she may have jumped into a new opportunity too quickly. As a PhD student, it was good to hear advice on how to handle different cases, and especially on how to be careful in picking which cases to pursue. Likewise, this seemed to be a common theme over the three days, with a whole range of attendees discussing issues they had encountered whilst carrying out this kind of collaborative work.

The day came to a close with Tony McEnery’s talk discussing linguistics and the impact agenda. Tony reflected on some of his own experiences working with various agencies outside of academia, but the bulk of his talk concerned impact work against the backdrop of the REF. Tony gave some top tips for how to get your research out there and informing public life through the Civil Service, but also spoke very realistically about the priority it will be given by others. Everyone is busy – those in academia and those outside of it – and we must not lose sight of that. Tony finished with a call to arms: language pervades each and every aspect of our life, and it is clear that the discipline has a lot more to offer than it has traditionally done in the past.

Day one of the Collaborations between Linguistics and the Professions event was enlightening. I, for one, never knew quite how much consultancy work linguists are involved with, and it was refreshing to see such a healthy appetite for it within the room – especially from early-career researchers, like myself, who may well do it in the future.

Day 2 – by Sundeep Dhillon

I attended the education focused sessions on Tuesday, given that my background is in English language teaching and, as a current ESRC doctoral student in Applied Linguistics, I was keen to find out more on collaborations between linguistics and the professions. The day started with a warm welcome from Elena Semino prior to the first presentation. Alison Mackey spoke about her work as a linguistics consultant in the private sector which ranged from educational technology companies to private schools in the USA. Alison gave lots of varied (and humorous) examples of the consultancy work and how she achieved these contracts, which she traced back to three key factors. These were networking and word of mouth referrals, the publication of a book on bilingualism by Harper Collins, and a Guardian article which has over 65,000 shares on Facebook. I was impressed by the range of Alison’s work activities, proving that linguistics can be widely applied to real-life practical contexts.

One of the schools Alison has worked with in the USA, ‘Avenues: The World School’, was then represented by Abby Brody. The private school has an innovative approach to teaching as students are immersed in Spanish or Mandarin (alongside English) with the aim of becoming ‘truly fluent’. The links between linguistics and the school’s curriculum development over time were outlined. It was clear that the school was responsive to research and adapted their teaching and learning practices accordingly.

The next presentation by Judit Kormos was very inspiring in that the linguistics research has led to a direct impact on the way inclusive practices are promoted in educational publishing and second language assessment. Judit’s research on specific learning difficulties and L2 learning difficulties has aimed to give a voice and agency to those who are traditionally underrepresented. There were a number of examples given of working with publishers and government departments to develop strategies and ways of working which are inclusive. The success of Lancaster’s Future Learn MOOC  on Dyslexia and Foreign Language Teaching was also discussed and there is now an opportunity to join the next launch of this MOOC on the 15th April 2019.

Following a lovely long and well catered lunch break, we then heard from Claire Dembry of Cambridge University Press (CUP). Claire outlined the many opportunities for links between the publisher and academic research, including the recent Spoken British National Corpus 2014 (BNC) project in collaboration with Lancaster University. This project involved collecting 11.5 million words of spoken conversation and the BNC 2014 is now available online with free access. There are also opportunities to contribute to articles, books, research guides and white papers which are produced by CUP. Claire also answered questions on practical considerations such as contacting CUP, pitching ideas and negotiating fees, all of which was useful information to consider prior to any collaboration.

We then heard from Vaclav Brezina about corpora and language teaching and learning. There were three main sections in the presentation – accessibility, research partnerships and interdisciplinarity. Accessibility covered the link between theoretical ideas of linguistics and the practical tools and techniques used in projects such as the BNCLab and #LancsBox. Research partnership highlighted the importance of collaboration with others such as CUP and Trinity College London. Finally, interdisciplinarity covered good practice guidelines on working with others including flexibility and collective ownership of goals.

Cathy Taylor of Trinity College London presented about ‘The Spoken Learner Corpus’ (SLC) project collaboratively undertaken by Trinity College London and CASS, Lancaster University. This has involved collecting data from Trinity’s spoken Graded Examinations in Spoken English (GESE) at B1 level and above, leading to the creation of the SLC which can be explored for language teaching and research purposes. Cathy described the stages of the project including the rationale, the practical data collection of audio recorded exams from GESE and also the creation of teaching and learning materials based on the SLC. These materials are available on the Trinity website and cover topics such as managing hesitation and asking questions. This project is a great example of how corpus data can be used to inform and improve the classroom experience of English language learners.

The final presentation was by John Pill, who spoke about his experience of updating the Occupational English Test (OET), an English language test for medical professionals. Collaboration between the test developers, language researchers and medical experts was outlined, including tensions between them in relation to the expected content, assessment criteria and outcomes of the OET. Overall, the process to create a relevant language test which covered English language and also practical medical aspects was successful with an updated test being launched following the collaboration.

Each of the presentations linked the research within linguistics to applications in the wider education profession. There was a lot of useful information and plenty of food for thought for the audience in considering future collaboration activities.

Day 3 – by Joelle Loew

It is the third day of the conference – by now a familiar crowd is coming together around coffee in the morning, and the atmosphere is at ease. People have come from all over the UK and beyond to the beautiful campus of Lancaster University – I had flown in from Switzerland a while ago, where I am doing my PhD in Linguistics at the University of Basel. Everyone is looking forward to the last day, which brings together researchers and practitioners applying linguistics in various professions including media and marketing. We start off with a talk from Colleen Cotter from Queen Mary University of London on bridging ‘the professional divide’ between journalists’ and academics’ talk about language – she outlines journalistic language ideologies but also highlights journalistic audience design and corresponding readership-orientation as an example of how journalistic practice can feed into academic practice. After a quick refill, we gather again to hear Lancaster’s own Veronika Koller discuss her experience of opportunities and obstacles in linguistics consulting in healthcare. Throughout the presentation she refers to and outlines the main stakeholders in healthcare particularly relevant to linguistics:

https://twitter.com/ZsofiaDemjen/status/1103251411268182016

On we then go to hear Jeannette Littlemore from the University of Birmingham discuss her work with marketing and communications agencies on their use of figurative messaging. She focuses on the role of metaphor and metonymy for brand recognition, brand recall and consumer preference, drawing on examples from her research and work with the creative industry. Discussions following her talk continue into the lunch break, refreshed and well fed we move into an afternoon packed with insight from industry. Gill Ereaut brings in the perspective of a linguist working within the professions, introducing her consultancy Linguistic Landscapes. Their work includes evidence-based consulting for organisations on multiple levels, including organisational culture change. Another perspective from industry follows by Sandra Pickering from opento, who talks about the role of language in marketing. She provides a wide array of fascinating examples from her diverse experience with different organisations, and spends some time outlining how brands become metaphorical persons on their quest to build compelling brand narratives. The audience discusses some well-known brand narratives and archetypes of smaller and bigger players in the industry following her talk.

https://twitter.com/VeronikaKoller/status/1103306088953331713

Dan McIntyre and Hazel Price from the University of Huddersfield then present two very different case studies applying corpus linguistics in a private and a public setting with their consultancy Language Unlocked. The day ends with a Skype talk by Deborah Tannen from Georgetown University who captures the audience with her account on why and how she writes for non-academic audiences. Her multiple and diverse experiences of writing for the broader public make for interesting insights on the differences in writing for academics and writing for a lay audience. She emphasizes the value of having to find simple terminology for expressing and simplifying complicated ideas. Her talk was followed by a lively discussion, as were the others in the day. Exploring opportunities and challenges in linguistic consultancy work through discussing hands-on examples from different perspectives allowed highlighting recurrent themes too, such as the importance of considering ethical aspects in this process. It also showed the tremendous potential and relevance of linguistics for a variety of different aspects of the professional world.

In sum, it was a fascinating day and a very inspiring conference overall – throughout the day it was evident that attendees genuinely felt the exchange between academics and practitioners applying linguistics in the professional world was very fruitful, and I am almost certain it is not the last we’ve heard of events such as this! It certainly broadened my own horizon as a PhD researcher looking at professional communication – showing many opportunities and highlighting the challenges to prepare for and navigate when seeking collaboration between linguistics and the professions.

Introductory Blog – Luke Collins

I am delighted to have joined the CASS team as Senior Research Associate and will be working across the new programme of studies in Corpus Approaches to Health(care) Communication. I have already begun working on a fascinating strand exploring the Narratives of Voice-hearers and I will be working closely with Professor Elena Semino in applying corpus methods to see what effects a therapeutic intervention has on the experiences of those who hear distressing voices – and how they articulate these experiences – over time. More broadly, we will be examining representations of mental health and illness in the media, looking to address issues of stigmatisation and support public awareness and understanding.

Working towards the application of corpus linguistics and the findings of corpus analysis to health services is a great motivation to me and I am thrilled to have the opportunity to build on my previous work in this area. I have published work on the experiences of people undergoing a therapeutic intervention and demonstrated how corpus approaches can help to capture some of the complexities of those experiences. I have also implemented corpus analyses to investigate discussions of complex global issues in the news media (specifically, climate change and AMR), thinking about public understanding and how media reporting can help readers to comprehend their role in such issues. I have recently been working on my edition of the Routledge ‘Corpus Linguistics for..’ series, focusing on applications of corpus tools for analysing different types of online communication and hope to announce its release early next year. Throughout my work, I have endeavoured to raise awareness of corpus methods outside of the discipline and create opportunities to work with collaborators from various backgrounds. I am glad to find that in my role with CASS, this can continue!

Outside of my work, I have a reputation for hand-made greeting cards and I am an avid record collector. Since I have moved to Lancaster I have been exploring the local area and discovering what a picturesque part of the country this is. I don’t even mind the rain!

Statistics in (Higher) Education: A few thoughts at the beginning of the new academic year

As every year around this time, university campuses are buzzing with students who are starting their studies or returning to the campus after the summer break – this incredible transformation pours life into buildings – empty spaces become lecture theatres, seminar rooms and labs. Students have the opportunity to learn many new things about the subject they chose to study and also engage with the academic environment more generally.  Among the educational and development opportunities students have at the university one transferable skill stands out: statistical literacy.

Numbers are an essential part of our everyday life. We count the coins in our pocket, the minutes before the next bus arrives or the sunny days in a rainy year. Numbers and quantitative information are also very important for students and educators. Statistical literacy – the ability to produce and interpret quantitative information – belongs to the basic set of academic skills that, despite its importance, may not always receive the attention it deserves.

Many students (and academics) are afraid of statistics – think about what your first reaction is to the equation in Figure 1 below.

Figure 1: The equation of standard deviation (mathematical form)

 

This is because statistics is often misconstrued as the art of solving extremely complicated equations or a mysterious magic with numbers. Statistics, however, is first and foremost about understanding and making sense of numbers and quantitative information. For this, we need to learn the basic principles of collecting, organising and interpreting quantitative information. Critical thinking is thus much more important for statistics than the number crunching ability. After all, computers are very good at processing numbers and solving equations and we can happily leave this task to them. For example, many even complex statistical tasks can be achieved by using tools such as the Lancaster Stats Tool online, where the researcher can merely copy-paste their data (in an appropriate format) and press one button to receive the answer.

Humans, on the other hand, outperform computers in the interpretation skills. This is because we have the knowledge of the context in which numbers appear and we can therefore evaluate the relative importance of different quantitative results. We as teachers, linguists, sociologists, scientists etc. can provide the underlying meaning to numbers and equations and relate them to our experience and the knowledge of the field. For example, the equation in Figure 1 can be simplified as follows:

Figure 2: The equation of standard deviation (conceptual)

When we relate this to what we know about the world, we can see that the question we are asking in Figure 2 is how much variation there is in our data, a question about variability, difference in tendencies and preferences and overall diversity. This is something that we can relate to in our everyday experience: Will I ever find a twenty-pound note in my pocket? Is the wait for the bus longer in the evening? Is the number of sunny days different every year? When talking about statistics in education, I consider the following point crucial: as with any subject matter, it is important to connect statistical thinking and statistical literacy with our daily experience.

To read more about statistics for corpus linguistics, see Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.

Introductory Blog – Gavin Brookes

This is the second time I have been a part of CASS, which means that this is the second time I’ve written one of these introductory blog pieces. I first worked in CASS in 2016, on an eight-month project with Paul Baker where we looked at  the feedback that patients gave about the NHS in England. This was a really fun project to work on – I enjoyed being a part of CASS and working with Paul and made some great friends in the Centre with whom I’m still in contact to this day. Since leaving CASS in October 2016, I completed my PhD in Applied Linguistics in the School of English at the University of Nottingham, which examined the ways that people with diabetes and eating disorders construct their illnesses and identities in online support groups. Following my PhD, I stayed in the School of English at Nottingham, working as a Postdoctoral Research Fellow in the School’s Professional Communication research and consultancy unit.

As you might have guessed from the topic of my doctoral project and my previous activity with CASS, my main research interests are in the areas of corpus linguistics and health communication. I am therefore very excited to return to the Centre now, with its new focus on the application of corpora to the study of health communication. I’m currently working on a new project within the Centre, Representations of Obesity in the News, which explores the ways that obesity and people affected by obesity are represented in the media, focussing in particular on news articles and readers’ responses. I’m very excited to be working on this important project. Obesity is a growing and seemingly ever-topical public health concern, not just in the UK but globally. However, the media’s treatment of the issue can often be stigmatising, making it quite deserving of scrutiny! Yet, our aim in this project isn’t just to take the media to task, but to eventually work with media outlets to advise them on how to cover obesity in a way that is more balanced and informative and, crucially, less stigmatising for people who are affected by it. In this project, we’re also working with obesity charities and campaign groups, which provides a great opportunity to make sure that the focus of our research is not just fit for academic journals but is relevant to people affected by this issue and so can be applied in the ‘real world’, as it were.

So, to finish on more of a personal note, the things I said about myself the last time I wrote one of these blog posts  are still true ; I still like walking, I still travel lots, I still read fantasy and science fiction, I still do pub quizzes, my football team are still rubbish and I don’t think I’ve changed that much since the photo used in that piece was taken… Most of all, though, it still excites me to be a part of CASS and I am absolutely delighted to be back.

 

ESRC Postdoctoral Fellowship: The psychological validity of non-adjacent collocations

Having recently completed my PhD in CASS, I am really excited to announce that I have been awarded an ESRC Postdoctoral Fellowship for the upcoming academic year.

My research focuses on finding neurophysiological evidence for the existence of collocations, i.e. sequences of two or more words where the words are statistically highly likely to occur together. There are a lot of different types of collocation, and the different types vary along the dimensions of fixedness and compositionality. Idioms, for example, are highly fixed in the sense that one word cannot typically be substituted for another word. They are also non-compositional, which means that the meaning of the expression cannot be derived from knowing the meaning of the component words.

Previous studies investigating the psychological validity of collocation have tended to focus on idioms and other highly fixed expressions. However, this massively limits the generalizability of the findings. In my research, I therefore use a much more fluid conceptualization of collocation, where sequences of words can be considered to be collocational even if they are not fixed, and even if the meaning of the expression is highly transparent. For example, the word pair clinical trials is a collocation, despite lacking the properties of fixedness and non-compositionality, because the word trials is highly likely to follow the word clinical. In this way, I focus on the transition probabilities between words; the transition probability of clinical trials (as measured in a corpus) is much higher than the transition probability of clinical devices, even though the latter word pair is completely acceptable in English, both in terms of meaning and grammar.

In my research, I extract collocational word pairs such as clinical trials from the written BNC1994. I then construct matched non-collocational word pairs such as clinical devices, embed the two sets of word pairs into corpus-derived sentences, and then ask participants to read these sentences on a computer screen while electrodes attached to their scalp detect some of their brain activity. This method of recording the electrical activity of the brain using scalp electrodes is known as electroencephalography, or EEG. More specifically, I use the event-related potential (ERP) technique of analysing brainwave data, where the brain activity is measured in response to a particular stimulus (in this case, collocational and non-collocational word pairs).

My PhD consisted of four ERP experiments. In the first two experiments, I investigated whether or not collocations and non-collocations are processed differently (at the neural level) by native speakers of English. In the third experiment, I did the same but with non-native speakers of English. Then, having found that there are indeed neurophysiological differences in the way that collocations and non-collocations are processed by both native and non-native speakers, I then conducted a fourth experiment to investigate which measures of collocation strength most closely correlate with the brain response. The results of this experiment have really important implications for the field of corpus linguistics, as I found that the two most widely-used measures of collocation strength (namely log-likelihood and mutual information) are actually the two that seem to have the least psychological validity.

The ESRC Postdoctoral Fellowship is unique in that, although it allows for the completion of additional research, the main focus is actually on disseminating the results of the PhD. Thus, during my year as an ESRC Postdoctoral Fellow, I intend to publish the results of my PhD research in high-impact journals in the fields of corpus linguistics and cognitive neuroscience. I will also present my findings at conferences in both of these fields, and I will attend training workshops in other neuroscientific methods.

The additional research that I intend to do during the term of the Fellowship will build upon my PhD work by using the ERP technique to investigate whether or not the neurophysiological difference in the processing of collocations vs. non-collocations is still apparent when the (non-)collocations contain intervening words. For instance, I want to find out whether or not the collocation take seriously is still recognized as such by the brain when there is one intervening word (e.g. take something seriously) or two intervening words (e.g. take the matter seriously), and so on.

Investigating the processing of these non-adjacent collocations is important for the development of linguistic theory. While my PhD thesis focused on word pairs rather than longer sequences of words in order to reduce the number of factors that might influence how the word sequences were processed, making it feasible to conduct controlled experiments, this is actually a very narrow way of conceptualizing the notion of collocation; in practice, words are considered to form collocations when they occur in one another’s vicinity even if there are several intervening words, and even if the words do not always occur in the same order. I will therefore use the results of this additional research to inform the design of research questions and methods for future work engaging with yet more varied types of collocational pattern. This will have important implications for our understanding of how language works in the mind.

I would like to conclude by expressing my gratitude to the ESRC for providing funding for this Fellowship. I am very grateful to be given this opportunity to disseminate the results of my PhD thesis, and I am very excited to carry out further research on the psychological validity of collocation.

Compiling a trilingual corpus to examine the political and social representation(s) of ‘people’ and ‘democracy’

As a visiting researcher at CASS (coming from the University of Athens, where I am Associate Professor of Corpus Linguistics and Translation), since mid-October 2017 and until the end of August 2018, my research aim is to investigate critical aspects of populist discourses in Europe and their variation, especially during and after the 2008 financial (and then social and political) crisis, and to reveal patterns of similarity and difference (and, tentatively, of interconnectedness and intertextuality) across a wide spectrum of political parties, think tanks and organisations. This being essentially a Corpus-Assisted Discourse Study (CADS), a first way into examining the data is to identify and statistically analyse collocational patterns and networks that are built around key lexemes (e.g. ‘people’, ‘popular’, ‘democracy’, in this scenario), before moving on to critically correlating such quantitative findings with the social and political backdrop(s) and crucial milestones.

 

The first task of this complex corpus-driven effort, which is now complete, has been to compile a large-scale trilingual (EN, FR, EL) ‘focus’ corpus. This has been a tedious technical process: before the data can be examined in a consistent manner, several problems needed to be addressed and solutions had to be implemented, as outlined below.

 

  1. As a key primary aim was to gather as much data as possible from the websites of political parties, political organisations, think tanks and official party newspapers, from the UK, France and Greece, it was clear from the outset that it would not be possible to manually cull the corpus data, given the sheer number of sources and of texts. On the other hand, automatic corpus compilation tools (e.g. BootCaT and WebBootCaT in SketchEngine) could not handle the extent and the diversification of the corpora. To address this problem, texts were culled using web crawling techniques (‘wget -r’ in Linux bash) and the HTTrack app, with a lot tweaking and the necessary customisation of download parameters, to account for the (sometimes, very tricky) batch download restrictions of some websites.
  2. Clean-up html boilerplate (i.e., corpus text-irrelevant sections of code, advertising material, etc. that are included in html pages). This was accomplished using Justext (the app used by M. Davies to compile the NOW corpus), with a few tweaks, so to be able to handle some ‘malformed’ data, especially from Greek sources.

As I plan to specifically analyse the variation of key descriptors and qualifiers (‘key’ keywords and their c-collocates) as a first way into the “charting” of the discourses at hand, the (article or text) publication date is a critical part of the corpus metadata, one that needs to be retained for further processing. However, most if not all of this information is practically lost in the web crawling and boilerplating stages. Therefore, the html clean-up process was preceded by the identification and extraction of the articles’ publication dates, using a php script that was developed with the help of Dr. Matt Timperley (CASS, Lancaster) and Dr. Pantelis Mitropoulos (University of Athens). This script scans all files in a dataset, accounts for all possible date formats in all three languages, and then automatically creates a csv (tab-delimited) table that contains the extracted date(s), matched with the respective filenames. Its accuracy is estimated at ca. 95%, and can be improved further, by checking the output and rescanning the original data with a few code tweaks.

  1. Streamline the data, by removing irrelevant stretches of text (e.g. “Share this article on Facebook”) that were possibly left behind during the boilerplating process – this step is ensured using Linux commands (e.g. find, grep, sed, awk) and regular expressions and greatly improves the accuracy of the following step.
  2. Remove duplicate files: since onion (ONe Instance ONly: the script used e.g. in SketchEngine) only looks for pattern repetitions within a single file and within somewhat short maximum paragraph intervals, I used FSLint – an application that takes account of the files’ MD5 signature and identifies duplicates. This is extremely accurate and practically eliminates all files that have a one hundred percent text repetition, across various sections of the websites, regardless of the file name or creation date (actually, this was found to be the case mostly with political party websites, not newspapers). (NB: A similar process is available also in Mike Scott’s WordSmith Tools v7).
  3. Order files by publication year for each subcorpus and then calculate the corresponding metadata (files, tokens, types and average token count, by year) for each dataset and filter out the “focus corpus”, i.e. by looking for relevant files containing only node lemmas (i.e., lemmas related to the core question of this research: people*|popular|democr*|human* and their FR and EL equivalents, using grep and regular expressions – note that an open-source, java-based GUI app that combines these search options for large datasets is FAR).
  4. Finally, prepare the data for uploading on LU’s CQPWeb, by appending the text publication year info, as extracted from stage 2 to the corresponding raw text file – this was done using yet another php script, kindly developed by Matt Timperley.

 

In a nutshell, texts were culled from a total of 68 sources (24 Greek, 26 British, and 18 French). This dataset is divided into three major corpora, as follows:

  1. Cumulative corpus (CC, all data): 746,798 files/465,180,684 tokens.
  2. Non-journalistic research corpus (RC): 419,493 files/307,231,559 tokens.
  3. Focus corpus (FC): 205,038 files/235,235,353 tokens.