CASS has always been associated with innovation in corpus linguistics. Innovation comes in different forms and guises such as the creation new corpora and tools as well as novel applications of corpus methods in a wide range of areas of social and linguistic research. With increasing demands on the sophistication of corpus linguistic analyses comes the need for new tools and techniques that can respond to these demands. #LancsBox X is one of such tools.
#LancsBox X is a free desktop tool, which can quickly search very large corpora (millions and billions of words) which can consist of simple texts or richly annotated XML documents. It produces concordances, summary tables, collocation graphs and tables, wordlists and keyword lists.
On Friday 24 February 2023, a new version of #LancsBox X has been released. To mark this occasion, we organised a hybrid event, which attracted over 1,300 attendees. This event was co-sponsored by CLARIN-UK. A recording of this event is available above.
On 19 November 2021, The ESRC Centre for Corpus Approaches to Social Science (CASS) organised an event to celebrate the launch of the Written British National Corpus 2014 (BNC2024). The event was live-streamed from a very special location: the medieval Lancaster Castle. There were about 20 participants on the site and more than 1,200 participants joined the event online. Dr Vaclav Brezina started the event and welcomed the participants from over 30 different countries. After the official welcome by Professor Elena Semino and Professor Paul Connolly, a series of invited talks were delivered by prominent speakers from the UK and abroad. The talks covered topics such as corpus development, corpora in the classroom, corpora and fiction and the historical development of English.
The BNC2014 is now available together with its predecessor the BNC1994 via #LancBox X.
More information about the design and development of the Written BNC2014 is available from this open access research article:
There’s no question that all of us within society have been impacted in one way or another by the ongoing COVID-19 pandemic. However, it’s also the case that the health and wellbeing of certain groups have been particularly affected. A review of evidence on the disparities in the risks and outcomes of COVID-19 carried out by Public Health England suggests the virus has ‘replicated existing health inequalities and, in some cases, has increased them’. One group at particular risk of experiencing serious complications from COVID-19 is people living with obesity. Another report from Public Health England, reviewing evidence on the impact of excess weight on COVID-19, concluded that ‘the evidence consistently suggests that people with COVID-19 who are living with overweight or obesity, compared with those of a healthy weight, are at an increased risk of serious COVID-19 complications and death’.
In this paper, published in Critical Discourse Studies, Gavin Brookes explores how British print media representations of obesity have responded to the pandemic. The study, which is the latest in the CASS project exploring representations of obesity in the British press, is based on purpose-built corpora representing UK broadsheet and tabloid coverage of obesity during the pandemic. The analysis involved the use of keywords which were obtained by comparing each corpus against two reference corpora: one representing general press coverage of COVID-19 and the other representing general coverage of obesity in the six months leading up to the start of the pandemic. In this way, the study could account for keywords (and attendant discourses) that were characteristic of press representations of obesity during the pandemic relative not only to general coverage of obesity, but also general coverage of the virus.
Compared to this more general reportage, both broadsheet and tabloid reporting of obesity during the pandemic was found to be more fatalistic, with people with obesity being particularly likely to be construed as dying, or at least as being at heightened risk of dying, from the virus. For the broadsheets, this is a marked change in tone, with the pandemic seemingly ushering in a more pronounced focus on the connection between obesity and mortality. While such fatalistic discourses are characteristic of tabloid coverage of obesity in general, it seems that this way of framing obesity has gained even more prominence in these newspapers during the pandemic.
People with obesity were also depicted as a strain on an already-overburdened NHS, for example by taking up hospital beds and requiring oxygen therapy to the extent that this need creates a shortage for the rest of the population. The solution put forward (particularly by the tabloids, and to a lesser extent the right-leaning broadsheets) is for people with obesity to lose weight, for instance through exercise and supplements, in order to ‘save’ the NHS. This results in the responsibilisation of people with obesity, both for ensuring their own health and that of the wider public. This includes being responsible for ‘saving’ the NHS, though notably the damage endured by the NHS at the hand of austerity politics over the last decade is, conveniently, elided.
The link between obesity and coronavirus affords the press means by which it can maintain the newsworthiness of obesity in the context of what is, in COVID-19, a news story of global relevance. Meanwhile, the fatalistic and responsibilising depictions allow news agencies to key into the news value of negativity. Yet, discourses of personal responsibility are often criticised because they typically fail to grasp that obesity (along with other so-called ‘lifestyle’ conditions) is not simply the outcome of individual lifestyle choices, but likely results from a variety of factors (both individual and socio-political), over which individuals often have limited control. When the newspapers offer a public figure as privileged and powerful as Boris Johnson as a ‘role model’ for readers wanting to lose weight, they risk overlooking the influence of factors such as social privilege in the development of obesity.
When individuals and groups are blamed for problems in society, the result is the creation and propagation of stigma. The way much of the press has reported on obesity during the pandemic represents a ‘ramping up’ weight stigma, with people with obesity not only being blamed for their own health challenges but also shouldering responsibility for problems with the NHS against the backdrop of the most severe public health crisis of modern times. The weight stigma that results from this kind of blame-loading may engender further negative attitudes towards people with obesity, resulting in internalised shame. Yet the consequences of weight stigma may also be intensified by the circumstances surrounding the pandemic, which have already adversely affected the population’s mental health. Meanwhile, the aforementioned report by Public Health England stated that ‘stigma experienced by people living with obesity, may delay interaction with health care and may also contribute to increased risk of severe complications arising from COVID-19. It’s not all doom-and-gloom, though, as the pandemic also seems to have given rise to other, less stigmatising, changes to the press’s approach to obesity. For example, the broadsheets, and to a lesser extent the tabloids, also focussed more on race-related health disparities compared to in usual coverage of obesity. Meanwhile, the right-leaning tabloids offered otherwise uncharacteristic criticism of the UK Government, in particular for its ‘Eat Out to Help Out’ scheme, which we presented as being hypocritical by encouraging people to eat out on the one hand while imploring them to lose weight on the other. From the perspective of promoting more balanced obesity coverage, which cuts across political allegiances, this could be viewed as an encouraging sign. However, it remains to be seen whether this, along with the other changes to press discourse ushered in by the pandemic, will be lasting or particular to this unique and unprecedented news context.
Lancaster University is very proud to offer MA and Postgraduate Certificate programmes in Corpus linguistics. The programmes aim to equip students with skills that will enable them to analyse large amounts of linguistic data (corpora) using cutting-edge computational technology.
We asked our future students a few questions about their interests and motivation to study at Lancaster.
Alexandra Terashima: “Applying for this program represents a major pivot in my life.”
Hello! My name is Alexandra Terashima and I’ve recently been accepted into the Corpus Linguistics (Distance) MA program. I am originally from Russia, but I grew up and studied in the United States, and currently, I am living in Japan.
I feel incredibly grateful to have been selected to receive a bursary to support my studies towards an MA in Corpus Linguistics.
Can you tell us a little bit about your background and research interests?
Applying for this program represents a major pivot in my life—I already have a PhD in genetics and worked for several years as a researcher in a lab. But something was missing for me and a few years ago, I stepped away from the bench and turned towards the communication side of science, spending a few years helping scientists edit and revise papers for publication, which led to my current position, teaching academic writing to English language learners.
My research interests include language acquisition, in particular how learners of English acquire knowledge of formulaic language, such as collocations and multi-word phrases, particularly ones that are used in specific genres of writing, such as scientific literature.
Why have you applied to study MA in Corpus linguistics at Lancaster University?
While, perhaps I am not a traditional MA program student, I applied to this program after careful consideration of my future career goals. During my time as a biology researcher, I was fascinated by the fact that, while scientific articles play a big role in the career of a scientist, the conventions of how to write scientific articles are not taught to science students at either the undergraduate or graduate level. Instead, students are expected to learn how to write from their supervisor and other lab members.
When I worked as an in-house editor at a research institute, I saw first hand how the quality of writing can influence an editor’s response to and reviewers’ comments on a submitted manuscript, regardless of the quality of the scientific findings. Through working closely with scientists to help them improve their papers for publication, I became interested in education, and five years ago started working at the University of Tokyo, teaching academic writing to undergraduate students. Also 5 years ago I was introduced to corpus linguistics at an English for Specific Purposes conference where I heard talks by Laurence Anthony and Paul Thompson. The methodology of systematic analysis of language for patterns appealed to me and I began exploring this area of research in the context of my teaching. My career goal is to have a position in academia that combines teaching, research and supervision of graduate students, but I feel that I need additional qualifications to achieve my goals. I have been contemplating an MA in applied linguistics for several years as a way to acquire research training and qualifications in this field. In parallel, I became aware of Lancaster University as one of the leaders in the field of corpus linguistics by reading literature and taking part in the Corpus Linguistics MOOC on FutureLearn. Last fall, when I saw the announcement for this new distance MA program in corpus linguistics, I knew it was time to apply!
Can you tell us a little bit about the topic you have selected for your MA dissertation?
Because of my strong interest in formulaic language, the topic of my MA dissertation focuses on the use of corpus analysis tools to measure and visualize phraseological development in spoken L2 English. In particular I will explore whether different levels of L2 proficiency can be distinguished by differences in the knowledge of collocations and if so, what statistical measures for identifying collocations are most effective. This project will utilize the Trinity Lancaster Corpus, which in addition to being the largest spoken learner corpus of its kind, is rich in metadata, which allows users to quickly access the data of interest, such as the samples from different levels of L2 proficiency. I will also need to learn my way around #LancsBox for this project, which no doubt will be an invaluable tool in my future research.
Why have you selected this topic?
As a lifelong language learner, I am fascinated by how people acquire language, are taught language and ultimately, how they use language. I believe formulaic language, namely collocations and collocation networks, is one of the cornerstones of language study that can help improve learner motivation and accelerate the understanding of an L2 language.
I selected this topic because I am intrigued by the challenge of distinguishing collocational knowledge at different levels of L2 proficiency. I recognize the importance of such distinctions for developing assessment tools and graded teaching materials. It is also reasonable to assume that learners acquire L2 proficiency in different ways and so defining the borders between different levels of L2 English proficiency in terms of collocation knowledge is a challenging and useful endeavor, one that goes a step beyond vocabulary and grammar knowledge assessment.
What are your plans for the future?
For my future research, I would like to focus on formulaic language, specifically language used in scientific papers. I would like to help establish conventions to teach science paper writing systematically to undergraduate and graduate students to bridge the gap for scientists struggling to publish due to the poor writing skills of their supervisor or due to being a non-native English speaker. The majority of current literature analyzing scientific papers have been understandably done by linguists. While these studies provide many useful insights, I feel that their lack of understanding of scientific research culture as well as the culture of scientific publishing doesn’t allow them to fully capture the dynamic and evolving nature of the language of scientific publications. I believe that my background as a scientist can help bridge this gap and help expand this genre of linguistic research.
Lee Daniels: Corpus linguistics at Lancaster is “a fantastic opportunity for me!”
Hi there! My name is Lee Daniels, and I am a bursary holder for the Corpus Linguistics MA at Lancaster University.
I am a 28-year-old North Yorkshireman turned Mancunian, who has lived in Salford for the past seven years. I have just completed my B.A. (Hons) Linguistics undergraduate degree with Manchester Metropolitan University, and I am incredibly excited for this fantastic opportunity with Lancaster University!
So! Let me tell you a little bit about myself in the form of a mini-interview format.
Can you tell us a little bit about your background and research interests?
I began my higher education relatively late, that is, it was not until the age of 25 that I entered Manchester Metropolitan University (MMU) as a mature student studying Linguistics and Italian. Prior to this I was working as a Third-Party Liability and Credit-Hire Motor Claims Handler. However, for a multitude of reasons, I decided that this career path was not for me and I wanted to dedicate my efforts to something where my passions lay. That passion was (and still is) any and all things Linguistics! Subsequently, I studied, paid for, and completed the qualifications needed (iGCSE and A-Level Italian) to gain entry into university and develop these passions further.
Through three fantastic years of study at MMU, I honed these passions into particular research interests, that is, via the sub-disciplines of cognitive linguistics, pragmatics (with a dash of semantics) and corpus linguistics (go figure!). Particularly, my interests lay in the combination of these three interests. For as I argue in my undergraduate dissertation research, isolating language conceptualisation from the real-world context through which it is found, is counter intuitive. Thus, in-line with an emerging socio-cognitive sub-discipline, my interests lay in intertwining conceptual and pragmatic processes which may influence unique language conceptualisations, and thus, language output.
I have found that the application of the corpus linguistic methodology, with its ever-developing capabilities thanks to ever-emerging new technology, provides fantastic opportunity to offer some substantiation or refutation to such claims (although I hope the former!). Nevertheless, the integration of these interests is something that I have initiated in my dissertation project and is something that I would love to continue to pursue throughout my academic career.
Why have you applied to study MA in Corpus linguistics at Lancaster University?
Lancaster has not only one of the best Linguistics departments in the world, but also, the corpus work coming out of the institution is at the cutting edge of the discipline. During my time at MMU, I often utilised the corpus work of Lancaster scholars to demonstrate the benefits and applicability of its methodology be it through Baker, Brezina, McEnery, Hardie, Semino, Culpeper (and many more). I had thus quickly learned of Lancaster’s position at the forefront of the field.
I have also had the pleasure of working with some of Lancaster alumni, such as Professor Dawn Archer and Dr Sean Murphy in a corpus-led research project looking at Shakespeare’s representation of gender in his works. This was via the utilisation of the Enhanced Shakespearean Corpus (ESC) and CQPWeb (developed at Lancaster). Additionally, I enjoy a fantastic and productive working relationship with Dr Lexi Webster, which I hope will continue for many years and produce exciting work. Nonetheless, I applied to Lancaster because I want to contribute to, and be associated with, the incredible work and people that are associated with the institution.
Can you tell us a little bit about the topic you have selected for your MA dissertation?
I have selected to study disagreement strategies in spoken L2 English (English not as a native language). This study will utilise the Trinity Lancaster Corpus (TLC) developed at The ESRC Centre for Corpus Approaches to Social Science (CASS), Lancaster University in collaboration with Trinity College London. TLC contains the largest body of spoken L2 English across all corpora and is thus best placed for the application of this MA dissertation piece. The topic selected allows the analysis of a complex pragmatic process (disagreement) through empirical means, whilst at the same time, complementing it with in-depth qualitative analysis. The subsequent findings obtained from this analysis may then enhance our understanding of second language pragmatic abilities, communicative strategies in language testing, and may thus contribute to greater understanding and improved practice within TESOL/TEFL contexts.
Why have you selected this topic?
What drew me in to this topic was the opportunity to provide great insight into a pragmatic communicative strategy; it also allows me to explore my research interests. That is, the project allows me to further explore the conceptual/contextual practices that are behind pragmatic strategy constructions.
Using corpus to provide substantiation to such a complex pragmatic phenomenon, also falls in line with my interests. In that, I think we are in an exciting time for Linguistics because the technology associated with corpus is only getting better and more capable. Thus, with that expansion, all sorts of new research may be attempted into complex phenomena (like L2 English disagreement strategies!) that was previously not feasible. Therefore to be at an institution that fully resonates this thinking is a fantastic opportunity for me!
What are your plans for the future?
More Linguistics! In other words, my aim is to become a Lecturer within the field. In addition to having a passion for the Linguistic discipline, I also love rambling on about it too! (if you have not guessed already). I developed this at MMU by applying it in a teaching capacity in both paid and voluntary roles. Nevertheless, I find teaching a topic that I am genuinely passionate about, and trying to stir that same passions in others, to be incredibly rewarding. Subsequently, to reach this goal I need to acquire my PhD and would love this to be at Lancaster via a similar corpus-led opportunity. Nevertheless, it will require a lot of hard work, but I am as committed now as I was on day one when I started this incredibly rewarding journey!
A new ESRC-funded project based in CASS will apply the methods of corpus linguistics to arrive at new understandings of vaccine hesitancy, which the World Health Organization lists among the top 10 global health challenges, and defines as ‘a delay in acceptance or refusal of vaccines despite availability of vaccination services’.
Vaccine hesitancy is often a consequence of views and attitudes that are formed and exchanged through discourse, for example by reading the news, listening to politicians and interacting on social media. The ‘Quo VaDis’ project (Questioning Vaccination Discourse) will employs corpus linguistic methods to study systematically the ways in which vaccinations are discussed, both currently and historically, in the UK press, UK parliamentary debates, and social media (Twitter, reddit and Mumsnet). The goal is to arrive at a better understanding of pro- and anti-vaccination views, as well as undecided views, and to use the findings to inform future public health campaigns about vaccinations, in collaboration with public health agencies. For more information: https://www.lancaster.ac.uk/vaccination-discourse/ Twitter: @vaccine_project
POS Tagging for Georgian is now available in #LancsBox
We are delighted to announce that part-of-speech tagging for Georgian is now available in #LancsBox. This is the very first Georgian POS tagger made available for wide range of users and uses. It enables users to perform various linguistic analysis on their own texts or corpora in #LancsBox.
The POS-tagger for Georgian was developed within my PhD project (Computational analysis of morphosyntactic categories in Georgian) at the University of Leeds. The tagset design part of my research was conducted in the Centre for Corpus Approaches to Social Science at Lancaster University and was supervised by Dr Andrew Hardie. Thanks to Dr Vaclav Brezina the lead developer of #LancsBox, now this tagger is available to be used in #LancsBox (Brezina et al., 2015, 2018, 2020).
I use a probabilistic methodology (TreeTagger) and enclitic tokenisation approach to perform tagging in Georgian. The accuracy of part-of-speech tagging 92%. The tagger program uses a new morphosyntactic language model (developed for POS tagging purposes) and KATAG tagset (219 tags) based on this model. The KATAG tagset is a hierarchical-decomposable tagset which allows the user to search for different sections of the paradigm.
#LancsBox is a very powerful corpus analysis tool. It can be used at different levels of analysis of language data and corpora. It automatically annotates data for part-of-speech and can be used to find frequencies of different word classes such as nouns, verbs etc., compute frequency and dispersion measures for POS tags, find and visualise co-occurrence of grammatical categories. It can also find complex linguistic structures using ‘smart searches’. For example, there are 60 ‘smart searches’ available for Georgian in #LancsBox such as:
ADJECTIVES GENITIVE CASE looks up for adjectives in genitive case
ADVERBS any adverbs
NOUNS ERGATIVE CASE nouns in ergative case
PRONOUNS DEMONSTRATIVE demonstrative pronouns
PRONOUNS INTERROGATIVE interrogative pronouns
PRONOUNS PERSONAL personal pronouns
VERBS AORIST TENSE verbs in aorist tense
VERBS I PERSON verbs 1st person of subject
VERBS II PERSON PLURAL verbs 2nd person of subject plural
VERBS IMPERFECT TENSE verbs imperfect tense
To demonstrate how to use ‘smart searches’ in #LancsBox I use a small Covid19 corpus (229,481 tokens). I am interested to find out which verbs immediately follow the word coronavirus. Thus, my search term is: კორონავირუსიVERBS
This image displays an alphabetically arranged concordance lines in #LancsBox, showing the most immediate contexts in which the search term is used. This allows me to analyse frequency and dispersion of the node კორონავირუსი (coronavirus) immediately followed by a verb. Here it occurs 37 times (1.612 per 10k) in Covid19 Corpus in 10 out of 11 texts.
We are delighted to announce the forthcoming publication of a book based on research carried out as part of the CASS project, ‘Representations of Obesity in the News’. The book, titled Obesity in the News: Language and Representation in the Press, will be published by Cambridge University Press in 2021. You can see a sneak preview of the cover here!
The book reports analysis of a 36 million-word corpus of all UK national newspaper articles mentioning obese or obesity published over a ten-year period (2008-2017). This analysis combines methods from Corpus Linguistics with Critical Discourse Studies to explore the discourses that characterise press coverage of obesity during this period. The book explores a wide range of themes in this large dataset, with chapters that answer the following questions:
• What discourses characterise representations of obesity in the press as a whole?
• How do obesity discourses differ according to newspapers’ formats and political leanings?
• How have obesity discourses changed over time, and how do they interact with the annual news cycle?
• How does the press use language to shame and stigmatise people with obesity, and how are attempts to ‘reclaim’ the notion of obesity depicted?
• What discourses surround the core concepts of the ‘healthy body’, ‘diet’ and ‘exercise’ in press coverage of obesity?
• How do obesity discourses interact with gender, and how does this influence the ways in which men and women with obesity are represented?
• How does the press talk about social class in relation to obesity, and how do such discourses contribute to differing depictions of obesity in people from different social class groups?
• Finally, how do audiences respond to press depictions of obesity in below-the-line comments on online articles?
The book will be the latest output from this project. You can read more about our work on changing representations of obesity over time in this recent Open Access article published in Social Science & Medicine. We are also working on articles which expand our analysis of obesity and social class, depictions of obesity risk, and obesity discourses in press coverage of the coronavirus pandemic, so keep your eyes peeled for further announcements!
I am honoured to have received the Institute for Corpus Research Outstanding Doctoral Thesis Award. The purpose of this annual award is to recognise and reward theses in the field of Corpus Linguistics.
I conducted my PhD research in the Centre for Corpus Approaches to Social Science at Lancaster University, which is part of Department of Linguistics and English Language. My thesis was titled Collocational Processing in Typologically Different Languages, English and Turkish: Evidence from Corpora and Psycholinguistic Experimentation. Some of the findings based on my PhD research are reported in this article. The study was multidisciplinary, involving both corpus analysis and psycholinguistic experimentation. Supervisors Dr Vaclav Brezina and Prof Patrick Rebuschat played a key role in shaping the thesis. Their academic knowledge and insight have been invaluable in developing a multidisciplinary perspective to pioneer a contrastive study of English and Turkish.
Turkish, with its rich morphology, differs from English – prompting questions about whether the same variables affect collocational processing in the two languages. Importantly, so far the vast majority of research on collocational processing has focussed on a narrow range of primarily European languages, especially English, which makes it difficult to generalise the findings to other languages. Corpus analyses showed that uninflected collocations have similar mean frequencies and association counts in both languages. When inflected forms were included, 75% of the Turkish collocations occurred at a higher frequency than the collocations in English, suggesting that language typology impacts frequency of collocations.
I then conducted psycholinguistic experiments to understand the differences and similarities between the processing of collocations in English and Turkish and by native and non-native English speakers. To what extent is there a difference between native-speakers’ (of English and Turkish) sensitivity to both individual word-level and phrase level frequency information when processing collocations? Mixed-effects regression modelling revealed that Turkish and English native-speakers are equally sensitive to collocation frequencies, confirming collocations’ psychological reality in both languages. Yet English speakers were additionally affected by individual word-frequencies, indicating that language typologies require users to process collocations from different sources of information.
Furthermore, this thesis investigated the effects of individual word and collocational frequency on native and non-native speakers’ collocational processing in English. Both groups of participants demonstrated sensitivity to individual word and collocation frequency. The findings align with the predictions of usage-based approaches that language acquisition should be viewed as a statistical accumulation of experiences that changes every time we encounter a particular utterance.
This study identified both universal fundamentals and language-specific differences in collocational processing. It addressed language typology and second-language learning through a novel multidisciplinary approach which reinforces and challenges usage-based theories of language learning, demonstrating that they should include typologically different languages to develop broader perspectives on processing.
Please see the link here for more information about this award.
If you have any questions, or are interested in working with me, get in touch. Dr Doğuş Can Öksüz Research fellow at the University of Leeds. firstname.lastname@example.org
A new three-year project based in CASS will use corpus linguistic methods to study how vaccinations (including future vaccines for Covid-19) are talked about in the UK press, UK parliamentary discourse and social media. Through collaborations with governmental and public health partners, the findings will be used to help address vaccine hesitancy, which is one of the World Health Organizations top 10 global health challenges.
The project will start in March 2021 and is funded by the Economic and Social Research Council, part of UK Research and Innovation.
To find out more, read Lancaster University’s announcement and watch a brief introduction to the project by Principal Investigator Elena Semino.
We are delighted to announce that we are joining the International Consortium for Communication in Health Care. The Consortium is led by the Australian National University, and also includes University College London, Nanyang Technological University, the University of Hong Kong and Queensland University of Technology. The aim of the Consortium is to conduct research that increases understanding of communication about illness in clinical and non-clinical settings, and to translate research findings into changes in education and practice that will improve the experiences and safety of patients.