#LancsBox: The emerging historical linguist’s MO? A brief case study of Aramaic.

By: Charbel El-Khaissi

I took Lancaster University’s free Corpus Linguistics course (Corpus MOOC) to fill time. Three months later, a doctoral research proposal enabled by #LancsBox, a software tool introduced in the course, was accepted at the Australian National University.

For as long as this topic has been studied, ancient Semitic languages have relied on classical philological approaches. Naturally, a tension exists between this tradition and contemporary approaches in computational linguistics. It would be unfair to characterise this divide as a mere consequence of ‘old-school’ scholars resisting technological changes in research because philology is an inherent part of the study. The study of any ancient language requires far more human involvement than a machine can achieve: a careful hand to conserve and restore manuscripts, a keen eye for epigraphic analysis and a well-rounded, learned mind to interpret literature in medias res, politically, theologically and societally. However, as far as the researcher is open to computer-assistive technology, #LancsBox fills a much-needed gap in historical linguistics, especially in the field of Semitic historical syntax.

As a case in point, consider the Aramaic language: the longest, continuously spoken Semitic language with an attested lifespan of approximately 3,000 years. This human language offers linguists intriguing insights on how human languages change over a substantial time period, including changes in its underlying structure (i.e., grammar and syntax). If these changes are substantiated then their insights may lend important cues concerning the evolution of human cognition itself. Yet, the historical syntax of Aramaic remains largely underrepresented and understudied. Few commendable scholars have undertaken the task of analysing developments in areas of Aramaic grammar (e.g., Huehnergard, 2005; Rubin, 2005; Grassi, 2009; Pat-El, 2012; Coghill, 2012). Among other reasons, the lack of rigorous study in this discipline is due to the labour-intensive task of qualitatively analysing large corpora. This task is made more difficult by a manual transcription and grammatical tagging process, in addition to administration duties such as record management and categorisation. Recent advancements in Aramaic computational linguistics – including, but not limited to Handwriting-text Recognition (HTR) technology and digital archives – have significantly reduced time of text transcription and tagging. However, the diachronic analysis of large corpora remains tedious without a free, user-friendly and accessible corpus software like #LancsBox.

My doctoral research is among the first studies in Semitic historical linguistics to experiment with Lancaster University’s #LancsBox corpus software and analyse Aramaic syntax over time. Thus far, it has proven to be an exceptional tool for data management and diachronic analysis (see Figure 1 and Figure 2):

• Corpus management: the ease of creating, storing and analysing (sub-)corpora based on variables of interest (e.g., by dialect, century, author) reduces administrative overhead and gives me more time test different hypotheses according to multiple variables.

• POS-tagging: in addition to offering POS tagging in a number of languages, #LancsBox caters to self-tagged corpora. This means I can import datasets that have been annotated according to my own tagging scheme, which gives me flexibility when testing the robustness of tag sets according to various theoretical frameworks.

As with any computer software, few caveats are worthy of mention to historical Semitic linguists interested in using the software for their research.

• Coding: basic knowledge of Regular Expression coding is needed to execute meaningful, in-context searches.

• Font: in its current version (5.0), Aramaic is partially-supported, with some fonts appearing disconnected. This makes in-tool legibility difficult, but not impossible.

• Text-direction: in its current version (5.0), Aramaic texts appear reversed (e.g., “cat” appears “tac”). Current workarounds include (1) using free, online tools to reverse the text prior to import, or (2) conducting analysis outside the tool.

Will #LancsBox become the MO for future historical linguists? Only time will tell. It seems to me the only accessible software currently available for linguists who wish to build and design their own corpus, especially in underrepresented and under resourced languages. In fact, I can think of a number of innovative applications outside the research domain as well: for example, Australian linguists might be able to use #LancsBox to investigate which linguistic features have been declining in student writing over the last decade. Perhaps then #LancsBox’s core functionalities could help academics in other fields and a wider group of users.

Watch a 60-second video of Charbel El-Khaissi’s research here.

Acknowledgements: Thank you Professor Tony McEnery, Dr Pierre Weill-Tessier and Dr Vaclav Brezina whose innovations have enabled my research. I express gratitude to my supervisory panel for their ongoing guidance.

British Muslims Caught Amidst FOGs – A Discourse Analysis of Religious Advice and Authority

By Usman Maravia

In this blog entry, I will provide an overview of my latest article which explores the writing style of Islamic advice texts on COVID-19. The issues that were addressed in these advice texts were related to the topic of mosque closures, funerary rites, fasting during Ramadan, and suspending Friday and daily prayers to help curb the spread of COVID-19. These texts were being circulated in the UK in March and April of 2020, a crucial period wherein information was passed on to address issues that, in the scope of the study, British Muslims would face in Ramadan, which began on 25th April 2020.

The context

My interest in this topic was sparked by an unfortunate COVID-19 related death of an elderly Muslim from Walsall. A family member of the deceased stated in the Press that “It is imperative that we learn from this tragic loss and comply with Government guidelines to save lives”. What further caught my interest was that if the aim of the Islamic advice documents was to help Muslims stay safe during the pandemic, a unified and standardised message with collaboration between Muslim faith leaders and health professionals would have been helpful. Instead, a range of documents were found to be circulated as well as these documents differed in their titles – leading to ambiguity of exactly what preventative British Muslims were to take and where exactly lay the authority.

Moreover , the titles of these documents differed. Some were titled fatwa, which is a non-binding legal opinion of an Islamic legal expert, but still a document that could potentially carry much influence on Muslim communities in the UK. Some documents were written by healthcare professionals and were titled guidance documents – I wondered, do these documents carry the same weight as fatwas? And yet other documents were neither titled a fatwa nor guidance but in a hybrid style of the two categories, again I wondered, why were these words used in the titles?

The FOG corpus

As such, I sought to identify a) the underlying reasons behind the titling of the documents; and (b) the construction of discourses in the documents. In collaboration with my colleagues Zhazira Bekzhanova (Astana IT University, Kazakhstan), Mansur Ali (Centre for the Study of Islam in the UK, Cardiff University), and Rakan Alibri (University of Tabuk), we collected a total of 76 texts that were available online on websites of British mosques, Facebook pages and other online venues. We found that of these 76 documents, 14 documents were clearly titled fatwa. We also found that six documents were titled guidance documents, and an eye-catching 56 documents, which we refer to as other documents, included a range of words in their titles such as analysis, clarification, confirmation, guidelines, method, pathway, permissibility, plan of action, points, recommendation, response, ruling, and statement. This classification led to our jocular acronym FOG i.e., fatwas, other documents, and guidance documents. This compilation then led to the creation of the specialised FOG corpus consisting of around 110,000 words.

We examined these written electronic texts in the social context of Muslims and COVID-19 in the UK. We explored the way language was used in real-life in fatwas, guidance documents, and other documents. We then focused on the way the authors of these documents differ in their writing styles to create a certain impression on the audience by increasing, in Bourdieu’s terms, symbolic capital. Moreover, we focus on representation of social actors (van Leeuwen, 1995) in deciphering power relations across the FOG documents. Moreover, references to social actors are widely analysed and interpreted across the FOG documents. Other than text producers of these documents, the audience’s references are also analysed, explained, and interpreted through the prism of authorities.

Corpus methods

We applied corpus-assisted critical discourse analysis, which helped us to uncover important patterns in relation to FOGs. Using AntConc software, we analysed the frequency of words, word lists, lexical bundles, collocations, concordance plots, and concordances to detect linguistic patterns in the FOG corpus. Corpus methods also assisted us with the tools to detect power hierarchies and inequalities within the texts. Moreover, our corpus-assisted study strengthens Brookes and McEnery’s study, that texts do acquire symbolic capital through an accumulation of patterns of textual cohesion and rhetorical strategies. We found that the documents appear to follow an underlying hierarchy among British Muslim scholars.

Findings

To elaborate, a particular writing style can be found across the FOG documents. We found fatwas and guidance documents to be textually diametric, whereas other documents were found to feature greater intertextuality as well as maintaining respect to the authority of muftis and their fatwas, but with reservations. The fatwas were found to be written by senior muftis and contained important references to the Qur’an and Muhammad, the Prophet of Islam. Fatwas also included legal terminology in Arabic related to Shariah law. Moreover, fatwas contained phrases such as ‘according to’ and ‘Allah knows best’.

Such a writing style is in accordance with the traditional writing style of fatwas and thereby holds higher symbolic capital. On the other hand, guidance documents were produced by healthcare professionals and did not contain such theologically related phrases but rather relied on scientific and medical language. Interestingly, we found the other category of documents to be written in a hybrid-style of fatwas and guidance documents. Such a writing style appears to increase the symbolic capital of these documents as well as it empowers the writers to challenge existing fatwas – whilst maintaining respect for senior muftis.

While the FOG documents reveal that multiple voices are welcome in addressing a national emergency, we recommend that a standardisation of documents, issued in collaboration with the NHS and senior muftis, could perhaps give a clearer action plan for British Muslims in future. As such, this study is intended to give an impetus to social scientists to explore the discourse of British Muslims and COVID-19 through a linguistic lens.

Our article is available to read in MDPI’s open access journal Religion. Additionally, further research is being carried out on the topic of COVID-19 by the British Islamic Medical Association’s (BIMA) as part of ‘Operation Vaccination’.

For my article on addressing vaccine resistance from an Islamic perspective, please read Vaccines: religio-cultural arguments from an Islamic perspective published by JBIMA.

‘Face masks’ and ‘face coverings’ in the UK press during the Covid-19 pandemic: Scottish vs. national newspapers

Carmen Dayrell, Isobelle Clarke and Elena Semino (Lancaster University)

1 Introduction

Since the beginning of the Covid-19 pandemic, the use of face masks or face coverings as a means of reducing the transmission of the virus has been a major area of debate in many countries around the world. In the UK specifically, the first nine months of 2020 saw a rapid change from a view of face masks as a medical piece of PPE that would not be appropriate or acceptable for the general population, to the establishment of non-surgical face coverings as a recommended public health measure in indoor public spaces, such as buses and supermarkets. As with other aspects of the response to the pandemic, during that time there were differences in the approach to face masks/coverings between the Scottish devolved administration and the Westminster government.

Table 1 provides a timeline summary of policy decisions concerning face masks/coverings on public transport, shops and schools in Scotland and England. For the most part, in Scotland face coverings were recommended or made mandatory earlier than in England. They are also mandatory in corridors and communal areas in Scottish schools, whereas in England this is at the school’s discretion.

 Public transportShopsSchools
April(28th) Scotland (recommended)(28th) Scotland (recommended) 
May(11th) England (recommended)(11th) England (recommended) 
June(15th)England (mandatory) (22nd) Scotland (mandatory)  
July (10th) Scotland (mandatory) (24th) England (mandatory) 
August  (31st) Scotland (mandatory in corridors and communal areas)
September  (1st) England (school/college discretion in indoors communal areas)
Table 1 – Timeline of policy decisions about the wearing of face coverings by the general public in Scotland vs. England.

Scotland has also had a lower incidence of Covid-19 than England. According to official UK government data, as of 30th December 23 people per 1,000 had had at least one positive Covid-19 test in Scotland, in contrast with 39 people per 1,000 in England.

This blog post is concerned with references to face masks and face coverings in Scottish vs. national UK newspapers between December 2019 and August 2020, that is from the start of reports about a new type of pneumonia in Wuhan, China, up to the beginning of the 2020-21 school year in the UK.

2 Research questions

Overarching research question

How does press reporting on face masks and face coverings in Scotland compare with national UK reporting between December 2019 and August 2020?

Specific research questions

  1. How did the frequency of use of ‘face covering(s)’ vs. ‘face mask(s)’ change over time in Scottish vs. national press reporting?
  2. Were there any statistically significant differences in the relative frequencies of the use of ‘face mask(s)’ and ‘face covering(s)’, and of terms relating to places where face masks/coverings may be used, in Scottish vs. national press reporting?
  3. What are the differences and similarities in the collocations (co-occurrence of words) of ‘face mask(s)’ vs. ‘face covering(s)’ in Scottish and national press reporting?

3 Findings in brief

Finding 1 – Over time, ‘face covering(s)’ became more frequent than ‘face mask(s)’ in the Scottish press, but not in the national press.

Finding 2 – ‘Face covering(s)’ are mentioned much more often, relatively speaking, in the Scottish press than in the national press, alongside other terms for public indoor environments where they may be worn.

Finding 3 – Face ‘mask(s)’ and ‘covering(s)’ have partly different collocates, reflecting differences in status and associated narratives.

4 Data

The news aggregator service LexisNexis was used to collect articles that contained either the phrase ‘face mask(s)’ or ‘face covering(s)’ and that were published in a selection of national and Scottish newspapers in the period between 01.12.2019 and 31.08.2020.

Table 2 provides the numbers of texts and words included in each of the resulting two corpora: the Scottish Corpus and the National Corpus. For the National Corpus, we also provide figures for articles extracted from ‘broadsheet’ vs. ‘tabloid’ newspapers, constituting the Broadsheet and Tabloid subcorpora. (NB: For the national newspapers specifically, we selected the national editions only, thus excluding the Irish, Scottish and Northern Ireland editions.). Figures 1 to 4 below show the number of articles per newspaper title within each corpus.

CorpusNumber of textsNumber of Words
National corpus11,53619,401,316
 The Broadsheet subcorpus6,63116,657,194
 The Tabloid subcorpus2,4191,264,952
Scottish corpus1,084588,894
Table 2: Number of texts and total number of words comprising each corpus
Figure 1: Number of texts from each national title

Figure 2: Number of texts from each broadsheet title
Figure 3: Number of texts from each tabloid title

Figure 4: Number of texts from each Scottish newspaper title

The Broadsheet subcorpus is by far the largest of all datasets, both in terms of the number of texts and the number of words (Table 2). Within that subcorpus, The Guardian and The Observer account for the highest number of articles, corresponding to 36% of texts and 83% of the words in that subcorpus (13,744,333 out of 16,657,194). The number of texts is more evenly distributed in the Tabloid subcorpus (Figure 3). The Daily Mail accounts for the largest number of texts (20%) but it is closely followed by The Express, The Sun and Evening Standard (17% and 15% each respectively). Within the Scottish corpus, most texts come from The Daily Record and The National (32% each).

5 Method

To answer question 1.a, we plotted the frequencies of the search terms used to collect the texts that comprise the corpora, ‘face mask(s)’ and ‘face covering(s)’. These figures give us an indication of how the level of attention fluctuated in the National and Scottish press throughout time.

To answer question 1.b, we carried out a ‘keyword’ analysis of the Scottish Corpus as compared with the National Corpus as a whole. Keywords are words that are much more frequent in a corpus of interest (known as the ‘study’ corpus) than they are in another corpus (known as the ‘reference corpus’), where the difference is statistically significant. They can be interpreted as reflecting the most distinctive concepts and themes in a particular corpus. The analysis was carried out using WordSmith Tools, version 7.

For the calculation of keywords, we established that the candidate keyword should occur in at least 5% of texts in the study corpus. This thus determined the minimum frequency of each term, which varied from one corpus to another. The minimum frequency was 577 instances in the National Corpus and 54 in the Scottish Corpus. In terms of statistical tests, we combined the log-likelihood test (a statistical measure of confidence) with log-ratio as the effect size measure, using the following threshold: a critical value higher than 15.13 (p < 0.001) for the log-likelihood test and 1.5 as the minimum log-ratio score, discarding negative scores. Keywords were then grouped by theme through close reading of the concordance lines, that is, individual occurrences of each word with the preceding and following stretches of text.

To answer question 1.c, we carried out a ‘collocation’ analysis of the terms ‘face mask(s)’ and ‘face covering(s)’. Collocation analyses explore co-occurrence relationships between words, and therefore make it possible to study the narratives or discourses that a word is part of. A word collocates with another if it is more likely to be found in close proximity to the other word than elsewhere. Collocations were generated by means of the software package LancsBox, on the basis of the criteria below:

  • Span of 5:5 – a window of five words to the left and five words to the right of the search word.
  • Mutual Information (MI) score ≥ 6. MI is a statistical procedure widely employed in corpus studies to indicate how strong the association between two words is. It is calculated by considering their frequency of co-occurrence in relation to their frequencies when occurring independently in each corpus.
  • Minimum frequency of collocation: 10 occurrences per 1,000 instances of term in question. For example, ‘face mask(s)’ occurs 1,672 times in the Welsh corpus; the minimum frequency of collocation was therefore 17 instances.

Similar to the analysis of keywords, collocations were analysed by close reading of their concordance lines.

6 Findings

Finding 1 – Over time, ‘face covering(s)’ became more frequent than ‘face mask(s)’ in the Scottish press, but not in national press.

Figures 5-6 show the frequency distribution of the terms ‘face mask(s)’ and ‘face covering(s)’ in the two corpora across time, considering the relative frequencies of terms (per 100,000 words). Note that the scale varies from one chart to another; that is due to differences in the amount of data from each corpus.


Figure 5: Relative frequencies of ‘face covering(s)’ and ‘face mask(s)’ in the National Corpus

Figure 6: Relative frequencies of ‘face covering(s)’ and ‘face mask(s)’ in the Scottish Corpus

As can be seen, both corpora show a clear preference for the term ‘face mask(s)’ in the early months, from December 2019 to March 2020, with hardly any mention of the term ‘face covering(s)’. Scottish newspapers seem to have embraced the term first, with mentions of ‘face covering(s)’ increasing swiftly in April 2020, corresponding to nearly half of the number of mentions of ‘face mask(s)’ in that month (83 as compared with 181 instances). National newspapers showed a modest increase in the mentions of ‘face covering(s)’ in April; the term ‘face mask(s)’ was nearly six times more frequent than ‘face covering(s)’ in the national newspapers (2,241 in relation to 386 instances). Mentions of ‘face covering(s)’ continued to rise across both corpora in the following months. In May, they represented about half of the number of mentions of ‘face mask(s)’ in the Scottish corpus and about a third in the National Corpus. By June, mentions of ‘face covering(s)’ surpassed those of ‘face mask(s)’ in Scottish newspapers. In national newspapers, ‘face mask(s)’ remained more frequent than ‘face covering(s)’ across the entire period.

Finding 2 – ‘Face covering(s)’ are mentioned much more often, relatively speaking, in the Scottish press than in the national press, alongside other terms for public indoor environments where they may be worn.

The words ‘covering’ and ‘coverings’, which tend to occur in the phrase ‘face covering(s)’, were found to be ‘key’ or ‘overused’ in the Scottish as compared with the National Corpus. In other words, ‘covering’ and ‘coverings’ are used much more often, in terms of relative frequencies, in the Scottish Corpus than in the National Corpus, based on our thresholds for effect size (log-ratio) and statistical significance (log-likelihood). However, based on the same thresholds, the word ‘mask(s)’ is not overused in the National corpus as compared with the Scottish Corpus. This means that ‘covering(s)’ in the Scottish Corpus is not in complementary distribution to ‘mask(s)’ in the National Corpus.

Overall, the keyword calculation retrieved 41 overused items in the Scottish Corpus, using the National corpus as reference in both. Table 3 includes the complete lists of keywords in the Scottish Corpus, grouped thematically and then ordered by their frequency of occurrence in the corpus.

Table 3 shows that the keywords in the Scottish Corpus include three other terms that are related to face coverings (‘mandatory’, ‘worn’ and ‘mouth’) as well as groups of words that relate to the different environments where face coverings may or may not be recommended or mandatory: Space (e.g. ‘indoor’, ‘outdoor’, ‘household’), Retail/hospitality (e.g. ‘shop’, ‘hospitality’) and Education (e.g. ‘pupils’, ‘teachers’).

Table 3: ‘Keywords’ in the Scottish Corpus, grouped by theme

The overuse of the word ‘kids’ reflects discussions about the age at which face masks/coverings should be made compulsory, as expressed by a reader’s comment published by The Glasgow Evening Times (Extract 1):

(1) “I AM so confused myself. Our kids are going with no distancing and in shops and malls and cinemas and public transport and airports. There is this hype of distancing. Which one is right? Are the poor kids so strong that they will not catch it at all and will not bring anything back home to their elderly grans etc? So illogical!” (The Glasgow Evening Times, 21.08.2020)

  • The keywords also include a group that is to do with Other Measures to reduce contagion, particularly in public spaces such as shops, restaurants and pubs (e.g. ‘screens’, ‘two-metre’). This is because face coverings are often presented as necessary when those other measures are not practicable:

(2) The government guidance says: “If you can, wear a face covering in an enclosed space where social distancing isn’t possible and where you will come into contact with people you do not normally meet. (The National, 25.06.2020).

Finding 3 – Face ‘mask(s)’ and ‘covering(s)’ have partly different collocates, reflecting differences in status and associated narratives.

We now examine the collocates of ‘face mask(s)’ and ‘face covering(s)’ in the two corpora. These are listed in Tables 4 and 5, in decreasing order of frequency of co-occurrence with each term.


Table 4: Collocations of ‘face mask(s)’ and ‘face covering(s)’ in the Scottish Corpus

Table 5: Collocations of ‘face mask(s)’ and ‘face covering(s)’ in the National Corpus

Five words appeared as collocates of both ‘face mask(s)’ and ‘face covering(s)’ in both corpora. These are: three different forms of the verb ‘wear’ (‘wear’, ‘wearing’, ‘worn’), ‘compulsory’ and ‘mandatory’. These suggest that ‘mask(s)’ and ‘covering(s)’ are both used in the context of debates and decisions about the need or obligation to wear them in certain settings.

Figure 7: Instances of ‘face mask(s)’ in the Scottish Corpus

However, the collocates that only apply to ‘face mask(s)’ show that they tend to be talked about as a type of PPE in clinical or care settings (e.g. ‘protective’, ‘surgical’, ‘gloves’, ‘aprons’).

(3) Carers, many of whom are paid low wages by private sector firms, have complained they have not been provided with essential items such as hand sanitiser, gloves, aprons, and face masks. (The Independent, 24.03.2020)

In contrast, the collocates that only apply to ‘covering(s)’ show that they tend to be talked about as a non-medical item of clothing that is:

  • made of cloth and a potential fashion accessory or political statement (‘cloth’, ‘branded’);

(4) Currently no other party is selling branded face coverings, although many independent online shops stock masks with Union flag or political designs. (The National, 25.07.2020)

(5) Face coverings include scarves, a piece of cloth or a mask and certain travellers – such as people with disabilities or breathing difficulties – will be exempt. (The Daily Express, 06.06.2020)

  • recommended to be worn (e.g. ‘recommended’, ‘advised’);

(6) Earlier this week, First Minister Nicola Sturgeon recommended the limited use of face coverings – not necessarily masks – when social distancing is hard to maintain. (Glasgow Evening Times, 04.05.2020)

(7) Other precautions advised include wearing face coverings in public as much as possible, keeping two metres apart, avoiding physical contact with those outside one’s household and to be tested and isolate if told to do so. (The Telegraph 18.07.2020)

  • in specific indoor public settings (‘crowded’, ‘enclosed’; ‘shops’, ‘transport’);

(8) “However, we are recommending you do wear a cloth face covering if you are in an enclosed space with others where social distancing is difficult – for example, on public transport, or in a shop.” (The National, 28.04.2020)

(9) It is compulsory to wear face coverings on public transport, in shops and when collecting takeaway food. (The Sun, 14.08.2020)

  • by large sections of the population (‘secondary’, ‘pupils’, ‘passengers’).

(10) A SECONDARY school is asking pupils to wear face coverings as part of efforts to combat the spread of coronavirus. (The Herald, 23.08.2020)

(11) Passengers have been told to wear face coverings on public transport to prevent a further outbreak of coronavirus as Britain slowly emerges from the lockdown. (The Times, 12.05.2020)

What does not, however, emerge from the collocates of ‘face covering(s)’ in either corpus is a consistent message about their role in protecting others from droplets produced by the wearer, thus reducing transmission overall. This may partly explain ongoing opposition to or scepticism about the usefulness of face coverings during the pandemic.

7 Conclusions

Overall, in the period December 2019 – August 2020, reports on face mask(s)/covering(s) in the Scottish press contrasted with the national press in terms of: a preference for ‘face covering(s)’ over ‘face mask(s)’ from April 2020 onwards; and a greater concern for their use to mitigate the transmission of the virus in schools, shops and other public indoor environments. This can only be partly explained by the fact that the Westminster government made decisions about the recommended/mandatory use of face coverings in public indoor spaces slightly later than the Scottish devolved administration. The contrasting collocates of ‘face covering(s)’ vs. ‘face mask(s)’ confirm that they are associated with different settings and narratives: PPE in clinical/care settings vs. item of clothing/accessory to be worn in public indoor environments by the general population as a public health measure. In the period under consideration, the latter narrative was therefore increasingly prevalent in Scottish but not in national newspapers.

Introductory Blog – Hanna Schmueck

I am very honoured to have received the Geoffrey Leech Outstanding MA Student Award for my MA in Language and Linguistics. This award traditionally goes to the MA student with the highest overall average.

I started my postgraduate journey in September 2019 after finishing my undergraduate degree at the University of Bamberg (Germany) in 2018 and working as a freelance translator and teacher for a year. I’ve always had an interest in the way language influences us both as individuals and as a society and have carried with me a fascination for experimentation and statistics. I first discovered corpus linguistics in the second year of my undergraduate degree, it soon after cemented itself as my primary research interest. I chose a corpus-based project for my undergraduate dissertation on pronouns in the English-lexifier lingua franca Bislama. From here I realised that much of the relevant methodological literature had been published by Lancaster academics – which cemented my decision to apply at Lancaster despite having to move abroad and face a number of Brexit-related administrative hurdles.

When I finally came to Lancaster for my MA, I felt welcome in the department from day one and I had the chance to attend/audit a wide variety of modules such as Cognitive Linguistics, Experimental Approaches to Language and Cognition, Forensic Linguistics, Stylistics, and Corpus Linguistics. The freedom of choice that Lancaster MA students in Language and Linguistics are given was another major motivation for studying at Lancaster and the flexible approach really benefited my personal learning experience. Another important element of my academic learning experience was being able to attend research groups – such as the Trinity group and UCREL talks –which focus on a wide variety of topics and allow you to come into contact with people that have all kinds of specialisms while getting the opportunity to develop your own research interests further.

I had, like all of us, not foreseen that my MA would move online in spring and all the challenges COVID-19 would bring about, but after the first phase of getting used to the situation I tried my best to see this as an opportunity to focus on my MA thesis titled “More than the sum of its parts: Collocation networks in the written section of the BNC2014 Baby+”. The aim of this thesis was to explore corpus-wide collocation networks and their structural and graph-theoretical properties using the BNC2014 Baby+ as the underlying dataset. I developed a method to create and display large MI2-score based weighted networks in order to analyse meta-level collocational patterns that emerge and performed a graph-theoretical analysis on them. The results obtained from this pilot study suggested that there is an underlying structure that all sections in the BNC2014 Baby+ share and the structure of the generated networks resembles other networks from a wide variety of phenomena such as power grids, social networks, and networks of brain neurons. The findings indicated that there are, however, text-type specific differences in terms of how connected different topic areas are and that certain words serve as hubs connecting topics with one another. The network displayed below is an example taken from the BNC Baby+ academic books section with a filter applied to only show the node “award”, its direct neighbours and their weighted interrelations.

I am very grateful for having had the opportunity to learn from and exchange ideas with so many amazing academics in the department over the course of my MA and I’m very excited to carry on researching collocation networks for my PhD here at Lancaster.

New CASS project: Feedback on NHS Cancer services

 In recent months, CASS members Paul Baker  and Gavin Brookes  have embarked on a project working with the National Health Service (NHS), using corpus linguistics methods to investigate patient concerns in a large corpus (approx. 14 million words) of patient feedback on NHS cancer care. Below we discuss what the project will entail.

If this project sounds familiar, it is because we carried out a similar project four years ago, also using corpus techniques to examine NHS patient feedback more generally (you can read about this work in this book and this journal article in BMJ Open).

This latest project was made possible with ESRC funding (£84,006 FEC) and involves collaboration between CASS and NHS England who provided us with electronic versions of approximately 200,000 patient questionnaires given annually to all patients who receive treatment for cancer in England. We have been given access to four years of data (2015-2018) which has been mounted on Lancaster University’s online corpus analysis system CQPWeb.

We will be using and refining some of the techniques we developed in that earlier work to explore, for example, what kinds of concerns drive patients’ evaluations, how patients’ priorities change throughout the duration of their care, and what types of concerns patients regard as being most urgent. This set of comments differs from that which we analysed previously in an important respect; specifically, we have access to metadata regarding patients’ age, ethnicity, sex and sexuality, as well the type of cancer they received treatment for and the hospital they attended. Therefore, our analysis will also explore what impacts these variables are likely to have on patients’ expectations and how that impacts on the language they use when talking about and evaluating NHS services.

Another important difference between this project and the last one is that we will be able to draw on the expertise of Professor Sheila Payne – an expert in palliative and end of life care who has also been involved in other CASS projects in the past (e.g. Metaphor in End of Life Care (MELC).  Sheila’s insight will help to guide the aims of the project and to ensure that these are relevant and of value to the NHS, while her expertise will be key to interpreting the significance of our findings.

POS Tagging for Georgian is now available in #LancsBox

Featured

POS Tagging for Georgian is now available in #LancsBox

We are delighted to announce that part-of-speech tagging for Georgian is now available in #LancsBox. This is the very first Georgian POS tagger made available for wide range of users and uses. It enables users to perform various linguistic analysis on their own texts or corpora in #LancsBox.

The POS-tagger for Georgian was developed within my PhD project (Computational analysis of morphosyntactic categories in Georgian) at the University of Leeds. The tagset design part of my research was conducted in the Centre for Corpus Approaches to Social Science at Lancaster University and was supervised by Dr Andrew Hardie.  Thanks to Dr Vaclav Brezina the lead developer of #LancsBox, now this tagger is available to be used in #LancsBox (Brezina et al., 2015, 2018, 2020).

I use a probabilistic methodology (TreeTagger) and enclitic tokenisation approach to perform tagging in Georgian. The accuracy of part-of-speech tagging 92%. The tagger program uses a new morphosyntactic language model (developed for POS tagging purposes) and KATAG tagset (219 tags) based on this model. The KATAG tagset is a hierarchical-decomposable tagset which allows the user to search for different sections of the paradigm.

#LancsBox is a very powerful corpus analysis tool.  It can be used at different levels of analysis of language data and corpora. It automatically annotates data for part-of-speech and can be used to find frequencies of different word classes such as nouns, verbs etc., compute frequency and dispersion measures for POS tags, find and visualise co-occurrence of grammatical categories. It can also find complex linguistic structures using ‘smart searches’. For example, there are 60 ‘smart searches’ available for Georgian in #LancsBox such as:

ADJECTIVES GENITIVE CASE                      looks up for adjectives in genitive case

ADVERBS                                                          any adverbs

NOUNS ERGATIVE CASE                              nouns in ergative case

PRONOUNS DEMONSTRATIVE                   demonstrative pronouns

PRONOUNS INTERROGATIVE                     interrogative pronouns

PRONOUNS PERSONAL                                 personal pronouns

VERBS AORIST TENSE                                  verbs in aorist tense

VERBS I PERSON                                             verbs 1st person of subject

VERBS II PERSON PLURAL                           verbs 2nd person of subject plural

VERBS IMPERFECT TENSE                           verbs imperfect tense

To demonstrate how to use ‘smart searches’ in #LancsBox I use a small Covid19 corpus (229,481 tokens). I am interested to find out which verbs immediately follow the word coronavirus. Thus, my search term is: კორონავირუსი VERBS

This image displays an alphabetically arranged concordance lines in #LancsBox, showing the most immediate contexts in which the search term is used. This allows me to analyse frequency and dispersion of the node კორონავირუსი (coronavirus) immediately followed by a verb. Here it occurs 37 times (1.612 per 10k) in Covid19 Corpus in 10 out of 11 texts.

Representations of Obesity in the News: Project update and book announcement!

Gavin Brookes and Paul Baker

We are delighted to announce the forthcoming publication of a book based on research carried out as part of the CASS project, ‘Representations of Obesity in the News’. The book, titled Obesity in the News: Language and Representation in the Press, will be published by Cambridge University Press in 2021. You can see a sneak preview of the cover here!

The book reports analysis of a 36 million-word corpus of all UK national newspaper articles mentioning obese or obesity published over a ten-year period (2008-2017). This analysis combines methods from Corpus Linguistics with Critical Discourse Studies to explore the discourses that characterise press coverage of obesity during this period. The book explores a wide range of themes in this large dataset, with chapters that answer the following questions:

• What discourses characterise representations of obesity in the press as a whole?

• How do obesity discourses differ according to newspapers’ formats and political leanings?

• How have obesity discourses changed over time, and how do they interact with the annual news cycle?

• How does the press use language to shame and stigmatise people with obesity, and how are attempts to ‘reclaim’ the notion of obesity depicted?

• What discourses surround the core concepts of the ‘healthy body’, ‘diet’ and ‘exercise’ in press coverage of obesity?

• How do obesity discourses interact with gender, and how does this influence the ways in which men and women with obesity are represented?

• How does the press talk about social class in relation to obesity, and how do such discourses contribute to differing depictions of obesity in people from different social class groups?

• Finally, how do audiences respond to press depictions of obesity in below-the-line comments on online articles?

The book will be the latest output from this project. You can read more about our work on changing representations of obesity over time in this recent Open Access article published in Social Science & Medicine. We are also working on articles which expand our analysis of obesity and social class, depictions of obesity risk, and obesity discourses in press coverage of the coronavirus pandemic, so keep your eyes peeled for further announcements!

ICR Outstanding Corpus Thesis Award for Lancaster PhD graduate

I am honoured to have received the Institute for Corpus Research Outstanding Doctoral Thesis Award. The purpose of this annual award is to recognise and reward theses in the field of Corpus Linguistics.

I conducted my PhD research in the Centre for Corpus Approaches to Social Science at Lancaster University, which is part of Department of Linguistics and English Language. My thesis was titled Collocational Processing in Typologically Different Languages, English and Turkish: Evidence from Corpora and Psycholinguistic Experimentation. Some of the findings based on my PhD research are reported in this article. The study was multidisciplinary, involving both corpus analysis and psycholinguistic experimentation. Supervisors Dr Vaclav Brezina and Prof Patrick Rebuschat played a key role in shaping the thesis. Their academic knowledge and insight have been invaluable in developing a multidisciplinary perspective to pioneer a contrastive study of English and Turkish.

Turkish, with its rich morphology, differs from English – prompting questions about whether the same variables affect collocational processing in the two languages. Importantly, so far the vast majority of research on collocational processing has focussed on a narrow range of primarily European languages, especially English, which makes it difficult to generalise the findings to other languages. Corpus analyses showed that uninflected collocations have similar mean frequencies and association counts in both languages. When inflected forms were included, 75% of the Turkish collocations occurred at a higher frequency than the collocations in English, suggesting that language typology impacts frequency of collocations.

I then conducted psycholinguistic experiments to understand the differences and similarities between the processing of collocations in English and Turkish and by native and non-native English speakers. To what extent is there a difference between native-speakers’ (of English and Turkish) sensitivity to both individual word-level and phrase level frequency information when processing collocations? Mixed-effects regression modelling revealed that Turkish and English native-speakers are equally sensitive to collocation frequencies, confirming collocations’ psychological reality in both languages. Yet English speakers were additionally affected by individual word-frequencies, indicating that language typologies require users to process collocations from different sources of information.

Furthermore, this thesis investigated the effects of individual word and collocational frequency on native and non-native speakers’ collocational processing in English. Both groups of participants demonstrated sensitivity to individual word and collocation frequency. The findings align with the predictions of usage-based approaches that language acquisition should be viewed as a statistical accumulation of experiences that changes every time we encounter a particular utterance.

This study identified both universal fundamentals and language-specific differences in collocational processing. It addressed language typology and second-language learning through a novel multidisciplinary approach which reinforces and challenges usage-based theories of language learning, demonstrating that they should include typologically different languages to develop broader perspectives on processing.

Please see the link here for more information about this award.

If you have any questions, or are interested in working with me, get in touch. Dr Doğuş Can Öksüz Research fellow at the University of Leeds. d.oksuz@leeds.ac.uk

Covid-19 and the International Baccalaureate

A month passed, but yet our pain hasn’t diminished and justice unserved (#ibscandal, Aug 6)

Three months ago, when I wrote my introductory post for the CASS blog, I had a clear research plan for my SSHRC (Canada’s Social Sciences & Humanities Research Council) postdoctoral fellowship, which involved examining IB discourses in a large corpus of global (English) newspapers to see how these compared to IB discourses in Canada. However, that plan took a completely new and unexpected turn last month when Covid-19 and the IB collided.

The word “unprecedented” has been used a great deal in connection with our current Covid-19 world. While readers of history may raise a sceptical eyebrow about exactly how unprecedented this situation is, it does apply to the IB organization and the events unfolding globally in relation to the May 2020 final examination results which were released on July 6. This year has been unlike any other year in the organization’s 52 year history because for the first time, the high stakes IB diploma program final exams were cancelled due to Covid-19. The announcement was made on March 23, further elaborated by the Director General Siva Kumari on March 24, with a follow-up statement on May 13 describing in detail the alternate assessment model to be used.

On July 6, the IB organization published the final results for 174,355 IB Diploma Program (DP) and Career-related Program (CP) students with great fanfare. Of these, 170,343 were DP candidates from 146 countries, whose results would most likely be linked to university admission. Congratulatory messages were splashed on the IB organization website and Twitter feeds, celebrating the triumph of the Class of 2020 for their great achievements in such a difficult year. Messages from the Director General, Siva Kumari, Deputy Director Sally Holloway, Chief Assessment Officer Paula Wilcock and representatives from IB schools around the world joined together in their praise for this cohort, who had been forced to adapt to a new and fluid situation.

But a problem was brewing that was not evident from the IB organization’s celebratory communication. Reports began to emerge from a variety of sources (e.g., Wired, Reuters, TES, Financial Times, Bloomberg) about issues with IB final results, which turned out to be lower than many had expected and thus put students’ university admission and/or scholarships in jeopardy. Within four days of the release of results, an online petition calling for “Justice for May 2020 IB Graduates”, with the hashtag #ibscandal, had already collected 15,000 signatures. Government bodies such as Ofqual (Office of Qualifications and Examinations Regulation) in the UK and the Data Protection Authority in Norway also became involved in seeking clarification regarding the IB organization’s grading system. Despite such wide coverage, the IB website released only a single statement on July 15 about the results, and the @iborganization Twitter account remained largely inactive, with nine posts on July 6 in connection to the results, and the next post on July 20 saying: In response to the enquiries received, we share further clarity regarding our awarding model for the May 2020 session. We are in direct communication with schools, providing support options including a new process to review extraordinary cases. Learn more: bit.ly/3hdiJnC https://mobile.twitter.com/iborganization/status/1285175265027665920

Over the past month, the #ibscandal hashtag has evolved to become a space where not only students, parents and teachers voice their opinions (e.g., the quote at the start), but also where key information is exchanged, such as newly published articles or videos. There are also academics and journalists posting on this site, most recently a professor from New York University looking to talk to students “affected by the 2020 #ibscandal”. And on July 30, there was an announcement saying that the IB controversy was now on Wikipedia. In sum, there is a stark contrast between the IB organization’s silence on anything to do with results on one side, while anger and frustration mount on the other.

Meanwhile, in Canada, no coverage of these events can be found anywhere. This is curious since Canada not only has the second highest number of DP candidates in the world (11,962 reported for the May 2020 examination session) but ranks as the number one “destination for IB transcripts of any university in the world” (Arida, 2016). It would seem reasonable to expect that there might be some interest in the events taking place globally. Just to be sure I wasn’t missing anything, I conducted a search for international AND baccalaureate on Canadian News stream, a database containing over 280 news sources. Of the 17 results for the month of July, three were not about the IB, two mentioned the IB in passing as part of a person’s qualifications, one reported on a school in Vancouver going ahead with its plans to become an IB school, 10 reported on complaints against Canada’s Governor General Julie Payette and her assistant, who had been friends “going back to their days in an international baccalaureate program decades ago”. And one article, from July 21 (over two weeks after the IB results were published), is a reprint of the Reuters story Global exam grading algorithm under fire for suspected bias by A. A. Schapiro. In other words, there is no Canadian news story even though the topic is clearly newsworthy.

So like everyone else who is interested in this topic, I am a regular visitor to the #ibscandal hashtag, observing events unfold in real-time. As a result I’ve noticed some rather interesting developments over the past four weeks which I hope to explore further using corpus tools and methods. As they often say here at CASS, watch this space!

Update: Since this piece was written, the IB organization issued a statement on August 17 explaining changes to their assessment model in light of data and evidence they received from schools. The announcement also appeared on the organization’s Twitter feed (https://mobile.twitter.com/iborganization/status/1295269390540210176).