Compiling a trilingual corpus to examine the political and social representation(s) of ‘people’ and ‘democracy’

As a visiting researcher at CASS (coming from the University of Athens, where I am Associate Professor of Corpus Linguistics and Translation), since mid-October 2017 and until the end of August 2018, my research aim is to investigate critical aspects of populist discourses in Europe and their variation, especially during and after the 2008 financial (and then social and political) crisis, and to reveal patterns of similarity and difference (and, tentatively, of interconnectedness and intertextuality) across a wide spectrum of political parties, think tanks and organisations. This being essentially a Corpus-Assisted Discourse Study (CADS), a first way into examining the data is to identify and statistically analyse collocational patterns and networks that are built around key lexemes (e.g. ‘people’, ‘popular’, ‘democracy’, in this scenario), before moving on to critically correlating such quantitative findings with the social and political backdrop(s) and crucial milestones.


The first task of this complex corpus-driven effort, which is now complete, has been to compile a large-scale trilingual (EN, FR, EL) ‘focus’ corpus. This has been a tedious technical process: before the data can be examined in a consistent manner, several problems needed to be addressed and solutions had to be implemented, as outlined below.


  1. As a key primary aim was to gather as much data as possible from the websites of political parties, political organisations, think tanks and official party newspapers, from the UK, France and Greece, it was clear from the outset that it would not be possible to manually cull the corpus data, given the sheer number of sources and of texts. On the other hand, automatic corpus compilation tools (e.g. BootCaT and WebBootCaT in SketchEngine) could not handle the extent and the diversification of the corpora. To address this problem, texts were culled using web crawling techniques (‘wget -r’ in Linux bash) and the HTTrack app, with a lot tweaking and the necessary customisation of download parameters, to account for the (sometimes, very tricky) batch download restrictions of some websites.
  2. Clean-up html boilerplate (i.e., corpus text-irrelevant sections of code, advertising material, etc. that are included in html pages). This was accomplished using Justext (the app used by M. Davies to compile the NOW corpus), with a few tweaks, so to be able to handle some ‘malformed’ data, especially from Greek sources.

As I plan to specifically analyse the variation of key descriptors and qualifiers (‘key’ keywords and their c-collocates) as a first way into the “charting” of the discourses at hand, the (article or text) publication date is a critical part of the corpus metadata, one that needs to be retained for further processing. However, most if not all of this information is practically lost in the web crawling and boilerplating stages. Therefore, the html clean-up process was preceded by the identification and extraction of the articles’ publication dates, using a php script that was developed with the help of Dr. Matt Timperley (CASS, Lancaster) and Dr. Pantelis Mitropoulos (University of Athens). This script scans all files in a dataset, accounts for all possible date formats in all three languages, and then automatically creates a csv (tab-delimited) table that contains the extracted date(s), matched with the respective filenames. Its accuracy is estimated at ca. 95%, and can be improved further, by checking the output and rescanning the original data with a few code tweaks.

  1. Streamline the data, by removing irrelevant stretches of text (e.g. “Share this article on Facebook”) that were possibly left behind during the boilerplating process – this step is ensured using Linux commands (e.g. find, grep, sed, awk) and regular expressions and greatly improves the accuracy of the following step.
  2. Remove duplicate files: since onion (ONe Instance ONly: the script used e.g. in SketchEngine) only looks for pattern repetitions within a single file and within somewhat short maximum paragraph intervals, I used FSLint – an application that takes account of the files’ MD5 signature and identifies duplicates. This is extremely accurate and practically eliminates all files that have a one hundred percent text repetition, across various sections of the websites, regardless of the file name or creation date (actually, this was found to be the case mostly with political party websites, not newspapers). (NB: A similar process is available also in Mike Scott’s WordSmith Tools v7).
  3. Order files by publication year for each subcorpus and then calculate the corresponding metadata (files, tokens, types and average token count, by year) for each dataset and filter out the “focus corpus”, i.e. by looking for relevant files containing only node lemmas (i.e., lemmas related to the core question of this research: people*|popular|democr*|human* and their FR and EL equivalents, using grep and regular expressions – note that an open-source, java-based GUI app that combines these search options for large datasets is FAR).
  4. Finally, prepare the data for uploading on LU’s CQPWeb, by appending the text publication year info, as extracted from stage 2 to the corresponding raw text file – this was done using yet another php script, kindly developed by Matt Timperley.


In a nutshell, texts were culled from a total of 68 sources (24 Greek, 26 British, and 18 French). This dataset is divided into three major corpora, as follows:

  1. Cumulative corpus (CC, all data): 746,798 files/465,180,684 tokens.
  2. Non-journalistic research corpus (RC): 419,493 files/307,231,559 tokens.
  3. Focus corpus (FC): 205,038 files/235,235,353 tokens.

Is Academic Writing Becoming More Colloquial?

Have you noticed that academic writing in books and journals seems less formal than it used to? Preliminary data from the Written BNC2014 shows that you may be right!

Some early data from the academic journals and academic books sections of the new corpus has been analysed to find out whether academic writing has become more colloquial since the 1990s. Colloquialisation is “a tendency for features of the conversational spoken language to infiltrate and spread in the written language” (Leech, 2002: 72). The colloquialisation of language can make messages more easily understood by the general public because, whilst not everybody is familiar with the specifics of academic language, everyone is familiar with spoken language. In order to investigate the colloquialisation of academic writing, the frequencies of several linguistic features which have been associated with colloquialisation were compared in academic writing in the BNC1994 and the BNC2014.

Results show that, of the eleven features studied, five features have shown large changes in frequency between the BNC1994 and the BNC2014, pointing to the colloquialisation of academic writing. The use of first and second person pronouns, verb contractions, and negative contractions have previously been found to be strongly associated with spoken language. These features have all increased in academic language between 1994 and 2014. Passive constructions and relative pronouns have previously been found to be strongly associated with written language, and are not often used in spoken language. This analysis shows that both of these features have decreased in frequency in academic language in the BNC2014.

Figure 1: Frequency increases indicating the colloquialisation of academic language.

Figure 2: Frequency decreases indicating the colloquialisation of academic language.

These frequency changes were also compared for each genre of academic writing separately. The genres studied were: humanities & arts, social science, politics, law & education, medicine, natural science, and technology & engineering. An interesting difference between some of these genres emerged. It seems that the ‘hard’ sciences (medicine, natural science, and technology & engineering) have shown much larger changes in some of the linguistic features studied than the other genres have. For example, figure 3 shows the difference in the percentage increase of verb contractions for each genre, and clearly shows a difference between the ‘hard’ sciences and the social sciences and humanities subjects.

Figure 3: % increases in the frequency of the use of verb contractions between 1994 and 2014 for each genre of academic writing.

This may lead you to think that medicine, natural science, and technology & engineering writing has become more colloquial than the other genres, but this is in fact not the case. Looking more closely at the data shows us that these ‘hard’ science genres were actually much less colloquial than the other genres in the 1990s, and that the large change seen here is actually a symptom of all genres becoming more similar in their use of these features. In other words, some genres have not become more colloquial than others, they have simply had to change more in order for all of the genres to become more alike.

So it seems from this analysis that, in some respects at least, academic language has certainly become more colloquial since the 1990s. The following is a typical example of academic writing in the 1990s, taken from a sample of a natural sciences book in the BNC1994. It shows avoidance of using first or second person pronouns and contractions (which have increased in use in the BNC2014), and shows use of a passive construction (the use of which has decreased in the BNC2014).

Experimentally one cannot set up just this configuration because of the difficulty in imposing constant concentration boundary conditions (Section 14.3). In general, the most readily practicable experiments are ones in which an initial density distribution is set up and there is then some evolution of the configuration during the course of the experiment.

It is much more common nowadays to see examples such as the following, taken from an academic natural sciences book in the BNC2014. This example contains active sentence constructions, first person pronouns, verb contractions, negative contractions, and a question.

No doubt people might object in further ways, but in the end nearly all these replies boil down to the first one I discussed above. I’d like to return to it and ponder a somewhat more aggressive version, one that might reveal the stakes of this discussion even more clearly. Very well, someone might say. Not reproducing may make sense for most people, but my partner and I are well – educated, well – off, and capable of protecting our children from whatever happens down the road. Why shouldn’t we have children if we want to?

It will certainly be interesting to see if this trend of colloquialisation can be seen in other genres of writing in the BNC2014!

Would you like to contribute to the Written BNC2014?

We are looking for native speakers of British English to submit their student essays, emails, and Facebook and Whatsapp messages for inclusion in the corpus! To find out more, and to get involved click here. All contributors will be fully credited in the corpus documentation.

British National Corpus 2014: A sociolinguistic book is out

Have you ever wondered what real spoken English looks like? Have you ever asked the question of whether people from different backgrounds (based on gender, age, social class etc.) use language differently? Have you ever  thought it would be interesting to investigate how much English has changed over the last twenty years? All these questions can be answered by looking at language corpora such as the Spoken BNC 2014 and analysing them from a sociolinguistic persective. Corpus Approaches to Contemporary British Speech:  Sociolinguistic Studies of the Spoken BNC2014 is a book which offers a series of studies that provide a unique insight into a number of topics ranging from Discourse, Pragmatics and Interaction to Morphology and Syntax.

This is, however, only the first step. We are hoping that there will be many more studies to come based on this wonderful dataset. If you want to start exploring the Spoken BNC 2014 corpus, it is just three mouse clicks away:

Get access to the BNC2014 Spoken

  1. Register for free and log on to CQPweb.
  2. Sign-up for access to the BNC2014 Spoken.
  3. Select ‘BNC2014’in the main CQPweb menu.

Also, right now there is a great opportunity to take part in the written BNC 2014 project, a written counterpart to the Spoken BNC2014.  If you’d like to contribute to the written BNC2014, please check out the project’s website for more information.

Learn about the BNC2014, scan a book sample and contribute to the corpus…

On Saturday 12 May 2018, CASS hosted a small training event at Lancaster University for a group of participants, who came from different universities in the UK.  We talked about the BNC2014 project and discussed both the theoretical underpinnings as well as the practicalities of corpus design and compilation. Slides from the event are available as pdf here.

The participants then tried in practice what is involved in the compilation of a large general corpus such as the BNC2014. They selected and scanned samples of books from current British fiction, poetry and a range of non-fiction books (history, popular science, hobbies etc.). Once processed, these samples will become a part of the written BNC2014.

Here are some pictures from the event:

Carmen Dayrell and Vaclav Brezina before the event

Elena Semino welcoming participants

In the computer lab: Abi Hawtin helping participants

A box full of books

If you are interested in contributing to the written BNC2014, go to the project website  to find out about different ways in which you can participate in this exciting project.

The event was supported by ESRC grant no. EP/P001559/1.

Triangulating findings from a corpus-based study: An interview with Adil Ray (creator of Citizen Khan)

My doctoral thesis investigated the ‘Construction of Identities in the BBC Sitcom Citizen Khan’ by analysing a corpus of over 40,000 words, which consisted of transcripts from all the episodes of the show within the first two seasons. In my analysis, there were a number of instances where I had made some assumptions in relation to the motivations of the scriptwriters when incorporating certain scenes into the programme. Therefore, in order to triangulate my findings, I decided to interview the content creator (Adil Ray) and ascertain from him if my assumptions had been correct. (Transcript of the full interview is available at:

It would not be feasible in this short article to highlight all the various points discussed within the interview and how they correlated with my findings. However, I aim to provide at least one very vivid example, which highlights the importance of triangulating especially the qualitative aspects of a corpus-based analysis. For the readers who have viewed the sitcom Citizen Khan, they may have noticed the ‘running-gag’ throughout the series, where the mosque manager Dave (a Caucasian convert to Islam) would offer the Islamic salutation (As Salaamu Alaikum) to Mr Khan (a Pakistani British Muslim) and Khan would respond to him by saying ‘hello Dave’.

In total, there were ten such instances in the corpus, where such an exchange occurred; and in my thesis (pp.136-145), I discussed in some detail the various textual evidences from the Quran and Prophetic Traditions (Hadeeth) that indicate the importance of greetings within Islam and the etiquette involved when giving or responding to an Islamic salutation. Taking into consideration this contextual information, I concluded that the scriptwriters had incorporated this gag into the show to indicate that Mr Khan did not consider Dave to be a ‘real’ or ‘proper’ Muslim. Furthermore, I argued that they were also trying to highlight how Muslim converts are part of an ‘out-group’, which can sometimes be ostracised from those who are born Muslim.

During my interview with Ray, I questioned him on the significance of Mr Khan responding to Dave’s Islamic greeting with ‘hello’. Ray said that from his own point of view, it could be seen as signifying that Khan was not comfortable with ‘this white chap coming in and being the manager of the mosque and being a Muslim’. However, Ray then went onto explicitly state that he believed that Khan’s main grievance was that Dave had the job of mosque manager, as opposed to him:

‘I don’t think it was the fact that he was uncomfortable necessarily with his race, it’s not that, but he was uncomfortable that there was somebody who was a manager and probably had more authority than Khan and in a way he was better than Khan at managing and doing things. And that was the thing that riled Khan, Khan probably thought he would be that person, he would have that job.’

I then specifically mentioned that there was a Quranic instruction that dictated how an Islamic greeting should be responded to and thus the interaction could indicate that Khan did not see Dave to be a full Muslim. However, Ray did not seem to fully entertain this notion and once again went onto stress that he believed Khan’s main gripe with Dave was due to the mosque manager position and ultimately Khan would always be there for Dave.

The main premise of my argument in the thesis was that the scriptwriters chose to highlight Khan’s usage of ‘hello’, to clearly indicate that he did not fully accept Dave as a Muslim. However, after my discussion with Ray, it became abundantly clear that this was not in fact their primary motivation. Citizen Khan has three co-writers (Anil Gupta, Richard Pinto, Adil Ray), with Ray being the only Muslim amongst them and thus, if the intention was to signify a contravention in Islamic etiquettes, it is presumed he would be the most aware of it from the three.

Thus, through triangulating my research findings in such a manner, it has highlighted that when engaging in linguistic analysis, an analyst may at times ‘over-analyse’ language usage, in their pursuit of extracting ‘meaning behind the text’.

Bilal Kadiri is an Assistant Professor at King Khalid University and completed his PhD at Lancaster University under the supervision of CASS’s Paul Baker. He can be contacted by email at bilal(Replace this parenthesis with the @ sign)

Media delusions: the (mis)representation of people with schizophrenia in 9 U.K. national newspapers between 2000 and 2015

In 2006, The Independent published an article advertising to its readers some of the islands off the coast of New Zealand as ideal holiday destinations. Amongst descriptions of various idyllic landscapes, cultural eccentricities and tourist attractions, the author warns readers of some of the local fauna:

     “Wildlife-wise, there are not just hammerhead sharks in these parts, he told me, but school sharks       and mako sharks – the paranoid schizophrenics of the shark world (The Independent, 2                       September 2006).

What is meant by the paranoid schizophrenics of the shark world’? Surely not that a species of shark is affected by a chronic mental disorder, typically characterised by delusions and auditory hallucinations?

I think most readers would agree that the author does not mean this literally. In fact, there is currently no evidence that sharks, or other animals besides humans, experience symptoms of psychosis. Instead, given some of the shared characteristics of sharks and ‘paranoid schizophrenics’ in our background knowledge, we can infer that by paranoid schizophrenics of the shark world, the journalist means ‘highly dangerous sharks’.  In other words, our understanding of what is meant rests on the assumption that paranoid schizophrenics are aggressive and violent members of the human race.

So where have we learnt to believe that people with schizophrenia are more violent than other people? Probably not from personal experience or credible evidence. Statistics repeatedly show that people with schizophrenia are not significantly more likely to commit violent crimes than the general population (Kalucy et al, 2011; Fazl and Grann, 2006). Other studies report that people with schizophrenia are instead more likely to be the targets of violence (Wehring and Carpenter, 2011).

One likely explanation is the media, which has been shown to frequently represent people with schizophrenia as aggressive, unpredictable killers who are both ‘mad’ and ‘bad’ (e.g. Cross, 2014, Clement, 2008, Coverdale, 2002). These misrepresentations are particularly alarming given the influence of the media on public attitudes. Since members of the public are unlikely to have first-hand experiences with people with schizophrenia, they obtain almost all their understanding of, and attitudes towards, people with schizophrenia from the increasingly ubiquitous media (Angermeyer et al, 2005). People with the diagnosis themselves are given little choice but to internalise these malign stereotypes, which have been found to deter some individuals from seeking medical help and sadly increasing the risk of suicide (Harrison and Gill, 2010; Wilkinson, 1994).

For my doctoral thesis, I have spent the last two years examining the ways in which schizophrenia is portrayed in the U.K. national press between 2000 and 2015. I pay attention, not only to how people with schizophrenia are misrepresented as violent, but also to other, less well-documented representations. For instance, my project has also led me to consider cases where the press paint a picture of people with schizophrenia as having special and unexplained creative powers, serving to make them separate and different from ‘ordinary folk’, and perhaps resulting in expectations that may be difficult to meet.

In order to consider the most frequent portrayals and how they might have changed over time, I am examining all articles published by 9 U.K. national newspapers –five tabloids (The Express, The Mail, The Mirror, The Star, The Sun) and four broadsheets (The Guardian, The Independent, The Telegraph, The Times) – published between 2000 and 2015 that refer to a diagnosis of schizophrenia in some way. This comprises 16,466 articles and 15,134,066 words which I then analyse for frequent patterns using a combination of statistical computer tools and manual linguistic analysis.

As a linguist, I pay close attention to how positive or negative portrayals are realised through subtle choices in language. These can reveal the origins of various misconceptions. For instance, 14 of the top 25 ‘doing’ words that occur unusually frequently in the vicinity of the word schizophrenic in my data refer to violent behaviours.  In other words, ‘schizophrenics’ frequently attack, behead, punch, murder, rape, stab and slash others, or at least pose a risk or threaten them.

     A MACHETE-wielding schizophrenic who slashed two guards in a rampage through MI5’s HQ           was locked up in a mental health unit indefinitely yesterday. (The Sun, 22 June 2005).

     A paranoid schizophrenic beheaded his flatmate in a frenzied attack after suffering from                     delusions that he was being persecuted, a court has heard. (The Mirror, 2 December 2013).

The frequency with which schizophrenic occurs in the vicinity of violent action words help explain why someone might have chosen to characterise the school and mako sharks as the paranoid schizophrenics of the shark world, and how readers were able to make sense of this analogy.

Overall, it is clear that the media have a responsibility to offer more accurate and balanced representations of schizophrenia. In utilising a characteristically language-oriented approach, the goal of this research is to offer media institutions practical ways to improve their reporting of schizophrenia by identifying potentially problematic choices in language and suggesting more balanced and accurate alternatives.

If you have any suggestions or observations, or for any reason would like to get in touch, please contact me via my email: j.balfour(Replace this parenthesis with the @ sign)

CASS in the 2017 ESRC Festival of Social Science

The ESRC Festival of Social Science is an annual celebration of social science research – comprised of a huge array of public events of all kinds, and designed to promote awareness of UK social science research across the board. This year, it runs from 4th to 11th November.

As the team at ESRC says,

“You may be surprised at just how relevant the Festival’s events are to society today. Social science research makes a difference. Discover how it shapes public policy and contributes to making the economy more competitive, as well as giving people a better understanding of 21st century society. From big ideas to the most detailed observations, social science affects us all everyday – at work, in school, when raising children, within our communities, and even at the national level.”

As an ESRC Centre, CASS has been involved in the Festival since our work began in 2013. We have organised events of different types in different years – for instance, in the first year of the Centre, our contribution to the Festival was a series of talks in schools in the North West of English to introduce the kind of social science analysis in which we specialise to students in sixth-form. It was great to be able to reach out to an audience that we rarely have a chance to communicate with about our work.

In subsequent years, we organised events under our “Valuing language” banner – aimed at using examples of our work to present to a public audience the benefits across the social sciences that arise in research that understands the value of language for all kinds of social investigations. Our first “Valuing language” event was in London; the following year we held another event in Manchester.

This year our contribution to the Festival of Social Science is a new “Valuing language” presentation. This event focuses in particular on two strands of research that have been under way in CASS for the past two years or so, looking at the intersection of language with the critical issue of health and healthcare. We are also returning to London for the event, entitled “Valuing language: Effective communication in healthcare provision”. The event – at 6.30 pm on Thursday 9th November – is particularly aimed at healthcare practitioners and those training to enter healthcare services – but of course, it is open to anyone with an interest in this work!

The evening will include two presentations, one on each of these strands of work. First will be a presentation of research into patient comments on healthcare services collected through the NHS Choices website. Patient feedback has often been analysed by looking straightforwardly at the numeric ratings given in feedback. However, the textual responses supplied alongside these ratings are a far richer source of data – albeit so extensive they can be non-straightforward to analyse! But this is, of course, where corpus-based linguistic methods come in. A CASS project, led by Paul Baker, has applied these methods to investigate patients place on interpersonal skills and effective, compassionate communication. Two members of the team working on this project, myself and Craig Evans, will give an overview of how we have gone about analysing this unique and fascinating source of data.

In the second half of the event, CASS Director Elena Semino will present her work looking at patients’ reporting of pain. A common way for healthcare practitioners to assess the level of pain that patients are experiencing is to use questionnaires that present descriptor  words – such as “pricking/boring/drilling/stabbing”. The descriptor word that a patient chooses is assumed to reflect the level of their pain. Elena’s research suggests, however, that patients’ choice of descriptor may in many cases instead be a result of how strongly associated with the word “pain” the descriptor word is. Again, this is a problem that corpus-based language analysis is an ideal way to address. Elena will explain the findings of her investigation and also consider the implications these findings have for how descriptor-word questionnaires should be used in assessing patients’ pain.

We’re all looking forward to participating once again in the ESRC Festival and we hope to see you there!

Find out more (and sign up for the event) via

Introducing Visiting Researcher Ioannis Saridakis

Starting from Translation Studies, as both an academic discipline and a professional practice in the early 90s, I soon embarked on the then innovative field of corpus linguistics and started exploring its links with, and applicability in, translation and interpreting studies. Soon after finishing my PhD in Corpus Linguistics, Translation and Terminology in 1999, and having already worked for more than a decade as a professional translator and head of a translation agency, I started teaching at the Department of Translation and Interpreting of the Ionian University in Greece. Currently, I am Associate Professor of Corpus Linguistics and Translation Studies at the University of Athens (School of Economics and Political Science), as well as director of the IT research lab and co-director of the Bilingualism, Linguistics and Translation research lab and deputy director of the Translation and Interpreting postgraduate programme at the University of Athens.

In the past, and in parallel to my core academic and research activities, I have also collaborated with many national and international organisations, as a consultant in the fields of linguistics, translation and interpreting. My research activities include a number of empirical studies and research projects in the fields of Corpus Linguistics and Discourse Analysis, as well as on Corpus Linguistics and Translation Studies, a discipline which I consider to rely essentially on the functional analysis of discourse, both methodologically and practically. My most recent research and publications focus on corpus-driven methods and models for systematically analysing the lexis and the rhetoric of a range of discourses, including analysis of the discourse of Golden Dawn, i.e. Greece’s far right political party, and its representations and meta-discoursal perceptions by Greek and European newspapers, the study of the diachronic variation of the lexis used to designate and qualify RASIM, especially during and after the recent migrant crisis, and exploration of the linguistic aspects of impoliteness and aggression in Greek Computer-mediated communication (CMC).

At CASS, I will be working with Professor Paul Baker on a project that aims to investigate critical aspects of populist discourses in Europe, especially during and after the 2008 financial (and then social and political) crisis. The research draws heavily on large-scale corpora, with a focus on so far under-researched discourses, particularly of the ‘left’ and the ‘far-left’, including ‘anti-austerity’ and ‘anti-globalisation’ discourses, from Greece, the UK and France. By charting such a landscape of discourse traits, foci and conventionalisations, also from a cross-linguistic perspective, I also purport to reveal patterns of similarity and dissimilarity (and tentatively, interconnectedness) with the significantly more researched ‘right-wing’ political and newspapers discourses (‘nationalist’, ‘anti-immigration’, ‘anti-Islam’). To pursue these goals, my research will use cutting-edge research methods and computational techniques for corpus compilation and annotation, as well as statistical analysis, including analysis of collocational patterns and networks, and will critically correlate quantitative findings with the social and political backdrop and its crucial milestones. In other words, it will explore how linguistic patterns, as well as changes and variations, are linked to social, political and economic changes and to significant events.

I’m excited to be able to work at CASS, and to join such a wonderful team of committed academics and researchers.

I intend to post frequently on this blog, as the project is pursued further, highlighting significant preliminary findings and tentative conclusions.