Compiling a trilingual corpus to examine the political and social representation(s) of ‘people’ and ‘democracy’

As a visiting researcher at CASS (coming from the University of Athens, where I am Associate Professor of Corpus Linguistics and Translation), since mid-October 2017 and until the end of August 2018, my research aim is to investigate critical aspects of populist discourses in Europe and their variation, especially during and after the 2008 financial (and then social and political) crisis, and to reveal patterns of similarity and difference (and, tentatively, of interconnectedness and intertextuality) across a wide spectrum of political parties, think tanks and organisations. This being essentially a Corpus-Assisted Discourse Study (CADS), a first way into examining the data is to identify and statistically analyse collocational patterns and networks that are built around key lexemes (e.g. ‘people’, ‘popular’, ‘democracy’, in this scenario), before moving on to critically correlating such quantitative findings with the social and political backdrop(s) and crucial milestones.


The first task of this complex corpus-driven effort, which is now complete, has been to compile a large-scale trilingual (EN, FR, EL) ‘focus’ corpus. This has been a tedious technical process: before the data can be examined in a consistent manner, several problems needed to be addressed and solutions had to be implemented, as outlined below.


  1. As a key primary aim was to gather as much data as possible from the websites of political parties, political organisations, think tanks and official party newspapers, from the UK, France and Greece, it was clear from the outset that it would not be possible to manually cull the corpus data, given the sheer number of sources and of texts. On the other hand, automatic corpus compilation tools (e.g. BootCaT and WebBootCaT in SketchEngine) could not handle the extent and the diversification of the corpora. To address this problem, texts were culled using web crawling techniques (‘wget -r’ in Linux bash) and the HTTrack app, with a lot tweaking and the necessary customisation of download parameters, to account for the (sometimes, very tricky) batch download restrictions of some websites.
  2. Clean-up html boilerplate (i.e., corpus text-irrelevant sections of code, advertising material, etc. that are included in html pages). This was accomplished using Justext (the app used by M. Davies to compile the NOW corpus), with a few tweaks, so to be able to handle some ‘malformed’ data, especially from Greek sources.

As I plan to specifically analyse the variation of key descriptors and qualifiers (‘key’ keywords and their c-collocates) as a first way into the “charting” of the discourses at hand, the (article or text) publication date is a critical part of the corpus metadata, one that needs to be retained for further processing. However, most if not all of this information is practically lost in the web crawling and boilerplating stages. Therefore, the html clean-up process was preceded by the identification and extraction of the articles’ publication dates, using a php script that was developed with the help of Dr. Matt Timperley (CASS, Lancaster) and Dr. Pantelis Mitropoulos (University of Athens). This script scans all files in a dataset, accounts for all possible date formats in all three languages, and then automatically creates a csv (tab-delimited) table that contains the extracted date(s), matched with the respective filenames. Its accuracy is estimated at ca. 95%, and can be improved further, by checking the output and rescanning the original data with a few code tweaks.

  1. Streamline the data, by removing irrelevant stretches of text (e.g. “Share this article on Facebook”) that were possibly left behind during the boilerplating process – this step is ensured using Linux commands (e.g. find, grep, sed, awk) and regular expressions and greatly improves the accuracy of the following step.
  2. Remove duplicate files: since onion (ONe Instance ONly: the script used e.g. in SketchEngine) only looks for pattern repetitions within a single file and within somewhat short maximum paragraph intervals, I used FSLint – an application that takes account of the files’ MD5 signature and identifies duplicates. This is extremely accurate and practically eliminates all files that have a one hundred percent text repetition, across various sections of the websites, regardless of the file name or creation date (actually, this was found to be the case mostly with political party websites, not newspapers). (NB: A similar process is available also in Mike Scott’s WordSmith Tools v7).
  3. Order files by publication year for each subcorpus and then calculate the corresponding metadata (files, tokens, types and average token count, by year) for each dataset and filter out the “focus corpus”, i.e. by looking for relevant files containing only node lemmas (i.e., lemmas related to the core question of this research: people*|popular|democr*|human* and their FR and EL equivalents, using grep and regular expressions – note that an open-source, java-based GUI app that combines these search options for large datasets is FAR).
  4. Finally, prepare the data for uploading on LU’s CQPWeb, by appending the text publication year info, as extracted from stage 2 to the corresponding raw text file – this was done using yet another php script, kindly developed by Matt Timperley.


In a nutshell, texts were culled from a total of 68 sources (24 Greek, 26 British, and 18 French). This dataset is divided into three major corpora, as follows:

  1. Cumulative corpus (CC, all data): 746,798 files/465,180,684 tokens.
  2. Non-journalistic research corpus (RC): 419,493 files/307,231,559 tokens.
  3. Focus corpus (FC): 205,038 files/235,235,353 tokens.

Introducing Visiting Researcher Ioannis Saridakis

Starting from Translation Studies, as both an academic discipline and a professional practice in the early 90s, I soon embarked on the then innovative field of corpus linguistics and started exploring its links with, and applicability in, translation and interpreting studies. Soon after finishing my PhD in Corpus Linguistics, Translation and Terminology in 1999, and having already worked for more than a decade as a professional translator and head of a translation agency, I started teaching at the Department of Translation and Interpreting of the Ionian University in Greece. Currently, I am Associate Professor of Corpus Linguistics and Translation Studies at the University of Athens (School of Economics and Political Science), as well as director of the IT research lab and co-director of the Bilingualism, Linguistics and Translation research lab and deputy director of the Translation and Interpreting postgraduate programme at the University of Athens.

In the past, and in parallel to my core academic and research activities, I have also collaborated with many national and international organisations, as a consultant in the fields of linguistics, translation and interpreting. My research activities include a number of empirical studies and research projects in the fields of Corpus Linguistics and Discourse Analysis, as well as on Corpus Linguistics and Translation Studies, a discipline which I consider to rely essentially on the functional analysis of discourse, both methodologically and practically. My most recent research and publications focus on corpus-driven methods and models for systematically analysing the lexis and the rhetoric of a range of discourses, including analysis of the discourse of Golden Dawn, i.e. Greece’s far right political party, and its representations and meta-discoursal perceptions by Greek and European newspapers, the study of the diachronic variation of the lexis used to designate and qualify RASIM, especially during and after the recent migrant crisis, and exploration of the linguistic aspects of impoliteness and aggression in Greek Computer-mediated communication (CMC).

At CASS, I will be working with Professor Paul Baker on a project that aims to investigate critical aspects of populist discourses in Europe, especially during and after the 2008 financial (and then social and political) crisis. The research draws heavily on large-scale corpora, with a focus on so far under-researched discourses, particularly of the ‘left’ and the ‘far-left’, including ‘anti-austerity’ and ‘anti-globalisation’ discourses, from Greece, the UK and France. By charting such a landscape of discourse traits, foci and conventionalisations, also from a cross-linguistic perspective, I also purport to reveal patterns of similarity and dissimilarity (and tentatively, interconnectedness) with the significantly more researched ‘right-wing’ political and newspapers discourses (‘nationalist’, ‘anti-immigration’, ‘anti-Islam’). To pursue these goals, my research will use cutting-edge research methods and computational techniques for corpus compilation and annotation, as well as statistical analysis, including analysis of collocational patterns and networks, and will critically correlate quantitative findings with the social and political backdrop and its crucial milestones. In other words, it will explore how linguistic patterns, as well as changes and variations, are linked to social, political and economic changes and to significant events.

I’m excited to be able to work at CASS, and to join such a wonderful team of committed academics and researchers.

I intend to post frequently on this blog, as the project is pursued further, highlighting significant preliminary findings and tentative conclusions.