Compiling a trilingual corpus to examine the political and social representation(s) of ‘people’ and ‘democracy’

As a visiting researcher at CASS (coming from the University of Athens, where I am Associate Professor of Corpus Linguistics and Translation), since mid-October 2017 and until the end of August 2018, my research aim is to investigate critical aspects of populist discourses in Europe and their variation, especially during and after the 2008 financial (and then social and political) crisis, and to reveal patterns of similarity and difference (and, tentatively, of interconnectedness and intertextuality) across a wide spectrum of political parties, think tanks and organisations. This being essentially a Corpus-Assisted Discourse Study (CADS), a first way into examining the data is to identify and statistically analyse collocational patterns and networks that are built around key lexemes (e.g. ‘people’, ‘popular’, ‘democracy’, in this scenario), before moving on to critically correlating such quantitative findings with the social and political backdrop(s) and crucial milestones.


The first task of this complex corpus-driven effort, which is now complete, has been to compile a large-scale trilingual (EN, FR, EL) ‘focus’ corpus. This has been a tedious technical process: before the data can be examined in a consistent manner, several problems needed to be addressed and solutions had to be implemented, as outlined below.


  1. As a key primary aim was to gather as much data as possible from the websites of political parties, political organisations, think tanks and official party newspapers, from the UK, France and Greece, it was clear from the outset that it would not be possible to manually cull the corpus data, given the sheer number of sources and of texts. On the other hand, automatic corpus compilation tools (e.g. BootCaT and WebBootCaT in SketchEngine) could not handle the extent and the diversification of the corpora. To address this problem, texts were culled using web crawling techniques (‘wget -r’ in Linux bash) and the HTTrack app, with a lot tweaking and the necessary customisation of download parameters, to account for the (sometimes, very tricky) batch download restrictions of some websites.
  2. Clean-up html boilerplate (i.e., corpus text-irrelevant sections of code, advertising material, etc. that are included in html pages). This was accomplished using Justext (the app used by M. Davies to compile the NOW corpus), with a few tweaks, so to be able to handle some ‘malformed’ data, especially from Greek sources.

As I plan to specifically analyse the variation of key descriptors and qualifiers (‘key’ keywords and their c-collocates) as a first way into the “charting” of the discourses at hand, the (article or text) publication date is a critical part of the corpus metadata, one that needs to be retained for further processing. However, most if not all of this information is practically lost in the web crawling and boilerplating stages. Therefore, the html clean-up process was preceded by the identification and extraction of the articles’ publication dates, using a php script that was developed with the help of Dr. Matt Timperley (CASS, Lancaster) and Dr. Pantelis Mitropoulos (University of Athens). This script scans all files in a dataset, accounts for all possible date formats in all three languages, and then automatically creates a csv (tab-delimited) table that contains the extracted date(s), matched with the respective filenames. Its accuracy is estimated at ca. 95%, and can be improved further, by checking the output and rescanning the original data with a few code tweaks.

  1. Streamline the data, by removing irrelevant stretches of text (e.g. “Share this article on Facebook”) that were possibly left behind during the boilerplating process – this step is ensured using Linux commands (e.g. find, grep, sed, awk) and regular expressions and greatly improves the accuracy of the following step.
  2. Remove duplicate files: since onion (ONe Instance ONly: the script used e.g. in SketchEngine) only looks for pattern repetitions within a single file and within somewhat short maximum paragraph intervals, I used FSLint – an application that takes account of the files’ MD5 signature and identifies duplicates. This is extremely accurate and practically eliminates all files that have a one hundred percent text repetition, across various sections of the websites, regardless of the file name or creation date (actually, this was found to be the case mostly with political party websites, not newspapers). (NB: A similar process is available also in Mike Scott’s WordSmith Tools v7).
  3. Order files by publication year for each subcorpus and then calculate the corresponding metadata (files, tokens, types and average token count, by year) for each dataset and filter out the “focus corpus”, i.e. by looking for relevant files containing only node lemmas (i.e., lemmas related to the core question of this research: people*|popular|democr*|human* and their FR and EL equivalents, using grep and regular expressions – note that an open-source, java-based GUI app that combines these search options for large datasets is FAR).
  4. Finally, prepare the data for uploading on LU’s CQPWeb, by appending the text publication year info, as extracted from stage 2 to the corresponding raw text file – this was done using yet another php script, kindly developed by Matt Timperley.


In a nutshell, texts were culled from a total of 68 sources (24 Greek, 26 British, and 18 French). This dataset is divided into three major corpora, as follows:

  1. Cumulative corpus (CC, all data): 746,798 files/465,180,684 tokens.
  2. Non-journalistic research corpus (RC): 419,493 files/307,231,559 tokens.
  3. Focus corpus (FC): 205,038 files/235,235,353 tokens.