CASS in the 2017 ESRC Festival of Social Science

The ESRC Festival of Social Science is an annual celebration of social science research – comprised of a huge array of public events of all kinds, and designed to promote awareness of UK social science research across the board. This year, it runs from 4th to 11th November.

As the team at ESRC says,

“You may be surprised at just how relevant the Festival’s events are to society today. Social science research makes a difference. Discover how it shapes public policy and contributes to making the economy more competitive, as well as giving people a better understanding of 21st century society. From big ideas to the most detailed observations, social science affects us all everyday – at work, in school, when raising children, within our communities, and even at the national level.”

As an ESRC Centre, CASS has been involved in the Festival since our work began in 2013. We have organised events of different types in different years – for instance, in the first year of the Centre, our contribution to the Festival was a series of talks in schools in the North West of English to introduce the kind of social science analysis in which we specialise to students in sixth-form. It was great to be able to reach out to an audience that we rarely have a chance to communicate with about our work.

In subsequent years, we organised events under our “Valuing language” banner – aimed at using examples of our work to present to a public audience the benefits across the social sciences that arise in research that understands the value of language for all kinds of social investigations. Our first “Valuing language” event was in London; the following year we held another event in Manchester.

This year our contribution to the Festival of Social Science is a new “Valuing language” presentation. This event focuses in particular on two strands of research that have been under way in CASS for the past two years or so, looking at the intersection of language with the critical issue of health and healthcare. We are also returning to London for the event, entitled “Valuing language: Effective communication in healthcare provision”. The event – at 6.30 pm on Thursday 9th November – is particularly aimed at healthcare practitioners and those training to enter healthcare services – but of course, it is open to anyone with an interest in this work!

The evening will include two presentations, one on each of these strands of work. First will be a presentation of research into patient comments on healthcare services collected through the NHS Choices website. Patient feedback has often been analysed by looking straightforwardly at the numeric ratings given in feedback. However, the textual responses supplied alongside these ratings are a far richer source of data – albeit so extensive they can be non-straightforward to analyse! But this is, of course, where corpus-based linguistic methods come in. A CASS project, led by Paul Baker, has applied these methods to investigate patients place on interpersonal skills and effective, compassionate communication. Two members of the team working on this project, myself and Craig Evans, will give an overview of how we have gone about analysing this unique and fascinating source of data.

In the second half of the event, CASS Director Elena Semino will present her work looking at patients’ reporting of pain. A common way for healthcare practitioners to assess the level of pain that patients are experiencing is to use questionnaires that present descriptor  words – such as “pricking/boring/drilling/stabbing”. The descriptor word that a patient chooses is assumed to reflect the level of their pain. Elena’s research suggests, however, that patients’ choice of descriptor may in many cases instead be a result of how strongly associated with the word “pain” the descriptor word is. Again, this is a problem that corpus-based language analysis is an ideal way to address. Elena will explain the findings of her investigation and also consider the implications these findings have for how descriptor-word questionnaires should be used in assessing patients’ pain.

We’re all looking forward to participating once again in the ESRC Festival and we hope to see you there!

Find out more (and sign up for the event) via

Introducing Visiting Researcher Ioannis Saridakis

Starting from Translation Studies, as both an academic discipline and a professional practice in the early 90s, I soon embarked on the then innovative field of corpus linguistics and started exploring its links with, and applicability in, translation and interpreting studies. Soon after finishing my PhD in Corpus Linguistics, Translation and Terminology in 1999, and having already worked for more than a decade as a professional translator and head of a translation agency, I started teaching at the Department of Translation and Interpreting of the Ionian University in Greece. Currently, I am Associate Professor of Corpus Linguistics and Translation Studies at the University of Athens (School of Economics and Political Science), as well as director of the IT research lab and co-director of the Bilingualism, Linguistics and Translation research lab and deputy director of the Translation and Interpreting postgraduate programme at the University of Athens.

In the past, and in parallel to my core academic and research activities, I have also collaborated with many national and international organisations, as a consultant in the fields of linguistics, translation and interpreting. My research activities include a number of empirical studies and research projects in the fields of Corpus Linguistics and Discourse Analysis, as well as on Corpus Linguistics and Translation Studies, a discipline which I consider to rely essentially on the functional analysis of discourse, both methodologically and practically. My most recent research and publications focus on corpus-driven methods and models for systematically analysing the lexis and the rhetoric of a range of discourses, including analysis of the discourse of Golden Dawn, i.e. Greece’s far right political party, and its representations and meta-discoursal perceptions by Greek and European newspapers, the study of the diachronic variation of the lexis used to designate and qualify RASIM, especially during and after the recent migrant crisis, and exploration of the linguistic aspects of impoliteness and aggression in Greek Computer-mediated communication (CMC).

At CASS, I will be working with Professor Paul Baker on a project that aims to investigate critical aspects of populist discourses in Europe, especially during and after the 2008 financial (and then social and political) crisis. The research draws heavily on large-scale corpora, with a focus on so far under-researched discourses, particularly of the ‘left’ and the ‘far-left’, including ‘anti-austerity’ and ‘anti-globalisation’ discourses, from Greece, the UK and France. By charting such a landscape of discourse traits, foci and conventionalisations, also from a cross-linguistic perspective, I also purport to reveal patterns of similarity and dissimilarity (and tentatively, interconnectedness) with the significantly more researched ‘right-wing’ political and newspapers discourses (‘nationalist’, ‘anti-immigration’, ‘anti-Islam’). To pursue these goals, my research will use cutting-edge research methods and computational techniques for corpus compilation and annotation, as well as statistical analysis, including analysis of collocational patterns and networks, and will critically correlate quantitative findings with the social and political backdrop and its crucial milestones. In other words, it will explore how linguistic patterns, as well as changes and variations, are linked to social, political and economic changes and to significant events.

I’m excited to be able to work at CASS, and to join such a wonderful team of committed academics and researchers.

I intend to post frequently on this blog, as the project is pursued further, highlighting significant preliminary findings and tentative conclusions.

Texts and Images of Austerity: Workshop in Erlangen, Germany

On Sunday 24th September, a few of us from CASS travelled to the small Bavarian city of Erlangen, Germany, to attend ‘Texts and Images of Austerity in Britain’, a five-day workshop being held at Friedrich-Alexander-Universität. Our number included Deputy Director of CASS Andrew Hardie, Olivia Ha, Craig Evans, and former CASS member Laura Paterson (now with the Open University). Also at the event was former CASS Director Tony McEnery.

Partly inspired by the Paul Baker and Jesse Egbert edited book ‘Triangulating Methodological Approaches in Corpus Linguistic Research’ (2016), the workshop brought together researchers – both seasoned and budding – to work on a common data set on the topic of austerity: a 20+ million-word corpus of news articles from the Guardian and Daily Telegraph (2010-2016), nearly 400 images from these articles, and a collection of Twitter messages from the same period. Baker and Egbert et al. used different corpus methods to analyse their shared data. At the Erlangen workshop, methods from different disciplines were used, with participants coming from a variety of fields, including Sociology, Political Science, Linguistics, and Economics. The purpose was to encourage transdisciplinary collaboration in the study of how austerity is discursively constructed.

The workshop followed a format that combined short presentations with working groups. In the convivial atmosphere of a group of twenty or so international researchers, each participant presented their approach to looking at austerity. A variety of theories and techniques were outlined in the presentations, and corpus software and methods were well represented across the workshop. In his talk, Tony McEnery provided an overview of corpus linguistics, representing its value as an approach that focuses on how language is actually used rather than on how people think it is used. This overview also highlighted the variety of ways corpus methods can be employed in the study of text and talk. In other presentations, the focus was more on the means of doing corpus analysis, namely the software. For example, CQPweb: the main interface for accessing and analysing the text data at the workshop. Here, Andrew Hardie was on hand to provide a demonstration and offer his support.

Contributions from others with links to CASS included: Olivia Ha’s look at the collocates of emotion and evaluation, Craig Evans’s consideration of the notion of empathy, and Laura Paterson’s presentation on the use of geoparsing software. Other participants covered a range of techniques, theories and topics including multimodal annotation, textual analysis of moral logic, metaphor of austerity as attack, gender and austerity, crisis narratives, and critical realism.

In the working groups, participants with similar interests naturally gravitated to each other, particularly along the lines of those with more of a corpus focus and those with more of a multimodal focus. This, nevertheless, did not prevent a fruitful exchange of information and ideas, with several participants also presenting initial findings from their collaborative work. From a corpus perspective, a major challenge was the high presence of duplicates in the newspaper corpus (an issue with NexisLexis and capturing online newspaper data). The benefit of the workshop situation was that there were several participants with computational expertise present and able to work out ways of cleaning up the data.

The workshop in Erlangen was run by Tim Griebel, Stefan Evert, and a team of others at Friedrich-Alexander-Universität. Our hosts were incredibly welcoming, providing food and refreshments, organising accommodation and evening meals in the charming city of Erlangen, and even arranging a mid-week city tour event. The workshop itself was an interesting and rewarding exercise that forms part of a larger project on austerity. It helped create a space for different kinds of social scientists to exchange ideas and develop working relationships, which may develop into future research collaborations.

For more information on the workshop and its theme of austerity, visit the workshop website.

Sketch Engine and other tools for language analysis

Here’s some good news for the beginning of the term: all Lancaster University staff and students have now access to Sketch Engine, an online tool for the analysis of linguistic data. Sketch Engine is used by major publishers (CUP, OUP, Macmillan, etc.) to produce dictionaries and grammar books. It can also be used for a wide range of research projects involving the analysis of language and discourse. Sketch Engine offers access to a large number of corpora in over 85 different languages. Many of the web-based corpora available through Sketch Engine include billions of words that can be analysed easily via the online interface.

In Sketch Engine, you can, for example:

  • Search and analyse corpora via a web browser.
  • Create word sketches, which summarise the use of words in different grammatical frames.
  • Load and grammatically annotate your own data.
  • Use parallel (translation) corpora in many languages.
  • Crawl the web and collect texts that include a combination of user-defined keywords.
  • Much more.

How to connect to Sketch Engine?

  1. Go to
  2. Click on ‘Authenticate using your institution account (Single Sign On)’

3. Select ‘Lancaster University’ from the drop-down menu and use your Lancaster     login details to log on. That’s all – you can start exploring corpora straightaway!

Other corpus tools

There are also many other tools for analysis of language and corpora available to Lancaster University staff and students (and others, of course!). The following table provides an overview of some of them.


Tool Analysis of own data Provides corpora Brief description
Desktop (offline) tools
#LancsBox YES YES This tool runs on all major operating systems (Windows, Linux, Mac). It has a simple, easy-to-use interface and allows searching and comparing corpora (your own data as well as corpora provided).  In addition, #LancsBox provides unique visualisations tools for analysing frequency, dispersion, keywords and collocations.

Web-based (online) tools
CQPweb NO YES This tool offers a range of pre-loaded corpora for English (current and historical) and other languages including Arabic, Italian, Hindi and Chinese. It includes, the BNC 2014 Spoken, a brand new 10-milion-word corpus of current informal British speech. It has a number of powerful analytical functionalities. The tool is freely available from
Wmatrix YES NO This tool allows processing users’ own data and adding part-of-speech and semantic annotation. Corpora can also be searched and compared with reference wordlists. Wmatrix is available from

My experience with the Corpus MOOC

Lancaster University’s MOOC in Corpus Linguistics has been hugely important to me during my doctoral research and I’ve taken it each year since it was first offered in 2014. This is not because I’m an especially slow learner or that I was unsuccessful in all of my previous attempts – it’s because the course has so much to offer that it’s impossible to appreciate all of the different aspects in one go; it requires repeated visits as understanding deepens and new questions emerge.

When I first took the course, I knew nothing about corpus linguistics (or MOOCs for that matter) and went through each week’s materials at a very introductory level, trying to get a handle on the principles and terminology and also learning about tools and techniques. At first, I was apprehensive, ready to bail at any sign of discomfort, but I found the lectures not only easy to follow, but also thoroughly enjoyable and endlessly fascinating. I was hooked! Although the course itself spanned eight weeks, the materials were available on the website long after the course was over. This allowed me to revisit and review tutorials whenever I felt unsure about something, and also start to focus on areas that aligned with my own research interests.

The following year, with the basics under my belt, I decided to take the course for a second time with the intention of tackling the content in more depth and also using my own data for the tutorials. What I found was that the multiple layers of the course became extremely valuable as I became more comfortable with different concepts and research in the field and also that my approach to the course had changed. Instead of following the course week by week as I had done the first time, I started to pick and choose different aspects that matched particular stages of my own research.

The third time I took the course, I was driven by an interest in the advanced materials as well as the discussions and comments by other students and mentors. I had so many questions arising from my own research that I felt it would be helpful to hear what others had to say about their own. The forum became an incredibly valuable resource and one that I had not appreciated as a beginner. It is extremely generous of Lancaster to offer such a fantastic course with all the support, resources, knowledge and materials and ask for nothing in return.

And now, even though I’ve completed my doctoral research, I’ve registered for the course for the fourth time. It is such an incredibly diverse and fascinating course, with so many layers and areas of interest, that there is still a great deal for me to learn. And the numerous scholars discussing their research have an enthusiasm and passion for their work that is both infectious and inspirational. Perhaps my husband is right, I’ve become addicted to this MOOC!

The next Corpus MOOC starts 25 September 2017. You can register for free at

The course is intended for anyone interested in quantitative language analysis – no prior knowledge of linguistics or corpora is required.

Would you like to share your experience of the Corpus MOOC? Include #CorpusMOOC in your tweets or other social media posts or get in touch via v.brezina(Replace this parenthesis with the @ sign)

CASS PhD Student Tanjun Liu wins Best Poster Award at EUROCALL2017

In late August, I attended the 25th annual conference of EUROCALL (European Association for Computer Assisted Language Learning) at the University of Southampton. This year’s theme encompassed how Computer-Assisted Language Learning (CALL) responds to changing global circumstances, which impact on education. Over 240 sessions were presented covering the topics of computer mediated communication, MOOCs, social networking, corpora, European projects, teacher education, etc.


 At this conference, I presented a poster entitled “Evaluating the effect of data-driven learning (DDL) on the acquisition of academic collocations by advanced Chinese learners of English”. DDL is a term created by Tim Johns in 1991 to refer to the use of authentic corpus data to conduct student-centred discovery learning activities. However, even though many corpus-based studies in the pedagogical domain have suggested applying corpora in the domain of classroom teaching, DDL has not become the mainstream teaching practice to date. Therefore, my research sets out to examine the contribution of DDL to the acquisition of academic collocation in the Chinese university context.


The corpus tool that I used in my research was #LancsBox (, which is a newly-developed corpus tool at CASS that has the capacity to create collocational networks, i.e. GraphColl. The poster I presented was a five-week pilot study of my research, the results of which show that the learners’ attitudes towards using #LancsBox were mostly positive, but there were no statistically significant differences between using the corpus tool and online collocations dictionary, which may be largely due to very short intervention time in the pilot study. My poster also presented the description of the forthcoming main study that will involve longer exposure and more EFL learners.


At this conference I was fortunate enough to win the EUROCALL2017 Best Poster Award (PhD), which was given to the best poster presented by a PhD student as nominated by conference delegates. Thank you to all of the delegates who voted for me to win this award and it was a real pleasure to attend such a wonderful conference!

CASS at Corpus Linguistics 2017

The biennial Corpus Linguistics conference first took place in 2001 at Lancaster, with the 2017 conference at Birmingham being its 9th outing. Lasting four days with an additional day for workshops, this blog post details CASS participation at the event.

On Monday 24th July CASS ran two pre-conference workshops: Vaclav Brezina and Matt Timperley’s workshop was based around the tool #LancsBox which has the capacity to create collocational networks while Robbie Love and Andrew Hardie introduced the Spoken BNC2014 Corpus. Pre-conference workshop presentations were also given by CASS members in the Corpus Approaches to Health Communication workshop which saw talks by Paul Baker (on NHS patient feedback), Elena Semino (on assessment of a diagnostic pain questionnaire) and Karen Kinloch who gave two talks on discourses around IVF treatment and post natal depression (her second talk was co-presented with Sylvia Jaworska).

On the first day of the conference proper, CASS Director Andrew Hardie gave a plenary entitled Exploratory analysis of word frequencies across corpus texts: towards a critical contrast of approaches, which involved a “for one night only” Topic Modelling analysis, demonstrating some of the problems and assumptions behind this approach. Key points were illustrated with a friendly picture of a Gigantopithecus (pictures of dinosaurs and other extinct creatures were used in several talks, perhaps suggesting a new theme for CL research). The plenary can be watched in full here.

A number of conference talks involved the creation and analysis of the new 2014 British National Corpus, with Abi Hawtin presenting on how she developed parameters for the written section and Robbie Love discussing swearing in the spoken section of the BNC2014. Vaclav Brezina and Matt Timperley discussed a proposal for standardised tokenization and word counting, using the new BNC as an exemplar while Susan Reichelt examined ways of adapting the BNC for sociolinguistic research, taking a case study on negative concord.

In terms of other corpus creation projects, Paul Rayson, Scott Piao and a team from Cardiff University discussed the creation of a Welsh Semantic tagger for use with the CorCenCC Project.

Two talks involved uses of corpus linguistics in teaching. First, Gillian Smith described the creation and analysis of a corpus of interactions in Special Education Needs classrooms, with the goal of investigating teacher scaffolding while Liam Blything, Kate Cain and Andrew Hardie analysed a half million corpus of teacher-child interactions during guided reading sessions.

Regarding work examining discourse and representation using corpus approaches Carmen Dayrell presented her work with Helen Baker and Tony McEnery on a diachronic analysis of newspaper articles about droughts, their research combining corpus approaches with GIS (Geographical Information Systems). GIS was also used by Laura Paterson and Ian Gregory to map text analysis of poverty in the UK while Paul Baker and Mark McGlashan reported on their work looking at representations of Romanians in the Daily Express, comparing articles with online reader comments. A fourth paper by Jens Zinn and Daniel McDonald considered changing understandings around the concept of risk in English language newspapers.

Collocation was also a popular CASS topic in our presentations. Native and non-native processing of collocations was investigated by Jen Hughes, who carried out an experimental study using electroencephalography (EEG) which measures electrical potentials in the brain, while another approach to collocation was taken by Doğuş Can Öksüz and Vaclav Brezina who examined adjective-noun collocations in Turkish and English. A third collocation study by Dana Gabasolva, Vaclav Brezina and Tony McEnery involved empirical validation of MI-based score collocations for language learning research.

Finally, Jonathan Culpeper and Amelia Joulain-Jay talked about an affiliated CASS project involving work on creating an Encyclopaedia of Shakespeare’s language. They discussed issues surrounding spelling variation, and part of speech tagging, and gave two case studies (involving the words I and good).

The conference brought together corpus linguists from dozens of countries (including Germany, Finland, Spain, Israel, Japan, Brazil, Iran, The Netherlands, USA, New Zealand, Taiwan, Ireland, China, Czech Republic, Italy, Sweden, Poland, Chile, UK, Hong Kong, Norway, Australia, Belgium, Canada, South Africa and Venezuela) and was a great opportunity to share and hear about developing work in the field. There was a lively twitter presence throughout the conference, with the tag #CL2017bham. However, my favourite tag was #HardiePieChartWatch, which had me going back to my slides to see if I had used a pie chart appropriately. Be careful with your pie charts!

The next conference will be held (for the first time) in Cardiff – I hope to see you there in two years.

More pictures of the conference can be found at

How to Produce Vocabulary Lists

As part of the Forum discussion in Applied Linguistics, we have formulated some basic principles of corpus-based vocabulary studies and pedagogical wordlist creation and use. These principles can be summarised as follows:

  1. Explicitly define the vocabulary construct.
  2. Operationalize the vocabulary construct using transparent and replicable criteria.
  3. If using corpora, take corpus evidence seriously and avoid cherry-picking.
  4. Use multiple sources of evidence to test the validity of the vocabulary construct.
  5. Do not rely on your intuition/experience to determine what is useful for learners; collect evidence about learner needs to evaluate the usefulness of the list.
  6. Do not present learners with a decontextualized list of lexical items; use/create contextualized materials instead.

To find out more, you can read:

Brezina, V. & Gablasova, D. (2017). How to Produce Vocabulary Lists? Issues of Definition, Selection and Pedagogical Aims. A Response to Gabriele Stein. Applied Linguistics, doi:10.1093/applin/amx022.

CASS Guided Reading project presented to The Society for the Scientific Studies of Reading (SSSR)

In mid-July, it was my pleasure to represent CASS at the SSSR conference in Novia Scotia, Canada! Over 400 professionals attended, including language and literacy researchers, school teachers, and speech and language therapists.

My primary aim was to demonstrate how our CASS language development project is using corpus search methods to identify the effectiveness of teacher strategies that are being used in guided reading classroom interactions (also see part 1 & part 2 of my project introduction blogs). The best opportunity for this was during my poster presentation, which highlighted our first round of findings on the types of questions that teachers ask children.

We first demonstrated that teachers are paying attention to recommended guidelines to ask a lot of wh-questions (why, how, what, when etc): wh-questions typically take up around 20% of the total questions being asked in normal adult conversation, but took up 40% of the total questions asked by teachers in our spoken classroom interactions.

Second, the poster presents initial findings on our developmental question of whether teachers of older children ask more challenging question types than teachers of younger children. However, our chosen categories of question type (thus far) were used equivalently across year groups, so this prompts a follow up to examine whether finer categories of question type differ in their proportion of usage across year groups.

Third, the poster reported that teachers at schools in low socio-economic-status (SES) regions asked a higher proportion of wh-questions than teachers at schools in high SES regions. Most viewers of the poster agreed that this prompts us to look at children’s responses: the high proportion of wh-questions asked by teachers at schools in low SES regions might be shaped by less engaged answers from low SES children that require more follow up wh-questions relative to the typically more engaged answers provided by high SES children.

Although there were a number of other posters throughout the week that examined classroom interactions, none had taken advantage of the precise, fast and reliable search methods that we are using. Therefore, attendees were very impressed by how we have been able to interrogate our large corpus without being restricted by the amount of manual hand coding that can be achieved within a realistic time window.

Finally, a big thanks to CASS and SSSR for making this visit possible. As well as the incredible learning opportunities provided by the wide range of high quality presentations on reading research, I also had a good time meeting the fun and interesting conference attendees  – and local Canadians too! Novia Scotia is a beautiful place to visit, with a very friendly and youthful demographic.

Liam will be presenting a talk on this project at the Corpus Linguistics 2017 conference on Wednesday 26th July at 4pm in Lecture Theatre 117, Physics Building, University of Birmingham. For updates, watch this space and twitter @CorpusSocialSci @LiamBlything



User Involvement: CASS go to CLARIN PLUS workshop

At the beginning of June, I attended the CLARIN PLUS workshop on User Involvement held in the capital Helsinki. CLARIN stands for “Common Language Resources and Technology Infrastructure”; it is an international research infrastructure which provides scholars in the social sciences and humanities with easy access to digital language data, and also advanced tools to handle those data sets. The main purpose of the workshop was to share information, good practice, expertise, and ideas on how potential and current users can most benefit from CLARIN services.

I was representing Lancaster University as part of the UK branch of CLARIN, which is led by Martin Wynne at Oxford. Some of the participants, representing CLARIN’s different national consortia, shared their successful stories of their involvement with the local community.

At the workshop, Johanna Berg, from Sweden, and Mietta Lennes, from Finland showed us how they made innovative use of the roadshow event format to present some language resources across different institutions in their countries. Mietta also gave us a taste of the very useful tools and corpora that you can find at The Language Bank of Finland.

Another fruitful example presented at the workshop was the Helsinki Digital Humanities Hackathons. The event, which is in its third edition, brings together researchers from computer science, humanities and social sciences for a week of intensive work sharing a diversity of skills. Eetu Mäkelä, one of the organisers of the DHH, demonstrated that it is possible to engage researchers from very different backgrounds and have them working in a complementary way. The impressive results of last year’s edition can be checked out at the DHH16 website.

At the end of two profitable days, Darja Fišer, director of CLARIN-ERIC User Involvement, wrapped up the event by presenting other amazing experiences across several institutions connected to CLARIN. One of the success stories she mentioned was the Corpus Linguistics: Method, Analysis, Interpretation MOOC offered by CASS, which will be running again in Autumn this year (you can register your interest here!). Darja also highlighted the importance of events such as summer schools to reach out to more users. Indeed, Darja shared some incredible resources and insightful ideas at our recent Summer Schools in Corpus Linguistics and other Digital methods (#LancsSS17). Make sure you read our next blog post for a summary of the summer school week!