Tracking terrorists who leave a technological trail.

Dr Sheryl Prentice’s work on using technology to aid in the detection of terrorists has been gaining a lot of attention in the media this week! Sheryl’s discussion of the different ways in which technology can be used to tackle the issue of terrorism and how effective these methods are was originally published in The Conversation, and then republished by the ‘i’ newspaper on 23rd June 2016. You can read the original article here.

Corpus Data and Psycholinguistics Seminar

On the afternoon of Thursday 19th May 2016, CASS held its first ever psycholinguistics seminar which brought together researchers from both linguistics and psychology. The theme of the seminar was “Corpus Data and Psycholinguistics”, with a particular focus on experimental psycholinguistics.

The afternoon consisted of four 40-minute presentations which covered a range of different experimental methods including eye-tracking and EEG. Interestingly, the notion of collocation also emerged as a strong theme throughout the presentations. Different types of collocation were addressed, including bigrams, idioms, and compounds, and this prompted thought-provoking discussions about the nature of collocation and the relationship between psycholinguistic results and the different statistical measures of collocation strength.

The first presentation was delivered by Professor Padraic Monaghan from the Psychology Department at Lancaster University. In this presentation, Padraic provided an engaging introduction to computational modelling in psycholinguistics, focusing mainly on connectionist models where the input determines the structure of processing. This talk prompted a particularly interesting observation about the relationship between connectionist models and parts-of-speech tags in corpora.

In the second presentation, Dr Phil Durrant from the University of Exeter provided a critical perspective on his own earlier work into whether or not psycholinguistic priming is evident in collocations at different levels of frequency, and on the distinction between the related notions of collocation and psychological association. This presentation also provided a really interesting insight into the different ways in which corpus linguistics and psychological experimentation can be combined in psycholinguistic studies. This really helped to contextualise the studies reported in the other presentations within the field of psycholinguistics.

After a short break, I presented the results of the first of several studies which will make up my PhD thesis. This initial study pilots a procedure for using EEG to determine whether or not the brain is sensitive to the transition probabilities between words. This was an excellent opportunity for me to gain feedback on my work and I really appreciate the input and suggestions for further reading that I received from participants at this event.

The final presentation of the afternoon was delivered by Professor Michaela Mahlberg and Dr Gareth Carroll from the University of Birmingham. This presentation drew upon eye-tracking data from a study exploring literary reading in order to pinpoint the methodological issues associated with combining eye-tracking techniques with literary corpora, and with corpus data more generally.

With such an interesting series of talks sharing the theme of “Corpus Data and Psycholinguistics”, the CASS psycholinguistics seminar proved to be a very successful event. We would like to thank the presenters and all of the participants who attended the seminar for their contribution to the discussions, and we are really looking forward to hosting similar seminars in the near future.

TLC and innovation in language testing

One of the objectives of Trinity College London investing in the Trinity Lancaster Spoken Corpus has been to share findings with the language assessment community. The corpus allows us to develop an innovative approach to validating test constructs and offers a window into the exam room so we can see how test takers utilise their language skills in managing the series of test tasks.

Recent work by the CASS team in Lancaster has thrown up a variety of features that illustrate how test takers voice their identity in the test, how they manage interaction through a range of strategic competences and how they use epistemic markers to express their point of view and negotiate a relationship with the examiner (for more information see Gablasova et al. 2015). I have spent the last few months disseminating these findings at a range of language testing conferences and have found that the audiences have been fascinated by the findings.

We have presented findings at BAAL TEASIG in Reading, at EAQUALS in Lisbon  and at EALTA in Valencia. Audiences ranged from assessment experts to teacher educators and classroom practitioners and there was great interest both in how the test takers manage the exam as well as the manifestations of L2 language. Each presentation was tailored to the audience and the theme of the conference. In separate presentations, we covered how assessments can inform classroom practice, how the data could inform the type of feedback we give learners and how the data can be used to help validate aspects of the test construct. The feedback has been very positive, urging us to investigate further. Comments have praised the extent and quality of the corpus and range from the fact that the evidence “is something that we have long been waiting for” (Dr Parvaneh Tavakoli, University of Reading) to musings on what some of the data might mean both for how we assess spoken language and the implications for the classroom. It has certainly opened the door on the importance of strategic and pragmatic competences as well as validating Trinity’s aims to allow the test taker to bring themselves into the test.  The excitement spilled over into some great tweets. There is general recognition that the data offers something new – sometimes confirming what we suspected and sometimes – as with all corpora – refuting our beliefs!

We have always recognised that the data is constrained by the semi-formal context of the test but the fact that each test is structured but not scripted and has tasks which represent language pertinent to communicative events in the wider world allows the test taker to produce language which is more reflective of naturally occurring speech than many other oral tests. It has been enormously helpful to have feedback from the audiences who have fully engaged with the issues raised and highlighted aspects we can investigate in greater depth as well as raising features they would like to know more about. These features are precisely those that the research team wishes to explore in order to develop ‘a more fine-grained and comprehensive understanding of spoken pragmatic ability and communicative competence’ (Gablasova et al. 2015: 21)

One of the next steps is to show how this data can be used to develop and support performance descriptors. Trinity is confident that the features of communication which the test takers display are captured in its new Integrated Skills in English exam validating claims that Trinity assesses real world communication.

CASS Changing Climates project presented at the University of Turin

Carmen blog 2

It was a great honour and pleasure to present CASS Changing Climates project to an engaging audience at the University of Turin last month, on 27th April 2016. This was the 8th symposium on ‘Energy and Power: Social ontology perspectives and energy transitions’ as part of a UNESCO Chair programme in Sustainable Development and Territory Management currently hosted by University of Turin (Italy), under the coordination of Professor Dario Padovan.

Carmen blog 1

The symposium brought together academics and students from various disciplines – sociology, linguistics, history and environmental sciences –, thus having an enthusiastic audience and resulting in a lively debate. CASS would like to thank the organisers Professor Dario Padovan, Dr Maria Cristina Caimotto and Gabriela Gabriela Cavaglià for this great opportunity to exchange experience and ideas. I very much enjoyed the event and, as expected, had a great time in lovely Torino.

Chinese Applied Corpus Linguistics Symposium

On Friday 29th April 2016, Lancaster University hosted a symposium which brought together researchers and practitioners interested in Chinese linguistics and the corpus method. The symposium was supported by the British Academy (International Mobility and Partnership Scheme IPM 2013) and was hosted by the ESRC Centre for Corpus Approaches to Social Science (CASS). The symposium introduced the Guangwai-Lancaster Chinese Learner Corpus, a 1.2-million-word corpus of spoken and written L2 Chinese produced by learners of Chinese at different proficiency levels; the corpus was built as part of a collaboration between Guangdong University of Foreign Studies (Prof. Hai Xu and his team) and Lancaster University. The project was initiated by Richard Xiao, who also obtained the funding from the British Academy. Richard’s vision to bring corpus linguistics to the analysis of L2 Chinese (both spoken and written) is now coming to fruition with the final stages of the project and the public release of the corpus planned for the end of this year.

The symposium showcased different areas of Chinese linguistics research through presentations by researchers from Lancaster and other UK universities (Coventry, Essex), with the topics ranging from the use of corpora as resources in the foreign language classroom to a cross-cultural comparison of performance evaluation in concert reviews, second language semasiology, and CQPweb as a tool for Chinese corpus data. As part of the symposium, the participants were also given an opportunity to search the Guangwai-Lancaster Chinese Learner Corpus and explore different features of the dataset. At the end of the symposium, we discussed the applications of corpus linguistics in Chinese language learning and teaching and the future of the field.

Thanks are due to the presenters and all participants for joining the symposium and for very engaging presentations and discussions.  The following snapshots summarise the presentations –links to the slides are available below the images.


Hai Xu


Hai Xu (Guangdong University of Foreign Studies ): Guangwai-Lancaster Chinese Learner Corpus: A profile – via video conferencing from Guangzhou

Simon Smith

Simon Smith (Coventry University): 语料酷!Corpora and online resources in the Mandarin classroom

Fong Wa Ha

Fong Wa Ha (University of Essex): A cross-cultural comparison of evaluation between concert reviews in Hong Kong and British newspapers

Vittorio Tantucci

Vittorio Tantucci (Lancaster University): Second language semasiology (SLS): The case of the Mandarin sentence final particle 吧 ba

Andrew Hardie

Andrew Hardie (Lancaster University): Using CQPweb to analyse Chinese corpus data

Vaclav Brezina

Vaclav Brezina (Lancaster University):  Practical demonstration of the Guangwai-Lancaster Chinese Learner Corpus followed by a general discussion.

Clare Wright: Using Learner Corpora to analyse task effects on L2 oral interlanguage in English-Mandarin bilinguals




Introducing Yufang Qian to CASS

CASS is delighted to welcome visiting researcher Yufang Qian to the centre, where she will be working on a project exploring the representation of Chinese medicine in British historical news texts over the last 200 years. Continue reading to find out more about Yufang and the research which she will be undertaking!


In 2009, Yufang Qian obtained her PhD at Lancaster University with a dissertation on corpus-based discourse studies, under the supervision of Professors Tony McEnery and Paul Baker. She then returned to Zhejiang University of Media and Communications (ZUMC) and was appointed Professor in 2011.

Yufang is committed to popularizing the combination of corpus and discourse approaches in China. She has taught corpus linguistics and media discourse at the ZUMC to students at all levels and supervised more than 50 students’ dissertations relating to corpus-based discourse studies, in disciplines as diverse as communication studies, education, sociology, psychology and politics. The students have then either persued their further educations in the UK, USA, Japan, South Korea, and Hong Kong in this area or have used the expertise they have gained in various institutions and organizations in China.

In 2010 Yufang published ‘Corpus and Critical Discourse Analysis’ in the journal Foreign Language Teaching and Research, the first paper to introduce corpus-based discourse analysis in a Chinese journal. To date, it has been cited 48 times and downloaded 3515 times. In the past few years she has published nearly two dozens journal articles on corpus-based media discourse analysis. Her PhD thesis, Discursive constructions around terrorism in the People’s Daily and The Sun before and after 9.11 (Oxford: Peter Lang 2010), won the third Prize in the Sixth Outstanding Achievement Awards for Research in Humanities and Social Sciences, conferred by the Ministry of Education in 2013, the top governmental award in social science in China.

To explain and promote the application of the corpus-based discourse approach, Yufang has spoken at many national and international conferences and has given lectures at more than a dozen universities in China. She is Founding Director of Research Center for Discourse and Communications at the ZUMC, the first of its kind in China. She is principle investigator for many research projects, such as ‘Discursive constructions around the low carbon economy in the press of China, the UK and the US’, funded by the Ministry of Education; and ‘A corpus- based comparative study of Western and Chinese political discourse analysis’, funded by the National Social Science Foundation. She is also co-principle investigator of the project entitled ‘A comparative study of the discourse system in Chinese dream films’, funded by the National Social Science Foundation.

Yufang’s comparative perspective is evident from her early paper, ‘Contrasting signals of politeness between Western and Eastern countries’, published in Education in China (ed. E Fizette; Fenton, MI: Hana Guild, 1993). Since 2014, she has been working with CCPN Global (China in Comparative Perspective Network Global, an affiliate member of the Academy of Social Sciences, UK) to develop a project entitled ‘Corpus approaches for Chinese social science (CACSS)’. She is organizing a panel on ‘Corpus approaches to governance in the context of climate change’ at the 3rd Global China Dialogue on 2 December 2016 at the British Academy.

Yufang has recently returned to her alma mater, Lancaster University, as a visiting researcher, where she will work with Professor McEnery on a project exploring the representation of Chinese medicine in British historical news texts over the last 200 years. This diachronic observation of discourse on Chinese medicine is significant in that it will provide specific evidence of the media’s role in public health vis-à-vis the use of traditional Chinese medicine in the West. It is hoped that the findings of this study will help bridge the gap between Western and Chinese medicine, both of which play a role in serving public health.

NewsHack 2016 Retrospective

The BBC’s multilingual NewsHACK event was run on the 15th and 16th of March as an opportunity for teams of language technology researchers to work with multilingual data from the BBC’s connected studio.  The theme was ‘multilingual journalism: tools for future news’, and teams were encouraged to bring some existing language technologies to apply to problems in this area. Nine teams attended from various news and research organisations. Lancaster University sent two teams with funding from CASS, CorCenCC, DSI, and UCREL: team ‘1’ consisting of Paul, Scott and Hugo, and team ‘A’ comprising Matt, Mahmoud, Andrew and Steve.


The brief from the newsHACK team suggested two possible directions: to provide a tool for the BBC’s journalist staff, or to create an audience-facing utility. To support us, the BBC provided access to a variety of APIs, but the Lancaster ‘A’ team were most interested to discover that something we’d thought would be available wasn’t — there is no service mapping news stories to their counterparts in other languages. We decided to remedy that.

The BBC is a major content creator, perhaps one of the largest multilingual media organisations in the world. This presents a great opportunity. Certain events are likely to be covered in every language the BBC publishes in, providing ‘translations’ of the news which are not merely literal translations at the word, sentence or paragraph level, but full-fledged contextual translations which identify the culturally appropriate ways to convey the same information. Linking these articles together could help the BBC create a deeply interesting multilingual resource for exploring questions about language, culture and journalism.

Interesting, but how do we make this into a tool for the BBC? Our idea was to take these linked articles directly to the users. Say you have a friend who would prefer to read the news in their native tongue — one different to your own — how would you share a story with them? Existing approaches seem to involve either using an external search engine (But then how do you know the results are what you intend to share, not speaking the target language?) or to use machine translation to offer your friend a barely-readable version of the exact article you have read. We came up with an idea that keeps content curation within the BBC and provides readers with easy-access to the existing high-quality translations being made by professional writers: a simple drop-down menu for articles which allows a user to ‘Read about this in…’ any of the BBC’s languages.


To implement this, in two days, required a bit of creative engineering. We wanted to connect articles based on their content, but we didn’t have tools to extract and compare features in all the BBC’s languages. Instead, we translated small amounts of text — article summaries and a few other pieces of information — into English, which has some of the best NLP tool support (and was the only language all of our team spoke). Then we could use powerful existing solutions to named entity recognition and part-of-speech tagging to extract informative features from articles, and compare them using a few tricks from record linkage literature. Of course, a lack of training data (and time to construct it!) meant that we couldn’t machine-learn our way to perfection for weighting these features, so a ‘human learning’ process was involved in manually tweaking the weights and thresholds until we got some nice-looking links between articles in different languages.

Data is only part of the battle, though. We needed a dazzling front-end to impress the judges.  We used a number of off-the-shelf web frameworks to quickly develop a prototype, drawing upon the BBC’s design to create something that could conceivably be worked into a reader’s workflow: users enter a URL at the top and are shown results from all languages in a single dashboard, from which they can read or link to the original articles or their identified translations.

Here we have retrieved a similar article in Arabic, as well as two only-vaguely-similar ones in Portuguese and Spanish (the number indicates a percentage similarity).  The original article text is reproduced, along with a translated English summary.


The judges were impressed — perhaps as much with our pun-filled presentation as our core concept — and our contribution, the spontaneously-titled ‘Super Mega Linkatron 5000’ was joint winner in the category of best audience-facing tool.

The BBC’s commitment to opening up their resources to outsiders seems to have paid off with a crop of high-quality concepts from all the competitors, and we’d like to thank them for the opportunity to attend (as well as the pastries!).

The code and presentation for the team ‘A’ entry is available via github at and images from Lancaster’s visit can be seen at .  Some of the team have also written their own blog posts on the subject: here’s Matt’s and Steve’s.

Team ‘1’ based their work around the BBC Reality Check service. This was part of the BBC News coverage of the 2015 UK general election and published news items on twitter and contributed to TV and radio news as well. For example, in May 2015 when a politician claimed that the number of GPs has declined under the coalition government, BBC Reality Check produced a summary of data obtained from a variety of sources to enable the audience to make up their own mind about this claim. Reality Check is continuing in 2016 with a similar service for the EU referendum, providing, for example, a check on how many new EU regulations there are every year (1,269 rather than the 2,500 claimed by Boris Johnson!!). After consulting with the BBC technology producer and journalist attending the newsHACK, Team ‘1’ realised that this current Reality Check service could only serve its purpose for English news stories, so set about making a new ‘BBC Multilingual Reality Check’ service to support journalists in their search for suitable sources. Having a multilingual system is really important for the EU referendum and other international news topics due to the potential sources being written in languages other than English.

In order to bridge related stories across different languages, we adopted the UCREL Semantic Analysis System (USAS) developed at Lancaster over the last 26 years. The system automatically assigns semantic fields (concepts or coarse-grained senses) to each word or phrase in a given text, and we reasoned that the frequency profile of these concepts would be similar for related stories even in different languages e.g. the semantic profile could help distinguish between news stories about finance or education or health. Using the APIs that the BBC newsHACK team provided, we gathered stories in English, Spanish and Chinese (the native languages spoken by team ‘1’). Each story was then processed through the USAS tagger and a frequency profile was generated. Using a cosine distance measure, we ranked related stories across languages. Although we only used the BBC multilingual news stories during the newsHACK event, it could be extended to ingest text from other sources e.g. UK Parliamentary Hansard and manifestos, proceedings of the European parliament and archives of speeches from politicians (before they are removed from political party websites).

The screenshot below shows our analysis focussed on some main topics of the day: UK and Catalonia referendums, economics, Donald Trump, and refugees. Journalists can click on news stories in the system and show related articles in the other languages, ranked by our distance measure (shown here in red).

Team ‘1’s Multilingual Reality Check system would not only allow fact checking such as the number of refugees and migrants over time entering the EU, but also allow journalists to observe different portrayals of the news about refugees and migrants in different countries.


Upcoming CASS Psycholinguistics Seminar

CASS is excited to announce an upcoming half-day research seminar on the theme of “Corpus Data and Psycholinguistics”. The event will take place on Thursday 19th May 2016 at 1-5pm in Furness Lecture Theatre 3.

The aim of the event is to bring together researchers with an interest in combining methods from corpus linguistics and psycholinguistics. In particular, there will be a focus on experimental psycholinguistics. It is set to be an exciting afternoon consisting of four 40-minute presentations from both internal and external speakers. Professor Padraic Monaghan from the Department of Psychology will be giving an introduction to computational modelling in psycholinguistics, and I will be presenting my work on investigating the processing of collocation using EEG. Furthermore, Dr Phil Durrant from the University of Exeter will be giving a talk entitled “Revisiting collocational priming”, and Professor Michaela Mahlberg from the University of Birmingham will be discussing the methodological issues associated with combining eye-tracking techniques with corpus data.

You can find out more about these talks from the abstracts below.

Padraic Monaghan, Lancaster University

Computational modelling of corpus data in psycholinguistic studies

Computational models of language learning and processing enable us to determine the inherent structure present in language input, and also the cognitive mechanisms that react to this structure. I will give an introduction to computational models used in psycholinguistic studies, with a particular focus on connectionist models where the structure of processing is derived principally from the structure of the input to the model.

Phil Durrant, University of Exeter

Revisiting collocational priming

Durrant & Doherty (2010) evaluated whether collocations at different levels of frequency exhibit psycholinguistic priming. It also attempted to untangle collocation from the related phenomenon of psychological association by comparing collocations which were and were not associates. Priming was found between high-frequency collocations but associated collocates appeared to exhibit more deep-rooted priming (as reflected in a task designed to reflect automatic, rather than strategic processes) than those which were not associated. This presentation will critically review the 2010 paper in light of more recent work. It will re-evaluate the study itself and suggest ways in which research could be taken forward.

Durrant, P., & Doherty, A. (2010). Are high-frequency collocations psychologically real? Investigating the thesis of collocational priming. Corpus linguistics and linguistic theory, 6(2), 125-155.

Jennifer Hughes, Lancaster University

Investigating the processing of collocation using EEG: A pilot study

In this presentation, I discuss the results of an EEG experiment which pilots a procedure for determining whether or not there is a quantitively distinct brain response to the processing of collocational bigrams compared to non-collocational bigrams. Collocational bigrams are defined as adjacent word pairs which have a high forward transitional probability in the BNC (e.g. crucial point), while non-collocational bigrams are defined as adjacent word pairs which are semantically plausible but are absent from the BNC (e.g. crucial night). The results show that there is a neurophysiological difference in how collocational bigrams and non-collocations bigrams are processed.

Michaela Mahlberg, Kathy Conklin, and Gareth Carrol, University of Birmingham

Exploring corpus-attested patterns in Dickens’s fiction – methodological challenges of using eye-tracking techniques

The study of the relationship between patterns and meanings is a key concern in corpus linguistics. The data that corpus linguists work with, however, only provides a partial picture. In this paper, we will look at how questions of frequencies in corpora can be related to questions raised by data from eye-tracking studies on reading times. We will also discuss challenges of designing experiments to address these questions. As a case study, we focus on examples of patterns identified in Dickens’s fiction, but the methodological issues we address have wider implications beyond the study of literary corpora.

The event is free to attend and is open to both internal and external attendees. If you are an external guest, please email this parenthesis with the @ sign) so we know that you intend to come.

We are really looking forward to this event as it will be an exciting opportunity to share ideas regarding the different approaches to using corpus data in experimental psycholinguistics.

FireAnt is making headlines!

FireAnt, a tool for extracting, visualising and exporting social media data, is making headlines! The tool, developed by Claire Hardaker and Laurence Anthony at CASS, has been noted by the Daily Mail for it’s abilities to “hunt down terrorists and trolls”. We’re delighted that FireAnt is being recognised for its capabilities in social media data analysis, and that this is being illustrated to the public in mainstream news.

You can read the article here.

You can read more about FireAnt and it’s development here and here.

News: Professor John Urry

CASS is extremely sorry to hear of the death of Professor John Urry. We have lost a very distinguished and enthusiastic member of our team, and he will be greatly missed by all at the centre. You can read more about John’s life and work here.