Sino-UK Corpus Linguistics Summer School

ShanghaiAt the end of July, Tony McEnery and I taught at the second Sino-UK corpus linguistics summer school, arranged between CASS and Shanghai Jiao Tong University. It was my first time visiting China and we arrived during an especially warm season with temperatures hitting 40 degrees Celsius (we were grateful for the air conditioning in the room we taught in).

Tony opened the summer school, giving an introductory session on corpus linguistics, followed a few days later by a session on collocations, where he introduced CASS’s new tool for collocational networks, GraphColl. I gave a session on frequency and keywords, followed by later sessions on corpus linguistics and language teaching, and CL and discourse analysis. For the lab work components of our sessions, we didn’t use a computer lab. Instead the students brought along their own laptop and tablets, including a few who carried out BNCweb searches on their mobile phones! I was impressed by how much the students attending already knew, and had to think on my feet a couple of times – particularly when asked to explain some of the more arcane aspects of WordSmith (such as the “Standardised Type Token ratio standard deviation”).

At the end of the summer school, a symposium was held where Tony gave a talk on his work with Dana Gablasova and Vaclav Brezina on the Trinity Learner Language corpus. I talked about some research I’m currently doing with Amanda Potts on change and variation in British and American English.

Also presenting were Prof Gu Yuego (Beijing Foreign Studies University) who talked about building a corpus of texts on Chinese medicine, and Prof. Roger K Moore (University of Sheffield) who discussed adaptive speech recognition in noisy contexts.

We were made to feel very welcome by our host, Gavin Zhen, one of the lecturers at the university, who went out of his way to shuttle us on the 90 minute journey from the university to our hotel on the Bund.

It was a great event and it was nice to see students getting to grips with corpus linguistics so enthusiastically.

#CL2015 social media roundup: Using Corpus Linguistics to investigate Corpus Linguists talking about Corpus Linguistics


Corpus Linguistics 2015 – CL2015 – is the largest conference of its kind and this year drew over 250 attendees from all over the world to present work outlining the state of Corpus Linguistics (CL) at large, leading-edge technology and methods, and setting the agenda for years to come.

Of particular interest to me was a small but important streak of enquiry running through the conference, which is also becoming more prevalent in CL as a whole. That is, a focus on corpora collected from online source such as blogs and social media (Elgesem & Salway 2015; Grieve, et al. 2015; Hardaker & McGlashan 2015; Knight 2015; Longhi & Wigham 2015; McGlashan & Hardaker 2015; Statache, et al. 2015). The Internet now enables great opportunities for the collection and interrogation of large amounts of data – big data, even – and the rapid compilation of specialised corpora in ways previously impossible.

I focus here on social media data, specifically data collected from Twitter. Sampling data from Twitter, like a lot of other online sources, offers the opportunity to collect what people are saying (the content of their posts; tweets) but also a huge amount of metadata about the date, time, user, shared content (e.g. hyperlinks, retweets), interactional information, etc. relating to those posts. As Corpus Linguists, we therefore get the data we sample for – posts containing the thing(s) we are interested in – as well as other social information about the content creators and their social networks that we may or may not be interested in. Indeed, concerns about the kinds of metadata included and attached to online post is an issue that has sparked a great deal of debate about the ethics of collecting and using publicly posted online content, though these concerns are not discussed here. Instead, the potential for online ethnography is explored. In order to do this, I pair familiar CL research methods with methods from Social Network Analysis (SNA) that are more explicitly focussed on social networks and examining the myriad ways people affiliate with each other.

Theory & Methods: Corpus-assisted Community Analysis (CoCoA)

Corpus-assisted Community Analysis (CoCoA) is a multimethodological approach to the study of online discourse communities that combines methods from Discourse Analysis (DA), CL, and SNA.

Corpus-assisted Discourse Analysis

I predominantly draw on Baker (2006) in my approach to corpus-assisted DA, seeing discourse in a Foucauldian sense as, forms of social practice; “practices which systematically form the objects of which they speak” (Foucault 1972: 49). Particularly, I am interested in the incremental effect of discourse. Baker suggests, “a single word, phrase or grammatical construction on its own may suggest the existence of a discourse” (2006: 13). However, in order to investigate how quantitatively typical or pervasive discourse is within a discourse community, numerous examples of linguistic instantiations of discourse are required to make a claim about its cumulative effect (ibid.). Following Baker, I argue here that corpora and CL techniques enable this kind of quantitative examination of discourse.

Social Network Analysis

SNA implements notions from graph theory for the formal modelling and describing the properties of relationships between objects of study such as people and institutions. A graph (or ‘sociogram’) is a representation of people or institutions of interest as ‘nodes’ and the relationships between them as a set of lines known as ‘edges’; a graph is built by representing “a set of lines [‘edges’] connecting points [‘nodes’]” (Scott 2013: 17). To interpret graphs, graph theory contributes “a body of mathematical axioms and formulae that describe the properties of the patterns formed by the lines [‘edges’]” (Scott 2013: 17). One of these axioms is ‘directionality’. Directed graphs can encode both symmetric and asymmetric relations (D’Andrea, et al. 2010: 12). Directed relationships are where nodes are connected by an edge that has a direction of flow from one node to another is known as asymmetric, as illustrated by the relations between A and C, and C and B in Fig. 1. Symmetric relationships are those in which an edge connects two nodes but is bidirectional – the direction of relation flows both ways – as illustrated by the relationship between A and B in Fig. 1. Directed relationships on Twitter include followership relations and the act of mentioning – i.e. including the handle (e.g. @CorpusSocialSci) – in tweets.


Figure 1: A simple directed graph

Undirected graphs represent identical, symmetric relationships between nodes which might be the result of nodes sharing reciprocal attitudes or “because they have a common involvement in the same activity” (Scott 2013: 17). Fig. 2 contains gives a graphical representation of an undirected graph.


Figure 2: A simple undirected graph

Directed and undirected (‘ambient’) kinds of affiliation are both understood here as being distinct forms of discursively constructed social practices. Furthermore, I adopt the term ‘ambient affiliation’ from the work of Zappavigna on the use of social media in the formation of community and identity (Zappagigna 2012; Zappagigna 2013). Ambient affiliation is about the functionalities of social media platforms that enable users “to commune with others without necessarily engaging in direct conversational exchanges” (Zappagigna 2013: 223-4). Therefore, ambient affiliation is about people exhibiting the same behaviours or sharing the same qualities but without directly interacting with each other. This notion closely approximates to the notion of an ‘undirected’ graph. In developing the theory of ambient affiliation Zappavigna draws on Page’s work on hashtags. Page refers to hashtags as “a search term” (2012: 183). Hashtags – a string of characters (usually a word or short phrase) unbroken by spaces or non-alphabetic/non-numeric characters (excl. underscores ‘_’) preceded by ‘#’ (e.g. #YOLO) – are used a metadiscursive markers of the topic of a tweet. Page goes onto argue that, “the kind of talk which aggregate around hashtags […] involve multiple participants talking simultaneously about the same topic, rather than individuals necessarily talking with each other in dyadic exchanges that resemble a conversation” (2012: 196). As such, Page suggests that hashtags destabilise conventional adjacency pairs characteristic of many forms of human dialogue and give a new way for humans to interact on a topic of mutual interest.


I collected all tweets and retweets including the official hashtag of the Corpus Linguistics 2015 conference – #CL2015 – posted from the date of the first pre-conference workshop (20/07/2015) through until the final day of the conference (24/07/2015). To do this, I used the R based Twitter client ‘twitteR’ to access the Twitter API. The resulting data amounted to:

  Total number
Tweets 671
Retweets 1025


  Tweets Retweets
20/07/2015 57 76
21/07/2015 128 169
22/07/2015 152 370
23/07/2015 176 241
24/07/2015 158 169
Totals 671 1025

The tweets corpus contained around ~10,000 words in total.


The data contained some ‘noise’ mainly caused by other people using the same #CL2015 hashtag to talk about another event occurring during the period of the conference. However, as I will show in the analysis, the methods enable researchers to focus only on the communities they are interested in.


Tweets – what was being talked about?

To find out what people were talking about day-to-day, I created daily tweet corpora. With each of these daily corpora, I performed a keyword analysis using a reference corpus compiled using the remaining other days. So, for the tweets sent during the pre-conference workshop day (20/07/2015) I used the tweets sent during the rest of the conference (21/07/2015-24/07/2015) as a reference corpus, and so on. The resulting top 10 keywords for each day are given in the table below.

  20/07/2015 21/07/2015 22/07/2015 23/07/2015 24/07/2015
1 CL2015 change sealey partington illness
2 workshop fireant granger duguid mental
3 pre biber animals gala news
4 workshops climate sylviane class literature
5 conference doom campaign dinner yahoo
6 main misogyny heforshe please dickens
7 starting academic collocation poster csr
8 historian assist eeg legal health
9 day biber’s handford alan incelli
10 lancaster bnc learner mock jaworska

The keywords shown in each column outline the most distinctive topics tweeted about during the conference. Italics used here relate back to keywords in the table.

On day 1, the pre-conference workshops, including @antlab‘s pre-conference corpus tools brainstorming session and @stgries’s pre-conference #R workshop were popular topics of conversation in the smallest subsample of tweets for the week.

Top favourited tweet from day 1:

On day 2, more diverse topics start to emerge. Change became a theme, relating to Andrew Salway’s talk on discourse surrounding climate change but also relates to a talk given by Doug Biber on historical linguistic change in ‘uptight’ academic texts. Fireant, a new user-friendly tool for efficiently dealing with large databases developed by Laurence Anthony, was also unveiled to the CL masses on day 2, which prompted a flurry of excited tweets [keep track of Laruence’s Twitter page for release]. DOOM and misogyny also became topical following talks by Claire Hardaker and Mark McGlashan on the Discourse of Online Misogyny project. Finally, some excitement followed a paper given by Robbie Love and Claire Dembry about the new Spoken BNC2014. For those interested, keep track of the CASS website for spoken data grants later in the year.

Top favourited tweet from day 2:

 Day 3 saw another topic change focussing most prominently on Alison Sealey’s talk on the discursive construction of animals in the media, Sylviane Granger’s plenary on learner corpora, a talk on the public’s online reactin to the #HeForShe campaign given by Rosie Knight, and Jen Hughes’ talk on the application of EEG (‘Electroencephalography’) to the study of collocation as a cognitive phenomenon.

Top favourited tweet from day 3:

After 3 days of incredibly interesting talks, corpus linguists were about ready for their gala dinner on day 4. But before all the cheesecake, the CL2015 were excitedly tweeting about the all important poster session, Alison Duguid’s talk on class, the Geoffrey Leech tribute panel which included Charlotte Taylor’s paper on mock politeness and ‘bitchiness’ as well as Lynne Murphy and Rachele de Felice’s talk on the differential use of please in BrE and AmE, Alan Partington’s plenary speech on CADS; and papers given by Ruth Breeze, Amanda Potts, and Alex Trklja, on the application of CL methods to the study of a broad range of legal language.

Top favourited tweet from day 4:

Day 5 brought #CL2015 to a close but the number of tweets remained steady with health on the agenda with talks from Ersilia Incelli and Gillian Smith who both focussed partly on the construction of mental illness/health in the news. News also featured Monika Bednarek’s talk on news discourse and Antonio Fruttaldo’s analysis of news tickers. Other key topics related to Sylvia Jaworska and Anupam Nanda’s paper on the Corpus Linguistic analysis of Corporate Social Responsibility (CSR), Michaela Mahlberg’s work on the literature of Charles Dickens, and discussion of a corpus of Yahoo answers in the week’s penultimate panel on triangulating methodological approaches.

Top favourited tweet from day 5:

Approaching tweets in this way, it was possible to find out the most salient topics of each day. However, I was also interested in the retweeting behaviour of attendees.

Retweets – what was being talked about?

I looked at the top 10 most frequently retweeted tweets during the conference. Due to the intertextual nature of retweets – they are simply identical reposts of the same content – methods familiar to CL such as word frequency lists may not be as useful in their study. For example, if a few retweets are particularly frequently reposted, the most frequent words will be skewed by the content of the most frequent retweets. Instead, I suggest that retweets themselves should be conceptualised as being individual types in and of themselves that require more qualitative approaches to their interpretations (at least in this context). The top 10 most frequently retweeted tweets including the #CL2015 hashtag are given below:

  Retweet Date Freq
1 RT @EstrategiasEc: Concluimos este viernes con exitoso proceso de postulación @ECLideres VI Prom. #CL2015 con auspicio de @ucatolicagye. ht… 22/07/2015 218
2 RT @perayson: To access the new HT semantic tagger from the @SAMUELSProject see and #CL2015 23/07/2015 15
3 RT @UCREL_Lancaster: The #CL2015 abstract book is now available to download from the conference website 21/07/2015 13
4 RT @duygucandarli: Important take-away messages about corpus research in Biber’s plenary talk at #CL2015! 21/07/2015 11
5 RT @lynneguist: Alan Partington looking at how quickly language changes in White House Press Briefings… #CL2015 23/07/2015 10
6 RT @CorpusSocialSci: .@_paulbaker_ reflecting on a number of approaches to the same data at the Triangulation panel at #CL2015… 24/07/2015 10
7 RT @CorpusSocialSci: .@vaclavbrezina introduces Graphcoll, a new visualisation tool for collocational networks #CL2015… 22/07/2015 9
8 RT @_ctaylor_: It’s a myth that reference corpora have to larger than target corpus says @antlabjp  #cl2015 22/07/2015 7
9 RT @Loopy63: #CL2015 Call for papers for Intl. Conference on  statistical analysis of textual data 2016 in Nice, France:… 23/07/2015 7
10 RT @vaclavbrezina: A great use of #GraphColl by @violawiegand – #CL2015 poster presentation @TonyMcEnery @StephenWattam… 24/07/2015 7

The most frequent retweet was regarding a Latin American Youth Leadership programme that shared the same #CL2015 hashtag [nb. For next year, Corpus Linguistics conference organisers…]. As you will notice, this retweet occurred on 22/07/2015 but as retweets and tweets are dealt with exclusively, the retweet does not interfere with the keyword analysis done for the same day on the tweets.

What do the most frequent retweets highlight? Free tools (GraphColl, HT semantic tagger), free resources (abstract book), plenary talks and more conferences.


With a general idea of what people are talking about and sharing using the #CL2015 hashtag, I was interested to examine the overall activity around #CL2015 and the emergence of discourse communities.

In terms of tweets the gif below shows how relationships developed over the course of the conference. Every node represents a Twitter account that posted a tweet containing #CL2015 during the period of data collection. The size of these nodes is dictated by their ‘degree’, or its number of edges. More edges = larger node. The colour of the nodes is determined by ‘betweenness centrality’, which indicates how central a node is in a network. Nodes with high betweenness centrality help the speed of transfer of information through networks as they help create the shortest distance between other nodes in the network. Nodes with high betweenness centrality are coloured red, a medium betweenness centrality is yellow, and low betweenness centrality is blue. Nodes with intermediary colours (orange, green) represent those that have a betweenness centrality somewhere between low and medium or between medium and high. Finally, the colour and size of edges is dictated by ‘weight’. In this example, weight is dictated by the frequency of tweets that exist between nodes. Thick red edges between nodes represent nodes that send tweets to each other frequently, or one node mentions another frequently. Thin blue edges represent low frequency mentioning relationships. Yellow are medium. Again, blended colours represent intermediary frequencies and thus, in this case, weight.

CL2015 tweets

The tweets network shows that @CorpusSocialSci was – perhaps unsurprisingly – the most prolific and central account in the #CL2015 network. It had the most connections and joined the most individual accounts together. But other users were very active in helping to disseminate information more widely, which are shown by those nodes in yellow and orange. The accounts on the periphery of the network are good examples of ambient affiliation. They use #CL2015 to affiliate but do not directly engage with others by mentioning other users. Moreover, the gif attempts to show the evolution and growth of the network over time but also shows that each day new topics and networks of interaction relating to those new topics emerged daily. As talks (and news of talks in the network) became topical, people tweeted and shared ideas and notes relevant to those talks. An example of this is the emergence of fireant on 21/07/2015. When introduced to delegates, an ad hoc online discourse community formed to spread the news of a new tool, add new information and to channel their enthusiasm back to source.

User Date Tweet
RachelleVessey 2015-07-21 16:50:43 Excellent end to the first day of #CL2015- FireAnt looks like a fantastic programme @antlabjp @DrClaireH can’t wait to try it out!
SLGlaas 2015-07-21 16:54:13 Stupidly excited about #Fireant from @antlabjp  #CL2015
CorpusSocialSci 2015-07-21 16:54:43 Everyone is eagerly wondering when FireAnt will be available. @antlabjp’s answer is hopefully within the next few months. #CL2015
Rosie_Knight 2015-07-21 16:56:40 Amazing talk about FireAnt- can’t wait to use this on my #HeForShe data! @DrClaireH @antlabjp @Mark_McGlashan #CL2015

CL2015 retweets

The retweets network again shows that @CorpusSocialSci was – and, again perhaps unsurprisingly – at the centre of #CL2015 retweeting activity. The retweet network gif shows 2 discrete networks. The right hand network shows activity at the CL conference, the left hand network shows the retweeting behaviour of the Latin American Youth Leadership programme mentioned above. Avid conference tweeters may have noticed when keeping track of the #CL2015 hashtag. The left hand network – a graphic representation of the most retweeted tweet containing #CL2015 shown above – shows 218 users retweeting a single central account. In this network there is no interaction between the users engaged in retweeting this user. This kind of network formation is extremely typical of users retweeting news stories on Twitter. The right hand network, however, shows a great deal of mutual retweeting, whereby users are engaged on a prolonged basis in sharing each others’ tweets and forming a network of sharing and resharing.


Integrating methods from CL and SNA offers some really interesting possibilities for the analysis of large amounts of social data. Here, I have used keyword analysis to find the most salient topics for each day of the conference, used those topics to find and visualise small but coherent discourse communities, and situated those communities within the wider #CL2015 social network.


m.mcglashan(Replace this parenthesis with the @ sign)



Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum.

D’Andrea, A., Ferri, F. & Grifoni, P. (2010). An overview of methods for virtual social network analysis. In: A. Abraham, A.-E. Hassanien, & V. Sná el (eds.). Computational Social Network Analysis. London: Springer London, pp. 3–26.

Elgesem, D. & Salway, A. (2015) Traitor, whistleblower or hero? Moral evaluations of the Snowden affair in the blogosphere. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp 99-101

Foucault, M. (1972) The Archaeology of Knowledge. London: Tavistock.

Grieve, J., Nini, A., Guo, D, & Kasakoff, A. (2015) Recent changes in word formation strategies in American social media. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp 140-3

Knight, R. (2015) Tweet all about it: public views on the UN’s HeForShe campaign. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp 201-3

Longhi, J. & Wigham, C. R. (2015) Structuring a CMC corpus of political tweets in TEI: corpus features, ethics and workflow. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp. 408-9

Hardaker, C. & McGlashan, M. (2015) Twitter rape threats and the discourse of online misogyny (DOOM): from discourses to networks. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp. 154-6

McGlashan, M. & Hardaker, C. (2015) Twitter rape threats and the discourse of online misogyny (DOOM): using corpus-assisted community analysis (COCOA) to detect abusive online discourse communities. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp. 234-5

Page, R. (2012). The linguistics of self-branding and micro-celebrity in Twitter: The role of hashtags. Discourse & Communication. 6 (2). p.pp. 181–201.

Scott, J. (2013). Social Network Analysis. 3rd Ed. London: Sage.

Statache, R., Adolphs, S., Carter, C. J., Koene, A., McAuley, D., O’Malley, C., Perez, E. & Rodden, T. (2015) Descriptive ethics on social media from the perspective of ideology as defined within systemic functional linguistics. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. p. 433

Zappavigna, M. (2012). Discourse of Twitter and social media. London: Continuum.

Zappavigna, M. (2013). Enacting identity in microblogging through ambient affiliation. Discourse & Communication. 8 (2). pp. 209–228.

CASS presentation at Cambridge University Centre of Islamic Studies symposium on Anti-Muslim Hate Crime

The CASS ‘Hate Speech’ project team were invited on the 16th of June to present some of our findings at a Symposium on Anti-Muslim Hate Crime held at the University of Cambridge Centre of Islamic Studies. The Symposium was organised by Julian Hargreaves, a Lancaster University Law School PhD student and Research Associate at the Cambridge Centre.

The symposium brought together academics, community experts and civil society leaders in a unique event that allowed the sharing of knowledge, experience and expertise on the subject from a wide range of perspectives.

The first session of the day focussed on the research approaches and findings from three UK academic centres. Stevie-Jade Hardy from the University of Leicester’s Hate Crime Project isolated and shared some of the project’s key findings on experiences and impacts of hate crime for Muslims in Leicester. Sussex University PhD student Harriet Fearn discussed the early observations she had made in her research on the impacts of hate crime against Muslims on the internet.

Representing CASS and Lancaster University Law School, Paul Iganski and I then delivered a presentation of our work conducted with Jonathan Culpeper examining Crown Prosecution Service files from cases of religiously aggravated offences. In our paper titled ‘A question of faith?’, Paul and I explored the boundaries of free speech, the roles of religious identity and religious beliefs in the alleged offences committed, and the commonalities in the circumstances and contexts which surround offences prosecuted as religiously aggravated.

After lunch, the experiences of representatives from three community organisations confronting hate crime in Britain were shared with those present. Alice Purves gave a compelling account of the challenges faced by the Edinburgh and Lothians Regional Equality Council (ELREC). Jed Din, director of the Bradford Hate Crime Alliance then offered a personal account of the particular challenges of anti-Muslim hate crime and his own visions to develop community cohesion as a response. The session concluded with a presentation on anti-Muslim hate crime in Leicester from Jawaahir Daahir, CEO of the Somali Development Services.

The final session of the day, chaired by Paul Iganski, offered different approaches to documenting and responding to anti-Muslim hate crime. Shenaz Bunglawala, the head of research at MEND, shared insights and observations on the prevalence of anti-Muslim hate crime and attitudes to Muslims in Britain. The presentation included several of the key findings and observations from the research led by CASS director Tony McEnery on Representations of Islam in the British press. Those gathered then had the opportunity to hear from Hayyan Bhabha, the independent parliamentary researcher for the All Party Parliamentary Group on Islamophobia, who shared the latest developments in the work of the APPG and illustrated some of the evidence received or collated by the APPG. The final paper of the day came from Vishal Vora, from SOAS, with a perspective on indirect discrimination towards British Muslim women as a consequence of declarations of ‘non marriage’ by the English family court.


From left to right: Abe Sweiry, Julian Hargreaves and Paul Iganski

The symposium was a very successful event and Paul and I very much enjoyed contributing to the day. Thanks are due to Julian and to Louise Beazor for putting together a very interesting programme, bringing together a wide range of perspectives on an important social issue, and arranging a highly productive day for all in attendance.

2014/15 in retrospective: Perspectives on Chinese

Looking back over the academic year as it draws to a close, one of the highlights for us here at CASS was the one-day seminar we hosted in January on Perspectives on Chinese: Talks in Honour of Richard Xiao. This event celebrated the contributions to linguistics of CASS co-investigator Dr. Richard Zhonghua Xiao, on the occasion of both his retirement in October 2014 (and simultaneous taking-up of an honorary position with the University!), and the completion of the two funded research projects which Richard has led under the aegis of CASS.

The speakers included present and former collaborators with Richard – some (including myself) from here at Lancaster, others from around the world – as well as other eminent scholars working in the areas that Richard has made his own: Chinese corpus linguistics (especially, but not only, comparative work), and the allied area of the methodologies that Richard’s work has both utilised and promulgated.

In the first presentation, Prof. Hongyin Tao of UCLA took a classic observation of corpus-based studies – the existence, and frequent occurrence, of highly predictable strings or structures, pointed out a little-noticed aspect of these highly-predictable elements. They often involve lacunae, or null elements, where some key component of the meaning is simply left unstated and assumed. An example of this is the English expression under the influence, where “the influence of what?” is often implicit, but understood to be drugs/alcohol. It was pointed out that collocation patterns may identify the null elements, but that a simplistic application of collocation analysis may fail to yield useful results for expressions containing null elements. Finally, an extension of the analysis to yinxiang, the Chinese equivalent of influence, showed much the same tendencies – including, crucially, the importance of null elements – at work.

The following presentation came from Prof. Gu Yueguo of the Chinese Academy of Social Sciences. Gu is well-known in the field of corpus linguistics for his many projects over the years to develop not just new corpora, but also new types of corpus resources – for example, his exciting development in recent years of novel types of ontology. His presentation at the seminar was very much in this tradition, arguing for a novel type of multimodal corpus for use in the study of child language acquisition.

At this point in proceedings, I was deeply honoured to give my own presentation. One of Richard’s recently-concluded projects involved the application of Douglas Biber’s method of Multidimensional Analysis to translational English as the “Third Code”. In my talk, I presented methodological work which, together with Xianyao Hu, I have recently undertaken to assist this kind of analysis by embedding tools for the MD approach in CQPweb. A shorter version of this talk was subsequently presented at the ICAME conference in Trier at the end of May.

Prof. Xu Hai of Guangdong University of Foreign Studies gave a presentation on the study of the study of Learner Chinese, an issue which was prominent among Richard’s concerns as director of the Lancaster University Confucius Institute. As noted above, Richard has led a project funded by the British Academy, looking at the acquisition of Mandarin Chinese as a foreign language; as a partner on that project, Xu’s presentation of a preliminary report on the Guangwai Lancaster Chinese Learner Corpus was timely indeed. This new learner corpus – already in excess of a million words in size, and consisting of a roughly 60-40 split between written and spoken materials – follows the tradition of the best learner corpora for English by sampling learners with many different national backgrounds, but also, interestingly, includes some longitudinal data. Once complete, the value of this resource for the study of L2 Chinese interlanguage will be incalculable.

The next presentation was another from colleagues of Richard here at Lancaster: Dr. Paul Rayson and Dr. Scott Piao gave a talk on the extension of the UCREL Semantic Analysis System (USAS) to Chinese. This has been accomplished by means of mapping the vast semantic lexicon originally created for English across to Chinese, initially by automatic matching, and secondarily by manual editing. Scott and Paul, with other colleagues including CASS’s Carmen Dayrell, went on to present this work – along with work on other languages – at the prestigious NAACL HLT 2015 conference, in whose proceedings a write-up has been published.

Prof. Jiajin Xu (Beijing Foreign Studies University) then made a presentation on corpus construction for Chinese. This area has, of, course, been a major locus of activity by Richard over the years: his Lancaster Corpus of Mandarin Chinese (LCMC), a Mandarin match for the Brown corpus family, is one of the best openly-available linguistic resources for that language, and his ZJU Corpus of Translational Chinese (ZCTC) was a key contribution of his research on translation in Chinese . Xu’s talk presented a range of current work building on that foundation, especially the ToRCH (“Texts of Recent Chinese”) family of corpora – a planned Brown-family-style diachronic sequence of snapshot corpora in Chinese from BFSU, starting with the ToRCH2009 edition. Xu rounded out the talk with some case studies of applications for ToRCH, looking first at recent lexical change in Chinese by comparing ToRCH2009 and LCMC, and then at features of translated language in Chinese by comparing ToRCH2009 and ZCTC.

The last presentation of the day was from Dr. Vittorio Tantucci, who has recently completed his PhD at the department of Linguistics and English Language at Lancaster, and who specialises in a number of issues in cognitive linguistic analysis including intersubjectivity and evidentiality. His talk addressed specifically the Mandarin evidential marker 过 guo, and the path it took from a verb meaning ‘to get through, to pass by’ to becoming a verbal grammatical element. He argued that this exemplified a path for an evidential marker to originate from a traversative structure – a phenomenon not noted on the literature on this kind of grammaticalisation, which focuses on two other paths of development, from verbal constructions conveying a result or a completion. Vittorio’s work is extremely valuable, not only in its own right but as a demonstration of the role that corpus-based analysis, and cross-linguistic evidence, has to play on linguistic theory. Given Richard’s own work on the grammar and semantics of aspect in Chinese, a celebration of Richard’s career would not have been complete without an illustration of how this trend in current linguistics continues to develop.

All in all, the event was a magnificent tribute to Richard and his highly productive research career, and a potent reminder of how diverse his contributions to the field have actually been, and of their far-reaching impact among practitioners of Chinese corpus linguistics. The large and lively audience certainly seemed to agree with our assessment!

Our deep thanks go out to all the invited speakers, especially those who travelled long distances to attend – our speaker roster stretched from California in the west, to China in the east.

CASS Corpus Linguistics workshop at the University of Caxias do Sul (UCS, Brazil)

Last month at UCS (Brazil), the CASS Corpus Linguistics workshop found a receptive audience who participated actively and enthusiastically engaged in the discussion. The workshop was run from 27-28 May by CASS members Elena Semino, Vaclav Brezina and Carmen Dayrell, and perfectly organised by the local committee Heloísa Feltes and Ana Pelosi.


From left to right: Carmen Dayrell, Heloísa Feltes, Vaclav Brezina, Elena Semino, and Ana Pelosi

This workshop brought together lecturers, researchers, PhDs and MA research students from various Brazilian universities. It was a positive, invigorating experience for the CASS team and a golden opportunity to discuss the various applications of corpus linguistics methods. We would like to thank UCS for offering all necessary conditions to make this workshop run so smoothly.

The workshop was part of a collaborative project between UK and Brazilian scholars funded by the UK’s ESRC and the Brazilian research agency CONFAP (FAPERGS) which will make use of corpus linguistics techniques to investigate the linguistic representation of urban violence in Brazil. Further details of this project can be found at

CASS PhD student in Moscow to attend the XVI April International Academic Conference on Economic and Social Development

I recently got the opportunity to travel to Moscow to attend the XVI April International Academic Conference on Economic and Social Development at the National Research University – Higher School of Economics (HSE). This conference covered a wide variety of fields including Sociology, Geography, and Technology, and, on the last day of the conference, there was a seminar specifically for Linguistics PhD students. The aim of this seminar was to allow students from Russia and other countries to exchange ideas, and to introduce students from around the world to HSE.

At the seminar, there were presentations from 10 PhD students and these covered a variety of Linguistics topics including Grammar, Semantics, Sign Language, and Cognitive Linguistics. There were also some presentations on Corpus Linguistics: one which discussed semantic role labelling for the Russian language based on the Russian FrameBank, and another which discussed building a corpus of Soviet poetry. I found it interesting to see corpus analyses based on the Russian language, and it was also interesting to see the use of the ‘web as corpus’. This introduced me to tools that I haven’t used before, such as the Google N-Gram Viewer.

In the afternoon, I gave a presentation entitled The collocation hypothesis: Evidence from self-paced reading. This was the first time I had ever given a conference presentation and I was really pleased to have an audience that seemed interested in my work. The audience was composed of PhD students, some undergraduate students from the Linguistics Department at HSE, researchers from other fields who had presented at the conference on the previous days, as well as a few senior academics who gave me some really useful feedback.

The conference was held at the central building of HSE and, the day before the seminar, an MA student in Computational Linguistics kindly gave me a tour of the Linguistics Department. It was interesting to see that their classes are all seminar-based and I particularly liked the way they had a common room where all members of the department, including undergraduates, postgraduates, and lecturers, go between classes in order to socialise or do work. Here, I got the chance to speak to some undergraduates and postgraduates and I was shown some of the corpora that were compiled at that department, such as the Corpus of Modern Yiddish, the Bashkir Poetic Corpus, and the Russian Learner Corpus of Academic Writing. I was also told about a project called Tolstoy Digital, which involved making a corpus of Tolstoy’s works. It was interesting to hear about the unique problems that were faced when compiling this corpus. For instance, Tolstoy used an older orthography so this had to be translated to the modern form before the corpus could be tagged and parsed.

When speaking to members of the department, it was also interesting to discuss how some of their work links to some of the work carried out at CASS and the Linguistics Department at Lancaster University. For example, Elena Semino’s work on pain questionnaires seemed to link closely to an article written by members of HSE entitled Towards a typology of pain predicates (Reznikova et al. 2012). This article discusses the way in which the semantic domain of pain is largely composed of words borrowed from other semantic domains.

After showing me around the department, the MA student, Natalia, showed me around some of the main sights in central Moscow. I really appreciated this as I got to see some of Moscow from a local’s perspective as well as getting to visit some of the key sights that I was looking forward to seeing such as the Bolshoi Theatre. Whilst in Moscow, I also went to see Swan Lake at the Kremlin Theatre of Classical Russian Ballet. This was an amazing experience because I had always wanted to see a Russian ballet and, although I had already seen Swan Lake several times, this was definitely the best version I had ever seen. Overall I had a brilliant time in Moscow and I am really grateful for the Higher School of Economics for funding and organising the trip.

Workshop on ‘Metaphor in end of life care’ at St Joseph’s Hospice, London

On 26th September 2014, three members of the CASS-affiliated ‘Metaphor in end of life care’ project team were invited to run a workshop at St Joseph’s Hospice in London. The workshop was attended by 27 participants, including clinical staff, non-clinical staff and volunteers.

Veronika Koller (Lancaster University) introduced the project, including its background, rationale, research questions, data and use of corpus methods in combination with qualitative analysis. Zsófia Demjén (The Open University) and Elena Semino (Lancaster University) presented the findings from the project that are particularly relevant to communication between healthcare professionals and patients nearing the end of their lives. These findings include: how patients diagnosed with terminal cancer use Violence and Journey metaphors to talk about their experiences of illness and treatment; and how patients and healthcare professionals use a variety of metaphors to talk about their mutual relationships. The project team pointed out the different ‘framings’ provided by different uses of metaphor, particularly in terms of the empowerment and disempowerment of patients. They provided evidence that no metaphor is inherently good or bad for all patients, but rather suggested that different metaphors work differently for different people, or even for the same person at different times. In the final session, Veronika Koller introduced the ‘Metaphor Menu’ – a collection of metaphors used by cancer sufferers, which the team are planning to pilot as a resource for newly-diagnosed patients.

A lively discussion followed each presentation, with many members of the audience asking questions and contributing their personal and professional experiences. The workshop received very positive evaluations in anonymous feedback questionnaires: 83% of participants rated the session at 4 or 5 on a 5-point scale (where 1 corresponds to ‘Very poor’ and 5 to ‘Excellent’). Comments included: Very interesting research & resonated with my experience. Food for thought!’ and ‘Will help with my area of care, will help me understand and think about what my patients and relatives are actually telling me. Will make me reflect and respond more appropriately’.

CASS visit to Ghana

On June 24th, I and three other members of CASS spent a week in Accra, Ghana, demonstrating corpus methods and our own research at two universities, the University of Ghana and the recently established Lancaster University Ghana campus in Accra. From the UK it’s just over a six hour flight although thankfully only one hour of time difference. However, travel did involve some advance preparation, with jabs for yellow fever (and a few other things), visa applications and taking anti-malarial pills for a month after the trip. Fortunately, we only encountered one mosquito during the whole trip and none of us were bitten.

Although close together, the two universities we visited have a very different feel to them, the former is a large university spread out over a lot of land, with many departments and buildings, while the latter is (at the moment), a three storey modern-looking grey and red building with the familiar Lancaster logo on it.


Our first trip was to the University of Ghana, where Andrew, Tony and I each gave a lecture to about 90 members of staff and students. Tony talked about the theoretical principles behind corpus linguistics, I discussed (and problematized) sex differences in the British National Corpus and Andrew showed applications of corpus linguistics to field linguistics using Corpus Workbench. The University of Ghana has some alumni members of Lancaster University and it was great to run into Clement Appah and Grace Diabah (formely Bota) again.


Over the following two days, we gave corpus linguistics workshops, which included a two hour lab session where Andrew walked students through setting up a CQPweb account and doing some analysis of the Brown Family of corpora. I suspect this was the highlight of the day for those who attended, who were pleased to get access to many of the corpora we have at Lancaster. Each day we taught about 35 people, including some who had travelled quite long distances to get to us. Four students had driven in that morning from Cape Coast – a journey that we did some of when we went to Kakum National Park on our day off, and that took us over three hours – so we were impressed by their dedication. Tony gave an introduction to corpus linguistics and Vaclav talked about the General Service List for English words and let the students use a tool he had developed for exploring it. I ended each day with a talk on corpus linguistics and discourse analysis.


As I’d mentioned, we had a day off, where we visited Kakum National Park. This gave us an opportunity to see more of Ghana on the drive there, and then we had a great experience in the park, walking across a 350m network of rope bridges (the Kakum Canopy Walk) that were suspended high above the ground – you literally got a bird’s eye view of the tropical rainforest below. It was one of the most memorable experiences I’ve had and I think we all came away with very positive feelings about our trip, and are looking forward to our next visit to Ghana. I also hope that we managed to inspire people to incorporate some corpus linguistics methods into their own research.

Reflections from the Front Line: Sarah Russell on MELC and Twitter

Sarah Russell (Director of Education and Research, Peace Hospice Care and the Hospice of St Francis) attended this month’s Language in End-of Life-Care event, where an audience of approximately 40 healthcare professionals and researchers specialising in palliative and end-of-life care gathered to share their perspectives.

In a new blog post on eHospice, she reflects on this experience, as well as sharing some insight into a tweet chat with @WeNurses, where 128 participants came together to discuss individual experiences, symptom control, communication, recognising dying, family and patient needs, caring, and denial as a coping mechanism.

Read more to learn about Sarah’s experience, and to hear her challenge for everyone (including researchers and health care professionals) by visiting eHospice now.

‘Language in End-of-Life Care’: A user engagement event

On 8th May 2014, the main findings of the CASS-affiliated project ‘Metaphor in End-of-Life Care’ were presented to potential users of the research at the Work Foundation in central London. The event, entitled ‘Language in End-of-Life Care’ attracted an audience of approximately forty participants, consisting primarily of healthcare professionals and researchers specialising in palliative and end-of-life care. Although most participants are based in the UK, international guests joined us from Germany, the Netherlands, Spain and the US.

melc1Professor Sheila Payne (Co-Investigator on the project and Co-Director of Lancaster’s International Observatory on End-of-Life Care), opened proceedings and acted as chair for the day’s activities. Two high-profile invited speakers shared their perspectives on communication in end-of-life care. Professor Lukas Radbruch (Chair of Palliative Medicine, University of Bonn) gave a presentation entitled ‘The search for a final sense of meaning in end-of-life discourses’. Among other things, he emphasized the influence of language and culture on perceptions and attitudes towards end of life and end-of-life care. Professor Dame Barbara Monroe (Chief Executive of St Christopher’s Hospice, London) discussed the main current challenges in hospice care in a talk entitled ‘Listening to patient and professional voices in end-of-life care’. These challenges, she argued, include those posed by a variety of linguistic and communicative barriers.


The methods, data and findings of the ‘Metaphor in End-of-Life Care’ project were introduced by four members of the team: Professor Elena Semino (Principal Investigator), Dr Veronika Koller (Co-Investigator), Dr Jane Demmen (Research Associate) and Dr Zsófia Demjén (former Research Associate, currently at the Open University). The project involves a combination of ‘manual’ and corpus-based methods to investigate the metaphors used to talk about end-of-life care in a 1.5-million-word corpus consisting of interviews with and online forum posts by terminally ill patients, family carers and health professionals. The team introduced the findings from the analysis that are particularly relevant to practitioners in end-of-life care, namely: the use of ‘violence’ and ‘journey’ metaphors by terminally ill patients, and the narratives of ‘good’ and ‘bad’ deaths told by hospice managers in semi-structured interviews. The implications of these findings for end-of-life care were suggested by the team and discussed with the audience. Participants were also invited to discuss selected uses of metaphors from the health professionals’ data, and to consider the potential value of some creative, alternative metaphors for cancer in particular.

melc3The richness of the interactions on the day and the liveliness of the event’s hashtag on Twitter (#melc14) suggest that the event was a success. In the words of a hospice director: ‘everybody at the conference was truly inspired by the potential for change in practice and training!’ Although the funded phase of the project is coming to an end, the contacts made on the day are likely to lead to further collaborative research between the Lancaster team and healthcare professionals in the UK and beyond.