2014/15 in retrospective: Perspectives on Chinese

Looking back over the academic year as it draws to a close, one of the highlights for us here at CASS was the one-day seminar we hosted in January on Perspectives on Chinese: Talks in Honour of Richard Xiao. This event celebrated the contributions to linguistics of CASS co-investigator Dr. Richard Zhonghua Xiao, on the occasion of both his retirement in October 2014 (and simultaneous taking-up of an honorary position with the University!), and the completion of the two funded research projects which Richard has led under the aegis of CASS.

The speakers included present and former collaborators with Richard – some (including myself) from here at Lancaster, others from around the world – as well as other eminent scholars working in the areas that Richard has made his own: Chinese corpus linguistics (especially, but not only, comparative work), and the allied area of the methodologies that Richard’s work has both utilised and promulgated.

In the first presentation, Prof. Hongyin Tao of UCLA took a classic observation of corpus-based studies – the existence, and frequent occurrence, of highly predictable strings or structures, pointed out a little-noticed aspect of these highly-predictable elements. They often involve lacunae, or null elements, where some key component of the meaning is simply left unstated and assumed. An example of this is the English expression under the influence, where “the influence of what?” is often implicit, but understood to be drugs/alcohol. It was pointed out that collocation patterns may identify the null elements, but that a simplistic application of collocation analysis may fail to yield useful results for expressions containing null elements. Finally, an extension of the analysis to yinxiang, the Chinese equivalent of influence, showed much the same tendencies – including, crucially, the importance of null elements – at work.

The following presentation came from Prof. Gu Yueguo of the Chinese Academy of Social Sciences. Gu is well-known in the field of corpus linguistics for his many projects over the years to develop not just new corpora, but also new types of corpus resources – for example, his exciting development in recent years of novel types of ontology. His presentation at the seminar was very much in this tradition, arguing for a novel type of multimodal corpus for use in the study of child language acquisition.

At this point in proceedings, I was deeply honoured to give my own presentation. One of Richard’s recently-concluded projects involved the application of Douglas Biber’s method of Multidimensional Analysis to translational English as the “Third Code”. In my talk, I presented methodological work which, together with Xianyao Hu, I have recently undertaken to assist this kind of analysis by embedding tools for the MD approach in CQPweb. A shorter version of this talk was subsequently presented at the ICAME conference in Trier at the end of May.

Prof. Xu Hai of Guangdong University of Foreign Studies gave a presentation on the study of the study of Learner Chinese, an issue which was prominent among Richard’s concerns as director of the Lancaster University Confucius Institute. As noted above, Richard has led a project funded by the British Academy, looking at the acquisition of Mandarin Chinese as a foreign language; as a partner on that project, Xu’s presentation of a preliminary report on the Guangwai Lancaster Chinese Learner Corpus was timely indeed. This new learner corpus – already in excess of a million words in size, and consisting of a roughly 60-40 split between written and spoken materials – follows the tradition of the best learner corpora for English by sampling learners with many different national backgrounds, but also, interestingly, includes some longitudinal data. Once complete, the value of this resource for the study of L2 Chinese interlanguage will be incalculable.

The next presentation was another from colleagues of Richard here at Lancaster: Dr. Paul Rayson and Dr. Scott Piao gave a talk on the extension of the UCREL Semantic Analysis System (USAS) to Chinese. This has been accomplished by means of mapping the vast semantic lexicon originally created for English across to Chinese, initially by automatic matching, and secondarily by manual editing. Scott and Paul, with other colleagues including CASS’s Carmen Dayrell, went on to present this work – along with work on other languages – at the prestigious NAACL HLT 2015 conference, in whose proceedings a write-up has been published.

Prof. Jiajin Xu (Beijing Foreign Studies University) then made a presentation on corpus construction for Chinese. This area has, of, course, been a major locus of activity by Richard over the years: his Lancaster Corpus of Mandarin Chinese (LCMC), a Mandarin match for the Brown corpus family, is one of the best openly-available linguistic resources for that language, and his ZJU Corpus of Translational Chinese (ZCTC) was a key contribution of his research on translation in Chinese . Xu’s talk presented a range of current work building on that foundation, especially the ToRCH (“Texts of Recent Chinese”) family of corpora – a planned Brown-family-style diachronic sequence of snapshot corpora in Chinese from BFSU, starting with the ToRCH2009 edition. Xu rounded out the talk with some case studies of applications for ToRCH, looking first at recent lexical change in Chinese by comparing ToRCH2009 and LCMC, and then at features of translated language in Chinese by comparing ToRCH2009 and ZCTC.

The last presentation of the day was from Dr. Vittorio Tantucci, who has recently completed his PhD at the department of Linguistics and English Language at Lancaster, and who specialises in a number of issues in cognitive linguistic analysis including intersubjectivity and evidentiality. His talk addressed specifically the Mandarin evidential marker 过 guo, and the path it took from a verb meaning ‘to get through, to pass by’ to becoming a verbal grammatical element. He argued that this exemplified a path for an evidential marker to originate from a traversative structure – a phenomenon not noted on the literature on this kind of grammaticalisation, which focuses on two other paths of development, from verbal constructions conveying a result or a completion. Vittorio’s work is extremely valuable, not only in its own right but as a demonstration of the role that corpus-based analysis, and cross-linguistic evidence, has to play on linguistic theory. Given Richard’s own work on the grammar and semantics of aspect in Chinese, a celebration of Richard’s career would not have been complete without an illustration of how this trend in current linguistics continues to develop.

All in all, the event was a magnificent tribute to Richard and his highly productive research career, and a potent reminder of how diverse his contributions to the field have actually been, and of their far-reaching impact among practitioners of Chinese corpus linguistics. The large and lively audience certainly seemed to agree with our assessment!

Our deep thanks go out to all the invited speakers, especially those who travelled long distances to attend – our speaker roster stretched from California in the west, to China in the east.

Coming to CASS to code: The first two months


After working at Waseda University in Japan for exactly 10 years, I was granted a one-year sabbatical in 2014 to concentrate on my corpus linguistics research. As my first choice of destination was Lancaster University, I was overjoyed to hear from Tony McEnery that the Centre for Corpus Approaches to Social Science (CASS) would be able to offer me office space and access to some of the best corpus resources in the world. I have now been at CASS for two months and thought this would be a good time to report on my experience here to date.

Since arriving at CASS, I have been working on several projects. My main project here is the development of a new database architecture that will allow AntConc, my freeware corpus analysis toolkit, to process very large corpora in a fast and resource-light way. The strong connection between the applied linguistics and computer science at Lancaster has allowed me to work closely with some excellent computer science faculty and graduate students, including Paul Rayson, John Mariani, Stephen Wattam, and John Vidler. We just presented our first results at LREC 2014 in Reykjavik.

I’ve also been working closely with the CASS members, including Amanda Potts and Robbie Love, to develop a set of ‘mini’ corpus tools to help with the collection, cleaning, and processing of corpora. I have now released VariAnt, which is a tool that finds spelling variants in a corpus, and SarAnt, which allows multiple search-and-replace functions to be carried out in a corpus as a batch process. I am also just about to release TagAnt, which will finally give corpus linguists a simple and intuitive interface to popular freeware Part-Of-Speech (POS) tagging tools such TreeTagger. I am hoping to develop more of these tools to help the corpus linguists in CASS and around the world to help with the complex and time-consuming tasks that they have to perform each day.

I always expected that I would enjoy the time at Lancaster, but did not anticipate that I would enjoy it as much as I am. Lancaster University has a great campus, the research facilities are some of the best in the world, the CASS members have treated me like family since the day I arrived, and even the weather has been kind to me, with sunny days throughout April and May. I look forward to writing more about my projects here at CASS.

More about the Metaphor in End of Life Care project at Lancaster University

MELCcoverThe CASS-affiliated Metaphor in End of Life Care project has just released a free resource containing information of interest to many of our readers. Download the document now to learn more about the project, from basic concepts (what is metaphor, and how are they used in everyday life?) to more specific details (why study metaphor in end-of-life care?). Some interesting initial findings are also included. For instance, “Family carers often say that their emotions can only be safely ‘released’ when talking to people who are ‘in the same boat’.” Read on to learn more about the project.

Politeness and impoliteness in digital communication: Corpus-related explorations

Post-event review of the one-day workshop at Lancaster University

Topics don’t come much hotter than the forms of impoliteness or aggression that are associated with digital communication – flaming, trolling, cyberbullying, and so on. Yet academia has done surprisingly little to pull together experts in social interaction (especially (im)politeness) and experts in the new media, let alone experts in corpus-related work. That is, until last Friday, when the Corpus Approaches to Social Science Centre (@CorpusSocialSci) invited fifteen such people from diverse backgrounds (from law to psychology) gathered together for an intense one-day workshop.

CASS workshop cropped

The scope of the workshop was broad. One cannot very well study impoliteness without considering politeness, since merely failing to be polite in a particular context could be taken as impoliteness. Similarly, the range of digital communication types – email, blogs, texts, tweets and so on – presents a varied terrain to navigate. And then there are plenty of corpus-related approaches and notions, including collocation, keywords, word sketches, etc.

Andrew Kehoe (@ayjaykay), Ursula Lutzky (@UrsulaLutzky) and Matt Gee (@mattbgee) kicked off the day with a talk on swearwords and swearing, based on their 628-million-word Birmingham Blog Corpus. Amongst other things, they showed how internet swearword/profanity filters would work rather better if they incorporated notions like collocation. For example, knowing the words that typically accompany items like balls and tart can help disambiguate neutral usages (e.g. “tennis balls”, “lemon tart”) from less salubrious usages! (See more research from Andrew here, from Ursula here, and from Matt here.)

With Ruth Page’s (@ruthtweetpage) presentation, came a switch from blogs to Twitter. Using corpus-related techniques, Ruth revealed the characteristics of corporate tweets. Given that the word sorry turns out to be the seventh most characteristic or keyword for corporate tweets, it was not surprising that Ruth focused on apologies. She reveals that corporate tweets tend to avoid stating a problem or giving an explanation (thus avoiding damage to their reputation), but are accompanied by offers of repair and attempts to build – at least superficially – rapport. (See more research from Ruth here.)

Last of the morning was Caroline Tagg’s (@carotagg) presentation, and with this came another shift in medium, from Twitter to text messages. Focusing on convention and creativity, Caroline pointed out that, contrary to popular opinion, heavily abbreviated messages are not in fact the norm, and that when abbreviations do occur, they are often driven by communicative needs, e.g. using creativity to foster interest and engagement. Surveying the functions of texts, Caroline established that maintenance of friendship is key. And corpus-related techniques revealed the supporting evidence: politeness formulae were particularly frequent, including the salutation have a good one, the hedge a bit for the invitation, and for further contact, give us a bell. (See more research from Caroline here.)

With participants refuelled by lunch, Claire Hardaker (@DrClaireH) and I presented a smorgasbord of relevant issues. As an opening shot, we displayed frequencies showing that the stereotypical emblems of British politeness, words such as please, thank you, sorry, excuse me, can you X, tend not to be frequent in any digital media variety, relative to spoken conversation (as represented in the British National Corpus). Perhaps this accounts for why at least some sectors of the British public find digital media barren of politeness. This is not to say that politeness does not take place, but it seems to take place through different means – consider the list of politeness items derived by Caroline above. And there was an exception: sorry was the only item that occurred with greater frequency in some digital media. This, of course, nicely ties in with Ruth’s focus on apologies. The bulk of my and Clare’s presentation revolved around using corpus techniques to help establish: (1) definitions (e.g. what is trolling?), (2) strategies and formulae (e.g. what is the linguistic substance of trolling?) and (3) evaluations (e.g. what or who is considered rude?). Importantly, we showed that corpus-related approaches are not just lists of numbers, but can integrate qualitative analyses. (See more research from me here, and from Claire here.)

With encroaching presentation fatigue, the group decamped and went to at a computer lab. Paul Rayson (@perayson) introduced some corpus tools, notably WMatrix, of which he is the architect. Amanda Potts (@watchedpotts) then put everybody through their paces – gently of course! – giving everybody the opportunity of valuable hands-on experience.

Back in our discussion room and refreshed by various caffeinated beverages, we spent an hour reflecting on a range of issues. The conversation moved towards corpora that include annotations (interpretative information). Such annotations could be a way of helping to analyse images, context, etc., creating an incredibly rich dataset that could only be interrogated by computer (see here, for instance). I noted that this end of corpus work was not far removed from using Atlas or Nudist. Snapchat came up in discussion, not only because it involves images (though they can include text), but also because it raises issues of data accessibility (how do you get hold of a record of this communication, if one of its essential features is that it dissolves within a narrow timeframe?). The thorny problem of ethics was discussed (e.g. data being used in ways that were not signaled when original user agreements were completed).

Though exhausting, it was a hugely rewarding and enjoyable day. Often those rewards came in the form of vibrant contributions from each and every participant. Darren Reed, for example, pointed out that sometimes what we were dealing with is neither digital text nor digital image, but a digital act. Retweeting somebody, for example, could be taken as a “tweet act” with politeness implications.

Official launch of the ESRC Centre for Corpus Approaches to Social Science

The official opening of the £4.1 million ESRC Centre for Corpus Approaches to Social Science (CASS) took place on Tuesday, 23 July 2013, at the start of the seventh international Corpus Linguistics 2013 conference attended by more than 300 delegates. Delegates representing dozens of universities around the world convened with civil servants to honour the past, promote the present, and celebrate the future of corpus methods in the social sciences.

Former Home Secretary Charles Clarke was among several special guests at the launch event including representatives from the Ministry of Justice, the Home Office and the Environment Agency. Mr. Clarke said a few words to the audience of scholars and other users of research, stressing the importance of investigating language in the context of society, as well as continuing to foster and nurture interdisciplinary collaborative links in social science research.

With such a large and influential crowd gathered, we took the opportunity to showcase a variety of new and exciting research featuring corpus methods applied to the social sciences to a wide network of people. A range of researchers from Lancaster and much further afield were invited to give poster presentations highlighting their current work, which offers a variety of exciting contributions ranging from methodological advances to increased social understanding, and greater emphasis on interdisciplinarity in academia. Poster presenters included Mike Scott, Alan Partington, Ute Römer, Kevin Harvey, Elena Semino, Veronika Koller, Ramesh Krishnamurthy, Alan Partington, Alison Sealey, Andrew Salway, Paul Rayson, Steve Young, Jonathan Culpeper, Paul Baker, Rachelle Vessey, Charlotte Taylor, Anna Marchi, Catherine Chorley, Costas Gabrielatos, and Robbie Love. The posters proved great fodder for stimulating conversation about the future potentials of corpus linguistics and corpus approaches to social science.

Click below to see the full gallery of photos from the evening.

Corporate Financial Information Environment (CFIE)

The UK financial sector is a major driver of economic activity and transparent and effective financial communication is a key determinant of its success. Audited financial statements, unaudited corporate disclosures, and information signalled through corporate financial choices are the primary ways that firms communicate with capital market participants. These mechanisms, together with information from analysts, financial journalists, rating agencies and other market commentators external to the firm combine to form the Corporate Financial Information Environment (CFIE).

Narrative disclosures represent a large part of firms’ overall financial communications with investors. We will study the causes and consequences of corporate disclosure and financial reporting outcomes. While a considerable body of research exists on financial narratives, it has been limited by the methods used for measuring the characteristics of such disclosures. In particular, the need to hand-collect relevant data from firms’ annual reports and the subjectivity of textual scoring methods have restricted progress. Recent advances in computing and linguistics provide a basis for undertaking more sophisticated analyses.

This project brings together a multidisciplinary team with the aim of developing statistical and computer-based techniques for measuring the properties of UK corporate disclosures. In particular, we will develop new ways of measuring the quality and tone of company narratives using computer-based rankings of annual reports. Both the rankings and linguistic techniques on which these rankings are based will be made available to those seeking information on corporate disclosure policy or wishing to undertake their own analysis of specific narrative statements. We will also use the findings from our analysis as the basis for studying how managers communicate expectations of firm performance to investors and they seek to manipulate investors’ impressions of reported results.

The project is expected to yield important insights for business policy makers, accounting standard setting bodies and financial market information regulators. We also expect equity market participants including investors, investment analysts, finance directors, auditors, and firm officials to benefit from the research.

For more information, visit the project webpage.


Recent news associated with this project:

  • Corporate Financial Information Environment (CFIE) (9 July 2013)

    The UK financial sector is a major driver of economic activity and transparent and effective financial communication is a key determinant of its success. Audited financial statements, unaudited corporate disclosures, and information signalled through corporate financial choices are the primary ways that firms communicate with capital market participants. These mechanisms, together with information from analysts, ...