2014/15 in retrospective: Perspectives on Chinese

Looking back over the academic year as it draws to a close, one of the highlights for us here at CASS was the one-day seminar we hosted in January on Perspectives on Chinese: Talks in Honour of Richard Xiao. This event celebrated the contributions to linguistics of CASS co-investigator Dr. Richard Zhonghua Xiao, on the occasion of both his retirement in October 2014 (and simultaneous taking-up of an honorary position with the University!), and the completion of the two funded research projects which Richard has led under the aegis of CASS.

The speakers included present and former collaborators with Richard – some (including myself) from here at Lancaster, others from around the world – as well as other eminent scholars working in the areas that Richard has made his own: Chinese corpus linguistics (especially, but not only, comparative work), and the allied area of the methodologies that Richard’s work has both utilised and promulgated.

In the first presentation, Prof. Hongyin Tao of UCLA took a classic observation of corpus-based studies – the existence, and frequent occurrence, of highly predictable strings or structures, pointed out a little-noticed aspect of these highly-predictable elements. They often involve lacunae, or null elements, where some key component of the meaning is simply left unstated and assumed. An example of this is the English expression under the influence, where “the influence of what?” is often implicit, but understood to be drugs/alcohol. It was pointed out that collocation patterns may identify the null elements, but that a simplistic application of collocation analysis may fail to yield useful results for expressions containing null elements. Finally, an extension of the analysis to yinxiang, the Chinese equivalent of influence, showed much the same tendencies – including, crucially, the importance of null elements – at work.

The following presentation came from Prof. Gu Yueguo of the Chinese Academy of Social Sciences. Gu is well-known in the field of corpus linguistics for his many projects over the years to develop not just new corpora, but also new types of corpus resources – for example, his exciting development in recent years of novel types of ontology. His presentation at the seminar was very much in this tradition, arguing for a novel type of multimodal corpus for use in the study of child language acquisition.

At this point in proceedings, I was deeply honoured to give my own presentation. One of Richard’s recently-concluded projects involved the application of Douglas Biber’s method of Multidimensional Analysis to translational English as the “Third Code”. In my talk, I presented methodological work which, together with Xianyao Hu, I have recently undertaken to assist this kind of analysis by embedding tools for the MD approach in CQPweb. A shorter version of this talk was subsequently presented at the ICAME conference in Trier at the end of May.

Prof. Xu Hai of Guangdong University of Foreign Studies gave a presentation on the study of the study of Learner Chinese, an issue which was prominent among Richard’s concerns as director of the Lancaster University Confucius Institute. As noted above, Richard has led a project funded by the British Academy, looking at the acquisition of Mandarin Chinese as a foreign language; as a partner on that project, Xu’s presentation of a preliminary report on the Guangwai Lancaster Chinese Learner Corpus was timely indeed. This new learner corpus – already in excess of a million words in size, and consisting of a roughly 60-40 split between written and spoken materials – follows the tradition of the best learner corpora for English by sampling learners with many different national backgrounds, but also, interestingly, includes some longitudinal data. Once complete, the value of this resource for the study of L2 Chinese interlanguage will be incalculable.

The next presentation was another from colleagues of Richard here at Lancaster: Dr. Paul Rayson and Dr. Scott Piao gave a talk on the extension of the UCREL Semantic Analysis System (USAS) to Chinese. This has been accomplished by means of mapping the vast semantic lexicon originally created for English across to Chinese, initially by automatic matching, and secondarily by manual editing. Scott and Paul, with other colleagues including CASS’s Carmen Dayrell, went on to present this work – along with work on other languages – at the prestigious NAACL HLT 2015 conference, in whose proceedings a write-up has been published.

Prof. Jiajin Xu (Beijing Foreign Studies University) then made a presentation on corpus construction for Chinese. This area has, of, course, been a major locus of activity by Richard over the years: his Lancaster Corpus of Mandarin Chinese (LCMC), a Mandarin match for the Brown corpus family, is one of the best openly-available linguistic resources for that language, and his ZJU Corpus of Translational Chinese (ZCTC) was a key contribution of his research on translation in Chinese . Xu’s talk presented a range of current work building on that foundation, especially the ToRCH (“Texts of Recent Chinese”) family of corpora – a planned Brown-family-style diachronic sequence of snapshot corpora in Chinese from BFSU, starting with the ToRCH2009 edition. Xu rounded out the talk with some case studies of applications for ToRCH, looking first at recent lexical change in Chinese by comparing ToRCH2009 and LCMC, and then at features of translated language in Chinese by comparing ToRCH2009 and ZCTC.

The last presentation of the day was from Dr. Vittorio Tantucci, who has recently completed his PhD at the department of Linguistics and English Language at Lancaster, and who specialises in a number of issues in cognitive linguistic analysis including intersubjectivity and evidentiality. His talk addressed specifically the Mandarin evidential marker 过 guo, and the path it took from a verb meaning ‘to get through, to pass by’ to becoming a verbal grammatical element. He argued that this exemplified a path for an evidential marker to originate from a traversative structure – a phenomenon not noted on the literature on this kind of grammaticalisation, which focuses on two other paths of development, from verbal constructions conveying a result or a completion. Vittorio’s work is extremely valuable, not only in its own right but as a demonstration of the role that corpus-based analysis, and cross-linguistic evidence, has to play on linguistic theory. Given Richard’s own work on the grammar and semantics of aspect in Chinese, a celebration of Richard’s career would not have been complete without an illustration of how this trend in current linguistics continues to develop.

All in all, the event was a magnificent tribute to Richard and his highly productive research career, and a potent reminder of how diverse his contributions to the field have actually been, and of their far-reaching impact among practitioners of Chinese corpus linguistics. The large and lively audience certainly seemed to agree with our assessment!

Our deep thanks go out to all the invited speakers, especially those who travelled long distances to attend – our speaker roster stretched from California in the west, to China in the east.

CASS Corpus Linguistics workshop at the University of Caxias do Sul (UCS, Brazil)

Last month at UCS (Brazil), the CASS Corpus Linguistics workshop found a receptive audience who participated actively and enthusiastically engaged in the discussion. The workshop was run from 27-28 May by CASS members Elena Semino, Vaclav Brezina and Carmen Dayrell, and perfectly organised by the local committee Heloísa Feltes and Ana Pelosi.


From left to right: Carmen Dayrell, Heloísa Feltes, Vaclav Brezina, Elena Semino, and Ana Pelosi

This workshop brought together lecturers, researchers, PhDs and MA research students from various Brazilian universities. It was a positive, invigorating experience for the CASS team and a golden opportunity to discuss the various applications of corpus linguistics methods. We would like to thank UCS for offering all necessary conditions to make this workshop run so smoothly.

The workshop was part of a collaborative project between UK and Brazilian scholars funded by the UK’s ESRC and the Brazilian research agency CONFAP (FAPERGS) which will make use of corpus linguistics techniques to investigate the linguistic representation of urban violence in Brazil. Further details of this project can be found at http://cass.lancs.ac.uk/?page_id=1501.

Big data media analysis and the representation of urban violence in Brazil: Kick-off meeting


The first meeting of the project took place earlier this month at CASS, Lancaster. This kick-off meeting brought together the Brazilian researchers Professors Heloísa Pedroso de Moraes Feltes (UCS) and Ana Cristina Pelosi (UNISC/UFC) and the CASS team (Professors Elena Semino and Tony McEnery, and Dr Carmen Dayrell) to plan the project’s activities and discuss the next steps.

The meeting was an excellent opportunity to discuss the partners’ role and activities in the project and to clarify how CASS can provide the Brazilian researchers with the expertise needed in a corpus investigation. A key decision towards this goal was to run a two-day Workshop in Corpus Linguistics in Brazil. This will be run by the CASS team (also counting with the expertise of Dr Vaclav Brezina) in the last week of May.

The workshop aims to reach a wider audience and not only to the Brazilian researchers’ team. It will be open to their colleagues, graduate and undergraduate students, and anyone interested in learning and using corpus linguistics methods and tools in the research.

We are all looking forward to that!

Trinity Lancaster Corpus at the International ESOL Examiner Training Conference 2015

On Friday 30th January 2015, I gave a talk at the International ESOL Examiner Training Conference 2015 in Stafford. Every year, the Trinity College London, CASS’s research partner, organises a large conference for all their examiners which consists of plenary lectures and individual training sessions. This year, I was invited to speak in front of an audience of over 300 examiners about the latest development in the learner corpus project.  For me, this was a great opportunity not only to share some of the exciting results from the early research based on this unique resource, but also to meet the Trinity examiners; many of them have been involved in collecting the data for the corpus. This talk was therefore also an opportunity to thank everyone for their hard work and wonderful support.

It was very reassuring to see the high level of interest in the corpus project among the examiners who have a deep insight into examination process from their everyday professional experience.  The corpus as a body of transcripts from the Trinity spoken tests in some way reflects this rich experience offering an overall holistic picture of the exam and, ultimately, L2 speech in a variety of communicative contexts.

Currently, the Trinity Lancaster Corpus consists of over 2.5 million running words sampling the speech of over 1,200 L2 speakers from eight different L1 and cultural backgrounds. The size itself makes the Trinity Lancaster Corpus the largest corpus of its kind. However, it is not only the size that the corpus has to offer. In cooperation with Trinity (and with great help from the Trinity examiners) we were able to collect detailed background information about each speaker in our 2014 dataset. In addition, the corpus covers a range of proficiency levels (B1– C2 levels of the Common European Framework), which allows us to research spoken language development in a way that has not been previously possible.  The Trinity Lancaster Corpus, which is still being developed with an average growth of 40,000 words a week, is an ambitious project:  Using this robust dataset, we can now start exploring crucial aspects of L2 speech and communicative competence and thus help language learners, teachers and material developers to make the process of L2 learning more efficient and also (hopefully) more enjoyable. Needless to say, without Trinity as a strong research partner and the support from the Trinity examiners this project wouldn’t be possible.

New CASS Briefing now available — What words are most useful for learners of English?

CASSbriefings-EDLWhat words are most useful for learners of English? Introducing the New General Service List. Learning vocabulary is a complex process in which the learner needs to acquire both the form and a variety of meanings of a given vocabulary item. General vocabulary lists can assist in the process of learning words by providing common vocabulary items. In response to problems identified in the currently available General Service List, the authors decided to investigate the core English vocabulary with very large language corpora using current corpus linguistics technology.

New resources are being added regularly to the new CASS: Briefings tab above, so check back soon.

Participate in our ESRC Festival of Social Sciences “Language Matters” event online

We are very pleased like to announce an event that we are live streaming on YouTube and Google+ next week. We hope you can find time to attend online*; if not, the recording will be available on YouTube afterwards.

From 1730 – 1900 GMT on 4 November, the ESRC Centre for Corpus Approaches to Social Science is hosting a live event in association with the ESRC Festival of Social Sciences and in tangent with our popular FutureLearn course. We would be thrilled if you could ‘tune in’ and collaborate with us during “Language Matters: Communication, Culture, and Society”.

This evening is a mini-series of four informal talks showcasing the impact of language on society. These are presented by some leading names in corpus linguistics (including the CASS Principal Investigator, Tony McEnery) and their talks draw upon the most popular themes in our corpus MOOC:

– What can corpora tell us about learning a foreign language? (with Vaclav Brezina)
– A ‘battle’, a ‘journey’, or none of these? Metaphors for cancer (with Elena Semino)
– Wolves in the wires: online abuse from people to press (with Claire Hardaker)
– Words ‘yesterday and today’ (with Tony McEnery, Claire Dembry, and Robbie Love)

Though we pride ourselves on bringing interesting, accessible material to people on the go, what really brings these events to life is the interactions that we have with attendees. That’s why we invite you to log in and contribute to the discussions taking place after each presentation.

There are two ways to virtually attend.

First, via Google Hangout if you have a Google account. Sign up at https://plus.google.com/events/ca15afbicmmeiu6d25pn1qbverg and then log in from 17:15 GMT  on 4 November to greet your fellow participants.

If you don’t have a Google account, you can watch us on YouTube at https://www.youtube.com/watch?v=hF_fl95tiSk with no registration.

We’ll be taking questions from the Google Hangout and from the #corpusMOOC hashtag on Twitter (particularly for those viewing on YouTube) and mixing these in with questions from our live audience.

We hope that you can take advantage of this event by participating online.

* If you are available, located in the London area, and would like to attend in person, please visit our event website to register.

The New General Service List (new-GSL) is out

The new-GSL is an English vocabulary baseline intended for both researchers and practitioners. It is based on robust comparison of four corpora of general English of the total size of over 12 billion words. It contains 2,494 vocabulary items, 2,116 of which belong to a stable lexical core; 378 words in the wordlist represent lexical innovations. All of these words appear with high frequencies across a large number of different contexts.

The article, which describes the methodology of the wordlist compilation, as well as the full new-GSL are available from the Applied Linguistics website in the open access mode.

Further research

At the moment, we are working on an American supplement to the new-GSL. Our findings show that there is a surprisingly large overlap between frequent lexical items in British and American corpora. With some modifications, the new-GSL can therefore be successfully used also in the American English contexts.

A larger question, however, that the new-GSL raises is – how do we reconcile our intuitions about important vocabulary items with the corpus-based findings? In this respect, the new-GSL is not a prescriptive but a descriptive wordlist. As we stress in the article, “[w]ith respect to the diversity of ESL/EFL contexts, it is deemed more useful to envision the use of our wordlist as a vocabulary base with the possibility of further additions, rather than a wordlist that strives to cater to a mixed cluster of heterogeneous expectations and needs” (p. 19).

Read more:

Brezina, V. and Gablasova, D. (2013) Is There a Core General Vocabulary? Introducing the New General Service List Applied Linguistics.

Vocabulary wordlists designed for learners: Development of the new-GSL

Imagine you have just started learning a new foreign language. Which words do you need to learn first? We all might have some intuitions about this. If the language is English then time – the most frequent noun both in speech and writing – will probably be more useful than say the adjective temporaneous (yes, OED records this word). However, intuitions (as corpus linguists know) are not to be trusted (at least not all the time). Only through analysis of large amounts of textual data (yes, language corpora!)  will we be able to identify words that occur frequently across a number of different contexts.

The research Dana and I are going to talk about on Thursday will look at the methodology of creating a pedagogical wordlist – the new-GSL (the old one is now really out of date)- which can assist both learners and teachers in the process of acquisition of basic English vocabulary. We’ll be looking at the ways in which both large (BNC, EnTenTen12) and small corpora (LOB, BE06) can be used in the creation of such a wordlist.