Morphological complexity: How is grammar acquired and how do we measure this?

Vaclav Brezina and Gabriele Pallotti

Inflectional morphology has to do with how words change their form to express grammatical meaning. It plays an important role in a number of languages. In these languages, the patterns of word change may for example indicate number and case on nouns, or past, present and future tense on verbs. For example, to express the past participle in German we regularly add the prefix ge- and optionally modify the base. Ich gehe [I go/walk] thus becomes Ich bin gegangen [I have walked].  English also inflects words (e.g. walk – walks – walking – walked; drive – drove – driven) but the range of inflected forms is narrower than in many other languages. The range of morphological forms in a text can be seen as its morphological complexity. Simply put, it is an indicator of the morphological variety of a text, i.e. how many changes to the dictionary forms of the words are manifested in the text.

To find out more about morphological complexity, how it can be measured and how L2 speakers acquire it, you can read:

Gabriele Pallotti and I have been working together to investigate the construct and develop a tool that can analyse the morphological complexity of texts. So far, the tool has been implemented for English, Italian and German verbal morphology. Currently, together with Michael Gauthier from Université Lyon we are implementing the morphological complexity measure for French verbs.

To analyse a text in the Morphological complexity tool, copy/paste the text in the text box, select the appropriate language and press ‘Analyse text now’ (Fig. 1).

Figure 1. Morphological tool: Interface

The tool will output the results of the linguistic analysis that highlights all verbs and nouns in the text and identifies morphological changes (exponences). After clicking on the ‘Calculate MCI’ button the tool also automatically calculates the Morphological Complexity Index (MCI) – see Fig. 2.

Figure 2. Morphological tool output: Selected parts

 

Chinese Applied Corpus Linguistics Symposium

On Friday 29th April 2016, Lancaster University hosted a symposium which brought together researchers and practitioners interested in Chinese linguistics and the corpus method. The symposium was supported by the British Academy (International Mobility and Partnership Scheme IPM 2013) and was hosted by the ESRC Centre for Corpus Approaches to Social Science (CASS). The symposium introduced the Guangwai-Lancaster Chinese Learner Corpus, a 1.2-million-word corpus of spoken and written L2 Chinese produced by learners of Chinese at different proficiency levels; the corpus was built as part of a collaboration between Guangdong University of Foreign Studies (Prof. Hai Xu and his team) and Lancaster University. The project was initiated by Richard Xiao, who also obtained the funding from the British Academy. Richard’s vision to bring corpus linguistics to the analysis of L2 Chinese (both spoken and written) is now coming to fruition with the final stages of the project and the public release of the corpus planned for the end of this year.

The symposium showcased different areas of Chinese linguistics research through presentations by researchers from Lancaster and other UK universities (Coventry, Essex), with the topics ranging from the use of corpora as resources in the foreign language classroom to a cross-cultural comparison of performance evaluation in concert reviews, second language semasiology, and CQPweb as a tool for Chinese corpus data. As part of the symposium, the participants were also given an opportunity to search the Guangwai-Lancaster Chinese Learner Corpus and explore different features of the dataset. At the end of the symposium, we discussed the applications of corpus linguistics in Chinese language learning and teaching and the future of the field.

Thanks are due to the presenters and all participants for joining the symposium and for very engaging presentations and discussions.  The following snapshots summarise the presentations –links to the slides are available below the images.


 

Hai Xu

 

Hai Xu (Guangdong University of Foreign Studies ): Guangwai-Lancaster Chinese Learner Corpus: A profile – via video conferencing from Guangzhou


Simon Smith

Simon Smith (Coventry University): 语料酷!Corpora and online resources in the Mandarin classroom


Fong Wa Ha

Fong Wa Ha (University of Essex): A cross-cultural comparison of evaluation between concert reviews in Hong Kong and British newspapers


Vittorio Tantucci

Vittorio Tantucci (Lancaster University): Second language semasiology (SLS): The case of the Mandarin sentence final particle 吧 ba


Andrew Hardie

Andrew Hardie (Lancaster University): Using CQPweb to analyse Chinese corpus data


Vaclav Brezina

Vaclav Brezina (Lancaster University):  Practical demonstration of the Guangwai-Lancaster Chinese Learner Corpus followed by a general discussion.


Clare Wright: Using Learner Corpora to analyse task effects on L2 oral interlanguage in English-Mandarin bilinguals


 

 

 

Syntactic structures in the Trinity Lancaster Corpus

We are proud to announce collaboration with Markus Dickinson and Paul Richards from the Department of Linguistics, Indiana University on a project  that will analyse syntactic structures in the Trinity Lancaster Corpus. The focus of the project is to develop a syntactic annotation scheme of spoken learner language and apply this scheme to the Trinity Lancaster Corpus, which is being compiled at Lancaster University in collaboration with Trinity College London. The aim of the project is to provide an annotation layer for the corpus that will allow sophisticated exploration of the morphosyntactic and syntactic structures in learner speech. The project will have an impact on both the theoretical understanding of spoken language production at different proficiency levels as well as on the development of practical NLP solutions for annotation of learner speech.  More specific goals include:

  • Identification of units of spoken production and their automatic recognition.
  • Annotation and visualization of morphosyntactic and syntactic structures in learner speech.
  • Contribution to the development of syntactic complexity measures for learner speech.
  • Description of the syntactic development of spoken learner production.

 

Trinity Lancaster Corpus at the International ESOL Examiner Training Conference 2015

On Friday 30th January 2015, I gave a talk at the International ESOL Examiner Training Conference 2015 in Stafford. Every year, the Trinity College London, CASS’s research partner, organises a large conference for all their examiners which consists of plenary lectures and individual training sessions. This year, I was invited to speak in front of an audience of over 300 examiners about the latest development in the learner corpus project.  For me, this was a great opportunity not only to share some of the exciting results from the early research based on this unique resource, but also to meet the Trinity examiners; many of them have been involved in collecting the data for the corpus. This talk was therefore also an opportunity to thank everyone for their hard work and wonderful support.

It was very reassuring to see the high level of interest in the corpus project among the examiners who have a deep insight into examination process from their everyday professional experience.  The corpus as a body of transcripts from the Trinity spoken tests in some way reflects this rich experience offering an overall holistic picture of the exam and, ultimately, L2 speech in a variety of communicative contexts.

Currently, the Trinity Lancaster Corpus consists of over 2.5 million running words sampling the speech of over 1,200 L2 speakers from eight different L1 and cultural backgrounds. The size itself makes the Trinity Lancaster Corpus the largest corpus of its kind. However, it is not only the size that the corpus has to offer. In cooperation with Trinity (and with great help from the Trinity examiners) we were able to collect detailed background information about each speaker in our 2014 dataset. In addition, the corpus covers a range of proficiency levels (B1– C2 levels of the Common European Framework), which allows us to research spoken language development in a way that has not been previously possible.  The Trinity Lancaster Corpus, which is still being developed with an average growth of 40,000 words a week, is an ambitious project:  Using this robust dataset, we can now start exploring crucial aspects of L2 speech and communicative competence and thus help language learners, teachers and material developers to make the process of L2 learning more efficient and also (hopefully) more enjoyable. Needless to say, without Trinity as a strong research partner and the support from the Trinity examiners this project wouldn’t be possible.

Trinity Lancaster Spoken Learner Corpus: A milestone to celebrate

On Monday 19 May we came together to celebrate the completion of the first part of the Trinity Lancaster Spoken Learner Corpus project. The transcription of our 2012 dataset is now complete and the corpus comprises 1.5 million running words. The Trinity Lancaster Spoken Learner Corpus represents a balanced sample of learner speech from six different countries (Italy, Spain, Mexico, India, China and Sri Lanka) covering the B1.2 – C2 levels of the Common European Framework (CEFR). Below are some pictures from our small celebration.

trinity3

trinity1 trinity2

We are continuing with the corpus development adding more data from our 2014 dataset so there is still a lot of work to be done. However, we are really excited about the possibilities of applied linguistic and language testing research based on this unique dataset.

You can read more about the Trinity Lancaster Spoken Learner Corpus in the AEA-Europe newsletter report.

Is this the way to do Corpus Linguistics? Feedback system for the Corpus Linguistics MOOC

Corpus linguistics (CL) is a set of incredibly versatile methods of language analysis applicable to a number of different contexts. So, for example, if you are interested in language, culture, history or society, corpus linguistics has something to offer. Today, thanks to the amazing development in computer technology, corpus linguistic tools are literally only a mouse click away or a touch away, if you are using a tablet or a smartphone. Are you then ready to get your hands dirty with computational analysis of large amounts of language? If the answer is yes, you have probably already registered for the new massively open online course (MOOC) on Corpus Linguistics, created and run by Tony McEnery and other members of the CASS team. (If you haven’t managed to register yet, you can still do so at the FutureLearn website. The course kicks off on 27th January 2014.)

An essential part of the Corpus Linguistics MOOC is its unique feedback system. You will be given a question, a data set and a software tool, and you will be asked to apply what you have learnt in the MOOC lectures to real language analysis. You will explore a topic using corpus techniques which will enable you to uncover interesting patterns in language data. We have a range of topics in store for you. These include English grammar, British and American language and culture, historical discourse of 17th century news books and learner language. But don’t worry, we won’t ask you to write an essay on the topic. Instead, we will give you a number of analyses and descriptions of the corpus data and you will decide which ones use the corpus techniques correctly. After you’ve made your decisions we will provide detailed comments on each of the options. In this way, the CASS Corpus Linguistics MOOC system aims to promote independent learning so that next time you can apply the corpus tools with confidence to answer your own questions.

Centre Vacancy: Senior Research Associate in Corpus Linguistics

Post A793

Linguistics & English Language
Salary:   £31,331 to £36,298
Closing Date:   Sunday 20 October 2013
Interview Date:   To be confirmed
Reference:  A793

The Centre for Corpus Approaches to Social Science, funded by the ESRC, is seeking to appoint to 4 year research contract in the area of corpus linguistics and language teaching/assessment.

You will pursue research on developing new approaches to the use of corpus linguistics within language teaching and assessment in conjunction with Trinity College London. A knowledge of corpus linguistics is essential, as well as some familiarity with language teaching and /or assessment.

You will join an interdisciplinary team of internationally renowned researchers within CASS. You will be offered excellent career progression opportunities through the ESRC Centre.

This is a fixed term contract for 4 years.

Informal enquiries may be made to Dr. Vaclav Brezina: brezina(Replace this parenthesis with the @ sign)exchange.lancs.ac.uk

Apply through the Lancaster University website. 

“It’s all sex and celebrity now”: Page three corpus linguistics

On Monday (16th October) on page three of the Daily Mail, the readers could come across a short article about changes in the English lexicon with a title: “Forget supper and soup… it’s all sex and celebrity now”. (A longer version of the article is available online.) The article quoted some data from the New General Service List (new-GSL) and compared these with Bauman and Culligan’s version of West’s GSL. Bauman and Culligan offer a list of words from West’s GSL (1953) combined with word frequency rankings based on the Brown Corpus (1961).

It comes as no surprise that the word ranks in Bauman and Culligan’s version of the GSL differ from the ranks in the new-GSL. This might be given not only by the time factor, but also by the composition of the source corpora. The new-GSL is a wordlist based on four different language corpora (three British English corpora and one corpus representing the language of the internet) of the total size of over 12 billion words; Bauman and Culligan’s counts, on the other hand, rely on a single one-million-word corpus of American English compiled in the 1960s. The comparisons in the Daily Mail therefore need to be interpreted with caution. In particular, the following points should be considered:

  • The language changes and there is no doubt that over time, some words become more popular  and other words fall out of fashion. The new GSL lists 378 lexical innovations including words such as Internet, website, online, email, network, client, mobile, file and web.
  • On the other hand, the research shows that there is a large stable lexical core (2,116 items in the new-GSL) including frequent nouns, verbs and adjectives such as time, year, people, way, say, make, take, go, good, new, great and same.
  • In order to interpret the social significance of lexical changes, we need to look at the contexts in which different words appear. A good example of this is the word “sex” quoted in the headline of the Daily Mail article.

Let’s talk about “sex”, shall we?

The word sex is polysemous and can mean either physical activity or biological dimorphism (male or female). I suppose the phrase “it’s all sex now” in the headline of the Daily Mail article alludes to the former meaning of the word sex, because the fact that we talk about males and females (the latter meaning of the word) does not sell newspapers. Let’s have a look at some corpus evidence.

Brown (1961)American writing EnTenTen12 (2012)Internet language
form “sex” per million words 82.7 86.6
sex as activity 75% 90%*
sex as dimorphism 25% 10%*

*based on a random sample of 250 lines

A quick comparison of the evidence in the Brown Corpus (which the Daily Mail uses as the point of departure) and the EnTenTen12 internet corpus (one of the sources of the new-GSL) shows that the frequencies per million do not differ very much. There is a difference, however, in the proportions of the two meanings (sex as activity and sex as dimorphism) which can be explained by the difference in the genres sampled in the two corpora. In contrast to the Brown corpus, EnTenTen12 includes also pornography (as you would expect from an internet-based corpus); This is also reflected in some of the prominent collocates of the word “sex” such as oral, anal, hardcore, gay, lesbian and toy in EnTenTen12. However, the fact that “it’s all sex now” (as the Daily Mail puts it) has even a more simple and prosaic explanation:  When compiling the original wordlist, Michael West very likely decided to exclude the term “sex” as something that does not need to be mentioned in the classroom context.

The New General Service List (new-GSL) is out

The new-GSL is an English vocabulary baseline intended for both researchers and practitioners. It is based on robust comparison of four corpora of general English of the total size of over 12 billion words. It contains 2,494 vocabulary items, 2,116 of which belong to a stable lexical core; 378 words in the wordlist represent lexical innovations. All of these words appear with high frequencies across a large number of different contexts.

The article, which describes the methodology of the wordlist compilation, as well as the full new-GSL are available from the Applied Linguistics website in the open access mode.

Further research

At the moment, we are working on an American supplement to the new-GSL. Our findings show that there is a surprisingly large overlap between frequent lexical items in British and American corpora. With some modifications, the new-GSL can therefore be successfully used also in the American English contexts.

A larger question, however, that the new-GSL raises is – how do we reconcile our intuitions about important vocabulary items with the corpus-based findings? In this respect, the new-GSL is not a prescriptive but a descriptive wordlist. As we stress in the article, “[w]ith respect to the diversity of ESL/EFL contexts, it is deemed more useful to envision the use of our wordlist as a vocabulary base with the possibility of further additions, rather than a wordlist that strives to cater to a mixed cluster of heterogeneous expectations and needs” (p. 19).


Read more:

Brezina, V. and Gablasova, D. (2013) Is There a Core General Vocabulary? Introducing the New General Service List Applied Linguistics.