Syntactic structures in the Trinity Lancaster Corpus

We are proud to announce collaboration with Markus Dickinson and Paul Richards from the Department of Linguistics, Indiana University on a project  that will analyse syntactic structures in the Trinity Lancaster Corpus. The focus of the project is to develop a syntactic annotation scheme of spoken learner language and apply this scheme to the Trinity Lancaster Corpus, which is being compiled at Lancaster University in collaboration with Trinity College London. The aim of the project is to provide an annotation layer for the corpus that will allow sophisticated exploration of the morphosyntactic and syntactic structures in learner speech. The project will have an impact on both the theoretical understanding of spoken language production at different proficiency levels as well as on the development of practical NLP solutions for annotation of learner speech.  More specific goals include:

  • Identification of units of spoken production and their automatic recognition.
  • Annotation and visualization of morphosyntactic and syntactic structures in learner speech.
  • Contribution to the development of syntactic complexity measures for learner speech.
  • Description of the syntactic development of spoken learner production.


New CASS Briefing now available — How to communicate successfully in English?

CASSbriefings-EDLHow to communicate successfully in English? An exploration of the Trinity Lancaster Corpus. Many speakers use English as their non-native language (L2) to communicate in a variety of situations: at school, at work or in other everyday situations. As well as needing to master the grammar and vocabulary of the English language, L2 users of English need to know how to react appropriately in different communicative situations. In linguistics, this aspect of language is studied under the label of “pragmatics”. This briefing offers an exploration of the pragmatic features of L2 speech in the Trinity Lancaster Corpus of spoken L2 production.

Trinity Lancaster Corpus at the International ESOL Examiner Training Conference 2015

On Friday 30th January 2015, I gave a talk at the International ESOL Examiner Training Conference 2015 in Stafford. Every year, the Trinity College London, CASS’s research partner, organises a large conference for all their examiners which consists of plenary lectures and individual training sessions. This year, I was invited to speak in front of an audience of over 300 examiners about the latest development in the learner corpus project.  For me, this was a great opportunity not only to share some of the exciting results from the early research based on this unique resource, but also to meet the Trinity examiners; many of them have been involved in collecting the data for the corpus. This talk was therefore also an opportunity to thank everyone for their hard work and wonderful support.

It was very reassuring to see the high level of interest in the corpus project among the examiners who have a deep insight into examination process from their everyday professional experience.  The corpus as a body of transcripts from the Trinity spoken tests in some way reflects this rich experience offering an overall holistic picture of the exam and, ultimately, L2 speech in a variety of communicative contexts.

Currently, the Trinity Lancaster Corpus consists of over 2.5 million running words sampling the speech of over 1,200 L2 speakers from eight different L1 and cultural backgrounds. The size itself makes the Trinity Lancaster Corpus the largest corpus of its kind. However, it is not only the size that the corpus has to offer. In cooperation with Trinity (and with great help from the Trinity examiners) we were able to collect detailed background information about each speaker in our 2014 dataset. In addition, the corpus covers a range of proficiency levels (B1– C2 levels of the Common European Framework), which allows us to research spoken language development in a way that has not been previously possible.  The Trinity Lancaster Corpus, which is still being developed with an average growth of 40,000 words a week, is an ambitious project:  Using this robust dataset, we can now start exploring crucial aspects of L2 speech and communicative competence and thus help language learners, teachers and material developers to make the process of L2 learning more efficient and also (hopefully) more enjoyable. Needless to say, without Trinity as a strong research partner and the support from the Trinity examiners this project wouldn’t be possible.

A Journey into Transcription, Part 4: The Question Question

question: (NOUN) A sentence worded or expressed so as to elicit information.

Since we speak in utterances (not sentences), most forms of punctuation are omitted in this corpus of learner language; the exceptions being apostrophes, hyphens and question marks. 

This blog concerns question marks.  (Warning: there are not many jokes!)

When we started transcription, the convention seemed simple and straightforward: Question mark indicates a questionThis is easy to apply when questions are straightforward.  For example, the following question types are easy to identify:  

  • yes/no questions (do you like chocolate?);
  • wh- questions (where have you been?);
  • tag questions(rock music is popular isn’t it?);
  • either/or questions (did you catch the train or did you fly?)

However, very soon, we found ourselves in debate about whether and where to transcribe question marks in less straightforward utterances.  This enabled us to amend the convention and add illustrative examples.  In addition, transcribers created a Questions Bank and began to keep a log of decisions made regarding the transcription of question marks; this was done with the aim of achieving the consistency which we anticipate might be vital to researchers in the future. 

So here follows a reflection on some of the varied ways in which speakers can elicit a response in spoken discourse, along with remarks on whether or not a question mark is transcribed in context of this corpus.

It is useful to keep two vital rules in mind:

  • For the learner language corpus it is the structure of the utterance that is crucial rather than the expression or tone of voice. 
  • If in doubt, leave it out!

Either/Or Adjusted Question

Speaker adjusts wording and question structure remains.

  • so in Indian houses do you also have landline telephones or do they  are they disappearing?

Either/Or Anticipation Question:

Use of ‘or’ suggests a choice of alternatives is going to be presented but the questioner’s voice and pace tails off in anticipation of the listener’s response.

  • do you go to a special school? or… [no ellipsis would not be transcribed in corpus]

Doubled Up Question

Structurally, there may be two questions but only one question is actually being asked; question mark transcribed at the end.

  • is it important to do school trips do you think?

Rephrased / Clarified Question:

Multiple rephrased/related questions in quick succession; each is structurally complete, eliciting a single response.

  • in what area? in what field? do have you any idea?
  • what are you going to do when you finish at this school? what will you do next?

Wondering Question:

A question word (often ‘what’) within the utterance and transcribed with question mark.

  • it seems to me your class sizes you have what? forty five students in a class it seems to me they are very large

Question Word/Context Question:

Question word followed by context/detail; often for emphasis and expressing shock or surprise.

  • what? they have a party all day
  • when? in the middle of the night

Clarification/Qualification Question:

A question followed by qualifying phrase for emphasis or for clarification; question mark may be transcribed at the end…

  • what about education more broadly more generally?
  • would you make it more fashionable more stylish?

…or in the middle of the utterance.

  • what do you think the biggest problems are in Mumbai? the biggest pollution problems
  • is that your ambition? to design a bicycle

Interrupted (Clause) Question:

A clause inserted mid-question but structure remains and one main question is being asked.

  • what about looking at education not just at your school looking at education in general?

Implied Question:

Interrogative intonation communicates speaker’s aim to elicit information; however, in this corpus we focus solely on structure so no question mark is transcribed.

Useful test: is the utterance meaningful without interrogative intonation?  If so, no question mark is added.

S:            I thought I was late

E:            really

S:            yes I overslept


E:            and how are you today?

S:            I’m fine and you

E:            I’m fine too


E:            any questions for me about your topic

S:            yes have you ever been to New York?

Statement Question:

Again, interrogative intonation communicates speaker’s aim to elicit information but structurally there is no question in the second part of this utterance and so no question mark is transcribed.

E:            so what do you think is the answer then? you think that parents should be at home more

S:            no I think they should have the choice

Unclear Question:

Key words are unclear making question structure incomplete; no question mark is transcribed.

S:            <unclear=can you> repeat the question please

A Complex Utterance with a Question Structure:

A number of self-corrections but the structure of a question exists.

S:            and do you think it’s it’s good to be in to be in touch with many people and to and to and to con= er contact with your friends and erm and at your home for exa= on your home for example?

Interrupted Question:

If the question is interrupted no question mark is transcribed, however, sometimes a short question structure remains.

S:            is he er good enough?  to

E:            mm

S:            you know develop India and make it a superpower

Interrupted Either/Or Question:

What would originally have been a single either/or question is interrupted resulting in two independent question structures which are each transcribed with question marks.

E:            do you think it’s a skill?

S:            erm I think

E:            or can you get better at it?

So this has been a glimpse at some of the many varied ways speakers use language to elicit a response.  Time and again we chant our mantra: “If in doubt, leave it out“! 

The full version of our Questions Bank is now pretty exhaustive.  Generally we find that utterances can be mapped onto existing example structures so we can be confident that the decision as to if/where to transcribe the question mark will be consistent with previous decisions. So the Questions Bank, for us, has definitely been a valuable transcription tool. 

A Journey into Transcription, Part 3: Clarity

As audio transcribers we listen to sound.  Of primary importance is the clarity of the sound.



The quality of being clear (‘easy to perceive, understand, or interpret’), in particular:

  • The quality of being coherent and intelligible
  • The quality of being easy to hear; sharpness of sound
  • The quality of purity

Let’s consider these qualities and their relevance to the audio transcriber.

The quality of being coherent and intelligible

All of us, when engaged in discussion and conversation, want our language to be coherent and intelligible.  However, for the transcriber listening to a recording, its clarity in the sense of being coherent and intelligible is something of a paradox; it is simultaneously useful and yet also to be ignored.

Naturally, we know that our brains are programmed to attempt to organise and make sense of language.  In this sense, context can often present the transcriber with an invaluable clue to making out words which may be difficult to hear in a recording.

At the initial drafting stage of transcription what we hear at first can turn out to be quite different when we re-listen, edit and proofread the transcript with the glorious benefit of wider context to assist us.  Here are a few of the more entertaining examples:

you wear glasses becomes yoga classes

it’s among the becomes it’s a manga [comic]

yes she was becomes H G Wells

whisking gently becomes whiskey J&B [discussing a recipe!]

However, since the raison d’être of  this corpus is as a basis for research into the language of learners, part of the skill here is in not being distracted by our knowledge of grammatical rules and the surrounding context.

The audio transcriber’s task is to hear what the learner actually says; this may not always be what they (or we) think or expect might be logical or appropriate (or desirable!).  Indeed, the transcription conventions are designed specifically to minimise the possibility of this happening during the transcription process.  In the context of a Graded Examination in Spoken English (GESE) the students (and, on rare occasion, the examiners) can, and sometimes do, say anything!

Below are a few examples of wrong words and non-words which are to be transcribed, alongside words which may have been intended by the speaker:

Trinity Lancaster Corpus: A glimpse of the future

At Trinity we are totally impressed that our spoken learner corpus is now just over 1.5 million words. Although there are still some quality checks to run, it means we’ve reached that anticipatory moment where we can start digging into the goldmine and seeing what insights the data can offer. We’ve been working closely with CASS so that their team have been able to participate in Trinity’s test creation processes as well as examiner training sessions. This has allowed the researchers to fully understand the communicative skills the exam elicits and to identify interesting aspects of language that might be investigated. Equally, the Trinity team are very much looking forward to an upcoming visit to Lancaster where the CASS team will guide us on the corpus tools and the type of reports we can run that will access the data we need for our own research interests into the test itself.

Currently we are so excited at having such a wealth of data at our fingertips that we are in that dangerous moment of skimming the corpus to see if our assumptions are played out. We’ve all been there – when you are convinced that the corpus will finally confirm your long held beliefs about how learners use language – only to discover that you are wrong or the evidence is not there! This is, however, significantly ameliorated by emerging findings that will allow us to add a quantitative component to our test validity arguments. Mining corpus data indicates a new approach to evidencing that the test tasks are performing as anticipated – and as designed! And then there’s that little delve where the numbers and patterns indicate something unanticipated – how delicious!

This Trinity Lancaster corpus is fascinating because it comprises data from tasks where the candidate is given free rein to ‘show off’ their language skills and engage in authentic interaction with the examiner – thus giving a very close parallel with real life and so enriching applied linguistic research. At the same time, the test also contains a task type which really hones into the candidate’s skill at enacting Gricean principles of co-operation thus allowing us to investigate metacognitive processes such as how learners manage a conversation.

It has to be said that we recognize that the opportunities and insights offered by this unique corpus are in large part down to the high quality corpus transcription and annotation process implemented by CASS. We are now planning the collection of 2014 data, including we hope a wider range of L1s – because we are now totally addicted!

Trinity Lancaster Spoken Learner Corpus: A milestone to celebrate

On Monday 19 May we came together to celebrate the completion of the first part of the Trinity Lancaster Spoken Learner Corpus project. The transcription of our 2012 dataset is now complete and the corpus comprises 1.5 million running words. The Trinity Lancaster Spoken Learner Corpus represents a balanced sample of learner speech from six different countries (Italy, Spain, Mexico, India, China and Sri Lanka) covering the B1.2 – C2 levels of the Common European Framework (CEFR). Below are some pictures from our small celebration.


We are continuing with the corpus development adding more data from our 2014 dataset so there is still a lot of work to be done. However, we are really excited about the possibilities of applied linguistic and language testing research based on this unique dataset.

You can read more about the Trinity Lancaster Spoken Learner Corpus in the AEA-Europe newsletter report.

Trinity oral test corpus: The first hurdle

At Trinity we are wildly excited – yes, wildly – to finally have our corpus project set up with CASS. It’s a unique opportunity to create a learner corpus of English based on some fairly free flowing L2 language which is not too constrained by the testing context.  All Trinity oral tests are recorded and most of the tests include one or two tasks where the candidate has free rein to talk about their own interests in their own way – very much their own contributions, expressed as themselves. We have been hoping to use what is referred to as our ‘gold dust’ for research that will be meaningful – not just to the corpus community but also in terms of the impact on our tests and our feedback to learners and teachers. Working with CASS has now given us this golden opportunity.

The project is now up and running and in the corpus building stage and we have moved from the heady excitement of imaging what we could do with all the data to the grindstone of pulling together all the strands of meta data needed to make the corpus robust and useful. The challenges are real – for example, we need to log first languages but how do we ensure reliability? Meta data is now an  opt-in in most countries so how do we capture everyone? Even when the data boxes are completed how do we know it’s true? No, the only way is the very non-technological method of contacting the students again and following up in person.

A related concern is has the meta data we need shifted? We would normally be interested in what kind of input students had had to their learning so e.g. how many years study etc. In the past, part of this  data gathering was to ask about time learners had spent in an English-speaking country. Should this now be shifted to time spent watching videos online in English, in social media, in reading online sources? What is relevant –and also collectable?

The challenges in what might be considered this no-core information is forcing us to re-examine how sure we are about influences on learning – not just our perception but form the learner’s perception as well.

Vocabulary wordlists designed for learners: Development of the new-GSL

Imagine you have just started learning a new foreign language. Which words do you need to learn first? We all might have some intuitions about this. If the language is English then time – the most frequent noun both in speech and writing – will probably be more useful than say the adjective temporaneous (yes, OED records this word). However, intuitions (as corpus linguists know) are not to be trusted (at least not all the time). Only through analysis of large amounts of textual data (yes, language corpora!)  will we be able to identify words that occur frequently across a number of different contexts.

The research Dana and I are going to talk about on Thursday will look at the methodology of creating a pedagogical wordlist – the new-GSL (the old one is now really out of date)- which can assist both learners and teachers in the process of acquisition of basic English vocabulary. We’ll be looking at the ways in which both large (BNC, EnTenTen12) and small corpora (LOB, BE06) can be used in the creation of such a wordlist.