New resources are being added regularly to the new CASS: Briefings tab above, so check back soon.
We at the Trinity Lancaster Learner Corpus team are very pleased to announce that we have a logo for our lovely corpus. We very much hope that it represents the corpus by capturing its key features. We knew we wanted to portray what we feel are the unique aspects of our corpus – interactive L2 spoken data – but that within this we had to reflect the identities of both the ESRC Centre for Corpus Approaches to Social Science at Lancaster University and Trinity. It was a challenge in the design brief to present abstract concepts in a way that allowed the designer to ‘see’ our vision.
We are very happy with the result. We have a logo which displays our respective organisational ‘colours’ in two speech bubbles. The two bubbles denote the largely interactive nature of the data and are slightly squared off to shift away from cartoon associations and towards the academic nature of our endeavours.
We are happy that the logo gives the corpus a very strong identity and is meaningful to the wider community of researchers and language practitioners who are likely to engage with the corpus in the future. It all feels very real now as the Lancaster team process and tag the final recordings that make up the data.
On Friday 30th January 2015, I gave a talk at the International ESOL Examiner Training Conference 2015 in Stafford. Every year, the Trinity College London, CASS’s research partner, organises a large conference for all their examiners which consists of plenary lectures and individual training sessions. This year, I was invited to speak in front of an audience of over 300 examiners about the latest development in the learner corpus project. For me, this was a great opportunity not only to share some of the exciting results from the early research based on this unique resource, but also to meet the Trinity examiners; many of them have been involved in collecting the data for the corpus. This talk was therefore also an opportunity to thank everyone for their hard work and wonderful support.
It was very reassuring to see the high level of interest in the corpus project among the examiners who have a deep insight into examination process from their everyday professional experience. The corpus as a body of transcripts from the Trinity spoken tests in some way reflects this rich experience offering an overall holistic picture of the exam and, ultimately, L2 speech in a variety of communicative contexts.
Currently, the Trinity Lancaster Corpus consists of over 2.5 million running words sampling the speech of over 1,200 L2 speakers from eight different L1 and cultural backgrounds. The size itself makes the Trinity Lancaster Corpus the largest corpus of its kind. However, it is not only the size that the corpus has to offer. In cooperation with Trinity (and with great help from the Trinity examiners) we were able to collect detailed background information about each speaker in our 2014 dataset. In addition, the corpus covers a range of proficiency levels (B1– C2 levels of the Common European Framework), which allows us to research spoken language development in a way that has not been previously possible. The Trinity Lancaster Corpus, which is still being developed with an average growth of 40,000 words a week, is an ambitious project: Using this robust dataset, we can now start exploring crucial aspects of L2 speech and communicative competence and thus help language learners, teachers and material developers to make the process of L2 learning more efficient and also (hopefully) more enjoyable. Needless to say, without Trinity as a strong research partner and the support from the Trinity examiners this project wouldn’t be possible.
Since we speak in utterances (not sentences), most forms of punctuation are omitted in this corpus of learner language; the exceptions being apostrophes, hyphens and question marks.
This blog concerns question marks. (Warning: there are not many jokes!)
When we started transcription, the convention seemed simple and straightforward: Question mark indicates a question. This is easy to apply when questions are straightforward. For example, the following question types are easy to identify:
- yes/no questions (do you like chocolate?);
- wh- questions (where have you been?);
- tag questions(rock music is popular isn’t it?);
- either/or questions (did you catch the train or did you fly?)
However, very soon, we found ourselves in debate about whether and where to transcribe question marks in less straightforward utterances. This enabled us to amend the convention and add illustrative examples. In addition, transcribers created a Questions Bank and began to keep a log of decisions made regarding the transcription of question marks; this was done with the aim of achieving the consistency which we anticipate might be vital to researchers in the future.
So here follows a reflection on some of the varied ways in which speakers can elicit a response in spoken discourse, along with remarks on whether or not a question mark is transcribed in context of this corpus.
It is useful to keep two vital rules in mind:
- For the learner language corpus it is the structure of the utterance that is crucial rather than the expression or tone of voice.
- If in doubt, leave it out!
Either/Or Adjusted Question
Speaker adjusts wording and question structure remains.
- so in Indian houses do you also have landline telephones or do they are they disappearing?
Either/Or Anticipation Question:
Use of ‘or’ suggests a choice of alternatives is going to be presented but the questioner’s voice and pace tails off in anticipation of the listener’s response.
- do you go to a special school? or… [no ellipsis would not be transcribed in corpus]
Doubled Up Question
Structurally, there may be two questions but only one question is actually being asked; question mark transcribed at the end.
- is it important to do school trips do you think?
Rephrased / Clarified Question:
Multiple rephrased/related questions in quick succession; each is structurally complete, eliciting a single response.
- in what area? in what field? do have you any idea?
- what are you going to do when you finish at this school? what will you do next?
A question word (often ‘what’) within the utterance and transcribed with question mark.
- it seems to me your class sizes you have what? forty five students in a class it seems to me they are very large
Question Word/Context Question:
Question word followed by context/detail; often for emphasis and expressing shock or surprise.
- what? they have a party all day
- when? in the middle of the night
A question followed by qualifying phrase for emphasis or for clarification; question mark may be transcribed at the end…
- what about education more broadly more generally?
- would you make it more fashionable more stylish?
…or in the middle of the utterance.
- what do you think the biggest problems are in Mumbai? the biggest pollution problems
- is that your ambition? to design a bicycle
Interrupted (Clause) Question:
A clause inserted mid-question but structure remains and one main question is being asked.
- what about looking at education not just at your school looking at education in general?
Interrogative intonation communicates speaker’s aim to elicit information; however, in this corpus we focus solely on structure so no question mark is transcribed.
Useful test: is the utterance meaningful without interrogative intonation? If so, no question mark is added.
S: I thought I was late
S: yes I overslept
E: and how are you today?
S: I’m fine and you
E: I’m fine too
E: any questions for me about your topic
S: yes have you ever been to New York?
Again, interrogative intonation communicates speaker’s aim to elicit information but structurally there is no question in the second part of this utterance and so no question mark is transcribed.
E: so what do you think is the answer then? you think that parents should be at home more
S: no I think they should have the choice
Key words are unclear making question structure incomplete; no question mark is transcribed.
S: <unclear=can you> repeat the question please
A Complex Utterance with a Question Structure:
A number of self-corrections but the structure of a question exists.
S: and do you think it’s it’s good to be in to be in touch with many people and to and to and to con= er contact with your friends and erm and at your home for exa= on your home for example?
If the question is interrupted no question mark is transcribed, however, sometimes a short question structure remains.
S: is he er good enough? to
S: you know develop India and make it a superpower
Interrupted Either/Or Question:
What would originally have been a single either/or question is interrupted resulting in two independent question structures which are each transcribed with question marks.
E: do you think it’s a skill?
S: erm I think
E: or can you get better at it?
So this has been a glimpse at some of the many varied ways speakers use language to elicit a response. Time and again we chant our mantra: “If in doubt, leave it out“!
The full version of our Questions Bank is now pretty exhaustive. Generally we find that utterances can be mapped onto existing example structures so we can be confident that the decision as to if/where to transcribe the question mark will be consistent with previous decisions. So the Questions Bank, for us, has definitely been a valuable transcription tool.
As audio transcribers we listen to sound. Of primary importance is the clarity of the sound.
The quality of being clear (‘easy to perceive, understand, or interpret’), in particular:
- The quality of being coherent and intelligible
- The quality of being easy to hear; sharpness of sound
- The quality of purity
Let’s consider these qualities and their relevance to the audio transcriber.
The quality of being coherent and intelligible
All of us, when engaged in discussion and conversation, want our language to be coherent and intelligible. However, for the transcriber listening to a recording, its clarity in the sense of being coherent and intelligible is something of a paradox; it is simultaneously useful and yet also to be ignored.
Naturally, we know that our brains are programmed to attempt to organise and make sense of language. In this sense, context can often present the transcriber with an invaluable clue to making out words which may be difficult to hear in a recording.
At the initial drafting stage of transcription what we hear at first can turn out to be quite different when we re-listen, edit and proofread the transcript with the glorious benefit of wider context to assist us. Here are a few of the more entertaining examples:
you wear glasses becomes yoga classes
it’s among the becomes it’s a manga [comic]
yes she was becomes H G Wells
whisking gently becomes whiskey J&B [discussing a recipe!]
However, since the raison d’être of this corpus is as a basis for research into the language of learners, part of the skill here is in not being distracted by our knowledge of grammatical rules and the surrounding context.
The audio transcriber’s task is to hear what the learner actually says; this may not always be what they (or we) think or expect might be logical or appropriate (or desirable!). Indeed, the transcription conventions are designed specifically to minimise the possibility of this happening during the transcription process. In the context of a Graded Examination in Spoken English (GESE) the students (and, on rare occasion, the examiners) can, and sometimes do, say anything!
Below are a few examples of wrong words and non-words which are to be transcribed, alongside words which may have been intended by the speaker:
The action of teaching a person or animal a particular skill or type of behaviour.
So how to begin? With experts as our guides (and thankfully no animals in sight!)…
The Context: The first week was to be dedicated to training. We began by watching a short video clip of a Trinity examination in progress. Although our day-to-day work is based purely on audio recordings, we really appreciated having this quick peak into the world of the examination room. Being able to picture the scene when listening to exam recordings somehow brings the spoken language to life.
Picture this: a desk with a friendly examiner seated at one side; tape recorder in situ and possibly a fan whirring (quietly, we hope) in the background; a pile of papers (perhaps held down by a paperweight); and then, most importantly for us in this research into learner language, a student seated on the other side of the desk; some nervous, some shy, some confident, some excited, some reluctant to speak and a rare few who might even have felt quite at home seated on the other side of the desk!
Time spent viewing this clip was truly a valuable introduction to the context of this research and the real world to which the audio transcriber is privy on a daily basis.
What next? Enthusiastic to get started, headsets on, foot pedals down…
Practice File: We started with a practice recording that had been transcribed previously, applying to it our first set of transcription conventions. (These have subsequently been altered and updated on numerous occasions.) This was an extremely valuable process – in listening separately and together to sections of the recording and in comparing our own transcripts with each other and with the original, we quickly realised the range of subtleties that are involved in this task. The aim, of course, is for transcribers to do as little interpretation as possible and to be able to apply the conventions in a more or less uniform manner, thus making the transcription process as straightforward as possible. This, after all, is what will enable us to build a reliable corpus of words that are actually uttered. Whilst the technology now exists to generate text from spoken words, the accuracy of the text produced does not come close to that produced by a real-life human transcriber.
Key to this task is the fact that it is unlike transcription in other working environments; we are not seeking to produce grammatically correct punctuated documents such as you might find on a BBC website when you want to review that radio programme you heard, or perhaps missed. In spoken language there are only utterances and our job is to record every utterance precisely by following the given conventions, the only punctuation in sight being apostrophes and the odd question mark. So is that syllable a word ending, a false start to another word, perhaps a filler used intentionally to maintain a turn in conversation, or perhaps an involuntary sound? All these are natural features of spoken discourse. Tackling this challenge and striving to produce a document that represents as accurately as is humanly possible the words actually uttered by each individual speaker – once again, here is the challenge that makes our job enjoyable and rewarding.
And finally… A Transcriber’s Thought For The Day:
I tried to catch some fog. I mist.
to put (thoughts, speech, or data) into written or printed form
mid 16th century (in the sense ‘make a copy in writing’):
from Latin transcribere, from trans- ‘across’ + scribere ‘write’
In September 2013 we applied for the post of Audio Transcriber in the CASS Office in the Department of Linguistics and English Language here at Lancaster University. The job description seemed straightforward; to transcribe audio tape materials according to a predefined scheme and to undertake other appropriate duties as directed. And the person specification? As you would expect, a list of essential/desirable skills including working effectively as part of a team; the ability to learn and apply schemes (more of that later); and the ability to work with a range of accents and dialects of English (this is the fun part!).
We say the post of Audio Transcriber since, as far as we knew, only one post was available. How wonderful to find ourselves both appointed (long may the funding last!); the opportunity to establish a slick working team, as well as to consult when problems arise and, not least, to celebrate the successes (yes, transcribing is a rewarding job!) are a huge benefit not only to ourselves in our work but also to the success of project as a whole. In the ESRC Centre for Corpus Approaches to Social Science, it must be the corpus that is at the heart of the centre. Knowing that we play a key role within the team working together to develop this corpus, we take great pride in what we do. After all, our listening skills, our focus on accuracy and our meticulous attention to detail have the potential to help develop a corpus of excellent quality, and this will make a vital contribution to the validity of the all the research that will follow. Quite simply, it is this which makes our job so enjoyable and rewarding.
Our day-to-day work involves transcribing recordings of oral examinations taken by learners of English as a second language at elementary, intermediate and advanced stages. The examinations have been carried out by Trinity College London and have taken place in various countries; Spain, Mexico, Italy, China, India and Sri Lanka so far. Each language and each stage have their own unique features.
Seven months and 1.5 million words later (Stage One completed and celebrated with colleagues and cake!), we were delighted to be invited to contribute a BLOG documenting our experience as transcribers. Over the coming months we plan to describe and discuss various aspects of the job. The aim is to offer an insight to other transcribers and researchers about this particular process.
Look out for the next instalment on Getting Started!
And finally… A Transcriber’s Thought For The Day:
They told me I had type A blood, but it was a type-O.
At Trinity we are totally impressed that our spoken learner corpus is now just over 1.5 million words. Although there are still some quality checks to run, it means we’ve reached that anticipatory moment where we can start digging into the goldmine and seeing what insights the data can offer. We’ve been working closely with CASS so that their team have been able to participate in Trinity’s test creation processes as well as examiner training sessions. This has allowed the researchers to fully understand the communicative skills the exam elicits and to identify interesting aspects of language that might be investigated. Equally, the Trinity team are very much looking forward to an upcoming visit to Lancaster where the CASS team will guide us on the corpus tools and the type of reports we can run that will access the data we need for our own research interests into the test itself.
Currently we are so excited at having such a wealth of data at our fingertips that we are in that dangerous moment of skimming the corpus to see if our assumptions are played out. We’ve all been there – when you are convinced that the corpus will finally confirm your long held beliefs about how learners use language – only to discover that you are wrong or the evidence is not there! This is, however, significantly ameliorated by emerging findings that will allow us to add a quantitative component to our test validity arguments. Mining corpus data indicates a new approach to evidencing that the test tasks are performing as anticipated – and as designed! And then there’s that little delve where the numbers and patterns indicate something unanticipated – how delicious!
This Trinity Lancaster corpus is fascinating because it comprises data from tasks where the candidate is given free rein to ‘show off’ their language skills and engage in authentic interaction with the examiner – thus giving a very close parallel with real life and so enriching applied linguistic research. At the same time, the test also contains a task type which really hones into the candidate’s skill at enacting Gricean principles of co-operation thus allowing us to investigate metacognitive processes such as how learners manage a conversation.
It has to be said that we recognize that the opportunities and insights offered by this unique corpus are in large part down to the high quality corpus transcription and annotation process implemented by CASS. We are now planning the collection of 2014 data, including we hope a wider range of L1s – because we are now totally addicted!
On Monday 19 May we came together to celebrate the completion of the first part of the Trinity Lancaster Spoken Learner Corpus project. The transcription of our 2012 dataset is now complete and the corpus comprises 1.5 million running words. The Trinity Lancaster Spoken Learner Corpus represents a balanced sample of learner speech from six different countries (Italy, Spain, Mexico, India, China and Sri Lanka) covering the B1.2 – C2 levels of the Common European Framework (CEFR). Below are some pictures from our small celebration.
We are continuing with the corpus development adding more data from our 2014 dataset so there is still a lot of work to be done. However, we are really excited about the possibilities of applied linguistic and language testing research based on this unique dataset.
You can read more about the Trinity Lancaster Spoken Learner Corpus in the AEA-Europe newsletter report.