Spoken BNC2014 meets FOLK

On Thursday 3rd December I visited the Institut für Deutsche Sprache (Institute for German Language) in Mannheim. The IDS is Germany’s national, non-university institution for the research and documentation of the German language in both the present day and the past.

I was thrilled to be invited there by Swantje Westpfahl, a PhD student at the Institute, who is working on the compilation of a large spoken corpus of German known as the FOLK (Forschungs- und Lehrkorpus Gesprochenes Deutsch; research and teaching corpus of spoken German). With the similarities between FOLK and the Spoken BNC2014 (my own PhD research project) apparent, we spent a day at the IDS learning about each other’s work.

In the morning, I gave an hour-long talk about the Spoken BNC2014, including an overview of our data collection and transcription methods as well as an investigation into speaker identification which I conducted earlier this year. I explained that, with a small budget, we (CASS and our partner Cambridge University Press) have very much favoured size and speed of production over minute detail of transcription; a decision that has allowed us to have produced approximately 8 million words of orthographic transcription so far in only 18 months.

After lunch, I attended a workshop entitled “Spoken BNC2014 meets FOLK”, where Dr Thomas Schmidt gave an equivalent talk to my own about the FOLK project, followed by Swantje, whose specific focus is on the annotation of the transcribed corpus data. In terms of general design, the FOLK is fairly similar to the Spoken BNC2014; it contains transcripts of audio recordings held between speakers in a variety of settings. The major differences, as I learned, lie in the approach to transcription and the release of data. I learned about the incredible level of detail with which the FOLK recordings are transcribed, using Thomas’ own transcription software FOLKER. I was impressed by the affordances of this tool and the dedication to detail that was evident at the IDS, including the transcription of breathing, pauses measured to the millisecond and direct alignment to the (anonymized) audio recordings. All of this work takes a long time (on average, one hour of recording take 100 hours to prepare in this way!), and as such the FOLK is much smaller than the Spoken BNC2014 (1.3 million words after three years), but extremely rich in terms of potential for analysis.

The IDS was in turn impressed by the Spoken BNC2014’s approach to data collection, where we ‘crowd-source’ participants and invite them, through media engagement and other means, to make recordings using their smartphones in exchange for payment. I suggested that they might like to try putting out a press release about marmalade to see whether the German media respond in the same way that the British media did.

Overall, my visit to Mannheim was a fantastic opportunity to learn about the FOLK project and to have some really interesting discussions about the aims of spoken corpus linguistics, and I would like to thank all at the IDS for their hospitality. I look forward to seeing Swantje again when CASS hosts her in Lancaster for a research visit in the Spring next year.

The Spoken British National Corpus 2014 – project update

SpokenBNCupdateIt has been little over a year since CASS and Cambridge University Press announced a collaboration to compile a successor to the spoken component of the British National Corpus, the Spoken BNC2014. This will be the largest corpus of spoken British English since the original, with the advantage of being collected in the 2010s rather than the 1990s, providing an updated snapshot of spoken language in the UK. By including a set of recordings already gathered by Cambridge University Press before our collaboration began, we plan for the corpus to contain data ranging from the years 2012-2016. As well as being the year in which the project was announced, 2014 will be the median year of the planned data range, and so we chose it to feature in the working title of the project: the Spoken BNC2014.

Since our announcement, we have been hard at work: advertising the project nationally, collecting recordings from speakers from all over the UK, transcribing the data, conducting methodological investigations, and presenting our work so far at corpus linguistics conferences. At ICAME 36 in May we described the development of the Spoken BNC2014 transcription scheme, and at Corpus Linguistics 2015 in July we gave an overview of the data collection methodology as well as presenting new research on speaker identification in transcription. All of this activity continues as we work towards making the corpus freely and publicly available in the year 2017.

So far, we have gathered nearly 700 recordings at an estimated total of approximately six million words of informal conversational data. The majority of recordings feature two or three speakers, with about a quarter of recordings containing four or more so far. So far, the balance of speaker gender is fairly even, and we have been able to gather data from a wide range of ages – though at the moment the 19-29 year olds have a clear lead! We have done very well in England to gather recordings from a great range of self-reported dialects, and we plan now to focus more heavily on gathering recordings from Wales, Scotland, and Northern Ireland. The word cloud of self-reported conversation topics gives a first look at the range of things that users can expect to find being discussed in the corpus.

We are very pleased with the progress of the project so far, and we look forward to releasing the corpus texts publicly once they are complete. In the meantime, as announced at CL2015, we will be offering the opportunity to apply for pre-release data grants later this year. More information about the data grants will be announced in the near future.

A Journey into Transcription, Part 4: The Question Question

question: (NOUN) A sentence worded or expressed so as to elicit information.

Since we speak in utterances (not sentences), most forms of punctuation are omitted in this corpus of learner language; the exceptions being apostrophes, hyphens and question marks. 

This blog concerns question marks.  (Warning: there are not many jokes!)

When we started transcription, the convention seemed simple and straightforward: Question mark indicates a questionThis is easy to apply when questions are straightforward.  For example, the following question types are easy to identify:  

  • yes/no questions (do you like chocolate?);
  • wh- questions (where have you been?);
  • tag questions(rock music is popular isn’t it?);
  • either/or questions (did you catch the train or did you fly?)

However, very soon, we found ourselves in debate about whether and where to transcribe question marks in less straightforward utterances.  This enabled us to amend the convention and add illustrative examples.  In addition, transcribers created a Questions Bank and began to keep a log of decisions made regarding the transcription of question marks; this was done with the aim of achieving the consistency which we anticipate might be vital to researchers in the future. 

So here follows a reflection on some of the varied ways in which speakers can elicit a response in spoken discourse, along with remarks on whether or not a question mark is transcribed in context of this corpus.

It is useful to keep two vital rules in mind:

  • For the learner language corpus it is the structure of the utterance that is crucial rather than the expression or tone of voice. 
  • If in doubt, leave it out!

Either/Or Adjusted Question

Speaker adjusts wording and question structure remains.

  • so in Indian houses do you also have landline telephones or do they  are they disappearing?

Either/Or Anticipation Question:

Use of ‘or’ suggests a choice of alternatives is going to be presented but the questioner’s voice and pace tails off in anticipation of the listener’s response.

  • do you go to a special school? or… [no ellipsis would not be transcribed in corpus]

Doubled Up Question

Structurally, there may be two questions but only one question is actually being asked; question mark transcribed at the end.

  • is it important to do school trips do you think?

Rephrased / Clarified Question:

Multiple rephrased/related questions in quick succession; each is structurally complete, eliciting a single response.

  • in what area? in what field? do have you any idea?
  • what are you going to do when you finish at this school? what will you do next?

Wondering Question:

A question word (often ‘what’) within the utterance and transcribed with question mark.

  • it seems to me your class sizes you have what? forty five students in a class it seems to me they are very large

Question Word/Context Question:

Question word followed by context/detail; often for emphasis and expressing shock or surprise.

  • what? they have a party all day
  • when? in the middle of the night

Clarification/Qualification Question:

A question followed by qualifying phrase for emphasis or for clarification; question mark may be transcribed at the end…

  • what about education more broadly more generally?
  • would you make it more fashionable more stylish?

…or in the middle of the utterance.

  • what do you think the biggest problems are in Mumbai? the biggest pollution problems
  • is that your ambition? to design a bicycle

Interrupted (Clause) Question:

A clause inserted mid-question but structure remains and one main question is being asked.

  • what about looking at education not just at your school looking at education in general?

Implied Question:

Interrogative intonation communicates speaker’s aim to elicit information; however, in this corpus we focus solely on structure so no question mark is transcribed.

Useful test: is the utterance meaningful without interrogative intonation?  If so, no question mark is added.

S:            I thought I was late

E:            really

S:            yes I overslept

 

E:            and how are you today?

S:            I’m fine and you

E:            I’m fine too

 

E:            any questions for me about your topic

S:            yes have you ever been to New York?

Statement Question:

Again, interrogative intonation communicates speaker’s aim to elicit information but structurally there is no question in the second part of this utterance and so no question mark is transcribed.

E:            so what do you think is the answer then? you think that parents should be at home more

S:            no I think they should have the choice

Unclear Question:

Key words are unclear making question structure incomplete; no question mark is transcribed.

S:            <unclear=can you> repeat the question please

A Complex Utterance with a Question Structure:

A number of self-corrections but the structure of a question exists.

S:            and do you think it’s it’s good to be in to be in touch with many people and to and to and to con= er contact with your friends and erm and at your home for exa= on your home for example?

Interrupted Question:

If the question is interrupted no question mark is transcribed, however, sometimes a short question structure remains.

S:            is he er good enough?  to

E:            mm

S:            you know develop India and make it a superpower

Interrupted Either/Or Question:

What would originally have been a single either/or question is interrupted resulting in two independent question structures which are each transcribed with question marks.

E:            do you think it’s a skill?

S:            erm I think

E:            or can you get better at it?


So this has been a glimpse at some of the many varied ways speakers use language to elicit a response.  Time and again we chant our mantra: “If in doubt, leave it out“! 

The full version of our Questions Bank is now pretty exhaustive.  Generally we find that utterances can be mapped onto existing example structures so we can be confident that the decision as to if/where to transcribe the question mark will be consistent with previous decisions. So the Questions Bank, for us, has definitely been a valuable transcription tool. 

A Journey into Transcription, Part 3: Clarity

As audio transcribers we listen to sound.  Of primary importance is the clarity of the sound.

clarity:

ABSTRACT NOUN:

The quality of being clear (‘easy to perceive, understand, or interpret’), in particular:

  • The quality of being coherent and intelligible
  • The quality of being easy to hear; sharpness of sound
  • The quality of purity

Let’s consider these qualities and their relevance to the audio transcriber.

The quality of being coherent and intelligible

All of us, when engaged in discussion and conversation, want our language to be coherent and intelligible.  However, for the transcriber listening to a recording, its clarity in the sense of being coherent and intelligible is something of a paradox; it is simultaneously useful and yet also to be ignored.

Naturally, we know that our brains are programmed to attempt to organise and make sense of language.  In this sense, context can often present the transcriber with an invaluable clue to making out words which may be difficult to hear in a recording.

At the initial drafting stage of transcription what we hear at first can turn out to be quite different when we re-listen, edit and proofread the transcript with the glorious benefit of wider context to assist us.  Here are a few of the more entertaining examples:

you wear glasses becomes yoga classes

it’s among the becomes it’s a manga [comic]

yes she was becomes H G Wells

whisking gently becomes whiskey J&B [discussing a recipe!]

However, since the raison d’être of  this corpus is as a basis for research into the language of learners, part of the skill here is in not being distracted by our knowledge of grammatical rules and the surrounding context.

The audio transcriber’s task is to hear what the learner actually says; this may not always be what they (or we) think or expect might be logical or appropriate (or desirable!).  Indeed, the transcription conventions are designed specifically to minimise the possibility of this happening during the transcription process.  In the context of a Graded Examination in Spoken English (GESE) the students (and, on rare occasion, the examiners) can, and sometimes do, say anything!

Below are a few examples of wrong words and non-words which are to be transcribed, alongside words which may have been intended by the speaker:

Continue reading

A Journey into Transcription, Part 2: Getting Started

training:
MASS VERB:
The action of teaching a person or animal a particular skill or type of behaviour.

So how to begin?  With experts as our guides (and thankfully no animals in sight!)…

The Context:  The first week was to be dedicated to training.  We began by watching a short video clip of a Trinity examination in progress.  Although our day-to-day work is based purely on audio recordings, we really appreciated having this quick peak into the world of the examination room.  Being able to picture the scene when listening to exam recordings somehow brings the spoken language to life.

Picture this: a desk with a friendly examiner seated at one side; tape recorder in situ and possibly a fan whirring (quietly, we hope) in the background;  a pile of papers (perhaps held down by a paperweight); and then, most importantly for us in this research into learner language, a student seated on the  other side of the desk;  some nervous, some shy, some confident, some excited, some reluctant to speak and a rare few who might even have felt quite at home seated on the other side of the desk! 

Time spent viewing this clip was truly a valuable introduction to the context of this research and the real world to which the audio transcriber is privy on a daily basis.

What next?  Enthusiastic to get started, headsets on, foot pedals down…

Practice File:  We started with a practice recording that had been transcribed previously, applying to it our first set of transcription conventions.  (These have subsequently been altered and updated  on numerous occasions.)  This was an extremely valuable process – in listening separately and together to sections of the recording and in comparing our own transcripts with each other and with the original, we quickly realised the range of subtleties that are involved in this task.  The aim, of course, is for transcribers to do as little interpretation as possible and to be able to apply the conventions in a more or less uniform manner, thus making  the transcription process as straightforward as possible.  This, after all, is what will enable us to build a reliable corpus of words that are actually uttered.  Whilst the technology now exists to generate text from spoken words, the accuracy of the text produced does not come close to that produced by a real-life human transcriber.

Key to this task is the fact that it is unlike transcription in other working environments; we are not seeking to produce grammatically correct punctuated documents such as you might find on a BBC website when you want to review that radio programme you heard, or perhaps missed.  In spoken language there are only utterances and our job is to record every utterance precisely by following the given conventions, the only punctuation in sight being apostrophes and the odd question mark.  So is that syllable a word ending, a false start to another word, perhaps a filler used intentionally to maintain a turn in conversation, or perhaps an involuntary sound? All these are natural features of spoken discourse.  Tackling this challenge and striving to produce a document that represents as accurately as is humanly possible the words actually uttered by each individual speaker – once again, here is the challenge that makes our job enjoyable and rewarding.

And finally… A Transcriber’s  Thought For The Day:

I tried to catch some fog.  I mist.

A Journey into Transcription, Part 1: Our Approach

To Transcribe:
VERB:
to put (thoughts, speech, or data) into written or printed form
origin:
mid 16th century (in the sense ‘make a copy in writing’):
from Latin transcribere, from trans- ‘across’ + scribere ‘write’

In September 2013 we applied for the post of Audio Transcriber in the CASS Office in the Department of Linguistics and English Language here at Lancaster University.  The job description seemed straightforward; to transcribe audio tape materials according to a predefined scheme and to undertake other appropriate duties as directed.  And the person specification?  As you would expect, a list of essential/desirable skills including working effectively as part of a team; the ability to learn and apply schemes (more of that later); and the ability to work with a range of accents and dialects of English (this is the fun part!).

We say the post of Audio Transcriber since, as far as we knew, only one post was available.  How wonderful to find ourselves both appointed (long may the funding last!); the opportunity to establish a slick working team, as well as to consult when problems arise and, not least, to celebrate the successes (yes, transcribing is a rewarding job!) are a huge benefit not only to ourselves in our work but also to the success of project as a whole.  In the ESRC Centre for Corpus Approaches to Social Science, it must be the corpus that is at the heart of the centre.  Knowing that we play a key role within the team working together to develop this corpus, we take great pride in what we do.  After all, our listening skills, our focus on accuracy and our meticulous attention to detail have the potential to help develop a corpus of excellent quality, and this will make a vital contribution to the validity of the all the research that will follow.  Quite simply, it is this which makes our job so enjoyable and rewarding.

Our day-to-day work involves transcribing recordings of oral examinations taken by learners of English as a second language at elementary, intermediate and advanced stages.  The examinations have been carried out by Trinity College London and have taken place in various countries; Spain, Mexico, Italy, China, India and Sri Lanka so far.  Each language and each stage have their own unique features.

Seven months and 1.5 million words later (Stage One completed and celebrated with colleagues and cake!), we were delighted to be invited to contribute a BLOG documenting our experience as transcribers.  Over the coming months we plan to describe and discuss various aspects of the job.  The aim is to offer an insight to other transcribers and researchers about this particular process.

Look out for the next instalment on Getting Started!

And finally… A Transcriber’s  Thought For The Day:

They told me I had type A blood, but it was a type-O.