Spoken BNC2014 meets FOLK

On Thursday 3rd December I visited the Institut für Deutsche Sprache (Institute for German Language) in Mannheim. The IDS is Germany’s national, non-university institution for the research and documentation of the German language in both the present day and the past.

I was thrilled to be invited there by Swantje Westpfahl, a PhD student at the Institute, who is working on the compilation of a large spoken corpus of German known as the FOLK (Forschungs- und Lehrkorpus Gesprochenes Deutsch; research and teaching corpus of spoken German). With the similarities between FOLK and the Spoken BNC2014 (my own PhD research project) apparent, we spent a day at the IDS learning about each other’s work.

In the morning, I gave an hour-long talk about the Spoken BNC2014, including an overview of our data collection and transcription methods as well as an investigation into speaker identification which I conducted earlier this year. I explained that, with a small budget, we (CASS and our partner Cambridge University Press) have very much favoured size and speed of production over minute detail of transcription; a decision that has allowed us to have produced approximately 8 million words of orthographic transcription so far in only 18 months.

After lunch, I attended a workshop entitled “Spoken BNC2014 meets FOLK”, where Dr Thomas Schmidt gave an equivalent talk to my own about the FOLK project, followed by Swantje, whose specific focus is on the annotation of the transcribed corpus data. In terms of general design, the FOLK is fairly similar to the Spoken BNC2014; it contains transcripts of audio recordings held between speakers in a variety of settings. The major differences, as I learned, lie in the approach to transcription and the release of data. I learned about the incredible level of detail with which the FOLK recordings are transcribed, using Thomas’ own transcription software FOLKER. I was impressed by the affordances of this tool and the dedication to detail that was evident at the IDS, including the transcription of breathing, pauses measured to the millisecond and direct alignment to the (anonymized) audio recordings. All of this work takes a long time (on average, one hour of recording take 100 hours to prepare in this way!), and as such the FOLK is much smaller than the Spoken BNC2014 (1.3 million words after three years), but extremely rich in terms of potential for analysis.

The IDS was in turn impressed by the Spoken BNC2014’s approach to data collection, where we ‘crowd-source’ participants and invite them, through media engagement and other means, to make recordings using their smartphones in exchange for payment. I suggested that they might like to try putting out a press release about marmalade to see whether the German media respond in the same way that the British media did.

Overall, my visit to Mannheim was a fantastic opportunity to learn about the FOLK project and to have some really interesting discussions about the aims of spoken corpus linguistics, and I would like to thank all at the IDS for their hospitality. I look forward to seeing Swantje again when CASS hosts her in Lancaster for a research visit in the Spring next year.