The Spoken BNC2014 is now available!

On behalf of Lancaster University and Cambridge University Press, it gives us great pleasure to announce the public release of the Spoken British National Corpus 2014 (Spoken BNC2014).

The Spoken BNC2014 contains 11.5 million words of transcribed informal British English conversation, recorded by (mainly English) speakers between the years 2012 and 2016. The situational context of the recordings – casual conversation among friends and family members – is designed to make the corpus broadly comparable to the demographically-sampled component of the original spoken British National Corpus.

The Spoken BNC2014 is now accessible online in full, free of charge, for research and teaching purposes. To access the corpus, you should first create a free account on Lancaster University’s CQPweb server ( if you do not already have one. Once registered, please visit the BNC2014 website ( to (a) sign the corpus’ end-user licence and (b) register your CQPweb account – following the instructions on the site. When you return to CQPweb, you will have access to the Spoken BNC2014 via the link that appears in the list of ‘Present-day English’ corpora. While access is initially only via the CQPweb platform, the underlying corpus XML files and associated metadata will be available for download in Autumn 2018.

The BNC2014 website also contains lots of useful information about the corpus, and in particular a downloadable manual and reference guide, which will be available soon. Further information, as well as the first research articles to use Spoken BNC2014 data, will be available in two in-press publications associated with the project: a special issue of the International Journal of Corpus Linguistics (due next month) and an edited collection in the Routledge ‘Advances in Corpus Linguistics’ series (due early 2018).

The BNC2014 does not end here – we are currently working on transcribing materials provided to us by the British Library to provide a substantial supplement to the corpus – find out more about that here: For now, we will be waiting and watching with interest to see what work the corpus releases today stimulates. As ever with corpus data, it does not enable all questions to be answered, but it does allow a very wide range of questions to be investigated.

The Spoken BNC2014 research team would like to express our gratitude to all who have had a hand in the creation of the corpus, and hope that you enjoy exploring the data. We are, of course, keen to hear your feedback about the corpus; this, as well as any questions, can be directed to Robbie Love ( this parenthesis with the @ sign) or Andrew Hardie (a.hardie(Replace this parenthesis with the @ sign)

Change of Leadership in CASS

Andrew Hardie is delighted to announce that he has handed over his role of CASS Centre Director to Elena Semino.

Elena has been Head of Department for Lancaster’s Department of Linguistics and English Language for 6 years, and has published widely in the areas of stylistics, metaphor theory, and medical humanities/health communication.

In Elena’s own words: 

‘It is a great honour and challenge to take over as CASS Director. Over the last four years, CASS has led the way nationally and internationally in the application of corpus methods to a wide range of social scientific problems, and has had a significant impact on research, policy and practice in many different contexts. I look forward to working with colleagues in Lancaster, and partners in the UK and around the world, to continue and extend this work in years to come.’


CASS PhD Student Tanjun Liu wins Best Poster Award at EUROCALL2017

In late August, I attended the 25th annual conference of EUROCALL (European Association for Computer Assisted Language Learning) at the University of Southampton. This year’s theme encompassed how Computer-Assisted Language Learning (CALL) responds to changing global circumstances, which impact on education. Over 240 sessions were presented covering the topics of computer mediated communication, MOOCs, social networking, corpora, European projects, teacher education, etc.


 At this conference, I presented a poster entitled “Evaluating the effect of data-driven learning (DDL) on the acquisition of academic collocations by advanced Chinese learners of English”. DDL is a term created by Tim Johns in 1991 to refer to the use of authentic corpus data to conduct student-centred discovery learning activities. However, even though many corpus-based studies in the pedagogical domain have suggested applying corpora in the domain of classroom teaching, DDL has not become the mainstream teaching practice to date. Therefore, my research sets out to examine the contribution of DDL to the acquisition of academic collocation in the Chinese university context.


The corpus tool that I used in my research was #LancsBox (, which is a newly-developed corpus tool at CASS that has the capacity to create collocational networks, i.e. GraphColl. The poster I presented was a five-week pilot study of my research, the results of which show that the learners’ attitudes towards using #LancsBox were mostly positive, but there were no statistically significant differences between using the corpus tool and online collocations dictionary, which may be largely due to very short intervention time in the pilot study. My poster also presented the description of the forthcoming main study that will involve longer exposure and more EFL learners.


At this conference I was fortunate enough to win the EUROCALL2017 Best Poster Award (PhD), which was given to the best poster presented by a PhD student as nominated by conference delegates. Thank you to all of the delegates who voted for me to win this award and it was a real pleasure to attend such a wonderful conference!

How to Produce Vocabulary Lists

As part of the Forum discussion in Applied Linguistics, we have formulated some basic principles of corpus-based vocabulary studies and pedagogical wordlist creation and use. These principles can be summarised as follows:

  1. Explicitly define the vocabulary construct.
  2. Operationalize the vocabulary construct using transparent and replicable criteria.
  3. If using corpora, take corpus evidence seriously and avoid cherry-picking.
  4. Use multiple sources of evidence to test the validity of the vocabulary construct.
  5. Do not rely on your intuition/experience to determine what is useful for learners; collect evidence about learner needs to evaluate the usefulness of the list.
  6. Do not present learners with a decontextualized list of lexical items; use/create contextualized materials instead.

To find out more, you can read:

Brezina, V. & Gablasova, D. (2017). How to Produce Vocabulary Lists? Issues of Definition, Selection and Pedagogical Aims. A Response to Gabriele Stein. Applied Linguistics, doi:10.1093/applin/amx022.

CASS Guided Reading project presented to The Society for the Scientific Studies of Reading (SSSR)

In mid-July, it was my pleasure to represent CASS at the SSSR conference in Novia Scotia, Canada! Over 400 professionals attended, including language and literacy researchers, school teachers, and speech and language therapists.

My primary aim was to demonstrate how our CASS language development project is using corpus search methods to identify the effectiveness of teacher strategies that are being used in guided reading classroom interactions (also see part 1 & part 2 of my project introduction blogs). The best opportunity for this was during my poster presentation, which highlighted our first round of findings on the types of questions that teachers ask children.

We first demonstrated that teachers are paying attention to recommended guidelines to ask a lot of wh-questions (why, how, what, when etc): wh-questions typically take up around 20% of the total questions being asked in normal adult conversation, but took up 40% of the total questions asked by teachers in our spoken classroom interactions.

Second, the poster presents initial findings on our developmental question of whether teachers of older children ask more challenging question types than teachers of younger children. However, our chosen categories of question type (thus far) were used equivalently across year groups, so this prompts a follow up to examine whether finer categories of question type differ in their proportion of usage across year groups.

Third, the poster reported that teachers at schools in low socio-economic-status (SES) regions asked a higher proportion of wh-questions than teachers at schools in high SES regions. Most viewers of the poster agreed that this prompts us to look at children’s responses: the high proportion of wh-questions asked by teachers at schools in low SES regions might be shaped by less engaged answers from low SES children that require more follow up wh-questions relative to the typically more engaged answers provided by high SES children.

Although there were a number of other posters throughout the week that examined classroom interactions, none had taken advantage of the precise, fast and reliable search methods that we are using. Therefore, attendees were very impressed by how we have been able to interrogate our large corpus without being restricted by the amount of manual hand coding that can be achieved within a realistic time window.

Finally, a big thanks to CASS and SSSR for making this visit possible. As well as the incredible learning opportunities provided by the wide range of high quality presentations on reading research, I also had a good time meeting the fun and interesting conference attendees  – and local Canadians too! Novia Scotia is a beautiful place to visit, with a very friendly and youthful demographic.

Liam will be presenting a talk on this project at the Corpus Linguistics 2017 conference on Wednesday 26th July at 4pm in Lecture Theatre 117, Physics Building, University of Birmingham. For updates, watch this space and twitter @CorpusSocialSci @LiamBlything



CASS go to ICAME38!

Researchers from CASS recently attended the ICAME38 conference at Charles University in Prague. Luckily, we arrived in Prague a day early which gave us plenty of time to explore the city. The weather was sunny, so we walked to Wenceslas Square, and then took the lift to the top of the Old Town Hall Tower to enjoy the views over the city.

The following day, it was time to begin the conference! Over the course of the event, seven CASS members presented their research (you can view full abstracts of all talks here). Up first was Robbie Love, presenting “FUCK in spoken British English revisited with the Spoken BNC2014”. By replicating the approaches of McEnery & Xiao (2004) on the new data contained in the Spoken BNC2014, Robbie found, among other things, that FUCK is now used equally by men and women, and that use of FUCK peaks when speakers are in their 20s and then decreases with age, apart from the 60-69 group which has a higher frequency than the 50-59 group.

Also discussing the BNC2014 project was Abi Hawtin, who presented “The British National Corpus Revisited: Developing parameters for Written BNC2014.” Abi discussed the progress on the project so far, and gave the audience a chance to look at the sampling frame which has been designed for the corpus. Abi also highlighted the difficulty of collecting certain text types, particularly published books.

Amelia Joulain-Jay presented “Describing collocation patterns in OCR data: are MI and LL reliable?” Amelia discussed the fact that data which has been digitized using OCR procedures often has low levels of accuracy, and how this can affect corpus analysis. Amelia tested the reliability of Mutual Information statistics and Log Likelihood statistics when working with OCR data, and found that, among other things, Mutual Information and Log Likelihood attract high rates of false positives. However, she also found that correcting OCR data using Overproof makes a positive difference for both statistics.

CASS director, Andrew Hardie, also presented research using OCR data. He gave a talk titled “Plotting and comparing corpus lexical growth curves as an assessment of OCR quality in historical news data”. Andrew further drew our attention to the amount of errors, or ‘noise’, in OCR data, and showed that if a graph is constructed of number of tokens observed versus count of types at intervals (say, every 10,000 tokens) a curve characteristic of lexical growth over the span of a given corpus emerges. Andrew showed that visual comparison of lexical growth curves among historical collections, or to modern corpora, therefore generates a good impression of the relative extent of OCR noise, and thus some estimate of how much such noise will impede analysis.

Also presenting was Dana Gablasova who discussed “A corpus-based approach to the expression of subjectivity in L2 spoken English: The case of ‘I + verb’ construction”. Dana used the Trinity Lancaster Corpus (TLC) to investigate the ‘I + verb’ construction in L1 Spanish and Italian speakers aged over 20 years. Dana found that with the increase in proficiency the frequency of emotive verbs decreased while the frequency of the epistemic verbs increased considerably. The study also identified the most frequent cognitive and emotive verbs and the trends in their use according to the proficiency level of L2 users.

Vaclav Brezina (and Matt Timperley, who was unfortunately not able to attend the conference) gave a software demonstration of #LancsBox – a new-generation corpus analysis tool developed at CASS. Vaclav showed that #LancsBox can:

  • Search, sort and filter examples of language use.
  • Compare frequency of words and phrases in multiple corpora and subcorpora.
  • Identify and visualise meaning associations in language (collocations).
  • Compute and visualize keywords.
  • Use a simple but powerful interface.
  • Support a number of advanced features such as customisable statistical measures.

#LancsBox can be downloaded for free from the tool website

Dana and Vaclav also gave a presentation together, titled “MI-score-based collocations in language learning research: A critical evaluation.” Dana and Vaclav identified several issues in the use of MI-score as a measure in language learning research, and used data from the BNC and TLC to:

  • place the MI-score in the context of other similar association measures and discuss the similarities and differences directly relevant to LLR
  • to propose general principles for selection of association measures in LLR.

Finally, former CASS senior research associate Laura Paterson, who recently moved to a lectureship at the Open University, presented “Visualising corpora using Geographical Text Analysis (GTA): (Un)employment in the UK, a case study”, which stemmed from her work on the CASS Distressed Communities project. Laura showed how GTA can be used to generate maps from concordance lines. She showed lots of interesting data visualisations and highlighted the way in which GTA allows the researcher to visualise their corpus and adds a consideration of physical space to language analysis.

Aside from all of the fascinating talks, ICAME38 also had a brilliant social programme. We were able to go on 2 boat trips along the river. The first gave us brilliant views of the city, and the second allowed us to get much closer to the bridges and buildings which line the river. The Gala dinner was also great fun – we had a linguistics themed menu and, best of all an Abba tribute band!

Thank you to all of the organisers of ICAME38 for such an enjoyable and well-organised conference!


Data-driven learning: learning from assessment

The process of converting valuable spoken corpus data into classroom materials is not necessarily straightforward, as a recent project conducted by Trinity College London reveals.

One of the buzz words we increasingly hear from teacher trainers in English Language Teaching (ELT) is the use of data-driven learning. This ties in with other contemporary pedagogies, such as discovery learning.  A key component of this is how data from a corpus can be used to inform learning. One of our long-running projects with the Trinity Lancaster Corpus has been to see how we could use the spoken data in the classroom so that students could learn from assessment as well as for assessment. We have reported before (From Corpus to Classroom 1 and From Corpus to Classroom 2) on the research focus on pragmatic and strategic examples. These linguistic features and competences are often not practised – or are only superficially addressed – in course books and yet can be significant in enhancing learners’ communication skills, especially across cultures. Our ambition is to translate the data findings for classroom use, specifically to help teachers improve learners’ wider speaking competences.

We developed a process of constructing sample worksheets based on, and including, the corpus data. The data was contextualized and presented to teachers in order to give them an opportunity to use their expertise in guiding how this data could be developed for, and utilized in, the classroom. So, essentially, we asked teachers to collaborate on checking how useful the data and tasks were and potentially improving these tasks. We also asked teachers to develop their own tasks based on the data and we now have the results of this project.

Overwhelmingly, the teachers were very appreciative of the data and they each produced some great tasks. All of these were very useful for the classroom but they did not really exploit the unique information we identified as being captured in the data. We have started exploring why this might be the case.

What the teachers did was the following:

  • Created noticing and learner autonomy activities with the data (though most tasks would need much more scaffolding).
  • Focused on traditional information about phrases identified in the data, e.g. the strength and weakness of expressions of agreement.
  • Created activities that reflected traditional course book approaches.
  • Created reflective, contextual practice related to the data although this sometimes became lost in the addition of extra non-corpus texts.

We had expectations that the data would inspire activities which:

  • showed new ways of approaching the data
  • supported discovery learning tasks with meaningful outcomes
  • explored the context and pragmatic functions of the data
  • reflected pragmatic usage; perhaps even referring to L1 as a resource for this
  • focused on the listener and interpersonal aspects rather than just the speaker

It was clear that the teachers were intellectually engaged and excited, so we considered the reasons why their tasks had taken a more traditional path than expected. Many of these have been raised in the past by Tim Johns and Simon Borg. There is no doubt that the heavy teacher workload affects how far teachers feel they can be innovative with materials. There is a surety in doing what you know and what you know works. Also many teachers, despite being in the classroom everyday, often need a certain confidence to design input when this is traditionally something that has been left to syllabus and course book creators. Another issue was that we realised that teachers would probably have to have more support in understanding corpus data and many don’t have the time to do extra training. Finally, there may be the issue with this particular data that teachers may not be fully aware of the importance of pragmatic and strategic competences. Often they are seen as an ‘add-on’ rather than a core competence especially in contexts for contemporary communications when it is largely being used as a lingua franca.

Ultimately, there was a difference between what the researchers ‘saw’ and what the teachers ‘saw’. As an alternative, we asked a group of expert material writers to produce new tasks and they have produced some innovative material. We concluded that maybe this is a fairer approach. In other words, instead of expecting each of the roles involved in language teaching (SLA researchers, teachers, materials designers) to find the time to become experts in new skills, it may sometimes be better to use each other as a resource. This would still be a learning experience as we draw on each other’s expertise.

In future if we want teachers to collaborate on designing materials we must make sure we discuss the philosophy or pedagogy behind our objectives (Rapti, 2013) with our collaborators, that we show how the data is mapped to relevant curricula and that we recognise the restrictions caused by practical issues such as a lack of time or training opportunities.

The series of worksheets is now available from the Trinity College London website. More to come in the future so keep checking.

Corpus-based insights into spoken L2 English: Introducing eight projects that use the Trinity Lancaster Corpus

In November 2016, we announced the Early Data Grant Scheme in which researchers could apply for access to the Trinity Lancaster Corpus (TLC) before its official release in 2018.  The Early Data subset of the corpus contains 2.83 million words from 1,244 L2 speakers.

The Trinity Lancaster Corpus project is a product of an ongoing collaboration between The ESRC Centre for Corpus Approaches to Social Science (CASS), Lancaster University, and Trinity College London, a major international examination board. The Trinity Lancaster Corpus contains several features (rich metadata, a range of proficiency levels, L1s and age groups) that make it an important resource for studying L2 English. Soon after we started working on the corpus development in 2013, we realised the great potential of the dataset for researchers in language learning and language testing. We were very excited to receive a number of outstanding applications from around the world (Belgium, China, Germany, Italy, Spain, UK and US).  The selected projects cover a wide range of topics focusing on different aspects of learner language use. In the rest of this blog post we introduce the successful projects and their authors.

  1. Listener response in examiner-EFL examinee interactions

Erik Castello and Sara Gesuato, University of Padua

The term listener response is used to denote (non-)verbal behaviour produced in reaction to an interlocutor’s talk and sharing a non-turn status, e.g. short verbalisations, sentence completion, requests for clarifications, restatements, shakes, frowns (Xudong 2009). Listener response is a form of confluence-oriented behaviour (McCarthy 2006) which contributes to the construction and smooth handling of conversation (Krauss et al. 1982). Response practices can vary within the same language/culture in terms of placement and function in the turn sequence and the roles played by the same listener response types (Schiffrin 1987; Gardner 2007). They can also vary across cultures/groups (Cutrone 2005; Tottie 1991) and between the sexes (Makri-Tsilipakou 1994; Rühlemann 2010). Therefore, interlocutors from different linguistic/cultural backgrounds may experience communication breakdown, social friction and the emergence of negative attitudes (Wieland 1991; Li 2006), including participants in examiner-EFL examinee interactions (Götz 2013) and in EFL peer-to-peer interactions (Castello 2013). This paper explores the listener response behaviour of EFL examinees in the Trinity Lancaster Corpus (Gablasova et al. 2015), which may display interference from the examinees’ L1s and affect the examiners’ impression of their fluency. It aims to: identify forms of verbal listener responses in examinee turns and classify them in terms of conventions of form (mainly following Clancy et al. 1996) and conventions of function (mainly following Maynard 1997); identify strategies for co-constructing turn-taking, if any (Clancy/McCarthy 2015); and determine the frequencies of occurrence of the above phenomena across types of interaction, examinees’ perceived proficiency levels and between the sexes.

Erik Castello is Assistant Professor of English Language and Translation at the University of Padua, Italy. His research interests include (learner) corpus linguistics, discourse analysis, language testing, academic English and SFL. He has co-edited two volumes and published two books and several articles on these topics.

Sara Gesuato is Associate Professor of English language at the University of Padua, Italy. Her research interests include pragmatics, genre analysis, verbal aspect, and corpus linguistics. She has co-edited two volumes on pragmatic issues in language teaching, and is currently investigating sociopragmatic aspects of L2 written speech acts.

  1. Formulaic expressions in learner speech: New insights from the Trinity Lancaster Corpus

Francesca Coccetta, Ca’ Foscari University of Venice

This study investigates the use of formulaic expressions in the dialogic component of the Trinity Lancaster Corpus. Formulaic expressions are multi-word units serving pragmatic or discourse structuring functions (e.g. discourse markers, indirect forms performing speech acts, and hedges), and their mastery is essential for language learners to sound more native-like. The study explores the extent to which the Trinity exam candidates use formulaic expressions at the various proficiency levels (B1, B2 and C1/C2), and the differences in their use between successful and less successful candidates. In addition, it investigates how the exam candidates compare with native speakers in the use of formulaic expressions. To do this, recurrent multi-word units consisting of two to five words will be automatically extracted from the corpus using Sketch Engine; then, the data will be manually filtered to eliminate unintentional repetitions, phrase and clause fragments (e.g. in the, it and, of the), and the multi-word units that do not perform any pragmatic or discourse function. The high-frequency formulaic expressions of each proficiency level will be provided and compared with each other and with the ones identified in previous studies on native speech. The results will offer new insights into learners’ use of prefabricated expressions in spoken language, particularly in an exam setting.

Francesca Coccetta is a tenured Assistant Professor at Ca’ Foscari University of Venice. She holds a doctorate in English Linguistics from Padua University where she specialised in multimodal corpus linguistics. Her research interests include multimodal discourse analysis, learner corpus research, and the use of e-learning in language learning and teaching. 

  1. The development of high-frequency verbs in spoken EFL and ESL

Gaëtanelle Gilquin, Université catholique de Louvain

This project aims to contribute to the recent effort to bridge the paradigm gap between second language acquisition research and corpus linguistics. While most such studies have relied on written corpus data to compare English as a Foreign Language (EFL) and English as a Second Language (ESL), the present study will take advantage of a new resource, the Trinity Lancaster Corpus, to compare speech in an EFL variety (Chinese English) and in an ESL variety (Indian English). The focus will be on high-frequency verbs and how their use develops across proficiency levels in the two varieties, as indicated by the CEFR scores provided in the corpus. Various aspects of language will be considered, taking high-frequency verbs as a starting point, among which grammatical complexity (e.g. through the use of infinitival constructions of the causative type), idiomaticity (e.g. through the degree of typicality of object nouns) and fluency (e.g. through the presence of filled pauses in the immediate environment). The assumption is that, given the different acquisitional contexts of EFL and ESL, one and the same score in EFL and ESL may correspond to different linguistic realities, and that similar developments in scores (e.g. from B1 to B2) may correspond to different developments in language usage. More particularly, it is hypothesised that EFL speakers will progress more rapidly in aspects that can benefit from instruction (e.g. those involving grammatical rules), whereas ESL speakers will progress more rapidly in aspects that can benefit from exposure to naturalistic language (like phraseology).

Gaëtanelle Gilquin is a Lecturer in English Language and Linguistics at the University of Louvain. She is the coordinator of LINDSEI and one of the editors of The Cambridge Handbook of Learner Corpus Research. Her research interests include spoken learner English, the link between EFL and ESL, and applied construction grammar.

  1. Describing fluency across proficiency levels: From ‘can-do- statements’ towards learner-corpus-informed descriptions of proficiency

Sandra Götz, Justus Liebig University Giessen

 While it has been noted that current assessment scales (e.g. the Common European Framework of Reference; CEF; Council of Europe 2009) describing learners’ proficiency levels in ‘can-do-statements’ are often formulated somewhat vaguely (e.g. North 2014), researchers and CEF-developers have pointed out the benefits of including more specific linguistic descriptors emerging from learner corpus analyses (e.g. McCarthy 2013; Park 2014). In this project, I will test how/if descriptions of fluency in learner language such as the CEF can benefit from analyzing learner data at different proficiency levels in the Trinity Lancaster Corpus. More specifically I will test if the learners’ proficiency levels can serve as robust predictors in their use core fluency variables, such as filled and unfilled pauses (e.g. er, erm, eh, ehm), discourse markers (e.g. you know, like, well), or small words (e.g. sort of, kind of). Also, I will test if learners show similar or different paths in their developmental stages of fluency from the B1 to the C2 level, regardless of (or dependent on) their L1. Through the meta-information available on the learners in the Trinity Lancaster Corpus, sociolinguistic and learning context variables (such as the learners’ age, gender or the task type) will also be taken into consideration in developing data-driven descriptor scales on fluency at different proficiency levels. Thus, it will be possible to differentiate between L1-specific and universal learner features in fluency development.

Sandra Götz obtained her PhD from Justus Liebig University Giessen and Macquarie University Sydney in 2011. Since then, she has been working as a Senior Lecturer in English Linguistics at University of Giessen. Her main research interests include (learner) corpus linguistics and its application to language teaching and testing, applied linguistics and World Englishes.

  1. Self-repetition in the spoken English of L2 English learners: The effects of task type and proficiency levels

Lalita Murty, York University

Self-repetition (SR) where the speaker repeats a word/phrase is a much-observed phenomenon in spoken discourse. SR serves a range of distinct communicative and interactive functions in interactions such as expressing agreement or disagreement or adding emphasis to what the speaker wants to say as the following example shows ‘Yes, I know I know and I certainly think that limits are…’ (to express agreement with the previous speaker) (Gablasova, et al, 2015). Self-repetitions also help in creating coherence (Bublitz, 1989 as cited in Fung, 2007: 224), enhancing the clarity of the message (Kaur, 2012), keeping the floor, maintaining smooth flow of conversation, linking speaker’s ideas to previous speaker’s ideas (Tannen, 1989), and initiating self and other repairs (Bjorkman, 2011; Robinson and Kevoe-Feldman, 2010). This paper will use Sketch Engine to extract instances of single content word self-repetitions in the Trinity Lancaster Corpus data to examine the effect of (i) L2 proficiency levels and (ii) task types on the frequency and functions of different types of self-repetitions made by speakers at varying proficiency levels in the different tasks. A quantitative and qualitative analysis of the data thus extracted will be conducted using a mix of Norrick’s (1987) framework along with CA approaches.

Lalita Murty is a Lecturer at the Norwegian Study Centre, University of York.  Her previous research focused on spoken word recognition and call centre language. Currently she is working on Reduplication and Iconicity in Telugu, a South Indian language.

  1. Certainty adverbs in learner language: The role of tasks and proficiency

Pascual Pérez-Paredes, University of Cambridge and María Belén Díez-Bedmar, University of Jaén

When comparing native and non-native use of stance adverbs, the effect of task has been largely ignored. An exception is Gablasova et al.’s (2015). The authors researched the effect of different speaking tasks on L2 speakers’ use of epistemic stance markers and concluded that there was a significant difference between the monologic prepared tasks and every other task and between the dialogic general topic and the dialogic pre-selected topic (p < .05). This study suggests that the type of speaking task conditions speakers’ repertoire of markers, including certainty markers. Pérez-Paredes & Bueno (forthcoming) looked at how certainty stance adverbs were employed during the picture description task in the LINDSEI and the extended LOCNEC (Aguado et al., 2012). In particular, the authors discussed the contexts of use of obviously, really and actually by native and NNSs across the same speaking task in the four datasets when expressing the range of meanings associated with certainty. The authors found that different groups of speakers used these adverbs differently, both quantitatively and qualitatively. Our research seeks to expand the findings in Gablasova et al.’s (2015) and Pérez-Paredes & Bueno (forthcoming) and examine the uses of certainty adverbs across the L1s, proficiency and tasks represented in the Trinity Lancaster Corpus. We believe that the use of this corpus, together with the findings from the LINDSEI, will help us reach a better understanding of the uses of certainty adverbs in spoken learner language.

Pascual Pérez-Paredes is a Lecturer in Research in Second Language Education at the Faculty of Education, University of Cambridge. His main research interests are learner language variation, the use of corpora in language education and corpus-assisted discourse analysis.

María Belén Díez-Bedmar is Associate Professor at the University of Jaén (Spain). Her main research interests include Learner Corpus Research, error-tagging, the learning of English as a Foreign Language, language testing and assessment, the CEFR and CMC.  She is currently involved in national and international corpus-based projects.

  1. Emerging verb constructions in spoken learner English

Ute Römer and James Garner, Georgia State University

Recent research in first language (L1) and second language (L2) acquisition has demonstrated that we learn language by learning constructions, defined as conventionalized form-meaning pairings. While studies in L2 English acquisition have begun to examine construction development in learner production data, these studies have been based on rather small corpora. Using a larger set of data from the Trinity Lancaster Corpus (TLC), this study investigates how verb-argument constructions (VACs; e.g. ‘V about n’) emerge in the spoken English of L2 learners at different proficiency levels. We will systematically and exhaustively extract a small set of VACs (’V about n’, ‘V for n’, ‘V in n’, ‘V like n’, and ‘V with n’) from the L1 Italian and L1 Spanish subsets of the TLC, separately for three CEFR proficiency levels. For each VAC and L1-proficiency combination (e.g. Italian-B1), we will create frequency-sorted verb lists, allowing us to determine how learners’ verb-construction knowledge develops with increasing proficiency. We will also examine in what ways VAC emergence in the TLC data is influenced by VAC usage as captured in a large native-speaker reference corpus (the BNC). We will use chi-square tests to compare VAC type and token frequencies across L1 subsets and proficiency levels. We will use path analysis (a type of structural equation modeling) including the predictor variables L1 status, proficiency level, and BNC usage information to gain insights into how learner characteristics and variables concerning L1 construction usage affect the emergence of the target VACs in spoken L2 learner English.

Ute Römer is currently Assistant professor in the Department of Applied Linguistics and ESL at Georgia State University. Her research interests include corpus linguistics, phraseology, second language acquisition, discourse analysis, and the application of corpora in language teaching. She serves on a range of editorial and advisory boards of professional journals and organizations, and is general editor of the Studies in Corpus Linguistics book series.

James Garner is currently a PhD student in the Department of Applied Linguistics and ESL at Georgia State University. His current research interests include learner corpus research, phraseology, usage-based second language acquisition, and data-driven learning.

  1. Verb-argument constructions in Chinese EFL learners’ spoken English production

Jiajin Xu and Yang Liu, Beijing Foreign Studies University

The widespread recognition of usage-based approach to constructions has made Corpus Linguistics a most viable methodology to scrutinise such frequent morpho-syntactic patterns as verb-argument constructions (VACs) in learner language. The present study attempts to examine the use of VACs in Chinese EFL learners’ spoken English. Our focus will be on the semantics of the verbal constructions in light of collostructional statistics (Stefanowitsch & Gries, 2003) as well as the comparisons across learners’ proficiency levels and task types. 20 VACs were collected from COBUILD Grammar Patterns 1: Verbs (Francis, Hunston & Manning, 1996). On the basis of the retrieved VAC concordances from the Trinity Lancaster Corpus, the semantic prototypicality of the VACs will be analysed according to the collocational strength of verbs with their host constructions. Comparisons of Chinese EFL learners against the native speakers will be made, and also within different task types. It is hoped that our findings would shed light on Chinese EFL learners’ knowledge of VACs and the crosslinguistic influence that impacts verb semantics of learners’ spoken English. Meanwhile, we also consider language proficiency and task type as potential factors that may account for the differences across CEFR groups based on the comparisons within Chinese EFL learners.

Jiajin Xu is Professor of Linguistics at the National Research Centre for Foreign Language Education, Beijing Foreign Studies University as well as secretary general and a founding member of the Corpus Linguistics Society of China. His research interests include discourse studies, second language acquisition, contrastive linguistics and translation studies, and corpus linguistics.

Yang Liu is currently a PhD candidate at Beijing Foreign Studies University. His research focus is on the corpus-based study of construction acquisition of Chinese EFL learners.







Introducing a new project with the British Library

Since 2012 the BBC have been working with the British Library to build a collection of intimate conversations from across the UK in the BBC Listening Project. Through its network of local radio stations, and with the help of a travelling recording booth the BBC has captured many conversations of people, who are well known to one another, on a range of topics in high quality audio.

For the past two years we have been discussing with the BBC and the British Library the possibility of using these recordings as the basis of a large scale extension of our spoken BNC corpus. The Spoken BNC2014 has been built so far to reflect language in intimate settings – with recordings made in the home. This has led to a large and very useful collection of data but, without the resources of an organization such as the BBC, we were not able to roam the country with a sound recording booth to sample language from John o’Groats to Land’s End! By teaming up with the BBC and British Library we can supplement this very useful corpus of data, which is strongly focused on a ‘hard to capture’ context, intimate conversations in the home, with another type of data, intimate conversations in a public situation sampled from across the UK.

Another way in which the Listening data should prove helpful to linguists is that the data itself was captured in a recording studio as high quality audio recordings. Our hope is that a corpus based on this material will be of direct interest and use to phoneticians.

We have recently concluded our discussion with the British Library, which is archiving this material, and signed an agreement which will see CASS undertake orthographic transcription of the data. Our goal is to provide a high quality transcription of the data which will be of use to linguists and members of the public, who may wish to browse the collection, alike. In doing this we will be building on our experience of producing the Trinity Lancaster Corpus of Spoken Learner English and the Spoken BNC2014.

We take our first delivery of recordings at the beginning of March and are very excited at the prospect of lifting the veil a little further on the fascinating topic of everyday conversation and language use. The plan is to transcribe up to 1000 of the recordings archived at the British Library. We will be working to time align the transcriptions with the sound recordings also and are working closely with our strong phonetics team in the Department of Linguistics and English Language at Lancaster University to begin to assess the extent to which this new dataset could facilitate new work, for example, on the accents of the British Isles.

Our partners in the British Library are just as excited as we are – Jonnie Robinson, lead Curator for Spoken English at the British Library says ‘The British Library is delighted to enable Lancaster to make such innovative use of the Listening Project conversations and we look forward to working with them to make the collection more accessible and to enhance its potential to support linguistic and other research enquiries’.

Keep an eye on the CASS website and Twitter feed over the next couple of years for further updates on this new project!

Analysing Corporate Communications

Detecting the structure of annual financial reports and extracting their contents for further corpus analysis has never been easier. The UCREL Corporate Financial Information Environment (CFIE) project and CASS’ Corporate Communications sub-project has now released the CFIE-FRSE Final Report Structure Extractor: A desktop application to detect the structure of UK Annual Reports and extract the reports’ contents on a section level. This extraction step is vital for the analysis of UK reports which adopt a much more flexible structure than the US equivalent 10-Ks. The CFIE-FRSE tool works as a desktop version of our CFIE-FRSE Web tool

The tool provides batch extraction and analysis of PDF annual report content for English. Crucially, our approach preserves the structure of the underlying report (as represented by the document table of contents) and therefore offers clear delineation between the narrative and financial statement components, as well as facilitating analysis of the narrative component on a schedule-by-schedule basis.

The tool was trained using more than 10,000 UK annual reports and the extraction accuracy exceeds 95% against manual validations and large-sample tests confirm that extracted content varies predictably with economic and regulatory factors.

Accessing the tool:

The tool is available for direct download from GitHub link below:

GitHub Repository:

The CFIE-FSRE tool:

  • Detects the structure of UK Annual Reports by detecting the key section, their start and end pages and extracts the contents.
  • Extracts the text of each section in a plain text file format.
  • Splits the text of each section into sentences using Stanford Sentence Splitter.
  • Provides a Section Classification mechanism to detect the type of the extracted section.
  • Each extracted section will be annotated with a number between 0 and 8 as follows:
Header Type Header
1 Chairman’s Statement
2 CEO Review
3 Corporate Government Report
4 Directors Remuneration Report
5 Business Review
6 Financial Review
7 Operating Review
8 Highlights
0 A section that is none of the above
  • The tool uses Levenshtein distance and other similarity metrics and synonyms for section classification. For example Chairman’s letter and letter to shareholders can still be detected as Type 1 section (Chairman’s Statement).
  • The analysis results of the uploaded files or reports can be found in a subdirectory that follows the pattern of “FileName_Analysis”
    • For example, if you are uploading a file called XYZCompany.pdf, the results will be in subdirectory called XYZCompany_Analysis
    • Analysis outputs are saved in Comma Separated Value (CSV) file format which can be opened using any spreadsheet editor.
    • The tool provides more fields in the Sections_Frequencies.csv file which can be found in the Analysis subdirectory. The new fields are:
    • Start and End pages of each section.
    • Provides the readability of the extracted sections in addition to the whole report using Fog and Flesch readability metrics.
    • Provides keywords frequencies using a preloaded set of keywords for Forward Looking, Positivity, Negativity and Uncertainty.
    • Report Year, this will only work if the year was part of the file name. E.g. “XYZCompany_2015.pdf”
    • Performance Flag: Shows 1 if a section is a performance section, 0 otherwise.
    • Strategy Flag: Shows 1 if a section is a strategic section, 0 otherwise.
    • Booklet Flag: Shows 1, 2 or 3 if a header is a booklet layout, 0 otherwise. Our tool is unable to process booklet annual reports (those reports where two pages are combined into one pdf page). Numbers 1-3 indicates how confident the system is. 1 suspects a booklet layout, 3 definitely a booklet layout
    • The keyword lists (Forward Looking, Uncertainty, Positivity and Negativity) have been updated to eliminate duplicates and encoding errors.

How to run the software:

  • [MS Windows]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run (double click) the runnable.bat file.
  • [Linux/Ubuntu]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run the Simply cd to the directory where the is located and type the following command ./
  • [Unix/Mac]: To run the tool, simply clone the repository to your machine, place your pdf annual reports in the pdfs directory and run the Simply cd to the directory where the is located and type the following command sh or bash
  • The analysis output directory (a directory for each PDF file) will be found in the PDF directory.
  • Please do not delete any of the files or directories or change their structure.
  • You can add or delete PDF files from the PDF directory and you can also edit the userKeywords.txt to include your own keyword list, simply empty the file and insert one keyword (or keyphrase) on each line.

Related Papers:

  • El-Haj, Mahmoud, Rayson, Paul, Young, Steven, and Walker, Martin. “Detecting Document Structure in a Very Large Corpus of UK Financial Reports”. In The 9th edition of the Language Resources and Evaluation Conference, 26-31 May 2014, Reykjavik, Iceland.
    Available at:


The tool is available under the GNU General Public License.

More about the CFIE research:

For more information about the projects’ output, web-tools, resources and contact information, please visit our page below: