Corpus-based insights into spoken L2 English: Introducing eight projects that use the Trinity Lancaster Corpus

In November 2016, we announced the Early Data Grant Scheme in which researchers could apply for access to the Trinity Lancaster Corpus (TLC) before its official release in 2018.ย  The Early Data subset of the corpus contains 2.83 million words from 1,244 L2 speakers.

The Trinity Lancaster Corpus project is a product of an ongoing collaboration between The ESRC Centre for Corpus Approaches to Social Science (CASS), Lancaster University, and Trinity College London, a major international examination board. The Trinity Lancaster Corpus contains several features (rich metadata, a range of proficiency levels, L1s and age groups) that make it an important resource for studying L2 English. Soon after we started working on the corpus development in 2013, we realised the great potential of the dataset for researchers in language learning and language testing. We were very excited to receive a number of outstanding applications from around the world (Belgium, China, Germany, Italy, Spain, UK and US).ย  The selected projects cover a wide range of topics focusing on different aspects of learner language use. In the rest of this blog post we introduce the successful projects and their authors.

  1. Listener response in examiner-EFL examinee interactions

Erik Castello and Sara Gesuato, University of Padua

The term listener response is used to denote (non-)verbal behaviour produced in reaction to an interlocutorโ€™s talk and sharing a non-turn status, e.g. short verbalisations, sentence completion, requests for clarifications, restatements, shakes, frowns (Xudong 2009). Listener response is a form of confluence-oriented behaviour (McCarthy 2006) which contributes to the construction and smooth handling of conversation (Krauss et al. 1982). Response practices can vary within the same language/culture in terms of placement and function in the turn sequence and the roles played by the same listener response types (Schiffrin 1987; Gardner 2007). They can also vary across cultures/groups (Cutrone 2005; Tottie 1991) and between the sexes (Makri-Tsilipakou 1994; Rรผhlemann 2010). Therefore, interlocutors from different linguistic/cultural backgrounds may experience communication breakdown, social friction and the emergence of negative attitudes (Wieland 1991; Li 2006), including participants in examiner-EFL examinee interactions (Gรถtz 2013) and in EFL peer-to-peer interactions (Castello 2013). This paper explores the listener response behaviour of EFL examinees in the Trinity Lancaster Corpus (Gablasova et al. 2015), which may display interference from the examineesโ€™ L1s and affect the examinersโ€™ impression of their fluency. It aims to: identify forms of verbal listener responses in examinee turns and classify them in terms of conventions of form (mainly following Clancy et al. 1996) and conventions of function (mainly following Maynard 1997); identify strategies for co-constructing turn-taking, if any (Clancy/McCarthy 2015); and determine the frequencies of occurrence of the above phenomena across types of interaction, examineesโ€™ perceived proficiency levels and between the sexes.

Erik Castello is Assistant Professor of English Language and Translation at the University of Padua, Italy. His research interests include (learner) corpus linguistics, discourse analysis, language testing, academic English and SFL. He has co-edited two volumes and published two books and several articles on these topics.

Sara Gesuato is Associate Professor of English language at the University of Padua, Italy. Her research interests include pragmatics, genre analysis, verbal aspect, and corpus linguistics. She has co-edited two volumes on pragmatic issues in language teaching, and is currently investigating sociopragmatic aspects of L2 written speech acts.

  1. Formulaic expressions in learner speech: New insights from the Trinity Lancaster Corpus

Francesca Coccetta, Caโ€™ Foscari University of Venice

This study investigates the use of formulaic expressions in the dialogic component of the Trinity Lancaster Corpus. Formulaic expressions are multi-word units serving pragmatic or discourse structuring functions (e.g. discourse markers, indirect forms performing speech acts, and hedges), and their mastery is essential for language learners to sound more native-like. The study explores the extent to which the Trinity exam candidates use formulaic expressions at the various proficiency levels (B1, B2 and C1/C2), and the differences in their use between successful and less successful candidates. In addition, it investigates how the exam candidates compare with native speakers in the use of formulaic expressions. To do this, recurrent multi-word units consisting of two to five words will be automatically extracted from the corpus using Sketch Engine; then, the data will be manually filtered to eliminate unintentional repetitions, phrase and clause fragments (e.g. in the, it and, of the), and the multi-word units that do not perform any pragmatic or discourse function. The high-frequency formulaic expressions of each proficiency level will be provided and compared with each other and with the ones identified in previous studies on native speech. The results will offer new insights into learnersโ€™ use of prefabricated expressions in spoken language, particularly in an exam setting.

Francesca Coccetta is a tenured Assistant Professor at Caโ€™ Foscari University of Venice. She holds a doctorate in English Linguistics from Padua University where she specialised in multimodal corpus linguistics. Her research interests include multimodal discourse analysis, learner corpus research, and the use of e-learning in language learning and teaching.ย 

  1. The development of high-frequency verbs in spoken EFL and ESL

Gaรซtanelle Gilquin, Universitรฉ catholique de Louvain

This project aims to contribute to the recent effort to bridge the paradigm gap between second language acquisition research and corpus linguistics. While most such studies have relied on written corpus data to compare English as a Foreign Language (EFL) and English as a Second Language (ESL), the present study will take advantage of a new resource, the Trinity Lancaster Corpus, to compare speech in an EFL variety (Chinese English) and in an ESL variety (Indian English). The focus will be on high-frequency verbs and how their use develops across proficiency levels in the two varieties, as indicated by the CEFR scores provided in the corpus. Various aspects of language will be considered, taking high-frequency verbs as a starting point, among which grammatical complexity (e.g. through the use of infinitival constructions of the causative type), idiomaticity (e.g. through the degree of typicality of object nouns) and fluency (e.g. through the presence of filled pauses in the immediate environment). The assumption is that, given the different acquisitional contexts of EFL and ESL, one and the same score in EFL and ESL may correspond to different linguistic realities, and that similar developments in scores (e.g. from B1 to B2) may correspond to different developments in language usage. More particularly, it is hypothesised that EFL speakers will progress more rapidly in aspects that can benefit from instruction (e.g. those involving grammatical rules), whereas ESL speakers will progress more rapidly in aspects that can benefit from exposure to naturalistic language (like phraseology).

Gaรซtanelle Gilquin is a Lecturer in English Language and Linguistics at the University of Louvain. She is the coordinator of LINDSEI and one of the editors of The Cambridge Handbook of Learner Corpus Research. Her research interests include spoken learner English, the link between EFL and ESL, and applied construction grammar.

  1. Describing fluency across proficiency levels: From โ€˜can-do- statementsโ€™ towards learner-corpus-informed descriptions of proficiency

Sandra Gรถtz, Justus Liebig University Giessen

ย While it has been noted that current assessment scales (e.g. the Common European Framework of Reference; CEF; Council of Europe 2009) describing learnersโ€™ proficiency levels in โ€˜can-do-statementsโ€™ are often formulated somewhat vaguely (e.g. North 2014), researchers and CEF-developers have pointed out the benefits of including more specific linguistic descriptors emerging from learner corpus analyses (e.g. McCarthy 2013; Park 2014). In this project, I will test how/if descriptions of fluency in learner language such as the CEF can benefit from analyzing learner data at different proficiency levels in the Trinity Lancaster Corpus. More specifically I will test if the learnersโ€™ proficiency levels can serve as robust predictors in their use core fluency variables, such as filled and unfilled pauses (e.g. er, erm, eh, ehm), discourse markers (e.g. you know, like, well), or small words (e.g. sort of, kind of). Also, I will test if learners show similar or different paths in their developmental stages of fluency from the B1 to the C2 level, regardless of (or dependent on) their L1. Through the meta-information available on the learners in the Trinity Lancaster Corpus, sociolinguistic and learning context variables (such as the learnersโ€™ age, gender or the task type) will also be taken into consideration in developing data-driven descriptor scales on fluency at different proficiency levels. Thus, it will be possible to differentiate between L1-specific and universal learner features in fluency development.

Sandra Gรถtz obtained her PhD from Justus Liebig University Giessen and Macquarie University Sydney in 2011. Since then, she has been working as a Senior Lecturer in English Linguistics at University of Giessen. Her main research interests include (learner) corpus linguistics and its application to language teaching and testing, applied linguistics and World Englishes.

  1. Self-repetition in the spoken English of L2 English learners: The effects of task type and proficiency levels

Lalita Murty, York University

Self-repetition (SR) where the speaker repeats a word/phrase is a much-observed phenomenon in spoken discourse. SR serves a range of distinct communicative and interactive functions in interactions such as expressing agreement or disagreement or adding emphasis to what the speaker wants to say as the following example shows โ€˜Yes, I know I know and I certainly think that limits areโ€ฆโ€™ (to express agreement with the previous speaker) (Gablasova, et al, 2015). Self-repetitions also help in creating coherence (Bublitz, 1989 as cited in Fung, 2007: 224), enhancing the clarity of the message (Kaur, 2012), keeping the floor, maintaining smooth flow of conversation, linking speakerโ€™s ideas to previous speakerโ€™s ideas (Tannen, 1989), and initiating self and other repairs (Bjorkman, 2011; Robinson and Kevoe-Feldman, 2010). This paper will use Sketch Engine to extract instances of single content word self-repetitions in the Trinity Lancaster Corpus data to examine the effect of (i) L2 proficiency levels and (ii) task types on the frequency and functions of different types of self-repetitions made by speakers at varying proficiency levels in the different tasks. A quantitative and qualitative analysis of the data thus extracted will be conducted using a mix of Norrickโ€™s (1987) framework along with CA approaches.

Lalita Murty is a Lecturer at the Norwegian Study Centre, University of York.ย  Her previous research focused on spoken word recognition and call centre language. Currently she is working on Reduplication and Iconicity in Telugu, a South Indian language.

  1. Certainty adverbs in learner language: The role of tasks and proficiency

Pascual Pรฉrez-Paredes, University of Cambridge and Marรญa Belรฉn Dรญez-Bedmar, University of Jaรฉn

When comparing native and non-native use of stance adverbs, the effect of task has been largely ignored. An exception is Gablasova et al.โ€™s (2015). The authors researched the effect of different speaking tasks on L2 speakersโ€™ use of epistemic stance markers and concluded that there was a significant difference between the monologic prepared tasks and every other task and between the dialogic general topic and the dialogic pre-selected topic (p < .05). This study suggests that the type of speaking task conditions speakersโ€™ repertoire of markers, including certainty markers. Pรฉrez-Paredes & Bueno (forthcoming) looked at how certainty stance adverbs were employed during the picture description task in the LINDSEI and the extended LOCNEC (Aguado et al., 2012). In particular, the authors discussed the contexts of use of obviously, really and actually by native and NNSs across the same speaking task in the four datasets when expressing the range of meanings associated with certainty. The authors found that different groups of speakers used these adverbs differently, both quantitatively and qualitatively. Our research seeks to expand the findings in Gablasova et al.โ€™s (2015) and Pรฉrez-Paredes & Bueno (forthcoming) and examine the uses of certainty adverbs across the L1s, proficiency and tasks represented in the Trinity Lancaster Corpus. We believe that the use of this corpus, together with the findings from the LINDSEI, will help us reach a better understanding of the uses of certainty adverbs in spoken learner language.

Pascual Pรฉrez-Paredes is a Lecturer in Research in Second Language Education at the Faculty of Education, University of Cambridge. His main research interests are learner language variation, the use of corpora in language education and corpus-assisted discourse analysis.

Marรญa Belรฉn Dรญez-Bedmar is Associate Professor at the University of Jaรฉn (Spain). Her main research interests include Learner Corpus Research, error-tagging, the learning of English as a Foreign Language, language testing and assessment, the CEFR and CMC.ย  She is currently involved in national and international corpus-based projects.

  1. Emerging verb constructions in spoken learner English

Ute Rรถmer and James Garner, Georgia State University

Recent research in first language (L1) and second language (L2) acquisition has demonstrated that we learn language by learning constructions, defined as conventionalized form-meaning pairings. While studies in L2 English acquisition have begun to examine construction development in learner production data, these studies have been based on rather small corpora. Using a larger set of data from the Trinity Lancaster Corpus (TLC), this study investigates how verb-argument constructions (VACs; e.g. โ€˜V about nโ€™) emerge in the spoken English of L2 learners at different proficiency levels. We will systematically and exhaustively extract a small set of VACs (โ€™V about nโ€™, โ€˜V for nโ€™, โ€˜V in nโ€™, โ€˜V like nโ€™, and โ€˜V with nโ€™) from the L1 Italian and L1 Spanish subsets of the TLC, separately for three CEFR proficiency levels. For each VAC and L1-proficiency combination (e.g. Italian-B1), we will create frequency-sorted verb lists, allowing us to determine how learnersโ€™ verb-construction knowledge develops with increasing proficiency. We will also examine in what ways VAC emergence in the TLC data is influenced by VAC usage as captured in a large native-speaker reference corpus (the BNC). We will use chi-square tests to compare VAC type and token frequencies across L1 subsets and proficiency levels. We will use path analysis (a type of structural equation modeling) including the predictor variables L1 status, proficiency level, and BNC usage information to gain insights into how learner characteristics and variables concerning L1 construction usage affect the emergence of the target VACs in spoken L2 learner English.

Ute Rรถmer is currently Assistant professor in the Department of Applied Linguistics and ESL at Georgia State University. Her research interests include corpus linguistics, phraseology, second language acquisition, discourse analysis, and the application of corpora in language teaching. She serves on a range of editorial and advisory boards of professional journals and organizations, and is general editor of the Studies in Corpus Linguistics book series.

James Garner is currently a PhD student in the Department of Applied Linguistics and ESL at Georgia State University. His current research interests include learner corpus research, phraseology, usage-based second language acquisition, and data-driven learning.

  1. Verb-argument constructions in Chinese EFL learnersโ€™ spoken English production

Jiajin Xu and Yang Liu, Beijing Foreign Studies University

The widespread recognition of usage-based approach to constructions has made Corpus Linguistics a most viable methodology to scrutinise such frequent morpho-syntactic patterns as verb-argument constructions (VACs) in learner language. The present study attempts to examine the use of VACs in Chinese EFL learnersโ€™ spoken English. Our focus will be on the semantics of the verbal constructions in light of collostructional statistics (Stefanowitsch & Gries, 2003) as well as the comparisons across learnersโ€™ proficiency levels and task types. 20 VACs were collected from COBUILD Grammar Patterns 1: Verbs (Francis, Hunston & Manning, 1996). On the basis of the retrieved VAC concordances from the Trinity Lancaster Corpus, the semantic prototypicality of the VACs will be analysed according to the collocational strength of verbs with their host constructions. Comparisons of Chinese EFL learners against the native speakers will be made, and also within different task types. It is hoped that our findings would shed light on Chinese EFL learnersโ€™ knowledge of VACs and the crosslinguistic influence that impacts verb semantics of learnersโ€™ spoken English. Meanwhile, we also consider language proficiency and task type as potential factors that may account for the differences across CEFR groups based on the comparisons within Chinese EFL learners.

Jiajin Xu is Professor of Linguistics at the National Research Centre for Foreign Language Education, Beijing Foreign Studies University as well as secretary general and a founding member of the Corpus Linguistics Society of China. His research interests include discourse studies, second language acquisition, contrastive linguistics and translation studies, and corpus linguistics.

Yang Liu is currently a PhD candidate at Beijing Foreign Studies University. His research focus is on the corpus-based study of construction acquisition of Chinese EFL learners.