Compiling a trilingual corpus to examine the political and social representation(s) of ‘people’ and ‘democracy’

As a visiting researcher at CASS (coming from the University of Athens, where I am Associate Professor of Corpus Linguistics and Translation), since mid-October 2017 and until the end of August 2018, my research aim is to investigate critical aspects of populist discourses in Europe and their variation, especially during and after the 2008 financial (and then social and political) crisis, and to reveal patterns of similarity and difference (and, tentatively, of interconnectedness and intertextuality) across a wide spectrum of political parties, think tanks and organisations. This being essentially a Corpus-Assisted Discourse Study (CADS), a first way into examining the data is to identify and statistically analyse collocational patterns and networks that are built around key lexemes (e.g. ‘people’, ‘popular’, ‘democracy’, in this scenario), before moving on to critically correlating such quantitative findings with the social and political backdrop(s) and crucial milestones.

 

The first task of this complex corpus-driven effort, which is now complete, has been to compile a large-scale trilingual (EN, FR, EL) ‘focus’ corpus. This has been a tedious technical process: before the data can be examined in a consistent manner, several problems needed to be addressed and solutions had to be implemented, as outlined below.

 

  1. As a key primary aim was to gather as much data as possible from the websites of political parties, political organisations, think tanks and official party newspapers, from the UK, France and Greece, it was clear from the outset that it would not be possible to manually cull the corpus data, given the sheer number of sources and of texts. On the other hand, automatic corpus compilation tools (e.g. BootCaT and WebBootCaT in SketchEngine) could not handle the extent and the diversification of the corpora. To address this problem, texts were culled using web crawling techniques (‘wget -r’ in Linux bash) and the HTTrack app, with a lot tweaking and the necessary customisation of download parameters, to account for the (sometimes, very tricky) batch download restrictions of some websites.
  2. Clean-up html boilerplate (i.e., corpus text-irrelevant sections of code, advertising material, etc. that are included in html pages). This was accomplished using Justext (the app used by M. Davies to compile the NOW corpus), with a few tweaks, so to be able to handle some ‘malformed’ data, especially from Greek sources.

As I plan to specifically analyse the variation of key descriptors and qualifiers (‘key’ keywords and their c-collocates) as a first way into the “charting” of the discourses at hand, the (article or text) publication date is a critical part of the corpus metadata, one that needs to be retained for further processing. However, most if not all of this information is practically lost in the web crawling and boilerplating stages. Therefore, the html clean-up process was preceded by the identification and extraction of the articles’ publication dates, using a php script that was developed with the help of Dr. Matt Timperley (CASS, Lancaster) and Dr. Pantelis Mitropoulos (University of Athens). This script scans all files in a dataset, accounts for all possible date formats in all three languages, and then automatically creates a csv (tab-delimited) table that contains the extracted date(s), matched with the respective filenames. Its accuracy is estimated at ca. 95%, and can be improved further, by checking the output and rescanning the original data with a few code tweaks.

  1. Streamline the data, by removing irrelevant stretches of text (e.g. “Share this article on Facebook”) that were possibly left behind during the boilerplating process – this step is ensured using Linux commands (e.g. find, grep, sed, awk) and regular expressions and greatly improves the accuracy of the following step.
  2. Remove duplicate files: since onion (ONe Instance ONly: the script used e.g. in SketchEngine) only looks for pattern repetitions within a single file and within somewhat short maximum paragraph intervals, I used FSLint – an application that takes account of the files’ MD5 signature and identifies duplicates. This is extremely accurate and practically eliminates all files that have a one hundred percent text repetition, across various sections of the websites, regardless of the file name or creation date (actually, this was found to be the case mostly with political party websites, not newspapers). (NB: A similar process is available also in Mike Scott’s WordSmith Tools v7).
  3. Order files by publication year for each subcorpus and then calculate the corresponding metadata (files, tokens, types and average token count, by year) for each dataset and filter out the “focus corpus”, i.e. by looking for relevant files containing only node lemmas (i.e., lemmas related to the core question of this research: people*|popular|democr*|human* and their FR and EL equivalents, using grep and regular expressions – note that an open-source, java-based GUI app that combines these search options for large datasets is FAR).
  4. Finally, prepare the data for uploading on LU’s CQPWeb, by appending the text publication year info, as extracted from stage 2 to the corresponding raw text file – this was done using yet another php script, kindly developed by Matt Timperley.

 

In a nutshell, texts were culled from a total of 68 sources (24 Greek, 26 British, and 18 French). This dataset is divided into three major corpora, as follows:

  1. Cumulative corpus (CC, all data): 746,798 files/465,180,684 tokens.
  2. Non-journalistic research corpus (RC): 419,493 files/307,231,559 tokens.
  3. Focus corpus (FC): 205,038 files/235,235,353 tokens.

Is Academic Writing Becoming More Colloquial?

Have you noticed that academic writing in books and journals seems less formal than it used to? Preliminary data from the Written BNC2014 shows that you may be right!

Some early data from the academic journals and academic books sections of the new corpus has been analysed to find out whether academic writing has become more colloquial since the 1990s. Colloquialisation is “a tendency for features of the conversational spoken language to infiltrate and spread in the written language” (Leech, 2002: 72). The colloquialisation of language can make messages more easily understood by the general public because, whilst not everybody is familiar with the specifics of academic language, everyone is familiar with spoken language. In order to investigate the colloquialisation of academic writing, the frequencies of several linguistic features which have been associated with colloquialisation were compared in academic writing in the BNC1994 and the BNC2014.

Results show that, of the eleven features studied, five features have shown large changes in frequency between the BNC1994 and the BNC2014, pointing to the colloquialisation of academic writing. The use of first and second person pronouns, verb contractions, and negative contractions have previously been found to be strongly associated with spoken language. These features have all increased in academic language between 1994 and 2014. Passive constructions and relative pronouns have previously been found to be strongly associated with written language, and are not often used in spoken language. This analysis shows that both of these features have decreased in frequency in academic language in the BNC2014.

Figure 1: Frequency increases indicating the colloquialisation of academic language.

Figure 2: Frequency decreases indicating the colloquialisation of academic language.

These frequency changes were also compared for each genre of academic writing separately. The genres studied were: humanities & arts, social science, politics, law & education, medicine, natural science, and technology & engineering. An interesting difference between some of these genres emerged. It seems that the ‘hard’ sciences (medicine, natural science, and technology & engineering) have shown much larger changes in some of the linguistic features studied than the other genres have. For example, figure 3 shows the difference in the percentage increase of verb contractions for each genre, and clearly shows a difference between the ‘hard’ sciences and the social sciences and humanities subjects.


Figure 3: % increases in the frequency of the use of verb contractions between 1994 and 2014 for each genre of academic writing.

This may lead you to think that medicine, natural science, and technology & engineering writing has become more colloquial than the other genres, but this is in fact not the case. Looking more closely at the data shows us that these ‘hard’ science genres were actually much less colloquial than the other genres in the 1990s, and that the large change seen here is actually a symptom of all genres becoming more similar in their use of these features. In other words, some genres have not become more colloquial than others, they have simply had to change more in order for all of the genres to become more alike.

So it seems from this analysis that, in some respects at least, academic language has certainly become more colloquial since the 1990s. The following is a typical example of academic writing in the 1990s, taken from a sample of a natural sciences book in the BNC1994. It shows avoidance of using first or second person pronouns and contractions (which have increased in use in the BNC2014), and shows use of a passive construction (the use of which has decreased in the BNC2014).

Experimentally one cannot set up just this configuration because of the difficulty in imposing constant concentration boundary conditions (Section 14.3). In general, the most readily practicable experiments are ones in which an initial density distribution is set up and there is then some evolution of the configuration during the course of the experiment.

It is much more common nowadays to see examples such as the following, taken from an academic natural sciences book in the BNC2014. This example contains active sentence constructions, first person pronouns, verb contractions, negative contractions, and a question.

No doubt people might object in further ways, but in the end nearly all these replies boil down to the first one I discussed above. I’d like to return to it and ponder a somewhat more aggressive version, one that might reveal the stakes of this discussion even more clearly. Very well, someone might say. Not reproducing may make sense for most people, but my partner and I are well – educated, well – off, and capable of protecting our children from whatever happens down the road. Why shouldn’t we have children if we want to?

It will certainly be interesting to see if this trend of colloquialisation can be seen in other genres of writing in the BNC2014!


Would you like to contribute to the Written BNC2014?

We are looking for native speakers of British English to submit their student essays, emails, and Facebook and Whatsapp messages for inclusion in the corpus! To find out more, and to get involved click here. All contributors will be fully credited in the corpus documentation.

British National Corpus 2014: A sociolinguistic book is out

Have you ever wondered what real spoken English looks like? Have you ever asked the question of whether people from different backgrounds (based on gender, age, social class etc.) use language differently? Have you ever  thought it would be interesting to investigate how much English has changed over the last twenty years? All these questions can be answered by looking at language corpora such as the Spoken BNC 2014 and analysing them from a sociolinguistic persective. Corpus Approaches to Contemporary British Speech:  Sociolinguistic Studies of the Spoken BNC2014 is a book which offers a series of studies that provide a unique insight into a number of topics ranging from Discourse, Pragmatics and Interaction to Morphology and Syntax.

This is, however, only the first step. We are hoping that there will be many more studies to come based on this wonderful dataset. If you want to start exploring the Spoken BNC 2014 corpus, it is just three mouse clicks away:

Get access to the BNC2014 Spoken

  1. Register for free and log on to CQPweb.
  2. Sign-up for access to the BNC2014 Spoken.
  3. Select ‘BNC2014’in the main CQPweb menu.

Also, right now there is a great opportunity to take part in the written BNC 2014 project, a written counterpart to the Spoken BNC2014.  If you’d like to contribute to the written BNC2014, please check out the project’s website for more information.

Learn about the BNC2014, scan a book sample and contribute to the corpus…

On Saturday 12 May 2018, CASS hosted a small training event at Lancaster University for a group of participants, who came from different universities in the UK.  We talked about the BNC2014 project and discussed both the theoretical underpinnings as well as the practicalities of corpus design and compilation. Slides from the event are available as pdf here.

The participants then tried in practice what is involved in the compilation of a large general corpus such as the BNC2014. They selected and scanned samples of books from current British fiction, poetry and a range of non-fiction books (history, popular science, hobbies etc.). Once processed, these samples will become a part of the written BNC2014.

Here are some pictures from the event:

Carmen Dayrell and Vaclav Brezina before the event

Elena Semino welcoming participants

In the computer lab: Abi Hawtin helping participants


A box full of books

If you are interested in contributing to the written BNC2014, go to the project website  to find out about different ways in which you can participate in this exciting project.

The event was supported by ESRC grant no. EP/P001559/1.

Media delusions: the (mis)representation of people with schizophrenia in 9 U.K. national newspapers between 2000 and 2015

In 2006, The Independent published an article advertising to its readers some of the islands off the coast of New Zealand as ideal holiday destinations. Amongst descriptions of various idyllic landscapes, cultural eccentricities and tourist attractions, the author warns readers of some of the local fauna:

     “Wildlife-wise, there are not just hammerhead sharks in these parts, he told me, but school sharks       and mako sharks – the paranoid schizophrenics of the shark world (The Independent, 2                       September 2006).

What is meant by the paranoid schizophrenics of the shark world’? Surely not that a species of shark is affected by a chronic mental disorder, typically characterised by delusions and auditory hallucinations?

I think most readers would agree that the author does not mean this literally. In fact, there is currently no evidence that sharks, or other animals besides humans, experience symptoms of psychosis. Instead, given some of the shared characteristics of sharks and ‘paranoid schizophrenics’ in our background knowledge, we can infer that by paranoid schizophrenics of the shark world, the journalist means ‘highly dangerous sharks’.  In other words, our understanding of what is meant rests on the assumption that paranoid schizophrenics are aggressive and violent members of the human race.

So where have we learnt to believe that people with schizophrenia are more violent than other people? Probably not from personal experience or credible evidence. Statistics repeatedly show that people with schizophrenia are not significantly more likely to commit violent crimes than the general population (Kalucy et al, 2011; Fazl and Grann, 2006). Other studies report that people with schizophrenia are instead more likely to be the targets of violence (Wehring and Carpenter, 2011).

One likely explanation is the media, which has been shown to frequently represent people with schizophrenia as aggressive, unpredictable killers who are both ‘mad’ and ‘bad’ (e.g. Cross, 2014, Clement, 2008, Coverdale, 2002). These misrepresentations are particularly alarming given the influence of the media on public attitudes. Since members of the public are unlikely to have first-hand experiences with people with schizophrenia, they obtain almost all their understanding of, and attitudes towards, people with schizophrenia from the increasingly ubiquitous media (Angermeyer et al, 2005). People with the diagnosis themselves are given little choice but to internalise these malign stereotypes, which have been found to deter some individuals from seeking medical help and sadly increasing the risk of suicide (Harrison and Gill, 2010; Wilkinson, 1994).

For my doctoral thesis, I have spent the last two years examining the ways in which schizophrenia is portrayed in the U.K. national press between 2000 and 2015. I pay attention, not only to how people with schizophrenia are misrepresented as violent, but also to other, less well-documented representations. For instance, my project has also led me to consider cases where the press paint a picture of people with schizophrenia as having special and unexplained creative powers, serving to make them separate and different from ‘ordinary folk’, and perhaps resulting in expectations that may be difficult to meet.

In order to consider the most frequent portrayals and how they might have changed over time, I am examining all articles published by 9 U.K. national newspapers –five tabloids (The Express, The Mail, The Mirror, The Star, The Sun) and four broadsheets (The Guardian, The Independent, The Telegraph, The Times) – published between 2000 and 2015 that refer to a diagnosis of schizophrenia in some way. This comprises 16,466 articles and 15,134,066 words which I then analyse for frequent patterns using a combination of statistical computer tools and manual linguistic analysis.

As a linguist, I pay close attention to how positive or negative portrayals are realised through subtle choices in language. These can reveal the origins of various misconceptions. For instance, 14 of the top 25 ‘doing’ words that occur unusually frequently in the vicinity of the word schizophrenic in my data refer to violent behaviours.  In other words, ‘schizophrenics’ frequently attack, behead, punch, murder, rape, stab and slash others, or at least pose a risk or threaten them.

     A MACHETE-wielding schizophrenic who slashed two guards in a rampage through MI5’s HQ           was locked up in a mental health unit indefinitely yesterday. (The Sun, 22 June 2005).

     A paranoid schizophrenic beheaded his flatmate in a frenzied attack after suffering from                     delusions that he was being persecuted, a court has heard. (The Mirror, 2 December 2013).

The frequency with which schizophrenic occurs in the vicinity of violent action words help explain why someone might have chosen to characterise the school and mako sharks as the paranoid schizophrenics of the shark world, and how readers were able to make sense of this analogy.

Overall, it is clear that the media have a responsibility to offer more accurate and balanced representations of schizophrenia. In utilising a characteristically language-oriented approach, the goal of this research is to offer media institutions practical ways to improve their reporting of schizophrenia by identifying potentially problematic choices in language and suggesting more balanced and accurate alternatives.

If you have any suggestions or observations, or for any reason would like to get in touch, please contact me via my email: j.balfour(Replace this parenthesis with the @ sign)lancaster.ac.uk

More on drought: The ENDOWS project

We are thrilled to announce that our latest bid was successful – ENDOWS: ENgaging diverse stakeholders and publics with outputs from the UK DrOught and Water Scarcity programme.

The ENDOWS project will capitalise on the outcomes from four existing projects within NERC’s UK Drought and Water Scarcity programme (Historic Droughts, DRY, MaRIUS and IMPETUS) to maximise impact. ENDOWS will exploit the synergies between these projects, promote active interaction among disciplines, and develop close collaboration with a diverse range of stakeholders (policy makers, water companies, NGOs and community leaders). The project therefore opens up the possibility of genuinely enhancing and innovating the UK planning and management of future drought events.

Funded by NERC, this is a two-year project which brings together a multi-disciplinary team of 37 researchers from 12 UK universities or research centres and Climate Outreach, one of Europe’s leading voices on public engagement:

  • Centre for Ecology and Hydrology (CEH)
  • University of West England
  • University of Oxford
  • Cranfield University
  • The University of Reading
  • University of Bristol
  • British Geological Survey (BGS)
  • Sheffield University
  • Harper Adams University
  • University of Exeter
  • Lancaster University
  • Loughborough University
  • University of Warwick

Carmen Dayrell is the Co-Investigator at Lancaster University. She has been working with Tony McEnery and Helen Baker as part of the CASS team within the Historic Droughts project. CASS is examining how British newspapers have debated drought and water scarcity events in UK, covering 200 years of discourse: 1850 to 2014. The analysis uses an innovative methodological approach which combines Critical Discourse Analysis with methods from Corpus Linguistics and GIS (Geographic Information Systems), enabling the researchers to examine the link between textual patterns and geographic references and hence explore geographically bounded discourses.

To illustrate the interesting results that this type of analysis can yield, let’s have a look at the places appearing around the word “drought” in newspaper texts published between 2010 and 2012. These were years when England and Wales were hit hard by drought. The bigger the dot in the maps, the higher the number of mentions.

 

Figure 1: 2010-2012 drought in Britain (tabloid)

 

 

Figure 2: 2010-2012 drought in Britain (broadsheet)

 

A closer reading of texts unveils interesting patterns:

  • The press does not always specify the specific locations impacted by the drought. Britain and England were by far the places most frequently mentioned.
  • When mentioning specific locations, these were usually in England.

Large areas of Britain face drought conditions, the Environment Agency said. Parts of the Midlands and Yorkshire are expected to be declared high risk in the agency’s drought prospects report.

The Telegraph 11/03/2012

  • Rather than impacted by the drought, Scotland was portrayed as the solution for the problem since it is rich in water resources.

SCOTLAND yesterday offered to provide water to drought-hit Southern England. Infrastructure Secretary Alex Neil said it was “only right” to offer some of Scotland’s “plentiful supply of water “. 

The Express, 10/03/2012

  • The newspapers also report on actions taken to address the problem of drought. Applications for drought orders and permits and the introduction of hosepipe bans were the most frequently mentioned.

The remaining area in drought is South and West of a line from Lincolnshire to Sussex, taking in Oxfordshire, where hosepipe bans imposed by seven water companies remain in place.

The Independent, 19/05/2012

  • There were also mentions of the impact of the drought. These mainly related to: (i) wildlife and plants/gardens being affected and (ii) water levels of rivers and reservoirs going low.

SCIENTISTS fear rare eel species could be completely wiped out because of drought in the South of England.

The Daily Record, 17/07/2011

River levels are as low as in 1976 after another very dry week across England and Wales, the Environment Agency said. In its latest drought briefing yesterday, the Government agency said all areas had seen less than 1mm of rain.

The Herald, 31/03/2012

By examining 200 years of newspaper discourse, the analysis can trace repeated patterns and changes across time. This in turn can inform ways of thinking about how the media representation of drought has influenced the way in which the British public perceives and responds to drought events. Thus, the newspaper analysis will contribute to fostering more informed dialogues between policy makers, water companies, and community leaders and the general public.

Data-driven learning: learning from assessment

The process of converting valuable spoken corpus data into classroom materials is not necessarily straightforward, as a recent project conducted by Trinity College London reveals.

One of the buzz words we increasingly hear from teacher trainers in English Language Teaching (ELT) is the use of data-driven learning. This ties in with other contemporary pedagogies, such as discovery learning.  A key component of this is how data from a corpus can be used to inform learning. One of our long-running projects with the Trinity Lancaster Corpus has been to see how we could use the spoken data in the classroom so that students could learn from assessment as well as for assessment. We have reported before (From Corpus to Classroom 1 and From Corpus to Classroom 2) on the research focus on pragmatic and strategic examples. These linguistic features and competences are often not practised – or are only superficially addressed – in course books and yet can be significant in enhancing learners’ communication skills, especially across cultures. Our ambition is to translate the data findings for classroom use, specifically to help teachers improve learners’ wider speaking competences.

We developed a process of constructing sample worksheets based on, and including, the corpus data. The data was contextualized and presented to teachers in order to give them an opportunity to use their expertise in guiding how this data could be developed for, and utilized in, the classroom. So, essentially, we asked teachers to collaborate on checking how useful the data and tasks were and potentially improving these tasks. We also asked teachers to develop their own tasks based on the data and we now have the results of this project.

Overwhelmingly, the teachers were very appreciative of the data and they each produced some great tasks. All of these were very useful for the classroom but they did not really exploit the unique information we identified as being captured in the data. We have started exploring why this might be the case.

What the teachers did was the following:

  • Created noticing and learner autonomy activities with the data (though most tasks would need much more scaffolding).
  • Focused on traditional information about phrases identified in the data, e.g. the strength and weakness of expressions of agreement.
  • Created activities that reflected traditional course book approaches.
  • Created reflective, contextual practice related to the data although this sometimes became lost in the addition of extra non-corpus texts.

We had expectations that the data would inspire activities which:

  • showed new ways of approaching the data
  • supported discovery learning tasks with meaningful outcomes
  • explored the context and pragmatic functions of the data
  • reflected pragmatic usage; perhaps even referring to L1 as a resource for this
  • focused on the listener and interpersonal aspects rather than just the speaker

It was clear that the teachers were intellectually engaged and excited, so we considered the reasons why their tasks had taken a more traditional path than expected. Many of these have been raised in the past by Tim Johns and Simon Borg. There is no doubt that the heavy teacher workload affects how far teachers feel they can be innovative with materials. There is a surety in doing what you know and what you know works. Also many teachers, despite being in the classroom everyday, often need a certain confidence to design input when this is traditionally something that has been left to syllabus and course book creators. Another issue was that we realised that teachers would probably have to have more support in understanding corpus data and many don’t have the time to do extra training. Finally, there may be the issue with this particular data that teachers may not be fully aware of the importance of pragmatic and strategic competences. Often they are seen as an ‘add-on’ rather than a core competence especially in contexts for contemporary communications when it is largely being used as a lingua franca.

Ultimately, there was a difference between what the researchers ‘saw’ and what the teachers ‘saw’. As an alternative, we asked a group of expert material writers to produce new tasks and they have produced some innovative material. We concluded that maybe this is a fairer approach. In other words, instead of expecting each of the roles involved in language teaching (SLA researchers, teachers, materials designers) to find the time to become experts in new skills, it may sometimes be better to use each other as a resource. This would still be a learning experience as we draw on each other’s expertise.

In future if we want teachers to collaborate on designing materials we must make sure we discuss the philosophy or pedagogy behind our objectives (Rapti, 2013) with our collaborators, that we show how the data is mapped to relevant curricula and that we recognise the restrictions caused by practical issues such as a lack of time or training opportunities.

The series of worksheets is now available from the Trinity College London website. More to come in the future so keep checking.

Corpus-based insights into spoken L2 English: Introducing eight projects that use the Trinity Lancaster Corpus

In November 2016, we announced the Early Data Grant Scheme in which researchers could apply for access to the Trinity Lancaster Corpus (TLC) before its official release in 2018.  The Early Data subset of the corpus contains 2.83 million words from 1,244 L2 speakers.

The Trinity Lancaster Corpus project is a product of an ongoing collaboration between The ESRC Centre for Corpus Approaches to Social Science (CASS), Lancaster University, and Trinity College London, a major international examination board. The Trinity Lancaster Corpus contains several features (rich metadata, a range of proficiency levels, L1s and age groups) that make it an important resource for studying L2 English. Soon after we started working on the corpus development in 2013, we realised the great potential of the dataset for researchers in language learning and language testing. We were very excited to receive a number of outstanding applications from around the world (Belgium, China, Germany, Italy, Spain, UK and US).  The selected projects cover a wide range of topics focusing on different aspects of learner language use. In the rest of this blog post we introduce the successful projects and their authors.

  1. Listener response in examiner-EFL examinee interactions

Erik Castello and Sara Gesuato, University of Padua

The term listener response is used to denote (non-)verbal behaviour produced in reaction to an interlocutor’s talk and sharing a non-turn status, e.g. short verbalisations, sentence completion, requests for clarifications, restatements, shakes, frowns (Xudong 2009). Listener response is a form of confluence-oriented behaviour (McCarthy 2006) which contributes to the construction and smooth handling of conversation (Krauss et al. 1982). Response practices can vary within the same language/culture in terms of placement and function in the turn sequence and the roles played by the same listener response types (Schiffrin 1987; Gardner 2007). They can also vary across cultures/groups (Cutrone 2005; Tottie 1991) and between the sexes (Makri-Tsilipakou 1994; Rühlemann 2010). Therefore, interlocutors from different linguistic/cultural backgrounds may experience communication breakdown, social friction and the emergence of negative attitudes (Wieland 1991; Li 2006), including participants in examiner-EFL examinee interactions (Götz 2013) and in EFL peer-to-peer interactions (Castello 2013). This paper explores the listener response behaviour of EFL examinees in the Trinity Lancaster Corpus (Gablasova et al. 2015), which may display interference from the examinees’ L1s and affect the examiners’ impression of their fluency. It aims to: identify forms of verbal listener responses in examinee turns and classify them in terms of conventions of form (mainly following Clancy et al. 1996) and conventions of function (mainly following Maynard 1997); identify strategies for co-constructing turn-taking, if any (Clancy/McCarthy 2015); and determine the frequencies of occurrence of the above phenomena across types of interaction, examinees’ perceived proficiency levels and between the sexes.

Erik Castello is Assistant Professor of English Language and Translation at the University of Padua, Italy. His research interests include (learner) corpus linguistics, discourse analysis, language testing, academic English and SFL. He has co-edited two volumes and published two books and several articles on these topics.

Sara Gesuato is Associate Professor of English language at the University of Padua, Italy. Her research interests include pragmatics, genre analysis, verbal aspect, and corpus linguistics. She has co-edited two volumes on pragmatic issues in language teaching, and is currently investigating sociopragmatic aspects of L2 written speech acts.

  1. Formulaic expressions in learner speech: New insights from the Trinity Lancaster Corpus

Francesca Coccetta, Ca’ Foscari University of Venice

This study investigates the use of formulaic expressions in the dialogic component of the Trinity Lancaster Corpus. Formulaic expressions are multi-word units serving pragmatic or discourse structuring functions (e.g. discourse markers, indirect forms performing speech acts, and hedges), and their mastery is essential for language learners to sound more native-like. The study explores the extent to which the Trinity exam candidates use formulaic expressions at the various proficiency levels (B1, B2 and C1/C2), and the differences in their use between successful and less successful candidates. In addition, it investigates how the exam candidates compare with native speakers in the use of formulaic expressions. To do this, recurrent multi-word units consisting of two to five words will be automatically extracted from the corpus using Sketch Engine; then, the data will be manually filtered to eliminate unintentional repetitions, phrase and clause fragments (e.g. in the, it and, of the), and the multi-word units that do not perform any pragmatic or discourse function. The high-frequency formulaic expressions of each proficiency level will be provided and compared with each other and with the ones identified in previous studies on native speech. The results will offer new insights into learners’ use of prefabricated expressions in spoken language, particularly in an exam setting.

Francesca Coccetta is a tenured Assistant Professor at Ca’ Foscari University of Venice. She holds a doctorate in English Linguistics from Padua University where she specialised in multimodal corpus linguistics. Her research interests include multimodal discourse analysis, learner corpus research, and the use of e-learning in language learning and teaching. 

  1. The development of high-frequency verbs in spoken EFL and ESL

Gaëtanelle Gilquin, Université catholique de Louvain

This project aims to contribute to the recent effort to bridge the paradigm gap between second language acquisition research and corpus linguistics. While most such studies have relied on written corpus data to compare English as a Foreign Language (EFL) and English as a Second Language (ESL), the present study will take advantage of a new resource, the Trinity Lancaster Corpus, to compare speech in an EFL variety (Chinese English) and in an ESL variety (Indian English). The focus will be on high-frequency verbs and how their use develops across proficiency levels in the two varieties, as indicated by the CEFR scores provided in the corpus. Various aspects of language will be considered, taking high-frequency verbs as a starting point, among which grammatical complexity (e.g. through the use of infinitival constructions of the causative type), idiomaticity (e.g. through the degree of typicality of object nouns) and fluency (e.g. through the presence of filled pauses in the immediate environment). The assumption is that, given the different acquisitional contexts of EFL and ESL, one and the same score in EFL and ESL may correspond to different linguistic realities, and that similar developments in scores (e.g. from B1 to B2) may correspond to different developments in language usage. More particularly, it is hypothesised that EFL speakers will progress more rapidly in aspects that can benefit from instruction (e.g. those involving grammatical rules), whereas ESL speakers will progress more rapidly in aspects that can benefit from exposure to naturalistic language (like phraseology).

Gaëtanelle Gilquin is a Lecturer in English Language and Linguistics at the University of Louvain. She is the coordinator of LINDSEI and one of the editors of The Cambridge Handbook of Learner Corpus Research. Her research interests include spoken learner English, the link between EFL and ESL, and applied construction grammar.

  1. Describing fluency across proficiency levels: From ‘can-do- statements’ towards learner-corpus-informed descriptions of proficiency

Sandra Götz, Justus Liebig University Giessen

 While it has been noted that current assessment scales (e.g. the Common European Framework of Reference; CEF; Council of Europe 2009) describing learners’ proficiency levels in ‘can-do-statements’ are often formulated somewhat vaguely (e.g. North 2014), researchers and CEF-developers have pointed out the benefits of including more specific linguistic descriptors emerging from learner corpus analyses (e.g. McCarthy 2013; Park 2014). In this project, I will test how/if descriptions of fluency in learner language such as the CEF can benefit from analyzing learner data at different proficiency levels in the Trinity Lancaster Corpus. More specifically I will test if the learners’ proficiency levels can serve as robust predictors in their use core fluency variables, such as filled and unfilled pauses (e.g. er, erm, eh, ehm), discourse markers (e.g. you know, like, well), or small words (e.g. sort of, kind of). Also, I will test if learners show similar or different paths in their developmental stages of fluency from the B1 to the C2 level, regardless of (or dependent on) their L1. Through the meta-information available on the learners in the Trinity Lancaster Corpus, sociolinguistic and learning context variables (such as the learners’ age, gender or the task type) will also be taken into consideration in developing data-driven descriptor scales on fluency at different proficiency levels. Thus, it will be possible to differentiate between L1-specific and universal learner features in fluency development.

Sandra Götz obtained her PhD from Justus Liebig University Giessen and Macquarie University Sydney in 2011. Since then, she has been working as a Senior Lecturer in English Linguistics at University of Giessen. Her main research interests include (learner) corpus linguistics and its application to language teaching and testing, applied linguistics and World Englishes.

  1. Self-repetition in the spoken English of L2 English learners: The effects of task type and proficiency levels

Lalita Murty, York University

Self-repetition (SR) where the speaker repeats a word/phrase is a much-observed phenomenon in spoken discourse. SR serves a range of distinct communicative and interactive functions in interactions such as expressing agreement or disagreement or adding emphasis to what the speaker wants to say as the following example shows ‘Yes, I know I know and I certainly think that limits are…’ (to express agreement with the previous speaker) (Gablasova, et al, 2015). Self-repetitions also help in creating coherence (Bublitz, 1989 as cited in Fung, 2007: 224), enhancing the clarity of the message (Kaur, 2012), keeping the floor, maintaining smooth flow of conversation, linking speaker’s ideas to previous speaker’s ideas (Tannen, 1989), and initiating self and other repairs (Bjorkman, 2011; Robinson and Kevoe-Feldman, 2010). This paper will use Sketch Engine to extract instances of single content word self-repetitions in the Trinity Lancaster Corpus data to examine the effect of (i) L2 proficiency levels and (ii) task types on the frequency and functions of different types of self-repetitions made by speakers at varying proficiency levels in the different tasks. A quantitative and qualitative analysis of the data thus extracted will be conducted using a mix of Norrick’s (1987) framework along with CA approaches.

Lalita Murty is a Lecturer at the Norwegian Study Centre, University of York.  Her previous research focused on spoken word recognition and call centre language. Currently she is working on Reduplication and Iconicity in Telugu, a South Indian language.

  1. Certainty adverbs in learner language: The role of tasks and proficiency

Pascual Pérez-Paredes, University of Cambridge and María Belén Díez-Bedmar, University of Jaén

When comparing native and non-native use of stance adverbs, the effect of task has been largely ignored. An exception is Gablasova et al.’s (2015). The authors researched the effect of different speaking tasks on L2 speakers’ use of epistemic stance markers and concluded that there was a significant difference between the monologic prepared tasks and every other task and between the dialogic general topic and the dialogic pre-selected topic (p < .05). This study suggests that the type of speaking task conditions speakers’ repertoire of markers, including certainty markers. Pérez-Paredes & Bueno (forthcoming) looked at how certainty stance adverbs were employed during the picture description task in the LINDSEI and the extended LOCNEC (Aguado et al., 2012). In particular, the authors discussed the contexts of use of obviously, really and actually by native and NNSs across the same speaking task in the four datasets when expressing the range of meanings associated with certainty. The authors found that different groups of speakers used these adverbs differently, both quantitatively and qualitatively. Our research seeks to expand the findings in Gablasova et al.’s (2015) and Pérez-Paredes & Bueno (forthcoming) and examine the uses of certainty adverbs across the L1s, proficiency and tasks represented in the Trinity Lancaster Corpus. We believe that the use of this corpus, together with the findings from the LINDSEI, will help us reach a better understanding of the uses of certainty adverbs in spoken learner language.

Pascual Pérez-Paredes is a Lecturer in Research in Second Language Education at the Faculty of Education, University of Cambridge. His main research interests are learner language variation, the use of corpora in language education and corpus-assisted discourse analysis.

María Belén Díez-Bedmar is Associate Professor at the University of Jaén (Spain). Her main research interests include Learner Corpus Research, error-tagging, the learning of English as a Foreign Language, language testing and assessment, the CEFR and CMC.  She is currently involved in national and international corpus-based projects.

  1. Emerging verb constructions in spoken learner English

Ute Römer and James Garner, Georgia State University

Recent research in first language (L1) and second language (L2) acquisition has demonstrated that we learn language by learning constructions, defined as conventionalized form-meaning pairings. While studies in L2 English acquisition have begun to examine construction development in learner production data, these studies have been based on rather small corpora. Using a larger set of data from the Trinity Lancaster Corpus (TLC), this study investigates how verb-argument constructions (VACs; e.g. ‘V about n’) emerge in the spoken English of L2 learners at different proficiency levels. We will systematically and exhaustively extract a small set of VACs (’V about n’, ‘V for n’, ‘V in n’, ‘V like n’, and ‘V with n’) from the L1 Italian and L1 Spanish subsets of the TLC, separately for three CEFR proficiency levels. For each VAC and L1-proficiency combination (e.g. Italian-B1), we will create frequency-sorted verb lists, allowing us to determine how learners’ verb-construction knowledge develops with increasing proficiency. We will also examine in what ways VAC emergence in the TLC data is influenced by VAC usage as captured in a large native-speaker reference corpus (the BNC). We will use chi-square tests to compare VAC type and token frequencies across L1 subsets and proficiency levels. We will use path analysis (a type of structural equation modeling) including the predictor variables L1 status, proficiency level, and BNC usage information to gain insights into how learner characteristics and variables concerning L1 construction usage affect the emergence of the target VACs in spoken L2 learner English.

Ute Römer is currently Assistant professor in the Department of Applied Linguistics and ESL at Georgia State University. Her research interests include corpus linguistics, phraseology, second language acquisition, discourse analysis, and the application of corpora in language teaching. She serves on a range of editorial and advisory boards of professional journals and organizations, and is general editor of the Studies in Corpus Linguistics book series.

James Garner is currently a PhD student in the Department of Applied Linguistics and ESL at Georgia State University. His current research interests include learner corpus research, phraseology, usage-based second language acquisition, and data-driven learning.

  1. Verb-argument constructions in Chinese EFL learners’ spoken English production

Jiajin Xu and Yang Liu, Beijing Foreign Studies University

The widespread recognition of usage-based approach to constructions has made Corpus Linguistics a most viable methodology to scrutinise such frequent morpho-syntactic patterns as verb-argument constructions (VACs) in learner language. The present study attempts to examine the use of VACs in Chinese EFL learners’ spoken English. Our focus will be on the semantics of the verbal constructions in light of collostructional statistics (Stefanowitsch & Gries, 2003) as well as the comparisons across learners’ proficiency levels and task types. 20 VACs were collected from COBUILD Grammar Patterns 1: Verbs (Francis, Hunston & Manning, 1996). On the basis of the retrieved VAC concordances from the Trinity Lancaster Corpus, the semantic prototypicality of the VACs will be analysed according to the collocational strength of verbs with their host constructions. Comparisons of Chinese EFL learners against the native speakers will be made, and also within different task types. It is hoped that our findings would shed light on Chinese EFL learners’ knowledge of VACs and the crosslinguistic influence that impacts verb semantics of learners’ spoken English. Meanwhile, we also consider language proficiency and task type as potential factors that may account for the differences across CEFR groups based on the comparisons within Chinese EFL learners.

Jiajin Xu is Professor of Linguistics at the National Research Centre for Foreign Language Education, Beijing Foreign Studies University as well as secretary general and a founding member of the Corpus Linguistics Society of China. His research interests include discourse studies, second language acquisition, contrastive linguistics and translation studies, and corpus linguistics.

Yang Liu is currently a PhD candidate at Beijing Foreign Studies University. His research focus is on the corpus-based study of construction acquisition of Chinese EFL learners.

 

 

 

 

 

 

Morphological complexity: How is grammar acquired and how do we measure this?

Vaclav Brezina and Gabriele Pallotti

Inflectional morphology has to do with how words change their form to express grammatical meaning. It plays an important role in a number of languages. In these languages, the patterns of word change may for example indicate number and case on nouns, or past, present and future tense on verbs. For example, to express the past participle in German we regularly add the prefix ge- and optionally modify the base. Ich gehe [I go/walk] thus becomes Ich bin gegangen [I have walked].  English also inflects words (e.g. walk – walks – walking – walked; drive – drove – driven) but the range of inflected forms is narrower than in many other languages. The range of morphological forms in a text can be seen as its morphological complexity. Simply put, it is an indicator of the morphological variety of a text, i.e. how many changes to the dictionary forms of the words are manifested in the text.

To find out more about morphological complexity, how it can be measured and how L2 speakers acquire it, you can read:

Gabriele Pallotti and I have been working together to investigate the construct and develop a tool that can analyse the morphological complexity of texts. So far, the tool has been implemented for English, Italian and German verbal morphology. Currently, together with Michael Gauthier from Université Lyon we are implementing the morphological complexity measure for French verbs.

To analyse a text in the Morphological complexity tool, copy/paste the text in the text box, select the appropriate language and press ‘Analyse text now’ (Fig. 1).

Figure 1. Morphological tool: Interface

The tool will output the results of the linguistic analysis that highlights all verbs and nouns in the text and identifies morphological changes (exponences). After clicking on the ‘Calculate MCI’ button the tool also automatically calculates the Morphological Complexity Index (MCI) – see Fig. 2.

Figure 2. Morphological tool output: Selected parts

 

Introducing a new project with the British Library

Since 2012 the BBC have been working with the British Library to build a collection of intimate conversations from across the UK in the BBC Listening Project. Through its network of local radio stations, and with the help of a travelling recording booth the BBC has captured many conversations of people, who are well known to one another, on a range of topics in high quality audio.

For the past two years we have been discussing with the BBC and the British Library the possibility of using these recordings as the basis of a large scale extension of our spoken BNC corpus. The Spoken BNC2014 has been built so far to reflect language in intimate settings – with recordings made in the home. This has led to a large and very useful collection of data but, without the resources of an organization such as the BBC, we were not able to roam the country with a sound recording booth to sample language from John o’Groats to Land’s End! By teaming up with the BBC and British Library we can supplement this very useful corpus of data, which is strongly focused on a ‘hard to capture’ context, intimate conversations in the home, with another type of data, intimate conversations in a public situation sampled from across the UK.

Another way in which the Listening data should prove helpful to linguists is that the data itself was captured in a recording studio as high quality audio recordings. Our hope is that a corpus based on this material will be of direct interest and use to phoneticians.

We have recently concluded our discussion with the British Library, which is archiving this material, and signed an agreement which will see CASS undertake orthographic transcription of the data. Our goal is to provide a high quality transcription of the data which will be of use to linguists and members of the public, who may wish to browse the collection, alike. In doing this we will be building on our experience of producing the Trinity Lancaster Corpus of Spoken Learner English and the Spoken BNC2014.

We take our first delivery of recordings at the beginning of March and are very excited at the prospect of lifting the veil a little further on the fascinating topic of everyday conversation and language use. The plan is to transcribe up to 1000 of the recordings archived at the British Library. We will be working to time align the transcriptions with the sound recordings also and are working closely with our strong phonetics team in the Department of Linguistics and English Language at Lancaster University to begin to assess the extent to which this new dataset could facilitate new work, for example, on the accents of the British Isles.

Our partners in the British Library are just as excited as we are – Jonnie Robinson, lead Curator for Spoken English at the British Library says ‘The British Library is delighted to enable Lancaster to make such innovative use of the Listening Project conversations and we look forward to working with them to make the collection more accessible and to enhance its potential to support linguistic and other research enquiries’.

Keep an eye on the CASS website and Twitter feed over the next couple of years for further updates on this new project!