Introducing Yufang Qian to CASS

CASS is delighted to welcome visiting researcher Yufang Qian to the centre, where she will be working on a project exploring the representation of Chinese medicine in British historical news texts over the last 200 years. Continue reading to find out more about Yufang and the research which she will be undertaking!


Yufang

In 2009, Yufang Qian obtained her PhD at Lancaster University with a dissertation on corpus-based discourse studies, under the supervision of Professors Tony McEnery and Paul Baker. She then returned to Zhejiang University of Media and Communications (ZUMC) and was appointed Professor in 2011.

Yufang is committed to popularizing the combination of corpus and discourse approaches in China. She has taught corpus linguistics and media discourse at the ZUMC to students at all levels and supervised more than 50 students’ dissertations relating to corpus-based discourse studies, in disciplines as diverse as communication studies, education, sociology, psychology and politics. The students have then either persued their further educations in the UK, USA, Japan, South Korea, and Hong Kong in this area or have used the expertise they have gained in various institutions and organizations in China.

In 2010 Yufang published ‘Corpus and Critical Discourse Analysis’ in the journal Foreign Language Teaching and Research, the first paper to introduce corpus-based discourse analysis in a Chinese journal. To date, it has been cited 48 times and downloaded 3515 times. In the past few years she has published nearly two dozens journal articles on corpus-based media discourse analysis. Her PhD thesis, Discursive constructions around terrorism in the People’s Daily and The Sun before and after 9.11 (Oxford: Peter Lang 2010), won the third Prize in the Sixth Outstanding Achievement Awards for Research in Humanities and Social Sciences, conferred by the Ministry of Education in 2013, the top governmental award in social science in China.

To explain and promote the application of the corpus-based discourse approach, Yufang has spoken at many national and international conferences and has given lectures at more than a dozen universities in China. She is Founding Director of Research Center for Discourse and Communications at the ZUMC, the first of its kind in China. She is principle investigator for many research projects, such as ‘Discursive constructions around the low carbon economy in the press of China, the UK and the US’, funded by the Ministry of Education; and ‘A corpus- based comparative study of Western and Chinese political discourse analysis’, funded by the National Social Science Foundation. She is also co-principle investigator of the project entitled ‘A comparative study of the discourse system in Chinese dream films’, funded by the National Social Science Foundation.

Yufang’s comparative perspective is evident from her early paper, ‘Contrasting signals of politeness between Western and Eastern countries’, published in Education in China (ed. E Fizette; Fenton, MI: Hana Guild, 1993). Since 2014, she has been working with CCPN Global (China in Comparative Perspective Network Global, an affiliate member of the Academy of Social Sciences, UK) to develop a project entitled ‘Corpus approaches for Chinese social science (CACSS)’. She is organizing a panel on ‘Corpus approaches to governance in the context of climate change’ at the 3rd Global China Dialogue on 2 December 2016 at the British Academy.

Yufang has recently returned to her alma mater, Lancaster University, as a visiting researcher, where she will work with Professor McEnery on a project exploring the representation of Chinese medicine in British historical news texts over the last 200 years. This diachronic observation of discourse on Chinese medicine is significant in that it will provide specific evidence of the media’s role in public health vis-à-vis the use of traditional Chinese medicine in the West. It is hoped that the findings of this study will help bridge the gap between Western and Chinese medicine, both of which play a role in serving public health.

NewsHack 2016 Retrospective

The BBC’s multilingual NewsHACK event was run on the 15th and 16th of March as an opportunity for teams of language technology researchers to work with multilingual data from the BBC’s connected studio.  The theme was ‘multilingual journalism: tools for future news’, and teams were encouraged to bring some existing language technologies to apply to problems in this area. Nine teams attended from various news and research organisations. Lancaster University sent two teams with funding from CASS, CorCenCC, DSI, and UCREL: team ‘1’ consisting of Paul, Scott and Hugo, and team ‘A’ comprising Matt, Mahmoud, Andrew and Steve.

image00

The brief from the newsHACK team suggested two possible directions: to provide a tool for the BBC’s journalist staff, or to create an audience-facing utility. To support us, the BBC provided access to a variety of APIs, but the Lancaster ‘A’ team were most interested to discover that something we’d thought would be available wasn’t — there is no service mapping news stories to their counterparts in other languages. We decided to remedy that.

The BBC is a major content creator, perhaps one of the largest multilingual media organisations in the world. This presents a great opportunity. Certain events are likely to be covered in every language the BBC publishes in, providing ‘translations’ of the news which are not merely literal translations at the word, sentence or paragraph level, but full-fledged contextual translations which identify the culturally appropriate ways to convey the same information. Linking these articles together could help the BBC create a deeply interesting multilingual resource for exploring questions about language, culture and journalism.

Interesting, but how do we make this into a tool for the BBC? Our idea was to take these linked articles directly to the users. Say you have a friend who would prefer to read the news in their native tongue — one different to your own — how would you share a story with them? Existing approaches seem to involve either using an external search engine (But then how do you know the results are what you intend to share, not speaking the target language?) or to use machine translation to offer your friend a barely-readable version of the exact article you have read. We came up with an idea that keeps content curation within the BBC and provides readers with easy-access to the existing high-quality translations being made by professional writers: a simple drop-down menu for articles which allows a user to ‘Read about this in…’ any of the BBC’s languages.

image03

To implement this, in two days, required a bit of creative engineering. We wanted to connect articles based on their content, but we didn’t have tools to extract and compare features in all the BBC’s languages. Instead, we translated small amounts of text — article summaries and a few other pieces of information — into English, which has some of the best NLP tool support (and was the only language all of our team spoke). Then we could use powerful existing solutions to named entity recognition and part-of-speech tagging to extract informative features from articles, and compare them using a few tricks from record linkage literature. Of course, a lack of training data (and time to construct it!) meant that we couldn’t machine-learn our way to perfection for weighting these features, so a ‘human learning’ process was involved in manually tweaking the weights and thresholds until we got some nice-looking links between articles in different languages.

Data is only part of the battle, though. We needed a dazzling front-end to impress the judges.  We used a number of off-the-shelf web frameworks to quickly develop a prototype, drawing upon the BBC’s design to create something that could conceivably be worked into a reader’s workflow: users enter a URL at the top and are shown results from all languages in a single dashboard, from which they can read or link to the original articles or their identified translations.


Here we have retrieved a similar article in Arabic, as well as two only-vaguely-similar ones in Portuguese and Spanish (the number indicates a percentage similarity).  The original article text is reproduced, along with a translated English summary.

image01

The judges were impressed — perhaps as much with our pun-filled presentation as our core concept — and our contribution, the spontaneously-titled ‘Super Mega Linkatron 5000’ was joint winner in the category of best audience-facing tool.

The BBC’s commitment to opening up their resources to outsiders seems to have paid off with a crop of high-quality concepts from all the competitors, and we’d like to thank them for the opportunity to attend (as well as the pastries!).

The code and presentation for the team ‘A’ entry is available via github at https://github.com/StephenWattam/LU-Newshack and images from Lancaster’s visit can be seen at https://flic.kr/s/aHskwHcpNH .  Some of the team have also written their own blog posts on the subject: here’s Matt’s and Steve’s.

Team ‘1’ based their work around the BBC Reality Check service. This was part of the BBC News coverage of the 2015 UK general election and published news items on twitter and contributed to TV and radio news as well. For example, in May 2015 when a politician claimed that the number of GPs has declined under the coalition government, BBC Reality Check produced a summary of data obtained from a variety of sources to enable the audience to make up their own mind about this claim. Reality Check is continuing in 2016 with a similar service for the EU referendum, providing, for example, a check on how many new EU regulations there are every year (1,269 rather than the 2,500 claimed by Boris Johnson!!). After consulting with the BBC technology producer and journalist attending the newsHACK, Team ‘1’ realised that this current Reality Check service could only serve its purpose for English news stories, so set about making a new ‘BBC Multilingual Reality Check’ service to support journalists in their search for suitable sources. Having a multilingual system is really important for the EU referendum and other international news topics due to the potential sources being written in languages other than English.

In order to bridge related stories across different languages, we adopted the UCREL Semantic Analysis System (USAS) developed at Lancaster over the last 26 years. The system automatically assigns semantic fields (concepts or coarse-grained senses) to each word or phrase in a given text, and we reasoned that the frequency profile of these concepts would be similar for related stories even in different languages e.g. the semantic profile could help distinguish between news stories about finance or education or health. Using the APIs that the BBC newsHACK team provided, we gathered stories in English, Spanish and Chinese (the native languages spoken by team ‘1’). Each story was then processed through the USAS tagger and a frequency profile was generated. Using a cosine distance measure, we ranked related stories across languages. Although we only used the BBC multilingual news stories during the newsHACK event, it could be extended to ingest text from other sources e.g. UK Parliamentary Hansard and manifestos, proceedings of the European parliament and archives of speeches from politicians (before they are removed from political party websites).

The screenshot below shows our analysis focussed on some main topics of the day: UK and Catalonia referendums, economics, Donald Trump, and refugees. Journalists can click on news stories in the system and show related articles in the other languages, ranked by our distance measure (shown here in red).

Team ‘1’s Multilingual Reality Check system would not only allow fact checking such as the number of refugees and migrants over time entering the EU, but also allow journalists to observe different portrayals of the news about refugees and migrants in different countries.

image02

Upcoming CASS Psycholinguistics Seminar

CASS is excited to announce an upcoming half-day research seminar on the theme of “Corpus Data and Psycholinguistics”. The event will take place on Thursday 19th May 2016 at 1-5pm in Furness Lecture Theatre 3.

The aim of the event is to bring together researchers with an interest in combining methods from corpus linguistics and psycholinguistics. In particular, there will be a focus on experimental psycholinguistics. It is set to be an exciting afternoon consisting of four 40-minute presentations from both internal and external speakers. Professor Padraic Monaghan from the Department of Psychology will be giving an introduction to computational modelling in psycholinguistics, and I will be presenting my work on investigating the processing of collocation using EEG. Furthermore, Dr Phil Durrant from the University of Exeter will be giving a talk entitled “Revisiting collocational priming”, and Professor Michaela Mahlberg from the University of Birmingham will be discussing the methodological issues associated with combining eye-tracking techniques with corpus data.

You can find out more about these talks from the abstracts below.


Padraic Monaghan, Lancaster University

Computational modelling of corpus data in psycholinguistic studies

Computational models of language learning and processing enable us to determine the inherent structure present in language input, and also the cognitive mechanisms that react to this structure. I will give an introduction to computational models used in psycholinguistic studies, with a particular focus on connectionist models where the structure of processing is derived principally from the structure of the input to the model.


Phil Durrant, University of Exeter

Revisiting collocational priming

Durrant & Doherty (2010) evaluated whether collocations at different levels of frequency exhibit psycholinguistic priming. It also attempted to untangle collocation from the related phenomenon of psychological association by comparing collocations which were and were not associates. Priming was found between high-frequency collocations but associated collocates appeared to exhibit more deep-rooted priming (as reflected in a task designed to reflect automatic, rather than strategic processes) than those which were not associated. This presentation will critically review the 2010 paper in light of more recent work. It will re-evaluate the study itself and suggest ways in which research could be taken forward.

Durrant, P., & Doherty, A. (2010). Are high-frequency collocations psychologically real? Investigating the thesis of collocational priming. Corpus linguistics and linguistic theory, 6(2), 125-155.


Jennifer Hughes, Lancaster University

Investigating the processing of collocation using EEG: A pilot study

In this presentation, I discuss the results of an EEG experiment which pilots a procedure for determining whether or not there is a quantitively distinct brain response to the processing of collocational bigrams compared to non-collocational bigrams. Collocational bigrams are defined as adjacent word pairs which have a high forward transitional probability in the BNC (e.g. crucial point), while non-collocational bigrams are defined as adjacent word pairs which are semantically plausible but are absent from the BNC (e.g. crucial night). The results show that there is a neurophysiological difference in how collocational bigrams and non-collocations bigrams are processed.


Michaela Mahlberg, Kathy Conklin, and Gareth Carrol, University of Birmingham

Exploring corpus-attested patterns in Dickens’s fiction – methodological challenges of using eye-tracking techniques

The study of the relationship between patterns and meanings is a key concern in corpus linguistics. The data that corpus linguists work with, however, only provides a partial picture. In this paper, we will look at how questions of frequencies in corpora can be related to questions raised by data from eye-tracking studies on reading times. We will also discuss challenges of designing experiments to address these questions. As a case study, we focus on examples of patterns identified in Dickens’s fiction, but the methodological issues we address have wider implications beyond the study of literary corpora.


The event is free to attend and is open to both internal and external attendees. If you are an external guest, please email j.j.hughes(Replace this parenthesis with the @ sign)lancaster.ac.uk so we know that you intend to come.

We are really looking forward to this event as it will be an exciting opportunity to share ideas regarding the different approaches to using corpus data in experimental psycholinguistics.

Ants On Fire

LaurenceAnthonyBeing an honorary research fellow at CASS is not only a great honor but a great pleasure. In December
of 2015, my initial three-year fellowship at CASS was extended for a further three years, and this introduced the possibility of returning to Lancaster for a sabbatical-length seven-week research stay between February and March of 2016.

The timing of this research stay was especially enjoyable as it coincided with CASS receiving the Queen’s Anniversary Prize for Higher and Further Education for its contributions to computer analysis of world languages in print, speech and online. As part of a week of celebrations at the Centre, I worked with Claire Hardaker of CASS to organize a launch event for our new FireAnt social media analysis toolkit on February 22. FireAnt is a tool that allows researchers to easily extract relevant data from the social media data sources, visualize that data in the form of time-series plots, network graphs, and geolocation maps, and export results for further analysis using traditional corpus tools. At the event, 20 invited participants learned how to use the new tool to analyze Twitter and other social media data sets. They also gave us very valuable comments and suggestions that were immediately incorporated into the software before it was released to the public later on the same day.

Screenshot of FireAnt main display

Screenshot of FireAnt main display

Following the release of FireAnt, I then worked with Claire over the next few weeks on our first research project utilizing the software – a forensic corpus linguistics analysis of the Ashley Madison dataset. Here, we used FireAnt to identify the creation and activities of automated ‘Angel’ accounts on the site. We presented preliminary results from this analysis at a UCREL/Forge event on March 18 that was attended by a wide number of forensic linguists, corpus linguists, computer scientists, and others around the university.

Time-series analysis account creation in the AshleyMadison data set

Time-series analysis account creation in the AshleyMadison data set

One of the great advantages of being at Lancaster is that it is home to excellent scholars that are interested in the entire span of linguistics fields. Since my one-year sabbatical at Lancaster, I’ve had the pleasure to work with Marije Michel in the Dept. of Linguistics and English Language who uses eye tracking methodology in her research into Second Language Acquisition (SLA), Task-Based Language Teaching (TBLT) and written Synchronous Computer Mediated Communication (SCMC). Returning to Lancaster allowed us to work together further to develop a new eye-tracking tool that has applications not only in SLA, TBL and SCMC research, but also corpus linguistics. Again, we presented these ideas at a UCREL event held on March 10, and we are now in the process of writing up the research for publication.

Although this research visit was mainly focused on FireAnt development, I fortunately had time to also continue work on some of the other projects that were initiated during my sabbatical year. Meeting up again with Paul Baker allowed us to consider the next stage of development of ProtAnt, which we will be presenting at the TaLC 12 conference. I also met up with Paul Rayson and CASS’s new lecturer in digital humanities, Steve Wattam, to discuss how we can promote an understanding of tools development and programming skills among corpus linguistics (an area of interest that I have had for several years now). Sadly, my schedule prevented me from joining them at the BBC #newsHack, but I was so happy to hear that Steve’s team won the Editorial Prize at the event.

Nothing beats having an entire sabbatical year to focus on research and collaborate with the excellent members of the CASS. But, this seven-week research visit comes a very enjoyable second. I would like to thank Tony McEnery and his team for funding the visit and making me feel so welcome again at Lancaster. It was a true pleasure to be back. I look forward to continue working with Tony and the team over the next three years.


Biography:

Laurence Anthony is Professor of Applied Linguistics at the Faculty of Science and Engineering, Waseda University, Japan, and an Honorary Research Fellow at the ESRC Centre for Corpus Approaches to Social Science (CASS), Lancaster University, UK. His main interests are in corpus linguistics, educational technology, and English for Specific Purposes (ESP) program design and teaching methodologies. He received the National Prize of the Japan Association for English Corpus Studies (JAECS) in 2012 for his work in corpus software tools design. He is the developer of various corpus tools including AntConc, AntWordProfiler, FireAnt, ProtAnt, and TagAnt.

A tribute to John Urry

JohnIt is difficult to find words to express how shocked and deeply saddened I was with John’s early passing. I have worked with John for the past two years on a research project on climate change that he led with so much enthusiasm and interest. One of his ‘obsessions’, he would say.

Being a linguist by background, I was not familiar with his work when we first started this project but it did not take long for me to be happily immersed in his books and develop full admiration for his work. John opened up my horizons and has changed the way I see society and language. Genuinely modest, he was always kind, supportive, full of ideas and insightful comments that made me think and work hard. I have learned so much from him.

I will always keep a special memory of our last meeting. I asked him how he would explain some ups and downs in the frequency of the word energy in our newspaper data. He just turned to me and said: ‘We will have to examine the collocations of energy’. I was so glad to hear that and I could not help having a good laugh. He added: ‘Is collocation the wrong word?’.  ‘No John, but you sound like a corpus linguist’.

I feel very much privileged to have had the chance to work with such a brilliant mind. John was a very special person and will be greatly missed.

The Spoken BNC2014 early access projects: Part 4

In January, we announced the recipients of the Spoken BNC2014 Early Access Data Grants. Over the next several months, they will use exclusive access to the first five million words of Spoken BNC2014 data to carry out a total of thirteen research projects.

In this series of blogs, we are excited to share more information about these projects, in the words of their authors.

In the fourth and final part of our series, read about the work of Tanja Hessner & Ira Gawlitzek, Karin Axelsson, Andrew Caines et al. and Tanja Säily et al.


Tanja Hessner and Ira Gawlitzek

University of Mannheim, Germany

Women speak in an emotional manner; men show their authority through speech! – A corpus-based study on linguistic differences showing which gender clichés are (still) true by analysing boosters in the Spoken BNC2014

Western world clichés claim that women are emotional and often exaggerate, which is reflected in their speech. In contrast, men’s language is said to be characterised by bluntness. Aiming to shed a bit more light on statements like these, this study is going to consider gender differences on the lexical level.

In order to discover if and, if so, to which extent there really is a difference between female and male speakers, the phenomena of boosters will be investigated in the Spoken BNC2014 early access subset. Boosters such as totally or absolutely are particularly appealing and suitable for analysing gender differences since they are extremely multifaceted and they are indicators not only of lively, but also of emotional and powerful speech. Not only are appropriate boosters investigated by using quantitative methods, but also by analysing the data in a qualitative way.


Karin Axelsson

University of Gothenburg, Sweden

Canonical and non-canonical tag questions in the Spoken BNC2014: What has happened since the original BNC?

What is happening to tag questions in British everyday conversation? Are canonical tag questions, where the form of the tag reflects that of the preceding clause (as in She won’t come, will she?), on the way out as the use of innit and other invariant tags is spreading? Who uses innit in 2014? The use of tag questions in the Spoken BNC2014 early access subset will be compared to the use in the demographic part of the original Spoken BNC reflecting the language of the early 1990s.


Andrew Caines1, Michael McCarthy2 and Paula Buttery1

1University of Cambridge, UK

2University of Nottingham, UK

‘You still talking to me?’ The zero auxiliary progressive in spoken British English, twenty years on

With early access to a subset of the Spoken BNC2014, we will be able to assess whether a supposedly ‘ungrammatical’ construction has become more frequently used in conversational British English over the past 20 years. The construction in question is the ‘zero auxiliary’ – for example, the progressive aspect construction may be used with an -ing verb form alone (“you talking to me?”, “What you doing?”, “We going to town”) whereas the standard rule is to combine an auxiliary verb (BE or HAVE) with the -ing form.

In the original Spoken BNC recorded in the early 1990s, the zero auxiliary occurred in one-in-twenty progressive constructions, a rate that rose to one-in-three if second person interrogatives (You talking to me? etc.) were considered alone. Moreover, younger working-class speakers were more likely to use the zero auxiliary than older middle-class speakers. We will investigate how these usage rates compare to the Spoken BNC2014, in the process updating the demographics of zero auxiliary use as well.


Tanja Säily1, Victoria González-Díaz2 and Jukka Suomela3

1University of Helsinki, Finland

2University of Liverpool, UK

3Aalto University, Finland

Variation in the productivity of adjective comparison

The functional competition between inflectional (‑er) and periphrastic (more) comparative strategies in English has received a great deal of attention in corpus-based research. A key area of competition remains relatively unexplored, however: the productivity of either comparative strategy, or how diversely they are used with different adjectives. The received wisdom is that inflection is fully productive, so we might expect to find no variation within the productivity of ‑er. However, recent research using new methods shows sociolinguistic variation in the productivity of extremely productive derivational suffixes. Whether the same variation applies to the productivity of inflectional processes remains an open question.

On the basis of the Spoken BNC2014 early access subset, our project will analyse intra- and extra-linguistic variation in the productivity of inflectional and periphrastic comparative strategies. Intra-linguistic factors include syntactic position, modification preferences, length and derivational type of the adjective. The extra-linguistic determinants focus on gender, age, socio-economic status, conversational setting and roles of the interlocutors. Our research constitutes a timely contribution to current knowledge of adjective comparison and morphological theory-building. If (a) variation in the productivity of inflectional comparison is found and (b) similar change in the productivity of both derivational and inflectional processes is observed, this will support our hypothesis that there is a derivation-to-inflection cline rather than a sharp divide.


Check back soon for more updates on the Spoken BNC2014 project!

From Corpus to Classroom 2

There is great delight that the Trinity Lancaster Corpus is providing so much interesting data that can be used to enhance communicative competences in the classroom. From Corpus to Classroom 1 described some of these findings. But how exactly do we go about ‘translating’ this for classroom use so that it can be used by busy teachers with high pressured curricula to get through? How can we be sure we enhance rather than problematize the communicative feature we want to highlight?

Although the Corpus data comes from a spoken test, we want to use it to illustrate  wider pragmatic features of communication. The data fascinates students who are entranced to see what their fellow learners do, but how does it help their learning? The first step is to send the research outputs to an experienced classroom materials author to see what they suggest.

Here’s how our materials writer, Jeanne Perrett, went about this challenging task:

As soon as I saw the research outputs from TLC, I knew that this was something really special; proper, data driven learning on how to be a more successful speaker. I could also see that the corpus scripts, as they were, might look very alien and quirky to most teachers and students. Speaking and listening texts in coursebooks don’t usually include sounds of hesitation, people repeating themselves, people self-correcting or even asking ‘rising intonation’ questions. But all of those things are a big part of how we actually communicate so I wanted to use the original scripts as much as possible. I also thought that learners would be encouraged by seeing that you don’t have to speak in perfectly grammatical sentences, that you can hesitate and you can make some mistakes but still be communicating well.

Trinity College London commissioned me to write a series of short worksheets, each one dealing with one of the main research findings from the Corpus, and intended for use in the classroom to help students prepare for GESE and ISE exams at a B1 or B2 level.

I started each time with extracts from the original scripts from the data. Where I thought that the candidates’ mistakes would hinder the learner’s comprehension (unfinished sentences for example), I edited them slightly (e.g. with punctuation). But these scripts were not there for comprehension exercises; they were there to show students something that they might never have been taught before.

For example, sounds of hesitation: we all know how annoying it is to listen to someone (native and non-native speakers) continually erm-ing and er-ing in their speech and the data showed that candidates were hesitating too much. But we rarely, if ever, teach our students that it is in fact okay and indeed natural to hesitate while we are thinking of what we want to say and how we want to say it. What they need to know is that, like the more successful candidates in the data,  there are other words and phrases that we can use instead of erm and er. So one of the worksheets shows how we can use hedging phrases such as ‘well..’ or ‘like..’ or ‘okay…’ or ‘I mean..’ or ‘you know…’.

The importance of taking responsibility for a conversation was another feature to emerge from the data and again, I felt that these corpus findings were very freeing for students; that taking responsibility doesn’t, of course, mean that you have to speak all the time but that you also have to create opportunities for the other person to speak and that there are specific ways in which you can do that such as making active listening sounds (ah, right, yeah), asking questions, making short comments and suggestions.

Then there is the whole matter of how you ask questions. The corpus findings show that there is far less confusion in a conversation when properly formed questions are used. When someone says ‘You like going to the mountains?’ the question is not as clear as when they say ‘Do you like going to the mountains?’ This might seem obvious but pointing it out, showing that less checking of what has been asked is needed when questions are direct ones, is, I think very helpful to students. It might also be a consolation-all those years of grammar exercises really were worth it! ‘Do you know how to ask a direct question? ‘Yes, I do!’

These worksheets are intended for EFL exam candidates but the more I work on them, the more I think that the Corpus findings could have a far wider reach. How you make sure you have understood what someone is saying, how you can be a supportive listener, how you can make yourself clear, even if you want to be clear about being uncertain; these are all communication skills which everyone needs in any language.

 

 

The Spoken BNC2014 early access projects: Part 3

In January, we announced the recipients of the Spoken BNC2014 Early Access Data Grants. Over the next several months, they will use exclusive access to the first five million words of Spoken BNC2014 data to carry out a total of thirteen research projects.

In this series of blogs, we are excited to share more information about these projects, in the words of their authors.

In Part 3 of our series, read about the work of Karin Aijmer, Kazuki Hata et al. and Laura Paterson.


Karin Aijmer

University of Gothenburg, Sweden

Investigating intensifiers in the Spoken BNC2014

Intensifiers undergo rapid changes. Old ones may go out of fashion and be replaced by new ones even in a short diachronic perspective. They should therefore be studied in up-to-date spoken material. This project will describe ‘new’ intensifiers (or new developments of intensifiers) such as so (cool), fucking, damn, dead, enough and the contexts in which they are used. What do they for example collocate with? Who are the typical users?

The aim of the article using data from the Spoken BNC2014 early access subset is to study recent or on-going changes in the area of intensification. Intensifiers are interesting to study because they have a tendency to lose ground and may be replaced by other intensifiers even in a short diachronic perspective. Intensifiers have earlier been studied on the basis of the spoken part of the British National Corpus, and access to the EAS will make it possible to compare the frequencies of intensifiers across time. On the basis of the corpus data it will also be possible to give information about the speakers (e.g. whether they are teenagers or adults, gender and social class of the speakers).


Kazuki Hata, Yun Pan and Steve Walsh

Newcastle University, UK

Talking the talk, walking the walk: interactional competence in and out

Our project aims to characterise interactional competence through a comparison of casual conversation and institutional talk, two distinct genres. The proposed study will build on an ongoing project using the NUCASE corpus (Newcastle University Corpus of Academic Spoken English), led by the School of Education, Communication and Language Sciences, Newcastle University. From our analysis of the NUCASE data, we have identified specific features of interactional competence which operate in different academic contexts. Interactional competence, across a range of academic disciplines, can be characterised by identifying the key linguistic and interactional features, which promote engagement and maximise ‘learning’ and ‘learning opportunities’.

The proposed study would extend findings from the NUCASE study by comparing two corpora, and by highlighting the ways in which interactional competence operates in both formal and informal settings. We see the Spoken BNC2014 early access subset as an ideal source to accomplish our research aim, due to its geographical and functional features, offering a unique opportunity to study speakers’ interactional competence in different settings, with a particular focus on the ‘organising features’ of spoken interactions. We anticipate that the proposed study would bring into question some of the recent claims from functional/interactional linguistic studies, regarding the textual and interpersonal functions of several tokens, and provide a better understanding of the context-shaped/renewing nature of discourse across interactional contexts.


Laura Paterson

Lancaster University, UK

‘You can just give those documents to myself’: Untriggered reflexive pronouns in 21st century spoken British English

Reflexive pronouns (myself, herself, etc.) must share reference with another grammatical unit in order to fulfil their syntactic criteria: in the sentence ‘The cat washes herself’, the noun phrase the cat and the reflexive pronoun herself represent the same entity and share a syntactic bond. However, despite syntactic constraints, reflexive pronouns occur without coreferent NPs in some varieties of English. In ‘You can just give those documents to myself’, the pronoun you and the reflexive pronoun myself cannot be coreferent and have different real-world referents. Reflexives occurring without coreferent noun phrases are classed as ‘untriggered’ and have traditionally been deemed ungrammatical. However, untriggered reflexives can be understood.

Using the Spoken BNC2014 early access subset, I will investigate the use of untriggered reflexives in 21st century spoken British English, asking:

  1. Do untriggered reflexives occur in particular syntactic positions?
  1. Does the use of untriggered reflexives correlate with use of a particular grammatical person?
  1. Does the use of untriggered reflexives correlate with particular demographic groups?
  1. How does the use of untriggered reflexives compare with the use of reflexives in 21st century spoken British English?

Check back soon for Part 4!

CASS receives Queen’s Anniversary Prize for Further and Higher Education

Queen's Anniversary AwardAt the end of February, a team of CASS researchers attended the Presentation of the Queen’s Anniversary Prizes for Further and Higher Education, held at Buckingham Palace. The CASS team officially received the award from their Royal Highnesses, The Prince of Wales and the Duchess of Cornwall on 25th February 2016.

Back in November, it was announced that CASS received the esteemed Queen’s Anniversary Prize for its work in “computer analysis of world languages in print, speech, and online.” The Queen’s Anniversary Prizes are awarded every two years to universities and colleges who submit work judged to show excellence, innovation, impact, and benefit for the institution itself, and for the people and society generally in the wider world.

10 of us were selected to attend the ceremony itself, including the Chancellor, Vice-Chancellor, our Centre Director Tony McEnery, and three students. Buckingham Palace sent strict instructions about dress code and the possession of electronic devices, and we were well-read on royal etiquette by the time the big day arrived. I think all of us were a little nervous about what the day would have in store, but we met bright-eyed and bushy-tailed at 9:30am, and took a taxi to Green Park. We entered through the front gates into Buckingham Palace, and looked back at the crowd of adoring fans on the other side of the railings.

We showed our entry cards, and found ourselves being ushered across the courtyard and into the Palace itself. We dropped off our coats and bags, and then went up the grand staircase into the Ballroom where the ceremony was held. We began to relax as the Equerry told us what would be happening throughout the ceremony, and the Countess of Wessex’s String orchestra provided excellent music throughout the event. The score ranged from Handel, right through to John Lennon’s ‘Imagine’, and even a James Bond theme.

As the ceremony started, Vice-Chancellor Mark Smith and CASS Centre Director Tony McEnery passed through the guests, along with representatives from other universities and colleges, and then proceeded to form a line to receive the award. Chancellor Alan Milburn was seated at the front of the Ballroom, along with Anne, Princess Royal. Whilst receiving the award on behalf of CASS, The Prince of Wales asked the Vice-Chancellor about our work, and was fascinated to discover what we have undertaken in the past 40 years. After a brief chat about our work, Mark Smith and Tony McEnery were presented with the Queen’s Anniversary Prize medal and certificate that will be displayed in the John Welch Room in University House.

After the ceremony, we filed through into the Picture Gallery for the reception. Over the course of the next 60-90 minutes, guests were free to mingle and network with each other whilst canapés were served. Dignitaries passed through and spoke to the visitors; Anne, Princess Royal, had a keen interest in the impact of our work on dictionary-making, and I must admit that Tony McEnery was excellent at giving a summary of what corpus research entails. He outlined how it is used in modern-day dictionary building, and discussed some of the historical texts that we now have access to.

The Duchess of Cornwall also visited our group over the course of the event, and made a point of speaking to both Gill Smith and Rosie Knight about the practical applications of their research. They discussed extensively why corpus research is such a useful method in the social sciences, and spoke of their personal connection to the research centre.

Having the opportunity to promote and discuss our research with royalty was a true honour, and I think it is fantastic to see the work of CASS recognised in this unique and special way.

The Spoken BNC2014 early access projects: Part 2

In January, we announced the recipients of the Spoken BNC2014 Early Access Data Grants. Over the next several months, they will use exclusive access to the first five million words of Spoken BNC2014 data to carry out a total of thirteen research projects.

In this series of blogs, we are excited to share more information about these projects, in the words of their authors.

In Part 2 of our series, read about the work of Chris Ryder et al., Andreea Calude and Barbara McGillivray et al.


Chris Ryder, Jacqueline Laws and Sylvia Jaworska

University of Reading, UK

From oldies to selfies: A diachronic corpus-based study into changing productivity patterns in British English suffixation

The data from the Spoken BNC2014 early access subset will provide a unique opportunity to examine changes that have occurred in affix use in spoken British English over a twenty-year period; for example, the word selfie has only entered general usage since the invention of the iPhone. Using the recently developed MorphoQuantics database containing complex word data for 222 word-final affixes from the demographically sampled subset of the original Spoken BNC, direct comparisons can be made between old and new datasets, focussing on suffixation patterns, changes in productivity, and trends that demonstrate the shifts in semantic scope of individual suffixes. These features will be analysed chiefly through an examination (both quantitative and qualitative) of neologisms within the data, specifically regarding their regularity of construction, occurrence, and meaning.

This study is just one example of the diachronic morphological analyses that will be made available through a comparison of the Spoken BNC2014 EAS and the Spoken BNC, by utilising the categorisation system provided by MorphoQuantics.


Andreea Calude

University of Waikato, New Zealand

Sociolinguistic variation in cleft constructions: a quantitative corpus study of spontaneous conversation

This project concerns links between the use of various grammatical constructions and sociolinguistic variation, for example is grammar used differently by men and women, or by younger and older speakers? We know that such variation can be observed for certain phonological features (e.g., some vowel sounds) and for certain pragmatic constructions (e.g., discourse markers and new and given information), but as regards grammar features, the answer remains largely unknown or at best vague.

I intend to use the Spoken BNC2014 early access subset to investigate cleft constructions from a sociolinguistic variationist perspective, with the aim of uncovering (potential) systematic syntactic variation across age, gender, dialect, and socio-economic status. Clefts constitute the most frequently used focusing strategy in English, with demonstrative clefts being among the most common in spontaneous conversation, for example: “That is what I want to study”, “This is where I was born”. Despite intense diachronic and synchronic study of the structure and function of clefts in English, virtually nothing is known about the relationship between clefts use and sociolinguistic variation.

The Spoken BNC2014 data will be coded for all demonstrative clefts using a combination of manual and automatic detection, and each construction identified will be attributed to a particular speaker profile (in terms of their sociolinguistic features). Three linguistic features will also be coded for each construction, namely discourse function, reference direction (cataphoric or anaphoric), and information structure (amount of new and given information included).  The data will be analysed using a mixed effects generalised linear regression model.


Barbara McGillivray1, Gard Buen Jenset1 and Michael Rundell2

1University of Oxford, UK

2Lexicography MasterClass, UK

The dative alternation revisited: fresh insights from contemporary spoken data

A well-known feature of English grammar is the dative alternation, whereby a verb may be used in an SVOO construction (Give me the money) or in the pattern SVO followed by a PP with the preposition to (Give the money to me). This is quite a well-researched topic, and generalizations have been made about the factors influencing a writer’s choice of one construction or another, and about which verbs show a preference for one of these patterns over the other. However, most of the studies published to date draw either on introspection or on data from written sources. The availability of contemporary, unscripted spoken data takes us into new territory, and offers an exciting opportunity to revisit this topic.

Our plan is to use the data from the Early Access Scheme to investigate verbs whose argument structure preferences include the dative alternation. Once we have all the relevant corpus data from the Spoken BNC2014 early access subset, we will analyse it using state-of-the-art multivariate statistical techniques, in order to account for the interplay of all the potentially significant variables, whether lexical, semantic, syntactic, or and social. The proposed study thus exploits many of the unique features of this dataset, including the metadata on speakers and the USAS semantic tagging, to answer questions concerning the possible influence of semantic categories, socio-economic factors, gender, dialect, age, as well as linguistic features on a speaker’s preferences. Once the study is complete, there would be opportunities for fresh comparative studies, either with the original Spoken BNC or with contemporary written data.


Check back soon for Part 3!