CASS at Corpus Linguistics 2017

The biennial Corpus Linguistics conference first took place in 2001 at Lancaster, with the 2017 conference at Birmingham being its 9th outing. Lasting four days with an additional day for workshops, this blog post details CASS participation at the event.

On Monday 24th July CASS ran two pre-conference workshops: Vaclav Brezina and Matt Timperley’s workshop was based around the tool #LancsBox which has the capacity to create collocational networks while Robbie Love and Andrew Hardie introduced the Spoken BNC2014 Corpus. Pre-conference workshop presentations were also given by CASS members in the Corpus Approaches to Health Communication workshop which saw talks by Paul Baker (on NHS patient feedback), Elena Semino (on assessment of a diagnostic pain questionnaire) and Karen Kinloch who gave two talks on discourses around IVF treatment and post natal depression (her second talk was co-presented with Sylvia Jaworska).

On the first day of the conference proper, CASS Director Andrew Hardie gave a plenary entitled Exploratory analysis of word frequencies across corpus texts: towards a critical contrast of approaches, which involved a “for one night only” Topic Modelling analysis, demonstrating some of the problems and assumptions behind this approach. Key points were illustrated with a friendly picture of a Gigantopithecus (pictures of dinosaurs and other extinct creatures were used in several talks, perhaps suggesting a new theme for CL research). The plenary can be watched in full here. https://www.youtube.com/watch?v=ka4yDJLtSSc

A number of conference talks involved the creation and analysis of the new 2014 British National Corpus, with Abi Hawtin presenting on how she developed parameters for the written section and Robbie Love discussing swearing in the spoken section of the BNC2014. Vaclav Brezina and Matt Timperley discussed a proposal for standardised tokenization and word counting, using the new BNC as an exemplar while Susan Reichelt examined ways of adapting the BNC for sociolinguistic research, taking a case study on negative concord.

In terms of other corpus creation projects, Paul Rayson, Scott Piao and a team from Cardiff University discussed the creation of a Welsh Semantic tagger for use with the CorCenCC Project.

Two talks involved uses of corpus linguistics in teaching. First, Gillian Smith described the creation and analysis of a corpus of interactions in Special Education Needs classrooms, with the goal of investigating teacher scaffolding while Liam Blything, Kate Cain and Andrew Hardie analysed a half million corpus of teacher-child interactions during guided reading sessions.

Regarding work examining discourse and representation using corpus approaches Carmen Dayrell presented her work with Helen Baker and Tony McEnery on a diachronic analysis of newspaper articles about droughts, their research combining corpus approaches with GIS (Geographical Information Systems). GIS was also used by Laura Paterson and Ian Gregory to map text analysis of poverty in the UK while Paul Baker and Mark McGlashan reported on their work looking at representations of Romanians in the Daily Express, comparing articles with online reader comments. A fourth paper by Jens Zinn and Daniel McDonald considered changing understandings around the concept of risk in English language newspapers.

Collocation was also a popular CASS topic in our presentations. Native and non-native processing of collocations was investigated by Jen Hughes, who carried out an experimental study using electroencephalography (EEG) which measures electrical potentials in the brain, while another approach to collocation was taken by Doğuş Can Öksüz and Vaclav Brezina who examined adjective-noun collocations in Turkish and English. A third collocation study by Dana Gabasolva, Vaclav Brezina and Tony McEnery involved empirical validation of MI-based score collocations for language learning research.

Finally, Jonathan Culpeper and Amelia Joulain-Jay talked about an affiliated CASS project involving work on creating an Encyclopaedia of Shakespeare’s language. They discussed issues surrounding spelling variation, and part of speech tagging, and gave two case studies (involving the words I and good).

 
The conference brought together corpus linguists from dozens of countries (including Germany, Finland, Spain, Israel, Japan, Brazil, Iran, The Netherlands, USA, New Zealand, Taiwan, Ireland, China, Czech Republic, Italy, Sweden, Poland, Chile, UK, Hong Kong, Norway, Australia, Belgium, Canada, South Africa and Venezuela) and was a great opportunity to share and hear about developing work in the field. There was a lively twitter presence throughout the conference, with the tag #CL2017bham. However, my favourite tag was #HardiePieChartWatch, which had me going back to my slides to see if I had used a pie chart appropriately. Be careful with your pie charts!

The next conference will be held (for the first time) in Cardiff – I hope to see you there in two years.

More pictures of the conference can be found at https://www.flickr.com/photos/artsatbirmingham/sets/72157684181373191

CASS go to ICAME38!

Researchers from CASS recently attended the ICAME38 conference at Charles University in Prague. Luckily, we arrived in Prague a day early which gave us plenty of time to explore the city. The weather was sunny, so we walked to Wenceslas Square, and then took the lift to the top of the Old Town Hall Tower to enjoy the views over the city.

The following day, it was time to begin the conference! Over the course of the event, seven CASS members presented their research (you can view full abstracts of all talks here). Up first was Robbie Love, presenting “FUCK in spoken British English revisited with the Spoken BNC2014”. By replicating the approaches of McEnery & Xiao (2004) on the new data contained in the Spoken BNC2014, Robbie found, among other things, that FUCK is now used equally by men and women, and that use of FUCK peaks when speakers are in their 20s and then decreases with age, apart from the 60-69 group which has a higher frequency than the 50-59 group.

Also discussing the BNC2014 project was Abi Hawtin, who presented “The British National Corpus Revisited: Developing parameters for Written BNC2014.” Abi discussed the progress on the project so far, and gave the audience a chance to look at the sampling frame which has been designed for the corpus. Abi also highlighted the difficulty of collecting certain text types, particularly published books.

Amelia Joulain-Jay presented “Describing collocation patterns in OCR data: are MI and LL reliable?” Amelia discussed the fact that data which has been digitized using OCR procedures often has low levels of accuracy, and how this can affect corpus analysis. Amelia tested the reliability of Mutual Information statistics and Log Likelihood statistics when working with OCR data, and found that, among other things, Mutual Information and Log Likelihood attract high rates of false positives. However, she also found that correcting OCR data using Overproof makes a positive difference for both statistics.

CASS director, Andrew Hardie, also presented research using OCR data. He gave a talk titled “Plotting and comparing corpus lexical growth curves as an assessment of OCR quality in historical news data”. Andrew further drew our attention to the amount of errors, or ‘noise’, in OCR data, and showed that if a graph is constructed of number of tokens observed versus count of types at intervals (say, every 10,000 tokens) a curve characteristic of lexical growth over the span of a given corpus emerges. Andrew showed that visual comparison of lexical growth curves among historical collections, or to modern corpora, therefore generates a good impression of the relative extent of OCR noise, and thus some estimate of how much such noise will impede analysis.

Also presenting was Dana Gablasova who discussed “A corpus-based approach to the expression of subjectivity in L2 spoken English: The case of ‘I + verb’ construction”. Dana used the Trinity Lancaster Corpus (TLC) to investigate the ‘I + verb’ construction in L1 Spanish and Italian speakers aged over 20 years. Dana found that with the increase in proficiency the frequency of emotive verbs decreased while the frequency of the epistemic verbs increased considerably. The study also identified the most frequent cognitive and emotive verbs and the trends in their use according to the proficiency level of L2 users.

Vaclav Brezina (and Matt Timperley, who was unfortunately not able to attend the conference) gave a software demonstration of #LancsBox – a new-generation corpus analysis tool developed at CASS. Vaclav showed that #LancsBox can:

  • Search, sort and filter examples of language use.
  • Compare frequency of words and phrases in multiple corpora and subcorpora.
  • Identify and visualise meaning associations in language (collocations).
  • Compute and visualize keywords.
  • Use a simple but powerful interface.
  • Support a number of advanced features such as customisable statistical measures.

#LancsBox can be downloaded for free from the tool website http://corpora.lancs.ac.uk/lancsbox.

Dana and Vaclav also gave a presentation together, titled “MI-score-based collocations in language learning research: A critical evaluation.” Dana and Vaclav identified several issues in the use of MI-score as a measure in language learning research, and used data from the BNC and TLC to:

  • place the MI-score in the context of other similar association measures and discuss the similarities and differences directly relevant to LLR
  • to propose general principles for selection of association measures in LLR.

Finally, former CASS senior research associate Laura Paterson, who recently moved to a lectureship at the Open University, presented “Visualising corpora using Geographical Text Analysis (GTA): (Un)employment in the UK, a case study”, which stemmed from her work on the CASS Distressed Communities project. Laura showed how GTA can be used to generate maps from concordance lines. She showed lots of interesting data visualisations and highlighted the way in which GTA allows the researcher to visualise their corpus and adds a consideration of physical space to language analysis.

Aside from all of the fascinating talks, ICAME38 also had a brilliant social programme. We were able to go on 2 boat trips along the river. The first gave us brilliant views of the city, and the second allowed us to get much closer to the bridges and buildings which line the river. The Gala dinner was also great fun – we had a linguistics themed menu and, best of all an Abba tribute band!

Thank you to all of the organisers of ICAME38 for such an enjoyable and well-organised conference!

 

In memory: Professor Geoffrey Leech

It is with great sorrow that we report the death on 19th August of Professor Geoffrey Leech.

Geoff was not only the founder of the UCREL research centre for corpus linguistics at Lancaster University, he was also the first Professor and founding Head of the Department of Linguistics and English Language. His contributions to linguistics – not only in corpus linguistics, but also in English grammar, pragmatics and stylistics – were immense. After his retirement in 2002, he remained an active member of our department, not only continuing his own research but also, characteristically, providing advice, support and encouragement for students and junior colleagues.

All our thoughts are with Geoff’s wife Fanny, and with his family.

It is still hard for us to find the right words at this time. For many of us he was an inspirational teacher and mentor, but for all of us, he was a kind and generous friend.

The video below was recorded by Tony McEnery in conversation with Geoff in late 2013 for Lancaster’s online course in corpus linguistics. In it, Tony and Geoff discuss the history of the field. We present it now publicly as a first tribute to Geoff’s life and work.

(A transcript is available from this link.)