CASS go to ICAME38!

Researchers from CASS recently attended the ICAME38 conference at Charles University in Prague. Luckily, we arrived in Prague a day early which gave us plenty of time to explore the city. The weather was sunny, so we walked to Wenceslas Square, and then took the lift to the top of the Old Town Hall Tower to enjoy the views over the city.

The following day, it was time to begin the conference! Over the course of the event, seven CASS members presented their research (you can view full abstracts of all talks here). Up first was Robbie Love, presenting โ€œFUCK in spoken British English revisited with the Spoken BNC2014โ€. By replicating the approaches of McEnery & Xiao (2004) on the new data contained in the Spoken BNC2014, Robbie found, among other things, that FUCK is now used equally by men and women, and that use of FUCK peaks when speakers are in their 20s and then decreases with age, apart from the 60-69 group which has a higher frequency than the 50-59 group.

Also discussing the BNC2014 project was Abi Hawtin, who presented โ€œThe British National Corpus Revisited: Developing parameters for Written BNC2014.โ€ Abi discussed the progress on the project so far, and gave the audience a chance to look at the sampling frame which has been designed for the corpus. Abi also highlighted the difficulty of collecting certain text types, particularly published books.

Amelia Joulain-Jay presented โ€œDescribing collocation patterns in OCR data: are MI and LL reliable?โ€ Amelia discussed the fact that data which has been digitized using OCR procedures often has low levels of accuracy, and how this can affect corpus analysis. Amelia tested the reliability of Mutual Information statistics and Log Likelihood statistics when working with OCR data, and found that, among other things, Mutual Information and Log Likelihood attract high rates of false positives. However, she also found that correcting OCR data using Overproof makes a positive difference for both statistics.

CASS director, Andrew Hardie, also presented research using OCR data. He gave a talk titled โ€œPlotting and comparing corpus lexical growth curves as an assessment of OCR quality in historical news dataโ€. Andrew further drew our attention to the amount of errors, or โ€˜noiseโ€™, in OCR data, and showed that if a graph is constructed of number of tokens observed versus count of types at intervals (say, every 10,000 tokens) a curve characteristic of lexical growth over the span of a given corpus emerges. Andrew showed that visual comparison of lexical growth curves among historical collections, or to modern corpora, therefore generates a good impression of the relative extent of OCR noise, and thus some estimate of howย much such noise will impede analysis.

Also presenting was Dana Gablasova who discussed โ€œA corpus-based approach to the expression of subjectivity in L2 spoken English: The case of โ€˜I + verbโ€™ constructionโ€. Dana used the Trinity Lancaster Corpus (TLC) to investigate the โ€˜I + verbโ€™ construction in L1 Spanish and Italian speakers aged over 20 years. Dana found that with the increase in proficiency the frequency of emotive verbs decreased while the frequency of the epistemic verbs increased considerably. The study also identified the most frequent cognitive and emotive verbs and the trends in their use according to the proficiency level of L2 users.

Vaclav Brezina (and Matt Timperley, who was unfortunately not able to attend the conference) gave a software demonstration of #LancsBox โ€“ a new-generation corpus analysis tool developed at CASS. Vaclav showed that #LancsBox can:

  • Search, sort and filter examples of language use.
  • Compare frequency of words and phrases in multiple corpora and subcorpora.
  • Identify and visualise meaning associations in language (collocations).
  • Compute and visualize keywords.
  • Use a simple but powerful interface.
  • Support a number of advanced features such as customisable statistical measures.

#LancsBox can be downloaded for free from the tool website http://corpora.lancs.ac.uk/lancsbox.

Dana and Vaclav also gave a presentation together, titled โ€œMI-score-based collocations in language learning research: A critical evaluation.โ€ Dana and Vaclav identified several issues in the use of MI-score as a measure in language learning research, and used data from the BNC and TLC to:

  • place the MI-score in the context of other similar association measures and discuss the similarities and differences directly relevant to LLR
  • to propose general principles for selection of association measures in LLR.

Finally, former CASS senior research associate Laura Paterson, who recently moved to a lectureship at the Open University, presented โ€œVisualising corpora using Geographical Text Analysis (GTA): (Un)employment in the UK, a case studyโ€, which stemmed from her work on the CASS Distressed Communities project. Laura showed how GTA can be used to generate maps from concordance lines. She showed lots of interesting data visualisations and highlighted the way in which GTA allows the researcher to visualise their corpus and adds a consideration of physical space to language analysis.

Aside from all of the fascinating talks, ICAME38 also had a brilliant social programme. We were able to go on 2 boat trips along the river. The first gave us brilliant views of the city, and the second allowed us to get much closer to the bridges and buildings which line the river. The Gala dinner was also great fun โ€“ we had a linguistics themed menu and, best of all an Abba tribute band!

Thank you to all of the organisers of ICAME38 for such an enjoyable and well-organised conference!