‘Location, location, location’: Introducing corpus linguistics in a novel and interesting way

“Lancaster University is one of the places where corpus linguistics was born. Let’s travel back in time to the year 1970, six years after Lancaster University was founded…” This is a quote from the beginning of the first lecture of a new online Masters programme in Corpus linguistics, which invites the students to embark on a journey of discovery, exploring key concepts, analytical techniques and important thinkers in the world of corpus linguistics.

When preparing the programme we were faced with a seemingly simple question: how to introduce corpus linguistics in an interesting way? In the programme, we wanted to share not only the knowledge and expertise in the field of corpus linguistics but also something of the unique character of Lancaster University, which is so closely connected with the history of the discipline as well as the most recent innovations in the field.

To achieve this aim, we decided to use different memorable locations around Lancaster to record lectures, in which we highlight different aspects of corpus linguistics and its applications. For example, we travelled to Morecambe, a seaside town near Lancaster, to record a lecture entitled ‘A drop in the ocean’, which uses the metaphor of the sea and all the water in it to explain how we can use corpora to sample the vast amounts of language that is produced every day. In another lecture, the Lancaster house of John Austin was chosen as the perfect backdrop for a lecture on corpora and pragmatics. The ruins of the Roman Bath House from the 4th century AD, located in the vicinity of the Lancaster Castle, created the opportunity to explain key grammatical categories, which date back to the ancient times and which are, with some modification, still used today.

And there are other stories related to specific significant sites around Lancaster that are used in the course. This approach allows us to share with the students in our new online programme  the energy of the place, Lancaster’s genius loci, if you like, making the study of corpus linguistics more memorable and enjoyable than a simple classroom recording or a PowerPoint lecture.

To find out more information about our programme, please visit the programme’s webpage: https://www.lancaster.ac.uk/linguistics/masters-level/corpus-linguistics-distance-ma/  Using this link, you can also access a free taster sessions and explore lectures and practical exercises from the programme.

In front of the Bailrigg House, Lancaster University

Morecambe near Lancaster

In front of John Austin’s house

Roman Bath House, Lancaster

Dalton Square, Lancaster

POS Tagging for Georgian is now available in #LancsBox

Featured

POS Tagging for Georgian is now available in #LancsBox

We are delighted to announce that part-of-speech tagging for Georgian is now available in #LancsBox. This is the very first Georgian POS tagger made available for wide range of users and uses. It enables users to perform various linguistic analysis on their own texts or corpora in #LancsBox.

The POS-tagger for Georgian was developed within my PhD project (Computational analysis of morphosyntactic categories in Georgian) at the University of Leeds. The tagset design part of my research was conducted in the Centre for Corpus Approaches to Social Science at Lancaster University and was supervised by Dr Andrew Hardie.  Thanks to Dr Vaclav Brezina the lead developer of #LancsBox, now this tagger is available to be used in #LancsBox (Brezina et al., 2015, 2018, 2020).

I use a probabilistic methodology (TreeTagger) and enclitic tokenisation approach to perform tagging in Georgian. The accuracy of part-of-speech tagging 92%. The tagger program uses a new morphosyntactic language model (developed for POS tagging purposes) and KATAG tagset (219 tags) based on this model. The KATAG tagset is a hierarchical-decomposable tagset which allows the user to search for different sections of the paradigm.

#LancsBox is a very powerful corpus analysis tool.  It can be used at different levels of analysis of language data and corpora. It automatically annotates data for part-of-speech and can be used to find frequencies of different word classes such as nouns, verbs etc., compute frequency and dispersion measures for POS tags, find and visualise co-occurrence of grammatical categories. It can also find complex linguistic structures using ‘smart searches’. For example, there are 60 ‘smart searches’ available for Georgian in #LancsBox such as:

ADJECTIVES GENITIVE CASE                      looks up for adjectives in genitive case

ADVERBS                                                          any adverbs

NOUNS ERGATIVE CASE                              nouns in ergative case

PRONOUNS DEMONSTRATIVE                   demonstrative pronouns

PRONOUNS INTERROGATIVE                     interrogative pronouns

PRONOUNS PERSONAL                                 personal pronouns

VERBS AORIST TENSE                                  verbs in aorist tense

VERBS I PERSON                                             verbs 1st person of subject

VERBS II PERSON PLURAL                           verbs 2nd person of subject plural

VERBS IMPERFECT TENSE                           verbs imperfect tense

To demonstrate how to use ‘smart searches’ in #LancsBox I use a small Covid19 corpus (229,481 tokens). I am interested to find out which verbs immediately follow the word coronavirus. Thus, my search term is: კორონავირუსი VERBS

This image displays an alphabetically arranged concordance lines in #LancsBox, showing the most immediate contexts in which the search term is used. This allows me to analyse frequency and dispersion of the node კორონავირუსი (coronavirus) immediately followed by a verb. Here it occurs 37 times (1.612 per 10k) in Covid19 Corpus in 10 out of 11 texts.

Texts and Images of Austerity: Workshop in Erlangen, Germany

On Sunday 24th September, a few of us from CASS travelled to the small Bavarian city of Erlangen, Germany, to attend ‘Texts and Images of Austerity in Britain’, a five-day workshop being held at Friedrich-Alexander-Universität. Our number included Deputy Director of CASS Andrew Hardie, Olivia Ha, Craig Evans, and former CASS member Laura Paterson (now with the Open University). Also at the event was former CASS Director Tony McEnery.

Partly inspired by the Paul Baker and Jesse Egbert edited book ‘Triangulating Methodological Approaches in Corpus Linguistic Research’ (2016), the workshop brought together researchers – both seasoned and budding – to work on a common data set on the topic of austerity: a 20+ million-word corpus of news articles from the Guardian and Daily Telegraph (2010-2016), nearly 400 images from these articles, and a collection of Twitter messages from the same period. Baker and Egbert et al. used different corpus methods to analyse their shared data. At the Erlangen workshop, methods from different disciplines were used, with participants coming from a variety of fields, including Sociology, Political Science, Linguistics, and Economics. The purpose was to encourage transdisciplinary collaboration in the study of how austerity is discursively constructed.

The workshop followed a format that combined short presentations with working groups. In the convivial atmosphere of a group of twenty or so international researchers, each participant presented their approach to looking at austerity. A variety of theories and techniques were outlined in the presentations, and corpus software and methods were well represented across the workshop. In his talk, Tony McEnery provided an overview of corpus linguistics, representing its value as an approach that focuses on how language is actually used rather than on how people think it is used. This overview also highlighted the variety of ways corpus methods can be employed in the study of text and talk. In other presentations, the focus was more on the means of doing corpus analysis, namely the software. For example, CQPweb: the main interface for accessing and analysing the text data at the workshop. Here, Andrew Hardie was on hand to provide a demonstration and offer his support.

Contributions from others with links to CASS included: Olivia Ha’s look at the collocates of emotion and evaluation, Craig Evans’s consideration of the notion of empathy, and Laura Paterson’s presentation on the use of geoparsing software. Other participants covered a range of techniques, theories and topics including multimodal annotation, textual analysis of moral logic, metaphor of austerity as attack, gender and austerity, crisis narratives, and critical realism.

In the working groups, participants with similar interests naturally gravitated to each other, particularly along the lines of those with more of a corpus focus and those with more of a multimodal focus. This, nevertheless, did not prevent a fruitful exchange of information and ideas, with several participants also presenting initial findings from their collaborative work. From a corpus perspective, a major challenge was the high presence of duplicates in the newspaper corpus (an issue with NexisLexis and capturing online newspaper data). The benefit of the workshop situation was that there were several participants with computational expertise present and able to work out ways of cleaning up the data.

The workshop in Erlangen was run by Tim Griebel, Stefan Evert, and a team of others at Friedrich-Alexander-Universität. Our hosts were incredibly welcoming, providing food and refreshments, organising accommodation and evening meals in the charming city of Erlangen, and even arranging a mid-week city tour event. The workshop itself was an interesting and rewarding exercise that forms part of a larger project on austerity. It helped create a space for different kinds of social scientists to exchange ideas and develop working relationships, which may develop into future research collaborations.

For more information on the workshop and its theme of austerity, visit the workshop website.

CASS at Corpus Linguistics 2017

The biennial Corpus Linguistics conference first took place in 2001 at Lancaster, with the 2017 conference at Birmingham being its 9th outing. Lasting four days with an additional day for workshops, this blog post details CASS participation at the event.

On Monday 24th July CASS ran two pre-conference workshops: Vaclav Brezina and Matt Timperley’s workshop was based around the tool #LancsBox which has the capacity to create collocational networks while Robbie Love and Andrew Hardie introduced the Spoken BNC2014 Corpus. Pre-conference workshop presentations were also given by CASS members in the Corpus Approaches to Health Communication workshop which saw talks by Paul Baker (on NHS patient feedback), Elena Semino (on assessment of a diagnostic pain questionnaire) and Karen Kinloch who gave two talks on discourses around IVF treatment and post natal depression (her second talk was co-presented with Sylvia Jaworska).

On the first day of the conference proper, CASS Director Andrew Hardie gave a plenary entitled Exploratory analysis of word frequencies across corpus texts: towards a critical contrast of approaches, which involved a “for one night only” Topic Modelling analysis, demonstrating some of the problems and assumptions behind this approach. Key points were illustrated with a friendly picture of a Gigantopithecus (pictures of dinosaurs and other extinct creatures were used in several talks, perhaps suggesting a new theme for CL research). The plenary can be watched in full here. https://www.youtube.com/watch?v=ka4yDJLtSSc

A number of conference talks involved the creation and analysis of the new 2014 British National Corpus, with Abi Hawtin presenting on how she developed parameters for the written section and Robbie Love discussing swearing in the spoken section of the BNC2014. Vaclav Brezina and Matt Timperley discussed a proposal for standardised tokenization and word counting, using the new BNC as an exemplar while Susan Reichelt examined ways of adapting the BNC for sociolinguistic research, taking a case study on negative concord.

In terms of other corpus creation projects, Paul Rayson, Scott Piao and a team from Cardiff University discussed the creation of a Welsh Semantic tagger for use with the CorCenCC Project.

Two talks involved uses of corpus linguistics in teaching. First, Gillian Smith described the creation and analysis of a corpus of interactions in Special Education Needs classrooms, with the goal of investigating teacher scaffolding while Liam Blything, Kate Cain and Andrew Hardie analysed a half million corpus of teacher-child interactions during guided reading sessions.

Regarding work examining discourse and representation using corpus approaches Carmen Dayrell presented her work with Helen Baker and Tony McEnery on a diachronic analysis of newspaper articles about droughts, their research combining corpus approaches with GIS (Geographical Information Systems). GIS was also used by Laura Paterson and Ian Gregory to map text analysis of poverty in the UK while Paul Baker and Mark McGlashan reported on their work looking at representations of Romanians in the Daily Express, comparing articles with online reader comments. A fourth paper by Jens Zinn and Daniel McDonald considered changing understandings around the concept of risk in English language newspapers.

Collocation was also a popular CASS topic in our presentations. Native and non-native processing of collocations was investigated by Jen Hughes, who carried out an experimental study using electroencephalography (EEG) which measures electrical potentials in the brain, while another approach to collocation was taken by Doğuş Can Öksüz and Vaclav Brezina who examined adjective-noun collocations in Turkish and English. A third collocation study by Dana Gabasolva, Vaclav Brezina and Tony McEnery involved empirical validation of MI-based score collocations for language learning research.

Finally, Jonathan Culpeper and Amelia Joulain-Jay talked about an affiliated CASS project involving work on creating an Encyclopaedia of Shakespeare’s language. They discussed issues surrounding spelling variation, and part of speech tagging, and gave two case studies (involving the words I and good).

 
The conference brought together corpus linguists from dozens of countries (including Germany, Finland, Spain, Israel, Japan, Brazil, Iran, The Netherlands, USA, New Zealand, Taiwan, Ireland, China, Czech Republic, Italy, Sweden, Poland, Chile, UK, Hong Kong, Norway, Australia, Belgium, Canada, South Africa and Venezuela) and was a great opportunity to share and hear about developing work in the field. There was a lively twitter presence throughout the conference, with the tag #CL2017bham. However, my favourite tag was #HardiePieChartWatch, which had me going back to my slides to see if I had used a pie chart appropriately. Be careful with your pie charts!

The next conference will be held (for the first time) in Cardiff – I hope to see you there in two years.

More pictures of the conference can be found at https://www.flickr.com/photos/artsatbirmingham/sets/72157684181373191

CASS go to ICAME38!

Researchers from CASS recently attended the ICAME38 conference at Charles University in Prague. Luckily, we arrived in Prague a day early which gave us plenty of time to explore the city. The weather was sunny, so we walked to Wenceslas Square, and then took the lift to the top of the Old Town Hall Tower to enjoy the views over the city.

The following day, it was time to begin the conference! Over the course of the event, seven CASS members presented their research (you can view full abstracts of all talks here). Up first was Robbie Love, presenting “FUCK in spoken British English revisited with the Spoken BNC2014”. By replicating the approaches of McEnery & Xiao (2004) on the new data contained in the Spoken BNC2014, Robbie found, among other things, that FUCK is now used equally by men and women, and that use of FUCK peaks when speakers are in their 20s and then decreases with age, apart from the 60-69 group which has a higher frequency than the 50-59 group.

Also discussing the BNC2014 project was Abi Hawtin, who presented “The British National Corpus Revisited: Developing parameters for Written BNC2014.” Abi discussed the progress on the project so far, and gave the audience a chance to look at the sampling frame which has been designed for the corpus. Abi also highlighted the difficulty of collecting certain text types, particularly published books.

Amelia Joulain-Jay presented “Describing collocation patterns in OCR data: are MI and LL reliable?” Amelia discussed the fact that data which has been digitized using OCR procedures often has low levels of accuracy, and how this can affect corpus analysis. Amelia tested the reliability of Mutual Information statistics and Log Likelihood statistics when working with OCR data, and found that, among other things, Mutual Information and Log Likelihood attract high rates of false positives. However, she also found that correcting OCR data using Overproof makes a positive difference for both statistics.

CASS director, Andrew Hardie, also presented research using OCR data. He gave a talk titled “Plotting and comparing corpus lexical growth curves as an assessment of OCR quality in historical news data”. Andrew further drew our attention to the amount of errors, or ‘noise’, in OCR data, and showed that if a graph is constructed of number of tokens observed versus count of types at intervals (say, every 10,000 tokens) a curve characteristic of lexical growth over the span of a given corpus emerges. Andrew showed that visual comparison of lexical growth curves among historical collections, or to modern corpora, therefore generates a good impression of the relative extent of OCR noise, and thus some estimate of how much such noise will impede analysis.

Also presenting was Dana Gablasova who discussed “A corpus-based approach to the expression of subjectivity in L2 spoken English: The case of ‘I + verb’ construction”. Dana used the Trinity Lancaster Corpus (TLC) to investigate the ‘I + verb’ construction in L1 Spanish and Italian speakers aged over 20 years. Dana found that with the increase in proficiency the frequency of emotive verbs decreased while the frequency of the epistemic verbs increased considerably. The study also identified the most frequent cognitive and emotive verbs and the trends in their use according to the proficiency level of L2 users.

Vaclav Brezina (and Matt Timperley, who was unfortunately not able to attend the conference) gave a software demonstration of #LancsBox – a new-generation corpus analysis tool developed at CASS. Vaclav showed that #LancsBox can:

  • Search, sort and filter examples of language use.
  • Compare frequency of words and phrases in multiple corpora and subcorpora.
  • Identify and visualise meaning associations in language (collocations).
  • Compute and visualize keywords.
  • Use a simple but powerful interface.
  • Support a number of advanced features such as customisable statistical measures.

#LancsBox can be downloaded for free from the tool website http://corpora.lancs.ac.uk/lancsbox.

Dana and Vaclav also gave a presentation together, titled “MI-score-based collocations in language learning research: A critical evaluation.” Dana and Vaclav identified several issues in the use of MI-score as a measure in language learning research, and used data from the BNC and TLC to:

  • place the MI-score in the context of other similar association measures and discuss the similarities and differences directly relevant to LLR
  • to propose general principles for selection of association measures in LLR.

Finally, former CASS senior research associate Laura Paterson, who recently moved to a lectureship at the Open University, presented “Visualising corpora using Geographical Text Analysis (GTA): (Un)employment in the UK, a case study”, which stemmed from her work on the CASS Distressed Communities project. Laura showed how GTA can be used to generate maps from concordance lines. She showed lots of interesting data visualisations and highlighted the way in which GTA allows the researcher to visualise their corpus and adds a consideration of physical space to language analysis.

Aside from all of the fascinating talks, ICAME38 also had a brilliant social programme. We were able to go on 2 boat trips along the river. The first gave us brilliant views of the city, and the second allowed us to get much closer to the bridges and buildings which line the river. The Gala dinner was also great fun – we had a linguistics themed menu and, best of all an Abba tribute band!

Thank you to all of the organisers of ICAME38 for such an enjoyable and well-organised conference!

 

In memory: Professor Geoffrey Leech

It is with great sorrow that we report the death on 19th August of Professor Geoffrey Leech.

Geoff was not only the founder of the UCREL research centre for corpus linguistics at Lancaster University, he was also the first Professor and founding Head of the Department of Linguistics and English Language. His contributions to linguistics – not only in corpus linguistics, but also in English grammar, pragmatics and stylistics – were immense. After his retirement in 2002, he remained an active member of our department, not only continuing his own research but also, characteristically, providing advice, support and encouragement for students and junior colleagues.

All our thoughts are with Geoff’s wife Fanny, and with his family.

It is still hard for us to find the right words at this time. For many of us he was an inspirational teacher and mentor, but for all of us, he was a kind and generous friend.

The video below was recorded by Tony McEnery in conversation with Geoff in late 2013 for Lancaster’s online course in corpus linguistics. In it, Tony and Geoff discuss the history of the field. We present it now publicly as a first tribute to Geoff’s life and work.

(A transcript is available from this link.)