#LancsBox X: Innovation in corpus linguistics

CASS has always been associated with innovation in corpus linguistics. Innovation comes in different forms and guises such as the creation new corpora and tools as well as novel applications of corpus methods in a wide range of areas of social and linguistic research. With increasing demands on the sophistication of corpus linguistic analyses comes the need for new tools and techniques that can respond to these demands. #LancsBox X is one of such tools.

#LancsBox X is a free desktop tool, which can quickly search very large corpora (millions and billions of words) which can consist of simple texts or richly annotated XML documents. It produces concordances, summary tables, collocation graphs and tables, wordlists and keyword lists.  

On Friday 24 February 2023, a new version of #LancsBox X has been released. To mark this occasion, we organised a hybrid event, which attracted over 1,300 attendees. This event was co-sponsored by CLARIN-UK. A recording of this event is available above.

Launching #LancsBox X (Margaret Fell LT, Lancaster University)
CASS team supporting the event (others were helping online).
Online support of the event

‘Location, location, location’: Introducing corpus linguistics in a novel and interesting way

“Lancaster University is one of the places where corpus linguistics was born. Let’s travel back in time to the year 1970, six years after Lancaster University was founded…” This is a quote from the beginning of the first lecture of a new online Masters programme in Corpus linguistics, which invites the students to embark on a journey of discovery, exploring key concepts, analytical techniques and important thinkers in the world of corpus linguistics.

When preparing the programme we were faced with a seemingly simple question: how to introduce corpus linguistics in an interesting way? In the programme, we wanted to share not only the knowledge and expertise in the field of corpus linguistics but also something of the unique character of Lancaster University, which is so closely connected with the history of the discipline as well as the most recent innovations in the field.

To achieve this aim, we decided to use different memorable locations around Lancaster to record lectures, in which we highlight different aspects of corpus linguistics and its applications. For example, we travelled to Morecambe, a seaside town near Lancaster, to record a lecture entitled ‘A drop in the ocean’, which uses the metaphor of the sea and all the water in it to explain how we can use corpora to sample the vast amounts of language that is produced every day. In another lecture, the Lancaster house of John Austin was chosen as the perfect backdrop for a lecture on corpora and pragmatics. The ruins of the Roman Bath House from the 4th century AD, located in the vicinity of the Lancaster Castle, created the opportunity to explain key grammatical categories, which date back to the ancient times and which are, with some modification, still used today.

And there are other stories related to specific significant sites around Lancaster that are used in the course. This approach allows us to share with the students in our new online programme  the energy of the place, Lancaster’s genius loci, if you like, making the study of corpus linguistics more memorable and enjoyable than a simple classroom recording or a PowerPoint lecture.

To find out more information about our programme, please visit the programme’s webpage: https://www.lancaster.ac.uk/linguistics/masters-level/corpus-linguistics-distance-ma/  Using this link, you can also access a free taster sessions and explore lectures and practical exercises from the programme.

In front of the Bailrigg House, Lancaster University

Morecambe near Lancaster

In front of John Austin’s house

Roman Bath House, Lancaster

Dalton Square, Lancaster

Time to Celebrate: Trinity Lancaster Corpus

On Wednesday 30 October, The ESRC Centre for Corpus Approaches (CASS) organised a small get-together in its new location, Bailrigg House, to celebrate the research that is being carried out at the centre. Specifically, on this occasion, we wanted to highlight the Trinity Lancaster Corpus, a corpus of spoken learner English built in collaboration between Lancaster University and Trinity College London.

Cutting the cake with the Trinity Lancaster Corpus logo

We are really proud of the corpus, which is the largest learner corpus of its kind. It took us over five years to complete this part of the project. Here are a few numbers that describe the Trinity Lancaster Corpus:

  • Over 2,000 transcripts
  • Over 4.2 million words
  • Over 3,500 hours of transcription time
  • Over 10 L1 and cultural backgrounds
  • Up to four speaking tasks

A balanced sample of the corpus is now available for online searching via TLC Hub (password: Lancaster1964). To read more about the corpus and its development, check out this article in the International Journal of Learner Corpus Research:

Gablasova, D., Brezina, V., & McEnery, T. (2019). The Trinity Lancaster Corpus: Development, Description and ApplicationInternational Journal of Learner Corpus Research5(2), 126-158. [open access]

A new special issue of the journal featuring articles on various aspects of learner language, which use the Trinity Lancaster Corpus as their primary data source, is available from this link.

Table of contents of the special issue of the International Journal of Learner Corpus Research

A cake to celebrate the Trinity Lancaster Corpus

Celebrations at CASS

Celebrations at CASS (posters featuring research on TLC in the background)

Statistics in (Higher) Education: A few thoughts at the beginning of the new academic year

As every year around this time, university campuses are buzzing with students who are starting their studies or returning to the campus after the summer break – this incredible transformation pours life into buildings – empty spaces become lecture theatres, seminar rooms and labs. Students have the opportunity to learn many new things about the subject they chose to study and also engage with the academic environment more generally.  Among the educational and development opportunities students have at the university one transferable skill stands out: statistical literacy.

Numbers are an essential part of our everyday life. We count the coins in our pocket, the minutes before the next bus arrives or the sunny days in a rainy year. Numbers and quantitative information are also very important for students and educators. Statistical literacy – the ability to produce and interpret quantitative information – belongs to the basic set of academic skills that, despite its importance, may not always receive the attention it deserves.

Many students (and academics) are afraid of statistics – think about what your first reaction is to the equation in Figure 1 below.

Figure 1: The equation of standard deviation (mathematical form)

 

This is because statistics is often misconstrued as the art of solving extremely complicated equations or a mysterious magic with numbers. Statistics, however, is first and foremost about understanding and making sense of numbers and quantitative information. For this, we need to learn the basic principles of collecting, organising and interpreting quantitative information. Critical thinking is thus much more important for statistics than the number crunching ability. After all, computers are very good at processing numbers and solving equations and we can happily leave this task to them. For example, many even complex statistical tasks can be achieved by using tools such as the Lancaster Stats Tool online, where the researcher can merely copy-paste their data (in an appropriate format) and press one button to receive the answer.

Humans, on the other hand, outperform computers in the interpretation skills. This is because we have the knowledge of the context in which numbers appear and we can therefore evaluate the relative importance of different quantitative results. We as teachers, linguists, sociologists, scientists etc. can provide the underlying meaning to numbers and equations and relate them to our experience and the knowledge of the field. For example, the equation in Figure 1 can be simplified as follows:

Figure 2: The equation of standard deviation (conceptual)

When we relate this to what we know about the world, we can see that the question we are asking in Figure 2 is how much variation there is in our data, a question about variability, difference in tendencies and preferences and overall diversity. This is something that we can relate to in our everyday experience: Will I ever find a twenty-pound note in my pocket? Is the wait for the bus longer in the evening? Is the number of sunny days different every year? When talking about statistics in education, I consider the following point crucial: as with any subject matter, it is important to connect statistical thinking and statistical literacy with our daily experience.

To read more about statistics for corpus linguistics, see Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.

British National Corpus 2014: A sociolinguistic book is out

Have you ever wondered what real spoken English looks like? Have you ever asked the question of whether people from different backgrounds (based on gender, age, social class etc.) use language differently? Have you ever  thought it would be interesting to investigate how much English has changed over the last twenty years? All these questions can be answered by looking at language corpora such as the Spoken BNC 2014 and analysing them from a sociolinguistic persective. Corpus Approaches to Contemporary British Speech:  Sociolinguistic Studies of the Spoken BNC2014 is a book which offers a series of studies that provide a unique insight into a number of topics ranging from Discourse, Pragmatics and Interaction to Morphology and Syntax.

This is, however, only the first step. We are hoping that there will be many more studies to come based on this wonderful dataset. If you want to start exploring the Spoken BNC 2014 corpus, it is just three mouse clicks away:

Get access to the BNC2014 Spoken

  1. Register for free and log on to CQPweb.
  2. Sign-up for access to the BNC2014 Spoken.
  3. Select ‘BNC2014’in the main CQPweb menu.

Also, right now there is a great opportunity to take part in the written BNC 2014 project, a written counterpart to the Spoken BNC2014.  If you’d like to contribute to the written BNC2014, please check out the project’s website for more information.

Learn about the BNC2014, scan a book sample and contribute to the corpus…

On Saturday 12 May 2018, CASS hosted a small training event at Lancaster University for a group of participants, who came from different universities in the UK.  We talked about the BNC2014 project and discussed both the theoretical underpinnings as well as the practicalities of corpus design and compilation. Slides from the event are available as pdf here.

The participants then tried in practice what is involved in the compilation of a large general corpus such as the BNC2014. They selected and scanned samples of books from current British fiction, poetry and a range of non-fiction books (history, popular science, hobbies etc.). Once processed, these samples will become a part of the written BNC2014.

Here are some pictures from the event:

Carmen Dayrell and Vaclav Brezina before the event

Elena Semino welcoming participants

In the computer lab: Abi Hawtin helping participants


A box full of books

If you are interested in contributing to the written BNC2014, go to the project website  to find out about different ways in which you can participate in this exciting project.

The event was supported by ESRC grant no. EP/P001559/1.

Sketch Engine and other tools for language analysis

Here’s some good news for the beginning of the term: all Lancaster University staff and students have now access to Sketch Engine, an online tool for the analysis of linguistic data. Sketch Engine is used by major publishers (CUP, OUP, Macmillan, etc.) to produce dictionaries and grammar books. It can also be used for a wide range of research projects involving the analysis of language and discourse. Sketch Engine offers access to a large number of corpora in over 85 different languages. Many of the web-based corpora available through Sketch Engine include billions of words that can be analysed easily via the online interface.

In Sketch Engine, you can, for example:

  • Search and analyse corpora via a web browser.
  • Create word sketches, which summarise the use of words in different grammatical frames.
  • Load and grammatically annotate your own data.
  • Use parallel (translation) corpora in many languages.
  • Crawl the web and collect texts that include a combination of user-defined keywords.
  • Much more.

How to connect to Sketch Engine?

  1. Go to https://the.sketchengine.co.uk/login/
  2. Click on ‘Authenticate using your institution account (Single Sign On)’

3. Select ‘Lancaster University’ from the drop-down menu and use your Lancaster     login details to log on. That’s all – you can start exploring corpora straightaway!


Other corpus tools

There are also many other tools for analysis of language and corpora available to Lancaster University staff and students (and others, of course!). The following table provides an overview of some of them.

 

Tool Analysis of own data Provides corpora Brief description
Desktop (offline) tools
#LancsBox YES YES This tool runs on all major operating systems (Windows, Linux, Mac). It has a simple, easy-to-use interface and allows searching and comparing corpora (your own data as well as corpora provided).  In addition, #LancsBox provides unique visualisations tools for analysing frequency, dispersion, keywords and collocations.

http://corpora.lancs.ac.uk/lancsbox

Web-based (online) tools
CQPweb NO YES This tool offers a range of pre-loaded corpora for English (current and historical) and other languages including Arabic, Italian, Hindi and Chinese. It includes, the BNC 2014 Spoken, a brand new 10-milion-word corpus of current informal British speech. It has a number of powerful analytical functionalities. The tool is freely available from https://cqpweb.lancs.ac.uk/
Wmatrix YES NO This tool allows processing users’ own data and adding part-of-speech and semantic annotation. Corpora can also be searched and compared with reference wordlists. Wmatrix is available from http://ucrel.lancs.ac.uk/wmatrix/.

Morphological complexity: How is grammar acquired and how do we measure this?

Vaclav Brezina and Gabriele Pallotti

Inflectional morphology has to do with how words change their form to express grammatical meaning. It plays an important role in a number of languages. In these languages, the patterns of word change may for example indicate number and case on nouns, or past, present and future tense on verbs. For example, to express the past participle in German we regularly add the prefix ge- and optionally modify the base. Ich gehe [I go/walk] thus becomes Ich bin gegangen [I have walked].  English also inflects words (e.g. walk – walks – walking – walked; drive – drove – driven) but the range of inflected forms is narrower than in many other languages. The range of morphological forms in a text can be seen as its morphological complexity. Simply put, it is an indicator of the morphological variety of a text, i.e. how many changes to the dictionary forms of the words are manifested in the text.

To find out more about morphological complexity, how it can be measured and how L2 speakers acquire it, you can read:

Gabriele Pallotti and I have been working together to investigate the construct and develop a tool that can analyse the morphological complexity of texts. So far, the tool has been implemented for English, Italian and German verbal morphology. Currently, together with Michael Gauthier from Université Lyon we are implementing the morphological complexity measure for French verbs.

To analyse a text in the Morphological complexity tool, copy/paste the text in the text box, select the appropriate language and press ‘Analyse text now’ (Fig. 1).

Figure 1. Morphological tool: Interface

The tool will output the results of the linguistic analysis that highlights all verbs and nouns in the text and identifies morphological changes (exponences). After clicking on the ‘Calculate MCI’ button the tool also automatically calculates the Morphological Complexity Index (MCI) – see Fig. 2.

Figure 2. Morphological tool output: Selected parts

 

Chinese Applied Corpus Linguistics Symposium

On Friday 29th April 2016, Lancaster University hosted a symposium which brought together researchers and practitioners interested in Chinese linguistics and the corpus method. The symposium was supported by the British Academy (International Mobility and Partnership Scheme IPM 2013) and was hosted by the ESRC Centre for Corpus Approaches to Social Science (CASS). The symposium introduced the Guangwai-Lancaster Chinese Learner Corpus, a 1.2-million-word corpus of spoken and written L2 Chinese produced by learners of Chinese at different proficiency levels; the corpus was built as part of a collaboration between Guangdong University of Foreign Studies (Prof. Hai Xu and his team) and Lancaster University. The project was initiated by Richard Xiao, who also obtained the funding from the British Academy. Richard’s vision to bring corpus linguistics to the analysis of L2 Chinese (both spoken and written) is now coming to fruition with the final stages of the project and the public release of the corpus planned for the end of this year.

The symposium showcased different areas of Chinese linguistics research through presentations by researchers from Lancaster and other UK universities (Coventry, Essex), with the topics ranging from the use of corpora as resources in the foreign language classroom to a cross-cultural comparison of performance evaluation in concert reviews, second language semasiology, and CQPweb as a tool for Chinese corpus data. As part of the symposium, the participants were also given an opportunity to search the Guangwai-Lancaster Chinese Learner Corpus and explore different features of the dataset. At the end of the symposium, we discussed the applications of corpus linguistics in Chinese language learning and teaching and the future of the field.

Thanks are due to the presenters and all participants for joining the symposium and for very engaging presentations and discussions.  The following snapshots summarise the presentations –links to the slides are available below the images.


 

Hai Xu

 

Hai Xu (Guangdong University of Foreign Studies ): Guangwai-Lancaster Chinese Learner Corpus: A profile – via video conferencing from Guangzhou


Simon Smith

Simon Smith (Coventry University): 语料酷!Corpora and online resources in the Mandarin classroom


Fong Wa Ha

Fong Wa Ha (University of Essex): A cross-cultural comparison of evaluation between concert reviews in Hong Kong and British newspapers


Vittorio Tantucci

Vittorio Tantucci (Lancaster University): Second language semasiology (SLS): The case of the Mandarin sentence final particle 吧 ba


Andrew Hardie

Andrew Hardie (Lancaster University): Using CQPweb to analyse Chinese corpus data


Vaclav Brezina

Vaclav Brezina (Lancaster University):  Practical demonstration of the Guangwai-Lancaster Chinese Learner Corpus followed by a general discussion.


Clare Wright: Using Learner Corpora to analyse task effects on L2 oral interlanguage in English-Mandarin bilinguals


 

 

 

Syntactic structures in the Trinity Lancaster Corpus

We are proud to announce collaboration with Markus Dickinson and Paul Richards from the Department of Linguistics, Indiana University on a project  that will analyse syntactic structures in the Trinity Lancaster Corpus. The focus of the project is to develop a syntactic annotation scheme of spoken learner language and apply this scheme to the Trinity Lancaster Corpus, which is being compiled at Lancaster University in collaboration with Trinity College London. The aim of the project is to provide an annotation layer for the corpus that will allow sophisticated exploration of the morphosyntactic and syntactic structures in learner speech. The project will have an impact on both the theoretical understanding of spoken language production at different proficiency levels as well as on the development of practical NLP solutions for annotation of learner speech.  More specific goals include:

  • Identification of units of spoken production and their automatic recognition.
  • Annotation and visualization of morphosyntactic and syntactic structures in learner speech.
  • Contribution to the development of syntactic complexity measures for learner speech.
  • Description of the syntactic development of spoken learner production.