From Korea to CASS – Diving into Corpus Linguistics at Lancaster University

In January 2026 I visited The ESRC Centre for Corpus Approaches to Social Science (CASS) at Lancaster University. The reason for my visit was related to my PhD research which I recently started. I am pursuing a PhD in linguistics, specifically Corpus Linguistics, at Lancaster University. It was time to meet my supervisor, Professor Vaclav Brezina, and learn more about Corpus Linguistics and the work of the CASS Research Centre.

The CASS Centre comprises a team of world-leading experts who are at the forefront of Corpus Linguistics, driving the discipline forward through groundbreaking research and innovative tools while training the next generation of pioneering researchers.

Meeting my PhD supervisor, CASS Co-Director Vaclav Brezina.

Why do a PhD Focusing on Corpus Linguistics?

As an English educator based in South Korea for 15 years, I have seen the types of sentences, expressions and errors that Korean learners of English produce, and found the same patterns emerging again and again. For example, Korean learners of English have a hard time using the English articles โ€œaโ€/โ€anโ€ and โ€œtheโ€, as equivalents do not exist in the Korean language. However, until recently I hadnโ€™t had a way of explaining these patterns beyond simply talking about the kinds of mistakes learners tend to make. The beauty of Corpus Linguistics is it allows us to find clear examples of words and phrases in a massive collection of texts, and even do statistical analyses to confirm (or deny) what we believe about specific language use.

In recent years, I have also seen first-hand the benefits and pitfalls of the new wave of AI-based tools that are available to English learners. Some of them seem incredible, all-knowing and reliable. However, when we look closely, itโ€™s clear that a lot of them are โ€œblack boxโ€ AI models. This means that although we can see the outputs (the things they tell us), we canโ€™t see how they generated these. Who knows what goes on โ€œinsideโ€ the AI models? We donโ€™t have access to the data they were trained on, nor how they reach the answers they give us. This is an issue as we are sometimes unable to tell which outputs are trustworthy and correct, and which are not.

My Research

This brings me neatly onto the research I am doing for my PhD, which is twofold:
1. Construct a corpus of Korean learnersโ€™ English, which can be a useful tool for future research.

2. Develop an AI-based tool which helps Korean people with their English. The tool will be built on my Korean corpus, meaning it is much more trustworthy, reliable and transparent than โ€œoff-the-shelfโ€ AI tools like ChatGPT. Last year I made a prototype AI English writing coach which you can try here.

What I Experienced at the CASS Centre

While at CASS, I shared an office with Lah, a visiting Corpus Linguistics researcher from Malaysia. Each day we chatted about the research we were doing, as well as more general topics like the culture of our countries. It was interesting to get a feel for the international nature of this discipline. It seems Corpus Linguistics has the potential to provide insights for researchers in any country or field. I have heard about researchers using Corpus Linguistics in fields as varied as aviation and healthcare.

With Lah, a Corpus Linguistics researcher visiting from Malaysia.

I also had the opportunity to meet other members of the CASS team, some of whom were investigating questions like โ€œHow is cancer discussed on social media?โ€ and โ€œWhat are the verbal signs of early onset dementia?โ€. I was impressed at how the researchers are addressing these real-life issues with potential benefits for patients.

The rest of my time was spent working on various elements of my own PhD research, such as planning interview questions for participants, exploring relevant academic papers and discussing these topics and more with my supervisor Professor Brezina.


Looking to the Future

Although I am new to the field, I believe that Corpus Linguistics has an exciting future and will become more important and widespread. Additionally, it has the potential to cross-pollinate with various other disciplines, even beyond linguistics.

We are living in a world which is more data-driven than ever. Corpus Linguistics takes the apparent disorganisation and randomness of language and allows us to find order and patterns and apply statistical analysis to learn the significance of our findings. THE ESRC CASS Centre at Lancaster University is at the forefront of this pioneering field, led by experts such as Vaclav Brezina who both build corpora such as the British National Corpus 2014 and also develop toolkits for their use, such as the excellent #LancsBox X. My hope for the future is that I can be involved in such a dynamic field of work and research. Corpus Linguistics is an illuminating field that can both contribute to my English teaching work and understanding and also provide a rich seam of fascinating research, insight and discovery in the future.

I saw this rainbow from the window while at CASS. Maybe a sign of a bright future!

Some Reading Recommendations

If you would like to learn more about Corpus Linguistics and the work of the CASS Centre at Lancaster University, I recommend this page:

https://cass.lancs.ac.uk/wp-content/uploads/2025/06/CASS-Brochure.pdf

I also recommend this paper by Vaclav Brezina which was one of the biggest inspirations behind my research. The paper explores how Corpus Linguistics can respond to emerging technologies like LLMs, particularly in the context of black box AI. It also provides a handy introduction to the toolbox #LancsBox X.

Thanks for reading, and feel free to drop me an email about my research! j.p.sumner@lancaster.ac.uk