What is corpus stats about? A new book on Statistics in Corpus Linguistics has been published

This practical guide will equip the reader to understand the key principles of statistical thinking and apply these concepts to their own research, without the need for prior statistical knowledge. The book provides step-by-step guidance through the process of statistical analysis and offers multiple examples of how statistical techniques can be used to analyse and visualize linguistic data. It also includes a useful selection of discussion questions and exercises. The book comes with a Companion website, which provides additional materials (answers to exercises, datasets, advanced materials, teaching slides etc.)  and Lancaster Stats Tools online (http://corpora.lancs.ac.uk/stats), a free click-and-analyse statistical tool for easy calculation of the statistical measures discussed in the book.

British National Corpus 2014: A sociolinguistic book is out

Have you ever wondered what real spoken English looks like? Have you ever asked the question of whether people from different backgrounds (based on gender, age, social class etc.) use language differently? Have you ever  thought it would be interesting to investigate how much English has changed over the last twenty years? All these questions can be answered by looking at language corpora such as the Spoken BNC 2014 and analysing them from a sociolinguistic persective. Corpus Approaches to Contemporary British Speech:  Sociolinguistic Studies of the Spoken BNC2014 is a book which offers a series of studies that provide a unique insight into a number of topics ranging from Discourse, Pragmatics and Interaction to Morphology and Syntax.

This is, however, only the first step. We are hoping that there will be many more studies to come based on this wonderful dataset. If you want to start exploring the Spoken BNC 2014 corpus, it is just three mouse clicks away:

Get access to the BNC2014 Spoken

  1. Register for free and log on to CQPweb.
  2. Sign-up for access to the BNC2014 Spoken.
  3. Select ‘BNC2014’in the main CQPweb menu.

Also, right now there is a great opportunity to take part in the written BNC 2014 project, a written counterpart to the Spoken BNC2014.  If you’d like to contribute to the written BNC2014, please check out the project’s website for more information.

CASS: Five more years

We are delighted to announce that CASS has been awarded £2.5 million funding from the Economic and Social Research Council (ESRC) and Lancaster University to continue existing activities and pursue a new research programme for five more years, from April 2018 to March 2023.

The funding, which includes £750,000 from the ESRC, will be used to maximise the economic and societal impact of the research carried out in the first phase of the Centre, particularly in the areas of: Corporate Communications; Climate Change and Maritime Security; Language Development, Disorders and Environment; and Spoken Learner Language.

In addition, a new research programme will extend the facilitative and transformative power of corpus methods to the study of health (care) communication, in the following areas:

  • Language and mental health (including: communication about anxiety disorder; presentation and diagnosis of psychosis; depression in users of social media);
  • Communicating and diagnosing chronic pain;
  • Media representations of obesity;
  • English language assessment and training for medical professionals.

The Centre will also continue to create new openly accessible corpora, extend the existing programme of methodological and technological innovation, especially through #LancsBox and CQPWeb, and continue to disseminate methods and tools through the Corpus MOOC, Summer Schools and free workshops in the UK and internationally.

The new CASS team brings together 15 scholars from different disciplines at Lancaster University and two collaborating institutions: Durham University and University College London (see below).

Two postdoctoral Research Associates will also be recruited to work with the rest of the team for the next five years.

CASS Director Professor Elena Semino said: “We are absolutely delighted to have been awarded five more years of funding by the ESRC and grateful to the University for its part in supporting the Centre.

“This award will ensure that the work we have done so far achieves its full potential in terms of societal impact, and will enable us to carry out new research on communication about illness and healthcare.”

CASS is one eight established research centres awarded a total of £6.9m to continue their work under a new funding model designed to secure the long term sustainability of social science research excellence in the UK.

Watch this space for updates on the Centre’s work and the release of new tools and corpora!


The CASS team from April 2018:

Principal Investigator:
Elena Semino – Linguistics and English Language (Lancaster University)

Co-Investigators:
Andrew Hardie – Linguistics and English Language (Lancaster University)
Paul Baker – Linguistics and English Language (Lancaster University)
Vaclav Brezina – Linguistics and English Language (Lancaster University)
Dana Gablasova – Linguistics and English Language (Lancaster University)
Claire Hardaker – Linguistics and English Language (Lancaster University)
John Pill – Linguistics and English Language (Lancaster University)
Dimitrinka Atanasova – Linguistics and English Language (Lancaster University)

Basil Germond – Politics, Philosophy and Religion (Lancaster University)
Garrath Williams – Politics, Philosophy and Religion (Lancaster University)

Kate Cain – Psychology (Lancaster University)
Steve Young – Accounting and Finance (Lancaster University)

Angela Woods – English Studies and Hearing the Voice project (Durham  University)
Joanna Zakrzewska – University College London Hospitals

Collaborator:
Zsófia Demjén – UCL Centre for Applied Linguistics (University College London)

CASS plays leading role in major European heritage language project

The ESRC Centre for Corpus Approaches to Social Science will play a leading role in the new Heritage Language Consortium. The Consortium is a strategic partnership for the study of heritage languages in Europe and involves six leading universities in the UK, Germany and Portugal, as well as the Portuguese Ministry of Foreign Affairs.

Through this partnership, CASS will have privileged access to over 130,000 students in 85 countries, and we will use this unique opportunity to build the world’s largest heritage language corpus. The corpus will enable ground-breaking new research on language learning and education and have important implications for educational policy, curriculum design, and materials development.

A Memorandum of Understanding was signed at a ceremony in Lisbon to officially launch the Consortium. The launch event featured statements by the Secretary of State for the Portuguese Communities, Dr José Luís Carneiro, by the Secretary of State for Education, Professor João Costa, by the President of the Camões Institute, Ambassador Luís Faro Ramos, and by the Consortium’s Director, Dr Patrick Rebuschat, from Lancaster’s Department of Linguistics and English Language.

Portugal maintains a heritage language network across 85 countries for the families of Portuguese citizens, the world over. This enables children to improve their heritage language with qualified teachers who go into schools to run approved language programmes funded by the Portuguese government.

The Consortium Director, LAEL’s Dr Patrick Rebuschat, said: “This strategic partnership provides us with a unique opportunity – no other country maintains such a significant heritage language network overseas, and we will have privileged access to substantial, yet completely unexplored data.

“The Consortium is a major international initiative which uses Portuguese as a ‘test case’. The insights gained from this project will be applicable to other languages, of course. Our research will help us understand how children and adults learn new languages and identify those factors that make some of us particularly good language learners. We can then use these insights to improve language teaching.

“The Consortium will also organize impact and outreach initiatives to engage with parents, teachers, and policy makers across Europe.”

Professor Steve Bradley, Lancaster’s Pro-Vice-Chancellor (International), said: “This important initiative demonstrates again Lancaster’s strong international outlook and our commitment to playing a leading role in research that impacts lives, communities, and educational practices across the globe. The Consortium will provide unique opportunities for Lancaster’s staff and students to be involved in a research area that is of particular significance to Europe today.”

The idea for the consortium was born earlier this year when the Portuguese Secretary of State for Education, Professor João Costa, visited Lancaster University to deliver a keynote at a conference organized by Dr Rebuschat. The event focused on bilingualism and heritage language education across Europe. It brought together policy makers from the Portuguese Ministries of Foreign Affairs and Education, leading academics, journalists, school teachers and parents to discuss current trends and challenges in heritage language research and education.

Caption: A Memorandum of Understanding was signed at a ceremony in Lisbon to officially launch the Consortium. From left to right: Ambassador Luís Faro Ramos, President of the Camões Institute; Dr José Luís Carneiro, Secretary of State for Portuguese Communities; Professor João Costa, Portuguese Secretary of State for Education; Professor Susana Trovão, NOVA University Lisbon; Dr Patrick Rebuschat, Lancaster University; Professor Maria de Fátima Marinho, University of Porto; Professor Detmar Meurers, Tübingen University; Professor Paulo Farmhouse Alberto, University of Lisbon; Professor Cristina Flores, University of Minho.

For more information, please visit http://www.lancaster.ac.uk/heritage-language or email Dr Patrick Rebuschat: p.rebuschat(Replace this parenthesis with the @ sign)lancaster.ac.uk.

Lancaster Summer Schools in Corpus Linguistics (#LancsSS18)

CASS is pleased to offer three free training events that cover the techniques of corpus linguistics and their application in three different areas.

The schools include both lectures and practical sessions that introduce the latest developments in the field and practical applications of cutting-edge analytical techniques. The summer schools are taught by leading experts in the field from Lancaster University.

The summer schools are intended primarily for postgraduate research students but applications from Masters-level students, postdoctoral researchers, senior researchers, and others will also be considered.

Dates: 25 – 28 June 2018 (four days)

Venue: Lancaster University, Lancaster, UK


Application: To apply for a place in one of the Lancaster summer schools in corpus linguistics, please fill in the Registration form. Since the places in the summer schools are limited, we recommend applying early. Applications will be evaluated on a rolling basis.


The summer schools are free to attend; the participants will need to arrange their own travel and accommodation. During all four days, we will offer free refreshments during the tea & coffee breaks and participants will have time during the lunch break to buy their lunch on campus.

Organising committee: Dr. Dana Gablasova (Chair), Rachael McCarthy

For further details, click through to each Summer School’s full description. Queries about the summer schools can be directed to the Summer School administrator, Rachael McCarthy (r.mccarthy2(Replace this parenthesis with the @ sign)lancaster.ac.uk).

To tweet about the event, please use: #LancsSS18

CASS in the 2017 ESRC Festival of Social Science

The ESRC Festival of Social Science is an annual celebration of social science research – comprised of a huge array of public events of all kinds, and designed to promote awareness of UK social science research across the board. This year, it runs from 4th to 11th November.

As the team at ESRC says,

“You may be surprised at just how relevant the Festival’s events are to society today. Social science research makes a difference. Discover how it shapes public policy and contributes to making the economy more competitive, as well as giving people a better understanding of 21st century society. From big ideas to the most detailed observations, social science affects us all everyday – at work, in school, when raising children, within our communities, and even at the national level.”

As an ESRC Centre, CASS has been involved in the Festival since our work began in 2013. We have organised events of different types in different years – for instance, in the first year of the Centre, our contribution to the Festival was a series of talks in schools in the North West of English to introduce the kind of social science analysis in which we specialise to students in sixth-form. It was great to be able to reach out to an audience that we rarely have a chance to communicate with about our work.

In subsequent years, we organised events under our “Valuing language” banner – aimed at using examples of our work to present to a public audience the benefits across the social sciences that arise in research that understands the value of language for all kinds of social investigations. Our first “Valuing language” event was in London; the following year we held another event in Manchester.

This year our contribution to the Festival of Social Science is a new “Valuing language” presentation. This event focuses in particular on two strands of research that have been under way in CASS for the past two years or so, looking at the intersection of language with the critical issue of health and healthcare. We are also returning to London for the event, entitled “Valuing language: Effective communication in healthcare provision”. The event – at 6.30 pm on Thursday 9th November – is particularly aimed at healthcare practitioners and those training to enter healthcare services – but of course, it is open to anyone with an interest in this work!

The evening will include two presentations, one on each of these strands of work. First will be a presentation of research into patient comments on healthcare services collected through the NHS Choices website. Patient feedback has often been analysed by looking straightforwardly at the numeric ratings given in feedback. However, the textual responses supplied alongside these ratings are a far richer source of data – albeit so extensive they can be non-straightforward to analyse! But this is, of course, where corpus-based linguistic methods come in. A CASS project, led by Paul Baker, has applied these methods to investigate patients place on interpersonal skills and effective, compassionate communication. Two members of the team working on this project, myself and Craig Evans, will give an overview of how we have gone about analysing this unique and fascinating source of data.

In the second half of the event, CASS Director Elena Semino will present her work looking at patients’ reporting of pain. A common way for healthcare practitioners to assess the level of pain that patients are experiencing is to use questionnaires that present descriptor  words – such as “pricking/boring/drilling/stabbing”. The descriptor word that a patient chooses is assumed to reflect the level of their pain. Elena’s research suggests, however, that patients’ choice of descriptor may in many cases instead be a result of how strongly associated with the word “pain” the descriptor word is. Again, this is a problem that corpus-based language analysis is an ideal way to address. Elena will explain the findings of her investigation and also consider the implications these findings have for how descriptor-word questionnaires should be used in assessing patients’ pain.

We’re all looking forward to participating once again in the ESRC Festival and we hope to see you there!

Find out more (and sign up for the event) via http://cass.lancs.ac.uk/festival17.

The Spoken BNC2014 is now available!

On behalf of Lancaster University and Cambridge University Press, it gives us great pleasure to announce the public release of the Spoken British National Corpus 2014 (Spoken BNC2014).

The Spoken BNC2014 contains 11.5 million words of transcribed informal British English conversation, recorded by (mainly English) speakers between the years 2012 and 2016. The situational context of the recordings – casual conversation among friends and family members – is designed to make the corpus broadly comparable to the demographically-sampled component of the original spoken British National Corpus.

The Spoken BNC2014 is now accessible online in full, free of charge, for research and teaching purposes. To access the corpus, you should first create a free account on Lancaster University’s CQPweb server (https://cqpweb.lancs.ac.uk/) if you do not already have one. Once registered, please visit the BNC2014 website (http://corpora.lancs.ac.uk/bnc2014) to (a) sign the corpus’ end-user licence and (b) register your CQPweb account – following the instructions on the site. When you return to CQPweb, you will have access to the Spoken BNC2014 via the link that appears in the list of ‘Present-day English’ corpora. While access is initially only via the CQPweb platform, the underlying corpus XML files and associated metadata will be available for download in Autumn 2018.

The BNC2014 website also contains lots of useful information about the corpus, and in particular a downloadable manual and reference guide, which will be available soon. Further information, as well as the first research articles to use Spoken BNC2014 data, will be available in two in-press publications associated with the project: a special issue of the International Journal of Corpus Linguistics (due next month) and an edited collection in the Routledge ‘Advances in Corpus Linguistics’ series (due early 2018).

The BNC2014 does not end here – we are currently working on transcribing materials provided to us by the British Library to provide a substantial supplement to the corpus – find out more about that here: http://cass.lancs.ac.uk/?p=2241. For now, we will be waiting and watching with interest to see what work the corpus releases today stimulates. As ever with corpus data, it does not enable all questions to be answered, but it does allow a very wide range of questions to be investigated.

The Spoken BNC2014 research team would like to express our gratitude to all who have had a hand in the creation of the corpus, and hope that you enjoy exploring the data. We are, of course, keen to hear your feedback about the corpus; this, as well as any questions, can be directed to Robbie Love (r.m.love(Replace this parenthesis with the @ sign)lancaster.ac.uk) or Andrew Hardie (a.hardie(Replace this parenthesis with the @ sign)lancaster.ac.uk).

Change of Leadership in CASS

Andrew Hardie is delighted to announce that he has handed over his role of CASS Centre Director to Elena Semino.

Elena has been Head of Department for Lancaster’s Department of Linguistics and English Language for 6 years, and has published widely in the areas of stylistics, metaphor theory, and medical humanities/health communication.

In Elena’s own words: 

‘It is a great honour and challenge to take over as CASS Director. Over the last four years, CASS has led the way nationally and internationally in the application of corpus methods to a wide range of social scientific problems, and has had a significant impact on research, policy and practice in many different contexts. I look forward to working with colleagues in Lancaster, and partners in the UK and around the world, to continue and extend this work in years to come.’

 

CASS PhD Student Tanjun Liu wins Best Poster Award at EUROCALL2017

In late August, I attended the 25th annual conference of EUROCALL (European Association for Computer Assisted Language Learning) at the University of Southampton. This year’s theme encompassed how Computer-Assisted Language Learning (CALL) responds to changing global circumstances, which impact on education. Over 240 sessions were presented covering the topics of computer mediated communication, MOOCs, social networking, corpora, European projects, teacher education, etc.

 

 At this conference, I presented a poster entitled “Evaluating the effect of data-driven learning (DDL) on the acquisition of academic collocations by advanced Chinese learners of English”. DDL is a term created by Tim Johns in 1991 to refer to the use of authentic corpus data to conduct student-centred discovery learning activities. However, even though many corpus-based studies in the pedagogical domain have suggested applying corpora in the domain of classroom teaching, DDL has not become the mainstream teaching practice to date. Therefore, my research sets out to examine the contribution of DDL to the acquisition of academic collocation in the Chinese university context.

 

The corpus tool that I used in my research was #LancsBox (http://corpora.lancs.ac.uk/lancsbox/), which is a newly-developed corpus tool at CASS that has the capacity to create collocational networks, i.e. GraphColl. The poster I presented was a five-week pilot study of my research, the results of which show that the learners’ attitudes towards using #LancsBox were mostly positive, but there were no statistically significant differences between using the corpus tool and online collocations dictionary, which may be largely due to very short intervention time in the pilot study. My poster also presented the description of the forthcoming main study that will involve longer exposure and more EFL learners.

 

At this conference I was fortunate enough to win the EUROCALL2017 Best Poster Award (PhD), which was given to the best poster presented by a PhD student as nominated by conference delegates. Thank you to all of the delegates who voted for me to win this award and it was a real pleasure to attend such a wonderful conference!

How to Produce Vocabulary Lists

As part of the Forum discussion in Applied Linguistics, we have formulated some basic principles of corpus-based vocabulary studies and pedagogical wordlist creation and use. These principles can be summarised as follows:

  1. Explicitly define the vocabulary construct.
  2. Operationalize the vocabulary construct using transparent and replicable criteria.
  3. If using corpora, take corpus evidence seriously and avoid cherry-picking.
  4. Use multiple sources of evidence to test the validity of the vocabulary construct.
  5. Do not rely on your intuition/experience to determine what is useful for learners; collect evidence about learner needs to evaluate the usefulness of the list.
  6. Do not present learners with a decontextualized list of lexical items; use/create contextualized materials instead.

To find out more, you can read:

Brezina, V. & Gablasova, D. (2017). How to Produce Vocabulary Lists? Issues of Definition, Selection and Pedagogical Aims. A Response to Gabriele Stein. Applied Linguistics, doi:10.1093/applin/amx022.