Sketch Engine and other tools for language analysis

Here’s some good news for the beginning of the term: all Lancaster University staff and students have now access to Sketch Engine, an online tool for the analysis of linguistic data. Sketch Engine is used by major publishers (CUP, OUP, Macmillan, etc.) to produce dictionaries and grammar books. It can also be used for a wide range of research projects involving the analysis of language and discourse. Sketch Engine offers access to a large number of corpora in over 85 different languages. Many of the web-based corpora available through Sketch Engine include billions of words that can be analysed easily via the online interface.

In Sketch Engine, you can, for example:

  • Search and analyse corpora via a web browser.
  • Create word sketches, which summarise the use of words in different grammatical frames.
  • Load and grammatically annotate your own data.
  • Use parallel (translation) corpora in many languages.
  • Crawl the web and collect texts that include a combination of user-defined keywords.
  • Much more.

How to connect to Sketch Engine?

  1. Go to
  2. Click on ‘Authenticate using your institution account (Single Sign On)’

3. Select ‘Lancaster University’ from the drop-down menu and use your Lancaster     login details to log on. That’s all – you can start exploring corpora straightaway!

Other corpus tools

There are also many other tools for analysis of language and corpora available to Lancaster University staff and students (and others, of course!). The following table provides an overview of some of them.


Tool Analysis of own data Provides corpora Brief description
Desktop (offline) tools
#LancsBox YES YES This tool runs on all major operating systems (Windows, Linux, Mac). It has a simple, easy-to-use interface and allows searching and comparing corpora (your own data as well as corpora provided).  In addition, #LancsBox provides unique visualisations tools for analysing frequency, dispersion, keywords and collocations.

Web-based (online) tools
CQPweb NO YES This tool offers a range of pre-loaded corpora for English (current and historical) and other languages including Arabic, Italian, Hindi and Chinese. It includes, the BNC 2014 Spoken, a brand new 10-milion-word corpus of current informal British speech. It has a number of powerful analytical functionalities. The tool is freely available from
Wmatrix YES NO This tool allows processing users’ own data and adding part-of-speech and semantic annotation. Corpora can also be searched and compared with reference wordlists. Wmatrix is available from

Morphological complexity: How is grammar acquired and how do we measure this?

Vaclav Brezina and Gabriele Pallotti

Inflectional morphology has to do with how words change their form to express grammatical meaning. It plays an important role in a number of languages. In these languages, the patterns of word change may for example indicate number and case on nouns, or past, present and future tense on verbs. For example, to express the past participle in German we regularly add the prefix ge- and optionally modify the base. Ich gehe [I go/walk] thus becomes Ich bin gegangen [I have walked].  English also inflects words (e.g. walk – walks – walking – walked; drive – drove – driven) but the range of inflected forms is narrower than in many other languages. The range of morphological forms in a text can be seen as its morphological complexity. Simply put, it is an indicator of the morphological variety of a text, i.e. how many changes to the dictionary forms of the words are manifested in the text.

To find out more about morphological complexity, how it can be measured and how L2 speakers acquire it, you can read:

Gabriele Pallotti and I have been working together to investigate the construct and develop a tool that can analyse the morphological complexity of texts. So far, the tool has been implemented for English, Italian and German verbal morphology. Currently, together with Michael Gauthier from Université Lyon we are implementing the morphological complexity measure for French verbs.

To analyse a text in the Morphological complexity tool, copy/paste the text in the text box, select the appropriate language and press ‘Analyse text now’ (Fig. 1).

Figure 1. Morphological tool: Interface

The tool will output the results of the linguistic analysis that highlights all verbs and nouns in the text and identifies morphological changes (exponences). After clicking on the ‘Calculate MCI’ button the tool also automatically calculates the Morphological Complexity Index (MCI) – see Fig. 2.

Figure 2. Morphological tool output: Selected parts


Chinese Applied Corpus Linguistics Symposium

On Friday 29th April 2016, Lancaster University hosted a symposium which brought together researchers and practitioners interested in Chinese linguistics and the corpus method. The symposium was supported by the British Academy (International Mobility and Partnership Scheme IPM 2013) and was hosted by the ESRC Centre for Corpus Approaches to Social Science (CASS). The symposium introduced the Guangwai-Lancaster Chinese Learner Corpus, a 1.2-million-word corpus of spoken and written L2 Chinese produced by learners of Chinese at different proficiency levels; the corpus was built as part of a collaboration between Guangdong University of Foreign Studies (Prof. Hai Xu and his team) and Lancaster University. The project was initiated by Richard Xiao, who also obtained the funding from the British Academy. Richard’s vision to bring corpus linguistics to the analysis of L2 Chinese (both spoken and written) is now coming to fruition with the final stages of the project and the public release of the corpus planned for the end of this year.

The symposium showcased different areas of Chinese linguistics research through presentations by researchers from Lancaster and other UK universities (Coventry, Essex), with the topics ranging from the use of corpora as resources in the foreign language classroom to a cross-cultural comparison of performance evaluation in concert reviews, second language semasiology, and CQPweb as a tool for Chinese corpus data. As part of the symposium, the participants were also given an opportunity to search the Guangwai-Lancaster Chinese Learner Corpus and explore different features of the dataset. At the end of the symposium, we discussed the applications of corpus linguistics in Chinese language learning and teaching and the future of the field.

Thanks are due to the presenters and all participants for joining the symposium and for very engaging presentations and discussions.  The following snapshots summarise the presentations –links to the slides are available below the images.


Hai Xu


Hai Xu (Guangdong University of Foreign Studies ): Guangwai-Lancaster Chinese Learner Corpus: A profile – via video conferencing from Guangzhou

Simon Smith

Simon Smith (Coventry University): 语料酷!Corpora and online resources in the Mandarin classroom

Fong Wa Ha

Fong Wa Ha (University of Essex): A cross-cultural comparison of evaluation between concert reviews in Hong Kong and British newspapers

Vittorio Tantucci

Vittorio Tantucci (Lancaster University): Second language semasiology (SLS): The case of the Mandarin sentence final particle 吧 ba

Andrew Hardie

Andrew Hardie (Lancaster University): Using CQPweb to analyse Chinese corpus data

Vaclav Brezina

Vaclav Brezina (Lancaster University):  Practical demonstration of the Guangwai-Lancaster Chinese Learner Corpus followed by a general discussion.

Clare Wright: Using Learner Corpora to analyse task effects on L2 oral interlanguage in English-Mandarin bilinguals




Syntactic structures in the Trinity Lancaster Corpus

We are proud to announce collaboration with Markus Dickinson and Paul Richards from the Department of Linguistics, Indiana University on a project  that will analyse syntactic structures in the Trinity Lancaster Corpus. The focus of the project is to develop a syntactic annotation scheme of spoken learner language and apply this scheme to the Trinity Lancaster Corpus, which is being compiled at Lancaster University in collaboration with Trinity College London. The aim of the project is to provide an annotation layer for the corpus that will allow sophisticated exploration of the morphosyntactic and syntactic structures in learner speech. The project will have an impact on both the theoretical understanding of spoken language production at different proficiency levels as well as on the development of practical NLP solutions for annotation of learner speech.  More specific goals include:

  • Identification of units of spoken production and their automatic recognition.
  • Annotation and visualization of morphosyntactic and syntactic structures in learner speech.
  • Contribution to the development of syntactic complexity measures for learner speech.
  • Description of the syntactic development of spoken learner production.


Trinity Lancaster Corpus at the International ESOL Examiner Training Conference 2015

On Friday 30th January 2015, I gave a talk at the International ESOL Examiner Training Conference 2015 in Stafford. Every year, the Trinity College London, CASS’s research partner, organises a large conference for all their examiners which consists of plenary lectures and individual training sessions. This year, I was invited to speak in front of an audience of over 300 examiners about the latest development in the learner corpus project.  For me, this was a great opportunity not only to share some of the exciting results from the early research based on this unique resource, but also to meet the Trinity examiners; many of them have been involved in collecting the data for the corpus. This talk was therefore also an opportunity to thank everyone for their hard work and wonderful support.

It was very reassuring to see the high level of interest in the corpus project among the examiners who have a deep insight into examination process from their everyday professional experience.  The corpus as a body of transcripts from the Trinity spoken tests in some way reflects this rich experience offering an overall holistic picture of the exam and, ultimately, L2 speech in a variety of communicative contexts.

Currently, the Trinity Lancaster Corpus consists of over 2.5 million running words sampling the speech of over 1,200 L2 speakers from eight different L1 and cultural backgrounds. The size itself makes the Trinity Lancaster Corpus the largest corpus of its kind. However, it is not only the size that the corpus has to offer. In cooperation with Trinity (and with great help from the Trinity examiners) we were able to collect detailed background information about each speaker in our 2014 dataset. In addition, the corpus covers a range of proficiency levels (B1– C2 levels of the Common European Framework), which allows us to research spoken language development in a way that has not been previously possible.  The Trinity Lancaster Corpus, which is still being developed with an average growth of 40,000 words a week, is an ambitious project:  Using this robust dataset, we can now start exploring crucial aspects of L2 speech and communicative competence and thus help language learners, teachers and material developers to make the process of L2 learning more efficient and also (hopefully) more enjoyable. Needless to say, without Trinity as a strong research partner and the support from the Trinity examiners this project wouldn’t be possible.

Trinity Lancaster Spoken Learner Corpus: A milestone to celebrate

On Monday 19 May we came together to celebrate the completion of the first part of the Trinity Lancaster Spoken Learner Corpus project. The transcription of our 2012 dataset is now complete and the corpus comprises 1.5 million running words. The Trinity Lancaster Spoken Learner Corpus represents a balanced sample of learner speech from six different countries (Italy, Spain, Mexico, India, China and Sri Lanka) covering the B1.2 – C2 levels of the Common European Framework (CEFR). Below are some pictures from our small celebration.


trinity1 trinity2

We are continuing with the corpus development adding more data from our 2014 dataset so there is still a lot of work to be done. However, we are really excited about the possibilities of applied linguistic and language testing research based on this unique dataset.

You can read more about the Trinity Lancaster Spoken Learner Corpus in the AEA-Europe newsletter report.

Is this the way to do Corpus Linguistics? Feedback system for the Corpus Linguistics MOOC

Corpus linguistics (CL) is a set of incredibly versatile methods of language analysis applicable to a number of different contexts. So, for example, if you are interested in language, culture, history or society, corpus linguistics has something to offer. Today, thanks to the amazing development in computer technology, corpus linguistic tools are literally only a mouse click away or a touch away, if you are using a tablet or a smartphone. Are you then ready to get your hands dirty with computational analysis of large amounts of language? If the answer is yes, you have probably already registered for the new massively open online course (MOOC) on Corpus Linguistics, created and run by Tony McEnery and other members of the CASS team. (If you haven’t managed to register yet, you can still do so at the FutureLearn website. The course kicks off on 27th January 2014.)

An essential part of the Corpus Linguistics MOOC is its unique feedback system. You will be given a question, a data set and a software tool, and you will be asked to apply what you have learnt in the MOOC lectures to real language analysis. You will explore a topic using corpus techniques which will enable you to uncover interesting patterns in language data. We have a range of topics in store for you. These include English grammar, British and American language and culture, historical discourse of 17th century news books and learner language. But don’t worry, we won’t ask you to write an essay on the topic. Instead, we will give you a number of analyses and descriptions of the corpus data and you will decide which ones use the corpus techniques correctly. After you’ve made your decisions we will provide detailed comments on each of the options. In this way, the CASS Corpus Linguistics MOOC system aims to promote independent learning so that next time you can apply the corpus tools with confidence to answer your own questions.

Centre Vacancy: Senior Research Associate in Corpus Linguistics

Post A793

Linguistics & English Language
Salary:   £31,331 to £36,298
Closing Date:   Sunday 20 October 2013
Interview Date:   To be confirmed
Reference:  A793

The Centre for Corpus Approaches to Social Science, funded by the ESRC, is seeking to appoint to 4 year research contract in the area of corpus linguistics and language teaching/assessment.

You will pursue research on developing new approaches to the use of corpus linguistics within language teaching and assessment in conjunction with Trinity College London. A knowledge of corpus linguistics is essential, as well as some familiarity with language teaching and /or assessment.

You will join an interdisciplinary team of internationally renowned researchers within CASS. You will be offered excellent career progression opportunities through the ESRC Centre.

This is a fixed term contract for 4 years.

Informal enquiries may be made to Dr. Vaclav Brezina: brezina(Replace this parenthesis with the @ sign)

Apply through the Lancaster University website. 

“It’s all sex and celebrity now”: Page three corpus linguistics

On Monday (16th October) on page three of the Daily Mail, the readers could come across a short article about changes in the English lexicon with a title: “Forget supper and soup… it’s all sex and celebrity now”. (A longer version of the article is available online.) The article quoted some data from the New General Service List (new-GSL) and compared these with Bauman and Culligan’s version of West’s GSL. Bauman and Culligan offer a list of words from West’s GSL (1953) combined with word frequency rankings based on the Brown Corpus (1961).

It comes as no surprise that the word ranks in Bauman and Culligan’s version of the GSL differ from the ranks in the new-GSL. This might be given not only by the time factor, but also by the composition of the source corpora. The new-GSL is a wordlist based on four different language corpora (three British English corpora and one corpus representing the language of the internet) of the total size of over 12 billion words; Bauman and Culligan’s counts, on the other hand, rely on a single one-million-word corpus of American English compiled in the 1960s. The comparisons in the Daily Mail therefore need to be interpreted with caution. In particular, the following points should be considered:

  • The language changes and there is no doubt that over time, some words become more popular  and other words fall out of fashion. The new GSL lists 378 lexical innovations including words such as Internet, website, online, email, network, client, mobile, file and web.
  • On the other hand, the research shows that there is a large stable lexical core (2,116 items in the new-GSL) including frequent nouns, verbs and adjectives such as time, year, people, way, say, make, take, go, good, new, great and same.
  • In order to interpret the social significance of lexical changes, we need to look at the contexts in which different words appear. A good example of this is the word “sex” quoted in the headline of the Daily Mail article.

Let’s talk about “sex”, shall we?

The word sex is polysemous and can mean either physical activity or biological dimorphism (male or female). I suppose the phrase “it’s all sex now” in the headline of the Daily Mail article alludes to the former meaning of the word sex, because the fact that we talk about males and females (the latter meaning of the word) does not sell newspapers. Let’s have a look at some corpus evidence.

Brown (1961)American writing EnTenTen12 (2012)Internet language
form “sex” per million words 82.7 86.6
sex as activity 75% 90%*
sex as dimorphism 25% 10%*

*based on a random sample of 250 lines

A quick comparison of the evidence in the Brown Corpus (which the Daily Mail uses as the point of departure) and the EnTenTen12 internet corpus (one of the sources of the new-GSL) shows that the frequencies per million do not differ very much. There is a difference, however, in the proportions of the two meanings (sex as activity and sex as dimorphism) which can be explained by the difference in the genres sampled in the two corpora. In contrast to the Brown corpus, EnTenTen12 includes also pornography (as you would expect from an internet-based corpus); This is also reflected in some of the prominent collocates of the word “sex” such as oral, anal, hardcore, gay, lesbian and toy in EnTenTen12. However, the fact that “it’s all sex now” (as the Daily Mail puts it) has even a more simple and prosaic explanation:  When compiling the original wordlist, Michael West very likely decided to exclude the term “sex” as something that does not need to be mentioned in the classroom context.