Workshop on Corpus Linguistics in Ghana

Back in 2014, a team from CASS ran a well-received introductory workshop on Corpus Linguistics in Accra, Ghana – a country where Lancaster University has a number of longstanding academic partnerships and has recently established a campus.

We’re pleased to announce that in February of this year, we will be returning to Ghana and running two more introductory one-day events. Both events are free to attend, each consisting of a series of introductory lectures and practical sessions on topics in corpus linguistics and the use of corpus tools.

Since the 2014 workshop was attended by some participants from a long way away, this time we are running events in two different locations in Ghana. The first workshop, on Tuesday 23rd February 2016, will be in Cape Coast, organised jointly with the University of Cape Coast: click this link for details. The second workshop, on  Friday 26th February 2016, will be in Legon (nr. Accra), organised jointly with the University of Ghana: click this link for details. The same material will be covered at both workshops.

The workshop in 2014 was built largely around the use of our online corpus tools, particularly CQPweb. In the 2016 events, we’re going to focus instead on a pair of programs that you can run on your own computer to analyse your own data: AntConc and GraphColl. For that reason we will be encouraging participants who have their own corpora to bring them along to analyse in the workshop. These can be in any language – not just English! Don’t worry however – we will also provide sample datasets that participants who don’t have their own data can work with.

We invite anyone in Ghana who wants to learn more about the versatile methodology for language analysis that is corpus linguistics to attend! While the events are free, registration in advance is required, as places are limited.

2014/15 in retrospective: Perspectives on Chinese

Looking back over the academic year as it draws to a close, one of the highlights for us here at CASS was the one-day seminar we hosted in January on Perspectives on Chinese: Talks in Honour of Richard Xiao. This event celebrated the contributions to linguistics of CASS co-investigator Dr. Richard Zhonghua Xiao, on the occasion of both his retirement in October 2014 (and simultaneous taking-up of an honorary position with the University!), and the completion of the two funded research projects which Richard has led under the aegis of CASS.

The speakers included present and former collaborators with Richard – some (including myself) from here at Lancaster, others from around the world – as well as other eminent scholars working in the areas that Richard has made his own: Chinese corpus linguistics (especially, but not only, comparative work), and the allied area of the methodologies that Richard’s work has both utilised and promulgated.

In the first presentation, Prof. Hongyin Tao of UCLA took a classic observation of corpus-based studies – the existence, and frequent occurrence, of highly predictable strings or structures, pointed out a little-noticed aspect of these highly-predictable elements. They often involve lacunae, or null elements, where some key component of the meaning is simply left unstated and assumed. An example of this is the English expression under the influence, where “the influence of what?” is often implicit, but understood to be drugs/alcohol. It was pointed out that collocation patterns may identify the null elements, but that a simplistic application of collocation analysis may fail to yield useful results for expressions containing null elements. Finally, an extension of the analysis to yinxiang, the Chinese equivalent of influence, showed much the same tendencies – including, crucially, the importance of null elements – at work.

The following presentation came from Prof. Gu Yueguo of the Chinese Academy of Social Sciences. Gu is well-known in the field of corpus linguistics for his many projects over the years to develop not just new corpora, but also new types of corpus resources – for example, his exciting development in recent years of novel types of ontology. His presentation at the seminar was very much in this tradition, arguing for a novel type of multimodal corpus for use in the study of child language acquisition.

At this point in proceedings, I was deeply honoured to give my own presentation. One of Richard’s recently-concluded projects involved the application of Douglas Biber’s method of Multidimensional Analysis to translational English as the “Third Code”. In my talk, I presented methodological work which, together with Xianyao Hu, I have recently undertaken to assist this kind of analysis by embedding tools for the MD approach in CQPweb. A shorter version of this talk was subsequently presented at the ICAME conference in Trier at the end of May.

Prof. Xu Hai of Guangdong University of Foreign Studies gave a presentation on the study of the study of Learner Chinese, an issue which was prominent among Richard’s concerns as director of the Lancaster University Confucius Institute. As noted above, Richard has led a project funded by the British Academy, looking at the acquisition of Mandarin Chinese as a foreign language; as a partner on that project, Xu’s presentation of a preliminary report on the Guangwai Lancaster Chinese Learner Corpus was timely indeed. This new learner corpus – already in excess of a million words in size, and consisting of a roughly 60-40 split between written and spoken materials – follows the tradition of the best learner corpora for English by sampling learners with many different national backgrounds, but also, interestingly, includes some longitudinal data. Once complete, the value of this resource for the study of L2 Chinese interlanguage will be incalculable.

The next presentation was another from colleagues of Richard here at Lancaster: Dr. Paul Rayson and Dr. Scott Piao gave a talk on the extension of the UCREL Semantic Analysis System (USAS) to Chinese. This has been accomplished by means of mapping the vast semantic lexicon originally created for English across to Chinese, initially by automatic matching, and secondarily by manual editing. Scott and Paul, with other colleagues including CASS’s Carmen Dayrell, went on to present this work – along with work on other languages – at the prestigious NAACL HLT 2015 conference, in whose proceedings a write-up has been published.

Prof. Jiajin Xu (Beijing Foreign Studies University) then made a presentation on corpus construction for Chinese. This area has, of, course, been a major locus of activity by Richard over the years: his Lancaster Corpus of Mandarin Chinese (LCMC), a Mandarin match for the Brown corpus family, is one of the best openly-available linguistic resources for that language, and his ZJU Corpus of Translational Chinese (ZCTC) was a key contribution of his research on translation in Chinese . Xu’s talk presented a range of current work building on that foundation, especially the ToRCH (“Texts of Recent Chinese”) family of corpora – a planned Brown-family-style diachronic sequence of snapshot corpora in Chinese from BFSU, starting with the ToRCH2009 edition. Xu rounded out the talk with some case studies of applications for ToRCH, looking first at recent lexical change in Chinese by comparing ToRCH2009 and LCMC, and then at features of translated language in Chinese by comparing ToRCH2009 and ZCTC.

The last presentation of the day was from Dr. Vittorio Tantucci, who has recently completed his PhD at the department of Linguistics and English Language at Lancaster, and who specialises in a number of issues in cognitive linguistic analysis including intersubjectivity and evidentiality. His talk addressed specifically the Mandarin evidential marker 过 guo, and the path it took from a verb meaning ‘to get through, to pass by’ to becoming a verbal grammatical element. He argued that this exemplified a path for an evidential marker to originate from a traversative structure – a phenomenon not noted on the literature on this kind of grammaticalisation, which focuses on two other paths of development, from verbal constructions conveying a result or a completion. Vittorio’s work is extremely valuable, not only in its own right but as a demonstration of the role that corpus-based analysis, and cross-linguistic evidence, has to play on linguistic theory. Given Richard’s own work on the grammar and semantics of aspect in Chinese, a celebration of Richard’s career would not have been complete without an illustration of how this trend in current linguistics continues to develop.

All in all, the event was a magnificent tribute to Richard and his highly productive research career, and a potent reminder of how diverse his contributions to the field have actually been, and of their far-reaching impact among practitioners of Chinese corpus linguistics. The large and lively audience certainly seemed to agree with our assessment!

Our deep thanks go out to all the invited speakers, especially those who travelled long distances to attend – our speaker roster stretched from California in the west, to China in the east.

In memory: Professor Geoffrey Leech

It is with great sorrow that we report the death on 19th August of Professor Geoffrey Leech.

Geoff was not only the founder of the UCREL research centre for corpus linguistics at Lancaster University, he was also the first Professor and founding Head of the Department of Linguistics and English Language. His contributions to linguistics – not only in corpus linguistics, but also in English grammar, pragmatics and stylistics – were immense. After his retirement in 2002, he remained an active member of our department, not only continuing his own research but also, characteristically, providing advice, support and encouragement for students and junior colleagues.

All our thoughts are with Geoff’s wife Fanny, and with his family.

It is still hard for us to find the right words at this time. For many of us he was an inspirational teacher and mentor, but for all of us, he was a kind and generous friend.

The video below was recorded by Tony McEnery in conversation with Geoff in late 2013 for Lancaster’s online course in corpus linguistics. In it, Tony and Geoff discuss the history of the field. We present it now publicly as a first tribute to Geoff’s life and work.

(A transcript is available from this link.)

Log Ratio – an informal introduction

In the latest version of CQPweb (v 3.1.7) a new statistic for keywords, collocations and lockwords is introduced, called Log Ratio.

“Log Ratio” is actually my own made-up abbreviated title for something which is more precisely defined as either the binary log of the ratio of relative frequencies or the binary log of the relative risk. Over the months I’ve been building up to this addition, people have kept telling me that I need a nice, easy to understand label for this measurement, and they are quite right. Thus Log Ratio. But what is Log Ratio?

Log Ratio is my attempt to suggest a better statistic for keywords/key tags than log-likelihood, which is the statistic normally used. The problem with this accepted procedure is that log-likelihood is a statistical significance measure – it tells us how much evidence we have for a difference between two corpora. However, it doesn’t tell us how big / how important a given difference is. But we very often want to know how big a difference is!

For instance, if we look at the top 200 keywords in a list, we want to look at the “most key” words, i.e. the words where the difference in frequency is greatest. But sorting the list by log-likelihood doesn’t give us this – it gives us the words we have most evidence for, even if the actual difference is quite small.

The Log Ratio statistic is an “effect-size” statistic, not a significance statistic: it does represent how big the difference between two corpora are for a particular keyword. It’s also a very transparent statistic in that it is easy to understand how it is calculated and why it represents the size of the difference.

When we present corpus frequencies, we usually give a relative frequency (or a normalised frequency as it is sometimes called): this is equal to the absolute frequency, divided by the size of the corpus or subcorpus. We often then multiply by a normalisation factor – 1,000 or 1,000,000 being the most usual factors – but this is, strictly speaking, optional and merely for presentation purposes.

Once we have made a frequency into a relative frequency by dividing it by the corpus size, we can compare it to the relative frequency of the same item in a different corpus. The easiest way to do this is to say how many times bigger the relative frequency is in one corpus as opposed to the other, which we work out by dividing one relative frequency by another. For instance, if the relative frequency of a word is 0.0006 in Corpus A and 0.0002 in Corpus B, then we can say that the relative frequency in Corpus A is three times bigger than in Corpus B (0.0006 ÷ 0.0002 = 3).

Dividing one number by another gives us the ratio of two numbers, so we can call this measure of the difference between the two corpora the ratio of relative frequencies (statisticians often call it the relative risk, for reasons I won’t go into here), and, as I’ve explained, it simply tells us how many times more frequent the word is in Corpus A than in Corpus B – so it’s a very transparent and understandable statistic.

We could use the ratio of relative frequencies as a keyness statistic but, in my view, it is useful to convert it into a logarithm (“log” for short) first – specifically, the logarithm to base 2 or binary logarithm. Why do this? Well, here’s how taking the log of the ratio works:

  • A word has the same relative frequency in A and B – the binary log of the ratio is 0
  • A word is 2 times more common in A than in B – the binary log of the ratio is 1
  • A word is 4 times more common in A than in B – the binary log of the ratio is 2
  • A word is 8 times more common in A than in B – the binary log of the ratio is 3
  • A word is 16 times more common in A than in B – the binary log of the ratio is 4
  • A word is 32 times more common in A than in B – the binary log of the ratio is 5

That is, once we take a binary log, every point represents a doubling of the ratio. This is very useful to help us focus on the overall magnitude of the difference (4 vs. 8 vs. 16) rather than differences that are pretty close together (e.g. 4 vs. 5 vs. 6).  This use of the binary log is very familiar in corpus linguistics – the commonly-used Mutual Information measure, which is closely related to the ratio of relative frequencies, is also calculated using a binary log.

So now we’ve arrived at our measure – the binary log of the ratio of relative frequencies, or Log Ratio for short.

If you followed the explanation above, then you know everything you need to know in order to interpret Log Ratio scores. If you didn’t follow it, then here’s the crucial takeaway: every extra point of Log Ratio score represents a doubling in size of the difference between the two corpora, for the keyword under consideration.

When we use Log Ratio for collocation, it has exactly the same interpretation, but applied to the zone around the node: every extra point of Log Ratio Score represents a doubling in size of the difference between the collocate’s frequency near the node and its frequency elsewhere. The outcome is a collocation measure very similar to Mutual Information.

Another advantage of Log Ratio is that it can be used for lockwords as well as keywords, which log-likelihood can’t. A Log Ratio of zero or nearly zero indicates a word that is “locked” between Corpus A and Corpus B. In consequence the new version of CQPweb allows you to look at lockwords – to my knowledge, the first general corpus tool that makes this possible.

A more formal discussion of Log Ratio will be at the core of my presentation at the ICAME conference later this week. A journal article will follow in due course.

Using version control software for corpus construction

There are two problems that often come up in collaborative efforts towards corpus construction. First, how do two or more people pool their efforts simultaneously on this kind of work – sharing the data as it develops without working at cross-purposes, repeating effort, or ending up with incompatible versions of the corpus? Second, how do we keep track of what changes in the corpus as it grows and approaches completion – and in particular, if mistakes get made, how do we make sure we can undo them?

Typically corpus linguists have used ad hoc solutions to these problems. To deal with the problem of collaboration, we email bundles of files back and forth, or used shared directories on our institutional networks, or rely on external cloud services like Dropbox. To deal with the problem of recording the history of the data, we often resort to saving multiple different versions of the data, creating a new copy of the whole corpus every time we make any tiny change, and adding an ever-growing pile of “v1”, “v2” “v3”… suffixes to the filenames.

In this blog post I’d like to suggest a better way!

The problems of collaboration and version tracking also affect the work of software developers – with the difference that for them, these problems have been quite thoroughly solved. Though software development and corpus construction are quite different animals, in two critical respects they are similar. First, we are working mainly with very large quantities of plain text files: source code files in the case of software, natural-language text files in the case of corpora. Second, when we make a change, we typically do not change the whole collection of files but only, perhaps, some specific sections of a subset of the files. For this reason, the tools that software developers use to manage their source code – called version control software – are in my view eminently suitable for corpus construction.

So what is version control software?

Think of a computer filesystem – a hierarchy of folders, subfolders and files within those folders which represents all the various data stored on a disk or disks somewhere. This is basically a two-dimensional system: files and folders can be above or below one another in the hierarchy (first dimension), or they can be side-by-side in some particular location (second dimension). But there is also the dimension of time – the state of the filesystem at one point in time is different from its state at a subsequent point in time, as we add new files and folders or move, modify or delete existing ones. A standard traditional filesystem does not have any way to represent this third dimension. If you want to keep a record of a change, all you can do is create a copy of the data alongside the original, and modify the copy while leaving the original untouched. But it would be much better if the filesystem itself were able to keep a record of all the changes that have been made, and all of its previous states going back through history – and if it did this automatically, without the user needing to manage different versions of the data manually.

Windows and Mac OS X both now have filesystems that contain some features of this automatic record-keeping. Version control software does the same thing, but in a more thorough and systematic way. It implements a filesystem with a complete, automatic record of all the changes that are made over time, and provides users with easy ways to access the files, see the record of the changes, and add new changes.

I personally encountered version control software for the first time when I became a developer on the Corpus Workbench project back in 2009/2010. Most of the work on CWB is done by myself and Stefan Evert, and although we do have vaguely defined areas of individual responsibility for different bits of the project, there is also a lot of overlap. Without version control software, effective collaboration and tracking the changes we each make would be quite impossible. The whole of CWB including the core system, the supplementary tools, the CQPweb user interface, and the various manuals and tutorials, is all version-controlled. UCREL also uses version control software for the source code of tools such as CLAWS and USAS. And the more I’ve used version control tools for programming work, the more convinced I’ve become that the same tools will be highly useful for corpus development.

The version control system that I prefer is called Subversion, also known by the abbreviation SVN. This is quite an old-fashioned system, and many software developers now use newer systems such as Mercurial or Git (the latter is the brainchild of Linus Torvalds, the mastermind behind Linux). These newer and much more flexible systems are, however, quite a bit more complex and harder to use than Subversion. This is fine for computer programmers using the systems every day, but for corpus linguists who only work with version control every now and them, the simplicity of good old Subversion makes it – in my view – the better choice.

Subversion works like this. First, a repository is created. The repository is just a big database for storing the files you’re going to work with. When you access this database using Subversion tools, it looks like one big file system containing files, folders and subfolders. The person who creates and manages the repository (here at CASS that’s me) needs a fair bit of technical expertise, but the other users need only some very quick training. The repository needs to be placed somewhere where all members of the team can access it. The CASS Subversion repository lives on our application server, a virtual machine maintained by Lancaster University’s ISS; but you don’t actually need this kind of full-on setup, just an accessible place to put the database (and, needless to say, there needs to be a good backup policy for the database, wherever it is).

The repository manager then creates usernames that the rest of the team can use to work with the files in the repository. When you want to start working with one of the corpora in the repository, you begin by checking out a copy of the data. This creates a working copy of the repository’s contents on your local machine. It can be a copy of the whole repository, or just a section that you want to work on.  Then, you make whatever additions, changes or deletions you want – no need to keep track of these manually! Once you’ve made a series of changes to your checked-out working copy, you commit it back into the repository. Whenever a user commits data, the repository creates a new, numbered version of its filesystem data. Each version is stored as a record of the changes made since the previous version. This means that (a) there is a complete record of the history of the filesystem, with every change to every file logged and noted; (b) there is also a record of who is responsible for every change. This complete record takes up less disk space than you might think, because only the changes are recorded. Subversion is clever enough not to create duplicate copies of the parts of its filesystem that have not changed.

Nothing is ever lost or deleted from this system. Even if a file is completely removed, it is only removed from the new version: all the old versions in the history still complain it. Moreover, it is always possible to check out a version other than the current one – allowing you to see the filesystem as it was at any point in time you choose. That means that all mistakes are reversible. Even if someone commits a version where they have accidentally wiped out nine-tenths of the corpus you are working on, it’s simplicity itself just to return to an earlier point in history and roll back the change.

The strength of this approach for collaboration is that more than one person can have a checked-out copy of a corpus at the same time, and everyone can make their own changes separately. To check whether someone else has committed changes while you’ve been working, you can update your working copy from the repository, getting the other person’s changes and merging them with yours. Even if you’ve made changes to the same file, they will be merged together automatically. Only if two of you have changed the same section of the same file is there a problem – and in this case the program will show you the two different versions, and allow you to pick one or the other or create a combination of the two manually.

While Subversion can do lots more than this, for most users these three actions – check out, update, and commit – are all that’s needed. You also have a choice of programs that you can use for these actions. Most people with Unix machines use a command-line tool called svn which lets you issue commands to Subversion by typing them into a shell terminal.

On Windows, on the other hand, the preferred tool is something called TortoiseSVN. This can be downloaded and installed in the same way as most Windows programs. However, once installed, you don’t have to start up a separate application to use Subversion. Instead, the Subversion commands are added to the right-click context menu in Windows Explorer. So you can simply go to and empty folder, right-click with the mouse, and select the “check out” option to get your working copy. Once you’ve got a working copy, right-clicking on any file or folder within it allows you to access the “update” and “commit” options. TortoiseSVN provides an additional sub-menu which lets you access the full range of Subversion commands – but, again, normal users only need those three most common commands.

The possibility of using TortoiseSVN on Windows means that even the least tech-savvy member of your team can become a productive use of Subversion with only a very little training. And the benefits of building your corpus in a Subversion repository are considerable:

  • The corpus is easily accessible and sharable between collaborators
  • A complete record of all changes made, plus who-did-what
  • Any change can be reversed if necessary, with no need to manually manage “old versions”
  • Full protection against accidental deletions and erroneous changes
  • A secure and reliable backup method is only needed for the repository itself, not for each person’s working copy

That’s not to mention other benefits, such as the ease of switching between computers (just check out another working copy on the new machine and carry on where you left off).

Here at CASS we are making it our standard policy to put corpus creation work into Subversion, and we’re now in the process of gradually transitioning the team’s corpus-building efforts across into that platform. I’m convinced this is the way of the future for effectively managing corpus construction.

A new version of EEBO on CQPweb

The version of the EEBO-TCP data that has been available on Lancaster University’s CQPweb server is now rather old (the TCP project adds text to the collection on a rolling basis), and, more importantly, does not contain any annotations. Recently I have devoted some time to running a newer version through UCREL’s standard annotation tools and then mounting the resulting dataset on CQPweb. The new version stands at 1.2 billion running tokens, each with eight different annotation fields.

Critically, the first layer of annotation is spelling regularisation, which  means that the accuracy of the subsequent layers including part-of-speech tagging and lemmatisation is enhanced. Regularised spelling means that searches can be much more comprehensive. Once I had finished with the indexing process, one of the first searches that Paul Rayson did (in preparation for a presentation at the EEBO-TCP conference in Oxford this week) was to check on the word experiment and to compare the results returned by a search on original spelling as compared to regularised spelling.





The version with regularised spelling (the second graph) returns about three times as many results as the version without (the first graph). The distribution is also rather different. As the following graph shows, the use of a lemma search retrieves even more relevant examples:



This illustrates the value of the standard annotations to the analysis of the EEBO-TCP data. The newly indexed corpus will be of use both for CASS purposes and for the CREME research group.