Log Ratio – an informal introduction

In the latest version of CQPweb (v 3.1.7) a new statistic for keywords, collocations and lockwords is introduced, called Log Ratio.

“Log Ratio” is actually my own made-up abbreviated title for something which is more precisely defined as either the binary log of the ratio of relative frequencies or the binary log of the relative risk. Over the months I’ve been building up to this addition, people have kept telling me that I need a nice, easy to understand label for this measurement, and they are quite right. Thus Log Ratio. But what is Log Ratio?

Log Ratio is my attempt to suggest a better statistic for keywords/key tags than log-likelihood, which is the statistic normally used. The problem with this accepted procedure is that log-likelihood is a statistical significance measure – it tells us how much evidence we have for a difference between two corpora. However, it doesn’t tell us how big / how important a given difference is. But we very often want to know how big a difference is!

For instance, if we look at the top 200 keywords in a list, we want to look at the “most key” words, i.e. the words where the difference in frequency is greatest. But sorting the list by log-likelihood doesn’t give us this – it gives us the words we have most evidence for, even if the actual difference is quite small.

The Log Ratio statistic is an “effect-size” statistic, not a significance statistic: it does represent how big the difference between two corpora are for a particular keyword. It’s also a very transparent statistic in that it is easy to understand how it is calculated and why it represents the size of the difference.

When we present corpus frequencies, we usually give a relative frequency (or a normalised frequency as it is sometimes called): this is equal to the absolute frequency, divided by the size of the corpus or subcorpus. We often then multiply by a normalisation factor – 1,000 or 1,000,000 being the most usual factors – but this is, strictly speaking, optional and merely for presentation purposes.

Once we have made a frequency into a relative frequency by dividing it by the corpus size, we can compare it to the relative frequency of the same item in a different corpus. The easiest way to do this is to say how many times bigger the relative frequency is in one corpus as opposed to the other, which we work out by dividing one relative frequency by another. For instance, if the relative frequency of a word is 0.0006 in Corpus A and 0.0002 in Corpus B, then we can say that the relative frequency in Corpus A is three times bigger than in Corpus B (0.0006 ÷ 0.0002 = 3).

Dividing one number by another gives us the ratio of two numbers, so we can call this measure of the difference between the two corpora the ratio of relative frequencies (statisticians often call it the relative risk, for reasons I won’t go into here), and, as I’ve explained, it simply tells us how many times more frequent the word is in Corpus A than in Corpus B – so it’s a very transparent and understandable statistic.

We could use the ratio of relative frequencies as a keyness statistic but, in my view, it is useful to convert it into a logarithm (“log” for short) first – specifically, the logarithm to base 2 or binary logarithm. Why do this? Well, here’s how taking the log of the ratio works:

  • A word has the same relative frequency in A and B – the binary log of the ratio is 0
  • A word is 2 times more common in A than in B – the binary log of the ratio is 1
  • A word is 4 times more common in A than in B – the binary log of the ratio is 2
  • A word is 8 times more common in A than in B – the binary log of the ratio is 3
  • A word is 16 times more common in A than in B – the binary log of the ratio is 4
  • A word is 32 times more common in A than in B – the binary log of the ratio is 5

That is, once we take a binary log, every point represents a doubling of the ratio. This is very useful to help us focus on the overall magnitude of the difference (4 vs. 8 vs. 16) rather than differences that are pretty close together (e.g. 4 vs. 5 vs. 6).  This use of the binary log is very familiar in corpus linguistics – the commonly-used Mutual Information measure, which is closely related to the ratio of relative frequencies, is also calculated using a binary log.

So now we’ve arrived at our measure – the binary log of the ratio of relative frequencies, or Log Ratio for short.

If you followed the explanation above, then you know everything you need to know in order to interpret Log Ratio scores. If you didn’t follow it, then here’s the crucial takeaway: every extra point of Log Ratio score represents a doubling in size of the difference between the two corpora, for the keyword under consideration.

When we use Log Ratio for collocation, it has exactly the same interpretation, but applied to the zone around the node: every extra point of Log Ratio Score represents a doubling in size of the difference between the collocate’s frequency near the node and its frequency elsewhere. The outcome is a collocation measure very similar to Mutual Information.

Another advantage of Log Ratio is that it can be used for lockwords as well as keywords, which log-likelihood can’t. A Log Ratio of zero or nearly zero indicates a word that is “locked” between Corpus A and Corpus B. In consequence the new version of CQPweb allows you to look at lockwords – to my knowledge, the first general corpus tool that makes this possible.

A more formal discussion of Log Ratio will be at the core of my presentation at the ICAME conference later this week. A journal article will follow in due course.

Using version control software for corpus construction

There are two problems that often come up in collaborative efforts towards corpus construction. First, how do two or more people pool their efforts simultaneously on this kind of work – sharing the data as it develops without working at cross-purposes, repeating effort, or ending up with incompatible versions of the corpus? Second, how do we keep track of what changes in the corpus as it grows and approaches completion – and in particular, if mistakes get made, how do we make sure we can undo them?

Typically corpus linguists have used ad hoc solutions to these problems. To deal with the problem of collaboration, we email bundles of files back and forth, or used shared directories on our institutional networks, or rely on external cloud services like Dropbox. To deal with the problem of recording the history of the data, we often resort to saving multiple different versions of the data, creating a new copy of the whole corpus every time we make any tiny change, and adding an ever-growing pile of “v1”, “v2” “v3”… suffixes to the filenames.

In this blog post I’d like to suggest a better way!

The problems of collaboration and version tracking also affect the work of software developers – with the difference that for them, these problems have been quite thoroughly solved. Though software development and corpus construction are quite different animals, in two critical respects they are similar. First, we are working mainly with very large quantities of plain text files: source code files in the case of software, natural-language text files in the case of corpora. Second, when we make a change, we typically do not change the whole collection of files but only, perhaps, some specific sections of a subset of the files. For this reason, the tools that software developers use to manage their source code – called version control software – are in my view eminently suitable for corpus construction.

So what is version control software?

Think of a computer filesystem – a hierarchy of folders, subfolders and files within those folders which represents all the various data stored on a disk or disks somewhere. This is basically a two-dimensional system: files and folders can be above or below one another in the hierarchy (first dimension), or they can be side-by-side in some particular location (second dimension). But there is also the dimension of time – the state of the filesystem at one point in time is different from its state at a subsequent point in time, as we add new files and folders or move, modify or delete existing ones. A standard traditional filesystem does not have any way to represent this third dimension. If you want to keep a record of a change, all you can do is create a copy of the data alongside the original, and modify the copy while leaving the original untouched. But it would be much better if the filesystem itself were able to keep a record of all the changes that have been made, and all of its previous states going back through history – and if it did this automatically, without the user needing to manage different versions of the data manually.

Windows and Mac OS X both now have filesystems that contain some features of this automatic record-keeping. Version control software does the same thing, but in a more thorough and systematic way. It implements a filesystem with a complete, automatic record of all the changes that are made over time, and provides users with easy ways to access the files, see the record of the changes, and add new changes.

I personally encountered version control software for the first time when I became a developer on the Corpus Workbench project back in 2009/2010. Most of the work on CWB is done by myself and Stefan Evert, and although we do have vaguely defined areas of individual responsibility for different bits of the project, there is also a lot of overlap. Without version control software, effective collaboration and tracking the changes we each make would be quite impossible. The whole of CWB including the core system, the supplementary tools, the CQPweb user interface, and the various manuals and tutorials, is all version-controlled. UCREL also uses version control software for the source code of tools such as CLAWS and USAS. And the more I’ve used version control tools for programming work, the more convinced I’ve become that the same tools will be highly useful for corpus development.

The version control system that I prefer is called Subversion, also known by the abbreviation SVN. This is quite an old-fashioned system, and many software developers now use newer systems such as Mercurial or Git (the latter is the brainchild of Linus Torvalds, the mastermind behind Linux). These newer and much more flexible systems are, however, quite a bit more complex and harder to use than Subversion. This is fine for computer programmers using the systems every day, but for corpus linguists who only work with version control every now and them, the simplicity of good old Subversion makes it – in my view – the better choice.

Subversion works like this. First, a repository is created. The repository is just a big database for storing the files you’re going to work with. When you access this database using Subversion tools, it looks like one big file system containing files, folders and subfolders. The person who creates and manages the repository (here at CASS that’s me) needs a fair bit of technical expertise, but the other users need only some very quick training. The repository needs to be placed somewhere where all members of the team can access it. The CASS Subversion repository lives on our application server, a virtual machine maintained by Lancaster University’s ISS; but you don’t actually need this kind of full-on setup, just an accessible place to put the database (and, needless to say, there needs to be a good backup policy for the database, wherever it is).

The repository manager then creates usernames that the rest of the team can use to work with the files in the repository. When you want to start working with one of the corpora in the repository, you begin by checking out a copy of the data. This creates a working copy of the repository’s contents on your local machine. It can be a copy of the whole repository, or just a section that you want to work on.  Then, you make whatever additions, changes or deletions you want – no need to keep track of these manually! Once you’ve made a series of changes to your checked-out working copy, you commit it back into the repository. Whenever a user commits data, the repository creates a new, numbered version of its filesystem data. Each version is stored as a record of the changes made since the previous version. This means that (a) there is a complete record of the history of the filesystem, with every change to every file logged and noted; (b) there is also a record of who is responsible for every change. This complete record takes up less disk space than you might think, because only the changes are recorded. Subversion is clever enough not to create duplicate copies of the parts of its filesystem that have not changed.

Nothing is ever lost or deleted from this system. Even if a file is completely removed, it is only removed from the new version: all the old versions in the history still complain it. Moreover, it is always possible to check out a version other than the current one – allowing you to see the filesystem as it was at any point in time you choose. That means that all mistakes are reversible. Even if someone commits a version where they have accidentally wiped out nine-tenths of the corpus you are working on, it’s simplicity itself just to return to an earlier point in history and roll back the change.

The strength of this approach for collaboration is that more than one person can have a checked-out copy of a corpus at the same time, and everyone can make their own changes separately. To check whether someone else has committed changes while you’ve been working, you can update your working copy from the repository, getting the other person’s changes and merging them with yours. Even if you’ve made changes to the same file, they will be merged together automatically. Only if two of you have changed the same section of the same file is there a problem – and in this case the program will show you the two different versions, and allow you to pick one or the other or create a combination of the two manually.

While Subversion can do lots more than this, for most users these three actions – check out, update, and commit – are all that’s needed. You also have a choice of programs that you can use for these actions. Most people with Unix machines use a command-line tool called svn which lets you issue commands to Subversion by typing them into a shell terminal.

On Windows, on the other hand, the preferred tool is something called TortoiseSVN. This can be downloaded and installed in the same way as most Windows programs. However, once installed, you don’t have to start up a separate application to use Subversion. Instead, the Subversion commands are added to the right-click context menu in Windows Explorer. So you can simply go to and empty folder, right-click with the mouse, and select the “check out” option to get your working copy. Once you’ve got a working copy, right-clicking on any file or folder within it allows you to access the “update” and “commit” options. TortoiseSVN provides an additional sub-menu which lets you access the full range of Subversion commands – but, again, normal users only need those three most common commands.

The possibility of using TortoiseSVN on Windows means that even the least tech-savvy member of your team can become a productive use of Subversion with only a very little training. And the benefits of building your corpus in a Subversion repository are considerable:

  • The corpus is easily accessible and sharable between collaborators
  • A complete record of all changes made, plus who-did-what
  • Any change can be reversed if necessary, with no need to manually manage “old versions”
  • Full protection against accidental deletions and erroneous changes
  • A secure and reliable backup method is only needed for the repository itself, not for each person’s working copy

That’s not to mention other benefits, such as the ease of switching between computers (just check out another working copy on the new machine and carry on where you left off).

Here at CASS we are making it our standard policy to put corpus creation work into Subversion, and we’re now in the process of gradually transitioning the team’s corpus-building efforts across into that platform. I’m convinced this is the way of the future for effectively managing corpus construction.

A new version of EEBO on CQPweb

The version of the EEBO-TCP data that has been available on Lancaster University’s CQPweb server is now rather old (the TCP project adds text to the collection on a rolling basis), and, more importantly, does not contain any annotations. Recently I have devoted some time to running a newer version through UCREL’s standard annotation tools and then mounting the resulting dataset on CQPweb. The new version stands at 1.2 billion running tokens, each with eight different annotation fields.

Critically, the first layer of annotation is spelling regularisation, which  means that the accuracy of the subsequent layers including part-of-speech tagging and lemmatisation is enhanced. Regularised spelling means that searches can be much more comprehensive. Once I had finished with the indexing process, one of the first searches that Paul Rayson did (in preparation for a presentation at the EEBO-TCP conference in Oxford this week) was to check on the word experiment and to compare the results returned by a search on original spelling as compared to regularised spelling.

eebo1

eebo2

eebo3

eebo4

The version with regularised spelling (the second graph) returns about three times as many results as the version without (the first graph). The distribution is also rather different. As the following graph shows, the use of a lemma search retrieves even more relevant examples:

eebo5

eebo6

This illustrates the value of the standard annotations to the analysis of the EEBO-TCP data. The newly indexed corpus will be of use both for CASS purposes and for the CREME research group.

Visiting With The Brown Family

In 2011 I gave a plenary talk on how American English is changing over time (contrasting it with British English), using the Brown Family of corpora. Each member of the Brown family consists of a corpus of 1 million words of written, published, standard English, divided into 500 files each of about 2000 words each. Fifteen genres of writing are represented – this framework being created decades ago when the original Brown corpus was compiled by Henry Kučera and W. Nelson Francis at Brown University, having the distinction of being the first publically available corpus ever built. Containing only American texts published in 1961, it originally went by the name of A Standard Corpus of Present-Day Edited American English for use with Digital Computers but later became known as just the Brown Corpus. It was followed by an equivalent British version, with later members representing English from the 1990s, the 2000s and the 1930s. A 1901 British version is in the pipeline.

Before I gave my talk, however, Mark Davies gave a brilliant presentation on the COHA (Corpus of Historical American English) which has 400 million words and covers the period from 1800 to the present day. It was the proverbial hard act to follow. Compared to the COHA, the Brown family are tiny, and the coverage occurs across 30 or 15 year snapshots, rather than representing every year. If we identify, say, that the word Mr is less frequent in 2006 than in 1991 then it is tempting to say that Mr is becoming less frequent over time. But we don’t know for certain what corpora from all the years in between would tell us. Having multiple sampling points presents a more convincing picture, but judicious hedging must be applied.

Also, being small, many words in the Brown family have tiny frequencies so it’s very difficult to make any claims about them. And the sampling could be viewed as rather outdated – the sorts of texts that people accessed in the 1960s are not necessarily the same as they access now. There are no online texts in the Brown family (although to ease collection, both the 2006 members involved texts that were originally published in written form, then placed online). Nor is there any advertising text. Or song lyrics. Or horror fiction. Or erotica (although there is a section on Romantic Fiction which could be pushed in that direction). Finally, the fact that all the texts are of the published variety means that they tend to represent a somewhat standardised, conservative form of English. A lot of the innovation in English happens in much more informal contexts, especially where young people or people from different backgrounds mix together – inner-city playgrounds and internet forums being two good examples. By the time such innovation gets into written published standard English, it’s no longer innovative. So the Brown family can’t tell us about the cutting edge of language use – they’ll always be a few years out of fashion.

So what are the Brown family good for, if anything?

Continue reading