In the latest version of CQPweb (v 3.1.7) a new statistic for keywords, collocations and lockwords is introduced, called Log Ratio.
“Log Ratio” is actually my own made-up abbreviated title for something which is more precisely defined as either the binary log of the ratio of relative frequencies or the binary log of the relative risk. Over the months I’ve been building up to this addition, people have kept telling me that I need a nice, easy to understand label for this measurement, and they are quite right. Thus Log Ratio. But what is Log Ratio?
Log Ratio is my attempt to suggest a better statistic for keywords/key tags than log-likelihood, which is the statistic normally used. The problem with this accepted procedure is that log-likelihood is a statistical significance measure – it tells us how much evidence we have for a difference between two corpora. However, it doesn’t tell us how big / how important a given difference is. But we very often want to know how big a difference is!
For instance, if we look at the top 200 keywords in a list, we want to look at the “most key” words, i.e. the words where the difference in frequency is greatest. But sorting the list by log-likelihood doesn’t give us this – it gives us the words we have most evidence for, even if the actual difference is quite small.
The Log Ratio statistic is an “effect-size” statistic, not a significance statistic: it does represent how big the difference between two corpora are for a particular keyword. It’s also a very transparent statistic in that it is easy to understand how it is calculated and why it represents the size of the difference.
When we present corpus frequencies, we usually give a relative frequency (or a normalised frequency as it is sometimes called): this is equal to the absolute frequency, divided by the size of the corpus or subcorpus. We often then multiply by a normalisation factor – 1,000 or 1,000,000 being the most usual factors – but this is, strictly speaking, optional and merely for presentation purposes.
Once we have made a frequency into a relative frequency by dividing it by the corpus size, we can compare it to the relative frequency of the same item in a different corpus. The easiest way to do this is to say how many times bigger the relative frequency is in one corpus as opposed to the other, which we work out by dividing one relative frequency by another. For instance, if the relative frequency of a word is 0.0006 in Corpus A and 0.0002 in Corpus B, then we can say that the relative frequency in Corpus A is three times bigger than in Corpus B (0.0006 ÷ 0.0002 = 3).
Dividing one number by another gives us the ratio of two numbers, so we can call this measure of the difference between the two corpora the ratio of relative frequencies (statisticians often call it the relative risk, for reasons I won’t go into here), and, as I’ve explained, it simply tells us how many times more frequent the word is in Corpus A than in Corpus B – so it’s a very transparent and understandable statistic.
We could use the ratio of relative frequencies as a keyness statistic but, in my view, it is useful to convert it into a logarithm (“log” for short) first – specifically, the logarithm to base 2 or binary logarithm. Why do this? Well, here’s how taking the log of the ratio works:
- A word has the same relative frequency in A and B – the binary log of the ratio is 0
- A word is 2 times more common in A than in B – the binary log of the ratio is 1
- A word is 4 times more common in A than in B – the binary log of the ratio is 2
- A word is 8 times more common in A than in B – the binary log of the ratio is 3
- A word is 16 times more common in A than in B – the binary log of the ratio is 4
- A word is 32 times more common in A than in B – the binary log of the ratio is 5
That is, once we take a binary log, every point represents a doubling of the ratio. This is very useful to help us focus on the overall magnitude of the difference (4 vs. 8 vs. 16) rather than differences that are pretty close together (e.g. 4 vs. 5 vs. 6). This use of the binary log is very familiar in corpus linguistics – the commonly-used Mutual Information measure, which is closely related to the ratio of relative frequencies, is also calculated using a binary log.
So now we’ve arrived at our measure – the binary log of the ratio of relative frequencies, or Log Ratio for short.
If you followed the explanation above, then you know everything you need to know in order to interpret Log Ratio scores. If you didn’t follow it, then here’s the crucial takeaway: every extra point of Log Ratio score represents a doubling in size of the difference between the two corpora, for the keyword under consideration.
When we use Log Ratio for collocation, it has exactly the same interpretation, but applied to the zone around the node: every extra point of Log Ratio Score represents a doubling in size of the difference between the collocate’s frequency near the node and its frequency elsewhere. The outcome is a collocation measure very similar to Mutual Information.
Another advantage of Log Ratio is that it can be used for lockwords as well as keywords, which log-likelihood can’t. A Log Ratio of zero or nearly zero indicates a word that is “locked” between Corpus A and Corpus B. In consequence the new version of CQPweb allows you to look at lockwords – to my knowledge, the first general corpus tool that makes this possible.
A more formal discussion of Log Ratio will be at the core of my presentation at the ICAME conference later this week. A journal article will follow in due course.