Corpus Linguistics, and why you might want to use it, despite what (you think) you know about it

As part of the Spatial Humanities project at Lancaster University, and in collaboration with the Centre for Corpus Approaches to Social Sciences, the central aim of my PhD research project is to investigate the potential of corpus linguistics to allow for the exploration of spatial patterns in large amounts of digitised historical texts. Since I come from a Sociology/Linguistics background, my personal aim since the start of my PhD journey has been to try and understand what historical scholarship practices look like, what kinds of questions historians are interested in (whether they are presently being asked or not), how historians may benefit from using corpus linguistics, and also what challenges historians might encounter when trying to take advantage of corpus linguistics’ affordances. I don’t think I can over-stress how helpful coming to the RSVP conferences has been in this respect, and how grateful I am to the welcoming and helpful community of scholars I have encountered there.

I have chosen to write this post as an introduction to corpus linguistics for several reasons. First, on many counts RSVP members have asked me to explain to them what corpus linguistics consists of; I hope this post begins to answer that question. Second, I have sometimes encountered a reluctance to consider computerised text analysis methods. This reluctance is understandable and should be taken seriously. There are indeed very real challenges to working with computers in the Humanities and it is worth considering them. Ultimately, I hope to help bring corpus linguistics to the attention of those scholars who may find it useful.

The Humanities unavoidably involve messy data and the messy, fluid, categories which we try to apply to them. Computers on the other hand are all about known quantities and a lack of ambiguity. So why use computers in the Humanities? Computers are bad at what humans are good at: understanding. But they are also good at what humans are bad at: performing large and accurate calculations at remarkable speed. Generating historical insight cannot be done at the press of a button, but computers can assist us in manipulating the large amounts of data which are relevant to the questions we care about.

The most fundamental tool in corpus linguistics – the area of linguistics devoted to developing tools and methods to facilitate the quantitative and qualitative analysis of large amounts of text – is the concordance: a method which has been in use for centuries. A concordance is simply a list of occurrences of a word (or expression) of interest, accompanied by some limited context (see figure 1). Drawing up such a concordance manually is a very lengthy process, which can occupy years of an individual’s life. In contrast, once the data has been prepared in certain ways, specialised computer-based corpus linguistic tools can draw up such a concordance within a few seconds, even for tens of thousands of lines of text drawn from a database containing millions of words. For just this simple feat, computers are invaluable for the historian. But why use corpus linguistics tools? After all, all historical digital collections come with interfaces which offer searchability through queries of some sort.

Figure 1: Search results for ‘police’ presented as a concordancefigure1-png

Read the rest from the RSVP website.