Encyclopaedia of Shakespeare’s Language Project: A methodological journey

Just before Christmas 2015, the AHRC announced that it was going to fund the £1 million Encyclopaedia of Shakespeare’s Language project. I actually had the idea for the project 20 years ago. The fact that it took so long has much to do with method.

The approach I envisaged for Shakespeare’s language is analogous to more recent developments in dictionaries of general English, and, specifically, the departure from the philological tradition that resulted in the Collins Cobuild Dictionary of the English Language, the first full corpus-based dictionary. Being corpus-based implies both a particular methodology for revealing meanings, and a particular theoretical approach to meaning. There is less reliance on the vagaries and biases of editors, and a greater focus on the evidence of actual usage. The question ‘what does X mean?’ is pursued through another question: ‘how is X used?’

But I wanted more from the encyclopaedia than this. I wanted it to be comparative, to reveal not just the usage of words and other linguistic units in Shakespeare but also in the general language of the period. This way, we can tap into issues such as what is distinctive about Shakespeare’s language, and, more particularly, how Shakespeare’s language would have been perceived by his contemporary audience.

For example, the play Henry V contains Welsh, Irish and Scottish characters. A pilot examination I conducted with Alison Findlay (English and Creative Writing) of the words Welsh, Irish and Scottish used in over 100 million words written in Shakespeare’s time revealed that: (1) that the Welsh barely registered on the Elizabethan consciousness, being considered a harmless in-group, only noteworthy for their curious language, (2) the Irish were wild, savage, rebels, viewed positively only in relation to Irish rugs (an important colonial import), and (3) the Scottish, whilst also rebels, were respected for their political power. (Current Shakespearean dictionaries do not contain entries for any of these three words).

The problem 20 years ago was the lack of comparative data. Back in the early 1990s, the leading historical corpus of English was without doubt the Helsinki Corpus of English Texts, completed in 1991. This corpus amounted to 1.5 million words – an impressive figure in those days! Moreover, it had been put together with great care; it was reliable. But those 1.5 million words covered the period 730 to 1710. The section contemporaneous with Shakespeare amounted to less than half a million words, and was thus far short of what is required for serious comparative work.

To solve the problem, I set about, with Merja Kytö, creating the Corpus of English Dialogues. The reason for the focus on dialogues is that this would provide an interesting comparison for the dialogues of Shakespeare’s plays. This project soaked up 10 or more years, not just in creating the corpus but also in publishing the various insights it afforded into early modern dialogues along the way.

I was then overtaken – in a positive way! – by other events, notably, the advent of a fully-searchable 1.2 billion transcribed version of Early English Books Online (EEBO) (i.e. EEBO-TCP). For years, EEBO, which contains pretty much all early modern printed output, had been of limited value to linguists because the texts were only available as images, and language searches relied on OCR, with all its inaccuracies. Now, however, I have a 321 million word fully searchable corpus of texts written by Shakespeare’s contemporaries.

In addition, solutions, or at least partial solutions, had evolved for the various problems associated with the computational analysis of historical language data. Early modern spelling variation had been a major stumbling block (e.g. the word would could be spelt would, wold, wolde, woolde, wuld, vvold, etc.). This problem has been largely solved by the Variant Detector (VARD), devised by scholars at Lancaster, especially Alistair Baron . The Lancaster-developed CLAWS part-of-speech annotation system, which works well for present-day English, has been adapted for Early Modern English (though more work will be necessary). Similarly, semantic annotation has received attention from generations of researchers at Lancaster University, and has been (and is being) adapted for Early Modern English, most recently within the AHRC-funded SAMUELS project, involving a consortium of universities, including Lancaster.

I don’t doubt that there will be many more twists and turns, lumps and bumps in the future methodological journey. But I am cheered by the fact that I will not be facing them alone but in the company of a wonderful group of people who are part of the project: Andrew Hardie and Tony McEnery (both LAEL), Paul Rayson (Computing and Communications), Alison Findlay (English & Creative Writing) and Dawn Archer (Manchester Metropolitan).

