Encyclopaedia of Shakespeare’s Language Project: A methodological journey

Just before Christmas 2015, the AHRC announced that it was going to fund the £1 million Encyclopaedia of Shakespeare’s Language project. I actually had the idea for the project 20 years ago. The fact that it took so long has much to do with method.

The approach I envisaged for Shakespeare’s language is analogous to more recent developments in dictionaries of general English, and, specifically, the departure from the philological tradition that resulted in the Collins Cobuild Dictionary of the English Language, the first full corpus-based dictionary. Being corpus-based implies both a particular methodology for revealing meanings, and a particular theoretical approach to meaning. There is less reliance on the vagaries and biases of editors, and a greater focus on the evidence of actual usage. The question ‘what does X mean?’ is pursued through another question: ‘how is X used?’

But I wanted more from the encyclopaedia than this. I wanted it to be comparative, to reveal not just the usage of words and other linguistic units in Shakespeare but also in the general language of the period. This way, we can tap into issues such as what is distinctive about Shakespeare’s language, and, more particularly, how Shakespeare’s language would have been perceived by his contemporary audience.

For example, the play Henry V contains Welsh, Irish and Scottish characters. A pilot examination I conducted with Alison Findlay (English and Creative Writing) of the words Welsh, Irish and Scottish used in over 100 million words written in Shakespeare’s time revealed that: (1) that the Welsh barely registered on the Elizabethan consciousness, being considered a harmless in-group, only noteworthy for their curious language, (2) the Irish were wild, savage, rebels, viewed positively only in relation to Irish rugs (an important colonial import), and (3) the Scottish, whilst also rebels, were respected for their political power. (Current Shakespearean dictionaries do not contain entries for any of these three words).

The problem 20 years ago was the lack of comparative data. Back in the early 1990s, the leading historical corpus of English was without doubt the Helsinki Corpus of English Texts, completed in 1991. This corpus amounted to 1.5 million words – an impressive figure in those days! Moreover, it had been put together with great care; it was reliable. But those 1.5 million words covered the period 730 to 1710. The section contemporaneous with Shakespeare amounted to less than half a million words, and was thus far short of what is required for serious comparative work.

To solve the problem, I set about, with Merja Kytö, creating the Corpus of English Dialogues. The reason for the focus on dialogues is that this would provide an interesting comparison for the dialogues of Shakespeare’s plays. This project soaked up 10 or more years, not just in creating the corpus but also in publishing the various insights it afforded into early modern dialogues along the way.

I was then overtaken – in a positive way! – by other events, notably, the advent of a fully-searchable 1.2 billion transcribed version of Early English Books Online (EEBO) (i.e. EEBO-TCP). For years, EEBO, which contains pretty much all early modern printed output, had been of limited value to linguists because the texts were only available as images, and language searches relied on OCR, with all its inaccuracies. Now, however, I have a 321 million word fully searchable corpus of texts written by Shakespeare’s contemporaries.

In addition, solutions, or at least partial solutions, had evolved for the various problems associated with the computational analysis of historical language data. Early modern spelling variation had been a major stumbling block (e.g. the word would could be spelt would, wold, wolde, woolde, wuld, vvold, etc.). This problem has been largely solved by the Variant Detector (VARD), devised by scholars at Lancaster, especially Alistair Baron . The Lancaster-developed CLAWS part-of-speech annotation system, which works well for present-day English, has been adapted for Early Modern English (though more work will be necessary). Similarly, semantic annotation has received attention from generations of researchers at Lancaster University, and has been (and is being) adapted for Early Modern English, most recently within the AHRC-funded SAMUELS project, involving a consortium of universities, including Lancaster.

I don’t doubt that there will be many more twists and turns, lumps and bumps in the future methodological journey. But I am cheered by the fact that I will not be facing them alone but in the company of a wonderful group of people who are part of the project: Andrew Hardie and Tony McEnery (both LAEL), Paul Rayson (Computing and Communications), Alison Findlay (English & Creative Writing) and Dawn Archer (Manchester Metropolitan).

For a brief project description, see: AHRC award to create a new Encyclopaedia of Shakespeare’s Language

Rude Britannia – what our politeness says about our nation

Britain is still a nation of polite people and fears that texts, tweets and Facebook are making people ruder is a myth, according to research from Lancaster University’s Faculty of Arts and Social Sciences (FASS). The British are famous for their reserve, indirect way of saying things and a love of queuing. However, new research shows that what we find polite, and what we find rude is unique to our culture and can be very different to notions of rudeness in other cultures.

The research carried out by Professor Jonathan Culpeper, an expert in linguistic politeness, will be presented at an event as part of the Economic and Social Research Council’s annual Festival of Social Science, which runs between 2-9 November 2013.

Read more…

Politeness and impoliteness in digital communication: Corpus-related explorations

Post-event review of the one-day workshop at Lancaster University

Topics don’t come much hotter than the forms of impoliteness or aggression that are associated with digital communication – flaming, trolling, cyberbullying, and so on. Yet academia has done surprisingly little to pull together experts in social interaction (especially (im)politeness) and experts in the new media, let alone experts in corpus-related work. That is, until last Friday, when the Corpus Approaches to Social Science Centre (@CorpusSocialSci) invited fifteen such people from diverse backgrounds (from law to psychology) gathered together for an intense one-day workshop.

CASS workshop cropped

The scope of the workshop was broad. One cannot very well study impoliteness without considering politeness, since merely failing to be polite in a particular context could be taken as impoliteness. Similarly, the range of digital communication types – email, blogs, texts, tweets and so on – presents a varied terrain to navigate. And then there are plenty of corpus-related approaches and notions, including collocation, keywords, word sketches, etc.

Andrew Kehoe (@ayjaykay), Ursula Lutzky (@UrsulaLutzky) and Matt Gee (@mattbgee) kicked off the day with a talk on swearwords and swearing, based on their 628-million-word Birmingham Blog Corpus. Amongst other things, they showed how internet swearword/profanity filters would work rather better if they incorporated notions like collocation. For example, knowing the words that typically accompany items like balls and tart can help disambiguate neutral usages (e.g. “tennis balls”, “lemon tart”) from less salubrious usages! (See more research from Andrew here, from Ursula here, and from Matt here.)

With Ruth Page’s (@ruthtweetpage) presentation, came a switch from blogs to Twitter. Using corpus-related techniques, Ruth revealed the characteristics of corporate tweets. Given that the word sorry turns out to be the seventh most characteristic or keyword for corporate tweets, it was not surprising that Ruth focused on apologies. She reveals that corporate tweets tend to avoid stating a problem or giving an explanation (thus avoiding damage to their reputation), but are accompanied by offers of repair and attempts to build – at least superficially – rapport. (See more research from Ruth here.)

Last of the morning was Caroline Tagg’s (@carotagg) presentation, and with this came another shift in medium, from Twitter to text messages. Focusing on convention and creativity, Caroline pointed out that, contrary to popular opinion, heavily abbreviated messages are not in fact the norm, and that when abbreviations do occur, they are often driven by communicative needs, e.g. using creativity to foster interest and engagement. Surveying the functions of texts, Caroline established that maintenance of friendship is key. And corpus-related techniques revealed the supporting evidence: politeness formulae were particularly frequent, including the salutation have a good one, the hedge a bit for the invitation, and for further contact, give us a bell. (See more research from Caroline here.)

With participants refuelled by lunch, Claire Hardaker (@DrClaireH) and I presented a smorgasbord of relevant issues. As an opening shot, we displayed frequencies showing that the stereotypical emblems of British politeness, words such as please, thank you, sorry, excuse me, can you X, tend not to be frequent in any digital media variety, relative to spoken conversation (as represented in the British National Corpus). Perhaps this accounts for why at least some sectors of the British public find digital media barren of politeness. This is not to say that politeness does not take place, but it seems to take place through different means – consider the list of politeness items derived by Caroline above. And there was an exception: sorry was the only item that occurred with greater frequency in some digital media. This, of course, nicely ties in with Ruth’s focus on apologies. The bulk of my and Clare’s presentation revolved around using corpus techniques to help establish: (1) definitions (e.g. what is trolling?), (2) strategies and formulae (e.g. what is the linguistic substance of trolling?) and (3) evaluations (e.g. what or who is considered rude?). Importantly, we showed that corpus-related approaches are not just lists of numbers, but can integrate qualitative analyses. (See more research from me here, and from Claire here.)

With encroaching presentation fatigue, the group decamped and went to at a computer lab. Paul Rayson (@perayson) introduced some corpus tools, notably WMatrix, of which he is the architect. Amanda Potts (@watchedpotts) then put everybody through their paces – gently of course! – giving everybody the opportunity of valuable hands-on experience.

Back in our discussion room and refreshed by various caffeinated beverages, we spent an hour reflecting on a range of issues. The conversation moved towards corpora that include annotations (interpretative information). Such annotations could be a way of helping to analyse images, context, etc., creating an incredibly rich dataset that could only be interrogated by computer (see here, for instance). I noted that this end of corpus work was not far removed from using Atlas or Nudist. Snapchat came up in discussion, not only because it involves images (though they can include text), but also because it raises issues of data accessibility (how do you get hold of a record of this communication, if one of its essential features is that it dissolves within a narrow timeframe?). The thorny problem of ethics was discussed (e.g. data being used in ways that were not signaled when original user agreements were completed).

Though exhausting, it was a hugely rewarding and enjoyable day. Often those rewards came in the form of vibrant contributions from each and every participant. Darren Reed, for example, pointed out that sometimes what we were dealing with is neither digital text nor digital image, but a digital act. Retweeting somebody, for example, could be taken as a “tweet act” with politeness implications.

CASS Q&A: “Part suspended” versus “Partly suspended” on the London Underground

Last month, I received an interesting email about some terms that London commuters might be very familiar with:

We at London Underground currently operate the electronic service update board which indicates the real-time status of each of our lines. Most of customers are familiar and use it daily. We currently use the phrases – good service, severe delays and part suspended. I wonder how correct these phrases are; in particular should part be partly or partially?

After querying some corpora, I responded:

Traditional grammar would have it that a verbal form should not be modified by an adjective but an adverb. Thus, only “part suspended” is problematic. Modern grammarians would normally recommend that people follow customary practices. I checked the frequencies of “part suspended” and “partially suspended” in several large electronic databases. “Partially suspended” is always more frequent. In conclusion, “good service” and “severe delays” are fine, but go for “partially suspended” rather than “part suspended”.

Do you have a question about language in use? Ask us at cass@lancs.ac.uk and we’ll try to post the answer here.

The neglected west: first-order politeness in Britain

Teaching and Learning (Im)politeness: An International (Im)politeness Conference“, will be held at SOAS, University of London, 8-10 July. I will be giving a talk with Jim O’Driscoll (Huddersfield) on the topic below:

Almost without exception, it is scholars based in “Western” locations that have introduced the ideas with pretensions to universal application which are commonly regarded as major milestones in the field of politeness studies: face (Goffman); politeness principle (Leech); politeness as redress to face and positive & negative faces (B&L); first versus second order politeness and politic behaviour (Watts 1992ff); impoliteness (Culpeper 1996ff); discursive politeness (Eelen 2001, Watts 2003, Mills 2005ff). Typically, the role of scholars from non-western areas has been to present culture-specific evidence to challenge or tinker with these ideas. Likewise, a perusal of merely the table of contents of edited collections (e.g. Watts et al 1992, Bargiela-Chiappini & Haugh 2009, Bargiela-Chiappini & Kádár 2011) suggests that data from western environments needs no specific labelling as such, while contributions from elsewhere have to indicate geographical specificity in their titles.

This decidedly western discursive deictic centre (O’Driscoll 2009) has a distorting effect. For one thing, there is a tendency to believe that the politeness2 and face2 conceptualisations emanating from western locations are actually accounts of politeness1 and face1 in these western cultures, so that, for example, Goffman’s face (second-order) is American face (first-order) or that B&L’s politeness (second-order) reflects ‘English’ politeness (first-order) (cf. Matsumoto 1988; Ide 1989; Gu 1990; Mao 1994; Nwoye 1992; Wierzbicka 1991 [2003]). For another, it has resulted in a relative paucity of emic studies of core western cultures, leading in turn to an unwisely unexamined acceptance of certain stereotypes of these cultures.

This paper probes English people’s understandings of politeness. More specifically, it investigates their usage of the term polite. Deploying methodologies from corpus linguistics, we report results from the 500 million-word subsection of the Oxford English Corpus. These results fly in the face of the large number of studies which have found evidence that present-day English politeness – by which English English politeness is meant – is often characterised by off-record or negative politeness (e.g. Blum-Kulka 1989; Stewart 2005; Wierzbicka  2006; Ogiermann 2009). We refine these results further by looking at variation across the social categories of the British National Corpus.

Check back after the event for access to slides.