Coming this year: Corpora and Discourse Studies (Palgrave Advances in Language and Linguistics)

Three members of CASS have contributed chapters to a new volume in the Palgrave Advances in Language and Linguistics series. Corpora and Discourse Studies will be released later this year.

corpdiscThe growing availability of large collections of language texts has expanded our horizons for language analysis, enabling the swift analysis of millions of words of data, aided by computational methods. This edited collection of chapters contains examples of such contemporary research which uses corpus linguistics to carry out discourse analysis. The book takes an inclusive view of the meaning of discourse, covering different text-types or modes of language, including discourse as both social practice and as ideology or representation. Authors examine a range of spoken, written, multimodal and electronic corpora covering themes which include health, academic writing, social class, ethnicity, gender, television narrative, news, Early Modern English and political speech. The chapters showcase the variety of qualitative and quantitative tools and methods that this new generation of discourse analysts are combining together, offering a set of compelling models for future corpus-based research in discourse.

Table of Contents:

  1. Introduction; Paul Baker and Tony McEnery
  2. E-Language: Communication in the Digital Age; Dawn Knight
  3. Beyond Monomodal Spoken Corpora: Using a Field Tracker to Analyse Participants’ Speech at the British Art Show; Svenja Adolphs, Dawn Knight and Ronald Carter
  4. Corpus-assisted Multimodal Discourse Analysis of Television and Film Narratives; Monika Bednarek
  5. Analysing Discourse Markers in Spoken Corpora: Actually as a Case Study; Karin Aijmer
  6. Discursive Constructions of the Environment in American Presidential Speeches 1960-2013: A Diachronic Corpus-assisted Study; Cinzia Bevitori
  7. Health Communication and Corpus Linguistics: Using Corpus Tools to Analyse Eating Disorder Discourse Online; Daniel Hunt and Kevin Harvey
  8. Multi-Dimensional Analysis of Academic Discourse; Jack A. Hardy
  9. Thinking About the News: Thought Presentation in Early Modern English News Writing; Brian Walker and Dan McIntyre
  10. The Use of Corpus Analysis in a Multi-perspectival Study of Creative Practice; Darryl Hocking
  11. Corpus-assisted Comparative Case Studies of Representations of the Arab World; Alan Partington
  12.  Who Benefits When Discourse Gets Democratised? Analysing a Twitter Corpus Around the British Benefits Street Debate; Paul Baker and Tony McEnery
  13. Representations of Gender and Agency in the Harry Potter Series; Sally Hunt
  14. Filtering the Flood: Semantic Tagging as a Method of Identifying Salient Discourse Topics in a Large Corpus of Hurricane Katrina Reportage; Amanda Potts

Brainstorming the Future of Corpus Tools

Since arriving at the Centre for Corpus Approaches to Social Science (CASS), I’ve been thinking a lot about corpus tools. As I wrote in my blog entry of June 3, I have been working on various software programs to help corpus linguists process and analyse texts, including VariAnt, SarAnt, TagAnt. Since then, I’ve also updated my mono-corpus analysis toolkit, AntConc, as well as updated my desktop and web-based parallel corpus tools, including AntPConc and the interfaces to the ENEJE and EXEMPRAES corpora. I’ve even started working with Paul Baker of Lancaster University on a completely new tool that provides detailed analyses of keywords.

In preparation for my plenary talk on corpus tools, given at the Teaching and Language Corpora (TaLC 11) conference held at Lancaster University, I interviewed many corpus linguists about their uses of corpus tools and their views on the future of corpus tools. I also interviewed people from other fields about their views on tools, including Jim Wild, the Vice President of the Royal Astronomical Society.

From my investigations, it was clear that corpus linguists rely on and very much appreciate the importance of tools in their work. But, it also became clear that corpus linguists can sometimes find it difficult to see beyond the features of their preferred concordancer or word frequency generator and attempt to look at language data in completely new and interesting ways. An analogy I often use (and one I detailed in my plenary talk at TaLC 11) is that of an astronomer. Corpus linguists can sometimes find that their telescopes are not powerful enough or sophisticated enough to delve into the depths of their research space. But, rather than attempting to build new telescopes that would reveal what they hope to see (an analogy to programming) or working with others to build such a telescope (an analogy to working with a software developer), corpus linguists simply turn their telescopes to other areas of the sky where their existing telescopes will continue to suffice.

To raise the awareness of corpus tools in the field and also generate new ideas for corpus tools that might be developed by individual programmers or within team projects, I proposed the first corpus tools brainstorming session at the 2014 American Association of Corpus Linguistics (AACL 2014) conference. Randi Reppen and the other organizers of the conference strongly supported the idea, and it finally became a reality on September 25, 2014, the first day of the conference.

At the session, over 30 people participated, filling the room. After I gave a brief overview of the history of corpus tools development, the participants thought about the ways in which they currently use corpora and the tools needed to do their work. The usual suspects—frequency lists (and frequency list comparisons), keyword-in-context concordances and plots, clusters and n-grams, collocates, and keywords—were all mentioned. In addition, the participants talked about how they are increasingly using statistics tools and also starting programming to find dispersion measures. A summary of the ways people use corpora is given below:

  • find word/phrase patterns (KWIC)
  • find word/phrase positions (plot)
  • find collocates
  • find n-grams/lexical bundles
  • find clusters
  • generate word lists
  • generate keyword lists
  • match patterns in text (via scripting)
  • generate statistics (e.g. using R)
  • measure dispersion of word/phrase patterns
  • compare words/synonyms
  • identify characteristics of texts

Next, the participants formed groups, and began brainstorming ideas for new tools that they would like to see developed. Each group came up with many ideas, and explained these to the session as a whole. The ideas are summarised below:

  • compute distances between subsequent occurrences of search patterns (e.g. words, lemmas, POS)
  • quantify the degree of variability around search patterns
  • generate counts per text (in addition to corpus)
  • extract definitions
  • find patterns of range and frequency
  • work with private data but allow  for powerful handling of annotation (e.g. comparing frequencies of sub-corpora)
  • carry out extensive move analysis over large texts
  • search corpora by semantic class
  • process audio data
  • carry out phonological analysis (e.g. neighbor density)
  • use tools to build a corpus (e.g. finding texts, annotating texts, converting non-ASCII characters to ASCII)
  • create new visualizations of data (e.g. a roman candle of words that ‘explode’ out of a text)
  • identify the encoding of corpus texts
  • compare two corpora along many dimensions
  • identify changes in language over time
  • disambiguate word senses

From the list, it is clear that the field is moving towards more sophisticated analyses of data. People are also thinking of new and interesting ways to analyse corpora. But, perhaps the list also reveals a tendency for corpus linguists to think more in terms of what they can do rather than what they should do, an observation made by Douglas Biber, who also attended the session. As Jim Wild said when I interviewed him in July, “Research should be led by the science not the tool.” In corpus linguistics, clearly we should not be trapped into a particular research topic because of the limitations of the tools available to us. We should always strive to answer the questions that need to be answered. If the current tools cannot help us answer those questions, we may need to work with a software developer or perhaps even start learning to program ourselves so that new tools will emerge to help us tackle these difficult questions.

I am very happy that I was able to organize the corpus tools brainstorming session at AACL 2014, and I would like to thank all the participants for coming and sharing their ideas. I will continue thinking about corpus tools and working to make some of the ideas suggested at the session become a reality.

The complete slides for the AACL 2014 corpus tools brainstorming session can be found here. My personal website is here.

Call for Participation: ESRC Summer School in Corpus Approaches to Social Science

The ESRC Summer School in Corpus Approaches to Social Sciences was inaugurated in 2013; the 2014 event is the second in the series. It will take place 15th to 18th July 2014, at Lancaster University, UK.

This free-to-attend summer school takes place under the aegis of CASS (, an ESRC research centre bringing a new method in the study of language – the corpus approach – to a range of social sciences. CASS is investigating the use and manipulation of language in society in a host of areas of pressing concern, including climate change, hate crime and education.

Who can attend?

A crucial part of the CASS remit is to provide researchers across the social sciences with the skills needed to apply the tools and techniques of corpus linguistics to the research questions that matter in their own discipline. This event is aimed at junior social scientists – especially PhD students and postdoctoral researchers – in any of the social science disciplines. Anyone with an interest in the analysis of social issues via text and discourse – especially on a large scale – will find this summer school of interest.


The programme consists of a series of intensive two-hour sessions, some involving practical work, others more discussion-oriented.

Topics include: Introduction to corpus linguistics; Corpus tools and techniques; Collecting corpus data; Foundational techniques for social science data – keywords and collocation; Understanding statistics for corpus analysis; Discourse analysis for the social sciences; Semantic annotation and key domains; Corpus-based approaches to metaphor in discourse; Pragmatics, politeness and impoliteness in the corpus.

Speakers include Tony McEnery, Paul Baker, Jonathan Culpeper, and Elena Semino.

The CASS Summer School is one of the three co-located Lancaster Summer Schools in Interdisciplinary Digital Methods; see the website for further information:

How to apply

The CASS Summer School is free to attend, but registration in advance is compulsory, as places are limited.

The deadline for registrations is Sunday 8th June 2014.

The application form is available on the event website as is further information on the programme.


Dispatch from YLMP2014


I recently had the pleasure of travelling to Poland to attend the Young Linguists’ Meeting in Poznań (YLMP), a congress for young linguists who are interested in interdisciplinary research and stepping beyond the realm of traditional linguistic study. Hosted over three days by the Faculty of English at Adam Mickiewicz University, the congress featured over 100 talks by linguists young and old, including plenary lectures by Lancaster’s very own Paul Baker and Jane Sunderland. I was one of three Lancaster students to attend the congress, along with undergraduate Agnes Szafranski and fellow MA student Charis Yang Zhang.

What struck me about the congress, aside from the warm hospitality of the organisers, was the sheer breadth of topics that were covered over the weekend. All of the presenters were more than qualified to describe their work as linguistics, but perhaps for the first time I saw within just how many domains such a discipline can be applied. At least four sessions ran in parallel at any given time, and themes ranged from gender and sexuality to EFL and even psycholinguistics. There were optional workshops as well as six plenary talks. On the second day of the conference, as part of the language and society stream, I presented a corpus-assisted critical discourse analysis of the UK national press reporting of the immediate aftermath of the May 2013 murder of soldier Lee Rigby. I was happy to have a lively and engaged audience who had some really interesting questions for me at the end, and I enjoyed the conversations that followed this at the reception in the evening!

What was most encouraging about the congress was the drive and enthusiasm shared by all of the ‘young linguists’ in attendance. I now feel part of a generation of young minds who are hungry to improve not only our own work but hopefully, in time, the field(s) of linguistics as a whole. After my fantastic experience at the Boya Forum at Beijing Foreign Studies University last autumn, I was happy to spend time again celebrating the work of undergraduate and postgraduate students, and early-career linguists. There was a willingness to listen, to share ideas, and to (constructively) criticise where appropriate, and as a result I left Poznań feeling very optimistic about the future of linguistic study. I look forward to returning to the next edition of YLMP, because from what I saw at this one, there is a new generation of linguists eager to push the investigation of language to the next level.

New CASS: Briefing now available — Opposing gay rights in UK Parliament: Then and now

CASSbriefings-gayrightsOpposing gay rights in UK Parliament: Then and now. How has the expression of opposition to gay rights changed in Parliamentary speeches in recent years? How are discussions of gay people involved in these changes? To what extent could these arguments be seen as homophobic? Read this CASS: Briefing of a diachronic corpus-based discourse analysis to find out more.

New resources are being added regularly to the new CASS: Briefings tab above, so check back soon.

Introducing CASS 1+3 Research Student: Robbie Love

In 2013, the ESRC Centre for Corpus Approaches to Social Science was pleased to award its inaugural 1+3 (Masters to PhD) studentship to Robbie Love. Read a bit about the first year of his postgraduate experience, in Robbie’s own words below.

robbieloveI am a Research Student at CASS in the first year of a 1+3 PhD studentship. My main role is to investigate methodological issues in the collection of spoken corpora, but I also have interests in corpus-assisted critical discourse analysis.

I grew up in the north east of England in Blyth, Northumberland and Forest Hall in the outskirts of Newcastle. At school I found equal enjoyment in studying both English language and mathematics, but when deciding what to take at university I couldn’t think of something that would satisfy both, so I went with language.

I moved to Lancaster in 2010 to study my BA in English Language, which I soon converted to Linguistics. It was only in my third year that I was introduced to corpus linguistics, and became fascinated with its potential for revealing things about the way we communicate which I would never have predicted. I also liked its combination of quantitative and qualitative analysis, so it seemed like the perfect way to reengage with my enjoyment of maths. I had always been open to the idea of postgraduate study so when the opportunity came up to join CASS under the supervision of Tony McEnery it felt like the best thing for me to do.

Since joining CASS in the summer last year I have worked on several interesting projects including the changing language of gay rights opposition in Parliamentary debates (with Paul Baker), comments on online newspaper articles (with Amanda Potts), and the representation of Muslim people and Islam in the press reaction to the 2013 Woolwich incident (with Tony McEnery). I will be presenting findings on the Woolwich project at the upcoming Young Linguists’ Meeting in Poznań.

When I’m not playing with words on a computer, I am usually found rehearsing for a play or musical, playing my keyboard or eating any and all varieties of hummus.

For our People page for a full list of the centre’s investigators, researchers, and students.

Using Corpora to Analyze Gender

ucagI wrote UCAG during a sabbatical as a semi-sequel to a book I published in 2006 called Using Corpora for Discourse Analysis. Part of the reason for the second book was to update and expand some of my thinking around discourse- or social-related corpus linguistics. As time has passed, I haven’t become disenamoured of corpus methods, but I have become more reflective and critical of them and I wanted to use the book to highlight what they can and can’t do, and how researchers need to be guarded against using tools which might send them down a particular analytical path with a set of pre-ordained answers. Part of this has involved reflecting on how interpretations and explanations of corpus findings often need to come from outside the texts themselves (one of the tenets of critical discourse analysis), and subsequently whether a corpus approach requires analysts to go further and critically evaluate their findings in terms of “who benefits”.

Another way in which my thinking around corpus linguistics has developed since 2006 is in considering the advantages of methodological triangulation (or approaching a research project in multiple ways). In one analysis chapter I take three small corpora of adverts from Craigslist and try out three methods of attempting to uncover something interesting about gender from them – one very broad involving an automated tagging of every word, one semi-automatic relying on a focus on a smaller set of words, and another much more qualitative, relying on looking at concordance lines only. In another chapter I look at “difficult” search terms – comparing two methods of finding all the cases where a lecturer indicates that a student has given an incorrect answer in a corpus of academic-related speech. Would it be better to just read the whole corpus from start to finish, or is it possible to devise search terms so concordancing would elicit pretty much the same set?

The book also gave me a chance to revisit older data, particularly a set of newspaper articles about gay people from the Daily Mail which I had first looked at in Public Discourses of Gay Men (2005). As a replication experiment I revisited that data and redid an analysis I had first carried out about 10 years ago. While the idea of an objective researcher is fictional, corpus methods have aimed to redress the issue of researcher bias to an extent – although in retreading my steps, I did not obtain exactly the same results. Fortunately, the overall outcome was the same, but there were a few important points that the 10 years younger version of me missed. Does that matter? I suspect it doesn’t invalidate the analysis although it is a useful reminder about how our own analytical abilities alter over time.

Part of the reason for writing the book was to address other researchers who are either from corpus linguistics and want to look at gender, or who do research in gender and want to use corpus methods. I sometimes feel that these two groups of people do not talk to each other very much and as a result the corpus research in this area is often based around the “gender differences” paradigm where the focus is on how men and women apparently differ from each other in language use (with attendant metaphors about Mars and Venus). Chapters 2 and to an extent 3, address this by trying a number of experiments to see just how much lexical variation there is in sets of spoken corpora of male and female language – and when difference is found, how can it be explained? I also warn against lumping all men together into a box to compare them with all women who are put in a second box. The variation within the boxes can actually be the more interesting story to tell and this is where corpus tools around dispersion can really come into their own. So even if, for example, men do swear more than women, it’s not all men and not all the time. On the other hand, some differences which are more consistent and widespread can be incredibly revealing, although not in ways you might think – chapter 2 took me down an analytical path that ended up at the word Christmas – not perhaps an especially interesting word relating to gender, but it produced a lovely punchline to the chapter.

It was also good to introduce different corpora, tools and techniques that weren’t available in 2006. Mark Davies has an amazing set of online corpora, mostly based around American English, and I took the opportunity to use the COHA (Corpus of Historical American English) to track changes in language which reflects male bias over time, from the start of the 19th century to the present day. Another chapter utilises Adam Kilgariff’s online tool Sketch Engine which allows collocates to be calculated in terms of their grammatical relationships to one another. This allowed for a comparison of the terms boy and girl which allowed me to consider verbs that positioned either as subject or object. So girls are more likely to be impressed while boys are more likely to be outperformed. On the other hand boys cry whereas girls scream.

It would be great if the book inspired other researchers to consider the potential of using corpora in discourse/social related subjects as well as showing how this potential has expanded in recent years. It’s been fun to explore a relatively unexplored field (or rather travel a route between two connecting fields) but it occasionally gets lonely. I hope to encounter a few more people heading in the same direction as me in the coming years.

CASS awarded £200,000 from landmark ESRC Urgency Grant Scheme

CASS is delighted to announce a successful ESRC application for funding on a project entitled “Twitter rape threats and the discourse of online misogyny” (ES/L008874/1). The award of £191,245.25 was one of the first (possibly even the first) to be made as part of the ESRC’s new Urgency Grants scheme. Under this scheme, applications are assessed very quickly, and projects also start within four weeks of a successful award. This particular project will begin in November and run for fourteen months. It will be part of the CASS Centre, and the team will be comprised of Claire Hardaker (PI), Tony McEnery (CI), Paul Baker (CI), Andrew Hardie (CI), Paul Iganski (CI), and two CASS-hosted research assistants.

This project will investigate the rape and death threats sent on Twitter in July and August 2013 to a number of high profile individuals, including MP Stella Creasy and journalist Caroline Criado-Perez. This project seeks to address the remarkable lack of research into such behaviour, especially in light of the fact that policymakers and legislators are under intense pressure to make quick, long-term decisions on relevant policy and procedure to allow enforcement agencies to act on this issue. Specifically, the project will investigate what the language used by those who send rape/death threats on Twitter reveals about…

  1. their concerns, interests, and ideologies; what concept do they seem to have of themselves and their role in society?
  2. their motivations and goals; what seems to trigger them? What do they seem to be seeking?
  3. the links between them and other individuals, topics, and behaviours; do they only produce misogynistic threats or do they engage in other hate-speech? Do they act alone or within networks?

The project will take a corpus approach, incorporating several innovative aspects, and it will produce results that should be relevant to several social sciences including sociology, criminology, politics, psychology, and law. It will also offer timely insight into an area where policy, practice, legislation, and enforcement is currently under intense scrutiny and requires such research to help shape future developments. As such, the results will likely be of interest to legislators, policymakers, investigative bodies, and law enforcement agencies, as well as the study participants, media, and general public.

CASS affiliated papers to be given at the upcoming 5th International Language in the Media Conference

In two weeks, several scholars affiliated with the Centre will be heading south to attend the 5th International Language in the Media Conference, taking place this year at Queen Mary, University of London. We are particularly excited about the theme — “Redefining journalism: Participation, practice, change” — as well as the conference’s continued prioritization of papers on “language and class, dis/ability, race/ethnicity, gender/sexuality and age; political discourse, commerce and global capitalism” (among other important themes). As a taster for those of you who will be joining us in London and an overview for those who are unfortunately unable to make it this year, abstracts of the CASS affiliated papers to be given at the conference are reproduced below.

“I hate that tranny look”: a corpus-based analysis of the representation of trans people in the national UK press

Paul Baker

In early 2013, two high-profile incidents involving press representation of trans people resulted in claims that the British press were transphobic. For example, Jane Fae wrote in The Independent, that ‘the trans community… is now a stand-in for various minorities… and a useful whipping girl for the national press… trans stories are only of interest when trans folk star as villains” (1/13/13). This paper examines Fae’s claims by using methods from corpus linguistics in order to identify the most frequent and salient representations of trans people in the national UK press. Corpus approaches use computational tools as an aid in human research, offering a good balance between quantitative and qualitative analyses, My analysis is based upon previous corpus-based research where I have examined the construction of gay people, refugees and asylum seekers and Muslims in similar contexts.

Using a 660,000 word corpus of news articles about trans people published in 2012, I employ concordancing techniques to examine collocates and discourse prosodies of terms like transgender, transsexual and tranny, in order to identify repetitive patterns of representation that occur across newspapers. I compare such patterns to sets of guidelines on language use by groups like The Beaumont Society, and discuss how certain representations can be enabled by the Press Complaints Commissions Code of Practice. While the analysis found that there are very different patterns of representation around the three labels under investigation, all of them showed a general preference for negative representations, with occasional glimpses of more positive journalism.

“I think we’d rather be called survivors”: A corpus-based critical discourse analysis of the semantic preferences of referential strategies in Hurricane Katrina news articles as indicators of ideology

Amanda Potts

In times of great crisis, people often rely upon the discourse of powerful institutions to help frame experiences and reinforce established ideologies (van Dijk 1985). Selection of referential strategies in such discourses can reveal much about our society; for instance, some words have the power to comfort addressees but further oppress the referents. Taking a corpus-based critical discourse analytical approach, in this paper I explore the discursive cues of underlying ideology (of both the publications and perhaps the assumed audience) with special attention on journalists’ referential and predicational strategies (Reisgl and Wodak 2000). Analysis is based on a custom-compiled 36.7-million-word corpus of American news print articles concerning Hurricane Katrina.

A variety of forms of reference have been identified in the corpus using part-of-speech tagged word lists. Collocates of each form of reference have been calculated and automatically assigned a semantic tag by the UCREL USAS tagger (Archer et al. 2002). Semantic categories represented by the highest proportion of collocates overall have been identified as the most salient indicators of ideology.

The semantic preferences of the referential strategies are found to be quite distinct. For instance, resident prefers the M: Movement semantic category, whereas collocates of evacuee tend to fall under N: Numbers. This may prime readers to interpret Gulf residents and evacuees as large, threatening, ‘invading’ masses (often in conjunction with negative water metaphors such as flood). The highest collocate semantic category for victim, displaced, and survivor is S: Social actions, states and processes, indicating that the [social] experiences of these referents—such as being helped or stranded, or linked to social identifies such as wife—are foregrounded rather than their numbers or movement.

Finally, the plummeting frequency of refugee following a unique debate in the media over the word’s meaning and even its semantic preference will also be discussed as an illustrative example of how unconscious language patterns can sometimes come to the fore in contested usage and influence the journalistic lexicon. Following from this, a more considered use of referential strategies is recommended, particularly in the media, where this could encourage heightened compassion for- and understanding of those gravely affected by catastrophic events.

Journalism through the Guardian’s goggles

Anna Marchi

‘Journalism is an intensely reflexive occupation, which constantly talks to and about itself’ (Aldridge and Evetts 2003: 560). Journalists create interpretative communities (Zelizer 2004) through the discourses they circulate about their profession, the meaning and role of journalism are constituted through daily performance (Matheson 2003) and can be studied by means of the self-reflexive traces in texts. That is, they can be detected and studied in a newspaper corpus.

This paper proposes a corpus-assisted discourse analysis (Partington 2009) of the ways journalists represent their trade in their own news-work. The focus of the research in one newspaper in particular: the Guardian. Previous research (Marchi and Taylor 2009) suggested that among British broadsheets the Guardian is by far the most interested in other media, as well as the most inclined to talk about itself. Using newspaper data from 2005, a particularly relevant year in the newspaper’s biography (it changed format from traditional broadsheet to berliner) and rich with self-reflexivity, I examine the discursive behavior of media-related lexical items in the corpus (such as journalist, reporter, hack, media, newspaper, press, tabloid) exploring the ways in which the Guardian conceptualises the role of the news media, how it represents professional values and the divide between good and bad journalism, and, ultimately, how it constructs its own identity. The study relies on the typical tools of corpus linguistics research – collocation analysis, keywords analysis, concordance analysis – and aims to a comprehensive description of the data, following the principle of total accountability (McEnery and Hardie 2012: 17), while keeping track of the broader extralinguistic context. From a methodological point of view this work encourages interdisciplinary contamination and a serendipitous approach to the data and wishes to offer an example of how corpus-based research can contribute to the academic investigation of journalism across disciplines.

Visit the conference website for more details, including a list of plenary speakers.

Visiting With The Brown Family

In 2011 I gave a plenary talk on how American English is changing over time (contrasting it with British English), using the Brown Family of corpora. Each member of the Brown family consists of a corpus of 1 million words of written, published, standard English, divided into 500 files each of about 2000 words each. Fifteen genres of writing are represented – this framework being created decades ago when the original Brown corpus was compiled by Henry Kučera and W. Nelson Francis at Brown University, having the distinction of being the first publically available corpus ever built. Containing only American texts published in 1961, it originally went by the name of A Standard Corpus of Present-Day Edited American English for use with Digital Computers but later became known as just the Brown Corpus. It was followed by an equivalent British version, with later members representing English from the 1990s, the 2000s and the 1930s. A 1901 British version is in the pipeline.

Before I gave my talk, however, Mark Davies gave a brilliant presentation on the COHA (Corpus of Historical American English) which has 400 million words and covers the period from 1800 to the present day. It was the proverbial hard act to follow. Compared to the COHA, the Brown family are tiny, and the coverage occurs across 30 or 15 year snapshots, rather than representing every year. If we identify, say, that the word Mr is less frequent in 2006 than in 1991 then it is tempting to say that Mr is becoming less frequent over time. But we don’t know for certain what corpora from all the years in between would tell us. Having multiple sampling points presents a more convincing picture, but judicious hedging must be applied.

Also, being small, many words in the Brown family have tiny frequencies so it’s very difficult to make any claims about them. And the sampling could be viewed as rather outdated – the sorts of texts that people accessed in the 1960s are not necessarily the same as they access now. There are no online texts in the Brown family (although to ease collection, both the 2006 members involved texts that were originally published in written form, then placed online). Nor is there any advertising text. Or song lyrics. Or horror fiction. Or erotica (although there is a section on Romantic Fiction which could be pushed in that direction). Finally, the fact that all the texts are of the published variety means that they tend to represent a somewhat standardised, conservative form of English. A lot of the innovation in English happens in much more informal contexts, especially where young people or people from different backgrounds mix together – inner-city playgrounds and internet forums being two good examples. By the time such innovation gets into written published standard English, it’s no longer innovative. So the Brown family can’t tell us about the cutting edge of language use – they’ll always be a few years out of fashion.

So what are the Brown family good for, if anything?

Continue reading