About Paul Baker

CASS co-investigator Paul Baker is a Professor of Linguistics and English Language at Lancaster University. His research interests corpus linguistics, language and gender/sexual identities and critical discourse analysis.

A new addition to the Brown Family: BE21

Back in the 1960s, the Brown Corpus was the first corpus ever created – 1 million words of written standard American English from 15 registers, across 500 text samples, all around 2,000 words in size. Since then, there have been matched versions to cover the 1930s, the 1990s and the 2000s. Today’s reference corpora can be very large (enTenTen20 is 36.5 billion words) so 1 million words is not very big in corpus linguistics these days. However, because the members of the Brown family have used the same sampling frame, they can be usefully employed to carry out comparisons of UK and US English, as well as allowing us to track trends in language development over time – with the stipulation that these corpora tend to be effective when considering high frequency phenomena.

I built the BE06 in 2008, a British English version of the Brown corpus, with texts from 2006, and I was also involved in building its American sibling – AmE06. For those corpora, we stipulated that texts could be found online, as long as they had also been published in paper format elsewhere. That way, the job of creating the corpus was made easier, but the texts would also be similar in form to the other corpora. The gap between the 1990s versions of the corpora and BE06 was 15 years, so 2021 is a good point to create new versions. I began collecting data for BE21 (British English from 2021) around the mid-point of 2021 and finished in early 2022.

As well as helping us to examine language change and variation, the corpora can also be used as reference corpora for projects which involve relatively small amounts of text. So if you have a corpus of recent British newspaper texts, for example, that is under a million words in size, then BE21 would be a reasonable option as a reference corpus.

As with BE06, for BE21 I collected texts from online sources. For most (around 80-90%) of texts, there are “on paper” equivalents somewhere. However, these days, some texts appear designed to exist in online form – this includes things like online magazines or government documents. So I relaxed the stipulation a little bit to reflect how written language is increasingly migrating to online as opposed to paper.

My memory of collecting the BE06 corpus was that it didn’t take that long. At the time I calculated that it took around 10 working days in total. For the BE21, the texts took about three times longer than this. That was a surprise – I’d assumed that with more people publishing online, compared to 15 years ago, there would be a wealth of texts to choose from and the task would be easier. However, compared to 15 years ago, many texts are now behind firewalls so cannot be freely accessed. This was especially the case for magazines – collecting texts for the Popular Lore and Skills, trades and hobbies categories was more difficult than expected.

For the fictional sections, back in 2006, many authors had set up their own websites, where they provided free samples of their latest novels. These days, this doesn’t seem to be as common – although Amazon does provide free samples of books, via Kindle – so that was the route I went to collect the samples of fiction as well as the biographies.

Another complicating factor involved identifying and locating British authors, particularly when it came to the Academic Writing category. This was reasonably easy back in 2006, although today, academic publishing is a more international activity, as well as a team-based one, so I found that I was passing over many more possible articles than I remembered doing so in 2006. Trying to locate journals that contained the word “British” was a bit of a red herring, as that was no guarantee that British academics would be publishing in them. To be certain I was sampling “British” English, I needed to make sure that everyone on the team was from the UK, which meant quite a lot of Googling the names of academics and trying to get a sense of their background. I’ve erred on the side of caution, although this made the task of collecting the academic articles more difficult.

2021 will be remembered as a year when the main topic of conversation was COVID. If you obtain keywords from BE21, using BE06 as the reference corpus, the top 10 are COVID, pandemic, lockdown, I, vaccine, my, Brexit, care, people and coronavirus. The top one, COVID, appears in 114 out of the 500 text samples, a total of 446 times across the corpus. This was not because I actively sought out texts about COVID – it was just very hard to avoid them, particularly when I was collecting the non-fiction texts – COVID had permeated almost every aspect of British society in 2021. Co-incidentally, another corpus in the Brown family, B-LOB (Before LOB), also covers an international crisis. The year 1931 saw the Great Depression, which began in the United States, overwhelm Britain, with investors withdrawing their gold from London at the rate of £2.5 million a day. The UK went off the gold standard and during the election of that year, the Labour party was virtually destroyed leaving Labour’s Ramsay MacDonald as the Prime Minister for a National Government, an all-party coalition. This results in a few linguistic peculiarities in the BLOB corpus – such as the high frequency of words like unemployment which, in Great Britain increased by 129% between 1929 and 1932. Despite constituting new lexical items, and an increased focus on certain topics, The Great Depression does not make the B-LOB ineffective as a reference corpus though, just as COVID-19 does not render BE21 so idiosyncratic as to be useless.

Today I had the pleasure of seeing students use the BE21 for the first time, in one of my corpus linguistics seminars. We looked at the frequency of negation forms like “should not” and “shouldn’t” determining the extent to which the latter form is gaining precedence. The “n’t” form has been increasing in frequency for the past century. It has not yet become the dominant form, but it is starting to get very close, indicating grammaticalization of “n’t” as a bound morpheme, linked to densification and colloquialisation of written English. Looking to the future, the next time I would expect to build another Brown Family corpus will be in 2036. I suspect that by then, the “n’t” form will have become more popular than “not” and that increasingly, we will only see “not” occurring in archaic sounding phrasing like “it mattered not”.

The corpus is currently available through Lancaster University’s CQPweb, which is free to sign up for, and it will be coming to new versions of AntConc and #LancsBox soon. I have done some work examining changes in part-of-speech tag frequencies across the British Brown members, and a paper in the International Journal of Corpus Linguistics is forthcoming. I also gave a talk on the corpus and the results of that analysis, which can be found here. Compared to earlier corpora, BE21 has more first and second pronouns, more use of -s and -ing forms of verbs, more genitives, and far fewer terms of address like Mr and Mrs, as well as fewer modal verbs, gradable adverbs, of-based noun phrases and male pronouns. The trends identified by Geoffrey Leech, like densification, colloquialisation and democratisation are continuing. Another trend, Americanisation – also seems to be holding up, and when the AmE21 corpus is available (currently work is under way at Cardiff University to build it), we can start to further identify the ways that English is likely to shift in the future.

Anxiety support in an online forum

Anxiety is a growing, worldwide phenomenon. The World Health Organization estimates that there are 264 million people living with anxiety disorders, which are characterised by excessive fear and behavioural disturbances, and which include specific phobias, panic disorder, social anxiety disorder and generalized anxiety disorder. In this project, we investigate an online forum dedicated to providing anxiety support and hosted by Health Unlocked: the world’s largest social network for health (https://healthunlocked.com/). Like many online forums, Health Unlocked offers users a space to get the informational and emotional support they seek in relation to a range of health-related topics. By examining the contributions and interactions of the Anxiety Support forum, we set out to better understand the lived experiences of those with anxiety, including the coping strategies they adopt to mitigate the impact of anxiety disorders.

Our data comprises approximately 21 million words of text posted to the Anxiety Support forum between March 2012 and October 2020. We are using corpus-based methods to analyse this data with respect to the following areas:

Sketching Anxiety: Our analysis begins with a focus on the word anxiety, using the corpus analysis tool Sketch Engine to provide a detailed ‘Word Sketch’ of its use in the forum, e.g. looking at its occurrence in different grammatical patterns. In demonstrating how anxiety is discursively constructed, we aim to show how users perceive anxiety disorders and how they talk about strategies for coping with anxiety. We also compare anxiety to related terms such as depression, fear, panic and stress to investigate how users of the forum relate these aspects of their mental health and how they differentiate between these often co-occurring experiences.

The Lived Experience: Research has shown that the stories people tell about their illness experiences “restore a coherent self by providing a meaningful explanation for a being in the world burdened by illness” (Kleinmann, 1988, p.48). We will investigate the narratives provided by contributors to the forum as a way of understanding how anxiety operates in the context of users’ lives and how the forum functions for participants to share their stories.

Creating a Community: Online forums provide invaluable opportunities to engage with other people’s experiences in a way that facilitates relatability and empathy, ultimately fostering solidarity and a community that extends beyond geographical barriers. Our work will investigate the affordances of the online platform by looking at the ways that participants respond to each other’s posts and how users elicit informational and emotional support from others in the forum. Focusing more on interactional aspects of the forum, we consider how users reach consensus and deal with conflict, establishing the conventions for the nature and manner in which participants support each other.

Sex and gender: Diagnoses of anxiety disorders are more common among females than males (4.6% compared to 2.6% at the global level) (World Health Organization, 2017). However, researchers argue that prevalence of anxiety among men is comparable to women and that normative gender ideologies affect how individuals talk about and seek help for experiences of anxiety (Gough et al. 2021). We will explore the forum both in relation to how posts made by female and male users compare in fulfilling particular kinds of support roles, as well as how participants refer to gender stereotypes, that shape their experiences of anxiety, including how and where they find support.

Comparing cultures: The Anxiety Support forum includes contributions from participants around the globe, with 38.84% of posts made by people from the UK and 33.94% made by those from the USA. Our analysis will include a comparison of contributions from the US and the UK, highlighting cultural differences in the way that anxiety is understood (in addition to spelling (favorite) and lexical choices (vacation)). This investigation will help to highlight how the respective health services of these countries shape users’ experiences of anxiety and their interactions with support services.

Changing Times: Our corpus contains posts made over an 8-year period, offering us the opportunity to look at how language has changed over the time, focussing on changes in how anxiety discourses are conceptualised (e.g. increasingly medicalised). Research has also shown that national and global events lead to increases in the prevalence of anxiety disorders. We can, for instance, examine the impact of Brexit on how users from the UK use the forum, or how participants discuss the impacts of the Covid-19 pandemic. The timespan of the data also enables us to investigate how posting behaviour ‘evolves’ over time. As an online community, we can see how more established contributors demonstrate their expertise and negotiate the communicative practices of the forum with newer participants. The diachronic nature of the forum will also help us to understand how we can support various stakeholders in living well with anxiety.

The project will run for 2 years and through our findings, we aim to demonstrate how important online spaces like the Anxiety Support forum are for individuals experiencing mental health issues, as well as to researchers who are interested in understanding lived experiences of health and illness.


Professor Paul Baker (Principal Investigator)
Dr Luke Collins (Senior Research Associate)


Kleinman, A. M. (1988). The Illness Narratives: Suffering, Healing, and the Human Condition. New York: Basic Books.

World Health Organization (2017). Depression and Other Common Mental Disorders: Global Health Estimates. Geneva: World Health Organization. Licence: CC BY-NC-SA.