A new addition to the Brown Family: BE21

Back in the 1960s, the Brown Corpus was the first corpus ever created – 1 million words of written standard American English from 15 registers, across 500 text samples, all around 2,000 words in size. Since then, there have been matched versions to cover the 1930s, the 1990s and the 2000s. Today’s reference corpora can be very large (enTenTen20 is 36.5 billion words) so 1 million words is not very big in corpus linguistics these days. However, because the members of the Brown family have used the same sampling frame, they can be usefully employed to carry out comparisons of UK and US English, as well as allowing us to track trends in language development over time – with the stipulation that these corpora tend to be effective when considering high frequency phenomena.

I built the BE06 in 2008, a British English version of the Brown corpus, with texts from 2006, and I was also involved in building its American sibling – AmE06. For those corpora, we stipulated that texts could be found online, as long as they had also been published in paper format elsewhere. That way, the job of creating the corpus was made easier, but the texts would also be similar in form to the other corpora. The gap between the 1990s versions of the corpora and BE06 was 15 years, so 2021 is a good point to create new versions. I began collecting data for BE21 (British English from 2021) around the mid-point of 2021 and finished in early 2022.

As well as helping us to examine language change and variation, the corpora can also be used as reference corpora for projects which involve relatively small amounts of text. So if you have a corpus of recent British newspaper texts, for example, that is under a million words in size, then BE21 would be a reasonable option as a reference corpus.

As with BE06, for BE21 I collected texts from online sources. For most (around 80-90%) of texts, there are “on paper” equivalents somewhere. However, these days, some texts appear designed to exist in online form – this includes things like online magazines or government documents. So I relaxed the stipulation a little bit to reflect how written language is increasingly migrating to online as opposed to paper.

My memory of collecting the BE06 corpus was that it didn’t take that long. At the time I calculated that it took around 10 working days in total. For the BE21, the texts took about three times longer than this. That was a surprise – I’d assumed that with more people publishing online, compared to 15 years ago, there would be a wealth of texts to choose from and the task would be easier. However, compared to 15 years ago, many texts are now behind firewalls so cannot be freely accessed. This was especially the case for magazines – collecting texts for the Popular Lore and Skills, trades and hobbies categories was more difficult than expected.

For the fictional sections, back in 2006, many authors had set up their own websites, where they provided free samples of their latest novels. These days, this doesn’t seem to be as common – although Amazon does provide free samples of books, via Kindle – so that was the route I went to collect the samples of fiction as well as the biographies.

Another complicating factor involved identifying and locating British authors, particularly when it came to the Academic Writing category. This was reasonably easy back in 2006, although today, academic publishing is a more international activity, as well as a team-based one, so I found that I was passing over many more possible articles than I remembered doing so in 2006. Trying to locate journals that contained the word “British” was a bit of a red herring, as that was no guarantee that British academics would be publishing in them. To be certain I was sampling “British” English, I needed to make sure that everyone on the team was from the UK, which meant quite a lot of Googling the names of academics and trying to get a sense of their background. I’ve erred on the side of caution, although this made the task of collecting the academic articles more difficult.

2021 will be remembered as a year when the main topic of conversation was COVID. If you obtain keywords from BE21, using BE06 as the reference corpus, the top 10 are COVID, pandemic, lockdown, I, vaccine, my, Brexit, care, people and coronavirus. The top one, COVID, appears in 114 out of the 500 text samples, a total of 446 times across the corpus. This was not because I actively sought out texts about COVID – it was just very hard to avoid them, particularly when I was collecting the non-fiction texts – COVID had permeated almost every aspect of British society in 2021. Co-incidentally, another corpus in the Brown family, B-LOB (Before LOB), also covers an international crisis. The year 1931 saw the Great Depression, which began in the United States, overwhelm Britain, with investors withdrawing their gold from London at the rate of £2.5 million a day. The UK went off the gold standard and during the election of that year, the Labour party was virtually destroyed leaving Labour’s Ramsay MacDonald as the Prime Minister for a National Government, an all-party coalition. This results in a few linguistic peculiarities in the BLOB corpus – such as the high frequency of words like unemployment which, in Great Britain increased by 129% between 1929 and 1932. Despite constituting new lexical items, and an increased focus on certain topics, The Great Depression does not make the B-LOB ineffective as a reference corpus though, just as COVID-19 does not render BE21 so idiosyncratic as to be useless.

Today I had the pleasure of seeing students use the BE21 for the first time, in one of my corpus linguistics seminars. We looked at the frequency of negation forms like “should not” and “shouldn’t” determining the extent to which the latter form is gaining precedence. The “n’t” form has been increasing in frequency for the past century. It has not yet become the dominant form, but it is starting to get very close, indicating grammaticalization of “n’t” as a bound morpheme, linked to densification and colloquialisation of written English. Looking to the future, the next time I would expect to build another Brown Family corpus will be in 2036. I suspect that by then, the “n’t” form will have become more popular than “not” and that increasingly, we will only see “not” occurring in archaic sounding phrasing like “it mattered not”.

The corpus is currently available through Lancaster University’s CQPweb, which is free to sign up for, and it will be coming to new versions of AntConc and #LancsBox soon. I have done some work examining changes in part-of-speech tag frequencies across the British Brown members, and a paper in the International Journal of Corpus Linguistics is forthcoming. I also gave a talk on the corpus and the results of that analysis, which can be found here. Compared to earlier corpora, BE21 has more first and second pronouns, more use of -s and -ing forms of verbs, more genitives, and far fewer terms of address like Mr and Mrs, as well as fewer modal verbs, gradable adverbs, of-based noun phrases and male pronouns. The trends identified by Geoffrey Leech, like densification, colloquialisation and democratisation are continuing. Another trend, Americanisation – also seems to be holding up, and when the AmE21 corpus is available (currently work is under way at Cardiff University to build it), we can start to further identify the ways that English is likely to shift in the future.

Coming this year: Corpora and Discourse Studies (Palgrave Advances in Language and Linguistics)

Three members of CASS have contributed chapters to a new volume in the Palgrave Advances in Language and Linguistics series. Corpora and Discourse Studies will be released later this year.


corpdiscThe growing availability of large collections of language texts has expanded our horizons for language analysis, enabling the swift analysis of millions of words of data, aided by computational methods. This edited collection of chapters contains examples of such contemporary research which uses corpus linguistics to carry out discourse analysis. The book takes an inclusive view of the meaning of discourse, covering different text-types or modes of language, including discourse as both social practice and as ideology or representation. Authors examine a range of spoken, written, multimodal and electronic corpora covering themes which include health, academic writing, social class, ethnicity, gender, television narrative, news, Early Modern English and political speech. The chapters showcase the variety of qualitative and quantitative tools and methods that this new generation of discourse analysts are combining together, offering a set of compelling models for future corpus-based research in discourse.

Table of Contents:

  1. Introduction; Paul Baker and Tony McEnery
  2. E-Language: Communication in the Digital Age; Dawn Knight
  3. Beyond Monomodal Spoken Corpora: Using a Field Tracker to Analyse Participants’ Speech at the British Art Show; Svenja Adolphs, Dawn Knight and Ronald Carter
  4. Corpus-assisted Multimodal Discourse Analysis of Television and Film Narratives; Monika Bednarek
  5. Analysing Discourse Markers in Spoken Corpora: Actually as a Case Study; Karin Aijmer
  6. Discursive Constructions of the Environment in American Presidential Speeches 1960-2013: A Diachronic Corpus-assisted Study; Cinzia Bevitori
  7. Health Communication and Corpus Linguistics: Using Corpus Tools to Analyse Eating Disorder Discourse Online; Daniel Hunt and Kevin Harvey
  8. Multi-Dimensional Analysis of Academic Discourse; Jack A. Hardy
  9. Thinking About the News: Thought Presentation in Early Modern English News Writing; Brian Walker and Dan McIntyre
  10. The Use of Corpus Analysis in a Multi-perspectival Study of Creative Practice; Darryl Hocking
  11. Corpus-assisted Comparative Case Studies of Representations of the Arab World; Alan Partington
  12.  Who Benefits When Discourse Gets Democratised? Analysing a Twitter Corpus Around the British Benefits Street Debate; Paul Baker and Tony McEnery
  13. Representations of Gender and Agency in the Harry Potter Series; Sally Hunt
  14. Filtering the Flood: Semantic Tagging as a Method of Identifying Salient Discourse Topics in a Large Corpus of Hurricane Katrina Reportage; Amanda Potts

Brainstorming the Future of Corpus Tools

Since arriving at the Centre for Corpus Approaches to Social Science (CASS), I’ve been thinking a lot about corpus tools. As I wrote in my blog entry of June 3, I have been working on various software programs to help corpus linguists process and analyse texts, including VariAnt, SarAnt, TagAnt. Since then, I’ve also updated my mono-corpus analysis toolkit, AntConc, as well as updated my desktop and web-based parallel corpus tools, including AntPConc and the interfaces to the ENEJE and EXEMPRAES corpora. I’ve even started working with Paul Baker of Lancaster University on a completely new tool that provides detailed analyses of keywords.

In preparation for my plenary talk on corpus tools, given at the Teaching and Language Corpora (TaLC 11) conference held at Lancaster University, I interviewed many corpus linguists about their uses of corpus tools and their views on the future of corpus tools. I also interviewed people from other fields about their views on tools, including Jim Wild, the Vice President of the Royal Astronomical Society.

From my investigations, it was clear that corpus linguists rely on and very much appreciate the importance of tools in their work. But, it also became clear that corpus linguists can sometimes find it difficult to see beyond the features of their preferred concordancer or word frequency generator and attempt to look at language data in completely new and interesting ways. An analogy I often use (and one I detailed in my plenary talk at TaLC 11) is that of an astronomer. Corpus linguists can sometimes find that their telescopes are not powerful enough or sophisticated enough to delve into the depths of their research space. But, rather than attempting to build new telescopes that would reveal what they hope to see (an analogy to programming) or working with others to build such a telescope (an analogy to working with a software developer), corpus linguists simply turn their telescopes to other areas of the sky where their existing telescopes will continue to suffice.

To raise the awareness of corpus tools in the field and also generate new ideas for corpus tools that might be developed by individual programmers or within team projects, I proposed the first corpus tools brainstorming session at the 2014 American Association of Corpus Linguistics (AACL 2014) conference. Randi Reppen and the other organizers of the conference strongly supported the idea, and it finally became a reality on September 25, 2014, the first day of the conference.

At the session, over 30 people participated, filling the room. After I gave a brief overview of the history of corpus tools development, the participants thought about the ways in which they currently use corpora and the tools needed to do their work. The usual suspects—frequency lists (and frequency list comparisons), keyword-in-context concordances and plots, clusters and n-grams, collocates, and keywords—were all mentioned. In addition, the participants talked about how they are increasingly using statistics tools and also starting programming to find dispersion measures. A summary of the ways people use corpora is given below:

  • find word/phrase patterns (KWIC)
  • find word/phrase positions (plot)
  • find collocates
  • find n-grams/lexical bundles
  • find clusters
  • generate word lists
  • generate keyword lists
  • match patterns in text (via scripting)
  • generate statistics (e.g. using R)
  • measure dispersion of word/phrase patterns
  • compare words/synonyms
  • identify characteristics of texts

Next, the participants formed groups, and began brainstorming ideas for new tools that they would like to see developed. Each group came up with many ideas, and explained these to the session as a whole. The ideas are summarised below:

  • compute distances between subsequent occurrences of search patterns (e.g. words, lemmas, POS)
  • quantify the degree of variability around search patterns
  • generate counts per text (in addition to corpus)
  • extract definitions
  • find patterns of range and frequency
  • work with private data but allow  for powerful handling of annotation (e.g. comparing frequencies of sub-corpora)
  • carry out extensive move analysis over large texts
  • search corpora by semantic class
  • process audio data
  • carry out phonological analysis (e.g. neighbor density)
  • use tools to build a corpus (e.g. finding texts, annotating texts, converting non-ASCII characters to ASCII)
  • create new visualizations of data (e.g. a roman candle of words that ‘explode’ out of a text)
  • identify the encoding of corpus texts
  • compare two corpora along many dimensions
  • identify changes in language over time
  • disambiguate word senses

From the list, it is clear that the field is moving towards more sophisticated analyses of data. People are also thinking of new and interesting ways to analyse corpora. But, perhaps the list also reveals a tendency for corpus linguists to think more in terms of what they can do rather than what they should do, an observation made by Douglas Biber, who also attended the session. As Jim Wild said when I interviewed him in July, “Research should be led by the science not the tool.” In corpus linguistics, clearly we should not be trapped into a particular research topic because of the limitations of the tools available to us. We should always strive to answer the questions that need to be answered. If the current tools cannot help us answer those questions, we may need to work with a software developer or perhaps even start learning to program ourselves so that new tools will emerge to help us tackle these difficult questions.

I am very happy that I was able to organize the corpus tools brainstorming session at AACL 2014, and I would like to thank all the participants for coming and sharing their ideas. I will continue thinking about corpus tools and working to make some of the ideas suggested at the session become a reality.

The complete slides for the AACL 2014 corpus tools brainstorming session can be found here. My personal website is here.

Call for Participation: ESRC Summer School in Corpus Approaches to Social Science

The ESRC Summer School in Corpus Approaches to Social Sciences was inaugurated in 2013; the 2014 event is the second in the series. It will take place 15th to 18th July 2014, at Lancaster University, UK.

This free-to-attend summer school takes place under the aegis of CASS (https://cass.lancs.ac.uk), an ESRC research centre bringing a new method in the study of language – the corpus approach – to a range of social sciences. CASS is investigating the use and manipulation of language in society in a host of areas of pressing concern, including climate change, hate crime and education.

Who can attend?

A crucial part of the CASS remit is to provide researchers across the social sciences with the skills needed to apply the tools and techniques of corpus linguistics to the research questions that matter in their own discipline. This event is aimed at junior social scientists – especially PhD students and postdoctoral researchers – in any of the social science disciplines. Anyone with an interest in the analysis of social issues via text and discourse – especially on a large scale – will find this summer school of interest.

Programme

The programme consists of a series of intensive two-hour sessions, some involving practical work, others more discussion-oriented.

Topics include: Introduction to corpus linguistics; Corpus tools and techniques; Collecting corpus data; Foundational techniques for social science data – keywords and collocation; Understanding statistics for corpus analysis; Discourse analysis for the social sciences; Semantic annotation and key domains; Corpus-based approaches to metaphor in discourse; Pragmatics, politeness and impoliteness in the corpus.

Speakers include Tony McEnery, Paul Baker, Jonathan Culpeper, and Elena Semino.

The CASS Summer School is one of the three co-located Lancaster Summer Schools in Interdisciplinary Digital Methods; see the website for further information:

http://ucrel.lancs.ac.uk/summerschool

How to apply

The CASS Summer School is free to attend, but registration in advance is compulsory, as places are limited.

The deadline for registrations is Sunday 8th June 2014.

The application form is available on the event website as is further information on the programme.

 

Dispatch from YLMP2014

YLMP

I recently had the pleasure of travelling to Poland to attend the Young Linguists’ Meeting in Poznań (YLMP), a congress for young linguists who are interested in interdisciplinary research and stepping beyond the realm of traditional linguistic study. Hosted over three days by the Faculty of English at Adam Mickiewicz University, the congress featured over 100 talks by linguists young and old, including plenary lectures by Lancaster’s very own Paul Baker and Jane Sunderland. I was one of three Lancaster students to attend the congress, along with undergraduate Agnes Szafranski and fellow MA student Charis Yang Zhang.

What struck me about the congress, aside from the warm hospitality of the organisers, was the sheer breadth of topics that were covered over the weekend. All of the presenters were more than qualified to describe their work as linguistics, but perhaps for the first time I saw within just how many domains such a discipline can be applied. At least four sessions ran in parallel at any given time, and themes ranged from gender and sexuality to EFL and even psycholinguistics. There were optional workshops as well as six plenary talks. On the second day of the conference, as part of the language and society stream, I presented a corpus-assisted critical discourse analysis of the UK national press reporting of the immediate aftermath of the May 2013 murder of soldier Lee Rigby. I was happy to have a lively and engaged audience who had some really interesting questions for me at the end, and I enjoyed the conversations that followed this at the reception in the evening!

What was most encouraging about the congress was the drive and enthusiasm shared by all of the ‘young linguists’ in attendance. I now feel part of a generation of young minds who are hungry to improve not only our own work but hopefully, in time, the field(s) of linguistics as a whole. After my fantastic experience at the Boya Forum at Beijing Foreign Studies University last autumn, I was happy to spend time again celebrating the work of undergraduate and postgraduate students, and early-career linguists. There was a willingness to listen, to share ideas, and to (constructively) criticise where appropriate, and as a result I left Poznań feeling very optimistic about the future of linguistic study. I look forward to returning to the next edition of YLMP, because from what I saw at this one, there is a new generation of linguists eager to push the investigation of language to the next level.

New CASS: Briefing now available — Opposing gay rights in UK Parliament: Then and now

CASSbriefings-gayrightsOpposing gay rights in UK Parliament: Then and now. How has the expression of opposition to gay rights changed in Parliamentary speeches in recent years? How are discussions of gay people involved in these changes? To what extent could these arguments be seen as homophobic? Read this CASS: Briefing of a diachronic corpus-based discourse analysis to find out more.


New resources are being added regularly to the new CASS: Briefings tab above, so check back soon.

Introducing CASS 1+3 Research Student: Robbie Love

In 2013, the ESRC Centre for Corpus Approaches to Social Science was pleased to award its inaugural 1+3 (Masters to PhD) studentship to Robbie Love. Read a bit about the first year of his postgraduate experience, in Robbie’s own words below.


robbieloveI am a Research Student at CASS in the first year of a 1+3 PhD studentship. My main role is to investigate methodological issues in the collection of spoken corpora, but I also have interests in corpus-assisted critical discourse analysis.

I grew up in the north east of England in Blyth, Northumberland and Forest Hall in the outskirts of Newcastle. At school I found equal enjoyment in studying both English language and mathematics, but when deciding what to take at university I couldn’t think of something that would satisfy both, so I went with language.

I moved to Lancaster in 2010 to study my BA in English Language, which I soon converted to Linguistics. It was only in my third year that I was introduced to corpus linguistics, and became fascinated with its potential for revealing things about the way we communicate which I would never have predicted. I also liked its combination of quantitative and qualitative analysis, so it seemed like the perfect way to reengage with my enjoyment of maths. I had always been open to the idea of postgraduate study so when the opportunity came up to join CASS under the supervision of Tony McEnery it felt like the best thing for me to do.

Since joining CASS in the summer last year I have worked on several interesting projects including the changing language of gay rights opposition in Parliamentary debates (with Paul Baker), comments on online newspaper articles (with Amanda Potts), and the representation of Muslim people and Islam in the press reaction to the 2013 Woolwich incident (with Tony McEnery). I will be presenting findings on the Woolwich project at the upcoming Young Linguists’ Meeting in Poznań.

When I’m not playing with words on a computer, I am usually found rehearsing for a play or musical, playing my keyboard or eating any and all varieties of hummus.


For our People page for a full list of the centre’s investigators, researchers, and students.

Using Corpora to Analyze Gender

ucagI wrote UCAG during a sabbatical as a semi-sequel to a book I published in 2006 called Using Corpora for Discourse Analysis. Part of the reason for the second book was to update and expand some of my thinking around discourse- or social-related corpus linguistics. As time has passed, I haven’t become disenamoured of corpus methods, but I have become more reflective and critical of them and I wanted to use the book to highlight what they can and can’t do, and how researchers need to be guarded against using tools which might send them down a particular analytical path with a set of pre-ordained answers. Part of this has involved reflecting on how interpretations and explanations of corpus findings often need to come from outside the texts themselves (one of the tenets of critical discourse analysis), and subsequently whether a corpus approach requires analysts to go further and critically evaluate their findings in terms of “who benefits”.

Another way in which my thinking around corpus linguistics has developed since 2006 is in considering the advantages of methodological triangulation (or approaching a research project in multiple ways). In one analysis chapter I take three small corpora of adverts from Craigslist and try out three methods of attempting to uncover something interesting about gender from them – one very broad involving an automated tagging of every word, one semi-automatic relying on a focus on a smaller set of words, and another much more qualitative, relying on looking at concordance lines only. In another chapter I look at “difficult” search terms – comparing two methods of finding all the cases where a lecturer indicates that a student has given an incorrect answer in a corpus of academic-related speech. Would it be better to just read the whole corpus from start to finish, or is it possible to devise search terms so concordancing would elicit pretty much the same set?

The book also gave me a chance to revisit older data, particularly a set of newspaper articles about gay people from the Daily Mail which I had first looked at in Public Discourses of Gay Men (2005). As a replication experiment I revisited that data and redid an analysis I had first carried out about 10 years ago. While the idea of an objective researcher is fictional, corpus methods have aimed to redress the issue of researcher bias to an extent – although in retreading my steps, I did not obtain exactly the same results. Fortunately, the overall outcome was the same, but there were a few important points that the 10 years younger version of me missed. Does that matter? I suspect it doesn’t invalidate the analysis although it is a useful reminder about how our own analytical abilities alter over time.

Part of the reason for writing the book was to address other researchers who are either from corpus linguistics and want to look at gender, or who do research in gender and want to use corpus methods. I sometimes feel that these two groups of people do not talk to each other very much and as a result the corpus research in this area is often based around the “gender differences” paradigm where the focus is on how men and women apparently differ from each other in language use (with attendant metaphors about Mars and Venus). Chapters 2 and to an extent 3, address this by trying a number of experiments to see just how much lexical variation there is in sets of spoken corpora of male and female language – and when difference is found, how can it be explained? I also warn against lumping all men together into a box to compare them with all women who are put in a second box. The variation within the boxes can actually be the more interesting story to tell and this is where corpus tools around dispersion can really come into their own. So even if, for example, men do swear more than women, it’s not all men and not all the time. On the other hand, some differences which are more consistent and widespread can be incredibly revealing, although not in ways you might think – chapter 2 took me down an analytical path that ended up at the word Christmas – not perhaps an especially interesting word relating to gender, but it produced a lovely punchline to the chapter.

It was also good to introduce different corpora, tools and techniques that weren’t available in 2006. Mark Davies has an amazing set of online corpora, mostly based around American English, and I took the opportunity to use the COHA (Corpus of Historical American English) to track changes in language which reflects male bias over time, from the start of the 19th century to the present day. Another chapter utilises Adam Kilgariff’s online tool Sketch Engine which allows collocates to be calculated in terms of their grammatical relationships to one another. This allowed for a comparison of the terms boy and girl which allowed me to consider verbs that positioned either as subject or object. So girls are more likely to be impressed while boys are more likely to be outperformed. On the other hand boys cry whereas girls scream.

It would be great if the book inspired other researchers to consider the potential of using corpora in discourse/social related subjects as well as showing how this potential has expanded in recent years. It’s been fun to explore a relatively unexplored field (or rather travel a route between two connecting fields) but it occasionally gets lonely. I hope to encounter a few more people heading in the same direction as me in the coming years.

CASS awarded £200,000 from landmark ESRC Urgency Grant Scheme

CASS is delighted to announce a successful ESRC application for funding on a project entitled “Twitter rape threats and the discourse of online misogyny” (ES/L008874/1). The award of £191,245.25 was one of the first (possibly even the first) to be made as part of the ESRC’s new Urgency Grants scheme. Under this scheme, applications are assessed very quickly, and projects also start within four weeks of a successful award. This particular project will begin in November and run for fourteen months. It will be part of the CASS Centre, and the team will be comprised of Claire Hardaker (PI), Tony McEnery (CI), Paul Baker (CI), Andrew Hardie (CI), Paul Iganski (CI), and two CASS-hosted research assistants.

This project will investigate the rape and death threats sent on Twitter in July and August 2013 to a number of high profile individuals, including MP Stella Creasy and journalist Caroline Criado-Perez. This project seeks to address the remarkable lack of research into such behaviour, especially in light of the fact that policymakers and legislators are under intense pressure to make quick, long-term decisions on relevant policy and procedure to allow enforcement agencies to act on this issue. Specifically, the project will investigate what the language used by those who send rape/death threats on Twitter reveals about…

  1. their concerns, interests, and ideologies; what concept do they seem to have of themselves and their role in society?
  2. their motivations and goals; what seems to trigger them? What do they seem to be seeking?
  3. the links between them and other individuals, topics, and behaviours; do they only produce misogynistic threats or do they engage in other hate-speech? Do they act alone or within networks?

The project will take a corpus approach, incorporating several innovative aspects, and it will produce results that should be relevant to several social sciences including sociology, criminology, politics, psychology, and law. It will also offer timely insight into an area where policy, practice, legislation, and enforcement is currently under intense scrutiny and requires such research to help shape future developments. As such, the results will likely be of interest to legislators, policymakers, investigative bodies, and law enforcement agencies, as well as the study participants, media, and general public.

CASS affiliated papers to be given at the upcoming 5th International Language in the Media Conference

In two weeks, several scholars affiliated with the Centre will be heading south to attend the 5th International Language in the Media Conference, taking place this year at Queen Mary, University of London. We are particularly excited about the theme — “Redefining journalism: Participation, practice, change” — as well as the conference’s continued prioritization of papers on “language and class, dis/ability, race/ethnicity, gender/sexuality and age; political discourse, commerce and global capitalism” (among other important themes). As a taster for those of you who will be joining us in London and an overview for those who are unfortunately unable to make it this year, abstracts of the CASS affiliated papers to be given at the conference are reproduced below.


“I hate that tranny look”: a corpus-based analysis of the representation of trans people in the national UK press

Paul Baker

In early 2013, two high-profile incidents involving press representation of trans people resulted in claims that the British press were transphobic. For example, Jane Fae wrote in The Independent, that ‘the trans community… is now a stand-in for various minorities… and a useful whipping girl for the national press… trans stories are only of interest when trans folk star as villains” (1/13/13). This paper examines Fae’s claims by using methods from corpus linguistics in order to identify the most frequent and salient representations of trans people in the national UK press. Corpus approaches use computational tools as an aid in human research, offering a good balance between quantitative and qualitative analyses, My analysis is based upon previous corpus-based research where I have examined the construction of gay people, refugees and asylum seekers and Muslims in similar contexts.

Using a 660,000 word corpus of news articles about trans people published in 2012, I employ concordancing techniques to examine collocates and discourse prosodies of terms like transgender, transsexual and tranny, in order to identify repetitive patterns of representation that occur across newspapers. I compare such patterns to sets of guidelines on language use by groups like The Beaumont Society, and discuss how certain representations can be enabled by the Press Complaints Commissions Code of Practice. While the analysis found that there are very different patterns of representation around the three labels under investigation, all of them showed a general preference for negative representations, with occasional glimpses of more positive journalism.


“I think we’d rather be called survivors”: A corpus-based critical discourse analysis of the semantic preferences of referential strategies in Hurricane Katrina news articles as indicators of ideology

Amanda Potts

In times of great crisis, people often rely upon the discourse of powerful institutions to help frame experiences and reinforce established ideologies (van Dijk 1985). Selection of referential strategies in such discourses can reveal much about our society; for instance, some words have the power to comfort addressees but further oppress the referents. Taking a corpus-based critical discourse analytical approach, in this paper I explore the discursive cues of underlying ideology (of both the publications and perhaps the assumed audience) with special attention on journalists’ referential and predicational strategies (Reisgl and Wodak 2000). Analysis is based on a custom-compiled 36.7-million-word corpus of American news print articles concerning Hurricane Katrina.

A variety of forms of reference have been identified in the corpus using part-of-speech tagged word lists. Collocates of each form of reference have been calculated and automatically assigned a semantic tag by the UCREL USAS tagger (Archer et al. 2002). Semantic categories represented by the highest proportion of collocates overall have been identified as the most salient indicators of ideology.

The semantic preferences of the referential strategies are found to be quite distinct. For instance, resident prefers the M: Movement semantic category, whereas collocates of evacuee tend to fall under N: Numbers. This may prime readers to interpret Gulf residents and evacuees as large, threatening, ‘invading’ masses (often in conjunction with negative water metaphors such as flood). The highest collocate semantic category for victim, displaced, and survivor is S: Social actions, states and processes, indicating that the [social] experiences of these referents—such as being helped or stranded, or linked to social identifies such as wife—are foregrounded rather than their numbers or movement.

Finally, the plummeting frequency of refugee following a unique debate in the media over the word’s meaning and even its semantic preference will also be discussed as an illustrative example of how unconscious language patterns can sometimes come to the fore in contested usage and influence the journalistic lexicon. Following from this, a more considered use of referential strategies is recommended, particularly in the media, where this could encourage heightened compassion for- and understanding of those gravely affected by catastrophic events.


Journalism through the Guardian’s goggles

Anna Marchi

‘Journalism is an intensely reflexive occupation, which constantly talks to and about itself’ (Aldridge and Evetts 2003: 560). Journalists create interpretative communities (Zelizer 2004) through the discourses they circulate about their profession, the meaning and role of journalism are constituted through daily performance (Matheson 2003) and can be studied by means of the self-reflexive traces in texts. That is, they can be detected and studied in a newspaper corpus.

This paper proposes a corpus-assisted discourse analysis (Partington 2009) of the ways journalists represent their trade in their own news-work. The focus of the research in one newspaper in particular: the Guardian. Previous research (Marchi and Taylor 2009) suggested that among British broadsheets the Guardian is by far the most interested in other media, as well as the most inclined to talk about itself. Using newspaper data from 2005, a particularly relevant year in the newspaper’s biography (it changed format from traditional broadsheet to berliner) and rich with self-reflexivity, I examine the discursive behavior of media-related lexical items in the corpus (such as journalist, reporter, hack, media, newspaper, press, tabloid) exploring the ways in which the Guardian conceptualises the role of the news media, how it represents professional values and the divide between good and bad journalism, and, ultimately, how it constructs its own identity. The study relies on the typical tools of corpus linguistics research – collocation analysis, keywords analysis, concordance analysis – and aims to a comprehensive description of the data, following the principle of total accountability (McEnery and Hardie 2012: 17), while keeping track of the broader extralinguistic context. From a methodological point of view this work encourages interdisciplinary contamination and a serendipitous approach to the data and wishes to offer an example of how corpus-based research can contribute to the academic investigation of journalism across disciplines.


Visit the conference website for more details, including a list of plenary speakers.