About Paul Baker

CASS co-investigator Paul Baker is a Professor of Linguistics and English Language at Lancaster University. His research interests corpus linguistics, language and gender/sexual identities and critical discourse analysis.

CASS at Corpus Linguistics 2017

The biennial Corpus Linguistics conference first took place in 2001 at Lancaster, with the 2017 conference at Birmingham being its 9th outing. Lasting four days with an additional day for workshops, this blog post details CASS participation at the event.

On Monday 24th July CASS ran two pre-conference workshops: Vaclav Brezina and Matt Timperley’s workshop was based around the tool #LancsBox which has the capacity to create collocational networks while Robbie Love and Andrew Hardie introduced the Spoken BNC2014 Corpus. Pre-conference workshop presentations were also given by CASS members in the Corpus Approaches to Health Communication workshop which saw talks by Paul Baker (on NHS patient feedback), Elena Semino (on assessment of a diagnostic pain questionnaire) and Karen Kinloch who gave two talks on discourses around IVF treatment and post natal depression (her second talk was co-presented with Sylvia Jaworska).

On the first day of the conference proper, CASS Director Andrew Hardie gave a plenary entitled Exploratory analysis of word frequencies across corpus texts: towards a critical contrast of approaches, which involved a “for one night only” Topic Modelling analysis, demonstrating some of the problems and assumptions behind this approach. Key points were illustrated with a friendly picture of a Gigantopithecus (pictures of dinosaurs and other extinct creatures were used in several talks, perhaps suggesting a new theme for CL research). The plenary can be watched in full here. https://www.youtube.com/watch?v=ka4yDJLtSSc

A number of conference talks involved the creation and analysis of the new 2014 British National Corpus, with Abi Hawtin presenting on how she developed parameters for the written section and Robbie Love discussing swearing in the spoken section of the BNC2014. Vaclav Brezina and Matt Timperley discussed a proposal for standardised tokenization and word counting, using the new BNC as an exemplar while Susan Reichelt examined ways of adapting the BNC for sociolinguistic research, taking a case study on negative concord.

In terms of other corpus creation projects, Paul Rayson, Scott Piao and a team from Cardiff University discussed the creation of a Welsh Semantic tagger for use with the CorCenCC Project.

Two talks involved uses of corpus linguistics in teaching. First, Gillian Smith described the creation and analysis of a corpus of interactions in Special Education Needs classrooms, with the goal of investigating teacher scaffolding while Liam Blything, Kate Cain and Andrew Hardie analysed a half million corpus of teacher-child interactions during guided reading sessions.

Regarding work examining discourse and representation using corpus approaches Carmen Dayrell presented her work with Helen Baker and Tony McEnery on a diachronic analysis of newspaper articles about droughts, their research combining corpus approaches with GIS (Geographical Information Systems). GIS was also used by Laura Paterson and Ian Gregory to map text analysis of poverty in the UK while Paul Baker and Mark McGlashan reported on their work looking at representations of Romanians in the Daily Express, comparing articles with online reader comments. A fourth paper by Jens Zinn and Daniel McDonald considered changing understandings around the concept of risk in English language newspapers.

Collocation was also a popular CASS topic in our presentations. Native and non-native processing of collocations was investigated by Jen Hughes, who carried out an experimental study using electroencephalography (EEG) which measures electrical potentials in the brain, while another approach to collocation was taken by Doğuş Can Öksüz and Vaclav Brezina who examined adjective-noun collocations in Turkish and English. A third collocation study by Dana Gabasolva, Vaclav Brezina and Tony McEnery involved empirical validation of MI-based score collocations for language learning research.

Finally, Jonathan Culpeper and Amelia Joulain-Jay talked about an affiliated CASS project involving work on creating an Encyclopaedia of Shakespeare’s language. They discussed issues surrounding spelling variation, and part of speech tagging, and gave two case studies (involving the words I and good).

The conference brought together corpus linguists from dozens of countries (including Germany, Finland, Spain, Israel, Japan, Brazil, Iran, The Netherlands, USA, New Zealand, Taiwan, Ireland, China, Czech Republic, Italy, Sweden, Poland, Chile, UK, Hong Kong, Norway, Australia, Belgium, Canada, South Africa and Venezuela) and was a great opportunity to share and hear about developing work in the field. There was a lively twitter presence throughout the conference, with the tag #CL2017bham. However, my favourite tag was #HardiePieChartWatch, which had me going back to my slides to see if I had used a pie chart appropriately. Be careful with your pie charts!

The next conference will be held (for the first time) in Cardiff – I hope to see you there in two years.

More pictures of the conference can be found at https://www.flickr.com/photos/artsatbirmingham/sets/72157684181373191

Beyond the checkbox – understanding what patients say in feedback on NHS services

In 2016 I will be working on a new project in CASS, which has received funding from the ESRC (£61,532 FEC). The purpose of this project is to help the National Health Service better understand the results of patient feedback so that they can improve their services. The NHS gathers a great deal of user feedback on its services from patients. Much of this is in “free text” format and represents a rich dataset, although the amount of text generated in the thousands of feedback forms patients fill in each year makes it unfeasible to undertake a close qualitative analysis of all of it. Categorisation-based approaches like sentiment analysis have been tried on the dataset but have not found to be revealing. In this project we will be working with the NHS to first identify a set of research questions they would like to be answered from the data, and then we will use corpus-based discourse analysis to draw out the main themes and issues arising from the data. We will focus on four key NHS services – dentists, GP practices, hospitals and pharmacies. From these services alone we have around 423,418 comments to analyse, totalling 105,380,697 words. Some of the issues we are likely to be focussing on include: what matters most for patients, the key drivers for positive and negative feedback, indicators in comments that might trigger an alert or urgent review and differences across providers/services or by socio-demographic group.

Sino-UK Corpus Linguistics Summer School

ShanghaiAt the end of July, Tony McEnery and I taught at the second Sino-UK corpus linguistics summer school, arranged between CASS and Shanghai Jiao Tong University. It was my first time visiting China and we arrived during an especially warm season with temperatures hitting 40 degrees Celsius (we were grateful for the air conditioning in the room we taught in).

Tony opened the summer school, giving an introductory session on corpus linguistics, followed a few days later by a session on collocations, where he introduced CASS’s new tool for collocational networks, GraphColl. I gave a session on frequency and keywords, followed by later sessions on corpus linguistics and language teaching, and CL and discourse analysis. For the lab work components of our sessions, we didn’t use a computer lab. Instead the students brought along their own laptop and tablets, including a few who carried out BNCweb searches on their mobile phones! I was impressed by how much the students attending already knew, and had to think on my feet a couple of times – particularly when asked to explain some of the more arcane aspects of WordSmith (such as the “Standardised Type Token ratio standard deviation”).

At the end of the summer school, a symposium was held where Tony gave a talk on his work with Dana Gablasova and Vaclav Brezina on the Trinity Learner Language corpus. I talked about some research I’m currently doing with Amanda Potts on change and variation in British and American English.

Also presenting were Prof Gu Yuego (Beijing Foreign Studies University) who talked about building a corpus of texts on Chinese medicine, and Prof. Roger K Moore (University of Sheffield) who discussed adaptive speech recognition in noisy contexts.

We were made to feel very welcome by our host, Gavin Zhen, one of the lecturers at the university, who went out of his way to shuttle us on the 90 minute journey from the university to our hotel on the Bund.

It was a great event and it was nice to see students getting to grips with corpus linguistics so enthusiastically.

The Scottish referendum – did it unite the Guardian and the Mail?

The Guardian and the Mail are very different newspapers. The Guardian is a left-leaning liberal broadsheet while the Mail is a more popular right-leaning ‘middle-market’ newspaper. Generally, they can be relied on to disagree with one another on a range of social, economic and political issues. However, both newspapers supported the recent “No” campaign during the Scottish Independence referendum, which raises a few interesting questions – how did their discourse around Scottish independence contrast? Did they use similar arguments and language, or did they still manage to retain their individual identities?

To explore these questions, we built corpora of the Mail and Guardian (and their Sunday editions) from 18 June 2014 until 18 September 2014 (the three months leading up to the referendum on Scottish independence) by collecting all articles which contained the term Scottish directly followed by independence, referendum, vote or poll.

We then examined the keywords which emerged when each corpus of articles was compared against the 1 million word BE06 Corpus of general British English. A keyword is simply a word which occurs much more often in a corpus when compared against a larger reference corpus. Corpus tools (we used Antconc) can quickly calculate keywords by conducting statistical tests on all the words in the corpus. We looked at the strongest (in terms of statistical saliency) 100 or so keywords for each corpus, and then compared the two sets of keywords to see which occurred just in the Guardian or just in the Mail, but also which were shared by both. The table below shows the keywords that were found.

Guardian Keywords Keywords in both newspapers Mail Keywords
austerity, Britain, Brown, campaigners, Carrell, country, devolution, EU, festival, Holyrood, ISIS, nation, nationalism, north, oil, political, politicians, politics, polling, polls, powers, Saturday, says, secretary, Severin, voted, votes, voting, weather, YouGov Alex, Alistair, all, August, bank, BBC, better, border, Cameron, campaign, currency, Darling, David, debate, Ed, Edinburgh, election, former, Games, Glasgow, has, independence, independent, July, Labour, leader, London, Miliband, minister, MPs, nationalists, No, party, poll, prime, pro, referendum, Salmond, Scotland, Scots, Scottish, September, SNP, tax, Thursday, Together, Tory, UK, undecided, union, vote, voters, Westminster, will, would, Yes Balmoral, border, cabinet, CBI, chairman, crisis, investors, James, Kingdom, MP, PM, prince, Queen, said, shares, sterling, Tories, Tuesday, twitter, uncertainty, United, warned, week, year,


This table isn’t really an analysis though – we need to explore the keywords in more detail by reading the articles that each keyword appears in and getting a sense for how and why they were used. This is achieved by looking at concordance lines, although we can also expand each line to read the entire article. Here are some of our preliminary findings.

The Mail was much more concerned than the Guardian about how the vote would impact on the Royal Family, with its keywords including Prince, Queen and Balmoral. Much is made of the queen’s ‘neutrality’, her relationship with David Cameron, her ‘soft power’ in influencing the vote, her carefully calculated comments, and characteristically, what she is wearing (“a turquoise outfit and hat” in one article). The Queen is also described as receiving daily updates from Balmoral.

The Mail also refers to the keyword uncertainty a lot more than the Guardian, particularly appearing concerned about how the progress of the campaign is bad for markets, investors, businesses and pension holders who don’t like uncertainty e.g. “uncertainty is the enemy of investment’. The use of the Mail keyword crisis also pins the Scottish vote to the idea of a crisis – the vote could trigger an “EMU-style currency crisis within the UK” but there could also be a “leadership crisis” for both Labour and the Conservatives. Another somewhat worrying Mail keyword is warned, with the Mail reporting various people and businesses (Stagecoach, Paul Krugman, Goldman Sachs, Standard Life, John Major, Mark Carney, Doug Flint) issuing warnings about a range of dire consequences that could occur if Scotland gains independence.

Perhaps surprisingly, twitter is a keyword for the Mail, which is interesting given the editor of the Mail, Paul Dacre’s dismissal of the ‘firestorm’ of tweets around a previous Mail article by Jan Moir which attracted the highest number of complaints to the Press Complaints Commission ever back in 2009. But the Mail now seems to have accepted the importance of Twitter and views tweets as newsworthy. To wit, it reports on Rupert Murdoch’s twitter behaviour, as well as tweets from people who disliked the Better Together advertising campaign #PatronisingBTLady. The Mail is especially disapproving of “tartan trolls” who use Twitter to attack celebrities like JK Rowling who endorse the Yes vote.

How about the Guardian? One keyword it used was nationalism, which at first glance may appear that the Guardian wished to critique the “Yes” voters as nationalists. However, there were cases were writers like Billy Bragg and George Monbiot argued that the label of nationalism was unfairly used to obscure ‘self determination’. One journalist approvingly refers to the lack of ‘braveheart nationalism’ in the campaign, although other journalists do attribute nationalism to some Scottish people, but this is felt to be due to London being out of touch and inward looking. Nationalism either doesn’t exist in the campaign, or when it does, can be excused.

Another Guardian keyword is austerity, with some journalists citing views that the current government’s austerity program being blamed as helping the yes camp. This could be an opportunity for the Guardian to blame the government’s economic policy for breaking up the union, but generally this is not done and instead, it is argued that a Yes vote would not end austerity, but merely impose it from Holyrood rather than Westminster.

Unlike the Mail, the Guardian doesn’t spend as much time reporting the warnings of ‘financial experts’, although the keyword oil was interesting, occurring with reference to North Sea Oil reserves and revenues. In a number of articles, the Guardian foregrounds claims by Sir Ian Wood that Alex Salmond has exaggerated North Sea Oil reserves by up to 60%. In terms of perspectivation, Sir Ian Wood’s position is given precedence over Salmond’s e.g. Wood is described as ‘one of the most influential figures in the Scottish oil industry’ and other people are described as quoting his position too. A woman who claims that the No campaigners have ‘downplayed the amount of oil we have left’ is subtly positioned as greedy: ‘It was “our oil”, she said…’ and thus her argument is weakened somewhat. At the end of the same article, another opinion, given by a local Lib Dem chairman who is described as a ‘marine engineer’ appears to be given more precedence: he says ‘Nobody knows how much oil is there’. The Guardian may not know how much oil there is, but it manages to do a good job of casting enough seeds of doubt to make us think that neither does Alex Salmond.

Finally, both newspapers had Yes as a keyword. How did they represent the yes campaigners? The Guardian made reference to yes voters who are starry-eyed, fierce, enterprising, determined, hardline, vocal and proud. It has very little to say about the no voters, indicating a somewhat subtle sense that the yes voters are a little pushy in their sentiments. The Mail doesn’t mention characteristics of the yes voters much, although it does refer to Alex Salmond as shouty and describes the no campaign as floundering and lacklustre.

So, while both newspapers generally supported Scotland staying within the UK, they each did it by using different strategies and in a way which helped them to maintain their own identities, reflecting the concerns and interests of their readers. From this admittedly preliminary analysis it is difficult to make a confident conclusion but the Guardian did appear to make more of an effort to allow a range of positions to be represented, and was somewhat more subtle in its disapproval of the ‘yes’ campaign. The two newspapers did have different strategies on what they said about each other in respect to the campaigning. The Mail barely mentioned the Guardian, only referring a couple of times to a Guardian poll that put Alistair Darling as scoring a victory over Alex Salmond during a two hour debate. The Guardian was more critical of the Mail, however, using the campaigning to get in a few digs at the Mail. One writer sneeringly referred to ‘the Daily Mail’s insistence that anyone who wants to see a fairer society must be a Stalinist’ And another Guardian columnist expressed surprise that ‘I’m on the same side as the Daily Mail too! Which appears to be taking a short break from convincing us the UK has gone down the tubes to press home a slightly perplexing message of: hey, please don’t break up this wonderful hideous slutty drunken immoral country where women, gays and foreigners don’t know their place!’

Now the vote is over, the two newspapers can get back in their respective bunkers.

CASS visit to Ghana

On June 24th, I and three other members of CASS spent a week in Accra, Ghana, demonstrating corpus methods and our own research at two universities, the University of Ghana and the recently established Lancaster University Ghana campus in Accra. From the UK it’s just over a six hour flight although thankfully only one hour of time difference. However, travel did involve some advance preparation, with jabs for yellow fever (and a few other things), visa applications and taking anti-malarial pills for a month after the trip. Fortunately, we only encountered one mosquito during the whole trip and none of us were bitten.

Although close together, the two universities we visited have a very different feel to them, the former is a large university spread out over a lot of land, with many departments and buildings, while the latter is (at the moment), a three storey modern-looking grey and red building with the familiar Lancaster logo on it.


Our first trip was to the University of Ghana, where Andrew, Tony and I each gave a lecture to about 90 members of staff and students. Tony talked about the theoretical principles behind corpus linguistics, I discussed (and problematized) sex differences in the British National Corpus and Andrew showed applications of corpus linguistics to field linguistics using Corpus Workbench. The University of Ghana has some alumni members of Lancaster University and it was great to run into Clement Appah and Grace Diabah (formely Bota) again.


Over the following two days, we gave corpus linguistics workshops, which included a two hour lab session where Andrew walked students through setting up a CQPweb account and doing some analysis of the Brown Family of corpora. I suspect this was the highlight of the day for those who attended, who were pleased to get access to many of the corpora we have at Lancaster. Each day we taught about 35 people, including some who had travelled quite long distances to get to us. Four students had driven in that morning from Cape Coast – a journey that we did some of when we went to Kakum National Park on our day off, and that took us over three hours – so we were impressed by their dedication. Tony gave an introduction to corpus linguistics and Vaclav talked about the General Service List for English words and let the students use a tool he had developed for exploring it. I ended each day with a talk on corpus linguistics and discourse analysis.


As I’d mentioned, we had a day off, where we visited Kakum National Park. This gave us an opportunity to see more of Ghana on the drive there, and then we had a great experience in the park, walking across a 350m network of rope bridges (the Kakum Canopy Walk) that were suspended high above the ground – you literally got a bird’s eye view of the tropical rainforest below. It was one of the most memorable experiences I’ve had and I think we all came away with very positive feelings about our trip, and are looking forward to our next visit to Ghana. I also hope that we managed to inspire people to incorporate some corpus linguistics methods into their own research.

Using Corpora to Analyze Gender

ucagI wrote UCAG during a sabbatical as a semi-sequel to a book I published in 2006 called Using Corpora for Discourse Analysis. Part of the reason for the second book was to update and expand some of my thinking around discourse- or social-related corpus linguistics. As time has passed, I haven’t become disenamoured of corpus methods, but I have become more reflective and critical of them and I wanted to use the book to highlight what they can and can’t do, and how researchers need to be guarded against using tools which might send them down a particular analytical path with a set of pre-ordained answers. Part of this has involved reflecting on how interpretations and explanations of corpus findings often need to come from outside the texts themselves (one of the tenets of critical discourse analysis), and subsequently whether a corpus approach requires analysts to go further and critically evaluate their findings in terms of “who benefits”.

Another way in which my thinking around corpus linguistics has developed since 2006 is in considering the advantages of methodological triangulation (or approaching a research project in multiple ways). In one analysis chapter I take three small corpora of adverts from Craigslist and try out three methods of attempting to uncover something interesting about gender from them – one very broad involving an automated tagging of every word, one semi-automatic relying on a focus on a smaller set of words, and another much more qualitative, relying on looking at concordance lines only. In another chapter I look at “difficult” search terms – comparing two methods of finding all the cases where a lecturer indicates that a student has given an incorrect answer in a corpus of academic-related speech. Would it be better to just read the whole corpus from start to finish, or is it possible to devise search terms so concordancing would elicit pretty much the same set?

The book also gave me a chance to revisit older data, particularly a set of newspaper articles about gay people from the Daily Mail which I had first looked at in Public Discourses of Gay Men (2005). As a replication experiment I revisited that data and redid an analysis I had first carried out about 10 years ago. While the idea of an objective researcher is fictional, corpus methods have aimed to redress the issue of researcher bias to an extent – although in retreading my steps, I did not obtain exactly the same results. Fortunately, the overall outcome was the same, but there were a few important points that the 10 years younger version of me missed. Does that matter? I suspect it doesn’t invalidate the analysis although it is a useful reminder about how our own analytical abilities alter over time.

Part of the reason for writing the book was to address other researchers who are either from corpus linguistics and want to look at gender, or who do research in gender and want to use corpus methods. I sometimes feel that these two groups of people do not talk to each other very much and as a result the corpus research in this area is often based around the “gender differences” paradigm where the focus is on how men and women apparently differ from each other in language use (with attendant metaphors about Mars and Venus). Chapters 2 and to an extent 3, address this by trying a number of experiments to see just how much lexical variation there is in sets of spoken corpora of male and female language – and when difference is found, how can it be explained? I also warn against lumping all men together into a box to compare them with all women who are put in a second box. The variation within the boxes can actually be the more interesting story to tell and this is where corpus tools around dispersion can really come into their own. So even if, for example, men do swear more than women, it’s not all men and not all the time. On the other hand, some differences which are more consistent and widespread can be incredibly revealing, although not in ways you might think – chapter 2 took me down an analytical path that ended up at the word Christmas – not perhaps an especially interesting word relating to gender, but it produced a lovely punchline to the chapter.

It was also good to introduce different corpora, tools and techniques that weren’t available in 2006. Mark Davies has an amazing set of online corpora, mostly based around American English, and I took the opportunity to use the COHA (Corpus of Historical American English) to track changes in language which reflects male bias over time, from the start of the 19th century to the present day. Another chapter utilises Adam Kilgariff’s online tool Sketch Engine which allows collocates to be calculated in terms of their grammatical relationships to one another. This allowed for a comparison of the terms boy and girl which allowed me to consider verbs that positioned either as subject or object. So girls are more likely to be impressed while boys are more likely to be outperformed. On the other hand boys cry whereas girls scream.

It would be great if the book inspired other researchers to consider the potential of using corpora in discourse/social related subjects as well as showing how this potential has expanded in recent years. It’s been fun to explore a relatively unexplored field (or rather travel a route between two connecting fields) but it occasionally gets lonely. I hope to encounter a few more people heading in the same direction as me in the coming years.

Discourse, Gender and Sexuality South-South Dialogues Conference

Last week was spent in at Witwatersrand (Wits) University in Johannesburg where I had been invited to give a workshop on corpus methods, as well as a talk on some of my own research. The week was topped off by the first Discourse, Gender and Sexuality South-South Dialogues Conference which was organised by Tommaso Milani. Many of the papers at the conference used qualitative methods (analyses of visual data seemed particularly popular) but there were a few papers, including my own, which used corpus methods.

These included a paper by Megan Edwards who combined a corpus approach with CDA and visual analysis to examine a small corpus of pamphlets found around Johannesburg – these pamphlets advertise remedies for sexual and relationship problems and Megan demonstrated that embedded within the adverts were gendered discourses – relating to notions of ideal masculinity and femininity. This is probably one of the few corpora in existence where the top lexical word is penis.

Another interesting paper was by Sally Hunt who examined corpora of articles about sex work in two South African newspapers, focussing on the period when SA hosted the World Cup. She found that while there was a more balanced set of representations of sex workers than expected, they were still largely represented as immoral and criminalised for their actions while the agency of their clients was largely obscured. Sally is a lecturer at Rhodes University, Grahamstown, and has recently completed the construction of a 1 million word South African corpus, using the Brown family sampling frame.

During the workshop that I hosted at the university I got participants to use AntConc to examine a small corpus of recent newspaper articles about feminists, and a number of interesting patterns emerged from the analyses of concordances and collocates that took place. For example, a representation of feminists as war-mongers or vocally annoying/fierce e.g. shrill, strident etc was very prevalent and perhaps expected, although we were surprised to see a sub-set of words which related feminists to Islam like feminist Taleban and feminist fatwas (killing two ideological birds with one stone). Additionally, it was interesting to see how these negative discourses shouldn’t always be taken at face value. They were sometimes quoted in order to be critical of them, although it was often only with expanded concordance lines that this could be seen. In all, a productive week, and it was good to meet so many people who were interested in finding out more about corpus linguistics.


Visiting With The Brown Family

In 2011 I gave a plenary talk on how American English is changing over time (contrasting it with British English), using the Brown Family of corpora. Each member of the Brown family consists of a corpus of 1 million words of written, published, standard English, divided into 500 files each of about 2000 words each. Fifteen genres of writing are represented – this framework being created decades ago when the original Brown corpus was compiled by Henry Kučera and W. Nelson Francis at Brown University, having the distinction of being the first publically available corpus ever built. Containing only American texts published in 1961, it originally went by the name of A Standard Corpus of Present-Day Edited American English for use with Digital Computers but later became known as just the Brown Corpus. It was followed by an equivalent British version, with later members representing English from the 1990s, the 2000s and the 1930s. A 1901 British version is in the pipeline.

Before I gave my talk, however, Mark Davies gave a brilliant presentation on the COHA (Corpus of Historical American English) which has 400 million words and covers the period from 1800 to the present day. It was the proverbial hard act to follow. Compared to the COHA, the Brown family are tiny, and the coverage occurs across 30 or 15 year snapshots, rather than representing every year. If we identify, say, that the word Mr is less frequent in 2006 than in 1991 then it is tempting to say that Mr is becoming less frequent over time. But we don’t know for certain what corpora from all the years in between would tell us. Having multiple sampling points presents a more convincing picture, but judicious hedging must be applied.

Also, being small, many words in the Brown family have tiny frequencies so it’s very difficult to make any claims about them. And the sampling could be viewed as rather outdated – the sorts of texts that people accessed in the 1960s are not necessarily the same as they access now. There are no online texts in the Brown family (although to ease collection, both the 2006 members involved texts that were originally published in written form, then placed online). Nor is there any advertising text. Or song lyrics. Or horror fiction. Or erotica (although there is a section on Romantic Fiction which could be pushed in that direction). Finally, the fact that all the texts are of the published variety means that they tend to represent a somewhat standardised, conservative form of English. A lot of the innovation in English happens in much more informal contexts, especially where young people or people from different backgrounds mix together – inner-city playgrounds and internet forums being two good examples. By the time such innovation gets into written published standard English, it’s no longer innovative. So the Brown family can’t tell us about the cutting edge of language use – they’ll always be a few years out of fashion.

So what are the Brown family good for, if anything?

Continue reading

Two approaches to keywords

On July 4th, 2013, I gave a presentation on keywords at a meeting of the Keywords Project at Jesus College, Cambridge University. The Keywords Project uses Raymond Williams’ concept of keywords as being socially prominent words (e.g. art, industry, media or society) that are capable of bearing interlocking, yet sometimes contradictory contemporary meanings, and the group meets a couple of times each year to discuss new keywords that have emerged in society. The group carry out analysis using a variety of different methods, involving deriving etymologies from the Oxford English Dictionary, making use of Google n-grams, referring to academic research on particular concepts and investigating corpora.

I was invited to give an alternative (or rather, complementary) perspective that was more focussed on around corpus linguistics. I discussed how the concept of keywords differs greatly in CL, and how keyness can be extended to include tagged words, semantic or grammatical groups of words, multi-word units or even punctuation marks. Using various reference corpora, I showed how keyness techniques could be used to aid the identification of potential emerging keywords, while concordancing and collocational analysis could help to to identify the range of meanings around a word at a given point in time.

For more information, see http://keywords.pitt.edu/index.html