Using Corpora to Analyze Gender

ucagI wrote UCAG during a sabbatical as a semi-sequel to a book I published in 2006 called Using Corpora for Discourse Analysis. Part of the reason for the second book was to update and expand some of my thinking around discourse- or social-related corpus linguistics. As time has passed, I haven’t become disenamoured of corpus methods, but I have become more reflective and critical of them and I wanted to use the book to highlight what they can and can’t do, and how researchers need to be guarded against using tools which might send them down a particular analytical path with a set of pre-ordained answers. Part of this has involved reflecting on how interpretations and explanations of corpus findings often need to come from outside the texts themselves (one of the tenets of critical discourse analysis), and subsequently whether a corpus approach requires analysts to go further and critically evaluate their findings in terms of “who benefits”.

Another way in which my thinking around corpus linguistics has developed since 2006 is in considering the advantages of methodological triangulation (or approaching a research project in multiple ways). In one analysis chapter I take three small corpora of adverts from Craigslist and try out three methods of attempting to uncover something interesting about gender from them – one very broad involving an automated tagging of every word, one semi-automatic relying on a focus on a smaller set of words, and another much more qualitative, relying on looking at concordance lines only. In another chapter I look at “difficult” search terms – comparing two methods of finding all the cases where a lecturer indicates that a student has given an incorrect answer in a corpus of academic-related speech. Would it be better to just read the whole corpus from start to finish, or is it possible to devise search terms so concordancing would elicit pretty much the same set?

The book also gave me a chance to revisit older data, particularly a set of newspaper articles about gay people from the Daily Mail which I had first looked at in Public Discourses of Gay Men (2005). As a replication experiment I revisited that data and redid an analysis I had first carried out about 10 years ago. While the idea of an objective researcher is fictional, corpus methods have aimed to redress the issue of researcher bias to an extent – although in retreading my steps, I did not obtain exactly the same results. Fortunately, the overall outcome was the same, but there were a few important points that the 10 years younger version of me missed. Does that matter? I suspect it doesn’t invalidate the analysis although it is a useful reminder about how our own analytical abilities alter over time.

Part of the reason for writing the book was to address other researchers who are either from corpus linguistics and want to look at gender, or who do research in gender and want to use corpus methods. I sometimes feel that these two groups of people do not talk to each other very much and as a result the corpus research in this area is often based around the “gender differences” paradigm where the focus is on how men and women apparently differ from each other in language use (with attendant metaphors about Mars and Venus). Chapters 2 and to an extent 3, address this by trying a number of experiments to see just how much lexical variation there is in sets of spoken corpora of male and female language – and when difference is found, how can it be explained? I also warn against lumping all men together into a box to compare them with all women who are put in a second box. The variation within the boxes can actually be the more interesting story to tell and this is where corpus tools around dispersion can really come into their own. So even if, for example, men do swear more than women, it’s not all men and not all the time. On the other hand, some differences which are more consistent and widespread can be incredibly revealing, although not in ways you might think – chapter 2 took me down an analytical path that ended up at the word Christmas – not perhaps an especially interesting word relating to gender, but it produced a lovely punchline to the chapter.

It was also good to introduce different corpora, tools and techniques that weren’t available in 2006. Mark Davies has an amazing set of online corpora, mostly based around American English, and I took the opportunity to use the COHA (Corpus of Historical American English) to track changes in language which reflects male bias over time, from the start of the 19th century to the present day. Another chapter utilises Adam Kilgariff’s online tool Sketch Engine which allows collocates to be calculated in terms of their grammatical relationships to one another. This allowed for a comparison of the terms boy and girl which allowed me to consider verbs that positioned either as subject or object. So girls are more likely to be impressed while boys are more likely to be outperformed. On the other hand boys cry whereas girls scream.

It would be great if the book inspired other researchers to consider the potential of using corpora in discourse/social related subjects as well as showing how this potential has expanded in recent years. It’s been fun to explore a relatively unexplored field (or rather travel a route between two connecting fields) but it occasionally gets lonely. I hope to encounter a few more people heading in the same direction as me in the coming years.

Website | + posts

CASS co-investigator Paul Baker is a Professor of Linguistics and English Language at Lancaster University. His research interests corpus linguistics, language and gender/sexual identities and critical discourse analysis.