Introductory Blog – Gavin Brookes

This is the second time I have been a part of CASS, which means that this is the second time I’ve written one of these introductory blog pieces. I first worked in CASS in 2016, on an eight-month project with Paul Baker where we looked at  the feedback that patients gave about the NHS in England. This was a really fun project to work on – I enjoyed being a part of CASS and working with Paul and made some great friends in the Centre with whom I’m still in contact to this day. Since leaving CASS in October 2016, I completed my PhD in Applied Linguistics in the School of English at the University of Nottingham, which examined the ways that people with diabetes and eating disorders construct their illnesses and identities in online support groups. Following my PhD, I stayed in the School of English at Nottingham, working as a Postdoctoral Research Fellow in the School’s Professional Communication research and consultancy unit.

As you might have guessed from the topic of my doctoral project and my previous activity with CASS, my main research interests are in the areas of corpus linguistics and health communication. I am therefore very excited to return to the Centre now, with its new focus on the application of corpora to the study of health communication. I’m currently working on a new project within the Centre, Representations of Obesity in the News, which explores the ways that obesity and people affected by obesity are represented in the media, focussing in particular on news articles and readers’ responses. I’m very excited to be working on this important project. Obesity is a growing and seemingly ever-topical public health concern, not just in the UK but globally. However, the media’s treatment of the issue can often be stigmatising, making it quite deserving of scrutiny! Yet, our aim in this project isn’t just to take the media to task, but to eventually work with media outlets to advise them on how to cover obesity in a way that is more balanced and informative and, crucially, less stigmatising for people who are affected by it. In this project, we’re also working with obesity charities and campaign groups, which provides a great opportunity to make sure that the focus of our research is not just fit for academic journals but is relevant to people affected by this issue and so can be applied in the ‘real world’, as it were.

So, to finish on more of a personal note, the things I said about myself the last time I wrote one of these blog posts  are still true ; I still like walking, I still travel lots, I still read fantasy and science fiction, I still do pub quizzes, my football team are still rubbish and I don’t think I’ve changed that much since the photo used in that piece was taken… Most of all, though, it still excites me to be a part of CASS and I am absolutely delighted to be back.

 

Using corpus methods to identify teacher strategies in guided reading: what questions do teachers ask, and what works?

In previous blogs on the CASS guided reading project, we have introduced our investigation into one of the most prevalent techniques recommended to engage children in discussion: strategic questioning. We can now reveal our key findings, which focus on the effectiveness of wh-word questioning techniques on children’s responses.

Background.

Guidelines encourage teachers to ask ‘high challenge’ or ‘open-ended’ questions. However, these are often considered too vague for teachers to implement.

How did we examine teacher questions? One way to specify detail about the nature of the questions is to label questions by their typical syntactic forms. There are 2 main question categories:

  • Wh-word questions (high challenge) pose few answering constraints. e.g., ‘how’, ‘when’, ‘where’, ‘why’, ‘which’, ‘what’, ‘who’?
  • Confirmative questions (low challenge) presuppose more information so pose greater constraints. e.g.: ‘Does Mary prefer chocolate or fruit?’

Also, wh-word questions can be split into subcategories:

  • Wh-adverbs (high challenge: ‘how’, ‘why’)
  • Wh-determiners/pronouns (low challenge: ‘what’)

How did we measure children’s response quality? We used 3 indicators:

  • Grammatical complexity. We calculated the proportion of the content vs. function vs. words. Content words carry real meaning, and include nouns, verbs, adverbs and adjectives. Function words do not carry real meaning and instead offer grammatical information (e.g., auxiliary ‘be’ verbs: ‘is’, ‘am’, ‘are’). A higher proportion of content words is an indicator of greater grammatical complexity.
  • MLU. Mean length of utterance in words is the most common indicator of syntactic complexity in children’s speech.
  • Causal language (e.g., ‘because’, ‘so’).

What questions did teachers ask? Teachers are paying attention to recommended guidelines to ask a lot of wh-word questions: these typically take up around 20% of the total questions being asked in normal adult conversation, but took up 40% of the total questions asked by teachers in our spoken classroom interactions.

How did questions influence children’s response quality?

We first demonstrated that wh-word questions typically increased response quality; whereas confirmative questions typically decreased response quality. However, in an examination of the subcategories of wh-word questions, we found that the positive influence of wh-word questions on children’s language was driven by wh-word adverbs (predominantly ‘why’ and ‘how’), and was not attributed to wh-word determiners and pronouns (predominantly ‘what’). These findings applied across the wide age and ability range of the study, indicating that even teachers of beginner readers target inferential-level skills through guided reading discussion.

Summary.

Our findings are informative about what it means to ask ‘high quality’, ‘high challenge’, and/or ‘open-ended’ questions. Specifically, teachers and teacher trainers should be made aware of the effect of various syntactic forms of questions, particularly the nuances of wh-word questions: our findings indicate that ‘why’ and ‘how’ wh-word questions are most effective in fostering complex language production in children.

What’s next for Liam?

I am now working as a postdoctoral fellow at The University of Alberta, Canada! My new work examines children’s understanding of sentences containing pronouns. Children who take part in our study will wear glasses that monitor eye-movement patterns whilst they are narrated a picture book. It has been a pleasure to work on the CASS guided reading project and we are going to continue using the corpus for new investigations into classroom interactions!

 

[LB1]http://cass.lancs.ac.uk/author/liam-blything/

 

ESRC Postdoctoral Fellowship: The psychological validity of non-adjacent collocations

Having recently completed my PhD in CASS, I am really excited to announce that I have been awarded an ESRC Postdoctoral Fellowship for the upcoming academic year.

My research focuses on finding neurophysiological evidence for the existence of collocations, i.e. sequences of two or more words where the words are statistically highly likely to occur together. There are a lot of different types of collocation, and the different types vary along the dimensions of fixedness and compositionality. Idioms, for example, are highly fixed in the sense that one word cannot typically be substituted for another word. They are also non-compositional, which means that the meaning of the expression cannot be derived from knowing the meaning of the component words.

Previous studies investigating the psychological validity of collocation have tended to focus on idioms and other highly fixed expressions. However, this massively limits the generalizability of the findings. In my research, I therefore use a much more fluid conceptualization of collocation, where sequences of words can be considered to be collocational even if they are not fixed, and even if the meaning of the expression is highly transparent. For example, the word pair clinical trials is a collocation, despite lacking the properties of fixedness and non-compositionality, because the word trials is highly likely to follow the word clinical. In this way, I focus on the transition probabilities between words; the transition probability of clinical trials (as measured in a corpus) is much higher than the transition probability of clinical devices, even though the latter word pair is completely acceptable in English, both in terms of meaning and grammar.

In my research, I extract collocational word pairs such as clinical trials from the written BNC1994. I then construct matched non-collocational word pairs such as clinical devices, embed the two sets of word pairs into corpus-derived sentences, and then ask participants to read these sentences on a computer screen while electrodes attached to their scalp detect some of their brain activity. This method of recording the electrical activity of the brain using scalp electrodes is known as electroencephalography, or EEG. More specifically, I use the event-related potential (ERP) technique of analysing brainwave data, where the brain activity is measured in response to a particular stimulus (in this case, collocational and non-collocational word pairs).

My PhD consisted of four ERP experiments. In the first two experiments, I investigated whether or not collocations and non-collocations are processed differently (at the neural level) by native speakers of English. In the third experiment, I did the same but with non-native speakers of English. Then, having found that there are indeed neurophysiological differences in the way that collocations and non-collocations are processed by both native and non-native speakers, I then conducted a fourth experiment to investigate which measures of collocation strength most closely correlate with the brain response. The results of this experiment have really important implications for the field of corpus linguistics, as I found that the two most widely-used measures of collocation strength (namely log-likelihood and mutual information) are actually the two that seem to have the least psychological validity.

The ESRC Postdoctoral Fellowship is unique in that, although it allows for the completion of additional research, the main focus is actually on disseminating the results of the PhD. Thus, during my year as an ESRC Postdoctoral Fellow, I intend to publish the results of my PhD research in high-impact journals in the fields of corpus linguistics and cognitive neuroscience. I will also present my findings at conferences in both of these fields, and I will attend training workshops in other neuroscientific methods.

The additional research that I intend to do during the term of the Fellowship will build upon my PhD work by using the ERP technique to investigate whether or not the neurophysiological difference in the processing of collocations vs. non-collocations is still apparent when the (non-)collocations contain intervening words. For instance, I want to find out whether or not the collocation take seriously is still recognized as such by the brain when there is one intervening word (e.g. take something seriously) or two intervening words (e.g. take the matter seriously), and so on.

Investigating the processing of these non-adjacent collocations is important for the development of linguistic theory. While my PhD thesis focused on word pairs rather than longer sequences of words in order to reduce the number of factors that might influence how the word sequences were processed, making it feasible to conduct controlled experiments, this is actually a very narrow way of conceptualizing the notion of collocation; in practice, words are considered to form collocations when they occur in one another’s vicinity even if there are several intervening words, and even if the words do not always occur in the same order. I will therefore use the results of this additional research to inform the design of research questions and methods for future work engaging with yet more varied types of collocational pattern. This will have important implications for our understanding of how language works in the mind.

I would like to conclude by expressing my gratitude to the ESRC for providing funding for this Fellowship. I am very grateful to be given this opportunity to disseminate the results of my PhD thesis, and I am very excited to carry out further research on the psychological validity of collocation.

Compiling a trilingual corpus to examine the political and social representation(s) of ‘people’ and ‘democracy’

As a visiting researcher at CASS (coming from the University of Athens, where I am Associate Professor of Corpus Linguistics and Translation), since mid-October 2017 and until the end of August 2018, my research aim is to investigate critical aspects of populist discourses in Europe and their variation, especially during and after the 2008 financial (and then social and political) crisis, and to reveal patterns of similarity and difference (and, tentatively, of interconnectedness and intertextuality) across a wide spectrum of political parties, think tanks and organisations. This being essentially a Corpus-Assisted Discourse Study (CADS), a first way into examining the data is to identify and statistically analyse collocational patterns and networks that are built around key lexemes (e.g. ‘people’, ‘popular’, ‘democracy’, in this scenario), before moving on to critically correlating such quantitative findings with the social and political backdrop(s) and crucial milestones.

 

The first task of this complex corpus-driven effort, which is now complete, has been to compile a large-scale trilingual (EN, FR, EL) ‘focus’ corpus. This has been a tedious technical process: before the data can be examined in a consistent manner, several problems needed to be addressed and solutions had to be implemented, as outlined below.

 

  1. As a key primary aim was to gather as much data as possible from the websites of political parties, political organisations, think tanks and official party newspapers, from the UK, France and Greece, it was clear from the outset that it would not be possible to manually cull the corpus data, given the sheer number of sources and of texts. On the other hand, automatic corpus compilation tools (e.g. BootCaT and WebBootCaT in SketchEngine) could not handle the extent and the diversification of the corpora. To address this problem, texts were culled using web crawling techniques (‘wget -r’ in Linux bash) and the HTTrack app, with a lot tweaking and the necessary customisation of download parameters, to account for the (sometimes, very tricky) batch download restrictions of some websites.
  2. Clean-up html boilerplate (i.e., corpus text-irrelevant sections of code, advertising material, etc. that are included in html pages). This was accomplished using Justext (the app used by M. Davies to compile the NOW corpus), with a few tweaks, so to be able to handle some ‘malformed’ data, especially from Greek sources.

As I plan to specifically analyse the variation of key descriptors and qualifiers (‘key’ keywords and their c-collocates) as a first way into the “charting” of the discourses at hand, the (article or text) publication date is a critical part of the corpus metadata, one that needs to be retained for further processing. However, most if not all of this information is practically lost in the web crawling and boilerplating stages. Therefore, the html clean-up process was preceded by the identification and extraction of the articles’ publication dates, using a php script that was developed with the help of Dr. Matt Timperley (CASS, Lancaster) and Dr. Pantelis Mitropoulos (University of Athens). This script scans all files in a dataset, accounts for all possible date formats in all three languages, and then automatically creates a csv (tab-delimited) table that contains the extracted date(s), matched with the respective filenames. Its accuracy is estimated at ca. 95%, and can be improved further, by checking the output and rescanning the original data with a few code tweaks.

  1. Streamline the data, by removing irrelevant stretches of text (e.g. “Share this article on Facebook”) that were possibly left behind during the boilerplating process – this step is ensured using Linux commands (e.g. find, grep, sed, awk) and regular expressions and greatly improves the accuracy of the following step.
  2. Remove duplicate files: since onion (ONe Instance ONly: the script used e.g. in SketchEngine) only looks for pattern repetitions within a single file and within somewhat short maximum paragraph intervals, I used FSLint – an application that takes account of the files’ MD5 signature and identifies duplicates. This is extremely accurate and practically eliminates all files that have a one hundred percent text repetition, across various sections of the websites, regardless of the file name or creation date (actually, this was found to be the case mostly with political party websites, not newspapers). (NB: A similar process is available also in Mike Scott’s WordSmith Tools v7).
  3. Order files by publication year for each subcorpus and then calculate the corresponding metadata (files, tokens, types and average token count, by year) for each dataset and filter out the “focus corpus”, i.e. by looking for relevant files containing only node lemmas (i.e., lemmas related to the core question of this research: people*|popular|democr*|human* and their FR and EL equivalents, using grep and regular expressions – note that an open-source, java-based GUI app that combines these search options for large datasets is FAR).
  4. Finally, prepare the data for uploading on LU’s CQPWeb, by appending the text publication year info, as extracted from stage 2 to the corresponding raw text file – this was done using yet another php script, kindly developed by Matt Timperley.

 

In a nutshell, texts were culled from a total of 68 sources (24 Greek, 26 British, and 18 French). This dataset is divided into three major corpora, as follows:

  1. Cumulative corpus (CC, all data): 746,798 files/465,180,684 tokens.
  2. Non-journalistic research corpus (RC): 419,493 files/307,231,559 tokens.
  3. Focus corpus (FC): 205,038 files/235,235,353 tokens.

Is Academic Writing Becoming More Colloquial?

Have you noticed that academic writing in books and journals seems less formal than it used to? Preliminary data from the Written BNC2014 shows that you may be right!

Some early data from the academic journals and academic books sections of the new corpus has been analysed to find out whether academic writing has become more colloquial since the 1990s. Colloquialisation is “a tendency for features of the conversational spoken language to infiltrate and spread in the written language” (Leech, 2002: 72). The colloquialisation of language can make messages more easily understood by the general public because, whilst not everybody is familiar with the specifics of academic language, everyone is familiar with spoken language. In order to investigate the colloquialisation of academic writing, the frequencies of several linguistic features which have been associated with colloquialisation were compared in academic writing in the BNC1994 and the BNC2014.

Results show that, of the eleven features studied, five features have shown large changes in frequency between the BNC1994 and the BNC2014, pointing to the colloquialisation of academic writing. The use of first and second person pronouns, verb contractions, and negative contractions have previously been found to be strongly associated with spoken language. These features have all increased in academic language between 1994 and 2014. Passive constructions and relative pronouns have previously been found to be strongly associated with written language, and are not often used in spoken language. This analysis shows that both of these features have decreased in frequency in academic language in the BNC2014.

Figure 1: Frequency increases indicating the colloquialisation of academic language.

Figure 2: Frequency decreases indicating the colloquialisation of academic language.

These frequency changes were also compared for each genre of academic writing separately. The genres studied were: humanities & arts, social science, politics, law & education, medicine, natural science, and technology & engineering. An interesting difference between some of these genres emerged. It seems that the ‘hard’ sciences (medicine, natural science, and technology & engineering) have shown much larger changes in some of the linguistic features studied than the other genres have. For example, figure 3 shows the difference in the percentage increase of verb contractions for each genre, and clearly shows a difference between the ‘hard’ sciences and the social sciences and humanities subjects.


Figure 3: % increases in the frequency of the use of verb contractions between 1994 and 2014 for each genre of academic writing.

This may lead you to think that medicine, natural science, and technology & engineering writing has become more colloquial than the other genres, but this is in fact not the case. Looking more closely at the data shows us that these ‘hard’ science genres were actually much less colloquial than the other genres in the 1990s, and that the large change seen here is actually a symptom of all genres becoming more similar in their use of these features. In other words, some genres have not become more colloquial than others, they have simply had to change more in order for all of the genres to become more alike.

So it seems from this analysis that, in some respects at least, academic language has certainly become more colloquial since the 1990s. The following is a typical example of academic writing in the 1990s, taken from a sample of a natural sciences book in the BNC1994. It shows avoidance of using first or second person pronouns and contractions (which have increased in use in the BNC2014), and shows use of a passive construction (the use of which has decreased in the BNC2014).

Experimentally one cannot set up just this configuration because of the difficulty in imposing constant concentration boundary conditions (Section 14.3). In general, the most readily practicable experiments are ones in which an initial density distribution is set up and there is then some evolution of the configuration during the course of the experiment.

It is much more common nowadays to see examples such as the following, taken from an academic natural sciences book in the BNC2014. This example contains active sentence constructions, first person pronouns, verb contractions, negative contractions, and a question.

No doubt people might object in further ways, but in the end nearly all these replies boil down to the first one I discussed above. I’d like to return to it and ponder a somewhat more aggressive version, one that might reveal the stakes of this discussion even more clearly. Very well, someone might say. Not reproducing may make sense for most people, but my partner and I are well – educated, well – off, and capable of protecting our children from whatever happens down the road. Why shouldn’t we have children if we want to?

It will certainly be interesting to see if this trend of colloquialisation can be seen in other genres of writing in the BNC2014!


Would you like to contribute to the Written BNC2014?

We are looking for native speakers of British English to submit their student essays, emails, and Facebook and Whatsapp messages for inclusion in the corpus! To find out more, and to get involved click here. All contributors will be fully credited in the corpus documentation.

British National Corpus 2014: A sociolinguistic book is out

Have you ever wondered what real spoken English looks like? Have you ever asked the question of whether people from different backgrounds (based on gender, age, social class etc.) use language differently? Have you ever  thought it would be interesting to investigate how much English has changed over the last twenty years? All these questions can be answered by looking at language corpora such as the Spoken BNC 2014 and analysing them from a sociolinguistic persective. Corpus Approaches to Contemporary British Speech:  Sociolinguistic Studies of the Spoken BNC2014 is a book which offers a series of studies that provide a unique insight into a number of topics ranging from Discourse, Pragmatics and Interaction to Morphology and Syntax.

This is, however, only the first step. We are hoping that there will be many more studies to come based on this wonderful dataset. If you want to start exploring the Spoken BNC 2014 corpus, it is just three mouse clicks away:

Get access to the BNC2014 Spoken

  1. Register for free and log on to CQPweb.
  2. Sign-up for access to the BNC2014 Spoken.
  3. Select ‘BNC2014’in the main CQPweb menu.

Also, right now there is a great opportunity to take part in the written BNC 2014 project, a written counterpart to the Spoken BNC2014.  If you’d like to contribute to the written BNC2014, please check out the project’s website for more information.

Learn about the BNC2014, scan a book sample and contribute to the corpus…

On Saturday 12 May 2018, CASS hosted a small training event at Lancaster University for a group of participants, who came from different universities in the UK.  We talked about the BNC2014 project and discussed both the theoretical underpinnings as well as the practicalities of corpus design and compilation. Slides from the event are available as pdf here.

The participants then tried in practice what is involved in the compilation of a large general corpus such as the BNC2014. They selected and scanned samples of books from current British fiction, poetry and a range of non-fiction books (history, popular science, hobbies etc.). Once processed, these samples will become a part of the written BNC2014.

Here are some pictures from the event:

Carmen Dayrell and Vaclav Brezina before the event

Elena Semino welcoming participants

In the computer lab: Abi Hawtin helping participants


A box full of books

If you are interested in contributing to the written BNC2014, go to the project website  to find out about different ways in which you can participate in this exciting project.

The event was supported by ESRC grant no. EP/P001559/1.

Triangulating findings from a corpus-based study: An interview with Adil Ray (creator of Citizen Khan)

My doctoral thesis investigated the ‘Construction of Identities in the BBC Sitcom Citizen Khan’ by analysing a corpus of over 40,000 words, which consisted of transcripts from all the episodes of the show within the first two seasons. In my analysis, there were a number of instances where I had made some assumptions in relation to the motivations of the scriptwriters when incorporating certain scenes into the programme. Therefore, in order to triangulate my findings, I decided to interview the content creator (Adil Ray) and ascertain from him if my assumptions had been correct. (Transcript of the full interview is available at: https://allaboutcorpora.com/adil-ray-interview)

It would not be feasible in this short article to highlight all the various points discussed within the interview and how they correlated with my findings. However, I aim to provide at least one very vivid example, which highlights the importance of triangulating especially the qualitative aspects of a corpus-based analysis. For the readers who have viewed the sitcom Citizen Khan, they may have noticed the ‘running-gag’ throughout the series, where the mosque manager Dave (a Caucasian convert to Islam) would offer the Islamic salutation (As Salaamu Alaikum) to Mr Khan (a Pakistani British Muslim) and Khan would respond to him by saying ‘hello Dave’.

In total, there were ten such instances in the corpus, where such an exchange occurred; and in my thesis (pp.136-145), I discussed in some detail the various textual evidences from the Quran and Prophetic Traditions (Hadeeth) that indicate the importance of greetings within Islam and the etiquette involved when giving or responding to an Islamic salutation. Taking into consideration this contextual information, I concluded that the scriptwriters had incorporated this gag into the show to indicate that Mr Khan did not consider Dave to be a ‘real’ or ‘proper’ Muslim. Furthermore, I argued that they were also trying to highlight how Muslim converts are part of an ‘out-group’, which can sometimes be ostracised from those who are born Muslim.

During my interview with Ray, I questioned him on the significance of Mr Khan responding to Dave’s Islamic greeting with ‘hello’. Ray said that from his own point of view, it could be seen as signifying that Khan was not comfortable with ‘this white chap coming in and being the manager of the mosque and being a Muslim’. However, Ray then went onto explicitly state that he believed that Khan’s main grievance was that Dave had the job of mosque manager, as opposed to him:

‘I don’t think it was the fact that he was uncomfortable necessarily with his race, it’s not that, but he was uncomfortable that there was somebody who was a manager and probably had more authority than Khan and in a way he was better than Khan at managing and doing things. And that was the thing that riled Khan, Khan probably thought he would be that person, he would have that job.’

I then specifically mentioned that there was a Quranic instruction that dictated how an Islamic greeting should be responded to and thus the interaction could indicate that Khan did not see Dave to be a full Muslim. However, Ray did not seem to fully entertain this notion and once again went onto stress that he believed Khan’s main gripe with Dave was due to the mosque manager position and ultimately Khan would always be there for Dave.

The main premise of my argument in the thesis was that the scriptwriters chose to highlight Khan’s usage of ‘hello’, to clearly indicate that he did not fully accept Dave as a Muslim. However, after my discussion with Ray, it became abundantly clear that this was not in fact their primary motivation. Citizen Khan has three co-writers (Anil Gupta, Richard Pinto, Adil Ray), with Ray being the only Muslim amongst them and thus, if the intention was to signify a contravention in Islamic etiquettes, it is presumed he would be the most aware of it from the three.

Thus, through triangulating my research findings in such a manner, it has highlighted that when engaging in linguistic analysis, an analyst may at times ‘over-analyse’ language usage, in their pursuit of extracting ‘meaning behind the text’.


Bilal Kadiri is an Assistant Professor at King Khalid University and completed his PhD at Lancaster University under the supervision of CASS’s Paul Baker. He can be contacted by email at bilal(Replace this parenthesis with the @ sign)allaboutcorpora.com