A cognitive scientist’s perspective on taking the CorpusMOOC

Rose Hendricks, a researcher at the Frameworks Institute in Washington D.C., shares her experience of taking the CorpusMOOC:

‘I’m a social science researcher and have been curious for a while how we can learn more about human culture and cognition by looking at large collections of language — so I jumped at the opportunity to take the Corpus Linguistics online course by Lancaster University.

The course had an great mix of videos, readings, and activities, and covered topics in just the right amount of detail. There was enough information to get a good sense of how corpus linguistics methods can be used in a huge range of ways, from addressing questions in sociolinguistics to developing textbooks, dictionaries, and resources for language learners.

Conversations with researchers who use corpus linguistics methods gave us an even deeper sense of the interesting and important topics that benefit from tools to extract patterns from huge amounts of text.

Throughout the course, I came up with many ideas I plan to explore with the methods we learned about, especially #LancsBox, a tool that helps researchers analyze and visualize their language data.

I would recommend this course to people with any level of background knowledge on the topic — there’s something for everyone.’

Introductory Blog – Luke Collins

I am delighted to have joined the CASS team as Senior Research Associate and will be working across the new programme of studies in Corpus Approaches to Health(care) Communication. I have already begun working on a fascinating strand exploring the Narratives of Voice-hearers and I will be working closely with Professor Elena Semino in applying corpus methods to see what effects a therapeutic intervention has on the experiences of those who hear distressing voices – and how they articulate these experiences – over time. More broadly, we will be examining representations of mental health and illness in the media, looking to address issues of stigmatisation and support public awareness and understanding.

Working towards the application of corpus linguistics and the findings of corpus analysis to health services is a great motivation to me and I am thrilled to have the opportunity to build on my previous work in this area. I have published work on the experiences of people undergoing a therapeutic intervention and demonstrated how corpus approaches can help to capture some of the complexities of those experiences. I have also implemented corpus analyses to investigate discussions of complex global issues in the news media (specifically, climate change and AMR), thinking about public understanding and how media reporting can help readers to comprehend their role in such issues. I have recently been working on my edition of the Routledge ‘Corpus Linguistics for..’ series, focusing on applications of corpus tools for analysing different types of online communication and hope to announce its release early next year. Throughout my work, I have endeavoured to raise awareness of corpus methods outside of the discipline and create opportunities to work with collaborators from various backgrounds. I am glad to find that in my role with CASS, this can continue!

Outside of my work, I have a reputation for hand-made greeting cards and I am an avid record collector. Since I have moved to Lancaster I have been exploring the local area and discovering what a picturesque part of the country this is. I don’t even mind the rain!

Statistics in (Higher) Education: A few thoughts at the beginning of the new academic year

As every year around this time, university campuses are buzzing with students who are starting their studies or returning to the campus after the summer break – this incredible transformation pours life into buildings – empty spaces become lecture theatres, seminar rooms and labs. Students have the opportunity to learn many new things about the subject they chose to study and also engage with the academic environment more generally.  Among the educational and development opportunities students have at the university one transferable skill stands out: statistical literacy.

Numbers are an essential part of our everyday life. We count the coins in our pocket, the minutes before the next bus arrives or the sunny days in a rainy year. Numbers and quantitative information are also very important for students and educators. Statistical literacy – the ability to produce and interpret quantitative information – belongs to the basic set of academic skills that, despite its importance, may not always receive the attention it deserves.

Many students (and academics) are afraid of statistics – think about what your first reaction is to the equation in Figure 1 below.

Figure 1: The equation of standard deviation (mathematical form)

 

This is because statistics is often misconstrued as the art of solving extremely complicated equations or a mysterious magic with numbers. Statistics, however, is first and foremost about understanding and making sense of numbers and quantitative information. For this, we need to learn the basic principles of collecting, organising and interpreting quantitative information. Critical thinking is thus much more important for statistics than the number crunching ability. After all, computers are very good at processing numbers and solving equations and we can happily leave this task to them. For example, many even complex statistical tasks can be achieved by using tools such as the Lancaster Stats Tool online, where the researcher can merely copy-paste their data (in an appropriate format) and press one button to receive the answer.

Humans, on the other hand, outperform computers in the interpretation skills. This is because we have the knowledge of the context in which numbers appear and we can therefore evaluate the relative importance of different quantitative results. We as teachers, linguists, sociologists, scientists etc. can provide the underlying meaning to numbers and equations and relate them to our experience and the knowledge of the field. For example, the equation in Figure 1 can be simplified as follows:

Figure 2: The equation of standard deviation (conceptual)

When we relate this to what we know about the world, we can see that the question we are asking in Figure 2 is how much variation there is in our data, a question about variability, difference in tendencies and preferences and overall diversity. This is something that we can relate to in our everyday experience: Will I ever find a twenty-pound note in my pocket? Is the wait for the bus longer in the evening? Is the number of sunny days different every year? When talking about statistics in education, I consider the following point crucial: as with any subject matter, it is important to connect statistical thinking and statistical literacy with our daily experience.

To read more about statistics for corpus linguistics, see Brezina, V. (2018). Statistics in Corpus Linguistics: A Practical Guide. Cambridge University Press.

Introductory Blog – Gavin Brookes

This is the second time I have been a part of CASS, which means that this is the second time I’ve written one of these introductory blog pieces. I first worked in CASS in 2016, on an eight-month project with Paul Baker where we looked at  the feedback that patients gave about the NHS in England. This was a really fun project to work on – I enjoyed being a part of CASS and working with Paul and made some great friends in the Centre with whom I’m still in contact to this day. Since leaving CASS in October 2016, I completed my PhD in Applied Linguistics in the School of English at the University of Nottingham, which examined the ways that people with diabetes and eating disorders construct their illnesses and identities in online support groups. Following my PhD, I stayed in the School of English at Nottingham, working as a Postdoctoral Research Fellow in the School’s Professional Communication research and consultancy unit.

As you might have guessed from the topic of my doctoral project and my previous activity with CASS, my main research interests are in the areas of corpus linguistics and health communication. I am therefore very excited to return to the Centre now, with its new focus on the application of corpora to the study of health communication. I’m currently working on a new project within the Centre, Representations of Obesity in the News, which explores the ways that obesity and people affected by obesity are represented in the media, focussing in particular on news articles and readers’ responses. I’m very excited to be working on this important project. Obesity is a growing and seemingly ever-topical public health concern, not just in the UK but globally. However, the media’s treatment of the issue can often be stigmatising, making it quite deserving of scrutiny! Yet, our aim in this project isn’t just to take the media to task, but to eventually work with media outlets to advise them on how to cover obesity in a way that is more balanced and informative and, crucially, less stigmatising for people who are affected by it. In this project, we’re also working with obesity charities and campaign groups, which provides a great opportunity to make sure that the focus of our research is not just fit for academic journals but is relevant to people affected by this issue and so can be applied in the ‘real world’, as it were.

So, to finish on more of a personal note, the things I said about myself the last time I wrote one of these blog posts  are still true ; I still like walking, I still travel lots, I still read fantasy and science fiction, I still do pub quizzes, my football team are still rubbish and I don’t think I’ve changed that much since the photo used in that piece was taken… Most of all, though, it still excites me to be a part of CASS and I am absolutely delighted to be back.

 

Using corpus methods to identify teacher strategies in guided reading: what questions do teachers ask, and what works?

In previous blogs on the CASS guided reading project, we have introduced our investigation into one of the most prevalent techniques recommended to engage children in discussion: strategic questioning. We can now reveal our key findings, which focus on the effectiveness of wh-word questioning techniques on children’s responses.

Background.

Guidelines encourage teachers to ask ‘high challenge’ or ‘open-ended’ questions. However, these are often considered too vague for teachers to implement.

How did we examine teacher questions? One way to specify detail about the nature of the questions is to label questions by their typical syntactic forms. There are 2 main question categories:

  • Wh-word questions (high challenge) pose few answering constraints. e.g., ‘how’, ‘when’, ‘where’, ‘why’, ‘which’, ‘what’, ‘who’?
  • Confirmative questions (low challenge) presuppose more information so pose greater constraints. e.g.: ‘Does Mary prefer chocolate or fruit?’

Also, wh-word questions can be split into subcategories:

  • Wh-adverbs (high challenge: ‘how’, ‘why’)
  • Wh-determiners/pronouns (low challenge: ‘what’)

How did we measure children’s response quality? We used 3 indicators:

  • Grammatical complexity. We calculated the proportion of the content vs. function vs. words. Content words carry real meaning, and include nouns, verbs, adverbs and adjectives. Function words do not carry real meaning and instead offer grammatical information (e.g., auxiliary ‘be’ verbs: ‘is’, ‘am’, ‘are’). A higher proportion of content words is an indicator of greater grammatical complexity.
  • MLU. Mean length of utterance in words is the most common indicator of syntactic complexity in children’s speech.
  • Causal language (e.g., ‘because’, ‘so’).

What questions did teachers ask? Teachers are paying attention to recommended guidelines to ask a lot of wh-word questions: these typically take up around 20% of the total questions being asked in normal adult conversation, but took up 40% of the total questions asked by teachers in our spoken classroom interactions.

How did questions influence children’s response quality?

We first demonstrated that wh-word questions typically increased response quality; whereas confirmative questions typically decreased response quality. However, in an examination of the subcategories of wh-word questions, we found that the positive influence of wh-word questions on children’s language was driven by wh-word adverbs (predominantly ‘why’ and ‘how’), and was not attributed to wh-word determiners and pronouns (predominantly ‘what’). These findings applied across the wide age and ability range of the study, indicating that even teachers of beginner readers target inferential-level skills through guided reading discussion.

Summary.

Our findings are informative about what it means to ask ‘high quality’, ‘high challenge’, and/or ‘open-ended’ questions. Specifically, teachers and teacher trainers should be made aware of the effect of various syntactic forms of questions, particularly the nuances of wh-word questions: our findings indicate that ‘why’ and ‘how’ wh-word questions are most effective in fostering complex language production in children.

What’s next for Liam?

I am now working as a postdoctoral fellow at The University of Alberta, Canada! My new work examines children’s understanding of sentences containing pronouns. Children who take part in our study will wear glasses that monitor eye-movement patterns whilst they are narrated a picture book. It has been a pleasure to work on the CASS guided reading project and we are going to continue using the corpus for new investigations into classroom interactions!

 

[LB1]http://cass.lancs.ac.uk/author/liam-blything/

 

ESRC Postdoctoral Fellowship: The psychological validity of non-adjacent collocations

Having recently completed my PhD in CASS, I am really excited to announce that I have been awarded an ESRC Postdoctoral Fellowship for the upcoming academic year.

My research focuses on finding neurophysiological evidence for the existence of collocations, i.e. sequences of two or more words where the words are statistically highly likely to occur together. There are a lot of different types of collocation, and the different types vary along the dimensions of fixedness and compositionality. Idioms, for example, are highly fixed in the sense that one word cannot typically be substituted for another word. They are also non-compositional, which means that the meaning of the expression cannot be derived from knowing the meaning of the component words.

Previous studies investigating the psychological validity of collocation have tended to focus on idioms and other highly fixed expressions. However, this massively limits the generalizability of the findings. In my research, I therefore use a much more fluid conceptualization of collocation, where sequences of words can be considered to be collocational even if they are not fixed, and even if the meaning of the expression is highly transparent. For example, the word pair clinical trials is a collocation, despite lacking the properties of fixedness and non-compositionality, because the word trials is highly likely to follow the word clinical. In this way, I focus on the transition probabilities between words; the transition probability of clinical trials (as measured in a corpus) is much higher than the transition probability of clinical devices, even though the latter word pair is completely acceptable in English, both in terms of meaning and grammar.

In my research, I extract collocational word pairs such as clinical trials from the written BNC1994. I then construct matched non-collocational word pairs such as clinical devices, embed the two sets of word pairs into corpus-derived sentences, and then ask participants to read these sentences on a computer screen while electrodes attached to their scalp detect some of their brain activity. This method of recording the electrical activity of the brain using scalp electrodes is known as electroencephalography, or EEG. More specifically, I use the event-related potential (ERP) technique of analysing brainwave data, where the brain activity is measured in response to a particular stimulus (in this case, collocational and non-collocational word pairs).

My PhD consisted of four ERP experiments. In the first two experiments, I investigated whether or not collocations and non-collocations are processed differently (at the neural level) by native speakers of English. In the third experiment, I did the same but with non-native speakers of English. Then, having found that there are indeed neurophysiological differences in the way that collocations and non-collocations are processed by both native and non-native speakers, I then conducted a fourth experiment to investigate which measures of collocation strength most closely correlate with the brain response. The results of this experiment have really important implications for the field of corpus linguistics, as I found that the two most widely-used measures of collocation strength (namely log-likelihood and mutual information) are actually the two that seem to have the least psychological validity.

The ESRC Postdoctoral Fellowship is unique in that, although it allows for the completion of additional research, the main focus is actually on disseminating the results of the PhD. Thus, during my year as an ESRC Postdoctoral Fellow, I intend to publish the results of my PhD research in high-impact journals in the fields of corpus linguistics and cognitive neuroscience. I will also present my findings at conferences in both of these fields, and I will attend training workshops in other neuroscientific methods.

The additional research that I intend to do during the term of the Fellowship will build upon my PhD work by using the ERP technique to investigate whether or not the neurophysiological difference in the processing of collocations vs. non-collocations is still apparent when the (non-)collocations contain intervening words. For instance, I want to find out whether or not the collocation take seriously is still recognized as such by the brain when there is one intervening word (e.g. take something seriously) or two intervening words (e.g. take the matter seriously), and so on.

Investigating the processing of these non-adjacent collocations is important for the development of linguistic theory. While my PhD thesis focused on word pairs rather than longer sequences of words in order to reduce the number of factors that might influence how the word sequences were processed, making it feasible to conduct controlled experiments, this is actually a very narrow way of conceptualizing the notion of collocation; in practice, words are considered to form collocations when they occur in one another’s vicinity even if there are several intervening words, and even if the words do not always occur in the same order. I will therefore use the results of this additional research to inform the design of research questions and methods for future work engaging with yet more varied types of collocational pattern. This will have important implications for our understanding of how language works in the mind.

I would like to conclude by expressing my gratitude to the ESRC for providing funding for this Fellowship. I am very grateful to be given this opportunity to disseminate the results of my PhD thesis, and I am very excited to carry out further research on the psychological validity of collocation.

Compiling a trilingual corpus to examine the political and social representation(s) of ‘people’ and ‘democracy’

As a visiting researcher at CASS (coming from the University of Athens, where I am Associate Professor of Corpus Linguistics and Translation), since mid-October 2017 and until the end of August 2018, my research aim is to investigate critical aspects of populist discourses in Europe and their variation, especially during and after the 2008 financial (and then social and political) crisis, and to reveal patterns of similarity and difference (and, tentatively, of interconnectedness and intertextuality) across a wide spectrum of political parties, think tanks and organisations. This being essentially a Corpus-Assisted Discourse Study (CADS), a first way into examining the data is to identify and statistically analyse collocational patterns and networks that are built around key lexemes (e.g. ‘people’, ‘popular’, ‘democracy’, in this scenario), before moving on to critically correlating such quantitative findings with the social and political backdrop(s) and crucial milestones.

 

The first task of this complex corpus-driven effort, which is now complete, has been to compile a large-scale trilingual (EN, FR, EL) ‘focus’ corpus. This has been a tedious technical process: before the data can be examined in a consistent manner, several problems needed to be addressed and solutions had to be implemented, as outlined below.

 

  1. As a key primary aim was to gather as much data as possible from the websites of political parties, political organisations, think tanks and official party newspapers, from the UK, France and Greece, it was clear from the outset that it would not be possible to manually cull the corpus data, given the sheer number of sources and of texts. On the other hand, automatic corpus compilation tools (e.g. BootCaT and WebBootCaT in SketchEngine) could not handle the extent and the diversification of the corpora. To address this problem, texts were culled using web crawling techniques (‘wget -r’ in Linux bash) and the HTTrack app, with a lot tweaking and the necessary customisation of download parameters, to account for the (sometimes, very tricky) batch download restrictions of some websites.
  2. Clean-up html boilerplate (i.e., corpus text-irrelevant sections of code, advertising material, etc. that are included in html pages). This was accomplished using Justext (the app used by M. Davies to compile the NOW corpus), with a few tweaks, so to be able to handle some ‘malformed’ data, especially from Greek sources.

As I plan to specifically analyse the variation of key descriptors and qualifiers (‘key’ keywords and their c-collocates) as a first way into the “charting” of the discourses at hand, the (article or text) publication date is a critical part of the corpus metadata, one that needs to be retained for further processing. However, most if not all of this information is practically lost in the web crawling and boilerplating stages. Therefore, the html clean-up process was preceded by the identification and extraction of the articles’ publication dates, using a php script that was developed with the help of Dr. Matt Timperley (CASS, Lancaster) and Dr. Pantelis Mitropoulos (University of Athens). This script scans all files in a dataset, accounts for all possible date formats in all three languages, and then automatically creates a csv (tab-delimited) table that contains the extracted date(s), matched with the respective filenames. Its accuracy is estimated at ca. 95%, and can be improved further, by checking the output and rescanning the original data with a few code tweaks.

  1. Streamline the data, by removing irrelevant stretches of text (e.g. “Share this article on Facebook”) that were possibly left behind during the boilerplating process – this step is ensured using Linux commands (e.g. find, grep, sed, awk) and regular expressions and greatly improves the accuracy of the following step.
  2. Remove duplicate files: since onion (ONe Instance ONly: the script used e.g. in SketchEngine) only looks for pattern repetitions within a single file and within somewhat short maximum paragraph intervals, I used FSLint – an application that takes account of the files’ MD5 signature and identifies duplicates. This is extremely accurate and practically eliminates all files that have a one hundred percent text repetition, across various sections of the websites, regardless of the file name or creation date (actually, this was found to be the case mostly with political party websites, not newspapers). (NB: A similar process is available also in Mike Scott’s WordSmith Tools v7).
  3. Order files by publication year for each subcorpus and then calculate the corresponding metadata (files, tokens, types and average token count, by year) for each dataset and filter out the “focus corpus”, i.e. by looking for relevant files containing only node lemmas (i.e., lemmas related to the core question of this research: people*|popular|democr*|human* and their FR and EL equivalents, using grep and regular expressions – note that an open-source, java-based GUI app that combines these search options for large datasets is FAR).
  4. Finally, prepare the data for uploading on LU’s CQPWeb, by appending the text publication year info, as extracted from stage 2 to the corresponding raw text file – this was done using yet another php script, kindly developed by Matt Timperley.

 

In a nutshell, texts were culled from a total of 68 sources (24 Greek, 26 British, and 18 French). This dataset is divided into three major corpora, as follows:

  1. Cumulative corpus (CC, all data): 746,798 files/465,180,684 tokens.
  2. Non-journalistic research corpus (RC): 419,493 files/307,231,559 tokens.
  3. Focus corpus (FC): 205,038 files/235,235,353 tokens.