Corpus methods and multimodal data: A new approach

By William Dance, Alex Christiansen and Alexander Wild

Within corpus linguistics, multimodality is a subject which is often overlooked.

While there are multiple projects tackling multimodal interactional elements in corpora, such as the French interaction corpus RECOLA and the video meeting repository REPERE, corpus linguistic approaches generally tend to struggle when faced with extra-textual content such as images. Until now, the only viable approach to including such content in a corpus has been manual image annotation, but such an approach runs into two overarching issues: first, visual modality is the most labour-intensive form of multimodal corpus annotation when performed ‘by hand’ and second, multimodal corpora are often limited in scope and therefore remain very specialist using relatively small datasets.

However, as Twitter and other social media are quickly becoming popular sources of natural text, it is important to recognise that ignoring images means ignoring a large portion of potential meaning. In the worst instances, texts become entirely meaningless without the context supplied by the image – take for example this relatively innocuous tweet about superheroes:

without its image content.

As opposed to with it.

The image in the example above comprises part of the meaning making process and without the image, meaning is lost. Although evidence of the number of posts which include images are scarce, an engagement analysis sampling 1,000,000 posts from Twitter tentatively noted that 42% included an image.

As a step towards fixing this omission, we are introducing a new methodological tool to the corpus linguistic toolbox, tentatively named Visual Constituent Analysis or simply VCA. As the name implies, the approach draws from the concept of grammatical constituencies, presenting images as a series of individual semiotic constituents, which can then be shown in-line with any co-text found in the tweet. Using Google’s Cloud Vision API, VCA seeks to redress the issues raised earlier of scalability and scope by automating the annotation process and consequently widening the research scope, allowing studies to be extended to a much larger portion of multimodal data with very little extra work involved.

In addition to extending the scale of analysis, Vision also supplies information that would otherwise be missed by most annotators. This includes the function called web entities, which retrieves the set of all indexed web-pages using a particular image and extracts the most representative keywords from the context the image was used in. As an example, note that in the sample image below Vision detects only that the image contains a ‘journalist’/’commentator’ and that there is a ‘photo caption’, while web entities highlight that the people in question are Sean Hannity and Mitch McConnel, as well as the fact that the image relates to Fox News and the Speaker of the United States’ House of Congress.

Input

Output

Labels Journalist; Commentator; Facial Expression; Person; Forehead; Photo Caption; Chin; Official
Web Entities Sean Hannity; Mitch McConnell; FOX News; Kentucky; United States Senate; Republican Party; Capigruppo al Senato degli Stati Uniti d’America; Speaker of the United States House; United States Congress; Election; President of the United States
Document “STOP WHINING AND GET TO WORK”

 

While we recognise that there are obvious issues with allowing an algorithm to take over the task of annotating images, we posit that the same issues are inherent to human annotation, perhaps to an even larger degree. Within the traditional annotation method, a human element is required to process the non-textual data by hand, with implications of scale, consistency and knowledge-base. Vision offers vast scalability as well as the web indexing power of Google and consequently can help to analyse large multimodal datasets that would require teams of human annotators to process.

To test the viability of the approach as well as the reliability of the data-labelling supplied by Google’s neural network, we will use VCA to analyse the use of images in hostile-state information operations on social media in Twitter’s recently released Internet Research Agency dataset (T-IRA). T-IRA includes all the users identified by Twitter as being connected to Russian state-backed information operations and measures more than 9 million tweets, including a database of more than 1.4 million images.

This project will test the viability of VCA as a method of corpus construction but will also provide insights into how information operations weaponise images on social media. Using VCA, we will seek to identify the strategies used in T-IRA to try and influence people’s political and social views. Looking at studies of online disinformation as well as linguistic studies of manipulation these strategies will be codified into a typology of online image-based manipulation.

CASS is strengthening its links with colleagues at the University of Mosul in Iraq

As reported in the media, in recent months we have been delighted to support staff and students at the University of Mosul in Iraq who are rebuilding the Department of English after the devastation caused by the so-called Islamic State group . Via the CorpusMOOC and other forms of long-distance support, we have begun to interact with colleagues in Mosul, and to appreciate both the size of the task ahead of them and their determination to succeed. We are now in the process of arranging a month-long visit to Lancaster from two Mosul academics, so that we can strengthen our ties, including by exploring joint projects. Watch this space for updates on the visit and our future joint activities.

A cognitive scientist’s perspective on taking the CorpusMOOC

Rose Hendricks, a researcher at the Frameworks Institute in Washington D.C., shares her experience of taking the CorpusMOOC:

‘I’m a social science researcher and have been curious for a while how we can learn more about human culture and cognition by looking at large collections of language — so I jumped at the opportunity to take the Corpus Linguistics online course by Lancaster University.

The course had an great mix of videos, readings, and activities, and covered topics in just the right amount of detail. There was enough information to get a good sense of how corpus linguistics methods can be used in a huge range of ways, from addressing questions in sociolinguistics to developing textbooks, dictionaries, and resources for language learners.

Conversations with researchers who use corpus linguistics methods gave us an even deeper sense of the interesting and important topics that benefit from tools to extract patterns from huge amounts of text.

Throughout the course, I came up with many ideas I plan to explore with the methods we learned about, especially #LancsBox, a tool that helps researchers analyze and visualize their language data.

I would recommend this course to people with any level of background knowledge on the topic — there’s something for everyone.’

Introductory Blog – Luke Collins

I am delighted to have joined the CASS team as Senior Research Associate and will be working across the new programme of studies in Corpus Approaches to Health(care) Communication. I have already begun working on a fascinating strand exploring the Narratives of Voice-hearers and I will be working closely with Professor Elena Semino in applying corpus methods to see what effects a therapeutic intervention has on the experiences of those who hear distressing voices – and how they articulate these experiences – over time. More broadly, we will be examining representations of mental health and illness in the media, looking to address issues of stigmatisation and support public awareness and understanding.

Working towards the application of corpus linguistics and the findings of corpus analysis to health services is a great motivation to me and I am thrilled to have the opportunity to build on my previous work in this area. I have published work on the experiences of people undergoing a therapeutic intervention and demonstrated how corpus approaches can help to capture some of the complexities of those experiences. I have also implemented corpus analyses to investigate discussions of complex global issues in the news media (specifically, climate change and AMR), thinking about public understanding and how media reporting can help readers to comprehend their role in such issues. I have recently been working on my edition of the Routledge ‘Corpus Linguistics for..’ series, focusing on applications of corpus tools for analysing different types of online communication and hope to announce its release early next year. Throughout my work, I have endeavoured to raise awareness of corpus methods outside of the discipline and create opportunities to work with collaborators from various backgrounds. I am glad to find that in my role with CASS, this can continue!

Outside of my work, I have a reputation for hand-made greeting cards and I am an avid record collector. Since I have moved to Lancaster I have been exploring the local area and discovering what a picturesque part of the country this is. I don’t even mind the rain!

What is corpus stats about? A new book on Statistics in Corpus Linguistics has been published

This practical guide will equip the reader to understand the key principles of statistical thinking and apply these concepts to their own research, without the need for prior statistical knowledge. The book provides step-by-step guidance through the process of statistical analysis and offers multiple examples of how statistical techniques can be used to analyse and visualize linguistic data. It also includes a useful selection of discussion questions and exercises. The book comes with a Companion website, which provides additional materials (answers to exercises, datasets, advanced materials, teaching slides etc.)  and Lancaster Stats Tools online (http://corpora.lancs.ac.uk/stats), a free click-and-analyse statistical tool for easy calculation of the statistical measures discussed in the book.

Elena Semino appears on BBC World Service ‘Healthcheck’

CASS project affiliate (and head of department of Linguistics and English Language at Lancaster University) Elena Semino was interviewed about the findings of the ESRC-funded project ‘Metaphor in End-of-Life Care’ on the BBC World Service’s programme ‘Healthcheck’, presented by Claudia Hammond. The programme will air four times between 7th and 11th May 2014; the first 15 minutes of the programme focus on metaphors and cancer.