Recent Research into CEO Compensation

On Wednesday 18th January, the CFA Society United Kingdom (CFA UK) hosted a breakfast meeting at Innholders’ Court (London, EC4R 2RH) to discuss findings of a recently completed CFA UK-funded research project examining CEO compensation across the FTSE-350 from 2003 to 2015. CFA UK represents the interests of around 12,000 investment professionals in the UK and the report received widespread press coverage over the Christmas period including coverage from the BBC, The Times, The Guardian, and Financial Times.

The report (co-authored with Dr Weijia Li, Lancaster University Management School and available to download at: contributes to the executive remuneration debate by providing independent statistical evidence highlighting a limited association between economic value creation and executive pay.

Among other findings, the research suggests that despite relentless pressure from regulators and governance reformers over the last two decades to ensure closer alignment between executive pay and performance, the association between CEO pay and fundamental value creation in the UK remains weak at best.

At the heart of the problem is the disconnect between the performance measures that are widely employed in executive remuneration contracts such as earnings per share (EPS) growth and total shareholder return (TSR), and the extent to which these metrics provide reliable information on periodic value creation. Economic theory clearly demonstrates that EPS growth and TSR provide poor proxies for value creation; and this insight is confirmed in the data, with correlations below 30% documented for these measures and more sophisticated value-based performance metrics such as residual income and economic profit that include an explicit charge for invested capital.

The work also reveals that mandatory pay-related annual report disclosures designed to enhance the transparency of executive remuneration arrangements have become increasingly complicated and hard to read (measured by the Fog index), to the extent that even relatively sophisticated consumers of firms’ published reports struggle to identify basic information such as total compensation paid to the CEO during the reporting period.

Attendees at the event comprised representatives from a range of City institutions including CFA UK, The Investment Association, SVM Asset Management, RPMI Railpen, Schroders, PIRC, Aberdeen Asset Management, JP Morgan Asset Management, Kepler Cheuvreux, Legal & General Investment Management, Fidelity International, Willis Towers Watson, Pensions and Lifetime Savings Association.

Will Goodhart (Chief Executive, CFA UK) welcomed attendees and Natalie Winterfrost (Aberdeen Asset Management) provided context for the research. After a brief summary of the research purpose, methodology and main findings, plus follow-up comments from steering committee members Prof Brian Main (Edinburgh University), James Cooke (SVM Asset Management), and Alasdair Wood (Willis Towers Watson), attendees engaged in a lively discussion concerning the report’s conclusions and their implications for executive compensation policy in the UK. The discussions will help CFA UK to formulate its engagement strategy with companies and institutional investors to improve the degree of alignment between pay and value generation.

CASS goes to the Wellcome Trust!

Earlier this month I represented CASS in a workshop, hosted by the Wellcome Trust, which was designed to explore the language surrounding patient data. The remit of this workshop was to report back to the Trust on what might be the best ways to communicate to patients about their data, their rights respecting their data, and issues surrounding privacy and anonymity. The workshop comprised nine participants who all communicated with the public as part of their jobs, including journalists, bloggers, a speech writer, a poet, and a linguist (no prizes for guessing who the latter was…). On a personal note, I had prepared for this event from the perspective of a researcher of health communication. However, the backgrounds of the other participants meant that I realised very quickly that my role in this event would not be so specific, so niche, but was instead much broader, as “the linguist” or even “the academic”.

Our remit was to come up with a vocabulary for communication about patient data that would be easier for patients to understand. As it turned out, this wasn’t too difficult, since most of the language surrounding patient data is waffly at its best, and overly-technical and incomprehensible at its worst. One of the most notable recommendations we made concerned the phrase ‘patient data’ itself, which we thought might carry connotations of science and research, and perhaps disengage the public, and so recommended that the phrase ‘patient health information’ might sound less technical and more 14876085_10154608287875070_1645281813_otransparent. We undertook a series of tasks which ranged from sticking post-it notes on whiteboards and windows, to role play exercises and editing official documents and newspaper articles. What struck me, and what the diversity of these tasks demonstrated particularly well, was how the suitability of our suggested terms could only really be assessed once we took the words off the post-it notes and inserted them into real-life communicative situations, such as medical consultations, patient information leaflets, newspaper articles, and even talk shows.

The most powerful message I took away from the workshop was that close consideration of linguistic choices in the rhetoric surrounding health is vital for health care providers to improve the ways that they communicate with the public. To this end, as a collection of methods that facilitate the analysis of large amounts of authentic language data in and across a variety of texts and contexts, corpus linguistics has an important role to play in providing such knowledge in the future. Corpus linguistic studies of health-related communication are currently small in number, but continue to grow apace. Although the health-related research that is being undertaken within CASS, such as Beyond the Checkbox and Metaphor in End of Life Care, go some way to showcasing the rich fruits that corpus-based studies of health communication can bear, there is still a long way to go. In particular, future projects in this area should strive to engage consumers of health research not only in terms of our findings, but also the (corpus) methods that we have used to get there.

Birmingham ERP Boot Camp

Last week I attended a 5-day ERP Boot Camp at the University of Birmingham, and this was an incredible opportunity for me to learn from ERP experts and get specific advice for running my next ERP experiments. The workshop was led by two of the most renowned ERP researchers in the world, namely Professor Steven Luck and Dr Emily Kappenman. Luck and Kappenman are both part of the Centre for Mind and Brain at the University of California, Davis, which is one of the world’s leading centres for research into cognitive neuroscience. They are both among the set of researchers who set the publicationjen workshop blog 1 guidelines and recommendations for conducting EEG research (Keil et al. 2014), and Luck is also the developer of ERPLAB, which is a MATLAB Toolbox designed specifically for ERP data analysis. Moreover, Luck is the author of the authoritative book entitled An Introduction to the Event-Related Potential Technique. Before attending the ERP Boot Camp, most of the knowledge that I had about ERPs came from this book. Therefore, I am extremely grateful that I have had this opportunity to learn from the authorities in the field, especially since Luck and Kappenman bring the ERP Boot Camp to the University of Birmingham just once every three years.

There were two parts to the ERP Boot Camp: 2.5 days of lectures covering the theoretical aspects of ERP research (led by Steven Luck), and 2.5 days of practical workshops which involved demonstrations of the main data acquisition and analysis steps, followed by independent data analysis work using ERPLAB (led by Emily Kappenman). Day 1 of the Boot Camp provided an overview of different experimental paradigms and different ERP components, which are defined as voltage changes that reflect a particular neural or psychological process (e.g. the N400 component reflects the processing of meaning and the P600 component reflects the processing of structure). Most of the electrical activity in the brain that can be detected by scalp electrodes comes from the surface of the cortex but, in the lecture on ERP components, I was amazed to find out that there are some ERP components that actually reflect brain stem activity. These components are known as auditory brainstem responses. I also learnt about how individual differences between participants are typically the result of differences in cortical folding and differences in skull thickness, rather than reflecting any functional differences, and I learnt how ERP components from one domain such as language can be used to illuminate psychological processes in other domains such as memory. From this first day at the Boot Camp, I started to gain a much deeper conceptual understanding of the theoretical basis of ERP research, causing me to think of questions that hadn’t even occurred to me before.

Day 2 of the Boot Camp covered the principles of electricity and magnetism, the practical steps involved in processing an EEG dataset, and the most effective ways of circumventing and minimizing the problems that are inevitably faced by all ERP researchers. On this day I also learnt the importance of taking ERP measurements from difference waves rather than from the raw ERP waveforms. This is invaluable knowledge to have when analysing the data from my next experiments. In addition, I gained some concrete advice on stimulus presentation which I will take into account when editing my stimuli.

On day 3 of the Boot Camp, we were shown examples of ‘bad’ experimental designs and we were asked to identify the factors that made them problematic. Similarly, we discussed how to identify problematic results just by looking at the waveforms. These was really useful exercises in helping me to critically evaluate ERP studies, which will be useful both when reading published articles and when thinking about my own experimental design.

From the outset of the Boot Camp, we were encouraged to ask questions at any time, andJen workshop blog 2 this was particularly useful when it came to the practical sessions as we were able to use our own data and ask specific questions relating to our own experiments. I came prepared with questions that I had wanted to know the answers to for a long time, as well as additional questions that I had thought of throughout the Boot Camp, and I was given clear answers to every one of these questions.

Furthermore, as well as acquiring both theoretical and practical knowledge from the scheduled lectures and workshops, I also gained a lot from talking to the other ERP researchers who were attending the Boot Camp. A large proportion of attendees focused on language as their main research area, while others focused on clinical psychology or other areas of psychology such as memory or perception. I found it really interesting to hear the differences of opinion between those who were primarily linguists and those who were primarily psychologists. For instance, when discussing the word-by-word presentation of sentences in ERP experiments, the psychologists stated that each word should immediately replace the previous word, whereas the linguists concluded that it is best to present a blank white screen between each word. Conversations such as this made it very apparent that many of the aspects of ERP research are not standardised, and so it is up to the researcher to decide what is best for their experiment based on what is known about ERPs and what is conventional in their particular area of research.

Attending this ERP Boot Camp was a fantastic opportunity to learn from some of the best ERP researchers in the world. I now have a much more thorough understanding of the theoretical basis of ERP research, and I have an extensive list of practical suggestions that I can apply to my next experiments. I thoroughly enjoyed every aspect of the workshop and I am very grateful to CASS for funding the trip.

CASS goes to Weihai!


China 1

Between the 28th July and the 2nd August, Carmen Dayrell and I represented CASS at the 3rd Sino-UK Summer School of Corpus Linguistics. The summer school was organised by Beijing Foreign Studies University and was hosted at the Weihai campus of the University of Shandong, China. A research symposium followed the summer school on the 3rd August where we presented our research to representatives from both universities. The research symposium gave us a taste of how corpus linguistics is used in a different culture and we heard papers on a range of different topics, such as Alzheimer’s research, work on translations, Chinese medicine, and analyses of media discourse.

Our summer school sessions introduced students to corpus linguistics and gave them an overview of the discipline’s development within a UK context. We also discussed the range of projects ongoing at CASS and foregrounded the interdisciplinary focus of the Centre’s work. After the formal lectures, we ran hands-on sessions demonstrating how to use Graphcoll and CQPweb and conducted seminars using material from the Climate Change and Discourses of Distressed Communities projects to test the students’ frequency, keywords, and concordance analysis skills. The students really engaged with the sessions and were particularly taken with Graphcoll. They enjoyed doing the practical sessions, which they said were different to how they usually learned. Everyone in the classroom worked really hard and asked great questions that showed how interested they were in Lancaster’s tools.

China 2

Weihai is an absolutely beautiful place. The university sits with a sandy beach on one side and a mountain on the other. Because of this, Weihai campus is considered to have good Fung Shui. The place itself was described as a small city by those who live here, but ‘small’ is relative when compared to cities the size of Lancaster. Carmen and I enjoyed our time in China (despite a long journey involving flight cancellations and a trip to a Beijing hotel in the middle of the night) and loved seeing how well the students took to corpus linguistics and the materials that we prepared for them. The trip was a great success and we look forward to future collaborations between Lancaster and Beijing Foreign Studies University.

China 3

Textual analysis training for European doctoral researchers in accounting

Professor Steve Young (Lancaster University Management School and PI of the CASS ESRC funded project Understanding Corporate Communications) was recently invited to the 6th Doctoral Summer Program in Accounting Research (SPAR) to deliver sessions specializing in textual analysis of financial reporting. The invitation reflects the increasing interest in narrative reporting among accounting researchers.

The summer program was held at WHU – Otto Beisheim School of Management (Vallendar, Germany) 11-14 July, 2016.

Professor Young was joined by Professors Mary Bath (Stanford University) and Wayne Landsman (University of North Carolina, Chapel Hill), whose sessions covered a range of current issues in empirical financial reporting research including disclosure and the cost of capital, fair value accounting, and comparative international financial reporting. Students also benefitted from presentations by Prof. Dr. Andreas Barckow (President, Accounting Standards Committee of Germany) and Prof. Dr. Sven Hayn (Partner, EY Germany).

The annual SPAR training event was organised jointly by the Ludwig Maximilian University of Munich School of Management and the WHU – Otto Beisheim School of Management. The programme attracts the top PhD students in accounting from across Europe with the aim of introducing them to cutting-edge theoretical, methodological, and practical issues involved in conducting high-quality financial accounting research. This year’s cohort comprised 31 carefully selected students from Europe’s leading business schools.

Professor Young delivered four sessions on textual analysis. Sessions 1 & 2 focused on the methods currently applied in accounting research and the opportunities associated with applying more advanced approaches from computational linguistics and natural language processing. The majority of extant work in mainstream accounting research relies on bag-of-words methods (e.g., dictionaries, readability, and basic machine learning applications) to study the properties and usefulness of narrative aspects of financial communications; significant opportunities exist for accounting researchers applying more mainstream textual analysis methods including part of speech tagging, semantic analysis, topic models, summarization, text mining, and corpus methods.

Sessions 3 & 4 reviewed the extant literature on automated textual analysis in accounting and financial communication. Session 3 concentrated on earnings announcements and annual reports. Research reveals that narrative disclosures are incrementally informative beyond quantitative data for stock market investors, particularly in circumstances where traditional accounting data provide an incomplete picture of firm performance and value. Nevertheless, evidence also suggests that management use narrative commentaries opportunistically when the incentives to do so are high.  Session 4 reviewed research on other aspects of financial communication including regulatory information [e.g., surrounding mergers and acquisitions (M&A) and initial public offerings (IPOs)], conference calls, analysts’ reports, financial media, and social media. Evidence consistently indicates that financial narratives contain information that is not captured by quantitative results.

Slides for all four sessions are available here.

The event was a great success. Students engaged actively in all sessions (including presentations and discussions of published research using textual analysis methods). New research opportunities were explored involving the analysis of new financial reporting corpora and the application of more advanced computational linguistics methods. Students also received detailed feedback from faculty on their research projects, a significant number of which involved application of textual analysis methods. Special thanks go to Professor Martin Glaum and his team at WHU for organizing and running the summer program.

40th Anniversary of the Language and Computation Group


Recently I was given the chance to attend the 40th anniversary of the Language and Computation (LAC) group at The University of Essex. As an Essex alumni I was invited to present my work with CASS on Financial Narrative Processing (FNP) part of the ESRC funded project . Slides are available online here.

The event celebrates 40 years of the Language and Computation (LAC) group: an interdisciplinary group created to foster interaction between researchers working on Computational Linguistics within the University of Essex.

There were 16 talks by Essex University alumnus and connections including Yorick Wilks, Patrick Hanks, Stephen Pulman and Anne de Roeck.

The two day workshop started with Doug Arnold from the Department of Language and Linguistics at Essex. He started by presenting the history and the beginning of the LAC group which started with the arrival of Yorick Wilks in the late 70s and others from Language and Linguistics, this includes Stephen Pulman, Mike Bray, Ray Turner and Anne de Roeck. According to Doug the introduction of the cognitive studies center and the Eurotra project in the 80s led to the introduction of the Computational Linguistics MA paving the way towards the emergence of Language and Computation. Something I always wondered about.

The workshop referred to the beginning of some of the most influential conferences and associations in computational linguistics such as CoLing, EACL and ESSLLI. It also showed the influence of the world events around that period and the struggle researchers and academics had to go through, especially during the cold war and the many university crises around the UK during the 80s and the 90s. Having finished my PhD in 2012 it never crossed my mind how difficult it would have been for researchers and academics to progress under such intriguing situations during that time.

Doug went on to point out how the introduction of the World Wide Web in the mid 90s and the development of technology and computers helped to rapidly advance and reshape the field. This helped in closing the gap between Computation and Linguistics and the problem of field identity between Computational Linguists coming from a Computing or Linguistics background. We now live surrounded by rapid technologies and solid networks infrastructure which makes communications and data processing a problem no more. I was astonished when Stephen Pulman mentioned how they used to wait a few days for the only machine in the department to compile a few lines-of-code of LISP.

The presence of Big Data processing in 2010 and the rapid need for resourcing, crowd-sourcing and interpreting big data added more challenges but interesting opportunities to computational linguists. Something I very much agree with considering the vast amount of data available online these days.

Doug ended his talk by pointing out that in general Computational Linguistics is a difficult field; computational linguists are expected to be experts in many areas, concluding that training computational linguists is deemed to be a challenging and difficult task. As a computational linguist this rings a bell. For example, and as someone from a computing background, I find it difficult to understand how part of speech taggers work without being versed in the grammatical aspect of the language of study.

Doug’s talk was followed by compelling and very informative talks from Yorick Wilks, Mike Rosner and Patrick Hanks.

Yorick opened with “Linguistics is still an interesting topic” narrating his experience in moving from Linguistics towards Computing and the challenge imposed by the UK system compared to other countries such as France, Russia and Italy where Chomsky had little influence. This reminded me of Peter Norivg’s response to Chomsky’s criticism of empirical theory where he said and I quote: “I think Chomsky is wrong to push the needle so far towards theory over facts”.

In his talk, Yorick referred to Lancaster University and the remarkable work by Geoffrey Leech and the build up of the CLAWS tagger, which was one of the earliest statistical taggers to ever reach the USA.

“What is meaning?” was Patrick Hanks talk’s opening and went into discussing word ambiguity saying: “most words are hopelessly ambiguous!”.  Patrick briefly discussed the ‘double helix’ rule system or the Theory of Norms and Exploitations (TNE), which enables creative use of language when speakers and writers make new meanings, while at the same time relying on a core of shared conventions for mutual understanding. His work on pattern and phraseologies is of great interest in an attempt to answer the ”why this perfectly valid English sentence fits in a single pattern?” question.

This was followed by interesting talks from ‘Essexians’ working in different universities and firms across the globe. This included recent work on Computational Linguistics (CL), Natural Language Processing (NLP) and Machine Learning (ML). One of those was a collaboration work between Essex University and Signal– a startup company in London.

The event closed with more socialising, drinks and dinner at a Nepalese restaurant in Colchester, courtesy of the LAC group.

In general I found the event very interesting, well organised and rich in terms of historical evidences on the beginning of Language and Computation. It was also of great interest to know about current work and state-of-the-art in CL, NLP and ML presented by the event attendances.

I would very much like to thank The Language and Computation group at Essex Universities for the invitation and their time and effort organising this wonderful event.

Mahmoud El-Haj

Senior Research Associate

CASS, Lancaster University


Corpus Data and Psycholinguistics Seminar

On the afternoon of Thursday 19th May 2016, CASS held its first ever psycholinguistics seminar which brought together researchers from both linguistics and psychology. The theme of the seminar was “Corpus Data and Psycholinguistics”, with a particular focus on experimental psycholinguistics.

The afternoon consisted of four 40-minute presentations which covered a range of different experimental methods including eye-tracking and EEG. Interestingly, the notion of collocation also emerged as a strong theme throughout the presentations. Different types of collocation were addressed, including bigrams, idioms, and compounds, and this prompted thought-provoking discussions about the nature of collocation and the relationship between psycholinguistic results and the different statistical measures of collocation strength.

The first presentation was delivered by Professor Padraic Monaghan from the Psychology Department at Lancaster University. In this presentation, Padraic provided an engaging introduction to computational modelling in psycholinguistics, focusing mainly on connectionist models where the input determines the structure of processing. This talk prompted a particularly interesting observation about the relationship between connectionist models and parts-of-speech tags in corpora.

In the second presentation, Dr Phil Durrant from the University of Exeter provided a critical perspective on his own earlier work into whether or not psycholinguistic priming is evident in collocations at different levels of frequency, and on the distinction between the related notions of collocation and psychological association. This presentation also provided a really interesting insight into the different ways in which corpus linguistics and psychological experimentation can be combined in psycholinguistic studies. This really helped to contextualise the studies reported in the other presentations within the field of psycholinguistics.

After a short break, I presented the results of the first of several studies which will make up my PhD thesis. This initial study pilots a procedure for using EEG to determine whether or not the brain is sensitive to the transition probabilities between words. This was an excellent opportunity for me to gain feedback on my work and I really appreciate the input and suggestions for further reading that I received from participants at this event.

The final presentation of the afternoon was delivered by Professor Michaela Mahlberg and Dr Gareth Carroll from the University of Birmingham. This presentation drew upon eye-tracking data from a study exploring literary reading in order to pinpoint the methodological issues associated with combining eye-tracking techniques with literary corpora, and with corpus data more generally.

With such an interesting series of talks sharing the theme of “Corpus Data and Psycholinguistics”, the CASS psycholinguistics seminar proved to be a very successful event. We would like to thank the presenters and all of the participants who attended the seminar for their contribution to the discussions, and we are really looking forward to hosting similar seminars in the near future.

CASS Changing Climates project presented at the University of Turin

Carmen blog 2

It was a great honour and pleasure to present CASS Changing Climates project to an engaging audience at the University of Turin last month, on 27th April 2016. This was the 8th symposium on ‘Energy and Power: Social ontology perspectives and energy transitions’ as part of a UNESCO Chair programme in Sustainable Development and Territory Management currently hosted by University of Turin (Italy), under the coordination of Professor Dario Padovan.

Carmen blog 1

The symposium brought together academics and students from various disciplines – sociology, linguistics, history and environmental sciences –, thus having an enthusiastic audience and resulting in a lively debate. CASS would like to thank the organisers Professor Dario Padovan, Dr Maria Cristina Caimotto and Gabriela Gabriela Cavaglià for this great opportunity to exchange experience and ideas. I very much enjoyed the event and, as expected, had a great time in lovely Torino.

Chinese Applied Corpus Linguistics Symposium

On Friday 29th April 2016, Lancaster University hosted a symposium which brought together researchers and practitioners interested in Chinese linguistics and the corpus method. The symposium was supported by the British Academy (International Mobility and Partnership Scheme IPM 2013) and was hosted by the ESRC Centre for Corpus Approaches to Social Science (CASS). The symposium introduced the Guangwai-Lancaster Chinese Learner Corpus, a 1.2-million-word corpus of spoken and written L2 Chinese produced by learners of Chinese at different proficiency levels; the corpus was built as part of a collaboration between Guangdong University of Foreign Studies (Prof. Hai Xu and his team) and Lancaster University. The project was initiated by Richard Xiao, who also obtained the funding from the British Academy. Richard’s vision to bring corpus linguistics to the analysis of L2 Chinese (both spoken and written) is now coming to fruition with the final stages of the project and the public release of the corpus planned for the end of this year.

The symposium showcased different areas of Chinese linguistics research through presentations by researchers from Lancaster and other UK universities (Coventry, Essex), with the topics ranging from the use of corpora as resources in the foreign language classroom to a cross-cultural comparison of performance evaluation in concert reviews, second language semasiology, and CQPweb as a tool for Chinese corpus data. As part of the symposium, the participants were also given an opportunity to search the Guangwai-Lancaster Chinese Learner Corpus and explore different features of the dataset. At the end of the symposium, we discussed the applications of corpus linguistics in Chinese language learning and teaching and the future of the field.

Thanks are due to the presenters and all participants for joining the symposium and for very engaging presentations and discussions.  The following snapshots summarise the presentations –links to the slides are available below the images.


Hai Xu


Hai Xu (Guangdong University of Foreign Studies ): Guangwai-Lancaster Chinese Learner Corpus: A profile – via video conferencing from Guangzhou

Simon Smith

Simon Smith (Coventry University): 语料酷!Corpora and online resources in the Mandarin classroom

Fong Wa Ha

Fong Wa Ha (University of Essex): A cross-cultural comparison of evaluation between concert reviews in Hong Kong and British newspapers

Vittorio Tantucci

Vittorio Tantucci (Lancaster University): Second language semasiology (SLS): The case of the Mandarin sentence final particle 吧 ba

Andrew Hardie

Andrew Hardie (Lancaster University): Using CQPweb to analyse Chinese corpus data

Vaclav Brezina

Vaclav Brezina (Lancaster University):  Practical demonstration of the Guangwai-Lancaster Chinese Learner Corpus followed by a general discussion.

Clare Wright: Using Learner Corpora to analyse task effects on L2 oral interlanguage in English-Mandarin bilinguals




NewsHack 2016 Retrospective

The BBC’s multilingual NewsHACK event was run on the 15th and 16th of March as an opportunity for teams of language technology researchers to work with multilingual data from the BBC’s connected studio.  The theme was ‘multilingual journalism: tools for future news’, and teams were encouraged to bring some existing language technologies to apply to problems in this area. Nine teams attended from various news and research organisations. Lancaster University sent two teams with funding from CASS, CorCenCC, DSI, and UCREL: team ‘1’ consisting of Paul, Scott and Hugo, and team ‘A’ comprising Matt, Mahmoud, Andrew and Steve.


The brief from the newsHACK team suggested two possible directions: to provide a tool for the BBC’s journalist staff, or to create an audience-facing utility. To support us, the BBC provided access to a variety of APIs, but the Lancaster ‘A’ team were most interested to discover that something we’d thought would be available wasn’t — there is no service mapping news stories to their counterparts in other languages. We decided to remedy that.

The BBC is a major content creator, perhaps one of the largest multilingual media organisations in the world. This presents a great opportunity. Certain events are likely to be covered in every language the BBC publishes in, providing ‘translations’ of the news which are not merely literal translations at the word, sentence or paragraph level, but full-fledged contextual translations which identify the culturally appropriate ways to convey the same information. Linking these articles together could help the BBC create a deeply interesting multilingual resource for exploring questions about language, culture and journalism.

Interesting, but how do we make this into a tool for the BBC? Our idea was to take these linked articles directly to the users. Say you have a friend who would prefer to read the news in their native tongue — one different to your own — how would you share a story with them? Existing approaches seem to involve either using an external search engine (But then how do you know the results are what you intend to share, not speaking the target language?) or to use machine translation to offer your friend a barely-readable version of the exact article you have read. We came up with an idea that keeps content curation within the BBC and provides readers with easy-access to the existing high-quality translations being made by professional writers: a simple drop-down menu for articles which allows a user to ‘Read about this in…’ any of the BBC’s languages.


To implement this, in two days, required a bit of creative engineering. We wanted to connect articles based on their content, but we didn’t have tools to extract and compare features in all the BBC’s languages. Instead, we translated small amounts of text — article summaries and a few other pieces of information — into English, which has some of the best NLP tool support (and was the only language all of our team spoke). Then we could use powerful existing solutions to named entity recognition and part-of-speech tagging to extract informative features from articles, and compare them using a few tricks from record linkage literature. Of course, a lack of training data (and time to construct it!) meant that we couldn’t machine-learn our way to perfection for weighting these features, so a ‘human learning’ process was involved in manually tweaking the weights and thresholds until we got some nice-looking links between articles in different languages.

Data is only part of the battle, though. We needed a dazzling front-end to impress the judges.  We used a number of off-the-shelf web frameworks to quickly develop a prototype, drawing upon the BBC’s design to create something that could conceivably be worked into a reader’s workflow: users enter a URL at the top and are shown results from all languages in a single dashboard, from which they can read or link to the original articles or their identified translations.

Here we have retrieved a similar article in Arabic, as well as two only-vaguely-similar ones in Portuguese and Spanish (the number indicates a percentage similarity).  The original article text is reproduced, along with a translated English summary.


The judges were impressed — perhaps as much with our pun-filled presentation as our core concept — and our contribution, the spontaneously-titled ‘Super Mega Linkatron 5000’ was joint winner in the category of best audience-facing tool.

The BBC’s commitment to opening up their resources to outsiders seems to have paid off with a crop of high-quality concepts from all the competitors, and we’d like to thank them for the opportunity to attend (as well as the pastries!).

The code and presentation for the team ‘A’ entry is available via github at and images from Lancaster’s visit can be seen at .  Some of the team have also written their own blog posts on the subject: here’s Matt’s and Steve’s.

Team ‘1’ based their work around the BBC Reality Check service. This was part of the BBC News coverage of the 2015 UK general election and published news items on twitter and contributed to TV and radio news as well. For example, in May 2015 when a politician claimed that the number of GPs has declined under the coalition government, BBC Reality Check produced a summary of data obtained from a variety of sources to enable the audience to make up their own mind about this claim. Reality Check is continuing in 2016 with a similar service for the EU referendum, providing, for example, a check on how many new EU regulations there are every year (1,269 rather than the 2,500 claimed by Boris Johnson!!). After consulting with the BBC technology producer and journalist attending the newsHACK, Team ‘1’ realised that this current Reality Check service could only serve its purpose for English news stories, so set about making a new ‘BBC Multilingual Reality Check’ service to support journalists in their search for suitable sources. Having a multilingual system is really important for the EU referendum and other international news topics due to the potential sources being written in languages other than English.

In order to bridge related stories across different languages, we adopted the UCREL Semantic Analysis System (USAS) developed at Lancaster over the last 26 years. The system automatically assigns semantic fields (concepts or coarse-grained senses) to each word or phrase in a given text, and we reasoned that the frequency profile of these concepts would be similar for related stories even in different languages e.g. the semantic profile could help distinguish between news stories about finance or education or health. Using the APIs that the BBC newsHACK team provided, we gathered stories in English, Spanish and Chinese (the native languages spoken by team ‘1’). Each story was then processed through the USAS tagger and a frequency profile was generated. Using a cosine distance measure, we ranked related stories across languages. Although we only used the BBC multilingual news stories during the newsHACK event, it could be extended to ingest text from other sources e.g. UK Parliamentary Hansard and manifestos, proceedings of the European parliament and archives of speeches from politicians (before they are removed from political party websites).

The screenshot below shows our analysis focussed on some main topics of the day: UK and Catalonia referendums, economics, Donald Trump, and refugees. Journalists can click on news stories in the system and show related articles in the other languages, ranked by our distance measure (shown here in red).

Team ‘1’s Multilingual Reality Check system would not only allow fact checking such as the number of refugees and migrants over time entering the EU, but also allow journalists to observe different portrayals of the news about refugees and migrants in different countries.