Textual analysis training for European doctoral researchers in accounting

Professor Steve Young (Lancaster University Management School and PI of the CASS ESRC funded project Understanding Corporate Communications) was recently invited to the 6th Doctoral Summer Program in Accounting Research (SPAR) to deliver sessions specializing in textual analysis of financial reporting. The invitation reflects the increasing interest in narrative reporting among accounting researchers.

The summer program was held at WHU – Otto Beisheim School of Management (Vallendar, Germany) 11-14 July, 2016.

Professor Young was joined by Professors Mary Bath (Stanford University) and Wayne Landsman (University of North Carolina, Chapel Hill), whose sessions covered a range of current issues in empirical financial reporting research including disclosure and the cost of capital, fair value accounting, and comparative international financial reporting. Students also benefitted from presentations by Prof. Dr. Andreas Barckow (President, Accounting Standards Committee of Germany) and Prof. Dr. Sven Hayn (Partner, EY Germany).

The annual SPAR training event was organised jointly by the Ludwig Maximilian University of Munich School of Management and the WHU – Otto Beisheim School of Management. The programme attracts the top PhD students in accounting from across Europe with the aim of introducing them to cutting-edge theoretical, methodological, and practical issues involved in conducting high-quality financial accounting research. This year’s cohort comprised 31 carefully selected students from Europe’s leading business schools.

Professor Young delivered four sessions on textual analysis. Sessions 1 & 2 focused on the methods currently applied in accounting research and the opportunities associated with applying more advanced approaches from computational linguistics and natural language processing. The majority of extant work in mainstream accounting research relies on bag-of-words methods (e.g., dictionaries, readability, and basic machine learning applications) to study the properties and usefulness of narrative aspects of financial communications; significant opportunities exist for accounting researchers applying more mainstream textual analysis methods including part of speech tagging, semantic analysis, topic models, summarization, text mining, and corpus methods.

Sessions 3 & 4 reviewed the extant literature on automated textual analysis in accounting and financial communication. Session 3 concentrated on earnings announcements and annual reports. Research reveals that narrative disclosures are incrementally informative beyond quantitative data for stock market investors, particularly in circumstances where traditional accounting data provide an incomplete picture of firm performance and value. Nevertheless, evidence also suggests that management use narrative commentaries opportunistically when the incentives to do so are high.  Session 4 reviewed research on other aspects of financial communication including regulatory information [e.g., surrounding mergers and acquisitions (M&A) and initial public offerings (IPOs)], conference calls, analysts’ reports, financial media, and social media. Evidence consistently indicates that financial narratives contain information that is not captured by quantitative results.

Slides for all four sessions are available here.

The event was a great success. Students engaged actively in all sessions (including presentations and discussions of published research using textual analysis methods). New research opportunities were explored involving the analysis of new financial reporting corpora and the application of more advanced computational linguistics methods. Students also received detailed feedback from faculty on their research projects, a significant number of which involved application of textual analysis methods. Special thanks go to Professor Martin Glaum and his team at WHU for organizing and running the summer program.

40th Anniversary of the Language and Computation Group

Mahmoud

Recently I was given the chance to attend the 40th anniversary of the Language and Computation (LAC) group at The University of Essex. As an Essex alumni I was invited to present my work with CASS on Financial Narrative Processing (FNP) part of the ESRC funded project . Slides are available online here.

The event celebrates 40 years of the Language and Computation (LAC) group: an interdisciplinary group created to foster interaction between researchers working on Computational Linguistics within the University of Essex.

There were 16 talks by Essex University alumnus and connections including Yorick Wilks, Patrick Hanks, Stephen Pulman and Anne de Roeck. http://lac.essex.ac.uk/2016-computationallinguistics40

The two day workshop started with Doug Arnold from the Department of Language and Linguistics at Essex. He started by presenting the history and the beginning of the LAC group which started with the arrival of Yorick Wilks in the late 70s and others from Language and Linguistics, this includes Stephen Pulman, Mike Bray, Ray Turner and Anne de Roeck. According to Doug the introduction of the cognitive studies center and the Eurotra project in the 80s led to the introduction of the Computational Linguistics MA paving the way towards the emergence of Language and Computation. Something I always wondered about.

The workshop referred to the beginning of some of the most influential conferences and associations in computational linguistics such as CoLing, EACL and ESSLLI. It also showed the influence of the world events around that period and the struggle researchers and academics had to go through, especially during the cold war and the many university crises around the UK during the 80s and the 90s. Having finished my PhD in 2012 it never crossed my mind how difficult it would have been for researchers and academics to progress under such intriguing situations during that time.

Doug went on to point out how the introduction of the World Wide Web in the mid 90s and the development of technology and computers helped to rapidly advance and reshape the field. This helped in closing the gap between Computation and Linguistics and the problem of field identity between Computational Linguists coming from a Computing or Linguistics background. We now live surrounded by rapid technologies and solid networks infrastructure which makes communications and data processing a problem no more. I was astonished when Stephen Pulman mentioned how they used to wait a few days for the only machine in the department to compile a few lines-of-code of LISP.

The presence of Big Data processing in 2010 and the rapid need for resourcing, crowd-sourcing and interpreting big data added more challenges but interesting opportunities to computational linguists. Something I very much agree with considering the vast amount of data available online these days.

Doug ended his talk by pointing out that in general Computational Linguistics is a difficult field; computational linguists are expected to be experts in many areas, concluding that training computational linguists is deemed to be a challenging and difficult task. As a computational linguist this rings a bell. For example, and as someone from a computing background, I find it difficult to understand how part of speech taggers work without being versed in the grammatical aspect of the language of study.

Doug’s talk was followed by compelling and very informative talks from Yorick Wilks, Mike Rosner and Patrick Hanks.

Yorick opened with “Linguistics is still an interesting topic” narrating his experience in moving from Linguistics towards Computing and the challenge imposed by the UK system compared to other countries such as France, Russia and Italy where Chomsky had little influence. This reminded me of Peter Norivg’s response to Chomsky’s criticism of empirical theory where he said and I quote: “I think Chomsky is wrong to push the needle so far towards theory over facts”.

In his talk, Yorick referred to Lancaster University and the remarkable work by Geoffrey Leech and the build up of the CLAWS tagger, which was one of the earliest statistical taggers to ever reach the USA.

“What is meaning?” was Patrick Hanks talk’s opening and went into discussing word ambiguity saying: “most words are hopelessly ambiguous!”.  Patrick briefly discussed the ‘double helix’ rule system or the Theory of Norms and Exploitations (TNE), which enables creative use of language when speakers and writers make new meanings, while at the same time relying on a core of shared conventions for mutual understanding. His work on pattern and phraseologies is of great interest in an attempt to answer the ”why this perfectly valid English sentence fits in a single pattern?” question.

This was followed by interesting talks from ‘Essexians’ working in different universities and firms across the globe. This included recent work on Computational Linguistics (CL), Natural Language Processing (NLP) and Machine Learning (ML). One of those was a collaboration work between Essex University and Signal– a startup company in London.

The event closed with more socialising, drinks and dinner at a Nepalese restaurant in Colchester, courtesy of the LAC group.

In general I found the event very interesting, well organised and rich in terms of historical evidences on the beginning of Language and Computation. It was also of great interest to know about current work and state-of-the-art in CL, NLP and ML presented by the event attendances.

I would very much like to thank The Language and Computation group at Essex Universities for the invitation and their time and effort organising this wonderful event.

Mahmoud El-Haj

Senior Research Associate

CASS, Lancaster University

@DocElhaj

http://www.lancaster.ac.uk/staff/elhaj

Corpus Data and Psycholinguistics Seminar

On the afternoon of Thursday 19th May 2016, CASS held its first ever psycholinguistics seminar which brought together researchers from both linguistics and psychology. The theme of the seminar was “Corpus Data and Psycholinguistics”, with a particular focus on experimental psycholinguistics.

The afternoon consisted of four 40-minute presentations which covered a range of different experimental methods including eye-tracking and EEG. Interestingly, the notion of collocation also emerged as a strong theme throughout the presentations. Different types of collocation were addressed, including bigrams, idioms, and compounds, and this prompted thought-provoking discussions about the nature of collocation and the relationship between psycholinguistic results and the different statistical measures of collocation strength.

The first presentation was delivered by Professor Padraic Monaghan from the Psychology Department at Lancaster University. In this presentation, Padraic provided an engaging introduction to computational modelling in psycholinguistics, focusing mainly on connectionist models where the input determines the structure of processing. This talk prompted a particularly interesting observation about the relationship between connectionist models and parts-of-speech tags in corpora.

In the second presentation, Dr Phil Durrant from the University of Exeter provided a critical perspective on his own earlier work into whether or not psycholinguistic priming is evident in collocations at different levels of frequency, and on the distinction between the related notions of collocation and psychological association. This presentation also provided a really interesting insight into the different ways in which corpus linguistics and psychological experimentation can be combined in psycholinguistic studies. This really helped to contextualise the studies reported in the other presentations within the field of psycholinguistics.

After a short break, I presented the results of the first of several studies which will make up my PhD thesis. This initial study pilots a procedure for using EEG to determine whether or not the brain is sensitive to the transition probabilities between words. This was an excellent opportunity for me to gain feedback on my work and I really appreciate the input and suggestions for further reading that I received from participants at this event.

The final presentation of the afternoon was delivered by Professor Michaela Mahlberg and Dr Gareth Carroll from the University of Birmingham. This presentation drew upon eye-tracking data from a study exploring literary reading in order to pinpoint the methodological issues associated with combining eye-tracking techniques with literary corpora, and with corpus data more generally.

With such an interesting series of talks sharing the theme of “Corpus Data and Psycholinguistics”, the CASS psycholinguistics seminar proved to be a very successful event. We would like to thank the presenters and all of the participants who attended the seminar for their contribution to the discussions, and we are really looking forward to hosting similar seminars in the near future.

CASS Changing Climates project presented at the University of Turin

Carmen blog 2

It was a great honour and pleasure to present CASS Changing Climates project to an engaging audience at the University of Turin last month, on 27th April 2016. This was the 8th symposium on ‘Energy and Power: Social ontology perspectives and energy transitions’ as part of a UNESCO Chair programme in Sustainable Development and Territory Management currently hosted by University of Turin (Italy), under the coordination of Professor Dario Padovan.

Carmen blog 1

The symposium brought together academics and students from various disciplines – sociology, linguistics, history and environmental sciences –, thus having an enthusiastic audience and resulting in a lively debate. CASS would like to thank the organisers Professor Dario Padovan, Dr Maria Cristina Caimotto and Gabriela Gabriela Cavaglià for this great opportunity to exchange experience and ideas. I very much enjoyed the event and, as expected, had a great time in lovely Torino.

Chinese Applied Corpus Linguistics Symposium

On Friday 29th April 2016, Lancaster University hosted a symposium which brought together researchers and practitioners interested in Chinese linguistics and the corpus method. The symposium was supported by the British Academy (International Mobility and Partnership Scheme IPM 2013) and was hosted by the ESRC Centre for Corpus Approaches to Social Science (CASS). The symposium introduced the Guangwai-Lancaster Chinese Learner Corpus, a 1.2-million-word corpus of spoken and written L2 Chinese produced by learners of Chinese at different proficiency levels; the corpus was built as part of a collaboration between Guangdong University of Foreign Studies (Prof. Hai Xu and his team) and Lancaster University. The project was initiated by Richard Xiao, who also obtained the funding from the British Academy. Richard’s vision to bring corpus linguistics to the analysis of L2 Chinese (both spoken and written) is now coming to fruition with the final stages of the project and the public release of the corpus planned for the end of this year.

The symposium showcased different areas of Chinese linguistics research through presentations by researchers from Lancaster and other UK universities (Coventry, Essex), with the topics ranging from the use of corpora as resources in the foreign language classroom to a cross-cultural comparison of performance evaluation in concert reviews, second language semasiology, and CQPweb as a tool for Chinese corpus data. As part of the symposium, the participants were also given an opportunity to search the Guangwai-Lancaster Chinese Learner Corpus and explore different features of the dataset. At the end of the symposium, we discussed the applications of corpus linguistics in Chinese language learning and teaching and the future of the field.

Thanks are due to the presenters and all participants for joining the symposium and for very engaging presentations and discussions.  The following snapshots summarise the presentations –links to the slides are available below the images.


 

Hai Xu

 

Hai Xu (Guangdong University of Foreign Studies ): Guangwai-Lancaster Chinese Learner Corpus: A profile – via video conferencing from Guangzhou


Simon Smith

Simon Smith (Coventry University): 语料酷!Corpora and online resources in the Mandarin classroom


Fong Wa Ha

Fong Wa Ha (University of Essex): A cross-cultural comparison of evaluation between concert reviews in Hong Kong and British newspapers


Vittorio Tantucci

Vittorio Tantucci (Lancaster University): Second language semasiology (SLS): The case of the Mandarin sentence final particle 吧 ba


Andrew Hardie

Andrew Hardie (Lancaster University): Using CQPweb to analyse Chinese corpus data


Vaclav Brezina

Vaclav Brezina (Lancaster University):  Practical demonstration of the Guangwai-Lancaster Chinese Learner Corpus followed by a general discussion.


Clare Wright: Using Learner Corpora to analyse task effects on L2 oral interlanguage in English-Mandarin bilinguals


 

 

 

NewsHack 2016 Retrospective

The BBC’s multilingual NewsHACK event was run on the 15th and 16th of March as an opportunity for teams of language technology researchers to work with multilingual data from the BBC’s connected studio.  The theme was ‘multilingual journalism: tools for future news’, and teams were encouraged to bring some existing language technologies to apply to problems in this area. Nine teams attended from various news and research organisations. Lancaster University sent two teams with funding from CASS, CorCenCC, DSI, and UCREL: team ‘1’ consisting of Paul, Scott and Hugo, and team ‘A’ comprising Matt, Mahmoud, Andrew and Steve.

image00

The brief from the newsHACK team suggested two possible directions: to provide a tool for the BBC’s journalist staff, or to create an audience-facing utility. To support us, the BBC provided access to a variety of APIs, but the Lancaster ‘A’ team were most interested to discover that something we’d thought would be available wasn’t — there is no service mapping news stories to their counterparts in other languages. We decided to remedy that.

The BBC is a major content creator, perhaps one of the largest multilingual media organisations in the world. This presents a great opportunity. Certain events are likely to be covered in every language the BBC publishes in, providing ‘translations’ of the news which are not merely literal translations at the word, sentence or paragraph level, but full-fledged contextual translations which identify the culturally appropriate ways to convey the same information. Linking these articles together could help the BBC create a deeply interesting multilingual resource for exploring questions about language, culture and journalism.

Interesting, but how do we make this into a tool for the BBC? Our idea was to take these linked articles directly to the users. Say you have a friend who would prefer to read the news in their native tongue — one different to your own — how would you share a story with them? Existing approaches seem to involve either using an external search engine (But then how do you know the results are what you intend to share, not speaking the target language?) or to use machine translation to offer your friend a barely-readable version of the exact article you have read. We came up with an idea that keeps content curation within the BBC and provides readers with easy-access to the existing high-quality translations being made by professional writers: a simple drop-down menu for articles which allows a user to ‘Read about this in…’ any of the BBC’s languages.

image03

To implement this, in two days, required a bit of creative engineering. We wanted to connect articles based on their content, but we didn’t have tools to extract and compare features in all the BBC’s languages. Instead, we translated small amounts of text — article summaries and a few other pieces of information — into English, which has some of the best NLP tool support (and was the only language all of our team spoke). Then we could use powerful existing solutions to named entity recognition and part-of-speech tagging to extract informative features from articles, and compare them using a few tricks from record linkage literature. Of course, a lack of training data (and time to construct it!) meant that we couldn’t machine-learn our way to perfection for weighting these features, so a ‘human learning’ process was involved in manually tweaking the weights and thresholds until we got some nice-looking links between articles in different languages.

Data is only part of the battle, though. We needed a dazzling front-end to impress the judges.  We used a number of off-the-shelf web frameworks to quickly develop a prototype, drawing upon the BBC’s design to create something that could conceivably be worked into a reader’s workflow: users enter a URL at the top and are shown results from all languages in a single dashboard, from which they can read or link to the original articles or their identified translations.


Here we have retrieved a similar article in Arabic, as well as two only-vaguely-similar ones in Portuguese and Spanish (the number indicates a percentage similarity).  The original article text is reproduced, along with a translated English summary.

image01

The judges were impressed — perhaps as much with our pun-filled presentation as our core concept — and our contribution, the spontaneously-titled ‘Super Mega Linkatron 5000’ was joint winner in the category of best audience-facing tool.

The BBC’s commitment to opening up their resources to outsiders seems to have paid off with a crop of high-quality concepts from all the competitors, and we’d like to thank them for the opportunity to attend (as well as the pastries!).

The code and presentation for the team ‘A’ entry is available via github at https://github.com/StephenWattam/LU-Newshack and images from Lancaster’s visit can be seen at https://flic.kr/s/aHskwHcpNH .  Some of the team have also written their own blog posts on the subject: here’s Matt’s and Steve’s.

Team ‘1’ based their work around the BBC Reality Check service. This was part of the BBC News coverage of the 2015 UK general election and published news items on twitter and contributed to TV and radio news as well. For example, in May 2015 when a politician claimed that the number of GPs has declined under the coalition government, BBC Reality Check produced a summary of data obtained from a variety of sources to enable the audience to make up their own mind about this claim. Reality Check is continuing in 2016 with a similar service for the EU referendum, providing, for example, a check on how many new EU regulations there are every year (1,269 rather than the 2,500 claimed by Boris Johnson!!). After consulting with the BBC technology producer and journalist attending the newsHACK, Team ‘1’ realised that this current Reality Check service could only serve its purpose for English news stories, so set about making a new ‘BBC Multilingual Reality Check’ service to support journalists in their search for suitable sources. Having a multilingual system is really important for the EU referendum and other international news topics due to the potential sources being written in languages other than English.

In order to bridge related stories across different languages, we adopted the UCREL Semantic Analysis System (USAS) developed at Lancaster over the last 26 years. The system automatically assigns semantic fields (concepts or coarse-grained senses) to each word or phrase in a given text, and we reasoned that the frequency profile of these concepts would be similar for related stories even in different languages e.g. the semantic profile could help distinguish between news stories about finance or education or health. Using the APIs that the BBC newsHACK team provided, we gathered stories in English, Spanish and Chinese (the native languages spoken by team ‘1’). Each story was then processed through the USAS tagger and a frequency profile was generated. Using a cosine distance measure, we ranked related stories across languages. Although we only used the BBC multilingual news stories during the newsHACK event, it could be extended to ingest text from other sources e.g. UK Parliamentary Hansard and manifestos, proceedings of the European parliament and archives of speeches from politicians (before they are removed from political party websites).

The screenshot below shows our analysis focussed on some main topics of the day: UK and Catalonia referendums, economics, Donald Trump, and refugees. Journalists can click on news stories in the system and show related articles in the other languages, ranked by our distance measure (shown here in red).

Team ‘1’s Multilingual Reality Check system would not only allow fact checking such as the number of refugees and migrants over time entering the EU, but also allow journalists to observe different portrayals of the news about refugees and migrants in different countries.

image02

CASS receives Queen’s Anniversary Prize for Further and Higher Education

Queen's Anniversary AwardAt the end of February, a team of CASS researchers attended the Presentation of the Queen’s Anniversary Prizes for Further and Higher Education, held at Buckingham Palace. The CASS team officially received the award from their Royal Highnesses, The Prince of Wales and the Duchess of Cornwall on 25th February 2016.

Back in November, it was announced that CASS received the esteemed Queen’s Anniversary Prize for its work in “computer analysis of world languages in print, speech, and online.” The Queen’s Anniversary Prizes are awarded every two years to universities and colleges who submit work judged to show excellence, innovation, impact, and benefit for the institution itself, and for the people and society generally in the wider world.

10 of us were selected to attend the ceremony itself, including the Chancellor, Vice-Chancellor, our Centre Director Tony McEnery, and three students. Buckingham Palace sent strict instructions about dress code and the possession of electronic devices, and we were well-read on royal etiquette by the time the big day arrived. I think all of us were a little nervous about what the day would have in store, but we met bright-eyed and bushy-tailed at 9:30am, and took a taxi to Green Park. We entered through the front gates into Buckingham Palace, and looked back at the crowd of adoring fans on the other side of the railings.

We showed our entry cards, and found ourselves being ushered across the courtyard and into the Palace itself. We dropped off our coats and bags, and then went up the grand staircase into the Ballroom where the ceremony was held. We began to relax as the Equerry told us what would be happening throughout the ceremony, and the Countess of Wessex’s String orchestra provided excellent music throughout the event. The score ranged from Handel, right through to John Lennon’s ‘Imagine’, and even a James Bond theme.

As the ceremony started, Vice-Chancellor Mark Smith and CASS Centre Director Tony McEnery passed through the guests, along with representatives from other universities and colleges, and then proceeded to form a line to receive the award. Chancellor Alan Milburn was seated at the front of the Ballroom, along with Anne, Princess Royal. Whilst receiving the award on behalf of CASS, The Prince of Wales asked the Vice-Chancellor about our work, and was fascinated to discover what we have undertaken in the past 40 years. After a brief chat about our work, Mark Smith and Tony McEnery were presented with the Queen’s Anniversary Prize medal and certificate that will be displayed in the John Welch Room in University House.

After the ceremony, we filed through into the Picture Gallery for the reception. Over the course of the next 60-90 minutes, guests were free to mingle and network with each other whilst canapés were served. Dignitaries passed through and spoke to the visitors; Anne, Princess Royal, had a keen interest in the impact of our work on dictionary-making, and I must admit that Tony McEnery was excellent at giving a summary of what corpus research entails. He outlined how it is used in modern-day dictionary building, and discussed some of the historical texts that we now have access to.

The Duchess of Cornwall also visited our group over the course of the event, and made a point of speaking to both Gill Smith and Rosie Knight about the practical applications of their research. They discussed extensively why corpus research is such a useful method in the social sciences, and spoke of their personal connection to the research centre.

Having the opportunity to promote and discuss our research with royalty was a true honour, and I think it is fantastic to see the work of CASS recognised in this unique and special way.

FireAnt has officially launched!

Laurence Anthony and Claire Hardaker first introduced FireAnt at the CL2015 conference. In their talk, Claire explained that her work with the Discourse of Online Misogyny (DOOM) project had led her to realise that when working with Twitter data, you fast encounter a large array of problems: how to easily collect data, how to arrange that data in a useful way, and how you then analyse that data effectively. It was these problems that had led to the creation of FireAnt, a freeware social media and data analysis toolkit. Laurence and Claire showed the CL2015 audience a beta version of FireAnt, and it’s safe to say it was very well received… the Q&A at the end of their talk went along the lines of ‘it will be publicly available, right?’, ‘when can I get my hands on it?’, and ‘can I sign up to help you trial the beta version?’.

Well, ladies and gentlemen, the wait is over. On Monday 22nd February, Laurence and Claire officially launched FireAnt; it became available to the public on Laurence Anthony’s website, and they held a launch event at Lancaster University to teach people how to use the tool. Here’s a little about what I learnt at this launch event…

FireAnt is not just for analysing social media data; it’s also for collecting it…

FireAnt makes collecting tweets incredibly easy. All you have to do is enter a search term, specify how long you want FireAnt to collect tweets for (or set a maximum number of tweets you want it to collect), and go away and have a cup of tea. To trial this, I instructed FireAnt to start collecting tweets that contained the hashtag #feminism. While I munched my way through two biscuits and two cups of tea, 675 tweets were posted on Twitter containing the hashtag #feminism; FireAnt collected all of these.

FireAnt helps you extract the data you’re actually interested in…

When you collect social media data, you don’t just collect texts that people have posted online. You also collect lots of information about these texts – for example, the date and time that each text was posted on the internet, the username of the person who posted the content, the location of that user, etc. This means that the file containing all your data is often very large, and you have to extract the bits you want to work with. This sounds simple but in reality it’s not, unless you’re a fairly capable programmer and have a computer with a decent amount of memory. However, with FireAnt, the process is much simpler. FireAnt automatically detects what information you have in your file, allows you to filter this, and creates new files with the information you’re interested in (without crashing your computer!).

FireAnt can also help you analyse your data….

At the launch event, we experimented with three different analysis features of FireAnt. Firstly, FireAnt allows you to gather timeseries data, showing the usage of a particular word within your dataset across time. You can use this to produce pretty graphs, such as the one below, or export the data to Excel.

Fireant1
Secondly, FireAnt can produce Geoposition maps. For example, below is a picture of Abi Hawtin, one of CASS’ research students, who’s looking very excited because she used FireAnt to create a map showing the different locations that the tweets in her dataset were posted from:

fireant2

Thirdly, FireAnt allows you to easily produce network graphs, like the one below:One great feature of these graphs is that they allow you to plot lots of different things. These types of graphs are typically hard to produce but with a tool like FireAnt it’s easy.

What are you waiting for? Time to try FireAnt out for yourself!

 

Sino-UK Corpus Linguistics Summer School

ShanghaiAt the end of July, Tony McEnery and I taught at the second Sino-UK corpus linguistics summer school, arranged between CASS and Shanghai Jiao Tong University. It was my first time visiting China and we arrived during an especially warm season with temperatures hitting 40 degrees Celsius (we were grateful for the air conditioning in the room we taught in).

Tony opened the summer school, giving an introductory session on corpus linguistics, followed a few days later by a session on collocations, where he introduced CASS’s new tool for collocational networks, GraphColl. I gave a session on frequency and keywords, followed by later sessions on corpus linguistics and language teaching, and CL and discourse analysis. For the lab work components of our sessions, we didn’t use a computer lab. Instead the students brought along their own laptop and tablets, including a few who carried out BNCweb searches on their mobile phones! I was impressed by how much the students attending already knew, and had to think on my feet a couple of times – particularly when asked to explain some of the more arcane aspects of WordSmith (such as the “Standardised Type Token ratio standard deviation”).

At the end of the summer school, a symposium was held where Tony gave a talk on his work with Dana Gablasova and Vaclav Brezina on the Trinity Learner Language corpus. I talked about some research I’m currently doing with Amanda Potts on change and variation in British and American English.

Also presenting were Prof Gu Yuego (Beijing Foreign Studies University) who talked about building a corpus of texts on Chinese medicine, and Prof. Roger K Moore (University of Sheffield) who discussed adaptive speech recognition in noisy contexts.

We were made to feel very welcome by our host, Gavin Zhen, one of the lecturers at the university, who went out of his way to shuttle us on the 90 minute journey from the university to our hotel on the Bund.

It was a great event and it was nice to see students getting to grips with corpus linguistics so enthusiastically.

#CL2015 social media roundup: Using Corpus Linguistics to investigate Corpus Linguists talking about Corpus Linguistics

Introduction

Corpus Linguistics 2015 – CL2015 – is the largest conference of its kind and this year drew over 250 attendees from all over the world to present work outlining the state of Corpus Linguistics (CL) at large, leading-edge technology and methods, and setting the agenda for years to come.

Of particular interest to me was a small but important streak of enquiry running through the conference, which is also becoming more prevalent in CL as a whole. That is, a focus on corpora collected from online source such as blogs and social media (Elgesem & Salway 2015; Grieve, et al. 2015; Hardaker & McGlashan 2015; Knight 2015; Longhi & Wigham 2015; McGlashan & Hardaker 2015; Statache, et al. 2015). The Internet now enables great opportunities for the collection and interrogation of large amounts of data – big data, even – and the rapid compilation of specialised corpora in ways previously impossible.

I focus here on social media data, specifically data collected from Twitter. Sampling data from Twitter, like a lot of other online sources, offers the opportunity to collect what people are saying (the content of their posts; tweets) but also a huge amount of metadata about the date, time, user, shared content (e.g. hyperlinks, retweets), interactional information, etc. relating to those posts. As Corpus Linguists, we therefore get the data we sample for – posts containing the thing(s) we are interested in – as well as other social information about the content creators and their social networks that we may or may not be interested in. Indeed, concerns about the kinds of metadata included and attached to online post is an issue that has sparked a great deal of debate about the ethics of collecting and using publicly posted online content, though these concerns are not discussed here. Instead, the potential for online ethnography is explored. In order to do this, I pair familiar CL research methods with methods from Social Network Analysis (SNA) that are more explicitly focussed on social networks and examining the myriad ways people affiliate with each other.

Theory & Methods: Corpus-assisted Community Analysis (CoCoA)

Corpus-assisted Community Analysis (CoCoA) is a multimethodological approach to the study of online discourse communities that combines methods from Discourse Analysis (DA), CL, and SNA.

Corpus-assisted Discourse Analysis

I predominantly draw on Baker (2006) in my approach to corpus-assisted DA, seeing discourse in a Foucauldian sense as, forms of social practice; “practices which systematically form the objects of which they speak” (Foucault 1972: 49). Particularly, I am interested in the incremental effect of discourse. Baker suggests, “a single word, phrase or grammatical construction on its own may suggest the existence of a discourse” (2006: 13). However, in order to investigate how quantitatively typical or pervasive discourse is within a discourse community, numerous examples of linguistic instantiations of discourse are required to make a claim about its cumulative effect (ibid.). Following Baker, I argue here that corpora and CL techniques enable this kind of quantitative examination of discourse.

Social Network Analysis

SNA implements notions from graph theory for the formal modelling and describing the properties of relationships between objects of study such as people and institutions. A graph (or ‘sociogram’) is a representation of people or institutions of interest as ‘nodes’ and the relationships between them as a set of lines known as ‘edges’; a graph is built by representing “a set of lines [‘edges’] connecting points [‘nodes’]” (Scott 2013: 17). To interpret graphs, graph theory contributes “a body of mathematical axioms and formulae that describe the properties of the patterns formed by the lines [‘edges’]” (Scott 2013: 17). One of these axioms is ‘directionality’. Directed graphs can encode both symmetric and asymmetric relations (D’Andrea, et al. 2010: 12). Directed relationships are where nodes are connected by an edge that has a direction of flow from one node to another is known as asymmetric, as illustrated by the relations between A and C, and C and B in Fig. 1. Symmetric relationships are those in which an edge connects two nodes but is bidirectional – the direction of relation flows both ways – as illustrated by the relationship between A and B in Fig. 1. Directed relationships on Twitter include followership relations and the act of mentioning – i.e. including the handle (e.g. @CorpusSocialSci) – in tweets.

fig1

Figure 1: A simple directed graph

Undirected graphs represent identical, symmetric relationships between nodes which might be the result of nodes sharing reciprocal attitudes or “because they have a common involvement in the same activity” (Scott 2013: 17). Fig. 2 contains gives a graphical representation of an undirected graph.

fig2

Figure 2: A simple undirected graph

Directed and undirected (‘ambient’) kinds of affiliation are both understood here as being distinct forms of discursively constructed social practices. Furthermore, I adopt the term ‘ambient affiliation’ from the work of Zappavigna on the use of social media in the formation of community and identity (Zappagigna 2012; Zappagigna 2013). Ambient affiliation is about the functionalities of social media platforms that enable users “to commune with others without necessarily engaging in direct conversational exchanges” (Zappagigna 2013: 223-4). Therefore, ambient affiliation is about people exhibiting the same behaviours or sharing the same qualities but without directly interacting with each other. This notion closely approximates to the notion of an ‘undirected’ graph. In developing the theory of ambient affiliation Zappavigna draws on Page’s work on hashtags. Page refers to hashtags as “a search term” (2012: 183). Hashtags – a string of characters (usually a word or short phrase) unbroken by spaces or non-alphabetic/non-numeric characters (excl. underscores ‘_’) preceded by ‘#’ (e.g. #YOLO) – are used a metadiscursive markers of the topic of a tweet. Page goes onto argue that, “the kind of talk which aggregate around hashtags […] involve multiple participants talking simultaneously about the same topic, rather than individuals necessarily talking with each other in dyadic exchanges that resemble a conversation” (2012: 196). As such, Page suggests that hashtags destabilise conventional adjacency pairs characteristic of many forms of human dialogue and give a new way for humans to interact on a topic of mutual interest.

Data

I collected all tweets and retweets including the official hashtag of the Corpus Linguistics 2015 conference – #CL2015 – posted from the date of the first pre-conference workshop (20/07/2015) through until the final day of the conference (24/07/2015). To do this, I used the R based Twitter client ‘twitteR’ to access the Twitter API. The resulting data amounted to:

  Total number
Tweets 671
Retweets 1025

timeseries

  Tweets Retweets
20/07/2015 57 76
21/07/2015 128 169
22/07/2015 152 370
23/07/2015 176 241
24/07/2015 158 169
Totals 671 1025

The tweets corpus contained around ~10,000 words in total.

Issues

The data contained some ‘noise’ mainly caused by other people using the same #CL2015 hashtag to talk about another event occurring during the period of the conference. However, as I will show in the analysis, the methods enable researchers to focus only on the communities they are interested in.

Analysis

Tweets – what was being talked about?

To find out what people were talking about day-to-day, I created daily tweet corpora. With each of these daily corpora, I performed a keyword analysis using a reference corpus compiled using the remaining other days. So, for the tweets sent during the pre-conference workshop day (20/07/2015) I used the tweets sent during the rest of the conference (21/07/2015-24/07/2015) as a reference corpus, and so on. The resulting top 10 keywords for each day are given in the table below.

  20/07/2015 21/07/2015 22/07/2015 23/07/2015 24/07/2015
1 CL2015 change sealey partington illness
2 workshop fireant granger duguid mental
3 pre biber animals gala news
4 workshops climate sylviane class literature
5 conference doom campaign dinner yahoo
6 main misogyny heforshe please dickens
7 starting academic collocation poster csr
8 historian assist eeg legal health
9 day biber’s handford alan incelli
10 lancaster bnc learner mock jaworska

The keywords shown in each column outline the most distinctive topics tweeted about during the conference. Italics used here relate back to keywords in the table.

On day 1, the pre-conference workshops, including @antlab‘s pre-conference corpus tools brainstorming session and @stgries’s pre-conference #R workshop were popular topics of conversation in the smallest subsample of tweets for the week.

Top favourited tweet from day 1:

On day 2, more diverse topics start to emerge. Change became a theme, relating to Andrew Salway’s talk on discourse surrounding climate change but also relates to a talk given by Doug Biber on historical linguistic change in ‘uptight’ academic texts. Fireant, a new user-friendly tool for efficiently dealing with large databases developed by Laurence Anthony, was also unveiled to the CL masses on day 2, which prompted a flurry of excited tweets [keep track of Laruence’s Twitter page for release]. DOOM and misogyny also became topical following talks by Claire Hardaker and Mark McGlashan on the Discourse of Online Misogyny project. Finally, some excitement followed a paper given by Robbie Love and Claire Dembry about the new Spoken BNC2014. For those interested, keep track of the CASS website for spoken data grants later in the year.

Top favourited tweet from day 2:

 Day 3 saw another topic change focussing most prominently on Alison Sealey’s talk on the discursive construction of animals in the media, Sylviane Granger’s plenary on learner corpora, a talk on the public’s online reactin to the #HeForShe campaign given by Rosie Knight, and Jen Hughes’ talk on the application of EEG (‘Electroencephalography’) to the study of collocation as a cognitive phenomenon.

Top favourited tweet from day 3:

After 3 days of incredibly interesting talks, corpus linguists were about ready for their gala dinner on day 4. But before all the cheesecake, the CL2015 were excitedly tweeting about the all important poster session, Alison Duguid’s talk on class, the Geoffrey Leech tribute panel which included Charlotte Taylor’s paper on mock politeness and ‘bitchiness’ as well as Lynne Murphy and Rachele de Felice’s talk on the differential use of please in BrE and AmE, Alan Partington’s plenary speech on CADS; and papers given by Ruth Breeze, Amanda Potts, and Alex Trklja, on the application of CL methods to the study of a broad range of legal language.

Top favourited tweet from day 4:

Day 5 brought #CL2015 to a close but the number of tweets remained steady with health on the agenda with talks from Ersilia Incelli and Gillian Smith who both focussed partly on the construction of mental illness/health in the news. News also featured Monika Bednarek’s talk on news discourse and Antonio Fruttaldo’s analysis of news tickers. Other key topics related to Sylvia Jaworska and Anupam Nanda’s paper on the Corpus Linguistic analysis of Corporate Social Responsibility (CSR), Michaela Mahlberg’s work on the literature of Charles Dickens, and discussion of a corpus of Yahoo answers in the week’s penultimate panel on triangulating methodological approaches.

Top favourited tweet from day 5:

Approaching tweets in this way, it was possible to find out the most salient topics of each day. However, I was also interested in the retweeting behaviour of attendees.

Retweets – what was being talked about?

I looked at the top 10 most frequently retweeted tweets during the conference. Due to the intertextual nature of retweets – they are simply identical reposts of the same content – methods familiar to CL such as word frequency lists may not be as useful in their study. For example, if a few retweets are particularly frequently reposted, the most frequent words will be skewed by the content of the most frequent retweets. Instead, I suggest that retweets themselves should be conceptualised as being individual types in and of themselves that require more qualitative approaches to their interpretations (at least in this context). The top 10 most frequently retweeted tweets including the #CL2015 hashtag are given below:

  Retweet Date Freq
1 RT @EstrategiasEc: Concluimos este viernes con exitoso proceso de postulación @ECLideres VI Prom. #CL2015 con auspicio de @ucatolicagye. ht… 22/07/2015 218
2 RT @perayson: To access the new HT semantic tagger from the @SAMUELSProject see http://t.co/5LFWH8YGAH and http://t.co/BPxcC8pNNK #CL2015 23/07/2015 15
3 RT @UCREL_Lancaster: The #CL2015 abstract book is now available to download from the conference website http://t.co/px9hh3mMNe 21/07/2015 13
4 RT @duygucandarli: Important take-away messages about corpus research in Biber’s plenary talk at #CL2015! http://t.co/xm87Uo1umZ 21/07/2015 11
5 RT @lynneguist: Alan Partington looking at how quickly language changes in White House Press Briefings… #CL2015 http://t.co/jeVjvC8Ym3 23/07/2015 10
6 RT @CorpusSocialSci: .@_paulbaker_ reflecting on a number of approaches to the same data at the Triangulation panel at #CL2015 http://t.co/… 24/07/2015 10
7 RT @CorpusSocialSci: .@vaclavbrezina introduces Graphcoll, a new visualisation tool for collocational networks #CL2015 http://t.co/PM5FxS5N… 22/07/2015 9
8 RT @_ctaylor_: It’s a myth that reference corpora have to larger than target corpus says @antlabjp  #cl2015 22/07/2015 7
9 RT @Loopy63: #CL2015 Call for papers for Intl. Conference on  statistical analysis of textual data 2016 in Nice, France: http://t.co/3JpcAa… 23/07/2015 7
10 RT @vaclavbrezina: A great use of #GraphColl by @violawiegand – #CL2015 poster presentation @TonyMcEnery @StephenWattam http://t.co/uwlMGUY… 24/07/2015 7

The most frequent retweet was regarding a Latin American Youth Leadership programme that shared the same #CL2015 hashtag [nb. For next year, Corpus Linguistics conference organisers…]. As you will notice, this retweet occurred on 22/07/2015 but as retweets and tweets are dealt with exclusively, the retweet does not interfere with the keyword analysis done for the same day on the tweets.

What do the most frequent retweets highlight? Free tools (GraphColl, HT semantic tagger), free resources (abstract book), plenary talks and more conferences.

Networks

With a general idea of what people are talking about and sharing using the #CL2015 hashtag, I was interested to examine the overall activity around #CL2015 and the emergence of discourse communities.

In terms of tweets the gif below shows how relationships developed over the course of the conference. Every node represents a Twitter account that posted a tweet containing #CL2015 during the period of data collection. The size of these nodes is dictated by their ‘degree’, or its number of edges. More edges = larger node. The colour of the nodes is determined by ‘betweenness centrality’, which indicates how central a node is in a network. Nodes with high betweenness centrality help the speed of transfer of information through networks as they help create the shortest distance between other nodes in the network. Nodes with high betweenness centrality are coloured red, a medium betweenness centrality is yellow, and low betweenness centrality is blue. Nodes with intermediary colours (orange, green) represent those that have a betweenness centrality somewhere between low and medium or between medium and high. Finally, the colour and size of edges is dictated by ‘weight’. In this example, weight is dictated by the frequency of tweets that exist between nodes. Thick red edges between nodes represent nodes that send tweets to each other frequently, or one node mentions another frequently. Thin blue edges represent low frequency mentioning relationships. Yellow are medium. Again, blended colours represent intermediary frequencies and thus, in this case, weight.

CL2015 tweets

The tweets network shows that @CorpusSocialSci was – perhaps unsurprisingly – the most prolific and central account in the #CL2015 network. It had the most connections and joined the most individual accounts together. But other users were very active in helping to disseminate information more widely, which are shown by those nodes in yellow and orange. The accounts on the periphery of the network are good examples of ambient affiliation. They use #CL2015 to affiliate but do not directly engage with others by mentioning other users. Moreover, the gif attempts to show the evolution and growth of the network over time but also shows that each day new topics and networks of interaction relating to those new topics emerged daily. As talks (and news of talks in the network) became topical, people tweeted and shared ideas and notes relevant to those talks. An example of this is the emergence of fireant on 21/07/2015. When introduced to delegates, an ad hoc online discourse community formed to spread the news of a new tool, add new information and to channel their enthusiasm back to source.

User Date Tweet
RachelleVessey 2015-07-21 16:50:43 Excellent end to the first day of #CL2015- FireAnt looks like a fantastic programme @antlabjp @DrClaireH can’t wait to try it out!
SLGlaas 2015-07-21 16:54:13 Stupidly excited about #Fireant from @antlabjp  #CL2015
CorpusSocialSci 2015-07-21 16:54:43 Everyone is eagerly wondering when FireAnt will be available. @antlabjp’s answer is hopefully within the next few months. #CL2015
Rosie_Knight 2015-07-21 16:56:40 Amazing talk about FireAnt- can’t wait to use this on my #HeForShe data! @DrClaireH @antlabjp @Mark_McGlashan #CL2015

CL2015 retweets

The retweets network again shows that @CorpusSocialSci was – and, again perhaps unsurprisingly – at the centre of #CL2015 retweeting activity. The retweet network gif shows 2 discrete networks. The right hand network shows activity at the CL conference, the left hand network shows the retweeting behaviour of the Latin American Youth Leadership programme mentioned above. Avid conference tweeters may have noticed when keeping track of the #CL2015 hashtag. The left hand network – a graphic representation of the most retweeted tweet containing #CL2015 shown above – shows 218 users retweeting a single central account. In this network there is no interaction between the users engaged in retweeting this user. This kind of network formation is extremely typical of users retweeting news stories on Twitter. The right hand network, however, shows a great deal of mutual retweeting, whereby users are engaged on a prolonged basis in sharing each others’ tweets and forming a network of sharing and resharing.

Summary

Integrating methods from CL and SNA offers some really interesting possibilities for the analysis of large amounts of social data. Here, I have used keyword analysis to find the most salient topics for each day of the conference, used those topics to find and visualise small but coherent discourse communities, and situated those communities within the wider #CL2015 social network.

Contact

m.mcglashan(Replace this parenthesis with the @ sign)lancaster.ac.uk

@Mark_McGlashan

References

Baker, P. (2006) Using Corpora in Discourse Analysis. London: Continuum.

D’Andrea, A., Ferri, F. & Grifoni, P. (2010). An overview of methods for virtual social network analysis. In: A. Abraham, A.-E. Hassanien, & V. Sná el (eds.). Computational Social Network Analysis. London: Springer London, pp. 3–26.

Elgesem, D. & Salway, A. (2015) Traitor, whistleblower or hero? Moral evaluations of the Snowden affair in the blogosphere. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp 99-101

Foucault, M. (1972) The Archaeology of Knowledge. London: Tavistock.

Grieve, J., Nini, A., Guo, D, & Kasakoff, A. (2015) Recent changes in word formation strategies in American social media. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp 140-3

Knight, R. (2015) Tweet all about it: public views on the UN’s HeForShe campaign. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp 201-3

Longhi, J. & Wigham, C. R. (2015) Structuring a CMC corpus of political tweets in TEI: corpus features, ethics and workflow. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp. 408-9

Hardaker, C. & McGlashan, M. (2015) Twitter rape threats and the discourse of online misogyny (DOOM): from discourses to networks. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp. 154-6

McGlashan, M. & Hardaker, C. (2015) Twitter rape threats and the discourse of online misogyny (DOOM): using corpus-assisted community analysis (COCOA) to detect abusive online discourse communities. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. pp. 234-5

Page, R. (2012). The linguistics of self-branding and micro-celebrity in Twitter: The role of hashtags. Discourse & Communication. 6 (2). p.pp. 181–201.

Scott, J. (2013). Social Network Analysis. 3rd Ed. London: Sage.

Statache, R., Adolphs, S., Carter, C. J., Koene, A., McAuley, D., O’Malley, C., Perez, E. & Rodden, T. (2015) Descriptive ethics on social media from the perspective of ideology as defined within systemic functional linguistics. In Formato, F. & Hardie, A. (Eds.) Corpus Linguistics 2015 Abstract Book. Paper presented at Corpus Linguistics 2015, Lancaster. Lancaster University. p. 433

Zappavigna, M. (2012). Discourse of Twitter and social media. London: Continuum.

Zappavigna, M. (2013). Enacting identity in microblogging through ambient affiliation. Discourse & Communication. 8 (2). pp. 209–228.