Ants On Fire

LaurenceAnthonyBeing an honorary research fellow at CASS is not only a great honor but a great pleasure. In December
of 2015, my initial three-year fellowship at CASS was extended for a further three years, and this introduced the possibility of returning to Lancaster for a sabbatical-length seven-week research stay between February and March of 2016.

The timing of this research stay was especially enjoyable as it coincided with CASS receiving the Queen’s Anniversary Prize for Higher and Further Education for its contributions to computer analysis of world languages in print, speech and online. As part of a week of celebrations at the Centre, I worked with Claire Hardaker of CASS to organize a launch event for our new FireAnt social media analysis toolkit on February 22. FireAnt is a tool that allows researchers to easily extract relevant data from the social media data sources, visualize that data in the form of time-series plots, network graphs, and geolocation maps, and export results for further analysis using traditional corpus tools. At the event, 20 invited participants learned how to use the new tool to analyze Twitter and other social media data sets. They also gave us very valuable comments and suggestions that were immediately incorporated into the software before it was released to the public later on the same day.

Screenshot of FireAnt main display

Screenshot of FireAnt main display

Following the release of FireAnt, I then worked with Claire over the next few weeks on our first research project utilizing the software – a forensic corpus linguistics analysis of the Ashley Madison dataset. Here, we used FireAnt to identify the creation and activities of automated ‘Angel’ accounts on the site. We presented preliminary results from this analysis at a UCREL/Forge event on March 18 that was attended by a wide number of forensic linguists, corpus linguists, computer scientists, and others around the university.

Time-series analysis account creation in the AshleyMadison data set

Time-series analysis account creation in the AshleyMadison data set

One of the great advantages of being at Lancaster is that it is home to excellent scholars that are interested in the entire span of linguistics fields. Since my one-year sabbatical at Lancaster, I’ve had the pleasure to work with Marije Michel in the Dept. of Linguistics and English Language who uses eye tracking methodology in her research into Second Language Acquisition (SLA), Task-Based Language Teaching (TBLT) and written Synchronous Computer Mediated Communication (SCMC). Returning to Lancaster allowed us to work together further to develop a new eye-tracking tool that has applications not only in SLA, TBL and SCMC research, but also corpus linguistics. Again, we presented these ideas at a UCREL event held on March 10, and we are now in the process of writing up the research for publication.

Although this research visit was mainly focused on FireAnt development, I fortunately had time to also continue work on some of the other projects that were initiated during my sabbatical year. Meeting up again with Paul Baker allowed us to consider the next stage of development of ProtAnt, which we will be presenting at the TaLC 12 conference. I also met up with Paul Rayson and CASS’s new lecturer in digital humanities, Steve Wattam, to discuss how we can promote an understanding of tools development and programming skills among corpus linguistics (an area of interest that I have had for several years now). Sadly, my schedule prevented me from joining them at the BBC #newsHack, but I was so happy to hear that Steve’s team won the Editorial Prize at the event.

Nothing beats having an entire sabbatical year to focus on research and collaborate with the excellent members of the CASS. But, this seven-week research visit comes a very enjoyable second. I would like to thank Tony McEnery and his team for funding the visit and making me feel so welcome again at Lancaster. It was a true pleasure to be back. I look forward to continue working with Tony and the team over the next three years.


Laurence Anthony is Professor of Applied Linguistics at the Faculty of Science and Engineering, Waseda University, Japan, and an Honorary Research Fellow at the ESRC Centre for Corpus Approaches to Social Science (CASS), Lancaster University, UK. His main interests are in corpus linguistics, educational technology, and English for Specific Purposes (ESP) program design and teaching methodologies. He received the National Prize of the Japan Association for English Corpus Studies (JAECS) in 2012 for his work in corpus software tools design. He is the developer of various corpus tools including AntConc, AntWordProfiler, FireAnt, ProtAnt, and TagAnt.

Workshop on Corpus Linguistics in Ghana

Back in 2014, a team from CASS ran a well-received introductory workshop on Corpus Linguistics in Accra, Ghana – a country where Lancaster University has a number of longstanding academic partnerships and has recently established a campus.

We’re pleased to announce that in February of this year, we will be returning to Ghana and running two more introductory one-day events. Both events are free to attend, each consisting of a series of introductory lectures and practical sessions on topics in corpus linguistics and the use of corpus tools.

Since the 2014 workshop was attended by some participants from a long way away, this time we are running events in two different locations in Ghana. The first workshop, on Tuesday 23rd February 2016, will be in Cape Coast, organised jointly with the University of Cape Coast: click this link for details. The second workshop, on  Friday 26th February 2016, will be in Legon (nr. Accra), organised jointly with the University of Ghana: click this link for details. The same material will be covered at both workshops.

The workshop in 2014 was built largely around the use of our online corpus tools, particularly CQPweb. In the 2016 events, we’re going to focus instead on a pair of programs that you can run on your own computer to analyse your own data: AntConc and GraphColl. For that reason we will be encouraging participants who have their own corpora to bring them along to analyse in the workshop. These can be in any language – not just English! Don’t worry however – we will also provide sample datasets that participants who don’t have their own data can work with.

We invite anyone in Ghana who wants to learn more about the versatile methodology for language analysis that is corpus linguistics to attend! While the events are free, registration in advance is required, as places are limited.

Brainstorming the Future of Corpus Tools

Since arriving at the Centre for Corpus Approaches to Social Science (CASS), I’ve been thinking a lot about corpus tools. As I wrote in my blog entry of June 3, I have been working on various software programs to help corpus linguists process and analyse texts, including VariAnt, SarAnt, TagAnt. Since then, I’ve also updated my mono-corpus analysis toolkit, AntConc, as well as updated my desktop and web-based parallel corpus tools, including AntPConc and the interfaces to the ENEJE and EXEMPRAES corpora. I’ve even started working with Paul Baker of Lancaster University on a completely new tool that provides detailed analyses of keywords.

In preparation for my plenary talk on corpus tools, given at the Teaching and Language Corpora (TaLC 11) conference held at Lancaster University, I interviewed many corpus linguists about their uses of corpus tools and their views on the future of corpus tools. I also interviewed people from other fields about their views on tools, including Jim Wild, the Vice President of the Royal Astronomical Society.

From my investigations, it was clear that corpus linguists rely on and very much appreciate the importance of tools in their work. But, it also became clear that corpus linguists can sometimes find it difficult to see beyond the features of their preferred concordancer or word frequency generator and attempt to look at language data in completely new and interesting ways. An analogy I often use (and one I detailed in my plenary talk at TaLC 11) is that of an astronomer. Corpus linguists can sometimes find that their telescopes are not powerful enough or sophisticated enough to delve into the depths of their research space. But, rather than attempting to build new telescopes that would reveal what they hope to see (an analogy to programming) or working with others to build such a telescope (an analogy to working with a software developer), corpus linguists simply turn their telescopes to other areas of the sky where their existing telescopes will continue to suffice.

To raise the awareness of corpus tools in the field and also generate new ideas for corpus tools that might be developed by individual programmers or within team projects, I proposed the first corpus tools brainstorming session at the 2014 American Association of Corpus Linguistics (AACL 2014) conference. Randi Reppen and the other organizers of the conference strongly supported the idea, and it finally became a reality on September 25, 2014, the first day of the conference.

At the session, over 30 people participated, filling the room. After I gave a brief overview of the history of corpus tools development, the participants thought about the ways in which they currently use corpora and the tools needed to do their work. The usual suspects—frequency lists (and frequency list comparisons), keyword-in-context concordances and plots, clusters and n-grams, collocates, and keywords—were all mentioned. In addition, the participants talked about how they are increasingly using statistics tools and also starting programming to find dispersion measures. A summary of the ways people use corpora is given below:

  • find word/phrase patterns (KWIC)
  • find word/phrase positions (plot)
  • find collocates
  • find n-grams/lexical bundles
  • find clusters
  • generate word lists
  • generate keyword lists
  • match patterns in text (via scripting)
  • generate statistics (e.g. using R)
  • measure dispersion of word/phrase patterns
  • compare words/synonyms
  • identify characteristics of texts

Next, the participants formed groups, and began brainstorming ideas for new tools that they would like to see developed. Each group came up with many ideas, and explained these to the session as a whole. The ideas are summarised below:

  • compute distances between subsequent occurrences of search patterns (e.g. words, lemmas, POS)
  • quantify the degree of variability around search patterns
  • generate counts per text (in addition to corpus)
  • extract definitions
  • find patterns of range and frequency
  • work with private data but allow  for powerful handling of annotation (e.g. comparing frequencies of sub-corpora)
  • carry out extensive move analysis over large texts
  • search corpora by semantic class
  • process audio data
  • carry out phonological analysis (e.g. neighbor density)
  • use tools to build a corpus (e.g. finding texts, annotating texts, converting non-ASCII characters to ASCII)
  • create new visualizations of data (e.g. a roman candle of words that ‘explode’ out of a text)
  • identify the encoding of corpus texts
  • compare two corpora along many dimensions
  • identify changes in language over time
  • disambiguate word senses

From the list, it is clear that the field is moving towards more sophisticated analyses of data. People are also thinking of new and interesting ways to analyse corpora. But, perhaps the list also reveals a tendency for corpus linguists to think more in terms of what they can do rather than what they should do, an observation made by Douglas Biber, who also attended the session. As Jim Wild said when I interviewed him in July, “Research should be led by the science not the tool.” In corpus linguistics, clearly we should not be trapped into a particular research topic because of the limitations of the tools available to us. We should always strive to answer the questions that need to be answered. If the current tools cannot help us answer those questions, we may need to work with a software developer or perhaps even start learning to program ourselves so that new tools will emerge to help us tackle these difficult questions.

I am very happy that I was able to organize the corpus tools brainstorming session at AACL 2014, and I would like to thank all the participants for coming and sharing their ideas. I will continue thinking about corpus tools and working to make some of the ideas suggested at the session become a reality.

The complete slides for the AACL 2014 corpus tools brainstorming session can be found here. My personal website is here.

Coming to CASS to code: The first two months


After working at Waseda University in Japan for exactly 10 years, I was granted a one-year sabbatical in 2014 to concentrate on my corpus linguistics research. As my first choice of destination was Lancaster University, I was overjoyed to hear from Tony McEnery that the Centre for Corpus Approaches to Social Science (CASS) would be able to offer me office space and access to some of the best corpus resources in the world. I have now been at CASS for two months and thought this would be a good time to report on my experience here to date.

Since arriving at CASS, I have been working on several projects. My main project here is the development of a new database architecture that will allow AntConc, my freeware corpus analysis toolkit, to process very large corpora in a fast and resource-light way. The strong connection between the applied linguistics and computer science at Lancaster has allowed me to work closely with some excellent computer science faculty and graduate students, including Paul Rayson, John Mariani, Stephen Wattam, and John Vidler. We just presented our first results at LREC 2014 in Reykjavik.

I’ve also been working closely with the CASS members, including Amanda Potts and Robbie Love, to develop a set of ‘mini’ corpus tools to help with the collection, cleaning, and processing of corpora. I have now released VariAnt, which is a tool that finds spelling variants in a corpus, and SarAnt, which allows multiple search-and-replace functions to be carried out in a corpus as a batch process. I am also just about to release TagAnt, which will finally give corpus linguists a simple and intuitive interface to popular freeware Part-Of-Speech (POS) tagging tools such TreeTagger. I am hoping to develop more of these tools to help the corpus linguists in CASS and around the world to help with the complex and time-consuming tasks that they have to perform each day.

I always expected that I would enjoy the time at Lancaster, but did not anticipate that I would enjoy it as much as I am. Lancaster University has a great campus, the research facilities are some of the best in the world, the CASS members have treated me like family since the day I arrived, and even the weather has been kind to me, with sunny days throughout April and May. I look forward to writing more about my projects here at CASS.

Exploring the grammar of Netflix in The Atlantic

A journalist with The Atlantic has used AntConc — a concordance program by CASS affiliate scholar Laurence Anthony — to deconstruct and reconstruct the grammar of Netflix genre descriptions.

“If you use Netflix, you’ve probably wondered about the specific genres that it suggests to you. Some of them just seem so specific that it’s absurd. Emotional Fight-the-System Documentaries? Period Pieces About Royalty Based on Real Life? Foreign Satanic Stories from the 1980s? If Netflix can show such tiny slices of cinema to any given user, and they have 40 million users, how vast did their set of “personalized genres” need to be to describe the entire Hollywood universe?”

By creating the genres loaded into Netflix as a corpus, Alexis C. Madrigal was able to identify patterns in the data, and to autogenerate new theoretical genres based on popular adjectives and subjects, such as:

  • Deep Sea Father-and-Son Period Pieces Based on Real Life Set in the Middle East For Kids
  • Assassination Bounty-Hunter Secret Society Dramas Based on Books Set in Europe About Fame For Ages 8 to 10
  • Post-Apocalyptic Comedies About Friendship

To read more about this application of a concordancer and corpus linguistic methods, as well as the resulting interview with Todd Yellin, Netflix’s VP of Product and the man responsible for the creation of Netflix’s tagging system, read the full article on The AtlanticHow Netflix Reverse Engineered Hollywood


Discourse, Gender and Sexuality South-South Dialogues Conference

Last week was spent in at Witwatersrand (Wits) University in Johannesburg where I had been invited to give a workshop on corpus methods, as well as a talk on some of my own research. The week was topped off by the first Discourse, Gender and Sexuality South-South Dialogues Conference which was organised by Tommaso Milani. Many of the papers at the conference used qualitative methods (analyses of visual data seemed particularly popular) but there were a few papers, including my own, which used corpus methods.

These included a paper by Megan Edwards who combined a corpus approach with CDA and visual analysis to examine a small corpus of pamphlets found around Johannesburg – these pamphlets advertise remedies for sexual and relationship problems and Megan demonstrated that embedded within the adverts were gendered discourses – relating to notions of ideal masculinity and femininity. This is probably one of the few corpora in existence where the top lexical word is penis.

Another interesting paper was by Sally Hunt who examined corpora of articles about sex work in two South African newspapers, focussing on the period when SA hosted the World Cup. She found that while there was a more balanced set of representations of sex workers than expected, they were still largely represented as immoral and criminalised for their actions while the agency of their clients was largely obscured. Sally is a lecturer at Rhodes University, Grahamstown, and has recently completed the construction of a 1 million word South African corpus, using the Brown family sampling frame.

During the workshop that I hosted at the university I got participants to use AntConc to examine a small corpus of recent newspaper articles about feminists, and a number of interesting patterns emerged from the analyses of concordances and collocates that took place. For example, a representation of feminists as war-mongers or vocally annoying/fierce e.g. shrill, strident etc was very prevalent and perhaps expected, although we were surprised to see a sub-set of words which related feminists to Islam like feminist Taleban and feminist fatwas (killing two ideological birds with one stone). Additionally, it was interesting to see how these negative discourses shouldn’t always be taken at face value. They were sometimes quoted in order to be critical of them, although it was often only with expanded concordance lines that this could be seen. In all, a productive week, and it was good to meet so many people who were interested in finding out more about corpus linguistics.


A criminologist’s introduction to AntConc and concordance analysis

My name is Julian Hargreaves (j.hargreaves2(Replace this parenthesis with the @ sign) and I’m a newcomer to these parts: a non-linguist and an outsider. Okay, the last bit is a slight exaggeration. I’m a member of the CASS Challenge Panel (an advisory board within CASS) representing post-graduate students from disciplines other than linguistics. I’m also a PhD student at the Lancaster University Law School where my research employs a mixture of quantitative and qualitative methods to study criminology, hate crime, British Muslim communities, and the concept of Islamophobia.

Recently, thanks to Professor Tony McEnery and the CASS team, I was introduced to some research tools for linguistics: a piece of software called AntConc and a research method known as concordance analysis. Before the linguistic experts amongst you start groaning, a quick health warning: I’m afraid what follows here may be of little use to those familiar with these basic tools. However, it is hoped that newcomers and non-linguists will be persuaded to approach, without anxiety, both the software and the research methods described below.

Continue reading