Ants On Fire

LaurenceAnthonyBeing an honorary research fellow at CASS is not only a great honor but a great pleasure. In December
of 2015, my initial three-year fellowship at CASS was extended for a further three years, and this introduced the possibility of returning to Lancaster for a sabbatical-length seven-week research stay between February and March of 2016.

The timing of this research stay was especially enjoyable as it coincided with CASS receiving the Queen’s Anniversary Prize for Higher and Further Education for its contributions to computer analysis of world languages in print, speech and online. As part of a week of celebrations at the Centre, I worked with Claire Hardaker of CASS to organize a launch event for our new FireAnt social media analysis toolkit on February 22. FireAnt is a tool that allows researchers to easily extract relevant data from the social media data sources, visualize that data in the form of time-series plots, network graphs, and geolocation maps, and export results for further analysis using traditional corpus tools. At the event, 20 invited participants learned how to use the new tool to analyze Twitter and other social media data sets. They also gave us very valuable comments and suggestions that were immediately incorporated into the software before it was released to the public later on the same day.

Screenshot of FireAnt main display

Screenshot of FireAnt main display

Following the release of FireAnt, I then worked with Claire over the next few weeks on our first research project utilizing the software – a forensic corpus linguistics analysis of the Ashley Madison dataset. Here, we used FireAnt to identify the creation and activities of automated ‘Angel’ accounts on the site. We presented preliminary results from this analysis at a UCREL/Forge event on March 18 that was attended by a wide number of forensic linguists, corpus linguists, computer scientists, and others around the university.

Time-series analysis account creation in the AshleyMadison data set

Time-series analysis account creation in the AshleyMadison data set

One of the great advantages of being at Lancaster is that it is home to excellent scholars that are interested in the entire span of linguistics fields. Since my one-year sabbatical at Lancaster, I’ve had the pleasure to work with Marije Michel in the Dept. of Linguistics and English Language who uses eye tracking methodology in her research into Second Language Acquisition (SLA), Task-Based Language Teaching (TBLT) and written Synchronous Computer Mediated Communication (SCMC). Returning to Lancaster allowed us to work together further to develop a new eye-tracking tool that has applications not only in SLA, TBL and SCMC research, but also corpus linguistics. Again, we presented these ideas at a UCREL event held on March 10, and we are now in the process of writing up the research for publication.

Although this research visit was mainly focused on FireAnt development, I fortunately had time to also continue work on some of the other projects that were initiated during my sabbatical year. Meeting up again with Paul Baker allowed us to consider the next stage of development of ProtAnt, which we will be presenting at the TaLC 12 conference. I also met up with Paul Rayson and CASS’s new lecturer in digital humanities, Steve Wattam, to discuss how we can promote an understanding of tools development and programming skills among corpus linguistics (an area of interest that I have had for several years now). Sadly, my schedule prevented me from joining them at the BBC #newsHack, but I was so happy to hear that Steve’s team won the Editorial Prize at the event.

Nothing beats having an entire sabbatical year to focus on research and collaborate with the excellent members of the CASS. But, this seven-week research visit comes a very enjoyable second. I would like to thank Tony McEnery and his team for funding the visit and making me feel so welcome again at Lancaster. It was a true pleasure to be back. I look forward to continue working with Tony and the team over the next three years.


Biography:

Laurence Anthony is Professor of Applied Linguistics at the Faculty of Science and Engineering, Waseda University, Japan, and an Honorary Research Fellow at the ESRC Centre for Corpus Approaches to Social Science (CASS), Lancaster University, UK. His main interests are in corpus linguistics, educational technology, and English for Specific Purposes (ESP) program design and teaching methodologies. He received the National Prize of the Japan Association for English Corpus Studies (JAECS) in 2012 for his work in corpus software tools design. He is the developer of various corpus tools including AntConc, AntWordProfiler, FireAnt, ProtAnt, and TagAnt.

Brainstorming the Future of Corpus Tools

Since arriving at the Centre for Corpus Approaches to Social Science (CASS), I’ve been thinking a lot about corpus tools. As I wrote in my blog entry of June 3, I have been working on various software programs to help corpus linguists process and analyse texts, including VariAnt, SarAnt, TagAnt. Since then, I’ve also updated my mono-corpus analysis toolkit, AntConc, as well as updated my desktop and web-based parallel corpus tools, including AntPConc and the interfaces to the ENEJE and EXEMPRAES corpora. I’ve even started working with Paul Baker of Lancaster University on a completely new tool that provides detailed analyses of keywords.

In preparation for my plenary talk on corpus tools, given at the Teaching and Language Corpora (TaLC 11) conference held at Lancaster University, I interviewed many corpus linguists about their uses of corpus tools and their views on the future of corpus tools. I also interviewed people from other fields about their views on tools, including Jim Wild, the Vice President of the Royal Astronomical Society.

From my investigations, it was clear that corpus linguists rely on and very much appreciate the importance of tools in their work. But, it also became clear that corpus linguists can sometimes find it difficult to see beyond the features of their preferred concordancer or word frequency generator and attempt to look at language data in completely new and interesting ways. An analogy I often use (and one I detailed in my plenary talk at TaLC 11) is that of an astronomer. Corpus linguists can sometimes find that their telescopes are not powerful enough or sophisticated enough to delve into the depths of their research space. But, rather than attempting to build new telescopes that would reveal what they hope to see (an analogy to programming) or working with others to build such a telescope (an analogy to working with a software developer), corpus linguists simply turn their telescopes to other areas of the sky where their existing telescopes will continue to suffice.

To raise the awareness of corpus tools in the field and also generate new ideas for corpus tools that might be developed by individual programmers or within team projects, I proposed the first corpus tools brainstorming session at the 2014 American Association of Corpus Linguistics (AACL 2014) conference. Randi Reppen and the other organizers of the conference strongly supported the idea, and it finally became a reality on September 25, 2014, the first day of the conference.

At the session, over 30 people participated, filling the room. After I gave a brief overview of the history of corpus tools development, the participants thought about the ways in which they currently use corpora and the tools needed to do their work. The usual suspects—frequency lists (and frequency list comparisons), keyword-in-context concordances and plots, clusters and n-grams, collocates, and keywords—were all mentioned. In addition, the participants talked about how they are increasingly using statistics tools and also starting programming to find dispersion measures. A summary of the ways people use corpora is given below:

  • find word/phrase patterns (KWIC)
  • find word/phrase positions (plot)
  • find collocates
  • find n-grams/lexical bundles
  • find clusters
  • generate word lists
  • generate keyword lists
  • match patterns in text (via scripting)
  • generate statistics (e.g. using R)
  • measure dispersion of word/phrase patterns
  • compare words/synonyms
  • identify characteristics of texts

Next, the participants formed groups, and began brainstorming ideas for new tools that they would like to see developed. Each group came up with many ideas, and explained these to the session as a whole. The ideas are summarised below:

  • compute distances between subsequent occurrences of search patterns (e.g. words, lemmas, POS)
  • quantify the degree of variability around search patterns
  • generate counts per text (in addition to corpus)
  • extract definitions
  • find patterns of range and frequency
  • work with private data but allow  for powerful handling of annotation (e.g. comparing frequencies of sub-corpora)
  • carry out extensive move analysis over large texts
  • search corpora by semantic class
  • process audio data
  • carry out phonological analysis (e.g. neighbor density)
  • use tools to build a corpus (e.g. finding texts, annotating texts, converting non-ASCII characters to ASCII)
  • create new visualizations of data (e.g. a roman candle of words that ‘explode’ out of a text)
  • identify the encoding of corpus texts
  • compare two corpora along many dimensions
  • identify changes in language over time
  • disambiguate word senses

From the list, it is clear that the field is moving towards more sophisticated analyses of data. People are also thinking of new and interesting ways to analyse corpora. But, perhaps the list also reveals a tendency for corpus linguists to think more in terms of what they can do rather than what they should do, an observation made by Douglas Biber, who also attended the session. As Jim Wild said when I interviewed him in July, “Research should be led by the science not the tool.” In corpus linguistics, clearly we should not be trapped into a particular research topic because of the limitations of the tools available to us. We should always strive to answer the questions that need to be answered. If the current tools cannot help us answer those questions, we may need to work with a software developer or perhaps even start learning to program ourselves so that new tools will emerge to help us tackle these difficult questions.

I am very happy that I was able to organize the corpus tools brainstorming session at AACL 2014, and I would like to thank all the participants for coming and sharing their ideas. I will continue thinking about corpus tools and working to make some of the ideas suggested at the session become a reality.

The complete slides for the AACL 2014 corpus tools brainstorming session can be found here. My personal website is here.

Coming to CASS to code: The first two months

anthony_closeup_120px

After working at Waseda University in Japan for exactly 10 years, I was granted a one-year sabbatical in 2014 to concentrate on my corpus linguistics research. As my first choice of destination was Lancaster University, I was overjoyed to hear from Tony McEnery that the Centre for Corpus Approaches to Social Science (CASS) would be able to offer me office space and access to some of the best corpus resources in the world. I have now been at CASS for two months and thought this would be a good time to report on my experience here to date.

Since arriving at CASS, I have been working on several projects. My main project here is the development of a new database architecture that will allow AntConc, my freeware corpus analysis toolkit, to process very large corpora in a fast and resource-light way. The strong connection between the applied linguistics and computer science at Lancaster has allowed me to work closely with some excellent computer science faculty and graduate students, including Paul Rayson, John Mariani, Stephen Wattam, and John Vidler. We just presented our first results at LREC 2014 in Reykjavik.

I’ve also been working closely with the CASS members, including Amanda Potts and Robbie Love, to develop a set of ‘mini’ corpus tools to help with the collection, cleaning, and processing of corpora. I have now released VariAnt, which is a tool that finds spelling variants in a corpus, and SarAnt, which allows multiple search-and-replace functions to be carried out in a corpus as a batch process. I am also just about to release TagAnt, which will finally give corpus linguists a simple and intuitive interface to popular freeware Part-Of-Speech (POS) tagging tools such TreeTagger. I am hoping to develop more of these tools to help the corpus linguists in CASS and around the world to help with the complex and time-consuming tasks that they have to perform each day.

I always expected that I would enjoy the time at Lancaster, but did not anticipate that I would enjoy it as much as I am. Lancaster University has a great campus, the research facilities are some of the best in the world, the CASS members have treated me like family since the day I arrived, and even the weather has been kind to me, with sunny days throughout April and May. I look forward to writing more about my projects here at CASS.