Coming to CASS to code: The first two months

anthony_closeup_120px

After working at Waseda University in Japan for exactly 10 years, I was granted a one-year sabbatical in 2014 to concentrate on my corpus linguistics research. As my first choice of destination was Lancaster University, I was overjoyed to hear from Tony McEnery that the Centre for Corpus Approaches to Social Science (CASS) would be able to offer me office space and access to some of the best corpus resources in the world. I have now been at CASS for two months and thought this would be a good time to report on my experience here to date.

Since arriving at CASS, I have been working on several projects. My main project here is the development of a new database architecture that will allow AntConc, my freeware corpus analysis toolkit, to process very large corpora in a fast and resource-light way. The strong connection between the applied linguistics and computer science at Lancaster has allowed me to work closely with some excellent computer science faculty and graduate students, including Paul Rayson, John Mariani, Stephen Wattam, and John Vidler. We just presented our first results at LREC 2014 in Reykjavik.

I’ve also been working closely with the CASS members, including Amanda Potts and Robbie Love, to develop a set of ‘mini’ corpus tools to help with the collection, cleaning, and processing of corpora. I have now released VariAnt, which is a tool that finds spelling variants in a corpus, and SarAnt, which allows multiple search-and-replace functions to be carried out in a corpus as a batch process. I am also just about to release TagAnt, which will finally give corpus linguists a simple and intuitive interface to popular freeware Part-Of-Speech (POS) tagging tools such TreeTagger. I am hoping to develop more of these tools to help the corpus linguists in CASS and around the world to help with the complex and time-consuming tasks that they have to perform each day.

I always expected that I would enjoy the time at Lancaster, but did not anticipate that I would enjoy it as much as I am. Lancaster University has a great campus, the research facilities are some of the best in the world, the CASS members have treated me like family since the day I arrived, and even the weather has been kind to me, with sunny days throughout April and May. I look forward to writing more about my projects here at CASS.

Exploring the grammar of Netflix in The Atlantic

A journalist with The Atlantic has used AntConc — a concordance program by CASS affiliate scholar Laurence Anthony — to deconstruct and reconstruct the grammar of Netflix genre descriptions.

“If you use Netflix, you’ve probably wondered about the specific genres that it suggests to you. Some of them just seem so specific that it’s absurd. Emotional Fight-the-System Documentaries? Period Pieces About Royalty Based on Real Life? Foreign Satanic Stories from the 1980s? If Netflix can show such tiny slices of cinema to any given user, and they have 40 million users, how vast did their set of “personalized genres” need to be to describe the entire Hollywood universe?”

By creating the genres loaded into Netflix as a corpus, Alexis C. Madrigal was able to identify patterns in the data, and to autogenerate new theoretical genres based on popular adjectives and subjects, such as:

  • Deep Sea Father-and-Son Period Pieces Based on Real Life Set in the Middle East For Kids
  • Assassination Bounty-Hunter Secret Society Dramas Based on Books Set in Europe About Fame For Ages 8 to 10
  • Post-Apocalyptic Comedies About Friendship

To read more about this application of a concordancer and corpus linguistic methods, as well as the resulting interview with Todd Yellin, Netflix’s VP of Product and the man responsible for the creation of Netflix’s tagging system, read the full article on The AtlanticHow Netflix Reverse Engineered Hollywood