The Spoken BNC2014 early access projects: Part 2

In January, we announced the recipients of the Spoken BNC2014 Early Access Data Grants. Over the next several months, they will use exclusive access to the first five million words of Spoken BNC2014 data to carry out a total of thirteen research projects.

In this series of blogs, we are excited to share more information about these projects, in the words of their authors.

In Part 2 of our series, read about the work of Chris Ryder et al., Andreea Calude and Barbara McGillivray et al.

Chris Ryder, Jacqueline Laws and Sylvia Jaworska

University of Reading, UK

From oldies to selfies: A diachronic corpus-based study into changing productivity patterns in British English suffixation

The data from the Spoken BNC2014 early access subset will provide a unique opportunity to examine changes that have occurred in affix use in spoken British English over a twenty-year period; for example, the word selfie has only entered general usage since the invention of the iPhone. Using the recently developed MorphoQuantics database containing complex word data for 222 word-final affixes from the demographically sampled subset of the original Spoken BNC, direct comparisons can be made between old and new datasets, focussing on suffixation patterns, changes in productivity, and trends that demonstrate the shifts in semantic scope of individual suffixes. These features will be analysed chiefly through an examination (both quantitative and qualitative) of neologisms within the data, specifically regarding their regularity of construction, occurrence, and meaning.

This study is just one example of the diachronic morphological analyses that will be made available through a comparison of the Spoken BNC2014 EAS and the Spoken BNC, by utilising the categorisation system provided by MorphoQuantics.

Andreea Calude

University of Waikato, New Zealand

Sociolinguistic variation in cleft constructions: a quantitative corpus study of spontaneous conversation

This project concerns links between the use of various grammatical constructions and sociolinguistic variation, for example is grammar used differently by men and women, or by younger and older speakers? We know that such variation can be observed for certain phonological features (e.g., some vowel sounds) and for certain pragmatic constructions (e.g., discourse markers and new and given information), but as regards grammar features, the answer remains largely unknown or at best vague.

I intend to use the Spoken BNC2014 early access subset to investigate cleft constructions from a sociolinguistic variationist perspective, with the aim of uncovering (potential) systematic syntactic variation across age, gender, dialect, and socio-economic status. Clefts constitute the most frequently used focusing strategy in English, with demonstrative clefts being among the most common in spontaneous conversation, for example: “That is what I want to study”, “This is where I was born”. Despite intense diachronic and synchronic study of the structure and function of clefts in English, virtually nothing is known about the relationship between clefts use and sociolinguistic variation.

The Spoken BNC2014 data will be coded for all demonstrative clefts using a combination of manual and automatic detection, and each construction identified will be attributed to a particular speaker profile (in terms of their sociolinguistic features). Three linguistic features will also be coded for each construction, namely discourse function, reference direction (cataphoric or anaphoric), and information structure (amount of new and given information included).  The data will be analysed using a mixed effects generalised linear regression model.

Barbara McGillivray1, Gard Buen Jenset1 and Michael Rundell2

1University of Oxford, UK

2Lexicography MasterClass, UK

The dative alternation revisited: fresh insights from contemporary spoken data

A well-known feature of English grammar is the dative alternation, whereby a verb may be used in an SVOO construction (Give me the money) or in the pattern SVO followed by a PP with the preposition to (Give the money to me). This is quite a well-researched topic, and generalizations have been made about the factors influencing a writer’s choice of one construction or another, and about which verbs show a preference for one of these patterns over the other. However, most of the studies published to date draw either on introspection or on data from written sources. The availability of contemporary, unscripted spoken data takes us into new territory, and offers an exciting opportunity to revisit this topic.

Our plan is to use the data from the Early Access Scheme to investigate verbs whose argument structure preferences include the dative alternation. Once we have all the relevant corpus data from the Spoken BNC2014 early access subset, we will analyse it using state-of-the-art multivariate statistical techniques, in order to account for the interplay of all the potentially significant variables, whether lexical, semantic, syntactic, or and social. The proposed study thus exploits many of the unique features of this dataset, including the metadata on speakers and the USAS semantic tagging, to answer questions concerning the possible influence of semantic categories, socio-economic factors, gender, dialect, age, as well as linguistic features on a speaker’s preferences. Once the study is complete, there would be opportunities for fresh comparative studies, either with the original Spoken BNC or with contemporary written data.

Check back soon for Part 3!

FireAnt has officially launched!

Laurence Anthony and Claire Hardaker first introduced FireAnt at the CL2015 conference. In their talk, Claire explained that her work with the Discourse of Online Misogyny (DOOM) project had led her to realise that when working with Twitter data, you fast encounter a large array of problems: how to easily collect data, how to arrange that data in a useful way, and how you then analyse that data effectively. It was these problems that had led to the creation of FireAnt, a freeware social media and data analysis toolkit. Laurence and Claire showed the CL2015 audience a beta version of FireAnt, and it’s safe to say it was very well received… the Q&A at the end of their talk went along the lines of ‘it will be publicly available, right?’, ‘when can I get my hands on it?’, and ‘can I sign up to help you trial the beta version?’.

Well, ladies and gentlemen, the wait is over. On Monday 22nd February, Laurence and Claire officially launched FireAnt; it became available to the public on Laurence Anthony’s website, and they held a launch event at Lancaster University to teach people how to use the tool. Here’s a little about what I learnt at this launch event…

FireAnt is not just for analysing social media data; it’s also for collecting it…

FireAnt makes collecting tweets incredibly easy. All you have to do is enter a search term, specify how long you want FireAnt to collect tweets for (or set a maximum number of tweets you want it to collect), and go away and have a cup of tea. To trial this, I instructed FireAnt to start collecting tweets that contained the hashtag #feminism. While I munched my way through two biscuits and two cups of tea, 675 tweets were posted on Twitter containing the hashtag #feminism; FireAnt collected all of these.

FireAnt helps you extract the data you’re actually interested in…

When you collect social media data, you don’t just collect texts that people have posted online. You also collect lots of information about these texts – for example, the date and time that each text was posted on the internet, the username of the person who posted the content, the location of that user, etc. This means that the file containing all your data is often very large, and you have to extract the bits you want to work with. This sounds simple but in reality it’s not, unless you’re a fairly capable programmer and have a computer with a decent amount of memory. However, with FireAnt, the process is much simpler. FireAnt automatically detects what information you have in your file, allows you to filter this, and creates new files with the information you’re interested in (without crashing your computer!).

FireAnt can also help you analyse your data….

At the launch event, we experimented with three different analysis features of FireAnt. Firstly, FireAnt allows you to gather timeseries data, showing the usage of a particular word within your dataset across time. You can use this to produce pretty graphs, such as the one below, or export the data to Excel.

Secondly, FireAnt can produce Geoposition maps. For example, below is a picture of Abi Hawtin, one of CASS’ research students, who’s looking very excited because she used FireAnt to create a map showing the different locations that the tweets in her dataset were posted from:


Thirdly, FireAnt allows you to easily produce network graphs, like the one below:One great feature of these graphs is that they allow you to plot lots of different things. These types of graphs are typically hard to produce but with a tool like FireAnt it’s easy.

What are you waiting for? Time to try FireAnt out for yourself!