A Journey into Transcription, Part 2: Getting Started

training:
MASS VERB:
The action of teaching a person or animal a particular skill or type of behaviour.

So how to begin?  With experts as our guides (and thankfully no animals in sight!)…

The Context:  The first week was to be dedicated to training.  We began by watching a short video clip of a Trinity examination in progress.  Although our day-to-day work is based purely on audio recordings, we really appreciated having this quick peak into the world of the examination room.  Being able to picture the scene when listening to exam recordings somehow brings the spoken language to life.

Picture this: a desk with a friendly examiner seated at one side; tape recorder in situ and possibly a fan whirring (quietly, we hope) in the background;  a pile of papers (perhaps held down by a paperweight); and then, most importantly for us in this research into learner language, a student seated on the  other side of the desk;  some nervous, some shy, some confident, some excited, some reluctant to speak and a rare few who might even have felt quite at home seated on the other side of the desk! 

Time spent viewing this clip was truly a valuable introduction to the context of this research and the real world to which the audio transcriber is privy on a daily basis.

What next?  Enthusiastic to get started, headsets on, foot pedals down…

Practice File:  We started with a practice recording that had been transcribed previously, applying to it our first set of transcription conventions.  (These have subsequently been altered and updated  on numerous occasions.)  This was an extremely valuable process – in listening separately and together to sections of the recording and in comparing our own transcripts with each other and with the original, we quickly realised the range of subtleties that are involved in this task.  The aim, of course, is for transcribers to do as little interpretation as possible and to be able to apply the conventions in a more or less uniform manner, thus making  the transcription process as straightforward as possible.  This, after all, is what will enable us to build a reliable corpus of words that are actually uttered.  Whilst the technology now exists to generate text from spoken words, the accuracy of the text produced does not come close to that produced by a real-life human transcriber.

Key to this task is the fact that it is unlike transcription in other working environments; we are not seeking to produce grammatically correct punctuated documents such as you might find on a BBC website when you want to review that radio programme you heard, or perhaps missed.  In spoken language there are only utterances and our job is to record every utterance precisely by following the given conventions, the only punctuation in sight being apostrophes and the odd question mark.  So is that syllable a word ending, a false start to another word, perhaps a filler used intentionally to maintain a turn in conversation, or perhaps an involuntary sound? All these are natural features of spoken discourse.  Tackling this challenge and striving to produce a document that represents as accurately as is humanly possible the words actually uttered by each individual speaker – once again, here is the challenge that makes our job enjoyable and rewarding.

And finally… A Transcriber’s  Thought For The Day:

I tried to catch some fog.  I mist.

Reflections from the CASS student challenge panel member, part 3

Pamela Irwin, this year’s CASS student challenge panel member, is looking back on her past year of research. This is part 3 of her reflections — need to catch up on the others? Click here to read part 1, or here to read part 2


Lately, I have been examining sociolinguistics and its related sub-disciplines as part of my exploration of the synergy between the social sciences (sociology/social gerontology) and language (corpus linguistics) in relation to my research.

My first task was to compare sociolinguistics with the sociology of language. According to the literature, in brief, the focus of sociolinguistics is to ascertain the effect of society on language, whereas the sociology of language is oriented around the influence of language on society.

Even with this conceptual clarification, I still found it quite difficult to assimilate the vertical (layers) and horizontal (scope) dimensions of sociolinguistics and then to differentiate within and between the sociolinguistic sub-specialities. At this stage, it was a relief to discover that some of these social/linguistic links had already been mapped, including sociolinguistics and corpus linguistics (Baker, 2010), critical discourse analysis and corpus linguistics (Baker, Gabrielatos, Knosravinik, Krzyzanowski, McEnery & Wodak, 2008), realism and corpus linguistics (Sealey, 2010) and linguistics and ethnography (Rampton, Maybin & Tusting, 2007).

Linguistic ethnography has particular relevance my study’s ethnographic methodology. During my ethnographic fieldwork in rural Australia, I obtained data from multiple sources: historical records, contemporary materials such as local newspapers and community notices, participant interviews and journals, and field notes. As I had naively assumed that all types of data are equally valid, Creese’s (2011) advocacy of a non-hierarchial balance between researcher fieldnotes and interactional data (interviews, conversations) was reassuring.

According to Rampton (2007), a distinctive linguistic ethnography is still evolving and as such, it remains open to wider interpretative approaches. Here, Sealey’s (2007) juxtaposition of linguistic ethnography and realism to address ‘what kinds of language in what circumstances and with what outcome?” (p. 641) makes a valuable contribution to my analytical repertoire. For instance, my findings suggest that the older and late middle-aged women’s life history narratives vary significantly in terms of their depth (reflective/instrumental) and breadth (expansive/constrained). While these differences do not seem to be related to the type of data (written versus spoken accounts), the influence of temporal (age, period, cohort) and situational (rural/urban, ‘local’/newcomer) circumstances on the women’s accounts is less clear. Corpus linguistics provides an objective analytical method of unravelling these complex inter-relationships.

References:

Baker, P. (2010). Sociolinguistics and corpus linguistics. Edinburgh: Edinburgh University Press.

Baker, P., Gabrielatos, C., Khosravinik, M., McEnery, T., & Wodak, R. (2008). A useful synergy? Combining critical discourse analysis and corpus linguistics to examine discourses of refugees and asylum seekers in the UK press. Discourse & Society, 19(3), 273-306. doi: 10.1177/0957926508088962

Creese, A. (2011). Making local practices globally relevant in researching multilingual education. In F.M. Hult and K.A. King (Eds.). Educational linguistics in practice: Applying the local globally and the global locally. Chapter 3. pp. 41-59 Bristol, UK: Multilingual Matters.

Rampton, B. (2007). Neo-Hymesian linguistic ethnography in the United Kingdom. Journal of Sociolinguistics, 11(5), 584-607. doi: 10.1111/j.1467-9841.2007.00341.x

Sealey, A. (2007). Linguistic ethnography in realist perspective. Journal of Sociolinguistics, 11(5), 641-660. doi: 10.1111/j.1467-9841.2007.00341.x

Sealey, A. (2010). Probabilities and surprises: A realist approach to identifying linguistic and social patterns, with reference to an oral history corpus. Applied linguistics, 31(2), 215-235. doi: 10.1093/applin/amp023


Are you interested in becoming the next student challenge panel member? Apply to attend our free summer school to learn more.

Reflections from the CASS student challenge panel member, part 2

Pamela Irwin, this year’s CASS student challenge panel member, is looking back on her past year of research. This is part 2 of her reflections — did you miss part 1? Click here to catch up.


As my research is predicated on a realist ontology, I have been concerned that it is at odds with the constructivist perspective adopted by many studies investigating the use of language in society.

Very simplistically, realists believe in the existence of a reality that is external to a person, whereas for constructivists, reality is contingent on language and signification.

Different versions populate both ontologies. Realism is largely associated with the critical realists spearheaded by Bhaskar and Archer. Likewise, constructivism is noted for its variations, such as those associated with the sociocultural and critical constructivists.

As such, I am struggling with ‘if and how’ to reconcile these “incompatible meta-theories” (Chouliaraki, 2002, p. 83). Lichbach (2003) suggests that there are three ways to address this philosophical schism: ‘competitors’ exaggerate the differences between these perspectives, ‘lumpers’ try to synthesise them into one centre, and ‘pragmatists’ roll over and ignore discrepancies. Here, my view aligns with the competitor’s insistence on separate ontologies.

Interestingly, a lumper approach is deemed workable in an ontological/epistemological combination. For Chouliaraki (2002, pp. 97-98), this is “a discourse informed by realist elements”, where a constructivist ontology is combined with a realist epistemology to draw out conceptual, analytical and temporal effects. Conversely, Buroway (2003, p.655) “presumes an external ‘real world’ but it is one that we can only know through our constructed relation to it…realist and constructivist approaches provide each other’s corrective.” His sequence (a realist ontology and a constructivist epistemology) aligns with my conceptual position.

I am also intrigued by the potentiality of ‘critical’ as a hinge linking the critical realist and critical constructivist worldviews. (Incidentally, two recent papers address this realist/language divide: Elder-Vass (2013) with his seven classifications of linguistic realism and Lau and Morgan (2013) via discourse theory). When contextualised to my realist/constructivist framework and research data revealing inequalities in power relations and social structures in the rural community, a comparable option for me might be to underpin critical gerontology (ontology) with a critical discourse analysis (epistemology), mediated through corpus linguistics.

References:

Buroway, M. (2003). Revisits: An outline of a theory of reflexive ethnography. American Sociological Review, 68(5), 645-679. Retrieved from: http://jstor.org/stable/1519757

Chouliaraki, L. (2002). ‘The contingency of universality’: Some thoughts on discourse and realism. Social Semiotics, 12(1), 83-114. doi; 10.1080/10350330220130386

Elder-Vass, D. (2013). Debate: Seven ways to be a realist about language. Journal for the Theory of Social Behaviour. doi: 10.1111/jsb.12040

Lau, R.W.K., & Morgan, J. (2013). Integrating discourse, construction and objectivity: A contemporary realist approach. Sociology. doi: 10.1177/003803513491466

Lichbach, M.I. (2003). Is rational choice theory all of social science? Ann Arbor, MI: University of Michigan Press.


Return soon to read Pamela’s next installment! Are you interested in becoming the next student challenge panel member? Apply to attend our free summer school to learn more.

A Journey into Transcription, Part 1: Our Approach

To Transcribe:
VERB:
to put (thoughts, speech, or data) into written or printed form
origin:
mid 16th century (in the sense ‘make a copy in writing’):
from Latin transcribere, from trans- ‘across’ + scribere ‘write’

In September 2013 we applied for the post of Audio Transcriber in the CASS Office in the Department of Linguistics and English Language here at Lancaster University.  The job description seemed straightforward; to transcribe audio tape materials according to a predefined scheme and to undertake other appropriate duties as directed.  And the person specification?  As you would expect, a list of essential/desirable skills including working effectively as part of a team; the ability to learn and apply schemes (more of that later); and the ability to work with a range of accents and dialects of English (this is the fun part!).

We say the post of Audio Transcriber since, as far as we knew, only one post was available.  How wonderful to find ourselves both appointed (long may the funding last!); the opportunity to establish a slick working team, as well as to consult when problems arise and, not least, to celebrate the successes (yes, transcribing is a rewarding job!) are a huge benefit not only to ourselves in our work but also to the success of project as a whole.  In the ESRC Centre for Corpus Approaches to Social Science, it must be the corpus that is at the heart of the centre.  Knowing that we play a key role within the team working together to develop this corpus, we take great pride in what we do.  After all, our listening skills, our focus on accuracy and our meticulous attention to detail have the potential to help develop a corpus of excellent quality, and this will make a vital contribution to the validity of the all the research that will follow.  Quite simply, it is this which makes our job so enjoyable and rewarding.

Our day-to-day work involves transcribing recordings of oral examinations taken by learners of English as a second language at elementary, intermediate and advanced stages.  The examinations have been carried out by Trinity College London and have taken place in various countries; Spain, Mexico, Italy, China, India and Sri Lanka so far.  Each language and each stage have their own unique features.

Seven months and 1.5 million words later (Stage One completed and celebrated with colleagues and cake!), we were delighted to be invited to contribute a BLOG documenting our experience as transcribers.  Over the coming months we plan to describe and discuss various aspects of the job.  The aim is to offer an insight to other transcribers and researchers about this particular process.

Look out for the next instalment on Getting Started!

And finally… A Transcriber’s  Thought For The Day:

They told me I had type A blood, but it was a type-O.

Reflections from the CASS student challenge panel member, part 1

Each year, one student from an outside institution is appointed to ‘challenge‘ CASS with concepts from their own novel research. Pamela Irwin, the 2013/2014 student challenge panel member, is beginning to wrap up her ‘term’, and has put together a series of reflections on the process. Read the first entry below.


I am a mature student with a background in health and higher education, and currently completing my PhD in gerontology. My research centres on the interaction between age, gender and the community in the context of resilience in older women living on their own in rural Australia.

Although ageing is informed by many disciplines, my research route is via the broad domain of social sciences. Serendipitously, a peer review of a journal article was responsible for my formal exposure to linguistics and corpus linguistics. The reviewers indicated that my paper reflected a sociological rather than the requisite social psychology orientation, and while I was aware that my topic crossed these disciplines, I was not fully cognisant of the critical importance of language in differentiating these subtleties. As a result, I enrolled in a corpus linguistic programme designed to improve academic language use, and through the inaugural CASS summer school, I was then able to consolidate, expand and apply this knowledge. This immersion in the world of linguistics stimulated a new and growing interest in the ‘function’ of language in academia and everyday life.

However I soon realised that my grounding in the grammatical structures of the English language was extremely basic. While I could identify the fundamental parts of speech, I could not parse a sentence and any further analysis was well beyond my skill set. Since then, I have been introduced to new concepts (semiosis), terminology (concatenate), techniques (linguistic ‘friendly’ transcribing) and technology (WMatix) amongst others, as well as being challenged to rethink and change some of my preconceived ideas (metaphor).

Here, my understanding of the figures of speech is particularly salient. Resilience, a key theme in my research, tends to have different meanings depending on both the subject and context. An overview of the literature suggests that resilience is often described metaphorically as ‘bouncing back’ in academic and popular psychology, whereas in an Australian setting, resilience is more likely to be associated with an image of ‘the (little) Aussie battler’ (Moore, 2010). In this context, resilience represents perseverance, with the ‘underdog’ battling against all odds to overcome hardship in adverse conditions. By contrast, at a systems (socio-ecological) level, resilience is not yet related to a specific metaphor or image. It is however, closely linked to a related term, ‘panarchy’, that involves a dynamic process of adaptation and transformation.

Thus resilience is defined by a metaphor (a ball), an image (a battler) and a conceptual term (panarchy) in my study. These differences provide a rich ‘landscape’ to uncover with corpus linguistics.

Reference:

Moore, B. (2010). What’s their story? A history of Australian words. Melbourne: Oxford University Press.


Return soon to read Pamela’s next installment! Are you interested in becoming the next student challenge panel member? Apply to attend our free summer school to learn more.

Dispatch from YLMP2014

YLMP

I recently had the pleasure of travelling to Poland to attend the Young Linguists’ Meeting in Poznań (YLMP), a congress for young linguists who are interested in interdisciplinary research and stepping beyond the realm of traditional linguistic study. Hosted over three days by the Faculty of English at Adam Mickiewicz University, the congress featured over 100 talks by linguists young and old, including plenary lectures by Lancaster’s very own Paul Baker and Jane Sunderland. I was one of three Lancaster students to attend the congress, along with undergraduate Agnes Szafranski and fellow MA student Charis Yang Zhang.

What struck me about the congress, aside from the warm hospitality of the organisers, was the sheer breadth of topics that were covered over the weekend. All of the presenters were more than qualified to describe their work as linguistics, but perhaps for the first time I saw within just how many domains such a discipline can be applied. At least four sessions ran in parallel at any given time, and themes ranged from gender and sexuality to EFL and even psycholinguistics. There were optional workshops as well as six plenary talks. On the second day of the conference, as part of the language and society stream, I presented a corpus-assisted critical discourse analysis of the UK national press reporting of the immediate aftermath of the May 2013 murder of soldier Lee Rigby. I was happy to have a lively and engaged audience who had some really interesting questions for me at the end, and I enjoyed the conversations that followed this at the reception in the evening!

What was most encouraging about the congress was the drive and enthusiasm shared by all of the ‘young linguists’ in attendance. I now feel part of a generation of young minds who are hungry to improve not only our own work but hopefully, in time, the field(s) of linguistics as a whole. After my fantastic experience at the Boya Forum at Beijing Foreign Studies University last autumn, I was happy to spend time again celebrating the work of undergraduate and postgraduate students, and early-career linguists. There was a willingness to listen, to share ideas, and to (constructively) criticise where appropriate, and as a result I left Poznań feeling very optimistic about the future of linguistic study. I look forward to returning to the next edition of YLMP, because from what I saw at this one, there is a new generation of linguists eager to push the investigation of language to the next level.

How to be a PhD student (by someone who just was), Part 2: Managing your work and working relationships

After submitting and successfully defending my thesis a few months ago, I’ve decided to share some ‘lessons learnt’ over the course of my 38 months as a PhD student. 

In Part 2 of this series, I’ll talk about best practices for structuring your work, managing your relationship with your supervisor, and my experience with teaching undergraduates. If you missed “Part 1: Preparing for the programme”, you can read it here


hypothesis

Structuring your work

I believe it’s healthy to treat your PhD—as much as possible—like a job. Like any job, a PhD has physical, social, and temporal boundaries.

Try to create a PhD ‘space’. Make use of your office if you’ve been given one at your university, and create a space within your home that is a ‘work area’ if you haven’t been given one. Working from bed, from the sofa, or from a café means that your PhD is infiltrating all areas of your life. While some degree of this is inevitable, it’s best to keep physical boundaries as much as possible, even if you can only keep it to your desk.

By the same token, making friends outside of your department or your field is helpful in many ways. I adore my friends from Linguistics and I couldn’t have finished my doctorate without them, but you wouldn’t solely hang out with friends from work when you’re at home, and this is the same situation. In a group of people who have a similar background, you might end up talking about your field ‘outside of hours’. This can be stimulating, but also exhausting. You may want to vent about your department, or talk about something other than your PhD or field, even trashy TV! It’s easier with friends from other areas. As a nice extra feature, the connections that you make outside of your field can also help you inside your field. I’ve had very good advice from friends working in statistics, gotten ideas from historians, and been inspired by literary scholars, even though I might never venture into these areas in the library.

If you can, also create a routine for yourself, even if this isn’t 9-5. It’s best if this routine involves physically moving locations, but even if it doesn’t, physically change something: take a shower, get dressed for work. Pick 8 hours within the day that you work best, and work during those hours. Don’t be too hard on yourself if you have a short day or miss days out entirely…a PhD is ‘swings and roundabouts’ as they say around here…it’s long enough that you will make up the time to yourself. As much as possible, take the weekends and holidays off. This might mean working longer than 8 hours on weekdays, but personally, I think it’s worth it. Many people study in a place far from where they grew up, and a PhD is one time in life where you can be flexible enough with your time to enjoy a bit of sightseeing and tourism.

During this routine, set clear goals for yourself. I’ve seen people arguing for and against writing something every day. I found it very helpful to set a daily word count goal for myself, then sit in front of a computer until I at least came close. The number isn’t important: at the start of my PhD, I aimed to write 200 words per day; at the end of my PhD, I was able to write 1,000 words per day. What is important is getting into a routine. You will sit down some days and feel horrible. You’ll have writer’s block. You will struggle through each word of those 200, and know that you’ll delete most of them. But it’s much easier to get 40 great words out of 200 bad ones than to write 40 words completely cold. I’ve written entire chapters three times as long as they needed to be, and hated them. But paring them down is cathartic—it’s like sculpting. The bonus is that when you get into the habit of writing every day, you slowly get into the habit of writing something good every day. Soon, you’ll be writing 100 words and keeping 50 of them. Then you’ll be writing 1,000 words and keeping 900 of them. The important part is keeping the pace: just write! Your supervisor will also appreciate having something tangible to mark your progress (see next section).

As far as the structure of my own work, there are three things that I would do differently, if I could do it all again:

  1. Decide on a reference manager and stick to it diligently from Day 1. At the start of my degree I used EndNote for reference management, as this was offered for free by my university and came in both desktop and web versions. For my whole first year, I used EndNote to create an annotated bibliography—an extremely useful tool when drafting your literature review. However, EndNote began crashing on me, and papers were no longer available. In my second year, I stopped keeping track of references and just kept haphazard folders of PDFs. In my third year, I just used in-line citations, believing that sources would be easy to find later on. Not true! The month before submission I decided to make the leap to Mendeley, a truly amazing (free) reference manager that allows you to build and share libraries, store your PDFs, search other people’s collections, and select from a vast array of output styles (I favour APA 6th edition). The transition was extraordinarily painful. Exporting from Endnote was problematic and buggy, scanning PDFs in Mendeley was error-prone, and finding the corresponding works for those in-line references was impossible in some cases. I wasted a solid week just before submission sorting out my references, and this really should have been done all along. It would have been so painless!
  2. Master MS Word early on. In my final year, I finally got serious about standardising the numbering of my tables and figures, which means that in the eleventh hour, I was still panicking, trying to make sure that I had updated everything to the proper styles and made appropriate in-line references to my data. Had I set my styles earlier on and made the best use of MS Word’s quite intuitive counting and cross-referencing mechanisms, I would have saved myself days of close reading. If you are using MS Word (sorry, I can’t say anything about LaTeX) and you are not using the citation manager or cross-reference tool, learn how to do that immediately. Today. Your library might have a class on it, or, like me, you can brush up in an hour of web searching.
  3. Put down the books earlier. At a certain point, you need to generate new research and make a novel contribution to knowledge. Your first year and much of your second year will be dedicated to making sure that a research gap exists, and that you can pay tribute to all of the giants whose shoulders you will be standing on. However, burying yourself in a library for three years reading everyone else’s great works is a good way to paralyse yourself. Of course you will always need to keep up with the times, but a certain point, your rate of writing will overtake your rate of reading. If I could do it again, I would follow a pattern more like this:

readwrite

After the first year, you won’t be missing anything totally fundamental. After the second year, you won’t be missing anything peripheral. If, in the third year, you’ve missed something very fresh, your examiners will point it out. But the more important thing is to make a contribution. Most of the PhD is research, not literature review. Your supervisor will be able to help you with this, and with other things (but not really others), as I discuss below.

Managing your relationship with your supervisor

Continue reading

“My research trip to the CASS centre” by visiting PhD student Anna Mattfeldt

Several times a year, the ESRC Centre for Corpus Approaches to Social Science welcomes visiting researchers, from PhD students to professors. Past visitors include Will Hamlin (Washington State University, USA) and Iuliia Rudych (Albert-Ludwigs-Universität Freiburg, Germany); current visitors include Laurence Anthony (Waseda University, Japan) and Anna Mattfeldt (Heidelberg University, Germany). Before returning to her home university, Anna wanted to share a few thoughts about her experience here at CASS:


I am a PhD student from Heidelberg who has just spent eight wonderful weeks at Lancaster University on a research trip. Before I went, some friends and colleagues asked me why I would go to so much trouble when I could just as easily write my thesis back home in Heidelberg. In the following post, I will try to answer why a research trip to another country and another university was the right decision for me – and why I can absolutely recommend it to other PhD students as well. I would also like to thank my main supervisor, Prof. Ekkehard Felder, for giving me the great chance to spend these eight weeks of research here at Lancaster.

I am doing my PhD at the German department of Heidelberg University. We have been doing corpus linguistic research in discourse analysis for quite some time, with big thematic corpora like HeideKo that were collected for research and teaching purposes. A bilingual corpus project, focusing on the depiction of Europe in German and Hungarian newspapers, is currently under way with the German department of The ELTE in Budapest, Hungary.

We approach data from a mainly qualitative point of view, accompanied by quantitative analysis. We focus on so-called “semantic battles” in a pragma-semiotic approach, which means we try to find instances of disagreement or agreement between speakers and how they are played out on the linguistic surface-level. Some may come up so often in specific discourses that they can be seen as central to the discourse. We are interested in the concepts behind the discourse, and how we can deduce them from the actual linguistic devices used in texts.

In my PhD, I am looking at environmental media discourses (especially concerning Hurricane Sandy and hydraulic fracturing, the so-called “fracking” in the US, the UK and Germany), in order to do a linguistic discourse analysis. Moreover, I am trying to find a way to detect conflictive topics and concepts in the various discourses. So, for a project that focuses different languages, corpora, research questions, I need corpus linguistic software, like WMatrix, AntConc, CQPweb and WordSmith. My co-supervisor, Prof. Busse, recommended a stay with the ESRC Centre for Corpus Approaches to Social Science at Lancaster University. The CASS centre at Lancaster is known for its high scientific expertise with huge corpora and different kinds of software. This is why I came up with the idea to also look for support somewhere else.

Hence I sent an email to Tony McEnery. To my great delight, after sending in a few documents, I was actually invited to come and do some research here. After figuring everything out at work, sending applications for scholarships to fund all this and chatting online with local property owners, I finally arrived on the 15th February and spent eight amazing weeks here.

The CASS centre has helped me a lot in my research, especially with tricky data. I was also confronted with lots of interesting ideas, and I loved the atmosphere of picking one another’s brains and inspiring one another. I liked the working atmosphere, the many interesting talks that were given, and the wonderful library with all the literature of the different fields, and last but not least the beautiful campus in an idyllic landscape. I was inspired to work more closely with quantitative approaches and to see how they could be used to see the bigger concepts “between the lines”. I also got a lot of my analysis done, made a lot of progress and still managed to see a bit of England as well during the weekends.

Thus, I can wholeheartedly recommend going abroad during a PhD for a research trip:

  • You get to talk to experts who can help you find solutions for the challenges you have been stuck with.
  • You get lots of new ideas just by talking to different people, being in a new environment or experiencing a different research philosophy.
  • Believe it or not, it immensely furthers the writing process to work in a new environment without any distractions.
  • If you are going to a country with a different language than your own, it is a great opportunity to brush up your language skills.
  • You broaden your horizons by living abroad, not only as far as your PhD is concerned.

So if you feel that you can profit in any way by going abroad, I recommend you do that – and hopefully come to Heidelberg! If you have any further questions concerning my project or visiting Heidelberg University for your own research trip, just send me an email (anna.mattfeldt at gs.uni-heidelberg.de).


Are you interested in being a visiting researcher/scholar at CASS? Email us at cass(Replace this parenthesis with the @ sign)lancs.ac.uk to discuss research aims and availability.

How to be a PhD student (by someone who just was), Part 1: Preparing for the programme

In December 2013, after three years and two months of work, I submitted my PhD thesis. Last month, I successfully defended it, and made the (typographical) corrections in two nights. I’m a Doctor! It’s still exciting to say.

pottsphdA PhD is certainly not easy — I’ve heard it compared to giving birth, starting and ending a relationship, riding a rollercoaster, making a lonely journey, and more. I relocated across the world from Australia to begin mine, and the start was marked by the sadness of a death in the family. It’s been a whirlwind ever since; throughout the course of my degree, I taught as much as possible, I researched and published outside the scope of my PhD, and in April 2013, I began full-time work in the ESRC Centre for Corpus Approaches to Social Science.

The question that I get most often is a question that I found myself asking for years: how? How do you do a PhD? How do you choose a programme and keep from looking back? How do you keep close to the minimum submission date (or at least keep from going beyond the maximum submission date)? How do you balance work and study? I’d like to share a short series (in three installments) about my degree and my lessons learned. There are many resources out there for people doing PhDs, but I wasn’t able to find any that described my experience. I hope that this might help some others who are [metaphorical representation of your choice] a PhD. Before beginning, I’d just like to stress that these resonate with my personal experience (and with those of many of my friends), but won’t align with everyone’s circumstances.

The first installment is five pointers about what to do when applying to a programme.

Continue reading

Using version control software for corpus construction

There are two problems that often come up in collaborative efforts towards corpus construction. First, how do two or more people pool their efforts simultaneously on this kind of work – sharing the data as it develops without working at cross-purposes, repeating effort, or ending up with incompatible versions of the corpus? Second, how do we keep track of what changes in the corpus as it grows and approaches completion – and in particular, if mistakes get made, how do we make sure we can undo them?

Typically corpus linguists have used ad hoc solutions to these problems. To deal with the problem of collaboration, we email bundles of files back and forth, or used shared directories on our institutional networks, or rely on external cloud services like Dropbox. To deal with the problem of recording the history of the data, we often resort to saving multiple different versions of the data, creating a new copy of the whole corpus every time we make any tiny change, and adding an ever-growing pile of “v1”, “v2” “v3”… suffixes to the filenames.

In this blog post I’d like to suggest a better way!

The problems of collaboration and version tracking also affect the work of software developers – with the difference that for them, these problems have been quite thoroughly solved. Though software development and corpus construction are quite different animals, in two critical respects they are similar. First, we are working mainly with very large quantities of plain text files: source code files in the case of software, natural-language text files in the case of corpora. Second, when we make a change, we typically do not change the whole collection of files but only, perhaps, some specific sections of a subset of the files. For this reason, the tools that software developers use to manage their source code – called version control software – are in my view eminently suitable for corpus construction.

So what is version control software?

Think of a computer filesystem – a hierarchy of folders, subfolders and files within those folders which represents all the various data stored on a disk or disks somewhere. This is basically a two-dimensional system: files and folders can be above or below one another in the hierarchy (first dimension), or they can be side-by-side in some particular location (second dimension). But there is also the dimension of time – the state of the filesystem at one point in time is different from its state at a subsequent point in time, as we add new files and folders or move, modify or delete existing ones. A standard traditional filesystem does not have any way to represent this third dimension. If you want to keep a record of a change, all you can do is create a copy of the data alongside the original, and modify the copy while leaving the original untouched. But it would be much better if the filesystem itself were able to keep a record of all the changes that have been made, and all of its previous states going back through history – and if it did this automatically, without the user needing to manage different versions of the data manually.

Windows and Mac OS X both now have filesystems that contain some features of this automatic record-keeping. Version control software does the same thing, but in a more thorough and systematic way. It implements a filesystem with a complete, automatic record of all the changes that are made over time, and provides users with easy ways to access the files, see the record of the changes, and add new changes.

I personally encountered version control software for the first time when I became a developer on the Corpus Workbench project back in 2009/2010. Most of the work on CWB is done by myself and Stefan Evert, and although we do have vaguely defined areas of individual responsibility for different bits of the project, there is also a lot of overlap. Without version control software, effective collaboration and tracking the changes we each make would be quite impossible. The whole of CWB including the core system, the supplementary tools, the CQPweb user interface, and the various manuals and tutorials, is all version-controlled. UCREL also uses version control software for the source code of tools such as CLAWS and USAS. And the more I’ve used version control tools for programming work, the more convinced I’ve become that the same tools will be highly useful for corpus development.

The version control system that I prefer is called Subversion, also known by the abbreviation SVN. This is quite an old-fashioned system, and many software developers now use newer systems such as Mercurial or Git (the latter is the brainchild of Linus Torvalds, the mastermind behind Linux). These newer and much more flexible systems are, however, quite a bit more complex and harder to use than Subversion. This is fine for computer programmers using the systems every day, but for corpus linguists who only work with version control every now and them, the simplicity of good old Subversion makes it – in my view – the better choice.

Subversion works like this. First, a repository is created. The repository is just a big database for storing the files you’re going to work with. When you access this database using Subversion tools, it looks like one big file system containing files, folders and subfolders. The person who creates and manages the repository (here at CASS that’s me) needs a fair bit of technical expertise, but the other users need only some very quick training. The repository needs to be placed somewhere where all members of the team can access it. The CASS Subversion repository lives on our application server, a virtual machine maintained by Lancaster University’s ISS; but you don’t actually need this kind of full-on setup, just an accessible place to put the database (and, needless to say, there needs to be a good backup policy for the database, wherever it is).

The repository manager then creates usernames that the rest of the team can use to work with the files in the repository. When you want to start working with one of the corpora in the repository, you begin by checking out a copy of the data. This creates a working copy of the repository’s contents on your local machine. It can be a copy of the whole repository, or just a section that you want to work on.  Then, you make whatever additions, changes or deletions you want – no need to keep track of these manually! Once you’ve made a series of changes to your checked-out working copy, you commit it back into the repository. Whenever a user commits data, the repository creates a new, numbered version of its filesystem data. Each version is stored as a record of the changes made since the previous version. This means that (a) there is a complete record of the history of the filesystem, with every change to every file logged and noted; (b) there is also a record of who is responsible for every change. This complete record takes up less disk space than you might think, because only the changes are recorded. Subversion is clever enough not to create duplicate copies of the parts of its filesystem that have not changed.

Nothing is ever lost or deleted from this system. Even if a file is completely removed, it is only removed from the new version: all the old versions in the history still complain it. Moreover, it is always possible to check out a version other than the current one – allowing you to see the filesystem as it was at any point in time you choose. That means that all mistakes are reversible. Even if someone commits a version where they have accidentally wiped out nine-tenths of the corpus you are working on, it’s simplicity itself just to return to an earlier point in history and roll back the change.

The strength of this approach for collaboration is that more than one person can have a checked-out copy of a corpus at the same time, and everyone can make their own changes separately. To check whether someone else has committed changes while you’ve been working, you can update your working copy from the repository, getting the other person’s changes and merging them with yours. Even if you’ve made changes to the same file, they will be merged together automatically. Only if two of you have changed the same section of the same file is there a problem – and in this case the program will show you the two different versions, and allow you to pick one or the other or create a combination of the two manually.

While Subversion can do lots more than this, for most users these three actions – check out, update, and commit – are all that’s needed. You also have a choice of programs that you can use for these actions. Most people with Unix machines use a command-line tool called svn which lets you issue commands to Subversion by typing them into a shell terminal.

On Windows, on the other hand, the preferred tool is something called TortoiseSVN. This can be downloaded and installed in the same way as most Windows programs. However, once installed, you don’t have to start up a separate application to use Subversion. Instead, the Subversion commands are added to the right-click context menu in Windows Explorer. So you can simply go to and empty folder, right-click with the mouse, and select the “check out” option to get your working copy. Once you’ve got a working copy, right-clicking on any file or folder within it allows you to access the “update” and “commit” options. TortoiseSVN provides an additional sub-menu which lets you access the full range of Subversion commands – but, again, normal users only need those three most common commands.

The possibility of using TortoiseSVN on Windows means that even the least tech-savvy member of your team can become a productive use of Subversion with only a very little training. And the benefits of building your corpus in a Subversion repository are considerable:

  • The corpus is easily accessible and sharable between collaborators
  • A complete record of all changes made, plus who-did-what
  • Any change can be reversed if necessary, with no need to manually manage “old versions”
  • Full protection against accidental deletions and erroneous changes
  • A secure and reliable backup method is only needed for the repository itself, not for each person’s working copy

That’s not to mention other benefits, such as the ease of switching between computers (just check out another working copy on the new machine and carry on where you left off).

Here at CASS we are making it our standard policy to put corpus creation work into Subversion, and we’re now in the process of gradually transitioning the team’s corpus-building efforts across into that platform. I’m convinced this is the way of the future for effectively managing corpus construction.