CASS visit to Ghana

On June 24th, I and three other members of CASS spent a week in Accra, Ghana, demonstrating corpus methods and our own research at two universities, the University of Ghana and the recently established Lancaster University Ghana campus in Accra. From the UK it’s just over a six hour flight although thankfully only one hour of time difference. However, travel did involve some advance preparation, with jabs for yellow fever (and a few other things), visa applications and taking anti-malarial pills for a month after the trip. Fortunately, we only encountered one mosquito during the whole trip and none of us were bitten.

Although close together, the two universities we visited have a very different feel to them, the former is a large university spread out over a lot of land, with many departments and buildings, while the latter is (at the moment), a three storey modern-looking grey and red building with the familiar Lancaster logo on it.

ghana1

Our first trip was to the University of Ghana, where Andrew, Tony and I each gave a lecture to about 90 members of staff and students. Tony talked about the theoretical principles behind corpus linguistics, I discussed (and problematized) sex differences in the British National Corpus and Andrew showed applications of corpus linguistics to field linguistics using Corpus Workbench. The University of Ghana has some alumni members of Lancaster University and it was great to run into Clement Appah and Grace Diabah (formely Bota) again.

ghana2

Over the following two days, we gave corpus linguistics workshops, which included a two hour lab session where Andrew walked students through setting up a CQPweb account and doing some analysis of the Brown Family of corpora. I suspect this was the highlight of the day for those who attended, who were pleased to get access to many of the corpora we have at Lancaster. Each day we taught about 35 people, including some who had travelled quite long distances to get to us. Four students had driven in that morning from Cape Coast – a journey that we did some of when we went to Kakum National Park on our day off, and that took us over three hours – so we were impressed by their dedication. Tony gave an introduction to corpus linguistics and Vaclav talked about the General Service List for English words and let the students use a tool he had developed for exploring it. I ended each day with a talk on corpus linguistics and discourse analysis.

ghana3

As I’d mentioned, we had a day off, where we visited Kakum National Park. This gave us an opportunity to see more of Ghana on the drive there, and then we had a great experience in the park, walking across a 350m network of rope bridges (the Kakum Canopy Walk) that were suspended high above the ground – you literally got a bird’s eye view of the tropical rainforest below. It was one of the most memorable experiences I’ve had and I think we all came away with very positive feelings about our trip, and are looking forward to our next visit to Ghana. I also hope that we managed to inspire people to incorporate some corpus linguistics methods into their own research.

Using version control software for corpus construction

There are two problems that often come up in collaborative efforts towards corpus construction. First, how do two or more people pool their efforts simultaneously on this kind of work – sharing the data as it develops without working at cross-purposes, repeating effort, or ending up with incompatible versions of the corpus? Second, how do we keep track of what changes in the corpus as it grows and approaches completion – and in particular, if mistakes get made, how do we make sure we can undo them?

Typically corpus linguists have used ad hoc solutions to these problems. To deal with the problem of collaboration, we email bundles of files back and forth, or used shared directories on our institutional networks, or rely on external cloud services like Dropbox. To deal with the problem of recording the history of the data, we often resort to saving multiple different versions of the data, creating a new copy of the whole corpus every time we make any tiny change, and adding an ever-growing pile of “v1”, “v2” “v3”… suffixes to the filenames.

In this blog post I’d like to suggest a better way!

The problems of collaboration and version tracking also affect the work of software developers – with the difference that for them, these problems have been quite thoroughly solved. Though software development and corpus construction are quite different animals, in two critical respects they are similar. First, we are working mainly with very large quantities of plain text files: source code files in the case of software, natural-language text files in the case of corpora. Second, when we make a change, we typically do not change the whole collection of files but only, perhaps, some specific sections of a subset of the files. For this reason, the tools that software developers use to manage their source code – called version control software – are in my view eminently suitable for corpus construction.

So what is version control software?

Think of a computer filesystem – a hierarchy of folders, subfolders and files within those folders which represents all the various data stored on a disk or disks somewhere. This is basically a two-dimensional system: files and folders can be above or below one another in the hierarchy (first dimension), or they can be side-by-side in some particular location (second dimension). But there is also the dimension of time – the state of the filesystem at one point in time is different from its state at a subsequent point in time, as we add new files and folders or move, modify or delete existing ones. A standard traditional filesystem does not have any way to represent this third dimension. If you want to keep a record of a change, all you can do is create a copy of the data alongside the original, and modify the copy while leaving the original untouched. But it would be much better if the filesystem itself were able to keep a record of all the changes that have been made, and all of its previous states going back through history – and if it did this automatically, without the user needing to manage different versions of the data manually.

Windows and Mac OS X both now have filesystems that contain some features of this automatic record-keeping. Version control software does the same thing, but in a more thorough and systematic way. It implements a filesystem with a complete, automatic record of all the changes that are made over time, and provides users with easy ways to access the files, see the record of the changes, and add new changes.

I personally encountered version control software for the first time when I became a developer on the Corpus Workbench project back in 2009/2010. Most of the work on CWB is done by myself and Stefan Evert, and although we do have vaguely defined areas of individual responsibility for different bits of the project, there is also a lot of overlap. Without version control software, effective collaboration and tracking the changes we each make would be quite impossible. The whole of CWB including the core system, the supplementary tools, the CQPweb user interface, and the various manuals and tutorials, is all version-controlled. UCREL also uses version control software for the source code of tools such as CLAWS and USAS. And the more I’ve used version control tools for programming work, the more convinced I’ve become that the same tools will be highly useful for corpus development.

The version control system that I prefer is called Subversion, also known by the abbreviation SVN. This is quite an old-fashioned system, and many software developers now use newer systems such as Mercurial or Git (the latter is the brainchild of Linus Torvalds, the mastermind behind Linux). These newer and much more flexible systems are, however, quite a bit more complex and harder to use than Subversion. This is fine for computer programmers using the systems every day, but for corpus linguists who only work with version control every now and them, the simplicity of good old Subversion makes it – in my view – the better choice.

Subversion works like this. First, a repository is created. The repository is just a big database for storing the files you’re going to work with. When you access this database using Subversion tools, it looks like one big file system containing files, folders and subfolders. The person who creates and manages the repository (here at CASS that’s me) needs a fair bit of technical expertise, but the other users need only some very quick training. The repository needs to be placed somewhere where all members of the team can access it. The CASS Subversion repository lives on our application server, a virtual machine maintained by Lancaster University’s ISS; but you don’t actually need this kind of full-on setup, just an accessible place to put the database (and, needless to say, there needs to be a good backup policy for the database, wherever it is).

The repository manager then creates usernames that the rest of the team can use to work with the files in the repository. When you want to start working with one of the corpora in the repository, you begin by checking out a copy of the data. This creates a working copy of the repository’s contents on your local machine. It can be a copy of the whole repository, or just a section that you want to work on.  Then, you make whatever additions, changes or deletions you want – no need to keep track of these manually! Once you’ve made a series of changes to your checked-out working copy, you commit it back into the repository. Whenever a user commits data, the repository creates a new, numbered version of its filesystem data. Each version is stored as a record of the changes made since the previous version. This means that (a) there is a complete record of the history of the filesystem, with every change to every file logged and noted; (b) there is also a record of who is responsible for every change. This complete record takes up less disk space than you might think, because only the changes are recorded. Subversion is clever enough not to create duplicate copies of the parts of its filesystem that have not changed.

Nothing is ever lost or deleted from this system. Even if a file is completely removed, it is only removed from the new version: all the old versions in the history still complain it. Moreover, it is always possible to check out a version other than the current one – allowing you to see the filesystem as it was at any point in time you choose. That means that all mistakes are reversible. Even if someone commits a version where they have accidentally wiped out nine-tenths of the corpus you are working on, it’s simplicity itself just to return to an earlier point in history and roll back the change.

The strength of this approach for collaboration is that more than one person can have a checked-out copy of a corpus at the same time, and everyone can make their own changes separately. To check whether someone else has committed changes while you’ve been working, you can update your working copy from the repository, getting the other person’s changes and merging them with yours. Even if you’ve made changes to the same file, they will be merged together automatically. Only if two of you have changed the same section of the same file is there a problem – and in this case the program will show you the two different versions, and allow you to pick one or the other or create a combination of the two manually.

While Subversion can do lots more than this, for most users these three actions – check out, update, and commit – are all that’s needed. You also have a choice of programs that you can use for these actions. Most people with Unix machines use a command-line tool called svn which lets you issue commands to Subversion by typing them into a shell terminal.

On Windows, on the other hand, the preferred tool is something called TortoiseSVN. This can be downloaded and installed in the same way as most Windows programs. However, once installed, you don’t have to start up a separate application to use Subversion. Instead, the Subversion commands are added to the right-click context menu in Windows Explorer. So you can simply go to and empty folder, right-click with the mouse, and select the “check out” option to get your working copy. Once you’ve got a working copy, right-clicking on any file or folder within it allows you to access the “update” and “commit” options. TortoiseSVN provides an additional sub-menu which lets you access the full range of Subversion commands – but, again, normal users only need those three most common commands.

The possibility of using TortoiseSVN on Windows means that even the least tech-savvy member of your team can become a productive use of Subversion with only a very little training. And the benefits of building your corpus in a Subversion repository are considerable:

  • The corpus is easily accessible and sharable between collaborators
  • A complete record of all changes made, plus who-did-what
  • Any change can be reversed if necessary, with no need to manually manage “old versions”
  • Full protection against accidental deletions and erroneous changes
  • A secure and reliable backup method is only needed for the repository itself, not for each person’s working copy

That’s not to mention other benefits, such as the ease of switching between computers (just check out another working copy on the new machine and carry on where you left off).

Here at CASS we are making it our standard policy to put corpus creation work into Subversion, and we’re now in the process of gradually transitioning the team’s corpus-building efforts across into that platform. I’m convinced this is the way of the future for effectively managing corpus construction.