Using version control software for corpus construction

There are two problems that often come up in collaborative efforts towards corpus construction. First, how do two or more people pool their efforts simultaneously on this kind of work โ€“ sharing the data as it develops without working at cross-purposes, repeating effort, or ending up with incompatible versions of the corpus? Second, how do we keep track of what changes in the corpus as it grows and approaches completion โ€“ and in particular, if mistakes get made, how do we make sure we can undo them?

Typically corpus linguists have used ad hoc solutions to these problems. To deal with the problem of collaboration, we email bundles of files back and forth, or used shared directories on our institutional networks, or rely on external cloud services like Dropbox. To deal with the problem of recording the history of the data, we often resort to saving multiple different versions of the data, creating a new copy of the whole corpus every time we make any tiny change, and adding an ever-growing pile of โ€œv1โ€, โ€œv2โ€ โ€œv3โ€โ€ฆ suffixes to the filenames.

In this blog post Iโ€™d like to suggest a better way!

The problems of collaboration and version tracking also affect the work of software developers โ€“ with the difference that for them, these problems have been quite thoroughly solved. Though software development and corpus construction are quite different animals, in two critical respects they are similar. First, we are working mainly with very large quantities of plain text files: source code files in the case of software, natural-language text files in the case of corpora. Second, when we make a change, we typically do not change the whole collection of files but only, perhaps, some specific sections of a subset of the files. For this reason, the tools that software developers use to manage their source code โ€“ called version control software โ€“ are in my view eminently suitable for corpus construction.

So what is version control software?

Think of a computer filesystem โ€“ a hierarchy of folders, subfolders and files within those folders which represents all the various data stored on a disk or disks somewhere. This is basically a two-dimensional system: files and folders can be above or below one another in the hierarchy (first dimension), or they can be side-by-side in some particular location (second dimension). But there is also the dimension of time โ€“ the state of the filesystem at one point in time is different from its state at a subsequent point in time, as we add new files and folders or move, modify or delete existing ones. A standard traditional filesystem does not have any way to represent this third dimension. If you want to keep a record of a change, all you can do is create a copy of the data alongside the original, and modify the copy while leaving the original untouched. But it would be much better if the filesystem itself were able to keep a record of all the changes that have been made, and all of its previous states going back through history โ€“ and if it did this automatically, without the user needing to manage different versions of the data manually.

Windows and Mac OS X both now have filesystems that contain some features of this automatic record-keeping. Version control software does the same thing, but in a more thorough and systematic way. It implements a filesystem with a complete, automatic record of all the changes that are made over time, and provides users with easy ways to access the files, see the record of the changes, and add new changes.

I personally encountered version control software for the first time when I became a developer on the Corpus Workbench project back in 2009/2010. Most of the work on CWB is done by myself and Stefan Evert, and although we do have vaguely defined areas of individual responsibility for different bits of the project, there is also a lot of overlap. Without version control software, effective collaboration and tracking the changes we each make would be quite impossible. The whole of CWB including the core system, the supplementary tools, the CQPweb user interface, and the various manuals and tutorials, is all version-controlled. UCREL also uses version control software for the source code of tools such as CLAWS and USAS. And the more Iโ€™ve used version control tools for programming work, the more convinced Iโ€™ve become that the same tools will be highly useful for corpus development.

The version control system that I prefer is called Subversion, also known by the abbreviation SVN. This is quite an old-fashioned system, and many software developers now use newer systems such as Mercurial or Git (the latter is the brainchild of Linus Torvalds, the mastermind behind Linux). These newer and much more flexible systems are, however, quite a bit more complex and harder to use than Subversion. This is fine for computer programmers using the systems every day, but for corpus linguists who only work with version control every now and them, the simplicity of good old Subversion makes it โ€“ in my view – the better choice.

Subversion works like this. First, a repository is created. The repository is just a big database for storing the files youโ€™re going to work with. When you access this database using Subversion tools, it looks like one big file system containing files, folders and subfolders. The person who creates and manages the repository (here at CASS thatโ€™s me) needs a fair bit of technical expertise, but the other users need only some very quick training. The repository needs to be placed somewhere where all members of the team can access it. The CASS Subversion repository lives on our application server, a virtual machine maintained by Lancaster Universityโ€™s ISS; but you donโ€™t actually need this kind of full-on setup, just an accessible place to put the database (and, needless to say, there needs to be a good backup policy for the database, wherever it is).

The repository manager then creates usernames that the rest of the team can use to work with the files in the repository. When you want to start working with one of the corpora in the repository, you begin by checking out a copy of the data. This creates a working copy of the repositoryโ€™s contents on your local machine. It can be a copy of the whole repository, or just a section that you want to work on.ย  Then, you make whatever additions, changes or deletions you want โ€“ no need to keep track of these manually! Once youโ€™ve made a series of changes to your checked-out working copy, you commit it back into the repository. Whenever a user commits data, the repository creates a new, numbered version of its filesystem data. Each version is stored as a record of the changes made since the previous version. This means that (a) there is a complete record of the history of the filesystem, with every change to every file logged and noted; (b) there is also a record of who is responsible for every change. This complete record takes up less disk space than you might think, because only the changes are recorded. Subversion is clever enough not to create duplicate copies of the parts of its filesystem that have not changed.

Nothing is ever lost or deleted from this system. Even if a file is completely removed, it is only removed from the new version: all the old versions in the history still complain it. Moreover, it is always possible to check out a version other than the current one โ€“ allowing you to see the filesystem as it was at any point in time you choose. That means that all mistakes are reversible. Even if someone commits a version where they have accidentally wiped out nine-tenths of the corpus you are working on, itโ€™s simplicity itself just to return to an earlier point in history and roll back the change.

The strength of this approach for collaboration is that more than one person can have a checked-out copy of a corpus at the same time, and everyone can make their own changes separately. To check whether someone else has committed changes while youโ€™ve been working, you can update your working copy from the repository, getting the other personโ€™s changes and merging them with yours. Even if youโ€™ve made changes to the same file, they will be merged together automatically. Only if two of you have changed the same section of the same file is there a problem โ€“ and in this case the program will show you the two different versions, and allow you to pick one or the other or create a combination of the two manually.

While Subversion can do lots more than this, for most users these three actions โ€“ check out, update, and commit โ€“ are all thatโ€™s needed. You also have a choice of programs that you can use for these actions. Most people with Unix machines use a command-line tool called svn which lets you issue commands to Subversion by typing them into a shell terminal.

On Windows, on the other hand, the preferred tool is something called TortoiseSVN. This can be downloaded and installed in the same way as most Windows programs. However, once installed, you donโ€™t have to start up a separate application to use Subversion. Instead, the Subversion commands are added to the right-click context menu in Windows Explorer. So you can simply go to and empty folder, right-click with the mouse, and select the โ€œcheck outโ€ option to get your working copy. Once youโ€™ve got a working copy, right-clicking on any file or folder within it allows you to access the โ€œupdateโ€ and โ€œcommitโ€ options. TortoiseSVN provides an additional sub-menu which lets you access the full range of Subversion commands โ€“ but, again, normal users only need those three most common commands.

The possibility of using TortoiseSVN on Windows means that even the least tech-savvy member of your team can become a productive use of Subversion with only a very little training. And the benefits of building your corpus in a Subversion repository are considerable:

  • The corpus is easily accessible and sharable between collaborators
  • A complete record of all changes made, plus who-did-what
  • Any change can be reversed if necessary, with no need to manually manage โ€œold versionsโ€
  • Full protection against accidental deletions and erroneous changes
  • A secure and reliable backup method is only needed for the repository itself, not for each personโ€™s working copy

Thatโ€™s not to mention other benefits, such as the ease of switching between computers (just check out another working copy on the new machine and carry on where you left off).

Here at CASS we are making it our standard policy to put corpus creation work into Subversion, and weโ€™re now in the process of gradually transitioning the teamโ€™s corpus-building efforts across into that platform. Iโ€™m convinced this is the way of the future for effectively managing corpus construction.