A new version of EEBO on CQPweb

The version of theย EEBO-TCPย data that has been available onย Lancaster Universityโ€™s CQPweb serverย is now rather old (the TCP project adds text to the collection on a rolling basis), and, more importantly, does not contain any annotations. Recently I have devoted some time to running a newer version throughย UCRELโ€™s standard annotation toolsย and then mounting the resulting dataset on CQPweb. The new version stands at 1.2 billion running tokens, each with eight different annotation fields.

Critically, the first layer of annotation isย spelling regularisation, whichย  means that the accuracy of the subsequent layers including part-of-speech tagging and lemmatisation is enhanced. Regularised spelling means that searches can be much more comprehensive. Once I had finished with the indexing process, one of the first searches thatย Paul Raysonย did (in preparation for a presentation at theย EEBO-TCPย conference in Oxford this week) was to check on the wordย experimentย and to compare the results returned by a search on original spelling as compared to regularised spelling.

eebo1

eebo2

eebo3

eebo4

The version with regularised spelling (the second graph) returns about three times as many results as the version without (the first graph). The distribution is also rather different. As the following graph shows, the use of a lemma search retrieves even more relevant examples:

eebo5

eebo6

This illustrates the value of the standard annotations to the analysis of the EEBO-TCP data. The newly indexed corpus will be of use both for CASS purposes and for theย CREMEย research group.