POS Tagging for Georgian is now available in #LancsBox

POS Tagging for Georgian is now available in #LancsBox

We are delighted to announce that part-of-speech tagging for Georgian is now available in #LancsBox. This is the very first Georgian POS tagger made available for wide range of users and uses. It enables users to perform various linguistic analysis on their own texts or corpora in #LancsBox.

The POS-tagger for Georgian was developed within my PhD project (Computational analysis of morphosyntactic categories in Georgian) at the University of Leeds. The tagset design part of my research was conducted in the Centre for Corpus Approaches to Social Science at Lancaster University and was supervised by Dr Andrew Hardie.  Thanks to Dr Vaclav Brezina the lead developer of #LancsBox, now this tagger is available to be used in #LancsBox (Brezina et al., 2015, 2018, 2020).

I use a probabilistic methodology (TreeTagger) and enclitic tokenisation approach to perform tagging in Georgian. The accuracy of part-of-speech tagging 92%. The tagger program uses a new morphosyntactic language model (developed for POS tagging purposes) and KATAG tagset (219 tags) based on this model. The KATAG tagset is a hierarchical-decomposable tagset which allows the user to search for different sections of the paradigm.

#LancsBox is a very powerful corpus analysis tool.  It can be used at different levels of analysis of language data and corpora. It automatically annotates data for part-of-speech and can be used to find frequencies of different word classes such as nouns, verbs etc., compute frequency and dispersion measures for POS tags, find and visualise co-occurrence of grammatical categories. It can also find complex linguistic structures using ‘smart searches’. For example, there are 60 ‘smart searches’ available for Georgian in #LancsBox such as:

ADJECTIVES GENITIVE CASE                      looks up for adjectives in genitive case

ADVERBS                                                          any adverbs

NOUNS ERGATIVE CASE                              nouns in ergative case

PRONOUNS DEMONSTRATIVE                   demonstrative pronouns

PRONOUNS INTERROGATIVE                     interrogative pronouns

PRONOUNS PERSONAL                                 personal pronouns

VERBS AORIST TENSE                                  verbs in aorist tense

VERBS I PERSON                                             verbs 1st person of subject

VERBS II PERSON PLURAL                           verbs 2nd person of subject plural

VERBS IMPERFECT TENSE                           verbs imperfect tense

To demonstrate how to use ‘smart searches’ in #LancsBox I use a small Covid19 corpus (229,481 tokens). I am interested to find out which verbs immediately follow the word coronavirus. Thus, my search term is: კორონავირუსი VERBS

This image displays an alphabetically arranged concordance lines in #LancsBox, showing the most immediate contexts in which the search term is used. This allows me to analyse frequency and dispersion of the node კორონავირუსი (coronavirus) immediately followed by a verb. Here it occurs 37 times (1.612 per 10k) in Covid19 Corpus in 10 out of 11 texts.