POS Tagging for Georgian is now available in #LancsBox
We are delighted to announce that part-of-speech tagging for Georgian is now available in #LancsBox. This is the very first Georgian POS tagger made available for wide range of users and uses. It enables users to perform various linguistic analysis on their own texts or corpora in #LancsBox.
The POS-tagger for Georgian was developed within my PhD project (Computational analysis of morphosyntactic categories in Georgian) at the University of Leeds. The tagset design part of my research was conducted in the Centre for Corpus Approaches to Social Science at Lancaster University and was supervised by Dr Andrew Hardie. Thanks to Dr Vaclav Brezina the lead developer of #LancsBox, now this tagger is available to be used in #LancsBox (Brezina et al., 2015, 2018, 2020).
I use a probabilistic methodology (TreeTagger) and enclitic tokenisation approach to perform tagging in Georgian. The accuracy of part-of-speech tagging 92%. The tagger program uses a new morphosyntactic language model (developed for POS tagging purposes) and KATAG tagset (219 tags) based on this model. The KATAG tagset is a hierarchical-decomposable tagset which allows the user to search for different sections of the paradigm.
#LancsBox is a very powerful corpus analysis tool. It can be used at different levels of analysis of language data and corpora. It automatically annotates data for part-of-speech and can be used to find frequencies of different word classes such as nouns, verbs etc., compute frequency and dispersion measures for POS tags, find and visualise co-occurrence of grammatical categories. It can also find complex linguistic structures using ‘smart searches’. For example, there are 60 ‘smart searches’ available for Georgian in #LancsBox such as:
ADJECTIVES GENITIVE CASE looks up for adjectives in genitive case
ADVERBS any adverbs
NOUNS ERGATIVE CASE nouns in ergative case
PRONOUNS DEMONSTRATIVE demonstrative pronouns
PRONOUNS INTERROGATIVE interrogative pronouns
PRONOUNS PERSONAL personal pronouns
VERBS AORIST TENSE verbs in aorist tense
VERBS I PERSON verbs 1st person of subject
VERBS II PERSON PLURAL verbs 2nd person of subject plural
VERBS IMPERFECT TENSE verbs imperfect tense
To demonstrate how to use ‘smart searches’ in #LancsBox I use a small Covid19 corpus (229,481 tokens). I am interested to find out which verbs immediately follow the word coronavirus. Thus, my search term is: კორონავირუსი VERBS
This image displays an alphabetically arranged concordance lines in #LancsBox, showing the most immediate contexts in which the search term is used. This allows me to analyse frequency and dispersion of the node კორონავირუსი (coronavirus) immediately followed by a verb. Here it occurs 37 times (1.612 per 10k) in Covid19 Corpus in 10 out of 11 texts.