NATURAL LANGUAGE PROCESSING
PROBABILISTIC METHODS FOR PROCESSING HIGHLY INFLECTED
NATURAL LANGUAGES
We are applying our expertise of OA for the development of natural
language processing tools.
We concentrate at this stage on rare semitic languages. The development of new
fundamental statistical and
probabilistic methodologies is an important aspect of our effort. The ultimate goal
is to create spell checkers, syntax and morphology analyzers,
electronic dictionaries and
translation machines based on stochastic methods. Our research might
also be useful for the automatic retrieval
of historical text and Internet search engines.
The example of a spell checker illustrates well tremendous problems
created by the complex morphology of semitic languages.
The spell checker should offer suggestions for incorrectly spelled words.
The suggested list is made up by the words which are close to the misspelled
word. To measure closeness between words, one can use an
OA-score or similar measures. For time efficiency, one needs to reduce words to
a skeletal form and apply pattern-matching algorithm
One difficulty for semitic languages is the complex morphology of
the verbs: the primary meaning is defined by its root which consists
of three consonants. There are complicated suffixes
and prefixes which contain information about the gender, person and
number of the object and subject. For example, take the
root "flg"(have) in Amharic. Then, "if"algal "ahu"(I want) and
"if\"alg\"awal\"ahu"(I want something) are different only because of the object.
Note that the suffix added for the object
changes due to the subject: "tf\"aligal\"ash"(you want)
but "tf\"algiwal\"ash" (you want something).
The vowels between the consonants of the root define various modes
and the verb is also inflected for benefactive, malfactive, causative, transitive,
passive, dative, negative. For example, ``I have'' is different
in ``I have money'' and `` I have a problem''. This
difference is there because in one sentence something
beneficiary to the subject is meant whilst not in the other.
The problem is that for all verbs there
are many such different forms, so it is not an isolated
phenomenon, but rather a systematic one.
To make matters worse, there are often many suffixes and prefixes
added to the root at the same time. To summarize: a verb
appears in thousand of different shapes in a text. Since,
many words are derived from a relatively small amount of roots,
there are huge numbers of related and unrelated words,
which have a high degree of similarity. This is the primary
reason why the spell check problem has not yet been solved
in a satisfactory manner for Arabic, for example.
TEXT CLASSIFICATION
Text classification is still very important task: for example when you
build a chat-bot, you first need to determine the intention of
the meassage received. This can be viewed as a topic classification.
Our published articles on the subject: