Loading...

How do computers analyse words?

Krzysztof Jassem November 30, 2020

What is morphological analysis and what is it used for?

Morphological analysis is used to determine how a word is built up. The result of such analysis might be the statement that, for example, the word classes is made up of the stem class and the ending -es, which is used in English to make the plural form of certain nouns and the third person singular form of certain verbs. From this information we may deduce that classes is likely the plural of a noun (or the third person singular of a verb) whose base form, or lemma1, is class.

This type of inference is very useful in such tasks as context search. By decoding the probable context of a submitted query (establishing, for example, who the user is, where they are and what kind of information they are looking for) a document will be identified that most closely corresponds to their needs – even if that document does not contain the exact word forms that were used in the query.

For example, in response to the query “cheap laptops”, the Google search engine might display a snippet2 with the text: “We offer the most popular laptop models … cheaper than anywhere else …”. Note that this snippet does not strictly contain either of the words used in the query. Rather than the word cheap it contains the comparative form cheaper, and rather than laptops it contains the base form laptop.

To be able to inflect words correctly (as in the above example), a context search system needs to be supported by a dictionary that contains all word forms in a given language. In some applications, such a vast resource would just be redundant weight. For the purposes of classifying texts by subject matter, for example, it is enough to use a simplified version of morphological analysis called stemming. This involves cutting off the ending of a word (and sometimes also the beginning) to identify the stem. For example, the result of stemming for the words beauty, beautiful and beautician might be the same in each case: beaut. Stemming is a much faster process than dictionary-based analysis, and is quite sufficient for purposes of thematic classification.

Analysing words with a dictionary

Lemmatisation

Lemmatisation is a process whereby, for each word appearing in a document, the system identifies all of the lemmas (base forms) from which that word might potentially derive. For example, for the word bases, the lemmatiser (a program performing lemmatisation) will return the information that the word may come either from the noun or the verb base, or from the noun basis. Similarly, for the word number the program will say that it is either the base form of that noun (or verb), or else the comparative form of the adjective numb.

POS-tagging

A slightly different type of analytical process is POS-tagging (Part of Speech Tagging). Here, based on context analysis, the POS-tagger (the program performing the task of tagging) is expected to determine unambiguously what part of speech is represented by a particular word used in a sentence. For example, in the sentence She bases her theory on a number of findings, the POS-tagger should identify the word bases as a verb, and the word number as a noun.

Lemmatisers, assuming they are equipped with suitably large dictionaries, do not make errors. Modern POS-taggers, on the other hand, operate with a success rate of around 95–98%, which means that occasionally they will return an incorrect result.

Today’s NLP solutions offer a combination of the two above-mentioned tools. Thus, when analysing a document, they can determine for each word both what part of speech it represents, and what its base form is.

Analysing words without a dictionary

If a computer system does not have a dictionary to check the existence of words, the way in which it analyses the structure of words (to perform stemming, for example) is fairly primitive. Most methods so far developed for this scenario involve cutting off successive letters from the end of a word for as long as suitably defined conditions are satisfied.

The Porter algorithm

One of the most popular methods of stemming is the Porter algorithm, as illustrated in the above screenshot. As we can see, for the word isolation the algorithm identifies the stem isol by cutting off the ending -ation. However, for ration, the result of stemming is the whole of that word. This is because the potential stem remaining after removing the ending -ation in this case, namely the single letter r, is too short – according to the assumptions of the algorithm – for such reduction to be performed. The situation is similar with the words conditional and zonal. The ending -ness is removed from the word madness, and in hopefulness the algorithm removes two endings: the noun suffix -ness and the adjectival suffix -ful. However, in aness (a word invented by the user) the algorithm does not remove the ending, for exactly the same reason as in the case of ration.

Rules of the Porter algorithm

The algorithm is based on manually constructed rules. Each rule has two parts: the first part says what conditions must be fulfilled by a potential stem in order for the removal operation defined in the second part to be applied.

An example rule used by the Porter algorithm is the following:

(m>0) ational -> ate

The first part of this rule (m>0) imposes a condition on the size of the stem left after removal of the ending – this size (which is roughly equivalent to the number of syllables) must be greater than zero. The second part of the rule (ational -> ate) indicates how the ending is to be replaced.

According to this rule, the word relational will be replaced by relate, but the word rational will be left unchanged, since the stem r does not satisfy the defined condition on its size.

Can a computer be creative?

One of the obstacles to effective automated morphological analysis using a dictionary is human lexical creativity. In any language, new words are being invented practically every day. For Polish, for example, a continuously updated ranking of neologisms can be found at https://polszczyzna.pl/neologizmy-ranking/.

At the time this post was written, the list was headed by the following words:

  1. plażing (“beaching”)
  2. urlop tacierzyński(“paternity leave”)
  3. alternatywka (“girl with alternative tastes”)
  4. jesieniara (“autumn girl”)
  5. smakówka (teen slang for “bon appétit”)

Moreover, possibilities of creating new words by adding endings or prefixes – whether to native words or those borrowed from other languages (particularly from English, in the case of a language like Polish) – are practically unlimited. Consider such creations as megaweekend, macromaker or instagramming.

Machine algorithms should not be left behind in this regard. For a computer system to have the ability to perform dictionary-based analysis of newly formed words, those words ought to appear in its dictionary.

For this purpose, a system may itself generate potential neologisms, for example by combining common words with popular prefixes. This means that the system’s dictionary may contain words that are not to be found in traditional lexicographical sources – words like supermachine and eco-promotion.

A computer’s lexical creativity must nonetheless be kept under control. Above all, the process of combining lexical units will inevitably lead to the generation of words that are not used in the language. This problem can be effectively eliminated by subjecting each automatically generated word to appropriate verification, and adding to the system’s dictionary only those words that occur with suitably high frequency in a selected corpus of texts (for instance, in the content indexed by the Google search engine).

Worse, though, is that words automatically generated by the system may turn out to be false derivatives – words with an entirely different meaning than would be suggested by their component elements. For example:

re+treat ≠ retreat
anti+gone ≠ Antigone
co+lander ≠ colander
ex+tractor ≠ extractor
e+vocation ≠ evocation

So to answer the question that forms the title of this section: Yes, a computer can be creative. Sometimes too much so!

Summary

Morphological analysis – the analysis of the structure of words – is an essential element of many systems for natural language processing. In some applications it is necessary to use a comprehensive dictionary of the language in question. In today’s solutions, the trend is to combine lemmatisation and tagging in a single process.

Sometimes, however, it is sufficient to apply methods where no dictionary is used – limited to determining the stems from which individual words are derived.

Human creativity means that new words are constantly being created. A computer can also be taught to be creative in word formation, although the results may be quite unexpected, and even comical.


1 The lemma, or base form, of a word is the form that appears in dictionaries – for example, the singular form of a noun or the bare infinitive of a verb.
2 A snippet is an extract from the text of a Web page which is returned by a search engine in response to a query.

How do computers analyse words?

Contents

Neural Machine Translation System Our Translator