A natural language is a language used by humans to communicate with each other, like English, Polish, etc. It contrasts, for example, with a programming language (Java, Python, etc.), which is used by humans to give instructions to a computer.
Natural language processing, or NLP, refers to operations performed by a computer on natural language texts. These operations may serve to understand the text, to translate it into another language, or to generate a new text.
Natural language processing is being used in an ever wider range of computer systems, carrying out tasks such as the following:
and many others.
In this blog post I will briefly describe the above applications of NLP.
Automatic translation, or machine translation, has undergone a revolution in recent years, as a result of which the translations produced are of much higher quality than those of little more than a decade ago. If you are interested in the history of machine translation, I encourage you to read a separate post on that subject, available on our blog.
At the beginning of the 21st century, the most popular machine translation systems were based on dictionaries and on rules formulated by specialists. A popular product in Poland was the Translatica program, produced by POLENG (now PWN AI), which performed a rule-based syntactic analysis of the source text, on the basis of which it would generate a translated document.
However, in the first version of the system (there were seven versions in total), some “intriguing” cases occurred, such as the following:
Polish sentence: Misiu zaraz państwu łapkę poda.
Correct translation: The teddy bear will now give you its paw.
Machine translation: The Teddy bear of epidemics will pass a trap to the State.
While the whole of this attempted translation will no doubt be found amusing, perhaps the most surprising is the appearance of the phrase “of epidemics”. This results from the word “zaraz” being treated not as an adverb meaning “right now”, as intended, but as an inflected grammatical form (the genitive plural) of the noun zaraza, meaning “epidemic”.
To convince yourself that later versions of the system were able to translate much more accurately, try it out at www.translatica.pl.
The development and subsequent correction of rule-based systems nonetheless requires a huge amount of work by experts. For this reason, rule-based methods were gradually replaced by other forms of machine translation.
The foundations of statistical translation were worked out in the 1990s by IBM researchers. Systems employing the methodology came into general usage in 2006, when Google Translate was launched.
The idea of this translation method is based on the concept of a noisy channel.
In this model, the listener (Person B) tries to recreate the original text produced by the speaker (Person A) after it has been distorted during transmission. We may imagine such a situation, for example, when talking on the phone with someone who has a poor signal. When we hear something like:
What’s …p with …ou? Are you …ll?
we guess that the speaker meant to say:
What’s up with you? Are you well?
and not, for instance:
What’s op with Lou? Are you full?
WhatsApp with zoo? Are you Bill?
In this kind of situation we carry out a process consisting of two steps:
In statistical translation the translation process is treated as a noisy channel – we “hear” an utterance in one language, and try to reconstruct its “source” in another language. We do this by performing the same steps:
Neural network translation first came into use in 2014. It is based on the use of deep neural networks. In the first step (called encoding) the source text is changed to a numerical form (specifically, to a vector of real numbers, from 300 to 500 elements long). In the second step (decoding) this vector is converted into a text in the target language.
For a comparative analysis of the results obtained using statistical and neural network translation techniques, see the article by Krzysztof Jassem and Tomasz Dwojak titled Statistical versus neural machine translation – a case study for a medium size domain-specific bilingual corpus (2019).
Chatbots are systems that “talk” to users using natural language, trying to make them interested in a particular topic, usually related to the products of the firm supplying the system. Chatbots normally operate in a fairly simple manner – the system looks for key words in the user’s input, and then directs the user to Web pages associated with those words.
There are also chatbots that are used mainly for entertainment purposes. For an example, try out the dialogue system available at https://www.elbot.com/.
Slightly more advanced conceptually are question answering systems. These aim to identify the information needs of a user asking a question in a natural language, and then look for an answer – either in their own database, or on external Web pages. An example of a solution of this type is START (start.csail.mit.edu), developed at MIT.
Virtual assistants are our constant friends, fulfilling our wishes here and now. Just say the words “OK Google” and your friend is ready and waiting to look for information on any subject you want. Google Assistant is now probably even more popular than the legendary Siri, which was launched on iPhones in 2011. Still lagging somewhat behind is Bixby, the assistant installed on Samsung phones – although it very much wants to helps us in a wide range of situations, its affection is rarely reciprocated. In turn, Amazon’s Alexa is unwilling to travel around with us – she prefers to take the form of a small speaker and wait for us at home. There, however, she is irreplaceable, at least as far as household duties are concerned.
For the moment at least, these electronic genies are no replacement for our human buddies. They have a problem not only with empathy, but also with conducting a natural conversation. Their message seems to be “I will fulfil your every wish, but on condition that you state it clearly, audibly and concisely. And then I’m out of here.”
Work is currently under way to build systems that can conduct a dialogue with a user. Perhaps soon, when we phone a helpline, instead of hearing messages like “You have real problems – press 7”, we will get through to an automatic consultant that we can talk to as if it were another human.
One of the methods being tested is called reinforcement learning. In this technology, the machine perfects its operation using a system of rewards and punishments. In this way, a device may become an expert, for example, in the game of Breakout (https://youtu.be/V1eYniJ0Rnk) within literally just a few minutes of beginning training.
Mastering the art of conducting a dialogue has proved a more difficult challenge. An interesting description of this problem can be found in a master’s degree thesis by Weronika Sieińska, titled Use of Reinforcement Learning in Dialogue Modeling (2019).
The user of an Internet search engine expects to receive query results that meet his or her needs, even if those needs are often unclearly formulated. Semantic search is an information selection technique that works in accordance with the following principles:
For example, in response to the instruction:
Find a restaurant nearby.
From the flood of texts available online, it is not always possible, by means of a single query, to find exactly the information we are interested in – even using semantic search. Moreover, in many cases we would like to limit our search to particular sources, instead of combing through the entire Internet.
The process of obtaining documents containing information related to a formulated query is called information retrieval. For instance, in reply to a request for information on actors who have played James Bond, contained in a collection of film reviews, we will be supplied with a list of all those reviews which contain the information being sought.
To obtain information given in a specified format, we apply a process called information extraction. For example, in reply to a request for information on Bond actors, we may obtain a table containing the names of the actors and the titles of the films in which they appeared in the role.
Some interesting ideas for methods of information extraction can be found in a master’s degree thesis by Dawid Jurkiewicz titled Extracting information about church services start times (2018).
Text classification is an automatic process that serves to assign every analysed document to a certain class.
In supervised classification, a certain set of classes to which documents are to be assigned is determined in advance. A frequently cited example of this is spam detection – for each incoming e-mail, the system determines whether to classify it as spam, or whether to place it in the user’s inbox.
Human “supervision” of this process involves preparing the training data in a suitable way. For example, in the above example, for a certain set of e-mails (as many as possible) a human decides to which class each of them is to be assigned. Using the data prepared in this manner, the classification algorithm learns to imitate the human’s actions, so as to take a decision for each new e-mail analogously to the decisions taken by the human when evaluating the training data.
An example of an implementation of supervised classification is described in a master’s degree thesis by Dawid Klimek, titled Classification of Investment Funds by Means of Machine Learning Methods (2017).
The preparation of training data is nonetheless a time-consuming process. In practice, therefore, unsupervised learning is often used. This does not require a human to classify a training set. In return, however, we do not expect the algorithm to divide the documents into classes according to some specified property.
Most often, the user defines only the number of classes into which the set of documents is to be divided, and the system attempts to perform the classification such that the documents in each of the classes are similar to each other. The similarity of documents can be defined in a wide variety of ways – in the simplest case, documents are regarded as similar if they have a large number of words in common.
Sentiment analysis is a type of supervised text classification that serves to categorise the opinions expressed in texts. In most cases a division is made into three classes: positive, negative and neutral.
In some applications, sentiment analysis is applied to specific aspects of the opinions. For instance, in the case of hotel reviews, the opinions contained in the analysed texts may be categorised with respect to such aspects as price, location, service quality, and quality of meals.
For example, in a master’s degree thesis by Kinga Kramer titled Sentiment Analysis of Students’ Comments on Lectures (2018), an experiment is described involving sentiment analysis of comments made by students concerning lectures given by the author of this blog.
Language correction is a task in which a computer system can play an advisory role, indicating potential errors in a text, and leaving the user to take the final decision on whether and how to correct them.
In this process, the system’s task is to indicate spelling mistakes, which may arise either through mistyping (pressing the wrong key) or through the writer’s unfamiliarity with the correct spelling of a word. A spelling mistake may lead to the appearance of a word that does not exist in the language (for example, cheese → chease); mistakes of this type are easier to detect. It may, however, result in a word that does exist in the language (for example, from → form), in which case the detection of the error is a much harder task.
One popular method for detecting spelling mistakes is the above-mentioned noisy channel method. It is assumed that the error is caused by a noisy channel, and the system’s task is to discover the most probable intention of the author. Such a method is described in a master’s degree thesis by Tomasz Posiadała, titled Probabilistic methods for spell-checking (2017).
One can list many types of errors committed by people writing in a foreign language. For example, in the NUCLE corpus (https://www.comp.nus.edu.sg/~nlp/corpora.html), containing English-language texts written by students from Singapore, 28 error types are identified, the most common being: incorrect use of article, use of incorrect collocation, punctuation error, noun countability error, incorrect use of tense, and use of incorrect preposition.Today’s systems are well able to cope with these kinds of errors. One of the most effective solutions is the use of a neural network translation model to “translate” a text from incorrect to correct language. For more information on modern methods of grammatical correction, see the doctoral dissertation by Roman Grundkiewicz titled Algorithms for automatic verification of grammatical correctness (2018).
NLP has many more applications than those described above. I will now briefly refer to just a few of them.
The aim of this task is to make an automatic summary of a document, usually by selecting the sentences or pieces of text that carry the most information. An example of a system performing such a task is presented in a master’s degree thesis by Łukasz Pawluczuk, Automatic Summarization of Polish News Articles by Sentence Selection (2015).
Based on information contained in texts, a machine learning system can attempt to predict trends in certain economic parameters. A master’s degree thesis by Marcin Kania, titled A System for Supporting Decisions on Investing at Polish Stock Exchange Based on Stock Exchange News (2014), describes a system for predicting the prices of shares listed on the Warsaw Stock Exchange based on press releases by the companies concerned.
An automatic prompt listens to human conversation, and on hearing a phrase indicating doubt or ignorance as to the meaning of a term (e.g. “I wonder…” or “What is…?”), it springs into action, offering full information about the topic that was unclear. Ugh…
When writing out an application, do you sometimes lack the words or an idea for effective self-promotion? You can seek help from an automatic “ghost writer” – your personal editorial assistant. Solutions of this kind are offered, for example, by Textio (https://textio.com/).
And now try to guess: Who or what was the author of this blog post?