06.10.2021 | Blog NLP: A Key Technology for Search Engines and Text Analytics
The term NLP describes computer-based processing of natural language using artificial intelligence techniques. The increasing importance of NLP can be seen in the consumer sector in the establishment of digital assistants such as Siri, Cortana or the Google Assistant. In corporate applications, such technologies are finding their way into enterprise search applications and intelligent text analysis of documents in particular.
Simple examples show that processing and understanding natural language with its complexity and ambiguity is an immense challenge: The different roles of "spoon" and "noodles" in the sentences "I eat the soup with the noodles" and "I eat the soup with the spoon" cannot be correctly captured with a simple syntactic analysis. The importance of semantics and world knowledge are also shown by the following news titles: "Scientists watch whales from space", "The pope's baby steps on gays".
NLP uses lexicons, rules, and machine learning
To meet this challenge, NLP uses various AI methods, especially lexicons and rule-based methods (classical computational linguistics) and machine learning (ML), for example in the form of deep learning.
Classical computer linguistics
The analysis and processing of text using classical computational linguistic methods consists of the following steps:
1. Recognition of words and their normalization
First, the text is decomposed into a sequence of words. It is also important to normalize these words and for further processing to determine the word type (noun, verb, etc.). Both tasks are usually realized using lexicons. Examples are lemmatization, which allows to find "children" when searching for "child", or compound decomposition (very important for languages such as Dutch and German), which allows to find “landowner” when searching for “owner”. In most cases, further normalization to the root word or stem (dancer -> dance) is also necessary. Similarly, semantic lexicons with thesauri are required so that related terms such as "car" and "motor vehicle" can be identified with each other.
2. Recognition of entities
Names, e.g. of persons, organizations and places (addresses), often play a special role in search. This also applies to numerical data such as prices or dates or to technical-scientific units such as area specifications, speeds or temperatures. The intention behind many search queries is often a factual question or W-question (Who?, When?, Where?, How much?). The basis for recognizing these information units are special lexicons (for example, typical first and last names) and typical contexts of these entities. For example, phrases such as "is located in" or "resides in" represent typical contexts for locations. Rule-based decisions are then made based on matches with the lexicons and the context descriptions as to whether an entity can be identified for a specific text passage.
3. Recognition of word groups, sentence structures and relations
The most challenging task in computational linguistics is syntactic-semantic analysis of complete sentences. Based on the information of the two previous analysis steps and grammars, complete sentences can be analyzed. In this process, larger groups of words - in the above examples "with the spoon" and "with the noodles" - are identified and categorized (in the first case as an attribute to the verb "eat", in the second case as part of the sentence object "soup") Thus, relations can be recognized. For example, from the sentence "The company IntraFind was founded on October 1, 2000 by Franz Kögl and Bernhard Messer" a relation (founding relation) between the two named persons and the company IntraFind can be extracted. Based on recognized relations, questions can be answered reliably.
Machine Learning
1. Classical machine learning
Over the past decades, classical computational linguistic methods have been supplemented or even replaced by ML. Tasks such as identifying text language, topic of a text (classification) or sentiment are almost exclusively solved by ML. ML is also frequently used for entity recognition. However, until recently ML methods had two fundamental drawbacks: 1. large amounts of manually analyzed and annotated sample texts were needed as training sets. This means, that e.g. for text classification lots of manually classified example documents were needed for each topic. 2. until a few years ago, ML was a domain for experts, since feature engineering (transforming text into a form suitable for vector-based ML methods) was often a crucial prerequisite for success.
2. Deep learning
Deep learning could be a game changer. On the one hand, feature engineering mentioned above is less important, as deep neural networks are able to learn semantic representations for text in the form of high-dimensional vectors (embeddings) from large non-manually annotated text sets. By using these representations (transfer learning) for words, sentences and longer passages, many classical NLP tasks can be solved. For the specific NLP tasks, manually annotated sample texts are still needed. However, fewer samples may suffice due to the use of the semantically rich embeddings.
The task of a search engine, namely mapping search queries to documents, can be done entirely based on embeddings. This approach, also called neural search, could completely replace classical search engine technology where documents and search queries are represented as sets of words (bag-of-words). Neural search approaches are less susceptible to the vocabulary mismatch problem (describing the same things with different words), because their representations are to some extent independent of the actual words used. Classical search engine technology relies on word normalizations and thesauri for handling vocabulary mismatch. Another advantage of Neural Search is that embeddings for larger text units like sentences also contain certain semantic relations. A sentence and its negation can have very different embedding representations. For classical information retrieval, the difference may only be in one word, or even in the position of one word, and thus has little influence on the search results.
However, neural search approaches are currently much less robust than classical search engine technology. Searching for product numbers, error codes, or account numbers is not possible with neural search approaches. The same applies to searching for numerical values or telephone numbers. The hardware requirements of neural search approaches are at least an order of magnitude greater than those of classical search engine technology. In contrast to classic search engine technology, it is not possible to predict which content will be found with a search query. The explainability of individual search results is also significantly worse. Like all deep learning approaches, their behavior largely eludes human analysis (black box).
The future will show whether we will see a complete change in technology or whether the combination of new and old approaches will be more successful.