Image
Aufnahme Kunstwerk bestehend aus Buchstaben

25.09.2020 | Blog Semantic-linguistic indexing for better search results

With its Linguistics Plugin, Intrafind enables users of the open source search engine Elasticsearch to get more complete and more relevant search results. The search specialist has now implemented some semantic extensions: the search for entities, numbers and units.
IntraFind provides linguistic extension for Elasticsearch

Most of us know how annoying it can be when we urgently search for information and get useless answers from the search engine. In such cases it becomes obvious that most search engines have no understanding of the text at all. With the Linguistics Plugin text is analyzed linguistically and semantically and the results of this analysis are integrated into the Elasticsearch index. In many cases, frustrating experiences can be avoided.

It may sound trivial, but for the user searching becomes more comfortable if he or she no longer has to worry about word variants and spellings. For example, the IntraFind Linguistics Plugin for Elasticsearch will also return hits with the word "basketball" or "football" when searching for "ball", but not with "ballet". The user does not have to be an expert in formulating wildcard queries for this. A search for "Buch" will also return hits for "Kinderbücher", a search for "Bundesforschungsministerium" will return hits for "Ministerium des Bundes für Forschung und Technologie" based solely on morphological normalization and word compound decomposition.

Base forms and decomposed multi-word terms as key

The Linguistics Plugin combines different methods of linguistic text mining to normalize inflected words to their base forms (lemmatization) and to decompose compound terms into their basic components (compound decomposition). This is particularly important for the numerous German multi-word terms. As an example:

Lemmatization: mice -> mouse

Compound Decomposition: Bundesumweltminister -> (Bund, Umwelt, Minister) also finds matches for Bundesminister für Umwelt

Search based on lemmatization and compound decomposition is much more precise than search based on the so-called stemming, which is usually used in search engines. In many cases, stemming tend to over-generalize, as it traces words with completely different meanings back to the same artificial word stem (examples: Messer → Mess, Messe → Mess).

Semantic search

One speaks of semantic search when the content meaning of texts and search queries is taken into account in the search. In this context, thesauri, semantic networks or ontologies are used which contain background knowledge for the interpretation of the search query.

Software takes into account that languages change

The linguistic procedures of the plugin are based on extensive lexicon resources and are additionally characterized by a strong procedural orientation in the lexicon analysis. Linguistic results, especially for word decomposition, are therefore, in contrast to comparable procedures, "computed" by procedures and not "looked up" in the lexicon. The extensive lexicons are prepared and compiled in a high-quality and efficient way, so that the analysis is carried out quickly and saves memory. Since languages develop dynamically and have a great creative potential to create new words or word combinations, it would be inefficient to rely only on rigid lexicons.

Use of thesauri

In numerous search projects, the quality of the search is significantly enhanced by the use of thesauri. Many of these improvements are already provided automatically by the Linguistics Plugin through the use of lemmatization and compound decomposition. However, lemmatization and compound decomposition do not replace real synonyms and abbreviations (such as car, automobile, motor vehicle), hyponyms (sub-concepts) or hyperonyms (generic terms).

The plugin therefore allows the use of arbitrary thesauri and supports common thesaurus formats, such as SKOS. By using lemmatization and compound decomposition also for the lookup in the thesaurus, no inflection forms need to be maintained in the thesaurus ressources. This greatly simplifies the maintenance of customer-specific thesauri and the use of existing thesauri.

A specific example of the benefits of linguistic normalization

The goal of a search tool is to find information - whether it is to be able to provide information quickly as a support employee or to help customers find all products in an online store. An example: In an online store for automotive accessories a user searches for " Wheels A6". He does not know that this item is listed under a different name. This is also irrelevant; the hit list will provide him with the item he is looking for under the term "Complete wheelset A6" based solely on linguistic normalization. An application-specific maintenance of thesaurus resources is not necessary for this. Thus the economic aspect of an efficient search should not be underestimated.

Image

Linguistics Practical Example

Let´s have a look how IntraFind's linguistics plug-in analyzes the word "policemen":

  • The plural form "policemen" is reverted to its base form "policeman".
     
  • The word "policeman" is recognized as a compound and separated into the single words "police" and "man".

Multilinguality

An advantage of the Linguistics Plugins that should not be ignored is how little configuration is needed particularly in a multilingual environment. It is no longer necessary to configure a language-specific field and analyzer in the schema and to control indexing with a language recognition component, since the IntraFind Analyzer already contains a language recognition component. Even in case of mixed-language content the parts written in different languages are recognized and automatically handled correctly.

Details on the search based on linguistic normalization and multilingualism can also be found in an older blog post [1].

Search for entities, numerical search and unit search

Names, e.g. of persons, organizations and places (addresses), often play a special role in search. This also applies to numerical data such as prices or dates or to technical-scientific units such as area specifications, speeds or temperatures. The intention behind many search queries is often a factual question or W question (Who?, When?, Where?, How much?).

The linguistics plugin now offers the recognition of entities, numbers and units and their integration into the search index. The integration into the index is done by using the typed index concept [2], [3], which has been introduced by IntraFind many years ago. It allows to use different types of terms for different needs in search while retaining information about textual proximity (word distance).

With these semantic extensions it is possible, for example, to search for any person in the vicinity (small word distance or same sentence) of " founder" and " Doctors without Borders". If a document is found for this search query, the highlighting snippet will most likely contain the names of one or more founders of the association "Doctors without Borders".  The semantic enrichment of the index also allows, for example, a targeted search for a person with the name "Black" while excluding hits for the color " black".

Searching for any amount of money in a text close to rent and address components as a broader context could provide the answer to the rent amount of an object.

The numerical search can not only provide answers to questions like "How much?". It is also possible to search for amounts in a specified interval. In addition, recognized units are automatically normalized. The search for the words "boiling point" and a "temperature of 90 - 110 °C" in the same sentence also delivers hits for the sentence "The boiling point of water is 212 °F" thanks to the unit normalization. Searching for "Display 5 inch" will also return results for "Display diagonal 4.8 inch" since as default we use an exactness / fuzziness of 10% for numeric searches.

[1] “New High Quality Search and Linguistics for iFinder5 elastic”, C. Goller, IntraFind Blog
[2] “The Typed Index”, C. Goller, FOSDEM 2015, https://archive.fosdem.org/2015/schedule/event/the_typed_index/
[3] “The Typed Index”, C. Goller, Lucene Revolution 2013, Dublin, https://www.youtube.com/watch?v=X93DaRfi790

The author

Dr. Christoph Goller
Head of Research
Christoph Goller got a PhD in computer science from the Technical University of Munich for pioneering work on Deep Learning. He is Apache Lucene Committer and has more than 20 years of experience in Information Retrieval, Natural Language Processing and AI. Since 2002 he is head of IntraFind‘s research department.
Image
Dr. Christoph Goller