25.09.2020 | Blog Semantic-linguistic indexing for better search results
Most of us know how annoying it can be when we urgently search for information and get useless answers from the search engine. In such cases it becomes obvious that most search engines have no understanding of the text at all. With the Linguistics Plugin text is analyzed linguistically and semantically and the results of this analysis are integrated into the Elasticsearch index. In many cases, frustrating experiences can be avoided.
It may sound trivial, but for the user searching becomes more comfortable if he or she no longer has to worry about word variants and spellings. For example, the IntraFind Linguistics Plugin for Elasticsearch will also return hits with the word "basketball" or "football" when searching for "ball", but not with "ballet". The user does not have to be an expert in formulating wildcard queries for this. A search for the word "landowner" will return hits for "owner of the land" based solely on morphological normalization and word compound decomposition.
Base forms and decomposed multi-word terms as key
The Linguistics Plugin combines different methods of linguistic text mining to normalize inflected words to their base forms (lemmatization) and to decompose compound terms into their basic components (compound decomposition). This is particularly important for the numerous German multi-word terms, but also for simple cases in other languages. As an example:
Lemmatization: mice -> mouse
Compound Decomposition: moonlight -> (moon, light) also finds matches for "light of the moon"
Search based on lemmatization and compound decomposition is much more precise than search based on the so-called stemming, which is usually used in search engines. In many cases, stemming tend to over-generalize, as it traces words with completely different meanings back to the same artificial word stem.
Semantic search
Software takes into account that languages change
The linguistic procedures of the plugin are based on extensive lexicon resources and are additionally characterized by a strong procedural orientation in the lexicon analysis. Linguistic results, especially for word decomposition, are therefore, in contrast to comparable procedures, "computed" by procedures and not "looked up" in the lexicon. The extensive lexicons are prepared and compiled in a high-quality and efficient way, so that the analysis is carried out quickly and saves memory. Since languages develop dynamically and have a great creative potential to create new words or word combinations, it would be inefficient to rely only on rigid lexicons.
Use of thesauri
In numerous search projects, the quality of the search is significantly enhanced by the use of thesauri. Many of these improvements are already provided automatically by the Linguistics Plugin through the use of lemmatization and compound decomposition. However, lemmatization and compound decomposition do not replace real synonyms and abbreviations (such as car, automobile, motor vehicle), hyponyms (sub-concepts) or hyperonyms (generic terms).
The plugin therefore allows the use of arbitrary thesauri and supports common thesaurus formats, such as SKOS. By using lemmatization and compound decomposition also for the lookup in the thesaurus, no inflection forms need to be maintained in the thesaurus ressources. This greatly simplifies the maintenance of customer-specific thesauri and the use of existing thesauri.
A specific example of the benefits of linguistic normalization
The goal of a search tool is to find information - whether it is to be able to provide information quickly as a support employee or to help customers find all products in an online store. An example: In an online store for automotive accessories a user searches for " Wheels A6". He does not know that this item is listed under a different name. This is also irrelevant; the hit list will provide him with the item he is looking for under the term "Complete wheelset A6" based solely on linguistic normalization. An application-specific maintenance of thesaurus resources is not necessary for this. Thus the economic aspect of an efficient search should not be underestimated.
Linguistics Practical Example
Let´s have a look how IntraFind's linguistics plug-in analyzes the word "policemen":
- The plural form "policemen" is reverted to its base form "policeman".
- The word "policeman" is recognized as a compound and separated into the single words "police" and "man".
Multilinguality
An advantage of the Linguistics Plugins that should not be ignored is how little configuration is needed particularly in a multilingual environment. It is no longer necessary to configure a language-specific field and analyzer in the schema and to control indexing with a language recognition component, since the IntraFind Analyzer already contains a language recognition component. Even in case of mixed-language content the parts written in different languages are recognized and automatically handled correctly.
Details on the search based on linguistic normalization and multilingualism can also be found in an older blog post [1].
Search for entities, numerical search and unit search
Names, e.g. of persons, organizations and places (addresses), often play a special role in search. This also applies to numerical data such as prices or dates or to technical-scientific units such as area specifications, speeds or temperatures. The intention behind many search queries is often a factual question or W question (Who?, When?, Where?, How much?).
The linguistics plugin now offers the recognition of entities, numbers and units and their integration into the search index. The integration into the index is done by using the typed index concept [2], [3], which has been introduced by IntraFind many years ago. It allows to use different types of terms for different needs in search while retaining information about textual proximity (word distance).
With these semantic extensions it is possible, for example, to search for any person in the vicinity (small word distance or same sentence) of " founder" and " Doctors without Borders". If a document is found for this search query, the highlighting snippet will most likely contain the names of one or more founders of the association "Doctors without Borders". The semantic enrichment of the index also allows, for example, a targeted search for a person with the name "Black" while excluding hits for the color " black".
Searching for any amount of money in a text close to rent and address components as a broader context could provide the answer to the rent amount of an object.
The numerical search can not only provide answers to questions like "How much?". It is also possible to search for amounts in a specified interval. In addition, recognized units are automatically normalized. The search for the words "boiling point" and a "temperature of 90 - 110 °C" in the same sentence also delivers hits for the sentence "The boiling point of water is 212 °F" thanks to the unit normalization. Searching for "Display 5 inch" will also return results for "Display diagonal 4.8 inch" since as default we use an exactness / fuzziness of 10% for numeric searches.
[1] “New High Quality Search and Linguistics for iFinder5 elastic”, C. Goller, IntraFind Blog
[2] “The Typed Index”, C. Goller, FOSDEM 2015, https://archive.fosdem.org/2015/schedule/event/the_typed_index/
[3] “The Typed Index”, C. Goller, Lucene Revolution 2013, Dublin, https://www.youtube.com/watch?v=X93DaRfi790