Image
Aufnahme Rollband

05.02.2016 | Blog How search engines can be expanded by means of word families

What makes a good search engine? You type a sequence of letters and the document is searched for all occurrences of this combination. This way you can quickly and easily find certain text passages within a document. For a user who wants to find out more about a particular topic or the use of a particular word, such a basic search functionality would certainly not be enough.

What makes a good search engine? The minimum requirement are those search functionalities, which are usually retrieved in a word processor or an editor by using the shortcut “CTRL + F”: You type a sequence of letters and the document is searched for all occurrences of this combination. This way you can quickly and easily find certain text passages within a document.

Words, however, not always consist of the same letter sequences, but are dynamic entities: they are inflected depending on grammatical person, time, case or gender. For a user who wants to find out more about a particular topic or the use of a particular word, such a basic search functionality would certainly not be enough.

Therefore, one way to extend the search is lemmatization: besides decomposition of compounds, lemmatization is a feature that builds the core of IntraFind’s linguistics and is based on the company's own full form lexicon. Thus, for example, the letter sequence "thought" is attributed to its basic form "think" in the lexicon as well as to all other possible forms of "think". This way the entire spectrum of inflected word variants (inflectional paradigm) can be included in the search for one of these forms.  

Words are dynamic entities - not only because they can be inflected, but also because they can be merged by so-called word-formation processes to form new words. With these neologisms they are in a kind of relation then. This relation may exist between words from the same part of speech, such as between the nouns "chemistry" and "chemist" or between words of different parts of speech such as "chemistry" and "chemical" or "buy" and "buyer".

A group of words that is in such a relation is called a word family. In many situations it can be useful for a search, to not only include the inflection paradigm, but also the word family. Because when someone is searching for "chemistry", other search results that contain "chemist", "chemical" and "chemically" might be interesting as well. For this, IntraFind developed a stem-thesaurus, in which the relation between members of a word family is stored.

The thesaurus is a stem-thesaurus, because the criterion for the relation is a common word stem. This is a linguistically controversial term which can be defined as follows in this application: It is that part of the word the word family has in common and from which new words can be generated within the word family by adding pre- or suffixes. The word stem for our "chemistry"-family would be "chem" to which we can add the endings "-ical", "-istry" and "-ist". 

What is a Thesaurus?

A thesaurus or word network is a term from the field of documentation science. It contains synonyms, superordinate and subordinate terms that precisely describe and represent a subject area.

The study of word formation processes shows that there are certain regularities in the derivation of words from the word stem: for example, nouns are mostly formed by adding the suffix "-er" or "-ist".

Rules of this kind can be implemented, so that a part of the thesaurus can be created automatically. However, it is also clear that there are too many exceptions in the language, so they can’t be corrected easily with only few additional exception rules. While the "player" is just someone who plays the "liver" is not a person who lives.  

Particularly difficult is the relation with prefixes that strongly modify the meaning of the stem, even if they are still related: "underwrite" has something to do with "write", but it's something different and shouldn’t be mixed up in a search query.

Because of all these exceptions, it makes sense to store word families in a stem-thesaurus which is created with automatic support but also with half-automatic exception lists and a final manual check. The finished thesaurus contains only those word families, which feature a sufficient semantic relationship that makes them associated.

It is then available as established but expandable resource to enlarge the range of search functionalities by the respective word family. Such semi-automatically generated and manually curated stem-thesauri are a valuable resource for exploratory search. By interacting with other features of the iFinder they cater for a significant expansion of the search and research possibilities of the user.

The author

Pascal Zambito
Working Student IntraFind Software AG
Pascal Zambito, B.Sc., graduant of Computational Linguistics at the Ludwig-Maximilians-Universität München (LMU), has developed a stem- thesaurus for the German language as part of his thesis on behalf of Intrafind Software AG.
Image
Pixelfläche grün-blau