Image
Funnel

09.06.2022 | Blog Modern search software also takes gender forms into account

Gender spellings bring completely new challenges for search and text analytics software. Halyna Galanzina, Text Analytics Expert at IntraFind, shows what an intelligent search engine should be able to do in order to provide users with perfect results.

Efforts to adapt language for gender-appropriate expression have produced a lot of new word formations in German. Here are some of the most common spellings:

Tabelle Genderformen

The phenomenon of such gendered expressions brings some challenges for text processing and search:

  1. some stems of the newly formed expressions differ from their corresponding common base forms and sometimes have different meanings: Kolleg in Kolleg:innen does not equal Kollege.
  2. the affixes "-in" and "-innen" also correspond to stand alone German words and can thus be found incorrectly as a result of search queries.

In this blog post, we show how the iFinder software addresses these challenges.

In our solution for the search tasks, we strive not only to provide a set of matching documents to a search query, but also to correctly interpret the intention of each search query. For this purpose, texts are analyzed and normalized in various ways. One of the many techniques used for this is lemmatization: it is determined which of the various word forms can be traced back to a base form. Thus, a good lemmatizer recognizes that words such as Mitarbeiter (engl. employee), Mitarbeiterin, Mitarbeiters, Mitarbeiterinnen have a common base form. And the search queries with all these forms should produce possibly the same set of hits during a standard search. The exact search offers the possibility to search specifically for one of these word forms. Here is an example: When searching for "Mitarbeiterin", only hits containing "Mitarbeiterin" are returned.  Hits with "Mitarbeiter", "Mitarbeiterinnen" etc. will not be in the result list of the exact search.

Even before the lemmatizer can analyze the word forms, the boundaries of individual words should be correctly recognized. The spelling of the new gender expressions poses another challenge here, since it does not correspond to the word spelling conventions of the German language. A wide variety of characters occur in the middle of words, most of which are rightly seen and interpreted as word separators. Therefore, the first step is to recognize the correct word boundaries for many spellings of gender expressions. In the next step, the recognized word forms are normalized for the standard search, in order to determine their correct base forms afterwards. These steps are naturally applied during the indexing and during the search. Thus, we arrive at the same hit sets for different queries while performing standard search. Here are hit numbers, given in thousands, that we get from a test system of iFinder, comparing standard search versus exact search:

Tabelle Standardsuche vs exakte Suche

The corresponding hits are also correctly highlighted, here is a small illustration:

Screenshot iFinder kein Treffer

 

or by the example „Kolleg-/in“:

Screenshot iFinder Kollegensuche

 

and

Screenshot iFinder Kollegensuche Highlight

 

Conclusion: Our solution iFinder ensures the findability of all common spellings of the new expressions. The possible misspellings due to affixes (in, innen) are significantly reduced.

Linguistically, other phenomena occur in other languages with regard to gender spelling. In English, for example,  we don’t register as many new word forms, but a change in language usage. There, terms that refer to a man or a woman have simply been replaced by  neutral expressions, e.g.:

fireman = firefighter

policeman = policeofficer

stewardess = flight attendant

In this case, iFinder uses a supplementary thesaurus function to ensure that corresponding synonyms are found.

Related Articles

Image
Bibliothek

NLP: A Key Technology for Search Engines and Text Analytics

Modern NLP (Natural Language Processing) / NLU (Natural Language Understandig)-solutions, based on the three keystones of lexicons, rules and machine learning, open up entirely new possibilities, especially in the areas of text search, classification and analysis. Christoph Goller, Head of Research at IntraFind, gives a technology overview.
Read article
Image
Nahaufnahme Kunstwerk bestehend aus Buchstaben

Semantic-linguistic indexing for better search results

With its Linguistics Plugin, Intrafind enables users of the open source search engine Elasticsearch to get more complete and more relevant search results. The search specialist has now implemented some semantic extensions: the search for entities, numbers and units.
Read article

The author

Halyna Galanzina
Text Analytics Expert
Halyna Galanzina has been working at Intrafind since 2008 and is an expert in text analytics, natural language processing and information extraction. Her passion: developing algorithms for automatic text processing and text understanding.
Image
Halyna Galazina