New High Quality Search and Linguistics for iFinder5 elastic

IntraFind has long been known for high quality information retrieval. For our new product generation iFinder5 elastic we completely overhauled our core search technology consisting of our Lucene / Elasticsearch Analyzers and our Query Parser. In part 1 of this blog article I talk about advantages for the standard user and how we are able to reduce configuration efforts.

Language Identification and Language Chunking

Identifying the language of a given text is a crucial preprocessing step for almost all text analysis methods. It is considered as a solved problem since more than 20 years. Available solutions build on the simple observation that for all languages typical letter sequences (letter n-grams) exist, that occur significantly more frequent in this language than in other languages.

The difference between stemming and lemmatization

"Stemming" as well as "Lemmatization" are commonly used buzzwords in the field of Information Retrieval (IR), particularly in the development of powerful search engines. [...]

So what exactly is the difference between these two methods? What are the advantages and disadvantages and which one should be preferred? [...]

Approximative data structures for natural language processing

Some say software developers draw their motivation from minimizing or maximizing numbers in any given problem. That's a smug innuendo. From my experience, developers are always on the lookout for beautiful solutions, of which numbers are but a symptom. The usage of approximative data structures for language processing is one such example of a beautiful idea with nice numbers.

