05.06.2023 | Blog Always Trouble with Boolean Queries
This is a deep-dive expert-level article!
Many search engines support Boolean (logical) operators in their query language. So does Lucene, so do Lucene-based search engines such as OpenSearch (OS), Elasticsearch (ES), and Solr, and of course so does IntraFind’s Insight Plugin and the iFinder. This blog is about specific problems of Lucene’s classic query parser and how we handle / fix them. However, many aspects also generalize to other information retrieval systems. We don’t investigate Lucene’s flexible query parser since it is currently not supported by ES or OS. Scoring issues are also not topic of this article. I will focus on them in a follow up article. This article is about logic only.
Operator Precedence
Boolean Operators AND, OR, NOT
Lucene’s classic query parser and the respective parsers derived from it in OS, ES, and Solr support Boolean operators, namely AND, OR, and NOT. If you have a background in mathematics and logic you expect the standard operator precedence. The unary operator NOT should bind stronger than the binary operators, which actually works in Lucene. The operator AND should bind stronger than the operator OR[1]. For the following examples[2] you would probably expect:
A AND B OR C AND D <-> (A AND B) OR (C AND D)
A OR B AND C OR D <-> A OR (B AND C) OR D
A AND B AND C <-> (A AND B) AND C
A AND NOT B OR C <-> (A AND (NOT B)) OR C
Unfortunately, operator precedence is not correctly implemented, and it has been so for roughly 20 years. We will see later that this causes unexpected and incorrect search results. Operator precedence in Lucene’s query parser is rather weird in fact. I don’t know if it can be described in any reasonable way besides the source code of the actual implementation. One reason for that might be that the Lucene BooleanQuery implements a different semantics which by the way is also supported syntactically by classic query parser:
BooleanQuery: +A1 +A2 …. +An B1 B2 … Bn -C1 -C2 … -Cn
- A-clauses are required: a result-document must contain each A-clause; filters are A-clauses
- B-clauses are optional: if there are A-clauses, B-clauses only contribute to scoring, if there are no A-clauses, a result-document must contain at least one B-clause, additionally, min_should_match might be applied to B-clauses
- C-clauses are prohibited: only used for filtering results. Result-documents must not contain any C-clause
A BooleanQuery consisting exclusively of C-clauses [3]never matches in Lucene. In ES / OS this is fixed by adding a MatchAllDocsQuery (*:*).
Besides single-word queries, queries consisting of more than one word such as John Wick are probably the most common type of query because users usually don’t use Boolean operators or other specific query syntax. According to the BooleanQuery syntax introduced above these are B-clauses and they are optional. However, there is a parameter called default operator, which has 2 possible values. If it is set to OR, both words in John Wick remain optional (equivalent to John OR Wick), if set to AND, they become A-clauses (required): +John +Wick (equivalent to John AND Wick).
Classic Query Parser behavior: examples of queries translated (→) to BooleanQuery:
X1 AND X2 AND X3 → +X1 +X2 +X3
X1 OR X2 OR X3 → X1 X2 X3
X1 X2 X3 →+X1 +X2 +X3 or X1 X2 X3 depending on default operator
And now it becomes weird:
X1 OR X2 AND X3 → X1 +X2 +X3 not equivalent to the correct version X1 OR (X2 AND X3) → X1 (+X2 +X3)
X1 AND X2 OR X3 → +X1 X2 X3 not equivalent to correct version (X1 AND X2) OR X3 → (+X1 +X2) X3
X1 AND NOT X2 → +X1 -X2 equivalent to X1 AND (NOT X2) which is correct but
X1 OR NOT X2 → X1 -X2 equivalent to X1 AND (NOT X2)but should be X1 OR (NOT X2) → X1 (*:* -X2)
X1 OR X2 OR NOT X3 OR NOT X4 → X1 X2 -X3 -X4 equivalent to (X1 OR X2) AND (NOT X3) AND (NOT X4) but should be X1 OR X2 OR (NOT X3) OR (NOT X4) → X1 X2 (*:* -X3) (*:* -X4)
For our version of the classic query parser, we guarantee correct precedence of Boolean operators. To remain compatible with the BooleanQuery syntax without AND and OR, we assume that 2 words without an operator between them are combined by an “invisible” binary operator which has a higher precedence than AND and OR. We came up with this idea together with our partner and customer DATEV in the research project SEMIARID and we intend to commit our solution to Lucene’s classic query parser. Here are 2 example input queries translated to a BooleanQuery involving the “invisible” binary operator assuming OR as default operator:
X1 OR X2 X3 X4 → X1 OR (X2 X3 X4) → X1 (X2 X3 X4)
X1 OR X2 AND X3 X4 → X1 OR (X2 AND (X3 X4)) → X1 (+X2 +( X3 X4))
Split on Whitespace
The split_on_whitespace parameter was introduced with Lucene 6.2 released in 2016. If it is set to true (IntraFind default), splitting a Boolean query consisting of more than one word without any operators (no AND, OR, NOT, +, -) is done by the query parser. If set to false, the sequence of words is handed over to the (field-specific) Analyzer and the respective methods for generating queries base on a TokenStream are responsible for generating the complete (sub-)query.
Classic query parser basically tokenizes on whitespaces. If the field-specific Analyzer uses a similar tokenization and if the generated TokenStream does not do any fancy stuff concerning token positions, we get a BooleanQuery of the tokens/words in both cases and the value for split_on_whitespace does not make much of a difference.
However, if we search on a keyword field and therefore use KeywordAnalyzer, a sequence of words like John Wick will be concatenated to one single token John_Wick containing a whitespace if split_on_whitespace=false. If split_on_whitespace=true, we get a BooleanQuery John AND/OR Wick where AND/OR depends on the default operator. So, depending on the value of split_on_whitespace we get 2 completely different queries which most certainly will produce different result sets. More generally speaking, with split_on_whitespace=false the queries X1 OR X2 and X1X2 (default operator OR assumed) may no longer be equivalent for some Analyzers, that is they could deliver different result sets.
Why was the split_on_whitespace option introduced in the first place? The main reason was that Lucene’s synonym expansion (SynonymFilter) is plugged into the TokenStream. To be able to match multiword expressions such as New York -> NY, word sequences have to be handed over to the Analyzer. Splitting them within the query parser is no longer an option. The IntraFind thesaurus-expansion mechanism also needs split_on_whitespace=false to enable multiword matches. I think it is a good idea to delegate the responsibility for tokenization of word sequences without explicit operator to the Analyzer. The configured Analyzer should know best how to do the splitting and how to do the thesaurus lookup. But as shown on the simple example with a KeywordAnalyzer, this can have serious consequences. By the way, the query parsers of ES / OS no longer allow to set the split_on_whitespace parameter. They fix its value to false. In our query parser we still support setting the parameter.
We have seen that the split_on_whitespace parameter can have a big impact on the generated query. This becomes even more relevant if we search on multiple fields.
Searching on multiple fields
Documents usually consist of more than one field. E.g., they might have a title and a main content field. This allows us to restrict search, e.g. to the title to get documents that are probably more relevant. We can also search on multiple fields at the same time, and we can boost hits in some fields (e.g. the title) stronger than hits in others. One would expect the same result set as if the fields on which we are searching were concatenated into one field, except for scoring and result order. Unfortunately, that’s a goal not easily achieved.
Consider the query X1 AND X2. If we just rewrite it to title:( X1 AND X2) OR content:( X1 AND X2) we will lose documents that contain X1 in their title and X2 in their content field or vice versa. We should rather rewrite it to (title: X1 OR content: X1) AND (title X2 OR content: X2). This means that expanding a query to multiple fields should be done on the level of atomic queries (term or phrase queries), in other words it should be done on the leaves of the query parse tree. That’s exactly what ES / OS do, except that they use DisjunctionMaxQuery to combine queries for different fields which is just a Boolean OR with a fancy scoring.
So, you may wonder, where is the problem here? Unfortunately, the whole idea collapses for split_on_whitespace=false. We already saw that different Analyzers may lead to different numbers of tokens. The KeywordAnalyzer produces 1 token for any sequence of words. But the number of tokens may vary for other Analyzers too. One Analyzer may decide to split my iphone13 into the 3 tokens my iphone 13 while another one produces just the 2 tokens my iphone13. This makes it impossible to do expansion on the level of atomic queries since we might have a different number of atomic queries for different fields.
ES / OS allow to control the expansion of search to multiple fields with the type parameter. Its default (best_fields) does the obvious, it generates a query for the whole sequence of words separately for each field. So, X1 X2 becomes title:( X1 X2) OR content:( X1 X2). Consider the query John Wick again. What happens if we have 2 fields, one for the first name and one for the last name and our default operator is AND. Bad luck, then we won’t get the document containing this name in the respective fields because the query rewrites to (+firstName:John +firstName:Wick) OR (+lastName:John +lastName:Wick). Furthermore, there is a kind of inconsistency with this approach. The queries John AND Wick or +John +Wick will match because expansion to the fields is done on the atomic (term) query level again since the presence of explicit operators enforces splitting by the query parser.
ES / OS offer the option cross_fields for the type parameter. This groups fields according to their Analyzer and does expansion on the atomic / term level for each group. Furthermore, it does some term frequency blending in order to even out the differences between fields. In the IntraFind query parser we have a similar solution but without the frequency blending and it’s also compatible with our expansion of queries with a thesaurus, which by the way was not easy.
For the query X1 OR X2 it does not matter whether we expand the complete query to multiple fields or whether we do it on the leaf level. The queries (title: X1 OR title: X2) OR (content: X1 OR content: X2) and (title: X1 OR content: X1) OR (title: X2 OR content: X2) both produce the same result set [4]. If the default operator is set to OR, we don’t have to care about cross_fields, or do we? Unfortunately, we do, because there is the minimum_should_match option.
Minimum Should Match
I already mentioned the default operator several times. It determines whether all words typed by the user without any Boolean operator must occur in a result document or whether 1 word is enough. Usually both alternatives are bad, at least for a query consisting of more than 2 or 3 words. Therefore, Lucene’s BooleanQuery supports the minimum_should_match option. With this parameter, we can specify a minimum number of the B-clauses that has to match. Like AND the minimum_should_match parameter requires expanding queries to multiple fields on the leaf level. On the Lucene level, the parameter is an absolute value and if set to a value greater than the number of B-clauses of a query, this query never matches. That’s not very user friendly and therefore, ES / OS as well as our query parser support more reasonable ways to specify it (e.g.s a percentage). With minimum_should_match we get something that is an in-between between AND and OR.
In ES / OS minimum_should_match is applied to the complete query returned by the parser if it is a BooleanQuery. This is a bit dangerous. Under certain circumstances, this BooleanQuery could be a disjunction of e.g., synonyms and not originate from a user query consisting of several words. We have implemented a more sophisticated check to assure, that minimum_should_match is only applied if the user typed several words or a Boolean query.
Furthermore, we think that minimum_should_match is actually intended to improve search for queries consisting of several words without any explicit Boolean operators. If a user explicitly uses Boolean operators, he will not expect the Boolean logic of explicit optional clauses to be overwritten by such a parameter. Therefore, we (optionally) restrict the application of minimum_should_match to queries without explicit operators.
Dieser Artikel wurde nur in englischer Sprache veröffentlicht.
[1] This is like the precedence of multiplication over addition in calculus.
[2] For simplicity upper case letters should be regarded as single words here, though they could be arbitrary complex sub-queries.
[3] It should return all documents not containing any of the C-clauses.
[4] That’s because the operator OR as well as AND is commutative and associative.