Splitting and Concatenation

Splitting and concatenation are key relevance features we use to help your users find what they are looking for. You can enable these features via the typoTolerance parameter, and like typo tolerance, they let your users find results, even if what they are searching for does not match your records exactly.

If a user searches for parkbench, splitting allows for a match on park bench. Concatenation is when we combine words that are separated by a space: it allows nano second to match with nanosecond.

To get a deep understanding of these features, we recommend reading about tokenization first.

Splitting

Splitting is a technique we apply only at query time. For each non-separator token in a query, we try to split the token into two parts at each possible position. We do this up until the twelfth character, meaning that the first part can be up to 12 characters long. The second part can be any length.

For example, we split the query Katherinejohnson into the following tokens:

katherinejohnson
k, atherinejohnson,
ka, therinejohnson,
kat, herinejohnson,
kath, erinejohnson,
kathe, rinejohnson,
kather, inejohnson,
katheri, nejohnson,
katherin, ejohnson,
katherine, johnson,
katherinej, ohnson,
katherinejo, hnson,
katherinejoh, nson

For both parts, we search the indexed tokens for matches. If a “valid” split it found, that is, one which matches tokens in the index, it is kept as an alternative to the original query. In this example, since an index may have the tokens katherine and johnson, but not katherinejohnson, these two parts are kept as terms to search on.

We only split query words into two, and not more, parts. For example, the query jamesearljones is split into, james and earljones, and jamesearl and jones, but not into the three tokens james, earl, jones. We limit splitting to one split per query word for performance reasons.

Concatenation

At indexing time

The engine performs some concatenation at indexing time. This happens during tokenization.

We use the following separators to concatenate at indexing time: period (.), apostrophe ('), and the registered (®) and copyright symbols (©). This covers the most typical concatenation use cases such as acronyms (e.g., B.C.E.) and contractions (e.g., don't, we're).

For example, the text hello.world forms the tokens hello, ., world, and due to concatenation, helloworld. Since ., is a separator, we do not index it by default.

If non-separator tokens created from concatenation are less than three characters long, we also do not index them. For example wasn't yields the tokens wasn, ', t, and wasnt, but we only index wasn and wasnt. Similarly, B.C.E. yields BCE but not B, C, or E.

At query time

We apply the same concatenation to queries at query time, as we do to records at indexing time. We also perform some additional concatenation to the query:

Bi-gram concatenation: We concatenate adjacent pairs of tokens in the query string for the first 5 words.
All-word concatenation: We concatenate all the words in the query, if the query has three or more words.

We form these tokens, from the query a wonderful day in the neighborhood:

From initial tokenization: a, wonderful, day, in, the, neighborhood
From bi-gram concatenation: awonderful, wonderfulday, dayin, inthe
From all-word concatenation: awonderfuldayintheneighborhood

Special considerations for numeric characters

In most cases, the engine does not distinguish between alphabetic and numeric characters. For example, the queries m55, mfivefive, and 5mm are all tokenized as a single word.

We introduce some special behavior when numbers, separators, and concatenation interact.

Short-form concatenation with numbers

Short-form concatenation, for example, turning B.C.E. into only the token BCE, is handled differently with numbers: If the first character of a token is numeric, we do not concatenate it with adjacent tokens. So m.55 forms the token m55, but 5.mm forms the tokens 5 and mm, not 5mm.

The reason for this special behaviour is to handle floating point numbers correctly. For example, you wouldn’t want 1.3GB to be tokenzied as 13GB.

Because periods (.) denote a decimal point in numerical text, even short (1 or 2 character) non-separator tokens are indexed when numbers are involved. For example 1.5 yields the tokens 1, ., and 5. Since . is a separator, we don’t index it by default.

Additionally, the phrase 3.GB yields the tokens 3 and GB, even though GB is not numerical. As long as one of the characters surrounding a separator is numeric, we index any surrouding, non-separator tokens, even if they are short (1 or 2 characters), or alphabetic. We do not index the concatenated token 3GB, because 3 is a number and we do not concatenate tokens beginning with numbers.

Bi-gram concatenation with numbers

We do not perform bi-gram concatenation on adjacent tokens when the first token ends with a digit and the second starts with a digit. This is because for queries like XC90 2020 Volvo, you wouldn’t want to search for XC902020.

This restriction on concatenation can lead to some unexpected behavior while searching for an ISBN (International Standard Book Number) or other hyphenated numbers. An ISBN is a hyphenated 13 digit identifier for books, for example 978-3-16-148410-0.

If you have indexed this as 9783161484100, even when a user searches for 978-3-16-148410-0 or 978 3 16 148410 0, we return the appropriate record because of all-word concatenation. However, the query 978316148410-0 or 978316148410 0 doesn’t return this record because we do not apply bi-gram concatenation for adjacent tokens ending and starting with numbers.

This is why we recommend indexing all possible formats of such identifiers. This allow users to find the desired record, regardless of the spacing and special characters they use when querying. You can read our guide on searching hyphenated attributes, such as SKUs, ISBNs, phone numbers, and serial numbers to learn best practices for this use case.

In depth Searching in Hyphenated Attributes

Did you find this page helpful?