Splitting and Concatenation
On this page
Splitting and concatenation are key relevance features we use to help your users find what they are looking for. You can enable these features via the typoTolerance
parameter, and like typo tolerance, they let your users find results, even if what they are searching for does not match your records exactly.
If a user searches for parkbench
, splitting allows for a match on park bench
. Concatenation is when we combine words that are separated by a space: it allows nano second
to match with nanosecond
.
To get a deep understanding of these features, we recommend reading about tokenization first.
Splitting
Splitting is a technique we apply only at query time. For each non-separator token in a query, we try to split the token into two parts at each possible position. We do this up until the twelfth character, meaning that the first part can be up to 12 characters long. The second part can be any length.
For example, we split the query Katherinejohnson
into the following tokens:
katherinejohnson
k
,atherinejohnson
,ka
,therinejohnson
,kat
,herinejohnson
,kath
,erinejohnson
,kathe
,rinejohnson
,kather
,inejohnson
,katheri
,nejohnson
,katherin
,ejohnson
,katherine
,johnson
,katherinej
,ohnson
,katherinejo
,hnson
,katherinejoh
,nson
For both parts, we search the indexed tokens for matches. If a “valid” split it found, that is, one which matches tokens in the index, it is kept as an alternative to the original query. In this example, since an index may have the tokens katherine
and johnson
, but not katherinejohnson
, these two parts are kept as terms to search on.
We only split query words into two, and not more, parts. For example, the query jamesearljones
is split into, james
and earljones
, and jamesearl
and jones
, but not into the three tokens james
, earl
, jones
. We limit splitting to one split per query word for performance reasons.
Concatenation
At indexing time
The engine performs some concatenation at indexing time. This happens during tokenization.
We use the following separators to concatenate at indexing time: period (.
), apostrophe ('
), and the registered (®
) and copyright symbols (©
). This covers the most typical concatenation use cases such as acronyms (e.g., B.C.E.
) and contractions (e.g., don't
, we're
).
For example, the text hello.world
forms the tokens hello
, .
, world
, and due to concatenation, helloworld
. Since .
, is a separator, we do not index it by default.
If non-separator tokens created from concatenation are less than three characters long, we also do not index them. For example wasn't
yields the tokens wasn
, '
, t
, and wasnt
, but we only index wasn
and wasnt
. Similarly, B.C.E.
yields BCE
but not B
, C
, or E
.
At query time
We apply the same concatenation to queries at query time, as we do to records at indexing time. We also perform some additional concatenation to the query:
- Bi-gram concatenation: We concatenate adjacent pairs of tokens in the query string for the first 5 words.
- All-word concatenation: We concatenate all the words in the query, if the query has three or more words.
We form these tokens, from the query a wonderful day in the neighborhood
:
- From initial tokenization:
a
,wonderful
,day
,in
,the
,neighborhood
- From bi-gram concatenation:
awonderful
,wonderfulday
,dayin
,inthe
- From all-word concatenation:
awonderfuldayintheneighborhood
Special considerations for numeric characters
In most cases, the engine does not distinguish between alphabetic and numeric characters. For example, the queries m55
, mfivefive
, and 5mm
are all tokenized as a single word.
We introduce some special behavior when numbers, separators, and concatenation interact.
Short-form concatenation with numbers
Short-form concatenation, for example, turning B.C.E.
into only the token BCE
, is handled differently with numbers: If the first character of a token is numeric, we do not concatenate it with adjacent tokens. So m.55
forms the token m55
, but 5.mm
forms the tokens 5
and mm
, not 5mm
.
The reason for this special behaviour is to handle floating point numbers correctly. For example, you wouldn’t want 1.3GB
to be tokenzied as 13GB
.
Because periods (.
) denote a decimal point in numerical text, even short (1 or 2 character) non-separator tokens are indexed when numbers are involved. For example 1.5
yields the tokens 1
, .
, and 5
. Since .
is a separator, we don’t index it by default.
Additionally, the phrase 3.GB
yields the tokens 3
and GB
, even though GB
is not numerical. As long as one of the characters surrounding a separator is numeric, we index any surrouding, non-separator tokens, even if they are short (1 or 2 characters), or alphabetic. We do not index the concatenated token 3GB
, because 3
is a number and we do not concatenate tokens beginning with numbers.
Bi-gram concatenation with numbers
We do not perform bi-gram concatenation on adjacent tokens when the first token ends with a digit and the second starts with a digit. This is because for queries like XC90 2020 Volvo
, you wouldn’t want to search for XC902020
.
This restriction on concatenation can lead to some unexpected behavior while searching for an ISBN (International Standard Book Number) or other hyphenated numbers. An ISBN is a hyphenated 13 digit identifier for books, for example 978-3-16-148410-0
.
If you have indexed this as 9783161484100
, even when a user searches for 978-3-16-148410-0
or 978 3 16 148410 0
, we return the appropriate record because of all-word concatenation. However, the query 978316148410-0
or 978316148410 0
doesn’t return this record because we do not apply bi-gram concatenation for adjacent tokens ending and starting with numbers.
This is why we recommend indexing all possible formats of such identifiers. This allow users to find the desired record, regardless of the spacing and special characters they use when querying. You can read our guide on searching hyphenated attributes, such as SKUs, ISBNs, phone numbers, and serial numbers to learn best practices for this use case.