Skip to content

Commit

Permalink
Unified Highlighter to support matched_fields (#107640)
Browse files Browse the repository at this point in the history
Add support to the Unified highlighter to combine matches on multiple fields
to highlight a single field: "matched_fields".

Based on Lucene PR: apache/lucene#13268

Lucene PR is based on the concept of masked fields where masked fields
are different from the original highlighted field. This PR in
Elasticsearch uses the already existing highlighter parameter
"matched_fields".
  • Loading branch information
mayya-sharipova committed May 9, 2024
1 parent a2c947e commit 2337eb0
Show file tree
Hide file tree
Showing 9 changed files with 723 additions and 149 deletions.
6 changes: 6 additions & 0 deletions docs/changelog/107640.yaml
@@ -0,0 +1,6 @@
pr: 107640
summary: "Unified Highlighter to support matched_fields "
area: Highlighting
type: enhancement
issues:
- 5172
150 changes: 28 additions & 122 deletions docs/reference/search/search-your-data/highlighting.asciidoc
Expand Up @@ -46,8 +46,9 @@ for each field.
The `unified` highlighter uses the Lucene Unified Highlighter. This
highlighter breaks the text into sentences and uses the BM25 algorithm to score
individual sentences as if they were documents in the corpus. It also supports
accurate phrase and multi-term (fuzzy, prefix, regex) highlighting. This is the
default highlighter.
accurate phrase and multi-term (fuzzy, prefix, regex) highlighting. The `unified`
highlighter can combine matches from multiple fields into one result (see
`matched_fields`). This is the default highlighter.

[discrete]
[[plain-highlighter]]
Expand Down Expand Up @@ -199,10 +200,27 @@ include the search query as part of the `highlight_query`.

matched_fields:: Combine matches on multiple fields to highlight a single field.
This is most intuitive for multifields that analyze the same string in different
ways. All `matched_fields` must have `term_vector` set to
`with_positions_offsets`, but only the field to which
the matches are combined is loaded so only that field benefits from having
`store` set to `yes`. Only valid for the `fvh` highlighter.
ways. Valid for the `unified` and fvh` highlighters, but the behavior of this
option is different for each highlighter.

For the `unified` highlighter:

- `matched_fields` array should **not** contain the original field that you want to highlight. The
original field will be automatically added to the `matched_fields`, and there is no
way to exclude its matches when highlighting.
- `matched_fields` and the original field can be indexed with different strategies (with or
without `offsets`, with or without `term_vectors`).
- only the original field to which the matches are combined is loaded so only that field
benefits from having `store` set to `yes`

For the `fvh` highlighter:

- `matched_fields` array may or may not contain the original field
depending on your needs. If you want to include the original field's matches in
highlighting, add it to the `matched_fields` array.
- all `matched_fields` must have `term_vector` set to `with_positions_offsets`
- only the original field to which the matches are combined is loaded so only that field
benefits from having `store` set to `yes`.

no_match_size:: The amount of text you want to return from the beginning
of the field if there are no matching fragments to highlight. Defaults
Expand Down Expand Up @@ -498,133 +516,21 @@ GET /_search
[discrete]
=== Combine matches on multiple fields

WARNING: This is only supported by the `fvh` highlighter
WARNING: Supported by the `unified` and `fvh` highlighters.

The Fast Vector Highlighter can combine matches on multiple fields to
The Unified and Fast Vector Highlighter can combine matches on multiple fields to
highlight a single field. This is most intuitive for multifields that
analyze the same string in different ways. All `matched_fields` must have
`term_vector` set to `with_positions_offsets` but only the field to which
the matches are combined is loaded so only that field would benefit from having
`store` set to `yes`.

In the following examples, `comment` is analyzed by the `english`
analyzer and `comment.plain` is analyzed by the `standard` analyzer.

[source,console]
--------------------------------------------------
GET /_search
{
"query": {
"query_string": {
"query": "comment.plain:running scissors",
"fields": [ "comment" ]
}
},
"highlight": {
"order": "score",
"fields": {
"comment": {
"matched_fields": [ "comment", "comment.plain" ],
"type": "fvh"
}
}
}
}
--------------------------------------------------
// TEST[setup:my_index]

The above matches both "run with scissors" and "running with scissors"
and would highlight "running" and "scissors" but not "run". If both
phrases appear in a large document then "running with scissors" is
sorted above "run with scissors" in the fragments list because there
are more matches in that fragment.

[source,console]
--------------------------------------------------
GET /_search
{
"query": {
"query_string": {
"query": "running scissors",
"fields": ["comment", "comment.plain^10"]
}
},
"highlight": {
"order": "score",
"fields": {
"comment": {
"matched_fields": ["comment", "comment.plain"],
"type" : "fvh"
}
}
}
}
--------------------------------------------------
// TEST[setup:my_index]
analyze the same string in different ways.

The above highlights "run" as well as "running" and "scissors" but
still sorts "running with scissors" above "run with scissors" because
the plain match ("running") is boosted.
include::{es-ref-dir}/tab-widgets/highlighting-multi-fields-widget.asciidoc[]

[source,console]
--------------------------------------------------
GET /_search
{
"query": {
"query_string": {
"query": "running scissors",
"fields": [ "comment", "comment.plain^10" ]
}
},
"highlight": {
"order": "score",
"fields": {
"comment": {
"matched_fields": [ "comment.plain" ],
"type": "fvh"
}
}
}
}
--------------------------------------------------
// TEST[setup:my_index]

The above query wouldn't highlight "run" or "scissor" but shows that
it is just fine not to list the field to which the matches are combined
(`comment`) in the matched fields.

[NOTE]
Technically it is also fine to add fields to `matched_fields` that
don't share the same underlying string as the field to which the matches
are combined. The results might not make much sense and if one of the
matches is off the end of the text then the whole query will fail.

[NOTE]
===================================================================
There is a small amount of overhead involved with setting
`matched_fields` to a non-empty array so always prefer
[source,js]
--------------------------------------------------
"highlight": {
"fields": {
"comment": {}
}
}
--------------------------------------------------
// NOTCONSOLE
to
[source,js]
--------------------------------------------------
"highlight": {
"fields": {
"comment": {
"matched_fields": ["comment"],
"type" : "fvh"
}
}
}
--------------------------------------------------
// NOTCONSOLE
===================================================================
Expand Down
@@ -0,0 +1,40 @@
++++
<div class="tabs" data-tab-group="highligther">
<div role="tablist" aria-label="Highlighting based on multi fields">
<button role="tab"
aria-selected="true"
aria-controls="unified-tab"
id="unified-highlighter">
Unified
</button>
<button role="tab"
aria-selected="false"
aria-controls="fvh-tab"
id="fvh-highlighter"
tabindex="-1">
FVH
</button>
</div>
<div tabindex="0"
role="tabpanel"
id="unified-tab"
aria-labelledby="unified-highlighter">
++++

include::highlighting-multi-fields.asciidoc[tag=unified]

++++
</div>
<div tabindex="0"
role="tabpanel"
id="fvh-tab"
aria-labelledby="fvh-highlighter"
hidden="">
++++

include::highlighting-multi-fields.asciidoc[tag=fvh]

++++
</div>
</div>
++++

0 comments on commit 2337eb0

Please sign in to comment.