Skip to content

nik9000/expiremental-highlighter

Repository files navigation

Experimental Highlighter

This project has moved. Sorry for the inconvenience. I'm leaving this tombstone here, spelling mistakes and all so folks can find the new repository.

Old Readme

Text highlighter for Java designed to be pluggable enough for easy experimentation. The idea being that it should be possible to play with how hits are weighed or how they are grouped into snippets without knowing about the guts of Lucene or Elasticsearch.

Comes in three flavors:

  • Core: No dependencies jar containing most of the interesting logic
  • Lucene: A jar containing a bridge between the core and lucene
  • Elasticsearch: An Elasticsearch plugin

Elasticsearch value proposition

This highlighter

  • Doesn't need offsets in postings or term enums with offsets but can use either to speed itself up.
  • Can fragment like the Postings Highlighter, the Fast Vector Highlighter, or it can highlight the entire field.
  • Combine hits using multiple different fields (aka matched_fields support).
  • Boost matches that appear early in the document.

This highlighter does not (currently):

  • Respect phrase matches at all (all phrases are reduced to terms)
  • Support require_field_match

Elasticsearch installation

Experimental Highlighter Plugin ElasticSearch
master 1.1.0 -> master

Install it like so:

./bin/plugin --install org.wikimedia.search.highlighter/experimental-highlighter-elasticsearch-plugin/0.0.1

Then you can use it by searching like so:

{
  "_source": false,
  "query": {
    "query_string": {
      "query": "hello world"
    }
  },
  "highlight": {
    "order": "score",
    "fields": {
      "title": {
        "number_of_fragments": 1,
        "type": "experimental"
      }
    }
  }
}

Elasticsearch options

The fragmenter field to defaults to scan but can also be set to sentence or none. scan produces results that look like the Fast Vector Highlighter. sentence produces results that look like the Postings Highlighter. none won't fragment on anything so it is cleaner if you have to highlight the whole field. Multi-valued fields will always fragment between each value, even on none. Example:

  "highlight": {
    "fields": {
      "title": {
        "type": "experimental",
        "fragmenter": "sentence",
        "options": {
          "locale": "en_us"
        }
      }
    }
  }

If using the sentence fragmenter you can specify the locale used for sentence rules with the locale option as above.

Each fragmenter has different no_match_size strategies based on the spirit of the fragmenter.

The top_scoring option can be set to true while sorting fragments by source to return only the top scoring fragmenter but leave them in source order. Example:

  "highlight": {
    "fields": {
      "text": {
        "type": "experimental",
        "number_of_fragments": 2,
        "fragmenter": "sentence",
        "sort": "source",
        "options": {
           "locale": "en_us",
           "top_scoring": true
        }
      }
    }
  }

The default_similarity option defaults to true for queries with more then one term. It will weigh each matched term using Lucene's default similarity model similarly to how the Fast Vectory Highlighter weighs terms. If can be set to false to leave out that weighing. If there is only a single term in the query it will never be used.

  "highlight": {
    "fields": {
      "title": {
        "type": "experimental",
        "options": {
          "default_similarity": false
        }
      }
    }
  }

The hit_source option can force detecting matched terms from a particular source. It can be either postings, vectors, or analyze. If set to postings but the field isn't indexed with index_options set to offsets or set to vectors but term_vector isn't set to with_positions_offsets then the highlight throw back an error. Defaults to using the first option that wouldn't throw an error.

  "highlight": {
    "fields": {
      "title": {
        "type": "experimental",
        "options": {
          "hit_source": "analyze"
        }
      }
    }
  }

The boost_before option lets you set up boosts before positions. For example, this will multiply the weight of matches before the 20th position by 5 abd before the 100th position by 1.5.

  "highlight": {
    "fields": {
      "title": {
        "type": "experimental",
        "order": "score",
        "options": {
          "boost_before": {
            "20": 5,
            "100": 1.5
          }
        }
      }
    }
  }

Note that the position is not reset between multiple values of the same field but is handled independently for each of the matched_fields. Note also that boost_before works with top_scoring.

The matched_fields field turns on combining matches from multiple fields, just like the Fast Vector Highlighter. See the Elasticsearch documentation for more on it. The only real difference is that if hit_source is left out then each field's HitSource is determined independently if . If one field is short feel free to leave out any special settings for index_options or for term_vectors.

A limitation in matched_fields: if the highlighlighter has to analyze the field value to find hits then you can't reuse analyzers in each matched field.

If you aren't using Elasticsearch, you can combine hits from multiple sources using:

new OverlapMergingHitEnumWrapper(new MergingHitEnum(hitsToMerge, HitEnum.LessThans.OFFSETS));

Offsets in postings or term vectors

Since adding offsets to the postings (set index_options to offsets in Elasticsearch) and creating term vectors with offsets (set term_vector to with_positions_offsets in Elasticsearch) both act to speed up highligting of this highlighter you have a choice which to use. Unless you have a compelling reason go with adding offsets to the postings. That is faster (by my tests, at least) and uses much less space.

About

Expiremental highlighter for Java/Lucene/Elasticsearch

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published