solr-sudachi

A Japanese morphological analyzer Sudachi as a Solr plugin.

This plugin is based on elasticsearch-sudachi which includes the common lucene Tokenizer and TokenFilters.

Install

Edit properties/solr.version with your solr's version on pom.xml (Default is 6.2.1)
Do mvn package to generate solr-sudachi-assembly-1.0.0-SNAPSHOT.jar in assembly/target/.
Put solr-sudachi-assembly-1.0.0-SNAPSHOT.jar on ${SOLR_HOME}/lib directory.
Configure schema.xml or managed-schema with the following setting, then start solr.

Version

Solr 6.2.1 or above.

Note that not all the versions are tested. Please report through issues if any problem found with a version.

Tokenizer

SolrSudachiTokenizerFactory

Configuration

<fieldType name="text_ja" class="solr.TextField">
 <analyzer>
   <!-- Whatever Char Filters -->
   ...
   
   <tokenizer class="com.github.sh0nk.solr.sudachi.SolrSudachiTokenizerFactory"
     mode="NORMAL"
     discardPunctuation="true"
   />
   
   <!-- Whatever Token Filters -->
   ...
 </analyzer>
</fieldType>

Basically solr-sudachi follows the config on elasticsearch-sudachi as much as possible. Here it explains only the difference from that.

As default, system_full.dic which is provided by Sudachi is used as a dictionary. And solr_sudachi.json is used as a tokenizer plugin setting. So the above setting is everything to start with, but if you want to customize, the following properties are available to configure.

settingsPath: Put the sudachi json configuration with the relative file path from ${SOLR_HOME}/conf, or absolute path.
systemDictDir: Put the relative directory path from ${SOLR_HOME}/conf, or absolute path to the dict file directory. All the other Sudachi system files such as char.def or rewrite.def which are specified on settingsPath are the relative path from the systemDictDir. For example, if systemDictDir="sudachi" is given, they should be put in the same "sudachi" directory.

system_full.dic and system_core.dic are bundled in solr-sudachi jar. If one of the names is given in system_dict property on settings json, solr-sudachi extracts it into systemDictDir. If it is not given, then the extracted file goes to ${SOLR_HOME}/conf. This extraction is needed to be efficient memory handling on Sudachi.

Token Filters

In addition to the core SolrSudachiTokenizerFactory, several token filters are available as a post processing of the Tokenizer.

SudachiSurfaceFormFilterFactory

solr-sudachi respects the behavior of elasticsearch-sudachi, which outputs the Normalized form of the tokens instead of the Surface form which are tokens from input text as it is. Sudachi's normalized form performs to make the analyzed tokens respected more as well as base form. But if you want to match exactly with the query and the index, this filter would be useful.

Example

Before the token filter

	1	2	3	4	5
Tokens	吾が輩	は	猫	だ	有る
Surface	吾輩	は	猫	で	ある

After the token filter

	1	2	3	4	5
Tokens	吾輩	は	猫	で	ある

Configuration

<fieldType name="text_ja" class="solr.TextField">
 <analyzer>
   <tokenizer class="com.github.sh0nk.solr.sudachi.SolrSudachiTokenizerFactory"
     mode="NORMAL"
     discardPunctuation="true"
   />
   
   <filter class="com.github.sh0nk.solr.sudachi.SudachiSurfaceFormFilterFactory" />
 </analyzer>
</fieldType>

Licenses

solr-sudachi is licensed under Apache License, Ver 2.0.

The original Sudachi and elasticsearch-sudachi are by Works Applications Co., Ltd., which are licensed under Apache License, Ver 2.0.

solr-sudachi

日本語形態素解析器 Sudachi の Solr プラグイン solr-sudachi

solr-sudachiはluceneのTokenizerやTokenFilterのインターフェイスを提供する elasticsearch-sudachi をベースにして作られています。

Install

pom.xmlの中のproperties/solr.versionを、使用するSolrのバージョンに合わせて変更します。 (デフォルトのバージョンは6.2.1)
mvn packageを実行すると、assembly/target/の中にsolr-sudachi-assembly-1.0.0-SNAPSHOT.jarが生成されます。
生成されたsolr-sudachi-assembly-1.0.0-SNAPSHOT.jarを使用するSolrの${SOLR_HOME}/libディレクトリにコピーします。
schema.xmlあるいはmanaged-schemaを、以下の設定にならって編集し、Solrを起動します。

Version

Solr 6.2.1 以上に対応しています。

すべてのバージョンについて正常動作が確認できているわけではありません。特定のバージョンで問題が発生した場合、 githubのissueを通してご連絡ください。

Tokenizer

SolrSudachiTokenizerFactory

Configuration

<fieldType name="text_ja" class="solr.TextField">
 <analyzer>
   <!-- Whatever Char Filters -->
   ...
   
   <tokenizer class="com.github.sh0nk.solr.sudachi.SolrSudachiTokenizerFactory"
     mode="NORMAL"
     discardPunctuation="true"
   />
   
   <!-- Whatever Token Filters -->
   ...
 </analyzer>
</fieldType>

基本的に、solr-sudachiはベースとなる elasticsearch-sudachi の設定を継承しています。ここでは、elasticsearch-sudachiとの差分となる設定を主に説明します。

Sudachiで使用されるシステム辞書として、デフォルトではSudachiが生成するsystem_full.dicを使用します。 Sudachiの内部的なプラグインチェーンを指定する設定ファイルとして、 solr_sudachi.json を使用します。これらがデフォルトとして指定されているため、スキーマファイルは上記の設定のみで使用を開始することができますが、もし辞書やプラグインチェーンを変更したい場合は、以下のプロパティにより上書きすることができます。

settingsPath: Sudachiのjson設定ファイルを${SOLR_HOME}/confからの相対パス、あるいは絶対パスで指定します。
systemDictDir: 辞書ファイルがあるディレクトリを、${SOLR_HOME}/confからの相対パスか絶対パスで指定します。char.defやrewrite.defなど、上記のjson設定ファイルで相対パスとして指定しているファイルはすべてこの場所に置く必要があります。例えば、systemDictDir="sudachi"と設定した場合、これらのファイルは"sudachi"ディレクトリに置きます。

solr-sudachiのjarファイルにはSudachiの辞書ビルダーによって生成される system_full.dicとsystem_core.dicの２つの辞書が含まれます。もしどちらかのファイル名がSudachiのjson設定ファイルのsystem_dictプロパティに設定された場合、 solr-sudachiはこのファイルをsystemDictDirにコピーします。もしsystemDictDirが指定されていない場合、jarから抽出されたファイルは${SOLR_HOME}/confに置かれます。ファイルの抽出はSudachiのメモリ効率化に必要です。

Token Filters

コアとなるトークナイザSolrSudachiTokenizerFactoryのほかにいくつかのトークンフィルタがトークナイザの後処理として用意されています。

SudachiSurfaceFormFilterFactory

solr-sudachiはelasticsearch-sudachiの挙動をなるべく引き継いでいます。 elasticsearch-sudachiでは、標準で分かち書きした形態素を、元の入力形である 表層形 ではなく、 正規化形 として出力します。この 正規化形 は 基本形 と同様に、出力された形態素自身の一致を重視するのに役立ちますが、もしクエリとインデックスの入力文字を重視したい場合は、表層形に変換するこのTokenFilterが役立つかもしれません。

Example

トークンフィルタの入力

	1	2	3	4	5
Tokens	吾が輩	は	猫	だ	有る
Surface	吾輩	は	猫	で	ある

トークンフィルタの出力

	1	2	3	4	5
Tokens	吾輩	は	猫	で	ある

Configuration

<fieldType name="text_ja" class="solr.TextField">
 <analyzer>
   <tokenizer class="com.github.sh0nk.solr.sudachi.SolrSudachiTokenizerFactory"
     mode="NORMAL"
     discardPunctuation="true"
   />
   
   <filter class="com.github.sh0nk.solr.sudachi.SudachiSurfaceFormFilterFactory" />
 </analyzer>
</fieldType>

Licenses

solr-sudachi is licensed under Apache License, Ver 2.0.

The original Sudachi and elasticsearch-sudachi are by Works Applications Co., Ltd., which are licensed under Apache License, Ver 2.0.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
analysis		analysis
assembly-test		assembly-test
assembly		assembly
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

License

sh0nk/solr-sudachi

Folders and files

Latest commit

History

Repository files navigation

solr-sudachi

Install

Version

Tokenizer

SolrSudachiTokenizerFactory

Configuration

Token Filters

SudachiSurfaceFormFilterFactory

Example

Configuration

Licenses

solr-sudachi

Install

Version

Tokenizer

SolrSudachiTokenizerFactory

Configuration

Token Filters

SudachiSurfaceFormFilterFactory

Example

Configuration

Licenses

About

Topics

Resources

License

Stars

Watchers

Forks

Languages