rosette-preprocessor-normalization

Normalizes text for the Rosette internationalization platform using the Unicode normalization algorithm.

Installation

gem install rosette-preprocessor-normalization

Then, somewhere in your project:

require 'rosette/preprocessors/normalization-preprocessor'

Introduction

This library is generally meant to be used with the Rosette internationalization platform that extracts translatable phrases from git repositories. rosette-preprocessor-normalization is capable of running the Unicode normalization algorithm over translations before they are serialized.

Usage with rosette-server

Let's assume you're configuring an instance of Rosette::Server. Adding normalization pre-processor support would cause your configuration to look something like this:

# config.ru
require 'rosette/core'
require 'rosette/serializer/json-serializer'
require 'rosette/extractors/json-extractor'

rosette_config = Rosette.build_config do |config|
  config.add_repo('my_awesome_repo') do |repo_config|
    repo_config.add_serializer('json/key-value') do |serializer_config|
      serializer_config.add_preprocessor('normalization') do |pre_config|
        pre_config.set_normalization_form(:nfc)
      end
    end
  end
end

server = Rosette::Server::ApiV1.new(rosette_config)
run server

Supported normalization forms are :nfc, :nfd, :nfkc, and :nfkd. See Unicode Technical Report 15 for more information.

It may not be immediately obvious why normalization is important, especially because in most cases normalization does not have any visual effect on translation text. Normalization works behind the scenes by ensuring that accents, composed characters (eg. Korean Hangul), etc follow a common form. For example, the character "ñ" can be expressed using one or two Unicode code points. Normalization form NFC combines the "n" character and the "˜" accent into a single codepoint (0xF1), while normalization form NFD separates them into distinct codepoints (0x6E and 0x303). Most visual display systems (eg. browsers, terminals, etc) will display both the same way, making the two forms visually indistinguishable. Normalization comes in handy, for example, when you need to compare two strings or use them to build a search index. In the Ruby programming language, the strings "\u00F1" and "\u006E\u0303" are not eqivalent, although visually they look identical.

Requirements

This project must be run under jRuby. It uses expert to manage java dependencies via Maven. Run bundle exec expert install in the project root to download and install java dependencies.

Running Tests

bundle exec rake or bundle exec rspec should do the trick.

Authors

Cameron C. Dutro: http://github.com/camertron

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
lib/rosette/preprocessors		lib/rosette/preprocessors
spec		spec
.gitignore		.gitignore
.travis.yml		.travis.yml
Gemfile		Gemfile
History.txt		History.txt
Jarfile		Jarfile
README.md		README.md
Rakefile		Rakefile
pom.xml		pom.xml
rosette-preprocessor-normalization.gemspec		rosette-preprocessor-normalization.gemspec

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

lib/rosette/preprocessors

lib/rosette/preprocessors

spec

spec

.gitignore

.gitignore

.travis.yml

.travis.yml

Gemfile

Gemfile

History.txt

History.txt

Jarfile

Jarfile

README.md

README.md

Rakefile

Rakefile

pom.xml

pom.xml