Skip to content

Releases: google/sentencepiece

v0.2.0

19 Feb 16:08
Compare
Choose a tag to compare

Major changes

N/A

New features

  • [ALL] Added SentencePieceNormalizer class in C++/Python. It supports almost the equivalent feature of spm_normalize. Python Sample C++ Sample
  • [ALL] Added SentencePieceProcessor::Normalize method in C++/Python Python Sample
    C++ Sample
  • [ALL] Added functionality to override the normalization spec before the processing. Python Sample

Bug fixes & minor changes

  • Introduce better support of using external abseil and protobuf #869
  • Build universal binary in OSX release package #892
  • Add the set_min_log_level function to python to change the loglevel from the python wrapper. #893
  • Uses the logsumexp techniques in marginal probabilities of n-best tokenization to avoid underflow.
  • Support Python 3.12 #932
  • Improves the thread utilization in batch encoding/decoding.
  • Fix nasty bug in BPE position encoding.
  • Fix bugs in the handling of duplicated bigrams

v0.2.0pre1

16 Jan 06:37
Compare
Choose a tag to compare
v0.2.0pre1 Pre-release
Pre-release

Major changes

N/A

New features

  • [ALL] Added SentencePieceNormalizer class in C++/Python. It supports almost the equivalent feature of spm_normalize. Python Sample C++ Sample
  • [ALL] Added SentencePieceProcessor::Normalize method in C++/Python Python Sample
    C++ Sample
  • [ALL] Added functionality to override the normalization spec before the processing. Python Sample

Bug fixes & minor changes

  • Introduce better support of using external abseil and protobuf #869
  • Build universal binary in OSX release package #892
  • Add the set_min_log_level function to python to change the loglevel from the python wrapper. #893
  • Uses the logsumexp techniques in marginal probabilities of n-best tokenization to avoid underflow.
  • Support Python 3.12 #932
  • Improves the thread utilization in batch encoding/decoding.
  • Fix nasty bug in BPE position encoding.
  • Fix bugs in the handling of duplicated bigrams

v0.1.99

02 May 03:20
Compare
Choose a tag to compare

Major changes

N/A

New features

N/A

Bug fixes & minor changes

  • [ALL] Fixes the NaN issues in unigram model training: #851
  • [ALL] Fixes the bug in unigram loss computation: #628
  • [ALL] Fixes the minor bug in BPE token extraction algorithm: #318
  • [ALL] Increase the number of maximum threads from 128 to 1024. #857

v0.1.99pre1

28 Apr 22:54
Compare
Choose a tag to compare
v0.1.99pre1 Pre-release
Pre-release

v0.1.99 pre release for testing.

v0.1.98

12 Apr 08:47
Compare
Choose a tag to compare

Major changes

  • Python 3.11 support (wheel packages for python 3.11 are available)
  • Includes the entire full sources in the source python package to reduce the pip install troubles.
  • Improves the algorithm to initialize unigram seed vocabulary. Coverage is improved.

New features

  • [ALL] Added the feature to train the model with pre-tokenization boundary constraints. (--pretokenization_delimiter) flag

Bug fixes & minor changes

  • [ALL] Makes the error message more descriptive.
  • [ALL] Fixes the crash error when std::random_device failed
  • [ALL] Fixes the build error on Raspberry pi around atomic operation
  • [ALL] Fixes the minor bugs in nbest enumeration
  • [ALL] Fixes the build error when using the external protobuf library.
  • [ALL] Fixes the build error on a big-endian machine.
  • [Windows] Use /MD build flag instead of /MT.

v0.1.97

06 Aug 16:03
Compare
Choose a tag to compare

Major changes

  • Migrated the C++ version from C++11 to C++17.
  • Migrated the CI environment from Travis-CI to Github actions
  • Started using cibuildtool to build pypi wheel packages

New features

  • [ALL] Support differential privacy while training. https://aclanthology.org/2022.findings-acl.171.pdf
  • [ALL] Introduced APIs that return the struct of ImmutableSentencePieceText, which encodes string-token, id, and utf-8 byte offsets at once. New API is available both from C++ and Python.
  • [ALL] Allow tab ‘\t’ to be included in user defined symbols.
  • [ALL] Added NFKD normalization rule. NFKD rule is provided as a TSV file.
  • [ALL] Added option to emit unknown symbol instead of raw symbol.
  • [Python]: Batch encode/decode requests are performed in native multi-threads.
  • [Python]: Supports to pass a custom log stream during training.
  • [Python]: Adds module-level version variable: spm.__version__
  • [Python]: Creates wheel package of Mac universal binary.

Bug fixes & minor changes

  • Uses the efficient encoding algorithm by default. Removed the functionality to switch the Viterbi tokenization algorithm.
  • Make the output of Encode and 1-best from NBestEncode same.
  • Use std::string_view as much as possible.
  • [Python] Removed pip package for ppc64le and s390x architecture as cibuiltool doesn’t support them.

v0.1.96

17 Jun 16:55
d8711f5
Compare
Choose a tag to compare

Updates

  • Improves the performance of unigram training
  • Updated the nfkc normalization with the latest ICU module.
  • Stop handling zero-width-joiner string as whitespace.

New features

  • added new sampling algorithm without replacement.
  • added API for new sampling and perplexity calculation.
  • added allow_whitespace_only_pieces mode.

v0.1.95

10 Jan 06:02
Compare
Choose a tag to compare

Updates

  • support to build sentencepiece with the external (official) abseil library.
  • upgraded protobuf 3.14.0
  • changed the type of input_sentence_size from int32 to uint64.

v0.1.94

24 Oct 02:01
Compare
Choose a tag to compare

Updates

  • added SetRandomGeneratorSeed function to set the seed value for random generator. This can allow to make reproducible sampling.
  • Validate the range of the vocab id in Python module.
  • Change the directory arrangement of python module.
  • Added protobuf python module.

Bug fixes

  • Support to build python wheel from source package.

v0.1.93

14 Oct 04:38
Compare
Choose a tag to compare

Bug fix

  • Fixed the regression bug around the flag --minloglevel
  • Fixed minor bugs.

Updates

  • Used manylinux2014 to build pypi packages
  • Support arm64, ppc64le, s390x architectures in pypi packages
  • Support Python 3.9

Removed

  • Stopped tf-sentencepiece.
  • Stopped the support of Python 2.x and Python 3.4