Introducing CodeGen Tokenizer #7139

tarekgh · 2024-04-23T02:33:24Z

This change is implementing the CodeGen which also support the Phi-2 tokenizer.

tarekgh · 2024-04-23T02:35:52Z

CC @stephentoub @ericstj @michaelgsharp @luisquintanilla @JakeRadMSFT @LittleLittleCloud

codecov · 2024-04-23T04:10:47Z

Codecov Report

Attention: Patch coverage is 81.63170% with 394 lines in your changes are missing coverage. Please review.

Project coverage is 68.65%. Comparing base (72cfdf6) to head (aa0e394).

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #7139      +/-   ##
==========================================
+ Coverage   68.55%   68.65%   +0.10%     
==========================================
  Files        1259     1262       +3     
  Lines      255844   257746    +1902     
  Branches    26434    26658     +224     
==========================================
+ Hits       175392   176956    +1564     
- Misses      73717    73979     +262     
- Partials     6735     6811      +76

Flag	Coverage Δ
Debug	`68.65% <81.63%> (+0.10%)`	⬆️
production	`62.94% <70.47%> (+0.03%)`	⬆️
test	`88.85% <98.92%> (+0.15%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files	Coverage Δ
...osoft.ML.Tokenizers/Utils/ByteToUnicodeEncoding.cs	`100.00% <100.00%> (ø)`
...ft.ML.TorchSharp/Extensions/TokenizerExtensions.cs	`86.95% <100.00%> (ø)`
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs	`100.00% <ø> (ø)`
...crosoft.ML.Tokenizers.Tests/EnglishRobertaTests.cs	`100.00% <100.00%> (+2.59%)`	⬆️
test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs	`99.52% <100.00%> (+0.10%)`	⬆️
...c/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs	`87.65% <81.25%> (-0.59%)`	⬇️
...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs	`76.59% <83.78%> (+4.66%)`	⬆️
src/Microsoft.ML.Tokenizers/Tokenizer.cs	`56.04% <61.11%> (+2.38%)`	⬆️
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs	`91.95% <91.35%> (+91.95%)`	⬆️
test/Microsoft.ML.Tokenizers.Tests/CodeGenTests.cs	`98.77% <98.77%> (ø)`
... and 4 more

... and 6 files with indirect coverage changes

…ure accurate offsets too.

tarekgh · 2024-04-30T20:27:48Z

@ericstj @michaelgsharp @stephentoub

I appreciate if anyone of you have a look at this PR. Thanks!

ericstj

I got about half way through. I'll try to review more later. I'd also like others like @michaelgsharp and @stephentoub to have a look.

src/Microsoft.ML.Tokenizers/Model/CodeGen.cs

ericstj · 2024-05-01T14:06:19Z

src/Microsoft.ML.Tokenizers/Model/CodeGen.cs

+                string? unknownToken = DefaultSpecialToken,
+                string? beginningOfSentenceToken = DefaultSpecialToken,
+                string? endOfSentenceToken = DefaultSpecialToken) :
+            this(vocabularyStream, mergeStream, preTokenizer, normalizer, addedTokens, addPrefixSpace, addBeginningOfSentence, addEndOfSentence, unknownToken, beginningOfSentenceToken, endOfSentenceToken, disposeStream: false)


Should we expose a bool for disposeStream? I can imagine that folks might want to open this on their own, but then delegate ownership to the tokenizer.

I'm approaching this similar to File APIs. The tokenizer simply reads the stream without any further manipulation. I find it cleaner not to transfer ownership of the stream, especially since it's straightforward for the user to dispose of it after creating the tokenizer. They can even employ a using statement on the stream for simplicity.

src/Microsoft.ML.Tokenizers/Utils/Helpers.cs

src/Microsoft.ML.Tokenizers/Model/CodeGen.cs

src/Microsoft.ML.Tokenizers/Utils/ByteToUnicodeEncoding.cs

ericstj · 2024-05-01T15:52:40Z

src/Microsoft.ML.Tokenizers/Model/CodeGen.cs

+            {
+                using StreamReader reader = new StreamReader(mergeStream);
+
+                // We ignore the first and last line in the file


Interesting, is that just format - there's no indication of an ignored line (like empty or comment prefix)?

Usually the line starts with # but I don't think it has to. You can see the Python code doing what we are doing too:

https://github.com/huggingface/transformers/blob/12c39e5693f7223be162a1e84de026a6545029eb/src/transformers/models/codegen/tokenization_codegen.py#L173

with open(merges_file, encoding="utf-8") as merges_handle: bpe_merges = merges_handle.read().split("\n")[1:-1]

src/Microsoft.ML.Tokenizers/Model/CodeGen.cs

stephentoub · 2024-05-01T18:03:36Z

src/Microsoft.ML.Tokenizers/Utils/Helpers.cs

@@ -16,5 +18,114 @@ internal static void ArrayPoolGrow<T>(ref T[] arrayPoolArray, int requiredCapaci
            ArrayPool<T>.Shared.Return(arrayPoolArray);
            arrayPoolArray = tmp;
        }
+
+        public static int EncodeToUtf8(ReadOnlySpan<char> text, Span<byte> destination, Span<int> indexMapping)


This helper is necessary even on .NET Core because of indexMapping?

Yes, currently we don't have any API can give the mapping between UTF-8 and UTF-16 conversion. We may think exposing something for that. It is very useful in the scenarios like the one we need to support.

src/Microsoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs

stephentoub · 2024-05-01T18:05:20Z

src/Microsoft.ML.Tokenizers/Model/CodeGen.cs

+            try
+            {
+                JsonSerializerOptions options = new() { Converters = { StringSpanOrdinalKeyCustomConverter.Instance } };
+                vocab = JsonSerializer.Deserialize<Dictionary<StringSpanOrdinalKey, (int, string)>>(vocabularyStream, options) as Dictionary<StringSpanOrdinalKey, (int, string)>;


I assume the expectation is this code is executing rarely / basically once per process?

This code is executed once for each instantiation of the CodeGen Tokenizer. It reads the vocabulary file utilized by the tokenizer.

tarekgh · 2024-05-02T00:26:06Z

I have addressed all reported feedback.

Introducing CodeGen Tokenizer

c71edcf

dotnet-policy-service bot assigned tarekgh Apr 23, 2024

tarekgh added the Tokenizers label Apr 23, 2024

tarekgh requested review from ericstj and michaelgsharp April 23, 2024 02:36

tarekgh added 4 commits April 23, 2024 11:57

Mark a method as private. It was not intended to be public

e4aea35

Init vocab atomically.

59c9efa

Prevent returning tokens that are only partially mapped to a code point.

67b2909

Ensure Tiktoken precise token's count with IndexOf & LastIndexOf. Ens…

c681807

…ure accurate offsets too.

ericstj reviewed May 1, 2024

View reviewed changes

stephentoub reviewed May 1, 2024

View reviewed changes

Address the feedback

aa0e394

michaelgsharp approved these changes May 2, 2024

View reviewed changes

tarekgh merged commit e9097ce into dotnet:main May 2, 2024
25 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing CodeGen Tokenizer #7139

Introducing CodeGen Tokenizer #7139

tarekgh commented Apr 23, 2024

tarekgh commented Apr 23, 2024

codecov bot commented Apr 23, 2024 •

edited

tarekgh commented Apr 30, 2024

ericstj left a comment

ericstj May 1, 2024

tarekgh May 1, 2024

ericstj May 1, 2024

tarekgh May 1, 2024

stephentoub May 1, 2024

tarekgh May 1, 2024

stephentoub May 1, 2024

tarekgh May 1, 2024

tarekgh commented May 2, 2024

Introducing CodeGen Tokenizer #7139

Introducing CodeGen Tokenizer #7139

Conversation

tarekgh commented Apr 23, 2024

tarekgh commented Apr 23, 2024

codecov bot commented Apr 23, 2024 • edited

Codecov Report

tarekgh commented Apr 30, 2024

ericstj left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tarekgh commented May 2, 2024

codecov bot commented Apr 23, 2024 •

edited