Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introducing CodeGen Tokenizer #7139

Merged
merged 6 commits into from May 2, 2024
Merged

Conversation

tarekgh
Copy link
Member

@tarekgh tarekgh commented Apr 23, 2024

This change is implementing the CodeGen which also support the Phi-2 tokenizer.

@tarekgh
Copy link
Member Author

tarekgh commented Apr 23, 2024

Copy link

codecov bot commented Apr 23, 2024

Codecov Report

Attention: Patch coverage is 81.63170% with 394 lines in your changes are missing coverage. Please review.

Project coverage is 68.65%. Comparing base (72cfdf6) to head (aa0e394).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #7139      +/-   ##
==========================================
+ Coverage   68.55%   68.65%   +0.10%     
==========================================
  Files        1259     1262       +3     
  Lines      255844   257746    +1902     
  Branches    26434    26658     +224     
==========================================
+ Hits       175392   176956    +1564     
- Misses      73717    73979     +262     
- Partials     6735     6811      +76     
Flag Coverage Δ
Debug 68.65% <81.63%> (+0.10%) ⬆️
production 62.94% <70.47%> (+0.03%) ⬆️
test 88.85% <98.92%> (+0.15%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...osoft.ML.Tokenizers/Utils/ByteToUnicodeEncoding.cs 100.00% <100.00%> (ø)
...ft.ML.TorchSharp/Extensions/TokenizerExtensions.cs 86.95% <100.00%> (ø)
test/Microsoft.ML.Tokenizers.Tests/BpeTests.cs 100.00% <ø> (ø)
...crosoft.ML.Tokenizers.Tests/EnglishRobertaTests.cs 100.00% <100.00%> (+2.59%) ⬆️
test/Microsoft.ML.Tokenizers.Tests/TitokenTests.cs 99.52% <100.00%> (+0.10%) ⬆️
...c/Microsoft.ML.Tokenizers/Utils/BytePairEncoder.cs 87.65% <81.25%> (-0.59%) ⬇️
...rosoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs 76.59% <83.78%> (+4.66%) ⬆️
src/Microsoft.ML.Tokenizers/Tokenizer.cs 56.04% <61.11%> (+2.38%) ⬆️
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs 91.95% <91.35%> (+91.95%) ⬆️
test/Microsoft.ML.Tokenizers.Tests/CodeGenTests.cs 98.77% <98.77%> (ø)
... and 4 more

... and 6 files with indirect coverage changes

@tarekgh
Copy link
Member Author

tarekgh commented Apr 30, 2024

@ericstj @michaelgsharp @stephentoub

I appreciate if anyone of you have a look at this PR. Thanks!

Copy link
Member

@ericstj ericstj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got about half way through. I'll try to review more later. I'd also like others like @michaelgsharp and @stephentoub to have a look.

src/Microsoft.ML.Tokenizers/Model/CodeGen.cs Outdated Show resolved Hide resolved
string? unknownToken = DefaultSpecialToken,
string? beginningOfSentenceToken = DefaultSpecialToken,
string? endOfSentenceToken = DefaultSpecialToken) :
this(vocabularyStream, mergeStream, preTokenizer, normalizer, addedTokens, addPrefixSpace, addBeginningOfSentence, addEndOfSentence, unknownToken, beginningOfSentenceToken, endOfSentenceToken, disposeStream: false)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we expose a bool for disposeStream? I can imagine that folks might want to open this on their own, but then delegate ownership to the tokenizer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm approaching this similar to File APIs. The tokenizer simply reads the stream without any further manipulation. I find it cleaner not to transfer ownership of the stream, especially since it's straightforward for the user to dispose of it after creating the tokenizer. They can even employ a using statement on the stream for simplicity.

src/Microsoft.ML.Tokenizers/Utils/Helpers.cs Outdated Show resolved Hide resolved
src/Microsoft.ML.Tokenizers/Utils/Helpers.cs Outdated Show resolved Hide resolved
src/Microsoft.ML.Tokenizers/Model/CodeGen.cs Outdated Show resolved Hide resolved
src/Microsoft.ML.Tokenizers/Utils/ByteToUnicodeEncoding.cs Outdated Show resolved Hide resolved
{
using StreamReader reader = new StreamReader(mergeStream);

// We ignore the first and last line in the file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting, is that just format - there's no indication of an ignored line (like empty or comment prefix)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually the line starts with # but I don't think it has to. You can see the Python code doing what we are doing too:

https://github.com/huggingface/transformers/blob/12c39e5693f7223be162a1e84de026a6545029eb/src/transformers/models/codegen/tokenization_codegen.py#L173

        with open(merges_file, encoding="utf-8") as merges_handle:
            bpe_merges = merges_handle.read().split("\n")[1:-1]

src/Microsoft.ML.Tokenizers/Model/CodeGen.cs Outdated Show resolved Hide resolved
src/Microsoft.ML.Tokenizers/Model/CodeGen.cs Show resolved Hide resolved
src/Microsoft.ML.Tokenizers/Model/CodeGen.cs Show resolved Hide resolved
src/Microsoft.ML.Tokenizers/Model/CodeGen.cs Outdated Show resolved Hide resolved
src/Microsoft.ML.Tokenizers/Model/CodeGen.cs Outdated Show resolved Hide resolved
src/Microsoft.ML.Tokenizers/Model/CodeGen.cs Outdated Show resolved Hide resolved
@@ -16,5 +18,114 @@ internal static void ArrayPoolGrow<T>(ref T[] arrayPoolArray, int requiredCapaci
ArrayPool<T>.Shared.Return(arrayPoolArray);
arrayPoolArray = tmp;
}

public static int EncodeToUtf8(ReadOnlySpan<char> text, Span<byte> destination, Span<int> indexMapping)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This helper is necessary even on .NET Core because of indexMapping?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, currently we don't have any API can give the mapping between UTF-8 and UTF-16 conversion. We may think exposing something for that. It is very useful in the scenarios like the one we need to support.

src/Microsoft.ML.Tokenizers/Utils/StringSpanOrdinalKey.cs Outdated Show resolved Hide resolved
try
{
JsonSerializerOptions options = new() { Converters = { StringSpanOrdinalKeyCustomConverter.Instance } };
vocab = JsonSerializer.Deserialize<Dictionary<StringSpanOrdinalKey, (int, string)>>(vocabularyStream, options) as Dictionary<StringSpanOrdinalKey, (int, string)>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the expectation is this code is executing rarely / basically once per process?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is executed once for each instantiation of the CodeGen Tokenizer. It reads the vocabulary file utilized by the tokenizer.

@tarekgh
Copy link
Member Author

tarekgh commented May 2, 2024

I have addressed all reported feedback.

@tarekgh tarekgh merged commit e9097ce into dotnet:main May 2, 2024
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants