New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introducing CodeGen Tokenizer #7139
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #7139 +/- ##
==========================================
+ Coverage 68.55% 68.65% +0.10%
==========================================
Files 1259 1262 +3
Lines 255844 257746 +1902
Branches 26434 26658 +224
==========================================
+ Hits 175392 176956 +1564
- Misses 73717 73979 +262
- Partials 6735 6811 +76
Flags with carried forward coverage won't be shown. Click here to find out more.
|
@ericstj @michaelgsharp @stephentoub I appreciate if anyone of you have a look at this PR. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got about half way through. I'll try to review more later. I'd also like others like @michaelgsharp and @stephentoub to have a look.
string? unknownToken = DefaultSpecialToken, | ||
string? beginningOfSentenceToken = DefaultSpecialToken, | ||
string? endOfSentenceToken = DefaultSpecialToken) : | ||
this(vocabularyStream, mergeStream, preTokenizer, normalizer, addedTokens, addPrefixSpace, addBeginningOfSentence, addEndOfSentence, unknownToken, beginningOfSentenceToken, endOfSentenceToken, disposeStream: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we expose a bool for disposeStream
? I can imagine that folks might want to open this on their own, but then delegate ownership to the tokenizer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm approaching this similar to File APIs. The tokenizer simply reads the stream without any further manipulation. I find it cleaner not to transfer ownership of the stream, especially since it's straightforward for the user to dispose of it after creating the tokenizer. They can even employ a using statement on the stream for simplicity.
{ | ||
using StreamReader reader = new StreamReader(mergeStream); | ||
|
||
// We ignore the first and last line in the file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting, is that just format - there's no indication of an ignored line (like empty or comment prefix)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually the line starts with #
but I don't think it has to. You can see the Python code doing what we are doing too:
with open(merges_file, encoding="utf-8") as merges_handle:
bpe_merges = merges_handle.read().split("\n")[1:-1]
@@ -16,5 +18,114 @@ internal static void ArrayPoolGrow<T>(ref T[] arrayPoolArray, int requiredCapaci | |||
ArrayPool<T>.Shared.Return(arrayPoolArray); | |||
arrayPoolArray = tmp; | |||
} | |||
|
|||
public static int EncodeToUtf8(ReadOnlySpan<char> text, Span<byte> destination, Span<int> indexMapping) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This helper is necessary even on .NET Core because of indexMapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, currently we don't have any API can give the mapping between UTF-8 and UTF-16 conversion. We may think exposing something for that. It is very useful in the scenarios like the one we need to support.
try | ||
{ | ||
JsonSerializerOptions options = new() { Converters = { StringSpanOrdinalKeyCustomConverter.Instance } }; | ||
vocab = JsonSerializer.Deserialize<Dictionary<StringSpanOrdinalKey, (int, string)>>(vocabularyStream, options) as Dictionary<StringSpanOrdinalKey, (int, string)>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume the expectation is this code is executing rarely / basically once per process?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is executed once for each instantiation of the CodeGen Tokenizer. It reads the vocabulary file utilized by the tokenizer.
I have addressed all reported feedback. |
This change is implementing the CodeGen which also support the Phi-2 tokenizer.