Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Abstract
Current CP949 state machine has some false positives, and incorrectly marks valid CP949 texts as an error.
This PR rewrites the state transition table, to comply the CP949 Specification.
Details
These are some cases, which a false-positive error can occur in the current implementation.
춉
(0xAD68
)The first byte is classified as the class 8, as it is 0xAD. And in the START state, the class 8 makes an transition to the ERROR state. But this is a valid CP949.
힣
(0xC652
)The first byte is classified as the class 9, and the second byte is classified as the class 5. In the START state, the class 9 makes an transition to the State 6, and in the State 6, the class 5 makes an transition to the ERROR state. But this is a valid CP949.
Test
I have tested the state machine (To-Be) for the all characters in the CP949 with following code, and it successfully returned
Success
.When I have tested it against the current implementation (As-Is), it shows
Error! at byte 15479
.I couldn't upload the cp949 characters to the test fixtures folder, as it will make the test fail because of the frequency-based probing, which will not successfully mark it as the CP949. (Because it is just a plain listing of the all possible characters of the CP949.)