Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CODEC-308: change NYSIIS encoding to not remove the first character i… #189

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Ben-Waters
Copy link

With the current implementation of NYSIIS, it is possible to incorrectly remove the first character from the encoding.

According to the algorithm the first character of the string should be the first character of the encoding, then based on a bunch of other rules are applied to the string characters are removed. The implementation in commons-codec passes the entire string into the transcodeRemaining method which works for the most part and then afterwards, checks that there is at least 1 character before removing the final 'A' or 'S'.

The problem is, if you have a word like "ASH" you will end up with a single final character of "A". Similarly with "SSH" you would have "S" and the logic will currently remove it and return a blank string when it should still return at least the first letter of the original string.

@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
new String[] { "JILES", "JAL" },
// violates 6: if the last two characters are AY, remove A
new String[] { "CARRAWAY", "CARY" }, // Original: CARAY
new String[] { "YAMADA", "YANAD" });
new String[] { "YAMADA", "YANAD" },
new String[] { "ASH", "A"});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Ben-Waters
Thank you for your PR.
Who is right? This or Fuzzy?

py
Python 3.11.2 (tags/v3.11.2:878ead1, Feb  7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import fuzzy
>>> fuzzy.nysiis('ASH')
'AS'
>>>

Thoughts?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@garydgregory
Based on this other implementation it would be just A.

According to the algorithm, the final character should be removed if it is an 'S' so I would think it should be removed.

The current commons-codec implementation is removing it as well as the final 'A' but is ignoring the part about "The first character of the NYSIIS code is the first character of the name."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used fuzzy before but looking at their implementation I don't see where they remove the trailing S or A.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had not used it before today either. I was looking for something easy to compare this PR's behavior.

Let's see if anybody else has feedback.

Do you know of any other software that is easily testable for this use-case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's pretty frustrating but it seems like the results are inconsistent depending on where you go.

Looking at this link the Java version returns AS but the python returns A. For my use-case, either one of those would be fine but it seems hard to get a consistent answer. Commons-codec is currently returning nothing though and I haven't seen another do that yet.

I don't know of a good official implementation to run it against.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that returning something like "A" is better than "". Let's see if anyone else suggests anything. @kinow ?

Copy link
Member

@kinow kinow Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't find a C++ implementation nor a good source from Wikipedia. So I checked another implementation, phonics in R. It has a nysiis algorithm implementation that allows for "modified" key. For "A" it gives "" too, but the modified version gives "AS".

>install.packages('phonics')
>library('phonics')
>nysiis(c("ASH"))
[1] ""
> nysiis(c("ASH"), modified=TRUE)
[1] "AS"
> nysiis(c("ASHBURTON"))
[1] "ASBART"

Their docs (PDF) documents the basic algorithm, but links to this PDF that explains the whole package a lot better: James P. Howard, II, Phonetic Spelling Algorithm Implementations for R

image

Searching more about those papers after the text above, I found this issue that describes the same thing I just said 😬 : CODEC-235

So my think we should document that the current version in Commons Codec is the one from the first paper, and then maybe add the other implementation as a separate class/method and let users to pick which one they want to use.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question that comes up for me is: Is our current implementation "plain" NYSIIS, or, is any non-standard or any behavior from the "modified" L&A 1977 algorithm present in our current implementation? If our current code is "plain", then I could see us creating a ModifiedNysiis class, if not, then we are in a bit of a pickle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although the results seem to indicate we implement the plain NYSIIS, someone has indeed to verify it reading our code and the algorithm. A good point, +1!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the steps from wikipedia, the core of the issue is that in step 4, we are setting the pointer to the first character instead of the 2nd character. The 9th step on wikipedia is a bit ambiguous though as to whether that step includes the first character or not. If it does, then you would get a blank string like commons-codec currently gives. If not, then you would get 'A'.

Unfortunately I don't have access to the original source book for the algorithm to try to clarify. It looks like berkeley law has it but I don't have an account for that.

@garydgregory
Copy link
Member

garydgregory commented Jun 27, 2023

@Ben-Waters Not directly related, but do you have any thoughts on #36?

@Ben-Waters
Copy link
Author

Ben-Waters commented Jun 28, 2023

@Ben-Waters Not directly related, but do you have any thoughts on #36?

Hmmm I'm no expert on this since it isn't NYSIIS but some other algorithm but I can take a look.

@garydgregory
Copy link
Member

I just re-read the comments it seems like:

  • We are not sure if the current code implements the "plain" or original algorithm.
  • We don't have access to, or cannot find, the paper for the original algorithm.
  • If we implement the plain original algorithm, then we can add a new class for the newer "modifed" algorithm.
  • If we do not implement the plain original algorithm, then we need to talk about that.

Help needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants