CODEC-308: change NYSIIS encoding to not remove the first character i… #189

Ben-Waters · 2023-06-26T22:06:45Z

With the current implementation of NYSIIS, it is possible to incorrectly remove the first character from the encoding.

According to the algorithm the first character of the string should be the first character of the encoding, then based on a bunch of other rules are applied to the string characters are removed. The implementation in commons-codec passes the entire string into the transcodeRemaining method which works for the most part and then afterwards, checks that there is at least 1 character before removing the final 'A' or 'S'.

The problem is, if you have a word like "ASH" you will end up with a single final character of "A". Similarly with "SSH" you would have "S" and the logic will currently remove it and return a blank string when it should still return at least the first letter of the original string.

…f its an A or S

garydgregory · 2023-06-26T22:47:53Z

src/test/java/org/apache/commons/codec/language/NysiisTest.java

@@ -140,7 +140,8 @@ public void testDropBy() throws EncoderException {
                new String[] { "JILES", "JAL" },
                // violates 6: if the last two characters are AY, remove A
                new String[] { "CARRAWAY", "CARY" },       // Original: CARAY
-                new String[] { "YAMADA", "YANAD" });
+                new String[] { "YAMADA", "YANAD" },
+                new String[] { "ASH", "A"});


Hi @Ben-Waters
Thank you for your PR.
Who is right? This or Fuzzy?

py Python 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. >>> import fuzzy >>> fuzzy.nysiis('ASH') 'AS' >>>

Thoughts?

@garydgregory
Based on this other implementation it would be just A.

According to the algorithm, the final character should be removed if it is an 'S' so I would think it should be removed.

The current commons-codec implementation is removing it as well as the final 'A' but is ignoring the part about "The first character of the NYSIIS code is the first character of the name."

I haven't used fuzzy before but looking at their implementation I don't see where they remove the trailing S or A.

I had not used it before today either. I was looking for something easy to compare this PR's behavior.

Let's see if anybody else has feedback.

Do you know of any other software that is easily testable for this use-case?

It's pretty frustrating but it seems like the results are inconsistent depending on where you go.

Looking at this link the Java version returns AS but the python returns A. For my use-case, either one of those would be fine but it seems hard to get a consistent answer. Commons-codec is currently returning nothing though and I haven't seen another do that yet.

I don't know of a good official implementation to run it against.

I would think that returning something like "A" is better than "". Let's see if anyone else suggests anything. @kinow ?

I couldn't find a C++ implementation nor a good source from Wikipedia. So I checked another implementation, phonics in R. It has a nysiis algorithm implementation that allows for "modified" key. For "A" it gives "" too, but the modified version gives "AS".

>install.packages('phonics') >library('phonics') >nysiis(c("ASH")) [1] "" > nysiis(c("ASH"), modified=TRUE) [1] "AS" > nysiis(c("ASHBURTON")) [1] "ASBART"

Their docs (PDF) documents the basic algorithm, but links to this PDF that explains the whole package a lot better: James P. Howard, II, Phonetic Spelling Algorithm Implementations for R

Searching more about those papers after the text above, I found this issue that describes the same thing I just said 😬 : CODEC-235

So my think we should document that the current version in Commons Codec is the one from the first paper, and then maybe add the other implementation as a separate class/method and let users to pick which one they want to use.

The question that comes up for me is: Is our current implementation "plain" NYSIIS, or, is any non-standard or any behavior from the "modified" L&A 1977 algorithm present in our current implementation? If our current code is "plain", then I could see us creating a ModifiedNysiis class, if not, then we are in a bit of a pickle.

Although the results seem to indicate we implement the plain NYSIIS, someone has indeed to verify it reading our code and the algorithm. A good point, +1!

Based on the steps from wikipedia, the core of the issue is that in step 4, we are setting the pointer to the first character instead of the 2nd character. The 9th step on wikipedia is a bit ambiguous though as to whether that step includes the first character or not. If it does, then you would get a blank string like commons-codec currently gives. If not, then you would get 'A'.

Unfortunately I don't have access to the original source book for the algorithm to try to clarify. It looks like berkeley law has it but I don't have an account for that.

garydgregory · 2023-06-27T12:24:22Z

@Ben-Waters Not directly related, but do you have any thoughts on #36?

Ben-Waters · 2023-06-28T03:02:28Z

@Ben-Waters Not directly related, but do you have any thoughts on #36?

Hmmm I'm no expert on this since it isn't NYSIIS but some other algorithm but I can take a look.

garydgregory · 2023-08-11T23:22:36Z

I just re-read the comments it seems like:

We are not sure if the current code implements the "plain" or original algorithm.
We don't have access to, or cannot find, the paper for the original algorithm.
If we implement the plain original algorithm, then we can add a new class for the newer "modifed" algorithm.
If we do not implement the plain original algorithm, then we need to talk about that.

Help needed.

CODEC-308: change NYSIIS encoding to not remove the first character i…

caa21a8

…f its an A or S

Ben-Waters force-pushed the CODEC-308-NYSIIS-ASH branch from 3ae8399 to caa21a8 Compare June 26, 2023 22:08

Fix formatting

4c27929

garydgregory reviewed Jun 26, 2023

View reviewed changes

kinow mentioned this pull request Jun 27, 2023

Fix typo in nysiis.R k3jph/phonics-in-r#45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CODEC-308: change NYSIIS encoding to not remove the first character i… #189

CODEC-308: change NYSIIS encoding to not remove the first character i… #189

Ben-Waters commented Jun 26, 2023

garydgregory Jun 26, 2023

Ben-Waters Jun 26, 2023 •

edited

Ben-Waters Jun 26, 2023

garydgregory Jun 27, 2023 •

edited

Ben-Waters Jun 27, 2023

garydgregory Jun 27, 2023

kinow Jun 27, 2023 •

edited

garydgregory Jun 27, 2023

kinow Jun 27, 2023

Ben-Waters Jun 28, 2023

garydgregory commented Jun 27, 2023 •

edited

Ben-Waters commented Jun 28, 2023 •

edited

garydgregory commented Aug 11, 2023

CODEC-308: change NYSIIS encoding to not remove the first character i… #189

Are you sure you want to change the base?

CODEC-308: change NYSIIS encoding to not remove the first character i… #189

Conversation

Ben-Waters commented Jun 26, 2023

garydgregory Jun 26, 2023

Choose a reason for hiding this comment

Ben-Waters Jun 26, 2023 • edited

Choose a reason for hiding this comment

Ben-Waters Jun 26, 2023

Choose a reason for hiding this comment

garydgregory Jun 27, 2023 • edited

Choose a reason for hiding this comment

Ben-Waters Jun 27, 2023

Choose a reason for hiding this comment

garydgregory Jun 27, 2023

Choose a reason for hiding this comment

kinow Jun 27, 2023 • edited

Choose a reason for hiding this comment

garydgregory Jun 27, 2023

Choose a reason for hiding this comment

kinow Jun 27, 2023

Choose a reason for hiding this comment

Ben-Waters Jun 28, 2023

Choose a reason for hiding this comment

garydgregory commented Jun 27, 2023 • edited

Ben-Waters commented Jun 28, 2023 • edited

garydgregory commented Aug 11, 2023

Ben-Waters Jun 26, 2023 •

edited

garydgregory Jun 27, 2023 •

edited

kinow Jun 27, 2023 •

edited

garydgregory commented Jun 27, 2023 •

edited

Ben-Waters commented Jun 28, 2023 •

edited