Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(data): Refactor wordSearch and other methods for clarity and maintainability #1436

Open
wants to merge 30 commits into
base: main
Choose a base branch
from

Conversation

tora-pan
Copy link
Contributor

Refactor wordSearch to use lookup object

  • Added a lookup object to more cleanly convert from カタカナ to ひらがな.

  • Removed nested if/for block and replaced with a more concise convertToHiragana method.

  • Added a isKana method to return if whether or not the current character is indeed 日本語.

  • Renamed a few variables for code readability.

Testing

  • Cherry picked tests from separate branch to cover newly added methods and to make sure there were no regressions.

@tora-pan
Copy link
Contributor Author

@melink14 Don't mind the extra commits at the end. I accidentally did a git add . and pushed. (pushing my web-test-runner-config.js and some .vscode settings.

I did wan't to ask you about that. I have to update my executablePath to make sure pupetteer works correctly. Do you think I should just add it to my global .gitignore or something?

Let me know what you think about the contributions and please let me know if there is anything that you think could be done differently, refactored, etc...

ありがとうございます!

Copy link
Owner

@melink14 melink14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the new screenshots that were pushed it looks like there are some unexpected changes to how words are looked up (less words are found); we'll need to debug that.

extension/data.ts Outdated Show resolved Hide resolved
extension/test/data_test.ts Outdated Show resolved Hide resolved
extension/test/data_test.ts Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Mar 1, 2023

Codecov Report

Merging #1436 (448994a) into main (b6e87d9) will increase coverage by 2.37%.
The diff coverage is 99.12%.

@@            Coverage Diff             @@
##             main    #1436      +/-   ##
==========================================
+ Coverage   79.62%   82.00%   +2.37%     
==========================================
  Files           7        8       +1     
  Lines        3004     3189     +185     
  Branches      189      189              
==========================================
+ Hits         2392     2615     +223     
+ Misses        607      570      -37     
+ Partials        5        4       -1     
Impacted Files Coverage Δ
extension/data.ts 85.94% <98.08%> (+3.61%) ⬆️
extension/character_info.ts 100.00% <100.00%> (ø)

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Owner

@melink14 melink14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't look at the tests yet due to time constraints but mostly nits with some high level questions on if the new code is functionally equivilant to the old code.

Also, it seems there are still some unexpected screenshot diffs! (title translations aren't continuing past the first word?)

extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
Copy link
Owner

@melink14 melink14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another thought: I wonder if we should just use wanakana library to do all this stuff: https://www.npmjs.com/package/wanakana

Seems to have what we need and is small.

edit: maybe doesn't handle half width...

@melink14
Copy link
Owner

melink14 commented Mar 1, 2023

I did wan't to ask you about that. I have to update my executablePath to make sure pupetteer works correctly. Do you think I should just add it to my global .gitignore or something?

I forgot to answer this question but this can probably be updated to fallback to an environment variable or command line option.

I might also test if it's even required anymore (since there should be some logic to find chrome in a normal way) but I think I was debugging some problems in WSL which made me add it originally.

@tora-pan
Copy link
Contributor Author

tora-pan commented Mar 2, 2023

So I went ahead and commented out the executablePath: portion and the tests still run so I think I've found our solution. 👍

@tora-pan
Copy link
Contributor Author

tora-pan commented Mar 3, 2023

@melink14 all tests are passing now. Fixed a few logic issues. Going to do a bit more cleaning up of variable names and typing in the morning tomorrow.

🥂

@melink14
Copy link
Owner

melink14 commented Mar 4, 2023

I see the images are still there but that might be due to the presubmit not re-running to update them. (Since it doesn't re-run when github action pushes a commit)

I'll close and reopen the pull request to get it to re-run. (I should add a comment or label command for it...)

extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
extension/data.ts Outdated Show resolved Hide resolved
extension/types/hiraganaLookupMap.ts Outdated Show resolved Hide resolved
extension/types/hiraganaLookupMap.ts Outdated Show resolved Hide resolved
extension/types/unicode-constants.ts Outdated Show resolved Hide resolved
extension/types/unicode-constants.ts Outdated Show resolved Hide resolved
extension/types/unicode-constants.ts Outdated Show resolved Hide resolved
@melink14 melink14 closed this Mar 4, 2023
@melink14 melink14 reopened this Mar 4, 2023
@melink14 melink14 changed the title Refactor(data) refactor word search refactor(data): Refactor wordSearch and other methods for clarity and maintainability Mar 4, 2023
@tora-pan
Copy link
Contributor Author

tora-pan commented Mar 4, 2023

Another thought: I wonder if we should just use wanakana library to do all this stuff: https://www.npmjs.com/package/wanakana

Seems to have what we need and is small.

edit: maybe doesn't handle half width...

This hasn't been updated in years but looks like it does what we need it to do?
https://github.com/yomotsu/japanese-string-utils
Maybe we could either use it or get "hints" from it?

@melink14
Copy link
Owner

melink14 commented Mar 8, 2023

This hasn't been updated in years but looks like it does what we need it to do?
yomotsu/japanese-string-utils
Maybe we could either use it or get "hints" from it?

Might be worth trying though if we found problems I guess we'd need to fork it.

Copy link
Owner

@melink14 melink14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still haven't looked at the tests yet but I can feel the implementation is pretty close probably.

for (let i = 0; i < str.length; i++) {
const char = str.charAt(i);
const nextChar = i + 1 <= str.length - 1 ? str.charAt(i + 1) : null;
const nextCharCode = nextChar?.charCodeAt(0);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more clear if we store the characters as unicode as discussed in the other thread.

extension/data.ts Outdated Show resolved Hide resolved
0x307c,
];
cs: number[] = [0x3071, 0x3074, 0x3077, 0x307a, 0x307d];
convertToHiragana(str: string): string {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might have said this elsewhere but definititely recommend a jsdoc for this method and maybe a tweak to the name since it also has a truncation function now (which we added back for consistency)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean by truncation portion? If you mean the stripping of the ZWJ and Tilde, etc... then that happens before it is sent to the convertToHiragana method.

Here is what I currently have using TS-Doc:

Screenshot 2023-03-08 at 7 06 05 AM

What do you think?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's at line 352; I left a comment there for easy reference.

The TS doc is solid but I'll leave some pedantic comments (mostly conforming to best practices developed in my day job but that have served me well)

currentCharCode <= KANA.HW_KATAKANA_END;
let key = '';
if (currentCharCode < 0x3000) {
break;
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the truncation I meant; with this line the string will be truncated at the first not kana/kanji character.

0x307c,
];
cs: number[] = [0x3071, 0x3074, 0x3077, 0x307a, 0x307d];
convertToHiragana(str: string): string {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's at line 352; I left a comment there for easy reference.

The TS doc is solid but I'll leave some pedantic comments (mostly conforming to best practices developed in my day job but that have served me well)

@@ -0,0 +1,186 @@
export const kanaToHiraganaNormalizationMap: Record<string, string> = {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be constant case? (and readonly I guess)

);
}
/**
* Returns the input string converted into hiragana. If any characters are not
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: It's good to reference the inputs expliictly in the summary statement so that you don't need to use an @param. The @ markers can be useful for additional information that doesn't flow naturally in the summary statement but there's no need to repeat information there that can be read once at the beginning!

So in this case something like:

Returns the given `kanaWord` with katakana and half width characters converted to full-width hiragana. ヴ is not converted. If a non-kana characters is found in `kanaWord`, that and the following characters are omitted from the returned conversion.

}
/**
* Returns the input string converted into hiragana. If any characters are not
* found in the [NormalizationMap](./character_info.ts) then the character
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information is all true but it's more of an implementation detail vs a spec. Ideally, the doc string should be about the goal of the method and needn't go into detail about how we accomplished that. Basically, anything that could be changed without updating a test, shouldn't go in the doc string.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants