Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use tabs as the field separator for the dict.dat and names.dat dictionary files #1614

Open
bryanjenningz opened this issue Jun 12, 2023 · 3 comments
Assignees

Comments

@bryanjenningz
Copy link
Contributor

Is your feature request related to a problem? Please describe.
Currently, the dict.dat file and names.dat file use spaces to separate fields, which makes parsing logic complicated because there are spaces in the definitions field. To add to the complexity, some of the lines in dict.dat and names.dat leave out the pronunciation field, so this makes the parsing logic even more complicated.

Examples of the current format in dict.dat and names.dat which include the pronunciation field:

揚げ雲雀 [あげひばり] /(n) soaring skylark/high-flying skylark/
B'z [ビーズ] /(group) B'z (rock duo)/

Examples of the current format in dict.dat and names.dat which don't include the pronunciation field:

あーた /(pn) (sl) you/
あーみん /(f) A-min/

If we wanted to add another field to these dictionary files (e.g. pitch accent) it would require us to change the parsing logic to be even more complicated and even harder to understand.
The current format is also using extra characters to help differentiate fields which makes the dictionary files larger than they need to be (i.e. we use "[" and "]" characters to indicate the pronunciation field and we use the "/" character around the definitions field, which causes 4 extra characters to be used per line - since there are about 1,030,000 lines total in dict.dat and names.dat together, this means we add about 4MB extra space by using this dictionary format).

Describe the solution you'd like
A simple solution would be to use a different character to separate each field in dict.dat and names.dat - for example we could use the tab character (\t) or the pipe character (|) or any other character which is never used in the fields of dict.dat and names.dat.

If we used the tab or pipe character, parsing logic would be trivial, we could just split the line by tabs or by pipes (e.g. const [word, pronunciation, definitions] = line.split('\t');). This would also make it easier for us to add new fields in the future without making the parsing logic more complex (e.g. if we added pitch accent, we could just parse it like this const [word, pronunciation, definitions, pitchAccents] = line.split('\t');). This would also allow us to save memory by removing the "[", "]", and "/" characters around the pronunciation and definitions field and we wouldn't need to add any extra characters around any new fields we add in the future.

So the example lines from above would be formatted like this (separated by \t):

揚げ雲雀\tあげひばり\t(n) soaring skylark/high-flying skylark
B'z\tビーズ\t(group) B'z (rock duo)
あーた\t\t(pn) (sl) you
あーみん\t\t(f) A-min

Notice that the last 2 entries don't have a pronunciation so there are 2 tabs in a row, this makes parsing easier.

Describe alternatives you've considered
We could possibly just keep the format the same and have parsing logic in a single function and unit test that function very well so that it would prevent any bugs. If we keep the format the same then we can reuse this function in the update-db.ts file which is where the files get indexed and we could also use this function in the data.ts file where the entries are parsed. But if we do this, then this would still use up 4MB extra space (because of the "[", "]", and "/" characters around pronunciation and definitions) and it would still require us to write a separate parsing function in the future if we add an extra field (e.g. a pitch accents field), so this function would only be used in the update-db.ts file if we add any extra fields in the future.
Another possible solution would be to use a separate file for pitch accents and then index that file separately rather than adding pitch accents as a new field in dict.dat and names.dat. This way, we could reuse that parsing function but we would still need to parse the pitch accents file. But it might be more simple to just add the pitch accents as a new field so they're packaged together, so it is probably a better idea to just simplify the dict.dat and names.dat file format to use tabs or pipes as separators.

Additional context
Right now, dict.dat and names.dat are created in the update-db.ts file. They are fetched from a different source, then they're indexed and saved but the file format isn't changes, so this is where we would change the file format to use tab or pipes instead of the current format. If we want to add any fields in the future (e.g. pitch accents field), this is also where we would add them.

@melink14
Copy link
Owner

Using a better separator to make parsing logic easier seems good to me. I can't think of any downside to keeping the current logic the same in general. A couple thoughts:

  1. In the mid term, I think we'll need to switch to a new data format altogether (storing in memory might feel slow in MV3 depending on how often the service worker is taken out) and in general using a real schema with indexDB will make using modern dictionary files easier I think. See Create modern, flexible system for handling arbitrary dictionaries #185 for thoughts around that.

I only bring this up since if there were any expansive or hard changes to the data format, I would suggest waiting. In this case, the change seems easy so there's little lost.

  1. Essentially to implement this, you'd want to update the the parsing logic and then fix and run update-db.ts to regenerate the data files with the new separator. This is not so complicated but ideally that data format change would not have other data changes. But sense update-db.ts just fetches the latest files it will requrie some local customizations and/or timing to make sure the dictionary files used are the same as what's already at head.

Generally, the update dictionary script runs Every Monday night (for me) so let me know if you want to coordinate timing the release in case it's hard to align the dictionaries otherwise.

I'll assign this to you for now assuming you want to work on it!

@melink14
Copy link
Owner

@all-contributors Add @bryanjenningz for ideas.

@allcontributors
Copy link
Contributor

@melink14

I've put up a pull request to add @bryanjenningz! 🎉

mergify bot pushed a commit that referenced this issue Jun 12, 2023
Adds @bryanjenningz as a contributor for ideas.

This was requested by melink14 [in this comment](#1614 (comment))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants