Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update JavaScript stemmer code to the latest version of Snowball (v2.1.0) #8867

Merged
merged 5 commits into from Feb 14, 2021

Conversation

mitya57
Copy link
Contributor

@mitya57 mitya57 commented Feb 10, 2021

The stemmer code that is currently used in Sphinx was based on an old version of snowball (it was added in 059a74c in 2015). Furthermore, it is not pure JS code, but rather “JSX” (not the widely known React JSX, but https://github.com/jsx/JSX, which is also dead since 2015).

Since then, there were two snowball releases: v2.0.0 in October 2019 and v2.1.0 recently, in January 2021. In snowballstem/snowball@ddbe1fa9d5da7db0, JSX was replaced with native JS that is much easier to maintain and understand.

I should also note that our Python code uses snowballstemmer module which is also v2.1.0 now, as that module is based on the same snowball code base (so this pull request brings JS up to date with Python).

What I did in this pull request:

  • I updated the non-minified JS files to the latest version from snowball. To generate them, I cloned snowball repository and ran make dist_libstemmer_js, then copied them from generated archive to Sphinx repository.

  • The minified files were embedded in Python code. This makes the code ugly and difficult to maintain. For consistency with non-minified files, I created a separate directory (minified-js) and put the files there.

    The minified files were created from non-minified ones using these commands:

    npm install uglifyjs
    for f in $(cd sphinx/search/non-minified-js && ls); do ./node_modules/.bin/uglifyjs sphinx/search/non-minified-js/$f -m -o sphinx/search/minified-js/$f; done

    The -m option of uglifyjs enables name mangling, e.g. using short names for variables that are only used internally.

  • To make the page load faster, the JS code is still embedded into language_data.js, not copied as its own file.

  • Note: now there is base-stemmer.js which defines the base class for all stemmers. That file is copied/embedded alongside the language-specific code.

With this pull request, it will be also easier to add new languages. In addition to what Sphinx already has, Snowball now supports Arabic, Armenian, Basque, Catalan, Greek, Hindi, Indonesian, Irish, Lithuanian, Nepali, Serbian, Tamil and Yiddish languages. I did not add new languages here, just updated the existing ones.

Copy link
Member

@tk0miya tk0miya left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you as always. LGTM with nits.

sphinx/search/__init__.py Outdated Show resolved Hide resolved
@tk0miya tk0miya merged commit 52b5509 into sphinx-doc:3.x Feb 14, 2021
@tk0miya
Copy link
Member

tk0miya commented Feb 14, 2021

Thank you for update. Merged.

tk0miya added a commit that referenced this pull request Feb 14, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 15, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants