Make number_of_words respect CJK characters #7813

iBug · 2019-09-05T10:56:01Z

This is a 🙋 feature or enhancement.

Summary

CJK languages generally consider every single character a word.

This PR modifies Liquid filter "number_of_words" so it counts CJK characters correctly.

Context

My personal blog uses Minimal Mistakes theme, which relies on this filter to generate a "reading time". However, as I write both English and Chinese, the latter of which doesn't use spaces to separate words, my Chinese articles get a "1 minute read" as a result regardless of how much I write. This is wrong.

I am currently using a hand-crafted plugin to resolve this issue, but I hope that it can be solved on the Jekyll side. The changes is mostly a port of my own plugin.

CJK languages generally consider every single character a word. This commit changed Liquid filter "number_of_words" so it counts CJK characters correctly.

mattr- · 2019-09-05T14:13:18Z

@iBug Thanks for the contribution!

It looks like there's one more change to make for rubocop, and then this should be good to merge

lib/jekyll/filters.rb

Co-Authored-By: Matt Rogers <mattr-@github.com>

iBug · 2019-09-05T14:35:31Z

@mattr- Thank you for your guidance. I've reviewed RuboCop output and applied your suggested changes. I'm rather new to Ruby actually.

By the way, I've updated the documentation text as well. Hope it helps.

ashmaroli · 2019-09-05T16:02:06Z

@iBug Thank you for submitting this enhancement.
But I feel it'd be better if there were tests as well. The necessity being to add coverage for the regular expression used. If any future complaints arise regarding the filter with CJK chars, one can always refer to those tests to understand the expected results.

Requested changes were made

Code taken from iBug. jekyll/jekyll#7813

iBug · 2020-03-07T09:12:20Z

Sorry for the hibernation of this PR. I spent a lot of time learning Ruby development and Jekyll. With the last commit 360ecda, bundle exec rake test gives no errors. Is that enough?

ashmaroli · 2020-03-07T10:19:33Z

@iBug Thank you for returning to this. The test looks sufficient to me.

doersino · 2020-04-21T09:24:09Z

I'm not sure whether including Katakana, Hiragana and Hangul (which all encode syllables, not words) in the regex is an ideal solution, but I do agree that it's better than the status quo.

Also note that while Japanese (Katakana, Hiragana) doesn't make use of ASCII spaces in writing, Korean (Hangul) does.

iBug · 2020-04-21T09:44:25Z

@doersino I'm aware of that, but in general CJK languages don't count "words", but "characters" instead (think how Microsoft Word does it). For ASCII spaces, since CJK chars are normalized into spaces before western (Latin, Cyrillic etc.) "words" are split, they don't matter here.

iBug · 2020-05-12T14:47:56Z

@ashmaroli Any chance someone could take a look at this? I think this is whatsoever better than before and thus worth having.

ashmaroli · 2020-05-12T16:24:54Z

Yes, I believe this is better than current master.
Pinging @mattr- for input.

ashmaroli · 2020-05-21T17:30:04Z

Looking at this again...
Say, I were to have {{ page.content | number_of_words }} in my layout, with the page containing only latin characters.
Based on the current proposed implementation:

input.scan(cjk_regex).length + input.gsub(cjk_regex, " ").split.length

input.scan(cjk_regex).length would allocate an empty array.
input.gsub(cjk_regex, " ") would duplicate input or page.content in our sample case (String#gsub duplicates given string irrespective of substitution) and then
input.gsub(cjk_regex, " ").split will generate an array of all the words in page.

In other words, a situation that was already bloated with master version (page.content is usually a large string) is going to bloat further with no apparent gain for the user.

Since, the percentage of users that would have CJK chars in their site's contents would be much lower than the percentage of users without CJK chars in their site's contents, I'm withholding approval on the change.

@iBug, I recommend that you come up with a solution acceptable for both sides of the table.

doersino · 2020-05-21T17:35:44Z

Suggestion:

Check whether the input contains any CJK characters. This could be implemented relatively cheaply.
If so, do what this pull request proposes.
Otherwise, do what's currently implemented.

ashmaroli · 2020-05-21T17:44:46Z

Check whether the input contains any CJK characters

That is one solution.
Another would be to have an optional parameter (set to ignore CJK chars by default). That way, the input string need not be checked on each run (input string can be long; checking the parameter would be a constant effort; the number of times this filter is called can be large in large sites.)

Benchmarking (and optionally profiling) the two alternatives can help decide which would be the better choice.

iBug · 2020-05-21T17:45:07Z

@ashmaroli @doersino I've pushed a patch for this. Thank you for your precious analyses and suggestions!

A summary of the new commit:

According to @doersino 's suggestions, I first check for CJK characters, and go the old way if none is found
According to @ashmaroli 's analyses, String#gsub is heavily bloated for creating copies of (usually) large strings. Using String#split with a regex as the argument (presumably) solves this by avoiding the big copy.

- Split with CJK characters only when present - Avoid bloating from String#gub by using String#split with regex directly

iBug · 2020-05-22T07:49:24Z

@ashmaroli With the last commit 21bf5a5 pushed, I've changed the code to the "Overhead Mode" (optional but disabled by default).

I've also updated the tests so both the new behavior and the old behavior are tested (essentially testing the optional parameter). On my local environment both rake test and rubocop -D passes (of course preceded by bundle exec). Please let me know if any extra work is required.

Again, I'm very grateful to you two for being super helpful and patient on guiding me through this seemingly simple PR. It's certainly been a wonderful experience participating in this great open-source project and there's no denying that I've learned a lot from this.

ashmaroli · 2020-05-22T07:55:29Z

You're welcome @iBug
If you still have the benchmark script on hand, try the following version:

def number_of_words(input, mode = nil)
  cjk_regex = %r![\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}]!
  word_regex = %r![^\p{Han}\p{Katakana}\p{Hiragana}\p{Hangul}\s]+!
  case mode
  when "cjk"
    input.scan(cjk_regex).length + input.scan(word_regex).length
  when "auto"
    cjk_count = input.scan(cjk_regex).length
    cjk_count.zero? ? input.split.length : cjk_count + input.scan(word_regex).length
  else
    input.split.length
  end
end

iBug · 2020-05-22T08:14:06Z

@ashmaroli Sure. I've put the new test code on top of the original one in order to reduce clutter (avoid making this thread too long). The new results deserve a separate post so here they are.

(Keep in mind that each function is run 10000.times)

                 user     system      total        real
Original 1   4.099273   0.000369   4.099642 (  4.099745)
Original 2   4.565014   0.000003   4.565017 (  4.565135)
Original 3   1.514981   0.000000   1.514981 (  1.515023)
Auto 1      11.171122   0.000000  11.171122 ( 11.171594)
Auto 2      17.005940   0.000001  17.005941 ( 17.006658)
Auto 3      16.100305   0.000000  16.100305 ( 16.100708)
User1- 1     4.443806   0.000000   4.443806 (  4.444016)
User1- 2     4.111661   0.000000   4.111661 (  4.111847)
User1- 3     1.607525   0.000000   1.607525 (  1.607591)
User1+ 1    20.131011   0.000000  20.131011 ( 20.131919)
User1+ 2    16.699124   0.000000  16.699124 ( 16.699834)
User1+ 3    16.022139   0.000000  16.022139 ( 16.022594)
User2- 1     4.704363   0.000000   4.704363 (  4.704513)
User2- 2     3.944270   0.000000   3.944270 (  3.944377)
User2- 3     2.134202   0.000000   2.134202 (  2.134379)
User2* 1    11.197304   0.000000  11.197304 ( 11.197632)
User2* 2    16.641564   0.000000  16.641564 ( 16.642144)
User2* 3    16.273725   0.000000  16.273725 ( 16.274143)
User2+ 1    19.845742   0.000000  19.845742 ( 19.846762)
User2+ 2    16.796103   0.000000  16.796103 ( 16.796858)
User2+ 3    15.658962   0.003737  15.662699 ( 15.663274)

Though, I still prefer the "auto mode" and would like to have it as the default behavior. Benchmark results show an increase of less than 1 ms increase per 20 KB source processed. This level of performance penalty falls far behind the deviation from computers' intrinsics. To be exact, this is not going to be observable in real runs of Jekyll.

crispgm · 2020-05-22T08:51:38Z

@iBug i guess you should add some tests on CJK languages other than Chinese. :)

ashmaroli · 2020-05-22T08:55:13Z

@iBug The takeaway from my suggestion above is:

The default behavior will be the existing behavior.
I'm willing to ship the auto mode option as well.
Using %r!...! is better than Regexp.new('...') (the latter generates a regex on each call).

ashmaroli · 2020-05-22T08:57:31Z

@crispgm Will you be able to provide sample texts in Japanese or Korean?

iBug · 2020-05-22T09:39:21Z

Is it possible to make the default option a site-wide setting? Let's say, something like site.number_of_words_mode. So it works this way:

If the key is not present in _config.yml, Ruby defaults it to nil, which matches the old behavior (or the planned update with mode = nil. This would hold true for nearly all users (no one sets a random key in their configs).
If the key exists, then it's used as the default value for mode parameter.

ashmaroli · 2020-05-22T09:47:00Z

Is it possible to make the default option a site-wide setting?

That would delay this PR a lot.
Better alternative would be to use include files.

// sample include file

{{ include.text | number_of_words: 'auto' }}

// template or page

{% assign foobar = 'some content' %}
{% include number_of_words.html text = foobar %}

crispgm · 2020-05-22T09:48:37Z

@crispgm Will you be able to provide sample texts in Japanese or Korean?

@ashmaroli okay, here they are

こんにちは、世界！ // number_of_words = 9
안녕하세요 세상! // number_of_words = 8

and powered by Google Translate.

ashmaroli · 2020-05-22T10:19:21Z

@crispgm Ah! good ol' Google Translate. :) But you forgot to mention how many words they are.
We can't know if the output is correct until there's something to check against.. 😉

crispgm · 2020-05-22T10:42:31Z

@crispgm Ah! good ol' Google Translate. :) But you forgot to mention how many words they are.
We can't know if the output is correct until there's something to check against.. 😉

This is true. I updated on last posts. :)

ashmaroli · 2020-05-22T12:24:28Z

@iBug Please update the documentation as well so that users can differentiate when to use the two modes. Once again, thank you for your patience and effort.

iBug · 2020-05-22T12:50:43Z

@ashmaroli I was already planning to do so, but I failed to identify which pieces of files I should look into.

ashmaroli · 2020-05-22T13:20:10Z

which pieces of files

Adding to what you have in docs/_data/jekyll_filters.yml should be sufficient.

iBug · 2020-05-22T13:38:41Z

@ashmaroli With the latest commit, I think it's almost there.

ashmaroli · 2020-05-22T15:01:16Z

Thank you @iBug and everyone that participated here.
@jekyllbot: merge +minor

@ashmaroli

Thanks to @ashmaroli's help: jekyll/jekyll#7813 (comment)

@ashmaroli

[b3e0b34] Use new number_of_words plugin Thanks to @ashmaroli's help: jekyll/jekyll#7813 (comment)

@ashmaroli

Thanks to @ashmaroli's help: jekyll/jekyll#7813 (comment)

Make number_of_words respect CJK characters

afbee53

CJK languages generally consider every single character a word. This commit changed Liquid filter "number_of_words" so it counts CJK characters correctly.

iBug mentioned this pull request Sep 5, 2019

Allow per-page override of words_per_minute mmistakes/minimal-mistakes#2250

Merged

iBug added 2 commits September 5, 2019 21:06

Maybe use lowercase for that variable?

3cc01b1

RuboCop is hard to satisfy...

1d37203

mattr- previously requested changes Sep 5, 2019

View reviewed changes

lib/jekyll/filters.rb Outdated Show resolved Hide resolved

iBug and others added 2 commits September 5, 2019 22:30

Use exclamation marks for %r!regex! to make RuboCop happy

6a33254

Co-Authored-By: Matt Rogers <mattr-@github.com>

Update documentation to reflect changes

1a6af42

ashmaroli added the needs-tests label Sep 5, 2019

weirane added a commit to weirane/blog that referenced this pull request Jan 1, 2020

Use a plugin to calculate word count correctly

45d52b6

Code taken from iBug. jekyll/jekyll#7813

Add test for patched number_of_words filter

360ecda

iBug requested a review from mattr- March 8, 2020 05:04

ashmaroli removed the needs-tests label Mar 8, 2020

ashmaroli removed the request for review from mattr- May 21, 2020 17:13

ashmaroli marked this pull request as draft May 21, 2020 17:30

iBug marked this pull request as ready for review May 21, 2020 17:42

Fix according to ashmaroli's suggestions

28da2dc

- Split with CJK characters only when present - Avoid bloating from String#gub by using String#split with regex directly

Make CJK processing optional, adapt tests

21bf5a5

Support auto mode, add Japanese and Korean test cases

36fe3da

Update documentation to reflect #7813

758949b

ashmaroli added 2 commits May 22, 2020 19:54

Improve documentation of the change

676dd86

Reduce duplication in filter

25d1388

ashmaroli approved these changes May 22, 2020

View reviewed changes

jekyllbot merged commit 13b7291 into jekyll:master May 22, 2020

jekyllbot added the enhancement label May 22, 2020

jekyllbot added a commit that referenced this pull request May 22, 2020

Update history to reflect merge of #7813 [ci skip]

e761d06

iBug added a commit to iBug/iBug-source that referenced this pull request May 22, 2020

Use new number_of_words plugin

b3e0b34

Thanks to @ashmaroli's help: jekyll/jekyll#7813 (comment)

iBug pushed a commit to iBug-web/iBug-web.github.io that referenced this pull request May 22, 2020

Auto deploy from GitHub Actions

b7ddb2c

[b3e0b34] Use new number_of_words plugin Thanks to @ashmaroli's help: jekyll/jekyll#7813 (comment)

iBug added a commit to iBug/iBug-source that referenced this pull request Dec 18, 2020

Use new number_of_words plugin

6af52fb

Thanks to @ashmaroli's help: jekyll/jekyll#7813 (comment)

jekyll locked and limited conversation to collaborators May 22, 2021

jekyllbot added the frozen-due-to-age label May 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make number_of_words respect CJK characters #7813

Make number_of_words respect CJK characters #7813

iBug commented Sep 5, 2019

mattr- commented Sep 5, 2019

iBug commented Sep 5, 2019

ashmaroli commented Sep 5, 2019

iBug commented Mar 7, 2020

ashmaroli commented Mar 7, 2020

doersino commented Apr 21, 2020

iBug commented Apr 21, 2020

iBug commented May 12, 2020

ashmaroli commented May 12, 2020

ashmaroli commented May 21, 2020

doersino commented May 21, 2020

ashmaroli commented May 21, 2020

iBug commented May 21, 2020

iBug commented May 22, 2020 •

edited

ashmaroli commented May 22, 2020

iBug commented May 22, 2020

crispgm commented May 22, 2020

ashmaroli commented May 22, 2020

ashmaroli commented May 22, 2020

iBug commented May 22, 2020 •

edited

ashmaroli commented May 22, 2020

crispgm commented May 22, 2020 •

edited

ashmaroli commented May 22, 2020

crispgm commented May 22, 2020

ashmaroli commented May 22, 2020

iBug commented May 22, 2020

ashmaroli commented May 22, 2020

iBug commented May 22, 2020

ashmaroli commented May 22, 2020

Make number_of_words respect CJK characters #7813

Make number_of_words respect CJK characters #7813

Conversation

iBug commented Sep 5, 2019

Summary

Context

mattr- commented Sep 5, 2019

iBug commented Sep 5, 2019

ashmaroli commented Sep 5, 2019

iBug commented Mar 7, 2020

ashmaroli commented Mar 7, 2020

doersino commented Apr 21, 2020

iBug commented Apr 21, 2020

iBug commented May 12, 2020

ashmaroli commented May 12, 2020

ashmaroli commented May 21, 2020

doersino commented May 21, 2020

ashmaroli commented May 21, 2020

iBug commented May 21, 2020

iBug commented May 22, 2020 • edited

ashmaroli commented May 22, 2020

iBug commented May 22, 2020

crispgm commented May 22, 2020

ashmaroli commented May 22, 2020

ashmaroli commented May 22, 2020

iBug commented May 22, 2020 • edited

ashmaroli commented May 22, 2020

crispgm commented May 22, 2020 • edited

ashmaroli commented May 22, 2020

crispgm commented May 22, 2020

ashmaroli commented May 22, 2020

iBug commented May 22, 2020

ashmaroli commented May 22, 2020

iBug commented May 22, 2020

ashmaroli commented May 22, 2020

iBug commented May 22, 2020 •

edited

iBug commented May 22, 2020 •

edited

crispgm commented May 22, 2020 •

edited