Support UTF-8 characters in feature name again #2976

henry0312 · 2020-04-06T09:22:29Z

This commit reverts 0d59859.
Also see:

I reproduced the issue and as @kidotaka gave us a great survey in #2226,
I don't conclude that the cause is UTF-8, but "an empty string (character)".
Therefore, I revert "throw error when meet non ascii (#2229)" whose commit hash
is 0d59859, and add support feature names as UTF-8 again.

Sample codes

reprodude the issue

The below code raises lightgbm.basic.LightGBMError: Wrong size of feature_names

import lightgbm
import numpy
from matplotlib import pyplot

numpy.random.seed(42)

train_x= numpy.random.normal(size=(1000, 4))
valid_x= numpy.random.normal(size=(100, 4))
train_t = numpy.random.random(1000)
valid_t = numpy.random.random(100)

train_lgb = lightgbm.Dataset(train_x, train_t)
valid_lgb = lightgbm.Dataset(valid_x, valid_t, reference=train_lgb)

# This has non-ascii strings but an empty string.
feature_names = ['F_零', 'F_一', 'F_二', '']

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0,
    'seed': 42,
}

print('Starting training...')
evals_result = {}
gbm = lightgbm.train(params,
                     train_lgb,
                     num_boost_round=20,
                     valid_sets=valid_lgb,
                     feature_name=feature_names,
                     evals_result=evals_result,
                     early_stopping_rounds=5)

print('Plotting feature importances...')
ax = lightgbm.plot_importance(gbm, ignore_zero=False)
pyplot.show()

use utf-8 characters

You can see there is no problem with using utf-8 in feature names.

import lightgbm
import numpy
from matplotlib import pyplot

numpy.random.seed(42)

train_x= numpy.random.normal(size=(1000, 4))
valid_x= numpy.random.normal(size=(100, 4))
train_t = numpy.random.random(1000)
valid_t = numpy.random.random(100)

train_lgb = lightgbm.Dataset(train_x, train_t)
valid_lgb = lightgbm.Dataset(valid_x, valid_t, reference=train_lgb)

# This has non-ascii strings.
feature_names = ['F_零', 'F_一', 'F_二', 'F_三']

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0,
    'seed': 42,
}

print('Starting training...')
evals_result = {}
gbm = lightgbm.train(params,
                     train_lgb,
                     num_boost_round=20,
                     valid_sets=valid_lgb,
                     feature_name=feature_names,
                     evals_result=evals_result,
                     early_stopping_rounds=5)

print('Plotting feature importances...')
ax = lightgbm.plot_importance(gbm, ignore_zero=False)
pyplot.show()

@kidotaka

This commit reverts 0d59859. Also see: - microsoft#2226 - microsoft#2478 - microsoft#2229 I reproduced the issue and as @kidotaka gave us a great survey in microsoft#2226, I don't conclude that the cause is UTF-8, but "an empty string (character)". Therefore, I revert "throw error when meet non ascii (microsoft#2229)" whose commit hash is 0d59859, and add support feture names as UTF-8 again.

guolinke · 2020-04-06T09:34:39Z

@henry0312 did you test for this?

LightGBM/src/boosting/gbdt_model_text.cpp

Line 481 in e6de39a

feature_names_ = Common::Split(key_vals["feature_names"].c_str(), ' ');

you can test that by save model to string/file, and load it back.

henry0312 · 2020-04-06T10:05:21Z

@guolinke I just tried and confirm that there is no problem.

import lightgbm
import numpy
from matplotlib import pyplot

numpy.random.seed(42)

train_x= numpy.random.normal(size=(1000, 4))
valid_x= numpy.random.normal(size=(100, 4))
train_t = numpy.random.random(1000)
valid_t = numpy.random.random(100)

train_lgb = lightgbm.Dataset(train_x, train_t)
valid_lgb = lightgbm.Dataset(valid_x, valid_t, reference=train_lgb)

# This has non-ascii strings.
feature_names = ['F_零', 'F_一', 'F_二', 'F_三']

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression',
    'metric': {'l2', 'l1'},
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': 0,
    'seed': 42,
}

print('Starting training...')
evals_result = {}
gbm = lightgbm.train(params,
                     train_lgb,
                     num_boost_round=20,
                     valid_sets=valid_lgb,
                     feature_name=feature_names,
                     evals_result=evals_result,
                     early_stopping_rounds=5)

# feature names
print('Feature names:', gbm.feature_name())

print('Saving model...')
# save model to file
gbm.save_model('model.txt')

print('Loading model to predict...')
# load model to predict
gbm2 = lightgbm.Booster(model_file='model.txt')

# feature names
print('Feature names:', gbm2.feature_name())

❯ python test_utf8.py
Starting training...
[LightGBM] [Warning] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000196 seconds.
You can set `force_col_wise=true` to remove the overhead.
[1]	valid_0's l1: 0.234781	valid_0's l2: 0.0734969
Training until validation scores don't improve for 5 rounds
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[2]	valid_0's l1: 0.235442	valid_0's l2: 0.0736321
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[3]	valid_0's l1: 0.236002	valid_0's l2: 0.0738012
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[4]	valid_0's l1: 0.236452	valid_0's l2: 0.0738955
[LightGBM] [Warning] No further splits with positive gain, best gain: -inf
[5]	valid_0's l1: 0.236552	valid_0's l2: 0.0739692
[6]	valid_0's l1: 0.236756	valid_0's l2: 0.0739201
Early stopping, best iteration is:
[1]	valid_0's l1: 0.234781	valid_0's l2: 0.0734969
Feature names: ['F_零', 'F_一', 'F_二', 'F_三']
Saving model...
Loading model to predict...
Feature names: ['F_零', 'F_一', 'F_二', 'F_三']

EDIT: 2020-04-06 19:07

guolinke · 2020-04-06T10:16:49Z

great! I think we can add more test cases for utf8 feature names.

henry0312 · 2020-04-06T10:20:18Z

@guolinke Sure! Please wait a moment.

StrikerRUS · 2020-04-06T13:18:26Z

I think we should revisit R-package as well to ensure it supports non-ASCII names.
Ping @jameslamb and @Laurae2 for this.

henry0312 · 2020-04-06T13:37:45Z

I think we should revisit R-package as well to ensure it supports non-ASCII names.

As I remember, we hadn't had any problems with feature names before 0d59859.

henry0312 · 2020-04-06T13:39:42Z

By the way, we must drop support of Python 2.x as soon as possible.
It is really evil for string as UTF-8.

jameslamb · 2020-04-06T13:47:10Z

I think we should revisit R-package as well to ensure it supports non-ASCII names.

As I remember, we hadn't had any problems with feature names before 0d59859.

@henry0312 I'm surprised by the R test errors you got, but they don't seem related to this PR and the fixes are ok with me generally (I'll leave a comment)

I'll add some test code for R here in a few minutes

jameslamb · 2020-04-06T16:40:49Z

I'm looking at the logs and really don't understand these errors, or what is different between the CI environment and my laptop. I'm running the tests from a shell so it's not like my local environment is benefitting from weird RStudio magic with the environment.

The new test is failing

But so are 8 others

henry0312 · 2020-04-06T16:54:02Z

F_<e9><9b><b6> is encoded by UTF-8.
F_\u96f6 is unicode.
https://www.compart.com/en/unicode/U+96F6
so you need to check the actual character (零).

henry0312 · 2020-04-06T16:57:07Z

json library of R may encode strings, although I don't know much about R.

henry0312 · 2020-04-06T17:07:02Z

I think functions around https://rlang.r-lib.org/reference/chr_unserialise_unicode.html are related.

StrikerRUS · 2020-04-06T21:32:38Z

If it is not trivial to enable UTF-8 support in R, I think we can split Python and R changes in separate PRs.

StrikerRUS · 2020-04-06T21:34:43Z

I found the same TeX issue, https://tug.org/pipermail/tex-live/2020-April/045230.html.
It seems a temporary problem 😓

Do we need to report the current issue somewhere, or they are aware as it is "every year" problem?

R-package/tests/testthat/test_basic.R

henry0312 · 2020-04-07T01:18:25Z

Finally, all tests have passed (but include some work-arounds, which are not related to Python😅)
@guolinke @StrikerRUS @jameslamb this is ready for merge.

I found the same TeX issue, https://tug.org/pipermail/tex-live/2020-April/045230.html.
It seems a temporary problem 😓

Do we need to report the current issue somewhere, or they are aware as it is "every year" problem?

No need, but we will continue to check the thread.

StrikerRUS · 2020-04-07T13:35:29Z

@henry0312 I extracted the workarounds not related to this PR in #2977 and just merged it into master.

henry0312 · 2020-04-07T15:06:15Z

Followed the current master/HEAD

StrikerRUS

What about the following decodes?

LightGBM/python-package/lightgbm/basic.py

Line 2585 in 91185c3

ret = json.loads(string_buffer.value.decode())

LightGBM/python-package/lightgbm/basic.py

Lines 2956 to 2957 in 91185c3

    
           self.__name_inner_eval = \ 
        
               [string_buffers[i].value.decode() for i in range_(self.__num_inner_eval)]

tests/python_package_test/test_engine.py

jameslamb · 2020-04-07T20:57:58Z

Looks ok to me from R side! But my approval shouldn't count towards a merge.

henry0312 · 2020-04-08T10:15:42Z

What about the following decodes?

LightGBM/python-package/lightgbm/basic.py

Line 2585 in 91185c3

ret = json.loads(string_buffer.value.decode())

LightGBM/python-package/lightgbm/basic.py

Lines 2956 to 2957 in 91185c3

self.__name_inner_eval = \

[string_buffers[i].value.decode() for i in range_(self.__num_inner_eval)]

@StrikerRUS yeah, your points are right. .decode('utf-8') should be needed.

Looks ok to me from R side! But my approval shouldn't count towards a merge.

@jameslamb can you approve again?

StrikerRUS

LGTM except one minor comment!

tests/python_package_test/test_engine.py

he approved at #2976 (comment).

henry0312 requested a review from guolinke April 6, 2020 09:22

henry0312 requested review from chivee, jameslamb, Laurae2 and StrikerRUS as code owners April 6, 2020 09:22

henry0312 requested review from StrikerRUS and chivee and removed request for chivee, jameslamb, Laurae2 and StrikerRUS April 6, 2020 09:22

henry0312 added 6 commits April 6, 2020 20:26

add tests

d6c9e17

fix check-docs tests

1404c59

update

5e34022

fix tests

228c50a

update .travis.yml

227850b

fix tests

42fdcbe

henry0312 requested a review from wxchan as a code owner April 6, 2020 13:01

henry0312 added 2 commits April 6, 2020 22:27

update test_r_package.sh

00349e5

update test_r_package.sh

da076cb

henry0312 removed the request for review from wxchan April 6, 2020 13:33

jameslamb reviewed Apr 6, 2020

View reviewed changes

R-package/tests/testthat/test_basic.R Show resolved Hide resolved

jameslamb mentioned this pull request Apr 7, 2020

Master failing for LaTex error uptake/pkgnet#275

Closed

update

a36e8c4

Merge remote-tracking branch 'upstream/master' into support_utf-8

f1a71b6

henry0312 self-assigned this Apr 7, 2020

StrikerRUS reviewed Apr 7, 2020

View reviewed changes

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved

jameslamb self-requested a review April 7, 2020 20:57

jameslamb mentioned this pull request Apr 8, 2020

[R-package] Add support for non-ASCII feature names #2983

Closed

updte

1290e6a

update

f459dd2

StrikerRUS approved these changes Apr 8, 2020

View reviewed changes

tests/python_package_test/test_engine.py Outdated Show resolved Hide resolved

remove unneeded comments

a6d5e64

henry0312 merged commit 44a9120 into microsoft:master Apr 10, 2020

henry0312 deleted the support_utf-8 branch April 10, 2020 03:54

StrikerRUS mentioned this pull request Apr 10, 2020

[feature requests] support utf-8 characters in feature name #2478

Closed

StrikerRUS added the feature label Apr 10, 2020

StrikerRUS mentioned this pull request May 19, 2020

[BUG]Does LightGBM support non-ASCII characters NOT THE FEATURE NAME #3102

Closed

lock bot locked as resolved and limited conversation to collaborators Jun 24, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support UTF-8 characters in feature name again #2976

Support UTF-8 characters in feature name again #2976

henry0312 commented Apr 6, 2020 •

edited

guolinke commented Apr 6, 2020 •

edited

henry0312 commented Apr 6, 2020 •

edited

guolinke commented Apr 6, 2020

henry0312 commented Apr 6, 2020

StrikerRUS commented Apr 6, 2020

henry0312 commented Apr 6, 2020

henry0312 commented Apr 6, 2020 •

edited

jameslamb commented Apr 6, 2020

jameslamb commented Apr 6, 2020

henry0312 commented Apr 6, 2020

henry0312 commented Apr 6, 2020

henry0312 commented Apr 6, 2020

StrikerRUS commented Apr 6, 2020

StrikerRUS commented Apr 6, 2020 •

edited

henry0312 commented Apr 7, 2020 •

edited

StrikerRUS commented Apr 7, 2020

henry0312 commented Apr 7, 2020

StrikerRUS left a comment

jameslamb commented Apr 7, 2020

henry0312 commented Apr 8, 2020

StrikerRUS left a comment

	self.__name_inner_eval = \
	[string_buffers[i].value.decode() for i in range_(self.__num_inner_eval)]

Support UTF-8 characters in feature name again #2976

Support UTF-8 characters in feature name again #2976

Conversation

henry0312 commented Apr 6, 2020 • edited

Sample codes

reprodude the issue

use utf-8 characters

guolinke commented Apr 6, 2020 • edited

henry0312 commented Apr 6, 2020 • edited

guolinke commented Apr 6, 2020

henry0312 commented Apr 6, 2020

StrikerRUS commented Apr 6, 2020

henry0312 commented Apr 6, 2020

henry0312 commented Apr 6, 2020 • edited

jameslamb commented Apr 6, 2020

jameslamb commented Apr 6, 2020

henry0312 commented Apr 6, 2020

henry0312 commented Apr 6, 2020

henry0312 commented Apr 6, 2020

StrikerRUS commented Apr 6, 2020

StrikerRUS commented Apr 6, 2020 • edited

henry0312 commented Apr 7, 2020 • edited

StrikerRUS commented Apr 7, 2020

henry0312 commented Apr 7, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

jameslamb commented Apr 7, 2020

henry0312 commented Apr 8, 2020

StrikerRUS left a comment

Choose a reason for hiding this comment

henry0312 commented Apr 6, 2020 •

edited

guolinke commented Apr 6, 2020 •

edited

henry0312 commented Apr 6, 2020 •

edited

henry0312 commented Apr 6, 2020 •

edited

StrikerRUS commented Apr 6, 2020 •

edited

henry0312 commented Apr 7, 2020 •

edited