Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] MNT Include all pxd files in the package #15626

Merged
merged 6 commits into from Nov 26, 2019

Conversation

jeremiedbb
Copy link
Member

@jeremiedbb jeremiedbb commented Nov 14, 2019

Fixes #14847

Alternative to #14896

We can specify, in the top level setup, the files that need to be included in the package. (in combination to manifest.in but .pxd are already included in it).

I just tested it and it works:

conda create -n tmp python=3.7 pip numpy scipy cython
git clean -xdf
python setup.py sdist
pip install dist/scikit-learn-0.22.dev0.tar.gz

then
find /path/to/conda/envs/tmp/lib/python3.7/site-packages/sklearn/ -name '*.pxd'
shows all 17 .pxd files as in the sklearn repo.

@rth @ogrisel @jnothman @adrinjalali @realead

@jeremiedbb jeremiedbb added this to the 0.22 milestone Nov 14, 2019
Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplicity of this solution better. Let's assume that package_data is properly tested.

Copy link
Member

@rth rth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy with this solution as well (I have not double-checked that it works as expected).

Can we just recognize the contribution of the author in #14896 in some way? Either by merging part of their validation script under maint_tools in that PR, or by adding them a co-author to this PR (I think adding Signed-off-by: to the final commit message would do it). They did significant work on this, and didn't receive much timely feedback..

@jeremiedbb
Copy link
Member Author

or by adding them a co-author to this PR (I think adding Signed-off-by: to the final commit message would do it

Of course. I opened this PR because I was reviewing his PR and found a simpler method by chance.

I don't think the validation script is necessary. It's the role of setuptools to test that package_data works.

@ogrisel
Copy link
Member

ogrisel commented Nov 14, 2019

We could have a bit of bash script that checks the content of a generated sdist tarball as a CI job. I also think we should change at least one of our CI jobs to build and install from the sdist tarball and run the tests on the installed scikit-learn in a new empty env using the pytest --pyargs sklearn pattern.

But this can be done in a later PR.

@realead
Copy link

realead commented Nov 14, 2019

There is my reasoning why I didn't opt for this solution:

Having a pxd doesn't mean you can use it/cimport it, for example because of https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/svm/_libsvm.pxd#L42 _libsvm.pxd cannot be cimported/used: libsvm_helper.c is missing in the installation (but this is not the only problem).

It is not enough to tests that a pxd-file is present in the installation - one must really tests that an extension using it gets compiled/linked/loaded (see _libsvm.pxd as a problematic example) - the script in maint_tool checks exactly this.

Thus, I opted to include only pxd-files which really can be used when installed, and skipped others.

However, IMO using package_data is a good idea, if one checks that all pxd-files can be cimported/used (and not only whether they are present) and fixes the cases for which pxd-files don't work (which was a too large change for me).

@jeremiedbb
Copy link
Member Author

Can someone remind me why we want the .pxd files in the installed package ? @rth ?
They are all part of the private api, so my understanding is we include them but provide no support and no guaranty for their use. Am I wrong ?
Under that perspective, I don't think we need to test that they all can be used by third party libraries.

@jeremiedbb
Copy link
Member Author

We could have a bit of bash script that checks the content of a generated sdist tarball as a CI job.

@ogrisel they are already in the sdist (otherwise pip install could not build sklearn), what this pr does and what could be tested is that they are in the installed package.

@jeremiedbb
Copy link
Member Author

_libsvm.pxd cannot be cimported/used: libsvm_helper.c is missing in the installation (but this is not the only problem).

We could include the .c and .h files as well ?

@NicolasHug
Copy link
Member

Can someone remind me why we want the .pxd files in the installed package ? @rth ?

I assume it's for people to be able to use scikit-learn's private API in their own scripts without having to download the source? Sort of like a -dev linux package.
But if we do that, I agree it would make sense to also include any .c or .h that isn't generated by Cython, else only some parts of scikit-learn can be built.


I think I like this solution too. Worst case scenario, the installed package contains .pxd that cannot actually be used... which is OK right?

Can we just recognize the contribution of the author in #14896 in some way?

+1. We can add a what's new entry crediting both this PR and the former one.

@realead
Copy link

realead commented Nov 19, 2019

@jeremiedbb

We could include the .c and .h files as well ?

That is true, but usually one just would like to write "cimport xxxx" without hunting all dependencies down.

numpy.pxd is a little bit different, but at least there is np.get_includes().

After thinking about it, probably _libsvm.pxd and _liblinear.pxd should be actually pxi-files and not pxd-files.

@jeremiedbb
Copy link
Member Author

After thinking about it, probably _libsvm.pxd and _liblinear.pxd should be actually pxi-files and not pxd-files.

Indeed (although not mandatory). I renamed those files. Should be ok since it's private now. Thus they won't be included in the install

@realead I added a test to check that all pxd files included in the installation directory can be cimported. It's inspired from your test but a bit simpler because I don't check that all pxd files from the repo are actually in the install. Let me know what you think about it.

@realead
Copy link

realead commented Nov 21, 2019

@jeremiedbb looks good to me! I think we can assume setuptools will collect all *.pxd correctly.

Copy link
Member

@ogrisel ogrisel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of a centralized Cython check. Here are a few cosmetic suggestions but feel free to ignore them ;)

Comment on lines 38 to 40
with open('tst.pyx', 'w') as f:
for file in pxd_files:
to_import = file.replace(os.path.sep, '.')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nipick: file -> pxd_file to avoid shadowing the file built-in type in the globals namespace.

check=True)

except subprocess.CalledProcessError:
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need this except clause? Just a try / finally construct would work fine, no?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But actually we can get rid of the try / finally if you remove the os.chdir calls.


with tempfile.TemporaryDirectory() as tmpdir:
try:
os.chdir(tmpdir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid os.chdir and instead:

with open(tmpdir.name +os.path.sep + 'tst.pyx', 'w') as f:
   ...

setup_path = tmpdir.name + os.path.sep + "setup_tst.py"

and then:

        subprocess.run(["python", setup_path, "build_ext", "-i"],
                       check=True, cwd=tmpdir.name)

cdef extern from "math.h":
cdef extern double sqrt(double x)


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch!



skl_dir = sys.argv[1]
skl_dir = os.path.abspath(skl_dir)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick, let's use a oneliner with with a more explicit variable name:

Suggested change
skl_dir = os.path.abspath(skl_dir)
sklearn_dir = os.path.abspath(sys.argv[1])

skl_dir = os.path.abspath(skl_dir)

# find all .pxd files
pxd_files = glob.glob(os.path.join(skl_dir, '**', '*.pxd'), recursive=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way for this kind of helper maintenance script we could use recent feature of Python 3.6+ such as pathlib:

import pathlib

tmpdir = pathlib.Path(tmpdir.name)
setup_path = tmpdir / "setup.py"

sklearn_dir = pathlib.Path(sys.argv[1])
pxd_files = sklearn_dir.glob("**/*.pxd")


print("> Found pxd files:")
for pxd_file in pxd_files:
    print(' -', str(pxd_file))

...

@adrinjalali
Copy link
Member

Is this ready to be merged now? I can't give a review, but it'd be nice to have in the RC3 (working on it now).

This was referenced Nov 25, 2019
@jeremiedbb
Copy link
Member Author

I addressed all comments from @ogrisel so I guess it should be good to go

Copy link
Member

@jnothman jnothman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be run by CI? If it's costly, it can go into the wheel build

@jeremiedbb
Copy link
Member Author

If it's costly, it can go into the wheel build

It's not costly at all (takes a few seconds) but its goal is to be run on scikit-learn installed from source distribution. it could be interesting to generate the sdist and install from that in one of the ci jobs however.

@adrinjalali
Copy link
Member

Can we merge this, have in the RC, and do the CI later?

@jeremiedbb
Copy link
Member Author

Ok for me

@adrinjalali adrinjalali merged commit 9e5819a into scikit-learn:master Nov 26, 2019
adrinjalali pushed a commit to adrinjalali/scikit-learn that referenced this pull request Nov 26, 2019
* include all pxd files in package

* _libsvm _liblinear to pxi + add a test

* simplify stuff + use pathlib

* cln

* more cln
jnothman pushed a commit that referenced this pull request Nov 28, 2019
* include all pxd files in package

* _libsvm _liblinear to pxi + add a test

* simplify stuff + use pathlib

* cln

* more cln
panpiort8 pushed a commit to panpiort8/scikit-learn that referenced this pull request Mar 3, 2020
* include all pxd files in package

* _libsvm _liblinear to pxi + add a test

* simplify stuff + use pathlib

* cln

* more cln
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Consider a consistent policy towards including pxd-files into the installation
7 participants