New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
problematic licensing of /tests? #231
Comments
Unfortunately, we also have no good idea for a better source of test data. One possibility could be to download test-data on-the-fly from the original web pages when running the test suite, but this would mean to keep the links up-to-date when URLs change or get offline... |
This is a very good point. The model retraining script I've been working on on-and-off for the past few years—ugh, that doesn't feel good to write out—scrapes Wikipedia on the fly as to avoid some of those issues. That said, we could always just make the test data not end up inside the wheel or tarball that gets uploaded to PyPI, couldn't we? Then the package itself isn't sharing anything that people might find problematic. |
Hmm, I'm in a team supporting many Open Source based products at Siemens and I've seen both: projects downloading the tarball from PyPI and projects taking it from Github. Next question is what Linux distributions do, Debian and OpenSUSE seem to indeed take the sources from PyPI, but I think I also saw other Debian Python packages being built from a Github snapshot. So it would already be great if you could assure that these files don't end up on PyPI, but best to keep users (and yourself ;) ) safe would be to get rid of them on Github, too... |
A colleague just checked Arch and Gentoo. Arch takes your component from PyPI. Gentoo downloads from Github: https://github.com/gentoo/gentoo/blob/25c7cd62d731e2587924c50204671c48186d3005/dev-python/chardet/chardet-4.0.0.ebuild#L11. This also means that Gentoo mirrors the problematic files: http://distfiles.gentoo.org/distfiles/bd/chardet-4.0.0.tar.gz. So again, removing those files from the PyPI tarball would already be great and fix it for many people and distributions, but to catch all, it would be better to have these files removed from Github, too. |
Hi, I'm the co-maintainer of the Debian package, many thanks for reporting this issue (also on the Debian BTS). In Debian we prefer PyPI but it's not an hard requirement: I maintain packages that uses directly github (for example because inside the sdist documentation is missing). PyPI sdist is generally preferred because we think that inside of the sdist there is what upstream developers intend to publish. At the moment chardet in Debian is taken from PyPI: https://salsa.debian.org/python-team/packages/chardet/-/blob/debian/master/debian/watch @gernot-h a note about downloading test data: Debian build machines (but I suppose that other distributions do the same) can't access internet to download stuff, so we could not use tests. I agree on removing test data from the PyPI sdist to make chardet not contains problematic data, but can we re-add some of the test data after an assessment? I looked at the Italian tests data and both https://github.com/chardet/chardet/blob/master/tests/iso-8859-1/_ude_3.txt and https://github.com/chardet/chardet/blob/master/tests/iso-8859-1/_ude_4.txt are from Pirandello work "Sei personaggi in cerca d'autore": It's public domain since Pirandello died in 1936: https://www.gutenberg.org/ebooks/18457 This was easy... some of the text is from websites that don't exists anymore... @dan-blanchard what about using a copy of wikipedia texts, I mean not downloading on the fly? Wikipedia is PD, no? I really would prefer to have offline tests. |
Hi @eriol, thanks for your reply! I can perfectly understand the problem with online tests. About Wikipedia: it's no PD, it's CC-BY-SA with the "SA" being somehow comparable to the Copyleft effect of GPL. And the "SA" in CC licenses comes with big uncertainties for anything else than documents, which was for example the reason why OpenStreetMap moved away from CC-BY-SA. So I'm not completely sure what incorporating Wikipedia snippets could mean to chardet as a whole. As a technical guy, I would argue that your Python code is in no way a derived work of text snippets in test cases, but I will try to verify this with our licensing experts and let you know. Given the current vacation period, it might however take some time for me to get an answer. |
Hi there, There is an obvious solution to that "problem". Having a separate repository and use it whenever the test suite needs to be run.
Like I have done here https://github.com/Ousret/char-dataset So that would make those files optional. That would require a proper PR. |
That has the same problems as downloading the data on-the-fly. People should be able to run the tests without internet access for reasons like @eriol pointed out.
Yeah, I'm not a lawyer, but it would be hard for me to understand how chardet itself would be considered a derived work for testing on Wikipedia data. That said, if everything we used was public domain, we could sidestep the issue entirely, so maybe rather than doubling-down on Wikipedia we could look at more Project Gutenberg texts, assuming there are some for every language we support. |
I fouth that using two predefined sources would work, as I have a vague souvenir that it is possible. I could be wrong. Regards, |
Regarding CC-BY-SA: in a first reaction, our licensing expert agreed that he also wouldn't see chardet code as derived work of a CC-BY-SA testsuite file. He however pointed out a future risk when specific content in the testsuite (think of a very rare combination of encodings in a test sample?) triggers a special feature or bugfix in chardet code. No idea whether such a scenario is realistic... |
This is a note that I have just taken over primary maintenance of the We can’t include files that aren’t covered by a Fedora-approved license in our source RPMs, so we can’t keep using either the PyPI sdist or the GitHub archive directly. Instead, I’ve followed the usual process for dealing with this sort of issue. There is now a small script that can download the PyPI sdist and filter out Of course, this means we can’t run any tests except for verifying that all of the package’s modules can be imported, which is unfortunate. CC-BY-SA test data in the source RPM would be just fine for Fedora, but random things from random websites under no particular license is not. |
While checking the licensing of this Python library, we stumbled over this note in https://github.com/chardet/chardet/blob/master/tests/README.txt:
In our project team, there are major concerns that redistribution of those files might not be allowed. Looking into history of
test/
, some other tests were also taken from https://github.com/errepi/ude/tree/master/src/Tests/Data which states that tests are copied from Wikipedia (and thus CC-BY-SA which might also be problematic) and Project Gutenberg (Public Domain?).The text was updated successfully, but these errors were encountered: