Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

problematic licensing of /tests? #231

Open
gernot-h opened this issue Aug 6, 2021 · 11 comments
Open

problematic licensing of /tests? #231

gernot-h opened this issue Aug 6, 2021 · 11 comments

Comments

@gernot-h
Copy link

gernot-h commented Aug 6, 2021

While checking the licensing of this Python library, we stumbled over this note in https://github.com/chardet/chardet/blob/master/tests/README.txt:

These test feeds were downloaded from random sites while I was developing the Universal Encoding Detector.
Each feed is copyright its respective publisher.

In our project team, there are major concerns that redistribution of those files might not be allowed. Looking into history of test/, some other tests were also taken from https://github.com/errepi/ude/tree/master/src/Tests/Data which states that tests are copied from Wikipedia (and thus CC-BY-SA which might also be problematic) and Project Gutenberg (Public Domain?).

@gernot-h
Copy link
Author

gernot-h commented Aug 6, 2021

Unfortunately, we also have no good idea for a better source of test data. One possibility could be to download test-data on-the-fly from the original web pages when running the test suite, but this would mean to keep the links up-to-date when URLs change or get offline...

@dan-blanchard
Copy link
Member

This is a very good point. The model retraining script I've been working on on-and-off for the past few years—ugh, that doesn't feel good to write out—scrapes Wikipedia on the fly as to avoid some of those issues. That said, we could always just make the test data not end up inside the wheel or tarball that gets uploaded to PyPI, couldn't we? Then the package itself isn't sharing anything that people might find problematic.

@gernot-h
Copy link
Author

gernot-h commented Aug 6, 2021

Hmm, I'm in a team supporting many Open Source based products at Siemens and I've seen both: projects downloading the tarball from PyPI and projects taking it from Github.

Next question is what Linux distributions do, Debian and OpenSUSE seem to indeed take the sources from PyPI, but I think I also saw other Debian Python packages being built from a Github snapshot.

So it would already be great if you could assure that these files don't end up on PyPI, but best to keep users (and yourself ;) ) safe would be to get rid of them on Github, too...

@gernot-h
Copy link
Author

gernot-h commented Aug 10, 2021

A colleague just checked Arch and Gentoo. Arch takes your component from PyPI.

Gentoo downloads from Github: https://github.com/gentoo/gentoo/blob/25c7cd62d731e2587924c50204671c48186d3005/dev-python/chardet/chardet-4.0.0.ebuild#L11. This also means that Gentoo mirrors the problematic files: http://distfiles.gentoo.org/distfiles/bd/chardet-4.0.0.tar.gz.

So again, removing those files from the PyPI tarball would already be great and fix it for many people and distributions, but to catch all, it would be better to have these files removed from Github, too.

@eriol
Copy link

eriol commented Aug 19, 2021

Hi, I'm the co-maintainer of the Debian package, many thanks for reporting this issue (also on the Debian BTS).

In Debian we prefer PyPI but it's not an hard requirement: I maintain packages that uses directly github (for example because inside the sdist documentation is missing). PyPI sdist is generally preferred because we think that inside of the sdist there is what upstream developers intend to publish. At the moment chardet in Debian is taken from PyPI: https://salsa.debian.org/python-team/packages/chardet/-/blob/debian/master/debian/watch

@gernot-h a note about downloading test data: Debian build machines (but I suppose that other distributions do the same) can't access internet to download stuff, so we could not use tests.

I agree on removing test data from the PyPI sdist to make chardet not contains problematic data, but can we re-add some of the test data after an assessment?

I looked at the Italian tests data and both https://github.com/chardet/chardet/blob/master/tests/iso-8859-1/_ude_3.txt and https://github.com/chardet/chardet/blob/master/tests/iso-8859-1/_ude_4.txt are from Pirandello work "Sei personaggi in cerca d'autore": It's public domain since Pirandello died in 1936: https://www.gutenberg.org/ebooks/18457

This was easy... some of the text is from websites that don't exists anymore... @dan-blanchard what about using a copy of wikipedia texts, I mean not downloading on the fly? Wikipedia is PD, no? I really would prefer to have offline tests.

@gernot-h
Copy link
Author

gernot-h commented Aug 20, 2021

Hi @eriol, thanks for your reply!

I can perfectly understand the problem with online tests.

About Wikipedia: it's no PD, it's CC-BY-SA with the "SA" being somehow comparable to the Copyleft effect of GPL. And the "SA" in CC licenses comes with big uncertainties for anything else than documents, which was for example the reason why OpenStreetMap moved away from CC-BY-SA.

So I'm not completely sure what incorporating Wikipedia snippets could mean to chardet as a whole. As a technical guy, I would argue that your Python code is in no way a derived work of text snippets in test cases, but I will try to verify this with our licensing experts and let you know. Given the current vacation period, it might however take some time for me to get an answer.

@Ousret
Copy link

Ousret commented Aug 21, 2021

Hi there,

There is an obvious solution to that "problem". Having a separate repository and use it whenever the test suite needs to be run.
Either:

  • sub git module (lazily fetched)
  • manual clone or whatever give the same results

Like I have done here https://github.com/Ousret/char-dataset

So that would make those files optional. That would require a proper PR.
Online tests are out of the question for many reasons, thus one, maintainers sanity. The main thing should be consistent results across run. Page content or encoding may change at any time.

@dan-blanchard
Copy link
Member

There is an obvious solution to that "problem". Having a separate repository and use it whenever the test suite needs to be run.

That has the same problems as downloading the data on-the-fly. People should be able to run the tests without internet access for reasons like @eriol pointed out.

So I'm not completely sure what incorporatingWikipedia snippets could mean to chardet as a whole. As a technical guy, I would argue that your Python code is in no way a derived work of text snippets in test cases, but I will try to verify this with our licensing experts and let you know. Given the current vacation period, it might however take some time for me to get an answer.

Yeah, I'm not a lawyer, but it would be hard for me to understand how chardet itself would be considered a derived work for testing on Wikipedia data. That said, if everything we used was public domain, we could sidestep the issue entirely, so maybe rather than doubling-down on Wikipedia we could look at more Project Gutenberg texts, assuming there are some for every language we support.

@Ousret
Copy link

Ousret commented Aug 24, 2021

That has the same problems as downloading the data on-the-fly. People should be able to run the tests without internet access for reasons like @eriol pointed out.

I fouth that using two predefined sources would work, as I have a vague souvenir that it is possible. I could be wrong.
Maybe the unit tests should be more focused on the functionals aspects rather than the charset coverage. And having a separate workflow mainly in GH to verify that Chardet kept his accuracy target?

Regards,

@gernot-h
Copy link
Author

gernot-h commented Sep 7, 2021

Regarding CC-BY-SA: in a first reaction, our licensing expert agreed that he also wouldn't see chardet code as derived work of a CC-BY-SA testsuite file. He however pointed out a future risk when specific content in the testsuite (think of a very rare combination of encodings in a test sample?) triggers a special feature or bugfix in chardet code. No idea whether such a scenario is realistic...

@musicinmybrain
Copy link
Contributor

This is a note that I have just taken over primary maintenance of the python-chardet package in Fedora Linux.

We can’t include files that aren’t covered by a Fedora-approved license in our source RPMs, so we can’t keep using either the PyPI sdist or the GitHub archive directly. Instead, I’ve followed the usual process for dealing with this sort of issue. There is now a small script that can download the PyPI sdist and filter out tests/, and the resulting stripped-down source archive is what I upload to Fedora’s lookaside cache.

Of course, this means we can’t run any tests except for verifying that all of the package’s modules can be imported, which is unfortunate.

CC-BY-SA test data in the source RPM would be just fine for Fedora, but random things from random websites under no particular license is not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants