Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unnecessary dependency on FuzzyTM pulls in many libraries #3423

Open
osma opened this issue Jan 9, 2023 · 3 comments · May be fixed by #3437
Open

Unnecessary dependency on FuzzyTM pulls in many libraries #3423

osma opened this issue Jan 9, 2023 · 3 comments · May be fixed by #3437
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix impact HIGH Show-stopper for affected users reach HIGH Affects most or all Gensim users

Comments

@osma
Copy link

osma commented Jan 9, 2023

Problem description

I'm trying to upgrade to the new Gensim 4.3.0 release. My colleague @juhoinkinen noticed in NatLibFi/Annif#660 that Gensim 4.3.0 pulls in more dependencies than the previous release 4.2.0, including pandas. I suspect that at least the FuzzyTM dependency (which in turn pulls in pandas) is actually unused and thus unnecessary.

Steps/code/corpus to reproduce

Installing Gensim 4.2.0 into an empty venv (only four packages installed):

$ pip install gensim==4.2.0
Collecting gensim==4.2.0
  Downloading gensim-4.2.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.0 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.0/24.0 MB 2.0 MB/s eta 0:00:00
Collecting scipy>=0.18.1
  Downloading scipy-1.10.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (34.4 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 34.4/34.4 MB 3.3 MB/s eta 0:00:00
Collecting numpy>=1.17.0
  Downloading numpy-1.24.1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (17.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 17.3/17.3 MB 10.6 MB/s eta 0:00:00
Collecting smart-open>=1.8.1
  Downloading smart_open-6.3.0-py3-none-any.whl (56 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.8/56.8 KB 9.7 MB/s eta 0:00:00
Installing collected packages: smart-open, numpy, scipy, gensim
Successfully installed gensim-4.2.0 numpy-1.24.1 scipy-1.10.0 smart-open-6.3.0

Installing Gensim 4.3.0 into an empty venv (18 packages installed):

$ pip install gensim==4.3.0
Collecting gensim==4.3.0
  Downloading gensim-4.3.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (24.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 24.1/24.1 MB 6.9 MB/s eta 0:00:00

[...skipping downloads...]

Installing collected packages: pytz, urllib3, smart-open, six, numpy, idna, charset-normalizer, certifi, scipy, requests, python-dateutil, simpful, pandas, miniful, fst-pso, pyfume, FuzzyTM, gensim
  Running setup.py install for miniful ... done
  Running setup.py install for fst-pso ... done
Successfully installed FuzzyTM-2.0.5 certifi-2022.12.7 charset-normalizer-2.1.1 fst-pso-1.8.1 gensim-4.3.0 idna-3.4 miniful-0.0.6 numpy-1.24.1 pandas-1.5.2 pyfume-0.2.25 python-dateutil-2.8.2 pytz-2022.7 requests-2.28.1 scipy-1.10.0 simpful-2.9.0 six-1.16.0 smart-open-6.3.0 urllib3-1.26.13

The size of the venv has grown from 249MB to 318MB, an increase of 69MB.

Here is what pipdeptree shows - FuzzyTM appears to be the main reason why so many libraries are pulled in:

gensim==4.3.0
  - FuzzyTM [required: >=0.4.0, installed: 2.0.5]
    - numpy [required: Any, installed: 1.24.1]
    - pandas [required: Any, installed: 1.5.2]
      - numpy [required: >=1.21.0, installed: 1.24.1]
      - python-dateutil [required: >=2.8.1, installed: 2.8.2]
        - six [required: >=1.5, installed: 1.16.0]
      - pytz [required: >=2020.1, installed: 2022.7]
    - pyfume [required: Any, installed: 0.2.25]
      - fst-pso [required: Any, installed: 1.8.1]
        - miniful [required: Any, installed: 0.0.6]
          - numpy [required: >=1.12.0, installed: 1.24.1]
          - scipy [required: >=1.0.0, installed: 1.10.0]
            - numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
        - numpy [required: Any, installed: 1.24.1]
      - numpy [required: Any, installed: 1.24.1]
      - scipy [required: Any, installed: 1.10.0]
        - numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
      - simpful [required: Any, installed: 2.9.0]
        - numpy [required: >=1.12.0, installed: 1.24.1]
        - requests [required: Any, installed: 2.28.1]
          - certifi [required: >=2017.4.17, installed: 2022.12.7]
          - charset-normalizer [required: >=2,<3, installed: 2.1.1]
          - idna [required: >=2.5,<4, installed: 3.4]
          - urllib3 [required: >=1.21.1,<1.27, installed: 1.26.13]
        - scipy [required: >=1.0.0, installed: 1.10.0]
          - numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
    - scipy [required: Any, installed: 1.10.0]
      - numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
  - numpy [required: >=1.18.5, installed: 1.24.1]
  - scipy [required: >=1.7.0, installed: 1.10.0]
    - numpy [required: >=1.19.5,<1.27.0, installed: 1.24.1]
  - smart-open [required: >=1.8.1, installed: 6.3.0]
pip==22.0.2
pipdeptree==2.3.3
setuptools==59.6.0

It appears that the FuzzyTM dependency was added in PR #3398 (Flsamodel) by @ERijck . The first commits in this PR depended on the library, but a subsequent commit 9fec00b reworked the code so it doesn't need to import FuzzyTM at all. But the dependency in setup.py wasn't actually removed, it's still there: https://github.com/RaRe-Technologies/gensim/blob/f35faae7a7b0c3c8586fb61208560522e37e0e7e/setup.py#L347

I think the FuzzyTM dependency could be safely dropped, as the library is not actually imported. It would reduce the number of libraries Gensim pulls in and thus reduce the size of installations, including Docker images where minimal size is often required.

Versions

I'm using Ubuntu Linux 22.04.

Linux-5.15.0-56-generic-x86_64-with-glibc2.35
Python 3.10.6 (main, Nov 14 2022, 16:10:14) [GCC 11.3.0]
Bits 64
NumPy 1.24.1
SciPy 1.10.0
gensim 4.3.0
FAST_VERSION 0

@piskvorky
Copy link
Owner

piskvorky commented Jan 9, 2023

Thanks for reporting!

@mpenkov Is fuzzyTM really a hard dependency? If so that's terrible, definitely an omission / bug (or if intentional, done in very bad taste). Let's release a bug fix ASAP.

@piskvorky piskvorky added bug Issue described a bug difficulty easy Easy issue: required small fix impact HIGH Show-stopper for affected users reach HIGH Affects most or all Gensim users labels Jan 9, 2023
@piskvorky
Copy link
Owner

piskvorky commented Jan 9, 2023

I tracked the change in setup.py down to #3398. @ERijck why do you think this was needed, why did you add that line?

@ERijck
Copy link
Contributor

ERijck commented Jan 9, 2023

I'm surprised this line is still there, it was part of my first PR. The dependency can be removed from setup.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Issue described a bug difficulty easy Easy issue: required small fix impact HIGH Show-stopper for affected users reach HIGH Affects most or all Gensim users
Projects
None yet
3 participants