Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce number of 3rd party packages required for a prediction-only setup #594

Open
a-recknagel opened this issue Jan 11, 2023 · 3 comments
Labels
enhancement New feature or request

Comments

@a-recknagel
Copy link

a-recknagel commented Jan 11, 2023

My use case is that I'm running a trained causalml model in a server. I'm done with analysis, hyperopt, visualization, ... all that isn't necessary any more. So I pickled my model and moved it to a designated production environment which I configured in a way that it can unpickle the model and run predictions on it.

But the way causalml is set up, many of those "non-core" packages that deal with training and analysis are still hard runtime-dependencies, even if I were to install causalm with --no-deps (as suggested here #250 (comment), which I'd really like to avoid). Just to show an example, the model I'm using is causalml.inference.tree.causal.causalforest.CausalRandomForestRegressor, and in causalml.inference.tree.__init__.py all of the local modules are imported as well (e.g. causalml.inference.tree.plot, leading to a number of the 3rd part imports that I have an issue with, like seaborn, matplotlib, pydotplus, ...).

Would it be possible to separate every dependency that isn't necessary to run predictions into extras? Or at least, restructure the code in a way where a manual install of the actual runtime-dependencies won't lead to unrelated 3rd party package imports? I realize this is a massive ask, but it's a serious problem for me that I can't solve without forking your project and run my own builds (which I'd really, really like to avoid).

Just to give an idea of why it's an issue:

base/Dockerfile

# need a builder since no wheels are released to pypi, except for a single pyhton3.8 mac build?
FROM python:3.10-slim as builder

RUN apt-get update && \
    apt-get -y install build-essential
RUN pip install setuptools>=18.0 wheel cython numpy "scikit-learn<=1.0.2"
RUN pip install causalml --no-deps
RUN pip wheel -w wheels causalml --no-deps

FROM python:3.10-slim

COPY --from=builder wheels wheels
RUN pip install "scikit-learn<=1.0.2" packaging forestci tqdm pathos && \ 
    pip install wheels/causalml* --no-deps

This image contains the core set of 3rd party packages necessary to predict with a CausalRandomForestRegressor. I didn't investigate what other models would need, but numerical computation libraries don't have a massive disk footprint any way -- the whole image is 507MB big, which is reasonable for a simple ML backend.

actual/Dockerfile

FROM python:3.10-slim as builder

RUN apt-get update && \
    apt-get -y install build-essential
RUN pip install setuptools>=18.0 wheel cython numpy "scikit-learn<=1.0.2"
RUN pip install causalml --no-deps
RUN pip wheel -w wheels causalml --no-deps

FROM python:3.10-slim

COPY --from=builder wheels wheels
RUN pip install wheels/causalml*

This is the whole package, and visualization libs do tend to eat up a fair share of disk space. Plus torch. The image clocks in at 6.54GB, so a difference of ~6GB which I do not need.

My CI/CD straight up refuses to run this build for me because it doesn't support artifacts of this size. I didn't even know that could happen.

@a-recknagel a-recknagel added the enhancement New feature or request label Jan 11, 2023
@a-recknagel
Copy link
Author

I couldn't find similar issues in the tracker, apologies if I just missed them. In case I didn't I'd be surprised though, am I actually the first user who has this issue? Is dockerizing / running causalml in a server a strange thing to do?

Regarding PRs, I might be able to write one, but wouldn't start unless the issue itself is green-flagged by the maintainers.

@a-recknagel a-recknagel changed the title Reduce number of dependencies for a prediction-only setup Reduce number of 3rd party packages required for a prediction-only setup Jan 12, 2023
@jeongyoonlee
Copy link
Collaborator

Thanks for submitting this, @a-recknagel. Addressing this will help many others who'd like to deploy the causalml models. Can you take a stab at it?

A couple of things I can think of are:

@a-recknagel
Copy link
Author

a-recknagel commented Jan 22, 2023

Ok, that's good to know, I'd love to try. I hope to keep the changes to these two domains, changing import paths and writing extra groups, but either of these I'd consider a breaking change. Not that that'll stop me, and the project is still in zero_ver so it won't matter much, but I guess I want to ask how careful I should be. Should I read up on custom importer overloads to try and keep existing import paths working, or would that be a wasted effort?

Also, I'll probably touch most files in the project due to moving folders. Are there any particular WIPs or branches that I should consider or wait for before starting? The merge conflicts would be spectacularly bad.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants