use PythonVirtualenvOperator with a prebuilt env #15286

json2d · 2021-04-08T17:44:36Z

Description
Instead of passing in the requirements and relying Airflow to build the env, in some cases it would be more straightforward and desirable to just make Airflow use a prebuilt env.

This could be done with PythonVirtualenvOperator with a param like env_path.

Use case / motivation

virtualenv_task = PythonVirtualenvOperator(
    task_id="virtualenv_python",
    python_callable=callable_virtualenv,
    env_path='./color-env', # the path to prebuilt env
    # requirements=["colorama==0.4.0"], # replaces this
    system_site_packages=False,
    dag=dag,
)

Are you willing to submit a PR?

Perhaps

Related Issues

The text was updated successfully, but these errors were encountered:

boring-cyborg · 2021-04-08T17:44:37Z

Thanks for opening your first issue here! Be sure to follow the issue template!

uranusjr · 2021-04-08T17:50:36Z

Edit: I misread the issue. Thanks for the added example.

~~Hi, could provide more context to the issue? Why do you want this feature? Is the current virtualenv approach not working? Are there technical advantages to the built-in venv over virtualenv?~~

From my understanding, virtual environments created by virtualenv should be technically identical to those created by venv, but virtualenv provides much better configurability and performance characteristics in the implementation. The only real advantage to venv is it does not require installing a third-party dependency, but that is a non-issue for Airflow, which has a ton of them anyway.

json2d · 2021-04-08T17:52:55Z

hey @uranusjr yeah sorry i hit submit by accident there

uranusjr · 2021-04-08T18:02:21Z

I wonder if this should be made more generic as a new operator that can take any Python installation prefix (e.g. /home/uranusjr/.local/my-custom-compiled-python). It might not be too useful in general though, since it is pretty wasteful to install the requirements every time the task is run, if the environment is persisted. It is probably easier to populate the environment yourself, and use BashOperator instead.

json2d · 2021-04-08T19:02:27Z

it is pretty wasteful to install the requirements every time the task is run, if the environment is persisted

i agree

populate the environment yourself, and use BashOperator instead.

this could work, though in my experience one disadvantage with BashOperator vs PythonOperator/PythonVirtualenvOperator is that it only provides limited logging in Airflow for errors

potiuk · 2021-04-08T20:27:05Z

Just one comment - this fine, if you can make sure all your - distributed - venvs are present on all the workers (which might be tricky if you want to update those) - and you have to somehow link the "task" definition (expecting certain venv with certain requirement versions) with the "deployment" (i.e. worker definition). Any kind of "upgrade" to such an env might be tricky. The "local" installation pattern had the advantage, that you always got the requirements in the version you described in the task definition (via requirements specification).

I think a better solution would be to use caching mechanism to the task and modify the PythonVirtualenv to use it. However this might be tricky to get right when you have multiple tasks of the same type running in the same runner in Celery deployment.

uranusjr · 2021-04-13T16:20:06Z

I thought about this a bit and feel there are two things here to consider. The first is the overhead for PythonVirtualenvOperator to populate the virtuale environment, which (as mentioned above) should be solved by introducing some caching mechanism, something similar to how CI caches stuff between runs. This is very much worth doing.

There is another use case surrounding PythonVirtualenvOperator, however—people wanting more control over the environment used to run Python code. Maybe there are some dependencies that can’t be covered by Python packaging, or require special configuration of the environment. Or maybe the user is simply migrating from an existing cron setup and want to reuse the environments first to avoid re-writing everything all at once. Currently people would need to “drop down” to BashOperator to achieve this, and while that definitely works, kind of “wastes” the knowledge the operator is running Python, and prevents nice things we can do with that knowledge.

I think two solutions are needed for the two problems. The first is probably more intuitive to design; we can add caching options to PythonVirtualenvOperator to make Airflow cache and reuse the environment (or a subset of it); we can steal some ideas from CI designs for this. The other is less straightforward; my current idea is to introduce a ExternalPythonOperator (please recommend better names) that, instead of taking a requirement to create a virtual environment from, simply takes a path to a Python executable to run the Python callable with. The behaviour would otherwise be very similar to PythonVirtualenvOperator, including all the code generation and pickling caveats. This would be much easier to implement than the caching one (which, also mentioned above, requires tricky considerations with parallelism). So I’ll probably start with it and see what I can do.

Any advices are very welcomed!

Jakobhenningjensen · 2021-05-27T13:15:30Z

If I may pitch an idea; instead of creating Venvs for each task/operator, what about creating a virtualdag e.g

with VenvDAG(
    default_args=default_args,
    schedule_interval=timedelta(days=1),
    start_date=days_ago(1),
    tags=['venv'],
   requirements = "requirements.txt"
) as dag:

    t1 = PythonOperator()
    t2 = PythonOperator()

t1 >>t2

such that VenvDAG spins up a venv where t1 and t2 lives in.

uranusjr · 2021-05-28T11:15:06Z

That’s an interesting idea, but would require much more change since we don’t currently have a hook point for DAG to do pre-processing before operators are run.

Maybe something like this would be easier:

with DAG(...) as dag:
    t0 = CreateVirtualEnvironmentOperator(task_id="init_venv", ...)

    python_prefix_template = "{{ ti.xcom_pull(task_ids='init_venv')['prefix'] }}"
    t1 = ExternalPythonOperator(..., python_prefix=python_prefix_template)
    t2 = ExternalPythonOperator(..., python_prefix=python_prefix_template)

    t0 >> t1 >> t2

There are probably abstractions available to make this easier, but that’s the basic idea.

potiuk · 2021-07-25T18:07:10Z

I came back to the discussion after the Summit, and It gave me some idea @uranusjr. The problem with this approach is that you cannot have separate operator to prepare env and another to run the env. The problem is that they might run on different workers.

However with the custom XCom backends, I think we are very closet to use those backends as a generic Caching mechanism (That we could also use to store virtualenv caches). I think it is not that far to be able to add similar mechanism as we see in many CI environments, where we could specify ID of the cache (with some variations) and pull it from shared location (if exists) or push it there after task succeeds. Then we could make python virtualenv to build venv if it is missing (using requirments) and push it after complete. We would have to add some basic mechanism of invalidation of the hash (for example when hash of requirements.txt changes).

WDYT @uranusjr ?

ManikandanUV · 2022-01-28T22:17:05Z

Bump. Selecting my own existing venv, or at the very least reuse existing venvs instead of creating one every time, is a great feature to have.

HansBambel · 2022-01-31T07:34:08Z

@ManikandanUV We are doing it the following way for now:

env = vars_dict.get("conda_env", None)
path_to_python = f"/home/username/.conda{'/envs/'+env if env is not None else ''}/bin/python"

parse_files = BashOperator(
            task_id='parse-files',
            bash_command=f"{path_to_python} {abs_path_code}/my_repo/parse.py {files_to_parse}",
            env={"PATH": os.environ["PATH"],
                 "DB_CONN": db_conn}
        )

We have an environment variable containing the conda-env name which is used to get the full path to the Python executable. Then, using a BashOperator, we can use the same environment again for different Tasks.

Additionally, we run an update to the environment if requirements changed (note that we are using poetry as package manager):

update_repo = BashOperator(
    task_id=f"update-repo-{folder}",
    bash_command=f"cd {abs_path_code}/{folder}; "
           "git checkout master; git stash; git stash drop; git pull"
    )
install_dependencies = BashOperator(
    task_id=f"install-dependencies-{folder}",
          bash_command=f"cd {abs_path_code}/{folder}; conda activate {env_name}; poetry install "
    )
update_repo >> install_dependencies
`
``

gaoyibin0001 · 2022-02-21T05:52:17Z

@ManikandanUV We are doing it the following way for now:

env = vars_dict.get("conda_env", None)
path_to_python = f"/home/username/.conda{'/envs/'+env if env is not None else ''}/bin/python"

parse_files = BashOperator(
            task_id='parse-files',
            bash_command=f"{path_to_python} {abs_path_code}/my_repo/parse.py {files_to_parse}",
            env={"PATH": os.environ["PATH"],
                 "DB_CONN": db_conn}
        )

We have an environment variable containing the conda-env name which is used to get the full path to the Python executable. Then, using a BashOperator, we can use the same environment again for different Tasks.

Additionally, we run an update to the environment if requirements changed (note that we are using poetry as package manager):

update_repo = BashOperator(
    task_id=f"update-repo-{folder}",
    bash_command=f"cd {abs_path_code}/{folder}; "
           "git checkout master; git stash; git stash drop; git pull"
    )
install_dependencies = BashOperator(
    task_id=f"install-dependencies-{folder}",
          bash_command=f"cd {abs_path_code}/{folder}; conda activate {env_name}; poetry install "
    )
update_repo >> install_dependencies
`
``

may use "conda run -n env_name python xxx.py

gaoyibin0001 · 2022-02-21T06:03:49Z

one walk around may be use DockerOperator or KubernetesPodOperator to isolate env depends on your deploment.
PythonVirtualenvOperator in my understanding is suitable for task requiring few dependencies.
for usecase that demands too many packages or very big pakage like torch, the installation overhead is too much for per task. using prebuild images is a good choice, which can be maintained offline without influence the airflow online worker

uranusjr · 2022-09-13T13:48:20Z

This use case is now covered by ExternalPythonOperator, introduced in #25780 and will be a part of 2.4.0.

json2d added the kind:feature Feature Requests label Apr 8, 2021

uranusjr mentioned this issue Apr 14, 2021

Operator to run a Python callable against an arbitrary Python interpreter #15359

Closed

eladkal added the area:core-operators Operators, Sensors and hooks within Core Airflow label Apr 14, 2021

MainRo mentioned this issue Jul 29, 2022

[bitnami/airflow] unable to use additional provider package (SFTPHook) bitnami/charts#11423

Closed

uranusjr closed this as completed Sep 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use PythonVirtualenvOperator with a prebuilt env #15286

use PythonVirtualenvOperator with a prebuilt env #15286

json2d commented Apr 8, 2021 •

edited

boring-cyborg bot commented Apr 8, 2021

uranusjr commented Apr 8, 2021 •

edited

json2d commented Apr 8, 2021

uranusjr commented Apr 8, 2021 •

edited

json2d commented Apr 8, 2021

potiuk commented Apr 8, 2021

uranusjr commented Apr 13, 2021

Jakobhenningjensen commented May 27, 2021

uranusjr commented May 28, 2021

potiuk commented Jul 25, 2021

ManikandanUV commented Jan 28, 2022

HansBambel commented Jan 31, 2022

gaoyibin0001 commented Feb 21, 2022

gaoyibin0001 commented Feb 21, 2022 •

edited

uranusjr commented Sep 13, 2022

use PythonVirtualenvOperator with a prebuilt env #15286

use PythonVirtualenvOperator with a prebuilt env #15286

Comments

json2d commented Apr 8, 2021 • edited

boring-cyborg bot commented Apr 8, 2021

uranusjr commented Apr 8, 2021 • edited

json2d commented Apr 8, 2021

uranusjr commented Apr 8, 2021 • edited

json2d commented Apr 8, 2021

potiuk commented Apr 8, 2021

uranusjr commented Apr 13, 2021

Jakobhenningjensen commented May 27, 2021

uranusjr commented May 28, 2021

potiuk commented Jul 25, 2021

ManikandanUV commented Jan 28, 2022

HansBambel commented Jan 31, 2022

gaoyibin0001 commented Feb 21, 2022

gaoyibin0001 commented Feb 21, 2022 • edited

uranusjr commented Sep 13, 2022

json2d commented Apr 8, 2021 •

edited

uranusjr commented Apr 8, 2021 •

edited

uranusjr commented Apr 8, 2021 •

edited

gaoyibin0001 commented Feb 21, 2022 •

edited