Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement ExternalPythonOperator #25780

Merged
merged 1 commit into from Sep 6, 2022

Conversation

potiuk
Copy link
Member

@potiuk potiuk commented Aug 17, 2022

This Operator works very similarly to PythonVirtualenvOperator - but
instead of creating a virtualenv dynamically, it expects the
Python binary to be available in the environment that Airlfow is run in.

The PR adds not only the implemenat the operator, but also
documents the operator's use and adds best-practices chapter
that explains the differences between different ways how you can
achieve separation of dependencies between different tasks. This
has been a question added many times in by our users, so adding
this operator and outlining future aspects of AIP-46 and AIP-43
that will make separate docker images another option is also
part of this change.


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:core-operators Operators, Sensors and hooks within Core Airflow area:providers kind:documentation labels Aug 17, 2022
@potiuk potiuk marked this pull request as draft August 17, 2022 23:50
@potiuk
Copy link
Member Author

potiuk commented Aug 17, 2022

I still have to add /fix tests. But for some strange reason I had to make a number of changes to our typing (MyPy complained) - those changes look rather reasonable but @uranusjr maybe you can take a look If I have not made some stupid mistake that led to it.

BTW. PythonExternalOperator seems like a good name overall

Copy link
Member

@uranusjr uranusjr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The main implementation looks fine to me in general but there are too much peripheral changes (mapping changed to mutablemapping etc.) that don’t need to happen.

airflow/decorators/__init__.pyi Outdated Show resolved Hide resolved
airflow/decorators/base.py Outdated Show resolved Hide resolved
airflow/decorators/__init__.pyi Outdated Show resolved Hide resolved
airflow/operators/python.py Outdated Show resolved Hide resolved
@potiuk
Copy link
Member Author

potiuk commented Aug 18, 2022

The main implementation looks fine to me in general but there are too much peripheral changes (mapping changed to mutablemapping etc.) that don’t need to happen.

Yeah. for some reason when I split - MyPy started to complain on those Mapping to be non Mutable - I just scraped it quickly and simply fixed MyPy, but yeah - I agree I need to find out why MyPy started to complain in the first place.

@potiuk potiuk force-pushed the add-external-python-operator branch 2 times, most recently from 602000a to c162133 Compare August 18, 2022 11:32
@potiuk
Copy link
Member Author

potiuk commented Aug 18, 2022

I pushed fixes (I still need to add more tests).

Unfortunately it seems that the mypy instabilitites are a bit different nature. As explained in the comment above for some reason it started to complain with those errors:

airflow/operators/python.py:277: error: Argument 3 to "skip" of "SkipMixin" has
incompatible type "Collection[Union[BaseOperator, MappedOperator]]"; expected
"Sequence[BaseOperator]"  [arg-type]
                    self.skip(dag_run, execution_date, downstream_tasks)
                                                       ^
airflow/operators/python.py:282: error: Argument 3 to "skip" of "SkipMixin" has
incompatible type "Iterable[DAGNode]"; expected "Sequence[BaseOperator]"
[arg-type]
    ...              self.skip(dag_run, execution_date, context["task"].get_d...
                                                        ^
Found 2 errors in 1 file (checked 1 source file)

I looked at it closely and I think the suggestions from MyPy were actually correct. I could not find any reason why get_direct_relatives should return DAGNode, as far as I can tell you cannot get TaskGroups - you only get tasks so `Union[BaseOperator, MappedOperator]' (and you cannot skip TaskGroup either). Also Collection was not right, because tasks[0] was used in the 'skip' method:

 DagRun.dag_id == tasks[0].dag_id,

So looks like somethign "masked" the problems from MyPy before and we should fix it here. Any insights and confirmation of my findings would be appreciated before I add more tests.

@potiuk potiuk force-pushed the add-external-python-operator branch from c162133 to 8988f54 Compare August 18, 2022 11:40
@raphaelauv
Copy link
Contributor

I think ExternalPythonOperator is a confusing name , that could give the impression to users that the task is going to run out of airflow itself ( it sounds stupid and magical , but so many airflow users are not aware of how things really works and where code is actually running depending the operator and executor )

Why not add a parameter to PythonVirtualenvOperator giving the possibility to the user to set the path of an existing venv ?

or name this new operator PythonExistingVirtualenvOperator ?

thanks

Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Left a few minor comments.

Though I agree with @raphaelauv I don't love the name ExternalPythonOperator and like his proposal as an option. Or maybe PythonPreexistingVirtualenvOperator

airflow/example_dags/example_python_operator.py Outdated Show resolved Hide resolved
airflow/example_dags/example_python_operator.py Outdated Show resolved Hide resolved
docs/apache-airflow/howto/operator/python.rst Outdated Show resolved Hide resolved
docs/apache-airflow/howto/operator/python.rst Outdated Show resolved Hide resolved
docs/apache-airflow/tutorial_taskflow_api.rst Outdated Show resolved Hide resolved
@uranusjr uranusjr force-pushed the add-external-python-operator branch from 0b3294a to cf08e44 Compare August 19, 2022 05:29
@potiuk
Copy link
Member Author

potiuk commented Aug 20, 2022

I like the PythonPreexistingVirtualenvOperator name best

@potiuk potiuk force-pushed the add-external-python-operator branch 2 times, most recently from 9d0893f to 7248171 Compare August 20, 2022 18:23
@potiuk potiuk changed the title Implement ExternalPythonOperator Implement PythonPreexistingVirtualenvOperator Aug 20, 2022
@potiuk potiuk marked this pull request as ready for review August 20, 2022 18:23
@potiuk potiuk force-pushed the add-external-python-operator branch from 7248171 to b165d5f Compare August 20, 2022 18:41
@potiuk
Copy link
Member Author

potiuk commented Aug 20, 2022

I think this one is ready for review:

  • renamed the operator to to PythonPreexistingVirtualenvOperator
  • tests should be passing now
  • system Test is working (and uses external venv)
  • documentation is updated (howtos)
  • I even added Best Practices chapter about Handling Python Dependencies that provides more insights on how you can run your tasks using different sets of dependencies - it discusses pros/cons of different approaches

@potiuk potiuk force-pushed the add-external-python-operator branch from b165d5f to eb72962 Compare August 20, 2022 18:46
@potiuk
Copy link
Member Author

potiuk commented Sep 1, 2022

Ready for final review I think !

@raphaelauv
Copy link
Contributor

would it make sense to add a disclaimer/warning in the case this operator is run in KubernetesExecutor or kubernetesCeleryExecutor (k8S queue) ?

@potiuk
Copy link
Member Author

potiuk commented Sep 1, 2022

would it make sense to add a disclaimer/warning in the case this operator is run in KubernetesExecutor or kubernetesCeleryExecutor (k8S queue) ?

What disclaimer? It should work, there are no limits for that. It's perfectly fine to run the operator with KubernetesExecutor IMHO. I can easily imagine this being used when you have just one single image with multiple predefined envs and you want to choose which one to use.

What problems do you foresee with that @raphaelauv ?

@raphaelauv
Copy link
Contributor

Yeah it will work , I'm just concerned about "encouraging" users to create just one single image with multiple predefined envs . Sometimes users create chaotic stack design because something work out of box :)

@potiuk
Copy link
Member Author

potiuk commented Sep 1, 2022

Yeah it will work , I'm just concerned about "encouraging" users to create just one single image with multiple predefined envs . Sometimes users create chaotic stack design because something work out of box :)

I am not sure if we want to do it to be honest. I do not think we should encourage it at all (we should present it as an option and we do) because everyone's mileage is different. I spoke to a few users of Airlfow (my customers) and it really depends what stage and experience you have IMHO.

  1. some customers who do not have many "system" dependencies problems and are not "super" excited about infrastructure and build pipelinesf for it, will likely prefer single image with multiple venvs (actually the idea of this operator came precisely from that discussion - I met them yesterday and they cannot wait for 2.4 being available because it solves huge problem they have with multiple teams in a simple way).

  2. but there are more sophisticated customers who have multiple separate teams that have multiple complex requirements (including system dependencies). There only multiple container images will cut it. Extreme case of it - think GPU and ARM support for one team crossed with having to install Python 2.7 on an old debian because this is the only environment all dependencies will be installable (as much as we would not like it - Python 2.7 is still poptular in gaming industry apparently - Unity added Python binding few years ago with Python 2.7 only https://docs.unity3d.com/Packages/com.unity.scripting.python@2.0/manual/installation.html and it is still 2.7 only !!! 😱 ). This is of course extreme, but you get the idea.

So rather than advising the user to choose one over the other, I chose a different route, similar to the "installation" page - - https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html - if you look at the "best practices" chapter in my PR I simply describe all the options and explain pros and cons of each approach and consequences of choosing each one. This description is precisely targeted for the users who will attempt to ask us "which is the best approach".

Since we cannot answer this question authoritatively IMHO and we do not want to engage in long discussions with each user (this does not scale) to figure out which option is best for the particular user, we will simply send the user to that page, which they will be able to read and decide on their own. We simply cannot make the decisions for them, but we give them all the information in the way that they can make the decision themselves.

I tried to make this "best practices" chapter to be unbiased, factual and very precisely describing pros and cons of each approach and they are grouped in one chapter progresslvely going from the simplest (PythonVirtualenv) to the most complex and involved (Kubernetes). And I tried to avoid any "jugment" there - except the "objective" judgment (more resources used + why) / less resources used +why).

I think this is the best we can do.

@raphaelauv
Copy link
Contributor

thanks for your answer , it's really clear 👍

@potiuk
Copy link
Member Author

potiuk commented Sep 1, 2022

thanks for your answer , it's really clear +1

BTW. Funny thing - the customer uses Nomad not K8S, and CeleryExecutor, but it would not change a thing for them if they did use K8S.

@potiuk
Copy link
Member Author

potiuk commented Sep 1, 2022

Looking forward to get it merged :)

@potiuk
Copy link
Member Author

potiuk commented Sep 2, 2022

:D ?

@potiuk potiuk force-pushed the add-external-python-operator branch 2 times, most recently from 9c715ad to 299217d Compare September 2, 2022 20:22
@potiuk
Copy link
Member Author

potiuk commented Sep 5, 2022

Looking forward to merging that one, if there are no more comments. @ashb, you review is blocking here, and I believe the name change is addressed already.

Copy link
Member

@ashb ashb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few small things, pre-emptive approval though.

airflow/example_dags/example_python_operator.py Outdated Show resolved Hide resolved
airflow/operators/python.py Show resolved Hide resolved
airflow/operators/python.py Outdated Show resolved Hide resolved
tests/decorators/test_external_python.py Outdated Show resolved Hide resolved
tests/decorators/test_external_python.py Outdated Show resolved Hide resolved
@potiuk potiuk force-pushed the add-external-python-operator branch from 299217d to 4a49478 Compare September 5, 2022 17:15
@potiuk
Copy link
Member Author

potiuk commented Sep 5, 2022

Added all changes.

This Operator works very similarly to PythonVirtualenvOperator - but
instead of creating a virtualenv dynamically, it expects the
env to be available in the environment that Airlfow is run in.

The PR adds not only the implemenat the operator, but also
documents the operator's use and adds best-practices chapter
that explains the differences between different ways how you can
achieve separation of dependencies between different tasks. This
has been a question added many times in by our users, so adding
this operator and outlining future aspects of AIP-46 and AIP-43
that will make separate docker images another option is also
part of this change.
@potiuk potiuk force-pushed the add-external-python-operator branch from 4a49478 to c9a144b Compare September 5, 2022 18:18
@potiuk potiuk merged commit 55928b9 into apache:main Sep 6, 2022
@potiuk potiuk deleted the add-external-python-operator branch September 6, 2022 14:27
@jedcunningham jedcunningham added the type:new-feature Changelog: New Features label Sep 12, 2022
potiuk added a commit to potiuk/airflow that referenced this pull request Dec 30, 2022
The apache#25780 has accidentally bumped min airflow version for the
provider to 2.4.0, however the provider is fully capable to work
in Airflow 2.3+.
potiuk added a commit that referenced this pull request Dec 30, 2022
The #25780 has accidentally bumped min airflow version for the
provider to 2.4.0, however the provider is fully capable to work
in Airflow 2.3+.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:core-operators Operators, Sensors and hooks within Core Airflow area:providers kind:documentation type:new-feature Changelog: New Features
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants