New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement ExternalPythonOperator #25780
Conversation
I still have to add /fix tests. But for some strange reason I had to make a number of changes to our typing (MyPy complained) - those changes look rather reasonable but @uranusjr maybe you can take a look If I have not made some stupid mistake that led to it. BTW. PythonExternalOperator seems like a good name overall |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main implementation looks fine to me in general but there are too much peripheral changes (mapping changed to mutablemapping etc.) that don’t need to happen.
Yeah. for some reason when I split - MyPy started to complain on those Mapping to be non Mutable - I just scraped it quickly and simply fixed MyPy, but yeah - I agree I need to find out why MyPy started to complain in the first place. |
602000a
to
c162133
Compare
I pushed fixes (I still need to add more tests). Unfortunately it seems that the
I looked at it closely and I think the suggestions from MyPy were actually correct. I could not find any reason why get_direct_relatives should return DAGNode, as far as I can tell you cannot get TaskGroups - you only get tasks so `Union[BaseOperator, MappedOperator]' (and you cannot skip TaskGroup either). Also Collection was not right, because tasks[0] was used in the 'skip' method:
So looks like somethign "masked" the problems from MyPy before and we should fix it here. Any insights and confirmation of my findings would be appreciated before I add more tests. |
c162133
to
8988f54
Compare
I think Why not add a parameter to PythonVirtualenvOperator giving the possibility to the user to set the path of an existing venv ? or name this new operator thanks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Left a few minor comments.
Though I agree with @raphaelauv I don't love the name ExternalPythonOperator
and like his proposal as an option. Or maybe PythonPreexistingVirtualenvOperator
0b3294a
to
cf08e44
Compare
I like the |
9d0893f
to
7248171
Compare
7248171
to
b165d5f
Compare
I think this one is ready for review:
|
b165d5f
to
eb72962
Compare
Ready for final review I think ! |
would it make sense to add a disclaimer/warning in the case this operator is run in KubernetesExecutor or kubernetesCeleryExecutor (k8S queue) ? |
What disclaimer? It should work, there are no limits for that. It's perfectly fine to run the operator with KubernetesExecutor IMHO. I can easily imagine this being used when you have just one single image with multiple predefined envs and you want to choose which one to use. What problems do you foresee with that @raphaelauv ? |
Yeah it will work , I'm just concerned about "encouraging" users to create |
I am not sure if we want to do it to be honest. I do not think we should encourage it at all (we should present it as an option and we do) because everyone's mileage is different. I spoke to a few users of Airlfow (my customers) and it really depends what stage and experience you have IMHO.
So rather than advising the user to choose one over the other, I chose a different route, similar to the "installation" page - - https://airflow.apache.org/docs/apache-airflow/stable/installation/index.html - if you look at the "best practices" chapter in my PR I simply describe all the options and explain pros and cons of each approach and consequences of choosing each one. This description is precisely targeted for the users who will attempt to ask us "which is the best approach". Since we cannot answer this question authoritatively IMHO and we do not want to engage in long discussions with each user (this does not scale) to figure out which option is best for the particular user, we will simply send the user to that page, which they will be able to read and decide on their own. We simply cannot make the decisions for them, but we give them all the information in the way that they can make the decision themselves. I tried to make this "best practices" chapter to be unbiased, factual and very precisely describing pros and cons of each approach and they are grouped in one chapter progresslvely going from the simplest (PythonVirtualenv) to the most complex and involved (Kubernetes). And I tried to avoid any "jugment" there - except the "objective" judgment (more resources used + why) / less resources used +why). I think this is the best we can do. |
thanks for your answer , it's really clear 👍 |
BTW. Funny thing - the customer uses Nomad not K8S, and CeleryExecutor, but it would not change a thing for them if they did use K8S. |
Looking forward to get it merged :) |
:D ? |
9c715ad
to
299217d
Compare
Looking forward to merging that one, if there are no more comments. @ashb, you review is blocking here, and I believe the name change is addressed already. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Few small things, pre-emptive approval though.
299217d
to
4a49478
Compare
Added all changes. |
This Operator works very similarly to PythonVirtualenvOperator - but instead of creating a virtualenv dynamically, it expects the env to be available in the environment that Airlfow is run in. The PR adds not only the implemenat the operator, but also documents the operator's use and adds best-practices chapter that explains the differences between different ways how you can achieve separation of dependencies between different tasks. This has been a question added many times in by our users, so adding this operator and outlining future aspects of AIP-46 and AIP-43 that will make separate docker images another option is also part of this change.
4a49478
to
c9a144b
Compare
The apache#25780 has accidentally bumped min airflow version for the provider to 2.4.0, however the provider is fully capable to work in Airflow 2.3+.
The #25780 has accidentally bumped min airflow version for the provider to 2.4.0, however the provider is fully capable to work in Airflow 2.3+.
This Operator works very similarly to PythonVirtualenvOperator - but
instead of creating a virtualenv dynamically, it expects the
Python binary to be available in the environment that Airlfow is run in.
The PR adds not only the implemenat the operator, but also
documents the operator's use and adds best-practices chapter
that explains the differences between different ways how you can
achieve separation of dependencies between different tasks. This
has been a question added many times in by our users, so adding
this operator and outlining future aspects of AIP-46 and AIP-43
that will make separate docker images another option is also
part of this change.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.