Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/consistent get dag rest endpoint #16842

Closed
wants to merge 1 commit into from

Conversation

psg2
Copy link

@psg2 psg2 commented Jul 6, 2021

closes: #16839

I got a bit confused while updating the tests to the new behavior. Previously the tests were using the method _create_dag_models to generate test DAGs and when I changed to use the same DAGs from the tests on TestGetDagDetails the tests started failing because of tags for instance and then I noticed that tags are defined differently on dag_schema and dag_detail_schema and I felt like I needed to update dag_schema to use the same configuration.

There is also the behavior that using DAGs from DagBag instead of directly inserting a DagModel on the database through the session the attributes is_paused and is_active come up as None instead of False/True.

Another thing that I noticed is that on TestGetDag there is a test related to the details endpoint here. Should we update this as well?


^ Add meaningful description above

Read the Pull Request Guidelines for more information.
In case of fundamental code change, Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in UPDATING.md.

@boring-cyborg boring-cyborg bot added the area:API Airflow's REST/HTTP API label Jul 6, 2021
@boring-cyborg
Copy link

boring-cyborg bot commented Jul 6, 2021

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contribution Guide (https://github.com/apache/airflow/blob/main/CONTRIBUTING.rst)
Here are some useful points:

  • Pay attention to the quality of your code (flake8, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it’s a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@github-actions
Copy link

github-actions bot commented Jul 7, 2021

The PR is likely OK to be merged with just subset of tests for default Python and Database versions without running the full matrix of tests, because it does not modify the core of Airflow. If the committers decide that the full tests matrix is needed, they will add the label 'full tests needed'. Then you should rebase to the latest main or amend the last commit of the PR, and push it with --force-with-lease.

@github-actions github-actions bot added the okay to merge It's ok to merge this PR as it does not require more tests label Jul 7, 2021
update get_dag tests to handle the new behavior
update dag_schema to get tags the same way as dag_details_schema
@psg2 psg2 force-pushed the fix/consistent-get-dag-rest-endpoint branch from c0f7e78 to fcae7a2 Compare July 7, 2021 01:27
@mik-laj
Copy link
Member

mik-laj commented Jul 7, 2021

I am not sure about this change. Web UI displays DAGs that are deleted. This allows us to access the archival DAG Runes.

If you want to fetch a list of DAGs that have not been deleted, you should set the filter is_acttive = True.

if dag is None:
raise NotFound("DAG not found", detail=f"The DAG with dag_id: {dag_id} was not found")

return dag_schema.dump(dag)
Copy link
Member

@mik-laj mik-laj Jul 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This serializer accepts DagModel, so we should update typing on line 40 to reflect it. See:

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm not mistaken the get_dag method indeed returns a DAG object, so instead of updating the typing, should I query it from the database after validating its existence on the DagBag?

"is_active": True,
"fileloc": __file__,
"file_token": FILE_TOKEN,
"is_paused": None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_active is a filter parameter, so we should return it in response also for readability and not to expose ourselves to accidental leakage of information.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did remove those after changing the endpoint to return a DAG instead of a DagModel given that the API documentation says that the is_active field is nullable.

image

To fix this, should I transform from DAG to DagModel and leave the serializer to handle it, or take the approach from the DAGDetailSchema that has the following code:

    is_paused = fields.Method("get_is_paused", dump_only=True)
    is_active = fields.Method("get_is_active", dump_only=True)

    @staticmethod
    def get_is_paused(obj: DAG):
        """Checks entry in DAG table to see if this DAG is paused"""
        return obj.get_is_paused()

    @staticmethod
    def get_is_active(obj: DAG):
        """Checks entry in DAG table to see if this DAG is active"""
        return obj.get_is_active()

"is_active": False,
"fileloc": __file__,
"file_token": FILE_TOKEN,
"is_paused": None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_active is missing here.

@psg2
Copy link
Author

psg2 commented Jul 7, 2021

I am not sure about this change. Web UI displays DAGs that are deleted. This allows us to access the archival DAG Runes.

If you want to fetch a list of DAGs that have not been deleted, you should set the filter is_acttive = True.

I just tested it on Airflow 2.1.1 and if I remove the file for a DAG with existing runs the SerializedDag table is cleaned up and because of this the DAG no longer shows up on the UI.

@mik-laj
Copy link
Member

mik-laj commented Jul 7, 2021

I just tested it on Airflow 2.1.1 and if I remove the file for a DAG with existing runs the SerializedDag table is cleaned up and because of this the DAG no longer shows up on the UI.

Can you check the filter configurations at the top of the table - All/Active/Paused? Their status is saved in cookies and you can keep these DAGs hidden.

When you select "All", you should see deleted and active DAG in UI. If not, then we have a bug in the Web UI.

@psg2
Copy link
Author

psg2 commented Jul 7, 2021

I just tested it on Airflow 2.1.1 and if I remove the file for a DAG with existing runs the SerializedDag table is cleaned up and because of this the DAG no longer shows up on the UI.

Can you check the filter configurations at the top of the table - All/Active/Paused? Their status is saved in cookies and you can keep these DAGs hidden.

When you select "All", you should see deleted and active DAG in UI. If not, then we have a bug in the Web UI.

I'm selecting All, from what I understood given that the Web UI is now powered by Serialized Dags that table would be the source of truth.

Screen Shot 2021-07-06 at 22 49 59

Screen Shot 2021-07-06 at 22 49 55

@psg2
Copy link
Author

psg2 commented Jul 7, 2021

If I try to access the DAG after file deletion the Web UI I receive the SerializedDagNotFound exception

image

@mik-laj
Copy link
Member

mik-laj commented Jul 7, 2021

Web UI is now powered by Serialized Dags that table would be the source of truth.

Only some views use SerializedDAG because reading the data from this table is expensive. Apart from the cross-DAG dependency view, we never use the list operation on this table.

Most views still use the DagModel table including /home/ view:

dags = (
current_dags.order_by(DagModel.dag_id)
.options(joinedload(DagModel.tags))
.offset(start)
.limit(dags_per_page)
.all()
)

@mik-laj
Copy link
Member

mik-laj commented Jul 7, 2021

If I try to access the DAG after file deletion the Web UI I receive the SerializedDagNotFound exception

We can think of a clearer error message here, but from the home screen, you should also be able to access the DAG Runs

@mik-laj
Copy link
Member

mik-laj commented Jul 7, 2021

I have the impression that this change caused this change, but the Airflow 1.10 had a different behavior in this situation.
8b95e51
Admin had access to all DAGs: deleted and active.

We can change the API, but this will be a breaking change and it requires at least entries in the API documentation (section: Summary of Changes) and in UPDATING.MD. file

@psg2
Copy link
Author

psg2 commented Jul 7, 2021

When you select "All", you should see deleted and active DAG in UI. If not, then we have a bug in the Web UI.

What about those recent changes on the changelog?

@mik-laj
Copy link
Member

mik-laj commented Jul 7, 2021

Adding only_active parameter to /dags endpoint #14306 restored the ability to view the list of all DAGS, both active and inactive. If I understand correctly, now /dags/ only returns active DAGs by default. To return all DAGs, you need to pass is_active query parameter. /dags/{dag_id}/ returns information about all DAGs, both active and inactive, which is consistent with the /dags/ endpoint. The /dags/{dag_id}/ details endpoint, which is a bit specific because it operates on a serialized DAG, allows us to get only active DAGs. This is a certain limitation of this endpoint and I think it's worth documenting until we implement full versioning of the DAGs or other features that will make us have access to all DAG attributes in all cases.

I don't understand why you want to make this change? What do you mean by "stale DAGs"?

@kaxil @ephraimbuddy WDYT?

@psg2
Copy link
Author

psg2 commented Jul 7, 2021

Got it, I guess it was a miscommunication issue then, I asked on Slack about this when I noticed that /dags/{dag_id} returned a DAG that was inactive while the /dag/{dag_id}/details fails in such case that there is no more the associated dag on the SerializedDag table and I thought they should have the same behavior. Maybe I didn't express well my thoughts on Slack or misunderstood the response.

Slack Thread: https://apache-airflow.slack.com/archives/CCPRP7943/p1625504228302500

I don't understand why you want to make this change? What do you mean by "stale DAGs"?

I was thinking stale DAGs were inactive DAGs that no longer have the file on the filesystem at the time.

Now after reading the /dags/{dag_id} description which is Presents only information available in database (DAGModel). If you need detailed information, consider using GET /dags/{dag_id}/details. it actually looks correct as it is today.

Copy link
Contributor

@ephraimbuddy ephraimbuddy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the Schema documentation, this endpoint is fine as it is, See

get:
summary: Get basic information about a DAG
description: >
Presents only information available in database (DAGModel).
If you need detailed information, consider using GET /dags/{dag_id}/details.

I don't think we should be making this change @mik-laj , @uranusjr

@ephraimbuddy
Copy link
Contributor

Got it, I guess it was a miscommunication issue then, I asked on Slack about this when I noticed that /dags/{dag_id} returned a DAG that was inactive while the /dag/{dag_id}/details fails in such case that there is no more the associated dag on the SerializedDag table and I thought they should have the same behavior. Maybe I didn't express well my thoughts on Slack or misunderstood the response.

Slack Thread: https://apache-airflow.slack.com/archives/CCPRP7943/p1625504228302500

I don't understand why you want to make this change? What do you mean by "stale DAGs"?

I was thinking stale DAGs were inactive DAGs that no longer have the file on the filesystem at the time.

Now after reading the /dags/{dag_id} description which is Presents only information available in database (DAGModel). If you need detailed information, consider using GET /dags/{dag_id}/details. it actually looks correct as it is today.

Good

@psg2
Copy link
Author

psg2 commented Jul 9, 2021

For me, I think we can close this PR and issue now. 🤔

I'm still looking forward to giving my first contribution to Airflow, and I will take the time to watch the current Airflow Summit presentations on the subject. 😄

Thanks a lot for taking the time to review this carefully, @mik-laj @ephraimbuddy.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:API Airflow's REST/HTTP API okay to merge It's ok to merge this PR as it does not require more tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

/dag/{dag_id} returns stale dags without its serialized version unlike /dags/{dag_id}/details
5 participants