Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

exp: show: Include additional info in --json. #7690

Merged
merged 1 commit into from May 25, 2022

Conversation

daavoo
Copy link
Contributor

@daavoo daavoo commented May 3, 2022

Use relpath as keys in deps and outs.
Add use_cache and is_data_source for outs.

This allows differentiating git-tracked dependencies and
intermediate outputs (pseudo-code bellow).

Closes #7575
Closes #7790


Example code on how to use the new fields:

for dep_name, dep in res["deps"].items():
    for out_name, out in res["outs"].items():
   
        # dep and out are the same artifact
        if dep_name == out_name:

             if not out["is_data_source"]:
                 # This is an intermediate artifact
                 
             if not out["use_cache"]:
                 # This is git-tracked artifact
                 
        else:
            # This is a git-tracked dependency

@daavoo daavoo added product: VSCode Integration with VSCode extension A: experiments Related to dvc exp labels May 3, 2022
@daavoo daavoo self-assigned this May 3, 2022
@daavoo daavoo requested a review from a team as a code owner May 3, 2022 18:44
@daavoo daavoo requested review from skshetry, dberenbaum and mattseddon and removed request for skshetry and dberenbaum May 3, 2022 18:44
@daavoo daavoo marked this pull request as draft May 5, 2022 16:42
@daavoo daavoo added the release-blocker Blocks a release label May 5, 2022
@daavoo daavoo removed the request for review from mattseddon May 5, 2022 16:43
@daavoo daavoo force-pushed the 7575-exp-show-distinguish-between-different-files branch from 19fc2f7 to 0b60702 Compare May 5, 2022 18:31
@daavoo daavoo marked this pull request as ready for review May 5, 2022 19:18
@daavoo daavoo force-pushed the 7575-exp-show-distinguish-between-different-files branch from 0b60702 to cf8e636 Compare May 5, 2022 19:19
@daavoo daavoo changed the title exp: show: Include fs_path and use_cache in --json. exp: show: Include additional info in --json. May 5, 2022
@daavoo daavoo removed the release-blocker Blocks a release label May 5, 2022
@daavoo daavoo force-pushed the 7575-exp-show-distinguish-between-different-files branch from cf8e636 to 53f2507 Compare May 10, 2022 10:40
@daavoo daavoo force-pushed the 7575-exp-show-distinguish-between-different-files branch from 53f2507 to 1da3517 Compare May 23, 2022 09:39
@daavoo daavoo requested a review from pmrowla May 23, 2022 09:41
dvc/repo/experiments/show.py Outdated Show resolved Hide resolved
@daavoo daavoo force-pushed the 7575-exp-show-distinguish-between-different-files branch 2 times, most recently from 209baee to 33ecb88 Compare May 24, 2022 08:01
Use relpath to repo.root_dir as keys in deps and outs.
Add `use_cache` and `is_data_source` for outs.

This allows to differentiate git-tracked dependencies and
intermediate outputs.

Closes #7575
Closes #7790
@daavoo daavoo force-pushed the 7575-exp-show-distinguish-between-different-files branch from 33ecb88 to 48fe304 Compare May 24, 2022 08:04
Copy link
Contributor

@pmrowla pmrowla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM assuming this works for the vscode side

@daavoo daavoo requested a review from mattseddon May 24, 2022 09:43
"hash": out.hash_info.value,
"size": out.meta.size,
"nfiles": out.meta.nfiles,
"use_cache": out.use_cache,
"is_data_source": out.stage.is_data_source,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Q] Do I have everything I need to work this out:

for dep in res[“deps”].values():
    for out in res[“outs”].values():
        if dep[“fs_path”] == out[“fs_path”]:
            # THIS IS AN INTERMEDIATE ARTIFACT
            dep["intermediate"] = True
            out["intermediate"] = True
            if out[“use_cache”]:
                # DEP IS DVC TRACKED
                dep["use_cache"] = True
            else:
                # DEP IS GIT-TRACKED 
                dep["use_cache"] = False

The answer is yes

Given that the rules about deciding what files should be included appear to be currently unclear and product-dependent I would prefer not to include this logic in DVC.

This does make sense to me. However, there will be much less chance of providing consistency across dependencies if we push this logic out to the consumers.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mattseddon where does this snippet come from?

I see that is_data_source is included which is what we need I believe, @daavoo could you remind please what is the definition of the out.stage.is_data_source?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct code is in the description?

Copy link
Member

@mattseddon mattseddon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be enough for us right now as we're working on implementing:

The first step is to get all of the deps from exp show shown in the columns tree and experiments webview.

as per iterative/vscode-dvc#1183 (comment)

I think we should revisit when we get to:

Once extra information has been added to the DVC output we will provide it to the user (in some way to be determined) so they can easily filter out what they need.

and/or when we redesign the data that the extension needs

@shcheklein
Copy link
Member

@daavoo thanks for addressing this!

Example code on how to use the new fields:

I'm not sure that example is 100% correct. What happens if I have only a few .dvc (dvc add) in the project? There will be no deps at all, right? But there will be outs that should be marked as "data sources".

Copy link
Member

@shcheklein shcheklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please review the comment regarding the .dvc files ... at least we should fix an example.

In general - why would I need that double cycle at all btw, why can't I just take all data source outs (I hope it includes imports) ... and for final outputs - take outs that don't have deps.

@daavoo
Copy link
Contributor Author

daavoo commented May 25, 2022

I'm not sure that example is 100% correct. What happens if I have only a few .dvc (dvc add) in the project? There will be no deps at all, right? But there will be outs that should be marked as "data sources".

The example is from the current DVC perspective where deps is what is shown in the table and artifacts not connected to the pipeline are not shown.

VSCode has access to both deps and outs and can process and decide differently.

Please review the comment regarding the .dvc files ... at least we should fix an example.

Reverse the loop. Iterate on outs:

for out_name, out in res["out"].items():
    if not out["is_data_source"]:
        if out_name in res["deps"]:
            # This is an intermediate artifact
        else:
            # This is a "final output"

why can't I just take all data source outs (I hope it includes imports)

You can, as per the example above.
Inclusion of imports is tested here:

dvc.imp(os.fspath(erepo_dir), "out", "out_imported")
outs = dvc.experiments.show()["workspace"]["baseline"]["data"]["outs"]
assert outs == {
"out": {
"hash": ANY,
"size": ANY,
"nfiles": None,
"use_cache": True,
"is_data_source": False,
},
"out_add": {
"hash": ANY,
"size": ANY,
"nfiles": None,
"use_cache": True,
"is_data_source": True,
},
"out_imported": {
"hash": ANY,
"size": ANY,
"nfiles": None,
"use_cache": True,
"is_data_source": True,
},
}

why would I need that double cycle at all btw,

You kind of will still double looping because of and for final outputs - take outs that don't have deps (the in is still a loop).

And without also iterating on deps, you won't be able to have a signal for updates on git-tracked dependencies.

@daavoo daavoo requested a review from shcheklein May 25, 2022 05:52
@shcheklein
Copy link
Member

Got it, thanks @daavoo !

@daavoo daavoo merged commit 34ef180 into main May 25, 2022
@daavoo daavoo deleted the 7575-exp-show-distinguish-between-different-files branch May 25, 2022 07:01
@daavoo daavoo added the feature is a feature label May 25, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: experiments Related to dvc exp feature is a feature product: VSCode Integration with VSCode extension
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

exp show --show -json: outs does not return full path for keys exp show: Distinguish between different files
4 participants