Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Explore skipping graph checks #10425

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

Erotemic
Copy link
Contributor

A PR where I'm going to experiment with solutions to the issue discussed in #10415

The basic idea is that running dvc add (and other commands) walk the entire non-dvc indexed repo to look for DVC files. This is to handle granular modifications, but handling that use-case may not always be necessary, and it would be nice to disable these checks, which would allow the runtime of dvc add and other similar operations to have speed independent of other files indexed in the repo.

Not much in this PR yet, but I'm pushing it up as a DRAFT while I work on it.

@Erotemic
Copy link
Contributor Author

@dberenbaum You mentioned I would want to skip repo.find_outs_by_path in dvc/repo/add.py under get_or_create_stage:

def get_or_create_stage(
    repo: "Repo",
    target: str,
    out: Optional[str] = None,
    to_remote: bool = False,
    force: bool = False,
) -> StageInfo:
    if out:
        target = resolve_output(target, out, force=force)
    path, wdir, out = resolve_paths(repo, target, always_local=to_remote and not out)

    try:
        (out_obj,) = repo.find_outs_by_path(target, strict=False)
        stage = out_obj.stage
        if not stage.is_data_source:
            raise DvcException(...omit...)
        return StageInfo(stage, output_exists=True)
    except OutputNotFoundError:
        stage = repo.stage.create(...omit...)
        return StageInfo(stage, output_exists=False)

I don't have a strong understanding of what a "Stage" is, so its unclear what the correct mechanism for skipping that line is, as it defines the "stage" object used in the subsequent lines. I could just raise a OutputNotFoundError exception before calling that line, which will then call repo.stage.create. I'm not sure if that is desirable or not.

I found some docs describing stages.
I think in my use-case, there are never any stages, as they seem to correspond to steps in some pipeline. In the simple case where you only ever use add, push, and pull, is it correct that there are no stages? Any guidance on the right way to short circuit this?

dvc/repo/add.py Outdated
Comment on lines 57 to 61
try:
# How best to disable this line? With Skip Graph Checks Flag?
repo._skip_graph_checks = True
if getattr(repo, "_skip_graph_checks", False):
print("todo: skip graph checks")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong understanding of what a "Stage" is.

Internally, everything is a Stage in dvc.

What you want to do here is similar to following:

if not skip_graph_checks:
   try:
      ...  # existing inner block inside try
   except OutputNotFoundError:
     continue

...  # existing block from except

We might want to add a warning in case of skip_graph_checks=True saying that:

partial or virtual add does not work when --skip-graph-checks are enabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You may want to pass skip_graph_checks: bool to this function so to depend less on global state as possible (with default set to False unless --skip-graph-checks is used).

Copy link
Contributor Author

@Erotemic Erotemic May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree minimizing global state is a good idea. There seem to be existing tests for it, so I'm going to start with a global state implementation and then "fix it" once I get it working.

Is the "graph" referring to a graph of stages where edges represent dependencies between stages? Is the essence of what I want an optimization for the case where the graph is a disconnected set of nodes? EDIT: I see the docs in dvc.repo.graph.build_graph, so I think I'm correct here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm still unsure what the right thing to do in get_or_create_stage when this optimization is enabled. From what I understand we are attempting to lookup a "Stage" in repo.find_outs_by_path and that corresponds to a ".dvc" file? Can I use .dvc file and Stage interchangeably? I've been referring to them as "Sidecars", but I would like to understand the architecture of the repo better.

In find_outs_by_path, it calls self.index.outs_graph, which is one of the big time sinks, and that's what I'd like to avoid if possible. Is there an alternative way I can create the dvc.output.Output object when the "path" refers to a directory or file that corresponds to an existing or soon-to-be-created ".dvc" file?

As it stands my current implementation falls back to the case where it assumes the file is always a new .dvc file. But this optimization should apply in the case where the entire directory or file is tracked by the .dvc file. Guidance on what to do here would be helpful.

I'm also thinking that if there is any way to only crawl part of the graph without creating the entire thing, then perhaps every use-case can be made more efficient without needing to add a special case? (Although in the short-term I do just want to add the special case and move on).

Does it make sense to add docstrings to some of these functions? It would be nice there were docs explaining a bit about each argument and in which cases it might be specified, and perhaps points to a reference if it uses terminology the user is expected to be familiar with?

Something else that would be incredibly useful to have are doctests that would give me a quick entrypoint into any function. However, for things like DVC, that would require some concise helper function to setup a demo repo so the doctests don't get ridiculous. I happen to have some code that does this. I've been working on a DVC helper module called simple_dvc that improves my quality of life. An example that illustrates can be seen in the "request" function. Is there any interest in porting any of that work here?

I've pushed up a new commit that adds some of these docstrings, with questions where I don't know the answer. If someone wants to take a look and comment on what I've written / had questions about, perhaps we can consolidate that knowledge into the docstrings and make this codebase more approachable.

@@ -284,6 +284,7 @@ def index(self) -> "Index":
def check_graph(
self, stages: Iterable["Stage"], callback: Optional[Callable] = None
) -> None:
self._skip_graph_checks = True
Copy link
Member

@skshetry skshetry May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done inside Repo.add() API (and only when skip_graph_checks=True is passed to Repo.add()).

We also need to think how to contain the modification within Repo.add(), i.e. rollback the value before exiting Repo.add().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants