Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

plots: grouping: stop using dpath.util.search #7811

Merged
merged 1 commit into from Jun 3, 2022

Conversation

pared
Copy link
Contributor

@pared pared commented May 25, 2022

Thank you for the contribution - we'll try to review it as soon as possible. πŸ™

Seems like dpath.util.search is really slow. Tried modifying the code to stop using search and use dpath.util.get instead. There was a significant improvement (4x) but still seemed unreasonably long. Ended up implementing this method myself.

We are still using dpath.util.new but its influence is orders of magnitude smaller than "reading" methods.

test repository:

#!/bin/bash

set -exu
pushd $TMPDIR

wsp=test_wspace
rep=test_repo

rm -rf $wsp && mkdir $wsp && pushd $wsp
main=$(pwd)

mkdir $rep && pushd $rep

git init
dvc init

echo "numpy" >> requirements.txt
pip install -r requirements.txt

git add -A
git commit -am "init"


echo -e "p: 1" > params.yaml

echo -e "from random import random
import numpy as np
import sys

num_files = int(sys.argv[1])
linspace_size = int(sys.argv[2])

def load_params():
    import yaml
    with open('params.yaml') as fd:
            return yaml.safe_load(fd)

params = load_params()
p = params['p']

def calc_metric(index):
    metric = np.power(np.linspace(0,1,linspace_size),p)

    if index % 2 == 0:
       metric = 1 - metric

    metric += np.random.random(linspace_size)/(index+1)
    result=[]
    for elem in list(metric):
        result.append({f'val_{index}': elem})
    return result

import json
if __name__ == '__main__':
    for i in range(num_files):
        metric = calc_metric(i)
        with open(f'metric_{i}.json', 'w') as fd:
            json.dump(metric, fd)
" > train.py

echo -e "
from subprocess import run
outputs = []
for i in range(40):
    outputs.append('--plots')
    outputs.append(f'metric_{i}.json')

run(['dvc', 'run', '-d', 'train.py', '--params', 'p', '-n', 'train'] + outputs + ['python train.py 40 1000'])
" > call_dvc.py

git add -A
git commit -am "initial"

python call_dvc.py

git add -A
git commit -am "initial"

echo -e "p: 3" > params.yaml
dvc repro

command: time dvc plots diff

Result for main:
14.88s user 0.22s system 99% cpu 15.182 total

Result for this change:
1.33s user 0.22s system 95% cpu 1.627 total

@pared pared requested a review from a team as a code owner May 25, 2022 12:21
@pared pared requested a review from karajan1001 May 25, 2022 12:21
@pared
Copy link
Contributor Author

pared commented May 25, 2022

I would love it if someone could confirm the results.
cc @mattseddon @dberenbaum @shcheklein

@pared pared force-pushed the plots_fix_slow_matching branch 2 times, most recently from 599eb19 to 3b22f9c Compare May 25, 2022 12:36
@efiop
Copy link
Member

efiop commented May 25, 2022

@pared Let's add test_plots_diff to dvc-bench. Seems like it is pretty fast by itself, won't cost us much.

@pared
Copy link
Contributor Author

pared commented May 25, 2022

@efiop yep, already working on that

@mattseddon
Copy link
Member

mattseddon commented May 26, 2022

I would love it if someone could confirm the results. cc @mattseddon @dberenbaum @shcheklein

I can confirm.

Last release:

dvc plots diff 14.96s user 0.38s system 94% cpu 16.241 total

main:

dvc plots diff 15.37s user 0.36s system 102% cpu 15.339 total

This:

dvc plots diff 2.06s user 0.46s system 107% cpu 2.353 total

Thanks @pared.

@mattseddon
Copy link
Member

image

Performance issues will be back with us again now:

Screen.Recording.2022-05-26.at.10.16.03.am.mov

cc @sroy3

Relates to iterative/vscode-dvc#1689 & iterative/vscode-dvc#1643

@sroy3
Copy link

sroy3 commented May 26, 2022

image

Performance issues will be back with us again now:

Screen.Recording.2022-05-26.at.10.16.03.am.mov
cc @sroy3

Relates to iterative/vscode-dvc#1689 & iterative/vscode-dvc#1643

We can easily adjust the number of "buffer" rows to be rendered below and above the scroll line to prepare for faster scrolling.

@dberenbaum
Copy link
Contributor

@karajan1001 @iterative/dvc Could someone review please? It's an important performance improvement for the VS Code release.

@efiop
Copy link
Member

efiop commented May 27, 2022

@dberenbaum Mostly waiting for iterative/dvc-bench#352 to be able to confirm and keep an eye on.

Comment on lines 40 to 50
revisions = list(plots_data)

grouped: Dict[str, Dict] = defaultdict(dict)

for revision in revisions:
for file in files:
path = [revision, "data", file]
content = _get(plots_data, path)
if content:
dpath.util.new(grouped[file], path, content)
return dict(grouped)
Copy link
Member

@efiop efiop May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming that the code above works, looks like we could just

Suggested change
revisions = list(plots_data)
grouped: Dict[str, Dict] = defaultdict(dict)
for revision in revisions:
for file in files:
path = [revision, "data", file]
content = _get(plots_data, path)
if content:
dpath.util.new(grouped[file], path, content)
return dict(grouped)
grouped = {}
for revision in plots_data.keys():
data = plots_data[revision].get("data", {})
for file in data.keys():
content = data.get(file)
if content:
dpath.util.new(grouped, [file, revision, "data", file], content)
return grouped

and not need get_files(redundant walk and sort. Btw, not used anywhere, why wasn't it _get_files? same question about other functions here CC @daavoo ), not need _get, and not need revisions (were creating a list for no reason) and reduce complexity?

Unrelated: [file, revision, "data", file] kinda hints that our format is pretty odd here πŸ˜„

Copy link
Contributor

@daavoo daavoo May 31, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question about other functions here CC @daavoo )

I don't recall I remember refactoring when dvc-render extraction but I think that the logic was already there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it was used somewhere else. Dropping it now.

@skshetry skshetry self-requested a review May 28, 2022 00:56
@daavoo daavoo added A: plots Related to the plots performance improvement over resource / time consuming tasks labels May 31, 2022
@pared pared changed the title plots: grouping: stop using dpath.util.search [WIP] plots: grouping: stop using dpath.util.search Jun 2, 2022
@dberenbaum dberenbaum added the release-blocker Blocks a release label Jun 2, 2022
@pared pared changed the title [WIP] plots: grouping: stop using dpath.util.search plots: grouping: stop using dpath.util.search Jun 3, 2022
@@ -28,9 +26,9 @@ def test_plots_order(tmp_dir, dvc):
name="stage2",
)

assert get_files(dvc.plots.show()) == [
Copy link
Contributor Author

@pared pared Jun 3, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not have used that here, as it was supposed to test the order of show results.

@efiop
Copy link
Member

efiop commented Jun 3, 2022

Thank you!

@efiop efiop merged commit 20c7b0e into iterative:main Jun 3, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: plots Related to the plots performance improvement over resource / time consuming tasks release-blocker Blocks a release
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

None yet

6 participants