plots: grouping: stop using dpath.util.search #7811

pared · 2022-05-25T12:21:18Z

❗ I have followed the Contributing to DVC checklist.
📖 If this PR requires documentation updates, I have created a separate PR (or issue, at least) in dvc.org and linked it here.

Thank you for the contribution - we'll try to review it as soon as possible. 🙏

Seems like dpath.util.search is really slow. Tried modifying the code to stop using search and use dpath.util.get instead. There was a significant improvement (4x) but still seemed unreasonably long. Ended up implementing this method myself.

We are still using dpath.util.new but its influence is orders of magnitude smaller than "reading" methods.

test repository:

#!/bin/bash

set -exu
pushd $TMPDIR

wsp=test_wspace
rep=test_repo

rm -rf $wsp && mkdir $wsp && pushd $wsp
main=$(pwd)

mkdir $rep && pushd $rep

git init
dvc init

echo "numpy" >> requirements.txt
pip install -r requirements.txt

git add -A
git commit -am "init"


echo -e "p: 1" > params.yaml

echo -e "from random import random
import numpy as np
import sys

num_files = int(sys.argv[1])
linspace_size = int(sys.argv[2])

def load_params():
    import yaml
    with open('params.yaml') as fd:
            return yaml.safe_load(fd)

params = load_params()
p = params['p']

def calc_metric(index):
    metric = np.power(np.linspace(0,1,linspace_size),p)

    if index % 2 == 0:
       metric = 1 - metric

    metric += np.random.random(linspace_size)/(index+1)
    result=[]
    for elem in list(metric):
        result.append({f'val_{index}': elem})
    return result

import json
if __name__ == '__main__':
    for i in range(num_files):
        metric = calc_metric(i)
        with open(f'metric_{i}.json', 'w') as fd:
            json.dump(metric, fd)
" > train.py

echo -e "
from subprocess import run
outputs = []
for i in range(40):
    outputs.append('--plots')
    outputs.append(f'metric_{i}.json')

run(['dvc', 'run', '-d', 'train.py', '--params', 'p', '-n', 'train'] + outputs + ['python train.py 40 1000'])
" > call_dvc.py

git add -A
git commit -am "initial"

python call_dvc.py

git add -A
git commit -am "initial"

echo -e "p: 3" > params.yaml
dvc repro

command: time dvc plots diff

Result for main:
14.88s user 0.22s system 99% cpu 15.182 total

Result for this change:
1.33s user 0.22s system 95% cpu 1.627 total

pared · 2022-05-25T12:22:49Z

I would love it if someone could confirm the results.
cc @mattseddon @dberenbaum @shcheklein

efiop · 2022-05-25T13:31:25Z

@pared Let's add test_plots_diff to dvc-bench. Seems like it is pretty fast by itself, won't cost us much.

pared · 2022-05-25T13:35:28Z

@efiop yep, already working on that

mattseddon · 2022-05-26T00:10:46Z

I would love it if someone could confirm the results. cc @mattseddon @dberenbaum @shcheklein

I can confirm.

Last release:

dvc plots diff 14.96s user 0.38s system 94% cpu 16.241 total

main:

dvc plots diff 15.37s user 0.36s system 102% cpu 15.339 total

This:

dvc plots diff 2.06s user 0.46s system 107% cpu 2.353 total

Thanks @pared.

mattseddon · 2022-05-26T00:18:21Z

Performance issues will be back with us again now:

Screen.Recording.2022-05-26.at.10.16.03.am.mov

cc @sroy3

Relates to iterative/vscode-dvc#1689 & iterative/vscode-dvc#1643

sroy3 · 2022-05-26T13:30:44Z

Performance issues will be back with us again now:

Screen.Recording.2022-05-26.at.10.16.03.am.mov
cc @sroy3

Relates to iterative/vscode-dvc#1689 & iterative/vscode-dvc#1643

We can easily adjust the number of "buffer" rows to be rendered below and above the scroll line to prepare for faster scrolling.

dberenbaum · 2022-05-27T19:53:18Z

@karajan1001 @iterative/dvc Could someone review please? It's an important performance improvement for the VS Code release.

efiop · 2022-05-27T21:09:57Z

@dberenbaum Mostly waiting for iterative/dvc-bench#352 to be able to confirm and keep an eye on.

efiop · 2022-05-27T22:05:51Z

dvc/render/match.py

+    revisions = list(plots_data)
+
+    grouped: Dict[str, Dict] = defaultdict(dict)
+
+    for revision in revisions:
+        for file in files:
+            path = [revision, "data", file]
+            content = _get(plots_data, path)
+            if content:
+                dpath.util.new(grouped[file], path, content)
+    return dict(grouped)


Assuming that the code above works, looks like we could just

Suggested change

revisions = list(plots_data)

grouped: Dict[str, Dict] = defaultdict(dict)

for revision in revisions:

for file in files:

path = [revision, "data", file]

content = _get(plots_data, path)

if content:

dpath.util.new(grouped[file], path, content)

return dict(grouped)

grouped = {}

for revision in plots_data.keys():

data = plots_data[revision].get("data", {})

for file in data.keys():

content = data.get(file)

if content:

dpath.util.new(grouped, [file, revision, "data", file], content)

return grouped

and not need get_files(redundant walk and sort. Btw, not used anywhere, why wasn't it _get_files? same question about other functions here CC @daavoo ), not need _get, and not need revisions (were creating a list for no reason) and reduce complexity?

Unrelated: [file, revision, "data", file] kinda hints that our format is pretty odd here 😄

same question about other functions here CC @daavoo )

I don't recall I remember refactoring when dvc-render extraction but I think that the logic was already there.

I think it was used somewhere else. Dropping it now.

pared · 2022-06-03T10:22:26Z

tests/unit/test_plots.py

@@ -28,9 +26,9 @@ def test_plots_order(tmp_dir, dvc):
            name="stage2",
        )

-    assert get_files(dvc.plots.show()) == [


We should not have used that here, as it was supposed to test the order of show results.

efiop · 2022-06-03T13:48:28Z

Thank you!

pared requested a review from a team as a code owner May 25, 2022 12:21

pared requested a review from karajan1001 May 25, 2022 12:21

pared force-pushed the plots_fix_slow_matching branch 2 times, most recently from 599eb19 to 3b22f9c Compare May 25, 2022 12:36

pared mentioned this pull request May 25, 2022

[WIP] bench: add test_plots_diff iterative/dvc-bench#352

Closed

mattseddon mentioned this pull request May 26, 2022

Improve the plots webview performance iterative/vscode-dvc#1643

Closed

3 tasks

efiop reviewed May 27, 2022

View reviewed changes

skshetry self-requested a review May 28, 2022 00:56

daavoo assigned pared May 31, 2022

daavoo added A: plots performance labels May 31, 2022

pared changed the title ~~plots: grouping: stop using dpath.util.search~~ [WIP] plots: grouping: stop using dpath.util.search Jun 2, 2022

dberenbaum added the release-blocker label Jun 2, 2022

pared force-pushed the plots_fix_slow_matching branch from 3b22f9c to 8b8efb0 Compare June 3, 2022 09:05

pared changed the title ~~[WIP] plots: grouping: stop using dpath.util.search~~ plots: grouping: stop using dpath.util.search Jun 3, 2022

plots: grouping: stop using dpath.util.search

e881676

pared force-pushed the plots_fix_slow_matching branch from 8b8efb0 to e881676 Compare June 3, 2022 10:20

pared commented Jun 3, 2022

View reviewed changes

efiop approved these changes Jun 3, 2022

View reviewed changes

efiop merged commit 20c7b0e into iterative:main Jun 3, 2022

pared mentioned this pull request Jul 4, 2022

plots diff: performance issues iterative/vscode-dvc#1689

Closed

pared mentioned this pull request Oct 11, 2022

plots show: Cryptic error message when trying to plot csv files written from pandas #7844

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

plots: grouping: stop using dpath.util.search #7811

plots: grouping: stop using dpath.util.search #7811

pared commented May 25, 2022 •

edited by efiop

Loading

pared commented May 25, 2022

efiop commented May 25, 2022

pared commented May 25, 2022

mattseddon commented May 26, 2022 •

edited

Loading

mattseddon commented May 26, 2022

sroy3 commented May 26, 2022

dberenbaum commented May 27, 2022

efiop commented May 27, 2022

efiop May 27, 2022 •

edited

Loading

daavoo May 31, 2022 •

edited

Loading

pared Jun 3, 2022

pared Jun 3, 2022 •

edited

Loading

efiop commented Jun 3, 2022

plots: grouping: stop using dpath.util.search #7811

plots: grouping: stop using dpath.util.search #7811

Conversation

pared commented May 25, 2022 • edited by efiop Loading

pared commented May 25, 2022

efiop commented May 25, 2022

pared commented May 25, 2022

mattseddon commented May 26, 2022 • edited Loading

mattseddon commented May 26, 2022

sroy3 commented May 26, 2022

dberenbaum commented May 27, 2022

efiop commented May 27, 2022

efiop May 27, 2022 • edited Loading

Choose a reason for hiding this comment

daavoo May 31, 2022 • edited Loading

Choose a reason for hiding this comment

pared Jun 3, 2022

Choose a reason for hiding this comment

pared Jun 3, 2022 • edited Loading

Choose a reason for hiding this comment

efiop commented Jun 3, 2022

pared commented May 25, 2022 •

edited by efiop

Loading

mattseddon commented May 26, 2022 •

edited

Loading

efiop May 27, 2022 •

edited

Loading

daavoo May 31, 2022 •

edited

Loading

pared Jun 3, 2022 •

edited

Loading