Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🐛 does not work with nbdiff (from https://github.com/jupyter/nbdime) #1256

Closed
infokiller opened this issue Dec 9, 2022 · 19 comments
Closed

Comments

@infokiller
Copy link

Hey again!

So I'm not really sure if it's a bug in nbdiff (https://github.com/jupyter/nbdime) or in delta, but they don't seem to work together. Specifically, the diffs generated by nbdiff are missing from the output.

Here's my reproduction script:

build_oci_img () {
  docker build "$@" - <<'EOF'
  FROM archlinux
  RUN pacman -Sy > /dev/null
  RUN pacman -S --noconfirm --needed coreutils curl gcc bc git python-pip > /dev/null
  ENV CARGO_HOME=/usr/local
  RUN curl -fsSL 'https://sh.rustup.rs' | sh -s -- -y --profile minimal
  RUN /usr/local/bin/cargo install git-delta
  RUN pip install jupytext ipython_genutils nbdime
  WORKDIR /repo
EOF
}

build_oci_img
docker run --rm -i "$(build_oci_img -q)" bash <<'EOF'

main() {
  git config --global user.email test@mail.com && git config --global user.name test
  git -c init.defaultBranch=main init
  echo 'print(0)' > file.py
  jupytext --to ipynb file.py
  cat file.py
  cat file.ipynb
  git add file.ipynb
  git commit -m 'initial commit'
  echo 'print(1)' >| file.py
  jupytext --to ipynb file.py
  echo 'Running git diff:'
  git diff
  echo
  echo 'Running nbdiff:'
  nbdiff
  echo
  echo 'Piping nbdiff to delta:'
  nbdiff | delta
}

main
EOF

The output I'm getting:

Initialized empty Git repository in /repo/.git/
[jupytext] Reading file.py in format py
[jupytext] Writing file.ipynb
print(0)
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "9428c875",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(0)"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
[main (root-commit) 643ff7a] initial commit
 1 file changed, 23 insertions(+)
 create mode 100644 file.ipynb
[jupytext] Reading file.py in format py
[jupytext] Writing file.ipynb (destination file replaced [use --update to preserve cell outputs and ids])
Running git diff:
diff --git a/file.ipynb b/file.ipynb
index 3d9a299..5c5bbd4 100644
--- a/file.ipynb
+++ b/file.ipynb
@@ -3,11 +3,11 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "9428c875",
+   "id": "06311087",
    "metadata": {},
    "outputs": [],
    "source": [
-    "print(0)"
+    "print(1)"
    ]
   }
  ],

Running nbdiff:
nbdiff file.ipynb (HEAD) file.ipynb
--- file.ipynb (HEAD)  (no timestamp)
+++ file.ipynb  2022-12-09 17:25:45.285783
## modified /cells/0/id:
-  9428c875
+  06311087

## modified /cells/0/source:
-  print(0)
+  print(1)


Piping nbdiff to delta:
nbdiff file.ipynb (HEAD) file.ipynb

file.ipynb (HEAD)  (no timestamp) ⟶   file.ipynb  2022-12-09 17:25:45.285783
────────────────────────────────────────────────────────────────────────────────
@dandavison
Copy link
Owner

dandavison commented Dec 10, 2022

Hi @infokiller !

Hm, your docker repro script is awesome but on first try I got an error (below). However -- I don't think you actually need to go to those lengths -- you can just provide the output from nbdiff, right? I.e. what is being sent down this pipe: nbdiff | delta

In other words: delta's contract is that it renders unified diff, and git output. So the possibilities are:

  1. nbdiff is not trying to output one of those. In that case delta doesn't currently promise to support it but we could see how hard it is.
  2. nbdiff thinks it is outputting one of those but the nbdiff output doesn't actually conform to spec; in that case it's a bug in nbdiff.
  3. nbdiff is outputting correct unified diff / git output but delta isn't rendering it correctly. In that case it's a bug in delta.

In cases (1) and (3) we just need the raw text input sent to delta, and a screenshot of delta's output.

[+] Building 4.8s (6/10)
 => [internal] load build definition from Dockerfile                                                                                                                                                                     0.1s
 => => transferring dockerfile: 388B                                                                                                                                                                                     0.0s
 => [internal] load .dockerignore                                                                                                                                                                                        0.0s
 => => transferring context: 2B                                                                                                                                                                                          0.0s
 => [internal] load metadata for docker.io/library/archlinux:latest                                                                                                                                                      0.0s
 => [1/7] FROM docker.io/library/archlinux                                                                                                                                                                               0.0s
 => CACHED [2/7] RUN pacman -Sy > /dev/null                                                                                                                                                                              0.0s
 => ERROR [3/7] RUN pacman -S --noconfirm --needed coreutils curl gcc bc git python-pip > /dev/null                                                                                                                      4.7s
------
 > [3/7] RUN pacman -S --noconfirm --needed coreutils curl gcc bc git python-pip > /dev/null:
#6 0.960 warning: coreutils-9.0-2 is up to date -- skipping
#6 0.960 warning: curl-7.81.0-2 is up to date -- skipping
#6 4.605 error: failed retrieving file 'binutils-2.38-3-x86_64.pkg.tar.zst' from mirror.pkgbuild.com : The requested URL returned error: 404
#6 4.605 error: failed retrieving file 'gcc-11.2.0-3-x86_64.pkg.tar.zst' from mirror.pkgbuild.com : The requested URL returned error: 404
#6 4.605 error: failed retrieving file 'gcc-libs-11.2.0-3-x86_64.pkg.tar.zst' from mirror.pkgbuild.com : The requested URL returned error: 404
#6 4.605 warning: too many errors from mirror.pkgbuild.com, skipping for the remainder of this transaction
#6 4.605 error: failed retrieving file 'perl-5.34.0-3-x86_64.pkg.tar.zst' from mirror.pkgbuild.com : The requested URL returned error: 404
#6 4.605 error: failed retrieving file 'python-3.10.2-1-x86_64.pkg.tar.zst' from mirror.pkgbuild.com : The requested URL returned error: 404
#6 4.605 error: failed retrieving file 'perl-5.34.0-3-x86_64.pkg.tar.zst' from mirror.rackspace.com : The requested URL returned error: 404
#6 4.605 error: failed retrieving file 'python-3.10.2-1-x86_64.pkg.tar.zst' from mirror.rackspace.com : The requested URL returned error: 404
#6 4.605 error: failed retrieving file 'binutils-2.38-3-x86_64.pkg.tar.zst' from mirror.rackspace.com : The requested URL returned error: 404
#6 4.605 warning: too many errors from mirror.rackspace.com, skipping for the remainder of this transaction
#6 4.605 error: failed retrieving file 'gcc-11.2.0-3-x86_64.pkg.tar.zst' from mirror.rackspace.com : The requested URL returned error: 404
#6 4.605 error: failed retrieving file 'gcc-libs-11.2.0-3-x86_64.pkg.tar.zst' from mirror.rackspace.com : The requested URL returned error: 404
#6 4.605 error: failed retrieving file 'python-3.10.2-1-x86_64.pkg.tar.zst' from mirror.leaseweb.net : The requested URL returned error: 404
#6 4.605 error: failed retrieving file 'binutils-2.38-3-x86_64.pkg.tar.zst' from mirror.leaseweb.net : The requested URL returned error: 404
#6 4.605 error: failed retrieving file 'gcc-11.2.0-3-x86_64.pkg.tar.zst' from mirror.leaseweb.net : The requested URL returned error: 404
#6 4.605 warning: too many errors from mirror.leaseweb.net, skipping for the remainder of this transaction
#6 4.605 error: failed retrieving file 'gcc-libs-11.2.0-3-x86_64.pkg.tar.zst' from mirror.leaseweb.net : The requested URL returned error: 404
#6 4.605 error: failed retrieving file 'perl-5.34.0-3-x86_64.pkg.tar.zst' from mirror.leaseweb.net : The requested URL returned error: 404
#6 4.605 warning: failed to retrieve some files
#6 4.605 error: failed to commit transaction (failed to retrieve some files)
------
executor failed running [/bin/sh -c pacman -S --noconfirm --needed coreutils curl gcc bc git python-pip > /dev/null]: exit code: 1

@infokiller
Copy link
Author

I think your archlinux image is stale so you need to run docker pull archlinux.
The output of nbdiff can be seen in my first comment after "Running nbdiff", but I can send it separately when I'm on my desktop

@infokiller
Copy link
Author

Back at my desktop, here's the exact output of nbdiff:

nbdiff file.ipynb (HEAD) file.ipynb
--- file.ipynb (HEAD)  (no timestamp)
+++ file.ipynb  2022-12-09 17:25:45.285783
## modified /cells/0/id:
-  9428c875
+  06311087

## modified /cells/0/source:
-  print(0)
+  print(1)

@dandavison
Copy link
Owner

Thanks! Right, that looks similar to unified diff, but it's missing the hunk header (the line that starts with @@

https://www.gnu.org/software/diffutils/manual/html_node/Example-Unified.html

Would you be able to look into why that is? I'm looking around briefly and I see https://nbdime.readthedocs.io/en/latest/cli.html#nbdiff

However, that example looks like it should include the @@ lines for a cells/<n>/source line, whereas your output does not (/cells/0/source). It looks like nbdiff output deliberately outputs some initial sections that do not use the @@ header, and these refer to image / notebook cell output etc -- that makes sense since those have no line-column coordinates in the code, so I think they are actually trying to make good use of the unified diff format by extending it in a compatible way to the notebook case.

image

@dandavison
Copy link
Owner

it's missing the hunk header (the line that starts with @@

A separate question is whether delta should have output the lines between the +++ and the first @@: i.e. are such lines compatible with the spec and even if not might it be reasonable for delta to output them especially since nbdiff is using them.

cc @WayneD who is the original creator of unified diff format (and a delta contributor) in case he has any thoughts on this issue and the way that nbdiff are using the format.

@infokiller
Copy link
Author

Thanks a lot @dandavison for looking into this!
Yes, nbdiff can be configured to show some "difference types", like sources (code and text), output, metadata, and some other stuff. I agree it's not clear always clear what the diff line/column should be (even for source, you have multiple cells, should you treat that as a single long cell?).

I think it's reasonable to pass through anything that isn't recognized, or alternatively show some warning so that it doesn't look like the diff is empty. I guess it's best to add an option (for example --unknown-input=(passthrough|warn|ignore)) to control this behavior, since it won't break any other existing use and still support tools like nbdiff that may not follow the spec.

I will check with the nbdiff developers, but from a quick look at the code it seems there is some support for unified diff, but it may not be the default:

https://github.com/jupyter/nbdime/blob/master/nbdime/prettyprint.py#L303

@dandavison
Copy link
Owner

I will check with the nbdiff developers, but from a quick look at the code it seems there is some support for unified diff

Right, I noticed that code also. But that code includes @@ lines, and their example shows @@ lines. So, my current point of doubt/confusion is why your output doesn't contain the @@ on the */source/ line.

@WayneD
Copy link
Contributor

WayneD commented Dec 11, 2022

Well, putting lines after the +++ that don't start with @, -, +, or space are clearly a violation of a unified diff. What git has done is to make use of the traditional "diff ..." line as the beginning of the file's diff info and put extra lines in there, which I find superior. However, anyone is free to begin their own diff format or to try to get folks to agree to change the unified diff "standard" (which is not written down, but just codified in programs such as gnu diff & patch).

Looking at the example, it appears that the notebook is a container of multiple files and they decided to use ## to indicate sub elements within the notebook. If so, I'd personally change the output to use normal --- & +++ lines for each element in the notebook (not ## lines). Perhaps something like:

nbdiff c.ipynb b.ipynb
diff c.ipynb/cells/9/outputs/0/data/text/plain b.ipynb/cells/9/outputs/0/data/text/plain
modified
--- c.ipynb/cells/9/outputs/0/data/text/plain
+++ b.ipynb/cells/9/outputs/0/data/text/plain
@@ -1 +1 @@
- <matplotlib.figure.Figure at 0x10ea05940>
+ <matplotlib.figure.Figure at 0x10eb21860>

diff etc
--- c.ipynb/cells/etc
+++ b.ipynb/cells/etc
@@ etc @@

However, not being familiar with what they're trying to convey, that may not be a good match for their use case. If that is true, then I'd think that this should be considered to be a different diff format that delta could consider supporting, which would require it to have special code for understanding ## lines and their comments. It should not be surprising that the unidiff parser doesn't handle it.

@infokiller
Copy link
Author

Thanks @WayneD your response is very useful!

For context, a Jupyter Notebook is essentially a list of "cells", each containing code or markdown/text. Code cells can also have outputs, which can be simple textual outputs or things like figures/images.
The notebook is represented in a json file, you cans see this overview to get a better sense of the format.

The problem nbdiff solves is that looking at a "raw" diff of notebooks is very unreadable because of the json representation and stuff like cell outputs, which can get very large (for example encoded images), and hide the code changes which are much smaller and often more interesting.
@WayneD I agree that it seems better to use ---/+++ for the sub-elements, and I will ask the nbdiff developers whether they will accept a PR to add this format.

@infokiller
Copy link
Author

infokiller commented Dec 12, 2022

I will check with the nbdiff developers, but from a quick look at the code it seems there is some support for unified diff

Right, I noticed that code also. But that code includes @@ lines, and their example shows @@ lines. So, my current point of doubt/confusion is why your output doesn't contain the @@ on the */source/ line.

I think I figured out the discrepancy- nbdiff has different code paths for diff chunks that are multiline (which are properly formatted as in the website example) and single lines ones: https://github.com/jupyter/nbdime/blob/master/nbdime/prettyprint.py#L830-L846

If you modify the reproduce shell command in the first comment to create a multiline diff chunk, it works as expected:

docker pull archlinux
img_digest="$(docker image inspect --format='{{index .RepoDigests 0}}' archlinux:latest)"
echo "Digest: ${img_digest}"
build_oci_img () {
  docker build --build-arg "img_digest=${img_digest}" "$@" - <<'EOF'
  ARG img_digest
  FROM archlinux:${img_digest}
  RUN pacman -Sy > /dev/null
  RUN pacman -S --noconfirm --needed coreutils curl gcc bc git python-pip > /dev/null
  ENV CARGO_HOME=/usr/local
  RUN curl -fsSL 'https://sh.rustup.rs' | sh -s -- -y --profile minimal
  RUN /usr/local/bin/cargo install git-delta
  RUN pip install jupytext ipython_genutils nbdime
  WORKDIR /repo
EOF
}

build_oci_img
docker run --rm -i "$(build_oci_img -q)" bash <<'EOF'

main() {
  git config --global user.email test@mail.com && git config --global user.name test
  git -c init.defaultBranch=main init
  printf '%s\n' 'print(0)' 'print(1)' > file.py
  jupytext --to ipynb file.py
  cat file.py
  cat file.ipynb
  git add file.ipynb
  git commit -m 'initial commit'
  printf '%s\n' 'print(0)' 'print(2)' > file.py
  jupytext --to ipynb file.py
  echo 'Running git diff:'
  git diff
  echo
  echo 'Running nbdiff:'
  nbdiff
  echo
  echo 'Piping nbdiff to delta:'
  nbdiff | delta
}

main
EOF

Output:

Initialized empty Git repository in /repo/.git/
[jupytext] Reading file.py in format py
[jupytext] Writing file.ipynb
print(0)
print(1)
{
 "cells": [
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d641ca06",
   "metadata": {},
   "outputs": [],
   "source": [
    "print(0)\n",
    "print(1)"
   ]
  }
 ],
 "metadata": {
  "jupytext": {
   "cell_metadata_filter": "-all",
   "main_language": "python",
   "notebook_metadata_filter": "-all"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
[main (root-commit) 9b14311] initial commit
 1 file changed, 24 insertions(+)
 create mode 100644 file.ipynb
[jupytext] Reading file.py in format py
[jupytext] Writing file.ipynb (destination file replaced [use --update to preserve cell outputs and ids])
Running git diff:
diff --git a/file.ipynb b/file.ipynb
index fd58c76..2c99176 100644
--- a/file.ipynb
+++ b/file.ipynb
@@ -3,12 +3,12 @@
   {
    "cell_type": "code",
    "execution_count": null,
-   "id": "d641ca06",
+   "id": "6326360b",
    "metadata": {},
    "outputs": [],
    "source": [
     "print(0)\n",
-    "print(1)"
+    "print(2)"
    ]
   }
  ],

Running nbdiff:
nbdiff file.ipynb (HEAD) file.ipynb
--- file.ipynb (HEAD)  (no timestamp)
+++ file.ipynb  2022-12-12 20:19:56.170635
## modified /cells/0/id:
-  d641ca06
+  6326360b

## modified /cells/0/source:
@@ -1,2 +1,2 @@
 print(0)
-print(1)

+print(2)



Piping nbdiff to delta:
nbdiff file.ipynb (HEAD) file.ipynb

file.ipynb (HEAD)  (no timestamp) ⟶   file.ipynb  2022-12-12 20:19:56.170635
────────────────────────────────────────────────────────────────────────────────

───┐
1: │
───┘
print(0)
print(1)

print(2)


@infokiller
Copy link
Author

@dandavison should I open a separate issue with a feature request for passing through unknown diff formats?

@infokiller
Copy link
Author

I submitted a PR to improve the compatibility of nbdev: jupyter/nbdime#647

@dandavison
Copy link
Owner

dandavison commented Dec 12, 2022

I submitted a PR to improve the compatibility of nbdev: jupyter/nbdime#647

Great! Can you post the input to delta from nbdiff on your PR branch and a screenshot of what delta does ?

If you modify the reproduce shell command in the first comment to create a multiline diff chunk, it works as expected:

Hey, I'm sorry -- the docker script works now with the pull command, but I'm still spending time trying to understand the output of the docker script, and that's not actually relevant here -- delta doesn't care at all where its input comes from!

Can you just post the raw input to delta, and a screenshot of delta's output, so that we can immediately discuss whether delta is doing the right thing on its input? Here I think the raw input was

--- file.ipynb (HEAD)  (no timestamp)
+++ file.ipynb  2022-12-12 11:12:39.191136
## inserted before /cells/0:
+  code cell:
+    source:
+      print(0)

## deleted /cells/0:
-  code cell:
-    source:
-      print(0)
-      print(1)

whereas previously it was

--- file.ipynb (HEAD)  (no timestamp)
+++ file.ipynb  2022-12-09 17:25:45.285783
## modified /cells/0/id:
-  9428c875
+  06311087

## modified /cells/0/source:
-  print(0)
+  print(1)

I don't understand where we are exactly with those inputs -- neither contains the @@ lines of unified/git diff.

@dandavison
Copy link
Owner

Also I don't get any output from delta on that input whereas the docker script produced a line like

file.ipynb (HEAD)  (no timestamp) ⟶   file.ipynb  2022-12-12 05:54:33.806495
────────────────────────────────────────────────────────────────────────────────

@dandavison
Copy link
Owner

should I open a separate issue with a feature request for passing through unknown diff formats?

My vote is to solve it all in this ticket.

@infokiller
Copy link
Author

I submitted a PR to improve the compatibility of nbdev: jupyter/nbdime#647

Great! Can you post the input to delta from nbdiff on your PR branch and a screenshot of what delta does ?

The default output of nbdiff changes to:

nbdiff /proc/self/fd/14 /proc/self/fd/16
--- /proc/self/fd/14  2022-12-12 22:10:49.585532
+++ /proc/self/fd/16  2022-12-12 22:10:49.585532
## modified /cells/0/source:
@@ -1 +1 @@
-print(0)
+print(1)

(yes, there's a trailing newline there, not sure why but that's unrelated to my PR)

And here's how delta renders it:

image

If you modify the reproduce shell command in the first comment to create a multiline diff chunk, it works as expected:

Hey, I'm sorry -- the docker script works now with the pull command, but I'm still spending time trying to understand the output of the docker script, and that's not actually relevant here -- delta doesn't care at all where its input comes from!

Can you just post the raw input to delta, and a screenshot of delta's output, so that we can immediately discuss whether delta is doing the right thing on its input? Here I think the raw input was

--- file.ipynb (HEAD)  (no timestamp)
+++ file.ipynb  2022-12-12 11:12:39.191136
## inserted before /cells/0:
+  code cell:
+    source:
+      print(0)

## deleted /cells/0:
-  code cell:
-    source:
-      print(0)
-      print(1)

whereas previously it was

--- file.ipynb (HEAD)  (no timestamp)
+++ file.ipynb  2022-12-09 17:25:45.285783
## modified /cells/0/id:
-  9428c875
+  06311087

## modified /cells/0/source:
-  print(0)
+  print(1)

I don't understand where we are exactly with those inputs -- neither contains the @@ lines of unified/git diff.

My bad, I copied the wrong command (edited the original comment to fix it). The raw input to delta for the single line case is (before my PR):

nbdiff /proc/self/fd/14 /proc/self/fd/17
--- /proc/self/fd/14  2022-12-12 22:16:29.507406
+++ /proc/self/fd/17  2022-12-12 22:16:29.497406
## modified /cells/0/source:
-  print(0)
+  print(1)

delta:

image

For the multiline change case:

nbdiff /proc/self/fd/14 /proc/self/fd/17
--- /proc/self/fd/14  2022-12-12 22:18:17.050549
+++ /proc/self/fd/17  2022-12-12 22:18:17.063882
## modified /cells/0/source:
@@ -1,2 +1,2 @@
 print(0)
-print(1)
+print(2)

delta:

image

@sanmai-NL
Copy link

@infokiller What's the current status of this problem? Are you any wiser?

@infokiller
Copy link
Author

@sanmai-NL IIRC the nbdime PR resolves it, but it wasn't merged yet

@dandavison
Copy link
Owner

It might be helpful to open an issue in nbdime explaining the problem. I.e. asking them for an option to output valid unified diff format. The PR doesn't contain any explanation or tests etc, so I can understand that it got overlooked.

I think we can close this issue though; I don't think nbdime is prominent enough to warrant adding special support for its format to delta's parser unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants