Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DOC: Use vanilla MathJax distribution, fall back to vendored scipy-mathjax if not available #20518

Closed

Conversation

agriyakhetarpal
Copy link
Contributor

Reference issue

This change was proposed in the SciPy Community Call (Americas) on 17/04/2024 (please read through the meeting notes for more context about this change).

What does this implement/fix?

This PR adds a small code snippet to check for an internet connection by pinging Google's DNS at 8.8.8.8. If an internet connection is available, it will use the vanilla MathJax distribution, and if there isn't, then the local (vendored) MathJax distribution will be used. This ensures that the local one is used as just a fallback with a mechanism to check for the
availability of the usual distribution.

Additional information

Here are the three possible scenarios:

  1. Building the documentation for deployment via GitHub Pages or on CircleCI: the vanilla MathJax distribution will be used
  2. Building the documentation locally with an internet connection: the vanilla MathJax distribution will be used
  3. Building the documentation locally without an internet connection: the vendored MathJax distribution will be used

The third scenario is quite rare and will affect just contributors and developers for SciPy (and those without access to the internet). Therefore, the vendored MathJax (scipy-mathjax) will be used only without the presence of an internet connection – in the case of the first scenario, GitHub Actions runners used to deploy to GitHub Pages and runners on CircleCI both have access to the internet.

To not produce any breaking changes and keep behaviour consistent with what SciPy has been doing in the past (starting with gh-7376), I think it would be great to continue using the vendored distribution for the official SciPy docs, so if there is a handy environment variable that can be used to check if they are being deployed as a part of a release, I shall modify the logic here to include that (and therefore refine scenario 1).

xref: executablebooks/MyST-NB#590 for why this change is being proposed – it would be better to resolve this issue, so this PR is being opened just for visibility.

cc: @melissawm

This commit adds a small code snippet to check
for an internet connection by pinging Google's DNS.
If an internet connection is available, it will use
the vanilla MathJax distribution, and if there isn't,
then the local (vendored) MathJax distribution will
be used. This ensures that the local one is used as
just a fallback with a mechanism to check for the
availability of the usual distribution.
@github-actions github-actions bot added the Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org label Apr 18, 2024
@agriyakhetarpal
Copy link
Contributor Author

The sdist + wheel build on Windows (without Pythran acceleration) looks like it was a GitHub runner error and should be unrelated

Copy link
Member

@tupui tupui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR.

Do we know what is the impact on loading pages with a lot of maths (like in stats)? Could you do some timing? (With eg a trace from the dev tools of a browser.)

I am also concerned about the security implications of doing that. (We did have some CVE report about mathjax in the past.) Did other large projects have such discussions?

@tupui
Copy link
Member

tupui commented Apr 18, 2024

For the CI, please only run the doc part next time. See https://scipy.github.io/devdocs/dev/contributor/continuous_integration.html#continuous-integration

@agriyakhetarpal
Copy link
Contributor Author

Indeed, I'll run [docs only] next time (I usually remember to do so but forgot, in a jiffy).

@tupui
Copy link
Member

tupui commented Apr 18, 2024

No worries, I mostly pointed this out because you mentioned the failed job.

@tupui tupui added the maintenance Items related to regular maintenance tasks label Apr 18, 2024
@agriyakhetarpal agriyakhetarpal marked this pull request as draft April 18, 2024 18:51
INTERNET_AVAILABLE = False # assume no internet connection + use local MathJax

try:
# Attributed to https://stackoverflow.com/a/33117579/14001839
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit awkward because SciPy's license is not compatible with StackOverflow's.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes 👍 we cannot use this code and need to come up with our own solution or use a MIT/BSD existing code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the comment, I had not noticed that – let me try to find another code snippet that is compatible with SciPy's license and revise this functionality with that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I added a much simpler check with the socket module in 4c18fef, which also resolves a ResourceWarning that I had missed earlier, by closing the socket. It works locally and should be compliant with SciPy's license. I'm not sure if pinging Google out of all is the best website, so we can switch to IANA-reserved example domains to test a connection with if that makes sense (https://www.iana.org/help/example-domains)

@DietBru
Copy link
Contributor

DietBru commented Apr 21, 2024

Please note that the default MathML renderer (CHTML) has an annoying bug: Depending on the zoom level the over-line markings are suppressed.
On my computer (Firefox in KDE Wayland on 1920x1080 display) it happens at 100% zoom state.
From the Short-Time Fourier Transform Section with CHTML renderer:
image

With SVG renderer:
image

The renderer can be changed by right clicking on the equation and opening the sub-menu Math SettingsMath Renderer.

In stable SciPy, the default is seems to be the SVG renderer.

@agriyakhetarpal
Copy link
Contributor Author

agriyakhetarpal commented Apr 22, 2024

Thanks for the comment, @DietBru – unfortunately on my macOS machine (resolution: 3024x1964 and the renderer being most likely Metal or XQuartz, not sure) and with the Chromium-based Brave Browser, full specification "Version 1.64.116 Chromium: 123.0.6312.105 (Official Build) (arm64)", I can't seem to reproduce that. I am able to view the over-line markings in the STFT section with both CHTML and SVG renderers:


CHTML

STFT mathematical representation with CHTML renderer


SVG

STFT mathematical representation with SVG renderer

The only difference that I can notice is that the SVG renderer is a bit less bold, but also slightly smaller, the latter may not be evident in the screenshots I posted because they are quite similar in dimensions (though I don't think that's relevant to what you mentioned).

P.S. I would also like to mention that the reproducer failed at multiple zoom levels; standard ones from the browser and when using the trackpad for a custom zoom.

@agriyakhetarpal
Copy link
Contributor Author

agriyakhetarpal commented Apr 22, 2024

Thank you for the PR.

Do we know what is the impact on loading pages with a lot of maths (like in stats)? Could you do some timing? (With eg a trace from the dev tools of a browser.)

I am also concerned about the security implications of doing that. (We did have some CVE report about mathjax in the past.) Did other large projects have such discussions?

@tupui, I had looked at the impact on loading pages but couldn't find a truly reliable estimate for this last week through the in-browser developer tools (see below), so I'll additionally try to profile this with other tools available online (perhaps, something like http://www.webpagetest.org/). I will be looking into past CVE reports and the history of use of MathJax across SciPy and other projects.


In the Sphinx logs for the docs built on the latest commit, the proposed change is working and we have the MathJax distribution being loaded from the CDN. I purged my browser's caches because I found that it was loading the MathJax fonts from disk, which wasn't useful, and that revealed that they get loaded between as low as 9ms to 30ms (most of them were on the higher side, though). On subsequent reloads of the same page or other pages, they are placed into either the memory or disk caches, and thus take between 0 to 2 ms to load.

To say more, I chose the Discrete Statistical Distributions page from scipy.stats in the developer documentation as a benchmark, and the local MathJax distribution indeed loads everything much faster, in 0–2 ms, so this says more about how efficiently these resources are getting cached on further accesses of the webpage and it is only the first time that one can experience a short delay, which should be more prominent on throttled devices.

I tested on a "Slow 3G" network, which took 10 seconds for the script to load, while the devdocs took 3.5 seconds, which is surely going to be noticeable. On a "Fast 3G" network with caches disabled, the CDN takes ~2.5 seconds, and the local MathJax takes just about a second (on some occasions a twentieth of a second less), which is more bearable. I think this is enough information for us to determine the impact of loading from the CDN, so I would be happy to implement something for this point as mentioned above and will be open to suggestions:

I think it would be great to continue using the vendored distribution for the official SciPy docs, so if there is a handy environment variable that can be used to check if they are being deployed as a part of a release, I shall modify the logic here to include that (and therefore refine scenario 1).

because the official docs are receiving the maximum number of views; on devdocs or CircleCI, we can use the MathJax CDN? I have yet to investigate plus verify the impacts of CVEs and use cases of the CDN by other projects, plus, this is only a temporary solution for the aforementioned MyST-NB issue anyway (which would be good to get fixed for other projects too), so the experiment has most likely been useful to understand how varying levels of network speeds can affect load times for scripts, regardless of whether this PR ends up being merged or not. It can be reiterated that this is for the first load and that these high load times do not show up for any type of throttled connection upon reloads, so the drawbacks highlighted may turn up to be not major.

P.S. Keeping the vendored MathJax distribution on the official SciPy docs would also reduce the risks of CVEs coming from that derived from the CDN.

@tupui
Copy link
Member

tupui commented Apr 22, 2024

Thank you for the analysis 👏

It could be fine as we mostly serve fast connections. Though adding more latency could make us lose more ground.

@melissawm what do you think? From the notes, I think this was discussed with you in the call.

@agriyakhetarpal
Copy link
Contributor Author

Here are some of my findings (should have posted this yesterday, I later noticed that I forgot to push the "Comment" button and this text remained as a draft, oops!):

Some notes about past vulnerability reports for MathJax

It is to be noted that MathJax v2.7.1, i.e., the version that SciPy currently vendors, also has two noted CVEs that were first detected in v2.7.4 and v2.7.9 respectively.

This webpage offers some details about the regular expression denial of service vulnerabilities that were unravelled up to v2.7.9.

A permalink to the first CVE is present at https://nvd.nist.gov/vuln/detail/CVE-2023-39663; OTOH, the other XSS vulnerability is as described in https://nvd.nist.gov/vuln/detail/CVE-2018-1999024.

Other pieces of security issues and relevant history

However, the report that led to the above (mathjax/MathJax#3074) is refuted by the maintainer(s) owing to the fact that they do not think this is a potential attack vector that can be exploited through the use of MathJax in a page directly. See also: GHSA-v638-q856-grg8 and the corresponding entry on Synk.

MathJax used to have issues with unsafe consent security policies, but this was 12 years back (mathjax/MathJax#256) and has been long resolved, permitting users to access the script when served over both HTTP and HTTPS. However, there is a comment mathjax/MathJax#256 (comment) from March 2023 that argues that the issue persists, even so, I do not think I follow the discussion that entailed further based on my non-expertise in web development, and I think it makes sense to say that this has been more or less fixed.

The other XSS vulnerability was also reported on SciPy publicly in the past: #13210 where #13210 (comment) highlighted that the lack of user-provided untrusted content plus the use of solely static content means that SciPy is most likely not affected by this issue. I notice that this was also highlighted in numpy/numpy#21557, which I don’t think in particular received another comment, since it was about Pygments, and SciPy subsequently dropped PDF docs builds discussed in #15635 owing to CI and docs build times concerns, in #15649.

There have been other tools that if used might have been able to exploit the XSS vulnerability mentioned above: volca/markdown-preview#60, but this is also quite old and they enabled XSS sanitisation by default, but disabled it if MathJax is being used. I do not know enough about the logistics of XSS attacks as such, just the basics.

IPython also had some discussions about this where they argued against the use of a local MathJax distribution, which isn’t valid at all for SciPy’s current custom, smaller-size one: ipython/ipython#6246. They also discussed using it over HTTPS to prevent MITM attack vectors (see also: https://seclists.org/oss-sec/2014/q3/272), which I have currently enabled for the distribution already by using the HTTPS protocol in the URI I added in mathjax_path in conf.py. Earlier, when the MathJax distribution was not vendored, #5499 had enabled the use of HTTPS to access the script.

sympy/sympy#14964 (comment) also discusses the MathJax CVE, but this is also quite old, and is from August 2018.

My recommendation(s)

For all versions starting v3.0.0 and later (v2.7.9 was the last version to have a security vulnerability), there do not seem to be any (detected) vulnerabilities at all, based on https://www.cvedetails.com/version-list/21911/70488/1/Mathjax-Mathjax.html.

Other projects such as RStudio’s bookdown have migrated to later versions to avoid jeopardy: rstudio/bookdown#1355 and JupyterLab to v3 in jupyterlab/jupyterlab#13877

#14508 was when SciPy was last updated, and it has been quite some time since then and the current vendored one is still v2.7.1 – regardless of where this PR goes, I would like to propose updating it to v3, at least. I would be happy to do so in this PR, or put it up in a new PR and xref this as a link to it for those intending to follow previous discussions. Please let me know what works. I shall mark this ready for review for the time being so that any other code owners or those interested can have a look.

@agriyakhetarpal agriyakhetarpal marked this pull request as ready for review April 24, 2024 12:56
@agriyakhetarpal
Copy link
Contributor Author

It is to be noted that the custom MathJax problem, which was the root of everything as noted in #20303, is fixed1 by adding the following line in conf.py:

myst_update_mathjax = False

This configuration value can be evaluated to see if it breaks something else (I don't think it does), but this also reveals the source of the upstream bug. While this PR may not serve much purpose afterwards, I would still recommend updating the custom MathJax distribution to v3 for reasons mentioned above.

cc: @melissawm

Footnotes

  1. This is as noted in https://myst-parser.readthedocs.io/en/latest/syntax/optional.html#mathjax-and-math-parsing

@melissawm
Copy link
Contributor

So, to summarize:

  • If we use myst_update_mathjax = False in our conf.py, the immediate "mathjax not rendering in html pages originating from notebooks" is solved now;
  • It is recommended that we update our custom mathjax package to use MathJax v3 at some point soon;
  • The full solution of this problem, involving a PR to upstream myst-nb, can be done at a later step.
    Is that right? If so, do we still need this PR?

@agriyakhetarpal
Copy link
Contributor Author

Thank you for summarising, @melissawm!

  • If we use myst_update_mathjax = False in our conf.py, the immediate "mathjax not rendering in html pages originating from notebooks" is solved now;

Yes ✅

  • It is recommended that we update our custom mathjax package to use MathJax v3 at some point soon;

Indeed – I shall open another PR for doing this.

  • The full solution of this problem, involving a PR to upstream myst-nb, can be done at a later step.
    Is that right? If so, do we still need this PR?

Yes. I shall add a comment to the existing issue on MyST-NB, notifying others, in case it has been triaged, that the bug has been isolated to a specific feature there. We do not need this PR anymore. I shall start working on updating MathJax to v3 – in the meantime, I will close this PR and link to these discussions in the newer PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Documentation Issues related to the SciPy documentation. Also check https://github.com/scipy/scipy.org maintenance Items related to regular maintenance tasks
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants