Breaking change to Python domain IDs #7301

mgeier · 2020-03-11T17:00:30Z

Describe the bug

Previously, anchors for Python functions were using underscores, #7236 changed this to dashes.

To Reproduce

Document some Python function whose name contains underscores:

.. py:function:: example_python_function(foo)

    Some function.

Expected behavior

This used to create a fragment identifier #example_python_function , but since #7236 this creates #example-python-function.

Your project

This breaks links to python functions when used with nbsphinx: https://nbsphinx.readthedocs.io/en/0.5.1/markdown-cells.html#Links-to-Domain-Objects

Apart from that all links (containing underscores) from external sites to Python API docs created by Sphinx (which I guess are a lot) will break!

The text was updated successfully, but these errors were encountered:

jakobandersen · 2020-03-11T17:17:42Z

Are you sure the old links are broken? While the permalink is indeed using a new ID generation scheme, the old ID should still be attached to the declaration (using <span id="theOldID"></span>).

tk0miya · 2020-03-11T17:29:37Z

Yes, I changed the style of node_ids in #7236. Therefore, the main hyperlink anchor will be changed in the next release. But old-styled node_ids are still available. So old hyperlinks are still working.

There are some reasons why I changed them. First is that their naming is very simple and getting conflicted often (refs: #6903). Second is the rule of naming is against docutils specification. Last is that it allows sharing one node_ids to multiple names. For example, it helps to represent io.StringIO and _io.StringIO are the same (or they have the same ID).

To improve both python domain and autodoc, we have to change the structure of the domain and the rule of naming IDs. I don't know the change is harmful. But it is needed to improve Sphinx, I believe.

mgeier · 2020-03-12T12:14:09Z

Thanks for the quick responses!

Are you sure the old links are broken?

I thought so, but links from external sites seem to be not yet broken.

However, it looks like they will be broken at some point in the future, according to this comment:

sphinx/sphinx/domains/python.py

Lines 367 to 370 in f85b870

    
           # Assign old styled node_id(fullname) not to break old hyperlinks (if possible) 
        
           # Note: Will removed in Sphinx-5.0  (RemovedInSphinx50Warning) 
        
           if node_id != fullname and fullname not in self.state.document.ids: 
        
               signode['ids'].append(fullname)

Whether that happens sooner or later, it will be quite bad.

But that's not actually the situation where I found the problem. The ID change is breaking my Sphinx extension nbsphinx because the code assumed that underscores are not changed to dashes:

https://github.com/spatialaudio/nbsphinx/blob/5be6da7b212e0cfed34ebd7da0ede5501549571d/src/nbsphinx.py#L1446-L1467

This causes a build warning (and a missing link) when running Sphinx.
You can reproduce this by building the nbsphinx docs, see https://nbsphinx.readthedocs.io/en/0.5.1/contributing.html.

I could of course change the implementation of nbsphinx (depending on the Sphinx version), but that would still break all the API links my users have made in their notebooks!

And it would break them depending on which Sphinx version they are using, wouldn't that be horrible?

First is that their naming is very simple and getting conflicted often (refs: #6903).

How does changing #example_python_function to #example-python-function help in this case?

Second is the rule of naming is against docutils specification.

Because underscores are not allowed?

I think it would be worth violating the specification for that.
I don't see any negative consequences to this.
And it has worked fine for many years.

Also, having dots and underscores in link to API docs just looks so much more sensible!

Here's an example: https://sfs-python.readthedocs.io/en/0.5.0/sfs.fd.source.html#sfs.fd.source.point_velocity

The link contains the correct Python function name: #sfs.fd.source.point_velocity.

When the ID is changed, this becomes: #sfs-fd-source-point-velocity, which doesn't really make any sense anymore.

Last is that it allows sharing one node_ids to multiple names. For example, it helps to represent io.StringIO and _io.StringIO are the same (or they have the same ID).

I don't understand. Are io and _io not different names in Python?

And what does that have to do with changing underscores to dashes?

tk0miya · 2020-03-15T07:57:56Z

First is that their naming is very simple and getting conflicted often (refs: #6903).

How does changing #example_python_function to #example-python-function help in this case?

No, this change is not only replacing underscores by hyphens. New ID Generator tries to generate node_id by following steps; 1) Generate node_id by the given string. But generated one is already used in the document, 2) Generate node_id by sequence number like id0.

It means the node_id is not guess-able by its name. Indeed, it would be almost working fine if we use hyphens and dots for the ID generation. But it will be sometimes broken.

Second is the rule of naming is against docutils specification.

Because underscores are not allowed?

I think it would be worth violating the specification for that.
I don't see any negative consequences to this.
And it has worked fine for many years.

Yes, docutils' spec does not allow to use hyphens and dots in node_id.

I know the current rule for node_id generation is not so wrong. But it surely contains problems. Have you ever try to use invalid characters to the signature? How about multibyte characters?

For example, this is an attacking code for the ID generator:

.. py:function:: "><script>alert('hello sphinx')</script>

I know this is a very mean example and not related to hyphens' problem directly. But our code and docutils do not expect to pass malicious characters as a node_id. I suppose dots and hyphens may not harm our code. But we need to investigate all of our code to prove the safety.

Last is that it allows sharing one node_ids to multiple names. For example, it helps to represent io.StringIO and _io.StringIO are the same (or they have the same ID).

I don't understand. Are io and _io not different names in Python?

And what does that have to do with changing underscores to dashes?

Indeed, io and _io are different names in python interpreter. But please read the python-doc. The latter one is not documented in it. We have some issues to document a python object as "public name" instead of "canonical name". see #4065. It is one of them. This feature is not implemented yet. But I'll do that in the (nearly) future. It tells us the real name of the living object does not match the documentation of it.

As you know, it is not related to hyphens problem. It also conflicts with the hyperlinks which human builds manually. It's no longer guess-able. If we'll keep using dots and hyphens for node_id, the cross-reference feature is needed to create references for nbsphinx, I think.

mgeier · 2020-03-18T18:10:45Z

How does changing #example_python_function to #example-python-function help in this case?

No, this change is not only replacing underscores by hyphens. New ID Generator tries to generate node_id by following steps; 1) Generate node_id by the given string. But generated one is already used in the document, 2) Generate node_id by sequence number like id0.

OK, that sounds great. So what about doing that, but also allow underscores (_) and dots (.)?

I know the current rule for node_id generation is not so wrong. But it surely contains problems. Have you ever try to use invalid characters to the signature? How about multibyte characters?

OK, I understand that it might be problematic to allow arbitrary characters/code points.

But what about just adding _ and . to the allowed characters?

For example, this is an attacking code for the ID generator:
.. py:function:: "><script>alert('hello sphinx')</script>

I'm not sure if that's really a problematic case, because the attack would have to come from the document content itself. I'm not a security specialist, so I'm probably wrong.

Anyway, I'm not suggesting to allow arbitrary characters.

_ and . should be safe from a security standpoint, right?

Last is that it allows sharing one node_ids to multiple names. For example, it helps to represent io.StringIO and _io.StringIO are the same (or they have the same ID).

I don't understand. Are io and _io not different names in Python?
And what does that have to do with changing underscores to dashes?

Indeed, io and _io are different names in python interpreter. But please read the python-doc.

I can see that those are not the same name.

What IDs are those supposed to get?

IMHO it would make perfect sense to give them the IDs #io and #_io, respectively, wouldn't it?

The latter one is not documented in it. We have some issues to document a python object as "public name" instead of "canonical name". see #4065. It is one of them. This feature is not implemented yet. But I'll do that in the (nearly) future. It tells us the real name of the living object does not match the documentation of it.

I don't really understand any of this, but would it make a difference if underscores (_) were allowed in IDs?

As you know, it is not related to hyphens problem. It also conflicts with the hyperlinks which human builds manually. It's no longer guess-able. If we'll keep using dots and hyphens for node_id, the cross-reference feature is needed to create references for nbsphinx, I think.

I don't understand.
Do you mean I should create my own custom IDs in nbsphinx and overwrite the ones generated by Sphinx?
I guess I will have to do something like that if you are not modifying the way IDs are generated by Sphinx.
I could probably do something similar to https://github.com/spatialaudio/nbsphinx/blob/559fc4e82bc9e2123e546e67b8032643c87cfaf6/src/nbsphinx.py#L1384-L1407.

I do understand that IDs should be unique per HTML page, and I don't mind if the second (and third etc.) duplicate is re-written to #id0 etc., but I would really like to have readable and understandable IDs (for Python API links) for the case that there are no duplicate IDs (and for the first one even if there are duplicates). And (probably more importantly?) I would like to avoid too much breakage in the projects of nbsphinx users.

tk0miya · 2020-03-20T06:01:33Z

OK, I understand that it might be problematic to allow arbitrary characters/code points.
But what about just adding _ and . to the allowed characters?

Surely, I don't think _ and . will not cause the problem, as you said. But are you satisfied if I agree to support _ and .? How about capital characters too. How about the latin-1 characters? If you need to change the charset, we need to define the allowed charset for node IDs (and confirm they are safe as possible). Is it okay to support only _ and .?

I don't understand.
Do you mean I should create my own custom IDs in nbsphinx and overwrite the ones generated by Sphinx?
I guess I will have to do something like that if you are not modifying the way IDs are generated by Sphinx.

So far, a node_id of a python object had been the same as its name. Since Sphinx-3.0, it will be changed. The new implementation is almost the same as other domains do for the cross-references.

To realize the new cross-reference feature, we use a "reference name" and location info. A reference name equals the name of the object. For example, io, io.StringIO, int, MyClass, MyClass.meth and so on. A location info is a pair of a docname and a node ID.

On building a document, the python domain goes the following steps:

Collects definitions of python objects written by their directives (ex. py:function::, py:class::, etc.) and creates a mapping from the reference-names to the location info.
Every time find cross-reference roles, look up the location info from the mapping using the reference name of the role. For example, when a role :py:mod:`io` found, the python domain look up the location info by the reference name; io. If succeeded, the role is converted to a reference to the specified node in the document.
Renders the references to the output in the arbitrary format.

This means generating URL manually is not recommended. The node_id is not guess-able because it is sometimes auto-generated (ex. id0). I still don't think about how to implement io and _io case. But it will obey the structure above.

Note: docutils spec says node_id should be starts with a letter [a-z]. So _io will be converted into io (if not already registered).

jakobandersen · 2020-03-20T11:22:53Z

It's a tricky issue, but I think it would be good to be a bit more permissive on the IDs, and ignore the docutils spec a bit as it is not enforced anyway.

My reasoning behind the ID generation for, primarily the C++ domain, but also the C domain:

IDs should be as stable as possible, as they may be used in external links.
That means, if one permutes declarations, then it should have no effect on their generated IDs. This rules out having counters.
The extreme cases are for example singlehtml and Latex where IDs need to be unique throughout the whole document, and not just per page.
Domains are independent and users may implement their own, so all IDs of a domain should have a unique prefix (as best as possible*). E.g., currently c. for C and _CPPv4 for C++.
Changes are sometimes needed to the ID generation scheme, so nodes should also be assigned their old IDs, though on a best-effort basis. If this gets out of hand we can deprecate the oldest IDs.

So

For most languages/domains a unique ID can be made up by identifiers (i.e., something like [a-zA-Z_][a-zA-Z_0-9]*), and then a scope separator (e.g., . in Python and C).
The docutils restrictions/guidelines (https://docutils.sourceforge.io/docs/ref/rst/directives.html#identifier-normalization) seems to be based on "old" technologies, HTML 4.1 and CSS 1. Though it doesn't look like this normalisation is applied to IDs coming from Sphinx. Anyway, HTML 5 loosened the restrictions: https://mathiasbynens.be/notes/html5-id-class
Therefore, if we use restrict IDs to non-empty strings matching [A-Za-z0-9_.]* I believe we can guarantee uniqueness almost always. Optionally, we can map . into - to make CSS selection easier (i.e., no escaping), though has this been a problem so far?
Regarding *, domain ID prefixes: if we get really strict on this, we can require domains to register their primary ID prefix they use and enforce uniqueness.

tk0miya · 2020-03-21T17:11:31Z

I can accept adding . and _ to the allowed character list. I suppose it will harm nobody.

The docutils restrictions/guidelines (https://docutils.sourceforge.io/docs/ref/rst/directives.html#identifier-normalization) seems to be based on "old" technologies, HTML 4.1 and CSS 1. Though it doesn't look like this normalisation is applied to IDs coming from Sphinx. Anyway, HTML 5 loosened the restrictions:

Unfortunately, we still support HTML4 technology... Especially HTMLHelp builder depends on it. I don't know how many users use it. But bug reports are sometimes filed. So it seems used.

So I hesitate to use . and _ to the first character.

Optionally, we can map . into - to make CSS selection easier (i.e., no escaping), though has this been a problem so far?

Unfortunately, I don't know. To be exact, I've never seen the CSS class converted from node ID.

In development 3.0, Sphinx has obeyed to the rule of "Identifier Normalization" of docutils. This extends it to allow dots(".") and underscores("_") for node identifier. It allows Sphinx to generate node identifier from source string as possible as it is (bacause dots and underscores are usually used in many programming langauges). This change will keep not to break hyperlinks as possible.

In development of 3.0, Sphinx starts to obey to the rule of "Identifier Normalization" of docutils. This extends it to allow dots(".") and underscores("_") for node identifier. It allows Sphinx to generate node identifier from source string as possible as it is (bacause dots and underscores are usually used in many programming langauges). This change will keep not to break hyperlinks as possible.

tk0miya · 2020-03-22T08:17:01Z

I made a PR #7356 to allow . and _ in node id generation. It seems nobody objects to support them for node ids. So I'll merge it soon.

Fix #7301: Allow . and _ for node_id

tk0miya · 2020-03-22T14:44:55Z

Now I merged #7356 for beta release. But reopened to continue this discussion. Please check it and give me comments. I'd like to refine it before 3.0 final (if needed).

jakobandersen · 2020-03-23T15:52:53Z

Thanks! As a last thing I believe it is ok to allow capital letters as well, which would make it much easier to guarantee uniqueness. Maybe I just missed the docutils rationale for converting to lowercase, but I haven't seen any, and I don't know of any output format where IDs are not case sensitive.

mgeier · 2020-03-24T10:20:11Z

Thanks @tk0miya, you are the best!

I agree with @jakobandersen that keeping capital letters would be great.

For example, right now both these links work:

... but only the first one creates the yellow highlighting.
It's not necessary to have both, I think the lowercase target should be removed.

The corresponding [source] link respects uppercase, which is good: https://www.sphinx-doc.org/en/master/_modules/sphinx/application.html#TemplateBridge

And the [docs] link back from there also respects uppercase, which is also good: https://www.sphinx-doc.org/en/master/extdev/appapi.html#sphinx.application.TemplateBridge

Now the last missing piece would be to remove the lower-case link target and replace it with the correct case.

How about capital characters too.

Yes, please!

How about the latin-1 characters?

I guess we'll have to draw the line somewhere. Python allows many Unicode characters in identifiers, but AFAICT, in practice most people still use only ASCII letters (lower and upper case) and numbers. And, very importantly, underscores. And the dot (.) is used for namespacing, also very important.

I don't care about any other latin-1 or Unicode characters.

tk0miya · 2020-03-24T16:56:30Z

@jakobandersen @mgeier Thank you for comment. I agree to use capital characters also.

After applying #7374, we can use node_ids that matches with [a-zA-Z][a-zA-Z0-9._-]*.
The difference from docutils' is:

Allow to use capital alphabet characters
Allow to use dots (".") and underscores ("_") for an identifier without a leading character.

Does this rule make sense?

jakobandersen · 2020-03-24T17:20:29Z

Does this rule make sense?

Yes, looks good to me. Thanks!

Fix #7301: capital characters are not allowed for node_id

tk0miya · 2020-03-29T15:19:49Z

Done. Closing.

mgeier added the type:bug label Mar 11, 2020

tk0miya mentioned this issue Mar 14, 2020

3.0.0 release plan #7124

Closed

tk0miya added the domain label Mar 15, 2020

tk0miya added this to the 3.0.0 milestone Mar 15, 2020

tk0miya mentioned this issue Mar 22, 2020

Fix #7301: Allow . and _ for node_id #7356

Merged

tk0miya closed this as completed in 7bbf79c Mar 22, 2020

tk0miya added a commit that referenced this issue Mar 22, 2020

Merge pull request #7356 from tk0miya/7301_node_id

8edae5f

Fix #7301: Allow . and _ for node_id

tk0miya reopened this Mar 22, 2020

tk0miya added a commit to tk0miya/sphinx that referenced this issue Mar 24, 2020

Fix sphinx-doc#7301: capital characters are not allowed for node_id

01f7719

tk0miya mentioned this issue Mar 24, 2020

Fix #7301: capital characters are not allowed for node_id #7374

Merged

tk0miya added a commit to tk0miya/sphinx that referenced this issue Mar 24, 2020

Fix sphinx-doc#7301: capital characters are not allowed for node_id

f0a9a34

tk0miya added a commit to tk0miya/sphinx that referenced this issue Mar 24, 2020

Fix sphinx-doc#7301: capital characters are not allowed for node_id

a9ab8c0

tk0miya added a commit to tk0miya/sphinx that referenced this issue Mar 24, 2020

Fix sphinx-doc#7301: capital characters are not allowed for node_id

963b320

tk0miya added a commit to tk0miya/sphinx that referenced this issue Mar 29, 2020

Fix sphinx-doc#7301: capital characters are not allowed for node_id

7aa5584

tk0miya added a commit to tk0miya/sphinx that referenced this issue Mar 29, 2020

Fix sphinx-doc#7301: epub: duplicated node_ids are generated

32b256d

tk0miya added a commit that referenced this issue Mar 29, 2020

Merge pull request #7374 from tk0miya/7301_node_id

9d32d5b

Fix #7301: capital characters are not allowed for node_id

tk0miya closed this as completed Mar 29, 2020

blueyed mentioned this issue Jun 8, 2020

[RFC] Revisit/improve generating of ids (_make_id) #7800

Closed

github-actions bot locked as resolved and limited conversation to collaborators Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Breaking change to Python domain IDs #7301

Breaking change to Python domain IDs #7301

mgeier commented Mar 11, 2020

jakobandersen commented Mar 11, 2020

tk0miya commented Mar 11, 2020

mgeier commented Mar 12, 2020

tk0miya commented Mar 15, 2020

mgeier commented Mar 18, 2020

tk0miya commented Mar 20, 2020

jakobandersen commented Mar 20, 2020

tk0miya commented Mar 21, 2020

tk0miya commented Mar 22, 2020

tk0miya commented Mar 22, 2020

jakobandersen commented Mar 23, 2020

mgeier commented Mar 24, 2020 •

edited

tk0miya commented Mar 24, 2020

jakobandersen commented Mar 24, 2020

tk0miya commented Mar 29, 2020

Breaking change to Python domain IDs #7301

Breaking change to Python domain IDs #7301

Comments

mgeier commented Mar 11, 2020

jakobandersen commented Mar 11, 2020

tk0miya commented Mar 11, 2020

mgeier commented Mar 12, 2020

tk0miya commented Mar 15, 2020

mgeier commented Mar 18, 2020

tk0miya commented Mar 20, 2020

jakobandersen commented Mar 20, 2020

tk0miya commented Mar 21, 2020

tk0miya commented Mar 22, 2020

tk0miya commented Mar 22, 2020

jakobandersen commented Mar 23, 2020

mgeier commented Mar 24, 2020 • edited

tk0miya commented Mar 24, 2020

jakobandersen commented Mar 24, 2020

tk0miya commented Mar 29, 2020

mgeier commented Mar 24, 2020 •

edited