New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Breaking change to Python domain IDs #7301
Comments
Are you sure the old links are broken? While the permalink is indeed using a new ID generation scheme, the old ID should still be attached to the declaration (using |
Yes, I changed the style of node_ids in #7236. Therefore, the main hyperlink anchor will be changed in the next release. But old-styled node_ids are still available. So old hyperlinks are still working. There are some reasons why I changed them. First is that their naming is very simple and getting conflicted often (refs: #6903). Second is the rule of naming is against docutils specification. Last is that it allows sharing one node_ids to multiple names. For example, it helps to represent To improve both python domain and autodoc, we have to change the structure of the domain and the rule of naming IDs. I don't know the change is harmful. But it is needed to improve Sphinx, I believe. |
Thanks for the quick responses!
I thought so, but links from external sites seem to be not yet broken. However, it looks like they will be broken at some point in the future, according to this comment: sphinx/sphinx/domains/python.py Lines 367 to 370 in f85b870
Whether that happens sooner or later, it will be quite bad. But that's not actually the situation where I found the problem. The ID change is breaking my Sphinx extension This causes a build warning (and a missing link) when running Sphinx. I could of course change the implementation of And it would break them depending on which Sphinx version they are using, wouldn't that be horrible?
How does changing
Because underscores are not allowed? I think it would be worth violating the specification for that. Also, having dots and underscores in link to API docs just looks so much more sensible! Here's an example: https://sfs-python.readthedocs.io/en/0.5.0/sfs.fd.source.html#sfs.fd.source.point_velocity The link contains the correct Python function name: When the ID is changed, this becomes:
I don't understand. Are And what does that have to do with changing underscores to dashes? |
No, this change is not only replacing underscores by hyphens. New ID Generator tries to generate node_id by following steps; 1) Generate node_id by the given string. But generated one is already used in the document, 2) Generate node_id by sequence number like It means the node_id is not guess-able by its name. Indeed, it would be almost working fine if we use hyphens and dots for the ID generation. But it will be sometimes broken.
Yes, docutils' spec does not allow to use hyphens and dots in node_id. I know the current rule for node_id generation is not so wrong. But it surely contains problems. Have you ever try to use invalid characters to the signature? How about multibyte characters? For example, this is an attacking code for the ID generator:
I know this is a very mean example and not related to hyphens' problem directly. But our code and docutils do not expect to pass malicious characters as a node_id. I suppose dots and hyphens may not harm our code. But we need to investigate all of our code to prove the safety.
Indeed, As you know, it is not related to hyphens problem. It also conflicts with the hyperlinks which human builds manually. It's no longer guess-able. If we'll keep using dots and hyphens for node_id, the cross-reference feature is needed to create references for nbsphinx, I think. |
OK, that sounds great. So what about doing that, but also allow underscores (
OK, I understand that it might be problematic to allow arbitrary characters/code points. But what about just adding
I'm not sure if that's really a problematic case, because the attack would have to come from the document content itself. I'm not a security specialist, so I'm probably wrong. Anyway, I'm not suggesting to allow arbitrary characters.
I can see that those are not the same name. What IDs are those supposed to get? IMHO it would make perfect sense to give them the IDs
I don't really understand any of this, but would it make a difference if underscores (
I don't understand. I do understand that IDs should be unique per HTML page, and I don't mind if the second (and third etc.) duplicate is re-written to |
Surely, I don't think
So far, a node_id of a python object had been the same as its name. Since Sphinx-3.0, it will be changed. The new implementation is almost the same as other domains do for the cross-references. To realize the new cross-reference feature, we use a "reference name" and location info. A reference name equals the name of the object. For example, On building a document, the python domain goes the following steps:
This means generating URL manually is not recommended. The node_id is not guess-able because it is sometimes auto-generated (ex. Note: docutils spec says node_id should be starts with a letter |
It's a tricky issue, but I think it would be good to be a bit more permissive on the IDs, and ignore the docutils spec a bit as it is not enforced anyway. My reasoning behind the ID generation for, primarily the C++ domain, but also the C domain:
So
|
I can accept adding
Unfortunately, we still support HTML4 technology... Especially HTMLHelp builder depends on it. I don't know how many users use it. But bug reports are sometimes filed. So it seems used. So I hesitate to use
Unfortunately, I don't know. To be exact, I've never seen the CSS class converted from node ID. |
In development 3.0, Sphinx has obeyed to the rule of "Identifier Normalization" of docutils. This extends it to allow dots(".") and underscores("_") for node identifier. It allows Sphinx to generate node identifier from source string as possible as it is (bacause dots and underscores are usually used in many programming langauges). This change will keep not to break hyperlinks as possible.
In development of 3.0, Sphinx starts to obey to the rule of "Identifier Normalization" of docutils. This extends it to allow dots(".") and underscores("_") for node identifier. It allows Sphinx to generate node identifier from source string as possible as it is (bacause dots and underscores are usually used in many programming langauges). This change will keep not to break hyperlinks as possible.
In development of 3.0, Sphinx starts to obey to the rule of "Identifier Normalization" of docutils. This extends it to allow dots(".") and underscores("_") for node identifier. It allows Sphinx to generate node identifier from source string as possible as it is (bacause dots and underscores are usually used in many programming langauges). This change will keep not to break hyperlinks as possible.
I made a PR #7356 to allow |
Now I merged #7356 for beta release. But reopened to continue this discussion. Please check it and give me comments. I'd like to refine it before 3.0 final (if needed). |
Thanks! As a last thing I believe it is ok to allow capital letters as well, which would make it much easier to guarantee uniqueness. Maybe I just missed the docutils rationale for converting to lowercase, but I haven't seen any, and I don't know of any output format where IDs are not case sensitive. |
Thanks @tk0miya, you are the best! I agree with @jakobandersen that keeping capital letters would be great. For example, right now both these links work:
... but only the first one creates the yellow highlighting. The corresponding And the Now the last missing piece would be to remove the lower-case link target and replace it with the correct case.
Yes, please!
I guess we'll have to draw the line somewhere. Python allows many Unicode characters in identifiers, but AFAICT, in practice most people still use only ASCII letters (lower and upper case) and numbers. And, very importantly, underscores. And the dot ( I don't care about any other latin-1 or Unicode characters. |
@jakobandersen @mgeier Thank you for comment. I agree to use capital characters also. After applying #7374, we can use node_ids that matches with
Does this rule make sense? |
Yes, looks good to me. Thanks! |
Fix #7301: capital characters are not allowed for node_id
Done. Closing. |
Describe the bug
Previously, anchors for Python functions were using underscores, #7236 changed this to dashes.
To Reproduce
Document some Python function whose name contains underscores:
Expected behavior
This used to create a fragment identifier
#example_python_function
, but since #7236 this creates#example-python-function
.Your project
This breaks links to python functions when used with
nbsphinx
: https://nbsphinx.readthedocs.io/en/0.5.1/markdown-cells.html#Links-to-Domain-ObjectsApart from that all links (containing underscores) from external sites to Python API docs created by Sphinx (which I guess are a lot) will break!
The text was updated successfully, but these errors were encountered: