Skip to content

Commit

Permalink
♻️ REFACTOR: Parsing logic of Markdown links (#467)
Browse files Browse the repository at this point in the history
This commit aims to clarify and improve the parsing of Markdown links, i.e. `[Text](link)`,
with the full logic described in `docs/syntax/syntax.md#markdown-links-and-referencing`.

1. The new `myst_all_links_external` configuration option allows for all "smart" resolving to be turned off, then all links are treated as external links.
2. Links containing a `#` are no longer automatically treated as external links.
3. For docutils-only, internal links are properly handled (rather than just returning a warning).
4. For sphinx, internal links also now match any relative file paths (which are not source files), and turn them in to `download` references.
5. For sphinx, internal references can also be constrained to specific domains, using the new `myst_ref_domains` configuration option.
  • Loading branch information
chrisjsewell committed Dec 28, 2021
1 parent 30c44d5 commit 90c98aa
Show file tree
Hide file tree
Showing 23 changed files with 278 additions and 111 deletions.
2 changes: 1 addition & 1 deletion docs/api/reference.rst
Expand Up @@ -36,7 +36,7 @@ Sphinx

.. autoclass:: myst_parser.sphinx_renderer.SphinxRenderer
:special-members: __output__
:members: handle_cross_reference, render_math_block_label
:members: render_internal_link, render_math_block_label
:undoc-members:
:member-order: alphabetical
:show-inheritance:
Expand Down
6 changes: 6 additions & 0 deletions docs/sphinx/reference.md
Expand Up @@ -19,10 +19,16 @@ To do so, use the keywords beginning `myst_`.
* - `myst_enable_extensions`
- `["dollarmath"]`
- Enable Markdown extensions, [see here](../syntax/optional.md) for details.
* - `myst_all_links_external`
- `False`
- If `True`, all Markdown links `[text](link)` are treated as external.
* - `myst_url_schemes`
- `None`
- [URI schemes](https://en.wikipedia.org/wiki/List_of_URI_schemes) that will be recognised as external URLs in `[](scheme:loc)` syntax, or set `None` to recognise all.
Other links will be resolved as internal cross-references.
* - `myst_ref_domains`
- `None`
- If a list, then only these [sphinx domains](sphinx:domain) will be searched for when resolving Markdown links like `[text](reference)`.
* - `myst_linkify_fuzzy_links`
- `True`
- If `False`, only links that contain a scheme (such as `http`) will be recognised as external links.
Expand Down
1 change: 1 addition & 0 deletions docs/syntax/example.txt
@@ -0,0 +1 @@
Hallo!
2 changes: 1 addition & 1 deletion docs/syntax/reference.md
Expand Up @@ -242,7 +242,7 @@ In addition to these summaries of inline syntax, see {ref}`extra-markdown-syntax
![alt](src "title")
```
* - Link
- Reference `LinkDefinitions`
- Reference `LinkDefinitions`. See {ref}`syntax/referencing` for more details.
- ```md
[text](target "title") or [text][key]
```
Expand Down
33 changes: 33 additions & 0 deletions docs/syntax/syntax.md
Expand Up @@ -518,6 +518,39 @@ Is below, but it won't be parsed into the document.

+++

(syntax/referencing)=

## Markdown Links and Referencing

Markdown links are of the form: `[text](link)`.

If you set the configuration `myst_all_links_external = True` (`False` by default),
then all links will be treated simply as "external" links.
For example, in HTML outputs, `[text](link)` will be rendered as `<a href="link">text</a>`.

Otherwise, links will only be treated as "external" links if they are prefixed with a scheme,
configured with `myst_url_schemes` (by default, `http`, `https`, `ftp`, or `mailto`).
For example, `[example.com](https://example.com)` becomes [example.com](https://example.com).

:::{note}
The `text` will be parsed as nested Markdown, for example `[here's some *emphasised text*](https://example.com)` will be parsed as [here's some *emphasised text*](https://example.com).
:::

For "internal" links, myst-parser in Sphinx will attempt to resolve the reference to either a relative document path, or a cross-reference to a target (see [](syntax/targets)):

- `[this doc](syntax.md)` will link to a rendered source document: [this doc](syntax.md)
- This is similar to `` {doc}`this doc <syntax>` ``; {doc}`this doc <syntax>`, but allows for document extensions, and parses nested Markdown text.
- `[example text](example.txt)` will link to a non-source (downloadable) file: [example text](example.txt)
- The linked document itself will be copied to the build directory.
- This is similar to `` {download}`example text <example.txt>` ``; {download}`example text <example.txt>`, but parses nested Markdown text.
- `[reference](syntax/referencing)` will link to an internal cross-reference: [reference](syntax/referencing)
- This is similar to `` {any}`reference <syntax/referencing>` ``; {any}`reference <syntax/referencing>`, but parses nested Markdown text.
- You can limit the scope of the cross-reference to specific [sphinx domains](sphinx:domain), by using the `myst_ref_domains` configuration.
For example, `myst_ref_domains = ("std", "py")` will only allow cross-references to `std` and `py` domains.

Additionally, only if [](syntax/header-anchors) are enabled, then internal links to document headers can be used.
For example `[a header](syntax.md#markdown-links-and-referencing)` will link to a header anchor: [a header](syntax.md#markdown-links-and-referencing).

(syntax/targets)=

## Targets and Cross-Referencing
Expand Down
1 change: 1 addition & 0 deletions myst_parser/__init__.py
Expand Up @@ -35,6 +35,7 @@ def setup_sphinx(app: "Sphinx"):

for name, default in MdParserConfig().as_dict().items():
if not name == "renderer":
# TODO add types?
app.add_config_value(f"myst_{name}", default, "env")

app.connect("builder-inited", create_myst_config)
Expand Down
4 changes: 2 additions & 2 deletions myst_parser/docutils_.py
Expand Up @@ -58,9 +58,9 @@ def __repr__(self):
"substitutions",
# we can't add substitutions so not needed
"sub_delimiters",
# heading anchors are currently sphinx only
# sphinx only options
"heading_anchors",
# sphinx.ext.mathjax only options
"ref_domains",
"update_mathjax",
"mathjax_classes",
# We don't want to set the renderer from docutils.conf
Expand Down
87 changes: 53 additions & 34 deletions myst_parser/docutils_renderer.py
Expand Up @@ -19,6 +19,7 @@
Union,
cast,
)
from urllib.parse import urlparse

import jinja2
import yaml
Expand Down Expand Up @@ -526,51 +527,68 @@ def render_heading(self, token: SyntaxTreeNode) -> None:
self.current_node = section

def render_link(self, token: SyntaxTreeNode) -> None:
"""Parse `<http://link.com>` or `[text](link "title")` syntax to docutils AST:
- If `<>` autolink, forward to `render_autolink`
- If `myst_all_links_external` is True, forward to `render_external_url`
- If link is an external URL, forward to `render_external_url`
- External URLs start with a scheme (e.g. `http:`) in `myst_url_schemes`,
or any scheme if `myst_url_schemes` is None.
- Otherwise, forward to `render_internal_link`
"""
if token.markup == "autolink":
return self.render_autolink(token)

if self.config.get("myst_all_links_external", False):
return self.render_external_url(token)

# Check for external URL
url_scheme = urlparse(cast(str, token.attrGet("href") or "")).scheme
allowed_url_schemes = self.config.get("myst_url_schemes", None)
if (allowed_url_schemes is None and url_scheme) or (
url_scheme in allowed_url_schemes
):
return self.render_external_url(token)

return self.render_internal_link(token)

def render_external_url(self, token: SyntaxTreeNode) -> None:
"""Render link token `[text](link "title")`,
where the link has been identified as an external URL::
<reference refuri="link" title="title">
text
`text` can contain nested syntax, e.g. `[**bold**](url "title")`.
"""
ref_node = nodes.reference()
self.add_line_and_source_path(ref_node, token)
destination = cast(str, token.attrGet("href") or "")
ref_node["refuri"] = cast(str, token.attrGet("href") or "")
title = token.attrGet("title")
if title:
ref_node["title"] = title
with self.current_node_context(ref_node, append=True):
self.render_children(token)

if self.config.get(
"relative-docs", None
) is not None and destination.startswith(self.config["relative-docs"][0]):
# make the path relative to an "including" document
source_dir, include_dir = self.config["relative-docs"][1:]
destination = os.path.relpath(
os.path.join(include_dir, os.path.normpath(destination)), source_dir
)
def render_internal_link(self, token: SyntaxTreeNode) -> None:
"""Render link token `[text](link "title")`,
where the link has not been identified as an external URL::
<reference refname="link" title="title">
text
ref_node["refuri"] = destination
`text` can contain nested syntax, e.g. `[**bold**](link "title")`.
Note, this is overridden by `SphinxRenderer`, to use `pending_xref` nodes.
"""
ref_node = nodes.reference()
self.add_line_and_source_path(ref_node, token)
ref_node["refname"] = cast(str, token.attrGet("href") or "")
title = token.attrGet("title")
if title:
ref_node["title"] = title
next_node = ref_node

# TODO currently any reference with a fragment # is deemed external
# (if anchors are not enabled)
# This comes from recommonmark, but I am not sure of the rationale for it
if is_external_url(
destination,
self.config.get("myst_url_schemes", None),
"heading_anchors" not in self.config.get("myst_extensions", []),
):
self.current_node.append(next_node)
with self.current_node_context(ref_node):
self.render_children(token)
else:
self.handle_cross_reference(token, destination)

def handle_cross_reference(self, token: SyntaxTreeNode, destination: str) -> None:
if not self.config.get("ignore_missing_refs", False):
self.create_warning(
f"Reference not found: {destination}",
line=token_line(token),
subtype="ref",
append_to=self.current_node,
)
with self.current_node_context(ref_node, append=True):
self.render_children(token)

def render_autolink(self, token: SyntaxTreeNode) -> None:
refuri = target = escapeHtml(token.attrGet("href") or "") # type: ignore[arg-type]
Expand All @@ -594,6 +612,7 @@ def render_image(self, token: SyntaxTreeNode) -> None:
destination, None, True
):
# make the path relative to an "including" document
# this is set when using the `relative-images` option of the MyST `include` directive
destination = os.path.normpath(
os.path.join(
self.config.get("relative-images", ""),
Expand Down
15 changes: 14 additions & 1 deletion myst_parser/main.py
Expand Up @@ -115,11 +115,23 @@ def check_extensions(self, attribute, value):
metadata={"help": "Disable syntax elements"},
)

all_links_external: bool = attr.ib(
default=False,
validator=instance_of(bool),
metadata={"help": "Parse all links as simple hyperlinks"},
)

# see https://en.wikipedia.org/wiki/List_of_URI_schemes
url_schemes: Optional[Iterable[str]] = attr.ib(
default=cast(Optional[Iterable[str]], ("http", "https", "mailto", "ftp")),
validator=optional(deep_iterable(instance_of(str), instance_of((list, tuple)))),
metadata={"help": "URL schemes to allow in links"},
metadata={"help": "URL scheme prefixes identified as external links"},
)

ref_domains: Optional[Iterable[str]] = attr.ib(
default=None,
validator=optional(deep_iterable(instance_of(str), instance_of((list, tuple)))),
metadata={"help": "Sphinx domain names to search in for references"},
)

heading_anchors: Optional[int] = attr.ib(
Expand Down Expand Up @@ -273,6 +285,7 @@ def default_parser(config: MdParserConfig) -> MarkdownIt:
list(config.enable_extensions)
+ (["heading_anchors"] if config.heading_anchors is not None else [])
),
"myst_all_links_external": config.all_links_external,
"myst_url_schemes": config.url_schemes,
"myst_substitutions": config.substitutions,
"myst_html_meta": config.html_meta,
Expand Down
66 changes: 38 additions & 28 deletions myst_parser/myst_refs.py
Expand Up @@ -42,7 +42,6 @@ def run(self, **kwargs: Any) -> None:
contnode = cast(nodes.TextElement, node[0].deepcopy())
newnode = None

typ = node["reftype"]
target = node["reftarget"]
refdoc = node.get("refdoc", self.env.docname)
domain = None
Expand All @@ -54,23 +53,29 @@ def run(self, **kwargs: Any) -> None:
# but first we change the the reftype to 'any'
# this means it is picked up by extensions like intersphinx
node["reftype"] = "any"
newnode = self.app.emit_firstresult(
"missing-reference",
self.env,
node,
contnode,
**(
{"allowed_exceptions": (NoUri,)}
if version_info[0] > 2
else {}
),
)
node["reftype"] = "myst"
try:
newnode = self.app.emit_firstresult(
"missing-reference",
self.env,
node,
contnode,
**(
{"allowed_exceptions": (NoUri,)}
if version_info[0] > 2
else {}
),
)
finally:
node["reftype"] = "myst"
# still not found? warn if node wishes to be warned about or
# we are in nit-picky mode
if newnode is None:
node["refdomain"] = ""
self.warn_missing_reference(refdoc, typ, target, node, domain)
# TODO ideally we would override the warning message here,
# to show the [ref.myst] for supressing warning
self.warn_missing_reference(
refdoc, node["reftype"], target, node, domain
)
except NoUri:
newnode = contnode

Expand Down Expand Up @@ -109,25 +114,30 @@ def resolve_myst_ref(
if res:
results.append(("std:doc", res))

# get allowed domains for referencing
ref_domains = self.env.config.myst_ref_domains

# next resolve for any other standard reference objects
stddomain = cast(StandardDomain, self.env.get_domain("std"))
for objtype in stddomain.object_types:
key = (objtype, target)
if objtype == "term":
key = (objtype, target.lower())
if key in stddomain.objects:
docname, labelid = stddomain.objects[key]
domain_role = "std:" + stddomain.role_for_objtype(objtype)
ref_node = make_refnode(
self.app.builder, refdoc, docname, labelid, contnode
)
results.append((domain_role, ref_node))
if ref_domains is None or "std" in ref_domains:
stddomain = cast(StandardDomain, self.env.get_domain("std"))
for objtype in stddomain.object_types:
key = (objtype, target)
if objtype == "term":
key = (objtype, target.lower())
if key in stddomain.objects:
docname, labelid = stddomain.objects[key]
domain_role = "std:" + stddomain.role_for_objtype(objtype)
ref_node = make_refnode(
self.app.builder, refdoc, docname, labelid, contnode
)
results.append((domain_role, ref_node))

# finally resolve for any other type of reference
# TODO do we want to restrict this at all?
# finally resolve for any other type of allowed reference domain
for domain in self.env.domains.values():
if domain.name == "std":
continue # we did this one already
if ref_domains is not None and domain.name not in ref_domains:
continue
try:
results.extend(
domain.resolve_any_xref(
Expand Down

0 comments on commit 90c98aa

Please sign in to comment.