Allow auto-linkification of non-standard schemas without calling `mdurl.decode` #183

black-puppydog · 2022-01-05T14:05:07Z

Description / Summary

I propose to allow the unmodified handling of link text during auto-linkification.
Think something like this:

md.linkify.add("%", {"validate": message_regex, "normalize": normalize_message_sigil})

def normalize_message_sigil(obj, match):
  old_url = match.url
  match.url = urllib.parse.quote(old_url, safe="")
  match.text = f"&percnt;{old_url[1:6]}..."
  match.safe_decode = False  # means "don't touch match.text, render it as is in the HTML"

Value / benefit

I launched this as a discussion before but after thinking about it a little more I don't see a workaround for this.

I'm trying to implement some custom extensions to markdown for the scuttlebutt markdown flavour as implemented in ssb-markdown which is the JS implementation and relies on markdown-it. Hence it makes sense to me to make the re-implementation using markdown-it-py. 🙂

One of the key features of ssb is that messages are referenced by ids like this: %9eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd/E6CBCG5XY=.sha256
(feeds have an @ identifier, and blobs a & so they may have similar issues, but let's talk about message ids only for the sake of this discussion)

Anyway, so these message ids should be linked to urls like this:

 <a href="#/msg/%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256">
  %9eJYI...
 </a>

Note that the link text is an abbreviated version of the full id, but still begins with a % sigil.

So I have this regex to match message IDs:

MESSAGE_SIGIL_REGEX = r'[a-zA-Z0-9+/=]{44}\.sha256'

To automatically linkify these ids I set the % character up as a schema:

# this is in the main rendering method
md.linkify.add("%", {"validate": message_regex, "normalize": normalize_message_sigil})

def normalize_message_sigil(obj, match):
  old_url = match.url
  match.url = urllib.parse.quote(old_url, safe="")
  match.text = f"{old_url[:6]}..."

The problem I have with this is that once matched by linkify, the link text that results is actually interpreted as a url-encoded string, i.e. the %9e gets decoded to a (non-displayable) character.
The resulting link isn't exactly what I had hoped for:

  <a href="#/msg/%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256">
   �JYI...
  </a>

I've stepped through this a while now and I haven't figured out yet whether this is a bug or just me holding this wrong...
The resulting text gets put through state.md.normalizeLinkText here:

markdown-it-py/markdown_it/rules_core/linkify.py

Line 104 in b1a74b4

urlText = state.md.normalizeLinkText(urlText)

That function in turn passes the whole thing through mdurl.decode(mdurl.format(parsed), mdurl.DECODE_DEFAULT_CHARS + "%"):

markdown-it-py/markdown_it/common/normalize_url.py

Line 63 in bb6cf6e

return mdurl.decode(mdurl.format(parsed), mdurl.DECODE_DEFAULT_CHARS + "%")

And I don't see any way to prevent it from doing so...

But I thought I could just try to replace the % with %25 and let mdurl.decode replace it back to %. Alas, if I try that, it indeed produces %259eJYI... as the output. Not what I wanted...

Now, I realize I could just generate the text to escape the % into something like &percnt;, but the result is then that the & sign is escaped into &percnt;9eJYI... which is also not quite what I want...

So... is this an issue of usage? Is there something obvious I'm missing?

Implementation details

As I said in the beginning, I think this would best be signalled while setting up the schema. But I'm not sure how to do this cleanly, since the matches themselves are actually directly added to a linkify instance, not a class of markdown-it-py.
So assigning the flag for "raw/pass-through" mode to the match instance seems a bit iffy...

Tasks to complete

No response

The text was updated successfully, but these errors were encountered:

welcome · 2022-01-05T14:05:09Z

Thanks for opening your first issue here! Engagement like this is essential for open source projects! 🤗

If you haven't done so already, check out EBP's Code of Conduct. Also, please try to follow the issue template as it helps other community members to contribute more effectively.

If your issue is a feature request, others may react to it, to raise its prominence (see Feature Voting).

Welcome to the EBP community! 🎉

chrisjsewell · 2022-01-05T14:09:56Z

Heya, Perhaps @tsutsu3 (as the maintainer of linkify-it-py) and @hukkin would like to comment?

hukkin · 2022-01-05T14:19:37Z

There's a lot to intake here, but I'll start with a question: If there's a JS implementation, why not copy what it does? Are you trying to achieve something that the JS implementation does not do?

black-puppydog · 2022-01-05T20:51:12Z

Hey thanks folks for the quick replies!
Yeah, I did look at that, that's the reason I went with markdown-it-py 🙂

Thing is, they're doing pretty much what I (think) I am doing...
I'm looking at the JS code here where they call formatSigilText() which is simply this: return sigilText.replace(/^%/, '%25').slice(0, 8) + '...'

Inserting some console.print() calls in normalize() and then for the final rendered text, I get this:

// the match object after modification
Match {
  schema: '%',
  index: 14,
  lastIndex: 66,
  raw: '%9eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd/E6CBCG5XY=.sha256',
  text: '%259eJYI...',
  url: '%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd/E6CBCG5XY=.sha256'
}

result = "<p>Hey check out <a href="#/msg/%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256">%9eJYI...</a> to see my mad ssb skillz!</p>"

So this is my minimal example to reproduce this:

import json
import re
import urllib.parse
from markdown_it import MarkdownIt

MESSAGE_SIGIL_REGEX = r'[a-zA-Z0-9+/=]{44}\.sha256'
message_regex = re.compile(f"^{MESSAGE_SIGIL_REGEX}")


def normalize_message_sigil(obj, match):
  old_url = match.url
  match.url = urllib.parse.quote(old_url, safe="")
  match.text = f"%25{old_url[1:6]}..."
  print(json.dumps(match.__dict__, indent=2))
  print()


md = MarkdownIt("js-default", {
  "typographer": True,
  "linkify": True,
  "breaks": True,
})
md.linkify.add("%", {"validate": message_regex, "normalize": normalize_message_sigil})


markdown_str = "Hey check out %9eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd/E6CBCG5XY=.sha256 it's epic"
print(md.render(markdown_str))

This generates the same kind of match:

{
  "schema": "%",
  "index": 14,
  "last_index": 66,
  "raw": "%9eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd/E6CBCG5XY=.sha256",
  "text": "%259eJYI...",
  "url": "%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256"
}

But the result is this:

<p>Hey check out <a href="%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256">%259eJYI...</a> it’s epic</p>

While if I change the match.text = f"{old_url[0:6]}..." (so, no % encoding) then I get this:

<p>Hey check out <a href="%259eJYIT1HDNhWOeLK0EhhiHJTPwvDGZWGd%2FE6CBCG5XY%3D.sha256">�JYI...</a> it’s epic</p>

It works fine if I edit my local markdown_it to call mdurl.decode() without the extra %, so like this:

# in markdown-it-py/markdown_it/common/normalize_url.py
return mdurl.decode(mdurl.format(parsed)

But I understand that this was actually introduced for a reason, so not sure how to proceed here...

Sorry for the infodump... it's a bit late here, this is all free-time stuff for me 😆

black-puppydog added the enhancement New feature or request label Jan 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow auto-linkification of non-standard schemas without calling `mdurl.decode` #183

Allow auto-linkification of non-standard schemas without calling `mdurl.decode` #183

black-puppydog commented Jan 5, 2022 •

edited

welcome bot commented Jan 5, 2022

chrisjsewell commented Jan 5, 2022

hukkin commented Jan 5, 2022 •

edited

black-puppydog commented Jan 5, 2022 •

edited

Allow auto-linkification of non-standard schemas without calling mdurl.decode #183

Allow auto-linkification of non-standard schemas without calling mdurl.decode #183

Comments

black-puppydog commented Jan 5, 2022 • edited

Description / Summary

Value / benefit

Implementation details

Tasks to complete

welcome bot commented Jan 5, 2022

chrisjsewell commented Jan 5, 2022

hukkin commented Jan 5, 2022 • edited

black-puppydog commented Jan 5, 2022 • edited

Allow auto-linkification of non-standard schemas without calling `mdurl.decode` #183

Allow auto-linkification of non-standard schemas without calling `mdurl.decode` #183

black-puppydog commented Jan 5, 2022 •

edited

hukkin commented Jan 5, 2022 •

edited

black-puppydog commented Jan 5, 2022 •

edited