Invalid links returned with some chinese characters as delimiters #15

tommedema · 2015-05-26T19:44:10Z

Steps to reproduce:

linkify the following text

【视频奇志大兵《发烧友》在线观看 - 酷6视频】奇志大兵《发烧友》在线观看，奇志大兵搞笑双簧 _ 发烧友（追星族） http://t.cn/RZwjG7U（分享自 @酷6网）

the output link is

http://t.cn/RZwjG7U（分享自

whereas the output link should be

http://t.cn/RZwjG7U

The reason is that （ is not recognized as a separating delimiter, yet it is quite common in Chinese.

Out of 500 posts I gathered, about 20 to 30 of them had links like this, resulting in invalid links reported by linkify.

Note that I realize that these users are technically posting invalid URLs, but 20-30 out of 500 is very common and therefore there should be a way to deal with this. Any suggestion?

The text was updated successfully, but these errors were encountered:

puzrin · 2015-05-26T19:48:17Z

Could you post a permalink to demo http://markdown-it.github.io/linkify-it/ ? You can type there all examples at once, with results.

puzrin · 2015-05-26T19:54:48Z

Example

Problem is in (, used in your example. There are NO space before it. I'm not familiar with asian punctuation. Are such ( always word terminators? Is is possible to have it in links, like in wiki?

tommedema · 2015-05-26T19:57:51Z

See http://markdown-it.github.io/linkify-it/#t1=http%3A%2F%2Ft.cn%2FRv4VRqQ%EF%BC%88%E5%88%86%E4%BA%AB%E8%87%AA%20%40%E7%88%B1%E5%A5%87%E8%89%BA%EF%BC%89%0A%0Ahttp%3A%2F%2Fhttp%3A%2F%2Ft.cn%2FR2t287H

I added another issue where a user mistakenly added http:// twice.

These are both (technically speaking) user behavior issues, but the first one especially is quite common.

Formally speaking ( is not a word terminator, however because it takes so much white space, many chinese people don't put a space in front of it (it's an act of being lazy, but it's quite common).

Perhaps I should do some string manipulation before I pass it to linkify-it?

puzrin · 2015-05-26T20:09:14Z

Example with double http:// is technically correct. That's like http://localhost - any local domain allowed. Imho such typo do not need fixes.

Example with ( is more serious. It uses chineese unicode scopes, and can be distinguished from english scopes. But i need recommendations about grammar to be sure.

tommedema · 2015-05-26T20:25:08Z

I'm sorry I'm not a Chinese speaker. Perhaps someone else can help out here. There are also other Chinese punctuation characters, like ，

puzrin · 2015-05-26T20:30:23Z

Let's leave it open until someone can formalize info for asian languages group (probably japan language has the same issues). I'm ready to fix as soon as possible, but would like to avoid kludges with inccomplete workarounds.

tommedema · 2015-05-26T20:31:58Z

More examples of how difficult these Chinese posts are to parse: link

What's your idea on this?

puzrin · 2015-05-26T20:33:03Z

Seems you given wrong link, it shows default text.

tommedema · 2015-05-26T20:38:14Z

Sorry, I've fixed the link now.

puzrin · 2015-05-26T20:49:21Z

Thanks for examples. Posted a question at commonmark forum http://talk.commonmark.org/t/linkifier-lets-discuss-and-test/1045/9?u=vitaly

As far as i see in last example, spaces are not used at all. It's possible to to track locale change (non-english -> english), but that's not safe.

tommedema · 2015-05-26T20:50:39Z

That's correct, spaces are not used. The Chinese use different punctuation marks or even don't use them because URLs are in English script and therefore the disctinction is easily visible to the eye, but more difficult for machines.

puzrin · 2015-05-26T21:04:06Z

What about links with chineese chars? I can find link start somehow by [ any non english ]http://, but can't use this rule to search link end.

PS. Anyway, it worth to start collecting test samples for china language separately. Something like this

tommedema · 2015-05-26T21:12:16Z

You're absolutely right. I think the first rule makes sense, but URLs can probably have chinese characters. Perhaps the first rule by itself would help though, and somekind of smart processing for certain delimiters that are unlikely to appear within an URL, e.g.:

str.replace(/[\uff08\uff09\uff0c\uff01\u3002\uff1f\u3010\u3011\uff3b\uff3d\u3001\u300a\u300b\u2605]/g, ' ') //（），！。？【】［］、《》 ★

LeonLiuY · 2015-06-09T02:40:07Z

Found this URL doesn't match:

http://markdown-it.github.io/linkify-it/#t1=http%3A%2F%2F172.26.142.48%2Fviewerjs%2F%23..%2F0529%2Fslides.pdf

BTW, I found a website comparing different regexps for URL:
https://mathiasbynens.be/demo/url-regex

tommedema · 2015-06-30T11:12:57Z

By the way, Facebook does this properly, but I'm not sure if they use a proprietary solution.

fengmk2 · 2016-07-26T14:52:22Z

I will try to resolve this problem.

puzrin · 2016-07-26T15:03:29Z

@fengmk2 implementation can be not easy, but for the first step it would be enougth to have collection of fixtures with good coverage of Chinese edge cases.

See https://github.com/markdown-it/linkify-it/tree/master/test/fixtures

mikelambert · 2016-12-31T20:57:46Z

Here is my example page linkifying incorrectly:
http://www.dancedeets.com/events/896736620379772/2016-r16-taiwan-x-wbc

In this case, it's a vertical bar, though to be precise it's ｜ and not | (I hope github shows the difference properly). The page author uses it as a delimiter, but linkify sees it as part of the domain name (presumably it sees it as just another word character).

puzrin · 2017-01-01T09:15:48Z

@mikelambert Nothing to fix. Fuzzy mode is not safe, it's an author's mistale to use linkify-it in wrong way.

mikelambert · 2017-01-01T20:23:14Z

I'm the website author using linkify-it to add links to raw text from a variety of sources which I didn't write myself (facebook events and other websites). And the individual source authors are not using linkify-it, and of course not writing their text with linkify-it in mind.

I recognize that fuzzy is not safe, and not perfect. However, it seems unfortunate that a vertical bar (what some authors are using as visual punctuation) is treated as part of a domain name. It seemed like the fuzziness could be smarter, even if it's still fuzzy.

So to clarify, is this a "it's not a bug, just user error" bug, or a "it's not a bug I care about fixing, but patches are welcome" bug?

puzrin · 2017-01-03T04:10:42Z

@mikelambert I mean, problems with | in that examples is because linkifier applied after layout compose, instead of before compose. But linkifier goal is to find links in natural texts, not everywhere (that's impossible).

My personal opinion is, that linkifier is used not as expected. So, this example is not good enougth as a reason for changes. May be i don't understand something, but this is my opinion for now. If i had to do such site, i would parse links first, then compose header.

Also, i understand that people can have another opinions and may wish to just quick fix something via hacks. For this case, linkifier allows to override regexps without need to fork project.

mikelambert · 2017-01-03T04:43:20Z

What do you mean by "layout compose" ? I assume you are referring to writing up the text and adding the | to lay things out?

https://www.facebook.com/events/896736620379772/ is the source text I am working with. (Notice that Facebook gets the linkification correct.) I receive the raw text from the FB API, and then am trying to linkify it for use on my own website.

I understand that linkifier-it might be the wrong tool (since this is not natural text), and I am using it incorrectly (since it is applied after the author's layout composition). But unfortunately, I am not the author of the text, and so I am not able to linkify before adding the | characters.

Thanks for your time!

puzrin · 2017-01-03T05:18:05Z

Then, if you have source in known format (|-separated), i would split it first, then apply linkifier, then join back. May be, with some additional conditions (line should be short and have chineese letters inside). Or would try to get html if possible (not familiar with FB api)

The reason to add | support could be, if you say "humans write such way" or "that's de-facto standard", with a lot of examples.

puzrin · 2017-01-03T05:33:37Z

In other words, if you have auto-generated text - consider parse/detect it's structure prior to apply linkifier. Or if you know some new uncovered de-facto patterns of human writing - create new issue with proofs (live examples), and i'll try to fix it if possible.

mikelambert · 2017-01-03T22:58:42Z

Thank you, that's a creative solution to this problem. I'll go with that for now.

I'll create a separate issue with the few examples I have, and you can decide if it's justified "de-facto pattern of human writin" or not there. :)

geyang · 2017-02-22T05:50:31Z

@mikelambert I'm wondering what does facebook linkify do with this following link?

link: https://zh.wikipedia.org/wiki/（

you can see what this link take you to here: https://zh.wikipedia.org/wiki/（

mikelambert · 2017-02-22T05:57:51Z

Feel free to create a dummy facebook event (or even a FB post on your wall) and see what happens?

It seems like it fails to parse that link properly, instead linking to https://zh.wikipedia.org/wiki/ .

Jeff-Tian · 2023-10-19T10:43:59Z

I am looking forward to the fix! It could be more enjoyable every time I paste links in the GitHub README with Chinese characters following it can SMARTLY display them correctly.

LeonLiuY mentioned this issue Jun 9, 2015

URL contains .. can't be parsed correctly in IM rhinobird-io/rhinobird-web#126

Closed

puzrin mentioned this issue Jun 9, 2015

URLs with ".." #16

Closed

puzrin mentioned this issue Jan 4, 2017

Vertical pipe separators getting included in domain names #46

Closed

puzrin mentioned this issue Feb 22, 2017

Softbreak rendering in CJK languanges markdown-it/markdown-it#334

Closed

This was referenced Nov 30, 2021

Invalid regular expression on Japanese site #65

Closed

url not detected correctly #57

Closed

xqdoo00o mentioned this issue Apr 26, 2023

对gpt 提问获取网站连接的时候出现后面的字也被加到了连接显示里，该怎么解决？ xqdoo00o/chatgpt-web#38

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Invalid links returned with some chinese characters as delimiters #15

Invalid links returned with some chinese characters as delimiters #15

tommedema commented May 26, 2015

puzrin commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

LeonLiuY commented Jun 9, 2015

tommedema commented Jun 30, 2015

fengmk2 commented Jul 26, 2016

puzrin commented Jul 26, 2016

mikelambert commented Dec 31, 2016 •

edited

puzrin commented Jan 1, 2017

mikelambert commented Jan 1, 2017

puzrin commented Jan 3, 2017 •

edited

mikelambert commented Jan 3, 2017

puzrin commented Jan 3, 2017

puzrin commented Jan 3, 2017

mikelambert commented Jan 3, 2017

geyang commented Feb 22, 2017 •

edited

mikelambert commented Feb 22, 2017

Jeff-Tian commented Oct 19, 2023

Invalid links returned with some chinese characters as delimiters #15

Invalid links returned with some chinese characters as delimiters #15

Comments

tommedema commented May 26, 2015

puzrin commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

puzrin commented May 26, 2015

tommedema commented May 26, 2015

LeonLiuY commented Jun 9, 2015

tommedema commented Jun 30, 2015

fengmk2 commented Jul 26, 2016

puzrin commented Jul 26, 2016

mikelambert commented Dec 31, 2016 • edited

puzrin commented Jan 1, 2017

mikelambert commented Jan 1, 2017

puzrin commented Jan 3, 2017 • edited

mikelambert commented Jan 3, 2017

puzrin commented Jan 3, 2017

puzrin commented Jan 3, 2017

mikelambert commented Jan 3, 2017

geyang commented Feb 22, 2017 • edited

mikelambert commented Feb 22, 2017

Jeff-Tian commented Oct 19, 2023

mikelambert commented Dec 31, 2016 •

edited

puzrin commented Jan 3, 2017 •

edited

geyang commented Feb 22, 2017 •

edited