URL crate is failing to parse these existing URLs #489

mixalturek · 2019-03-01T15:16:01Z

Hi there, I have these five URLs that exist but can't be parsed by url crate due to invalid domain name. This issue is similar to #483 with subdomains ending on - character, but these ones use punycode.

use url::Url;

fn main() {
    let urls = vec![
        "http://mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/indexx.php",
        "http://shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/sitemap.html",
        "http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/bvv",
        "http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/index.php",
        "http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/index.php"
    ];

    for url in urls {
        match url.parse::<Url>() {
            Ok(_) => println!("Parsing successful: {}", url),
            Err(e) => println!("Parsing failed: {}, {}", e, url),
        }
    }
}

Parsing failed: invalid international domain name, http://mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/indexx.php
Parsing failed: invalid international domain name, http://shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/sitemap.html
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/bvv
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/index.php
Parsing failed: invalid international domain name, http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/index.php

All of them ping and respond.

http get http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/index.php
HTTP/1.1 200 OK
cache-control: max-age=0, private, must-revalidate
connection: close
content-length: 384
content-type: text/html; charset=utf-8
date: Fri, 01 Mar 2019 15:12:05 GMT
server: nginx
set-cookie: sid=603521ac-3c34-11e9-b489-941a3615a412; path=/; domain=xn----9mcjf9b4dbm09f.com; HttpOnly

<html><head><title>Loading...</title></head><body><script type='text/javascript'>window.location.replace('http://count.shdedgelanimailnoticeborad.count.mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/index.php?js=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJqcyI6MX0.fADWc9hUOlh58R9UzufQBROmie3I7c7vE835oE6YmU4&uuid=603521ac-3c34-11e9-b489-941a3615a412');</script></body></html>

The text was updated successfully, but these errors were encountered:

mixalturek · 2019-03-01T15:21:39Z

And one more note: All of them are marked as malware or phishing so be careful when accessing them, prefer command line to browser or use a sand boxed instance... ;-)

rtf-const · 2019-05-07T19:53:40Z

These URLs are reported as invalid because they violate rules for "Right-to-Left Scripts for Internationalized Domain Names" that are specified in RFC-5893
https://tools.ietf.org/html/rfc5893#section-2
In particular they violate rule #1 in section 2 that says that label in bidi domain name should not start with an ascii digit (Bidi property EN)

The first character must be a character with Bidi property L, R, or AL....

So, in the URL ""http://mail.163.com.xn----9mcjf9b4dbm09f.com/iloystgnjfrgthteawvo/indexx.php" domain name is the Bidi domain and it does contain label "163" that violate Rule #1, so the URL is reported as invalid.

But actually, I think that there is some misinterpretation of standard. Yes, label "163" violate Bidi rules, but the following paragraph of sections 2 of RFC-5893 says following:

The following guarantees can be made based on the above:
o In a domain name consisting of only labels that satisfy the rule, the requirements of Section 3 are satisfied. Note that even LTR labels and pure ASCII labels have to be tested.
o In a domain name consisting of only LDH labels ... and labels that satisfy the rule, the requirements of Section 3 are satisfied as long as a label that starts with an ASCII digit does not come after a right-to-left label.

So, in order to be a valid domain name Bidi domain all labels in the domain name should satisfy bidi rules with a only exception that LDH label (label that contains only an ascii letters, digits and hyphens) can start with a European digit if it comes before any RTL label.
The following parts of RFC-5893 make that even more clear and provide motivation for such decision.
Section 5. "Troublesome Situations and Guidelines"

There is a large current deployment of ASCII domain names starting with digits. These cannot possibly be invalidated.

Section 7. "Compatibility Considerations" is most straightforward

This specification does not forbid using leading European digits in ASCII-only labels, since this would conflict with a large installed base of such labels, and would increase the scope of the specification from RTL labels to all labels ... Generally, it is best to disallow registration of any right-to-left strings in a zone where the label at the level above begins with a digit.

So, in general it is not recommended (while not strictly forbidden) to use RTL labels as a subdomain of a digit-leading label, but not vice-versa.

But another issue it that while RFC-5893 does allow labels starting with a digit, the "UNICODE IDNA COMPATIBILITY PROCESSING" standard, which is used as a base for that crate implementation provides a publicly available test files that contains some domain names that are marked as violating Bidi rule #1, while they should be considered as valid according to RFC-5893. For example domain "0a.xn--4db" is a part of IDNA test as a domain that violate Bidi rule #1, while according to RFC-5893 it should be considered as valid because a label starting with a digit precedes RTL label.

I tried to look at other implementations of UTS46, for example python implementation
https://github.com/kjd/idna/blob/master/tests/test_idna_uts46.py
They explicitly skip the test that forces Bidi rules for not-bidi labels, saying that "These appear to be errors in the test vector".

Anyway I personally think that practicality aspect should outweigh aiming to following the specification, and if current realization considers existing domains as invalid, it's better to relax some parsing restrictions.

SimonSapin · 2019-07-22T16:47:04Z

Newer versions of https://www.unicode.org/reports/tr46/#ToASCII take a CheckBidi boolean flag as part of their input. https://url.spec.whatwg.org/#idna sets it to true.

Regarding practicality, if we have reason to believe the spec makes a particular choice wrong, it’s preferably to file a bug on the spec to discuss changing it (especially if it helps align with other implementations) rather than just unilaterally making a different choice.

vorner · 2019-07-25T16:37:27Z

I'm getting confused here a bit.

If I understand the previous comment, the old spec makes these URLs valid but not recommended and suggests not allowing their registration. But if I have to work with existing URLs, I can't un-register them so it makes sense to allow them. If I'm a registrar, I might decide not to allow them.

The new version would disallow them, but then, what happens to the existing-but-not-recommended URLs?

Do I understand it correctly?

rtf-const · 2019-07-25T17:49:53Z

If I understand the previous comment, the old spec makes these URLs valid but not recommended and suggests not allowing their registration

Not exactly. RFC-5893 says that URLS that contains labels starting with a digit are absolutly valid if such a label comes before any RTL label. While domain names that contains digit-leading labels after RTL label should be forbiden for registration.

Also I don't sure about term "new" and "old" spec. As I know RFC-5893 is an actual spec for bidirectional international domain names, and it wasn't replaced by any newer spec.
As I understand this crate is based on a quite different doc, "Unicode® Technical Standard #46" UNICODE IDNA COMPATIBILITY PROCESSING" https://unicode.org/reports/tr46/ It is not replacement for aforementioned RFC, instead it refers to RFC. But this doc says that domain that contains any RTL label cannot contain digit-leading label.
Frankly, I don't know the full relationshps between these two docs and can't say what doc should have a priority in any given case

vorner · 2019-09-10T09:41:15Z

I don't want to sound demanding, or anything, but is there some way forward about this? This seems to be somewhat stuck.

If it's matter of manpower (eg. needing to do some specific research, or implementing), I can try doing it when it's me needing it.

Even closing as wontfix would probably be better than this uncertainty, as I would know I have to look for a different solution for my problem.

Thanks

SimonSapin · 2019-09-10T11:38:49Z

What’s needed is research to find out whether an up-to-date (which this crate is known not to be: #290, #163) and correct implementation of https://url.spec.whatwg.org/ would also reject those URLs.

If not, what’s needed next is a PR to bring crate closer to the spec. (If starting to work on this, look at Pass "new" tests #537 if it’s not merged yet, in case it touches a similar area of the code.)
If so, this is a spec issue that should be resolved there. can't parse urls starting with xn-- whatwg/url#438 looks related.

zackw · 2019-09-11T14:56:43Z

Regardless of whether the crate is compliant with the current URL spec and whether the current URL spec should allow certain domain names, I think the crate needs an override mode in which it will accept any domain name that is technically possible, because we know for a fact that noncompliant domain names do exist in the real DNS. Google isn't going to stop using host names of the form r6---sn-q4f7sn7r.googlevideo.com just because they violate at least two RFCs.

SimonSapin · 2019-09-14T19:34:53Z

“Need” is an interesting word.

I’ll just leave this here: https://users.rust-lang.org/t/help-wanted-maintaining-rust-url/10707

vorner · 2019-09-15T18:53:47Z

I'll try to dig some time up to help. This post gives new light at why the pull request #497 sits there… at the time I thought there was disagreement about it and discussion happening here, not that you'd want someone else to take the review on them. It'll probably take some time, though (I need to get familiar with the crate internals), thanks for letting it be known.

vorner · 2019-10-11T15:27:54Z

I've tried digging through the standards. It seems to me that the section 2 of RFC-5893 disallows this URL. The bullet point says that if one allowed such domain as in here, it would still satisfy the goals in section 3, but the rules don't say one should allow them. The rules are referenced as „All six numbered properties in section 2 must be satisfied“, which rules out using the bullet point even if it said it should be allowed. The section 5 and 7 are not normative. However, the described bullet point and the text in section 7 kind of hints at the intention of allowing these URLs, even though the rules don't.

I'll try to reach the authors of the RFC or find if there was some discussion about this already.

SimonSapin · 2019-10-11T15:33:00Z

This crate implements https://url.spec.whatwg.org/ (which in turn references https://www.unicode.org/reports/tr46/) rather than IETF RFCs

vorner · 2019-10-11T15:43:38Z

Yes, but in depth 3 from there it references the RFC-5893. From the tr46:

If CheckBidi, and if the domain name is a Bidi domain name, then the label must satisfy all six of the numbered conditions in [IDNA2008] RFC 5893, Section 2.

SimonSapin · 2019-10-11T15:45:30Z

Ah sorry, never mind my previous comment.

annevk · 2020-08-25T09:32:44Z

See also https://github.com/whatwg/url/labels/topic%3A%20idna. This might warrant an additional URL Standard issue as these domain names do seem to work in browsers. Perhaps the URL Standard should take more direct ownership of these algorithms as UTR46 does not seem to cut it...

rtf-const mentioned this issue May 4, 2019

allowing bidi domain names to start with a digit #496

Closed

rtf-const mentioned this issue May 7, 2019

bidi rules #497

Open

nox added the bug label Jul 18, 2019

vorner mentioned this issue Sep 13, 2020

Refusing a mix of numeric-only and BIDI domains whatwg/url#543

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL crate is failing to parse these existing URLs #489

URL crate is failing to parse these existing URLs #489

mixalturek commented Mar 1, 2019

mixalturek commented Mar 1, 2019

rtf-const commented May 7, 2019

SimonSapin commented Jul 22, 2019

vorner commented Jul 25, 2019

rtf-const commented Jul 25, 2019

vorner commented Sep 10, 2019

SimonSapin commented Sep 10, 2019

zackw commented Sep 11, 2019

SimonSapin commented Sep 14, 2019 •

edited

vorner commented Sep 15, 2019

vorner commented Oct 11, 2019

SimonSapin commented Oct 11, 2019

vorner commented Oct 11, 2019

SimonSapin commented Oct 11, 2019

annevk commented Aug 25, 2020

URL crate is failing to parse these existing URLs #489

URL crate is failing to parse these existing URLs #489

Comments

mixalturek commented Mar 1, 2019

mixalturek commented Mar 1, 2019

rtf-const commented May 7, 2019

SimonSapin commented Jul 22, 2019

vorner commented Jul 25, 2019

rtf-const commented Jul 25, 2019

vorner commented Sep 10, 2019

SimonSapin commented Sep 10, 2019

zackw commented Sep 11, 2019

SimonSapin commented Sep 14, 2019 • edited

vorner commented Sep 15, 2019

vorner commented Oct 11, 2019

SimonSapin commented Oct 11, 2019

vorner commented Oct 11, 2019

SimonSapin commented Oct 11, 2019

annevk commented Aug 25, 2020

SimonSapin commented Sep 14, 2019 •

edited