Skip to content
This repository has been archived by the owner on Mar 7, 2021. It is now read-only.

Controls URL use urijs.normalize() #355

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Controls URL use urijs.normalize() #355

wants to merge 2 commits into from

Conversation

zicjin
Copy link

@zicjin zicjin commented Mar 5, 2017

What this PR changes

Add a configuration to controls URL use URI.js normalize() process.

Rationale

some url will be wrong normalize process such as this page contained the url: http://www.baidu.com/baidu?cl=3&tn=SE_baiduhomet8_jmjb7mjw&fr=top1000&wd=%D6%D01.3%D2%DA%B7%D6%B9%EB%C3%DB

@fredrikekelund
Copy link
Collaborator

Hi @zicjin, thanks for filing a PR! Could you expand a bit on the rationale here - how is the URL you're encountering faulty? What kinds of problems does it cause?

It looks like eslint is throwing an error because of a missing semicolon in your code as well, would be great if you could update that and push again!

@zicjin
Copy link
Author

zicjin commented Mar 5, 2017

sorry for semicolon, it has been fixed. this url: http://www.baidu.com/baidu?cl=3&tn=SE_baiduhomet8_jmjb7mjw&fr=top1000&wd=%D6%D01.3%D2%DA%B7%D6%B9%EB%C3%DB is all right to visit. But when it passes URIjs.normalize() it will become another url and navigating another webpage. This is not site owner and my intention. I did not go to analyze the normalize() inside what did it do.

@fredrikekelund
Copy link
Collaborator

Alright, I see the issue here. It looks like the URL you're getting is ISO8859 encoded, while uri.js outputs a Unicode encoded version after normalize has been called. We've tried to address these kinds of situations with the crawler.urlEncoding setting, but it looks like our implementation doesn't account for a scenario where the incoming URL's are already ISO8859 encoded. I'll open an issue and investigate further.

I'll leave this PR open for now, but it's not unlikely that I'll close it later on, since the issue here isn't really that we're normalizing the URL's, but rather that we don't handle their encoding correctly.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants