Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Function to find the selector of a node #198

Open
sundy-li opened this issue Aug 9, 2017 · 5 comments
Open

Function to find the selector of a node #198

sundy-li opened this issue Aug 9, 2017 · 5 comments

Comments

@sundy-li
Copy link

sundy-li commented Aug 9, 2017

  //node is a sub html*node of doc
  // if ok return the select string such as `.sidebar-reviews article .content-block a`
   sel, ok :=  doc.FindSelector(node) 

Could this function be possible?

@mna
Copy link
Member

mna commented Aug 9, 2017

Hello,

It could be done, but there can be many valid selector strings for a given node, and there's no guarantee that this selector would be unique (that is, the selector could return many matches, not just the one for that specific node). I guess it could be made unique by adding :nth-child pseudo classes everywhere, but not sure that would be super useful.

What do you want to achieve exactly?

Martin

@sundy-li
Copy link
Author

sundy-li commented Aug 12, 2017

Hello,
I have thought about it, what I want to achieve is when I search a html dom tree, I find a node useful to me by some judge algorithms, I would store it's selector in database for future use

For example, I want to crawl thousand blogs newest article urls, the html dom tree is different to varying blogs, when add a blog index url, I want store all the selector of node <a href="{newest article url }"> in my db by some algorithms.

But I notice that

  1. Goquery it a library to select the html node like jquery, so this feature may be a little different with goquery's goal.

  2. The selector is not unique for a given node (For now on, use recursive parent search to get the select of a node, to make the unique selector with :nth-child when parent has siblings ), It would be nice just like chrome did

image

@mna
Copy link
Member

mna commented Aug 12, 2017

Thanks for the context, yeah I see what you mean, I think it makes sense. I'm gonna try to give this a shot, maybe this weekend (no promises :). I'll take a look at how Chrome handles this, but I think another option is to have an array of html.Node indices to traverse the tree (instead of a css selector string). Maybe offer both options.

@mna
Copy link
Member

mna commented Aug 21, 2017

I implemented the PathForNode(*html.Node) []int and NodeAtPath([]int) *html.Node functions in the wip-selector branch. That's not exactly what you wanted, but that's the same-ish feature. It works well, though it's not so nice to use because it works with *html.Node instead of *goquery.Selection (however it can still be useful as it is probably more efficient to match, retrieve and to store than the string selector version will be).

I'll try to add the selector string thing at some point which will fit better with the rest of the goquery API.

@sundy-li
Copy link
Author

sundy-li commented Oct 9, 2017

Thanks for the quick implement,I saw the commit, and the PathForNode is a good path sign of a html.Node is a Dom tree. Though It's not exactly I want , I could use it to be a be a pointer which could be saved in database, so it's useful, I will keep waiting for your better implement~

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants