Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discard search results if no nodes are found #1311

Merged
merged 1 commit into from Aug 5, 2023

Conversation

bno1
Copy link
Contributor

@bno1 bno1 commented May 17, 2023

The docs are not very clear on this, but I think the search results should be discarded if getSearchResults is not going to be called in order to free up the result in Chrome.

@kenshaw
Copy link
Member

kenshaw commented May 17, 2023

@ZekeLu your thoughts on this one?

@bno1
Copy link
Contributor Author

bno1 commented May 17, 2023

I did a small test and it does seem that discarding the results leads to slightly less memory usage after many (100K) searches:

Test code
package main

import (
	"context"
	"log"
	"os"
	"time"

	"github.com/chromedp/cdproto/dom"
	"github.com/chromedp/chromedp"
)

func main() {
	ctx, cancel := chromedp.NewContext(context.Background())
	defer cancel()

	doDiscard := false

	if len(os.Args) > 1 {
		doDiscard = true
	}

	const STEPS = 100000

	log.Printf("Running %v steps with doDiscard=%v", STEPS, doDiscard)

	err := chromedp.Run(
		ctx,
		chromedp.Navigate("https://google.com"),
		chromedp.ActionFunc(func(ctx context.Context) error {
			for i := 0; i < STEPS; i++ {
				id, count, err := dom.PerformSearch("#foobarbaz").Do(ctx)
				if err != nil {
					return err
				}

				if count < 1 {
					if doDiscard {
						err = dom.DiscardSearchResults(id).Do(ctx)
						if err != nil {
							return err
						}
					}

					continue
				}

				nodes, err := dom.GetSearchResults(id, 0, count).Do(ctx)
				if err != nil {
					return err
				}

				log.Println(nodes)
			}

			return nil
		}),
	)
	if err != nil {
		panic(err)
	}

	log.Println("Done, sleeping...")

	time.Sleep(1 * time.Hour)
}

No discard:
no discard

With discard:
with discard

@ZekeLu
Copy link
Member

ZekeLu commented May 17, 2023

This is a good catch!

I just checked how the DevTools handle this command. It turns out that it always calls Dom.discardSearchResults. I think we should do the same thing here (that means we should call it when count > 0 too). And we have better add a test case for it.

image

@kenshaw
Copy link
Member

kenshaw commented May 17, 2023

When I originally wrote the query selectors, I purposefully did not discard the nodes after use; this is partially an issue with being able to maintain a somewhat "linear" logic. I'm not sure how this could be reworked without something like a chromedp.Discard() action that the user would have to add at the end of every action chain. The issue there is then that you couldn't break up actions. If anyone has thoughts on an API design that would work, I'm all for it. I realize chromedp is now used in much more complex projects, typically how I designed (and use) it for is to run a browser, extract some data, close the browser.

When I wrote the original code, it was with that mechanism in mind, so discarding the nodes seemed to not matter. Closing the search results when there are no results seems logically fine.

@ZekeLu
Copy link
Member

ZekeLu commented May 17, 2023

Regarding the unit test, I'm thinking sending a DOM.getSearchResults command after the search is discarded, and it should result in an error (I'm not sure yet). But the searchId is not exported, it seems there is no way to test it.

@ZekeLu
Copy link
Member

ZekeLu commented May 17, 2023

Ken, it seems that the search result is an internal state of the BySearch func, it's not exported, and the user code can not access it, so I think it's ok for BySearch to discard the search before it returns (means we can add a defer call to discard the search). Have I missed something?

@kenshaw
Copy link
Member

kenshaw commented May 17, 2023

I don't know if the logic has changed within Chrome, but originally the reason these needed to be kept around is that you would not get the right nodes back if you discarded it prior to using the node IDs later for performing actions on.

@ZekeLu
Copy link
Member

ZekeLu commented May 17, 2023

Ah, that makes sense! @bno1 Can you please have a test to see if the node IDs can be used after discarding the search?

@bno1
Copy link
Contributor Author

bno1 commented May 17, 2023

Updated to always discard using defer. The _ = dom.DiscardSearchResults syntax is to silence golangci-lint warning about unused err result.

@ZekeLu In my small test app everything seems to work fine with discarding the result. But I can't rule out the possibility of chrome reusing node IDs from discarded results. Do you think this is possible?

@kenshaw
Copy link
Member

kenshaw commented May 17, 2023

@bno1 my original version of the BySearch discarded the nodes (I have versions of the original chromedp that was never pushed, when I was still building out the first version, and I just went back and checked that). I changed it to not do a discard, because yes it ended up being incorrect.

@ZekeLu my thinking here is that we can by default at the end of an action run discard this, and collect the value in the passed action context. We can then create a context option that disables this, if for some reason someone wants to keep the search results for multiple action runs. By default, we would enable the discard, and then users who wish to change this, would be able to modify it. It'd keep the API the same, while possibly gaining a bit of less memory usage.

@ZekeLu
Copy link
Member

ZekeLu commented May 18, 2023

@kenshaw That would be a safe choice. Let me briefly summarize what will be changed in case I missed something.

  1. Add these fields to chromedp.Context:

    muSearch sync.Mutex
    // searchIDs collects search id from BySearch so that the search can be
    // discarded later in (Tasks).Do.
    searchIDs []string
    // keepSearchResult indicates whether the user wants to keep the search
    // result. If it's true, chromedp won't collect search ids. Its value can
    // be modified by the option WithKeepSearchResult.
    keepSearchResult bool

    type Context struct {

  2. Collect search id in BySearch:

    @@ -330,10 +330,12 @@ func BySearch(s *Selector) {
            }
    
            if count < 1 {
    +           // TODO: call dom.DiscardSearchResults()
                return []cdp.NodeID{}, nil
            }
    
            nodes, err := dom.GetSearchResults(id, 0, count).Do(ctx)
    +       // TODO: collect search id
            if err != nil {
                return nil, err
            }
  3. Discard collected searches on each (Tasks).Do (And I think we should document that calling action.Do directly won't be covered):

    @@ -731,6 +740,7 @@ func (t Tasks) Do(ctx context.Context) error {
                return err
            }
        }
    +   // TODO: discard collected searches.
        return nil
     }

    func (t Tasks) Do(ctx context.Context) error {

@bno1 Sorry that this requires quite some work. Please let us know if you encounter any issues during the implementation. Thank you very much!

@bno1
Copy link
Contributor Author

bno1 commented May 18, 2023

Sounds good to me. I think I will do some tests first to try to reproduce Ken's original issue, mostly out of curiosity.

@bno1
Copy link
Contributor Author

bno1 commented May 24, 2023

Update code according to @ZekeLu's suggestions. Also added a test for discard.

About trying to reproduce @kenshaw's original issue I couldn't achieve that on a static page. I think it can happen on a dynamic page (e.g. performSearch returns a node, the result is discarded immediately, the node is removed/replaced by page script, chrome reuses the node ID for some other node, chromedp tries to read the node and gets the new node instead of the one it found originally) but I didn't have the time to look into that

@ZekeLu
Copy link
Member

ZekeLu commented May 25, 2023

@bno1 Thank you very much! Just want to let you know that I need some time to review the changes.

And regarding the original issue mentioned by ken, I need some time to look into it (hopefully ken can shed some light on it). I want to apology in advance if it finally turns out that we can discard the search result early.

Thank you!

@ZekeLu
Copy link
Member

ZekeLu commented Jun 27, 2023

@bno1 Sorry for the soooo late reply! I was super busy recently.

I have checked the current source code of chromium, it turns out that discardSearchResults only removes the search result from search_results_:

protocol::Response InspectorDOMAgent::discardSearchResults(
    const String& search_id) {
  search_results_.erase(search_id);
  return protocol::Response::Success();
}

I'm not 100 percent sure that this won't affect the node ids. But comparing with DiscardFrontendBindings which will invalidate the node ids:

void InspectorDOMAgent::DiscardFrontendBindings() {
  if (history_)
    history_->Reset();
  search_results_.clear();
  document_node_to_id_map_->clear();
  id_to_node_.clear();
  id_to_nodes_map_.clear();
  ReleaseDanglingNodes();
  children_requested_.clear();
  cached_child_count_.clear();
  if (revalidate_task_)
    revalidate_task_->Reset();
}

I believe that the node ids are tracked by document_node_to_id_map_ (and/or id_to_node_, id_to_nodes_map_).

And based on your testing:

About trying to reproduce @kenshaw's original issue I couldn't achieve that on a static page.

In my opinion, always discarding the search result in the defer call is enough (that is, we only need this commit: d736122).

@kenshaw @bno1 What do you think?

@bno1
Copy link
Contributor Author

bno1 commented Aug 2, 2023

@ZekeLu thanks for pointing out the sources. I tried to look for those but I'm not familiar with chromium's source code. In addition to what you said, looking at the Bind method, it doesn't do anything smart like reusing ids, it's just incrementing a counter without checking for overflow or conflict.

So I agree with your conclusion. Discarding always seems to be the right thing to do.

@ZekeLu
Copy link
Member

ZekeLu commented Aug 2, 2023

Then let's always discard the search result. I will tag a new release after this PR is merged. In the case that it causes any error, we can revert the change. Thank you!

@bno1
Copy link
Contributor Author

bno1 commented Aug 4, 2023

@ZekeLu I updated the branch

@ZekeLu
Copy link
Member

ZekeLu commented Aug 5, 2023

Thank you @bno1 !

@ZekeLu ZekeLu merged commit d218b16 into chromedp:master Aug 5, 2023
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants