Skip to content
This repository has been archived by the owner on May 12, 2020. It is now read-only.

Use etld plus one matching for 3p #172

Open
wants to merge 13 commits into
base: master
Choose a base branch
from

Conversation

pes10k
Copy link
Collaborator

@pes10k pes10k commented Feb 4, 2019

fixes #171

@pes10k pes10k requested a review from bbondy February 4, 2019 16:58
'www.example.org'
)
assert.equal(queryResult.matches, true)
})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i suggest adding some tests that use non-trivial etld+1's like, example.co.uk and example.githubusercontent.com

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are many, but there at the C++ level. They're in this file https://github.com/brave/ad-block/blob/use-etld-plus-one-matching-for-3p/test/etld_test.cc

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i see thanks

@diracdeltas
Copy link
Member

general question about this approach:

is it possible to just use the etld+1 parsing from Chromium? see https://cs.chromium.org/chromium/src/net/base/registry_controlled_domains/registry_controlled_domain.h?q=getdomainandregistry&dr=CSs

@pes10k
Copy link
Collaborator Author

pes10k commented Feb 6, 2019

general question about this approach:

is it possible to just use the etld+1 parsing from Chromium? see https://cs.chromium.org/chromium/src/net/base/registry_controlled_domains/registry_controlled_domain.h?q=getdomainandregistry&dr=CSs

We could, but then we'd loose the ability to run in node (which has been very valuable for crawling / measurement, getting other folks to use the code, debugging, etc)

etld/domain.cc Outdated
@@ -0,0 +1,51 @@
/* Copyright (c) 2018 The Brave Software Team. Distributed under the MPL2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should be 2019. This applies to all of the new files added in this PR.

Copy link
Collaborator Author

@pes10k pes10k Feb 9, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed with 81cc4f4 :)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm confused about how GitHub is displaying the latest version of your PR, but it seems like you fixed all files, except for etld/domain.cc?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, i goofed, apologies for the butter fingers. Fixed now

@fmarier
Copy link
Collaborator

fmarier commented Feb 9, 2019

I've got a question about the build process since I don't actually know when the build step takes place: if we pull down the list at build time, is there a reason to have it checked into the repo?

@pes10k
Copy link
Collaborator Author

pes10k commented Feb 9, 2019

I've got a question about the build process since I don't actually know when the build step takes place: if we pull down the list at build time, is there a reason to have it checked into the repo?

You're right no need for this. I removed it from the .gitignore previously, now also removed it from the set of tracked files. Should be good now

@pes10k
Copy link
Collaborator Author

pes10k commented Feb 9, 2019

@bbondy my code expects the public suffix list to be in a known location, and lazily parses the list on first use (i.e. at etld/data/<list>.dat). I have no idea if this will work when rolled into the larger browser. I'm just not familiar enough with the build process. Can you double check that aspect?

@pes10k pes10k force-pushed the use-etld-plus-one-matching-for-3p branch from be4c62e to 33ff5d9 Compare February 9, 2019 06:03
.gitignore Outdated
.vscode

# These files are either fetched at build time, or generated from the build
etld/data/public_suffix_list.h
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I usually try to make these absolute paths with respect to the location of the .gitignore file (i.e. /etld/data/public_suffix.h) in order to avoid unexpected matches somewhere else in the codebase. Not very likely in this case, so feel free to ignore this suggestion, but a good habit IMHO.

Makefile Outdated
@@ -5,7 +5,9 @@
.PHONY: clean

build:
./node_modules/.bin/node-gyp configure && ./node_modules/.bin/node-gyp build
curl -s https://publicsuffix.org/list/public_suffix_list.dat -o etld/data/public_suffix_list.dat
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kudos for not using -k here :)

nit: If you want to make that curl line even stricter, you could also throw in --tlsv1.2 to enforce a minimum level of TLS.

constructorArgs.push(isException ? "true" : "false");

const wrappedLabels = labels.map(JSON.stringify);
constructorArgs.push("{" + wrappedLabels.join(", ") + "}");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the labels be surrounded by " in case they're not all bare words? Also, is it possible they include " characters that should be escaped?

}

if (previous != 0) {
labels_.push_back(string.substr(previous, current - previous));
Copy link
Collaborator

@fmarier fmarier Feb 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this call will fail if you get invalid input like: abcd.efgh. (note the trailing dot). Might be worth adding a test for it if there isn't one already.

@pes10k pes10k force-pushed the use-etld-plus-one-matching-for-3p branch 2 times, most recently from 2919690 to 8d72272 Compare February 22, 2019 06:04
@pes10k
Copy link
Collaborator Author

pes10k commented Feb 22, 2019

@bbondy this is now ready for review again. The ways to enable the eTLD+1 checking (by parsing a public suffix list) are:

  1. when using the check.js script, use the new -P, --public-suffix-rules-path option, and point it to a text including public suffix rules.
  2. use the js AdBlockClient.parsePublicSuffixRules method and give it a string containing public suffix rules
  3. use the C++ AdBlockClient::parsePublicSuffixRules method with a char* / std::string of rules
  4. use either the C++ or JS deserialize methods with a dat file that includes public suffix rules data (serializing after doing 1, 2 or 3 will include the public suffix rule data in the .dat).

@pes10k pes10k force-pushed the use-etld-plus-one-matching-for-3p branch 2 times, most recently from d1ada9e to 86fb8a9 Compare February 22, 2019 06:37
Copy link
Member

@bbondy bbondy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a commit to fix npm run perf, this is unfortunately currently regressing perf by 4-5x overall though. We'll need to optimize that so it's trivially the same in speed for matching.

@pes10k pes10k force-pushed the use-etld-plus-one-matching-for-3p branch from 0fcdfc4 to 76bdb66 Compare February 25, 2019 01:55
@pes10k pes10k requested a review from bbondy February 26, 2019 01:26
while (current != std::string::npos) {
current_label = label_text.substr(previous, current - previous);
if (current_label.length() == 0) {
throw PublicSuffixRuleInputException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Chromium is built with exceptions disabled, and this would be the first exception.
I'd recommend instead passing in a pointer to a vector and then filling it. And make the return value the result, and propagate failures.

// If don't include any trailing whitespace, if there is any.
current_label = label_text.substr(previous, current - previous);
if (current_label == "") {
throw PublicSuffixRuleInputException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto exceptions

PublicSuffixRule::PublicSuffixRule(const std::string& rule_text) {
std::string trimmed_rule_text(trim_to_whitespace(rule_text));
if (trimmed_rule_text.length() == 0) {
throw PublicSuffixRuleInputException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto exceptions

break;

case '/':
throw PublicSuffixRuleInputException(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto exceptions

before(function () {
this.client = new AdBlockClient()
this.client.parse('||bannersnack.com^$third-party')
const etldRules = fs.readFileSync('./test/data/public_suffix_list.dat', 'utf8')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do some similar tests for when we don't call parsePublicSuffixRules and it falls back to the warning with FQDN?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can probably just add a for loop which loops twice using different clients around all the it calls.

this.client.parsePublicSuffixRules(etldRules)
})
it('consider eTLD+1 domains as 1p', function () {
const altSubDomainQuery = this.client.findMatchingFilters(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we add tests for matches too?

Copy link
Member

@bbondy bbondy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please squash these 3 commits together by using git rebase -i

  • linter fixes / cleanup
  • further linter fixes / cleanup
  • even more lint cleanup

@pes10k pes10k force-pushed the use-etld-plus-one-matching-for-3p branch from d25f24d to ad0dab0 Compare February 28, 2019 22:03
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use eTLD+1 handling for determining whats 3p
4 participants