Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: network filter ending with right hat recognized as regex #3871

Closed

Conversation

seia-soto
Copy link
Member

@seia-soto seia-soto commented Apr 1, 2024

fixes #3694


Summary

This fixes a network filter ending with a hat sign (hostname^) not to be resolved as a regexp filter.

Internals

The NETWORK_FILTER_MASK.isRegex is being set at the following point. The filter has a legit length then checkIsRegex does false positives.

packages/adblocker/src/filters/network.ts

if (filterIndexEnd - filterIndexStart > 0) {
  filter = line.slice(filterIndexStart, filterIndexEnd).toLowerCase();

  mask = setNetworkMask(mask, NETWORK_FILTER_MASK.isUnicode, hasUnicode(filter));
  if (getBit(mask, NETWORK_FILTER_MASK.isRegex) === false) {
    mask = setNetworkMask(
      mask,
      NETWORK_FILTER_MASK.isRegex,
      checkIsRegex(filter, 0, filter.length),
    );
  }
}

checkIsRegex checks if the filter string has a hat sign or asterisk sign. However, this logic also triggers hostname^ pattern to be checked.

/**
 * Check if the sub-string contained between the indices start and end is a
 * regex filter (it contains a '*' or '^' char).
 */
function checkIsRegex(filter: string, start: number, end: number): boolean {
  const indexOfSeparator = filter.indexOf('^', start);
  if (indexOfSeparator !== -1 && indexOfSeparator < end) {
    return true;
  }

  const indexOfWildcard = filter.indexOf('*', start);
  return indexOfWildcard !== -1 && indexOfWildcard < end;
}

Suggested changes

I added a proper hostname checker to validate if the filter is a valid hostname in the checkIsRegex function. The suggested changes check if the filter string has a trailing hat because if a hat sign in other locations does present, it does mean it's often used as a regex symbol.

/**
 * Check if the sub-string contained between the indices start and end is a
 * regex filter (it contains a '*' or '^' char).
 */
function checkIsRegex(filter: string, start: number, end: number): boolean {
  // Check if the filter has a trailing hat without pathes (mostly for hostname) and then trim it
  if (filter.charCodeAt(end - 1) === 94 /* '^' */ && checkIsHostname(filter, start, end - 1)) {
    return false;
  }

  const indexOfSeparator = filter.indexOf('^', start);
  if (indexOfSeparator !== -1 && indexOfSeparator < end) {
    return true;
  }

  const indexOfWildcard = filter.indexOf('*', start);
  return indexOfWildcard !== -1 && indexOfWildcard < end;
}

Also some cases including glob pattern need to be judged. For example, hostname*^ pattern should be handled as regex because it includes glob pattern. Therefore, checking the trailing hat is preferred in this case.

@seia-soto seia-soto marked this pull request as ready for review April 2, 2024 08:34
@seia-soto seia-soto requested a review from remusao as a code owner April 2, 2024 08:34
@seia-soto seia-soto changed the title fix: network filter ending with right anchor recognized as regex fix: network filter ending with right hat recognized as regex Apr 2, 2024
@chrmod chrmod added the PR: Bug Fix 🐛 Increment patch version when merged label Apr 15, 2024
Comment on lines 778 to 786
// Remove leading '*' if the filter is not hostname anchored.
if (
getBit(mask, NETWORK_FILTER_MASK.isHostnameAnchor) === false &&
filterIndexEnd - filterIndexStart > 0 &&
line.charCodeAt(filterIndexStart) === 42 /* '*' */
) {
mask = clearBit(mask, NETWORK_FILTER_MASK.isLeftAnchor);
filterIndexStart += 1;
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this related to the fix of filters ending with ^?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is a kind of optimisation. There's a test relevant to this: *bar^. Also see https://ghostery.slack.com/archives/C06JX8NJVMF/p1712045816656759

Copy link
Collaborator

@remusao remusao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there might be issues with this change. For example the following three cases return false but I would expect all of them to return true instead:

const { Request, f } = require("@cliqz/adblocker");

console.log(f`foo^`.match(Request.fromRawDetails({
  url: "http://foo.com/bar",
  type: "script"
})));

console.log(f`foo^`.match(Request.fromRawDetails({
  url: "http://example.com/foo/bar",
  type: "script"
})));

console.log(f`foo^`.match(Request.fromRawDetails({
  url: "http://foo.com/foo/bar",
  type: "script"
})));

packages/adblocker/src/filters/network.ts Outdated Show resolved Hide resolved
The caret symbol is used to match the start and end of the given pattern. The leading asterisk symbol is effective without explicitly having it. Therefore, removing the leading asterisk is an effective optimisation.
@seia-soto
Copy link
Member Author

The transformation to regexp filter in the parsing logic is intended, Closing.

/**
 * Compiles a filter pattern to a regex. This is only performed *lazily* for
 * filters containing at least a * or ^ symbol. Because Regexes are expansive,
 * we try to convert some patterns to plain filters.
 */
function compileRegex(
  filter: string,
  isLeftAnchor: boolean,
  isRightAnchor: boolean,
  isFullRegex: boolean,
): RegExp

In the future, especially for optimisation, I think an update to matching mechanism should be a better path.

@seia-soto seia-soto closed this Apr 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
PR: Bug Fix 🐛 Increment patch version when merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Complete automated tests for Filter#toString
3 participants