Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Guard beacon script against saving not expected values into the database #6599

Open
piotrbak opened this issue Apr 26, 2024 · 22 comments · May be fixed by #6605
Open

Guard beacon script against saving not expected values into the database #6599

piotrbak opened this issue Apr 26, 2024 · 22 comments · May be fixed by #6605
Assignees
Labels
lcp needs: product direction priority: high Issues which should be resolved as quickly as possible severity: moderate Feature isn't working as expected but has work around to get same value type: bug Indicates an unexpected problem or unintended behavior
Milestone

Comments

@piotrbak
Copy link
Contributor

Before submitting an issue please check that you’ve completed the following steps:

  • Made sure you’re on the latest version
  • Used the search feature to ensure that the bug hasn’t been reported before

Describe the bug
In specific conditions it's possible that our beacon will add unexpected data to the db:

{"type":"img","src":"linear-gradient(160deg, rgb(255, 255, 255) 0%, rgb(248, 246, 243) 100%)"}
  {
    "type": "img",
    "src": "chrome-extension://eanggfilgoajaocelnaflolkadkeghjp/img/commands/main.svg",
    "srcset": "",
    "sizes": "",
    "sources": [],
    "bg_set": [],
    "current_src": "chrome-extension://eanggfilgoajaocelnaflolkadkeghjp/img/commands/main.svg",
    "label": "above-the-fold"
  }

Expected behavior
We should guard the src against incorrect values. Accepted values are:

  1. Absolute URLs of images, starting with http, https, possibly with //
  2. Relative URLs of images

Acceptance Criteria (for WP Media team use only)

  1. Unexpected values not added to the db
  2. Images starting with http, https, // can be added to the database correctly
  3. Relative URLs of images added correctly to the db
  4. No regressions in handling different type of implemented markups (background, responsive, picture, basic images)
@piotrbak piotrbak added type: bug Indicates an unexpected problem or unintended behavior priority: high Issues which should be resolved as quickly as possible needs: grooming severity: moderate Feature isn't working as expected but has work around to get same value lcp and removed needs: grooming labels Apr 26, 2024
@MathieuLamiot
Copy link
Contributor

MathieuLamiot commented Apr 26, 2024

AC 1 & 3 would need to be refined I believe, that's quite free to interpretation here. We would really need examples to reproduce, otherwise we won't be able to validate the development, nor assess exactly what we should guard against. Typically, it might not be straightforward to distinguish relative URLs from the rest of possible values in a src attribute. Without examples, we won't know what to work on.

@piotrbak
Copy link
Contributor Author

@DahmaniAdame Would you be able to provide steps to reproduce the problem with Chrome extension?

@DahmaniAdame
Copy link
Contributor

@DahmaniAdame Would you be able to provide steps to reproduce the problem with Chrome extension?

Use an extension that adds an overlay with an image on it, like HarpaAI for example, and run the script. There should be entries related to the icon used on the overlay.

There should be an entry similar to this:

chrome-extension://eanggfilgoajaocelnaflolkadkeghjp/img/commands/main.svg

@DahmaniAdame
Copy link
Contributor

DahmaniAdame commented Apr 27, 2024

Unexpected values not added to the db

Examples of what can be excluded:

https://domain.ext/file.php?url=img.jpg
https://domain.ext/file.js?url=img.jpg
https://domain.ext/file.php#url=img.jpg
https://domain.ext/file.js#url=img.jpg
chrome-extension://extension-hash/path/to/image/x.svg
linear-gradient(160deg, rgb(255, 255, 255) 0%, rgb(248, 246, 243) 100%)

About data URI they are accepted as LCP elements.

@MathieuLamiot
Copy link
Contributor

MathieuLamiot commented Apr 29, 2024

I guess there are 2 ways of handling those exceptions: preventing them to be identified in the first place as candidates, or filter out before storing in DB some given patterns.
I would highly recommend the 1st option, as it should allow at some point to get a refined detection system, whereas the second would just be a list of exceptions and it won't properly scale nor improve accuracy.
That being said, @DahmaniAdame & @piotrbak, for each above example, we should rather focus on finding examples reproducing them. Otherwise, we'll "just" filter out based on some patterns, and those patterns are yet to be determined anyway 🤔 Those examples just like that are unfortunately not useful for us :/

@piotrbak
Copy link
Contributor Author

@DahmaniAdame @benorfaz what do you think? I'd agree with Mathieu.

If so, @DahmaniAdame could you add exact steps for this one?

We'd be adding each case separately then.

@DahmaniAdame
Copy link
Contributor

DahmaniAdame commented Apr 29, 2024

Option 1 sounds right.

First thing, let's drop the ones with # as they are unlikely to be generally used. ? is the norm on passing arguments.

https://domain.ext/file.php?url=img.jpg

I couldn't find the plugin that served that, but you can use the below for testing:

https://call.adame.io/images/adaptive.php?image_url=https://image.civitai.com/xG1nkqKTMzGDvpLrqFT7WA/09af39a1-c481-4ad0-9283-aa4a4b6db0c9/original=true/00010-360343057.jpeg&width=200

https://domain.ext/file.js?url=img.jpg

More for external services using node to served adaptive images. I don't have anything on hand for now. We an dismiss it until we have a case.

chrome-extension://extension-hash/path/to/image/x.svg

Install the HarpaAI extension on Chrome and test - https://chromewebstore.google.com/detail/harpa-ai-automation-agent/eanggfilgoajaocelnaflolkadkeghjp?authuser=2

linear-gradient(160deg, rgb(255, 255, 255) 0%, rgb(248, 246, 243) 100%)

I don't have any example on this one. @piotrbak where did you experience that?

@piotrbak
Copy link
Contributor Author

@DahmaniAdame I experienced it once on the testing website, it wasn't there after cleaning the data anymore. I suspect it'll be reproducible with big div element that's set with background like this.

@piotrbak piotrbak changed the title [WIP] Guard beacon script against saving not expected values into the database Guard beacon script against saving not expected values into the database Apr 30, 2024
@DahmaniAdame
Copy link
Contributor

@piotrbak can you provide a sample the code for engineering/QA to take it into consideration?

@Khadreal
Copy link
Contributor

Khadreal commented May 1, 2024

Scope a solution ✅

src/wp-rocket/assets/js/lcp-beacon.js

In _getElementInfo() check if element_into.src is a valid url before returning the element_info. We could do something like this

const image_src = element_info.src
const valid_image_extensions = ['.jpg', '.jpeg', '.png', '.gif'];
const file_extension = image_src.split('.').pop().toLowerCase();

if (image_src.startsWith('http://') || image_src.startsWith('https://') &&  valid_image_extensions.includes('.' + file_extension) && !image_src.includes('?')) {
           return element_info;
} else if( image_src.startsWith('data:image')) {
          return element_info;
}

Estimate the effort ✅

[S]

@Khadreal
Copy link
Contributor

Khadreal commented May 1, 2024

We can also do the same thing on the backend should in case anything slip through the frontend

@wordpressfan
Copy link
Contributor

as a final decision, we need to exclude the following urls:

https://domain.ext/file.php?url=img.jpg
https://domain.ext/file.js?url=img.jpg
https://domain.ext/file.php#url=img.jpg
chrome-extension://extension-hash/path/to/image/x.svg
linear-gradient(160deg, rgb(255, 255, 255) 0%, rgb(248, 246, 243) 100%)

Correct?

@wordpressfan
Copy link
Contributor

I was thinking of doing that from the php side, exactly here:

$images = isset( $_POST['images'] ) ? json_decode( sanitize_text_field( wp_unslash( $_POST['images'] ) ) ) : [];

not from the JS side @Khadreal

@Khadreal
Copy link
Contributor

Khadreal commented May 1, 2024

I was thinking of doing that from the php side, exactly here:

$images = isset( $_POST['images'] ) ? json_decode( sanitize_text_field( wp_unslash( $_POST['images'] ) ) ) : [];

not from the JS side @Khadreal

Okay. I'm thinking we could that on both side. I'll add grooming for the backend

@Khadreal
Copy link
Contributor

Khadreal commented May 1, 2024

as a final decision, we need to exclude the following urls:

https://domain.ext/file.php?url=img.jpg
https://domain.ext/file.js?url=img.jpg
https://domain.ext/file.php#url=img.jpg
chrome-extension://extension-hash/path/to/image/x.svg
linear-gradient(160deg, rgb(255, 255, 255) 0%, rgb(248, 246, 243) 100%)

Correct?

@DahmaniAdame FYA

@wordpressfan wordpressfan linked a pull request May 2, 2024 that will close this issue
22 tasks
@wordpressfan
Copy link
Contributor

@wp-media/engineering-plugin-team

I created this PR #6605 to validate the idea and seems like it's working and covering many cases, can u plz take a look and once the idea is validated I'll fix the tests and may add some more test cases.

@MathieuLamiot
Copy link
Contributor

MathieuLamiot commented May 3, 2024

Thanks @wordpressfan ! The safeguard/bail-out with elementInfo and empty( $image->type ) look OK to me.
My main concern is about:

  $image_src_filetype_array = wp_check_filetype( $image_src_path );
  return ! empty( $image_src_filetype_array['type'] ) && str_starts_with( $image_src_filetype_array['type'], 'image/' );

@DahmaniAdame & @piotrbak, can you confirm that all elements that we should capture for LCP/ATF should always be URLs to a file of image type? Meaning, the URL we store should always be directly and explicitely pointing to a .jpg, .png or something like this?

@wordpressfan how does it behave for this image for instance? http://1.gravatar.com/avatar/d7a973c7dab26985da5f961be7b74480?s=120&d=mm&r=g
Currently, it is the LCP on http://mathieu.e2e.rocketlabsqa.ovh/hello-world

@DahmaniAdame
Copy link
Contributor

can you confirm that all elements that we should capture for LCP/ATF should always be URLs to a file of image type?

Yes, but that only applies to the collected src attribute. It needs to be a single image URL.

@MathieuLamiot
Copy link
Contributor

@DahmaniAdame
Copy link
Contributor

@MathieuLamiot this one should be accepted. It's a valid image.

The reason why I added .php and .js specifically as part of the exclusions is because they will process images on the fly without caching them.

If we preload them, the time needed to process will have negative effects on INP as well.

Anything else is fair game.

@DahmaniAdame
Copy link
Contributor

@MathieuLamiot checking the mime is still applicable if we have a filter to allow URLs that are still valid but don't have an image mime type, like gravatar.com.

I would consider them more of an edge case and they will usually come from a specific provider. We can maintain a list of auto exclusions if needed.

That way, we are sure that it's a valid image.

@wordpressfan
Copy link
Contributor

Done, now we have a filter to validate the image src

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lcp needs: product direction priority: high Issues which should be resolved as quickly as possible severity: moderate Feature isn't working as expected but has work around to get same value type: bug Indicates an unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants