Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Checker: change image check #3312

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

dalf
Copy link
Member

@dalf dalf commented Mar 9, 2024

What does this PR do?

In the master branch, the checker starts to stream the response and cut the connection. However it creates a lot of read error, which are false negative. I don't know how to fix the issue.

This commit change the checker to download the whole image. The error reporting is also changed to report only one line, instead of the whole stacktrace.

Do note that I have not test the checker running in background. This feature seems forgotten and lack of interrest despite the initial move few years ago.

Why is this change important?

How to test this PR locally?

  • make search.checker.brave.images
  • make search.checker.google_images (_ not .).

Author's checklist

Related issues

Close #3311

In the master branch, the checker starts to stream the response
and cut the connection. However it creates a lot of read error,
which are false negative. I don't know how to fix the issue.

This commit change the checker to download the whole image.
The error reporting is also changed to report only one line,
instead of the whole stacktrace.

Also, if a timeout occurs, the checker waits for one second
before retry.

Do note that I have not test the checker running in background.
This feature seems forgotten and lack of interrest despite the
initial move few years ago.
Copy link
Member

@return42 return42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR .. when I test:

make search.checker.artic

the checker terminates prematurely with this exception:

Engine artic                         Checking
Traceback (most recent call last):
  File "local/py3/bin/searxng-checker", line 33, in <module>
    sys.exit(load_entry_point('searxng', 'console_scripts', 'searxng-checker')())
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "searx/search/checker/__main__.py", line 115, in main
    run(args.engine_name_list, args.verbose)
  File "searx/search/checker/__main__.py", line 73, in run
    checker.run()
  File "searx/search/checker/impl.py", line 439, in run
    self.run_test(test_name)
  File "searx/search/checker/impl.py", line 425, in run_test
    rct_list = [self.get_result_container_tests(test_name, search_query) for search_query in search_query_list]
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "searx/search/checker/impl.py", line 419, in get_result_container_tests
    result_container_check.check_basic()
  File "searx/search/checker/impl.py", line 273, in check_basic
    self._check_results(results)
  File "searx/search/checker/impl.py", line 248, in _check_results
    self._check_result(result)
  File "searx/search/checker/impl.py", line 241, in _check_result
    elif not _is_url_image(result.get('img_src')):
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "searx/search/checker/impl.py", line 124, in _is_url_image
    return _download_and_check_if_image(image_url)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "searx/search/checker/impl.py", line 76, in _download_and_check_if_image
    r = network.get(
        ^^^^^^^^^^^^
  File "searx/network/__init__.py", line 165, in get
    return request('get', url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "searx/network/__init__.py", line 96, in request
    return future.result(timeout)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "~/.asdf/installs/python/3.12.0/lib/python3.12/concurrent/futures/_base.py", line 456, in result
    return self.__get_result()
           ^^^^^^^^^^^^^^^^^^^
  File "~/.asdf/installs/python/3.12.0/lib/python3.12/concurrent/futures/_base.py", line 401, in __get_result
    raise self._exception
  File "/searx/network/network.py", line 290, in request
    return await self.call_client(False, method, url, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "searx/network/network.py", line 273, in call_client
    return Network.patch_response(response, do_raise_for_httperror)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "searx/network/network.py", line 246, in patch_response
    raise_for_httperror(response)
  File "searx/network/raise_for_httperror.py", line 75, in raise_for_httperror
    raise SearxEngineAccessDeniedException(message='HTTP error ' + str(resp.status_code))
searx.exceptions.SearxEngineAccessDeniedException: HTTP error 403, suspended_time=86400
make: *** [Makefile:50: search.checker.artic] Error 1

@dalf
Copy link
Member Author

dalf commented Mar 9, 2024

The issue exists before this PR: there is an exception in the async code which is not catched by the sync code.
The fix is to remove async code in the Flask app.
I will add an additional comment after my meal later.

@dalf
Copy link
Member Author

dalf commented Mar 10, 2024

@return42 may I ask you to run make search.checker.artic on the master branch?

@return42
Copy link
Member

NOTE: artic was the first engine where a make search.checker ends for me.

@return42 may I ask you to run make search.checker.artic on the master branch?

on master the test works, at the end there is a issue report (with the false negatives):

== Results ======================================================================
Engine artic                         Error
    found languages: en eo fr
    simple         : img_src URL is invalid https://www.artic.edu/iiif/2//f8fd76e9-c396-5678-36ed-6a348c904d27/full/843,/0/default.jpg (query='life' lang='all' pageno=1 safesearch=0 time_range=None)
    simple         : img_src URL is invalid https://www.artic.edu/iiif/2//a38e2828-ec6f-ece1-a30f-70243449197b/full/843,/0/default.jpg (query='life' lang='all' pageno=1 safesearch=0 time_range=None)
...

@return42
Copy link
Member

Would mind to help on #3312 ?

What can I do / how can I help further .. did you read my last comment from above?

@dalf
Copy link
Member Author

dalf commented Apr 19, 2024

What can I do / how can I help further .. did you read my last comment from above?

May be you assume that I know the code since I wrote it.
I mostly forgot how it works: we are basically at the same point of knowledge regarding the checker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

make search.checker reports false positiv 'img_src URL is invalid'
2 participants