Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data: currencies, engine descriptions and osm_keys_tags: use SQLite instead of JSON #3458

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

dalf
Copy link
Member

@dalf dalf commented May 4, 2024

What does this PR do?

To reduce memory usage, use a SQLite database to store the engine descriptions, currencies and OSM keys/tags.

Dump of the databases are stored in Git to facilitate maintenance, especially the pull requests made automatically every month.

With this PR searx.data provides some functions to access the data:

def fetch_engine_descriptions(language) -> Dict[str, List[str]]:
  """Return engine description and source for each engine name"""

def fetch_iso4217_from_user(name: str) -> Optional[str]:
  """Currency: get the ISO4217 name from the user input"""

def fetch_name_from_iso4217(iso4217: str, language: str) -> Optional[str]:
  """Currency: get the localized name from the ISO4217"""

def fetch_osm_key_label(key_name: str, language: str) -> Optional[str]:
  """Get the OSM key label from the key id"""

def fetch_osm_tag_label(tag_key: str, tag_value: str, language: str) -> Optional[str]:
  """Get the OSM tag label from the tag key and value"""

The function names starts with fetch instead of get to emphasis the fact the data are fetch from the databases.

With these functions are part of the code or engines can access the data without weird import like in the apple map engine:

from searx.engines.openstreetmap import get_key_label

Why is this change important?

It spares about 20MB per worker similar to #3443, but the memory remains low even after some queries using OSM (for example).

SQLite is going to cache some pages, but as far I understand this is kernel cache:

  • it is shared between the processes
  • the kernel discards the cache entries when the memory is low

About load time: it takes 10ms to load useragents.json, external_urls.json, wikidata_units.json, external_bangs.json, engine_traits.json and locales.json on my AMD 5750GE. Even ten time slower is still reasonable IMO: the HTTP requests during the initialization are way slower than that.

How to test this PR locally?

Author's checklist

Related issues

Related to

@dalf dalf requested a review from return42 May 4, 2024 09:22
@dalf dalf changed the title data: engine descriptions: use SQLite instead of JSON data: currencies and engine descriptions: use SQLite instead of JSON May 4, 2024
@dalf dalf force-pushed the data_use_sqlite branch 2 times, most recently from 8557d79 to 42e1d92 Compare May 4, 2024 11:08
@dalf dalf changed the title data: currencies and engine descriptions: use SQLite instead of JSON data: currencies, engine descriptions and osm_keys_tags: use SQLite instead of JSON May 4, 2024
@dalf dalf force-pushed the data_use_sqlite branch 2 times, most recently from cab91d5 to a1a9156 Compare May 4, 2024 15:43
mrpaulblack added a commit to paulgoio/searxng that referenced this pull request May 6, 2024
Copy link
Member

@return42 return42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dalf can we merge this PR or are you waiting for more test results?

@dalf dalf force-pushed the data_use_sqlite branch 2 times, most recently from f5da9b4 to bf959dd Compare May 18, 2024 20:34
dalf added 3 commits May 18, 2024 20:50
To reduce memory usage, use a SQLite database to store the engine descriptions.
A dump of the database is stored in Git to facilitate maintenance,
especially the pull requests made automatically every month.

Related to
* searxng#2633
* searxng#3443
@dalf
Copy link
Member Author

dalf commented May 18, 2024

After some test on Paul's instance, the memory increases to nearly its original value after few days.


I have updated the code:

  • It is rebased on the last master branch
  • SQL connections are shared between the threads (it seems Python documentation is misleading: the connections can be shared, a safety belt can be added)
  • the cache is reduced to 512KB instead of 2MB
  • sql_connection is generator, so if the connection has to be closed, it can be done in one place (idea from @return42 )

@mrpaulblack can you try the last update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants