Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better query parsing #7

Open
bnewbold opened this issue Oct 2, 2020 · 3 comments
Open

Better query parsing #7

bnewbold opened this issue Oct 2, 2020 · 3 comments
Labels
enhancement New feature or request

Comments

@bnewbold
Copy link
Member

bnewbold commented Oct 2, 2020

A particular user request is to be able to paste a citation string into the search box and have "the right thing" happen in most cases. The current query parser (Elasticsearch's built-in) doesn't work well for this; it is expecting a structured query string (with booleans etc).

A great solution would be a custom query parser with perfect detection of user intent that "does the expected thing". In the meanwhile, more practically, we could try to differentiate between regular queries and citation string queries, and have two code paths. The query string path would be the current behavior. The query string path would use, eg, GROBID and/or biblio-glutton to parse the raw citation in to a structured citation, then try to do a fuzzy match against the live fatcat metadata index (generally faster than the scholar fulltext index), and if there is a hit do an exact identifier lookup against scholar elasticsearch. The later half of this code path would be similar to the current behavior for identifier lookups (eg, remove all filters and sort order).

@bnewbold bnewbold added enhancement New feature or request help wanted Extra attention is needed and removed help wanted Extra attention is needed labels Oct 2, 2020
@bnewbold
Copy link
Member Author

Here is a Google Scholar blog post about detecting reference strings: https://scholar.googleblog.com/2016/01/quickly-lookup-references.html

The jargon-y term for this use case is "known item lookup"

@bnewbold
Copy link
Member Author

An initial version of this has been implemented and is live. Testing and iteration probably needed.

@bnewbold
Copy link
Member Author

Some user queries are getting re-written poorly with the current system:

"journal:" Post Communist Economies "year:" 2021
"Title:" A multi-speed fiscal "Europe?" Fiscal rules and fiscal performance in the EU former communist countries. It appears to be online content from Post Communist Economies 31Jan 2021. "Link:" "https://www-tandfonline-com.libproxy-imf.imf.org/doi/full/10.1080/14631377.2020.1867432"

The original query was probably:

journal: Post Communist Economies year: 2021

Some of this may be due to copy/paste from other sources? Eg, an email or multi-line record on a website.

For one thing, we probably shouldn't return the re-written (quoted) query, we should return the original query string (in the search box). Any time we rewrite/modify the query, should indicate that it happened though, and link to query documentation.

Other possible improvements or work arounds are to have an "advanced search" page, or to have separate search boxes/options for different types of query. I'd like to try a little more to stick with the "one simple box" experience though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant