Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Reports/Agendas - Ability to search attachment text #266

Closed
shrayshray opened this issue Mar 8, 2018 · 13 comments
Closed

[META] Reports/Agendas - Ability to search attachment text #266

shrayshray opened this issue Mar 8, 2018 · 13 comments
Assignees
Milestone

Comments

@shrayshray
Copy link
Collaborator

Implementing system that allows users to search for board reports based on the
text in attachments.
We are discussing internally and will provide response on whether this functionality should be the default search type, or whether it will be an optional addition to the standard search, e.g., use a "search attachments" checkbox next to the search bar.

@shrayshray shrayshray added this to the March issues milestone Mar 8, 2018
@reginafcompton
Copy link
Contributor

reginafcompton commented Mar 9, 2018

Steps to add this search functionality

Phase I: Conversion

  • write a custom script that iterates over all attachments and converts doc, docx, and PDF into plain text
  • test this script locally on full database
  • install textract on the Councilmatic server: Conversion script for LA Metro attachments datamade/django-councilmatic#193 (comment)
  • in the requirements, pin django-councilmatic to a commit (e.g., -e git://github.com/datamade/django-councilmatic.git@3693b75179bbabb93087a90cf36842119c04076f#egg=councilmatic_core), and deploy to staging site - then, run the script against the staging database
  • merge attachment-conversion branch, and cut a new version of django-councilmatic
  • for Metro production: turn off cron, pin Metro to new release of django-councilmatic, deploy, and then run the conversion script against the database (N.B. I logged the first run of the conversion script here: /tmp/lametro_attachment_conversion.log ... a little 400 URLs returned a 404, which struck me as odd - a Legistar blip? – so I ran it again to catch those documents.)
  • turn on cron, and add the conversion script to regular data pipeline (i.e., run it every 15 minutes)

Phase II: Solr indexing

@reginafcompton
Copy link
Contributor

Questions to answer:

  • how much of a performance hit, after we add all the new text?
  • check or uncheck attachments: search for all (by default), just attachments, or do not search attachments?

DataMade will have a first implementation of this issue by the end of April.

@hancush
Copy link
Collaborator

hancush commented Apr 23, 2018

FYI – We'll need to push the completion of this issue to the May milestone.

@hancush
Copy link
Collaborator

hancush commented Apr 23, 2018

Road map:

  1. We should add attachment text as a distinct field for search (to enable optional searching), to the BillIndex search index.

    • What's the best way to do this: Concatenate all attachment text and search via bill, or create parallel index for bill documents? (The "best" way will be informed by the answers to the next two questions.)
    • Is the purpose of searching attachment text, to surface the attachment, or the bill?
    • What is the default search functionality (bill text only, attachment text only, or bill text and attachment text)?
  2. Then, most likely, we will need to extend / refactor the LAMetroCouncilmaticFacetedSearchView view to alter the SearchQueryset to search different fields, depending on user input.

@hancush hancush modified the milestones: April issues, May issues Apr 24, 2018
@hancush
Copy link
Collaborator

hancush commented May 25, 2018

Re: our earlier questions,

  1. I can see use cases for both of these alternatives, but the conversation which prompted this request on our side was about surfacing the report based on a related attachment.

  2. Search everything (board reports + attachments), as default. With an option to deselect/uncheck a box and search just reports.

@shrayshray
Copy link
Collaborator Author

@hancush We've been testing this and have some feedback and questions:

  1. How are search results ranked? It looks like they're ranked by date instead of relevancy ... we definitely do want them to rank by relevancy first.

  2. Search for exact phrase using quotes isn't working. E.g., search for "Red Line" in screenshot below.
    searchwithquotes-staging

  3. When the search terms are found in the attachment, there is not preview text in the search results listing. Is it possible to display preview text from the attachment?
    missingtext-staging

  4. Tags on a report don't work from search results where the search terms are found in the attachment.
    a. Tags on the search result:
    tagnoresults1-staging

b. What happens after clicking the tag "Annual Program Evaluation (APE):
tagnoresults2-staging

c. Tags on results where the search term is found in the Report are somewhat inconsistent. In this case, the tag term "Strategic Plan":
tagnoresults3-staging

d. Clicking the tag even retrieves results where the search term is in the attachment:
tagnoresults4-staging

e. But in another case, the tag term "Board of Directors"
tagnoresults5-staging

f. retrieves all sorts of reports which are not tagged with "Board of Directors":
tagnoresults6-staging

  1. Is it possible for the search engine to ingest a synonym list? Metro's library has created several and they could be helpful here for connecting phrases, acronyms, etc., e.g., "Public-Private Partnership", "P3", "PPP".

Thanks for all your work on this, Hannah!

@hancush
Copy link
Collaborator

hancush commented Jun 6, 2018

Hi, @shrayshray! Thank you for your detailed feedback!

Re: point 1, search has always been ordered by date by default – would you mind opening a separate issue for that? Whatever default we use, I think we should also indicate it to the user, like this:

screen shot 2018-06-06 at 4 51 42 pm

I've started working on your remaining feedback over in the search_improvements branch – I'll open up a PR to keep track of progress in the AM.

@reginafcompton reginafcompton modified the milestones: May issues, July issues Jul 5, 2018
@reginafcompton
Copy link
Contributor

reginafcompton commented Jul 11, 2018

@reginafcompton reginafcompton changed the title Reports/Agendas - Ability to search attachment text [META] Reports/Agendas - Ability to search attachment text Jul 11, 2018
@reginafcompton
Copy link
Contributor

@shrayshray - We've implemented fixes for the issues noted above and deployed those changes to the staging site.

Could you let us know how this looks and if/when you'd like to deploy to production?

@shrayshray
Copy link
Collaborator Author

@reginafcompton, thank you! I've been testing and will report back after Matt and Omar send feedback.

@shrayshray
Copy link
Collaborator Author

@reginafcompton Just got approval -- we're ready to deploy to production!

@reginafcompton
Copy link
Contributor

@shrayshray - wonderful! I'll deploy this morning.

@reginafcompton
Copy link
Contributor

We're live! Closing this issue via PR #325

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants