Conversion script for LA Metro attachments #193

reginafcompton · 2018-03-22T21:41:30Z

This PR handles Metro issue Metro-Records/la-metro-councilmatic#266

The script should (1) convert .doc, .docx, and PDF attachments into plain text, and (2) save that text in the full_text field on the BillDocument model.

Unoconv can convert doc and .docx files. However, it cannot convert PDFs into plain text. See what LibreOffice can and cannot export.

For PDFs, we could use something like PyPDF2 in combination with the requests library.

Note: Since we need to use something in addition to unoconv, we might think about if we should use something other than unoconv for the doc and docx conversions (unoconv is a heavy dependency). The original scope mentions Excel, however - and unoconv could be good for those files....

reginafcompton · 2018-03-26T15:00:53Z

For PDF conversion, try this:
https://www.binpress.com/tutorial/pdfrw-the-other-python-PDF-library/171
https://github.com/datamade/data-making-guidelines/blob/master/styleguide.md#4-standard-toolkit

hancush · 2018-03-26T15:22:36Z

for posterity: piping in2csv output to txt could work for excel files!

reginafcompton · 2018-03-30T19:33:43Z

PyPDF2 is easy to install and use. However, it comes with a couple drawbacks:

(1) I noticed several instances of plain text omitted spaces. This seems to be a known issue with PyPDF2: py-pdf/pypdf#17

(2) PyPDF2 can only convert one page at a time (using extractText()).

I think pdfminer six might be the best option - it comes with a nice pdf2txt.py script, although the pip install does not work as expected. The documentation suggests downloading it from source.

reginafcompton · 2018-03-30T21:25:52Z

I ultimately landed on textract, since it converts pdf, doc, and docx files to plaintext, without the heft of unoconv or a second library. I did not have much difficulty installing it on MacOS, but I'd like to try it out on the Councilmatic server before confirming this solution: http://textract.readthedocs.io/en/stable/installation.html#ubuntu-debian

reginafcompton · 2018-04-04T20:43:23Z

Clarification regarding the abandonment of unoconv

unoconv struggles with converting PDF to txt. I tried these conversions both locally and on the Councilmatic server. For both, unoconv errors with "Unable to store document..." when calling storeToURL - a function in OpenOffice.

Server

# Command run 
# I also tried this with "text"
unoconv -f txt 8e6281f1-8342-42ae-b5a2-271ca6902d99.pdf

# Error
File "/usr/bin/unoconv", line 1118, in convert
    document.storeToURL(outputurl, tuple(outputprops) )
uno.IOException: SfxBaseModel::impl_store <file:///tmp/8e6281f1-8342-42ae-b5a2-271ca6902d99.txt> failed: 0xc10

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/bin/unoconv", line 1389, in <module>
    main()
  File "/usr/bin/unoconv", line 1305, in main
    convertor.convert(inputfn)
  File "/usr/bin/unoconv", line 1120, in convert
    raise UnoException("Unable to store document to %s (ErrCode %d)\n\nProperties: %s" % (outputurl, e.ErrCode, outputprops), None)
  File "/usr/lib/python3/dist-packages/uno.py", line 507, in _uno_struct__getattr__
    return getattr(self.__dict__["value"], name)
AttributeError: ErrCode

It may be that we need a different version of OpenOffice, but I would rather not go down that path, considering that we are happily using Unoconv for the RTF converter.

reginafcompton · 2018-04-04T21:18:12Z

Installing textract on an Ubuntu server

I installed textract on the staging server (i.e., the server with all our staging sites, not the Councilmatic server), and it worked well. It requires several lightweight dependencies and one small install hack.

First, install all the dependencies.

Second, textract fails when installing pocketsphinx, which is not actually necessary for converting PDFs or word documents to txt. This issue provides a clever work around: deanmalmgren/textract#178.
Third, run pip install textract.

reginafcompton · 2018-04-04T22:15:02Z

@hancush - I have a working solution for the text conversion!

The one point that requires further thought entails the use of a NamedTemporaryFile in convert_document.

Ideally, we could do this without a temporary file, i.e., with a subprocess, since textract can be used as a CLI tool:

p = subprocess.Popen(['textract', '--stdin', '--stdout'], preexec_fn=os.setsid, stdout=subprocess.PIPE, stdin=subprocess.PIPE, stderr=subprocess.DEVNULL)
plain_text, stderr_data = p.communicate(input=RESPONSE_IN_SOME_FORMAT, timeout=15)

However, textract uses a variety of dependencies to open and convert files: docx, doc, pdf

Is there a way to pass pdf, doc, and docx files to textract with a subprocess - and I am not seeing it?

@fgregg - could I bring you into this conversation as a consultant/reviewer? The main parts of this are (1) why I used textract, rather than unoconv (see comments above), and (2) the question of using a subprocess with --stdin rather than a TempFile (see notes directly above).

fgregg · 2018-04-05T16:32:57Z

Since unoconv remains a dependency, as we use it for converting the rtf files to html, what about just install pdftotext and just using that for pdfs and unoconv for the doc files.

That would seem to be a much smaller footprint?

reginafcompton · 2018-04-06T14:24:52Z

@fgregg, yes, indeed. One of my first implementations did just that (i.e., use Unoconv for doc and docx, then another tool for PDF) - see commented code. I am happy to revert back to something along these lines with pdftotext.

With that said...

the two-tool solution seems unnecessarily verbose (when textract can do everything for us)
it does not seem terribly efficient to open an unoconv listener every session, even if we do not need to convert doc and docx files
ideally, we are going to isolate each Councilmatic instance on unique servers: do we want to have unoconv on the Metro server for this little task? in that case, it seems like textract would be a smaller footprint, no?

I am not married to either solution! Just getting my thoughts out there.

fgregg · 2018-04-06T14:45:00Z

ideally, we are going to isolate each Councilmatic instance on unique servers: do we want to have unoconv on the Metro server for this little task? in that case, it seems like textract would be a smaller footprint, no?

This is a compelling argument, as unoconv is a big dependency. However textract has a lot of binary dependencies, it is very far from lightweight.

Okay, I'm okay with textrac.

hancush

Thanks very much for finding the right tool! :-) A few comments inline...

hancush · 2018-04-06T17:00:54Z

councilmatic_core/management/commands/convert_attachment_text.py

+        self.add_plain_text()
+
+    def get_document_url(self):
+        self.connection.execute("SET local timezone to '{}'".format(settings.TIME_ZONE))


Why do this here, instead of in the transactional context below?

Why do this at all?

Yes - this is unnecessary: we are not adding any timestamps to the instances of BillDocument. We can safely remove it.

hancush · 2018-04-06T17:07:36Z

councilmatic_core/management/commands/convert_attachment_text.py

+
+    def handle(self, *args, **options):
+        self.update_all = options['update_all']
+        self.connection = engine.connect()


Opening a connection in this way, without subsequently closing it, relies on garbage collection to close it out for you. The SQLAlchemy docs caution against this approach, notably because it's unreliable and can lead to orphaned connections to your database hanging out forever until they ultimately cause a mysterious "You have too many connections open, sorry!!" error from Postgres.

Is there a reason you assigned connection as a class attribute rather than using the with engine.begin() context manager when you require a connection?

hancush · 2018-04-06T17:12:37Z

councilmatic_core/management/commands/convert_attachment_text.py

+
+        chunk = []
+
+        for doc_dict in plain_text_results:


It'd be a bit more readable here to do for doc_dict in self.convert_document() rather than assigning the generator function call to a variable, I think.

Could you be a bit more descriptive with the convert_document name, perhaps convert_document_to_plaintext?

hancush · 2018-04-06T17:14:07Z

councilmatic_core/management/commands/convert_attachment_text.py

+    def add_plain_text(self):
+        plain_text_results = self.convert_document()
+
+        self.connection.execute("SET local timezone to '{}'".format(settings.TIME_ZONE))


Why do we have to do this twice? Could we do it once, at the top of the handle method?

hancush · 2018-04-06T17:15:26Z

councilmatic_core/management/commands/convert_attachment_text.py

+
+        logger.info('Converting document to plain text...')
+
+        for document_data in documents:


Ditto for x in self.generator_method() rather than assigning to variable comment.

hancush · 2018-04-06T17:22:25Z

councilmatic_core/management/commands/convert_attachment_text.py

+
+        for doc_dict in plain_text_results:
+            chunk.append(doc_dict)
+            if len(chunk) == 20:


I get that you want to do these updates in batches of 20 or less, but this control flow feels a little bit clunky. @fgregg, is there a more streamlined way to do this? If no, @reginafcompton, maybe a comment stating intention would make this a little easier to grok. :-)

I'm pretty sure 90% that

self.connection.execute(sa.text(update_statement), plain_texts)

will do the right thing if plain_texts is a generator.

…ating and saving in chunks

hancush

This seems legit to me! This is an awesome revision! One comment inline.

hancush · 2018-04-06T20:42:22Z

councilmatic_core/management/commands/convert_attachment_text.py

+        plaintexts = self.convert_document_to_plaintext()
+
+        while True:
+            plaintexts_fetched_from_generator = list(itertools.islice(plaintexts, 20))


If @fgregg wants to backread this change just in case I've missed a subtlety, I'd be much obliged!

Sounds good @hancush ! I'll wait for final comments from @fgregg before merging.

Initial pass at conversion script for LA Metro attachments

6ae8807

reginafcompton changed the title ~~Initial pass at conversion script for LA Metro attachments~~ [WIP] Initial pass at conversion script for LA Metro attachments Mar 22, 2018

Use textract to convert pdf, doc, and docx files

705fdb5

Merge branch 'master' into attachment-conversion

896da89

Working solution with textract and TempFile

373bbee

reginafcompton requested a review from hancush April 4, 2018 22:18

reginafcompton changed the title ~~[WIP] Initial pass at conversion script for LA Metro attachments~~ Conversion script for LA Metro attachments Apr 4, 2018

reginafcompton mentioned this pull request Apr 6, 2018

[META] Reports/Agendas - Ability to search attachment text Metro-Records/la-metro-councilmatic#266

Closed

hancush requested changes Apr 6, 2018

View reviewed changes

Use more context managers, and fetch from generator, rather than iter…

3693b75

…ating and saving in chunks

hancush approved these changes Apr 6, 2018

View reviewed changes

Better doc string with refactor suggestion

5077e3d

fgregg approved these changes Apr 9, 2018

View reviewed changes

reginafcompton added 5 commits April 9, 2018 10:52

Better logging and error catching for 404

8741959

Remove redundant log statement

31ce23b

Log bad response error

2b7f722

Log number of documents

55c5762

Remove count

b81a215

reginafcompton merged commit 8aa07f6 into master Apr 9, 2018

reginafcompton deleted the attachment-conversion branch April 9, 2018 18:01

hancush mentioned this pull request Jun 18, 2019

OCD model migration #240

Merged

11 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Conversion script for LA Metro attachments #193

Conversion script for LA Metro attachments #193

reginafcompton commented Mar 22, 2018 •

edited

reginafcompton commented Mar 26, 2018

hancush commented Mar 26, 2018

reginafcompton commented Mar 30, 2018 •

edited

reginafcompton commented Mar 30, 2018

reginafcompton commented Apr 4, 2018 •

edited

reginafcompton commented Apr 4, 2018 •

edited

reginafcompton commented Apr 4, 2018 •

edited

fgregg commented Apr 5, 2018 •

edited

reginafcompton commented Apr 6, 2018

fgregg commented Apr 6, 2018

hancush left a comment

hancush Apr 6, 2018

fgregg Apr 6, 2018

reginafcompton Apr 6, 2018

hancush Apr 6, 2018

hancush Apr 6, 2018

hancush Apr 6, 2018

hancush Apr 6, 2018

hancush Apr 6, 2018

hancush Apr 6, 2018

fgregg Apr 6, 2018

hancush left a comment

hancush Apr 6, 2018

reginafcompton Apr 6, 2018


		logger.info('Converting document to plain text...')

		for document_data in documents:

Conversion script for LA Metro attachments #193

Conversion script for LA Metro attachments #193

Conversation

reginafcompton commented Mar 22, 2018 • edited

reginafcompton commented Mar 26, 2018

hancush commented Mar 26, 2018

reginafcompton commented Mar 30, 2018 • edited

reginafcompton commented Mar 30, 2018

reginafcompton commented Apr 4, 2018 • edited

reginafcompton commented Apr 4, 2018 • edited

reginafcompton commented Apr 4, 2018 • edited

fgregg commented Apr 5, 2018 • edited

reginafcompton commented Apr 6, 2018

fgregg commented Apr 6, 2018

hancush left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hancush left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

reginafcompton commented Mar 22, 2018 •

edited

reginafcompton commented Mar 30, 2018 •

edited

reginafcompton commented Apr 4, 2018 •

edited

reginafcompton commented Apr 4, 2018 •

edited

reginafcompton commented Apr 4, 2018 •

edited

fgregg commented Apr 5, 2018 •

edited