Improved document conversion

Current setup

Office and PDF documents are currently converted to HTML, using wvHtml, xlhtml, pdftohtml, and unrtf. Other formats are not currently supported for inline viewing / indexing.

The documents are converted when the link to the ("as HTML") is followed. Converted text is stored on the filesystem by way of a cache, so subsequent link clients will just directly serve up the document.

The system works pretty well, except for some occasional bugs in the conversion software, which cause hangs or empty HTML versions of the software. It doesn't particularly need upgrading as it works well enough; this page is just to record a possible improvement.

Problems

Conversion software bugs leading to corrupt data
Conversions not pretty (e.g. preserving images in documents nicely, etc -- see samples below)
Quite limited range of supported source conversions (but 99% of those used are supported, i.e. doc and pdf)
No ability to annotate inline

Alternative

There's an alternative system used by US FOI site MuckRock, which displays the documents in a nice viewer.

Their system uses the (currently free) journalist source document system documentcloud.org. The software to do this is open source, and available at https://github.com/documentcloud. Alternatively, we could use the DocumentCloud service, which (currently, at least) is free.

See also the discussion on the Alaveteli dev mailing list

The main components are:

docsplit, a ruby frontend for OpenOffice (document conversion), Tesseract (OCR), pdftk (split single PDF into one-per page), graphicsmagick (thumbails/images of pages)
DocumentViewer from NYT. Most importantly, supports annotations on the document (e.g. this senate bill)

Benefits:

Much nicer-looking conversions
Reasonably good interface for navigating around documents
OCRed text wherever text extraction not possible
Full support for all supported OpenOffice formats
All documents converted to PDF as part of process
Annotations possible
Could use the DocumentCloud service, thus dramatically reducing maintenance and hosting overheads
Thumbnails of documents suitable for including in request thread (see below)

Presentation of a document as a thumbnail within a request thread, with download links below, and a summary next to it

Comparison of OpenOffice (left) and wvHtml conversions

Drawbacks:

Requires new code
If we implement locally, likely to be higher processing overheads (needs a running headless OpenOffice, and always requires PDF extraction step)
Is it indexable by search engines? The NYT blog post promises to fix this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improved document conversion

Current setup

Problems

Alternative

Clone this wiki locally