t3as-pdf

Introduction

This project augments itext with enhanced text extraction and new redaction capability.

The enhanced text extraction includes:

more accurate text placement using floating point comparisons with a specified tolerance rather than truncation to int and exact comparison; this corrects the placement of list bullet marks;
monitoring font size changes and vertical white space to generate a blank line after headings and between paragraphs; this helps with human readability and may help with NLP analysis of the text.

The redaction capabilty provides:

removal of text specified as character offsets into the extracted text
removal of XMP metadata
removal of PDF annotations, which can store URLs of linked documents and email addresses
replacement of native PDF metadata with Producer = "iText® 5.5.2 ©2000-2014 iText Group NV (AGPL-version)" (this is a requirement of itext's AGPL license) and Creator = "Redact v0.1 ©2014 NICTA (AGPL)"

See also: t3as-redact which depends on this project.

Install Tools

To run the code install:

a JRE e.g. from openjdk-7 (prefered for Scala 2.11) or openjdk-8 (will be required for Scala 2.12);
the build tool sbt.

To develop Scala code install:

the above items (or the full JDK instead of just the JRE, but the JRE should be sufficient);
Eclipse IDE; and
the Scala plugin for Eclipse scala-ide.

Build

Automatic builds are provided at: https://social-watch.dev.etd.nicta.com.au/.

The command:

sbt clean test package oneJar publishLocal dumpLicenseReport

from the project's top level directory:

cleans out previous build products,
runs all tests,
creates a jar file (project code),
creates a one-jar file (project code with all 3rd party dependencies),
publishes build products to the Ivy repository at ~/.ivy2/ (allowing other projects to depend on this one) and
generates a license report.

Develop With Eclipse

The command:

sbt update-classifiers eclipse

downloads source code for 3rd part dependencies and uses the sbteclipse plugin to create the .project and .classpath files required by Eclipse.

Unit Tests in Eclipse

This section describes further configuration required to run unit tests within Eclipse. This is not necessary for running tests with sbt (as shown in the Build section above).

Our unit tests rely on picking up properties files preferentially from src/test/resources ahead of src/main/resources, but unfortunately sbteclipse does not place target/scala-2.11/test-classes (the destination for src/test/resources) ahead of target/scala-2.11/classes (the destination for src/main/resources) in the classpath. This has to be corrected in Eclipse's Build Path for unit tests to pass (move all src/test items ahead of all src/main items).

Release

The command:

sbt release

uses the sbt-release plugin to perform the fairly long (and otherwise error prone) sequence of steps required to properly release a version of the software.

Run

To run the CLI from sbt:

sbt
> run --help

To run the CLI from the one-jar:

java -jar target/scala-2.11/t3as-pdf_{scala-version}-{project-version}-one-jar.jar --help

Software Description

In the text below:

itext Java means we're talking about Java code provided by itext
new Java/Scala means we're talking about Java/Scala code provided by this project

StreamProcessor

itext Java: com.itextpdf.text.pdf.parser.PdfContentStreamProcessor
tokenizes a binary PDF content stream, calling a listener for each parsed PDF operator

new Java: com.itextpdf.text.pdf.parser.MyPdfContentStreamProcessor (needs package access)
This is a copy not a subclass. It's a big source file and I've made minimal changes - some private fields changed to protected and protected getFontResourceName/push/pop methods added to support …

new Scala: org.t3as.pdf.RedactionStreamProcessor extends MyPdfContentStreamProcessor
provides a way for the listener to obtain the start and end offsets into the binary stream that correspond to the PDF operator being processed

TextExtractionStrategy (a listener for a StreamProcessor)

itext Java: com.itextpdf.text.pdf.parser.TextExtractionStrategy
tracks coordinate transformations to calculate where text appears on the page

text chunks with close to the same baseline are assumed to be on the same line
a gap from the end of previous text is assumed to be a space

new Scala: org.t3as.pdf.MyExtractionStrategy extends TextExtractionStrategy

improves text placement (float rather than int coords improves same line detection - handles bullet point placement
better gap detection
added blank line detection

During parsing, for each text chunk it saves:

ExtendedChunk(text: String, startLocation: Vector, endLocation: Vector, charSpaceWidth: Float, fontHeight: Float, streamOffset: StreamOffset(start: Long, end: Long))

Chunks can occur in any order in the stream, only when parsing is complete can we figure out their order on the page and then how to map between text offsets and stream offsets. This is done by the result method.

Copy/Redact

itext Java: com.itextpdf.text.pdf.PdfCopy
copies pages from input files to output file

new Scala: org.t3as.pdf.PdfCopyRedact extends PdfCopy
overrides

getImportedPage: uses RedactionStreamProcessor and MyExtractionStrategy.
- input is text offsets: (page: Int, start: Int, end: Int)
- output is: MyResult which provides methods to convert from text offsets to binary content stream offsets
copyDictionary to avoid copying dictionaries of type ANNOT (contains sticky notes, link URI's to email addresses, in/external links etc.)
copyStream if stream is the main content stream a redacted copy is stored, otherwise it uses the input stream unmodified

Stream

itext Java: com.itextpdf.text.pdf.PRStream
copies the content stream to the output PDF

new Scala: com.itextpdf.text.pdf.MyPRStream extends PRStream (needs package access)
if a redacted copy of the stream has been stored write that instead of the input content stream

Legal

This software is released under the terms of the AGPL. Source code for all transitive dependencies is available at t3as-legal.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
THIRD_PARTY_LICENSES		THIRD_PARTY_LICENSES
build.sbt		build.sbt
ivyDependencies.sh		ivyDependencies.sh
version.sbt		version.sbt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

project

project

src

src

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

THIRD_PARTY_LICENSES

THIRD_PARTY_LICENSES

build.sbt

build.sbt

ivyDependencies.sh

ivyDependencies.sh

version.sbt

version.sbt

Repository files navigation

t3as-pdf

Introduction

Install Tools

Build

Develop With Eclipse

Unit Tests in Eclipse

Release

Run

Software Description

StreamProcessor

TextExtractionStrategy (a listener for a StreamProcessor)

Copy/Redact

Stream

Legal

About

Releases

Packages

Languages

License

venkikeesara/t3as-pdf

Folders and files

Latest commit

History

Repository files navigation

t3as-pdf

Introduction

Install Tools

Build

Develop With Eclipse

Unit Tests in Eclipse

Release

Run

Software Description

StreamProcessor

TextExtractionStrategy (a listener for a StreamProcessor)

Copy/Redact

Stream

Legal

About

Resources

License

Stars

Watchers

Forks

Languages