Skip to content

tomrule007/pdf-text.js

Repository files navigation

pdf-text.js GitHub license PRs Welcome

A JavaScript frontend cross-browser compatible utility to extract text from pdf documents into useable organized data objects.

The user provides a pdf file (data buffer or url) and an optional template config. Without the template config the function will return an array of character objects with the text, the coordinates (x,y) and width measurement of text. With a template file the user can validate the parsing and pull out blocks of text and table data.

Current state (of the demo site )

  • Allows a pdf file to be selected
  • applies hardcoded sampleTables.pdf template to pull table data.
  • creates 3 html tables demonstrating correct parsing of sampleTable.pdf table data.
  • handles merging multi-line cells
  • renders the pdf canvas view with moveable box object. (shift clicking the box adds a column line)
  • shows the moveable box template config
  • creates a list of all text items inside the box with their coordinates
  • draws box around each character location
  • stores character location
  • uses pdf.js getTextContent to attempt to generate a html table of file.

ToDo

  • Readme
    • add installation instructions
    • add how to use instructions
  • create template generator to create:
    • table bounds
    • column bounds
    • validation boxes
  • create tests
  • clean up project
  • correct demo page text
  • create issues and project board

Technologies used 🛠️

  • React - Front-End JavaScript library
  • pdf.js - PDF Reader in JavaScript

Authors

License 📄

This project is licensed under the MIT License - see the LICENSE file for details

Available Scripts

In the project directory, you can run:

yarn start

Runs the app in the development mode.
Open http://localhost:3000 to view it in the browser.

About

Extract text from pdf documents into useable organized data objects.

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published