Skip to content

Commit

Permalink
DOC: Small improvements and corrections (#2631)
Browse files Browse the repository at this point in the history
  • Loading branch information
j-t-1 committed May 8, 2024
1 parent a584fb5 commit a435eaa
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 15 deletions.
14 changes: 7 additions & 7 deletions docs/user/extract-text.md
Expand Up @@ -72,7 +72,7 @@ operator, operand-arguments, current transformation matrix and text matrix.

### Example 1: Ignore header and footer

The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y < 720) and footer (y > 50).
The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y > 720) and footer (y < 50).

```python
from pypdf import PdfReader
Expand Down Expand Up @@ -171,17 +171,17 @@ Then there are issues where most people would agree on the correct output, but
the way PDF stores information just makes it hard to achieve that:

1. **Tables**: Typically, tables are just absolutely positioned text. In the worst
case, ever single letter could be absolutely positioned. That makes it hard
case, every single letter could be absolutely positioned. That makes it hard
to tell where columns / rows are.
2. **Images**: Sometimes PDFs do not contain the text as it's displayed, but
2. **Images**: Sometimes PDFs do not contain the text as it is displayed, but
instead an image. You notice that when you cannot copy the text. Then there
are PDF files that contain an image and a text layer in the background.
That typically happens when a document was scanned. Although the scanning
software (OCR) is pretty good today, it still fails once in a while. pypdf
is no OCR software; it will not be able to detect those failures. pypdf
will also never be able to extract text from images.

And finally there are issues that pypdf will deal with. If you find such a
Finally there are issues that pypdf will deal with. If you find such a
text extraction bug, please share the PDF with us so we can work on it!

### Missing Semantic Layer
Expand All @@ -196,7 +196,7 @@ find heuristics to make educated guesses, but there is no way of being certain.

This is a shortcoming of the PDF file format, not of pypdf.

It would be possible to apply machine learning on PDF documents to make good
It is possible to apply machine learning on PDF documents to make good
heuristics, but that will not be part of pypdf. However, pypdf could be used to
feed such a machine learning system with the relevant information.

Expand Down Expand Up @@ -229,7 +229,7 @@ More information:
Optical Character Recognition (OCR) is the process of extracting text from
images. Software which does this is called *OCR software*. The
[tesseract OCR engine](https://github.com/tesseract-ocr/tesseract) is the
most commonly known Open Source OCR software.
most commonly known open source OCR software.

pypdf is **not** OCR software.

Expand Down Expand Up @@ -279,7 +279,7 @@ pypdf also has an edge when it comes to characters which are rare, e.g.

## Attempts to prevent text extraction

If people who share PDF documents want to prevent text extraction, there are
If people who share PDF documents want to prevent text extraction, they have
multiple ways to do so:

1. Store the contents of the PDF as an image
Expand Down
10 changes: 5 additions & 5 deletions pypdf/_doc_common.py
Expand Up @@ -308,7 +308,7 @@ def _repr_mimebundle_(
"""
Integration into Jupyter Notebooks.
This method returns a dictionary that maps a mime-type to it's
This method returns a dictionary that maps a mime-type to its
representation.
See https://ipython.readthedocs.io/en/stable/config/integrating.html
Expand Down Expand Up @@ -848,9 +848,9 @@ def threads(self) -> Optional[ArrayObject]:
"""
Read-only property for the list of threads.
See §8.3.2 from PDF 1.7 spec.
See §12.4.3 from the PDF 1.7 or 2.0 specification.
It's an array of dictionaries with "/F" and "/I" properties or
It is an array of dictionaries with "/F" and "/I" properties or
None if there are no articles.
"""
catalog = self.root_object
Expand Down Expand Up @@ -1005,9 +1005,9 @@ def pages(self) -> List[PageObject]:
For PdfWriter Only:
It provides also capability to remove a page/range of page from the list
(through del operator)
(using the del operator)
Note: only the page entry is removed. As the objects beneath can be used
somewhere else.
elsewhere.
A solution to completely remove them - if they are not used anywhere -
is to write to a buffer/temporary file and to load it into a new PdfWriter
object afterwards.
Expand Down
6 changes: 3 additions & 3 deletions pypdf/_page.py
Expand Up @@ -1996,11 +1996,11 @@ def extract_text(
will change if this function is made more sophisticated.
Arabic and Hebrew are extracted in the correct order.
If required an custom RTL range of characters can be defined;
If required a custom RTL range of characters can be defined;
see function set_custom_rtl
Additionally you can provide visitor-methods to get informed on all
operations and all text-objects.
Additionally you can provide visitor methods to get informed on all
operations and all text objects.
For example in some PDF files this can be useful to parse tables.
Args:
Expand Down

0 comments on commit a435eaa

Please sign in to comment.