DOC: Small improvements and corrections (#2631)

py-pdf · May 8, 2024 · a435eaa · a435eaa
1 parent a584fb5
commit a435eaa
Show file tree

Hide file tree

Showing 3 changed files with 15 additions and 15 deletions.
diff --git a/docs/user/extract-text.md b/docs/user/extract-text.md
@@ -72,7 +72,7 @@ operator, operand-arguments, current transformation matrix and text matrix.
 
 ### Example 1: Ignore header and footer
 
-The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y < 720) and footer (y > 50).
+The following example reads the text of page four of [this PDF document](https://github.com/py-pdf/pypdf/blob/main/resources/GeoBase_NHNC1_Data_Model_UML_EN.pdf), but ignores the header (y > 720) and footer (y < 50).
 
 ```python
 from pypdf import PdfReader
@@ -171,17 +171,17 @@ Then there are issues where most people would agree on the correct output, but
 the way PDF stores information just makes it hard to achieve that:
 
 1. **Tables**: Typically, tables are just absolutely positioned text. In the worst
-   case, ever single letter could be absolutely positioned. That makes it hard
+   case, every single letter could be absolutely positioned. That makes it hard
    to tell where columns / rows are.
-2. **Images**: Sometimes PDFs do not contain the text as it's displayed, but
+2. **Images**: Sometimes PDFs do not contain the text as it is displayed, but
     instead an image. You notice that when you cannot copy the text. Then there
     are PDF files that contain an image and a text layer in the background.
     That typically happens when a document was scanned. Although the scanning
     software (OCR) is pretty good today, it still fails once in a while. pypdf
     is no OCR software; it will not be able to detect those failures. pypdf
     will also never be able to extract text from images.
 
-And finally there are issues that pypdf will deal with. If you find such a
+Finally there are issues that pypdf will deal with. If you find such a
 text extraction bug, please share the PDF with us so we can work on it!
 
 ### Missing Semantic Layer
@@ -196,7 +196,7 @@ find heuristics to make educated guesses, but there is no way of being certain.
 
 This is a shortcoming of the PDF file format, not of pypdf.
 
-It would be possible to apply machine learning on PDF documents to make good
+It is possible to apply machine learning on PDF documents to make good
 heuristics, but that will not be part of pypdf. However, pypdf could be used to
 feed such a machine learning system with the relevant information.
 
@@ -229,7 +229,7 @@ More information:
 Optical Character Recognition (OCR) is the process of extracting text from
 images. Software which does this is called *OCR software*. The
 [tesseract OCR engine](https://github.com/tesseract-ocr/tesseract) is the
-most commonly known Open Source OCR software.
+most commonly known open source OCR software.
 
 pypdf is **not** OCR software.
 
@@ -279,7 +279,7 @@ pypdf also has an edge when it comes to characters which are rare, e.g.
 
 ## Attempts to prevent text extraction
 
-If people who share PDF documents want to prevent text extraction, there are
+If people who share PDF documents want to prevent text extraction, they have
 multiple ways to do so:
 
 1. Store the contents of the PDF as an image

diff --git a/pypdf/_doc_common.py b/pypdf/_doc_common.py
@@ -308,7 +308,7 @@ def _repr_mimebundle_(
         """
         Integration into Jupyter Notebooks.
 
-        This method returns a dictionary that maps a mime-type to it's
+        This method returns a dictionary that maps a mime-type to its
         representation.
 
         See https://ipython.readthedocs.io/en/stable/config/integrating.html
@@ -848,9 +848,9 @@ def threads(self) -> Optional[ArrayObject]:
         """
         Read-only property for the list of threads.
 
-        See §8.3.2 from PDF 1.7 spec.
+        See §12.4.3 from the PDF 1.7 or 2.0 specification.
 
-        It's an array of dictionaries with "/F" and "/I" properties or
+        It is an array of dictionaries with "/F" and "/I" properties or
         None if there are no articles.
         """
         catalog = self.root_object
@@ -1005,9 +1005,9 @@ def pages(self) -> List[PageObject]:
 
         For PdfWriter Only:
         It provides also capability to remove a page/range of page from the list
-        (through del operator)
+        (using the del operator)
         Note: only the page entry is removed. As the objects beneath can be used
-        somewhere else.
+        elsewhere.
         A solution to completely remove them - if they are not used anywhere -
         is to write to a buffer/temporary file and to load it into a new PdfWriter
         object afterwards.

diff --git a/pypdf/_page.py b/pypdf/_page.py
@@ -1996,11 +1996,11 @@ def extract_text(
         will change if this function is made more sophisticated.
 
         Arabic and Hebrew are extracted in the correct order.
-        If required an custom RTL range of characters can be defined;
+        If required a custom RTL range of characters can be defined;
         see function set_custom_rtl
 
-        Additionally you can provide visitor-methods to get informed on all
-        operations and all text-objects.
+        Additionally you can provide visitor methods to get informed on all
+        operations and all text objects.
         For example in some PDF files this can be useful to parse tables.
 
         Args: