Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract terms through OCR for non-text source documents #752

Open
clementbiron opened this issue Feb 22, 2022 · 3 comments
Open

Extract terms through OCR for non-text source documents #752

clementbiron opened this issue Feb 22, 2022 · 3 comments

Comments

@clementbiron
Copy link
Member

With the following declaration (the dedicated branch is here OpenTermsArchive/france-declarations@5d1c1c3 )

{
  "name": "Desigual",
  "documents": {
    "Commercial Terms": {
      "fetch": "https://www.desigual.com/on/demandware.static/-/Library-Sites-DsglSharedLibrary/default/dw98507c8d/docs/legal/Footer_legal_documents/Francia/FRANCIA-Condiciones_Generales_Venta_Vfinal_FR_230321.pdf"
    },
    "Privacy Policy": {
      "fetch": "https://www.desigual.com/on/demandware.static/-/Library-Sites-DsglSharedLibrary/default/dw77e5bf6a/docs/legal/Footer_legal_documents/Francia/FRANCIA-POLITICA_DE_PRIVACIDAD_Vfinal_FR_230321.pdf"
    }
  }
}

i get empty version for Commerical Terms and the following wrong version for Privacy Policy


2  

 

 

 

 

   - 

- 

 

 

 

 

  

 

3

The snapshots are good.

@MattiSG
Copy link
Member

MattiSG commented Mar 4, 2022

Unfortunately these documents are protected: if I access the PDF and try to copy their contents, I also only get spaces. I don't think this is an issue with Open Terms Archive (or rather, with the dependency @accordproject). However, it is worth reflecting on whether we can detect this automatically and how we should handle such cases, as it is pretty much the PDF equivalent to an HTTP 403.

@MattiSG MattiSG removed the bug label Mar 4, 2022
@MattiSG MattiSG changed the title Empty and wrong PDF Handle unreadable PDF files Mar 4, 2022
@martinratinaud
Copy link
Member

And for the record, it is NOT fixed by #836

Considering how fast the answer from accordproject was on the whitespace matter, I suggest we create an issue in their repo to see if they can do something about it (even though I doubt)

@MattiSG
Copy link
Member

MattiSG commented Apr 24, 2023

The source file has been vectorised. There is indeed no text in the PDF. The only way to obtain the content would be to use OCR. This could be useful. I'll rename this issue accordingly. Please add other example cases where this would enable extraction!

@MattiSG MattiSG changed the title Handle unreadable PDF files Extract terms through OCR for non-text source documents Apr 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants