Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use Chardet for this Python code, as to read files that have ANSI encoder? #277

Open
me-suzy opened this issue Mar 26, 2023 · 0 comments

Comments

@me-suzy
Copy link

me-suzy commented Mar 26, 2023

Traceback (most recent call last):
  File "D:\Convert docx to pdf.py", line 32, in <module>
    file_content = file_path.read_text(encoding='UTF-8')
  File "C:\Program Files\Python39\lib\pathlib.py", line 1133, in read_text
    return f.read()
  File "C:\Program Files\Python39\lib\codecs.py", line 322, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd2 in position 16: invalid continuation byte

and this code I find on web. It converts all .docx files into PDF files.

import re
import os
from pathlib import Path
from docx import Document
from docx.shared import Inches
import sys
import chardet
from docx2pdf import convert

# The location where the files are located
input_path = r'c:\Folder7\input'
# The location where we will write the PDF files
output_path = r'c:\Folder7\output'
# Creeaza structura de foldere daca nu exista
os.makedirs(output_path, exist_ok=True)

# Verifica existenta folder-ului
directory_path = Path(input_path)
if directory_path.exists() and directory_path.is_dir():
    print(directory_path, "exists")
else:
    print(directory_path, "is invalid")
    sys.exit(1)

for file_path in directory_path.glob("*"):
    # file_path is a Path object

    print("Procesez fisierul:", file_path)
    document = Document()
    # file_path.name is the name of the file as str without the Path
    document.add_heading(file_path.name, 0)

    file_content = file_path.read_text(encoding='UTF-8')
    document.add_paragraph(file_content)

    # build the new path where we store the files
    output_file_path = os.path.join(output_path, file_path.name + ".pdf")

    document.save(output_file_path)
    print("Am convertit urmatorul fisier:", file_path, "in: ", output_file_path)

Can anyone update the code as to read all ANSI (ASCII) files with Charder, so as to convert them into UTF-8?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant