Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ArgumentError: invalid byte sequence in UTF-8 #7

Open
stephenlawrence opened this issue Nov 6, 2018 · 4 comments
Open

ArgumentError: invalid byte sequence in UTF-8 #7

stephenlawrence opened this issue Nov 6, 2018 · 4 comments

Comments

@stephenlawrence
Copy link

When I try to parse a document with non UTF-8 characters I get this:

irb(main):283:0> content = Plaintext::Resolver.new(file, document_file.upload_content_type).text
ArgumentError: invalid byte sequence in UTF-8
from /var/deploy/sdm/web_head/shared/bundle/ruby/2.1.0/gems/activesupport-4.1.7/lib/active_support/multibyte/chars.rb:172:in codepoints' from /var/deploy/sdm/web_head/shared/bundle/ruby/2.1.0/gems/activesupport-4.1.7/lib/active_support/multibyte/chars.rb:172:in compose'
from /var/deploy/sdm/web_head/shared/bundle/ruby/2.1.0/gems/plaintext-0.1.0/lib/plaintext/resolver.rb:37:in text' from (irb):283 from /var/deploy/sdm/web_head/shared/bundle/ruby/2.1.0/gems/railties-4.1.7/lib/rails/commands/console.rb:90:in start'
from /var/deploy/sdm/web_head/shared/bundle/ruby/2.1.0/gems/railties-4.1.7/lib/rails/commands/console.rb:9:in start' from /var/deploy/sdm/web_head/shared/bundle/ruby/2.1.0/gems/railties-4.1.7/lib/rails/commands/commands_tasks.rb:69:in console'
from /var/deploy/sdm/web_head/shared/bundle/ruby/2.1.0/gems/railties-4.1.7/lib/rails/commands/commands_tasks.rb:40:in run_command!' from /var/deploy/sdm/web_head/shared/bundle/ruby/2.1.0/gems/railties-4.1.7/lib/rails/commands.rb:17:in <top (required)>'
from bin/rails:8:in require' from bin/rails:8:in

@jkraemer
Copy link
Member

jkraemer commented Nov 8, 2018

what kind of document is that? Plaintext or CSV? In that case it's quite clear because the PlaintextHandler assumes UTF-8 as input encoding. We might either have to introduce a way to specify the encoding, leaving it up to the user to find out what that actually is for a given file, or we need some way to make an educated guess.

@stephenlawrence
Copy link
Author

@jkraemer In this particular case its a MS Word document. Also happens with PDF.

@stephenlawrence
Copy link
Author

Got this with a tiff file as well:

content = Plaintext::Resolver.new(file, df.upload_content_type).
Tesseract Open Source OCR Engine v3.03 with Leptonica
Cannot open input file: -dutf-8

@jkraemer
Copy link
Member

would you mind providing sample files to reproduce these errors (the utf8 error as well as the tesseract error)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants