Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

Issue with change to chardet #305

Closed
mcarans opened this issue Apr 6, 2020 · 6 comments · Fixed by #306
Closed

Issue with change to chardet #305

mcarans opened this issue Apr 6, 2020 · 6 comments · Fixed by #306
Labels

Comments

@mcarans
Copy link
Contributor

mcarans commented Apr 6, 2020

Overview

A script failed with the new Tabulator 1.38.1 and I wondered why. I narrowed it down to the change from cchardet to chardet. For this file: https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112 cchardet has no issues but chardet gives:

  File "/.../tabulator/parsers/csv.py", line 108, in __prepare_dialect
    sample.append(next(stream))
  File "/usr/lib/python3.8/encodings/cp1254.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 35349: character maps to <undefined>

I saw an issue #265 where someone experienced the opposite: chardet works but not cchardet. Obviously I can set things up to use cchardet, but I'd like to understand a bit better the discrepancies you've found between chardet and cchardet.


Please preserve this line to notify @roll (lead of this repository)

@roll
Copy link
Member

roll commented Apr 6, 2020

Thanks I'll investigate

@roll
Copy link
Member

roll commented Apr 8, 2020

@mcarans
I've fixed the size of the sample for detection of remote sources and this now works fine:

$ tabulator 'https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112'

@mcarans
Copy link
Contributor Author

mcarans commented Apr 8, 2020

@roll Thanks for fixing. I just wanted to ask about the change "Limit sample size for detection if remote" - if the character that caused the issue with chardet is at the beginning of the file, will there still be a difference of behaviour between chardet and cchardet?

@roll
Copy link
Member

roll commented Apr 8, 2020

@mcarans
TBH it's very confusing issue so I'm not sure it will be great if we can understand what went wrong and report this to chardet. Can it be problems with the server (e.g. some weird ending byte)?

@mcarans
Copy link
Contributor Author

mcarans commented Apr 8, 2020

Yes it is indeed confusing that it works as a local file but not as a remote url. I can only presume that the sample sent to chardet is different for the local file to the remote url somehow.

@mcarans
Copy link
Contributor Author

mcarans commented Apr 9, 2020

@roll, It is odd chardet and cchardet give the same results when tested on the url outside of tabulator:

from urllib.request import urlopen
import chardet
import cchardet

rawdata = urlopen('https://api.acleddata.com/acled/read.csv?limit=0&terms=accept&iso=112').read()
print(chardet.detect(rawdata))
print(cchardet.detect(rawdata))

gives:

{'encoding': 'utf-8', 'confidence': 0.7525, 'language': ''}
{'encoding': 'UTF-8', 'confidence': 0.7524999976158142}

I'm not sure how Tabulator prior to your fix was using chardet in such a way that it behaves differently to cchardet on the url so cannot produce a cut down example to report against chardet.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants