Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add chunk_size + chunk_steps parameters and iterate by chunks by default #284

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

deedy5
Copy link
Contributor

@deedy5 deedy5 commented Sep 12, 2023

Add chunk_size parameter and iterate by chunks by default to improve performance and low memory consumption.

detect() processes all incoming bytes at once, which leads to performance degradation and increased memory consumption when processing large files.


memory-profiler results:

mprof run bench.py
mprof plot

before
before
after
after


bench.py results
before

Calls per second for each encoding:
ascii: 2447.037591645437
big5: 35.51794139326065
cp932: 12.423507221195562
cp949: 21.244845560559376
euc-jp: 12.328510174196076
euc-kr: 35.59465724220593
euc-tw: 272.6406656266251
gb2312: 39.814863859931705
ibm855: 73.3086160122448
ibm866: 90.1158594541804
iso-2022-jp: 2349.7501400560222
iso-2022-kr: 1553.2158198785364
iso-8859-1: 254.33820163058144
iso-8859-2-croatian: 82.71810042204078
iso-8859-2-czech: 209.77918265070846
iso-8859-2-polish: 126.04175243636014
iso-8859-2-slovak: 117.66268267743153
iso-8859-2-slovene: 108.08585425186766
iso-8859-5: 88.61599269306654
iso-8859-5-russian: 94.23265836359697
iso-8859-7: 157.81893267349142
iso-8859-9: 211.43682434220642
johab: 40.16530999134635
koi8-r: 83.07685329510834
maccyrillic: 83.47851823198715
macroman: 0.39496194284516384
shift_jis: 13.63912487563005
tis-620: 38.75953227639994
utf-16: 128857.26574500768
utf-16be: 631.3965286245465
utf-16le: 288.9007208218678
utf-32: 116025.00691562932
utf-32be: 309.48448815905493
utf-32le: 246.41067819711427
utf-8: 234.9924054529762
utf-8-sig: 103948.05452292442
windows-1250-croatian: 82.62115019058218
windows-1250-czech: 217.7473555789173
windows-1250-polish: 87.21616771572168
windows-1250-romanian: 134.06200157896586
windows-1250-slovak: 150.2474927729531
windows-1250-slovene: 184.95116390847477
windows-1251: 97.17230326728034
windows-1251-russian: 115.7745791769917
windows-1252: 208.4121033239669
windows-1255: 37.29806542149806

Total time: 122.76112413406372s (32.09485110039596 calls per second)

after

Calls per second for each encoding:
ascii: 2523.4967811804345
big5: 47.77261665350186
cp932: 12.921194028134163
cp949: 32.84667805853709
euc-jp: 14.983235589751134
euc-kr: 37.89441492031874
euc-tw: 260.94852954900364
gb2312: 55.766112549371
ibm855: 94.39026970015895
ibm866: 107.44998000367193
iso-2022-jp: 2339.3965084499973
iso-2022-kr: 4068.1901066925316
iso-8859-1: 262.001263894774
iso-8859-2-croatian: 63.82984810692931
iso-8859-2-czech: 143.21067070932628
iso-8859-2-polish: 102.61520131330109
iso-8859-2-slovak: 150.89867064812648
iso-8859-2-slovene: 109.10676288037625
iso-8859-5: 85.80995233283325
iso-8859-5-russian: 118.37714087523933
iso-8859-7: 149.9144025015266
iso-8859-9: 214.75716125332494
johab: 76.1939917562405
koi8-r: 120.28085292845729
maccyrillic: 125.27358324156383
macroman: 8.742236720694109
shift_jis: 14.03391495541973
tis-620: 38.43586941182292
utf-16: 156212.43947858474
utf-16be: 676.8445258478098
utf-16le: 681.3525345809271
utf-32: 145383.15424610052
utf-32be: 451.42272878929754
utf-32le: 449.3263271788444
utf-8: 287.8447486209477
utf-8-sig: 125766.23688155922
windows-1250-croatian: 85.22808136924006
windows-1250-czech: 208.2191454400135
windows-1250-polish: 115.77839732795981
windows-1250-romanian: 110.24123049102943
windows-1250-slovak: 235.30017203979355
windows-1250-slovene: 131.3716916716259
windows-1251: 99.52125476908557
windows-1251-russian: 139.13406383054132
windows-1252: 208.8126075040637
windows-1255: 71.8703884796382

Total time: 83.2015163898468s (47.354906147850016 calls per second)

@deedy5 deedy5 changed the title Add chunk_size parameter and iterate by chunks by default Add chunk_size + chunk_steps parameters and iterate by chunks by default Sep 12, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant