[Proposal] Add module creation with mypyc to speed up #182

deedy5 · 2022-04-29T15:21:59Z

Hello.
I ran some tests to find bottlenecks and speed up the package.
The easiest option, since you are already using mypy, is to compile the module during installation using mypyc.
In this case the acceleration is about 2 times.
Here are the results of the tests using your bin/performance.py file:

------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.03485252343844548s
   --> 99th: 0.2629306570015615s
   --> 95th: 0.14874039799906313s
   --> 50th: 0.02182378301222343s
------------------------------
--> Charset-Normalizer_m Conclusions (Charset-Normalizer, compiled with mypyc )
   --> Avg: 0.01605459922575392s
   --> 99th: 0.12211546800972428s
   --> 95th: 0.06977643301070202s
   --> 50th: 0.009204783011227846s
------------------------------
--> Chardet Conclusions
   --> Avg: 0.12291852888552735s
   --> 99th: 0.6617688919941429s
   --> 95th: 0.17344348499318585s
   --> 50th: 0.023028297000564635s
------------------------------
--> Cchardet Conclusions
   --> Avg: 0.003174804929368931s
   --> 99th: 0.04868195200106129s
   --> 95th: 0.008641656007966958s
   --> 50th: 0.0005420649977168068s

test_log.txt
I think the acceleration would be greater if annotate all functions

The text was updated successfully, but these errors were encountered:

deedy5 · 2022-05-01T06:14:30Z

#183

Ousret · 2022-05-01T18:36:38Z

You went a bit too fast. I am reopening that thread.

The idea is tempting but needs a thorough analysis of its impacts.
First of all, I have never used mypyc, so I would need to catch up a bit on the subject (already started.)

Here are some major subjects we have to care about.

Dropping Python 3.5

There is a high chance that dropping Python 3.5 and all the specific code associated with it.
I am in favor of dropping its support BEFORE attempting this optimization.

Inherent risks

Compiling the package means providing ready-to-use whl, not a big problem using qemu and whatnot.
But.. Are we capable of falling back to native python code in case your architecture was not served?

Sub-package

Maybe, this should be published under another package name? To be discussed.

Types

I don't think that the package has a "perfect" typing, so I think that a PR should address the remaining cases using the strict mode. And should not be difficult to do so.

Task ahead

Dropping Python 3.5
Improving typing
Making a solid proof of concept
Decide whenever it should be published under a different package name? answer: no
Writing the required actions (GHA) according to our needs
Heavy testing on every Python supported (3.6 to 3.11)

deedy5 · 2022-05-01T23:16:08Z

Let's wait for the drop of Python 3.5

deedy5 · 2022-05-05T11:21:27Z

I ran some more tests to see how mypyc compilation affects performance.

mypyc_performance.xlsx

performance1.py

#!/bin/python
from glob import glob
from time import time_ns
import argparse
from sys import argv
from os.path import isdir

from charset_normalizer import detect
from chardet import detect as chardet_detect

from statistics import mean
from math import ceil


def calc_percentile(data, percentile):
    n = len(data)
    p = n * percentile / 100
    sorted_data = sorted(data)

    return sorted_data[int(p)] if p.is_integer() else sorted_data[int(ceil(p)) - 1]


def performance_compare(arguments):
    parser = argparse.ArgumentParser(
        description="Performance CI/CD check for Charset-Normalizer"
    )

    parser.add_argument('-s', '--size-increase', action="store", default=1, type=int, dest='size_coeff',
                        help="Apply artificial size increase to challenge the detection mechanism further")

    args = parser.parse_args(arguments)

    if not isdir("./char-dataset"):
        print("This script require https://github.com/Ousret/char-dataset to be cloned on package root directory")
        exit(1)

    charset_normalizer_results = []

    for tbt_path in sorted(glob("./char-dataset/**/*.*")):

        with open(tbt_path, "rb") as fp:
            content = fp.read() * args.size_coeff

        before = time_ns()
        detect(content)
        charset_normalizer_results.append(
            round((time_ns() - before) / 1000000000, 5)
        )
        print(str(charset_normalizer_results[-1]), tbt_path)

    charset_normalizer_avg_delay = mean(charset_normalizer_results)
    charset_normalizer_99p = calc_percentile(charset_normalizer_results, 99)
    charset_normalizer_95p = calc_percentile(charset_normalizer_results, 95)
    charset_normalizer_50p = calc_percentile(charset_normalizer_results, 50)

    print("------------------------------")
    print("--> Charset-Normalizer Conclusions")
    print("   --> Avg: " + str(charset_normalizer_avg_delay) + "s")
    print("   --> 99th: " + str(charset_normalizer_99p) + "s")
    print("   --> 95th: " + str(charset_normalizer_95p) + "s")
    print("   --> 50th: " + str(charset_normalizer_50p) + "s")
    
    # persentile / time plot
    print("Percentile data --------------")
    print()
    x_chardet, y_chardet = [], []
    for i in range(100):
        x_chardet.append(i)
        y_chardet.append(calc_percentile(charset_normalizer_results, i))
        print(calc_percentile(charset_normalizer_results, i))
    
    return


if __name__ == "__main__":
    exit(
        performance_compare(
            argv[1:]
        )
    )

Enlarged

percentile matplotlib

percentile-plot.py

#!/bin/python
from glob import glob
from time import time_ns
import argparse
from sys import argv
from os.path import isdir

from charset_normalizer import detect
from chardet import detect as chardet_detect
from cchardet import detect as cchardet_detect

from statistics import mean
from math import ceil

import matplotlib.pyplot as plt


def calc_percentile(data, percentile):
    n = len(data)
    p = n * percentile / 100
    sorted_data = sorted(data)

    return sorted_data[int(p)] if p.is_integer() else sorted_data[int(ceil(p)) - 1]


def performance_compare(arguments):
    parser = argparse.ArgumentParser(
        description="Performance CI/CD check for Charset-Normalizer"
    )

    parser.add_argument('-s', '--size-increase', action="store", default=1, type=int, dest='size_coeff',
                        help="Apply artificial size increase to challenge the detection mechanism further")

    args = parser.parse_args(arguments)

    if not isdir("./char-dataset"):
        print("This script require https://github.com/Ousret/char-dataset to be cloned on package root directory")
        exit(1)

    chardet_results = []
    cchardet_results = []
    charset_normalizer_results = []
    file_names_list = []

    for tbt_path in sorted(glob("./char-dataset/**/*.*")):
        print(tbt_path)
        file_names_list.append(tbt_path.split('/')[-1])
        
        # Read Bin file
        with open(tbt_path, "rb") as fp:
            content = fp.read() * args.size_coeff
        #Chardet
        before = time_ns()
        chardet_detect(content)
        chardet_results.append(
            round((time_ns() - before) / 1000000000, 5)
        )
        print("  --> Chardet: " + str(chardet_results[-1]) + "s")
        #Cchardet
        before = time_ns()
        cchardet_detect(content)
        cchardet_results.append(
            round((time_ns() - before) / 1000000000, 5)
        )
        print("  --> Cchardet: " + str(cchardet_results[-1]) + "s")
        #Charset_normalizer
        before = time_ns()
        detect(content)
        charset_normalizer_results.append(
            round((time_ns() - before) / 1000000000, 5)
        )
        print("  --> Charset-Normalizer: " + str(charset_normalizer_results[-1]) + "s")
        

    chardet_avg_delay = mean(chardet_results)
    chardet_99p = calc_percentile(chardet_results, 99)
    chardet_95p = calc_percentile(chardet_results, 95)
    chardet_50p = calc_percentile(chardet_results, 50)

    cchardet_avg_delay = mean(cchardet_results)
    cchardet_99p = calc_percentile(cchardet_results, 99)
    cchardet_95p = calc_percentile(cchardet_results, 95)
    cchardet_50p = calc_percentile(cchardet_results, 50)

    charset_normalizer_avg_delay = mean(charset_normalizer_results)
    charset_normalizer_99p = calc_percentile(charset_normalizer_results, 99)
    charset_normalizer_95p = calc_percentile(charset_normalizer_results, 95)
    charset_normalizer_50p = calc_percentile(charset_normalizer_results, 50)

    print("")

    print("------------------------------")
    print("--> Chardet Conclusions")
    print("   --> Avg: " + str(chardet_avg_delay) + "s")
    print("   --> 99th: " + str(chardet_99p) + "s")
    print("   --> 95th: " + str(chardet_95p) + "s")
    print("   --> 50th: " + str(chardet_50p) + "s")

    print("------------------------------")
    print("--> Cchardet Conclusions")
    print("   --> Avg: " + str(cchardet_avg_delay) + "s")
    print("   --> 99th: " + str(cchardet_99p) + "s")
    print("   --> 95th: " + str(cchardet_95p) + "s")
    print("   --> 50th: " + str(cchardet_50p) + "s")

    print("------------------------------")
    print("--> Charset-Normalizer Conclusions")
    print("   --> Avg: " + str(charset_normalizer_avg_delay) + "s")
    print("   --> 99th: " + str(charset_normalizer_99p) + "s")
    print("   --> 95th: " + str(charset_normalizer_95p) + "s")
    print("   --> 50th: " + str(charset_normalizer_50p) + "s")
    
    print("------------------------------")
    print("--> Charset-Normalizer / Chardet: Performance Сomparison")
    print("   --> Avg: " + str(round(((chardet_avg_delay / charset_normalizer_avg_delay - 1) * 100), 2)) + "%")        
    print("   --> 99th: " + str(round(((chardet_99p / charset_normalizer_99p - 1) * 100), 2)) + "%")
    print("   --> 95th: " + str(round(((chardet_95p / charset_normalizer_95p - 1) * 100), 2)) + "%")
    print("   --> 50th: " + str(round(((chardet_50p / charset_normalizer_50p - 1) * 100), 2)) + "%")

    '''
    # time / files plot
    x_chardet, y_chardet = [], []
    for i,v in enumerate(chardet_results):
        x_chardet.append(i)
        y_chardet.append(v)

    x_cchardet, y_cchardet = [], []
    for i,v in enumerate(cchardet_results):
        x_cchardet.append(i)
        y_cchardet.append(v)

    x_charset_normalizer, y_charset_normalizer = [], []
    for i,v in enumerate(charset_normalizer_results):
        x_charset_normalizer.append(i)
        y_charset_normalizer.append(v)
        
    plt.figure(figsize=(1000, 100), layout='constrained')
    plt.plot(x_chardet, y_chardet, label='Chardet') 
    plt.plot(x_cchardet, y_cchardet, label='Cchardet')
    plt.plot(x_charset_normalizer, y_charset_normalizer, label='Charset_normalizer')
    plt.xlabel('files')
    plt.ylabel('time')
    # Create names on the x axis
    plt.xticks(x_chardet, file_names_list, rotation=90)
    plt.title("Simple Plot")
    plt.legend()
    plt.show()
    '''

    # persentile / time plot
    x_chardet, y_chardet = [], []
    for i in range(100):
        x_chardet.append(i)
        y_chardet.append(calc_percentile(chardet_results, i))

    x_cchardet, y_cchardet = [], []
    for i in range(100):
        x_cchardet.append(i)
        y_cchardet.append(calc_percentile(cchardet_results, i))

    x_charset_normalizer, y_charset_normalizer = [], []
    for i in range(100):
        x_charset_normalizer.append(i)
        y_charset_normalizer.append(calc_percentile(charset_normalizer_results, i))
        
    plt.figure(figsize=(100, 100))
    plt.plot(x_chardet, y_chardet, label='Chardet') 
    plt.plot(x_cchardet, y_cchardet, label='Cchardet')
    plt.plot(x_charset_normalizer, y_charset_normalizer, label='Charset_normalizer')
    plt.xlabel('%')
    plt.ylabel('time')
    # Create names on the x axis
    plt.title("Percentile Plot")
    plt.legend()
    plt.show()
    
    return

if __name__ == "__main__":
    exit(
        performance_compare(
            argv[1:]
        )
    )

The effect is not so great, the speed increases by about 2 times.
But when processing a large number of files, I think it will be very noticeable.

deedy5 · 2022-05-05T11:33:07Z

Compilation is performed when the package is installed on the user's computer.
The source files are not deleted, and functionality will not be affected if a compilation error occurs.

You can test it yourself.
Mypyc docs

mypy must be installed

pip install -U mypy

Add to setup.py to compile during installation

ext_modules=mypycify([
        "charset_normalizer/__init__.py",
        "charset_normalizer/api.py",
        "charset_normalizer/cd.py",
        "charset_normalizer/constant.py",
        "charset_normalizer/legacy.py",
        "charset_normalizer/md.py",
        "charset_normalizer/models.py",
        "charset_normalizer/utils.py",
        "charset_normalizer/assets/__init__.py",
        "charset_normalizer/cli/normalizer.py",
    ]),

full setup.py

#!/usr/bin/env python
# -*- coding: utf-8 -*-

import io
import os
from re import search

from setuptools import find_packages, setup

from mypyc.build import mypycify


def get_version():
    with open('charset_normalizer/version.py') as version_file:
        return search(r"""__version__\s+=\s+(['"])(?P<version>.+?)\1""",
                      version_file.read()).group('version')


# Package meta-data.
NAME = 'charset-normalizer'
DESCRIPTION = 'The Real First Universal Charset Detector. Open, modern and actively maintained alternative to Chardet.'
URL = 'https://github.com/ousret/charset_normalizer'
EMAIL = 'ahmed.tahri@cloudnursery.dev'
AUTHOR = 'Ahmed TAHRI @Ousret'
REQUIRES_PYTHON = '>=3.5.0'
VERSION = get_version()

REQUIRED = []

EXTRAS = {
    'unicode_backport': ['unicodedata2']
}

here = os.path.abspath(os.path.dirname(__file__))

try:
    with io.open(os.path.join(here, 'README.md'), encoding='utf-8') as f:
        long_description = '\n' + f.read()
except FileNotFoundError:
    long_description = DESCRIPTION

setup(
    name=NAME,
    version=VERSION,
    description=DESCRIPTION,
    long_description=long_description.replace(':heavy_check_mark:', '✅'),
    long_description_content_type='text/markdown',
    author=AUTHOR,
    author_email=EMAIL,
    python_requires=REQUIRES_PYTHON,
    url=URL,
    keywords=['encoding', 'i18n', 'txt', 'text', 'charset', 'charset-detector', 'normalization', 'unicode', 'chardet'],
    packages=find_packages(exclude=["tests", "*.tests", "*.tests.*", "tests.*"]),
    install_requires=REQUIRED,
    extras_require=EXTRAS,
    include_package_data=True,
    package_data={"charset_normalizer": ["py.typed"]},
    license='MIT',
    entry_points={
        'console_scripts':
            [
                'normalizer = charset_normalizer.cli.normalizer:cli_detect'
            ]
    },
    classifiers=[
        'License :: OSI Approved :: MIT License',
        'Intended Audience :: Developers',
        'Topic :: Software Development :: Libraries :: Python Modules',
        'Operating System :: OS Independent',
        'Programming Language :: Python',
        'Programming Language :: Python :: 3',
        'Programming Language :: Python :: 3.5',
        'Programming Language :: Python :: 3.6',
        'Programming Language :: Python :: 3.7',
        'Programming Language :: Python :: 3.8',
        'Programming Language :: Python :: 3.9',
        'Programming Language :: Python :: 3.10',
        'Programming Language :: Python :: 3.11',
        'Topic :: Text Processing :: Linguistic',
        'Topic :: Utilities',
        'Programming Language :: Python :: Implementation :: PyPy',
        'Typing :: Typed'
    ],
    project_urls={
        'Bug Reports': 'https://github.com/Ousret/charset_normalizer/issues',
        'Documentation': 'https://charset-normalizer.readthedocs.io/en/latest',
    },
    ext_modules=mypycify([
        "charset_normalizer/__init__.py",
        "charset_normalizer/api.py",
        "charset_normalizer/cd.py",
        "charset_normalizer/constant.py",
        "charset_normalizer/legacy.py",
        "charset_normalizer/md.py",
        "charset_normalizer/models.py",
        "charset_normalizer/utils.py",
        "charset_normalizer/assets/__init__.py",
        "charset_normalizer/cli/normalizer.py",
    ]),
)

run

python3 setup.py build_ext --inplace

deedy5 · 2022-05-05T12:24:56Z

Compilation requires prerequisites

macOS

Install Xcode command line tools:

xcode-select --install

Linux

You need a C compiler and CPython headers and libraries. The specifics of how to install these varies by distribution. Here are instructions for Ubuntu 18.04, for example:

sudo apt install python3-dev

Windows

Install Visual C++.

Installing additional software can be a problem for the user, so it is not a good idea to compile by default.
But as an option, it would be nice to add such a feature to the installation.

akx · 2022-05-12T07:25:10Z

As long as charset_normalizer is a hard dependency for requests (see psf/requests#5875, psf/requests#5871 etc.), I really don't think this should be done.

As it is, installing requests does not install any packages with binary components (all .whls are -none-):

# pip install requests
  Downloading requests-2.27.1-py2.py3-none-any.whl (63 kB)
  Downloading urllib3-1.26.9-py2.py3-none-any.whl (138 kB)
  Downloading certifi-2021.10.8-py2.py3-none-any.whl (149 kB)
  Downloading charset_normalizer-2.0.12-py3-none-any.whl (39 kB)
  Downloading idna-3.3-py3-none-any.whl (61 kB)

That is, you can install requests wherever Python runs even if you don't have a C compiler.

If charset_normalizer starts including a binary module, then installing requests will require a C compiler, or the maintainers of charset_normalizer will need to start shipping binary wheels on multiple platforms and architectures (even more esoteric ones such as manylinux on arm64, since Raspberry Pis are a thing :) ) unless they wish to be inundated by issues asking why their particular installation fails with an obscure C compiler error.

deedy5 · 2022-05-12T07:59:43Z

It is not necessary to do this by default.
The average Internet user will not notice any difference whether the library is compiled or not.

But when you need to process a large number of files with an unknown encoding, there is a performance issue.
This package has the largest number of supported encodings, and today there is no alternative.

I tried to improve the processing speed and got some results (#183).
But I also found that compiling the library with mypyc speeds up more than twice.
I suggest adding compilation as an option during installation. That is, when installing requests no compilation will take place. But I would like to be able to compile the library using a command like

pip install charset_normalizer[mypyc].

I'm working on rewriting the code of this package in cython, but so far I'm having trouble understanding the algorithm.

akx · 2022-05-12T08:52:53Z

As far as I'm aware, the Setuptools extras syntax ([mypyc]) won't allow for optional compilation, just additional packages to be installed. The mypyc-compilable version could thus be packaged as a separate "charset-normalizer-speedups" package, and installed via the extra.

Ousret · 2022-05-12T09:13:26Z

I really don't think this should be done...unless they wish to be inundated by issues asking why their particular installation fails with an obscure C compiler error.

@akx
While I appreciate your concerns, there is next to no chance that this project would compromise our integrators. We are very much aware of the risks and opportunities.

There is a good chance, not negligible, that we eventually could upload some specific whl for specific platform WHILE always providing the whl-none.

You just have to look at how mypy handle things. By the look of it, they manage it well, unless mistaken.

mypy-0.950-py3-none-any.whl
mypy-0.950-cp310-cp310-win_amd64.whl
....

mypy does not impose any compilation as far as I know. coveragepy too.
The right study is required and it is gonna take some time.

deedy5 · 2022-05-12T17:42:49Z

It might be helpful:
psf/black#1009
psf/black#2431
mypyc/mypyc#886

deedy5 · 2022-05-12T20:50:00Z

As far as I'm aware, the Setuptools extras syntax ([mypyc]) won't allow for optional compilation, just additional packages to be installed. The mypyc-compilable version could thus be packaged as a separate "charset-normalizer-speedups" package, and installed via the extra.

something like this
https://github.com/psf/black/blob/main/setup.py

USE_MYPYC = False
# To compile with mypyc, a mypyc checkout must be present on the PYTHONPATH
if len(sys.argv) > 1 and sys.argv[1] == "--use-mypyc":
    sys.argv.pop(1)
    USE_MYPYC = True
if os.getenv("BLACK_USE_MYPYC", None) == "1":
    USE_MYPYC = True

if USE_MYPYC:
    from mypyc.build import mypycify

deedy5 · 2022-05-13T09:07:52Z

Surprisingly, mypyc is almost catching up with cython

isprime_cython.py

import cython

@cython.cdivision(True)
@cython.ccall
def is_prime(n: cython.ulonglong) -> cython.bint:
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    isqrt: cython.ulong = int(n**0.5)
    sqrtn: cython.ulong = isqrt + 1
    i: cython.ulong = 0
    for i in range(5, sqrtn, 6):
        if n % i == 0 or n % (i + 2) == 0:
            return False
    return True

cythonize -a -i isprime_cython.py

isprime_mypyc.py

def is_prime(n: int) -> bool:
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    for i in range(5, int(n**0.5) + 1, 6):
        if n % i == 0 or n % (i + 2) == 0:
            return False
    return True

mypyc isprime_mypyc.py

test.py

from time import monotonic
from isprime_cython import is_prime as is_prime_cython
from isprime_mypyc import is_prime as is_prime_mypyc


def is_prime(n: int) -> bool:
    if n <= 1:
        return False
    if n <= 3:
        return True
    if n % 2 == 0 or n % 3 == 0:
        return False
    for i in range(5, int(n**0.5) + 1, 6):
        if n % i == 0 or n % (i + 2) == 0:
            return False
    return True

START, END = 0, 10_000_000

t0 = monotonic()
r = sum(x for x in range(START, END) if is_prime(x))
print(f"is_prime: {monotonic() - t0}")

t0 = monotonic()
r = sum(x for x in range(START, END) if is_prime_cython(x))
print(f"is_prime_cython: {monotonic() - t0}")

t0 = monotonic()
r = sum(x for x in range(START, END) if is_prime_mypyc(x))
print(f"is_prime_mypyc: {monotonic() - t0}")

python3 test.py

results:

is_prime: 21.9134322960017
is_prime_cython: 3.197920835002151
is_prime_mypyc: 3.577863503996923

Ousret · 2022-06-30T18:16:50Z

Well, charset-normalizer did drop Python 3.5.

Some though that need to be considered beforehand.
Python 3.11 did X2 on the performance side of things and favored a bit Chardet but not by much. (Probably due to the simpleness of the code in Chardet's sources).

If we engage in this, this would mean? by extrapolation? that we should be x10 times faster. I expect (3.11) 19ms on avg and ~9ms with mypyc or better.
Mypy does have more than half a million fetch per day, so the whole mypyc does engage some confidence.

deedy5 · 2022-07-02T07:07:24Z

Used mypy-0.970+dev.914297e9486b141c01b3459393938fdf423d892cef, because mypy 0.961 does not support python 3.11

performance1.py

from glob import glob
from time import time_ns
import argparse
from sys import argv
from os.path import isdir

from charset_normalizer import detect
from chardet import detect as chardet_detect

from statistics import mean
from math import ceil


def calc_percentile(data, percentile):
    n = len(data)
    p = n * percentile / 100
    sorted_data = sorted(data)

    return sorted_data[int(p)] if p.is_integer() else sorted_data[int(ceil(p)) - 1]


def performance_compare(arguments):
    parser = argparse.ArgumentParser(
        description="Performance CI/CD check for Charset-Normalizer"
    )

    parser.add_argument('-s', '--size-increase', action="store", default=1, type=int, dest='size_coeff',
                        help="Apply artificial size increase to challenge the detection mechanism further")

    args = parser.parse_args(arguments)

    if not isdir("./char-dataset"):
        print("This script require https://github.com/Ousret/char-dataset to be cloned on package root directory")
        exit(1)

    charset_normalizer_results = []

    for tbt_path in sorted(glob("./char-dataset/**/*.*")):

        with open(tbt_path, "rb") as fp:
            content = fp.read() * args.size_coeff

        before = time_ns()
        detect(content)
        charset_normalizer_results.append(
            round((time_ns() - before) / 1000000000, 5)
        )
        print(str(charset_normalizer_results[-1]), tbt_path)

    charset_normalizer_avg_delay = mean(charset_normalizer_results)
    charset_normalizer_99p = calc_percentile(charset_normalizer_results, 99)
    charset_normalizer_95p = calc_percentile(charset_normalizer_results, 95)
    charset_normalizer_50p = calc_percentile(charset_normalizer_results, 50)

    print("------------------------------")
    print("--> Charset-Normalizer Conclusions")
    print("   --> Avg: " + str(charset_normalizer_avg_delay) + "s")
    print("   --> 99th: " + str(charset_normalizer_99p) + "s")
    print("   --> 95th: " + str(charset_normalizer_95p) + "s")
    print("   --> 50th: " + str(charset_normalizer_50p) + "s")

    # persentile / time plot
    print("Percentile data --------------")
    print()
    x_chardet, y_chardet = [], []
    for i in range(100):
        x_chardet.append(i)
        y_chardet.append(calc_percentile(charset_normalizer_results, i))
        print(calc_percentile(charset_normalizer_results, i))

    return


if __name__ == "__main__":
    exit(
        performance_compare(
            argv[1:]
        )
    )

mypyc_performance.xlsx

Following #182

Ousret · 2022-08-14T19:00:20Z

I started to work on a potential v3 including optional Mypyc. See https://github.com/Ousret/charset_normalizer/tree/3.0

To start testing:

git clone https://github.com/Ousret/charset_normalizer.git
cd charset_normalizer
git checkout 3.0
pip install -r dev-requirements.txt
python setup.py --use-mypyc install

On average 10ms per file. That is a good performance bump.
But, I am worried about the final Whl size. charset_normalizer-3.0.0b1-cp310-cp310-win_amd64.whl is about 500kB to 1MB (given different conf), that is heavier than Chardet Whl.

I am doing some extra research on the subject.

deedy5 · 2022-08-14T23:17:23Z

3.0 python3.10

I. default 3.0

------------------------------
--> Chardet Conclusions
   --> Avg: 0.12321512765957447s
   --> 99th: 0.74804s
   --> 95th: 0.178s
   --> 50th: 0.01804s
------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.025958744680851065s
   --> 99th: 0.25946s
   --> 95th: 0.14132s
   --> 50th: 0.01095s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
   --> Avg: x4.75
   --> 99th: x2.88
   --> 95th: x1.26
   --> 50th: x1.65

II. BUILD: python3 setup.py --use-mypyc build_ext --inplace

------------------------------
--> Chardet Conclusions
   --> Avg: 0.1224901914893617s
   --> 99th: 0.73647s
   --> 95th: 0.17915s
   --> 50th: 0.01755s
------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.010322106382978723s
   --> 99th: 0.11215s
   --> 95th: 0.05355s
   --> 50th: 0.00428s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
   --> Avg: x11.87
   --> 99th: x6.57
   --> 95th: x3.35
   --> 50th: x4.1

III. Marking constants as Final (#208) + BUILD: python3 setup.py --use-mypyc build_ext --inplace

------------------------------
--> Chardet Conclusions
   --> Avg: 0.12217872340425531s
   --> 99th: 0.72175s
   --> 95th: 0.17481s
   --> 50th: 0.01731s
------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.01016231914893617s
   --> 99th: 0.1085s
   --> 95th: 0.05219s
   --> 50th: 0.00419s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
   --> Avg: x12.02
   --> 99th: x6.65
   --> 95th: x3.35
   --> 50th: x4.13

3.0 python3.11b4

I. default 3.0

------------------------------
--> Chardet Conclusions
   --> Avg: 0.09048142553191489s
   --> 99th: 0.39248s
   --> 95th: 0.10197s
   --> 50th: 0.01008s
------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.018134829787234043s
   --> 99th: 0.17814s
   --> 95th: 0.09568s
   --> 50th: 0.00761s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
   --> Avg: x4.99
   --> 99th: x2.2
   --> 95th: x1.07
   --> 50th: x1.32

II. Marking constants as Final (#208) + BUILD: python3.11 setup.py --use-mypyc build_ext --inplace

------------------------------
--> Chardet Conclusions
   --> Avg: 0.09024136170212767s
   --> 99th: 0.39338s
   --> 95th: 0.10097s
   --> 50th: 0.00999s
------------------------------
--> Charset-Normalizer Conclusions
   --> Avg: 0.009643191489361703s
   --> 99th: 0.10487s
   --> 95th: 0.04999s
   --> 50th: 0.004s
------------------------------
--> Charset-Normalizer / Chardet: Performance Сomparison
   --> Avg: x9.36
   --> 99th: x3.75
   --> 95th: x2.02
   --> 50th: x2.5

Summary of Charset-Normalizer Conclusions:

version	mypyc	Avg., s	99th., s	95th., s	50th., s
python3.10	-	0.025958744680851065	0.25946	0.14132	0.01095
python3.10	+	0.01016231914893617	0.1085	0.05219	0.00419
python3.11b4	-	0.018134829787234043	0.17814	0.09568	0.00761
python3.11b4	+	0.009643191489361703	0.10487	0.04999	0.004

Ousret · 2022-08-15T14:41:09Z

Optimizing md.py only is strictly sufficient. I could get the final whl size down to 80kB.
Initial benchmarks show an insignificant difference. I expected it.
Now place to generate the compiled whl for all platforms (as much as possible).

Ousret · 2022-08-17T15:06:43Z

Update on the topic.

The first beta is available on https://pypi.org/project/charset-normalizer/3.0.0b1 and https://github.com/Ousret/charset_normalizer/releases/tag/3.0.0b1
First results extracted from a personal server are good. Running h24 to challenge the solution. So far, nothing.

The Whl size is no longer a problem to pursue this.

Ousret · 2022-08-19T21:51:29Z

For me, everything is ok.
Scheduled for release when mypy/c is ready for 3.11

Answered by #209

deedy5 added the enhancement New feature or request label Apr 29, 2022

deedy5 closed this as completed May 1, 2022

Ousret reopened this May 1, 2022

Ousret mentioned this issue Jul 12, 2022

cchardet seems to be obsolete, charset_normalizer as an alternative aio-libs/aiohttp#6819

Closed

1 task

Ousret added a commit that referenced this issue Aug 14, 2022

🎨 Enable strict type check and improve the project typing

370d9ee

Following #182

Ousret mentioned this issue Aug 14, 2022

🎨 Enable strict type check and improve the project typing #207

Merged

Ousret added a commit that referenced this issue Aug 14, 2022

🎨 Enable strict type check and improve the project typing (#207)

f955341

Following #182

Ousret closed this as completed Aug 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Proposal] Add module creation with mypyc to speed up #182

[Proposal] Add module creation with mypyc to speed up #182

deedy5 commented Apr 29, 2022 •

edited

deedy5 commented May 1, 2022

Ousret commented May 1, 2022 •

edited

deedy5 commented May 1, 2022

deedy5 commented May 5, 2022 •

edited

deedy5 commented May 5, 2022 •

edited

deedy5 commented May 5, 2022 •

edited

akx commented May 12, 2022

deedy5 commented May 12, 2022 •

edited

akx commented May 12, 2022

Ousret commented May 12, 2022

deedy5 commented May 12, 2022 •

edited

deedy5 commented May 12, 2022

deedy5 commented May 13, 2022

Ousret commented Jun 30, 2022 •

edited

deedy5 commented Jul 2, 2022

Ousret commented Aug 14, 2022 •

edited

deedy5 commented Aug 14, 2022 •

edited

Ousret commented Aug 15, 2022

Ousret commented Aug 17, 2022

Ousret commented Aug 19, 2022

[Proposal] Add module creation with mypyc to speed up #182

[Proposal] Add module creation with mypyc to speed up #182

Comments

deedy5 commented Apr 29, 2022 • edited

deedy5 commented May 1, 2022

Ousret commented May 1, 2022 • edited

Dropping Python 3.5

Inherent risks

Sub-package

Types

Task ahead

deedy5 commented May 1, 2022

deedy5 commented May 5, 2022 • edited

deedy5 commented May 5, 2022 • edited

deedy5 commented May 5, 2022 • edited

akx commented May 12, 2022

deedy5 commented May 12, 2022 • edited

akx commented May 12, 2022

Ousret commented May 12, 2022

deedy5 commented May 12, 2022 • edited

deedy5 commented May 12, 2022

deedy5 commented May 13, 2022

Ousret commented Jun 30, 2022 • edited

deedy5 commented Jul 2, 2022

Ousret commented Aug 14, 2022 • edited

deedy5 commented Aug 14, 2022 • edited

Ousret commented Aug 15, 2022

Ousret commented Aug 17, 2022

Ousret commented Aug 19, 2022

deedy5 commented Apr 29, 2022 •

edited

Ousret commented May 1, 2022 •

edited

deedy5 commented May 5, 2022 •

edited

deedy5 commented May 5, 2022 •

edited

deedy5 commented May 5, 2022 •

edited

deedy5 commented May 12, 2022 •

edited

deedy5 commented May 12, 2022 •

edited

Ousret commented Jun 30, 2022 •

edited

Ousret commented Aug 14, 2022 •

edited

deedy5 commented Aug 14, 2022 •

edited