Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add type annotations to the project and run mypy on CI #261

Merged
merged 1 commit into from Jun 29, 2022
Merged

Add type annotations to the project and run mypy on CI #261

merged 1 commit into from Jun 29, 2022

Conversation

jdufresne
Copy link
Contributor

This helps consumers of chardet ensure that they are using the provided
API correctly. The project includes a py.typed file for PEP-561
compliance.

This also helps ensure internal correctness and consistency. Adding the
static type checking caught at least one suspect pattern in
chardet/hebrewprober.py where an int value was compared to
string (probably leftover from Python 2 support).

Type checking will run on all pull requests through GitHub actions and
pre-commit.

Copy link
Member

@dan-blanchard dan-blanchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a great contribution, but I think some of the types aren't quite right. Once those are fixed, we should be good to go though.

from .universaldetector import UniversalDetector
from .version import VERSION, __version__

__all__ = ["UniversalDetector", "detect", "detect_all", "__version__", "VERSION"]


def detect(byte_str):
def detect(byte_str: str) -> ResultDict:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be Union[bytes, bytearray], not str. This function explicitly does not accept str.

@@ -83,7 +86,7 @@ def reset(self):
# The number of characters whose frequency order is less than 512
self._freq_chars = 0

def feed(self, char, char_len):
def feed(self, char: Sequence[int], char_len: int) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most of the places you have Sequence[int] should be Union[bytes, bytearray].

if not self._best_guess_prober:
self.get_confidence()
if not self._best_guess_prober:
return None
return self._best_guess_prober.language

def feed(self, byte_str):
def feed(self, byte_str: bytes) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every byte_str argument should be Union[bytes, bytearray].

self._state = None
_state: int

def __init__(self, lang_filter: int = 0) -> None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lang_filter should be LanguageFilter. Although, the default of 0 probably makes that tricky. Honestly, all the enums in the chardet.enums module should probably be overhauled and made into proper Python 3 Enum or Flag types.

self._freq_chars = None
# Mapping table to get frequency order from char order (get from
# GetOrder())
_char_to_freq_order: Tuple[int, ...]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you getting rid of all the attribute initializations from __init__ (here and elsewhere)? In local testing, that would make it an AttributeError to try to access any of those.

return "Japanese"

def feed(self, byte_str):
def feed(self, byte_str: bytes) -> int:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This int is really a ProbingState.

chardet/hebrewprober.py Show resolved Hide resolved
chardet/codingstatemachinedict.py Show resolved Hide resolved
@jdufresne
Copy link
Contributor Author

Thanks for the thorough review and great feedback! I believe I have applied all suggestions and this is ready for another round.

Copy link
Member

@dan-blanchard dan-blanchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting so much time into this project! This is mostly looking good. Just a few minor complaints at this point.

Also, you probably want to rebase from master, because I added a new prober file.

self.probers = []
self._best_guess_prober = None
self.probers: List[CharSetProber] = []
self.active: Dict[CharSetProber, bool] = {}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this dictionary when the probers already all have an active property?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One was not defined in charsetprober.py, so mypy reported an error. I reverted this change and added the missing property to the parent class.

@@ -113,14 +118,13 @@ def remove_xml_tags(buf):
filtered = bytearray()
in_tag = False
prev = 0
buf = memoryview(buf).cast("c")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way this was before was added specifically as a performance improvement in #252. Calling ord() every time will slow this down a little.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, thanks. This was done to workaround typeshed bug python/typeshed#8182

I'll revert the change and then ignore the false positive.

chardet/enums.py Show resolved Hide resolved
chardet/enums.py Outdated
@@ -31,7 +34,7 @@ class LanguageFilter:
CJK = CHINESE | JAPANESE | KOREAN


class ProbingState:
class ProbingState(Flag):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This (and the other ones in this file except for LanguageFilter) should inherit from Enum, because they're not meaningfully combinable by bitwise operations.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, I changed this to an Enum in the latest revision.

This helps consumers of chardet ensure that they are using the provided
API correctly. The project includes a py.typed file for PEP-561
compliance.

This also helps ensure internal correctness and consistency. Adding the
static type checking caught at least one suspect pattern in
chardet/hebrewprober.py where an int value was compared to
string (probably leftover from Python 2 support).

Type checking will run on all pull requests through GitHub actions and
pre-commit.
@dan-blanchard dan-blanchard merged commit c4f7057 into chardet:master Jun 29, 2022
@jdufresne jdufresne deleted the mypy branch July 8, 2022 12:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants