Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Determining the computed MIME type of a resource #4

Merged
merged 26 commits into from Aug 4, 2021
Merged

Conversation

akshaysharmajs
Copy link
Collaborator

Adding the main mime sniffing rules to #2 based on section 7 of standards.

@akshaysharmajs
Copy link
Collaborator Author

akshaysharmajs commented Jul 2, 2021

@Gallaecio For the byte pattern in section 7.1 step1, Patterns end with a tag-terminating byte that includes 0x20 (SP), 0x3E (">")

how should I make the entires of these patterns in _patterns.py? Making two entries of all patterns one ending with 0x20 and other with 0x3E or can make some changes in the _is_match_mime_pattern or there is some precise way ??

xtractmime/__init__.py Outdated Show resolved Hide resolved
xtractmime/__init__.py Outdated Show resolved Hide resolved
xtractmime/__init__.py Outdated Show resolved Hide resolved
xtractmime/__init__.py Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Member

@Gallaecio For the byte pattern in section 7.1 step1, Patterns end with a tag-terminating byte that includes 0x20 (SP), 0x3E (">")

how should I make the entires of these patterns in _patterns.py? Making two entries of all patterns one ending with 0x20 and other with 0x3E or can make some changes in the _is_match_mime_pattern or there is some precise way ??

I think the duplicate rows approach may be the best here, being the simplest, and something that users can easily do themselves as well when wishing to do something similar for extra_types.

You could use a loop in _patterns.py to avoid the duplication in code, but since the end goals is to autogenerate _pattern.py, I think that approach would be a bit of a waste of time. For extra_types, users are free to use this approach, though, e.g.:

extra_types = tuple(
    (prefix + suffix, mask, WHITESPACE_BYTES, "text/xml")
    for prefix, mask,  in (
        (b"\x3C\x21\x44\x4F\x43\x54\x59\x50\x45\x20\x48\x54\x4D\x4C", b"\xFF\xFF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xFF\xDF\xDF\xDF\xDF\xFF"),
        (b"\x3C\x48\x54\x4D\x4C", b"\xFF\xDF\xDF\xDF\xDF\xFF"),
        # …
    )
    for suffix in (b"\x20", b"\x3E")
)

@akshaysharmajs
Copy link
Collaborator Author

akshaysharmajs commented Jul 3, 2021

I believe I am wrong but this is what I understand till step 5.1 in section 7.3.

def _sniff_mislabled_feed(input_bytes: bytes, supplied_type: Optional[Tuple[bytes]]) -> str:
    input_size = len(input_bytes)
    index = 0

    if input_size >= 3 and input_bytes[:3] == b"\xef\xbb\xbf":
        index += 3

    while index < input_size:
        while True:
            if input_bytes[index:index+1] == None:
                return supplied_type

            if input_bytes[index:index+1] == b"<":
                index += 1
                break

            if input_bytes[index:index+1] not in WHITESPACE_BYTES:
                return supplied_type

            index += 1
              

@Gallaecio
Copy link
Member

Sounds correct to me.

The first loop (while True) aims to return the supplied type unless the file starts with < (ignoring spacing characters).

if input_bytes[index:index+1] == None will not work, though. I’ve just tested it, and it should be if input_bytes[index:index+1] == b'', or simplified as if not input_bytes[index:index+1], since:

>>> data = b"a"
>>> data[:1]
b'a'
>>> data[1:2]
b''

You could also simplify input_size >= 3 and input_bytes[:3] == b"\xef\xbb\xbf" as input_bytes[:3] == b"\xef\xbb\xbf".

xtractmime/__init__.py Outdated Show resolved Hide resolved
xtractmime/__init__.py Outdated Show resolved Hide resolved
xtractmime/__init__.py Outdated Show resolved Hide resolved
@akshaysharmajs
Copy link
Collaborator Author

I am shifting _is_match_mime_pattern to _utils.py, I think it's creating a reference cycle for imports

@codecov
Copy link

codecov bot commented Jul 9, 2021

Codecov Report

Merging #4 (882bc7d) into main (f656d9b) will increase coverage by 0.60%.
The diff coverage is 100.00%.

Impacted file tree graph

@@             Coverage Diff             @@
##             main        #4      +/-   ##
===========================================
+ Coverage   99.39%   100.00%   +0.60%     
===========================================
  Files           3         4       +1     
  Lines         164       353     +189     
===========================================
+ Hits          163       353     +190     
+ Misses          1         0       -1     
Impacted Files Coverage Δ
xtractmime/__init__.py 100.00% <100.00%> (+4.00%) ⬆️
xtractmime/_patterns.py 100.00% <100.00%> (ø)
xtractmime/_utils.py 100.00% <100.00%> (ø)
xtractmime/mimegroups.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f656d9b...882bc7d. Read the comment docs.

xtractmime/_patterns.py Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Member

Awesome.

I think we can merge as soon as you fix the renaming issues from the static checks and we merge #2, if you prefer to handle documentation in a separate pull request.

You might also want to do something like this to prevent 1 failure to stop an unrelated CI job: scrapy/scrapy#5200

@akshaysharmajs
Copy link
Collaborator Author

I think we can merge as soon as you fix the renaming issues from the static checks and we merge #2, if you prefer to handle documentation in a separate pull request.

Yes, handling documentation in a separate PR will be better and easier for me.

Also, please suggest a fix for the only error in typing checks in the latest commit. I think the type hint for leading bytes in extra_types requires a change.

@Gallaecio Gallaecio mentioned this pull request Jul 13, 2021
@Gallaecio Gallaecio changed the base branch from highlvl-API to main July 13, 2021 07:36
akshaysharmajs and others added 2 commits July 30, 2021 19:05
* Add mime groups

* Add tests

* Test all mime_types

* Add more tests

* Remove text mime

* All mime types
Copy link
Member

@elacuesta elacuesta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The only reason I'm not merging right away is because I wonder if it would be clearer to use bytes.fromhex more extensively instead of escaping hex sequences directly. It's not a blocker at all, let's just talk about it in our next meeting.

@elacuesta
Copy link
Member

It was decided that converting the escape sequences can be done later, in a separate PR.

@elacuesta elacuesta merged commit d600225 into main Aug 4, 2021
@elacuesta elacuesta deleted the compute-mime branch August 4, 2021 20:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants