Determining the computed MIME type of a resource #4

akshaysharmajs · 2021-07-02T11:02:07Z

Adding the main mime sniffing rules to #2 based on section 7 of standards.

akshaysharmajs · 2021-07-02T11:27:18Z

@Gallaecio For the byte pattern in section 7.1 step1, Patterns end with a tag-terminating byte that includes 0x20 (SP), 0x3E (">")

how should I make the entires of these patterns in _patterns.py? Making two entries of all patterns one ending with 0x20 and other with 0x3E or can make some changes in the _is_match_mime_pattern or there is some precise way ??

xtractmime/__init__.py

Gallaecio · 2021-07-02T13:28:11Z

@Gallaecio For the byte pattern in section 7.1 step1, Patterns end with a tag-terminating byte that includes 0x20 (SP), 0x3E (">")

how should I make the entires of these patterns in _patterns.py? Making two entries of all patterns one ending with 0x20 and other with 0x3E or can make some changes in the _is_match_mime_pattern or there is some precise way ??

I think the duplicate rows approach may be the best here, being the simplest, and something that users can easily do themselves as well when wishing to do something similar for extra_types.

You could use a loop in _patterns.py to avoid the duplication in code, but since the end goals is to autogenerate _pattern.py, I think that approach would be a bit of a waste of time. For extra_types, users are free to use this approach, though, e.g.:

extra_types = tuple(
    (prefix + suffix, mask, WHITESPACE_BYTES, "text/xml")
    for prefix, mask,  in (
        (b"\x3C\x21\x44\x4F\x43\x54\x59\x50\x45\x20\x48\x54\x4D\x4C", b"\xFF\xFF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xFF\xDF\xDF\xDF\xDF\xFF"),
        (b"\x3C\x48\x54\x4D\x4C", b"\xFF\xDF\xDF\xDF\xDF\xFF"),
        # …
    )
    for suffix in (b"\x20", b"\x3E")
)

akshaysharmajs · 2021-07-03T09:30:03Z

I believe I am wrong but this is what I understand till step 5.1 in section 7.3.

def _sniff_mislabled_feed(input_bytes: bytes, supplied_type: Optional[Tuple[bytes]]) -> str:
    input_size = len(input_bytes)
    index = 0

    if input_size >= 3 and input_bytes[:3] == b"\xef\xbb\xbf":
        index += 3

    while index < input_size:
        while True:
            if input_bytes[index:index+1] == None:
                return supplied_type

            if input_bytes[index:index+1] == b"<":
                index += 1
                break

            if input_bytes[index:index+1] not in WHITESPACE_BYTES:
                return supplied_type

            index += 1

Gallaecio · 2021-07-05T07:26:33Z

Sounds correct to me.

The first loop (while True) aims to return the supplied type unless the file starts with < (ignoring spacing characters).

if input_bytes[index:index+1] == None will not work, though. I’ve just tested it, and it should be if input_bytes[index:index+1] == b'', or simplified as if not input_bytes[index:index+1], since:

>>> data = b"a"
>>> data[:1]
b'a'
>>> data[1:2]
b''

You could also simplify input_size >= 3 and input_bytes[:3] == b"\xef\xbb\xbf" as input_bytes[:3] == b"\xef\xbb\xbf".

xtractmime/__init__.py

akshaysharmajs · 2021-07-08T20:19:55Z

I am shifting _is_match_mime_pattern to _utils.py, I think it's creating a reference cycle for imports

codecov · 2021-07-09T12:51:40Z

Codecov Report

Merging #4 (882bc7d) into main (f656d9b) will increase coverage by 0.60%.
The diff coverage is 100.00%.

@@             Coverage Diff             @@
##             main        #4      +/-   ##
===========================================
+ Coverage   99.39%   100.00%   +0.60%     
===========================================
  Files           3         4       +1     
  Lines         164       353     +189     
===========================================
+ Hits          163       353     +190     
+ Misses          1         0       -1

Impacted Files	Coverage Δ
xtractmime/__init__.py	`100.00% <100.00%> (+4.00%)`	⬆️
xtractmime/_patterns.py	`100.00% <100.00%> (ø)`
xtractmime/_utils.py	`100.00% <100.00%> (ø)`
xtractmime/mimegroups.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f656d9b...882bc7d. Read the comment docs.

xtractmime/_patterns.py

xtractmime/__init__.py

Gallaecio · 2021-07-12T16:38:26Z

Awesome.

I think we can merge as soon as you fix the renaming issues from the static checks and we merge #2, if you prefer to handle documentation in a separate pull request.

You might also want to do something like this to prevent 1 failure to stop an unrelated CI job: scrapy/scrapy#5200

akshaysharmajs · 2021-07-12T17:29:49Z

I think we can merge as soon as you fix the renaming issues from the static checks and we merge #2, if you prefer to handle documentation in a separate pull request.

Yes, handling documentation in a separate PR will be better and easier for me.

Also, please suggest a fix for the only error in typing checks in the latest commit. I think the type hint for leading bytes in extra_types requires a change.

xtractmime/_utils.py

* Add mime groups * Add tests * Test all mime_types * Add more tests * Remove text mime * All mime types

elacuesta

LGTM. The only reason I'm not merging right away is because I wonder if it would be clearer to use bytes.fromhex more extensively instead of escaping hex sequences directly. It's not a blocker at all, let's just talk about it in our next meeting.

mime type split

elacuesta · 2021-08-04T20:58:36Z

It was decided that converting the escape sequences can be done later, in a separate PR.

Add main sniffing algo

9f65cc6

Gallaecio reviewed Jul 2, 2021

View reviewed changes

xtractmime/__init__.py Outdated Show resolved Hide resolved

xtractmime/__init__.py Outdated Show resolved Hide resolved

xtractmime/__init__.py Outdated Show resolved Hide resolved

xtractmime/__init__.py Outdated Show resolved Hide resolved

akshaysharmajs and others added 2 commits July 5, 2021 15:48

Add 7.1, 7.2, and 7.3

1d2895e

Merge branch 'highlvl-API' into compute-mime

8c1d006

Gallaecio reviewed Jul 5, 2021

View reviewed changes

xtractmime/__init__.py Outdated Show resolved Hide resolved

xtractmime/__init__.py Outdated Show resolved Hide resolved

xtractmime/__init__.py Outdated Show resolved Hide resolved

Complete 7.3

0c9424e

Gallaecio reviewed Jul 5, 2021

View reviewed changes

xtractmime/__init__.py Show resolved Hide resolved

Small Fix

0f27318

Update test_utils.py

981b9ab

akshaysharmajs and others added 3 commits July 9, 2021 18:25

Merge branch 'highlvl-API' into compute-mime

8edacda

Add tests for extract_mime

92317e9

Add more tests

de958a1

Gallaecio reviewed Jul 12, 2021

View reviewed changes

xtractmime/_patterns.py Outdated Show resolved Hide resolved

akshaysharmajs added 2 commits July 12, 2021 15:04

Add tests for mislabled feed

438ee50

Small fix

948a456

Gallaecio reviewed Jul 12, 2021

View reviewed changes

xtractmime/__init__.py Show resolved Hide resolved

Add left over tests

88afd26

Static fix

55823ee

Gallaecio mentioned this pull request Jul 13, 2021

API xtractmime #2

Merged

Gallaecio reviewed Jul 13, 2021

View reviewed changes

xtractmime/_utils.py Show resolved Hide resolved

Fix typing check

2371a4d

Gallaecio changed the base branch from highlvl-API to main July 13, 2021 07:36

akshaysharmajs and others added 2 commits July 13, 2021 13:10

Conflicts

4b503fc

Merge branch 'main' into compute-mime

8ec5c1a

Gallaecio approved these changes Jul 13, 2021

View reviewed changes

akshaysharmajs added 6 commits July 18, 2021 21:07

Add mime groups

75d202e

Add tests

15fe0ff

Test all mime_types

91ecd5a

Add more tests

9da49ba

Remove text mime

8f665af

All mime types

7645470

akshaysharmajs requested a review from elacuesta July 28, 2021 14:24

akshaysharmajs and others added 2 commits July 30, 2021 19:05

mime type split

88f540c

Add mime groups (#5)

d19d5c5

* Add mime groups * Add tests * Test all mime_types * Add more tests * Remove text mime * All mime types

elacuesta approved these changes Aug 1, 2021

View reviewed changes

akshaysharmajs and others added 2 commits August 1, 2021 23:30

Merge branch 'compute-mime' into html-fix

236029b

Merge pull request #6 from scrapy/html-fix

882bc7d

mime type split

elacuesta merged commit d600225 into main Aug 4, 2021

elacuesta deleted the compute-mime branch August 4, 2021 20:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Determining the computed MIME type of a resource #4

Determining the computed MIME type of a resource #4

akshaysharmajs commented Jul 2, 2021

akshaysharmajs commented Jul 2, 2021 •

edited

Gallaecio commented Jul 2, 2021

akshaysharmajs commented Jul 3, 2021 •

edited by Gallaecio

Gallaecio commented Jul 5, 2021

akshaysharmajs commented Jul 8, 2021

codecov bot commented Jul 9, 2021 •

edited

Gallaecio commented Jul 12, 2021

akshaysharmajs commented Jul 12, 2021

elacuesta left a comment

elacuesta commented Aug 4, 2021

Determining the computed MIME type of a resource #4

Determining the computed MIME type of a resource #4

Conversation

akshaysharmajs commented Jul 2, 2021

akshaysharmajs commented Jul 2, 2021 • edited

Gallaecio commented Jul 2, 2021

akshaysharmajs commented Jul 3, 2021 • edited by Gallaecio

Gallaecio commented Jul 5, 2021

akshaysharmajs commented Jul 8, 2021

codecov bot commented Jul 9, 2021 • edited

Codecov Report

Gallaecio commented Jul 12, 2021

akshaysharmajs commented Jul 12, 2021

elacuesta left a comment

Choose a reason for hiding this comment

elacuesta commented Aug 4, 2021

akshaysharmajs commented Jul 2, 2021 •

edited

akshaysharmajs commented Jul 3, 2021 •

edited by Gallaecio

codecov bot commented Jul 9, 2021 •

edited