New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Determining the computed MIME type of a resource #4
Conversation
@Gallaecio For the byte pattern in section 7.1 step1, Patterns end with a tag-terminating byte that includes 0x20 (SP), 0x3E (">") how should I make the entires of these patterns in |
I think the duplicate rows approach may be the best here, being the simplest, and something that users can easily do themselves as well when wishing to do something similar for You could use a loop in extra_types = tuple(
(prefix + suffix, mask, WHITESPACE_BYTES, "text/xml")
for prefix, mask, in (
(b"\x3C\x21\x44\x4F\x43\x54\x59\x50\x45\x20\x48\x54\x4D\x4C", b"\xFF\xFF\xDF\xDF\xDF\xDF\xDF\xDF\xDF\xFF\xDF\xDF\xDF\xDF\xFF"),
(b"\x3C\x48\x54\x4D\x4C", b"\xFF\xDF\xDF\xDF\xDF\xFF"),
# …
)
for suffix in (b"\x20", b"\x3E")
) |
I believe I am wrong but this is what I understand till step 5.1 in section 7.3. def _sniff_mislabled_feed(input_bytes: bytes, supplied_type: Optional[Tuple[bytes]]) -> str:
input_size = len(input_bytes)
index = 0
if input_size >= 3 and input_bytes[:3] == b"\xef\xbb\xbf":
index += 3
while index < input_size:
while True:
if input_bytes[index:index+1] == None:
return supplied_type
if input_bytes[index:index+1] == b"<":
index += 1
break
if input_bytes[index:index+1] not in WHITESPACE_BYTES:
return supplied_type
index += 1
|
Sounds correct to me. The first loop (
>>> data = b"a"
>>> data[:1]
b'a'
>>> data[1:2]
b'' You could also simplify |
I am shifting |
Codecov Report
@@ Coverage Diff @@
## main #4 +/- ##
===========================================
+ Coverage 99.39% 100.00% +0.60%
===========================================
Files 3 4 +1
Lines 164 353 +189
===========================================
+ Hits 163 353 +190
+ Misses 1 0 -1
Continue to review full report at Codecov.
|
Awesome. I think we can merge as soon as you fix the renaming issues from the static checks and we merge #2, if you prefer to handle documentation in a separate pull request. You might also want to do something like this to prevent 1 failure to stop an unrelated CI job: scrapy/scrapy#5200 |
Yes, handling documentation in a separate PR will be better and easier for me. Also, please suggest a fix for the only error in typing checks in the latest commit. I think the type hint for leading bytes in |
* Add mime groups * Add tests * Test all mime_types * Add more tests * Remove text mime * All mime types
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The only reason I'm not merging right away is because I wonder if it would be clearer to use bytes.fromhex
more extensively instead of escaping hex sequences directly. It's not a blocker at all, let's just talk about it in our next meeting.
It was decided that converting the escape sequences can be done later, in a separate PR. |
Adding the main mime sniffing rules to #2 based on section 7 of standards.