API xtractmime #2

akshaysharmajs · 2021-06-09T08:49:49Z

Created main function and a function for resource metadata(still need to handle multiple content-type headers).

Should I put all the required flags in a different file "flags.py"?

akshaysharmajs · 2021-06-09T08:52:56Z

Please check these flags or settings, maybe helpful later.

Will it be better to use directly python-magic or import C header file <magic.h> just as done in python-magic here?

elacuesta

I think it would be good to have a template for the project before doing more coding. A minimal template could consist on a package directory, setup.py and tox.ini files. https://github.com/scrapy/itemadapter/ is a fairly simple project you could look at for inspiration.

main.py

Gallaecio · 2021-06-09T20:26:24Z

Will it be better to use directly python-magic or import C header file <magic.h> just as done in python-magic here?

Unless there’s a good reason to use the C file, I would use the Python package API instead.

Co-authored-by: Eugenio Lacuesta <1731933+elacuesta@users.noreply.github.com>

akshaysharmajs

I am working on the template will push that by tom and made some changes required please check them once.

main.py

akshaysharmajs · 2021-06-12T12:07:11Z

Please suggest what should be the value for python-requires and classifiers for setup.py

Gallaecio

Please, remove from Git: .DS_Store, .mypy_cache, .tox, xtractmime.egg-info.

.DS_Store is a macOS-specific file, and you should probably add it to your system-wide Git ignore list, because I don’t think there’s ever going to be a repository where keeping that file tracked by Git will be necessary. You can alternatively add it to your local, project specific exclude list (.git/info/exclude).

The rest are folders and files expected to get generated by developers working on this specific project. So they should probably be in the .gitignore file of the project, and that file should be committed.

Also, please consider turning your current testing code (e.g. sample_input.py) into automated tests. You can create a tests/test_extract_mime.py file, to be run with pytest tests/, where you can define different functions prefixes with test_ that define a test scenario. For example:

def test_empty_body_no_content_types():
    assert extract_mime(b'') == 'text/plain'

You can then extend this file with additional test functions as you extend the functionality of the library.

sample_input.py

setup.py

tox.ini

xtractmime/main.py

main.py

xtractmime/main.py

akshaysharmajs · 2021-06-16T20:38:20Z

I have added the pattern matching algorithm according to section 6 and added some tests. Please check once

Also, I am thinking of storing the byte tables in a separate file patterns.py which can store respective byte-pattern, pattern-mask, and mimetype as a tuple for each type of file.
for e.g, image_pattern = ((b"\x00\x00\x01\x00",b"\xff\xff\xff\xff","image/x-icon"), (b"\x00\x00\x02\x00", b"\xff\xff\xff\xff","image/x-icon"),...)

akshaysharmajs · 2021-06-18T10:25:46Z

@Gallaecio Can u please confirm, I think the algo mentioned in section 6.2.1 for the matching the signature of mp4 file is wrong.
pt 9 says while bytes-read < box-size. byte-read is 16 and box-size is 4 and it will never enter while loop??

it can be while box-size < bytes-read, if its a mistake??

Gallaecio · 2021-06-18T14:23:15Z

@Gallaecio Can u please confirm, I think the algo mentioned in section 6.2.1 for the matching the signature of mp4 file is wrong.
pt 9 says while bytes-read < box-size. byte-read is 16 and box-size is 4 and it will never enter while loop??

it can be while box-size < bytes-read, if its a mistake??

I believe that the algorithm is correct, but you might have misunderstood what box-size means here (e.g. you consider it to be 4?).

The first 4 bytes of the input data in MP4 files seem to indicate the size of some byte sequence they call “box”, which must contain MP4 metadata or something. The algorithm says that you must “Let box-size be the four bytes from sequence[0] to sequence[3], interpreted as a 32-bit unsigned big-endian integer”. So you are supposed to read those 4 bytes as an integer, I suspect using struct. And then read the input data from the 17th byte, 4 bytes by 4 bytes, up to the box size.

akshaysharmajs · 2021-06-18T14:41:15Z

yeah, I misunderstood it 😅, Thanks for the explanation 👍🏼

akshaysharmajs · 2021-06-25T20:04:58Z

This one it seems we should add to the allow-list:

xtractmime/__init__.py:47:40: E203 whitespace before ':'
xtractmime/__init__.py:53:53: E203 whitespace before ':'
xtractmime/_utils.py:67:34: E203 whitespace before ':'
xtractmime/_utils.py:99:29: E203 whitespace before ':'
xtractmime/_utils.py:111:33: E203 whitespace before ':'
xtractmime/_utils.py:123:26: E203 whitespace before ':'

We should also add xtractmime/__init__.py:47:13: W503 line break before binary operator to the list

Gallaecio

You seem to be going really well, so please keep it up!

Since we seem to be going a bit ahead of schedule, I’ll take the chance to bomb you with style feedback 🙂

xtractmime/__init__.py

xtractmime/_utils.py

akshaysharmajs · 2021-06-25T23:03:58Z

Thank you so much for the reviews and suggestions. I am learning a lot of new things and will try my best to work on them 😄

Gallaecio · 2021-06-26T06:43:53Z

I just had a thought for a stretch goal: automate the generation of _pattern.py. It’s not necessary for project completion, but it would be great since it would remove the chance of human error while transcribing and make it trivial to update the file as they add new patterns to the standard in the future.

akshaysharmajs · 2021-06-30T19:36:01Z

As per the discussion, I will open a new PR with the implementation of section 7 to merge into this branch highlvl-API just to make things less complicated.

xtractmime/_patterns.py

xtractmime/_utils.py

Gallaecio

I’m trying to figure out why the Windows tests are failing, but it’s a bit hard to navigate the test output because of the use of long parameters (file bodies) in parametrize.

I suggest you use the file name in the parameters (e.g. foo.gif), and move the file-reading code into the test functions themselves.

For example:

    @pytest.mark.parametrize(
        "input_bytes,expected",
        [
            ("foo_mp4", True),
            (b"\x00\x00\x00", False),
        ],
    )
    def test_is_mp4_signature(self, input_bytes, expected):
        if isinstance(input_bytes, str):
            with open(f"tests/files/{input_bytes}", "rb") as input_file:
                input_bytes = input_file.read()
        assert is_mp4_signature(input_bytes) == expected

Gallaecio · 2021-07-01T07:01:54Z

OK, I think the length of parametrize parameters is actually what’s causing the Windows errors. Pytests seems to try and put the parameters (as part of the test name) into an environment variable, and Windows does not support environment variables that long:

2021-06-30T19:27:04.8104969Z self = environ({'NUMBER_OF_PROCESSORS': '2', 'COMSPEC': 'C:\\Windows\\system32\\cmd.exe', 'PROCESSOR_ARCHITECTURE': 'AMD64', ...V_CORE_SOURCE': 'xtractmime', 'COV_CORE_CONFIG': ';', 'COV_CORE_DATAFILE': 'D:\\a\\xtractmime\\xtractmime\\.coverage'})
2021-06-30T19:27:04.8106169Z key = 'PYTEST_CURRENT_TEST'
2021-06-30T19:27:04.8107430Z value = 'tests/test_utils.py::TestUtils::test_is_mp4_signature[\\x00\\x00\\x00 ftypmp42\\x00\\x00\\x00\\x00mp42mp41isomavc1\\x...ZZZ]\\xe0\\x82\\x14\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xbc-True] (setup)'
2021-06-30T19:27:04.8108504Z 
2021-06-30T19:27:04.8108945Z     def __setitem__(self, key, value):
2021-06-30T19:27:04.8109481Z         key = self.encodekey(key)
2021-06-30T19:27:04.8110076Z         value = self.encodevalue(value)
2021-06-30T19:27:04.8110658Z >       self.putenv(key, value)
2021-06-30T19:27:04.8111370Z E       ValueError: the environment variable is longer than 32767 characters
2021-06-30T19:27:04.8111867Z 
2021-06-30T19:27:04.8112444Z c:\hostedtoolcache\windows\python\3.8.10\x64\lib\os.py:681: ValueError

Gallaecio · 2021-07-02T06:22:28Z

Regarding contexts: I think it would make sense to add an additional, enum parameter to the main function to allow users to select a context, and have it be the web browser context by default. That said, I would treat that as an stretch goal, and I would even prioritize automating the _patterns.py generation over it.

tests/requirements.txt

setup.cfg

tox.ini

xtractmime/_utils.py

akshaysharmajs · 2021-07-06T01:38:59Z

Should I make the changes for the type of input and return values in this PR or in another one?
The structure of extract_mime will be

def extract_mime(
    body: bytes,
    *,
    content_types: Optional[Tuple[Union[bytes, str]]] = None,
    http_origin: bool = True,
    no_sniff: bool = False,
    extra_types: Optional[Tuple[Tuple[bytes, bytes, Set[bytes], Union[bytes,str]], ...]] = None,
) -> bytes:

Also, I will have to change all the comparison with string mimetype to bytes and all the fourth column values in _patterns.py
to bytes. Same goes with all the type hints.

Maybe a check for string type in extract_mime will be better.

if type(supplied_type) == str:
    supplied_type = supplied_type.encode()

xtractmime/__init__.py

setup.py

xtractmime/_utils.py

Gallaecio · 2021-07-13T06:49:02Z

Going back to @elacuesta ’s #2 (comment), I believe that issue will be handled by black, so I think it would be OK to ignore the error code globally instead of line-by-line, which requires us to add the error code to every slice at the moment.

Gallaecio · 2021-07-13T06:50:39Z

@akshaysharmajs Since this needs to be merged before #4, let us know which remaining issues you will fix as part of #4, fix the rest here, and let us merge this soon.

akshaysharmajs · 2021-07-13T07:16:25Z

For this PR, I have fixed:

the version
global E203
extract_mime type hints
fail-fast: false

rest of the issues will be fixed as a part of #4 including the naming of functions

adds main function

f12fe58

elacuesta reviewed Jun 9, 2021

View reviewed changes

main.py Outdated Show resolved Hide resolved

main.py Outdated Show resolved Hide resolved

main.py Outdated Show resolved Hide resolved

main.py Outdated Show resolved Hide resolved

akshaysharmajs and others added 2 commits June 11, 2021 02:03

Update main.py

2697b31

Co-authored-by: Eugenio Lacuesta <1731933+elacuesta@users.noreply.github.com>

small fixes

5841da2

akshaysharmajs commented Jun 11, 2021

View reviewed changes

main.py Outdated Show resolved Hide resolved

akshaysharmajs added 3 commits June 12, 2021 02:23

removes flag

b68bea3

handles null Content-Type

a0e5549

adds setup.py

ab021ab

adds tox.ini

a45436f

Gallaecio reviewed Jun 13, 2021

View reviewed changes

sample_input.py Outdated Show resolved Hide resolved

setup.py Outdated Show resolved Hide resolved

setup.py Outdated Show resolved Hide resolved

tox.ini Outdated Show resolved Hide resolved

xtractmime/main.py Outdated Show resolved Hide resolved

Gallaecio reviewed Jun 13, 2021

View reviewed changes

main.py Outdated Show resolved Hide resolved

akshaysharmajs added 2 commits June 14, 2021 02:50

removes unnecessary files

f78dcda

adds .gitignore

68226a9

akshaysharmajs force-pushed the highlvl-API branch from d738706 to 68226a9 Compare June 13, 2021 21:50

akshaysharmajs added 2 commits June 14, 2021 03:27

removes .DS_Store

a7f6a9c

adds license, adds toxenv- black

2e7828c

akshaysharmajs commented Jun 15, 2021

View reviewed changes

xtractmime/main.py Outdated Show resolved Hide resolved

akshaysharmajs added 5 commits June 15, 2021 05:37

small fixes

03c73d1

adds pattern matching algorithm

0da529b

adds tests

5da829f

changes .gitignore

c152451

removes sample_input.py

7b836ff

adds patterns

fbe8deb

Gallaecio reviewed Jun 25, 2021

View reviewed changes

Style changes

21a4a38

Gallaecio reviewed Jul 1, 2021

View reviewed changes

xtractmime/_patterns.py Outdated Show resolved Hide resolved

xtractmime/_patterns.py Outdated Show resolved Hide resolved

xtractmime/_utils.py Outdated Show resolved Hide resolved

Gallaecio reviewed Jul 1, 2021

View reviewed changes

akshaysharmajs added 2 commits July 1, 2021 17:56

Fix windows-latest tests

6c19bde

Remove pattern comments

dadd5ea

akshaysharmajs mentioned this pull request Jul 2, 2021

Determining the computed MIME type of a resource #4

Merged

elacuesta reviewed Jul 5, 2021

View reviewed changes

tests/requirements.txt Outdated Show resolved Hide resolved

elacuesta reviewed Jul 5, 2021

View reviewed changes

setup.cfg Show resolved Hide resolved

elacuesta reviewed Jul 5, 2021

View reviewed changes

tox.ini Outdated Show resolved Hide resolved

elacuesta reviewed Jul 5, 2021

View reviewed changes

xtractmime/_utils.py Outdated Show resolved Hide resolved

Style Fix

5a9337d

Gallaecio reviewed Jul 6, 2021

View reviewed changes

xtractmime/__init__.py Outdated Show resolved Hide resolved

Change str -> bytes

5b03608

elacuesta reviewed Jul 12, 2021

View reviewed changes

setup.py Outdated Show resolved Hide resolved

elacuesta reviewed Jul 12, 2021

View reviewed changes

xtractmime/_utils.py Show resolved Hide resolved

akshaysharmajs added 2 commits July 13, 2021 12:33

Small fix

512061e

Fix checks

ff1e1fb

Gallaecio merged commit f656d9b into main Jul 13, 2021

akshaysharmajs deleted the highlvl-API branch July 13, 2021 08:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

API xtractmime #2

API xtractmime #2

akshaysharmajs commented Jun 9, 2021

akshaysharmajs commented Jun 9, 2021 •

edited

elacuesta left a comment

Gallaecio commented Jun 9, 2021

akshaysharmajs left a comment

akshaysharmajs commented Jun 12, 2021

Gallaecio left a comment

akshaysharmajs commented Jun 16, 2021 •

edited

akshaysharmajs commented Jun 18, 2021 •

edited

Gallaecio commented Jun 18, 2021

akshaysharmajs commented Jun 18, 2021

akshaysharmajs commented Jun 25, 2021

Gallaecio left a comment

akshaysharmajs commented Jun 25, 2021

Gallaecio commented Jun 26, 2021 •

edited

akshaysharmajs commented Jun 30, 2021

Gallaecio left a comment

Gallaecio commented Jul 1, 2021

Gallaecio commented Jul 2, 2021

akshaysharmajs commented Jul 6, 2021 •

edited

Gallaecio commented Jul 13, 2021

Gallaecio commented Jul 13, 2021 •

edited

akshaysharmajs commented Jul 13, 2021

API xtractmime #2

API xtractmime #2

Conversation

akshaysharmajs commented Jun 9, 2021

akshaysharmajs commented Jun 9, 2021 • edited

elacuesta left a comment

Choose a reason for hiding this comment

Gallaecio commented Jun 9, 2021

akshaysharmajs left a comment

Choose a reason for hiding this comment

akshaysharmajs commented Jun 12, 2021

Gallaecio left a comment

Choose a reason for hiding this comment

akshaysharmajs commented Jun 16, 2021 • edited

akshaysharmajs commented Jun 18, 2021 • edited

Gallaecio commented Jun 18, 2021

akshaysharmajs commented Jun 18, 2021

akshaysharmajs commented Jun 25, 2021

Gallaecio left a comment

Choose a reason for hiding this comment

akshaysharmajs commented Jun 25, 2021

Gallaecio commented Jun 26, 2021 • edited

akshaysharmajs commented Jun 30, 2021

Gallaecio left a comment

Choose a reason for hiding this comment

Gallaecio commented Jul 1, 2021

Gallaecio commented Jul 2, 2021

akshaysharmajs commented Jul 6, 2021 • edited

Gallaecio commented Jul 13, 2021

Gallaecio commented Jul 13, 2021 • edited

akshaysharmajs commented Jul 13, 2021

akshaysharmajs commented Jun 9, 2021 •

edited

akshaysharmajs commented Jun 16, 2021 •

edited

akshaysharmajs commented Jun 18, 2021 •

edited

Gallaecio commented Jun 26, 2021 •

edited

akshaysharmajs commented Jul 6, 2021 •

edited

Gallaecio commented Jul 13, 2021 •

edited