Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API xtractmime #2

Merged
merged 41 commits into from Jul 13, 2021
Merged

API xtractmime #2

merged 41 commits into from Jul 13, 2021

Conversation

akshaysharmajs
Copy link
Collaborator

Created main function and a function for resource metadata(still need to handle multiple content-type headers).

Should I put all the required flags in a different file "flags.py"?

#1

@akshaysharmajs
Copy link
Collaborator Author

akshaysharmajs commented Jun 9, 2021

Please check these flags or settings, maybe helpful later.

Will it be better to use directly python-magic or import C header file <magic.h> just as done in python-magic here?

Copy link
Member

@elacuesta elacuesta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to have a template for the project before doing more coding. A minimal template could consist on a package directory, setup.py and tox.ini files. https://github.com/scrapy/itemadapter/ is a fairly simple project you could look at for inspiration.

main.py Outdated Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Member

Will it be better to use directly python-magic or import C header file <magic.h> just as done in python-magic here?

Unless there’s a good reason to use the C file, I would use the Python package API instead.

akshaysharmajs and others added 2 commits June 11, 2021 02:03
Co-authored-by: Eugenio Lacuesta <1731933+elacuesta@users.noreply.github.com>
Copy link
Collaborator Author

@akshaysharmajs akshaysharmajs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am working on the template will push that by tom and made some changes required please check them once.

main.py Outdated Show resolved Hide resolved
@akshaysharmajs
Copy link
Collaborator Author

Please suggest what should be the value for python-requires and classifiers for setup.py

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, remove from Git: .DS_Store, .mypy_cache, .tox, xtractmime.egg-info.

.DS_Store is a macOS-specific file, and you should probably add it to your system-wide Git ignore list, because I don’t think there’s ever going to be a repository where keeping that file tracked by Git will be necessary. You can alternatively add it to your local, project specific exclude list (.git/info/exclude).

The rest are folders and files expected to get generated by developers working on this specific project. So they should probably be in the .gitignore file of the project, and that file should be committed.

Also, please consider turning your current testing code (e.g. sample_input.py) into automated tests. You can create a tests/test_extract_mime.py file, to be run with pytest tests/, where you can define different functions prefixes with test_ that define a test scenario. For example:

def test_empty_body_no_content_types():
    assert extract_mime(b'') == 'text/plain'

You can then extend this file with additional test functions as you extend the functionality of the library.

sample_input.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
tox.ini Outdated Show resolved Hide resolved
xtractmime/main.py Outdated Show resolved Hide resolved
main.py Outdated Show resolved Hide resolved
xtractmime/main.py Outdated Show resolved Hide resolved
@akshaysharmajs
Copy link
Collaborator Author

akshaysharmajs commented Jun 16, 2021

I have added the pattern matching algorithm according to section 6 and added some tests. Please check once

Also, I am thinking of storing the byte tables in a separate file patterns.py which can store respective byte-pattern, pattern-mask, and mimetype as a tuple for each type of file.
for e.g, image_pattern = ((b"\x00\x00\x01\x00",b"\xff\xff\xff\xff","image/x-icon"), (b"\x00\x00\x02\x00", b"\xff\xff\xff\xff","image/x-icon"),...)

@akshaysharmajs
Copy link
Collaborator Author

akshaysharmajs commented Jun 18, 2021

@Gallaecio Can u please confirm, I think the algo mentioned in section 6.2.1 for the matching the signature of mp4 file is wrong.
pt 9 says while bytes-read < box-size. byte-read is 16 and box-size is 4 and it will never enter while loop??

it can be while box-size < bytes-read, if its a mistake??

@Gallaecio
Copy link
Member

@Gallaecio Can u please confirm, I think the algo mentioned in section 6.2.1 for the matching the signature of mp4 file is wrong.
pt 9 says while bytes-read < box-size. byte-read is 16 and box-size is 4 and it will never enter while loop??

it can be while box-size < bytes-read, if its a mistake??

I believe that the algorithm is correct, but you might have misunderstood what box-size means here (e.g. you consider it to be 4?).

The first 4 bytes of the input data in MP4 files seem to indicate the size of some byte sequence they call “box”, which must contain MP4 metadata or something. The algorithm says that you must “Let box-size be the four bytes from sequence[0] to sequence[3], interpreted as a 32-bit unsigned big-endian integer”. So you are supposed to read those 4 bytes as an integer, I suspect using struct. And then read the input data from the 17th byte, 4 bytes by 4 bytes, up to the box size.

@akshaysharmajs
Copy link
Collaborator Author

yeah, I misunderstood it 😅, Thanks for the explanation 👍🏼

@akshaysharmajs
Copy link
Collaborator Author

This one it seems we should add to the allow-list:

xtractmime/__init__.py:47:40: E203 whitespace before ':'
xtractmime/__init__.py:53:53: E203 whitespace before ':'
xtractmime/_utils.py:67:34: E203 whitespace before ':'
xtractmime/_utils.py:99:29: E203 whitespace before ':'
xtractmime/_utils.py:111:33: E203 whitespace before ':'
xtractmime/_utils.py:123:26: E203 whitespace before ':'

We should also add xtractmime/__init__.py:47:13: W503 line break before binary operator to the list

Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You seem to be going really well, so please keep it up!

Since we seem to be going a bit ahead of schedule, I’ll take the chance to bomb you with style feedback 🙂

xtractmime/__init__.py Outdated Show resolved Hide resolved
xtractmime/__init__.py Show resolved Hide resolved
xtractmime/__init__.py Outdated Show resolved Hide resolved
xtractmime/__init__.py Outdated Show resolved Hide resolved
xtractmime/__init__.py Outdated Show resolved Hide resolved
xtractmime/_utils.py Outdated Show resolved Hide resolved
xtractmime/_utils.py Outdated Show resolved Hide resolved
xtractmime/_utils.py Outdated Show resolved Hide resolved
xtractmime/_utils.py Outdated Show resolved Hide resolved
xtractmime/_utils.py Show resolved Hide resolved
@akshaysharmajs
Copy link
Collaborator Author

Thank you so much for the reviews and suggestions. I am learning a lot of new things and will try my best to work on them 😄

@Gallaecio
Copy link
Member

Gallaecio commented Jun 26, 2021

I just had a thought for a stretch goal: automate the generation of _pattern.py. It’s not necessary for project completion, but it would be great since it would remove the chance of human error while transcribing and make it trivial to update the file as they add new patterns to the standard in the future.

@akshaysharmajs
Copy link
Collaborator Author

As per the discussion, I will open a new PR with the implementation of section 7 to merge into this branch highlvl-API just to make things less complicated.

xtractmime/_patterns.py Outdated Show resolved Hide resolved
xtractmime/_patterns.py Outdated Show resolved Hide resolved
xtractmime/_utils.py Outdated Show resolved Hide resolved
Copy link
Member

@Gallaecio Gallaecio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m trying to figure out why the Windows tests are failing, but it’s a bit hard to navigate the test output because of the use of long parameters (file bodies) in parametrize.

I suggest you use the file name in the parameters (e.g. foo.gif), and move the file-reading code into the test functions themselves.

For example:

    @pytest.mark.parametrize(
        "input_bytes,expected",
        [
            ("foo_mp4", True),
            (b"\x00\x00\x00", False),
        ],
    )
    def test_is_mp4_signature(self, input_bytes, expected):
        if isinstance(input_bytes, str):
            with open(f"tests/files/{input_bytes}", "rb") as input_file:
                input_bytes = input_file.read()
        assert is_mp4_signature(input_bytes) == expected

@Gallaecio
Copy link
Member

OK, I think the length of parametrize parameters is actually what’s causing the Windows errors. Pytests seems to try and put the parameters (as part of the test name) into an environment variable, and Windows does not support environment variables that long:

2021-06-30T19:27:04.8104969Z self = environ({'NUMBER_OF_PROCESSORS': '2', 'COMSPEC': 'C:\\Windows\\system32\\cmd.exe', 'PROCESSOR_ARCHITECTURE': 'AMD64', ...V_CORE_SOURCE': 'xtractmime', 'COV_CORE_CONFIG': ';', 'COV_CORE_DATAFILE': 'D:\\a\\xtractmime\\xtractmime\\.coverage'})
2021-06-30T19:27:04.8106169Z key = 'PYTEST_CURRENT_TEST'
2021-06-30T19:27:04.8107430Z value = 'tests/test_utils.py::TestUtils::test_is_mp4_signature[\\x00\\x00\\x00 ftypmp42\\x00\\x00\\x00\\x00mp42mp41isomavc1\\x...ZZZ]\\xe0\\x82\\x14\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xb4\\xbc-True] (setup)'
2021-06-30T19:27:04.8108504Z 
2021-06-30T19:27:04.8108945Z     def __setitem__(self, key, value):
2021-06-30T19:27:04.8109481Z         key = self.encodekey(key)
2021-06-30T19:27:04.8110076Z         value = self.encodevalue(value)
2021-06-30T19:27:04.8110658Z >       self.putenv(key, value)
2021-06-30T19:27:04.8111370Z E       ValueError: the environment variable is longer than 32767 characters
2021-06-30T19:27:04.8111867Z 
2021-06-30T19:27:04.8112444Z c:\hostedtoolcache\windows\python\3.8.10\x64\lib\os.py:681: ValueError

@Gallaecio
Copy link
Member

Regarding contexts: I think it would make sense to add an additional, enum parameter to the main function to allow users to select a context, and have it be the web browser context by default. That said, I would treat that as an stretch goal, and I would even prioritize automating the _patterns.py generation over it.

tests/requirements.txt Outdated Show resolved Hide resolved
setup.cfg Show resolved Hide resolved
tox.ini Outdated Show resolved Hide resolved
xtractmime/_utils.py Outdated Show resolved Hide resolved
@akshaysharmajs
Copy link
Collaborator Author

akshaysharmajs commented Jul 6, 2021

Should I make the changes for the type of input and return values in this PR or in another one?
The structure of extract_mime will be

def extract_mime(
    body: bytes,
    *,
    content_types: Optional[Tuple[Union[bytes, str]]] = None,
    http_origin: bool = True,
    no_sniff: bool = False,
    extra_types: Optional[Tuple[Tuple[bytes, bytes, Set[bytes], Union[bytes,str]], ...]] = None,
) -> bytes:

Also, I will have to change all the comparison with string mimetype to bytes and all the fourth column values in _patterns.py
to bytes. Same goes with all the type hints.

Maybe a check for string type in extract_mime will be better.

if type(supplied_type) == str:
    supplied_type = supplied_type.encode()

xtractmime/__init__.py Outdated Show resolved Hide resolved
setup.py Outdated Show resolved Hide resolved
@Gallaecio
Copy link
Member

Going back to @elacuesta ’s #2 (comment), I believe that issue will be handled by black, so I think it would be OK to ignore the error code globally instead of line-by-line, which requires us to add the error code to every slice at the moment.

@Gallaecio
Copy link
Member

Gallaecio commented Jul 13, 2021

@akshaysharmajs Since this needs to be merged before #4, let us know which remaining issues you will fix as part of #4, fix the rest here, and let us merge this soon.

@akshaysharmajs
Copy link
Collaborator Author

For this PR, I have fixed:

  • the version
  • global E203
  • extract_mime type hints
  • fail-fast: false

rest of the issues will be fixed as a part of #4 including the naming of functions

@Gallaecio Gallaecio merged commit f656d9b into main Jul 13, 2021
@akshaysharmajs akshaysharmajs deleted the highlvl-API branch July 13, 2021 08:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants