New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API xtractmime #2
Conversation
Please check these flags or settings, maybe helpful later. Will it be better to use directly python-magic or import C header file <magic.h> just as done in python-magic here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be good to have a template for the project before doing more coding. A minimal template could consist on a package directory, setup.py
and tox.ini
files. https://github.com/scrapy/itemadapter/ is a fairly simple project you could look at for inspiration.
Unless there’s a good reason to use the C file, I would use the Python package API instead. |
Co-authored-by: Eugenio Lacuesta <1731933+elacuesta@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am working on the template will push that by tom and made some changes required please check them once.
Please suggest what should be the value for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please, remove from Git: .DS_Store
, .mypy_cache
, .tox
, xtractmime.egg-info
.
.DS_Store
is a macOS-specific file, and you should probably add it to your system-wide Git ignore list, because I don’t think there’s ever going to be a repository where keeping that file tracked by Git will be necessary. You can alternatively add it to your local, project specific exclude list (.git/info/exclude
).
The rest are folders and files expected to get generated by developers working on this specific project. So they should probably be in the .gitignore
file of the project, and that file should be committed.
Also, please consider turning your current testing code (e.g. sample_input.py
) into automated tests. You can create a tests/test_extract_mime.py
file, to be run with pytest tests/
, where you can define different functions prefixes with test_
that define a test scenario. For example:
def test_empty_body_no_content_types():
assert extract_mime(b'') == 'text/plain'
You can then extend this file with additional test functions as you extend the functionality of the library.
d738706
to
68226a9
Compare
I have added the pattern matching algorithm according to section 6 and added some tests. Please check once Also, I am thinking of storing the byte tables in a separate file |
@Gallaecio Can u please confirm, I think the algo mentioned in section 6.2.1 for the matching the signature of mp4 file is wrong. it can be |
I believe that the algorithm is correct, but you might have misunderstood what box-size means here (e.g. you consider it to be 4?). The first 4 bytes of the input data in MP4 files seem to indicate the size of some byte sequence they call “box”, which must contain MP4 metadata or something. The algorithm says that you must “Let box-size be the four bytes from sequence[0] to sequence[3], interpreted as a 32-bit unsigned big-endian integer”. So you are supposed to read those 4 bytes as an integer, I suspect using struct. And then read the input data from the 17th byte, 4 bytes by 4 bytes, up to the box size. |
yeah, I misunderstood it 😅, Thanks for the explanation 👍🏼 |
We should also add |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You seem to be going really well, so please keep it up!
Since we seem to be going a bit ahead of schedule, I’ll take the chance to bomb you with style feedback 🙂
Thank you so much for the reviews and suggestions. I am learning a lot of new things and will try my best to work on them 😄 |
I just had a thought for a stretch goal: automate the generation of |
As per the discussion, I will open a new PR with the implementation of section 7 to merge into this branch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m trying to figure out why the Windows tests are failing, but it’s a bit hard to navigate the test output because of the use of long parameters (file bodies) in parametrize
.
I suggest you use the file name in the parameters (e.g. foo.gif
), and move the file-reading code into the test functions themselves.
For example:
@pytest.mark.parametrize(
"input_bytes,expected",
[
("foo_mp4", True),
(b"\x00\x00\x00", False),
],
)
def test_is_mp4_signature(self, input_bytes, expected):
if isinstance(input_bytes, str):
with open(f"tests/files/{input_bytes}", "rb") as input_file:
input_bytes = input_file.read()
assert is_mp4_signature(input_bytes) == expected
OK, I think the length of parametrize parameters is actually what’s causing the Windows errors. Pytests seems to try and put the parameters (as part of the test name) into an environment variable, and Windows does not support environment variables that long:
|
Regarding contexts: I think it would make sense to add an additional, enum parameter to the main function to allow users to select a context, and have it be the web browser context by default. That said, I would treat that as an stretch goal, and I would even prioritize automating the |
Should I make the changes for the type of input and return values in this PR or in another one?
Also, I will have to change all the comparison with string mimetype to bytes and all the fourth column values in Maybe a check for string type in
|
Going back to @elacuesta ’s #2 (comment), I believe that issue will be handled by black, so I think it would be OK to ignore the error code globally instead of line-by-line, which requires us to add the error code to every slice at the moment. |
@akshaysharmajs Since this needs to be merged before #4, let us know which remaining issues you will fix as part of #4, fix the rest here, and let us merge this soon. |
For this PR, I have fixed:
rest of the issues will be fixed as a part of #4 including the naming of functions |
Created main function and a function for resource metadata(still need to handle multiple content-type headers).
Should I put all the required flags in a different file "flags.py"?
#1