Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework mpeg4 check, add support for some more mpeg4 and ISO base media formats #206

Merged
merged 9 commits into from May 4, 2019

Conversation

stroncium
Copy link
Contributor

@stroncium stroncium commented Apr 15, 2019

Fixes #165
Fixes #99

WIP: gathering links for readme

Added 3g2, m4a, m4b, m4p, m4v, f4a, f4b, f4p, f4v extensions.
Gathered all ISO base media file format detectors together and simplified the code.
Added wildcard check for various mp4 files, but specific enough for false positives to be rare, 4 symbols exact ftyp match followed by 4 printable symbols match. Theoretically we might want to add exact matching for the printable symbols, but many of the types aren't registered and last compiled list I found was from 2009 (http://www.ftyps.com/), file seems to use that same list.
If my vision doesn't fail me, that means we support all mp4 versions supported by file and more(from wildcard).

As for test cases, due to just parsing mpeg4 format and knowing that all the mentioned containers respect it, I don't think extensive testing is needed, and I didn't want to bloat repository with too much audio/video and waste time searching for samples, so I just generated minimal valid header boxes for each format.

For m4v and variants:
- file info tells that they have m4v extenstion and video/x-m4v mime
- we had a test case for mp4 extenstion for them
- as per http://www.ftyps.com/ at least M4V brand is registered, though mime type is not
- they may(and usually do, from what I gather) require additional capabilities compared to mp4
- so for the moment I set it m4v and video/x-m4v (which was actually stated in readme, but wasn't detected), but it's probably worth some discussion
- tests fixed to support it(fixture-m4v.mp4 -> fiture.m4v)

Copy link
Collaborator

@Borewit Borewit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried the PR with some of my samples, which are audio truncated real world files:

Sample master PR #206
issue-120.m4b mime=video/mp4, extension=mp4 ❔mime=video/mp4, extension=mp4
issue-127.m4b mime=audio/mp4, extension=m4a ❌ FAILED
issue-133.m4a mime=audio/mp4, extension=m4a ✔️mime=audio/mp4, extension=m4a
issue-151.m4a mime=video/mp4, extension=mp4 ✔️mime=video/mp4, extension=mp4

Samples can be found here

I disagree that it fixes #165, it is not able to detect audio books. See also my comments: 1 & 2

@stroncium Can you attach the original (untruncated) files you used for the fixtures?

Should heave read better:

As for test cases, due to just parsing mpeg4 format and knowing that all the mentioned containers respect it, I don't think extensive testing is needed, and I didn't want to bloat repository with too much audio/video and waste time searching for samples, so I just generated minimal valid header boxes for each format.

That way the tests become a self fulfilling prophecy. Based on the consistency demonstrated so far, I think it is worth to test against a reasonable amount of real world MPEG-4 files.

@stroncium
Copy link
Contributor Author

stroncium commented Apr 17, 2019

@Borewit I actually tested it on your samples too. But now it seems like I screwed up one value when gathering all the code together(audio/mp4 instead of audio/x-m4a), fixed that.

After fix, outputs for your samples.

file

id4.m4a:       ISO Media, Apple iTunes ALAC/AAC-LC (.M4A) Audio
issue-120.m4b: ISO Media, MP4 Base Media v1 [IS0 14496-12:2003]
issue-127.m4b: ISO Media, Apple iTunes ALAC/AAC-LC (.M4A) Audio
issue-133.m4a: ISO Media, Apple iTunes ALAC/AAC-LC (.M4A) Audio
issue-151.m4a: ISO Media, MP4 v2 [ISO 14496-14]
issue-74.m4a:  ISO Media, MP4 Base Media v1 [IS0 14496-12:2003]
issue-79.m4a:  ISO Media, Apple iTunes ALAC/AAC-LC (.M4A) Audio
Mr. Pickles S02E07 My Dear Boy.mp4: ISO Media, MP4 v2 [ISO 14496-14]

file --mime-type

id4.m4a:       audio/x-m4a
issue-120.m4b: video/mp4
issue-127.m4b: audio/x-m4a
issue-133.m4a: audio/x-m4a
issue-151.m4a: video/mp4
issue-74.m4a:  video/mp4
issue-79.m4a:  audio/x-m4a
Mr. Pickles S02E07 My Dear Boy.mp4: video/mp4

file-type

id4.m4a { ext: 'm4a', mime: 'audio/x-m4a' }
issue-120.m4b { ext: 'mp4', mime: 'video/mp4' }
issue-127.m4b null
issue-133.m4a { ext: 'm4a', mime: 'audio/x-m4a' }
issue-151.m4a { ext: 'mp4', mime: 'video/mp4' }
issue-74.m4a { ext: 'mp4', mime: 'video/mp4' }
issue-79.m4a { ext: 'm4a', mime: 'audio/x-m4a' }
Mr. Pickles S02E07 My Dear Boy.mp4 { ext: 'mp4', mime: 'video/mp4' }

All files match with file except for issue-127.m4b.

hexdump -C -n12 issue-127.m4b

00000000  00 00 00 18 66 74 79 70  4d 34 41 00              |....ftypM4A.|
0000000c

It is not actually valid mp4 file as per standard, as it uses not only non-registerred, but illegal brand M4A\x00 (\x00 is illegal character for brands), even though I suppose many players will go further than standard, ignore that and try to process it anyway.

Now, what to do with such files is a big question I forgot to raise.

Now, going for matching with extensions you supplied, issue-74.m4a, issue-151.m4a, issue-120.m4b don't match.

issue-74.m4a is clearly saved with incorrect extension. It isn't produced by Apple(quote comment: youtube rip) and is marked with isom brand. It contains not only audio, so proper mime is video/mp4.
issue-151.m4a is marked with brand mp42, also doesn't seem to be created by Apple(created by JRiver Media Center 22.0.32), the best guess we can do there is audio/mp4, but video/mp4 is also valid.
issue-120.m4b is also not produced by Apple(encode inAudible 1.97). The brand is isom. It also contains not only audio, so the only valid mime for it is video/mp4.

I mean, there is no standard which defines what m4a is. Only internal documents of Apple. There are so common marks we can gather from their practice, and none of this marks seem to apply to those files. IMO it will be a bug if we return the m4a/m4b extensions here.

Though it all depends on perspective. I was going by assumption we want to produce extensions files should be saved with. If we want to guess which extensions they were saved with before, then I don't know what to do about it except gathering all the software around capable of producing mp4 and making matches on metadata it outputs. But that won't help in most cases. For example in issue-74.m4a even if we were to detect that it was created by Lavf58.0.102 we wouldn't be able to detect user named it m4a, as lavf doesn't dictate that.

@Borewit
Copy link
Collaborator

Borewit commented Apr 17, 2019

@Borewit I actually tested it on your samples too. But now it seems like I screwed up one value when gathering all the code together(audio/mp4 instead of audio/x-m4a), fixed that.

Small mistake, well done.

After fix, outputs for your samples.

file

id4.m4a:       ISO Media, Apple iTunes ALAC/AAC-LC (.M4A) Audio
issue-120.m4b: ISO Media, MP4 Base Media v1 [IS0 14496-12:2003]
issue-127.m4b: ISO Media, Apple iTunes ALAC/AAC-LC (.M4A) Audio
issue-133.m4a: ISO Media, Apple iTunes ALAC/AAC-LC (.M4A) Audio
issue-151.m4a: ISO Media, MP4 v2 [ISO 14496-14]
issue-74.m4a:  ISO Media, MP4 Base Media v1 [IS0 14496-12:2003]
issue-79.m4a:  ISO Media, Apple iTunes ALAC/AAC-LC (.M4A) Audio
Mr. Pickles S02E07 My Dear Boy.mp4: ISO Media, MP4 v2 [ISO 14496-14]

file --mime-type

id4.m4a:       audio/x-m4a
issue-120.m4b: video/mp4
issue-127.m4b: audio/x-m4a
issue-133.m4a: audio/x-m4a
issue-151.m4a: video/mp4
issue-74.m4a:  video/mp4
issue-79.m4a:  audio/x-m4a
Mr. Pickles S02E07 My Dear Boy.mp4: video/mp4

file-type

id4.m4a { ext: 'm4a', mime: 'audio/x-m4a' }
issue-120.m4b { ext: 'mp4', mime: 'video/mp4' }
issue-127.m4b null
issue-133.m4a { ext: 'm4a', mime: 'audio/x-m4a' }
issue-151.m4a { ext: 'mp4', mime: 'video/mp4' }
issue-74.m4a { ext: 'mp4', mime: 'video/mp4' }
issue-79.m4a { ext: 'm4a', mime: 'audio/x-m4a' }
Mr. Pickles S02E07 My Dear Boy.mp4 { ext: 'mp4', mime: 'video/mp4' }

All files match with file except for issue-127.m4b.

hexdump -C -n12 issue-127.m4b

00000000  00 00 00 18 66 74 79 70  4d 34 41 00              |....ftypM4A.|
0000000c

It is not actually valid mp4 file as per standard, as it uses not only non-registerred, but illegal brand M4A\x00 (\x00 is illegal character for brands), even though I suppose many players will go further than standard, ignore that and try to process it anyway.

Now, what to do with such files is a big question I forgot to raise.

These files are send in by users, and caused issues in my library. So probably not a fair set of test files neither. I truncated the audio prior adding them to the repository.

Now, going for matching with extensions you supplied, issue-74.m4a, issue-151.m4a, issue-120.m4b don't match.

issue-74.m4a is clearly saved with incorrect extension. It isn't produced by Apple(quote comment: youtube rip) and is marked with isom brand. It contains not only audio, so proper mime is video/mp4.

I agree.

issue-151.m4a is marked with brand mp42, also doesn't seem to be created by Apple(created by JRiver Media Center 22.0.32), the best guess we can do there is audio/mp4, but video/mp4 is also valid.
issue-120.m4b is also not produced by Apple(encode inAudible 1.97). The brand is isom. It also contains not only audio, so the only valid mime for it is video/mp4.

This sample file is really based on an audio book. What do you mean with It also contains not only audio?

I mean, there is no standard which defines what m4a is. Only internal documents of Apple.

You make it sound like Apple owns the .m4a extension, it can be used for MPEG-4 audio files right? Does not necessary needs to be encoded by Apple right?

There are so common marks we can gather from their practice

Let's capture those common marks, that is good enough in my opinion. I got suspicious on the ftype, because the MPEG-4 i had around did not comply with that at all. Some of them were very complete, and complex file structures, my humble assumption is then, the file has not been put together by world's dumbest idiot.

and none of this marks seem to apply to those files.

But if you say this sample is non representative I believe you and in that case it has mislead me big time.
I am sorry if my samples were not reflecting reality.

Though it all depends on perspective. I was going by assumption we want to produce extensions files should be saved with. If we want to guess which extensions they were saved with before, then I don't know what to do about it except gathering all the software around capable of producing mp4 and making matches on metadata it outputs. But that won't help in most cases.

Lets be realistic. No one is expecting magic.

I can see that I have frustrated you, please don't be.

I will try to find better samples, if someone can borrow me some testing material, that would be fantastic

For example in issue-74.m4a even if we were to detect that it was created by Lavf58.0.102 we wouldn't be able to detect user named it m4a, as lavf doesn't dictate that.

Let's build the case on those common marks and patterns you seem to be able to recognize.

@stroncium
Copy link
Contributor Author

stroncium commented Apr 17, 2019

@sindresorhus I don't seem to find any good documentation on any of those format. The sources I used are iso media standard, mpeg4 standard, www.ftyps.com, file output and some random point about some tricky parts on the internet. So, I have no idea what links to put into readme. Should we just bundle all of them together and point to mpeg4 wiki article maybe?

@stroncium
Copy link
Contributor Author

issue-120.m4b is also not produced by Apple(encode inAudible 1.97). The brand is isom. It also contains not only audio, so the only valid mime for it is video/mp4.

This sample file is really based on an audio book. What do you mean with It also contains not only audio?

The file also contains video stream.
Per mpeg standard, recommended mime types are video/mp4 for any file and audio/mp4 for files with audio streams only. There is no registered mime for audiobooks. I found some references to usage of audio/x-m4b on the internet, but considering even file doesn't use it, I guess it's better to stay clear of it for the moment.

I mean, there is no standard which defines what m4a is. Only internal documents of Apple.

You make it sound like Apple owns the .m4a extension, it can be used for MPEG-4 audio files right? Does not necessary needs to be encoded by Apple right?

Sure doesn't, my point is was if we try to go with trying to gather some consistent part of it it will be things made by Apple. Otherwise we're doomed, if you google for something like how to make an audiobook, the answer will usually be just make mp4 file and rename it to m4b, it will play fine, so my guess would be any kind of file can end up with extension m4b(same for m4a). The thing is, no players watch at extension, because it is meaningless for mpeg4 file, so as long as player just have all the capabilities to render it and is associated with extension, it all goes fine. But, as apple formats, apple products are supposed to care about them(at least on level of file associations), and marking something not conforming to what is usually seen in apple m4a/m4b might lead to some frustrating results.

What I've tried to do here was to make a good matcher for all the files respecting standards and unofficial standards but also provide a good fallback in case we don't match, mp4 and video/mp4 as any player is supposed to be associated with this extension, and any browser is supposed to be able to play it. What I'd also like to see is for no files to be marked as files which are supposed to require less capabilities than actual file requires.

For example in issue-74.m4a even if we were to detect that it was created by Lavf58.0.102 we wouldn't be able to detect user named it m4a, as lavf doesn't dictate that.

Let's build the case on those common marks and patterns you seem to be able to recognize.

and none of this marks seem to apply to those files.

But if you say this sample is non representative I believe you and in that case it has mislead me big time.

I can not guarantee they aren't representative as I have no statistics on their usage. I'm just trying to think logically about it.
For example, looking at the comments issue-74.m4a seems to be some live concert recording, with audio track and video track. issue-120.m4b seems like audiobook, with audio and video track.
Not counting ads in comments, they both:

  • have long enough audio stream
  • have video stream
  • video stream seems to be one frame
  • have same brands

Only technical difference is audiobook got chapters in it, but if concert wasn't youtube rip and/or was crafted a bit better, it well might have had too.
The brands inside iso media file are there exactly to combat the whole problem of extensions and users renaming files, one might think of them as extension, but written inside file.

Gathering common traces of known encoders might be a good goal for some use cases, but I'm not sure how reachable it is. Also, introducing that will probably require a change to multiple mime types return format, as for many uses such types will be simply wrong. To illustrate, this lavf is part of ffmpeg. I'd bet more than half of published mp4 is encoded with ffmpeg, including the ones Apple does, as it is widely used both directly and as backend for other software. The problem is, it doesn't set this info up for you. It sets it to default set of brands(isom, iso2, mp41), and it's users responsibility to set up proper parameters for his use.

I think the first thing should be to match file performance and look for comments.

@stroncium
Copy link
Contributor Author

As for deeper detection, it's fairly easy to check if there is video stream with significant bitrate to separate video from audio when you have whole file.

Separating audiobooks from audio is much more complex problem. Only solution I can think of is searching for keywords in metadata.

But I think neither of those 2 features are in scope of file-type. Otherwise we may as well go for recognizing poems vs thesises in doc files.

@stroncium
Copy link
Contributor Author

@sindresorhus So, what should I do about readme?

I don't seem to find any good documentation on any of those format. The sources I used are iso media standard, mpeg4 standard, www.ftyps.com, file output and some random point about some tricky parts on the internet. So, I have no idea what links to put into readme. Should we just bundle all of them together and point to mpeg4 wiki article maybe?

@sindresorhus
Copy link
Owner

@stroncium I would prefer to keep one entry per file extension, but you can just link to the same wiki article for all of them.

@stroncium stroncium changed the title [WIP] Reworked mpeg4, added support for some mpeg4 formats, fixes #165, solves #99 Reworked mpeg4, added support for some mpeg4 formats, fixes #165, solves #99 May 4, 2019
@stroncium stroncium changed the title Reworked mpeg4, added support for some mpeg4 formats, fixes #165, solves #99 Rework mpeg4, added support for some mpeg4 and ISO base media formats, fixes #165, solves #99 May 4, 2019
@stroncium stroncium changed the title Rework mpeg4, added support for some mpeg4 and ISO base media formats, fixes #165, solves #99 Rework mpeg4, add support for some mpeg4 and ISO base media formats, fixes #165, solves #99 May 4, 2019
@stroncium
Copy link
Contributor Author

@sindresorhus Readme done. I've also added description to 3gp and m4v and reordered them a bit.

@Borewit
Copy link
Collaborator

Borewit commented May 4, 2019

Nice work @stroncium.

@sindresorhus & @stroncium: I would like to use checkString in favour of check if applicable. It is more readable. I don't think performance is an argument, checkString can be made at least as fast.

@sindresorhus sindresorhus changed the title Rework mpeg4, add support for some mpeg4 and ISO base media formats, fixes #165, solves #99 Rework mpeg4 check, add support for some more mpeg4 and ISO base media formats May 4, 2019
@sindresorhus
Copy link
Owner

I would like to use checkString in favour of check if applicable. It is more readable. I don't think performance is not an argument, checkString can be made at least as fast.

👍 Agreed, but can be done in a follow-up PR.

@sindresorhus sindresorhus merged commit feaa217 into sindresorhus:master May 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

m4b audio books maps to video/mp4 Many MP4 variants are not implemented
3 participants