Add support for additional media types #18054

arash77 · 2024-04-25T12:41:11Z

This pull request adds support for Ogg, Webm, Mpeg, Mov, Avi, Wmv, and Wma media types.
It also refactors the media datatype sniffing logic in media.py using magic numbers.

How to test the changes?

I've included appropriate automated tests.
This is a refactoring of components with existing test coverage.
Instructions for manual testing are as follows:
1. [add testing steps and prerequisites here if you didn't write automated tests covering all your changes]

License

I agree to license these and all my past contributions to the core galaxy codebase under the MIT license.

…dia types

…ile formats

hexylena

I really like this PR as adding a sensible 'fallback' that can work if ffprobe isn't installed. That's potentially really useful for making life easier for small admins.

but I'm a bit wary of some of the changes which stop relying on ffprobe when it is available, I trust ffprobe's correctness far more than a (potentially) non-exhaustive list of magic numbers. In the case when it isn't available, the provide numbers are great let's use them. But since a number of the functions still rely on ffprobe, we aren't removing the potential dependency, so we don't gain much with the ffprobe+magic number combination, unless it were to be an ffprobe-less fallback.

I've made a number of comments directly on the PR, I think this will be an improvement when it's merged!

lib/galaxy/datatypes/media.py

hexylena · 2024-04-25T13:35:37Z

lib/galaxy/datatypes/media.py

-        if which("ffprobe"):
-            metadata, streams = ffprobe(filename)
-            return "mp4" in metadata["format_name"].split(",")
-        return False
+        return _get_file_format_from_magic_number(filename, "mp4")


is there a reason to remove fallback here? (and other places) I think this would be better if you just replaced the return False line, if we've got ffprobe, let's use it and feel certain we're detecting 100% correctly even in 'odd' cases.

So, should we always look for ffprobe and if it doesn't work check with the magic number?

Also when using 'ffprobe' on 'mov' and 'mp4' files, it gives back the same result. One way to tell them apart is by looking at the magic number.

So, should we always look for ffprobe and if it doesn't work check with the magic number?

Yes, not everyone is going to install ffmpeg, and your magic numbesr are a good, efficient fallback. (Unless any other devs want to chime in here?)

Also when using 'ffprobe' on 'mov' and 'mp4' files, it gives back the same result. One way to tell them apart is by looking at the magic number.

Thank you for pointing this out, you are correct. When I was looking further into it, I notice this output, ffmpeg bundles a lot of formats together on that line.

{ "format": { "filename": "sample-5s.mp4", "nb_streams": 2, "nb_programs": 0, "format_name": "mov,mp4,m4a,3gp,3g2,mj2", "format_long_name": "QuickTime / MOV",

which is an interesting result. Probably due to this: https://en.wikipedia.org/wiki/QuickTime_File_Format#Relation_to_MP4

Which explains why ffmpeg will decode both of those with the same decoder (since it's an mp4 stream internally with slightly different header metadata), but will encode them as separate spec compliant containers.

E mov QuickTime / MOV D mov,mp4,m4a,3gp,3g2,mj2 QuickTime / MOV

I wonder how much of a distinction we need to make here, for some relatively ancient formats, but probably some microscope/image tooling cares I guess.

As for implementation, I'm not sure what's the optimal implementation here. magic numbers can confirm it's actually one of the datatypes, but I still have some (perhaps unfounded) concern that they are not fully exhaustive/forward compatible of all the flavours of these datatypes that can be seen in the wild, so, instead if ffprobe says it's mp4, then we'll accept that.

Given how that magic numbers are used to spell ftypisom / ftypMSNV, I think that worry is not unfounded, that they'll might eventually a new ftyp that's also somehow an mp4 file.

So I think the optimal implementation looks like:

in mp4:

if ffprobe says it's an mp4

if magic numbers say it's a mov, return false

otherwise it's probably an mp4
in mov

if the magic numbers say it's a mov it's a mov.

and based on their ordering in the datatypes to prefer mp4 to mov/qtff (given that it's a more common format).

Why is the ffprobe better than the magic bits? The idea is to cut down on the number of subprocess calls on the headnodes, which is a good thing. So, should we use just magic bits if it is possible to eliminate ffprobe subprocess to fasten the process?

The idea is to cut down on the number of subprocess calls on the headnodes, which is a good thing

We currently only do 1 subprocess call (or we should only be doing 1) for any media file due to the caching in this function:

galaxy/lib/galaxy/datatypes/media.py

Lines 22 to 33 in d7673ab

@lru_cache(maxsize=128)

def _ffprobe(path):

return subprocess.run(

["ffprobe", "-loglevel", "quiet", "-show_format", "-show_streams", "-of", "json", path], capture_output=True

)

def ffprobe(path):

completed_process = _ffprobe(path)

completed_process.check_returncode()

data = json.loads(completed_process.stdout.decode("utf-8"))

return data["format"], data["streams"]

so this will not achieve any improvement there. Did you have a metric which was showing a high number of subprocess calls? Perhaps the cache size needs to be larger (or configurable)

it would only speed it up if we completely eliminated the ffprobe calls in every media type, but, I think we will not since other metadata is fetched from ffprobe's output.

hexylena · 2024-04-25T13:44:18Z

lib/galaxy/datatypes/media.py

+            vp_check = any(
+                stream["codec_name"] in ["vp8", "vp9"] for stream in streams if stream["codec_type"] == "video"
+            )
+            return _get_file_format_from_magic_number(filename, "mkv") and not vp_check


I don't think this restriction makes sense for mkvs, they should be able to host vp8/9 encoded streams.

But this is also somewhat a general 'problem' with the datatypes in that we describe both containers and encodings, and we don't capture that metadata in galaxy... but on the other hand I'm not sure I've encountered many tools that are doing video processing and aren't using something like ffmpeg / libav / etc and would be able to handle different stream codecs within a container.

Do you have a reason to add this restriction here? If so, perhaps that would be best expressed as metadata of the file, and then annotate that metadata here, and whatever tools you use can make decisions based on that metadata.

Also I'm not sure of a benefit we get from this specific change; we're still doing the ffprobe so we aren't doing any less computation via using magic numbers (i.e. the usual reason for adding them), are magic numbers more accurate than checking for matroska?

The problem is that for mkv and webm formats, they look the same to ffprobe and have the same magic number. So the way that I found is that webm video codec could only be vp8 or vp9, That is why I am using this approach. And for computation you are right it doesn't make any difference that is why I put the mkv and webm at the end of the sniffer tags in datatypes_conf.xml.sample.

Yes, you are correct, webms are mkvs, just with restricted format lists. I guess there is precedent for datatypes that are subsets (bed)

But in this case we would potentially (wrongly?) detect a user's uploaded mkv that happens to use vp8 as webm format, which is an odd scenario to encounter.

I'm not sure what the best behaviour is here and I'd be happy to hear input from other folks who deal with video data.

There is also another way to find out the format in Python and it is with magic but I'm afraid it will slow the process.

>>> import magic >>> f = magic.Magic(mime=True) >>> f.from_file('testfile')

hexylena · 2024-04-25T13:49:36Z

lib/galaxy/datatypes/media.py

-        return False
+        return _get_file_format_from_magic_number(filename, "mp3")


but yes cases like this, instead of return False, just return _get_file_format_from_magic_number(filename, "mp3") and keeping the ffprobe would potentially be good.

hexylena · 2024-04-25T13:50:37Z

lib/galaxy/datatypes/media.py

+        if which("ffprobe"):
+            metadata, streams = ffprobe(filename)
+            vp_check = any(
+                stream["codec_name"] in ["vp8", "vp9"] for stream in streams if stream["codec_type"] == "video"


webm also supports av1.

So, should I check av1, vp8, and vp9 for webm or is this approach entirely incorrect?

Based on #18054 (comment), I don't know.

Perhaps yes, check for those formats and if they're present it's a "webm". I think we could add a set_peek method which says "WebM subset of MKV" (or a similar description) to communicate that to the user that it's still actually a matroshka container.

hexylena · 2024-04-29T07:52:03Z

lib/galaxy/datatypes/media.py

+            "66 74 79 70 69",
+            "66 74 79 70 6D",
+            "66 74 79 70 4D",


https://en.wikipedia.org/wiki/List_of_file_signatures

66 74 79 70 69 73 6F 6D ftypisom 4 mp4 ISO Base Media file (MPEG-4) 66 74 79 70 4D 53 4E 56 ftypMSNV 4 mp4 MPEG-4 video file 66 74 79 70 68 65 69 63 66 74 79 70 6d ftypheic 4 heic High Efficiency Image Container (HEIC)

6d is heic, not sure if you want to lump this in with mp4? (I assume you based this on files you had/have seen). Also might want to consider extending these magic numbers to include the whole string to avoid future false positives

I wanted to grab m4v file format.
https://www.garykessler.net/library/file_sigs.html

[4 byte offset]
66 74 79 70 6D 70 34 32 [4 byte offset]
ftypmp42
M4V MPEG-4 video|QuickTime file

But, you are right it will be missed with the heic I will change it to 66 74 79 70 6D 70 34 32.
Also, I will consider extending the magic numbers.

hexylena · 2024-04-29T07:53:42Z

lib/galaxy/datatypes/media.py

+            "66 74 79 70 4D",
+        ],
+    },
+    "flv": {"offset": 0, "hex": ["46 4C 56 01"]},


also by wikipedia's list, the magic should just be FLV, so 01 in the last byte might be too restrictive.

Since many of these are human readable strings (like ftypisom) it might make sense for your implementation to support something like

Suggested change

"flv": {"offset": 0, "hex": ["46 4C 56 01"]},

"flv": {"offset": 0, "string": ["FLV"]},

and then check the correct number of bytes based off of the length of the string or so, to not have to specify a 4th byte if it isn't relevant would reduce some of these long hex lists.

hexylena · 2024-04-29T07:56:31Z

lib/galaxy/datatypes/media.py

+            "00 00 01 B7",
+            "00 00 01 B8",
+            "00 00 01 B9",
+            "00 00 01 BA",


are these all seen in practice? Wikipedia's list suggests it should be a much reduced set

00 00 01 BA 00 00 01 B3

I was using this source:
https://www.garykessler.net/library/file_sigs.html

00 00 01 Bx ....
MPEG, MPG MPEG video file
Trailer:
00 00 01 B7 (...·)

github-actions · 2024-05-20T11:51:36Z

This PR was merged without a "kind/" label, please correct.

arash77 added 9 commits April 23, 2024 14:35

Add support for Ogg, Webm, Mpeg, Mpga, M4a, Mov, Avi, Wmv, and Wma me…

99e1fc8

…dia types

Add support for Ogg, Webm, Mpeg, Mpga, M4a, Mov, Avi, Wmv, and Wma me…

541939f

…dia types

Update media datatypes

498e095

Refactor media datatype sniffing logic in media.py

f420c97

Refactor media datatype sniffing logic in media.py

1cfc63e

Update media datatypes and using magic numbers as main idinifier of f…

e64cb60

…ile formats

Refactor media datatype sniffing logic in media.py

de6f00c

change the order of file check, change names to meaningful names

e50f829

Fix typo in datatype sniffing logic for H5 files

371e227

github-actions bot added the area/datatypes label Apr 25, 2024

github-actions bot added this to the 24.1 milestone Apr 25, 2024

hexylena reviewed Apr 25, 2024

View reviewed changes

arash77 added 2 commits April 25, 2024 17:35

Refactor media datatype sniffing logic

230b734

Sorting imports

d7673ab

hexylena reviewed Apr 29, 2024

View reviewed changes

arash77 added 3 commits April 29, 2024 17:08

Refactor media datatype sniffing logic in media.py

a7e9775

using magic bit as fallback

85c1db5

remove unnecessary list

ef45b49

mvdbeek requested a review from dannon May 14, 2024 14:34

dannon approved these changes May 20, 2024

View reviewed changes

dannon merged commit 967d39f into galaxyproject:dev May 20, 2024
51 of 52 checks passed

arash77 deleted the add_ogg_datatype_media branch May 21, 2024 07:15

jdavcs added the kind/enhancement label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for additional media types #18054

Add support for additional media types #18054

arash77 commented Apr 25, 2024

hexylena left a comment

hexylena Apr 25, 2024

arash77 Apr 25, 2024

arash77 Apr 25, 2024

hexylena Apr 29, 2024

arash77 Apr 29, 2024

hexylena May 2, 2024 •

edited

hexylena Apr 25, 2024

arash77 Apr 25, 2024

hexylena Apr 29, 2024

arash77 Apr 29, 2024

hexylena Apr 25, 2024

hexylena Apr 25, 2024

arash77 Apr 25, 2024

hexylena Apr 29, 2024

hexylena Apr 29, 2024 •

edited

arash77 Apr 29, 2024

hexylena Apr 29, 2024 •

edited

hexylena Apr 29, 2024

arash77 Apr 29, 2024

github-actions bot commented May 20, 2024

	@lru_cache(maxsize=128)
	def _ffprobe(path):
	return subprocess.run(
	["ffprobe", "-loglevel", "quiet", "-show_format", "-show_streams", "-of", "json", path], capture_output=True
	)


	def ffprobe(path):
	completed_process = _ffprobe(path)
	completed_process.check_returncode()
	data = json.loads(completed_process.stdout.decode("utf-8"))
	return data["format"], data["streams"]

		return False
		return _get_file_format_from_magic_number(filename, "mp3")

	"flv": {"offset": 0, "hex": ["46 4C 56 01"]},
	"flv": {"offset": 0, "string": ["FLV"]},

Add support for additional media types #18054

Add support for additional media types #18054

Conversation

arash77 commented Apr 25, 2024

How to test the changes?

License

hexylena left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hexylena May 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hexylena Apr 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hexylena Apr 29, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented May 20, 2024

hexylena May 2, 2024 •

edited

hexylena Apr 29, 2024 •

edited

hexylena Apr 29, 2024 •

edited