Looking for help solving a problem with thread-safety and XmpProperties::registerNs() #2064

mallman · 2022-02-02T01:40:28Z

mallman
Feb 2, 2022
Collaborator

First, let me state that I have indeed RTFM'd and found that the thread-"unsafety" of the XmpProperties::registerNs() method is documented in the readme. Nevertheless, I'm in a bit of a pickle trying to solve a practical problem deriving from this issue. To put it briefly, I'm finding in my usage that Image::readMetadata() is not always safe to call from multiple threads concurrently because XmpProperties::registerNs() is not thread-safe. It's also difficult to anticipate the circumstances in which it is and is not thread-safe, and take precautions. Let me describe my use case.

I am trying to read the metadata from a large number of image files—we could say tens of thousands or whatnot. To improve performance, I read the metadata from multiple files concurrently on multiple threads. According to the documentation, I should serialize calls to XmpProperties::registerNs(), however I'm not seeing a simple way to do this because the way Image::readMetadata() calls that method introduces a race condition under certain circumstances. I'll describe those now.

I happen to have some files with embedded XMP metadata which share some common xml namespace prefixes that map to different namespaces among the documents. To give my particular use case, one file (file "A") maps the prefix "vr" to "http://www.communicatingastronomy.org/repository/1.0/". Another (file "B") maps the same prefix to "http://www.communicatingastronomy.org/repository/1.1/". The same conflation is true among multiple files—the definition of "vr" bounces back and forth. When I call Image::readMetadata() concurrently on these files, the ad-hoc mapping and re-mapping of the "vr" namespace prefix creates a race condition. For example, consider this sequence of events assuming we have two threads independently reading the metadata from file A and file B:

Thread 1 reads file A and maps "vr" to "http://www.communicatingastronomy.org/repository/1.0/" globally.
Thread 2 reads file B and re-maps "vr" to "http://www.communicatingastronomy.org/repository/1.1/" globally.
Thread 1 looks for a namespace prefix for "http://www.communicatingastronomy.org/repository/1.0/" and doesn't find it (because Thread 2 re-mapped "vr" to "http://www.communicatingastronomy.org/repository/1.1/"). Now thread 1 throws an exception.

I don't see an easy/simple way to work around this issue. One simple means would be to enforce serial access to the Image::readMetadata() method, but that would mean giving up the performance benefits of concurrency.

What do you think? Anyone have advice on this? Is this behavior a bug in exiv2?

Any help is much appreciated. Thank you.

(FYI, I've been building exiv2 from the main branch. The binary I'm using I built from commit 55712d4.)

(Also, let me add that the underlying call to XmpProperties::registerNs() made down the stack from Image::readMetadata() is this particular one:

exiv2/src/xmp.cpp

Line 787 in 55712d4

XmpProperties::registerNs(schemaNs, prefix);

)

Answered by clanmills

Feb 2, 2022

@mallman Michael. Thank you for raising this topic. I believe we've spoken in the past. Welcome back. I think you might be in trouble here.

The most obvious point to make is, perhaps you have to avoid multi-threading in this case. What about multi-processing combined with multi-threading? If you can segment your data into collections of 1.0 and 1.1 files, you could process the collections in separate processes (both of which can be multi-threaded).

Exiv2 is fast. I have 80,000 images on my web-site and have scripts that occassionally read them all. I'm not bothered if they takes 2 or 3 hours to run. I've just run a little test on my 8 year old mac mini with a spinning disk. 50 images/seco…

View full answer

clanmills · 2022-02-02T06:03:39Z

clanmills
Feb 2, 2022
Maintainer

@mallman Michael. Thank you for raising this topic. I believe we've spoken in the past. Welcome back. I think you might be in trouble here.

The most obvious point to make is, perhaps you have to avoid multi-threading in this case. What about multi-processing combined with multi-threading? If you can segment your data into collections of 1.0 and 1.1 files, you could process the collections in separate processes (both of which can be multi-threaded).

Exiv2 is fast. I have 80,000 images on my web-site and have scripts that occassionally read them all. I'm not bothered if they takes 2 or 3 hours to run. I've just run a little test on my 8 year old mac mini with a spinning disk. 50 images/second. My MacBook Pro with the m1 chip and SSD, achieves 1000+ images/second.

Another thought is to "fix" your files once and for all. XMP is xml and usually easy to spot in your images. You could use the utility sed to modify http://www.communicatingastronomy.org/repository/1.0/ to http://www.communicatingastronomy.org/repository/1.1/ in all your images and avoid the issue altogether.

I don't know much about the XMPsdk, however I know that Adobe changed the API concerning prefix. You give the ns URI and preferred prefix and it returns the prefix to be used. If vr is already used, it returns vr1 (or something like that). I believe it does this to ensure that no prefix can be an alias for more than one uri. Building Exiv2 with Adobe XMP is documented in README-CONAN.md, however it's unlikely that Adobe's modified API is understood by the exiv2 code-base! To compound matters, if you're reading the ns prefix from the file you must respect with that prefix, so I'm not sure that you can use vr1.

4 replies

clanmills Feb 2, 2022
Maintainer

GitHub seems to think I have given you an answer. He's mad. I have responded, that's all.

mallman Feb 3, 2022
Collaborator Author

GitHub seems to think I have given you an answer. He's mad. I have responded, that's all.

Yeah. I don't see a way to simply provide a comment. That's a shortcoming.

I believe we've spoken in the past. Welcome back.

We have! And thank you!

The most obvious point to make is, perhaps you have to avoid multi-threading in this case. What about multi-processing combined with multi-threading? If you can segment your data into collections of 1.0 and 1.1 files, you could process the collections in separate processes (both of which can be multi-threaded).

I'm trying to build a simpler, more general solution that does not make assumptions about the xmp metadata a-priori. My system is intended to be updated on-demand with images with arbitrary metadata. I cannot say that this particular namespace collision is the only foreseeable one.

That being said, I'm measuring wall time in my application this morning, and I'm finding that writing the metadata to a sqlite database is actually the predominant factor in determining the throughput of my metadata indexing. And writing to sqlite is a single-threaded process. So in the end, in my particular use case, concurrently reading the metadata actually makes very little difference!

(Measure first, complain later, right?)

My system is actually even more general-purpose in the sense that it's supposed to run a variety of tasks aside from just metadata extraction. And some of those tasks do in fact benefit from concurrency. So the problem for me comes down to making the metadata extraction a serial process while allowing other (thread-safe) tasks to run concurrently.

FWIW, if I omit writing to the database, then I do see a significant throughput boost from concurrency. However, for me that's fantasy unless and until I can find a way to implement the write path to run concurrently. (And that's assuming that that's even worth the effort.)

All that being said, after I wrote this message I did in fact find a way to solve this problem by making a modification inside exiv2. I was inspired by the concept of thread-local scope. I am not a C++ programmer, however I have used thread-local scope in other programming languages. So I did a little digging and found that, if I understand correctly, making the prefix-to-namespace dictionary thread local is simply a matter of marking the declaration thread_local. And to my pleasant surprise, after making this source modification and rebuilding, I found my problem went away! In fact, I removed the mutex guarding access to that data structure and things still seem to work fine as long as that data structure has thread-local scope.

Now this is quite preliminary. I don't know the consequences to cpu and memory use, however it does seem to make reading metadata with namespace collisions thread-safe.

In the event, given that my usage is in effect single threaded, it will be simpler to avoid modifying the exiv2 codebase and simply make the metadata extraction process single threaded. And perhaps this is a rather large hammer to fix an uncommon scenario, but I'd like to ask your opinion. What do you think about making the prefix-to-namespace dictionary thread-local?

In hindsight, I could have and should have done a little more due diligence to identify that my use case is in practice single threaded. In that sense, my question is a wash.

Thank you, Robin.

clanmills Feb 3, 2022
Maintainer

Ah right. I also use sqlite3 to manage my 80,000 images. I don't care if my scripts run all night.

I don't recall using C++11 thread-local. My C++ skills are stuck in the dark ages before C++11. I retired from work in 2014 and now 71. For me GC is not a Garbage Collector. It's a Grass Cutter. I'm more or less retired now from Exiv2 and taking care of the "dot" releases on branch 0.27-maintenance and written in C++98. Team Exiv2 (about 10 enthusiastic and great folks) are working on branch "main" which has been modernised to C++17.

Looking in WikiPedia, thread-local seems reasonable to me and you've tested in a realistically stressful environment. Can you either submit a PR, or open a new issue to ask for this to be supported. If you go down the issue route, please provide your code patch as a starting position. Another team member will work with you to add this to the code base and you can test it.

(Measure first, complain later, right?) Hell, no. There are plenty of users who only complain. I call them ABs and discuss them in my book: https://exiv2.org/book/index.html#11

It's good to have treated this topic as a discussion. That's a calm and nice way to ask "what do you think?". If GitHub asks this time "Have you answered?", I'll say "Yes". The truth is that you have answered your own question.

mallman Feb 4, 2022
Collaborator Author

Can you either submit a PR, or open a new issue to ask for this to be supported.

I will submit a simple draft PR as a proposal. If the maintainers feel it's a good direction to take, I'll do my best to fill it out with appropriate tests and documentation. However, I feel my lack of C++ skills will slow that process significantly. I will @ you for your reference.

I call them ABs and discuss them in my book: https://exiv2.org/book/index.html#11.

Wow. An Exiv2 book. Wow. Ummm... wow. That's dedication. I rarely see something like this book in an open source project. Exiv2 is quite the rarity.

I definitely recognize some of the pain points of open source project management. Once you publish a project, it becomes a responsibility. That's a large reason I rarely publish open source.

The truth is that you have answered your own question.

So much the better for your assistance. Thanks again.

clanmills · 2022-02-02T08:19:15Z

clanmills
Feb 2, 2022
Maintainer

I've had another thought about this. readMetadata() finds the XMP/xml "block" and then informs the Adobe XMPsdk. You could inject code to modify http://www.communicatingastronomy.org/repository/1.0/ to http://www.communicatingastronomy.org/repository/1.1/ at that point. Regrettably, you'll have to modify your source code because the library doesn't provide a callback for you. If all your images are JPEGs, you can do this in src/jpgimage.cpp. If you're processing files in many formats, there's a parse function in xmpsdk/src. You can "doctor" the parse function to perform this chore.

The benefit of putting this code in readMetata() or the XMPsdk file parser is that you are certain that your code is working on your behalf. The crude approach using sed could perform collateral damage. Additionally, if you are reading (and not updating/modifing metadata), you do not need to modify your original images. If you modify the metadata, the XMP/xml will use the vr prefix with the uri http://www.communicatingastronomy.org/repository/1.1/.

1 reply

mallman Feb 4, 2022
Collaborator Author

You could inject code to modify http://www.communicatingastronomy.org/repository/1.0/ to http://www.communicatingastronomy.org/repository/1.1/ at that point.

I appreciate your thoughts in this direction, however my perfectionism will not allow me to modify a document's namespace, even one as trivial as this. Call me obsessive. I am.

However, I think there's another reason this is not the best tack. We're assuming we have a known conflict that we more-or-less manually patch up. However, I want my system to handle arbitrary cross-document prefix collisions and do so without manual intervention.

I think that either the thread-local source modification or single-threaded approach are better approaches.

Cheers.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Looking for help solving a problem with thread-safety and XmpProperties::registerNs() #2064

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Looking for help solving a problem with thread-safety and XmpProperties::registerNs() #2064

mallman Feb 2, 2022 Collaborator

Replies: 2 comments · 5 replies

clanmills Feb 2, 2022 Maintainer

clanmills Feb 2, 2022 Maintainer

mallman Feb 3, 2022 Collaborator Author

clanmills Feb 3, 2022 Maintainer

mallman Feb 4, 2022 Collaborator Author

clanmills Feb 2, 2022 Maintainer

mallman Feb 4, 2022 Collaborator Author

mallman
Feb 2, 2022
Collaborator

Replies: 2 comments 5 replies

clanmills
Feb 2, 2022
Maintainer

clanmills Feb 2, 2022
Maintainer

mallman Feb 3, 2022
Collaborator Author

clanmills Feb 3, 2022
Maintainer

mallman Feb 4, 2022
Collaborator Author

clanmills
Feb 2, 2022
Maintainer

mallman Feb 4, 2022
Collaborator Author