Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow extension with custom extractors #6

Open
jkraemer opened this issue Feb 10, 2018 · 3 comments
Open

Allow extension with custom extractors #6

jkraemer opened this issue Feb 10, 2018 · 3 comments

Comments

@jkraemer
Copy link
Member

Feature idea, relatively low priority:

Right now you have to modify the source in order to add a new extractor class. There should be a way to do this dynamically, i.e. by calling TextExtractor.register(SomeCustomFileHandlerClass) or similar. We will have to find a way to determine the ordering / precedence of extractors since the list will be dynamic.

@thegcat
Copy link
Member

thegcat commented Feb 10, 2018

As the order only becomes relevant if you have multiple extractors capable of extracting text from the same file types, I would suggest postponing the ordering discussion until someone has a use-case for that, this way the code stays simple until required, and you might have a better understanding of the problem once a use-case presents itself.

@wielinde
Copy link
Collaborator

While working on my last PR I thought about a registry as well. So yes, @jkraemer, there is some point. However, my decision was somewhat similar to the thoughts of @thegcat ("build stuff when you need it").

@meineerde
Copy link
Member

If you decide to implement such a order-based registry, some prior art for that would be:

These options allow you to define an ordered list of handlers, which will be tried one after another. With these APIs, you can implement something like

TextExtractor.register(OCRExtractor)
TextExtractor.register_after(OCRExtractor, FirstParagraphExtractor)

Alternatively, you could also ignore the multiple-extractors-per-file-type issue at the registry at all and instead allow to register a MultiExtractor performs does this transparently for the registry. Then, you could use something much simpler such as Rackstash::ClassRegistry which is used here to register classes to names and to build instances for them. See Rackstash::Encoder for how this can be used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants