Extend ItemLoader processors #31

Matthijsy · 2019-01-11T21:39:32Z

Currently there are three methods to add ItemLoader processor:

The default_input/output_processor on the ItemLoader class
The field_name_in/out on the ItemLoader class
The input/output_processor on the scrapy.Field

Personally I use the input/output_processor on the scrapy.Field combined with the default_input/output_processor a lot. But I use those in combination. Often I just want to add one more processor after the default processors. Since input/output_processor on scrapy.Field does a override of the defaults this is quite hard to do.
So I would propose to add another method to add a input/output processors. I would like to have something like add_input/output on the scrapy.Field, which would add the specified processor to the default processor.

I did implement this on my own ItemLoader class but think that it would be usefull for the scrapy core. My implementation is as follows (original source: https://github.com/scrapy/scrapy/blob/master/scrapy/loader/__init__.py#L69). Ofcourse this can be added to get_output_processor in the same way.

def get_input_processor(self, field_name):
        proc = getattr(self, '%s_in' % field_name, None)
        if not proc:
            override_proc = self._get_item_field_attr(field_name, 'input_processor')
            extend_proc = self._get_item_field_attr(field_name, 'add_input')
            if override_proc and extend_proc:
                raise ValueError(f'Not allowed to define input_processor and add_input to {field_name}')
            if override_proc:
                return override_proc
            elif extend_proc:
                return Compose(self.default_input_processor, extend_proc)
            return self.default_input_processor
        return proc

I am not sure if add_input is a good name, probably extend_input_processor is more clear but this quite a long name. I would like to hear if more people are wanting this feature and what you all think about what the naming should be.

The text was updated successfully, but these errors were encountered:

Gallaecio · 2019-03-22T14:11:02Z

I am having a hard time trying to picture what you want done, even after reading the documentation from your pull request.

Could you provide some sample code that shows the (convoluted) way to achieve your goal before your changes, and the (simpler) way to do the same after your suggested changes?

Matthijsy · 2019-03-26T11:23:01Z

We have a custom itemloader which specifies some default input processors. Some fields needs an additional input processor which is specific for this field. Now we need to copy the default input processor and add the new one. We we want to add a new default input processor in the future we have to do this at all those fields again. I would like to be able to keep the default input processor and only extend it with one new input processor.

Example:
We have as default input processor a strip and remove tags function to clean some data.
Now we have a scrapy item representing an person with two fields:

Name -> for this field those input processors are fine
Height -> This data is in the format 1.50m for example but we want to have it in cm as integer (150). So we want to add an new input processor to do this, but keep the strip and remove_tags input processors

I hope this makes the use case a bit more clear

Gallaecio · 2019-03-26T12:40:36Z

So, if I got it right, you are saying that you have:

class MyItemLoader(ItemLoader):
    default_input_processor = some_input_processor
    some_field_in = MapCompose(some_input_processor, another_input_processor)

How would you like that to look like instead?

Matthijsy · 2019-03-26T14:44:47Z

No we don't use the some_field_in method, we use it like this (current style)

class MyItemLoader(ItemLoader):
    default_input_processor = some_input_processor

class User(scrapy.Item):
    name = scrapy.Field()
    height = scrapy.Field(input_processor=MapCompose(some_input_processor, another_input_processor))

But I would like to do it in this style

class MyItemLoader(ItemLoader):
    default_input_processor = some_input_processor

class User(scrapy.Item):
    name = scrapy.Field()
    height = scrapy.Field(add_input=another_input_processor)

This way I don't have to duplicate the some_input_processor

Matthijsy · 2019-06-04T14:23:07Z

@Gallaecio Do you understand the problem now? Or do I need to explain more?

Gallaecio · 2019-06-04T14:43:14Z

I think I understand, although I personally don’t like coupling Item’s Field class and the ItemLoader class that way.

Gallaecio transferred this issue from scrapy/scrapy Oct 30, 2020

Gallaecio mentioned this issue Oct 30, 2020

Extend ItemLoader processors scrapy/scrapy#3599

Closed

Gallaecio added the enhancement New feature or request label Feb 21, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extend ItemLoader processors #31

Extend ItemLoader processors #31

Matthijsy commented Jan 11, 2019

Gallaecio commented Mar 22, 2019

Matthijsy commented Mar 26, 2019 •

edited

Gallaecio commented Mar 26, 2019

Matthijsy commented Mar 26, 2019

Matthijsy commented Jun 4, 2019

Gallaecio commented Jun 4, 2019 •

edited

Extend ItemLoader processors #31

Extend ItemLoader processors #31

Comments

Matthijsy commented Jan 11, 2019

Gallaecio commented Mar 22, 2019

Matthijsy commented Mar 26, 2019 • edited

Gallaecio commented Mar 26, 2019

Matthijsy commented Mar 26, 2019

Matthijsy commented Jun 4, 2019

Gallaecio commented Jun 4, 2019 • edited

Matthijsy commented Mar 26, 2019 •

edited

Gallaecio commented Jun 4, 2019 •

edited