Item default value is appended to the processor_in output in ItemLoader #62

sergiogup · 2023-02-08T16:38:04Z

Description

If I create an Item using a dataclass and define a default value, that default value is appended at the start of the resulting array of the output-processor input. Should not the default value be overriden or at least be at the end of the resulting array when the input processor result is not None? Is this intended behavior?

If so, it means that the user has to create a new Loader function, i.e. TakeSecond(), which is identical in functioning as TakeFirst() but it takes the second value in the array if they want to provide a non-None default value and takes the first value in case the input processor received a None value

Steps to Reproduce

import scrapy
from dataclasses import dataclass
from typing import Optional

from itemloaders.processors import TakeFirst, MapCompose, Join, Compose, Identity
from scrapy.loader import ItemLoader


# Item definition
@dataclass
class ArticleItem:
    user_rating: Optional[float] = -999


# Item Loader definition
class RappiLoader(ItemLoader):

    default_output_processor = TakeFirst()
    
    user_rating_in = MapCompose(lambda x: x.get('score') if isinstance(x, dict) else None)

Expected behavior:
resulting item should be ArticleItem(user_rating=4.5)

Actual behavior:
Result is ArticleItem(user_rating=-999)

If I inspect the value that gets fed to the output processor it is [-999, 4.5]. That's why when using TakeFirst(), it seems as if the input processor did not receive a valid value from the spider, which is not the case.

Reproduces how often:
Every time

Versions

Scrapy : 2.6.2
lxml : 4.8.0.0
libxml2 : 2.9.10
cssselect : 1.1.0
parsel : 1.6.0
w3lib : 1.22.0
Twisted : 22.4.0
Python : 3.9.13 | packaged by conda-forge | (main, May 27 2022, 17:00:52) - [Clang 13.0.1 ]
pyOpenSSL : 22.0.0 (OpenSSL 1.1.1s 1 Nov 2022)
cryptography : 37.0.4
Platform : macOS-13.1-x86_64-i386-64bit

The text was updated successfully, but these errors were encountered:

wRAR · 2023-02-08T17:05:35Z

Moving this to itemloaders, though on a second thought (and without checking any code) it may be, though not likely, itemdapter-specific.

VMRuiz · 2023-02-08T18:18:12Z

Hello @wRAR , this is related to how itemloaders build the items.

What's happening here is that ItemLoader will create an instance of ArticleItem without values. Because of that, the user_rating is set to -999 during the instance creation, and ItemLoader will assume as the first value discovered to that field. This is the same behaviour as to passing an already pre-filled dictionary to an item loader:

>>> loader = RappiLoader(item={"user_rating": 2.5})
>>> loader.add_value("user_rating", 0)  # Add fallback value in case user_rating was not defined 
>>> loader.load_item()
{'user_rating': 2.5}

In this case however, the expectation would be to keep the initial value already provided.

I think this shouldn't be fixed. The list of values are always added in the same order. Then, is up to the user to decide what to do with them - e.g. select the first one, last one, all of them, apply some filters, or manipulate them-.

You can implementing a custom selector that fits your needs.

wRAR transferred this issue from scrapy/scrapy Feb 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Item default value is appended to the processor_in output in ItemLoader #62

Item default value is appended to the processor_in output in ItemLoader #62

sergiogup commented Feb 8, 2023 •

edited

wRAR commented Feb 8, 2023

VMRuiz commented Feb 8, 2023

Item default value is appended to the processor_in output in ItemLoader #62

Item default value is appended to the processor_in output in ItemLoader #62

Comments

sergiogup commented Feb 8, 2023 • edited

Description

Steps to Reproduce

Versions

wRAR commented Feb 8, 2023

VMRuiz commented Feb 8, 2023

sergiogup commented Feb 8, 2023 •

edited