Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Item loader missing values from base item #3046

Closed
stummjr opened this issue Dec 21, 2017 · 4 comments
Closed

Item loader missing values from base item #3046

stummjr opened this issue Dec 21, 2017 · 4 comments
Labels

Comments

@stummjr
Copy link
Member

stummjr commented Dec 21, 2017

ItemLoaders behave oddly when they get a pre-populated item as an argument and get_output_value() gets called for one of the pre-populated fields before calling load_item().

Check this out:

>>> from scrapy.loader import ItemLoader
>>> item = {'url': 'http://example.com', 'summary': 'foo bar'}
>>> loader = ItemLoader(item)
>>> loader.load_item()
{'summary': 'foo bar', 'url': 'http://example.com'}

# so far, so good... what about now?
>>> item = {'url': 'http://example.com', 'summary': 'foo bar'}
>>> loader = ItemLoader(item)
>>> loader.get_output_value('url')
[]
>>> loader.load_item()
{'summary': 'foo bar', 'url': []}

There are 2 unexpected behaviors in this snippet (at least from my point of view):

1) loader.get_output_value() doesn't return the pre-populated values, even though they end up in the final item.

It seems to be like this on purpose, though. The get_output_value() method only queries the _local_values defaultdict (here).

2) once we call loader.get_output_value('url'), that field is not included in the load_item() result anymore.

This one doesn't look right, IMHO.

It happens because when we call loader.get_output_value('url') for the first time, such value is not available on _local_values, and so a new entry in the _local_values defaultdict will be created with an empty list on it (here). Then, when loader.load_item() gets called, these lines overwrite the current value from the internal item because the value returned by get_output_value() is [] and not None.

Any thoughts on this?

@cathalgarvey
Copy link
Contributor

The first issue is a UX issue, but I agree it should be fixed. The second one, I think is more like a bug, as a side-effect of a getter method is wiping valid data in the output.

@yashrsharma44
Copy link
Contributor

Here even if we run the code like this, we get the result -

>>> from scrapy.loader import ItemLoader
>>> item = {'url':'http://example.com','summary':'foo bar'}
>>> loader = ItemLoader(item)
>>> loader.load_item()
{'summary': 'foo bar', 'url': 'http://example.com'}
>>> loader.get_output_value('url')
[]
>>> loader.get_output_value('summary')
[]

I think the method get_output_value('field_name'), is not able to print the corresponding values of the key.

@yashrsharma44
Copy link
Contributor

yashrsharma44 commented Mar 3, 2018

I have added a Bug Fix for this issue. Please have a look into this. @cathalgarvey #3149

@Gallaecio
Copy link
Member

Fixed by #3819, as @elacuesta mentions in #3897.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants