Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid missing base item fields in item loaders #3047

Closed
wants to merge 1 commit into from

Conversation

stummjr
Copy link
Member

@stummjr stummjr commented Dec 21, 2017

This is an attempt to fix the behavior described in #3046.

Instead of just checking if the value inside the loader is not None in order to decide if a field from the initial item should be overwritten or not, load_item() should also make sure that the value returned by get_output_value() is not an empty list.

That is because self._local_values , which stores the new values included via add_* or replace_* methods, is adefaultdict(list). Then, when we call get_output_value() for a field only available in the initial item, an empty list will be set for that field in self._local_values (because of this).

This way, we make sure we don't miss fields from the initial item, in case get_output_value() gets called for one of the pre-populated fields before load_item(), as described on #3046.

@codecov
Copy link

codecov bot commented Dec 27, 2017

Codecov Report

Merging #3047 into master will increase coverage by 0.15%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3047      +/-   ##
==========================================
+ Coverage   84.51%   84.67%   +0.15%     
==========================================
  Files         164      164              
  Lines        9270     9389     +119     
  Branches     1380     1404      +24     
==========================================
+ Hits         7835     7950     +115     
- Misses       1177     1181       +4     
  Partials      258      258
Impacted Files Coverage Δ
scrapy/loader/__init__.py 94.52% <100%> (ø) ⬆️
scrapy/core/downloader/handlers/http11.py 93.56% <0%> (+1.43%) ⬆️

@dangra dangra changed the title Avoid missing base item fields in item loaders [MRG+1] Avoid missing base item fields in item loaders Dec 27, 2017
@dangra dangra requested a review from kmike December 27, 2017 22:27
@dangra
Copy link
Member

dangra commented Dec 27, 2017

@stummjr it makes sense and I don't consider the change of behavior a backward incompatibility, more a gotcha removal. thanks

@kmike any comment before merging and including in 1.5.0 release?

@kmike
Copy link
Member

kmike commented Dec 29, 2017

Argh, item loaders! As I understand the code, this change affects not only get_output_value, but output processors as well: if an output processor returns an empty list, after the change load_item will be returning a default value instead of this empty list.

No idea how large is the issue. Example use case, rather theoretical: MapCompose output processor which drops some values from the result; when all results are dropped, after this change a default value is returned in .load_item() instead of an empty list.

That said, the way lists play with ItemLoaders is weird anyways. For example:

ld = ItemLoader({'colors': ['white', 'black']})
ld.replace_value('colors', ['red', 'yellow'])
ld.load_item()  
# {'colors': ['red', 'yellow']}

ld = ItemLoader({'colors': ['white', 'black']})
ld.replace_value('colors', [])
ld.load_item() 
# {'colors': ['white', 'black']}

ld = ItemLoader({'colors': ['white', 'black']})
ld.replace_value('colors', 'blue')
ld.load_item()  
# {'colors': ['blue']}

I can't find in ItemLoader docs that loader.replace_value(name, None) or loader.replace_value(name, []) or tuple() or set() doesn't replace the value, but sets it to default, unlike any other value, including empty dicts, empty strings, False, 0, etc.

So I'm not against merging this PR, as it fixes a real-world issue @stummjr had, and there is undocumented item loader behavior anyways. But at the same time, this PR seems to add more undocumented behavior to ItemLoaders.

@dangra
Copy link
Member

dangra commented Dec 29, 2017

What if load_items only attempt to set field's value if a call to _add_value was made for that field. It means loader has to keep track of "modified" fields and it can't rely only on _local_values dict.

>>> item = {'colors': ['white', 'black'], 'foo': 'bar'}
>>> ld = ItemLoader(item)
>>> ld.get_output_value('colors')
[]
>>> ld.load_item()
{'colors': ['white', 'black'], 'foo': 'bar'}
>>> ld.replace_value('colors', [])
>>> ld.get_output_value('colors')
[]
>>> ld.load_item()
{'colors': [], 'foo': 'bar'}

@dangra dangra changed the title [MRG+1] Avoid missing base item fields in item loaders Avoid missing base item fields in item loaders Dec 29, 2017
@@ -113,7 +113,7 @@ def load_item(self):
item = self.item
for field_name in tuple(self._values):
value = self.get_output_value(field_name)
if value is not None:
if value is not None and value != []:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can imagine cases where someone expects a failed load to still populate an empty list, and this change might break things for them. Perhaps instead the code should check whether adding an empty list would clobber an existing field? Because, per your issue in #3046 I think that's the more bug-like behaviour?

@Gallaecio
Copy link
Member

Closing given #3046 has been fixed.

@Gallaecio Gallaecio closed this Oct 30, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants