Avoid missing base item fields in item loaders #3047

stummjr · 2017-12-21T01:42:23Z

This is an attempt to fix the behavior described in #3046.

Instead of just checking if the value inside the loader is not None in order to decide if a field from the initial item should be overwritten or not, load_item() should also make sure that the value returned by get_output_value() is not an empty list.

That is because self._local_values , which stores the new values included via add_* or replace_* methods, is adefaultdict(list). Then, when we call get_output_value() for a field only available in the initial item, an empty list will be set for that field in self._local_values (because of this).

This way, we make sure we don't miss fields from the initial item, in case get_output_value() gets called for one of the pre-populated fields before load_item(), as described on #3046.

codecov · 2017-12-27T21:40:46Z

Codecov Report

Merging #3047 into master will increase coverage by 0.15%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #3047      +/-   ##
==========================================
+ Coverage   84.51%   84.67%   +0.15%     
==========================================
  Files         164      164              
  Lines        9270     9389     +119     
  Branches     1380     1404      +24     
==========================================
+ Hits         7835     7950     +115     
- Misses       1177     1181       +4     
  Partials      258      258

Impacted Files	Coverage Δ
scrapy/loader/__init__.py	`94.52% <100%> (ø)`	⬆️
scrapy/core/downloader/handlers/http11.py	`93.56% <0%> (+1.43%)`	⬆️

dangra · 2017-12-27T22:29:35Z

@stummjr it makes sense and I don't consider the change of behavior a backward incompatibility, more a gotcha removal. thanks

@kmike any comment before merging and including in 1.5.0 release?

kmike · 2017-12-29T01:37:55Z

Argh, item loaders! As I understand the code, this change affects not only get_output_value, but output processors as well: if an output processor returns an empty list, after the change load_item will be returning a default value instead of this empty list.

No idea how large is the issue. Example use case, rather theoretical: MapCompose output processor which drops some values from the result; when all results are dropped, after this change a default value is returned in .load_item() instead of an empty list.

That said, the way lists play with ItemLoaders is weird anyways. For example:

ld = ItemLoader({'colors': ['white', 'black']})
ld.replace_value('colors', ['red', 'yellow'])
ld.load_item()  
# {'colors': ['red', 'yellow']}

ld = ItemLoader({'colors': ['white', 'black']})
ld.replace_value('colors', [])
ld.load_item() 
# {'colors': ['white', 'black']}

ld = ItemLoader({'colors': ['white', 'black']})
ld.replace_value('colors', 'blue')
ld.load_item()  
# {'colors': ['blue']}

I can't find in ItemLoader docs that loader.replace_value(name, None) or loader.replace_value(name, []) or tuple() or set() doesn't replace the value, but sets it to default, unlike any other value, including empty dicts, empty strings, False, 0, etc.

So I'm not against merging this PR, as it fixes a real-world issue @stummjr had, and there is undocumented item loader behavior anyways. But at the same time, this PR seems to add more undocumented behavior to ItemLoaders.

dangra · 2017-12-29T15:34:37Z

What if load_items only attempt to set field's value if a call to _add_value was made for that field. It means loader has to keep track of "modified" fields and it can't rely only on _local_values dict.

>>> item = {'colors': ['white', 'black'], 'foo': 'bar'}
>>> ld = ItemLoader(item)
>>> ld.get_output_value('colors')
[]
>>> ld.load_item()
{'colors': ['white', 'black'], 'foo': 'bar'}
>>> ld.replace_value('colors', [])
>>> ld.get_output_value('colors')
[]
>>> ld.load_item()
{'colors': [], 'foo': 'bar'}

cathalgarvey · 2018-02-20T13:47:41Z

scrapy/loader/__init__.py

@@ -113,7 +113,7 @@ def load_item(self):
        item = self.item
        for field_name in tuple(self._values):
            value = self.get_output_value(field_name)
-            if value is not None:
+            if value is not None and value != []:


I can imagine cases where someone expects a failed load to still populate an empty list, and this change might break things for them. Perhaps instead the code should check whether adding an empty list would clobber an existing field? Because, per your issue in #3046 I think that's the more bug-like behaviour?

Gallaecio · 2020-10-30T18:58:55Z

Closing given #3046 has been fixed.

Don't miss values when calling get_output_value before load_item

00b7f4f

dangra changed the title ~~Avoid missing base item fields in item loaders~~ [MRG+1] Avoid missing base item fields in item loaders Dec 27, 2017

dangra requested a review from kmike December 27, 2017 22:27

dangra changed the title ~~[MRG+1] Avoid missing base item fields in item loaders~~ Avoid missing base item fields in item loaders Dec 29, 2017

cathalgarvey reviewed Feb 20, 2018

View reviewed changes

Gallaecio added the item loaders label Apr 16, 2020

Gallaecio closed this Oct 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid missing base item fields in item loaders #3047

Avoid missing base item fields in item loaders #3047

stummjr commented Dec 21, 2017

codecov bot commented Dec 27, 2017

dangra commented Dec 27, 2017

kmike commented Dec 29, 2017

dangra commented Dec 29, 2017

cathalgarvey Feb 20, 2018

Gallaecio commented Oct 30, 2020

Avoid missing base item fields in item loaders #3047

Avoid missing base item fields in item loaders #3047

Conversation

stummjr commented Dec 21, 2017

codecov bot commented Dec 27, 2017

Codecov Report

dangra commented Dec 27, 2017

kmike commented Dec 29, 2017

dangra commented Dec 29, 2017

cathalgarvey Feb 20, 2018

Choose a reason for hiding this comment

Gallaecio commented Oct 30, 2020