ItemLoaders can break if instantiated with pre-populated items #3897

elacuesta · 2019-07-22T21:47:20Z

Before I start, I know item loaders have been a big source of discussion for a long time; I'm only opening this issue because the latest release breaks some of our spiders.

In one of our projects, our Autounit tests fail under 1.7.1 due to some item loaders which are created from partially populated items. I suspect the relevant change is #3819 (which BTW I think inadvertently closes #3046).
Personally I think a better approach here would be something closer to the solution proposed in #3149, although not exactly the same.

Consider the following:

In [1]: import scrapy

In [2]: scrapy.__version__
Out[2]: '1.6.0'

In [3]: from scrapy.loader import ItemLoader
   ...: lo = ItemLoader(item={'key': 'value'})
   ...: lo.add_value('key', 'other value')
   ...: print(lo.load_item())
{'key': ['other value']}

In [1]: import scrapy

In [2]: scrapy.__version__
Out[2]: '1.7.1'

In [3]: from scrapy.loader import ItemLoader
   ...: lo = ItemLoader(item={'key': 'value'})
   ...: lo.add_value('key', 'other value')
   ...: print(lo.load_item())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-6aa64a41edb1> in <module>
      1 from scrapy.loader import ItemLoader
      2 lo = ItemLoader(item={'key': 'value'})
----> 3 lo.add_value('key', 'other value')
      4 print(lo.load_item())

~/venv-temporal/lib/python3.6/site-packages/scrapy/loader/__init__.py in add_value(self, field_name, value, *processors, **kw)
     77                 self._add_value(k, v)
     78         else:
---> 79             self._add_value(field_name, value)
     80
     81     def replace_value(self, field_name, value, *processors, **kw):

~/venv-temporal/lib/python3.6/site-packages/scrapy/loader/__init__.py in _add_value(self, field_name, value)
     93         processed_value = self._process_input_value(field_name, value)
     94         if processed_value:
---> 95             self._values[field_name] += arg_to_iter(processed_value)
     96
     97     def _replace_value(self, field_name, value):

TypeError: must be str, not list

I'm not directly opening a PR because I think this needs discussion. What if we changed

for field_name, value in item.items():
    self._values[field_name] = self._process_input_value(field_name, value)

to

for field_name, value in item.items():
    self._add_value(field_name, value)

which calls arg_to_iter internally?

With that change, the following happens which is more reasonable IMHO:

In [3]: from scrapy.loader import ItemLoader 
   ...: lo = ItemLoader(item={'key': 'value'}) 
   ...: lo.add_value('key', 'other value') 
   ...: print(lo.load_item())                                                                                                                                                                                                                 
{'key': ['value', 'other value']}

Looking forward to reading your thoughts on the matter

/cc @Gallaecio @kmike @andrewbaxter @fcanobrash @sortafreel

The text was updated successfully, but these errors were encountered:

sortafreel · 2019-07-23T00:52:02Z

Looks like a fine idea. Some tests needed to be sure.

andrewbaxter · 2019-07-23T19:11:58Z

So to summarize my understanding:

Before, the base item's fields were ignored and when doing load_item they'd be replaced wholesale
With the new change, the item's fields are also added to the item, where they pass the input + output processors and will also naturally be combined with other added values before replacing the original value in load_item
With the suggested change in this ticket, the item's fields will be added in such a way that they will skip the input processors but be merged with other added values in the output processors before replacing the original value in load_item

Although I don't have any examples myself, I can think of two other ways this might break:

Spider expects 2nd invocation to replace the values (just because the original behavior suited the scenario)
Output processor relies on multiple values, like doing enumerate(values) or len(values) or something - by passing the values through the output processor a second time you get an invalid output

The second might be somewhat workaroundable by adding the base item's values only at load_item, and only if other values were added (that is, if no new items are added don't run the output processors, just leave that field in the base item as is), but it leaves open the question of what to do if new items are added.

andrewbaxter · 2019-07-24T06:38:41Z

What about leaving the old behavior for item= and make a new parameter called seed= or initial= or something which does the _add_value thing?

sortafreel · 2019-07-30T20:44:33Z

Sorry for not answering, was quite ill (dentist), will look through it carefully next 3-5 days. Thank @andrewbaxter for the detailed vision :)

sortafreel · 2019-08-15T09:31:25Z

Discussed with @andrewbaxter, going on vacation (for a ~week), will apply some solution firstly after :)

elacuesta changed the title ~~ItemLoaders can break if instatiated with pre-populated items~~ ItemLoaders can break if instantiated with pre-populated items Jul 22, 2019

Gallaecio added the enhancement label Aug 1, 2019

Gallaecio mentioned this issue Aug 1, 2019

Item loader missing values from base item #3046

Closed

alexander-matsievsky mentioned this issue Aug 26, 2019

ItemLoader fields initialized from item are reprocessed #3976

Closed

sortafreel mentioned this issue Sep 2, 2019

Add reprocessing tests #3998

Closed

elacuesta mentioned this issue Sep 25, 2019

ItemLoader: improve handling of initial item #4036

Merged

Gallaecio closed this as completed in #4036 Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ItemLoaders can break if instantiated with pre-populated items #3897

ItemLoaders can break if instantiated with pre-populated items #3897

elacuesta commented Jul 22, 2019 •

edited

sortafreel commented Jul 23, 2019

andrewbaxter commented Jul 23, 2019 •

edited

andrewbaxter commented Jul 24, 2019

sortafreel commented Jul 30, 2019

sortafreel commented Aug 15, 2019

ItemLoaders can break if instantiated with pre-populated items #3897

ItemLoaders can break if instantiated with pre-populated items #3897

Comments

elacuesta commented Jul 22, 2019 • edited

sortafreel commented Jul 23, 2019

andrewbaxter commented Jul 23, 2019 • edited

andrewbaxter commented Jul 24, 2019

sortafreel commented Jul 30, 2019

sortafreel commented Aug 15, 2019

elacuesta commented Jul 22, 2019 •

edited

andrewbaxter commented Jul 23, 2019 •

edited