Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ItemLoaders can break if instantiated with pre-populated items #3897

Closed
elacuesta opened this issue Jul 22, 2019 · 5 comments · Fixed by #4036
Closed

ItemLoaders can break if instantiated with pre-populated items #3897

elacuesta opened this issue Jul 22, 2019 · 5 comments · Fixed by #4036

Comments

@elacuesta
Copy link
Member

elacuesta commented Jul 22, 2019

Before I start, I know item loaders have been a big source of discussion for a long time; I'm only opening this issue because the latest release breaks some of our spiders.

In one of our projects, our Autounit tests fail under 1.7.1 due to some item loaders which are created from partially populated items. I suspect the relevant change is #3819 (which BTW I think inadvertently closes #3046).
Personally I think a better approach here would be something closer to the solution proposed in #3149, although not exactly the same.

Consider the following:

In [1]: import scrapy

In [2]: scrapy.__version__
Out[2]: '1.6.0'

In [3]: from scrapy.loader import ItemLoader
   ...: lo = ItemLoader(item={'key': 'value'})
   ...: lo.add_value('key', 'other value')
   ...: print(lo.load_item())
{'key': ['other value']}
In [1]: import scrapy

In [2]: scrapy.__version__
Out[2]: '1.7.1'

In [3]: from scrapy.loader import ItemLoader
   ...: lo = ItemLoader(item={'key': 'value'})
   ...: lo.add_value('key', 'other value')
   ...: print(lo.load_item())
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-6aa64a41edb1> in <module>
      1 from scrapy.loader import ItemLoader
      2 lo = ItemLoader(item={'key': 'value'})
----> 3 lo.add_value('key', 'other value')
      4 print(lo.load_item())

~/venv-temporal/lib/python3.6/site-packages/scrapy/loader/__init__.py in add_value(self, field_name, value, *processors, **kw)
     77                 self._add_value(k, v)
     78         else:
---> 79             self._add_value(field_name, value)
     80
     81     def replace_value(self, field_name, value, *processors, **kw):

~/venv-temporal/lib/python3.6/site-packages/scrapy/loader/__init__.py in _add_value(self, field_name, value)
     93         processed_value = self._process_input_value(field_name, value)
     94         if processed_value:
---> 95             self._values[field_name] += arg_to_iter(processed_value)
     96
     97     def _replace_value(self, field_name, value):

TypeError: must be str, not list

I'm not directly opening a PR because I think this needs discussion. What if we changed

for field_name, value in item.items():
    self._values[field_name] = self._process_input_value(field_name, value)

to

for field_name, value in item.items():
    self._add_value(field_name, value)

which calls arg_to_iter internally?

With that change, the following happens which is more reasonable IMHO:

In [3]: from scrapy.loader import ItemLoader 
   ...: lo = ItemLoader(item={'key': 'value'}) 
   ...: lo.add_value('key', 'other value') 
   ...: print(lo.load_item())                                                                                                                                                                                                                 
{'key': ['value', 'other value']}

Looking forward to reading your thoughts on the matter

/cc @Gallaecio @kmike @andrewbaxter @fcanobrash @sortafreel

@elacuesta elacuesta changed the title ItemLoaders can break if instatiated with pre-populated items ItemLoaders can break if instantiated with pre-populated items Jul 22, 2019
@sortafreel
Copy link
Contributor

Looks like a fine idea. Some tests needed to be sure.

@andrewbaxter
Copy link

andrewbaxter commented Jul 23, 2019

So to summarize my understanding:

  • Before, the base item's fields were ignored and when doing load_item they'd be replaced wholesale
  • With the new change, the item's fields are also added to the item, where they pass the input + output processors and will also naturally be combined with other added values before replacing the original value in load_item
  • With the suggested change in this ticket, the item's fields will be added in such a way that they will skip the input processors but be merged with other added values in the output processors before replacing the original value in load_item

Although I don't have any examples myself, I can think of two other ways this might break:

  • Spider expects 2nd invocation to replace the values (just because the original behavior suited the scenario)
  • Output processor relies on multiple values, like doing enumerate(values) or len(values) or something - by passing the values through the output processor a second time you get an invalid output

The second might be somewhat workaroundable by adding the base item's values only at load_item, and only if other values were added (that is, if no new items are added don't run the output processors, just leave that field in the base item as is), but it leaves open the question of what to do if new items are added.

@andrewbaxter
Copy link

What about leaving the old behavior for item= and make a new parameter called seed= or initial= or something which does the _add_value thing?

@sortafreel
Copy link
Contributor

Sorry for not answering, was quite ill (dentist), will look through it carefully next 3-5 days. Thank @andrewbaxter for the detailed vision :)

@sortafreel
Copy link
Contributor

Discussed with @andrewbaxter, going on vacation (for a ~week), will apply some solution firstly after :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants