ItemLoader: improve handling of initial item #4036

elacuesta · 2019-09-25T04:32:28Z

Third time's the charm 🤞

Fixes #3897, fixes #3976, supersedes #3998.

Continuing the work of @sortafreel, updating the implementation and adding more granular tests. Please do let me know if you think I'm missing some test case.

…from dict. Add more tests.

elacuesta · 2019-09-25T04:33:02Z

tests/test_loader.py

@@ -61,7 +67,7 @@ def test_load_item_using_default_loader(self):
        il.add_value('name', u'marta')
        item = il.load_item()
        assert item is i
-        self.assertEqual(item['summary'], u'lala')
+        self.assertEqual(item['summary'], [u'lala'])


I think this line technically makes the PR backward incompatible. I believe this is a reasonable price to pay, considering that the current code breaks if one wants to add more values to the same field, i.e. adding il.add_value('summary', u'foobar') after creating this loader results in

ERROR: test_load_item_using_default_loader (__main__.BasicItemLoaderTest) ---------------------------------------------------------------------- Traceback (most recent call last): File "tests/test_loader.py", line 62, in test_load_item_using_default_loader il.add_value('summary', u'foobar') File "/.../scrapy/loader/__init__.py", line 79, in add_value self._add_value(field_name, value) File "/.../scrapy/loader/__init__.py", line 95, in _add_value self._values[field_name] += arg_to_iter(processed_value) TypeError: can only concatenate str (not "list") to str

codecov · 2019-09-25T04:51:28Z

Codecov Report

Merging #4036 into master will increase coverage by 0.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #4036      +/-   ##
==========================================
+ Coverage   85.67%   85.68%   +0.01%     
==========================================
  Files         165      165              
  Lines        9726     9734       +8     
  Branches     1461     1463       +2     
==========================================
+ Hits         8333     8341       +8     
  Misses       1136     1136              
  Partials      257      257

Impacted Files	Coverage Δ
scrapy/loader/__init__.py	`95.36% <100%> (ø)`	⬆️
scrapy/commands/parse.py	`20.21% <0%> (-0.12%)`	⬇️
scrapy/extensions/corestats.py	`100% <0%> (ø)`	⬆️
scrapy/item.py	`98.66% <0%> (+0.11%)`	⬆️
scrapy/commands/runspider.py	`75.36% <0%> (+1.07%)`	⬆️

Gallaecio · 2019-09-25T11:07:04Z

Do you think you could increase coverage of the changes?

If I’m looking at it right, what’s missing is a case where if processed_value evaluates to False, and a case where elif value evaluates to False.

elacuesta · 2019-09-25T14:59:11Z

Thanks for looking into this @Gallaecio.
I simplified the implementation, now test coverage is complete 🎉

elacuesta · 2019-09-25T15:05:18Z

Code samples from #3976 at 7c2ec90:

In [1]: from pprint import pprint 
   ...:  
   ...: from scrapy import Field, Item 
   ...: from scrapy.loader import ItemLoader 
   ...: from scrapy.loader.processors import TakeFirst 
   ...:  
   ...: class X(Item): 
   ...:     x = Field(output_processor=TakeFirst()) 
   ...:  
   ...: loader = ItemLoader(X()) 
   ...: loader.add_value("x", ["value1", "value2"]) 
   ...: x = loader.load_item() 
   ...:  
   ...: pprint(x) 
   ...: pprint(ItemLoader(x).load_item())                                                                                                                                                                                                     
{'x': 'value1'}
{'x': 'value1'}

In [2]: from scrapy.item import Item, Field 
   ...: from scrapy.loader import ItemLoader 
   ...: from scrapy.loader.processors import TakeFirst, Compose, Identity 
   ...:  
   ...: class Temp(Item): 
   ...:     temp = Field() 
   ...:     def __init__(self, *args, **kwargs): 
   ...:         super().__init__(self, *args, **kwargs) 
   ...:         self.setdefault('temp', 0.3) 
   ...:  
   ...: class Loader(ItemLoader): 
   ...:     default_item_class = Temp 
   ...:     default_input_processor = Identity() 
   ...:     default_output_processor = Compose(TakeFirst()) 
   ...:  
   ...: loader = Loader() 
   ...: loader.load_item()                                                                                                                                                                                                                    
Out[2]: {'temp': 0.3}

kmike · 2019-10-01T19:35:00Z

Hey! I haven't checked the problem in detail, but I wonder if we can actually document the expected behavior, to avoid further confusion.

kmike · 2019-10-01T19:41:14Z

//cc @andrewbaxter @ejulio @alexander-matsievsky - could you please check if the PR solves the problem in your projects?

andrewbaxter · 2019-10-06T12:08:56Z

@kmike I'm on @elacuesta 's team so some bias here, but IIRC this will fix our problem and I think re-running output processors and not input processors is a reasonable policy.

If you're able though, a word on what level of backwards incompatibility is acceptable in any of the issues would be appreciated. I imagine there's many projects out there that may need to upgrade and will be bitten unless we revert the original change and make it a separate factory method (for example). If that's an expected cost I'm on board with this solution.

ejulio · 2019-10-07T16:32:18Z

tests/test_loader.py

+    item_class = NameItem
+
+
+class NoInputReprocessingFromDictTest(unittest.TestCase):


I'd say that NoInputReprocessingFromDictTest and NoInputReprocessingFromItemTest should use the same loader definition. To make the behavior clear.

Updated. The loaders are still defined separately because they need different default_item_class attributes, but now they both inherit from BaseNoInputReprocessingLoader. Thanks!

elacuesta

Hey! I haven't checked the problem in detail, but I wonder if we can actually document the expected behavior, to avoid further confusion.

Hi @kmike, I added a small note to the docs, I hope that's clear enough, let me know otherwise. Thanks!

elacuesta · 2019-10-08T16:41:16Z

tests/test_loader.py

+    item_class = NameItem
+
+
+class NoInputReprocessingFromDictTest(unittest.TestCase):


Updated. The loaders are still defined separately because they need different default_item_class attributes, but now they both inherit from BaseNoInputReprocessingLoader. Thanks!

docs/topics/loaders.rst

pawelmhm · 2021-02-23T10:09:39Z

scrapy/loader/__init__.py

        for field_name, value in item.items():
-            self._values[field_name] = self._process_input_value(field_name, value)
+            self._values[field_name] += arg_to_iter(value)


I know this is old commit, but I'm now in the process of understanding this whole issue, and chain of events started with this #3819 and this is still a breaking change for many projects, people were certainly getting different data in many cases after all these changes. See below code as example, I have several spiders relying on old behavior and now looking for workaround.

from scrapy import Item, Field from scrapy.loader import ItemLoader class AItem(Item): name = Field() class ALoader(ItemLoader): default_item_class = AItem # Breaking change between scrapy 1.7 and 2.4. # Below is shown behavior that several spiders rely on, this was behavior up until 1.7.4. # And it is not how loader works by default in Scrapy after 1.7.4 version. def test_loader_breaking_change_scrapy_v17_v24(): # To see this run this test in different environment, one with Scrapy 1.7 # and old loader library, other with Scrapy 2.4.1 and itemloader library. loader = ALoader() loader.add_value('name', 'hello world') l = loader.load_item() new_loader = ALoader(l) new_loader.add_value("name", "lovely") l = new_loader.load_item() # after Scrapy 1.7.4 it gives: {'name': ['hello world', 'lovely']} # until Scrapy 1.7.4 it was {'name': ['lovely']} assert l['name'] == ['lovely']

Would using a “TakeLast” output processor (e.g. lambda v: v[-1]) work for your scenario?

I have TakeFirst processor defined for this specific field where I spotted this, and different spiders are doing different things, some of them are fine after this change, but some are not. Changing output processors will change several things. I wonder if we could add some option for loader to keep old behavior? It seems dirty, but would make things easier. I could also refactor spiders, but there is many of them and even finding which one is affected is not that easy

What kind of change do you have in mind? Something like a parameter to ItemLoader that sets the old behavior? Our would it work for you if we simply extended the ItemLoader API with something like set_value to replace any value added so far?

Something like a parameter to ItemLoader that sets the old behavior?

I was thinking about parameter or attribute of Loader. I could set this when defining my subclass of ItemLoader. But I see potential problem from library point of view - it can be hard to maintain in the future, there will be 2 different logics, not sure how many people will even think about this attribute or parameter. If I will be only project using it - it is not much sense, I can just subclass ItemLoader my side and change some things. I could also stick to old ItemLoader library but there is no easy way, because itemloaders were moved to separate repo, so I cannot simply use some old version of itemLoader with Scrapy 2.4.1. First release of itemloaders is after this change and scrapy 2.4.1 imports itemloaders.

Our would it work for you if we simply extended the ItemLoader API with something like set_value to replace any value added so far?

There is already replace_value, which is doing what I would need here. I guess the problem is because in the past in many cases loader.add_value was actually replacing value when people intended to add_value, now code relies on this and this creates mess when situation is cleared up.

sortafreel and others added 6 commits September 1, 2019 23:00

Merge with scrapy/master

846e021

Update tests.

685484a

[fix_reprocessing] Add reprocessing tests. Avoid reprocessing.

d53d14b

[fix_reprocessing] Check through is_instance.

39e77af

[fix_reprocessing] Convert all single values to list when initiating …

91e2453

…from dict. Add more tests.

ItemLoader: improve handling of initial item

12dc720

elacuesta commented Sep 25, 2019

View reviewed changes

elacuesta added 2 commits September 25, 2019 09:27

ItemLoader - initial item: simplify implementation, update changes

5c37377

ItemLoader: PEP8 adjustments

7c2ec90

elacuesta mentioned this pull request Sep 25, 2019

ItemLoader fields initialized from item are reprocessed #3976

Closed

Gallaecio approved these changes Sep 26, 2019

View reviewed changes

kmike modified the milestone: v1.8 Sep 26, 2019

ejulio reviewed Oct 7, 2019

View reviewed changes

kmike mentioned this pull request Oct 8, 2019

Adding Type Hints to Scrapy and its Modules #4041

Closed

ItemLoader: Clarify initial item behaviour, update tests

e9671b3

elacuesta commented Oct 8, 2019

View reviewed changes

Gallaecio reviewed Oct 14, 2019

View reviewed changes

docs/topics/loaders.rst Show resolved Hide resolved

Gallaecio merged commit 7731814 into scrapy:master Oct 28, 2019

Gallaecio mentioned this pull request Oct 28, 2019

Add reprocessing tests #3998

Closed

elacuesta deleted the item-loader-initial-values branch October 28, 2019 13:19

pawelmhm reviewed Feb 23, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ItemLoader: improve handling of initial item #4036

ItemLoader: improve handling of initial item #4036

elacuesta commented Sep 25, 2019 •

edited

elacuesta Sep 25, 2019

codecov bot commented Sep 25, 2019 •

edited

Gallaecio commented Sep 25, 2019

elacuesta commented Sep 25, 2019

elacuesta commented Sep 25, 2019

kmike commented Oct 1, 2019

kmike commented Oct 1, 2019

andrewbaxter commented Oct 6, 2019

ejulio Oct 7, 2019

elacuesta Oct 8, 2019

elacuesta left a comment

elacuesta Oct 8, 2019

pawelmhm Feb 23, 2021 •

edited

Gallaecio Feb 23, 2021

pawelmhm Feb 23, 2021 •

edited

Gallaecio Feb 23, 2021

pawelmhm Feb 23, 2021 •

edited

		item_class = NameItem


		class NoInputReprocessingFromDictTest(unittest.TestCase):

ItemLoader: improve handling of initial item #4036

ItemLoader: improve handling of initial item #4036

Conversation

elacuesta commented Sep 25, 2019 • edited

elacuesta Sep 25, 2019

Choose a reason for hiding this comment

codecov bot commented Sep 25, 2019 • edited

Codecov Report

Gallaecio commented Sep 25, 2019

elacuesta commented Sep 25, 2019

elacuesta commented Sep 25, 2019

kmike commented Oct 1, 2019

kmike commented Oct 1, 2019

andrewbaxter commented Oct 6, 2019

ejulio Oct 7, 2019

Choose a reason for hiding this comment

elacuesta Oct 8, 2019

Choose a reason for hiding this comment

elacuesta left a comment

Choose a reason for hiding this comment

elacuesta Oct 8, 2019

Choose a reason for hiding this comment

pawelmhm Feb 23, 2021 • edited

Choose a reason for hiding this comment

Gallaecio Feb 23, 2021

Choose a reason for hiding this comment

pawelmhm Feb 23, 2021 • edited

Choose a reason for hiding this comment

Gallaecio Feb 23, 2021

Choose a reason for hiding this comment

pawelmhm Feb 23, 2021 • edited

Choose a reason for hiding this comment

elacuesta commented Sep 25, 2019 •

edited

codecov bot commented Sep 25, 2019 •

edited

pawelmhm Feb 23, 2021 •

edited

pawelmhm Feb 23, 2021 •

edited

pawelmhm Feb 23, 2021 •

edited