Add reprocessing tests #3998

sortafreel · 2019-09-01T20:47:48Z

I tried to somehow understand when the item is processed the second-third-n time, so ignore the processor, but there's no clear way to find out how field processors were populated - initiated in a Class object, or taken from the ready item (so need to reprocess).

To be sure, I added a lot of new tests simulating the issue (#3976) logic and tested all of them through scrapy 1.6.0. to get the right results. And it looks like, the working solution is to check if the item is initiated from dict. Because, if item initiated from actual Item class object - it has own "meta", and no need to reprocess it, while dict == raw data and it needs to be processed through default processors to be sure it's valid.

If I'm too naive and missing something - let me know what tests to add to simulate an error, because all current test variations are passing.

codecov · 2019-09-01T22:49:59Z

Codecov Report

Merging #3998 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #3998      +/-   ##
=========================================
+ Coverage   85.39%   85.4%   +<.01%     
=========================================
  Files         167     167              
  Lines        9726    9727       +1     
  Branches     1456    1457       +1     
=========================================
+ Hits         8306    8307       +1     
  Misses       1162    1162              
  Partials      258     258

Impacted Files	Coverage Δ
scrapy/loader/__init__.py	`95.39% <100%> (+0.03%)`	⬆️

elacuesta

Hi Alexander, thanks for taking care of this. I think it would be good to add some tests related to #3897, i.e. creating loaders from items/dicts and then adding more values to them.

scrapy/loader/__init__.py

…from dict. Add more tests.

sortafreel · 2019-09-02T12:06:03Z

@elacuesta @Gallaecio If an item is initiated from a dict - all single values are converted to lists with a single element. Added more tests to cover both #3897 and #3976.

elacuesta · 2019-09-02T15:51:44Z

I think dicts are well covered here, however I believe values from Item subclasses should also be added to the loader. Maybe modifying _add_value to take an optional preprocess flag, or call arg_to_iter directly from __init__?

Gallaecio · 2019-09-03T08:30:46Z

scrapy/loader/__init__.py

+        if isinstance(item, dict):
+            for field_name, value in item.items():
+                # Convert all single values to lists because of following output processors
+                self._add_value(field_name, value)


I’m not familiar enough with the internals of ItemLoader, so this may be a stupid question: Why do we want _process_input_value applied to dictionaries? Shouldn’t we be assigning the values as is (but as lists), as it seems we do with item classes?

Not sure I understand the question correctly. If the question is "why to preprocess dicts" - it was the solution to save missing items (#3804), because without preprocessing scrapy dropped values. If I didn't understand the question - please, add more details 🙂

I mean that, with the current implementation, input processors are applied to input dictionary values, but not to item objects. _add_value calls _process_input_value, which passes dictionary values through input processors.

I wonder if it wouldn’t be better, to get values from dictionaries into _values, to use something similar to _add_values that does not apply input processing.

I may be getting things wrong, though, and even if this is an issue it would be an issue that already exists before this patch, so maybe it’s better to leave this topic for later, for a different issue or pull request.

sortafreel · 2019-09-10T10:12:49Z

I think dicts are well covered here, however I believe values from Item subclasses should also be added to the loader. Maybe modifying _add_value to take an optional preprocess flag, or call arg_to_iter directly from __init__?

I assume we had no problems with processing Item objects, only problems were with initiating Items from dicts. Can you add more details, please?

elacuesta · 2019-09-10T18:31:39Z

I assume we had no problems with processing Item objects, only problems were with initiating Items from dicts. Can you add more details, please?

With the current check in the ItemLoader initializer (if isinstance(item, dict):), values from the item variable will be added to the loader only if item is a dictionary, not if it's a scrapy.item.Item object.

sortafreel · 2019-09-10T21:52:33Z

I assume we had no problems with processing Item objects, only problems were with initiating Items from dicts. Can you add more details, please?

With the current check in the ItemLoader initializer (if isinstance(item, dict):), values from the item variable will be added to the loader only if item is a dictionary, not if it's a scrapy.item.Item object.

Can you provide a code example, please? I mean, whole if isinstance(item, dict): block wasn't there in scrapy 1.6, I added it only to handle items initiated from dicts. If items were processed correctly in scrapy 1.6 they must be processed correctly now too.

elacuesta · 2019-09-11T19:20:38Z

Sure, here's a code sample showing that values from the initial item were not preserved correctly in 1.6.0:

In [1]: import scrapy

In [2]: scrapy.__version__
Out[2]: '1.6.0'

In [3]: from scrapy.item import Item, Field 
   ...: from scrapy.loader import ItemLoader 
   ...:  
   ...: class MyItem(Item): 
   ...:     name = Field() 
   ...:  
   ...: loader_item = ItemLoader(item=MyItem(name="foo")) 
   ...: print("From Item, before add_value:", loader_item.load_item()) 
   ...: loader_item.add_value("name", "bar") 
   ...: print("From Item, after add_value:", loader_item.load_item()) 
   ...:  
   ...: loader_dict = ItemLoader(item=dict(name="foo")) 
   ...: print("From dict, before add_value:", loader_dict.load_item()) 
   ...: loader_dict.add_value("name", "bar") 
   ...: print("From dict, after add_value:", loader_dict.load_item())

From Item, before add_value: {'name': 'foo'}
From Item, after add_value: {'name': ['bar']}
From dict, before add_value: {'name': 'foo'}
From dict, after add_value: {'name': ['bar']}

This case (adding values after creating the loader with an initial item) should be tested.
In short, maybe Items don't need to be reprocessed, but their contents should be added to the loader.

kmike · 2019-09-12T14:26:39Z

tests/test_loader.py

+        class TestItemLoader(ItemLoader):
+            default_item_class = TestItem
+
+        # Initiate from dict


could you please try to split this test into multiple tests? E.g. create a new ReprocessingTest case, and have methods for sections in this test (test_initiate_from_dict, test_add_values, etc.)

Gallaecio · 2019-10-28T09:54:36Z

Superseded by #4036

sortafreel added 3 commits September 1, 2019 23:00

Merge with scrapy/master

846e021

Update tests.

685484a

[fix_reprocessing] Add reprocessing tests. Avoid reprocessing.

d53d14b

elacuesta reviewed Sep 2, 2019

View reviewed changes

scrapy/loader/__init__.py Outdated Show resolved Hide resolved

elacuesta reviewed Sep 2, 2019

View reviewed changes

scrapy/loader/__init__.py Outdated Show resolved Hide resolved

sortafreel added 2 commits September 2, 2019 14:26

[fix_reprocessing] Check through is_instance.

39e77af

[fix_reprocessing] Convert all single values to list when initiating …

91e2453

…from dict. Add more tests.

Gallaecio reviewed Sep 3, 2019

View reviewed changes

sortafreel mentioned this pull request Sep 10, 2019

ItemLoader fields initialized from item are reprocessed #3976

Closed

kmike reviewed Sep 12, 2019

View reviewed changes

elacuesta mentioned this pull request Sep 25, 2019

ItemLoader: improve handling of initial item #4036

Merged

Gallaecio closed this Oct 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add reprocessing tests #3998

Add reprocessing tests #3998

sortafreel commented Sep 1, 2019 •

edited

codecov bot commented Sep 1, 2019 •

edited

elacuesta left a comment •

edited

sortafreel commented Sep 2, 2019

elacuesta commented Sep 2, 2019 •

edited

Gallaecio Sep 3, 2019

sortafreel Sep 10, 2019

Gallaecio Sep 16, 2019

sortafreel commented Sep 10, 2019

elacuesta commented Sep 10, 2019

sortafreel commented Sep 10, 2019

elacuesta commented Sep 11, 2019 •

edited

kmike Sep 12, 2019

Gallaecio commented Oct 28, 2019

Add reprocessing tests #3998

Add reprocessing tests #3998

Conversation

sortafreel commented Sep 1, 2019 • edited

codecov bot commented Sep 1, 2019 • edited

Codecov Report

elacuesta left a comment • edited

Choose a reason for hiding this comment

sortafreel commented Sep 2, 2019

elacuesta commented Sep 2, 2019 • edited

Gallaecio Sep 3, 2019

Choose a reason for hiding this comment

sortafreel Sep 10, 2019

Choose a reason for hiding this comment

Gallaecio Sep 16, 2019

Choose a reason for hiding this comment

sortafreel commented Sep 10, 2019

elacuesta commented Sep 10, 2019

sortafreel commented Sep 10, 2019

elacuesta commented Sep 11, 2019 • edited

kmike Sep 12, 2019

Choose a reason for hiding this comment

Gallaecio commented Oct 28, 2019

sortafreel commented Sep 1, 2019 •

edited

codecov bot commented Sep 1, 2019 •

edited

elacuesta left a comment •

edited

elacuesta commented Sep 2, 2019 •

edited

elacuesta commented Sep 11, 2019 •

edited