Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add reprocessing tests #3998

Closed
wants to merge 5 commits into from
Closed

Conversation

sortafreel
Copy link
Contributor

@sortafreel sortafreel commented Sep 1, 2019

I tried to somehow understand when the item is processed the second-third-n time, so ignore the processor, but there's no clear way to find out how field processors were populated - initiated in a Class object, or taken from the ready item (so need to reprocess).

To be sure, I added a lot of new tests simulating the issue (#3976) logic and tested all of them through scrapy 1.6.0. to get the right results. And it looks like, the working solution is to check if the item is initiated from dict. Because, if item initiated from actual Item class object - it has own "meta", and no need to reprocess it, while dict == raw data and it needs to be processed through default processors to be sure it's valid.

If I'm too naive and missing something - let me know what tests to add to simulate an error, because all current test variations are passing.

@codecov
Copy link

codecov bot commented Sep 1, 2019

Codecov Report

Merging #3998 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff            @@
##           master   #3998      +/-   ##
=========================================
+ Coverage   85.39%   85.4%   +<.01%     
=========================================
  Files         167     167              
  Lines        9726    9727       +1     
  Branches     1456    1457       +1     
=========================================
+ Hits         8306    8307       +1     
  Misses       1162    1162              
  Partials      258     258
Impacted Files Coverage Δ
scrapy/loader/__init__.py 95.39% <100%> (+0.03%) ⬆️

Copy link
Member

@elacuesta elacuesta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Alexander, thanks for taking care of this. I think it would be good to add some tests related to #3897, i.e. creating loaders from items/dicts and then adding more values to them.

scrapy/loader/__init__.py Outdated Show resolved Hide resolved
scrapy/loader/__init__.py Outdated Show resolved Hide resolved
@sortafreel
Copy link
Contributor Author

@elacuesta @Gallaecio If an item is initiated from a dict - all single values are converted to lists with a single element. Added more tests to cover both #3897 and #3976.

@elacuesta
Copy link
Member

elacuesta commented Sep 2, 2019

I think dicts are well covered here, however I believe values from Item subclasses should also be added to the loader. Maybe modifying _add_value to take an optional preprocess flag, or call arg_to_iter directly from __init__?

if isinstance(item, dict):
for field_name, value in item.items():
# Convert all single values to lists because of following output processors
self._add_value(field_name, value)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not familiar enough with the internals of ItemLoader, so this may be a stupid question: Why do we want _process_input_value applied to dictionaries? Shouldn’t we be assigning the values as is (but as lists), as it seems we do with item classes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand the question correctly. If the question is "why to preprocess dicts" - it was the solution to save missing items (#3804), because without preprocessing scrapy dropped values. If I didn't understand the question - please, add more details 🙂

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean that, with the current implementation, input processors are applied to input dictionary values, but not to item objects. _add_value calls _process_input_value, which passes dictionary values through input processors.

I wonder if it wouldn’t be better, to get values from dictionaries into _values, to use something similar to _add_values that does not apply input processing.

I may be getting things wrong, though, and even if this is an issue it would be an issue that already exists before this patch, so maybe it’s better to leave this topic for later, for a different issue or pull request.

@sortafreel
Copy link
Contributor Author

I think dicts are well covered here, however I believe values from Item subclasses should also be added to the loader. Maybe modifying _add_value to take an optional preprocess flag, or call arg_to_iter directly from __init__?

I assume we had no problems with processing Item objects, only problems were with initiating Items from dicts. Can you add more details, please?

@elacuesta
Copy link
Member

I assume we had no problems with processing Item objects, only problems were with initiating Items from dicts. Can you add more details, please?

With the current check in the ItemLoader initializer (if isinstance(item, dict):), values from the item variable will be added to the loader only if item is a dictionary, not if it's a scrapy.item.Item object.

@sortafreel
Copy link
Contributor Author

I assume we had no problems with processing Item objects, only problems were with initiating Items from dicts. Can you add more details, please?

With the current check in the ItemLoader initializer (if isinstance(item, dict):), values from the item variable will be added to the loader only if item is a dictionary, not if it's a scrapy.item.Item object.

Can you provide a code example, please? I mean, whole if isinstance(item, dict): block wasn't there in scrapy 1.6, I added it only to handle items initiated from dicts. If items were processed correctly in scrapy 1.6 they must be processed correctly now too.

@elacuesta
Copy link
Member

elacuesta commented Sep 11, 2019

Sure, here's a code sample showing that values from the initial item were not preserved correctly in 1.6.0:

In [1]: import scrapy

In [2]: scrapy.__version__
Out[2]: '1.6.0'

In [3]: from scrapy.item import Item, Field 
   ...: from scrapy.loader import ItemLoader 
   ...:  
   ...: class MyItem(Item): 
   ...:     name = Field() 
   ...:  
   ...: loader_item = ItemLoader(item=MyItem(name="foo")) 
   ...: print("From Item, before add_value:", loader_item.load_item()) 
   ...: loader_item.add_value("name", "bar") 
   ...: print("From Item, after add_value:", loader_item.load_item()) 
   ...:  
   ...: loader_dict = ItemLoader(item=dict(name="foo")) 
   ...: print("From dict, before add_value:", loader_dict.load_item()) 
   ...: loader_dict.add_value("name", "bar") 
   ...: print("From dict, after add_value:", loader_dict.load_item())

From Item, before add_value: {'name': 'foo'}
From Item, after add_value: {'name': ['bar']}
From dict, before add_value: {'name': 'foo'}
From dict, after add_value: {'name': ['bar']}

This case (adding values after creating the loader with an initial item) should be tested.
In short, maybe Items don't need to be reprocessed, but their contents should be added to the loader.

class TestItemLoader(ItemLoader):
default_item_class = TestItem

# Initiate from dict
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you please try to split this test into multiple tests? E.g. create a new ReprocessingTest case, and have methods for sections in this test (test_initiate_from_dict, test_add_values, etc.)

@Gallaecio
Copy link
Member

Superseded by #4036

@Gallaecio Gallaecio closed this Oct 28, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants