New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ItemLoader should work with copy of item passed as an argument #616
Comments
I think handling the copy outside isn't an issue since you have a shortcut. |
What about something like item_copy = ProductOfferItem(**item) ? |
@illarion what's the point? item.copy() is convenient shortcut for that. |
@nramirezuy Yes, making copy outside is not so hard, but I'm talking about ease of use and unexpected results you can get without explicit copy creation. Another example - what will happen with base item if I yield variant item like in example and then modify it in spider middleware?
As far as I can see modification of variant item in spider middleware will result in modification of base item in response handler - because it is the same object. And if I yield another variant item after that - it still would be the same object with fields modified in spider middleware and response handler. |
@chekunkov Over the whole documentation of item loaders it say that it populate the item, how this can be an unexpected result? http://doc.scrapy.org/en/latest/topics/loaders.html The same is going to happen to you if you populate the item directly, using the dict-like API. That's why the Item Loader ask for an instance and not a class on his constructor. You can also do something like: >>> base_item = loader.load_item()
>>> for variant in variats:
... loader = ItemLoader()
... loader.add_value(None, base_item)
... yield loader.load_item() |
@nramirezuy hm, good point, I reread documentation and you are right - I misunderstood the concept... Still think it is error prone 😛 But you can close the issue if you want to. |
I also found it non-obvious that ItemLoader populates a single item, because ItemLoader subclasses are often instantiated with a response or a selector, and the item is implicit in this case.
|
@kmike The main problem with the implicit copies is the entity check, so if you want a copy of an item do it by your self like a normal dict. If you need deepcopy we can add a deepcopy method to Item as a shortcut. I'm agree with the |
@chekunkov , I'm closing this issue as you seem(ed) ok with it. |
Sorry to comment on a long closed issue, just noting a possible solution for others to use that worked for me if they haven't come up with their own solution or if this could be added to the next version of Scrapy (this was tested in v1.7.3). But my project had the same issue while scraping a website that had variations on single page responses. I have a custom <p>Paragraph 1</p>
<p>Paragraph 2</p>
<p>Paragraph 3</p> but I didn't wanna have waste time to reparse it or make more code to cache the results. I initially tried the <p><p>Paragraph 1</p>
<p>Paragraph 2</p>
<p>Paragraph 3</p></p> Then I tried initializing a copy like <p><</p>
<p>p</p>
<p>></p>
<p>P</p>
<p>a</p>
... Edit: Just noticed this is actually part of a recent bug, reported in #3976. So I just made a new import six
from collections import defaultdict
from scrapy.loader import ItemLoader
class ProductLoader(ItemLoader):
def copy(self):
cls = self.__class__
loader = cls.__new__(cls)
context = self.context.copy()
loader.selector = self.selector
loader.context = context
loader.parent = self.parent
loader._local_item = context['item'] = self._local_item.copy()
loader._local_values = defaultdict(list)
for key, values in six.iteritems(self._local_values):
loader._local_values[key] += values
return loader |
@LanetheGreat what about copying the item after you load it. >>> loader = ItemLoader()
>>> loader.add_xpath(...) # load your item normally
>>> for variant in variats:
... loader.add_xpath(...) # load your item normally
... yield loader.load_item().copy() |
Actually looking over my own code in my previous comment earlier I noticed an error in the logic which still modifies the lists within the base loader's @nramirezuy Even after my edit there's still an issue with doing that, because it still modifies the base loader and simply adds on new data to each new item you yield out of your loop because it just appends new data to the base Let's say I have a page for a product and it has some common specifications that get parsed from the description section into the base
So when I gather my variations I need to have 2 items with specs like this:
If I used your example the items would load out as:
As you'll notice they're correctly separate items since you used But if you snapshot it before using >>> loader = ProductLoader()
>>> loader.add_xpath('specs', '[description xpath]') # load your item normally
>>> loader.add_xpath('specs_html', [description_xpath]) # load your item normally
>>> for variant in variats:
... sub_loader = loader.copy()
... sub_loader.add_xpath('specs', [variation xpath]) # load your item normally
... sub_loader.add_xpath('specs_html', [variation xpath]) # load your item normally
... yield sub_loader.load_item() Which should yield these as the results from my example after going through the output formatter for specs/specs_html (apologies for the long winded examples and texts): Variation1['specs'] =
"Spec 1
Spec 2
Spec A
Spec B"
Variation1['specs_html'] =
"<ul>
<li>Spec 1</li>
<li>Spec 2</li>
<li>Spec A</li>
<li>Spec B</li>
</ul>"
Variation2['specs'] =
"Spec 1
Spec 2
Spec C
Spec D"
Variation2['specs_html'] =
"<ul>
<li>Spec 1</li>
<li>Spec 2</li>
<li>Spec C</li>
<li>Spec D</li>
</ul>" |
@LanetheGreat It's ok, you give a lot of information to work with. >>> loader = ItemLoader()
>>> loader.add_xpath(...) # load your item normally
>>> for variant in variats:
... loader.replace_xpath(...) # load your item using replace
... yield loader.load_item().copy() Thoughts? |
I think there still exists a problem with doing that as well and still would be better to create a snapshot of the loader because if we simply just replace the values with only the variants data we would loose the original data we put in first, so (referencing my past examples) instead of the desired output:
We'd get this instead by just replacing it, losing our Specs 1 and 2:
Though we could technically use >>> loader = ItemLoader()
>>> loader.add_xpath('field_name', ...) # load your item normally
>>> for variant in variats:
... loader.replace_xpath('field_name', []) # clear the field first
... loader.add_xpath('field_name', ...) # same add_xpath call from earlier
... yield loader.load_item().copy() But in larger projects with more fields you'd have to do that for each field which could lead to making errors if you forget one, plus it would start to slow down this section of the code because each call to |
Sometimes it is needed to parse several variants of item from single page, where only a couple of fields differ. Obvious way to do this is to create base item with all common fields collected and then create new items (for example in a loop) using new ItemLoader instance with base item passed to it.
The problem I'm facing now is that ItemLoader.load_item() modifies base item - which is counterintuitive and can result in weird behavior (for example if variant items can have different fields - after those field were added to base item - they would appear in all loaded items).
Now I'm using workaround like this to suppress such behavior:
What do you think about using item copy inside ItemLoader by default?
The text was updated successfully, but these errors were encountered: