Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ItemLoader should work with copy of item passed as an argument #616

Closed
chekunkov opened this issue Feb 26, 2014 · 14 comments
Closed

ItemLoader should work with copy of item passed as an argument #616

chekunkov opened this issue Feb 26, 2014 · 14 comments

Comments

@chekunkov
Copy link
Contributor

Sometimes it is needed to parse several variants of item from single page, where only a couple of fields differ. Obvious way to do this is to create base item with all common fields collected and then create new items (for example in a loop) using new ItemLoader instance with base item passed to it.

The problem I'm facing now is that ItemLoader.load_item() modifies base item - which is counterintuitive and can result in weird behavior (for example if variant items can have different fields - after those field were added to base item - they would appear in all loaded items).

In [5]: item = Item()

In [7]: item['url'] = 'foo'

In [8]: item
Out[8]: {'url': 'foo'}

In [9]: l = ItemLoader(item)

In [12]: l.add_value('category', 'bar')

In [13]: item
Out[13]: {'url': 'foo'}

In [14]: item_copy = l.load_item()

In [15]: item_copy
Out[15]: {'category': 'bar', 'url': 'foo'}

In [16]: item
Out[16]: {'category': 'bar', 'url': 'foo'}

In [17]: id(item)
Out[17]: 49468304

In [18]: id(item_copy)
Out[18]: 49468304

Now I'm using workaround like this to suppress such behavior:

loader = ItemLoader(selector=sel)
...
item = loader.load_item()
for variant in variants:
    item_copy = item.copy()
    loader = ItemLoader(item_copy)
    ...
    yield loader.load_item()

What do you think about using item copy inside ItemLoader by default?

@nramirezuy
Copy link
Contributor

I think handling the copy outside isn't an issue since you have a shortcut.

@illarion
Copy link

What about something like item_copy = ProductOfferItem(**item) ?

@chekunkov
Copy link
Contributor Author

@illarion what's the point? item.copy() is convenient shortcut for that.

@chekunkov
Copy link
Contributor Author

@nramirezuy Yes, making copy outside is not so hard, but I'm talking about ease of use and unexpected results you can get without explicit copy creation.

Another example - what will happen with base item if I yield variant item like in example and then modify it in spider middleware?

base_item = loader.load_item()
for variant in variants:
    loader = ItemLoader(base_item)
    ...
    yield loader.load_item()

As far as I can see modification of variant item in spider middleware will result in modification of base item in response handler - because it is the same object. And if I yield another variant item after that - it still would be the same object with fields modified in spider middleware and response handler.

@nramirezuy
Copy link
Contributor

@chekunkov Over the whole documentation of item loaders it say that it populate the item, how this can be an unexpected result? http://doc.scrapy.org/en/latest/topics/loaders.html

The same is going to happen to you if you populate the item directly, using the dict-like API. That's why the Item Loader ask for an instance and not a class on his constructor.

You can also do something like:

>>> base_item = loader.load_item()
>>> for variant in variats:
...    loader = ItemLoader()
...    loader.add_value(None, base_item)
...    yield loader.load_item()

@chekunkov
Copy link
Contributor Author

@nramirezuy hm, good point, I reread documentation and you are right - I misunderstood the concept...

Still think it is error prone 😛 But you can close the issue if you want to.

@kmike
Copy link
Member

kmike commented Apr 8, 2014

I also found it non-obvious that ItemLoader populates a single item, because ItemLoader subclasses are often instantiated with a response or a selector, and the item is implicit in this case.

item.copy() workaround is error-prone because it is not a deep copy.

loader.add_value(None, base_item) looks more like a trick for me; the API is quite strange even though it is documented. It can do not what is wanted: it doesn't perform a deep copy, and input/output processors may break things (the output is not necessarely in the same format as the input).

@nramirezuy
Copy link
Contributor

@kmike The main problem with the implicit copies is the entity check, so if you want a copy of an item do it by your self like a normal dict. If you need deepcopy we can add a deepcopy method to Item as a shortcut.

I'm agree with the .add_value(None, base_item) looks weird but works.

@redapple
Copy link
Contributor

@chekunkov , I'm closing this issue as you seem(ed) ok with it.
@kmike , @chekunkov , feel free to object and re-open.

@LanetheGreat
Copy link
Contributor

LanetheGreat commented Sep 4, 2019

Sorry to comment on a long closed issue, just noting a possible solution for others to use that worked for me if they haven't come up with their own solution or if this could be added to the next version of Scrapy (this was tested in v1.7.3). But my project had the same issue while scraping a website that had variations on single page responses.

I have a custom ProductLoader loader that would pull 1 or more items from a page with products as custom ProductItem items, the description of a product was in 1 part of the page and applied to each variation so it needed duplicated while fields like color, UPC, etc. where listed differently for each. It'd parse each line in the description and the output formatter would then join them together as a single HTML string formatted as paragraphs (also converted <br> to <p> to make them look more consistent):

<p>Paragraph 1</p>
<p>Paragraph 2</p>
<p>Paragraph 3</p>

but I didn't wanna have waste time to reparse it or make more code to cache the results.

I initially tried the .add_value(None, base_item) solution which didn't work in my project because it would cause double parsing of the output processors I assigned to my loader. Ex:

<p><p>Paragraph 1</p>
<p>Paragraph 2</p>
<p>Paragraph 3</p></p>

Then I tried initializing a copy like ProductLoader(item=loader.item.copy(), response=response) to have it just use a copy of my custom ProductItem which didn't work either because Item.__init__ would re-run them through the input processors thinking the HTML was text again and do this to it.

<p><</p>
<p>p</p>
<p>></p>
<p>P</p>
<p>a</p>
...

Edit: Just noticed this is actually part of a recent bug, reported in #3976.
Plus both method's were Exception prone because of re-running the input/output processors.

So I just made a new ProductLoader.copy() method that would bypass Item.__init__ and create shallow copies of its internals which worked with no exceptions being thrown and maintains a proper new item instance and separate values when you finally .load_item() it without effecting the base Loader it was copied from. It also maintains the proper types for .item/._local_item as an instance of my custom ProductItem class and ._local_values as a defaultdict since it uses the .copy method from both so it should be compatible with any Item subclasses and could probably work on the base class scrapy.item.Item itself. My ProductLoader.copy code:

import six
from collections import defaultdict
from scrapy.loader import ItemLoader

class ProductLoader(ItemLoader):

    def copy(self):
        cls = self.__class__
        loader = cls.__new__(cls)
        context = self.context.copy()

        loader.selector = self.selector
        loader.context = context
        loader.parent = self.parent
        loader._local_item = context['item'] = self._local_item.copy()

        loader._local_values = defaultdict(list)
        for key, values in six.iteritems(self._local_values):
            loader._local_values[key] += values

        return loader

@nramirezuy
Copy link
Contributor

@LanetheGreat what about copying the item after you load it.

>>> loader = ItemLoader()
>>> loader.add_xpath(...)  # load your item normally
>>> for variant in variats:
...    loader.add_xpath(...)   # load your item normally
...    yield loader.load_item().copy()

@LanetheGreat
Copy link
Contributor

Actually looking over my own code in my previous comment earlier I noticed an error in the logic which still modifies the lists within the base loader's ._local_values and edited it accordingly to fix that.

@nramirezuy Even after my edit there's still an issue with doing that, because it still modifies the base loader and simply adds on new data to each new item you yield out of your loop because it just appends new data to the base loader's ._local_values lists. Let me demonstrate with another example from my project's ProductItem/ProductLoader classes using the specs/specs_html fields I use in them (product specifications in text and an HTML unordered <ul></ul> list).

Let's say I have a page for a product and it has some common specifications that get parsed from the description section into the base ProductLoader and then it has a list of variations at the bottom of the page giving some unique specs for each variation of the product. Here's a list to try and show that tree structure:

  • Product Page:
  • Description
    • Spec 1
    • Spec 2
  • Variations:
    • Variation 1
      • Spec A
      • Spec B
    • Variation 2
      • Spec C
      • Spec D

So when I gather my variations I need to have 2 items with specs like this:

  • Variation 1: Specs 1, 2, A, B
  • Variation 2: Specs 1, 2, C, D

If I used your example the items would load out as:

  • Variation 1: Specs 1, 2, A, B
  • Variation 2: Specs 1, 2, A, B, C, D

As you'll notice they're correctly separate items since you used .copy() from Item but we still have the specs from the first iteration added to the second variation! This happens because after the first iteration of the loop loader._local_values['specs'] would equal ['Spec 1', 'Spec 2', 'Spec A', 'Spec B'] and then after the second iteration it would equal ['Spec 1', 'Spec 2', 'Spec A', 'Spec B', 'Spec C', 'Spec D'] because we just added the next set of values to the same loader and .load_item() doesn't empty the values between calls to it. This isn't correct for the second item since we'd need to either A.) take a snapshot of the loader as is with it's current set of values (.copy() method) and only append the current variations fields or B.) try and remove the values you added since the last variation from your loader which is obviously a little more time consuming to code and could still leave dirty values from each variation in the specs fields or other fields if one variation uses fields another doesn't.

But if you snapshot it before using .copy() it cleanly keeps each iteration unqiue both in item and in the fields modified after you use .load_item() like so:

>>> loader = ProductLoader()
>>> loader.add_xpath('specs', '[description xpath]')    # load your item normally
>>> loader.add_xpath('specs_html', [description_xpath]) # load your item normally
>>> for variant in variats:
...    sub_loader = loader.copy()
...    sub_loader.add_xpath('specs', [variation xpath])       # load your item normally
...    sub_loader.add_xpath('specs_html', [variation xpath])  # load your item normally
...    yield sub_loader.load_item()

Which should yield these as the results from my example after going through the output formatter for specs/specs_html (apologies for the long winded examples and texts):

Variation1['specs'] = 
"Spec 1
Spec 2
Spec A
Spec B"
Variation1['specs_html'] = 
"<ul>
  <li>Spec 1</li>
  <li>Spec 2</li>
  <li>Spec A</li>
  <li>Spec B</li>
</ul>"
Variation2['specs'] = 
"Spec 1
Spec 2
Spec C
Spec D"
Variation2['specs_html'] = 
"<ul>
  <li>Spec 1</li>
  <li>Spec 2</li>
  <li>Spec C</li>
  <li>Spec D</li>
</ul>"

@nramirezuy
Copy link
Contributor

nramirezuy commented Sep 4, 2019

@LanetheGreat It's ok, you give a lot of information to work with.
I used the wrong method, there is a .replace_xpath method we should be using to clear the old values.

>>> loader = ItemLoader()
>>> loader.add_xpath(...)  # load your item normally
>>> for variant in variats:
...    loader.replace_xpath(...)   # load your item using replace
...    yield loader.load_item().copy()

Thoughts?

@LanetheGreat
Copy link
Contributor

I think there still exists a problem with doing that as well and still would be better to create a snapshot of the loader because if we simply just replace the values with only the variants data we would loose the original data we put in first, so (referencing my past examples) instead of the desired output:

  • Variation 1: Specs 1, 2, A, B
  • Variation 2: Specs 1, 2, C, D

We'd get this instead by just replacing it, losing our Specs 1 and 2:

  • Variation 1: Specs A, B
  • Variation 2: Specs C, D

Though we could technically use .replace_xpath('specs', []) and then use our original .add_xpath()/.add_css()/.add_value() calls from earlier to reinitialize each variation in the loader like so:

>>> loader = ItemLoader()
>>> loader.add_xpath('field_name', ...)  # load your item normally
>>> for variant in variats:
...    loader.replace_xpath('field_name', [])   # clear the field first
...    loader.add_xpath('field_name', ...)      # same add_xpath call from earlier
...    yield loader.load_item().copy()

But in larger projects with more fields you'd have to do that for each field which could lead to making errors if you forget one, plus it would start to slow down this section of the code because each call to .add_xpath()/.add_css()/.add_value() runs the input through the input processors again so you'd be doing double the processing for no added benefit (like to stick to the DRY principle in my coding). Where as the .copy method bypasses doing the input processors because they've already been run once on the ._local_values being copied in the loader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants