Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ItemLoader fields initialized from item are reprocessed #3976

Closed
alexander-matsievsky opened this issue Aug 26, 2019 · 9 comments · Fixed by #4036
Closed

ItemLoader fields initialized from item are reprocessed #3976

alexander-matsievsky opened this issue Aug 26, 2019 · 9 comments · Fixed by #4036
Labels

Comments

@alexander-matsievsky
Copy link

alexander-matsievsky commented Aug 26, 2019

Description

#3804 introduced a bug where ItemLoader fields are reprocessed.
Related #3897.

Steps to Reproduce

from pprint import pprint

from scrapy import Field, Item
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst


class X(Item):
    x = Field(output_processor=TakeFirst())


loader = ItemLoader(X())
loader.add_value("x", ["value1", "value2"])
x = loader.load_item()
pprint(x)
# {'x': 'value1'}

pprint(ItemLoader(x).load_item())
# {'x': 'v'}

Expected behavior: ItemLoader initialized from the x item does not reprocess its fields and loads {'x': 'value1'}.

Actual behavior: ItemLoader initialized from the x item reprocesses its fields and loads {'x': 'v'}.

Versions

Scrapy       : 1.7.3
lxml         : 4.4.1.0
libxml2      : 2.9.9
cssselect    : 1.1.0
parsel       : 1.5.2
w3lib        : 1.21.0
Twisted      : 19.7.0
Python       : 3.6.5 (default, May  3 2018, 10:08:28) - [GCC 5.4.0 20160609]
pyOpenSSL    : 19.0.0 (OpenSSL 1.1.1c  28 May 2019)
cryptography : 2.7
Platform     : Linux-4.4.0-127-generic-x86_64-with-LinuxMint-18.1-serena

Additional context

Here's the behavior of the previous version:

Scrapy       : 1.6.0
lxml         : 4.4.0.0
libxml2      : 2.9.9
cssselect    : 1.0.3
parsel       : 1.5.1
w3lib        : 1.20.0
Twisted      : 19.7.0
Python       : 3.6.5 (default, May  3 2018, 10:08:28) - [GCC 5.4.0 20160609]
pyOpenSSL    : 19.0.0 (OpenSSL 1.1.1c  28 May 2019)
cryptography : 2.7
Platform     : Linux-4.4.0-127-generic-x86_64-with-LinuxMint-18.1-serena
# {'x': 'value1'}
# {'x': 'value1'}
@ava7
Copy link

ava7 commented Aug 26, 2019

@alexander-matsievsky Cheers, thanks for taking the time! I believe this issue is also related to the discussion in #3897 so this was the reason I didn't open another issue myself.

I also couldn't figure out why this commit was discarded in favour of the next one. The only thing I can think of is the failed tests that made the author of the pull request make the change. But then again, it seems to me that the test itself is not being correctly asserting?

@sortafreel
Copy link
Contributor

Yeah, it was mostly because of failing test. Just came back from my vacation, will take a fresh look.

@alexander-matsievsky
Copy link
Author

@ava7 It seems there's a web of related issues around it+) I've tried to make a focused reproduction, I hope it'll help.
@sortafreel Welcome back, Alexander! Thanks for taking care of it!

@sortafreel
Copy link
Contributor

Checked, fixable, will ship new pull request with new tests this week.

@ejulio
Copy link
Contributor

ejulio commented Sep 10, 2019

I got to the same problem, I'll just add my case here as an example

from scrapy.item import Item, Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst, Compose, Identity


class Temp(Item):
    temp = Field()
    def __init__(self, *args, **kwargs):
        super().__init__(self, *args, **kwargs)
        self.setdefault('temp', 0.3)


class Loader(ItemLoader):
    default_item_class = Temp
    default_input_processor = Identity()
    default_output_processor = Compose(TakeFirst())


loader = Loader()
loader.load_item()

It works in scrapy 1.6, but breaks in 1.7

@sortafreel
Copy link
Contributor

@ejulio Can you add more details, please? Tested in scrapy 1.6.0, got this:

>>> import scrapy
>>> scrapy.__version__
u'1.6.0'
>>> from scrapy.item import Item, Field
>>> from scrapy.loader import ItemLoader
>>> from scrapy.loader.processors import TakeFirst, Compose, Identity
>>> 
>>> 
>>> class Temp(Item):
...     temp = Field()
...     def __init__(self, *args, **kwargs):
...         super().__init__(self, *args, **kwargs)
...         self.setdefault('temp', 0.3)
... 
>>> 
>>> class Loader(ItemLoader):
...     default_item_class = Temp
...     default_input_processor = Identity()
...     default_output_processor = Compose(TakeFirst())
... 
>>> 
>>> loader = Loader()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/sortafreel/venvotest/local/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 33, in __init__
    item = self.default_item_class()
  File "<stdin>", line 4, in __init__
TypeError: super() takes at least 1 argument (0 given)

@sortafreel
Copy link
Contributor

@ejulio I assume, if there's a problem, it's fixed in #3998. Can add more tests to be sure, if you'll help me to understand what's broken.

@ejulio
Copy link
Contributor

ejulio commented Sep 11, 2019

@sortafreel , it is python3 also.
I think the error you got is because super() is from python3, but it should be super(Item)

@elacuesta
Copy link
Member

For the record, I think what we should do (and what is done in #3998 and #4036) is to avoid input reprocessing. The code samples posted by @alexander-matsievsky and @ejulio fail in 1.7.3 because the initial values are stored as they are and not as iterables.
It is reasonable to process the output each time load_item is called IMHO, but sinceTakeFirst works on an iterable it will take the first character of a string in the first example, and it will fail on a float in the second one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants