Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ItemLoader: improve handling of initial item #4036

Merged
merged 9 commits into from Oct 28, 2019
10 changes: 8 additions & 2 deletions docs/topics/loaders.rst
Expand Up @@ -35,6 +35,12 @@ Then, you start collecting values into the Item Loader, typically using
the same item field; the Item Loader will know how to "join" those values later
using a proper processing function.

.. note:: Collected data is internally stored as lists,
allowing to add several values to the same field.
If an ``item`` argument is passed when creating a loader,
each of the item's values will be stored as-is if it's already
Gallaecio marked this conversation as resolved.
Show resolved Hide resolved
an iterable, or wrapped with a list if it's a single value.

Here is a typical Item Loader usage in a :ref:`Spider <topics-spiders>`, using
the :ref:`Product item <topics-items-declaring>` declared in the :ref:`Items
chapter <topics-items>`::
Expand Down Expand Up @@ -128,9 +134,9 @@ So what happens is:
It's worth noticing that processors are just callable objects, which are called
with the data to be parsed, and return a parsed value. So you can use any
function as input or output processor. The only requirement is that they must
accept one (and only one) positional argument, which will be an iterator.
accept one (and only one) positional argument, which will be an iterable.

.. note:: Both input and output processors must receive an iterator as their
.. note:: Both input and output processors must receive an iterable as their
first argument. The output of those functions can be anything. The result of
input processors will be appended to an internal list (in the Loader)
containing the collected values (for that field). The result of the output
Expand Down
31 changes: 15 additions & 16 deletions scrapy/loader/__init__.py
@@ -1,19 +1,19 @@
"""Item Loader
"""
Item Loader

See documentation in docs/topics/loaders.rst

"""
from collections import defaultdict

import six

from scrapy.item import Item
from scrapy.loader.common import wrap_loader_context
from scrapy.loader.processors import Identity
from scrapy.selector import Selector
from scrapy.utils.misc import arg_to_iter, extract_regex
from scrapy.utils.python import flatten

from .common import wrap_loader_context
from .processors import Identity


class ItemLoader(object):

Expand All @@ -33,10 +33,9 @@ def __init__(self, item=None, selector=None, response=None, parent=None, **conte
self.parent = parent
self._local_item = context['item'] = item
self._local_values = defaultdict(list)
# Preprocess values if item built from dict
# Values need to be added to item._values if added them from dict (not with add_values)
# values from initial item
for field_name, value in item.items():
self._values[field_name] = self._process_input_value(field_name, value)
self._values[field_name] += arg_to_iter(value)
Copy link
Contributor

@pawelmhm pawelmhm Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is old commit, but I'm now in the process of understanding this whole issue, and chain of events started with this #3819 and this is still a breaking change for many projects, people were certainly getting different data in many cases after all these changes. See below code as example, I have several spiders relying on old behavior and now looking for workaround.

from scrapy import Item, Field
from scrapy.loader import ItemLoader


class AItem(Item):
    name = Field()


class ALoader(ItemLoader):
    default_item_class = AItem


# Breaking change between scrapy 1.7 and 2.4.
# Below is shown behavior that several spiders rely on, this was behavior up until 1.7.4.
# And it is not how loader works by default in Scrapy after 1.7.4 version.
def test_loader_breaking_change_scrapy_v17_v24():
    # To see this run this test in different environment, one with Scrapy 1.7
    # and old loader library, other with Scrapy 2.4.1 and itemloader library.
    loader = ALoader()
    loader.add_value('name', 'hello world')
    l = loader.load_item()
    new_loader = ALoader(l)
    new_loader.add_value("name", "lovely")
    l = new_loader.load_item()
    # after Scrapy 1.7.4 it gives: {'name': ['hello world', 'lovely']} 
    # until Scrapy 1.7.4 it was {'name': ['lovely']}
    assert l['name'] == ['lovely']

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would using a “TakeLast” output processor (e.g. lambda v: v[-1]) work for your scenario?

Copy link
Contributor

@pawelmhm pawelmhm Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have TakeFirst processor defined for this specific field where I spotted this, and different spiders are doing different things, some of them are fine after this change, but some are not. Changing output processors will change several things. I wonder if we could add some option for loader to keep old behavior? It seems dirty, but would make things easier. I could also refactor spiders, but there is many of them and even finding which one is affected is not that easy

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What kind of change do you have in mind? Something like a parameter to ItemLoader that sets the old behavior? Our would it work for you if we simply extended the ItemLoader API with something like set_value to replace any value added so far?

Copy link
Contributor

@pawelmhm pawelmhm Feb 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like a parameter to ItemLoader that sets the old behavior?

I was thinking about parameter or attribute of Loader. I could set this when defining my subclass of ItemLoader. But I see potential problem from library point of view - it can be hard to maintain in the future, there will be 2 different logics, not sure how many people will even think about this attribute or parameter. If I will be only project using it - it is not much sense, I can just subclass ItemLoader my side and change some things. I could also stick to old ItemLoader library but there is no easy way, because itemloaders were moved to separate repo, so I cannot simply use some old version of itemLoader with Scrapy 2.4.1. First release of itemloaders is after this change and scrapy 2.4.1 imports itemloaders.

Our would it work for you if we simply extended the ItemLoader API with something like set_value to replace any value added so far?

There is already replace_value, which is doing what I would need here. I guess the problem is because in the past in many cases loader.add_value was actually replacing value when people intended to add_value, now code relies on this and this creates mess when situation is cleared up.


@property
def _values(self):
Expand Down Expand Up @@ -132,24 +131,24 @@ def get_output_value(self, field_name):
try:
return proc(self._values[field_name])
except Exception as e:
raise ValueError("Error with output processor: field=%r value=%r error='%s: %s'" % \
(field_name, self._values[field_name], type(e).__name__, str(e)))
raise ValueError("Error with output processor: field=%r value=%r error='%s: %s'" %
(field_name, self._values[field_name], type(e).__name__, str(e)))

def get_collected_values(self, field_name):
return self._values[field_name]

def get_input_processor(self, field_name):
proc = getattr(self, '%s_in' % field_name, None)
if not proc:
proc = self._get_item_field_attr(field_name, 'input_processor', \
self.default_input_processor)
proc = self._get_item_field_attr(field_name, 'input_processor',
self.default_input_processor)
return proc

def get_output_processor(self, field_name):
proc = getattr(self, '%s_out' % field_name, None)
if not proc:
proc = self._get_item_field_attr(field_name, 'output_processor', \
self.default_output_processor)
proc = self._get_item_field_attr(field_name, 'output_processor',
self.default_output_processor)
return proc

def _process_input_value(self, field_name, value):
Expand All @@ -174,8 +173,8 @@ def _get_item_field_attr(self, field_name, key, default=None):
def _check_selector_method(self):
if self.selector is None:
raise RuntimeError("To use XPath or CSS selectors, "
"%s must be instantiated with a selector "
"or a response" % self.__class__.__name__)
"%s must be instantiated with a selector "
"or a response" % self.__class__.__name__)

def add_xpath(self, field_name, xpath, *processors, **kw):
values = self._get_xpathvalues(xpath, **kw)
Expand Down