Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow spiders to also return/yield ItemLoader. #3244

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/topics/loaders.rst
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,15 @@ called which actually returns the item populated with the data
previously extracted and collected with the :meth:`~ItemLoader.add_xpath`,
:meth:`~ItemLoader.add_css`, and :meth:`~ItemLoader.add_value` calls.

Note that it is also possible for the spider to return/yield the loader it-self,
letting Scrapy calling the :meth:`ItemLoader.load_item` behind::

def parse(self, response):
loader = ItemLoader(item=Product(), response=response)
# (...)
return loader


.. _topics-loaders-processors:

Input and Output processors
Expand Down
12 changes: 9 additions & 3 deletions scrapy/core/scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
from scrapy import signals
from scrapy.http import Request, Response
from scrapy.item import BaseItem
from scrapy.loader import ItemLoader
from scrapy.core.spidermw import SpiderMiddlewareManager
from scrapy.utils.request import referer_str

Expand Down Expand Up @@ -176,9 +177,13 @@ def handle_spider_output(self, result, request, response, spider):
return dfd

def _process_spidermw_output(self, output, request, response, spider):
"""Process each Request/Item (given in the output parameter) returned
from the given spider
"""Process each Request/Item/ItemLoader (given in the output parameter)
returned from the given spider
"""
# Allow ItemLoader to be returned: convert it to new Item
if isinstance(output, ItemLoader):
output = output.load_item()

if isinstance(output, Request):
self.crawler.engine.crawl(request=output, spider=spider)
elif isinstance(output, (BaseItem, dict)):
Expand All @@ -190,7 +195,8 @@ def _process_spidermw_output(self, output, request, response, spider):
pass
else:
typename = type(output).__name__
logger.error('Spider must return Request, BaseItem, dict or None, '
logger.error('Spider must return Request, BaseItem, ItemLoader, '
'dict or None, '
'got %(typename)r in %(request)s',
{'request': request, 'typename': typename},
extra={'spider': spider})
Expand Down