Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Items missing #3804

Closed
FAIZ428 opened this issue Jun 2, 2019 · 7 comments
Closed

Items missing #3804

FAIZ428 opened this issue Jun 2, 2019 · 7 comments
Labels

Comments

@FAIZ428
Copy link

FAIZ428 commented Jun 2, 2019

Hi,
when working with the itemLoader() to populate items from a created list in python. Here I have attached a working of the suspected bug.

This appears to be present when the loader.get_output_value method has been executed. Once that executes then, the populated list initially created would have one of the items missing. This appears to be a malfunction bug within the software presented. If we only have loader.load_item, the values are presented, however loder.get_output_value() fails to display the value/s in the data set as tested.
scrapybug

loader.get_output_value causes the items value to be missing from School:

@Gallaecio Gallaecio added the bug label Jun 3, 2019
@BurnzZ
Copy link
Member

BurnzZ commented Jun 4, 2019

Confirmed this bug, but only when ItemLoader is being instantiated with an item. Here's an example:

import scrapy

from scrapy.http import TextResponse
from scrapy.loader import ItemLoader

class TestItem(scrapy.Item):
    title = scrapy.Field()

class TestItemLoader(ItemLoader):
    default_item_class = TestItem

body = '<html><title>This is a title</title></html>'
response = TextResponse('https://test.com', body=body)

loader_from_response = TestItemLoader(response=response)
loader_from_response.add_css('title', 'title::text')
loader_from_response.load_item()  # {'title': [u'This is a title']}
loader_from_response.get_output_value('title')  # [u'This is a title']
loader_from_response.load_item()  # {'title': [u'This is a title']}

# The loading above is the most common approach in parsing the website
# contents. The bug occurs below when `ItemLoader` is being instantiated
# with an 'item'. 

input_item = {'title': 'Title from dict-like item'}
loader_from_item = ItemLoader(item=input_item)
loader_from_item.load_item()  # {'title': 'Title from dict-like item'}
loader_from_item.get_output_value('title')  # []
loader_from_item.load_item()  # {'title': []}

@sortafreel
Copy link
Contributor

Working on a pull request to fix it :)

sortafreel added a commit to sortafreel/scrapy that referenced this issue Jun 6, 2019
sortafreel added a commit to sortafreel/scrapy that referenced this issue Jun 7, 2019
This was referenced Jun 7, 2019
@kmike
Copy link
Member

kmike commented Jul 4, 2019

Fixed by #3819.

@kmike kmike closed this as completed Jul 4, 2019
@ava7
Copy link

ava7 commented Aug 22, 2019

Hey, fellows, I am afraid that this bugfix introduces an another problem...
Let me try to explain. If we take @BurnzZ 's example #3804 (comment) and modify it just a little bit to pass a dict via item in TestItemLoader's constructor:
TestItemLoader(response=response, item=TestItem(loader.load_item()))
we end up with the following error:

import scrapy

from scrapy.http import TextResponse
from scrapy.loader import ItemLoader


class TestItem(scrapy.Item):
    title = scrapy.Field(output_processor=TakeFirst())


class TestItemLoader(ItemLoader):
    default_item_class = TestItem


body = '<html><title>This is a title</title></html>'
response = TextResponse('https://test.com', body=body)

# Let's say we gathered some data before, and decided to load it into an item
loader = TestItemLoader()
loader.add_value('title', 'Hello')

# And then we decided to use that data to build an another item loader by passing it directly to the constructor
loader_from_response = TestItemLoader(response=response, item=TestItem(loader.load_item()))
loader_from_response.add_css('title', 'title::text')

And the error itself

>>> loader_from_response.add_css('title', 'title::text')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/opt/miniconda2/envs/py27_scrapy1.7/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 199, in add_css
    self.add_value(field_name, values, *processors, **kw)
  File "/opt/miniconda2/envs/py27_scrapy1.7/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 79, in add_value
    self._add_value(field_name, value)
  File "/opt/miniconda2/envs/py27_scrapy1.7/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 95, in _add_value
    self._values[field_name] += arg_to_iter(processed_value)
TypeError: cannot concatenate 'str' and 'list' objects

Does anyone else have this issue as well?

@ava7
Copy link

ava7 commented Aug 22, 2019

PS I forgot to add the most essential part - it only happens when the output processor is set to TakeFirst so I adjusted the example in my previous comment.

Maybe if we change line 41 here: https://github.com/sortafreel/scrapy/blob/cdeccac6d6ccd0034a5f007ed371c1d481b32c26/scrapy/loader/__init__.py#L41 not to apply any output or input processors and to directly accept the dict's value? Will that introduce any other problems?

for field_name, value in item.items():
    self._values[field_name] = value

@Gallaecio
Copy link
Member

@ava7 Could you please open a separate issue for it?

@alexander-matsievsky
Copy link

@AzharF @BurnzZ @sortafreel @kmike Hi!

I've just stumbled upon the same issue @ava7 mentioned. Seems to be related to double-processing of any kind, not just TakeFirst().


  1. The first round of processing (unwrapping the number from the list) finishes fine.
# IN_1
self._values[field_name]
# {'POPULARITY': ['7'], 'SCORE': ['41']}

# OUT_1
proc(self._values[field_name])
# {'POPULARITY': '7', 'SCORE': '41'}
  1. The second round incorrectly assumes the data is still raw (IN_2==IN_1), runs the processing and panics.
# IN_2
self._values[field_name]
# {'POPULARITY': '7', 'SCORE': '41'}

# OUT_2
proc(self._values[field_name])
# Error in Compose with <function process_object.<locals>.process at 0x7f708b2f6158> value={'POPULARITY': '7', 'SCORE': '41'} error='AttributeError: 'str' object has no attribute 'items''

not to apply any output or input processors and to directly accept the dict's value? Will that introduce any other problems?

@ava7 Unfortunately this does not work as the double processing happens anyway in other places, e.g. in get_output_value.

P.S.: I'll file a dedicated issue in a moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants