Items missing #3804

FAIZ428 · 2019-06-02T02:01:57Z

Hi,
when working with the itemLoader() to populate items from a created list in python. Here I have attached a working of the suspected bug.

This appears to be present when the loader.get_output_value method has been executed. Once that executes then, the populated list initially created would have one of the items missing. This appears to be a malfunction bug within the software presented. If we only have loader.load_item, the values are presented, however loder.get_output_value() fails to display the value/s in the data set as tested.

loader.get_output_value causes the items value to be missing from School:

BurnzZ · 2019-06-04T13:59:29Z

Confirmed this bug, but only when ItemLoader is being instantiated with an item. Here's an example:

import scrapy

from scrapy.http import TextResponse
from scrapy.loader import ItemLoader

class TestItem(scrapy.Item):
    title = scrapy.Field()

class TestItemLoader(ItemLoader):
    default_item_class = TestItem

body = '<html><title>This is a title</title></html>'
response = TextResponse('https://test.com', body=body)

loader_from_response = TestItemLoader(response=response)
loader_from_response.add_css('title', 'title::text')
loader_from_response.load_item()  # {'title': [u'This is a title']}
loader_from_response.get_output_value('title')  # [u'This is a title']
loader_from_response.load_item()  # {'title': [u'This is a title']}

# The loading above is the most common approach in parsing the website
# contents. The bug occurs below when `ItemLoader` is being instantiated
# with an 'item'. 

input_item = {'title': 'Title from dict-like item'}
loader_from_item = ItemLoader(item=input_item)
loader_from_item.load_item()  # {'title': 'Title from dict-like item'}
loader_from_item.get_output_value('title')  # []
loader_from_item.load_item()  # {'title': []}

sortafreel · 2019-06-06T21:46:51Z

Working on a pull request to fix it :)

scrapy#3804

kmike · 2019-07-04T08:05:42Z

Fixed by #3819.

ava7 · 2019-08-22T11:09:31Z

Hey, fellows, I am afraid that this bugfix introduces an another problem...
Let me try to explain. If we take @BurnzZ 's example #3804 (comment) and modify it just a little bit to pass a dict via item in TestItemLoader's constructor:
TestItemLoader(response=response, item=TestItem(loader.load_item()))
we end up with the following error:

import scrapy

from scrapy.http import TextResponse
from scrapy.loader import ItemLoader


class TestItem(scrapy.Item):
    title = scrapy.Field(output_processor=TakeFirst())


class TestItemLoader(ItemLoader):
    default_item_class = TestItem


body = '<html><title>This is a title</title></html>'
response = TextResponse('https://test.com', body=body)

# Let's say we gathered some data before, and decided to load it into an item
loader = TestItemLoader()
loader.add_value('title', 'Hello')

# And then we decided to use that data to build an another item loader by passing it directly to the constructor
loader_from_response = TestItemLoader(response=response, item=TestItem(loader.load_item()))
loader_from_response.add_css('title', 'title::text')

And the error itself

>>> loader_from_response.add_css('title', 'title::text')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/opt/miniconda2/envs/py27_scrapy1.7/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 199, in add_css
    self.add_value(field_name, values, *processors, **kw)
  File "/opt/miniconda2/envs/py27_scrapy1.7/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 79, in add_value
    self._add_value(field_name, value)
  File "/opt/miniconda2/envs/py27_scrapy1.7/lib/python2.7/site-packages/scrapy/loader/__init__.py", line 95, in _add_value
    self._values[field_name] += arg_to_iter(processed_value)
TypeError: cannot concatenate 'str' and 'list' objects

Does anyone else have this issue as well?

ava7 · 2019-08-22T11:40:00Z

PS I forgot to add the most essential part - it only happens when the output processor is set to TakeFirst so I adjusted the example in my previous comment.

Maybe if we change line 41 here: https://github.com/sortafreel/scrapy/blob/cdeccac6d6ccd0034a5f007ed371c1d481b32c26/scrapy/loader/__init__.py#L41 not to apply any output or input processors and to directly accept the dict's value? Will that introduce any other problems?

for field_name, value in item.items():
    self._values[field_name] = value

Gallaecio · 2019-08-22T15:57:39Z

@ava7 Could you please open a separate issue for it?

alexander-matsievsky · 2019-08-26T11:39:10Z

@AzharF @BurnzZ @sortafreel @kmike Hi!

I've just stumbled upon the same issue @ava7 mentioned. Seems to be related to double-processing of any kind, not just TakeFirst().

The first round of processing (unwrapping the number from the list) finishes fine.

# IN_1
self._values[field_name]
# {'POPULARITY': ['7'], 'SCORE': ['41']}

# OUT_1
proc(self._values[field_name])
# {'POPULARITY': '7', 'SCORE': '41'}

The second round incorrectly assumes the data is still raw (IN_2==IN_1), runs the processing and panics.

# IN_2
self._values[field_name]
# {'POPULARITY': '7', 'SCORE': '41'}

# OUT_2
proc(self._values[field_name])
# Error in Compose with <function process_object.<locals>.process at 0x7f708b2f6158> value={'POPULARITY': '7', 'SCORE': '41'} error='AttributeError: 'str' object has no attribute 'items''

not to apply any output or input processors and to directly accept the dict's value? Will that introduce any other problems?

@ava7 Unfortunately this does not work as the double processing happens anyway in other places, e.g. in get_output_value.

P.S.: I'll file a dedicated issue in a moment.

scrapy/scrapy#3804

Gallaecio added the bug label Jun 3, 2019

sortafreel added a commit to sortafreel/scrapy that referenced this issue Jun 6, 2019

Add values (if there're any) when initiating items from dicts

bd8a103

scrapy#3804

sortafreel mentioned this issue Jun 6, 2019

[WIP] Fix missing values #3816

Closed

sortafreel added a commit to sortafreel/scrapy that referenced this issue Jun 7, 2019

Preprocess values if item built from dict.

754f52b

scrapy#3804

This was referenced Jun 7, 2019

[WIP] Fix missing values #3818

Closed

[MRG+1] Fix missing values #3819

Merged

kmike closed this as completed Jul 4, 2019

alexander-matsievsky mentioned this issue Aug 26, 2019

ItemLoader fields initialized from item are reprocessed #3976

Closed

sortafreel mentioned this issue Sep 10, 2019

Add reprocessing tests #3998

Closed

ejulio pushed a commit to scrapy/itemloaders that referenced this issue Apr 17, 2020

Add values (if there're any) when initiating items from dicts

4a9cba7

scrapy/scrapy#3804

ejulio pushed a commit to scrapy/itemloaders that referenced this issue Apr 17, 2020

Preprocess values if item built from dict.

a3240c7

scrapy/scrapy#3804

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Items missing #3804

Items missing #3804

FAIZ428 commented Jun 2, 2019

BurnzZ commented Jun 4, 2019 •

edited

sortafreel commented Jun 6, 2019

kmike commented Jul 4, 2019

ava7 commented Aug 22, 2019 •

edited

ava7 commented Aug 22, 2019

Gallaecio commented Aug 22, 2019

alexander-matsievsky commented Aug 26, 2019

Items missing #3804

Items missing #3804

Comments

FAIZ428 commented Jun 2, 2019

BurnzZ commented Jun 4, 2019 • edited

sortafreel commented Jun 6, 2019

kmike commented Jul 4, 2019

ava7 commented Aug 22, 2019 • edited

ava7 commented Aug 22, 2019

Gallaecio commented Aug 22, 2019

alexander-matsievsky commented Aug 26, 2019

BurnzZ commented Jun 4, 2019 •

edited

ava7 commented Aug 22, 2019 •

edited