Add option to retrieve text content #128

frederik-elwert · 2018-11-16T20:48:01Z

As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the ::text pseudo-element or XPath text(). Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:

<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>

>>> response.css('h2::text').extract()
['This is the ', ' trend!']
>>> response.css('.post_info::text').extract()
['Published by newbie', 'on Sept 17']

With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes. I could imagine two ways to approach the required API:

Either, there could be additional .extract_text()/.get_text() methods. This seems clean and easy to use, but would lead to potentially convoluted method names like .extract_first_text() (or .extract_text_first()?).
Or add a parameter to .extract*()/.get(), similar to the proposal in Add format_as to extract() methods #101. This could be .extract(format_as='text'). This is less intrusive, but maybe less easy to discover.

Would such an addition be welcome? I could prepare a patch.

The text was updated successfully, but these errors were encountered:

kmike · 2018-11-17T12:33:39Z

Hey @frederik-elwert! This is being worked on here: #127 :)

kamrankausar · 2020-02-07T09:30:51Z

Please consider this as basic feature and add It.

joecabezas · 2021-05-23T03:18:19Z

+1

bblanchon · 2022-02-04T10:39:03Z

Any progress on this issue?

kmike · 2022-02-10T11:18:18Z

Not much, but I've merged master to #127 yesterday, so the PR is up-to-date now. I think feature-wise it is ready; I'm happy with the implementation. But it needs some cleanup - more docs and tests.

celsofranssa · 2022-08-21T18:44:41Z

Any progress on this issue?

mhillebrand · 2023-05-03T19:30:23Z

This still hasn't been addressed?

GeorgeA92 · 2023-05-05T07:25:20Z

One working option Is to use.. chaining css calls with *::text query applied to selector that contain text we aimed to scrape.
Applied solution on example html sample from issue description will look like this:

from parsel import Selector

text='''
<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
'''

sel = Selector(text=text)

# All text
print(sel.css('h2').css('*::text').extract())
# ['This is the ', 'new', ' trend!']

print(sel.css('.post_info').css('*::text').extract())
# ['Published by newbie', 'on Sept 17']

print(sel.css('*::text').extract())
# ['\n', '\n', 'This is the ', 'new', ' trend!', '\n', 'Published by newbie', 'on Sept 17', '\n', '\n']

It is not perfect but (at least for usecases I had) - it is already enough to cover this and similar cases (without digging deep into lxml internals).

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes...

I just realized that Selector.root - is lxml's html object created by it's create_root_node method. It means that if parser type is html - mentioned text_content can be applied here (as well as any other it's lxml methods):

print(sel.root.text_content())
'''

This is the new trend!
Published by newbieon Sept 17


'''

Cases when Selector query return SelectorList a bit more complicated:

print([s.root.text_content() for s in sel.css('h2')])
# ['This is the new trend!']

print([s.root.text_content() for s in sel.css('.post_info')])
# ['Published by newbieon Sept 17']

Applying bind to lxml's text_content into Selector and SelectorList types - looks like the most practical approach here.

As far as I understand both options mentioned above was technically applicable on 2018 when this ticket was created.

mhillebrand · 2023-10-13T21:49:22Z

Ugh. I guess I'll just stick to the selectolax library. I'm a big fan of its text() method. It's got deep, separator, and strip parameters. It's also incredibly fast. The major drawback is that it doesn't support XPath.

Gallaecio added the enhancement label Aug 22, 2019

Gallaecio added the patch available label Sep 24, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to retrieve text content #128

Add option to retrieve text content #128

frederik-elwert commented Nov 16, 2018

kmike commented Nov 17, 2018 •

edited

kamrankausar commented Feb 7, 2020

joecabezas commented May 23, 2021

bblanchon commented Feb 4, 2022

kmike commented Feb 10, 2022

celsofranssa commented Aug 21, 2022

mhillebrand commented May 3, 2023

GeorgeA92 commented May 5, 2023

mhillebrand commented Oct 13, 2023

Add option to retrieve text content #128

Add option to retrieve text content #128

Comments

frederik-elwert commented Nov 16, 2018

kmike commented Nov 17, 2018 • edited

kamrankausar commented Feb 7, 2020

joecabezas commented May 23, 2021

bblanchon commented Feb 4, 2022

kmike commented Feb 10, 2022

celsofranssa commented Aug 21, 2022

mhillebrand commented May 3, 2023

GeorgeA92 commented May 5, 2023

mhillebrand commented Oct 13, 2023

kmike commented Nov 17, 2018 •

edited