Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add option to retrieve text content #128

Open
frederik-elwert opened this issue Nov 16, 2018 · 9 comments
Open

Add option to retrieve text content #128

frederik-elwert opened this issue Nov 16, 2018 · 9 comments

Comments

@frederik-elwert
Copy link

As a scrapy user, I often want to extract the text content of an element. The default option in parsel is to either use the ::text pseudo-element or XPath text(). Both options have the downside that they return all text nodes as individual elements. When the element contains child elements, this creates unwanted behavior. E.g.:

<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
>>> response.css('h2::text').extract()
['This is the ', ' trend!']
>>> response.css('.post_info::text').extract()
['Published by newbie', 'on Sept 17']

With a basic understanding of XML and XPath, this behavior is expected. But it requires overhead to work around it, and it often creates frustrations with new users. There is a series of questions on stackoverflow as well as on the scrapy bug tracker:

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes. I could imagine two ways to approach the required API:

  • Either, there could be additional .extract_text()/.get_text() methods. This seems clean and easy to use, but would lead to potentially convoluted method names like .extract_first_text() (or .extract_text_first()?).
  • Or add a parameter to .extract*()/.get(), similar to the proposal in Add format_as to extract() methods #101. This could be .extract(format_as='text'). This is less intrusive, but maybe less easy to discover.

Would such an addition be welcome? I could prepare a patch.

@kmike
Copy link
Member

kmike commented Nov 17, 2018

Hey @frederik-elwert! This is being worked on here: #127 :)

@kamrankausar
Copy link

Please consider this as basic feature and add It.

@joecabezas
Copy link

+1

@bblanchon
Copy link

Any progress on this issue?

@kmike
Copy link
Member

kmike commented Feb 10, 2022

Not much, but I've merged master to #127 yesterday, so the PR is up-to-date now. I think feature-wise it is ready; I'm happy with the implementation. But it needs some cleanup - more docs and tests.

@celsofranssa
Copy link

Any progress on this issue?

@mhillebrand
Copy link

This still hasn't been addressed?

@GeorgeA92
Copy link
Contributor

One working option Is to use.. chaining css calls with *::text query applied to selector that contain text we aimed to scrape.
Applied solution on example html sample from issue description will look like this:

from parsel import Selector

text='''
<html>
<body>
<h2>This is the <em>new</em> trend!</h2>
<p class="post_info">Published by newbie<br>on Sept 17</p>
</body>
</html>
'''

sel = Selector(text=text)

# All text
print(sel.css('h2').css('*::text').extract())
# ['This is the ', 'new', ' trend!']

print(sel.css('.post_info').css('*::text').extract())
# ['Published by newbie', 'on Sept 17']

print(sel.css('*::text').extract())
# ['\n', '\n', 'This is the ', 'new', ' trend!', '\n', 'Published by newbie', 'on Sept 17', '\n', '\n']

It is not perfect but (at least for usecases I had) - it is already enough to cover this and similar cases (without digging deep into lxml internals).

lxml.html has the convenience method .text_content() that collects all of the text content of an element. Somethings similar could be added to the Selector and SelectorList classes...

I just realized that Selector.root - is lxml's html object created by it's create_root_node method. It means that if parser type is html - mentioned text_content can be applied here (as well as any other it's lxml methods):

print(sel.root.text_content())
'''

This is the new trend!
Published by newbieon Sept 17


'''

Cases when Selector query return SelectorList a bit more complicated:

print([s.root.text_content() for s in sel.css('h2')])
# ['This is the new trend!']

print([s.root.text_content() for s in sel.css('.post_info')])
# ['Published by newbieon Sept 17']

Applying bind to lxml's text_content into Selector and SelectorList types - looks like the most practical approach here.

As far as I understand both options mentioned above was technically applicable on 2018 when this ticket was created.

@mhillebrand
Copy link

Ugh. I guess I'll just stick to the selectolax library. I'm a big fan of its text() method. It's got deep, separator, and strip parameters. It's also incredibly fast. The major drawback is that it doesn't support XPath.

image

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

9 participants