[WIP] text extraction in Selector and SelectorList #127

kmike · 2018-11-02T09:18:00Z

I've opened it for a discussion, it is not a finished solution (yet), but something one can install any try the API. See #34 for original proposal.

Here ~~.text() methods~~ options to extract text are added for Selector and SelectorList, using https://github.com/TeamHG-Memex/html-text library.

Problems:

[RESOLVED] Naming issue. In scrapy it can be convenient to have response.text() shortcut, to use it instead of response.css('body').text() or response.selector.text(). But we already have response.text, which is unicode body. It makes this feature more confusing - selector's .text() methods are very different from response.text attribute. From this point of view, .text_content() name sounds better. Any ideas for a shorter / alternative name? UPD: resolved by making text conversion .get argument
[RESOLVED] There is a circular package dependency: html_text requires parsel, and parsel requires html_text. This is not a problem code-wise, but I haven't checked how well pip can handle it. In a basic case it seems to work, but I wonder if we get issues related to this. It can be solved by changing html_text API and making its parsel dependency optional. UPD: this is fixed in Remove parsel dependency TeamHG-Memex/html-text#15
[RESOLVED] parsel imports private html_text methods. This can be solved by changing html_text API. UPD: fixed at Remove parsel dependency TeamHG-Memex/html-text#15
[RESOLVED] Cleaning is called for each Selector.text() call. So e.g. in case of sel.css('div').text() each div will be cleaned and copied - instead of cleaning a tree once. I'm not sure how large is this problem tough; probably it is inefficient when you need to extract text from nested elements (e.g. from all elements) - it means cleaning will be run multiple times on same parts of the tree, making sel.xpath("*").text() O(N^2) instead of O(N). Alternative solution is to have sel.cleaned().text() or something like this; .cleaned() may allow lxml Cleaner arguments. But it looks like a separate feature; also, Cleaner parameters which work best with html-text are not default lxml's. UPD: there is .cleaned() method which supports different Cleaners, O(N^2) caveat is mentioned in the docs.
[RESOLVED] When user requests sel.text() from an element which is removed by Cleaner (e.g. sel.css('script')[0].text(), None is returned. Should it be an empty string? UPD: we (me and @dangra) think None is fine.
[RESOLVED] SelectorList.text() joins text. This is similar to what's proposed in Add method that allows joining the extracted result into a string scrapy#772, but different from SelectorList.get, which returns the first element. If needed, we can support both behaviors, by allowing sep=None, and probably using it by default (or join=None if we rename 'sep' argument to 'join'), meaning "don't join, take first" - or would it be too confusing? UPD: SelectorList no longer joins text; as there is text extraction support in .getall, it is easy to join text on user side.
[RESOLVED] Joining in SelectorList.text can be confusing if SelectorList selects nested elements. UPD: SelectorList no longer joins text.

TODO:

codecov · 2018-11-02T09:20:50Z

Codecov Report

Merging #127 into master will decrease coverage by 1.43%.
The diff coverage is 42.85%.

@@            Coverage Diff             @@
##           master     #127      +/-   ##
==========================================
- Coverage   99.63%   98.19%   -1.44%     
==========================================
  Files           5        5              
  Lines         271      277       +6     
  Branches       48       49       +1     
==========================================
+ Hits          270      272       +2     
- Misses          1        5       +4

Impacted Files	Coverage Δ
parsel/selector.py	`97.2% <42.85%> (-2.8%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fc608e...da7bb80. Read the comment docs.

…eaned

kmike · 2018-11-17T10:55:29Z

parsel/selector.py

+        """
+        if isinstance(cleaner, six.string_types):
+            if cleaner not in {'html', 'text'}:
+                raise ValueError("cleaner must be 'html', 'text' or "


There is one gotcha: this exception is raised in .get as well, but in .get there are two more accepted values: "auto" and None. Does it worth fixing?

kmike · 2018-11-17T10:56:38Z

parsel/selector.py

+        if cleaner == 'html':
+            cleaner = self._html_cleaner
+        elif cleaner == 'text':
+            cleaner = self._text_cleaner


an alternative is make these attributes public, and ask users to pass them: sel.cleaned(sel.TEXT_CLEANER) instead of sel.cleaned('text').

codecov · 2019-05-30T14:13:21Z

Codecov Report

Attention: Patch coverage is 77.14286% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 90.88%. Comparing base (780b6e6) to head (69456c1).
Report is 3 commits behind head on master.

❗ Current head 69456c1 differs from pull request most recent head 852bbef. Consider uploading reports for the commit 852bbef to get more accurate results

Files	Patch %	Lines
parsel/selector.py	77.14%	4 Missing and 4 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #127      +/-   ##
==========================================
- Coverage   92.18%   90.88%   -1.30%     
==========================================
  Files           5        5              
  Lines         448      472      +24     
  Branches       91       99       +8     
==========================================
+ Hits          413      429      +16     
- Misses         26       30       +4     
- Partials        9       13       +4

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

# Conflicts: # parsel/selector.py # setup.py

Gallaecio

Looks good to me so far.

parsel/selector.py

Selector text tests

The assertion was wrong

# Conflicts: # parsel/selector.py # tests/test_selector.py

Co-authored-by: Adrián Chaves <adrian@chaves.io>

Selector text doc

# Conflicts: # parsel/selector.py # tests/test_selector.py # tests/typing/selector.py # tox.ini

dangra · 2024-04-24T21:02:36Z

still 👍 from far away 🚢

kmike · 2024-05-08T18:32:20Z

docs/usage.rst

+To extract all text of one or more element and all their child elements, 
+formatted as plain text taking into account HTML tags (e.g. ``<br/>`` is 
+translated as a line break), set ``text=True`` in your call to 
+:meth:`~parsel.selector.Selector.get` or
+:meth:`~parsel.selector.Selector.getall` instead of including
+``::text`` (CSS) or ``/text()`` (XPath) in your query::
+
+    >>> selector.css('#images').get(text=True)
+    'Name: My image 1\nName: My image 2\nName: My image 3\nName: My image 4\nName: My image 5'
+
+See :meth:`Selector.get` for additional parameters that you can use to change
+how the extracted plain text is formatted.
+


It looks like for many use cases .get(text=True) could provide more reasonable behavior than /text() or ::text in a selector. From this point of view, I wonder if we should make it one of the first examples, and review many other examples as well. But it seems we can also do it separately, not as a part of this PR, so I'm not working on it.

[tmp] Selector.text and SelectorList.text methods

3c471b8

kmike mentioned this pull request Nov 13, 2018

Remove parsel dependency TeamHG-Memex/html-text#15

Merged

[wip] move converting to text to .get method, add getall support, .cl…

8dea4ce

…eaned

kmike commented Nov 17, 2018

View reviewed changes

kmike mentioned this pull request Nov 17, 2018

Add option to retrieve text content #128

Open

kmike changed the title ~~[WIP] Selector.text and SelectorList.text methods~~ [WIP] text extraction in Selector and SelectorList Dec 12, 2018

kmike mentioned this pull request Apr 11, 2019

add TextResponse.re() and .re_first() scrapy/scrapy#3741

Open

bump html-text required version number

da7bb80

Gallaecio mentioned this pull request Jul 10, 2019

Added text_content() method to selectors. #34

Closed

kmike and others added 4 commits February 10, 2022 02:31

Merge branch 'master' into selector-text

859044c

# Conflicts: # parsel/selector.py # setup.py

selector text unit tests

7bae279

code formtting

e4733ee

code formatting improvements

857ca72

Gallaecio reviewed Mar 17, 2022

View reviewed changes

parsel/selector.py Show resolved Hide resolved

shahidkarimi and others added 8 commits April 4, 2022 23:49

removed unwated tests

7941093

Merge pull request #236 from shahidkarimi/selector-text-tests

102f2e3

Selector text tests

Merge branch 'master' into selector-text

1f917bb

apply black

d87982d

fixed failing test

14dadbd

The assertion was wrong

Make new arguments keyword-only

af0d28a

documentation for selector .get() text

1737f83

suggested changes in the PR fixed

17ae5e0

kmike mentioned this pull request Nov 1, 2022

Issue #249 - Add strip to get() and getall() #260

Open

kmike and others added 3 commits November 10, 2022 17:22

Merge branch 'master' into selector-text

f8f1c66

# Conflicts: # parsel/selector.py # tests/test_selector.py

Update docs/usage.rst

c6580cc

Co-authored-by: Adrián Chaves <adrian@chaves.io>

Merge pull request #248 from shahidkarimi/selector-text-doc

419af4b

Selector text doc

bblanchon mentioned this pull request Aug 11, 2023

Adding a strip kwarg to get() and getall() #249

Open

Merge branch 'master' into selector-text

b8d0352

# Conflicts: # parsel/selector.py # tests/test_selector.py # tests/typing/selector.py # tox.ini

kmike added 8 commits May 1, 2024 18:56

fixed typing

ee3e734

fixed a refactoring issue

69456c1

document O(N^2) gotcha

a492278

make flake8 config compatible with black

8b4ae25

refactor text and cleaning tests; add more of them

ccaaa5b

fixed default .cleaned cleaner value

4eea4fa

fixed black formatting went wrong

27c9919

fix docs references

852bbef

kmike commented May 8, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] text extraction in Selector and SelectorList #127

[WIP] text extraction in Selector and SelectorList #127

kmike commented Nov 2, 2018 •

edited

codecov bot commented Nov 2, 2018 •

edited

kmike Nov 17, 2018

kmike Nov 17, 2018

codecov bot commented May 30, 2019 •

edited

Gallaecio left a comment

dangra commented Apr 24, 2024

kmike May 8, 2024

[WIP] text extraction in Selector and SelectorList #127

Are you sure you want to change the base?

[WIP] text extraction in Selector and SelectorList #127

Conversation

kmike commented Nov 2, 2018 • edited

codecov bot commented Nov 2, 2018 • edited

Codecov Report

kmike Nov 17, 2018

Choose a reason for hiding this comment

kmike Nov 17, 2018

Choose a reason for hiding this comment

codecov bot commented May 30, 2019 • edited

Codecov Report

Gallaecio left a comment

Choose a reason for hiding this comment

dangra commented Apr 24, 2024

kmike May 8, 2024

Choose a reason for hiding this comment

kmike commented Nov 2, 2018 •

edited

codecov bot commented Nov 2, 2018 •

edited

codecov bot commented May 30, 2019 •

edited