Add support for regex flags in `.re()` and `.re_first()` methods #225

noviluni · 2021-08-07T20:49:19Z

There are some cases where I need to apply a regex to multiple lines and the only workaround I found was compiling the expression and using a regex flag there.

Look at this example where I want to extract the content of the JavaScript function example() (I know the function is not exactly a function, it is just an example):

>>> import re
>>> from parsel import Selector
>>> text = """
...: <script>
...:     function example() {
...:         "name": "Adrian",
...:         "points": 3,
...:     }
...: </script>
...: """
>>> sel = Selector(text=text)

# using regex strings doesn't work
>>> sel.css('script').re_first(r"example\(\) ({.*})")

# I need to compile the function:
>>> regex = re.compile(r"example\(\) ({.*})", flags=re.DOTALL)
>>> sel.css('script').re_first(regex)
'{\n        "name": "Adrian",\n         "points": 3,\n     }'

Doing this requires some extra steps that could be avoided by adding support for regex flags to the re_first() and re() methods. And that's what I did. With this new implementation, you can directly use it like this:

>>> sel.css('script').re_first(r"example\(\) ({.*})", flags=re.DOTALL)
'{\n        "name": "Adrian",\n         "points": 3,\n     }'

This could also help a lot when needing to use case-insensitive regexes:

>>> text = 'Price: 1000.00€'
>>> sel = Selector(text=text)

# The next works
>>> sel.re_first(r'Price: ([\d.]+)€')
'1000.00'

# however, when lowering the text it stops working:
>>> text2 = 'price: 1000.00€'
>>> sel2 = Selector(text=text2)
>>> sel2.re_first(r'Price: ([\d.]+)€')

# with the new implementation you can directly do:
>>> sel2.re_first(r'Price: ([\d.]+)€', flags=re.I)
'1000.00'

Let me know your thoughts :)

codecov · 2021-08-07T20:52:44Z

Codecov Report

Merging #225 (df0f41b) into master (d20db09) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##            master      #225   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files            5         5           
  Lines          291       292    +1     
  Branches        51        51           
=========================================
+ Hits           291       292    +1

Impacted Files	Coverage Δ
parsel/selector.py	`100.00% <100.00%> (ø)`
parsel/utils.py	`100.00% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d20db09...df0f41b. Read the comment docs.

noviluni · 2021-08-07T20:53:21Z

@Gallaecio @wRAR, could you take a look? :)

noviluni · 2021-08-07T21:10:31Z

parsel/utils.py

    * if the regex contains a named group called "extract" that will be returned
    * if the regex contains multiple numbered groups, all those will be returned (flattened)
    * if the regex doesn't contain any group the entire regex matching is returned
    """
    if isinstance(regex, str):
-        regex = re.compile(regex, re.UNICODE)
+        flags |= re.UNICODE
+        regex = re.compile(regex, flags)


I keep the re.UNICODE to respect the old behavior, especially because in some docstrings you can find:

Apply the given regex and return the first unicode string which matches

However, this also means that it won't be possible to override this, and the re.UNICODE will be always applied. I don't think this is a big issue and can be changed in the future, for example when deprecating Python 2.7.

I've seen that Python 2.7 has been already deprecated...

The Unicode flag also applies to Python 3, though, is not about the Python 2 strings but about pattern behavior.

On the other hand, I see that we actually compile strings into patterns here. It’s out of the scope of this change, so feel free to ignore, but I wonder if, instead of compiling the expressions, we could pass them (and flags) to the corresponding functions. That may even allow flags to work when passed along a compiled regular expression.

The Unicode flag also applies to Python 3, though, is not about the Python 2 strings but about pattern behavior.

Sorry, I think I wasn't clear enough and mixed things. What I meant was that we could do something like:

regex = re.compile(regex, flags or re.UNICODE)

or doing what I did (always applying re.UNICODE). The first approach allows us to override the re.UNICODE, but if we were applying re.UNICODE always, it could be confusing to don't apply it when using other flags. The reference I made to Python 2.7 was that we could change this in the future but it would be a breaking change, however, it could come along with the Python 2.7 deprecation that would require a new major version. I didn't know it had been already deprecated before; after I saw that, I considered that maybe we need to think better about this decision.

On the other hand, I see that we actually compile strings into patterns here. It’s out of the scope of this change, so feel free to ignore, but I wonder if, instead of compiling the expressions, we could pass them (and flags) to the corresponding functions. That may even allow flags to work when passed along a compiled regular expression.

It seems a good idea, but if it's not required I would love to keep this PR as-is (shorter). We can open a new issue if you want :)

tests/test_selector.py

wRAR · 2021-08-09T07:48:55Z

docs/usage.rst

+    >>> selector.xpath('//a[contains(@href, "image")]/text()').re_first(regex)
+    'My image 1 '
+
+As well as adding regex flags with the ``flags`` argument.


I think this needs an example and/or a link to the Python doc about the flags.

I would also be OK with just appending (see :mod:`re`) to the first sentence of this entire section, right after the first mention of “regular expressions”, since at that point already users may wonder about regular expressions.

Gallaecio

the only workaround I found was compiling the expression and using a regex flag there

I think this feature makes sense nonetheless, but know that you can define flags in the pattern itself:

>>> from parsel import Selector
>>> text = """
... <script>
... function example() {
... 
KeyboardInterrupt
>>> text = """
... <script>
...     function example() {
...         "name": "Adrian",
...         "points": 3,
...     }
... </script>
... """
>>> sel = Selector(text=text)
>>> sel.css('script').re_first(r"(?s)example\(\) ({.*})")
'{\n        "name": "Adrian",\n        "points": 3,\n    }'

Gallaecio · 2021-08-09T08:58:09Z

parsel/selector.py

        ``regex`` can be either a compiled regular expression or a string which
-        will be compiled to a regular expression using ``re.compile(regex)``.
+        will be compiled to a regular expression using ``re.compile()``.


Assuming we don’t actually compile regular expressions, which could be counter-productive performance-wise given Python’s regular expression caching, what about some rewording instead?

``regex`` is a regular expression (see :mod:`re`), either as a string or compiled.

Gallaecio · 2021-08-09T09:03:24Z

parsel/utils.py

    * if the regex contains a named group called "extract" that will be returned
    * if the regex contains multiple numbered groups, all those will be returned (flattened)
    * if the regex doesn't contain any group the entire regex matching is returned
    """
    if isinstance(regex, str):
-        regex = re.compile(regex, re.UNICODE)
+        flags |= re.UNICODE
+        regex = re.compile(regex, flags)


The Unicode flag also applies to Python 3, though, is not about the Python 2 strings but about pattern behavior.

On the other hand, I see that we actually compile strings into patterns here. It’s out of the scope of this change, so feel free to ignore, but I wonder if, instead of compiling the expressions, we could pass them (and flags) to the corresponding functions. That may even allow flags to work when passed along a compiled regular expression.

tests/test_selector.py

noviluni added 2 commits August 7, 2021 22:30

add support for regex flags in .re() and .re_first() methods

4ea812f

fix docs typo

df0f41b

noviluni commented Aug 7, 2021

View reviewed changes

tests/test_selector.py Show resolved Hide resolved

wRAR reviewed Aug 9, 2021

View reviewed changes

Gallaecio reviewed Aug 9, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for regex flags in `.re()` and `.re_first()` methods #225

Add support for regex flags in `.re()` and `.re_first()` methods #225

noviluni commented Aug 7, 2021 •

edited

codecov bot commented Aug 7, 2021 •

edited

noviluni commented Aug 7, 2021

noviluni Aug 7, 2021 •

edited

noviluni Aug 8, 2021

Gallaecio Aug 9, 2021

noviluni Aug 9, 2021

wRAR Aug 9, 2021

Gallaecio Aug 9, 2021

Gallaecio left a comment

Gallaecio Aug 9, 2021

Gallaecio Aug 9, 2021

Add support for regex flags in .re() and .re_first() methods #225

Are you sure you want to change the base?

Add support for regex flags in .re() and .re_first() methods #225

Conversation

noviluni commented Aug 7, 2021 • edited

codecov bot commented Aug 7, 2021 • edited

Codecov Report

noviluni commented Aug 7, 2021

noviluni Aug 7, 2021 • edited

Choose a reason for hiding this comment

noviluni Aug 8, 2021

Choose a reason for hiding this comment

Gallaecio Aug 9, 2021

Choose a reason for hiding this comment

noviluni Aug 9, 2021

Choose a reason for hiding this comment

wRAR Aug 9, 2021

Choose a reason for hiding this comment

Gallaecio Aug 9, 2021

Choose a reason for hiding this comment

Gallaecio left a comment

Choose a reason for hiding this comment

Gallaecio Aug 9, 2021

Choose a reason for hiding this comment

Gallaecio Aug 9, 2021

Choose a reason for hiding this comment

Add support for regex flags in `.re()` and `.re_first()` methods #225

Add support for regex flags in `.re()` and `.re_first()` methods #225

noviluni commented Aug 7, 2021 •

edited

codecov bot commented Aug 7, 2021 •

edited

noviluni Aug 7, 2021 •

edited