Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filter SelectorList with custom predicate? #142

Open
msukmanowsky opened this issue Jul 26, 2019 · 1 comment
Open

Filter SelectorList with custom predicate? #142

msukmanowsky opened this issue Jul 26, 2019 · 1 comment

Comments

@msukmanowsky
Copy link

Somewhat related to #129, I'm curious if there's a way to filter a selector list using some manual predicate instead of doing a destructive removal as #129 is going for.

Something like this:

def is_good_node(node):
    return node.xpath("./@class").get() != "taboola"


selector = Selector(text="""
    <html>
        <body>
            <div class="content">
                Content
            </div>
            <div class="taboola">
                Taboola
            </div>
        </body>
    </html>
""")
nodes = selector.xpath("//*[not(self::a)]")
nodes = [node for node in nodes if is_good_node(node)]
nodes = SelectorList(nodes)
print(''.join(nodes.extract()))
# this prints
'''
<html>
        <body>
            <div class="content">
                Content
            </div>
            <div class="taboola">
                Taboola
            </div>
        </body>
    </html><body>
            <div class="content">
                Content
            </div>
            <div class="taboola">
                Taboola
            </div>
        </body><div class="content">
                Content
            </div>
'''

Assume is_good_node is more complex than the example above as I realize that particular predicate could be represented as an XPATH expression.

I've tried this approach and it doesn't work (ends up adding filtered nodes to the original document). Had a look at the source and didn't see an obvious way to do it with my limited lxml foo.

@Gallaecio
Copy link
Member

So you are suggesting something like SelectorList.filter(my_filter_function)?

I guess it’s a valid feature request, although I’m not sure if I like it.

I assume the implementation would be as simple as:

def filter(self, filter):
    for selector in self:
        if filter(selector):
            yield selector

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants