Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't create custom Provider #86

Open
suspectinside opened this issue Sep 4, 2022 · 6 comments
Open

Couldn't create custom Provider #86

suspectinside opened this issue Sep 4, 2022 · 6 comments

Comments

@suspectinside
Copy link

suspectinside commented Sep 4, 2022

Hi, just sample setup:

# ================= Providers pom/page_input_providers/providers.py
import logging
from collections.abc import Callable, Sequence
from scrapy_poet.page_input_providers import PageObjectInputProvider
from scrapy.settings import Settings

logger = logging.getLogger()
logger.setLevel(logging.INFO)

class Arq:
	async def enqueue_task(task: dict):
		logger.info('Arq.enqueue_task() enqueueing new task: %r', task)

class ArqProvider(PageObjectInputProvider):
	provided_classes = {Arq}
	name = 'ARQ_PROVIDER'
	
	async def __call__(self, to_provide: set[Callable]) -> Sequence[Callable]:
		return [Arq()]
# ================= Page Object Models
import attr
from web_poet.pages import Injectable, WebPage, ItemWebPage
from pom.page_input_providers.providers import Arq

@attr.define
class IndexPage(WebPage):
	arq: Arq

	@property
	async def page_titles(self):
		await self.arq.enqueue_task({'bla': 'bla!'})

		return [
			(el.attrib['href'], el.css('::text').get())
			for el in self.css('.selected a.reference.external')
		]

Injectable entity - arq: Arq. So, i'd like to work with arq instance here.

# ================= the Spider
import uvloop, asyncio, pprint, logging
import scrapy
from scrapy.utils.reactor import install_reactor
from scrapy.http import HtmlResponse
from pom.util import stop_logging, wait
from pom.poms.pages import IndexPage
from pom.page_input_providers.providers import ArqProvider

import web_poet as wp

from scrapy_poet.page_input_providers import HttpClientProvider, PageParamsProvider

stop_logging()
uvloop.install()
install_reactor('twisted.internet.asyncioreactor.AsyncioSelectorReactor', 'uvloop.Loop')

logger = logging.getLogger(__name__)
logger.setLevel(logging.INFO)

# ================= Actual Spider Code:

 TitlesLocalSpider(scrapy.Spider):
    name = 'titles.local'
    start_urls = ['http://localhost:8080/orm/join_conditions.html']
    
    custom_settings = {
        'SCRAPY_POET_PROVIDERS': {
            ArqProvider: 500,    # MY PROVIDER FOR INJECTABLE arq: Arq
        },
    }

    @classmethod
    def from_crawler(cls, crawler, *args, **kwargs):
        stop_logging()
        logger.info('=' * 30)
        return super().from_crawler(crawler, *args, **kwargs)

    async def parse(self, response, index_page: IndexPage, **kwargs):
        self.logger.info(await index_page.page_titles)

and i got the error like this:

Unhandled error in Deferred:

Traceback (most recent call last):
  File "~/.venv/lib/python3.10/site-packages/scrapy/crawler.py", line 205, in crawl
    return self._crawl(crawler, *args, **kwargs)
  File "~/.venv/lib/python3.10/site-packages/scrapy/crawler.py", line 209, in _crawl
    d = crawler.crawl(*args, **kwargs)
  File "~/.venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1946, in unwindGenerator
    return _cancellableInlineCallbacks(gen)
  File "~/.venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1856, in _cancellableInlineCallbacks
    _inlineCallbacks(None, gen, status, _copy_context())
--- <exception caught here> ---
  File "~/.venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1696, in _inlineCallbacks
    result = context.run(gen.send, result)
  File "~/.venv/lib/python3.10/site-packages/scrapy/crawler.py", line 101, in crawl
    self.engine = self._create_engine()
  File "~/.venv/lib/python3.10/site-packages/scrapy/crawler.py", line 115, in _create_engine
    return ExecutionEngine(self, lambda _: self.stop())
  File "~/.venv/lib/python3.10/site-packages/scrapy/core/engine.py", line 83, in __init__
    self.downloader = downloader_cls(crawler)
  File "~/.venv/lib/python3.10/site-packages/scrapy/core/downloader/__init__.py", line 83, in __init__
    self.middleware = DownloaderMiddlewareManager.from_crawler(crawler)
  File "~/.venv/lib/python3.10/site-packages/scrapy/middleware.py", line 59, in from_crawler
    return cls.from_settings(crawler.settings, crawler)
  File "~/.venv/lib/python3.10/site-packages/scrapy/middleware.py", line 41, in from_settings
    mw = create_instance(mwcls, settings, crawler)
  File "~/.venv/lib/python3.10/site-packages/scrapy/utils/misc.py", line 166, in create_instance
    instance = objcls.from_crawler(crawler, *args, **kwargs)
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/downloadermiddlewares.py", line 62, in from_crawler
    o = cls(crawler)
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/downloadermiddlewares.py", line 52, in __init__
    self.injector = Injector(
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/injection.py", line 50, in __init__
    self.load_providers(default_providers)
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/injection.py", line 63, in load_providers
    self.is_provider_requiring_scrapy_response = {
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/injection.py", line 64, in <dictcomp>
    provider: is_provider_requiring_scrapy_response(provider)
  File "~/.venv/lib/python3.10/site-packages/scrapy_poet/injection.py", line 348, in is_provider_requiring_scrapy_response
    plan = andi.plan(
  File "~/.venv/lib/python3.10/site-packages/andi/andi.py", line 303, in plan
    plan, _ = _plan(class_or_func,
  File "~/.venv/lib/python3.10/site-packages/andi/andi.py", line 341, in _plan
    sel_cls, arg_overrides = _select_type(
  File "~/.venv/lib/python3.10/site-packages/andi/andi.py", line 395, in _select_type
    if is_injectable(candidate) or externally_provided(candidate):
  File "~/.venv/lib/python3.10/site-packages/web_poet/pages.py", line 34, in is_injectable
    return isinstance(cls, type) and issubclass(cls, Injectable)
  File "/usr/lib/python3.10/abc.py", line 123, in __subclasscheck__
    return _abc_subclasscheck(cls, subclass)
builtins.TypeError: issubclass() arg 1 must be a class

So, could you pls explain why this error happens and how to fix it?

@BurnzZ
Copy link
Member

BurnzZ commented Sep 5, 2022

Hi @suspectinside , I'm not able to reproduce this locally, as the following minimal code derived from your example runs okay on my end.

I suspect that there's something else outside of your code example that causes this issue. Unfortunately, the logs you've noted doesn't exactly pinpoint the problem.

Could you try out copying the code below to 3 different modules in your project to see if it works?

# providers.py

import logging
from typing import Set
from collections.abc import Callable

from scrapy_poet.page_input_providers import PageObjectInputProvider

logger = logging.getLogger()


class Arq:
    async def enqueue_task(self, task: dict):
        logger.info('Arq.enqueue_task() enqueueing new task: %r', task)


class ArqProvider(PageObjectInputProvider):
    provided_classes = {Arq}
    name = 'ARQ_PROVIDER'

    async def __call__(self, to_provide: Set[Callable]):
        return [Arq()]
# pageobjects.py

import attr

from web_poet.pages import Injectable, WebPage, ItemWebPage
from .providers import Arq

@attr.define
class IndexPage(WebPage):
    arq: Arq

    async def page_titles(self):
        await self.arq.enqueue_task({'bla': 'bla!'})

        return [
            (el.attrib['href'], el.css('::text').get())
            for el in self.css('.selected a.reference.external')
        ]
# spiders/title_spider.py

import scrapy
from ..pageobjects import IndexPage
from ..providers import ArqProvider


class TitlesLocalSpider(scrapy.Spider):
    name = 'titles.local'
    start_urls = ["https://books.toscrape.com"]

    custom_settings = {
        "SCRAPY_POET_PROVIDERS": {
            ArqProvider: 600,  # MY PROVIDER FOR INJECTABLE arq: Arq
        },
        "DOWNLOADER_MIDDLEWARES": {
            "scrapy_poet.InjectionMiddleware": 543,
        },
    }

    async def parse(self, response, index_page: IndexPage):
        self.logger.info(await index_page.page_titles)
# ... omitted log lines
2022-09-05 11:57:31 [scrapy.core.engine] INFO: Spider opened
2022-09-05 11:57:31 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-09-05 11:57:31 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-09-05 11:57:34 [scrapy.core.engine] DEBUG: Crawled (404) <GET https://books.toscrape.com/robots.txt> (referer: None)
2022-09-05 11:57:35 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://books.toscrape.com> (referer: None)
2022-09-05 11:57:35 [root] INFO: Arq.enqueue_task() enqueueing new task: {'bla': 'bla!'}
2022-09-05 11:57:35 [titles.local] INFO: []
2022-09-05 11:57:35 [scrapy.core.engine] INFO: Closing spider (finished)
# ... omitted log lines

@Gallaecio
Copy link
Member

Could **kwargs in parse be the cause?

@BurnzZ
Copy link
Member

BurnzZ commented Sep 5, 2022

I've tried adding the **kwargs but it wasn't enough to cause the same issue.

@suspectinside
Copy link
Author

suspectinside commented Sep 5, 2022

Yep! Thanks a lot, I could find the source of the problem - it happens if i use new builtins.set (with generics support) instead of depricated (since 3.9) typing.Set

so, if i change __call__'s decl from this one:

async def __call__(self, to_provide: set[Callable], settings: Settings) -> Sequence[Callable]:

into smth like this:

from typing import Set
# ...
async def __call__(self, to_provide: Set[Callable], settings: Settings) -> Sequence[Callable]:

everything works correctly.

by the way, collections.abc.Set doesn't work too, from the other hand the Python team has depricated all that typing.{Set, Dict, List etc} guys due to builtins or collections.abc.* support instead, and may be it would be correct to add them into IoC engine too?

in any case, Scrapy-poet(Web-poet) is one of the best approaches i've ever seen and combinations of IoC and Page Object Model pattern for scrapping really shines! thanks a lot for it ;)

@suspectinside
Copy link
Author

...and just another one quick question: what's the best (more correct) way to provide Singleton object instance using scrapy-poet IoC infrastructure ?
let's say that abovementioned Arq should be a singleton service provider, what is the best way to return it from __call__ method in this case (may i configure IoC cntr somewhere or smth like that?)

@BurnzZ
Copy link
Member

BurnzZ commented Sep 6, 2022

I see, great catch! I believe we can use the typing module as a short-term workaround since PEP 585 mentions:

The deprecated functionality will be removed from the typing module in the first Python version released 5 years after the release of Python 3.9.0.

I'm not quite sure how large of an undertaking would it be to completely move to the builtins since web-poet and scrapy-poet still supports 3.7 and 3.8. I'm guessing that if we drop support for them when they lose Python support, the switch would be much easier.

in any case, Scrapy-poet(Web-poet) is one of the best approaches i've ever seen and combinations of IoC and Page Object Model pattern for scrapping really shines! thanks a lot for it ;)

💖 That'd be @kmike's work for you :)

what's the best (more correct) way to provide Singleton object instance using scrapy-poet IoC infrastructure ?

Lot's of approaches on this one but I think the most convenient one is to assign it as a class variable in the provider itself. Technically, it's not a true singleton in this case since the Arq could still be instantiated outside of the provider. However, that should still be okay since the the provider would ensure that the Arq its providing would be a singleton for every __call__() method call.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants