startproject and override command line tool for Page Objects development #57

ivanprado · 2021-12-10T15:18:13Z

WARNING: This was developed on top of #56. Merge it only after it.

scrapy startproject is modified so that the project is prepared for scrapy-poet. Also folders for the Page Objects and their tests and fixtures are created.
scrapy override creates a Page Object and a test case over a web page. This makes development handy.
Templates system that allows customizing the generated Page Objects and tests code.
The override command can also be used to update the fixture data with fresh web data. Also, it can be used if the dependencies of a Page Object has changed: in this case, running the command is required to fetch additional fixtures to the additional dependencies.

TODO

Improve the code structure
Improve error messages
Documentation

Remaining work for the future:

There is no way to do garbage collection over the unused fixtures.

How can be the documentation structured

Rewrite the tutorial using the new startproject and override commands. The goal should be to create a generic spider with common crawling logic and then integrate different sites. The spider could for example extract books from categories in book review pages. The structure could be:

Explanation of what we are going to show: extract books from different sites with different layouts, but keeping a common crawling logic. Enumerate the different steps.
Creating a spider
2.1 Create a new project using startproject
2.2 Writing a spider that rely on Page Objects (empty implementation)
2.3 Create the first override using the tool
2.3.1 Explain the handle_url decorator and link to web_poet documentation and url-matcher doc
2.4. Implement extraction logic in the PO
2.5 Use the unit test to check that the logic is right
2.6. Do the same for the rest of PO for the site
2.7. Run the spider
2.8 Integrate the second site
2.9 Summary of what happen
Rerunning the overide command over the same PO and URL. When and why:
3.1. To get fresh data. e.g., because the layout of the site changed and we need to update the extraction code
3.2. Under the presence of new dependencies in the PO. It will be required to fetch the new resources.
Templates. What they are and how to modify them.
4.1 Default templates vs specific ones
Listing the Page Objets using python -m web_poet
Existing Pages/class that can be used:
6.1. ItemPage
6.2. ItemWebPage
6.3. RequestData
6.4. Injectable

Keep in mind that the tutorial will be the entry point for many people. It is really important to have a tutorial that is good, simple and and convinces of the value.

… was created

sortafreel

Great job @ivanprado 👍 I've left a couple of comments here and there :)

setup.py

sortafreel · 2021-12-17T00:28:04Z

scrapy_poet/commands/override.py

+            po_path=po_path,
+            test_path=test_path,
+        )
+        self.context = context


I'm a bit confused here. Should we maybe init self.context and self.po_path with all the typing before assigning any values to them inside the methods?

scrapy_poet/commands/override.py

sortafreel · 2021-12-17T00:35:06Z

scrapy_poet/commands/override.py

+            print("Fixture saved successfully")
+
+            self.po_test_path = generate_test(self.context)
+            print()


Why do we stick with print instead of logging?

scrapy_poet/commands/override.py

scrapy_poet/po_tester.py

…py-poet into override-command-tool

ivanprado added 7 commits December 9, 2021 11:47

Some progress towards the override command

9aa1c85

Adding example.po package

7f9150c

Progress towards override command

3368bfe

Functional version

3d63b4f

Automatic template creation

2fefb0e

startproject for scrapy-poet

79ee103

Preserving get_domain to be backwards compatible

c0ef97f

ivanprado requested review from kmike, BurnzZ and sortafreel December 10, 2021 15:20

ivanprado added 3 commits December 10, 2021 17:16

Templates in the build

c485fbb

Adding .py extension

1ea8b3a

Fix a bug that required to run the override command twice when new PO…

e47664c

… was created

ivanprado changed the title ~~[WIP] startproject and override command line tool for Page Objects development~~ startproject and override command line tool for Page Objects development Dec 13, 2021

sortafreel reviewed Dec 17, 2021

View reviewed changes

BurnzZ added 3 commits January 13, 2022 10:41

Merge branch 'url-matcher-integration' of github.com:scrapinghub/scra…

28b16e1

…py-poet into override-command-tool

update retrieval of override rules according to new web-poet updates

8e09e1f

code cleanup and improvements

133af64

BurnzZ mentioned this pull request Jan 13, 2022

handle_urls decorator using a new PageObjectRegistry scrapinghub/web-poet#16

Closed

1 task

BurnzZ added 3 commits January 14, 2022 15:25

update CHANGELOG to reflect new scrapy and commands

cbaa801

improve startproject code

161c5f9

improve consistent formatting

c91df4b

Base automatically changed from url-matcher-integration to master May 19, 2022 05:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

startproject and override command line tool for Page Objects development #57

startproject and override command line tool for Page Objects development #57

ivanprado commented Dec 10, 2021 •

edited

sortafreel left a comment

sortafreel Dec 17, 2021

sortafreel Dec 17, 2021

startproject and override command line tool for Page Objects development #57

Are you sure you want to change the base?

startproject and override command line tool for Page Objects development #57

Conversation

ivanprado commented Dec 10, 2021 • edited

sortafreel left a comment

Choose a reason for hiding this comment

sortafreel Dec 17, 2021

Choose a reason for hiding this comment

sortafreel Dec 17, 2021

Choose a reason for hiding this comment

ivanprado commented Dec 10, 2021 •

edited