scrapy-loader-upkeep migration #12

BurnzZ · 2020-05-23T07:42:44Z

Covers these items in #11 :

code migration
test migration

NOTES:

The tests for py3.5 are failing because of f-strings. Are we going to deprecate py3.5 soon?

codecov · 2020-05-23T07:43:42Z

Codecov Report

Merging #12 into master will increase coverage by 0.14%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master      #12      +/-   ##
==========================================
+ Coverage   98.00%   98.14%   +0.14%     
==========================================
  Files           4        4              
  Lines         251      270      +19     
==========================================
+ Hits          246      265      +19     
  Misses          5        5

Impacted Files	Coverage Δ
itemloaders/__init__.py	`97.67% <100.00%> (+0.28%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 37a63cd...dd96b7e. Read the comment docs.

kmike · 2020-05-25T13:35:03Z

itemloaders/__init__.py

+        if parsed_data in (None, []):
+            parser_label += "/missing"
+
+        self.stats.inc_value(parser_label)


This makes itemloaders depend on scrapy.stats, at least as far as interface is concerned. Do you think we can avoid it? I can see 2 main ways:

ask for a different stats object here, maybe a plain dictionary; allow to pass real stats in Scrapy's itemloader

move stats handling to Scrapy's wrapper instead of this package

We can also start decoupling scrapy Stats from scrapy, but I'd rather not do this now.

Hi @kmike , the itemloaders accept Scrapy's stats instance via DI and could work without it, as per this.

Accepting the actual Scrapy stats instance is crucial for the loader's usage to be logged at the end of a job (and be logged in Scrapy Cloud's stats-tab), like this:

2019-06-16 14:32:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats: { ... 'parser/QuotesItemLoader/author/css/1': 10, 'parser/QuotesItemLoader/quote/css/1/missing': 10, 'parser/QuotesItemLoader/quote/css/2': 10 ... }

I'm all good for making it like a defaultdict(int) to function like the stat's inc_value, as long as we can pass it out to Scrapy's logging system.

However, I might be missing something? 🤔Is there another way to circumvent injecting Scrapy's stats to the itemloaders and still log the parser's usage?

I guess @kmike 's point here is the implicit dependency.
Even though, scrapy is not a requirement, we require a direct dependency from scrapy..
So, maybe a solution would be to create some sort of adapter (another one).
Though, I somehow foresee a problem.

If we allow ItemLoader to be extended with upkeep, we'll need to make upkeep the default base class for scrapy, as scrapy users we'll keep using scrapy's ItemLoader which is an extension to this one.
So, if they import this extension instead of scrapy's, they'll be losing scrapy's behavior.

kmike · 2020-05-25T20:09:01Z

The tests for py3.5 are failing because of f-strings. Are we going to deprecate py3.5 soon?

Currently Scrapy supports Python 3.5.1+ because it is Python availlable in Ubuntu 16.04 LTS. itemloaders should follow that, because Scrapy requires itemloders. We dropped Python 3.4 support just 7 months ago.

That said, there can be an argument to switch to Python 3.6. The share of 3.5 users, as reported by https://pypistats.org/packages/scrapy, is around 2-3%, which is not that large. This is better to be discussed in Scrapy issue tracker.

BurnzZ added 5 commits May 23, 2020 13:42

initial migration of Burnzz/scrapy-loader-upkeep code

05ddbc0

add __init__.py to tests/

1df3891

fix failing tests by combining _get_xpathvalues() and _get_cssvalues()

42296f2

streamline get_selector_values() and add tests to it

57f5308

add tests to stat logging feature

d4d3671

BurnzZ requested review from ejulio, Gallaecio and kmike May 23, 2020 07:44

BurnzZ self-assigned this May 23, 2020

BurnzZ added the enhancement New feature or request label May 23, 2020

remove docstring in write_to_stats() causing sphinx build to fail

1713944

BurnzZ force-pushed the scrapy-loader-upkeep branch from fe2f1b8 to 1713944 Compare May 23, 2020 08:09

kmike reviewed May 25, 2020

View reviewed changes

rename field_tracker into field_position_tracker for more clarity

dd96b7e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scrapy-loader-upkeep migration #12

scrapy-loader-upkeep migration #12

BurnzZ commented May 23, 2020 •

edited

codecov bot commented May 23, 2020 •

edited

kmike May 25, 2020

BurnzZ May 26, 2020

ejulio May 27, 2020

kmike commented May 25, 2020

scrapy-loader-upkeep migration #12

Are you sure you want to change the base?

scrapy-loader-upkeep migration #12

Conversation

BurnzZ commented May 23, 2020 • edited

codecov bot commented May 23, 2020 • edited

Codecov Report

kmike May 25, 2020

Choose a reason for hiding this comment

BurnzZ May 26, 2020

Choose a reason for hiding this comment

ejulio May 27, 2020

Choose a reason for hiding this comment

kmike commented May 25, 2020

BurnzZ commented May 23, 2020 •

edited

codecov bot commented May 23, 2020 •

edited