Skip to content

Releases: scrapy/scrapy

1.7.0

18 Jul 14:28
Compare
Choose a tag to compare

Highlights:

  • Improvements for crawls targeting multiple domains
  • A cleaner way to pass arguments to callbacks
  • A new class for JSON requests
  • Improvements for rule-based spiders
  • New features for feed exports

See the full change log

1.6.0

11 Feb 13:48
Compare
Choose a tag to compare

Highlights:

  • Better Windows support
  • Python 3.7 compatibility
  • Big documentation improvements, including a switch from .extract_first() + .extract() API to .get() + .getall() API
  • Feed exports, FilePipeline and MediaPipeline improvements
  • Better extensibility: item_error and request_reached_downloader signals; from_crawler support for feed exporters, feed storages and dupefilters.
  • scrapy.contracts fixes and new features
  • Telnet console security improvements, first released as a backport in Scrapy 1.5.2 (2019-01-22)
  • Clean-up of the deprecated code
  • Various bug fixes, small new features and usability improvements across the codebase.

Full changelog is in the docs.

1.5.0

30 Dec 15:35
Compare
Choose a tag to compare

This release brings small new features and improvements across the codebase.
Some highlights:

  • Google Cloud Storage is supported in FilesPipeline and ImagesPipeline.
  • Crawling with proxy servers becomes more efficient, as connections to proxies can be reused now.
  • Warnings, exception and logging messages are improved to make debugging easier.
  • scrapy parse command now allows to set custom request meta via --meta argument.
  • Compatibility with Python 3.6, PyPy and PyPy3 is improved; PyPy and PyPy3 are now supported officially, by running tests on CI.
  • Better default handling of HTTP 308, 522 and 524 status codes.
  • Documentation is improved, as usual.

Full changelog is in the docs.

1.4.0

29 Dec 15:39
Compare
Choose a tag to compare

1.3.3

29 Dec 15:40
Compare
Choose a tag to compare

1.2.2

08 Dec 09:56
Compare
Choose a tag to compare

Bug fixes

  • Fix a cryptic traceback when a pipeline fails on open_spider() (#2011)
  • Fix embedded IPython shell variables (fixing #396 that re-appeared in 1.2.0, fixed in #2418)
  • A couple of patches when dealing with robots.txt:
    • handle (non-standard) relative sitemap URLs (#2390)
    • handle non-ASCII URLs and User-Agents in Python 2 (#2373)

Documentation

  • Document "download_latency" key in Request‘s meta dict (#2033)
  • Remove page on (deprecated & unsupported) Ubuntu packages from ToC (#2335)
  • A few fixed typos (#2346, #2369, #2369, #2380) and clarifications (#2354, #2325, #2414)

Other changes

  • Advertize conda-forge as Scrapy’s official conda channel (#2387)
  • More helpful error messages when trying to use .css() or .xpath() on non-Text Responses (#2264)
  • startproject command now generates a sample middlewares.py file (#2335)
  • Add more dependencies’ version info in scrapy version verbose output (#2404)
  • Remove all *.pyc files from source distribution (#2386)

1.2.1

08 Dec 09:53
Compare
Choose a tag to compare

Bug fixes

  • Include OpenSSL’s more permissive default ciphers when establishing TLS/SSL connections (#2314).
  • Fix “Location” HTTP header decoding on non-ASCII URL redirects (#2321).

Documentation

  • Fix JsonWriterPipeline example (#2302).
  • Various notes: #2330 on spider names, #2329 on middleware methods processing order, #2327 on getting multi-valued HTTP headers as lists.

Other changes

  • Removed www. from start_urls in built-in spider templates (#2299).

1.2.0

03 Oct 13:25
Compare
Choose a tag to compare

New Features

  • New FEED_EXPORT_ENCODING setting to customize the encoding
    used when writing items to a file.
    This can be used to turn off \uXXXX escapes in JSON output.
    This is also useful for those wanting something else than UTF-8
    for XML or CSV output (#2034).
  • startproject command now supports an optional destination directory
    to override the default one based on the project name (#2005).
  • New SCHEDULER_DEBUG setting to log requests serialization
    failures (#1610).
  • JSON encoder now supports serialization of set instances (#2058).
  • Interpret application/json-amazonui-streaming as TextResponse (#1503).
  • scrapy is imported by default when using shell tools (shell,
    inspect_response) (#2248).

Bug fixes

  • DefaultRequestHeaders middleware now runs before UserAgent middleware
    (#2088). Warning: this is technically backwards incompatible,
    though we consider this a bug fix.
  • HTTP cache extension and plugins that use the .scrapy data directory now
    work outside projects (#1581). Warning: this is technically
    backwards incompatible
    , though we consider this a bug fix.
  • Selector does not allow passing both response and text anymore
    (#2153).
  • Fixed logging of wrong callback name with scrapy parse (#2169).
  • Fix for an odd gzip decompression bug (#1606).
  • Fix for selected callbacks when using CrawlSpider with scrapy parse
    (#2225).
  • Fix for invalid JSON and XML files when spider yields no items (#872).
  • Implement flush() for StreamLogger avoiding a warning in logs (#2125).

Refactoring

Tests & Requirements

Scrapy's new requirements baseline is Debian 8 "Jessie". It was previously Ubuntu 12.04 Precise.
What this means in practice is that we run continuous integration tests with these (main) packages versions at a minimum: Twisted 14.0, pyOpenSSL 0.14, lxml 3.4.

Scrapy may very well work with older versions of these packages (the code base still has switches for older Twisted versions for example) but it is not guaranteed (because it's not tested anymore).

Documentation

  • Grammar fixes: #2128, #1566.
  • Download stats badge removed from README (#2160).
  • New scrapy architecture diagram (#2165).
  • Updated Response parameters documentation (#2197).
  • Reworded misleading RANDOMIZE_DOWNLOAD_DELAY description (#2190).
  • Add StackOverflow as a support channel (#2257).