Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added constrain on lxml version to avoid error when using Python 3.4 (#3912) #3913

Closed
wants to merge 25 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
25 commits
Select commit Hold shift + click to select a range
82d239f
docs for scrapy.logformatter
anubhavp28 Mar 6, 2019
924b674
move api docs to source code
anubhavp28 Mar 7, 2019
82049e9
make suggested changes.
anubhavp28 Mar 10, 2019
e9cd4ee
fix list alignment and line width
anubhavp28 Mar 10, 2019
66a502d
Merge branch 'master' into logFormatter-doc-patch
anubhavp28 Mar 14, 2019
69b1d5d
Log cipher, certificate and temp key info on establishing an SSL conn…
wRAR Oct 5, 2018
67a4000
Work around older pyOpenSSL not having get_cipher_name or get_protoco…
wRAR Jul 8, 2019
0b9dce3
Add DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING setting.
wRAR Jul 8, 2019
0de6ffc
Fix super() call.
wRAR Jul 11, 2019
98689b2
Improve the DOWNLOADER_CLIENTCONTEXTFACTORY doc.
wRAR Jul 11, 2019
a96a07b
Add a test for DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING.
wRAR Jul 12, 2019
42743fd
Move tls_verbose_logging extraction from __init__ to from_settings.
wRAR Jul 18, 2019
95dd2df
Drop an unused import.
wRAR Jul 18, 2019
c645380
Remove an unneeded if.
wRAR Jul 18, 2019
b8a4301
Cover Scrapy 1.7.1 in the release notes
Gallaecio Jul 18, 2019
43d5b5a
fix default RETRY_HTTP_CODES value in docs
KristobalJunta Jul 22, 2019
7e622af
Fix ConfigParser import in py2
elacuesta Jul 22, 2019
bc8672c
Merge pull request #3896 from elacuesta/fix_configparser_import
Gallaecio Jul 23, 2019
7843101
Cover Scrapy 1.7.2 in the release notes
Gallaecio Jul 23, 2019
c679aef
Merge pull request #3660 from anubhavp28/logFormatter-doc-patch
kmike Jul 23, 2019
9c514b9
Merge pull request #3450 from wRAR/tls-logging
kmike Jul 23, 2019
7551689
s3 file store should accept all supported headers
lucywang000 Jul 26, 2019
04bca6a
Merge pull request #3894 from KristobalJunta/fix_retry_docs
Gallaecio Jul 29, 2019
06c093f
Merge pull request #3905 from lucywang000/0.001
Gallaecio Jul 29, 2019
dffd163
Added constrain on lxml version based on Python version
rennerocha Jul 29, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
11 changes: 11 additions & 0 deletions docs/news.rst
Expand Up @@ -6,6 +6,17 @@ Release notes
.. note:: Scrapy 1.x will be the last series supporting Python 2. Scrapy 2.0,
planned for Q4 2019 or Q1 2020, will support **Python 3 only**.

Scrapy 1.7.2 (2019-07-23)
-------------------------

Fix Python 2 support (:issue:`3889`, :issue:`3893`, :issue:`3896`).


Scrapy 1.7.1 (2019-07-18)
-------------------------

Re-packaging of Scrapy 1.7.0, which was missing some changes in PyPI.

.. _release-1.7.0:

Scrapy 1.7.0 (2019-07-18)
Expand Down
2 changes: 1 addition & 1 deletion docs/topics/downloader-middleware.rst
Expand Up @@ -963,7 +963,7 @@ precedence over the :setting:`RETRY_TIMES` setting.
RETRY_HTTP_CODES
^^^^^^^^^^^^^^^^

Default: ``[500, 502, 503, 504, 522, 524, 408]``
Default: ``[500, 502, 503, 504, 522, 524, 408, 429]``

Which HTTP response codes to retry. Other errors (DNS lookup issues,
connections lost, etc) are always retried.
Expand Down
11 changes: 11 additions & 0 deletions docs/topics/logging.rst
Expand Up @@ -193,6 +193,17 @@ to override some of the Scrapy settings regarding logging.
Module `logging.handlers <https://docs.python.org/2/library/logging.handlers.html>`_
Further documentation on available handlers

.. _custom-log-formats:

Custom Log Formats
------------------

A custom log format can be set for different actions by extending :class:`~scrapy.logformatter.LogFormatter` class
and making :setting:`LOG_FORMATTER` point to your new class.

.. autoclass:: scrapy.logformatter.LogFormatter
:members:

Advanced customization
----------------------

Expand Down
30 changes: 27 additions & 3 deletions docs/topics/settings.rst
Expand Up @@ -440,9 +440,10 @@ or even enable client-side authentication (and various other things).
which uses the platform's certificates to validate remote endpoints.
**This is only available if you use Twisted>=14.0.**

If you do use a custom ContextFactory, make sure it accepts a ``method``
parameter at init (this is the ``OpenSSL.SSL`` method mapping
:setting:`DOWNLOADER_CLIENT_TLS_METHOD`).
If you do use a custom ContextFactory, make sure its ``__init__`` method
accepts a ``method`` parameter (this is the ``OpenSSL.SSL`` method mapping
:setting:`DOWNLOADER_CLIENT_TLS_METHOD`) and a ``tls_verbose_logging``
parameter (``bool``).

.. setting:: DOWNLOADER_CLIENT_TLS_METHOD

Expand Down Expand Up @@ -470,6 +471,20 @@ This setting must be one of these string values:
We recommend that you use PyOpenSSL>=0.13 and Twisted>=0.13
or above (Twisted>=14.0 if you can).

.. setting:: DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING

DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING
-------------------------------------

Default: ``False``

Setting this to ``True`` will enable DEBUG level messages about TLS connection
parameters after establishing HTTPS connections. The kind of information logged
depends on the versions of OpenSSL and pyOpenSSL.

This setting is only used for the default
:setting:`DOWNLOADER_CLIENTCONTEXTFACTORY`.

.. setting:: DOWNLOADER_MIDDLEWARES

DOWNLOADER_MIDDLEWARES
Expand Down Expand Up @@ -870,6 +885,15 @@ directives.

.. _Python datetime documentation: https://docs.python.org/2/library/datetime.html#strftime-and-strptime-behavior

.. setting:: LOG_FORMATTER

LOG_FORMATTER
-------------

Default: :class:`scrapy.logformatter.LogFormatter`

The class to use for :ref:`formatting log messages <custom-log-formats>` for different actions.

.. setting:: LOG_LEVEL

LOG_LEVEL
Expand Down
3 changes: 2 additions & 1 deletion requirements-py3.txt
@@ -1,5 +1,6 @@
Twisted>=17.9.0
lxml>=3.2.4
lxml;python_version!="3.4"
lxml<=4.3.5;python_version=="3.4"
pyOpenSSL>=0.13.1
cssselect>=0.9
queuelib>=1.1.1
Expand Down
11 changes: 9 additions & 2 deletions scrapy/core/downloader/contextfactory.py
Expand Up @@ -28,9 +28,15 @@ class ScrapyClientContextFactory(BrowserLikePolicyForHTTPS):
understand the SSLv3, TLSv1, TLSv1.1 and TLSv1.2 protocols.'
"""

def __init__(self, method=SSL.SSLv23_METHOD, *args, **kwargs):
def __init__(self, method=SSL.SSLv23_METHOD, tls_verbose_logging=False, *args, **kwargs):
super(ScrapyClientContextFactory, self).__init__(*args, **kwargs)
self._ssl_method = method
self.tls_verbose_logging = tls_verbose_logging

@classmethod
def from_settings(cls, settings, method=SSL.SSLv23_METHOD, *args, **kwargs):
tls_verbose_logging = settings.getbool('DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING')
return cls(method=method, tls_verbose_logging=tls_verbose_logging, *args, **kwargs)

def getCertificateOptions(self):
# setting verify=True will require you to provide CAs
Expand All @@ -56,7 +62,8 @@ def getContext(self, hostname=None, port=None):
return self.getCertificateOptions().getContext()

def creatorForNetloc(self, hostname, port):
return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext())
return ScrapyClientTLSOptions(hostname.decode("ascii"), self.getContext(),
verbose_logging=self.tls_verbose_logging)


@implementer(IPolicyForHTTPS)
Expand Down
7 changes: 4 additions & 3 deletions scrapy/core/downloader/handlers/http10.py
@@ -1,7 +1,7 @@
"""Download handlers for http and https schemes
"""
from twisted.internet import reactor
from scrapy.utils.misc import load_object
from scrapy.utils.misc import load_object, create_instance
from scrapy.utils.python import to_unicode


Expand All @@ -11,6 +11,7 @@ class HTTP10DownloadHandler(object):
def __init__(self, settings):
self.HTTPClientFactory = load_object(settings['DOWNLOADER_HTTPCLIENTFACTORY'])
self.ClientContextFactory = load_object(settings['DOWNLOADER_CLIENTCONTEXTFACTORY'])
self._settings = settings

def download_request(self, request, spider):
"""Return a deferred for the HTTP download"""
Expand All @@ -21,7 +22,7 @@ def download_request(self, request, spider):
def _connect(self, factory):
host, port = to_unicode(factory.host), factory.port
if factory.scheme == b'https':
return reactor.connectSSL(host, port, factory,
self.ClientContextFactory())
client_context_factory = create_instance(self.ClientContextFactory, settings=self._settings, crawler=None)
return reactor.connectSSL(host, port, factory, client_context_factory)
else:
return reactor.connectTCP(host, port, factory)
11 changes: 6 additions & 5 deletions scrapy/core/downloader/handlers/http11.py
Expand Up @@ -25,7 +25,7 @@
from scrapy.responsetypes import responsetypes
from scrapy.core.downloader.webclient import _parse
from scrapy.core.downloader.tls import openssl_methods
from scrapy.utils.misc import load_object
from scrapy.utils.misc import load_object, create_instance
from scrapy.utils.python import to_bytes, to_unicode
from scrapy import twisted_version

Expand All @@ -44,14 +44,15 @@ def __init__(self, settings):
self._contextFactoryClass = load_object(settings['DOWNLOADER_CLIENTCONTEXTFACTORY'])
# try method-aware context factory
try:
self._contextFactory = self._contextFactoryClass(method=self._sslMethod)
self._contextFactory = create_instance(self._contextFactoryClass, settings=settings, crawler=None,
method=self._sslMethod)
except TypeError:
# use context factory defaults
self._contextFactory = self._contextFactoryClass()
self._contextFactory = create_instance(self._contextFactoryClass, settings=settings, crawler=None)
msg = """
'%s' does not accept `method` argument (type OpenSSL.SSL method,\
e.g. OpenSSL.SSL.SSLv23_METHOD).\
Please upgrade your context factory class to handle it or ignore it.""" % (
e.g. OpenSSL.SSL.SSLv23_METHOD) and/or `tls_verbose_logging` argument.\
Please upgrade your context factory class to handle them or ignore them.""" % (
settings['DOWNLOADER_CLIENTCONTEXTFACTORY'],)
warnings.warn(msg)
self._default_maxsize = settings.getint('DOWNLOAD_MAXSIZE')
Expand Down
30 changes: 29 additions & 1 deletion scrapy/core/downloader/tls.py
Expand Up @@ -2,6 +2,7 @@
from OpenSSL import SSL

from scrapy import twisted_version
from scrapy.utils.ssl import x509name_to_string, get_temp_key_info


logger = logging.getLogger(__name__)
Expand All @@ -20,6 +21,7 @@
METHOD_TLSv12: getattr(SSL, 'TLSv1_2_METHOD', 6), # TLS 1.2 only
}


if twisted_version >= (14, 0, 0):
# ClientTLSOptions requires a recent-enough version of Twisted.
# Not having ScrapyClientTLSOptions should not matter for older
Expand Down Expand Up @@ -65,13 +67,39 @@ class ScrapyClientTLSOptions(ClientTLSOptions):
Same as Twisted's private _sslverify.ClientTLSOptions,
except that VerificationError, CertificateError and ValueError
exceptions are caught, so that the connection is not closed, only
logging warnings.
logging warnings. Also, HTTPS connection parameters logging is added.
"""

def __init__(self, hostname, ctx, verbose_logging=False):
super(ScrapyClientTLSOptions, self).__init__(hostname, ctx)
self.verbose_logging = verbose_logging

def _identityVerifyingInfoCallback(self, connection, where, ret):
if where & SSL_CB_HANDSHAKE_START:
set_tlsext_host_name(connection, self._hostnameBytes)
elif where & SSL_CB_HANDSHAKE_DONE:
if self.verbose_logging:
if hasattr(connection, 'get_cipher_name'): # requires pyOPenSSL 0.15
if hasattr(connection, 'get_protocol_version_name'): # requires pyOPenSSL 16.0.0
logger.debug('SSL connection to %s using protocol %s, cipher %s',
self._hostnameASCII,
connection.get_protocol_version_name(),
connection.get_cipher_name(),
)
else:
logger.debug('SSL connection to %s using cipher %s',
self._hostnameASCII,
connection.get_cipher_name(),
)
server_cert = connection.get_peer_certificate()
logger.debug('SSL connection certificate: issuer "%s", subject "%s"',
x509name_to_string(server_cert.get_issuer()),
x509name_to_string(server_cert.get_subject()),
)
key_info = get_temp_key_info(connection._ssl)
if key_info:
logger.debug('SSL temp key: %s', key_info)

try:
verifyHostname(connection, self._hostnameASCII)
except verification_errors as e:
Expand Down
46 changes: 31 additions & 15 deletions scrapy/logformatter.py
Expand Up @@ -12,26 +12,40 @@

class LogFormatter(object):
"""Class for generating log messages for different actions.

All methods must return a dictionary listing the parameters ``level``,
``msg`` and ``args`` which are going to be used for constructing the log
message when calling logging.log.
All methods must return a dictionary listing the parameters ``level``, ``msg``
and ``args`` which are going to be used for constructing the log message when
calling ``logging.log``.

Dictionary keys for the method outputs:
* ``level`` should be the log level for that action, you can use those
from the python logging library: logging.DEBUG, logging.INFO,
logging.WARNING, logging.ERROR and logging.CRITICAL.

* ``msg`` should be a string that can contain different formatting
placeholders. This string, formatted with the provided ``args``, is
going to be the log message for that action.
* ``level`` is the log level for that action, you can use those from the
`python logging library <https://docs.python.org/3/library/logging.html>`_ :
``logging.DEBUG``, ``logging.INFO``, ``logging.WARNING``, ``logging.ERROR``
and ``logging.CRITICAL``.
* ``msg`` should be a string that can contain different formatting placeholders.
This string, formatted with the provided ``args``, is going to be the long message
for that action.
* ``args`` should be a tuple or dict with the formatting placeholders for ``msg``.
The final log message is computed as ``msg % args``.

* ``args`` should be a tuple or dict with the formatting placeholders
for ``msg``. The final log message is computed as output['msg'] %
output['args'].
"""
Here is an example on how to create a custom log formatter to lower the severity level of
the log message when an item is dropped from the pipeline::

class PoliteLogFormatter(logformatter.LogFormatter):
def dropped(self, item, exception, response, spider):
return {
'level': logging.INFO, # lowering the level from logging.WARNING
'msg': u"Dropped: %(exception)s" + os.linesep + "%(item)s",
'args': {
'exception': exception,
'item': item,
}
}
"""

def crawled(self, request, response, spider):
"""Logs a message when the crawler finds a webpage."""
request_flags = ' %s' % str(request.flags) if request.flags else ''
response_flags = ' %s' % str(response.flags) if response.flags else ''
return {
Expand All @@ -40,7 +54,7 @@ def crawled(self, request, response, spider):
'args': {
'status': response.status,
'request': request,
'request_flags' : request_flags,
'request_flags': request_flags,
'referer': referer_str(request),
'response_flags': response_flags,
# backward compatibility with Scrapy logformatter below 1.4 version
Expand All @@ -49,6 +63,7 @@ def crawled(self, request, response, spider):
}

def scraped(self, item, response, spider):
"""Logs a message when an item is scraped by a spider."""
if isinstance(response, Failure):
src = response.getErrorMessage()
else:
Expand All @@ -63,6 +78,7 @@ def scraped(self, item, response, spider):
}

def dropped(self, item, exception, response, spider):
"""Logs a message when an item is dropped while it is passing through the item pipeline."""
return {
'level': logging.WARNING,
'msg': DROPPEDMSG,
Expand Down
13 changes: 13 additions & 0 deletions scrapy/pipelines/files.py
Expand Up @@ -189,6 +189,19 @@ def _headers_to_botocore_kwargs(self, headers):
'X-Amz-Grant-Read': 'GrantRead',
'X-Amz-Grant-Read-ACP': 'GrantReadACP',
'X-Amz-Grant-Write-ACP': 'GrantWriteACP',
'X-Amz-Object-Lock-Legal-Hold': 'ObjectLockLegalHoldStatus',
'X-Amz-Object-Lock-Mode': 'ObjectLockMode',
'X-Amz-Object-Lock-Retain-Until-Date': 'ObjectLockRetainUntilDate',
'X-Amz-Request-Payer': 'RequestPayer',
'X-Amz-Server-Side-Encryption': 'ServerSideEncryption',
'X-Amz-Server-Side-Encryption-Aws-Kms-Key-Id': 'SSEKMSKeyId',
'X-Amz-Server-Side-Encryption-Context': 'SSEKMSEncryptionContext',
'X-Amz-Server-Side-Encryption-Customer-Algorithm': 'SSECustomerAlgorithm',
'X-Amz-Server-Side-Encryption-Customer-Key': 'SSECustomerKey',
'X-Amz-Server-Side-Encryption-Customer-Key-Md5': 'SSECustomerKeyMD5',
'X-Amz-Storage-Class': 'StorageClass',
'X-Amz-Tagging': 'Tagging',
'X-Amz-Website-Redirect-Location': 'WebsiteRedirectLocation',
})
extra = {}
for key, value in six.iteritems(headers):
Expand Down
1 change: 1 addition & 0 deletions scrapy/settings/default_settings.py
Expand Up @@ -87,6 +87,7 @@
DOWNLOADER_CLIENTCONTEXTFACTORY = 'scrapy.core.downloader.contextfactory.ScrapyClientContextFactory'
DOWNLOADER_CLIENT_TLS_METHOD = 'TLS' # Use highest TLS/SSL protocol version supported by the platform,
# also allowing negotiation
DOWNLOADER_CLIENT_TLS_VERBOSE_LOGGING = False

DOWNLOADER_MIDDLEWARES = {}

Expand Down
7 changes: 5 additions & 2 deletions scrapy/utils/conf.py
@@ -1,10 +1,13 @@
import os
import sys
import numbers
import configparser
from operator import itemgetter

import six
if six.PY2:
from ConfigParser import SafeConfigParser as ConfigParser
else:
from configparser import ConfigParser

from scrapy.settings import BaseSettings
from scrapy.utils.deprecate import update_classpath
Expand Down Expand Up @@ -94,7 +97,7 @@ def init_env(project='default', set_syspath=True):
def get_config(use_closest=True):
"""Get Scrapy config file as a ConfigParser"""
sources = get_sources(use_closest)
cfg = configparser.ConfigParser()
cfg = ConfigParser()
cfg.read(sources)
return cfg

Expand Down