Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document persistency and equality #220

Merged
merged 6 commits into from Oct 26, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
3 changes: 2 additions & 1 deletion CHANGES.rst
Expand Up @@ -5,7 +5,8 @@
5.1.3 (unreleased)
==================

- Nothing changed yet.
- Add documentation section ``Persistency and Equality``
(`#218 <https://github.com/zopefoundation/zope.interface/issues/218>`_).


5.1.2 (2020-10-01)
Expand Down
224 changes: 221 additions & 3 deletions docs/README.rst
@@ -1,6 +1,6 @@
==========
Interfaces
==========
============
Interfaces
============

.. currentmodule:: zope.interface

Expand Down Expand Up @@ -1046,6 +1046,208 @@ functionality for particular interfaces.
how to override functions in interface definitions and why, prior
to Python 3.6, the zero-argument version of `super` cannot be used.

.. _global_persistence:

Persistence, Sorting, Equality and Hashing
==========================================

.. tip:: For the practical implications of what's discussed below, and
some potential problems, see :ref:`spec_eq_hash`.

Just like Python classes, interfaces are designed to inexpensively
support persistence using Python's standard :mod:`pickle` module. This
means that one process can send a *reference* to an interface to another
process in the form of a byte string, and that other process can load
that byte string and get the object that is that interface. The processes
may be separated in time (one after the other), in space (running on
different machines) or even be parts of the same process communicating
with itself.

We can demonstrate this. Observe how small the byte string needed to
capture the reference is. Also note that since this is the same
process, the identical object is found and returned:

.. doctest::

>>> import sys
>>> import pickle
>>> class Foo(object):
... pass
>>> sys.modules[__name__].Foo = Foo # XXX, see below
>>> pickled_byte_string = pickle.dumps(Foo, 0)
>>> len(pickled_byte_string)
21
>>> imported = pickle.loads(pickled_byte_string)
>>> imported == Foo
True
>>> imported is Foo
True
>>> class IFoo(zope.interface.Interface):
... pass
>>> sys.modules[__name__].IFoo = IFoo # XXX, see below
>>> pickled_byte_string = pickle.dumps(IFoo, 0)
>>> len(pickled_byte_string)
22
>>> imported = pickle.loads(pickled_byte_string)
>>> imported is IFoo
True
>>> imported == IFoo
True

The eagle-eyed reader will have noticed the two funny lines like
``sys.modules[__name__].Foo = Foo``. What's that for? To understand,
we must know a bit about how Python "pickles" (``pickle.dump`` or
``pickle.dumps``) classes or interfaces.

When Python pickles a class or an interface, it does so as a "global
object" [#global_object]_. Global objects are expected to already
exist (contrast this with pickling a string or an object instance,
which creates a new object in the receiving process) with all their
necessary state information (for classes and interfaces, the state
information would be things like the list of methods and defined
attributes) in the receiving process; the pickled byte string needs
only contain enough data to look up that existing object; this is a
*reference*. Not only does this minimize the amount of data required
to persist such an object, it also facilitates changing the definition
of the object over time: if a class or interface gains or loses
methods or attributes, loading a previously pickled reference will use
the *current definition* of the object.

The *reference* to a global object that's stored in the byte string
consists only of the object's ``__name__`` and ``__module__``. Before
a global object *obj* is pickled, Python makes sure that the object being
pickled is the same one that can be found at
``getattr(sys.modules[obj.__module__], obj.__name__)``; if there is no
such object, or it refers to a different object, pickling fails. The
two funny lines make sure that holds, no matter how this example is
run (using some doctest runners, it doesn't hold by default, unlike it
normally would).

We can show some examples of what happens when that condition doesn't
hold. First, what if we change the global object and try to pickle the
old one?

.. doctest::

>>> sys.modules[__name__].Foo = 42
>>> pickle.dumps(Foo)
Traceback (most recent call last):
...
_pickle.PicklingError: Can't pickle <class 'Foo'>: it's not the same object as builtins.Foo

A consequence of this is that only one object of the given name can be
defined and pickled at a time. If we were to try to define a new ``Foo``
class (remembering that normally the ``sys.modules[__name__].Foo =``
line is automatic), we still cannot pickle the old one:

.. doctest::

>>> orig_Foo = Foo
>>> class Foo(object):
... pass
>>> sys.modules[__name__].Foo = Foo # XXX, see below
>>> pickle.dumps(orig_Foo)
Traceback (most recent call last):
...
_pickle.PicklingError: Can't pickle <class 'Foo'>: it's not the same object as builtins.Foo

Or what if there simply is no global object?

.. doctest::

>>> del sys.modules[__name__].Foo
>>> pickle.dumps(Foo)
Traceback (most recent call last):
...
_pickle.PicklingError: Can't pickle <class 'Foo'>: attribute lookup Foo on builtins failed

Interfaces and classes behave the same in all those ways.

.. rubric:: What's This Have To Do With Sorting, Equality and Hashing?

Another important design consideration for interfaces is that they
should be sortable. This permits them to be used, for example, as keys
in a (persistent) `BTree <https://btrees.readthedocs.io>`_. As such,
they define a total ordering, meaning that any given interface can
definitively said to be greater than, less than, or equal to, any
other interface. This relationship must be *stable* and hold the same
across any two processes.

An object becomes sortable by overriding the equality method
``__eq__`` and at least one of the comparison methods (such as
``__lt__``).

Classes, on the other hand, are not sortable [#class_sort]_.
Classes can only be tested for equality, and they implement this using
object identity: ``class_a == class_b`` is equivalent to ``class_a is class_b``.

In addition to being sortable, it's important for interfaces to be
hashable so they can be used as keys in dictionaries or members of
sets. This is done by implementing the ``__hash__`` method [#hashable]_.

Classes are hashable, and they also implement this based on object
identity, with the equivalent of ``id(class_a)``.

jamadden marked this conversation as resolved.
Show resolved Hide resolved
To be both hashable and sortable, the hash method and the equality and
comparison methods **must** `be consistent with each other
<https://docs.python.org/3/reference/datamodel.html#object.__hash__>`_.
That is, they must all be based on the same principle.

Classes use the principle of identity to implement equality and
hashing, but they don't implement sorting because identity isn't a
stable sorting method (it is different in every process).

Interfaces need to be sortable. In order for all three of hashing,
equality and sorting to be consistent, interfaces implement them using
the same principle as persistence. Interfaces are treated like "global
objects" and sort and hash using the same information a *reference* to
them would: their ``__name__`` and ``__module__``.

In this way, hashing, equality and sorting are consistent with each
other, and consistent with pickling:

.. doctest::

>>> class IFoo(zope.interface.Interface):
... pass
>>> sys.modules[__name__].IFoo = IFoo
>>> f1 = IFoo
>>> pickled_f1 = pickle.dumps(f1)
>>> class IFoo(zope.interface.Interface):
... pass
>>> sys.modules[__name__].IFoo = IFoo
>>> IFoo == f1
True
>>> unpickled_f1 = pickle.loads(pickled_f1)
>>> unpickled_f1 == IFoo == f1
True

This isn't quite the case for classes; note how ``f1`` wasn't equal to
``Foo`` before pickling, but the unpickled value is:

.. doctest::

>>> class Foo(object):
... pass
>>> sys.modules[__name__].Foo = Foo
>>> f1 = Foo
>>> pickled_f1 = pickle.dumps(Foo)
>>> class Foo(object):
... pass
>>> sys.modules[__name__].Foo = Foo
>>> f1 == Foo
False
>>> unpickled_f1 = pickle.loads(pickled_f1)
>>> unpickled_f1 == Foo # Surprise!
True
>>> unpickled_f1 == f1
False

For more information, and some rare potential pitfalls, see
:ref:`spec_eq_hash`.

.. rubric:: Footnotes

jamadden marked this conversation as resolved.
Show resolved Hide resolved
.. [#create] The main reason we subclass ``Interface`` is to cause the
Python class statement to create an interface, rather
than a class.
Expand All @@ -1070,3 +1272,19 @@ functionality for particular interfaces.

The interface implementation doesn't enforce this,
but maybe it should do some checks.

.. [#class_sort] In Python 2, classes could be sorted, but the sort
was not stable (it also used the identity principle)
and not useful for persistence; this was considered a
bug that was fixed in Python 3.

.. [#hashable] In order to be hashable, you must implement both
``__eq__`` and ``__hash__``. If you only implement
``__eq__``, Python makes sure the type cannot be
used in a dictionary, set, or with :func:`hash`. In
Python 2, this wasn't the case, and forgetting to
override ``__hash__`` was a constant source of bugs.

.. [#global_object] From the name of the pickle bytecode operator; it
varies depending on the protocol but always
includes "GLOBAL".
105 changes: 97 additions & 8 deletions docs/api/specifications.rst
Expand Up @@ -161,11 +161,13 @@ Exmples for :meth:`.Specification.extends`:
>>> I2.extends(I2, strict=False)
True

.. _spec_eq_hash:

Equality, Hashing, and Comparisons
----------------------------------

Specifications (including their notable subclass `Interface`), are
hashed and compared based solely on their ``__name__`` and
hashed and compared (sorted) based solely on their ``__name__`` and
``__module__``, not including any information about their enclosing
scope, if any (e.g., their ``__qualname__``). This means that any two
objects created with the same name and module are considered equal and
Expand All @@ -191,13 +193,22 @@ map to the same value in a dictionary.
>>> I1 == orig_I1 == nested_I1
True

Because weak references hash the same as their underlying object,
this can lead to surprising results when weak references are involved,
especially if there are cycles involved or if the garbage collector is
not based on reference counting (e.g., PyPy). For example, if you
redefine an interface named the same as an interface being used in a
``WeakKeyDictionary``, you can get a ``KeyError``, even if you put the
new interface into the dictionary.
Caveats
~~~~~~~

While this behaviour works will with :ref:`pickling (persistence)
<global_persistence>`, it has some potential downsides to be aware of.

.. rubric:: Weak References

The first downside involves weak references. Because weak references
hash the same as their underlying object, this can lead to surprising
results when weak references are involved, especially if there are
cycles involved or if the garbage collector is not based on reference
counting (e.g., PyPy). For example, if you redefine an interface named
the same as an interface being used in a ``WeakKeyDictionary``, you
can get a ``KeyError``, even if you put the new interface into the
dictionary.


.. doctest::
Expand Down Expand Up @@ -225,6 +236,84 @@ interfaces, you may find surprising ``KeyError`` exceptions. For this
reason, it is best to use distinct names for local interfaces within
the same test module.

.. rubric:: Providing Dynamic Interfaces

If you return an interface created inside a function or method, or
otherwise let it escape outside the bounds of that function (such as
by having an object provide it), it's important to be aware that it
will compare and hash equal to *any* other interface defined in that
same module with the same name. This includes interface objects
created by other invocations of that function.

This can lead to surprising results when querying against those
interfaces. We can demonstrate by creating a module-level interface
with a common name, and checking that it is provided by an object:

.. doctest::

>>> from zope.interface import Interface, alsoProvides, providedBy
>>> class ICommon(Interface):
... pass
>>> class Obj(object):
... pass
>>> obj = Obj()
>>> alsoProvides(obj, ICommon)
>>> len(list(providedBy(obj)))
1
>>> ICommon.providedBy(obj)
True

Next, in the same module, we will define a function that dynamically
creates an interface of the same name and adds it to an object.

.. doctest::

>>> def add_interfaces(obj):
... class ICommon(Interface):
... pass
... class I2(Interface):
... pass
... alsoProvides(obj, ICommon, I2)
... return ICommon
...
>>> dynamic_ICommon = add_interfaces(obj)

The two instances are *not* identical, but they are equal, and *obj*
provides them both:

.. doctest::

>>> ICommon is dynamic_ICommon
False
>>> ICommon == dynamic_ICommon
True
>>> ICommon.providedBy(obj)
True
>>> dynamic_ICommon.providedBy(obj)
True

At this point, we've effectively called ``alsoProvides(obj, ICommon,
dynamic_ICommon, I2)``, where the last two interfaces were locally
defined in the function. So checking how many interfaces *obj* now
provides should return three, right?

.. doctest::

>>> len(list(providedBy(obj)))
2

Because ``ICommon == dynamic_ICommon`` due to having the same
``__name__`` and ``__module__``, only one of them is actually provided
by the object, for a total of two provided interfaces. (Exactly which
one is undefined.) Likewise, if we run the same function again, *obj*
will still only provide two interfaces

.. doctest::

>>> _ = add_interfaces(obj)
>>> len(list(providedBy(obj)))
2


Interface
=========
Expand Down