Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issue #159 and greatly speeds up iteration of PMap in Python3 #243

Merged
merged 3 commits into from Feb 13, 2022

Conversation

noahbenson
Copy link
Contributor

This pull request fixes issue #159 and greatly speeds up iteration of PMap in Python3—in fact it greatly speeds up creation of the iterator object alone when calling PMap.keys, PMap.values, and PMap.items.

Most of the info is in the commit messages, but I've repeated the relevant bits below (with some minor edits) for convenience!

Happy to iterate on this if there are issues I've missed.


Adds a new class PMapView to the pyrsistent._pmap namespace. This class is intended to be private and essentially mimics the built-in dict_values and dict_items pseudo-types. Under the hood they just keep a reference to the PMap object and use the iteritems and itervalues methods; this means that creating a view can be run in constant time instead of the current O(n) time.

The keys method of PMap does not use the PMapView class because PSet is aleady a PMapView-type class for PMap keys. I couldn't think of a reason not to just use PSet: it's the logical type for a set of keys of a persistent map,
and it can just be passed a PMap to create such a set in constant time.

The tests were also updated to reflect the reality that keys now returns a PSet instead of a PVector and that values and items return PMapViews.

As a benchmark, the following code:

from timeit import timeit

def make_setup(mapsize):
    return ('from pyrsistent import pmap, pvector; '
            'm = pmap({k:k for k in range(%d)})' % (mapsize,))

tests = [('len(m.keys())',   'len(pvector(m.iterkeys()))'),
         ('len(m.values())', 'len(pvector(m.itervalues()))'),
         ('len(m.items())',  'len(pvector(m.iteritems()))')]

for mapsize in [10, 100, 1000]:
    print("Map size:", str(mapsize))
    setupstr = make_setup(mapsize)
    for test_pair in tests:
        print('=' * 60)
        print('%-40s' % test_pair[0], '%6f us' % timeit(test_pair[0], setup=setupstr))
        print('%-40s' % test_pair[1], '%6f us' % timeit(test_pair[1], setup=setupstr))
        print('-' * 60)
    print('')

produces the following output on my desktop:

Map size: 10
============================================================
len(m.keys())                            1.330579 us
len(pvector(m.iterkeys()))               1.562125 us
------------------------------------------------------------
============================================================
len(m.values())                          0.653008 us
len(pvector(m.itervalues()))             1.584405 us
------------------------------------------------------------
============================================================
len(m.items())                           0.653144 us
len(pvector(m.iteritems()))              1.159099 us
------------------------------------------------------------

Map size: 100
============================================================
len(m.keys())                            1.350190 us
len(pvector(m.iterkeys()))               11.702080 us
------------------------------------------------------------
============================================================
len(m.values())                          0.650723 us
len(pvector(m.itervalues()))             11.739465 us
------------------------------------------------------------
============================================================
len(m.items())                           0.653189 us
len(pvector(m.iteritems()))              8.951215 us
------------------------------------------------------------

Map size: 1000
============================================================
len(m.keys())                            1.379429 us
len(pvector(m.iterkeys()))               114.202673 us
------------------------------------------------------------
============================================================
len(m.values())                          0.679722 us
len(pvector(m.itervalues()))             113.874931 us
------------------------------------------------------------
============================================================
len(m.items())                           0.680877 us
len(pvector(m.iteritems()))              87.645472 us
------------------------------------------------------------

These tests were run with the version in this pull request, so the first test is always the new approach and the second test is always the previous approach.

Adds a new class `PMapView` to the `pyrsistent._pmap` namespace. This class is
intended to be private and essentially mimics the built-in `dict_values` and
`dict_items` pseudo-types. Under the hood they just keep a reference to the
`PMap` object and use the `iteritems` and `itervalues` methods; this means that
creating a view can be run in constant time insteead of the current `O(n)` time.

The `keys` method of `PMap` does not use the `PMapView` class because `PSet` is
aleady a `PMapView`-type class for `PMap` keys. I couldn't think of a reason not
to just use `PSet`: it's the logical type for a set of keys of a persistent map,
and it can just be passed a `PMap` to create such a set in constant time.

The tests were also updated to reflect the reality that `keys` now returns a
`PSet` instead of a `PVector` and that `values` and `items` return `PMapView`s.

As a benchmark, the following code:

```python
from timeit import timeit

def make_setup(mapsize):
    return ('from pyrsistent import pmap, pvector; '
            'm = pmap({k:k for k in range(%d)})' % (mapsize,))

tests = [('len(m.keys())',   'len(pvector(m.iterkeys()))'),
         ('len(m.values())', 'len(pvector(m.itervalues()))'),
         ('len(m.items())',  'len(pvector(m.iteritems()))')]

for mapsize in [10, 100, 1000]:
    print("Map size:", str(mapsize))
    setupstr = make_setup(mapsize)
    for test_pair in tests:
        print('=' * 60)
        print('%-40s' % test_pair[0], '%6f us' % timeit(test_pair[0], setup=setupstr))
        print('%-40s' % test_pair[1], '%6f us' % timeit(test_pair[1], setup=setupstr))
        print('-' * 60)
    print('')
```

produces the following output on my desktop:

```
Map size: 10
============================================================
len(m.keys())                            1.330579 us
len(pvector(m.iterkeys()))               1.562125 us
------------------------------------------------------------
============================================================
len(m.values())                          0.653008 us
len(pvector(m.itervalues()))             1.584405 us
------------------------------------------------------------
============================================================
len(m.items())                           0.653144 us
len(pvector(m.iteritems()))              1.159099 us
------------------------------------------------------------

Map size: 100
============================================================
len(m.keys())                            1.350190 us
len(pvector(m.iterkeys()))               11.702080 us
------------------------------------------------------------
============================================================
len(m.values())                          0.650723 us
len(pvector(m.itervalues()))             11.739465 us
------------------------------------------------------------
============================================================
len(m.items())                           0.653189 us
len(pvector(m.iteritems()))              8.951215 us
------------------------------------------------------------

Map size: 1000
============================================================
len(m.keys())                            1.379429 us
len(pvector(m.iterkeys()))               114.202673 us
------------------------------------------------------------
============================================================
len(m.values())                          0.679722 us
len(pvector(m.itervalues()))             113.874931 us
------------------------------------------------------------
============================================================
len(m.items())                           0.680877 us
len(pvector(m.iteritems()))              87.645472 us
------------------------------------------------------------
```
@noahbenson
Copy link
Contributor Author

I should have mentioned that the commit also includes a definition for the __reversed__ method of PMap. The method just throws an error saying that pmaps are not reversible. Currently if you try to reverse one, the error is somewhat arcane. (I realize that reversing a mapping like this is unusual, but you can reverse dict objects in Python, so making a clear error message seemed wise.)

@noahbenson
Copy link
Contributor Author

Big fan of pyrsistent by the way—thanks for maintaining it! I use it all over my libraries pimms and neuropythy, and I maintain a similar library in Julia!

Copy link
Owner

@tobgu tobgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

As a general comment and background why it looks the way it currently does.
keys/iterkeys, values/itervalues and items/iteritems mimic how the corresponding methods behave in Python 2 (which was the dominating Python version when pyrsistent was first written). I have not changed this since, partly due to backwards compatibility concerns.

The code in this PR brings keys/values/items closer to the semantics of the same methods in the corresponding types in the stdlib of Python 3 which is nice. Most notably it adds len and equality comparison which was not previously possible with the generators returned by the corresponding iter... methods.

If we decide to go this way (which I think is nice!) I also think we can ditch the iter... methods from the public interface of the PMap to fully align the interface with how things look and behave in Python 3.

That would have to be combined with bumping the version number accordingly and highlight the backwards incompability in the Changelog (I'll take care of that once I release a new version though).

What are your thoughts on this?

@@ -3,6 +3,68 @@
from pyrsistent._pvector import pvector
from pyrsistent._transformations import transform

class PMapView(object):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • No need to inherit from object in Python 3.
  • I think I'd prefer to split this class in two, one for the values and one for the items. A couple of methods would have to be duplicated in both classes but you'd get rid of a lot of testing on the boolean values in a bunch of methods.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this by having the PMapValues and PMapItems types inherit from PMapView—less code duplication this way.

raise TypeError("PViewMap requires a Mapping object")
object.__setattr__(self, '_map', m)
object.__setattr__(self, '_values', values)
def __len__(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general nitpick I'd prefer one empty line between methods in classes to align the style with other classes in the code base.

def __len__(self):
return len(self._map)
def __setattr__(self, k, v):
raise TypeError("%s is immutable" % (type(self),))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a general nitpick I'd like to replace all uses of Python 2 like format strings (eg. "%s ..." % ...) with f-strings or possibly "{foo}".format(foo="bar") unless there is a special reason to use the Python 2 convention (pyrsistent claims compatibility with 3.7+).

return self._map.iteritems()
def __contains__(self, arg):
if self._values:
for v in self._map.itervalues():
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be possible to replace this loop with:
return v in self._map.itervalues()
I think?

if v == arg: return True
return False
else:
try: (k,v) = arg
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be possible to replace this with:
return v in self._map.iteritems()
I think?

def __eq__(self, x):
if x is self: return True
if not isinstance(x, PMapView): return False
# For whatever reason, dict_values always seem to return False for ==
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weird, never though of that...

# If this method is not defined, then reversed(pmap) will attempt to reverse
# the map using len() and getitem, usually resulting in a mysterious
# KeyError.
def __reversed__(self):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great!

assert set(['a', 'b']) == set(m(a=1, b=2))
assert ['a', 'b'] == sorted(m(a=1, b=2).keys())
assert isinstance(m().keys(), PVector)
assert isinstance(m().keys(), pyr.PSet)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comment on the tests:
I don't think we need to test against the type of the returned value anymore but just the properties of it. The reason it was done before was just to highlight that you got a proper PVector back (much like you'd get a true, materialized, list in Python 2 for the corresponding methods).

assert len(pm) == len(pm.keys())
assert len(pm) == len(pm.values())
assert len(pm) == len(pm.items())
ks = pm.keys()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick: Perhaps move the declaration of these variables to right berfore the assertion where they are used. It would make it a bit easier to follow without having to jump back and forth between the declaration and the assert.

Eg.

vs = pm.values()
assert all(v in vs for v in vs)

@noahbenson
Copy link
Contributor Author

Yeah, I've been using pyrsistent since maybe 2014? Can't quite remember but I'm still used to thinking about the data structures in a fundamentally Python2 way.

I understand removing the itervalues()-like methods in order to align with the direction of Python 3, but I could also see the logic in keeping them around for compatibility. (Could always stick a deprecation warning on them for awhile.) IMHO this is a somewhat aesthetic choice—I would probably eventually do this if it were my library, but I don't have a really good reason why 🤷. My main concern about these methods arose from the fact that I often pass pyrsistent maps into functions that expect dict arguments (and thus use keys() or values() and expect certain speed/behavior). I think it's fine/expected for a dict variant type to have extra methods/features as long as the basic dict features work as expected, though.

Copy link
Owner

@tobgu tobgu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@tobgu
Copy link
Owner

tobgu commented Feb 13, 2022

Alright, lets keep the iter*-methods around for now. :-)

@tobgu tobgu merged commit dbb518b into tobgu:master Feb 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants