Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BEAM-7746] Add python type hints #9056

Closed
wants to merge 40 commits into from

Conversation

chadrik
Copy link
Contributor

@chadrik chadrik commented Jul 12, 2019

EDIT: jump to the latest conversation here: #9056 (comment)


This is an early look at type annotations for the beam source. It is not attempting to address the issue of porting beam runtime typing from its own internal typehints module to the official tying module, so perhaps I should open a new ticket just for type hinting the source.

EDIT: done. https://issues.apache.org/jira/browse/BEAM-7746

There's a lot left to do, including getting this hooked up to tests, but I wanted to get it out there so that A) I can get feedback before I get too far along, and B) in case anyone wants to join forces with me.

Some points for discussion / debate:

type annotation style

In python 2.7, we're limited to type comments, and it can be quite difficult to keep them under the 80 character limit, since they're not very flexible. I'm following the lead of typing-heavy projects like mypy and using the following style for functions with more than one or two arguments:

  def __init__(self,
               evaluation_context, # type: EvaluationContext
               applied_ptransform,  # type: AppliedPTransform
               input_committed_bundle,
               side_inputs
               ):
    # type: (...) -> None

typing module import

Some projects establish a convention to import only the typing objects required for the current module, e.g.:

from typing import Iterable, Tuple

Since beam has its own internal typehints module I'm importing the typing module in the same fashion to avoid conflicts/confusion. However, this approach has the downside of making some type comments a lot longer and less legible, e.g. typing.Union[typing.Iterable[str], str], which may make it impossible to stay below 80 characters in some cases.

type checking pipeline construction

The type hints are useful first-and-foremost for developers of beam, but they should also prove useful for consumers of the beam API, i.e. developers of beam pipelines. Though it's a secondary concern at the moment, I'm trying to ensure that the use case of pipeline creation is covered, so that mypy will generate proper errors for invalid combinations of transforms prior to submission, but due to a bug/weakness this is harder than expected. I'm making progress writing a mypy plugin to patch up the problem, but it's not done yet. Using the results of this process for runtime typing (e.g. to lookup coders) is out of scope for this PR.

str vs unicode

Currently I'm glossing over this problem. The correct way to annotate strings for both python 2 and 3 is to use typing.Text and typing.AnyStr, but it's a huge pain in the ass, in particular because it's often unclear where unicode is valid or invalid (I've also run into inaccuracies in the python typeshed wrt strings). In my own projects, I just annotate everything as str and handle the few cases where mypy complains about unicode by using typing.cast(str, whatever), since the problem will get a lot simpler when we switch to python3, and bytes becomes the exception. I'm open to suggestions here.


I'll probably think of more later, but I'm out of time and want to get this out the door!

Post-Commit Tests Status (on master branch)

Lang SDK Apex Dataflow Flink Gearpump Samza Spark
Go Build Status --- --- Build Status --- --- Build Status
Java Build Status Build Status Build Status Build Status
Build Status
Build Status
Build Status Build Status Build Status
Build Status
Python Build Status
Build Status
--- Build Status
Build Status
Build Status --- --- Build Status

Pre-Commit Tests Status (on master branch)

--- Java Python Go Website
Non-portable Build Status Build Status Build Status Build Status
Portable --- Build Status --- ---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

@chadrik
Copy link
Contributor Author

chadrik commented Jul 13, 2019

R: @aaltay
R: @charlesccychen

@charlesccychen
Copy link
Contributor

CC: @udim

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking at this. Some brief comments:

typing module import Given that we're deprecating the old type hints, I think we should take a sweep through the codebase and replace all non-qualified typehints with their typing equivalents (when possible) or qualified ones (where not). This would allow us to use the common convention of importing from the typing module directly.

type checking pipeline construction I think the goal of JIRA-7060 is specifically to make type annotations useful to pipeline developers. Making tools like mypy work on the beam codebase, and user pipelines in general, is another (worthy) goal.

str vs unicode +1 to your approach. There may be cases where we want an explicit bytes type as well (which would be an alias of str in python 2).

@@ -44,7 +52,8 @@ def __init__(self):
self.__class__, 'word_len_dist')
self.empty_line_counter = Metrics.counter(self.__class__, 'empty_lines')

def process(self, element):
def process(self, element, *args, **kwargs):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are the *args and **kwargs parameters added here?

reveal_type(lines) # PCollection[unicode*]

def make_ones(x):
# type: (T) -> Tuple[T, int]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this not be inferred? Similarly below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what you mean?

Let me try to answer what I think you're asking about.

First, even for a simple function like make_ones, mypy does not guess at the arguments and return type.

Next, as a user annotating pipelines I have a few of choices (all of them are valid):

  1. annotate functions, and let mypy infer the types of the PTransforms they are passed to
  2. annotate PTransforms
  3. annotate both, as a way to guard against unintentional incompatibilities

Option 2 would look like this:

  def make_ones(x):
    return (x, 1)

  makemap = beam.Map(make_ones)  # type: beam.Map[T, Tuple[T, int]]

To be clear, in option 2, mypy is not using the type of beam.Map to infer the type of the make_ones, it's simply assuming it's correct, since in our thought experiment it's unannotated. In other words, the following would not generate an error because make_ones is untyped:

  def make_ones(x):
    return (x + 1, 1)  # x can only be a numeric type!

  makemap = beam.Map(make_ones)  # type: beam.Map[str, Tuple[str, int]]

This module is very messy right now, so I apologize for any confusion that it's generated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's really unfortunate... I'd have expected more of mypy :(. In that case our own inference (which does handle this) is all the more important.

sdks/python/apache_beam/examples/wordcount.py Outdated Show resolved Hide resolved
def __init__(self, topic=None, subscription=None, id_label=None,
with_attributes=False, timestamp_attribute=None):
def __init__(self,
topic=None, # type: typing.Optional[str]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we keep these, I wonder if it'd be easier to read if the comments were all aligned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave that up to the Beam Team, but we're still going to have the line length limit to contend with.

@@ -972,14 +979,15 @@ def expand(self, pcoll):
'or be a PTransform. Received : %r' % self.sink)


class WriteImpl(ptransform.PTransform):
class WriteImpl(ptransform.PTransform[InT, None]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or should this be a PTransform[pvalue.PCollection[InT], None]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, that would be redundant. It won't change the behavior of the type checks, and it would add verbosity to a lot of class definitions, since there are numerous PTransforms throughout the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, also, it will require an import of the apache_beam.pvalue module into all of those modules as well. you won't be able to protect it inside a typing.TYPE_CHECKING block since it's used in the class definition.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But some transforms take and return other things than PCollections. Maybe if T is shorthand for PCollection[T] but that seems ambiguous.

@chadrik
Copy link
Contributor Author

chadrik commented Jul 15, 2019

typing module import: Given that we're deprecating the old type hints, I think we should take a sweep through the codebase and replace all non-qualified typehints with their typing equivalents (when possible) or qualified ones (where not). This would allow us to use the common convention of importing from the typing module directly.

Sorry, I don't completely follow this. IIUC, we can't replace apache_beam.typehints with typing until beam knows how to deal with typing internally, right?

@chadrik chadrik changed the title [BEAM-7060] Add python type hints [BEAM-7746] Add python type hints Jul 15, 2019
@chadrik
Copy link
Contributor Author

chadrik commented Jul 15, 2019

type checking pipeline construction I think the goal of JIRA-7060 is specifically to make type annotations useful to pipeline developers. Making tools like mypy work on the beam codebase, and user pipelines in general, is another (worthy) goal.

I made a new ticket for the latter, and changed the subject of this PR: https://issues.apache.org/jira/browse/BEAM-7746

@aaltay aaltay requested review from robertwb and udim July 17, 2019 00:35
Copy link
Member

@udim udim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertwb Please do not merge before I get a chance to understand exactly how this might affect Beam's type checking and hinting.
Specifically I'm worried about the use of typing.Generic and how that plays with IOTypeHints.
There's also the issue of using type comments instead of PEP 484 annotations that Python 3 parses and makes available at runtime.

mypy.ini Outdated
@@ -0,0 +1,19 @@
[mypy]
python_version = 2.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The tox test environment you added uses 3.7. Is 2.7 correct here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mypy needs python 3 to run, but this flag is for the version of the code that it's checking.
when the code is ready we'll want to run this in 3.x mode.

sdks/python/apache_beam/io/gcp/pubsub.py Outdated Show resolved Hide resolved
"""Implements the writing of custom sinks."""

def __init__(self, sink):
super(WriteImpl, self).__init__()
self.sink = sink

def expand(self, pcoll):
# type: (pvalue.PCollection[InT]) -> pvalue.PCollection[None]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, mypy will compare this annotation to the output of the plugin you wrote?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the plugin only deals with the __or__ and __ror__ methods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But would it insure that the __[r]or__ method agrees with the generic type of the PTransform which agrees with this method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I believe it could be used to achieve that.

sdks/python/apache_beam/typehints/decorators.py Outdated Show resolved Hide resolved
@@ -193,7 +200,7 @@ def __repr__(self):
self.input_types, self.output_types)


class WithTypeHints(object):
class WithTypeHints(Generic[InT, OutT]):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid that if we add a second typing system here will create confusion. If I write a PTransform, do I specify an input type to Generic[] or do I use a @with_input_type decorator? Which one is enforced?

Copy link
Contributor Author

@chadrik chadrik Jul 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True. I think the somewhat unsatisfactory answer is that for the time being you need both: one for runtime checking and the other for static checking, until such a time as they can become the same. I think trying to do that all in one PR is going to be too much.

off the top of my head, the order this should probably be done is:

  1. support runtime type hints using typing module instead of typehints: https://issues.apache.org/jira/browse/BEAM-7713
  2. add static type hints to the beam code and begin enforcing it using mypy: this PR (https://issues.apache.org/jira/browse/BEAM-7746) and possibly https://issues.apache.org/jira/browse/BEAM-7712
  3. support static validation of user pipelines (mypy plugin, etc)
  4. support runtime validations based on typing annotations

There are a lot of "ifs" surrounding step 4. We may need to get to python3-only first to avoid the pitfalls of type comments. We may find that step 3 makes it less important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(1) is already mostly there, for the user-facing API at least. How much of (2) is required for (3) and how invasive it is seems to still be an open question, which this PR is giving good insight to. (Long term, it'd also be nice to have (2) for development as well.) Given that mypy can't even do beam.Map(lambda x: (x, 1)) I think we'll have to rely a lot on (4).

All will be much nicer once we can have python-3 style annotations in the code itself. We can't do that for our own codebase until Python 2 is LTS only, but user code could start using it much sooner.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much of (2) is required for (3) and how invasive it is seems to still be an open question, which this PR is giving good insight to.

agreed.

Given that mypy can't even do beam.Map(lambda x: (x, 1)) I think we'll have to rely a lot on (4).

I think people who want to use static typing for their pipelines will have to stop using lambdas. So there will definitely be a split between those who are willing to make that adjustment and those who want to continue relying on the current interface for dynamic typing.

All will be much nicer once we can have python-3 style annotations in the code itself. We can't do that for our own codebase until Python 2 is LTS only, but user code could start using it much sooner.

When we get to that stage, we can use this tool to convert comments to annotations: https://github.com/ilevkivskyi/com2ann

"""Abstract method that is required to be overwritten"""

def get_project_path(self, include_dists=[]):
# with_2to3 = six.PY3 and getattr(self.distribution, 'use_2to3', False)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

leftover comments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, this command class is based off of several in the distutils source, and it was unclear to me if we'd need to support this other case, so it's there as a reminder to figure that out.

To be crystal clear, this PR is very much a WIP. Just wanted to get eyes on it before I get too far ahead of myself.

sdks/python/apache_beam/transforms/ptransform.py Outdated Show resolved Hide resolved
@chadrik
Copy link
Contributor Author

chadrik commented Jul 17, 2019

I'm really curious to know what the general consensus is on this PR, implementation details aside.

Do you all like the idea of adding type annotations?

If I can get some subset of the current package passing, would you be willing to merge something like this in?

What do you see as the major blockers?

Copy link
Contributor

@robertwb robertwb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typing module import: Given that we're deprecating the old type hints, I think we should take a sweep through the codebase and replace all non-qualified typehints with their typing equivalents (when possible) or qualified ones (where not). This would allow us to use the common convention of importing from the typing module directly.

Sorry, I don't completely follow this. IIUC, we can't replace apache_beam.typehints with typing until beam knows how to deal with typing internally, right?

Beam knows how to convert from typing to its internal representation (some fixes at #9109).

@robertwb Please do not merge before I get a chance to understand exactly how this might affect Beam's type checking and hinting.
Specifically I'm worried about the use of typing.Generic and how that plays with IOTypeHints.
There's also the issue of using type comments instead of PEP 484 annotations that Python 3 parses and makes available at runtime.

Yes, for sure.

@chadrik
Copy link
Contributor Author

chadrik commented Jul 19, 2019

Beam knows how to convert from typing to its internal representation (some fixes at #9109).

Awesome. Getting that merged in will make this PR a lot easier.

@robertwb
Copy link
Contributor

Beam knows how to convert from typing to its internal representation (some fixes at #9109).

Awesome. Getting that merged in will make this PR a lot easier.

It's in.

@chadrik
Copy link
Contributor Author

chadrik commented Jul 24, 2019

It's in.

It's out :(

I see it got reverted: 8afee95

Did it end up causing a regression?

@aaltay
Copy link
Member

aaltay commented Jul 24, 2019

It's in.

It's out :(

I see it got reverted: 8afee95

Did it end up causing a regression?

There is a JIRA link in the revert PR: https://issues.apache.org/jira/browse/BEAM-7798 -- It was causing test failures.

@udim
Copy link
Member

udim commented Jul 24, 2019

Back in: #9125

@chadrik
Copy link
Contributor Author

chadrik commented Jul 24, 2019

I left a note on that PR about transforms.core not being converted yet.

Separate question:

Are the following two examples functionally equivalent in beam?

coll = pipe | core.Map(lambda _, x: x.do_stuff(), self.thing)
coll = pipe | core.Map(lambda _, x=self.thing: x.do_stuff())

I would hope that they are, but I could see there being some difference in how the transforms get serialized.

The reason I ask is that it's going to be tricky to deal with the shifting signatures of methods like DoFn.process() and PTransform.expand(), since mypy can be a bit picky about this. I'm trying to work out some strategies for addressing this, and this is one.

@udim
Copy link
Member

udim commented Jul 25, 2019

Are the following two examples functionally equivalent in beam?

coll = pipe | core.Map(lambda _, x: x.do_stuff(), self.thing)
coll = pipe | core.Map(lambda _, x=self.thing: x.do_stuff())

Here's my attempt at recreating what Beam does:

>>> import dill
>>> o = object()
>>> l = lambda x=o: x
>>> l_loaded = dill.loads(dill.dumps(l))
>>> l() == l_loaded()
False
>>> l() == l_loaded(o)
True

Passing self.thing as a default argument causes it to get pickled with the lambda and makes it a distinct object after unpickling.
I'm not sure thought what pickling the enclosing scope would do.

@chadrik
Copy link
Contributor Author

chadrik commented Jul 25, 2019

Passing self.thing as a default argument causes it to get pickled with the lambda and makes it a distinct object after unpickling.

yes, any object will be a distinct object after unpickling:

>>> import dill
>>> o = object()
>>> o_loaded = dill.loads(dill.dumps(o))
>>> o == o_loaded
False
>>> id(o)
4402307584
>>> id(o_loaded)
4405359896

But the "failure" of your test is just an implementation detail of object. object.__eq__ is likely doing an id() check, and this simply demonstrates that you should not use object with pickle/dill if equality checks are important to your code.

Luckily, it is possible for distinctly different objects to behave the same (otherwise beam wouldn't work at all):

>>> import dill
>>> o = [1, 2]
>>> o_loaded = dill.loads(dill.dumps(o))
>>> o == o_loaded
True

Anyway, IIUC, the function and the args passed to DoFn will both be serialized using dill, so the argument object in question should undergo a serialization/deserialization transformation either way. My question was more about the details of the beam internals, esp as it might pertain to portability, grpc model, and urns etc that I would not be familiar with as a beam newbie.

@chadrik
Copy link
Contributor Author

chadrik commented Jul 27, 2019

I'm getting very close to getting this to pass, but I'm waiting on transforms.core to be converted from typehints to typing. Any updates on that?

@robertwb
Copy link
Contributor

#9179 awaiting your review for core.py

Also, yes, the two examples are the same, unless some.thing is a deferred side input (e.g. pvalue.AsIterable(pcoll) in which only the first is valid (but an argument could be made we should accept the second as well).

@chadrik
Copy link
Contributor Author

chadrik commented Aug 28, 2019

I just pushed a bunch of changes. I saw some updates to tickets related to this on Jira, so I thought I'd give a little update.

This is a pretty big undertaking, and there are a lot of things in motion.

  • I wrote pyi stubs for python-future, which was a tedious little side escapade. I've asked the author for permission to add them to the python typeshed. If he/she doesn't get back to me, I'll just make a stub-only package and put it on pypi
  • I made a PR for mypy to add support for future.utils.with_metaclass, but it's waiting on the stubs to find a home
  • I've been grappling with how to properly type the callables for DoFns. I'm not satisfied with the options so might punt on that for now
  • I'm waiting on the latest version of mypy to drop, which adds official error codes. I'll then need to update mypy-runner to use these new error codes to do its filtering.

@chadrik chadrik force-pushed the python-static-typing branch 3 times, most recently from 70357d5 to b1a0478 Compare September 19, 2019 18:30
@chadrik
Copy link
Contributor Author

chadrik commented Oct 27, 2019

Alright everyone, I'm happy with where this PR is now, and I think it's time to start figuring out how to get it merged.

This PR is admittedly huge, but the daunting scope is mitigated by the fact that the intent of the PR is to avoid making any behavioral changes. I recommend reviewing this PR one commit at a time, as I've tried to organize it into bite-sized chunks, and I've segregated changes with possible unintentional runtime side-effects and made note of them for extra scrutiny.

An additional consideration with respect to the size of this PR is that there is a critical mass required before mypy can be enabled in CI, so if we were to break up this PR, we wouldn't be able to enable the mypy lint until all of the changes are finally merged (the alternative is hundreds # type: ignore comments throughout the code, which I'd rather not do). Once this PR goes in, it'll be easy to make smaller follow-ups.

I'm really excited about this PR and I think you all are going to find the code base much easier to navigate and grok once it's in (I already provide this branch to others on the team who are trying to learn beam), not to mention the deep structural analysis that you get during testing (it's helped me to identify a number of real and potential bugs already, which you'll see in my commits).

One final consideration: there will be a need for training and mentoring on static type analysis, I have some internal docs on the subject which I'm happy to share, and you can always drag me into a PR to help resolve issues.

@TheNeuralBit
Copy link
Member

Thanks for doing this @chadrik! I'm very excited about this change :)

One quick question: when we switch to python 3, is there a tool that will convert all the type comments to pep 484 type hints?

@robertwb
Copy link
Contributor

Thanks for the huge amount of work that's gone into this PR, and I'm looking forwards to the benefits it'll bring to the codebase.

In terms of getting this reviewed, I understand that mypy can't be enabled [1] until it's all in, but I think it's preferable to get individual commits merged and in incrementally than review commits incrementally and then try to get them all in at once in a single PR. In my spot checking it seems these commits are relatively orthogonal--is that basically the case? Perhaps a reviewer could create PRs cherry-picking commits that they are able to review in one sitting and we could get them in that way? Any other ideas?

[1] Maybe we could enable it, but not make it a blocker, in which case we could track progress and at least manually track progress/slippage until we enforce it.

@chadrik
Copy link
Contributor Author

chadrik commented Oct 28, 2019

when we switch to python 3, is there a tool that will convert all the type comments to pep 484 type hints?

Yes. https://github.com/ilevkivskyi/com2ann. I think we will need to manually fixup a bunch of style issues after running that, because when the comments become annotations they will exceed the line limit.

Side note: I recently experimented with enabling pylint's auto-fix features (like spotlessApply), but it won't work unless we're willing to change Beams' rather "unique" style. IIRC, the main deal-breaker is Beams' mixture of 2-space and 4-space tabs depending on context (it's been a headache for me to setup properly in PyCharm as well. I have something that's close, but not perfect).

EDIT: I just saw the script includes an option to automatically wrap signatures, --wrap-signatures MAX_LENGTH, but even with this enabled there will be lines that exceed the length limit due to arguments with complex/compound types. But that feature will definitely save a lot of time!

In terms of getting this reviewed, I understand that mypy can't be enabled [1] until it's all in, but I think it's preferable to get individual commits merged and in incrementally than review commits incrementally and then try to get them all in at once in a single PR. In my spot checking it seems these commits are relatively orthogonal--is that basically the case? Perhaps a reviewer could create PRs cherry-picking commits that they are able to review in one sitting and we could get them in that way? Any other ideas?

I like the idea of adding the lint job but not failing until we've gotten everything in.

There are some commits that depend on others, but the relationship is not always obvious, even to me.

How about this plan: we first merge in a PR that contains just type comments, so theoretically no chance of runtime changes. This will be ignore comments, base typing pass, and the mypy lint job with failure disabled (basically this PR through to c25792c). At this point I think it would be safe for reviewers to cherry-pick the more topical commits that remain, in any order they choose.

@robertwb
Copy link
Contributor

Sounds good. That's still quite the big commit, but should be straightforward to review. (I'm not seeing that commit in the list though.)

@robertwb
Copy link
Contributor

We also need to get buy-in on the list as this will be yet another "lint" that contributors will have to make happy (though generally a good thing, but it might require some education).

@chadrik
Copy link
Contributor Author

chadrik commented Oct 29, 2019

First PR is here: #9915

@chadrik
Copy link
Contributor Author

chadrik commented Oct 30, 2019 via email

Copy link
Member

@udim udim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review. I focused mainly on the test infrastructure changes

@@ -1768,7 +1768,7 @@ class BeamModulePlugin implements Plugin<Project> {
project.exec { commandLine virtualenvCmd }
project.exec {
executable 'sh'
args '-c', ". ${project.ext.envdir}/bin/activate && pip install --retries 10 --upgrade tox==3.11.1 grpcio-tools==1.3.5"
args '-c', ". ${project.ext.envdir}/bin/activate && pip install --retries 10 --upgrade tox==3.11.1 -r ${project.rootDir}/sdks/python/build-requirements.txt"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather limit the packages installed in the base virtualenv.
setup.py should contain the list of packages needed (under REQUIRED_TEST_PACKAGES in this case). Note that it already lists future as a dependency.
IIUC mypy is run via setup.py so listing the dependencies there makes sense.

Copy link
Contributor Author

@chadrik chadrik Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I touched on this in your other comment, but the big picture here is that python lacks a good solution for specifying build-time requirements: i.e. deps required to run setup.py.

Entire ecosystems have been built up around solving this:

One note from the last link:

Pip 10 and later support build dependencies, but they have to be specified in a pyproject.toml file.

To use that we'd have to change python setup.py sdist to use pip, which makes me nervous. Certainly more than I'd want to change in this PR.

@robertwb may have some other input. I noticed from some comments there was an earlier attempt to use the setup_requires field of setup.py, and then it was dismantled. IIRC, setup_requires has a lot of crippling issues.

In my experience a build-requirements.txt is by far the simplest solution.

@@ -143,6 +143,7 @@ disable =
unnecessary-lambda,
unnecessary-pass,
unneeded-not,
unsubscriptable-object,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does this trigger? Perhaps a newer version of pylint could help? (pylint-dev/pylint#2416)

Copy link
Contributor Author

@chadrik chadrik Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

class _ListBuffer(List[bytes]):
  """Used to support parititioning of a list."""
  def partition(self, n):
    # type: (int) -> List[List[bytes]]
    return [self[k::n] for k in range(n)]    # <------ ERROR HERE
class SdfProcessSizedElements(DoOperation):
  ...
  def progress_metrics(self):
    # type: () -> beam_fn_api_pb2.Metrics.PTransform
    with self.lock:
      metrics = super(SdfProcessSizedElements, self).progress_metrics()
      current_element_progress = self.current_element_progress()
    if current_element_progress:
      assert self.input_info is not None
      metrics.active_elements.measured.input_element_counts[
          self.input_info[1]] = 1    # <------ ERROR HERE
      metrics.active_elements.fraction_remaining = (
          current_element_progress.fraction_remaining)
    return metrics

Generally speaking, there are a handful of checks that overlap between mypy and pylint where pylint does a worse job. In the second example above, self.input_info is an Optional[Tuple] that's initialized to None, so pylint thinks it's unsubscriptable, but it's initialized to a tuple elsewhere and there's a line just above the error that says assert self.input_info is not None, so it obviously is in this context. These are all subtleties that mypy understands but pylint does not, so I think we'll see a trend of moving responsibility for these structural tests to mypy.

Perhaps a newer version of pylint could help?

I'm already using the latest and greatest (one of the commits here ticks the version up from 2.4.2 to 2.4.3)

# TODO(BEAM-5414): latest grpcio-tools incompatible with latest protobuf 3.6.1.
grpcio-tools>=1.3.5,<=1.14.2
mypy-protobuf==1.12
future==0.16.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This addition of this file needs a discussion.

  • This file needs a description. Which dependencies should be included here?
    • Additionally, which dependencies should by setup.py, the gradle setupVirtualenv task, gen_protos.py, etc?
  • Ideally dependencies should only be listed once, so I am for this kind of change.
  • Why is future listed here? It's already in setup.py. I saw that some tox envs needed it, but it seems that if tox installs the apache-beam tarball in the virtualenv it creates then that should include future as well.
  • Why is mypy-protobuf here? Only the mypy command from setup.py seems to need it, so it could be listed under REQUIRED_TEST_PACKAGES there.
  • I do agree that the grpcio-tools dep should be listed in one place so maybe this file could serve that purpose.

Copy link
Contributor Author

@chadrik chadrik Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file needs a description. Which dependencies should be included here?

Good point. These are the requirements for building an sdist of beam. They happen to all be dependencies of gen_protos.py, but conceptually this file could contain other build deps like cython. I chose to omit that since it's a bit more complicated since it's only used on linux and only for certain tests.

Why is future listed here? It's already in setup.py.

gen_protos.py uses it to futurize the results of protoc. The packages in install_requires are not available at the time that gen_protos runs.

Why is mypy-protobuf here? Only the mypy command from setup.py seems to need it, so it could be listed under REQUIRED_TEST_PACKAGES there.

Also required for gen_protos

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is mypy-protobuf here?

To clarify, mypy-protobuf is required at the time that gen_protos runs (it's used by protoc), but it's not technically required by anything other than py37-mypy job. However, we'll want to include the .pyi files in sdist so that end users can get their benefits in PyCharm, mypy, etc. Which reminds me that I need to update setup.py to include pyi files in package_data!

@@ -49,6 +49,164 @@
]


def generate_urn_files(log, out_dir):
"""
Create a pyi stub for common_urns
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update the docstring.
IIUC, this does not generate a pyi stub, but several _urns.py files that are imported by common_urns.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

@@ -160,16 +160,17 @@ commands =
[testenv:py27-lint]
# Checks for py2 syntax errors
deps =
-r build-requirements.txt
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this still necessary?

Copy link
Contributor Author

@chadrik chadrik Oct 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required if we want tox to work outside of gradle, and I really really do. Using gradle to run tox tasks adds a lot of overhead and I kind of hate it. gradle creates a complete virtualenv just to build an sdist, then it creates another virtualenv to run tox (and IIRC it does this every time). If I run tox directly it creates a single virtualenv for both sdist and the tox env and it reuses that virtualenv for subsequent runs of tox. Gradle also just seems slow to start up. Also, I can't pass individual tests to run when using gradle. Let's just say that as a python developer, the gradle experience is the thing I like least about the Beam project.

Yes, by adding the gen_proto requirements to build-requirements.txt and installing that in the setupVirtualenv task I'm making that overhead worse, but I certainly didn't start that problem: the code is currently installing tox every time that the sdist task is run, which is pointless. In other words, setupVirtualenv is currently installing the union of all possible task requirements, and I'm extending that pattern.

I could add a feature to setupVirtualenv to provide a requirements file, which would let us provide exactly the deps required for each task:

  • sdist: grpcio-tools, mypy-protobuf, future
  • tox: tox (tox will install the deps above if needed)

That would end up a net win, even with the deps added in this PR, because tox has a lot of deps. Should I do that here or in a followup PR?

I've been meaning to bring up some of these issues on the list. I assume the reason for the two step test process (sdist + tox, each with its own virtualenv) is to ensure that tox is using an sdist built in the same way every time, using python2, just as it would be for distribution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tox runs outside of gradle. (At least for me). What is the issue there?

I assume the reason for the two step test process (sdist + tox, each with its own virtualenv) is to ensure that tox is using an sdist built in the same way every time, using python2, just as it would be for distribution?

sdist step is used for other things (for building artifacts for release). That is the reason for it to exists on its own.

tox, I do not believe we spent time on configuring it to reuse existing virtualenv. If that is an option we can try that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tox runs outside of gradle. (At least for me). What is the issue there?

When tox is run on jenkins, it is invoked via gradle. The tox task is defined in BeamModulePlugin.groovy like this:

      project.ext.toxTask = { name, tox_env ->
        project.tasks.create(name) {
          dependsOn 'setupVirtualenv'
          dependsOn ':sdks:python:sdist'

          doLast {
            // Python source directory is also tox execution workspace, We want
            // to isolate them per tox suite to avoid conflict when running
            // multiple tox suites in parallel.
            project.copy { from project.pythonSdkDeps; into copiedSrcRoot }

            def copiedPyRoot = "${copiedSrcRoot}/sdks/python"
            def distTarBall = "${pythonRootDir}/build/apache-beam.tar.gz"
            project.exec {
              executable 'sh'
              args '-c', ". ${project.ext.envdir}/bin/activate && cd ${copiedPyRoot} && scripts/run_tox.sh $tox_env $distTarBall"
            }
          }
          inputs.files project.pythonSdkDeps
          outputs.files project.fileTree(dir: "${pythonRootDir}/target/.tox/${tox_env}/log/")
        }
      }

and it's used here:

https://github.com/apache/beam/blob/master/sdks/python/test-suites/tox/py2/build.gradle

sdist step is used for other things (for building artifacts for release). That is the reason for it to exists on its own.

I'm not saying that sdist task shouldn't exist, I'm saying it adds overhead to use the sdist task to generate the tarball for the tox task, instead of just letting tox generate its own sdist tarball, which it is designed to do and is the default behavior (Also, there's the copying of source files which adds more overhead and it's unclear why this exists since tox is designed to create isolated virtualenvs per task environment).

In the definition of toxTask above, you see that it depends on setupVirtualenv and also on sdist, and sdist itself depends on setupVirtualenv. That's why we end up running setupVirtualenv twice for each tox task run, and then tox creates yet another virtualenv to actually run the task. That's 3 virtualenvs per tox job! I've used travisCI to setup a number of tox-based CI jobs and the typical workflow is to ensure that tox is installed on the worker image and then theres only one virtualenv created per job, the one created by tox.

Anyway, long story short the build-requirements.txt line is required for the tox tasks to work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that the gradle config is not the most efficient. The base requirements for running Python tests on gradle are pip, virtualenv, and 4 versions of the python interpreter. Tox is installed by setupVirtualenv. :sdks:python:sdist is run only once (IIUC) for all toxTasks, since running more than one sdist concurrently is racey and makes our tests flakey.
CC: @markflyhigh who set this up.

Thinking about this now, I have a hunch that the tarball in use doesn't ever include Cythonized sources (since they are created by sdist when the cython package is preset).

Source files are copied due to isolation requirements. The Cythonization process adds files which are automatically picked up if present. Perhaps a better way to do this is to exclude C files from sdist for the non-cython envs, but this is the way we do it currently.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we agree there's room for improvement. The question for the moment is, is my current solution ok?

@robertwb
Copy link
Contributor

Yeah, it's a bit pedantic. By default mypy assigns Any to an unannotated type, but in the case of __init__ nothing would break or error if it were omitted. However, it's good for people to get into the habit of fully annotating their functions so that eventually we can reach complete coverage by enabling --disallow-untyped-defs:

  --disallow-untyped-defs   Disallow defining functions without type
                            annotations or with incomplete type annotations
                            (inverse: --allow-untyped-defs)

Hmm... Especially once we move to Python 3, writing the return type part of __init__(arg: str, arg2: iint) -> None seems like we're crossing the point from types being helpful to types being noise. One of the main points of gradual typing is that they can be added where useful and omitted where not, and I am firmly in the camp that unneeded symbols in your code are a net negative. But possibly such calls are premature at this point.

@chadrik
Copy link
Contributor Author

chadrik commented Oct 31, 2019

Hmm... Especially once we move to Python 3, writing the return type part of init(arg: str, arg2: iint) -> None seems like we're crossing the point from types being helpful to types being noise.

Enabling --disallow-untyped-defs is always something we can choose not to do, if we don't feel that the pros outweigh the cons.

@chadrik
Copy link
Contributor Author

chadrik commented Nov 5, 2019

Hello reviewers,
Thanks for the feedback so far! There are a few questions that I've responded to that are now in your court. I'd love to get these addressed soon so that I can minimize the merge conflicts that are piling up.

Once I have those answers your feedback will be integrated into the smaller "part 1" PR here: #9915

thanks!

@stale
Copy link

stale bot commented Jan 5, 2020

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Jan 5, 2020
@chadrik
Copy link
Contributor Author

chadrik commented Jan 5, 2020

Keeping this as the master PR for typing. #10367 is currently under review.

@stale
Copy link

stale bot commented Mar 5, 2020

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

@stale stale bot added the stale label Mar 5, 2020
@stale
Copy link

stale bot commented Mar 12, 2020

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@stale stale bot closed this Mar 12, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants