[BEAM-7746] Add python type hints #9056

chadrik · 2019-07-12T22:30:58Z

EDIT: jump to the latest conversation here: #9056 (comment)

This is an early look at type annotations for the beam source. It is not attempting to address the issue of porting beam runtime typing from its own internal typehints module to the official tying module, so perhaps I should open a new ticket just for type hinting the source.

EDIT: done. https://issues.apache.org/jira/browse/BEAM-7746

There's a lot left to do, including getting this hooked up to tests, but I wanted to get it out there so that A) I can get feedback before I get too far along, and B) in case anyone wants to join forces with me.

Some points for discussion / debate:

type annotation style

In python 2.7, we're limited to type comments, and it can be quite difficult to keep them under the 80 character limit, since they're not very flexible. I'm following the lead of typing-heavy projects like mypy and using the following style for functions with more than one or two arguments:

  def __init__(self,
               evaluation_context, # type: EvaluationContext
               applied_ptransform,  # type: AppliedPTransform
               input_committed_bundle,
               side_inputs
               ):
    # type: (...) -> None

typing module import

Some projects establish a convention to import only the typing objects required for the current module, e.g.:

from typing import Iterable, Tuple

Since beam has its own internal typehints module I'm importing the typing module in the same fashion to avoid conflicts/confusion. However, this approach has the downside of making some type comments a lot longer and less legible, e.g. typing.Union[typing.Iterable[str], str], which may make it impossible to stay below 80 characters in some cases.

type checking pipeline construction

The type hints are useful first-and-foremost for developers of beam, but they should also prove useful for consumers of the beam API, i.e. developers of beam pipelines. Though it's a secondary concern at the moment, I'm trying to ensure that the use case of pipeline creation is covered, so that mypy will generate proper errors for invalid combinations of transforms prior to submission, but due to a bug/weakness this is harder than expected. I'm making progress writing a mypy plugin to patch up the problem, but it's not done yet. Using the results of this process for runtime typing (e.g. to lookup coders) is out of scope for this PR.

str vs unicode

Currently I'm glossing over this problem. The correct way to annotate strings for both python 2 and 3 is to use typing.Text and typing.AnyStr, but it's a huge pain in the ass, in particular because it's often unclear where unicode is valid or invalid (I've also run into inaccuracies in the python typeshed wrt strings). In my own projects, I just annotate everything as str and handle the few cases where mypy complains about unicode by using typing.cast(str, whatever), since the problem will get a lot simpler when we switch to python3, and bytes becomes the exception. I'm open to suggestions here.

I'll probably think of more later, but I'm out of time and want to get this out the door!

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Gearpump	Samza
Go	---	---	---	---
Java
Python	---		---	---

Pre-Commit Tests Status (on master branch)

---	Java	Python	Go	Website
Non-portable
Portable	---		---	---

See .test-infra/jenkins/README for trigger phrase, status and link of all Jenkins jobs.

chadrik · 2019-07-13T16:13:27Z

R: @aaltay
R: @charlesccychen

charlesccychen · 2019-07-13T17:47:28Z

CC: @udim

robertwb

Thanks for looking at this. Some brief comments:

typing module import Given that we're deprecating the old type hints, I think we should take a sweep through the codebase and replace all non-qualified typehints with their typing equivalents (when possible) or qualified ones (where not). This would allow us to use the common convention of importing from the typing module directly.

type checking pipeline construction I think the goal of JIRA-7060 is specifically to make type annotations useful to pipeline developers. Making tools like mypy work on the beam codebase, and user pipelines in general, is another (worthy) goal.

str vs unicode +1 to your approach. There may be cases where we want an explicit bytes type as well (which would be an alias of str in python 2).

robertwb · 2019-07-15T13:33:26Z

sdks/python/apache_beam/examples/wordcount.py

@@ -44,7 +52,8 @@ def __init__(self):
        self.__class__, 'word_len_dist')
    self.empty_line_counter = Metrics.counter(self.__class__, 'empty_lines')

-  def process(self, element):
+  def process(self, element, *args, **kwargs):


Why are the *args and **kwargs parameters added here?

robertwb · 2019-07-15T13:38:39Z

sdks/python/apache_beam/examples/wordcount.py

+  reveal_type(lines)  # PCollection[unicode*]
+
+  def make_ones(x):
+    # type: (T) -> Tuple[T, int]


Can this not be inferred? Similarly below.

Can you explain what you mean?

Let me try to answer what I think you're asking about.

First, even for a simple function like make_ones, mypy does not guess at the arguments and return type.

Next, as a user annotating pipelines I have a few of choices (all of them are valid):

annotate functions, and let mypy infer the types of the PTransforms they are passed to

annotate PTransforms

annotate both, as a way to guard against unintentional incompatibilities

Option 2 would look like this:

def make_ones(x): return (x, 1) makemap = beam.Map(make_ones) # type: beam.Map[T, Tuple[T, int]]

To be clear, in option 2, mypy is not using the type of beam.Map to infer the type of the make_ones, it's simply assuming it's correct, since in our thought experiment it's unannotated. In other words, the following would not generate an error because make_ones is untyped:

def make_ones(x): return (x + 1, 1) # x can only be a numeric type! makemap = beam.Map(make_ones) # type: beam.Map[str, Tuple[str, int]]

This module is very messy right now, so I apologize for any confusion that it's generated.

That's really unfortunate... I'd have expected more of mypy :(. In that case our own inference (which does handle this) is all the more important.

sdks/python/apache_beam/examples/wordcount.py

sdks/python/apache_beam/io/gcp/datastore/v1/query_splitter.py

robertwb · 2019-07-15T13:41:57Z

sdks/python/apache_beam/io/gcp/pubsub.py

-  def __init__(self, topic=None, subscription=None, id_label=None,
-               with_attributes=False, timestamp_attribute=None):
+  def __init__(self,
+               topic=None,  # type: typing.Optional[str]


If we keep these, I wonder if it'd be easier to read if the comments were all aligned.

I'll leave that up to the Beam Team, but we're still going to have the line length limit to contend with.

robertwb · 2019-07-15T13:42:56Z

sdks/python/apache_beam/io/iobase.py

@@ -972,14 +979,15 @@ def expand(self, pcoll):
                       'or be a PTransform. Received : %r' % self.sink)


-class WriteImpl(ptransform.PTransform):
+class WriteImpl(ptransform.PTransform[InT, None]):


Or should this be a PTransform[pvalue.PCollection[InT], None]?

IMHO, that would be redundant. It won't change the behavior of the type checks, and it would add verbosity to a lot of class definitions, since there are numerous PTransforms throughout the code.

oh, also, it will require an import of the apache_beam.pvalue module into all of those modules as well. you won't be able to protect it inside a typing.TYPE_CHECKING block since it's used in the class definition.

But some transforms take and return other things than PCollections. Maybe if T is shorthand for PCollection[T] but that seems ambiguous.

chadrik · 2019-07-15T17:08:57Z

typing module import: Given that we're deprecating the old type hints, I think we should take a sweep through the codebase and replace all non-qualified typehints with their typing equivalents (when possible) or qualified ones (where not). This would allow us to use the common convention of importing from the typing module directly.

Sorry, I don't completely follow this. IIUC, we can't replace apache_beam.typehints with typing until beam knows how to deal with typing internally, right?

chadrik · 2019-07-15T18:44:08Z

type checking pipeline construction I think the goal of JIRA-7060 is specifically to make type annotations useful to pipeline developers. Making tools like mypy work on the beam codebase, and user pipelines in general, is another (worthy) goal.

I made a new ticket for the latter, and changed the subject of this PR: https://issues.apache.org/jira/browse/BEAM-7746

udim

@robertwb Please do not merge before I get a chance to understand exactly how this might affect Beam's type checking and hinting.
Specifically I'm worried about the use of typing.Generic and how that plays with IOTypeHints.
There's also the issue of using type comments instead of PEP 484 annotations that Python 3 parses and makes available at runtime.

sdks/python/apache_beam/io/gcp/datastore/v1/query_splitter.py

udim · 2019-07-17T17:22:48Z

mypy.ini

@@ -0,0 +1,19 @@
+[mypy]
+python_version = 2.7


The tox test environment you added uses 3.7. Is 2.7 correct here?

mypy needs python 3 to run, but this flag is for the version of the code that it's checking.
when the code is ready we'll want to run this in 3.x mode.

sdks/python/apache_beam/io/gcp/pubsub.py

udim · 2019-07-17T17:44:03Z

sdks/python/apache_beam/io/iobase.py

  """Implements the writing of custom sinks."""

  def __init__(self, sink):
    super(WriteImpl, self).__init__()
    self.sink = sink

  def expand(self, pcoll):
+    # type: (pvalue.PCollection[InT]) -> pvalue.PCollection[None]


IIUC, mypy will compare this annotation to the output of the plugin you wrote?

the plugin only deals with the __or__ and __ror__ methods.

But would it insure that the __[r]or__ method agrees with the generic type of the PTransform which agrees with this method?

yes, I believe it could be used to achieve that.

sdks/python/apache_beam/typehints/decorators.py

udim · 2019-07-17T18:04:35Z

sdks/python/apache_beam/typehints/decorators.py

@@ -193,7 +200,7 @@ def __repr__(self):
        self.input_types, self.output_types)


-class WithTypeHints(object):
+class WithTypeHints(Generic[InT, OutT]):


I'm afraid that if we add a second typing system here will create confusion. If I write a PTransform, do I specify an input type to Generic[] or do I use a @with_input_type decorator? Which one is enforced?

True. I think the somewhat unsatisfactory answer is that for the time being you need both: one for runtime checking and the other for static checking, until such a time as they can become the same. I think trying to do that all in one PR is going to be too much.

off the top of my head, the order this should probably be done is:

support runtime type hints using typing module instead of typehints: https://issues.apache.org/jira/browse/BEAM-7713

add static type hints to the beam code and begin enforcing it using mypy: this PR (https://issues.apache.org/jira/browse/BEAM-7746) and possibly https://issues.apache.org/jira/browse/BEAM-7712

support static validation of user pipelines (mypy plugin, etc)

support runtime validations based on typing annotations

There are a lot of "ifs" surrounding step 4. We may need to get to python3-only first to avoid the pitfalls of type comments. We may find that step 3 makes it less important.

(1) is already mostly there, for the user-facing API at least. How much of (2) is required for (3) and how invasive it is seems to still be an open question, which this PR is giving good insight to. (Long term, it'd also be nice to have (2) for development as well.) Given that mypy can't even do beam.Map(lambda x: (x, 1)) I think we'll have to rely a lot on (4).

All will be much nicer once we can have python-3 style annotations in the code itself. We can't do that for our own codebase until Python 2 is LTS only, but user code could start using it much sooner.

How much of (2) is required for (3) and how invasive it is seems to still be an open question, which this PR is giving good insight to.

agreed.

Given that mypy can't even do beam.Map(lambda x: (x, 1)) I think we'll have to rely a lot on (4).

I think people who want to use static typing for their pipelines will have to stop using lambdas. So there will definitely be a split between those who are willing to make that adjustment and those who want to continue relying on the current interface for dynamic typing.

All will be much nicer once we can have python-3 style annotations in the code itself. We can't do that for our own codebase until Python 2 is LTS only, but user code could start using it much sooner.

When we get to that stage, we can use this tool to convert comments to annotations: https://github.com/ilevkivskyi/com2ann

udim · 2019-07-17T18:09:25Z

sdks/python/setup.py

+    """Abstract method that is required to be overwritten"""
+
+  def get_project_path(self, include_dists=[]):
+      # with_2to3 = six.PY3 and getattr(self.distribution, 'use_2to3', False)


leftover comments?

yeah, this command class is based off of several in the distutils source, and it was unclear to me if we'd need to support this other case, so it's there as a reminder to figure that out.

To be crystal clear, this PR is very much a WIP. Just wanted to get eyes on it before I get too far ahead of myself.

sdks/python/apache_beam/transforms/ptransform.py

chadrik · 2019-07-17T22:27:58Z

I'm really curious to know what the general consensus is on this PR, implementation details aside.

Do you all like the idea of adding type annotations?

If I can get some subset of the current package passing, would you be willing to merge something like this in?

What do you see as the major blockers?

robertwb

typing module import: Given that we're deprecating the old type hints, I think we should take a sweep through the codebase and replace all non-qualified typehints with their typing equivalents (when possible) or qualified ones (where not). This would allow us to use the common convention of importing from the typing module directly.

Sorry, I don't completely follow this. IIUC, we can't replace apache_beam.typehints with typing until beam knows how to deal with typing internally, right?

Beam knows how to convert from typing to its internal representation (some fixes at #9109).

@robertwb Please do not merge before I get a chance to understand exactly how this might affect Beam's type checking and hinting.
Specifically I'm worried about the use of typing.Generic and how that plays with IOTypeHints.
There's also the issue of using type comments instead of PEP 484 annotations that Python 3 parses and makes available at runtime.

Yes, for sure.

chadrik · 2019-07-19T18:26:32Z

Beam knows how to convert from typing to its internal representation (some fixes at #9109).

Awesome. Getting that merged in will make this PR a lot easier.

robertwb · 2019-07-22T15:13:00Z

Beam knows how to convert from typing to its internal representation (some fixes at #9109).

Awesome. Getting that merged in will make this PR a lot easier.

It's in.

chadrik · 2019-07-24T06:12:23Z

It's in.

It's out :(

I see it got reverted: 8afee95

Did it end up causing a regression?

aaltay · 2019-07-24T15:52:58Z

It's in.

It's out :(

I see it got reverted: 8afee95

Did it end up causing a regression?

There is a JIRA link in the revert PR: https://issues.apache.org/jira/browse/BEAM-7798 -- It was causing test failures.

udim · 2019-07-24T19:58:10Z

Back in: #9125

chadrik · 2019-07-24T23:34:39Z

I left a note on that PR about transforms.core not being converted yet.

Separate question:

Are the following two examples functionally equivalent in beam?

coll = pipe | core.Map(lambda _, x: x.do_stuff(), self.thing)

coll = pipe | core.Map(lambda _, x=self.thing: x.do_stuff())

I would hope that they are, but I could see there being some difference in how the transforms get serialized.

The reason I ask is that it's going to be tricky to deal with the shifting signatures of methods like DoFn.process() and PTransform.expand(), since mypy can be a bit picky about this. I'm trying to work out some strategies for addressing this, and this is one.

udim · 2019-07-25T00:48:47Z

Are the following two examples functionally equivalent in beam?

coll = pipe | core.Map(lambda _, x: x.do_stuff(), self.thing)

coll = pipe | core.Map(lambda _, x=self.thing: x.do_stuff())

Here's my attempt at recreating what Beam does:

>>> import dill
>>> o = object()
>>> l = lambda x=o: x
>>> l_loaded = dill.loads(dill.dumps(l))
>>> l() == l_loaded()
False
>>> l() == l_loaded(o)
True

Passing self.thing as a default argument causes it to get pickled with the lambda and makes it a distinct object after unpickling.
I'm not sure thought what pickling the enclosing scope would do.

chadrik · 2019-07-25T02:15:06Z

Passing self.thing as a default argument causes it to get pickled with the lambda and makes it a distinct object after unpickling.

yes, any object will be a distinct object after unpickling:

>>> import dill
>>> o = object()
>>> o_loaded = dill.loads(dill.dumps(o))
>>> o == o_loaded
False
>>> id(o)
4402307584
>>> id(o_loaded)
4405359896

But the "failure" of your test is just an implementation detail of object. object.__eq__ is likely doing an id() check, and this simply demonstrates that you should not use object with pickle/dill if equality checks are important to your code.

Luckily, it is possible for distinctly different objects to behave the same (otherwise beam wouldn't work at all):

>>> import dill
>>> o = [1, 2]
>>> o_loaded = dill.loads(dill.dumps(o))
>>> o == o_loaded
True

Anyway, IIUC, the function and the args passed to DoFn will both be serialized using dill, so the argument object in question should undergo a serialization/deserialization transformation either way. My question was more about the details of the beam internals, esp as it might pertain to portability, grpc model, and urns etc that I would not be familiar with as a beam newbie.

chadrik · 2019-07-27T07:02:31Z

I'm getting very close to getting this to pass, but I'm waiting on transforms.core to be converted from typehints to typing. Any updates on that?

robertwb · 2019-07-29T09:23:35Z

#9179 awaiting your review for core.py

Also, yes, the two examples are the same, unless some.thing is a deferred side input (e.g. pvalue.AsIterable(pcoll) in which only the first is valid (but an argument could be made we should accept the second as well).

chadrik · 2019-08-28T03:46:48Z

I just pushed a bunch of changes. I saw some updates to tickets related to this on Jira, so I thought I'd give a little update.

This is a pretty big undertaking, and there are a lot of things in motion.

I wrote pyi stubs for python-future, which was a tedious little side escapade. I've asked the author for permission to add them to the python typeshed. If he/she doesn't get back to me, I'll just make a stub-only package and put it on pypi
I made a PR for mypy to add support for future.utils.with_metaclass, but it's waiting on the stubs to find a home
I've been grappling with how to properly type the callables for DoFns. I'm not satisfied with the options so might punt on that for now
I'm waiting on the latest version of mypy to drop, which adds official error codes. I'll then need to update mypy-runner to use these new error codes to do its filtering.

chadrik · 2019-10-27T20:47:11Z

Alright everyone, I'm happy with where this PR is now, and I think it's time to start figuring out how to get it merged.

This PR is admittedly huge, but the daunting scope is mitigated by the fact that the intent of the PR is to avoid making any behavioral changes. I recommend reviewing this PR one commit at a time, as I've tried to organize it into bite-sized chunks, and I've segregated changes with possible unintentional runtime side-effects and made note of them for extra scrutiny.

An additional consideration with respect to the size of this PR is that there is a critical mass required before mypy can be enabled in CI, so if we were to break up this PR, we wouldn't be able to enable the mypy lint until all of the changes are finally merged (the alternative is hundreds # type: ignore comments throughout the code, which I'd rather not do). Once this PR goes in, it'll be easy to make smaller follow-ups.

I'm really excited about this PR and I think you all are going to find the code base much easier to navigate and grok once it's in (I already provide this branch to others on the team who are trying to learn beam), not to mention the deep structural analysis that you get during testing (it's helped me to identify a number of real and potential bugs already, which you'll see in my commits).

One final consideration: there will be a need for training and mentoring on static type analysis, I have some internal docs on the subject which I'm happy to share, and you can always drag me into a PR to help resolve issues.

TheNeuralBit · 2019-10-28T18:53:55Z

Thanks for doing this @chadrik! I'm very excited about this change :)

One quick question: when we switch to python 3, is there a tool that will convert all the type comments to pep 484 type hints?

robertwb · 2019-10-28T19:07:42Z

Thanks for the huge amount of work that's gone into this PR, and I'm looking forwards to the benefits it'll bring to the codebase.

In terms of getting this reviewed, I understand that mypy can't be enabled [1] until it's all in, but I think it's preferable to get individual commits merged and in incrementally than review commits incrementally and then try to get them all in at once in a single PR. In my spot checking it seems these commits are relatively orthogonal--is that basically the case? Perhaps a reviewer could create PRs cherry-picking commits that they are able to review in one sitting and we could get them in that way? Any other ideas?

[1] Maybe we could enable it, but not make it a blocker, in which case we could track progress and at least manually track progress/slippage until we enforce it.

chadrik · 2019-10-28T19:57:31Z

when we switch to python 3, is there a tool that will convert all the type comments to pep 484 type hints?

Yes. https://github.com/ilevkivskyi/com2ann. I think we will need to manually fixup a bunch of style issues after running that, because when the comments become annotations they will exceed the line limit.

Side note: I recently experimented with enabling pylint's auto-fix features (like spotlessApply), but it won't work unless we're willing to change Beams' rather "unique" style. IIRC, the main deal-breaker is Beams' mixture of 2-space and 4-space tabs depending on context (it's been a headache for me to setup properly in PyCharm as well. I have something that's close, but not perfect).

EDIT: I just saw the script includes an option to automatically wrap signatures, --wrap-signatures MAX_LENGTH, but even with this enabled there will be lines that exceed the length limit due to arguments with complex/compound types. But that feature will definitely save a lot of time!

In terms of getting this reviewed, I understand that mypy can't be enabled [1] until it's all in, but I think it's preferable to get individual commits merged and in incrementally than review commits incrementally and then try to get them all in at once in a single PR. In my spot checking it seems these commits are relatively orthogonal--is that basically the case? Perhaps a reviewer could create PRs cherry-picking commits that they are able to review in one sitting and we could get them in that way? Any other ideas?

I like the idea of adding the lint job but not failing until we've gotten everything in.

There are some commits that depend on others, but the relationship is not always obvious, even to me.

How about this plan: we first merge in a PR that contains just type comments, so theoretically no chance of runtime changes. This will be ignore comments, base typing pass, and the mypy lint job with failure disabled (basically this PR through to c25792c). At this point I think it would be safe for reviewers to cherry-pick the more topical commits that remain, in any order they choose.

robertwb · 2019-10-28T21:25:44Z

Sounds good. That's still quite the big commit, but should be straightforward to review. (I'm not seeing that commit in the list though.)

robertwb · 2019-10-28T21:31:10Z

We also need to get buy-in on the list as this will be yet another "lint" that contributors will have to make happy (though generally a good thing, but it might require some education).

chadrik · 2019-10-29T16:18:36Z

First PR is here: #9915

chadrik · 2019-10-30T17:22:06Z

Type annotations gained syntax support in 2 passes: function annotations in 3.5 and variable annotations in 3.6. So yes, once we get to python 3.6 there will no longer be a need for type comments. Luckly, there is a tool for automatically converting comments to annotations.

…

On Wed, Oct 30, 2019 at 9:41 AM Alex Amato ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In sdks/python/apache_beam/typehints/decorators.py <#9056 (comment)>: > @@ -193,7 +200,7 @@ def __repr__(self): self.input_types, self.output_types) -class WithTypeHints(object): +class WithTypeHints(Generic[InT, OutT]): Python3.6+ has some built in support for static typing, if I understand correctly. ***@***.***/learn-how-to-use-static-type-checking-in-python-3-6-in-10-minutes-12c86d72677b If we get beam onto python3 only, would it make more sense to use that instead of this comment style type hints? I was curious if there was an advantage to using the comment hints, other than the python2 support. Don't get me wrong, I think this change will be quite helpful and should proceed. But it could be a good stepping stone to the later python3 static typing, as it will be much easier to implement given your initial type hints. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9056?email_source=notifications&email_token=AAAPOEZFON75VJKN436T3CDQRG2KNA5CNFSM4ICSQA7KYY3PNVWWK3TUL52HS4DFWFIHK3DMKJSXC5LFON2FEZLWNFSXPKTDN5WW2ZLOORPWSZGOCJYF7ZI#discussion_r340731259>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAAPOEZ7ZAKTPQTWFM2VSDDQRG2KNANCNFSM4ICSQA7A> .

udim

Partial review. I focused mainly on the test infrastructure changes

udim · 2019-10-30T17:44:21Z

buildSrc/src/main/groovy/org/apache/beam/gradle/BeamModulePlugin.groovy

@@ -1768,7 +1768,7 @@ class BeamModulePlugin implements Plugin<Project> {
          project.exec { commandLine virtualenvCmd }
          project.exec {
            executable 'sh'
-            args '-c', ". ${project.ext.envdir}/bin/activate && pip install --retries 10 --upgrade tox==3.11.1 grpcio-tools==1.3.5"
+            args '-c', ". ${project.ext.envdir}/bin/activate && pip install --retries 10 --upgrade tox==3.11.1 -r ${project.rootDir}/sdks/python/build-requirements.txt"


I'd rather limit the packages installed in the base virtualenv.
setup.py should contain the list of packages needed (under REQUIRED_TEST_PACKAGES in this case). Note that it already lists future as a dependency.
IIUC mypy is run via setup.py so listing the dependencies there makes sense.

I touched on this in your other comment, but the big picture here is that python lacks a good solution for specifying build-time requirements: i.e. deps required to run setup.py.

Entire ecosystems have been built up around solving this:

https://poetry.eustace.io/

https://snarky.ca/clarifying-pep-518/

No way to specify build-time dependencies of setup.py pypa/pip#2381

One note from the last link:

Pip 10 and later support build dependencies, but they have to be specified in a pyproject.toml file.

To use that we'd have to change python setup.py sdist to use pip, which makes me nervous. Certainly more than I'd want to change in this PR.

@robertwb may have some other input. I noticed from some comments there was an earlier attempt to use the setup_requires field of setup.py, and then it was dismantled. IIRC, setup_requires has a lot of crippling issues.

In my experience a build-requirements.txt is by far the simplest solution.

udim · 2019-10-30T17:50:33Z

sdks/python/.pylintrc

@@ -143,6 +143,7 @@ disable =
  unnecessary-lambda,
  unnecessary-pass,
  unneeded-not,
+  unsubscriptable-object,


Where does this trigger? Perhaps a newer version of pylint could help? (pylint-dev/pylint#2416)

class _ListBuffer(List[bytes]): """Used to support parititioning of a list.""" def partition(self, n): # type: (int) -> List[List[bytes]] return [self[k::n] for k in range(n)] # <------ ERROR HERE

class SdfProcessSizedElements(DoOperation): ... def progress_metrics(self): # type: () -> beam_fn_api_pb2.Metrics.PTransform with self.lock: metrics = super(SdfProcessSizedElements, self).progress_metrics() current_element_progress = self.current_element_progress() if current_element_progress: assert self.input_info is not None metrics.active_elements.measured.input_element_counts[ self.input_info[1]] = 1 # <------ ERROR HERE metrics.active_elements.fraction_remaining = ( current_element_progress.fraction_remaining) return metrics

Generally speaking, there are a handful of checks that overlap between mypy and pylint where pylint does a worse job. In the second example above, self.input_info is an Optional[Tuple] that's initialized to None, so pylint thinks it's unsubscriptable, but it's initialized to a tuple elsewhere and there's a line just above the error that says assert self.input_info is not None, so it obviously is in this context. These are all subtleties that mypy understands but pylint does not, so I think we'll see a trend of moving responsibility for these structural tests to mypy.

Perhaps a newer version of pylint could help?

I'm already using the latest and greatest (one of the commits here ticks the version up from 2.4.2 to 2.4.3)

udim · 2019-10-30T18:16:08Z

sdks/python/build-requirements.txt

+# TODO(BEAM-5414): latest grpcio-tools incompatible with latest protobuf 3.6.1.
+grpcio-tools>=1.3.5,<=1.14.2
+mypy-protobuf==1.12
+future==0.16.0


This addition of this file needs a discussion.

This file needs a description. Which dependencies should be included here?

Additionally, which dependencies should by setup.py, the gradle setupVirtualenv task, gen_protos.py, etc?

Ideally dependencies should only be listed once, so I am for this kind of change.

Why is future listed here? It's already in setup.py. I saw that some tox envs needed it, but it seems that if tox installs the apache-beam tarball in the virtualenv it creates then that should include future as well.

Why is mypy-protobuf here? Only the mypy command from setup.py seems to need it, so it could be listed under REQUIRED_TEST_PACKAGES there.

I do agree that the grpcio-tools dep should be listed in one place so maybe this file could serve that purpose.

This file needs a description. Which dependencies should be included here?

Good point. These are the requirements for building an sdist of beam. They happen to all be dependencies of gen_protos.py, but conceptually this file could contain other build deps like cython. I chose to omit that since it's a bit more complicated since it's only used on linux and only for certain tests.

Why is future listed here? It's already in setup.py.

gen_protos.py uses it to futurize the results of protoc. The packages in install_requires are not available at the time that gen_protos runs.

Why is mypy-protobuf here? Only the mypy command from setup.py seems to need it, so it could be listed under REQUIRED_TEST_PACKAGES there.

Also required for gen_protos

Why is mypy-protobuf here?

To clarify, mypy-protobuf is required at the time that gen_protos runs (it's used by protoc), but it's not technically required by anything other than py37-mypy job. However, we'll want to include the .pyi files in sdist so that end users can get their benefits in PyCharm, mypy, etc. Which reminds me that I need to update setup.py to include pyi files in package_data!

udim · 2019-10-30T18:36:51Z

sdks/python/gen_protos.py

@@ -49,6 +49,164 @@
 ]


+def generate_urn_files(log, out_dir):
+  """
+  Create a pyi stub for common_urns


Please update the docstring.
IIUC, this does not generate a pyi stub, but several _urns.py files that are imported by common_urns.

good catch!

udim · 2019-10-30T18:56:44Z

sdks/python/tox.ini

@@ -160,16 +160,17 @@ commands =
 [testenv:py27-lint]
 # Checks for py2 syntax errors
 deps =
+  -r build-requirements.txt


Is this still necessary?

This is required if we want tox to work outside of gradle, and I really really do. Using gradle to run tox tasks adds a lot of overhead and I kind of hate it. gradle creates a complete virtualenv just to build an sdist, then it creates another virtualenv to run tox (and IIRC it does this every time). If I run tox directly it creates a single virtualenv for both sdist and the tox env and it reuses that virtualenv for subsequent runs of tox. Gradle also just seems slow to start up. Also, I can't pass individual tests to run when using gradle. Let's just say that as a python developer, the gradle experience is the thing I like least about the Beam project.

Yes, by adding the gen_proto requirements to build-requirements.txt and installing that in the setupVirtualenv task I'm making that overhead worse, but I certainly didn't start that problem: the code is currently installing tox every time that the sdist task is run, which is pointless. In other words, setupVirtualenv is currently installing the union of all possible task requirements, and I'm extending that pattern.

I could add a feature to setupVirtualenv to provide a requirements file, which would let us provide exactly the deps required for each task:

sdist: grpcio-tools, mypy-protobuf, future

tox: tox (tox will install the deps above if needed)

That would end up a net win, even with the deps added in this PR, because tox has a lot of deps. Should I do that here or in a followup PR?

I've been meaning to bring up some of these issues on the list. I assume the reason for the two step test process (sdist + tox, each with its own virtualenv) is to ensure that tox is using an sdist built in the same way every time, using python2, just as it would be for distribution?

tox runs outside of gradle. (At least for me). What is the issue there?

I assume the reason for the two step test process (sdist + tox, each with its own virtualenv) is to ensure that tox is using an sdist built in the same way every time, using python2, just as it would be for distribution?

sdist step is used for other things (for building artifacts for release). That is the reason for it to exists on its own.

tox, I do not believe we spent time on configuring it to reuse existing virtualenv. If that is an option we can try that.

tox runs outside of gradle. (At least for me). What is the issue there?

When tox is run on jenkins, it is invoked via gradle. The tox task is defined in BeamModulePlugin.groovy like this:

project.ext.toxTask = { name, tox_env -> project.tasks.create(name) { dependsOn 'setupVirtualenv' dependsOn ':sdks:python:sdist' doLast { // Python source directory is also tox execution workspace, We want // to isolate them per tox suite to avoid conflict when running // multiple tox suites in parallel. project.copy { from project.pythonSdkDeps; into copiedSrcRoot } def copiedPyRoot = "${copiedSrcRoot}/sdks/python" def distTarBall = "${pythonRootDir}/build/apache-beam.tar.gz" project.exec { executable 'sh' args '-c', ". ${project.ext.envdir}/bin/activate && cd ${copiedPyRoot} && scripts/run_tox.sh $tox_env $distTarBall" } } inputs.files project.pythonSdkDeps outputs.files project.fileTree(dir: "${pythonRootDir}/target/.tox/${tox_env}/log/") } }

and it's used here:

https://github.com/apache/beam/blob/master/sdks/python/test-suites/tox/py2/build.gradle

sdist step is used for other things (for building artifacts for release). That is the reason for it to exists on its own.

I'm not saying that sdist task shouldn't exist, I'm saying it adds overhead to use the sdist task to generate the tarball for the tox task, instead of just letting tox generate its own sdist tarball, which it is designed to do and is the default behavior (Also, there's the copying of source files which adds more overhead and it's unclear why this exists since tox is designed to create isolated virtualenvs per task environment).

In the definition of toxTask above, you see that it depends on setupVirtualenv and also on sdist, and sdist itself depends on setupVirtualenv. That's why we end up running setupVirtualenv twice for each tox task run, and then tox creates yet another virtualenv to actually run the task. That's 3 virtualenvs per tox job! I've used travisCI to setup a number of tox-based CI jobs and the typical workflow is to ensure that tox is installed on the worker image and then theres only one virtualenv created per job, the one created by tox.

Anyway, long story short the build-requirements.txt line is required for the tox tasks to work.

I agree that the gradle config is not the most efficient. The base requirements for running Python tests on gradle are pip, virtualenv, and 4 versions of the python interpreter. Tox is installed by setupVirtualenv. :sdks:python:sdist is run only once (IIUC) for all toxTasks, since running more than one sdist concurrently is racey and makes our tests flakey.
CC: @markflyhigh who set this up.

Thinking about this now, I have a hunch that the tarball in use doesn't ever include Cythonized sources (since they are created by sdist when the cython package is preset).

Source files are copied due to isolation requirements. The Cythonization process adds files which are automatically picked up if present. Perhaps a better way to do this is to exclude C files from sdist for the non-cython envs, but this is the way we do it currently.

So we agree there's room for improvement. The question for the moment is, is my current solution ok?

robertwb · 2019-10-30T23:58:47Z

Yeah, it's a bit pedantic. By default mypy assigns Any to an unannotated type, but in the case of __init__ nothing would break or error if it were omitted. However, it's good for people to get into the habit of fully annotating their functions so that eventually we can reach complete coverage by enabling --disallow-untyped-defs:
  --disallow-untyped-defs   Disallow defining functions without type
                            annotations or with incomplete type annotations
                            (inverse: --allow-untyped-defs)

Hmm... Especially once we move to Python 3, writing the return type part of __init__(arg: str, arg2: iint) -> None seems like we're crossing the point from types being helpful to types being noise. One of the main points of gradual typing is that they can be added where useful and omitted where not, and I am firmly in the camp that unneeded symbols in your code are a net negative. But possibly such calls are premature at this point.

chadrik · 2019-10-31T00:10:57Z

Hmm... Especially once we move to Python 3, writing the return type part of init(arg: str, arg2: iint) -> None seems like we're crossing the point from types being helpful to types being noise.

Enabling --disallow-untyped-defs is always something we can choose not to do, if we don't feel that the pros outweigh the cons.

chadrik · 2019-11-05T22:25:41Z

Hello reviewers,
Thanks for the feedback so far! There are a few questions that I've responded to that are now in your court. I'd love to get these addressed soon so that I can minimize the merge conflicts that are piling up.

Once I have those answers your feedback will be integrated into the smaller "part 1" PR here: #9915

thanks!

stale · 2020-01-05T18:48:39Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

chadrik · 2020-01-05T19:01:58Z

Keeping this as the master PR for typing. #10367 is currently under review.

stale · 2020-03-05T20:23:52Z

This pull request has been marked as stale due to 60 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the dev@beam.apache.org list. Thank you for your contributions.

stale · 2020-03-12T20:55:27Z

This pull request has been closed due to lack of activity. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

chadrik force-pushed the python-static-typing branch from 2f1e9d4 to d3c509a Compare July 13, 2019 16:21

robertwb reviewed Jul 15, 2019

View reviewed changes

chadrik changed the title ~~[BEAM-7060] Add python type hints~~ [BEAM-7746] Add python type hints Jul 15, 2019

aaltay requested review from robertwb and udim July 17, 2019 00:35

udim requested changes Jul 17, 2019

View reviewed changes

robertwb reviewed Jul 19, 2019

View reviewed changes

chadrik force-pushed the python-static-typing branch from 9f334c5 to e3a5471 Compare July 29, 2019 00:17

chadrik mentioned this pull request Jul 29, 2019

[BEAM-7060] Use typing in type decorators of core.py #9179

Merged

3 tasks

chadrik force-pushed the python-static-typing branch from 5ff6206 to 80932de Compare August 28, 2019 03:27

chadrik force-pushed the python-static-typing branch from 577bbd1 to b649f8c Compare September 10, 2019 14:47

chadrik force-pushed the python-static-typing branch 3 times, most recently from 70357d5 to b1a0478 Compare September 19, 2019 18:30

chadrik mentioned this pull request Oct 29, 2019

[BEAM-7746] Add python type hints (part 1) #9915

Merged

3 tasks

udim requested changes Oct 30, 2019

View reviewed changes

chadrik mentioned this pull request Dec 12, 2019

[BEAM-7746] Add python type hints (part 2) #10367

Merged

3 tasks

stale bot added the stale label Jan 5, 2020

stale bot removed the stale label Jan 5, 2020

stale bot added the stale label Mar 5, 2020

stale bot closed this Mar 12, 2020

damccorm mentioned this pull request Jun 4, 2022

Add type hints to python code #19737

Open

[BEAM-7746] Add python type hints #9056

[BEAM-7746] Add python type hints #9056

Conversation

chadrik commented Jul 12, 2019 • edited

type annotation style

typing module import

type checking pipeline construction

str vs unicode

Post-Commit Tests Status (on master branch)

Pre-Commit Tests Status (on master branch)

chadrik commented Jul 13, 2019

charlesccychen commented Jul 13, 2019

robertwb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadrik commented Jul 15, 2019

chadrik commented Jul 15, 2019

udim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadrik Jul 17, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadrik commented Jul 17, 2019

robertwb left a comment

Choose a reason for hiding this comment

chadrik commented Jul 19, 2019

robertwb commented Jul 22, 2019

chadrik commented Jul 24, 2019

aaltay commented Jul 24, 2019

udim commented Jul 24, 2019

chadrik commented Jul 24, 2019

udim commented Jul 25, 2019

chadrik commented Jul 25, 2019 • edited

chadrik commented Jul 27, 2019

robertwb commented Jul 29, 2019

chadrik commented Aug 28, 2019

chadrik commented Oct 27, 2019

TheNeuralBit commented Oct 28, 2019

robertwb commented Oct 28, 2019

chadrik commented Oct 28, 2019 • edited

robertwb commented Oct 28, 2019

robertwb commented Oct 28, 2019

chadrik commented Oct 29, 2019

chadrik commented Oct 30, 2019 via email

udim left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadrik Oct 30, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadrik Oct 30, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadrik Oct 30, 2019 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

chadrik Oct 30, 2019 • edited

Choose a reason for hiding this comment

chadrik commented Jul 12, 2019 •

edited

chadrik Jul 17, 2019 •

edited

chadrik commented Jul 25, 2019 •

edited

chadrik commented Oct 28, 2019 •

edited

chadrik Oct 30, 2019 •

edited

chadrik Oct 30, 2019 •

edited

chadrik Oct 30, 2019 •

edited

chadrik Oct 30, 2019 •

edited