Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Policy for docstring consistency among graph classes #5584

Open
rossbar opened this issue Apr 26, 2022 · 5 comments
Open

Policy for docstring consistency among graph classes #5584

rossbar opened this issue Apr 26, 2022 · 5 comments
Labels
Discussion Issues for discussion and decision making

Comments

@rossbar
Copy link
Contributor

rossbar commented Apr 26, 2022

This discussion was originally brought up in #5529 in the context of updating the degree method docstring for the MultiGraph class. The question then becomes: what should be done about the degree method for the other classes (Graph, DiGraph, etc.)? What is the best approach for keeping the class+method docstrings in sync (and correct from a typing perspective) while taking into account the differences between the classes. Original discussion from @dschult below:


The 4 graph classes have overlapping and intentionally repetitive doc_strings across the classes. Whenever we change one, we need to change -- or at least be aware of -- the others. If we improve one, we should make that change to the others. If we put special comments about one graph class into one doc_string, we should try to keep that comment easily identified so that readers can recognize the rest as being the same. For example, adding a "Note" to describe the differences rather than just changing one or two words in a paragraph in the doc_string.

There is a tension between keeping the doc_strings in all graph classes close to being the same, and adding the tweaks to make them specialized for each class. When the docs are the same for a method that belongs to e.g. Graph, DiGraph, MultiGraph and MultiDiGraph, there is a unified presentation. Maintainers have fewer doc_strings to parse (but more to remember to update), and users can more easily recognize the docs when they look at the same method in another class and understand that the methods do the same thing. When the docs differ, the presentation diverges over time but can be tailored to each class. Maintainers don't have to remember to go and change all 4 copies of the doc_string when one is changed (though maybe they should because it is probably better for all the classes). And users get docs that are specialized for that class. But they lose the ability to see what is the same across the classes.

You can see our attempts to keep the doc_strings the same in the Examples section just below the changes in this PR.
>>> G = nx.Graph() # or nx.DiGraph() or nx.MuiltiGraph() or nx.MultiDiGraph()
appears in all 4 graph classes. This way the user can understand that they all would give this same output after just reading one doc_string. We could move toward making each graph class use the examples with that graph class. But I would claim that the doc_string would be ever so slightly less useful.

Perhaps we should change the return type to MultiDegreeView or DegreeView or OutDegreeView or MultiOutDegreeView or int. But that certainly doesn't seem clear. And it might mess up typing. We could try to add another line to the doc_string stating that the return type varies from one class to another but that they are basically the same.

If we choose to make the doc_strings differ by graph class, then we should maybe make the examples differ by graph class as well. That would test all the graph classes for the example in this doc_string (which we currently don't do). But someone will have to keep track of the doc_strings in the 4 classes and make sure that improvements in one get updated in the others.

What is the better philosophy about the methods for doc_strings of similar subclasses? I am reminded of the 7 subclasses in scipy.sparse for the different types of sparse representations.

@rossbar rossbar added the Question A question about NetworkX or network science in general label Apr 26, 2022
@rossbar
Copy link
Contributor Author

rossbar commented Apr 28, 2022

I agree that consistency among the method docstrings for the various classes should be the goal. Based on my current understanding of how docstrings work with inheritance, there are two approaches, each with their own advantages and drawbacks:

  1. Define a single, generic docstring for the method in the base class which will be inherited by the others, or
  2. Define an explicit method docstring in each subclass.

Option 1. has the advantage that there is only one docstring that ever needs to be modified, and the method docstrings in any derived class are automatically in sync as they are inherited from the base class. The downside is then that extra effort needs to be made to make sure the docstring is generic enough to apply to all of the subclasses. This will likely have consequences in terms of typing (e.g. parameter/return types), and there will likely be cases where a docstring may have to have extra information that is only relevant for a subset of the derived classes (e.g. in the Examples section).

Option 2. has the advantage that it is completely flexible - each docstring can be perfectly tailored to the individual class. The downside is that it is more difficult to maintain consistency between the docstrings, as illustrated by #5529. For example, if there is a necessary change that should apply to the method docstring for all of the graph classes, it needs to be explicitly changed in 4 separate docstrings.

There is another option which I didn't enumerate, which is to try to combine the two approaches by developing a docstring generator that takes in a template docstring and automatically modifies it for each class. See e.g. #5416. Generally I think this really only works well for simple cases (e.g. __doc__.replace("Graph", "MultiGraph")) but as the needs become more nuanced my sense is that this can get quite complex and difficult to verify. Of the possible approaches this is the one I'm most wary of!

Without taking a detailed look at the method docstrings for the inherited classes (which is not only the graph classes by the way... this also applies to the various views etc.) I don't have a sense for whether one approach is better than another. In the case of degree in #5529 I think it's sort of the worst of both worlds - i.e. we have copied docstrings between the four classes but without really customizing each one to eliminate duplicate/non-specific info. My approach to answering this question would be to conduct a case study; i.e. take an example method docstring (e.g. degree) and try both approaches:

  1. In one branch, define only a single docstring for Graph.degree
  2. In a second branch, define a docstring for each individual class .degree method, with an emphasis on maintaining consistency in the style between each class.

and evaluate the procedure: how much information is generic? how significant is the improvement when the information is customized on a per-class basis? How much effort/review goes into ensuring that the docstrings are all consistent?

This exercise may also unlock new/better ideas involving templating/auto-generating docstrings.

@dschult
Copy link
Member

dschult commented Apr 28, 2022

Nice summary and statement of the problem and some possible solutions.

The current doc_strings were created as "have a single doc_string for all 4 graph cases and copy any changes to all 4 supposedly identical versions". Of course, over time they have diverged -- but I think only slightly. The idea behind copying the identical doc_strings was so that someone reading the code had the doc_string right there with the code. And no special code was needed to attache the inherited doc_strings to the class methods.

When the views and example subclasses arrived, no global effort was completed to update/unify the doc_strings.

So, it is not just degree that has 4 copies of a docstring that is not customized for the subclasses. It is actually most methods have docstrings that are not customized for the subclass. We were a small shop and made an effort to keep them synchronized while keeping them in the code of each subclass so people reading the code could see the docstring there in the code. We could keep all the doc_strings separate from their code, but we felt that keeping them close had merit -- and sometimes even helped us recall stuff without having to look at a separate module to find the docstring.

I think we didn't go with option 2 because

  • the users would have to look at all 4 docstrings to be able to tell what the differences are. The doc_strings mostly look the same, but where do they differ? (no surprises)
  • we worried about divergence being overlooked. (improvements in one that didn't get transferred to others would be hard to identify -- and might even be hard to even see how to transfer to others.

Now that we have the wonderful CI-magic, we could construct a test to ensure that we know every time a doc_string from one class is changed when the others were not updated. This would enforce synchrony without removing the docs from the code. For option 2, where the doc_strings differ, is there a CI-magic solution to ensure that they have similar style and explain the differences between subclass methods rather than just what one method does?

I like the idea of doing a case study with degree. That's a good case too because there are sufficient differences to make it useful and not always straight-forward.
:}

@rossbar
Copy link
Contributor Author

rossbar commented Apr 28, 2022

The idea behind copying the identical doc_strings was so that someone reading the code had the doc_string right there with the code.

This is a good point and something I didn't consider above - docstring inheritance works for the interactive-help use case (e.g. help() or ?/?? in IPython/Jupyter) and for the generated html docs (i.e. sphinx), but it doesn't help in the case of someone reading the source code which is obviously very important!

@dschult
Copy link
Member

dschult commented May 4, 2022

It looks like PR #5529 didn't fix the problem for InDegreeView and OutDegreeView and their counterparts in the other classes. I foresee this arising for other base class method doc_strings (e.g. edges, nodes, adj?, others?). Perhaps a broader in scope, yet more focused effort on getting all typing correct in the base class doc_strings is warranted.

@dschult
Copy link
Member

dschult commented Jun 8, 2022

I’m adding a pointer to #5699 because it relates strongly to this conversation. It started by updating the doc_string of remove_edges_from in MultiGraph. And those changes led to other improvements in related doc_strings. The doc_strings are now specific to the multigraph classes (and maybe they already were different from Graph/DiGraph, but maybe not). The method remove_edges_from is identical in both MultiGraph and MultiDiGraph (because it is inherited—so the methods are identical). The `remove_edge method is not similarly inherited, so can potentially be different for all 4 classes.

@rossbar rossbar added Discussion Issues for discussion and decision making and removed Question A question about NetworkX or network science in general labels Jun 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Discussion Issues for discussion and decision making
Development

No branches or pull requests

2 participants