GSoD application 2019

The application for Google Season of Docs 2019 asks about experience with doc writing and the project. NumFOCUS (Sam Brice) helped us to turn our bullet points into readable answers. Below we have the answers in full sentences together with the raw material for the Application. Not all was used so we keep here the information for future use.

We also keep the original list of ideas here. Not everything was used for this application.

We keep a GSoD: Lessons Learned page to keep a memory of what worked and what didn't.

Application

List previous experience with technical writers or documentation

If you or any of your mentors have worked with technical writers before, or have developed documentation, please mention this. Describe the documentation that you produced and the ways in which you worked with the technical writer. For example, describe any review processes that you used, or how the technical writer's skills were useful to your project. Explain how this previous experience may help you to work with a technical writer in Season of Docs.

MDAnalysis is widely used in the scientific community and is written by scientists for scientists. Our Python library provides an abstract and object-oriented interface to data from particle-based simulations.

The majority of our technical writings are within Python Docstrings as submitted from GitHub pull requests. Documentation is required prior to merging any new code or module. Our repository contains a detailed Writing Documentation Guide with well-defined Documentation Standards, Style Guides, and Guidelines. In addition to static documentation within the codebase, we also have example Jupyter notebooks on GitHub and live executable binder notebooks with vivid usage examples and offering easy experimentation.

As a result of presenting at scientific conferences such as SciPy and PyGrunn, we make educational videos available from our website. For our Workshops and Hackathons, we document and teach the use of MDAnalysis with other libraries such as NumPy and Dask using real-life data for experimentation. Our scientific paper, most recently published in the Proceedings of the 15th Python in Science Conference, goes into exquisite detail on how MDAnalysis abstracts access to raw simulation data, presenting a uniform object-oriented Python interface. An interface that enables users to rapidly write code that is portable and immediately usable in virtually all biomolecular simulation communities.

Overall we strive for consistency with the documentation styles other major scientific Python libraries such as SciPy and NumPy to facilitate an intuitive learning experience.

Summarize your project's approach to documentation work

How does new documentation get written? How does existing documentation get reviewed and/or updated? How are documentation tasks shared among contributors? Is there anyone in the project in charge of documentation?

Developers write Python doc strings (API docs) and, where applicable, module-level overview and example docs.
During code review on a pull request, docs are also reviewed.
Docs are updated when code is updated or in response to user comments. Missing or outdated docs are logged on the issue tracker and tagged with the Component-Docs label.
API docs are reasonably complete because developers have to write them for all public-facing code according to our documentation guidelines. PRs are not merged without docs.
Documentation writers should follow the Writing Documentation style guide and rules.
Core developers share the responsibility for the documentation.

Previous experience with mentoring

Let us know whether you have participated in GSoC before. If yes, describe your achievements in that program and explain how this experience may influence the way you work in Season of Docs.

MDAnalysis has been a participant in GSoC since 2016, first from within the Python Software Foundation then as a NumFOCUS sponsored project. Our most recent GSoC student, Ayush Suhane, integrated faster distance search algorithms for limited distances into our 0.19 Release and posted about the results on his blog. During 2017 GSoC Utkarsh Bansal ported our test framework from nose to pytest, increasing code coverage to 91% and significantly reducing total runtime. Our 2016 GSoC students Fiona Naughton and John Detlefs contributed innovative new features not found in other packages to our biggest release ever, such as a means to combine different data with trajectories and dimensionality reduction algorithms.

Our mentors and core developers have a great deal of experience onboarding students and getting them familiar with our codebase. They are also very adept at explaining the scientific motivation for various parts of MDAnalysis in plain language not requiring deep knowledge of molecular simulations. With a clear understanding of our user’s motivations and needs, GSoC students have been very successful at independently evaluating existing examples and collaborating with developers to make improvements.

MDAnalysis has continually participated in GSoC since 2016 (first under the PSF and then under the NumFOCUS umbrella). We mentored 5 students so far who all successfully completed the program and all contributed useful features to the project (2018: 2 students, see Release 0.19.x; 2017: 1 student, see Release 0.17.0; 2016: 2 students, see release 0.16.0).
Our mentors have experience in introducing students to the code base and know how to explain the scientific motivation for the project to students who are not involved in molecular simulations. The latter is important for GSoD because a technical writer needs to understand what motivates our users and what they are looking for. We will teach the writer how to use MDAnalysis (with the help of our basic tutorial, our Workshop materials and example Jupyter notebooks (also available as live binder notebooks) so that she or he is in a better position to evaluate existing examples and collaborate with the developers to develop better examples. We know how important it is to maintain ongoing communication and will be responsive via issue tracker/email/zoom teleconference.
We know that new developers need to have some guidelines on what the rules are that everyone else follows. We have therefore created a developer documentation on the wiki, including the Code Style Guide and documentation guidelines, which also describe how to build and preview the documentation.

Open source project activity

Open source projects should be active. For example, the Google program administrators may look at the number of contributors and the contribution trends on GitHub.

MDAnalysis has been downloaded more than 29,271 times on conda-forge. Over its lifetime our GitHub codebase has had 5,330 commits made by 121 contributors representing 65,357 lines of code. Details and results from the library have been published in the Proceedings of the 15th Python in Science Conference and the Journal of Computational Chemistry. Together, these two papers have been cited in the scientific literature more than 600 times, with the monthly citation rate of currently 19 citations per month growing every year (source: Google Scholar April 2019).

Project ideas

Please fill out your ideas list describing in some detail the project(s) that you envision a technical writer could work on during the program: Similar to GSoC, the quality of the ideas list will matter most in terms of whether we are accepted. We recommend you focus on one or two ideas and give detailed information on them.

Remember: most likely, the writer you get will initially not be familiar with your project. You will need to teach them the basics to use it and start the documentation work. Keep this in mind when composing your ideas list. Google provides examples of ideas in the instructions: https://developers.google.com/season-of-docs/docs/project-ideas

The list of ideas for GSoD is maintained at https://github.com/numfocus/gsod/blob/master/2019/MDAnalysis_ideas_list.md (perform a PR).

Name: MDAnalysis

Description: Scientific Python library for the analysis of molecular dynamics simulations in biophysics, chemistry, materials sciences.

website: https://www.mdanalysis.org/

repository: https://github.com/MDAnalysis/mdanalysis

email: mdnalysis-devel@googlegroups.com (all developers)

mentors: Oliver Beckstein orbeckst@gmail.com (primary contact) and Richard Gowers (richardjgowers@gmail.com)

Project name: Make it easy for new users to analyze their data

Description: MDAnalysis is a Python library that provides an abstract and object-oriented interface to data from particle-based simulations (primarily molecular dynamics simulations), which are widely used for simulating diverse systems such as the interaction of drugs with biomolecules or new materials. MDAnalysis is widely used in the scientific community and is written by scientists for scientists. Feedback from our users indicates that they like using MDAnalysis but wished that the documentation were easier to read and had more examples. The docs for scikit-learn and the PyTorch tutorials are generally cited as excellent examples of documentation and, taking these as examples, we would like to improve our documentation to make it more accessible and more immediately useful for new users.

At the moment, the primary sources of information for users are

the package documentation https://www.mdanalysis.org/docs
the basic tutorial
the most recent scientific article on MDAnalysis: R. J. Gowers et al. MDAnalysis: A Python package for the rapid analysis of molecular dynamics simulations. In S. Benthall and S. Rostrup, editors, Proceedings of the 15th Python in Science Conference, pages 98-105, Austin, TX, 2016. SciPy, doi:10.25080/majora-629e541a-00e.
example Jupyter notebooks (also available as live binder notebooks)
Workshop materials
two videos from conference presentations

We identified two areas for improvement (in rank order of priority):

Restructure docs

We want to restructure our docs for user-friendliness issue #1175 and refactor docs away from how the source code is organized into how the user interacts with the code (started in PR #1827). We envision a split into three major blocks:

introduction with examples (more like a tutorial) and explanation of the underlying principles and guiding concepts (see the 2016 MDAnalysis paper (doi:10.25080/majora-629e541a-00e), SciPy 2016 talk and the slides in the presentation scipy-MDAnalysis-Beckstein.pdf, which all outline the fundamentals)
API docs (similar to the majority of the current docs at https://www.mdanalysis.org/docs/)
developer docs (notes for developers, can be technical/arcane – e.g., some material from the wiki, details of the fundamental data structures, notes on file formats)

The current documentation is part of the code base and consists of:

Python doc strings that are directly embedded in the code and associated with functions, classes, methods, attributes, and constants. Many modules also directly contain overviews and examples.
Pages in the doc/sphinx/source directory, which consists of documents that combine multiple modules or give more general overviews. The documentation is written in restructured text and automatically processed with sphinx. As part of the continuous integration process, it is tested that these docs build correctly. Docs from the latest build are automatically and immediately published in HTML format as the "development docs" at https://www.mdanalysis.org/mdanalysis/.

We would like to maintain the ability to automatically build the docs and continue working in the sphinx framework outlined above. A technical writer would be trained in working with our current development process where changes to this documentation would be handled like other changes to the code base. This means that the writer would use git for version control and submit pull requests to the GitHub repository. As part of the standard review process, the mentors (and other developers and community members) would give rapid feedback on the contribution of the writer. Once a PR is approved, it will be merged and the docs will be autogenerated and immediately available.

Improve and expand tutorials

We have one "official" introductory tutorial and various other tutorials but it is initially confusing to new users what they should look at and it is too long. We need to provide a better "road map" for new users and clearly lay out tutorials for different levels and with clear learning goals.

We need to split the current MDAnalysis Tutorial into multiple self-contained tutorials and sort them by level (introductory, intermediate, advanced). The tutorials can and should build on each other. There should be a top level entry point that gives an overview over the tutorials. An initial outline would contain the following (not all content exists yet, especially at intermediate/advanced level):

Introductory level
1. Installation: installing MDAnalysis and testing trajectories (MDAnalysisTests for simple examples, MDAnalysisData for advanced examples)
2. Basic trajectory analysis: Loading data into a Universe, selecting atoms with Universe.select_atoms() as an AtomGroup, iterating through a trajectory, getting positions from AtomGroup.positions, and using numpy operations to calculate observables of interest from the positions.
3. Using analysis tools in MDAnalysis.analysis: Performing common analysis tasks such as RMSD calculation and fitting, hydrogen bond analysis, or dihedral analysis using the common analysis classes.
4. Working with AtomGroups: introduction to some often used methods of AtomGroup and how to work with multiple AtomGroups; slicing and fancy indexing of AtomGroup.
5. Writing trajectories: difference between "trajectories" and "single frame" file formats; standard code pattern for writing trajectories or single frames; writing single frames directly with AtomGroup.write()
Intermediate level
1. Selections (requires Basic trajectory analysis and Working with AtomGroups): in-depth tutorial on the selection language; dynamically updating selections
2. Working with Groups (requires Working with AtomGroups): The "container" hierarchy (Universe > Segment > Residue > Atom) and the corresponding groups SegmentGroup, ResidueGroup, AtomGroup: commonalities and differences, aggregating methods. How to work with fragments or molecules.
3. Writing selections: outputting selections for other codes
4. Working with topology information: introduction to the topology system; how to work with bonds; identify bonded atoms; working with angles and dihedrals; selections by type
5. Applying on-the-fly transformations: A unique capability of MDAnalysis are trajectory transformations that change the trajectory while it is being read and so avoid generating intermediate files that are needed with other analysis packages. This tutorial would be based on the notebook on-the-fly-transformations.ipynb.
6. In-memory trajectories: how to use the MemoryReader to speed up analysis or generate temporary reduced system trajectories for analysis (see, e.g., Workshop notebook trajectory_magic.ipynb)
7. Visualization in notebooks with NGLView: how to use nglview with MDAnalysis (see Workshop notebook Visualisation_with_NGLView.ipynb and binder notebook nglview_drawframes.ipynb)
Advanced level
1. System building (requires Working with topology information): how to add atoms or bonds or create simple topologies from scratch; generating initial coordinates
2. Extending file reading with own code (requires System building): write a Reader for once own custom file format and dynamically add it to MDAnalysis
3. Write your own analysis class: shows how to leverage the MDAnalysis.analysis.AnalysisBase class to create feature-full custom analysis tools.

For this and other documents we want to start adding example Jupyter notebooks (such as the first few example notebooks) to our sphinx-based restructured text documentation via the nbsphinx extension.

We also want to include more diagrams, pictures, and graphs to make clearer what the relationships between different parts of the code are and what output might look like.

The project suggestions below will not be included in the first proposal so that we can focus on our two initial priorities and present a stronger proposal.

How-to Library

We want to create a library of short how-tos (mini-tutorials) that quickly demonstrate how to obtain a specific result. The ideal format are Jupyter Notebooks with runnable code but reading notebooks can be cumbersome on the web. Therefore, the notebooks will be integrated with normal documentation in restructured text.

We already have a few example notebooks that can run on Binder but we are not covering a sufficient range of examples, and examples might be outdated. Ideally we would regularly test if the notebooks still work.

Developers need to identify a small number of FAQ cases that would make for good small self-contained how-tos
Initial selection of examples
- B-factor coloring
- PDB file manipulation (e.g., adding ChainIDs)
- selections with updating AtomGroups
- trajectory conversion
- RMSD analysis
- RMSF analysis
- dihedral analysis
Jupyter notebooks need to be written that solve these problems. These notebooks should use example data and should be executable on binder. They can use either data files from MDAnalysisTests or MDAnalysisData.
For each case, a short how-to document needs to be written (in restructured text and processed with sphinx) that
- links to installation instructions and introductory tutorials (basically: all the things one needs to know)
- integrates the notebook in the documentation via the nbsphinx extension so that there's no need to leave the docs and switch to a notebook viewer.

Background and Algorithms

Loosely connected to the introductory section of the tutorial, we would like to compile a more extensive and in-depth document on the algorithms and data structures. This would be more scholarly and could conceivably be expanded to a paper on fundamentals of analyzing molecular dynamics simulations. It would require more diagrams and graphs and more emphasis on citations. Topics to be included:

algorithms for distance calculations, including treatment of periodic boundaries: difference between distance_array and self_distance_array as well as capped_distance_array; how calc_bonds works.
Treatment of periodic boundaries in different contexts; wrapping/unwrapping of molecules
algorithms for commonly performed analyses
- hydrogen bonds and hydrogen bond correlations
- RMSD
- RMSF
- contacts
- dihedrals and Ramachandran plots
- PCA
- radial distribution functions and volumetric densities

Provide feedback

Saved searches

Use saved searches to filter your results more quickly