Skip to content

GSoD Application 2020

Richard Gowers edited this page May 2, 2020 · 6 revisions
Google Season of Docs 2020

MDAnalysis is applying for Google Season of Docs 2020. We would love to work with a technical writer to improve and extend our documentation.

Previous experience with technical writers or documentation

If you or any of your mentors have worked with technical writers before, or have developed documentation, please mention this. Describe the documentation that you produced and the ways in which you worked with the technical writer. For example, describe any review processes that you used, or how the technical writer's skills were useful to your project. Explain how this previous experience may help you to work with a technical writer in Season of Docs.

MDAnalysis worked successfully with @lilyminium in GSoD 2019 to produce a user guide separate from the technical reference. This is now the first resource that users are pointed to. It includes a Quick Start Guide to get new users up-and-running quickly; explanations of core data structures, with illustrations; and individual tutorials for each analysis supported by MDAnalysis.

@lilyminium already had experience with molecular simulation, but not MDAnalysis; therefore, our discussion was spread between both the user guide and codebase repositories, as @lilyminium learned about the code. We also asked her to subscribe to the developer and user mailing lists, and review them to make a list of commonly asked user questions during the community bonding period. We believe this exposure to the wider community was essential to the success of the project. The user guide now replaces much of the older documentation that @lilyminium used to learn about MDAnalysis, and it would play a large part in onboarding new technical writers.

After GSoD @lilyminium joined MDAnalysis as a core developer, so we can draw on her previous experience with documenting MDAnalysis to guide new projects. For example, we streamlined the review process by creating the user guide in a separate repository to the main codebase, and periodically invited opinions from developers via the developer mailing list. This saved a lot of time running continuous integration tests; however, the separation meant that many users and developers were unaware of the user guide as it was being constructed, so were unable to offer feedback or nominate topics to explore. Going forward, we will ask for feedback more frequently from the user mailing list.

The majority of our technical writing remains as Python Docstrings as submitted from GitHub pull requests. Documentation is required prior to merging any new code or module. We also make educational videos available from our website, from presentations at scientific conferences such as SciPy and PyGrunn. For our Workshops and Hackathons, we document and teach the use of MDAnalysis with other libraries such as NumPy and Dask using real-life data for experimentation. Our scientific paper, most recently published in the Proceedings of the 15th Python in Science Conference, details how MDAnalysis abstracts access to raw simulation data, presenting a uniform object-oriented Python interface: one that enables users to rapidly write code that is portable and immediately usable in virtually all biomolecular simulation communities.

--- additional paragraphs that I wrote but are probably not useful ---

The user guide was created with explicit consideration of accuracy, usability and maintainability. Where possible, example code is given in a stateful ipython shell directive, so that bugs or oudated code either raise errors on compilation or warn a reader that something is wrong.

The tutorials are written with nbsphinx so they can be viewed as web pages or downloaded as notebooks from the repo. Each tutorial illustrates practical usage of an analysis method, starting from a user from loading their data in, to plotting and explaining the results of the analysis. The implementation is described for the more complex methods, or those that cite potentially closed-access papers.

Previous experience with similar programs, such as Google Summer of Code or others

If you or any of your mentors have taken part in Google Summer of Code or a similar program, mention this in your application. Describe your achievements in that program. Explain how this experience may influence the way you work in Season of Docs.

MDAnalysis has been a participant in GSoC since 2016, first from within the Python Software Foundation, and then as a NumFOCUS-sponsored project. Our most recent student, Ninad Bhat, implemented periodic boundary handling in various AtomGroup methods. In 2018 Ayush Suhane integrated faster distance search algorithms for limited distances into our 0.19 Release and posted about the results on his blog. During 2017 GSoC Utkarsh Bansal ported our test framework from nose to pytest, increasing code coverage to 91% and significantly reducing total runtime. Our 2016 GSoC students Fiona Naughton and John Detlefs contributed innovative new features not found in other packages to our biggest release ever, such as a means to combine different data with trajectories and dimensionality reduction algorithms.

Our mentors and core developers have a great deal of experience onboarding students and getting them familiar with our codebase. They are also very adept at explaining the scientific and technical motivation for various parts of MDAnalysis in plain language. With a clear understanding of our user’s motivations and needs, GSoC students have been very successful at independently evaluating existing examples and collaborating with developers to make improvements.

Projects (GSoD 2020)

Project Name Description Mentors
1 Background and algorithms Introduce and explain the algorithms behind MDAnalysis (starting from existing docs) ??
2 Migration guide and cheat sheet Concisely summarise how to do common tasks in MDAnalysis, and create a guide for users from other analysis packages ??
3 Best practices for coding with MDAnalysis Create a guide for new developers explaining best practices for writing analysis ??
4 Handling data from specific MD engines/formats In-depth guide to working with specific formats ??
5 API Guide @richardjgowers

Under the hood: the algorithms of analysis

MDAnalysis is frequently used for data analysis that is published in scientific literature. In turn, most of its analysis methods implement published algorithms. Users should be able to understand how the analysis is performed, without needing to look at the code or potentially closed-access publications. We envision a resource similar to scikit-learn's user guide: an in-depth document that explains how each algorithm works, which questions it can answer, what kind of data is best suited for it, and tips and potential pitfalls. This would require diagrams, graphs, and an emphasis on citations.

Most tutorials in the user guide already summarise the analysis scheme in question and link to further resources; this can be a starting point for the technical writer. However, the tutorials are necessarily streamlined to introduce the code to an audience already familiar with simulation concepts. A guide to the algorithms should focus on introducing and explaining the point of the analysis rather than how to do it.

Topics to be included:

  • algorithms for distance calculations, including treatment of periodic boundaries
  • treatment of periodic boundaries in different contexts; un/wrapping of molecules
  • algorithms for commonly performed analysis

Migration guide and cheat sheet

Many users of MDAnalysis come with previous experience with other analysis packages; they are already familiar with the standard analysis algorithms, so they are reading the documentation to understand the interface and syntax. We are looking for a 1-2 page "cheat sheet" of common tasks as a first resource; this can be followed by a longer migration guide that explains how to perform analysis in other analysis packages, in MDAnalysis.

A technical writer for this project would need to survey users on what the most common tasks are, and what would be useful on the cheat sheet. Developers can also help in identifying FAQs. The writer can adapt existing tutorials in the User Guide for the examples. The developers will work with them to identify popular analysis tasks from other packages, and how to perform the same in MDAnalysis.

Common tasks to be included:

  • labelling structures with B-factors and writing it out to PDB for visualisation
  • trajectory conversion
  • RMSD/F analysis

Best practices for coding with MDAnalysis

MDAnalysis provides a modular toolkit for easily implementing custom analysis methods. To do so, it relies on metaprogramming and data structures that are unintuitive for the average Python user. We would like to extend the existing developer guide by adding a best practices guide for writing performant analysis. This would include a list of antipatterns to avoid. It could also include other commonly overlooked considerations, such as when and how we "guess" information from data.

Handling data from specific MD engines

MDAnalysis supports reading data from a wide variety of formats, each with their own idiosyncrasies. User expectations often do not align with MDAnalysis actually handles data from a particular format. In-depth documentation on how to deal with data from specific engines both clarifies what MDAnalysis is doing under the hood, and empowers users to write their own input parser if they need to.

The user guide can serve as a starting point. It has a page on each file format, along with notes on particular implementation quirks.

API Guide

The current documentation is a mix of API reference and user documentation built using Sphinx. The use of Sphinx has influenced the documentation to roughly follow the structure of the source code, which is not the most intuitive approach for users. Last year's GSOD introduced a user guide, which was designed to improve on the existing documentation by focusing on what users require irrespective of the layout within the software package. This project would build upon these efforts in an orthogonal manner, by creating developer API documentation which instead follows the layout of the project source code.

This project would focus on documentation for power users and developers, explaining how to expand the package.

Example topics to be included:

  • How to include a new file reader in the package
    • In depth guide on how to write new sub modules for topology/ and coordinates/
  • Adding new topology attributes in a system
    • Explaining the topology system and surrounding metaprogramming and including new attributes
  • Writing C extensions to work with coordinates
    • Style guide and best practices for building upon calc_distances.h

Current documentation

At the moment, the primary sources of information for users are:

Additional resources include:

Clone this wiki locally