Skip to content

GSoC 2020 Project Ideas

Lily Wang edited this page Jan 3, 2021 · 1 revision
Google Summer of Code 2020 A list of projects ideas for Google Summer of Code 2020.

The currently proposed projects are:

  1. Molecular volume and surface analysis
  2. Extensions of the AnalysisBase Framework
  3. Interoperability with RDKit
  4. Improved atom selections
  5. Serialize Universes for parallel computation
  6. Implement support for Gromacs' TNG file format.
  7. AtomGroupGroup and Rings

Or work on your own idea! Get in contact with us to propose an idea and we will work with you to flesh it out into a full project. Raise an issue in the Issue Tracker or contact us via the developer Google group.

You can find the list of all available mentors for MDAnalysis here.


Project summary

See below for long descriptions. The difficulty is a somewhat subjective ranking, where "easy" means that we know pretty much what needs to be done, "medium" requires some additional research into best solutions as part of the project, and "challenging" is high risk/high reward where we think a solution exists but we will have to work with the student to find it and implement it.

project name difficulty description skills mentors
1 Molecular volume and surface analysis easy use an existing package for molecular surface area calculations to build a new analysis module Python, MDAnalysis.analysis @orbeckst, @IAlibay, @richardjgowers
2 Extensions of the AnalysisBase Framework medium/challenging extend the AnalysisBase framework to be easier detectable for third-party projects and auto generated CLI Python, bash, Python packages, CI, MD @PicoCentauri, @joaomcteixeira, @fiona-naughton, @orbeckst
3 Interoperability with RDKit medium add capability to MDAnalysis to use the RDKit API to convert data structures between MDAnalysis and RDKit MDAnalysis, RDKit, Python, C++ (?) @richardjgowers, @IAlibay
4 Improved atom selections medium replace the selection parsing code with a more flexible parser Python @orbeckst, @IAlibay, @fiona-naughton
5 Serialize Universes for parallel computation medium/challenging make the central Universe data structure serializable with Pickle to enable simple parallelization with Dask or MPI Python, MDAnalysis I/O, task-based parallelization @richardjgowers, @orbeckst, @IAlibay, @fiona-naughton
6 Implement TNG support challenging define, document, test the TNG format, write reference implementation, write Python bindings for use with MDAnalysis C/C++, Cython @acmnpv (Gromacs), @richardjgowers, @orbeckst, @IAlibay
7 AtomGroupGroup and Rings medium Write analysis module for pi stacking analysis Python @richardjgowers, @micaela-matta

Project descriptions

Project 1: Molecular volume and surface analysis

It is often necessary to measure volume and surface area of a biomolecule or parts of it over a MD trajectory. MDAnalysis is currently lacking this important functionality. In this project you will implement an analysis class that calculates the molecular volume and area for an atomgroup as a function of time. See issue #2439.

The FreeSASA library appears to be a suitable tool to integrate into MDAnalysis. It comes under MIT license and has a C core and python bindings:

By default Lee & Richards' algorithm is used, but Shrake & Rupley's is also available.

Simon Mitternacht (2016) FreeSASA: An open source C library for solvent accessible surface area calculation. F1000Research 5:189 (doi: 10.12688/f1000research.7931.1)

Objectives

For this project you would

  1. figure out if freesasa and freesasa-python can be installed as pip and conda package; if necessary create the conda packages (on conda-forge)
  2. create test cases (use existing files in MDA and run external implementation for reference)
  3. create a analysis module MDAnalysis.analysis.sasa using the MDAnalysis.analysis.base.AnalysisBase framework.
  4. benchmark performance

As a stretch goal and depending on the performance we might also want to implement a parallel version of the analysis class in PMDA, which is easy once we have a standard MDAnalysis analysis class.

Mentors

  • @richardjgowers
  • @IAlibay
  • @orbeckst

Project 2: Extensions of the AnalysisBase Framework

MDAnalysis already contains a lot of powerful analysis classes like i.e. for calculating root mean square displacements (RMSDs). However, for batch processing of several MD systems, it can be useful to also to call analysis methods from the command line (See issue #2377). Powerful analysis libraries for the CLI already exist for example at the GROMACS project, however, they are bound to their file formats. MDAnalysis can handle many file formats but requires some knowledge of Python and of the library itself. There also exist downstream tools such as maicos and taurenmd that make use of the MDAnalysis analysis classes. But building these tools based on the current implementation of AnalysisBase Framework requires a lot of work.

In this project, you will extend the AnalysisBase framework to be easier detectable and incorporated in third-party projects. In the end, this should allow building an automatic command-line interface for the analysis classes and an easier development of new projects based on the AnalysisBase class.

Mentors

  • @PicoCentauri
  • @joaomcteixeira
  • @fiona-naughton
  • @orbeckst

Project 3: Interoperability with RDKit

The RDKit is a cheminformatics toolkit widely used in the pharmaceutical industry. The core library is written in C++, but it exposed (and mostly used) via Python bindings. This project would aim to allow the conversion of MDAnalysis to RDKit objects (ie mda.Universe to RDKit.Mol) and vice versa.

Relevant skills

Doing this will likely require some knowledge of C++ as well as Python.

Knowledge of both MDAnalysis and rdkit will be required.

Mentors

  • @richardjgowers
  • @IAlibay

Project 4: Improved atom selections

Selecting atoms with the atom selection language is a key functionality in MDAnalysis. Virtually every MDAnalysis script uses it. However, the underlying code is old and inflexible. In this project you will replace the atomselection parser with a better approach, namely a formally defined grammar based on the pyparsing package. Once this replacement is in place you will add new features to the selection language (e.g., see issues #104 and #371) and document them.

As an extension of the project you could also investigate teaching MDAnalysis the selection languages of other popular packages such as VMD, Chimera, or CHARMM so that users can easily use MDAnalysis if they are already familiar with these other tools.

A good starting issue is #2436, which you could fix as part of your GSoC application.

Relevant skills

  • Python with test-driven development (the new parser must not break the old syntax as tested in the unit tests)
  • parsing formal languages

Mentors

  • @orbeckst
  • @IAlibay
  • @fiona-naughton

Project 5: Serialize Universes for parallel computation

The Universe is the central data structure in MDAnalysis — virtually everything in MDAnalysis starts with it. It contains information about the particles in the system and the trajectory. In order to use parallel analysis approaches in a transparent manner (e.g.m, using multiprocessing, Dask, or MPI) we want to be able to serialize a Universe into a representation that can be sent over the network or stored to disk. Based on this representation, another process should be able to recreate a copy of the Universe. Once this is possible, it is easy to write code that parallelizes analysis over trajectories with approaches such as multiprocessing, Dask, or MPI.

Python data structures can be serialized as "pickles" but pickles have limitations in what they can contain. In particular, access to trajectory files is difficult and will require extra code to make pickling work.

In this project you will implement serialization support for MDAnalysis.Universe. This will require code changes to large parts of the core code of MDAnalysis (in particular, the coordinate readers and the universe code), writing tests, and documenting the new functionality. As a stretch goal, you could then apply the new functionality for parallelizing analysis with a task-parallel approach.

Completion of the project will enable the seamless use of MDAnalysis in parallel workflows and substantially increase the usability of the library.

Some initial work had been completed in PR 2140 so you don't have to start from zero.

Relevant skills

  • Python
  • testing with pytest
  • MDAnalysis I/O
  • parallelization (e.g., multiprocessing, Dask, MPI)

Mentors

  • @richardjgowers
  • @orbeckst
  • @IAlibay
  • @fiona-naughton

Project 6: Implement TNG support

NOTE: This project is co-mentored between the Gromacs Development Team and MDAnalysis. Development will take place in the Gromacs Gitlab org https://gitlab.com/gromacs and discussions will primarily take place on the Gromacs developer mailing list.

The Gromacs simulation engine can generate TNG (Trajectory New Generation) files. The TNG format is more powerful than the traditional XTC and TRR formats, it is expected to replace them in the next few years. However, the TNG format is still not widely adopted, not even in Gromacs. The current status of the format:

  1. TNG has all major features defined (although a more polished API might be desirable).
  2. The current C code in https://gitlab.com/gromacs/tng is considered the reference implementation by Gromacs developers.
  3. More tests are needed.
  4. At the moment there is no-one who is really looking after the code. (At least at a cursory glance, documentation is hard to find; presumably https://gitlab.com/gromacs/tng/-/blob/master/Trajectoryformatspecification.mk is important as well as the JCC 2014 paper (and the 2011 one). )
  5. The Gromacs devs consider porting the existing C code to C++ would be the next important step so that at least Gromacs itself can move towards TNG adoption.

In this project you would work towards finalizing the TNG format (together with a mentor from Gromacs and MDAnalysis) with the goal to make it usable in any MD analysis code (including MDAnalysis). The preliminary objectives would be:

  1. Generate extensive tests.
  2. Port existing TNG C code to C++ (using test-driven development)
  3. Write the API documentation (doxygen strings + minimal example for how to use the library)
  4. Stretch goal: Create a Python wrapper (based on https://github.com/MDAnalysis/pytng or other approaches).

Note (M. Eric Irrgang): Consider Eigen/mdspan/mdarray buffer layout (and descriptors) to allow zero-copy data interchange across library binaries and allow super-fast numpy compatibility.

Relevant skills

This project would be a good way to build skills in:

  • C++ for I/O
  • Dealing with binary file formats
  • Test driven development
  • Connecting a C++ library to Python using Cython

Mentors

  • @acmnpv (Paul Bauer from the Gromacs project)
  • @richardjgowers
  • @orbeckst

consultation:

  • M. Eric Irrgang
  • @IAlibay

Project 7: AtomGroupGroup and Rings

It is common to want to consider a group of atoms as a single site/particle, for example defining the position of a water molecule (or a larger solvent) as its center of mass. It then follows that it is useful to consider many such groupings as an array of quasi-particles, leading to something like an AtomGroup-Group or BeadGroup.

For systems with aromatic rings (eg benzene like structures), these rings can be defined as a position (ie the center of the ring) but also a vector representing the direction they are facing. This could be implemented as a special case of AtomGroupGroup which also defines a directionality.

This project would implement

  • Implementing a class to represent this array of grouped atoms
  • Implementing a RingClass, which is the special case of the array of grouped atoms
  • Implementing ring finding functions to quickly define these groups
  • Basic RingGroup based analysis, eg angle between rings, pi-stacking identification.

Related issues:

Relevant skills

  • Python
  • Graph theory (eg the networkx package)

Mentors

  • @richardjgowers
  • @micaela-matta
Clone this wiki locally