Skip to content
This repository has been archived by the owner on Nov 8, 2021. It is now read-only.

Refactoring Code

Maurits Kok edited this page Jun 3, 2021 · 1 revision

The current satay code base contains large scripts (such as transposon-mapping.py) that perform many tasks and some modules that are called in these scripts. In order keep the code base easy to understand and accessible for testing, the code needs to be refactored. Refactoring is the technique of changing an application (either the code or the architecture) so that it behaves the same way on the outside, but internally has improved. These improvements can be stability, performance, or reduction in complexity. Generally, refactoring existing code bases with large scripts aims at breaking the code into smaller modules.

Below are some general considerations and an example on refactoring and writing modular code. If you would like to have a more hands-on training, please have a look at these links:

Modular Code

Modular programming refers to the process of breaking a large, unwieldy programming task into separate, smaller, more manageable subtasks or modules. Individual modules can then put combined like building blocks to create a larger application.

There are several advantages to modularizing code in a large application:

  • Simplicity: Rather than focusing on the entire problem at hand, a module typically focuses on one relatively small portion of the problem. If you’re working on a single module, you’ll have a smaller problem domain to wrap your head around. This makes development easier and less error-prone.

  • Maintainability: Modules are typically designed so that they enforce logical boundaries between different problem domains. If modules are written in a way that minimizes interdependency, there is decreased likelihood that modifications to a single module will have an impact on other parts of the program. (You may even be able to make changes to a module without having any knowledge of the application outside that module.) This makes it more viable for a team of many programmers to work collaboratively on a large application.

  • Reusability: Functionality defined in a single module can be easily reused (through an appropriately defined interface) by other parts of the application (or other applications entirely). This eliminates the need to duplicate code.

  • Scoping: Modules typically define a separate namespace, which helps avoid collisions between identifiers in different areas of a program.

From: https://realpython.com/python-modules-packages/

Example workflow

The first task in the script transposon-mapping.py is loading a BAM file.

#%% LOADING BAM FILE
if bamfile is None:
    path = os.path.join('/home', 'gregoryvanbeek', 'Documents', 'data_processing')
    # filename = 'E-MTAB-4885.WT2.bam'
    filename = 'SRR062634.filt_trimmed.sorted.bam'
    bamfile = os.path.join(path,filename)
else:
    filename = os.path.basename(bamfile)
    path = bamfile.replace(filename,'')

assert os.path.isfile(bamfile), 'Bam file not found at: %s' % bamfile #check if given bam file exists

This single task would be easier to maintain and test if it was a separate function. Some observations:

  • The code is not actually loading a BAM file, only verifying its presence.
  • The code contains references to local files and folders

This can be turned into a separate function with one task:

def verify_presence_bamfile(path=None, filename=None):
   """ 
   Docstring 
   """
   if path and filename:
       bamfile = os.path.join(path,filename)       
   elif not path or not filename:
       # Some default behaviour, such as looking for the .bam extension in the current folder. 
         However, this would add a second functionality to the function, namely find_bamfile

   assert os.path.isfile(bamfile), 'Bam file not found at: %s' % bamfile

This new function (or module) can then be added to a script load_bamfile.py in a folder containing all scripts pertaining to data importing. This script could then later also contain the import routines for bamfiles. The existing code in transposon-mapping.py can the be replaced with

# Add to module import
from satay.importing import verify_presence_bamfile

# Set parameters, preferably in a separate config file
PATH = "path/to/file"
FILENAME = "filename.bam"

# Run code
verify_presence_bamfile(path=PATH, filename=FILENAME)

Of course, if more filetypes need to be verified, then the refactored function can be made more general further still.

Code improvements

While you're refactoring the code base to make it more modular, you might find better ways of writing the code altogether. Might as well take the opportunity! Below are some pointers to think of.

Don't Repeat Yourself (DRY)

Remove commented out code

Commented out code is confusing. It bloats the codebase and makes it difficult to follow the true execution paths of the code. It also gets in the way of genuine comments, which can become lost in a sea of out of date code lines. Since the code base is under version control, there is no risk of losing the code. Just give the commit where you delete the comments a suitable name and they will be easy to find.

Remove unneeded comments

Use comprehensions instead of for loops

Ternary Operator

Use enumerate() instead of range()

Join iterables with zip()

Use assignment expression

Resources