Python De Novo DNA Assembly

Usage

Get the requirements

pip install -r requirements.txt

Assemble away!

python assembly.py fasta.txt

Run unit tests

python test.py

This DNA sequence assembler works under the assumption that there are no read errors, and that there is overlap for each of the sequences in the file.

I took an iterative approach. First, I take one sequence and find the sequence in the list that has the best overlap score. Merge the sequences and repeat with the remainder of the list until all sequences have been combined.

If n is the average length of sequence, and m is the number of sequences, this has O(mn^2) complexity. Finding best overlap of two strings is O(n^2) in order to explore all the offsets and check overlap matching. We have to repeat this m times for each sequence. In this example with 50 sequences of ~1000 length, this naive implementation suffices, however if we want to scale to many sequences of longer length, a more efficient algorithm (i.e De Bruijn graph) should be considered.

TODO:

Improve efficiency, using graph-based implementation
Handle sequences with no overlap and imperfect overlap (i.e read errors)

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
README.md		README.md
assembly.py		assembly.py
fasta.txt		fasta.txt
requirements.txt		requirements.txt
result.txt		result.txt
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.gitignore

.gitignore

README.md

README.md

assembly.py

assembly.py

fasta.txt

fasta.txt

requirements.txt

requirements.txt

result.txt

result.txt

test.py

test.py

Repository files navigation

Python De Novo DNA Assembly

Usage

About

Releases

Packages

Languages

peteryin21/dnaassembly

Folders and files

Latest commit

History

Repository files navigation

Python De Novo DNA Assembly

Usage

About

Resources

Stars

Watchers

Forks

Languages