03 / 06Bioinformatics

DNA Sequence Analyzer

PythonBiopythonFASTACSV

The problem

Bench biologists often need a complete summary of a FASTA sequence (nucleotide counts, GC content, restriction sites, and the encoded protein) without stitching together five separate tools.

What I built

A Python toolkit that parses multi-sequence FASTA files with Biopython SeqIO and computes, per sequence: nucleotide counts, GC content, reverse complement, start-codon positions, restriction-enzyme cut sites (EcoRI, BamHI, HindIII, NotI, SpeI), and standard-genetic-code protein translation that terminates at the first stop codon. The output is a flat CSV ready for downstream work.

Data preview

Sequence	Nucleotides	GC%	Starts	Cut sites	Protein
CDS-1	A2938 T4245 C2008 G2734	39.77	338	EcoRI, BamHI, HindIII×6, SpeI×2	MANQYVLRVADCTNVYYTRLWSSREAVSVYGAAAACGF…3,974 aa
CDS-2	A2129 T2736 C1410 G1741	39.31	236	EcoRI×2, HindIII×4	EPCSEHHVIRAFDIYNKDVACITKFPKINCVRFRNTGM…2,671 aa
CDS-3	A894 T1147 C632 G708	39.63	96	SpeI×2	MALIFVLMLITLYRCPFVLCNFQVCTDQLRQQEVYLPN…1,126 aa
CDS-4	A147 T235 C130 G133	40.78	10	none	MIGGLFSVGFEQFIQHANVTTGGALTALAAQPLINYGT…214 aa
CDS-5	A58 T98 C41 G40	34.18	4	SpeI	MLPSFLRVFNDEGVVLSVLFWLLFIIILLLFSIAMLKT…78 aa

Download full output (CSV)

The outcome

Validated on the SARS-related coronavirus reference genome (NC_034972.1). One run produces a full bioinformatics summary per sequence, ready for cell-line work, restriction cloning, or expression-construct design.

Source

github.com/Sleepytimebaby/DNA-Sequence-Analyzer-with-Protein-Translation

Interested in this kind of work? Get in touch