Project-Sequence Assembly(Shortest common superstring)

Description: You are to implement the graph-theoretic Shortest Superstring problem. Specifications: Your input is a set of DNA « fragments » (motivated by snippets that may result from Next Generation Sequencing technologies). Your goal is to reconstruct the larger contiguous DNA strand that the fragments were sequenced from. There is a good degree of expected overlaps between fragments that guides the problem formulation. In fact, you want to find the shortest contiguous DNA sequence that contains all input fragments. This is equivalent to finding the DNA sequence containing all input fragments that maximizes the TOTAL OVERLAP between consecutive fragments.

However, be careful to first DISCARD any fragment that is a proper SUBSTRING of any other fragment. After doing so and computing the OVERLAP matrix, you may view each remaining fragment as a NODE i in a directed graph G. And G has all possible edges (i,j) weighted such that the SCORE of edge (i,j) is the value Overlap(i,j). A Hamiltonian Path in such a graph is a path that visits all vertices exactly once. Shortest superstring reduces to finding a Hamiltonian path with greatest total Overlap sum. And, following the path, you may reconstruct the Contig (the re-assembled superstring DNA strand).

Although finding Hamiltonian paths in general graphs is a hard problem (and such paths need not exist in general graphs), the types of graphs based on possible overlaps (including allowance for overlaps of zero) are completely connected and permit many possible Hamiltonian paths. The more difficult optimization is the Ham-path of maximal total overlap. However, we are contending with a heuristic approach here, based on a greedy approximation : Sort edges in descending order of their overlap scores, and proceeding in the order from maximal to minimal overlap, select the current edge as long as it is feasible to do so. All this means is that selecting the edge would NOT induce any of the following :

• A « fork » situation where there are two out-edges from a single node

• An inverse « fork » situation where there are two in-edges to a single node

• A cycle, meaning that there is already a path from the « to edge » back to the « from edge »

Remembering that a Hamiltonian path has |V|-1 edges, you know when to stop the loop.

Skills: C++ Programming, Biotechnology

See more: java shortest common superstring, shortest common superstring, shortest common superstring java code, superstring shortest common theory code, project design assembly language, java project library management system design documentation description, project java find shortest distance cities dijkastra algorithm, project templates assembly, project report assembly unit led lights, simple project mips assembly, shortest common superstring code, fibonacci sequence assembly code lmc, hi i bidded to your project but my offer wasnt exact if you already have a cloud based student attendance system already ill wri, hi! i bidded to your project but my offer wasnt exact if you already have a cloud based student attendance system already ill wr, alan parsons project wouldn t want to be like you live, project using assembly language, paulo basic desk - project 62 assembly instructions, mini project using assembly language, project 62 assembly instructions

About the Employer:
( 5 reviews ) Edwardsville, United States

Project ID: #30041998

Awarded to:


Hi, there I am here. I am ready. ..........................................................................

$35 USD in 3 days
(2 Reviews)