Em Andamento

DNA Alignments and Graphing over WWW

I have a bioinformatics program written in C that I would like to serve over the web. The application is currently run on the command line via a perl script that is able to process multiple files with different options and spits out R code to generate postscript graphs using the dotplot files. Input is two DNA sequences, 4 options (flags) and the output is two text files: a dotplot file (for graphing the alignment) and an alignment file (showing where the two sequences match).

I want to serve this app over the web using the standard CGI/PERL paradigm. The challenge is to make the graphs dynamic such that after mouse-over a feature on the graph will show the corresponding region in the alignment text file (or something similar). Clever perl scripting could be used to generate an html image map for the graphs as needed. I am open to other solutions though.

The other challenge is that once the graphs are generated for different combinations of parameters they need to be looked at in sequence one after another and back and forth. This means they need to load quickly. More details about the interface for uploading the sequences and the layout and navigation relative to the graphs can be found below.

## Deliverables

Here you can find a screenshot showing the two text output files (alignment and dot-plot coordinates), a dot-plot graph, and a bit of documentation from the perl script that orchestrates everything here:

[url removed, login to view]

Mousing over parts of the graph will bring up the corresponding DNA sequence alignment region either through a pop-up or highlighting. Note, the coordinates are also given in the alignment file.

I open to creative solutions!

Application Details

1. DNA Input Interface

The input of DNA sequences over the www by users is an important part of the functionality of this tool. Due to computational complexity, users can only compare 5,000bp (5kb) at a time to another DNA sequence with 5,000bp maximum. Because comparison of regions of interest are typically much larger than this we need a way for users to efficiently analyze subsequences of their DNA input. This section describes the essentials of this function.

Users upload 2 sequence files via a "browse" button on the html form in "strict FASTA format":

Strict FASTA format:

>DNA sequence name

ATTTGCTGCTGCTGCTCGTCGCTCGCTCGTCCGCTCG...

Note: all the DNA sequence is on one line, immediately below the header line that starts with a ">"

Users can submit up to 500kb* for each sequence which can be analyzed up to 5kb* at a time using

an interface something like this:

-choose sequence size to compare (max = 5kb):

seq 1____ (browse)

seq 2____ (browse)

-choose sequence regions to compare:

V++++++++ bar1 ++++++++++V

seq 1 ---------------------------------------------------------------------------

V++++++++ bar2 ++++++++++V

seq 2 ---------------------------------------------------------------------------

Note: The V+++++++V features are 'selection bars' whose size is specified in the input above. This feature should be 'drag and drop' so that new subsequences can be specified by moving each V++++++++V bar over each sequence on the page. The bars are ALWAYS the same length (as given above) but the positions of each bar can change independently. For example, bar1 could grab 0-1kb whereas bar2 could grab 1-2kb.

2. Graph Presentation

The exact settings to use when analyzing a particular DNA sequence alignment are often not known and so easy parameter optimization is important part of the user interface. From the C code, the options are :

-o gap opening penalty

-e gap extension pentalty

-l minimum _length_ for a significant match

-o and -e are typically both set to 4*

this should be the default in the html form, however it must be able to be changed by users.

The -l parameter is the one that needs the most tuning. The perl wrapper to the C code automatically sets up multiple, independent runs of a user-specified range of the parameter l. A typical run should only have 10* different sizes, maximum. For example, l = 10-15 would be typical.

Each run of smm with a particular l produces one graph which users should be able to view on after another on the webpage using a slider bar and clickable arrows that move one l at a time. The slider and arrows must quickly cycle back and forth between graphs corresponding to each run of smm with a particular l.

------------ ^forward

| | |

| graph | | | <- slider

| | |

| | |

------------ V back

Since the running of the smm-view C code takes a while, I thought the best thing to do would be to generate many html pages all at once so that they will be quick to page-through and load, but I am open to other ideas about how to best do this!

3. Graph and Alignment View

When the dot plot is moused-over, the corresponding DNA alignment regions will be highlighted both in a global sequence overview showing the relative positions of the homologous DNA and in an alignment view which shows the actual DNA sequences in the alignment.

The alignment view part could be a pop-up window or object but it would be best to keep have it pop-up in the same place each time. Something like this:

dotplot slider/arrows alignment view

------------ ^forward -----------------------

| | | | aattacgggaccc |

| \ \ | | | <- slider | |||||||| | | | | | | | |

| \ | | | aatta-gggaccc |

| | | -----------------------

------------ V back

global view

seq 1 ---------------[-----------]----------------------------------------------------------

seq 2 ---------------[-----------]--[-----------]--------------------------------------------

Where the box under 'alignment view' cycles through the corresponding DNA alignment based on where the user mouses on the dotplot.

Note: a DNA segment in seq1 may match more than one segment of seq2. All of the segments that match are given in the dotplot coordinate file. All of the coordinates are given in the alignment file also for matching.

4. Download Results

Users should be able to save their analyses on the website for up to 30 days and have the option of downloading each run as a folder with clearly labeled pairs of files:

- smma alignment files

- dotplot graphs

5. Server Access

You will have shh/sftp access to the server and directories where the site will reside.

I can give you a user account and password when you are ready.

Abbreviations

bp = basepair = 1 nucleotide (A,T,C,G,N, or X)

kb = kilobases = 1000 nucleotides (A,T,C,G,N, or X)

* Means that the number listed should be defined as a named variable whose default setting can easily be changed by changing the variable default at the beginning of the code.

Habilidades: CSS, Engenharia, Javascript, MySQL, Perl, PHP, Gestão de projetos, Arquitetura de software, Teste de Software, Hospedagem Web, Gestão de Site , Teste de Website

Ver mais: wrapper application, website best image format, use graphs find set, two line segments, time complexity code, time complexity function, smm web, set pairs, set application, segments line, screenshot website program, quick access bar, pop graphs, part graph, number sequence solutions, navigation user interface, max graph, want make web, html scripting access, html regions, html command, make sequence, find time complexity, find solutions, find complexity

Acerca do Empregador:
( 4 comentários ) United States

ID do Projeto: #2989679

Premiar a:

mantissasl

See private message.

$1700 USD em 14 dias
(119 Avaliações)
6.5

4 freelancers estão ofertando em média $1700 para este trabalho

matycavw

See private message.

$1955 USD in 14 dias
(54 Comentários)
7.2
pete6

See private message.

$1020 USD in 14 dias
(8 Comentários)
2.1
aklt

See private message.

$2125 USD in 14 dias
(0 Comentários)
0.0