2723 lines
104 KiB
Text
2723 lines
104 KiB
Text
|
.npa
|
||
|
.left margin1
|
||
|
@-1. TX 0 @General
|
||
|
.sp
|
||
|
@-2. T 0 @Screen control
|
||
|
.sp
|
||
|
@-2. X 0 @Screen
|
||
|
.sp
|
||
|
@-3. TX 0 @Modification
|
||
|
.sp
|
||
|
@0. TX -1 @BAP
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
This is an interactive program whose primary use is
|
||
|
for managing shotgun sequencing projects, but it can also be used for
|
||
|
handling alignments of other sequences, including those of proteins.
|
||
|
Currently the maximum 'gel reading' length is set to 4096 characters.
|
||
|
Almost all of the information below describes the use of the program for
|
||
|
shotgun projects, but those using the programs for handling other
|
||
|
sequence
|
||
|
alignments should interpret it accordingly.
|
||
|
The data for such a project is stored in a special type of database. The
|
||
|
program
|
||
|
contains the tools that are required to screen gel readings
|
||
|
against vector sequences and restriction sites, and to assemble
|
||
|
new gel
|
||
|
readings into the database (automatically comparing and aligning
|
||
|
them). In addition it contains editors and functions to examine the quality
|
||
|
of the aligned sequences.
|
||
|
.para
|
||
|
There are three main menus: "general", "screen" and "modification",
|
||
|
and some functions have submenus.
|
||
|
.left margin2
|
||
|
.lit
|
||
|
The general menu contains the following options:
|
||
|
|
||
|
Open a database
|
||
|
Display a contig
|
||
|
List a text file
|
||
|
Direct output
|
||
|
Calculate a consensus
|
||
|
Screen against restriction enzymes
|
||
|
Screen against vector
|
||
|
Check logical consistency
|
||
|
Copy database
|
||
|
Show relationships
|
||
|
set parameters
|
||
|
Highlight disagreements
|
||
|
Examine quality
|
||
|
Check Assembly
|
||
|
Find read pairs
|
||
|
|
||
|
The graphics menu contains:
|
||
|
|
||
|
Clear graphics
|
||
|
Clear text
|
||
|
Draw ruler
|
||
|
Use cross hair
|
||
|
Change margins
|
||
|
Label diagram
|
||
|
Plot map
|
||
|
Plot single contig
|
||
|
Plot all contigs
|
||
|
|
||
|
|
||
|
The modification menu contains:
|
||
|
|
||
|
Edit contig
|
||
|
Auto assemble
|
||
|
Join contigs
|
||
|
Complement a contig
|
||
|
Alter relationships
|
||
|
Extract gel readings
|
||
|
Find internal joins
|
||
|
Disassemble readings
|
||
|
Shuffle pads
|
||
|
Auto-select oligos
|
||
|
Double strand
|
||
|
|
||
|
The alter relationships menu contains:
|
||
|
|
||
|
Cancel
|
||
|
Line change
|
||
|
Check logical consistency
|
||
|
Remove contig
|
||
|
Shift
|
||
|
Move gel reading
|
||
|
Rename gel reading
|
||
|
Break a contig
|
||
|
Remove a gel reading
|
||
|
Alter raw data parameters
|
||
|
|
||
|
.END LIT
|
||
|
.SK1
|
||
|
.para
|
||
|
Overview of the methodology
|
||
|
.para
|
||
|
The shotgun sequencing strategy
|
||
|
.para
|
||
|
In the shotgun sequencing procedure
|
||
|
the sequence to be determined is randomly broken into fragments of
|
||
|
about
|
||
|
1000 nucleotides in length. These fragments are cloned and then
|
||
|
selected randomly and their
|
||
|
|
||
|
sequences determined. The relationship between any pair of
|
||
|
|
||
|
fragments is not known beforehand
|
||
|
but is found by comparing their sequences.
|
||
|
|
||
|
If the sequence of one found to be wholly or partially contained
|
||
|
|
||
|
within that of another for sufficient length to distinguish an
|
||
|
|
||
|
overlap from a repeat then those two fragments can be joined.
|
||
|
The
|
||
|
|
||
|
process of select, sequence and compare is continued until the
|
||
|
whole
|
||
|
|
||
|
of the DNA to be sequenced is in one continuous well
|
||
|
determined
|
||
|
|
||
|
piece.
|
||
|
|
||
|
.para
|
||
|
Definition of a contig
|
||
|
|
||
|
.para
|
||
|
A CONTIG is a set of gel readings that are related to one
|
||
|
another by overlap of their sequences. All gel readings belong to
|
||
|
a contig and each contig contains at least one gel
|
||
|
reading. The gel readings in a contig can be summed to produce
|
||
|
a continuous consensus sequence and the length of this sequence is
|
||
|
the length of the contig. The rules used to perform this summation are
|
||
|
given under "the consensus algorithm".
|
||
|
At any stage
|
||
|
of a sequencing project the data will comprise a number of
|
||
|
contigs;
|
||
|
when a project is
|
||
|
|
||
|
complete there should be only one contig and its consensus will be
|
||
|
the finished sequence. Note that since being introduced and
|
||
|
defined as above the word "contig" has been taken up by those involved in
|
||
|
genomic mapping. In that context the consensus with a precise length is,
|
||
|
of course, not
|
||
|
defined.
|
||
|
|
||
|
.SK1
|
||
|
.LEFT MARGIN2
|
||
|
Introduction to the computer method
|
||
|
.LEFT margin2
|
||
|
.PARA
|
||
|
It is useful to consider the objectives of a sequencing project before
|
||
|
outlining how we use the computer to help achieve them. The aim of a
|
||
|
shotgun sequencing project is to
|
||
|
produce an accurate consensus sequence from many overlapping gel
|
||
|
readings.
|
||
|
It is necessary to know, particularly at the latter
|
||
|
stages of the project, how accurate the
|
||
|
consensus sequence is. This enables us to know which regions of the
|
||
|
sequence require further work and also to know when the project is
|
||
|
finished.
|
||
|
To show the quality of the consensus, the programs described here
|
||
|
produce displays like that shown below.
|
||
|
.sk1
|
||
|
.lit
|
||
|
|
||
|
10 20 30 40 50
|
||
|
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
|
||
|
60 70 80 90 100
|
||
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
||
|
-3 HINW.007 GGCACA*GTC
|
||
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
|
||
|
|
||
|
110 120 130 140 150
|
||
|
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
|
||
|
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
|
||
|
-5 HINW.009 GGCAGCGACCGTCG
|
||
|
17 HINW.999 AGCGACCGTCG
|
||
|
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
|
||
|
|
||
|
160 170 180 190 200
|
||
|
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
12 HINW.017 GTAGAGC*
|
||
|
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
.END LIT
|
||
|
.para
|
||
|
This is an example showing the left end of a contig from
|
||
|
position 1 to 200. Overlapping this region are gel readings
|
||
|
numbered 6, 3, 5, 17 and 12;
|
||
|
6, 3 and 5
|
||
|
are in reverse orientation to their original reading (denoted by a minus
|
||
|
sign). Each gel reading also has a name (eg HINW.010). It can be seen that
|
||
|
in a number of places the sequences contain characters other than A,C,G
|
||
|
and
|
||
|
T. Some of these extra characters have been used by the sequencer to
|
||
|
indicate regions of uncertainty in the initial interpretation of the gel
|
||
|
reading, but the asterisks (*) have been inserted by the automatic
|
||
|
assembly function in order to align the sequences. Underneath each 50
|
||
|
character block of gel reading sequences is the consensus derived from
|
||
|
the
|
||
|
sequences aligned above (the line labelled CONSENSUS). For most of its
|
||
|
length the consensus has a definite nucleotide assignment but in a few
|
||
|
positions there is insufficient agreement between the gel readings and
|
||
|
so a dash (-) appears in the sequence. This display contains all the
|
||
|
evidence needed to assess the quality of the consensus: the number of
|
||
|
times
|
||
|
the sequence has been determined on each strand of the DNA, and the
|
||
|
individual nucleotide assignments given for each gel reading.
|
||
|
.para
|
||
|
So the aim is to produce the consensus sequence and, equally important,
|
||
|
a display of the experimental results from which it was derived.
|
||
|
.para
|
||
|
In order to achieve this the following operations need to be performed:
|
||
|
.left margin2
|
||
|
1) Put individual gel readings into the computer.
|
||
|
This might involved the manual interpretation of autoradiographs
|
||
|
or the transfer and process of machine-readable files from fluorescent
|
||
|
sequencing machines.
|
||
|
.left margin2
|
||
|
2) Check each gel reading to make sure it is not simply part of one of the
|
||
|
vectors used to clone the sequence.
|
||
|
.left margin2
|
||
|
3) Check each gel reading to make sure that those fragments that span
|
||
|
the
|
||
|
ligation point used prior to sonication are not assembled as single
|
||
|
sequences.
|
||
|
.left margin2
|
||
|
4) Compare all the remaining gel readings with one another to assemble
|
||
|
them
|
||
|
to produce the consensus sequence.
|
||
|
.left margin2
|
||
|
5) Check the quality of the consensus and edit the sequences.
|
||
|
.left margin2
|
||
|
6) When all the consensus is sufficiently well determined, produce a copy
|
||
|
of
|
||
|
it for processing by other analysis programs.
|
||
|
.para
|
||
|
It is very unlikely that this procedure will only be passed through once.
|
||
|
Usually steps 1 to 5 are cycled through repeatedly, with step 4 just
|
||
|
adding
|
||
|
new sequences to those already assembled. Generally step 6 is also used
|
||
|
in
|
||
|
order to analyse imperfect sequence to check if it is the one the project
|
||
|
intended to sequence, or to look for interesting features. Analysis of
|
||
|
the consensus, such as
|
||
|
searches for protein coding regions,
|
||
|
can also help to find errors in the sequence. The display of the
|
||
|
overlapping gel readings shown above can be used to indicate, not only
|
||
|
the
|
||
|
poorly determined regions, but also which clones should be resequenced
|
||
|
to
|
||
|
resolve ambiguities, or those which can usefully be extended or
|
||
|
sequenced
|
||
|
in the reverse direction, to cover
|
||
|
difficult regions.
|
||
|
|
||
|
.PARA
|
||
|
The original
|
||
|
individual gel readings for a sequencing project are each stored in
|
||
|
separate files. As the gel readings are entered into the computer
|
||
|
(usually in batches, say 10
|
||
|
from a film), the file names they are given are stored in
|
||
|
a further file, called a file of file names. Files of file names
|
||
|
enable gel readings to be processed in batches.
|
||
|
.para
|
||
|
For each sequencing project
|
||
|
we start a project database. This database has a structure specifically
|
||
|
designed for
|
||
|
dealing with shotgun sequence data.
|
||
|
In order to arrive at the final consensus sequence many operations will
|
||
|
be
|
||
|
performed on the sequence data. Individual fragments must be
|
||
|
sequenced and
|
||
|
compared in both senses (i.e. both orientations) with all the other
|
||
|
sequences. When an overlap between a new gel reading and a contig are
|
||
|
found
|
||
|
they must be aligned and the new gel reading added to the contig. If a
|
||
|
new
|
||
|
gel reading overlaps two contigs they must be aligned and joined. Before
|
||
|
the two contigs are joined one of them may need to be turned around
|
||
|
(reversed and complemented) so they are both in in the same orientation.
|
||
|
.para
|
||
|
Clearly, keeping track of all these manipulations is quite complicated,
|
||
|
and to be able to perform the operations
|
||
|
quickly requires careful choice of data
|
||
|
structure and algorithms. For these reasons it is not practicable to store
|
||
|
the gel readings aligned as shown in the display above. Rather, it is more
|
||
|
convenient to store the sequences unassembled, and to record sufficient
|
||
|
information for programs to assemble them during processing. The
|
||
|
data used to assemble the sequences is called relational information.
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
The database comprises five files and they are described under the
|
||
|
section entitled "open database".
|
||
|
.PARA
|
||
|
Before entry into the project database
|
||
|
each new gel reading must be compared to look for overlaps
|
||
|
with all the data already contained
|
||
|
within the database. This last point is
|
||
|
important: all searching for overlaps is between individual new gel
|
||
|
readings and the data already in the database. There is no searching for
|
||
|
overlaps between sequences within the database; overlaps must be found
|
||
|
before new gel readings are entered into the database.
|
||
|
.para
|
||
|
Below I give an introduction to how the sequences are processed by
|
||
|
being
|
||
|
passed from one function to the next.
|
||
|
.para
|
||
|
This program is used to start a
|
||
|
database for the project and
|
||
|
then the following procedure is used.
|
||
|
.para
|
||
|
Data in the form of individual gel readings are entered into the computer
|
||
|
|
||
|
and stored in separate files (possibly using either the digitizer
|
||
|
|
||
|
program GIP). Batches
|
||
|
of these gel readings
|
||
|
are passed to the screening functions in this program to search for overlaps
|
||
|
|
||
|
with vector sequences (see VEP and "screen against vector") or for matches to
|
||
|
|
||
|
restriction enzyme sites that should not be
|
||
|
|
||
|
present ("screen against enzymes").
|
||
|
Each run of these screening functions passes on only those gel
|
||
|
|
||
|
readings that do not contain unwanted sequences. Sequences are passed
|
||
|
|
||
|
via
|
||
|
files of file names and eventually are processed by the automatic
|
||
|
assembly function ("auto assemble"). This function compares each gel
|
||
|
reading with a consensus of all the previous gel readings
|
||
|
stored in the database.
|
||
|
If it finds any
|
||
|
overlaps
|
||
|
it aligns the overlapping sequences by inserting padding characters,
|
||
|
and then adds the new gel reading to the database.
|
||
|
Gels that overlap are added to existing contigs and gels that do not
|
||
|
overlap any data in the database start
|
||
|
new contigs. If a new gel overlaps two contigs they are joined.
|
||
|
Any gel readings that appear to overlap but which
|
||
|
cannot be aligned sufficiently well are not entered and have
|
||
|
their names written to a file of failed gel reading names.
|
||
|
.PARA
|
||
|
Generally data is entered
|
||
|
into the database in batches as just described. The program
|
||
|
is also used to examine
|
||
|
|
||
|
the data in the database, to enter gel readings that the automatic
|
||
|
|
||
|
assembly function cannot align ("auto assemble"),
|
||
|
|
||
|
and to make final edits. Edits to whole contigs
|
||
|
|
||
|
can be made using a
|
||
|
mouse-driven editor ("edit contig").
|
||
|
|
||
|
.PARA
|
||
|
Editing the sequences is obviously an essential part of managing a
|
||
|
|
||
|
sequencing project.
|
||
|
Editing is required when new
|
||
|
|
||
|
sequences are added, when contigs are joined, and when sequences are
|
||
|
|
||
|
corrected.
|
||
|
A basic part of the strategy
|
||
|
|
||
|
used here is that new
|
||
|
|
||
|
gel readings should be correctly aligned throughout their whole length
|
||
|
|
||
|
when
|
||
|
they are entered into the database, and that when contigs are joined they
|
||
|
|
||
|
are edited so that they are well aligned in the region of overlap.
|
||
|
|
||
|
Alignment can be achieved by
|
||
|
|
||
|
adding padding characters to the sequences, and this is the way "auto
|
||
|
|
||
|
assemble"
|
||
|
operates when adding new sequences to the database.
|
||
|
|
||
|
.para
|
||
|
In order to search
|
||
|
for overlaps that may have been missed or may be hidden in the "unused data"
|
||
|
the function "find internal joins" can be used.
|
||
|
|
||
|
.para
|
||
|
Generally the users need not concern themselves with how the relational
|
||
|
information is used by the program, but it is necessary to know
|
||
|
how contigs are identified. Because contigs are constantly being changed and
|
||
|
reordered the program identifies them by the numbers of the gel readings
|
||
|
they contain. Whenever users need to identify a contig they need only
|
||
|
know
|
||
|
the number or name of one of the gel readings it contains. Whenever the
|
||
|
program asks users to identify a contig or gel reading they can type its
|
||
|
number or its archive name. If they type its archive name they must precede
|
||
|
the name by a slash "/" symbol to denote that it is a name rather than a
|
||
|
number. E.g if the archive
|
||
|
name is fred.gel with number 99, users should
|
||
|
type /fred.gel or 99 when asked to identify the contig. Generally,
|
||
|
when it asks for the gel reading to be identified,
|
||
|
the program will offer the user a default name,
|
||
|
and if the user types only return, that
|
||
|
contig will be accessed. When a database is opened the default contig will
|
||
|
be the longest one, but if another is accessed, it will subsequently become
|
||
|
the current default.
|
||
|
.para
|
||
|
Further information is located in the following places.
|
||
|
The database files are described under "open database". The format
|
||
|
for
|
||
|
vector and consensus sequences is given under "calculate a consensus", as are
|
||
|
the
|
||
|
uncertainty codes used in gel readings.
|
||
|
.left margin2
|
||
|
.para
|
||
|
The digitiser program
|
||
|
is used for the initial input of gel readings
|
||
|
and for writing a file of file names. The program
|
||
|
uses a digitizer for data entry.
|
||
|
A digitizer is
|
||
|
a two dimensional surface such as a light box
|
||
|
which is such that if a special pen is pressed onto it, the pens
|
||
|
coordinates are recorded by a computer.
|
||
|
These coordinates
|
||
|
can be interpreted by a program.
|
||
|
.para
|
||
|
In order to read an autoradiograph placed on the light box
|
||
|
the user need only define the bottom of
|
||
|
the four sequencing lanes and the bases
|
||
|
to which they correspond and then use the pen to point to each
|
||
|
successive band progressing up the gel. The program examines
|
||
|
the
|
||
|
coordinates of each pen position to see in which of the four
|
||
|
lanes
|
||
|
it lies and assigns the corresponding base to be stored in the
|
||
|
computer. Each time the pen tip is depressed to point to a position
|
||
|
on the surface of the digitizer the program sounds the bell on the
|
||
|
terminal to indicate to the user that a point has been recorded. As
|
||
|
the sequence is read the program displays it on the screen.
|
||
|
|
||
|
.left margin1
|
||
|
@17. TX 1 @Screen against enzymes
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
Used to compare gel readings against any restriction enzyme recognition
|
||
|
|
||
|
sequences that may have been used during cloning and which should not
|
||
|
|
||
|
be present in the data. Works on single gel readings or processes batches
|
||
|
|
||
|
accessed through files of file names. The algorithm looks for exact
|
||
|
|
||
|
matches to recognition sequences stored in a file.
|
||
|
|
||
|
.para
|
||
|
The file containing the recognition sequences must be identified. The
|
||
|
user
|
||
|
must choose between employing a file of file names, or typing in the
|
||
|
|
||
|
|
||
|
names of individual gel reading files. If a file of file names is used the
|
||
|
|
||
|
|
||
|
program will also create a new file of file names. When the option has
|
||
|
|
||
|
finished operating this new file will contain the names of all those gel
|
||
|
|
||
|
readings that did not match any of the recognition sequences. Hence it
|
||
|
can
|
||
|
be used for further processing of the batch. The recognition sequences
|
||
|
|
||
|
should be stored in a simple text file with one recognition sequence per
|
||
|
|
||
|
line.
|
||
|
.left margin1
|
||
|
@18. TX 1 @Screen against vector
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
Used to compare gel readings against any vector sequences that may have
|
||
|
|
||
|
been picked up during cloning and which have not been removed by vep.
|
||
|
It Works on single gel readings or processes
|
||
|
|
||
|
batches accessed through files of file names. The algorithm looks for
|
||
|
exact
|
||
|
matches of length "minimum match length" and displays the overlapping
|
||
|
|
||
|
sequences.
|
||
|
.para
|
||
|
The file containing the vector sequence must be identified. The user must
|
||
|
|
||
|
choose between employing a file of file names, or typing in the names of
|
||
|
|
||
|
individual gel reading files. If a file of file names is used the program
|
||
|
will
|
||
|
also create a new file of file names. When the option has finished
|
||
|
|
||
|
operating this new file will contain the names of all those gel readings
|
||
|
|
||
|
that did not match the vector sequence. Hence it can be used for further
|
||
|
|
||
|
processing of the batch. The vector sequence should be stored in a simple
|
||
|
|
||
|
text file with up to 80 characters of data per line. More than one vector
|
||
|
|
||
|
can be stored in a single file. If so each should be preceded by a 20
|
||
|
|
||
|
character title of the form <---m13mp8.0001----> where the < and >
|
||
|
signs
|
||
|
and the number like .0001 are obligatory. The number must be preceded
|
||
|
|
||
|
by a dot (.) and be 4 digits long. The total sequence in the file must be <
|
||
|
|
||
|
500,001 characters in length.
|
||
|
|
||
|
.left margin1
|
||
|
@20. TX 3 @Auto assemble
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
Compares gel readings against the current contents of the database and
|
||
|
|
||
|
produces alignments. In its normal mode of operation
|
||
|
("entry permitted"), the function
|
||
|
will automatically enter the gel readings into the database.
|
||
|
.para
|
||
|
New assembly suboption.
|
||
|
However
|
||
|
if entry is not permitted the reads won't be entered but the program
|
||
|
will produce alignments and (optionally) save each reading name and its best
|
||
|
alignment score (percentage mismatch) in a file. When used in
|
||
|
this mode, the program will include in the alignment the poor quality data
|
||
|
for each reading. These files of names can then be sorted into score order
|
||
|
and then used for assembly, hence forcing the readings that align best to
|
||
|
be entered into the database first.
|
||
|
End of new suboption.
|
||
|
.para
|
||
|
The routine works on
|
||
|
|
||
|
single gel readings or processes batches of gel readings accessed through
|
||
|
|
||
|
files of file names. It is the only way to enter data into the database.
|
||
|
|
||
|
.para
|
||
|
The function will check the database for logical consistency and will
|
||
|
only
|
||
|
proceed if it is OK. Choose if gel readings should be entered into the
|
||
|
|
||
|
database, or if they should only be compared. Choose between using a file
|
||
|
|
||
|
of file names or typing file names on the keyboard. If so selected, supply
|
||
|
|
||
|
the file of file names. Also supply a file of file names to contain the names of
|
||
|
|
||
|
all the gel readings that fail to get entered.
|
||
|
Select the entry mode. Normal assembly is appropriate for all but special
|
||
|
cases, as is "permit joins". Uses for the other modes are not documented
|
||
|
here.
|
||
|
Define a minimum initial
|
||
|
|
||
|
match length.
|
||
|
Define the maximum number
|
||
|
|
||
|
of padding characters allowed to be used in each gel reading to help
|
||
|
|
||
|
achieve alignment, and the same for the number allowed in the contig for
|
||
|
|
||
|
each gel reading. Finally define the maximum percentage mismatch to
|
||
|
be allowed for any gel reading to be entered into the database. If
|
||
|
|
||
|
for any gel reading, either of these last three values is exceeded the gel
|
||
|
|
||
|
reading will not be entered into the database.
|
||
|
|
||
|
.para
|
||
|
In operation the function takes a batch of gel readings (probably passed
|
||
|
|
||
|
on as a file of file names from one of the screening routines) and
|
||
|
enters them into a
|
||
|
database for a sequencing project. It takes each gel reading
|
||
|
in turn,
|
||
|
compares it with the current consensus for the database, it then
|
||
|
produces an alignment for any regions of the consensus it
|
||
|
overlaps; if this alignment is sufficiently good it then edits
|
||
|
both the new gel reading and the sequences it overlaps and adds
|
||
|
the
|
||
|
new gel reading to the database. The program then updates the
|
||
|
consensus
|
||
|
accordingly and carries on to the next gel reading.
|
||
|
.para
|
||
|
All alignments are displayed and any gel readings
|
||
|
that do match but that
|
||
|
|
||
|
cannot be aligned sufficiently well have their names written to a
|
||
|
file of failed gel reading names. The function works without any
|
||
|
|
||
|
user intervention and can process any number of gel readings in a
|
||
|
single run. Those gel readings that fail can be recompared using
|
||
|
|
||
|
the same function (to find the current overlap position) and the
|
||
|
|
||
|
user can enter them into the database
|
||
|
|
||
|
using the "put all readings in new contigs"
|
||
|
assembly option and then joined using "join contigs".
|
||
|
.para
|
||
|
Typical dialogue and output from the function is shown below. (Note that
|
||
|
output for gel readings 2 - 9 has been deleted to save space).
|
||
|
.lit
|
||
|
Automatic sequence assembler
|
||
|
Database is logically consistent
|
||
|
? (y/n) (y) Permit entry
|
||
|
? (y/n) (y) Use file of file names
|
||
|
? File of gel reading names=demo.nam
|
||
|
? File for names of failures=demo.fail
|
||
|
Select entry mode
|
||
|
X 1 Perform normal shotgun assembly
|
||
|
2 Put all sequences in one contig
|
||
|
3 Put all sequences in new contigs
|
||
|
? Selection (1-3) (1) =
|
||
|
? (y/n) (y) Permit joins
|
||
|
? Minimum initial match (12-4097) (15) =
|
||
|
? Maximum pads per gel (0-25) (8) =
|
||
|
? Maximum pads per gel in contig (0-25) (8) =
|
||
|
? Maximum percent mismatch after alignment (0.00-15.00) (8.00) =
|
||
|
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
|
||
|
Processing 1 in batch
|
||
|
Gel reading name=HINW.004
|
||
|
Gel reading length= 283
|
||
|
Searching for overlaps
|
||
|
Strand 1
|
||
|
Strand 2
|
||
|
No matches found
|
||
|
Total matches found 1
|
||
|
Padding in contig= 0 and in gel= 1
|
||
|
Percentage mismatch after alignment = 1.8
|
||
|
Best alignment found
|
||
|
1 11 21 31 41 51
|
||
|
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
|
||
|
********** ********** ********** ********** ********** **********
|
||
|
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
|
||
|
1 11 21 31 41 51
|
||
|
61 71 81 91 101 111
|
||
|
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACA-GTTG
|
||
|
********** ********** ********** ********** ********** ***** ****
|
||
|
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACACGTTG
|
||
|
61 71 81 91 101 111
|
||
|
121 131 141 151 161 171
|
||
|
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
|
||
|
********** ********** ********** ********** ********** **********
|
||
|
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
|
||
|
121 131 141 151 161 171
|
||
|
181 191 201 211 221 231
|
||
|
GCCATGGTGG CCACAGTGAC G-TATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
|
||
|
********** ********** * ******** ********** ********** **********
|
||
|
GCCATGGTGG CCACAGTGAC GCTATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
|
||
|
181 191 201 211 221 231
|
||
|
241 251 261 271 281
|
||
|
CGCG--GTGG AATTACAGCG TTCCCTATTG ACGGGCGCAT CCAC
|
||
|
**** **** ********** ** * ***** ********** ****
|
||
|
CGCGACGTGG AATTACAGCG TT,CDTATTG ACGGGCGCAT CCAC
|
||
|
241 251 261 271 281
|
||
|
Batch finished
|
||
|
9 sequences processed
|
||
|
0 sequences entered into database
|
||
|
0 joins made
|
||
|
|
||
|
.end lit
|
||
|
|
||
|
.para
|
||
|
Note that "auto assemble" cannot align protein sequences.
|
||
|
.left margin1
|
||
|
@28. TX 1 @Highlight disagreements
|
||
|
.left margin2
|
||
|
.para
|
||
|
Used in the latter stages of a project
|
||
|
to highlight disagreements between individual gel readings
|
||
|
and their consensus sequences. This display is also availbale in the
|
||
|
contig editor.
|
||
|
Characters that agree with the
|
||
|
|
||
|
consensus are shown as : symbols for the plus strand and . for the minus
|
||
|
|
||
|
strand. Characters that disagree with the consensus are left unchanged
|
||
|
|
||
|
and so stand out clearly. The results of this analysis are written to a
|
||
|
file.
|
||
|
|
||
|
.para
|
||
|
Before selecting this option create a file of the display of the contig to
|
||
|
be
|
||
|
"highlighted". The option will ask for the name of this file. Select
|
||
|
symbols
|
||
|
to denote "agreeing" characters on each strand, the defaults are : and .,
|
||
|
|
||
|
but any others can be used. Supply the name of a file in which to put
|
||
|
|
||
|
the output.
|
||
|
.para
|
||
|
The display file needed as input for this option is created by selecting
|
||
|
|
||
|
"Redirect output", followed immediately by "display contig", and then
|
||
|
"Redirect output" again. The
|
||
|
|
||
|
cutoff score used in the consensus calculation can be set by option "set
|
||
|
|
||
|
display parameters". Note that for the highlight function
|
||
|
there is a limit of 50 for the number of gel
|
||
|
readings that are aligned at any position - ie the contig must be less
|
||
|
than 51 gel readings deep at its thickest point. I hope that those performing
|
||
|
shotgun sequencing never reach this limit, but those using the program for
|
||
|
comparing sequence families might.
|
||
|
.para
|
||
|
Typical output from this function is shown below.
|
||
|
.lit
|
||
|
|
||
|
210 220 230 240 250
|
||
|
1 HINW.004 :C::::::::::::::::::::::::::::::::::::::::::AC::::
|
||
|
7 HINW.018 :*::::::::::::::::::::::::::::::::::::::::::CA::::
|
||
|
-4 HINW.017 ...............AC....
|
||
|
G-TATTTTGTTTCCTCACGCTCGCTACGTATCTGTTTGCCCGCG--GTGG
|
||
|
|
||
|
260 270 280 290 300
|
||
|
1 HINW.004 ::::::::::::*:D:::::::::::::::::::
|
||
|
7 HINW.018 ::::::::::::::::::::CA:::::T:*:::*::::::::::::CA:
|
||
|
-4 HINW.017 ..............................................A...
|
||
|
3 HINW.009 :::::::::::::::V::::::::::::::::::::::::::::*AV:::
|
||
|
-6 HINW.028 ......................A...
|
||
|
AATTACAGCGTTCCCTATTGACGGGCGCATCCACGCTGATTCTCTT-CTG
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@32. TX 3 @Extract gel readings
|
||
|
.left margin2
|
||
|
.para
|
||
|
Used to make copies of the aligned gel readings in a database,
|
||
|
to write them into separate files, and to write a
|
||
|
|
||
|
corresponding file of file names. It operates in two modes: either all gel
|
||
|
|
||
|
readings are extracted, or only those at the ends of contigs.
|
||
|
|
||
|
.para
|
||
|
Choose which mode of operation is required and supply a file of file
|
||
|
|
||
|
names.
|
||
|
.para
|
||
|
The gel readings are given their original
|
||
|
|
||
|
names.
|
||
|
.para
|
||
|
If the option is used to extract all the gel readings from a database, a
|
||
|
|
||
|
subsequent run of "auto assemble" can reconstitute a database which has
|
||
|
|
||
|
been corrupted. This rarely occurs and is usually necessitated by a
|
||
|
|
||
|
user employing "alter relationships" incorrectly without first having
|
||
|
|
||
|
made a copy.
|
||
|
.left margin1
|
||
|
@1. TX 0 @Help
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
Help is available on the following topics :
|
||
|
|
||
|
.LEFT MARGIN1
|
||
|
@2. TX 0 @Quit
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
This command stops the program and is the only safe way to terminate a
|
||
|
|
||
|
run
|
||
|
of the program that has altered the contents of the database in any way.
|
||
|
|
||
|
.left margin1
|
||
|
@3. TX 1 @Open a database
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
Opens existing databases or allows new ones to be started. The function
|
||
|
is
|
||
|
automatically called into operation
|
||
|
when the program is started but can also be selected
|
||
|
|
||
|
from the general menu.
|
||
|
.para
|
||
|
Choose to open an existing database or start a new one, or if ! is typed
|
||
|
when the program is first started, enter the program without opening a
|
||
|
database. Supply a project
|
||
|
|
||
|
database name, and if it already exists, the "version". If starting a new
|
||
|
|
||
|
database define the database size and if it is for DNA or protein sequences.
|
||
|
The database size is an initial size for the database. It can be increased
|
||
|
later during the project. It is the sum of the number of gel
|
||
|
readings plus the number of contigs. The current maximum size is 8000.
|
||
|
.para
|
||
|
Database names can have from one to 12 letters and must not include full
|
||
|
|
||
|
stop (.). The database is made from five separate files. If the database
|
||
|
is
|
||
|
called FRED then version 0 of database FRED comprises files FRED.AR0,
|
||
|
|
||
|
FRED.RL0, FRED.SQ0, FRED.TG0 and FRED.CC0. The version is the last symbol in the file names.
|
||
|
|
||
|
Only this program
|
||
|
can read these files. If the "copy database" option is used it
|
||
|
|
||
|
will ask the user to define a new "version".
|
||
|
.para
|
||
|
For normal use the maximum gel reading length is set to 512 characters,
|
||
|
|
||
|
but when a database is started the user may choose lengths of either
|
||
|
|
||
|
512,
|
||
|
1024, 1536..., 4096. Normally the program is used to handle DNA
|
||
|
|
||
|
sequences but many of the functions also work on protein sequences. The
|
||
|
|
||
|
choice of sequence type is made when the database is started.
|
||
|
|
||
|
.para
|
||
|
The contigs are not stored on the disk as the user sees them displayed on
|
||
|
|
||
|
the screen. Each gel reading is stored with sufficient information about
|
||
|
|
||
|
how it overlaps other gel readings so that the program can work out how
|
||
|
|
||
|
to
|
||
|
present them aligned on the screen. We refer to this extra data as "the
|
||
|
relationships" and it is explained below.
|
||
|
|
||
|
The database comprises 5 separate files.
|
||
|
|
||
|
.left margin2
|
||
|
1. a working version of each gel reading. This is the version of
|
||
|
the gel reading
|
||
|
that is in the database and initially it is an exact copy of
|
||
|
the original sequence (known as the archive)
|
||
|
but it is edited and manipulated to align it
|
||
|
with other gel readings.
|
||
|
|
||
|
.left margin2
|
||
|
2. the file of relationships. This file contains all of the
|
||
|
|
||
|
information that is required to assemble the working versions
|
||
|
into
|
||
|
|
||
|
contigs during processing; any manipulations on the data use this
|
||
|
|
||
|
file and it is automatically updated at any time that the
|
||
|
|
||
|
relationships are changed. The information in this file is as
|
||
|
|
||
|
follows:
|
||
|
.left margin2
|
||
|
(A) Facts about each gel reading and its relationship to
|
||
|
others
|
||
|
("gel
|
||
|
|
||
|
descriptor lines"):
|
||
|
|
||
|
.left margin2
|
||
|
(a) the number of the gel
|
||
|
reading (each gel reading is given a number as it is
|
||
|
|
||
|
entered into the database)
|
||
|
|
||
|
.left margin2
|
||
|
(b) the length of the sequence from this gel reading
|
||
|
|
||
|
.left margin2
|
||
|
(c) the position of the left end of this gel
|
||
|
reading relative to the left
|
||
|
|
||
|
end of the contig of which it is a member
|
||
|
|
||
|
.left margin2
|
||
|
(d) the number of the next gel
|
||
|
reading to the left of this gel reading
|
||
|
|
||
|
.left margin2
|
||
|
(e) the number of the next gel reading to the right
|
||
|
|
||
|
.left margin2
|
||
|
(f) the relative strandedness of this gel
|
||
|
reading , ie whether it is in
|
||
|
|
||
|
the same sense or the complementary sense as its archive.
|
||
|
|
||
|
.left margin2
|
||
|
(B) Facts about each contig ("contig descriptor lines"):
|
||
|
|
||
|
.left margin2
|
||
|
(a) the length of this contig
|
||
|
|
||
|
.left margin2
|
||
|
(b) the number of the leftmost gel
|
||
|
reading of this contig
|
||
|
|
||
|
.left margin2
|
||
|
(c) the number of the rightmost gel reading of this contig.
|
||
|
|
||
|
.left margin2
|
||
|
(C) General facts:
|
||
|
|
||
|
.left margin2
|
||
|
(a) the number of gel readings in the database
|
||
|
|
||
|
.left margin2
|
||
|
(b) the number of contigs in the database.
|
||
|
|
||
|
.left margin2
|
||
|
3. the file of archive names. This is simply a list of the names
|
||
|
|
||
|
of each of the archive files in the database.
|
||
|
|
||
|
.left margin2
|
||
|
4. the file of tags (annotation).
|
||
|
This consists of linked lists of tag information for each sequences in the
|
||
|
database.
|
||
|
Tags are created by the user as annotation, or by xdap as records of edits or
|
||
|
for storing cutoff information.
|
||
|
As the number of tags can grow without limit, so can this file.
|
||
|
For each gel there is a header record, which contains the record number of
|
||
|
the start of the linked list for that gel. On line IDBSIZ there is a record
|
||
|
containing information about the file such as its present length and if there
|
||
|
are any free "tag" slots to be reused in the file.
|
||
|
|
||
|
5. the file of comments (annotation).
|
||
|
This consists of linked lists of comment fragments.
|
||
|
Comments are created by the user as a message attached to annotation,
|
||
|
or by the system to store cutoff information.
|
||
|
Comments are character strings of any length.
|
||
|
Comments longer than 40 characters are broken up into fragments, each 40
|
||
|
characters long, and are chained together in a link list.
|
||
|
As the number of comments can grow without limit, so can this file.
|
||
|
|
||
|
.para
|
||
|
Structure of the database files
|
||
|
.para
|
||
|
1. The file of relationships
|
||
|
.para
|
||
|
The file contains IDBSIZ lines of data:
|
||
|
the general data are stored on line IDBSIZ; data about gel
|
||
|
readings are
|
||
|
stored from line 1 downwards; data about contigs are stored from
|
||
|
line IDBSIZ-1 upwards. A database of 500 lines containing 25 gel
|
||
|
readings and 4 contigs would have a file
|
||
|
of relationships as is shown below.
|
||
|
.lit
|
||
|
|
||
|
|
||
|
---------------------------------------------
|
||
|
0 Info about the database size
|
||
|
1 Gel descriptor record
|
||
|
2 " " "
|
||
|
3 " " "
|
||
|
4 " " "
|
||
|
5 " " "
|
||
|
' ' ' '
|
||
|
' ' ' '
|
||
|
25 " " "
|
||
|
26 Empty record
|
||
|
' ' '
|
||
|
|
||
|
' ' '
|
||
|
495 ' '
|
||
|
496 Contig descriptor record
|
||
|
497 " " "
|
||
|
498 " " "
|
||
|
499 " " "
|
||
|
500 Number of gel readings=25, Number of contigs=4
|
||
|
---------------------------------------------
|
||
|
|
||
|
The arrangement of the data in the file of relationships
|
||
|
|
||
|
.end lit
|
||
|
As each new gel reading is added into the database a new line is added
|
||
|
to the end of the list of gel descriptor
|
||
|
lines. If this new gel reading does not
|
||
|
overlap with any gel readings
|
||
|
already in the database a new contig line is
|
||
|
added to the top of the list of contig lines. If it overlaps with
|
||
|
one contig then no new contig line need be added but if it overlaps
|
||
|
with two contigs then these two contigs must be joined and the
|
||
|
number of contig lines will be reduced by one. Then the list of
|
||
|
contig
|
||
|
lines is compressed to leave the empty line at the top of the list.
|
||
|
Initially the two types of line will move towards one another but
|
||
|
eventually, as contigs are joined, the contig descriptor lines will
|
||
|
move in the same direction as the gel descriptor
|
||
|
lines. At the end of a
|
||
|
project there should be only one contig line. The database is thus
|
||
|
capable of handling a project of 998 gels.
|
||
|
.para
|
||
|
2. Structure of the working versions file
|
||
|
.para
|
||
|
The working versions of gel readings are stored in a file of
|
||
|
NGELS lines each containing MAXGEL characters. Gel reading
|
||
|
number 1 is stored on line
|
||
|
1, gel reading number 2 on line 2 and so on. NGELS is the
|
||
|
current number of readings and MAXGEL the maximum reading length.
|
||
|
.para
|
||
|
3. Structure of the archive names file
|
||
|
.para
|
||
|
This file has NGELS lines of 16 characters.
|
||
|
|
||
|
.para
|
||
|
4. Structure of the tag file
|
||
|
.para
|
||
|
This file initially starts with IDBSIZ lines, and is expanded as new tags are
|
||
|
created.
|
||
|
Information about the length of the file, and which tag records are reusable
|
||
|
is stored on line IDBSIZ.
|
||
|
A database of 500 lines would have a file of tags as shown below.
|
||
|
.lit
|
||
|
|
||
|
---------------------------------------------
|
||
|
1 Tag descriptor record
|
||
|
2 " " "
|
||
|
3 " " "
|
||
|
4 " " "
|
||
|
5 " " "
|
||
|
' ' ' '
|
||
|
' ' ' '
|
||
|
497 " " "
|
||
|
498 " " "
|
||
|
499 " " "
|
||
|
500 Length of file=N, Free list=0
|
||
|
501 Tag record
|
||
|
502 " "
|
||
|
503 " "
|
||
|
' ' '
|
||
|
' ' '
|
||
|
N-2 " "
|
||
|
N-1 " "
|
||
|
N Tag record
|
||
|
---------------------------------------------
|
||
|
|
||
|
The arrangement of the data in the tag file
|
||
|
|
||
|
.end lit
|
||
|
As each new tag is added to the database, a check is made in the
|
||
|
file descriptor record at line IDBSIZ. If the list of reusable records is 0,
|
||
|
the file is extended by one line. Otherwise the new tag is assigned to
|
||
|
record at the head of the freelist.
|
||
|
When tags are deleted, they are added to the free list in the file descriptor
|
||
|
record.
|
||
|
.para
|
||
|
5. Structure of the comment file
|
||
|
.para
|
||
|
This file initially starts with 1 line, and is expanded as new annotation is
|
||
|
created.
|
||
|
Information about the length of the file, and which comment records are reusable
|
||
|
is stored on the first line.
|
||
|
.lit
|
||
|
|
||
|
---------------------------------------------
|
||
|
1 Length of file=N, Free list=0
|
||
|
2 Comment fragment
|
||
|
3 " "
|
||
|
4 " "
|
||
|
' ' '
|
||
|
' ' '
|
||
|
N-2 " "
|
||
|
N-1 " "
|
||
|
N Comment fragment
|
||
|
---------------------------------------------
|
||
|
|
||
|
The arrangement of the data in the comment file
|
||
|
|
||
|
.end lit
|
||
|
As each new comment is added to the database, a check is made in the file
|
||
|
descriptor record at line 1. If the list of reusable records is 0,
|
||
|
the file is extended to hold the new comment. Otherwise the new comments is
|
||
|
assigned to records starting with the head of the freelist.
|
||
|
When comments are deleted, the discarded records are added to the free list in
|
||
|
the file descriptor record.
|
||
|
.para
|
||
|
There are various checks within the programs to
|
||
|
protect users from themselves:-
|
||
|
.left margin2
|
||
|
1. All user input is checked for errors - e.g. reference to
|
||
|
non-existent gel
|
||
|
readings or contigs, incorrect positions in the
|
||
|
contig or gel readings.
|
||
|
.left margin2
|
||
|
2. Before entering a gel reading the system checks to see if a
|
||
|
file of the same name has already been entered.
|
||
|
.left margin2
|
||
|
3. Join will not allow the circularising of a contig.
|
||
|
|
||
|
.left margin2
|
||
|
5. Users may escape from any point in the program.
|
||
|
.left margin2
|
||
|
6. Help is available from all points in the program.
|
||
|
.SK2
|
||
|
.LEFT MARGIN2
|
||
|
IT IS ESSENTIAL THAT USERS DO NOT KILL THE PROGRAM WHILE IT IS
|
||
|
DOING
|
||
|
ANYTHING THAT INVOLVES CHANGING THE CONTENTS OF THE
|
||
|
DATABASE. I.E DURING AUTO ASSEMBLE,
|
||
|
COMPLETE JOIN, COMPLEMENT CONTIG, SAVE EDIT CONTIG.
|
||
|
|
||
|
This could
|
||
|
corrupt the database so badly that it is impossible to fix. The program
|
||
|
should always be left using the QUIT option.
|
||
|
|
||
|
.left margin1
|
||
|
@4. TX 3 @Edit contig
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
The Contig Editor is a mouse-driven editor that can insert,
|
||
|
delete and change gel reading sequences.
|
||
|
.para
|
||
|
The Contig Editor allows scrolling from one end of a contig to the other
|
||
|
using the scroll bar and scroll buttons. Action of mouse button presses
|
||
|
when the mouse pointer is in the scroll bar:
|
||
|
.sk1
|
||
|
.lit
|
||
|
Middle Mouse Button Set editor position
|
||
|
Left Mouse Button Scroll forward one screenful
|
||
|
Right Mouse Button Scroll backwards one screenful
|
||
|
.end lit
|
||
|
.sk1
|
||
|
The four scroll buttons operate as follows:
|
||
|
.sk1
|
||
|
.lit
|
||
|
"<<" Scroll left half a screenful
|
||
|
"<" Scroll left one character
|
||
|
">" Scroll right one character
|
||
|
">>" Scroll right half a screenful
|
||
|
.end lit
|
||
|
.para
|
||
|
The Editor cursor can be positioned anywhere in the edit window by
|
||
|
moving the mouse pointer over the character of interest, then pressing the
|
||
|
left mouse button. The Editor cursor can also be moved by using the
|
||
|
direction arrow keys.
|
||
|
.para
|
||
|
The editor operates in two main edit modes - Replace and Insert. Replace allows
|
||
|
a character to be replaced by another. Insert allows characters to be
|
||
|
inserted into a gel reading sequence. Characters are entered by typing
|
||
|
them from the keyboard. Only valid characters are permitted.
|
||
|
Characters can be deleted by positioning the cursor one character to the right,
|
||
|
then pressing the delete key.
|
||
|
Normally Insert and Delete apply to the consensus line of the contig ONLY.
|
||
|
This restraint can be overridden by using the "Super Edit" mode of
|
||
|
operation, THOUGH IT IS NOT RECOMMENDED.
|
||
|
.para
|
||
|
Edits can also be performed on the consensus, though they are
|
||
|
restricted to insertion and deletion of padding characters ("*").
|
||
|
These edits also have special meanings.
|
||
|
A deletion will delete ALL characters at the position to the left
|
||
|
of the cursor in the contig, and move the relative positions of all
|
||
|
sequences starting to the right of the cursor position left one
|
||
|
character.
|
||
|
An insertion will insert the character typed ("*") into ALL gel
|
||
|
reading sequences at the cursors position in the contig, and move the
|
||
|
relative positions of all sequences starting to the right of the cursor
|
||
|
position right one character.
|
||
|
.para
|
||
|
The effect of the last edit can be undone by pressing the "Undo" button
|
||
|
at the top of the editor window.
|
||
|
.para
|
||
|
The cursor will automatically be positioned at the next problem when the
|
||
|
"Find Next Problem" button is selected. The next problem is where the
|
||
|
consensus shows either an ambiguity ("-") or a pad ("*") character.
|
||
|
.para
|
||
|
The edits to the contig can be saved by pressing the "Leave Editor"
|
||
|
button and replying "Yes" to the prompt to "Save changes?". As no changes
|
||
|
are made to the working copy of your database til this point it
|
||
|
is possible to abort the editor if
|
||
|
the edit session ends up in an unsatisfactory state (ie if you've
|
||
|
stuffed it up!)
|
||
|
.left margin1
|
||
|
.sk3
|
||
|
Displaying Traces
|
||
|
.left margin2
|
||
|
.para
|
||
|
The original data from which the gel reading sequences where derived can
|
||
|
be seen by double clicking (two quick clicks) with the middle mouse button
|
||
|
on the area of interest. The trace will be displayed with the point
|
||
|
clicked at the centre of the trace viewport.
|
||
|
.para
|
||
|
All traces that are displayed are maintained in one window, called the Trace
|
||
|
Manager. The Trace Manager will only display four traces maximum. When four
|
||
|
traces are already being managed and a new one is requested, the one at the top
|
||
|
of the Trace Manager is removed and the new one is added to the bottom.
|
||
|
Traces can be removed individually by using the "quit" button in the panel next
|
||
|
to the trace.
|
||
|
.left margin1
|
||
|
.sk3
|
||
|
Extending Reads Using Cutoff Information
|
||
|
.left margin2
|
||
|
.para
|
||
|
Sequence data read in from Automated Fluorescent sequencing machines
|
||
|
trace files processed through the program ted
|
||
|
will have the discarded sequence (vector at start and poor read at
|
||
|
end) available to the contig editor. To display the cutoff
|
||
|
information, press the "Display Cutoff" button at the top of the
|
||
|
editor window.
|
||
|
The cutoff sequence appears in grey. This sequence can be incorporated
|
||
|
into the editable sequence, by moving the cutoff position. This is
|
||
|
done by positioning the cursor at the end of the gel sequence, and
|
||
|
using Meta-Left-Arrow and Meta-Right-Arrow to adjust the point of cutoff.
|
||
|
The Meta key is a diamond on the Sun keyboard.
|
||
|
.left margin1
|
||
|
.sk3
|
||
|
Pop-up menu
|
||
|
.left margin2
|
||
|
.para
|
||
|
A pop-up menu is revealed by depressing the "Control" key on the keyboard
|
||
|
and at the same time pressing the left mouse button. The menu has the following
|
||
|
functions:
|
||
|
.lit
|
||
|
|
||
|
Search
|
||
|
Highlight Disagreements
|
||
|
Save Contig
|
||
|
Create Tag
|
||
|
Edit Tag
|
||
|
Delete Tag
|
||
|
Select Oligo
|
||
|
|
||
|
.end lit
|
||
|
.left margin2
|
||
|
"Highlight Disaggreements" simply toggles between the normal display showing
|
||
|
the current base assignments and one in which only those assignments that
|
||
|
differ from the consensus are shown.
|
||
|
|
||
|
.left margin2
|
||
|
"Save Contig" is described above.
|
||
|
Searching and operations on tags are described below.
|
||
|
.left margin2
|
||
|
.sk3
|
||
|
Searching
|
||
|
.left margin2
|
||
|
.para
|
||
|
Selecting "Search" brings up a
|
||
|
window which can remain present during normal editor operation. The
|
||
|
window allows the user to select the direction of search, the type of
|
||
|
search and a value to search on. The value is entered into the value
|
||
|
text window. Then pressing the "search" button
|
||
|
performs the search. If successful, the cursor is positioned and
|
||
|
centred accordingly. An audible tone indicates failure. Pressing the
|
||
|
"ok" button removes the search window. The search window is
|
||
|
automatically removed when the contig editor is exited.
|
||
|
.sk1
|
||
|
There are seven different search modes:
|
||
|
.sk1
|
||
|
1. Search by position
|
||
|
.sk1
|
||
|
This positions the cursor at the numeric position specified in the
|
||
|
value text window. Eg a value of "1234" causes the cursor to be placed
|
||
|
at base number 1234 in the contig. Positioning withing a gel reading is
|
||
|
achieved by prefixing the number with the "@" character, eg "@123"
|
||
|
positions the cursor at base 123 of the sequence in which the cursor
|
||
|
lies. Relative positions can be specified by prefixing the number with
|
||
|
a plus or minus character. Eg "+1234" will advance the cursor 1234
|
||
|
bases. If possible, the cursor is positioned within the same sequence.
|
||
|
The direction buttons have no effect on the operation of "search
|
||
|
by position".
|
||
|
.sk1
|
||
|
2. Search by reading name
|
||
|
.sk1
|
||
|
This positions the cursor at the left end of the gel reading specified
|
||
|
in the value text window. If the value is prefixed with a slash is is
|
||
|
assumed to be a gel reading name. Otherwise it is assumed to be a gel
|
||
|
reading number. Eg "123" positions the cursor at the left end of gel
|
||
|
reading number 123. "/a16a12.s1" positions at the start of reading
|
||
|
a16a12.s1. If the value was "/a16" the cursor is positioned at the
|
||
|
first reading which starts with "a16". The direction buttons have no
|
||
|
effect on the operation of "search by position".
|
||
|
.sk1
|
||
|
3. Search by tag type.
|
||
|
.sk1
|
||
|
This positions the cursor at the start of the next tag which has the
|
||
|
the same type as specified by the type value menu. To change the type,
|
||
|
select off the menu that pops up when the mouse is clicked on the
|
||
|
button labeled "Type:". The search can be performed either forwards
|
||
|
or backwards of the current cursor position. To find all tags, use
|
||
|
"search by annotation", with a null text value string.
|
||
|
.sk1
|
||
|
4. Search by annotation.
|
||
|
.sk1
|
||
|
This positions the cursor at the start of the next tag which has a
|
||
|
comment containing the string specified in the value text window. The
|
||
|
search performed is a regular expression search, and certain
|
||
|
characters have special meaning. Be careful when your value string
|
||
|
contains ".", "*", "[", "^" or "$". The search can be performed either
|
||
|
forwards or backwards from the current cursor position.
|
||
|
.sk1
|
||
|
5. Search by sequence.
|
||
|
.sk1
|
||
|
This positions the cursor at the start of the next piece of sequence
|
||
|
that matches the value specified in the text value window. The search
|
||
|
is for an exact match, which means the case of value string is
|
||
|
important. The search is performed on the gel readings themselves,
|
||
|
rather than the consensus sequence. The search can be performed either
|
||
|
forwards or backwards from the current cursor position.
|
||
|
.sk1
|
||
|
6. Search by problem.
|
||
|
.sk1
|
||
|
This positions the cursor at the next place in the consensus sequence
|
||
|
which is not an "A", "C", "G" or "T". The search can be performed
|
||
|
either forwards or backwards from the current cursor position.
|
||
|
.sk1
|
||
|
7. Search by quality
|
||
|
.sk1
|
||
|
This positions the cursor at the next place in the consensus sequence
|
||
|
where the consensus calculation for each strand disagrees. When only
|
||
|
sequences on one strand is present, the search will stop at every
|
||
|
base. The search can be performed either forwards or backwards from the
|
||
|
current cursor position.
|
||
|
.left margin1
|
||
|
.sk3
|
||
|
Annotation
|
||
|
.left margin2
|
||
|
.para
|
||
|
Parts of a sequence can be annotated, to record the positions of primers used
|
||
|
for walking, or to mark sites, such as compressions that have caused problems
|
||
|
during sequencing.
|
||
|
The consensus sequence CANNOT be annotated.
|
||
|
.para
|
||
|
To annotate a piece of sequence first select the part of sequence
|
||
|
using the mouse buttons. Use the left mouse button to position the start of the
|
||
|
selection, and while this button is being held down, move the mouse to extend.
|
||
|
The selection can be extended further using the right mouse button.
|
||
|
.para
|
||
|
To create annotation, invoke the pop-up menu, and select the "Create Tag"
|
||
|
function. A small "tag editor" will appear which
|
||
|
allows you to select the type of the
|
||
|
annotation from a pull-down menu, and specify a comment if desired.
|
||
|
To select a new type pull down the Type menu, and select the entry desired.
|
||
|
To enter a comment, simply type into the text window in the tag editor.
|
||
|
The annotation is created when the "Leave" button on the tag editor,
|
||
|
and is displayed in the colour defined in the tag database file (TAGDB).
|
||
|
.para
|
||
|
To edit existing annotation,
|
||
|
position the cursor with the left mouse button
|
||
|
on the tag, and select the
|
||
|
"Edit Tag"
|
||
|
off the pop-up menu.
|
||
|
This invokes the tag editor, and changes to the type and comment of the
|
||
|
annotation can be made. The tag is updated when the "Leave" button is pressed.
|
||
|
.para
|
||
|
To delete an existing annotation,
|
||
|
position the cursor with the left mouse button
|
||
|
on the tag, and select the
|
||
|
"Delete Tag"
|
||
|
off the pop-up menu.
|
||
|
.left margin1
|
||
|
.sk3
|
||
|
NOTE:
|
||
|
.left margin2
|
||
|
.para
|
||
|
As the Contig Editor is a very powerful tool, it is possible that the alignment
|
||
|
of the gel reading sequences has unexpectedly been disrupted.
|
||
|
This can easily happen to parts of the contig that lie to the right
|
||
|
of the screen if excessive use has been made of the "Super Edit" facility.
|
||
|
Until familiar with "Super Edit" it would benefit the sequencer to quickly
|
||
|
scan through the contig after editing to check that bad alignments have not
|
||
|
been created.
|
||
|
.sp
|
||
|
.left margin2
|
||
|
Selecting Oligos
|
||
|
----------------
|
||
|
.sk1
|
||
|
.left margin2
|
||
|
1. Open the oligo selection window, by selecting "Select Oligo" from
|
||
|
the contig editor popup menu.
|
||
|
|
||
|
.left margin2
|
||
|
2. Position the cursor to where you want the oligo to be chosen. While
|
||
|
the oligo selection window is visible, you will still have complete
|
||
|
control over positioning and editing within the contig editor.
|
||
|
|
||
|
.left margin2
|
||
|
3. Indicate the strand for which you require an oligo. This is done by
|
||
|
toggling the direction arrow ("----->" or "<------"), if necessary.
|
||
|
|
||
|
.left margin2
|
||
|
3. Press the "Find Oligos" button to find all suitable oligos (See
|
||
|
"Oligo selection" below.) Information for the closest oligo to the
|
||
|
cursor position is given in the output text window. In the contig
|
||
|
editor the position of the oligo is marked by a temporary tag on the
|
||
|
consensus. The window is recentered if the oligo is off the screen.
|
||
|
Selecting "Display Selection Information" will print a short report on
|
||
|
the numbers of oligos considered and rejected during oligo selection.
|
||
|
|
||
|
.left margin2
|
||
|
4. If this oligo is not suitable (it may have been previously chosen,
|
||
|
and found to be unsuitable by experimentation, say), the next closest
|
||
|
oligo can be viewed by pressing "Select Next".
|
||
|
|
||
|
.left margin2
|
||
|
5. Suitable templates are automatically identified for the currently
|
||
|
displayed oligo (See "Template selection" below.) By default, the
|
||
|
template is that closest to the oligo site. If the choice is not
|
||
|
suitable (it may be known to be a poor quality template, say) another
|
||
|
can be chosen from the "Choose Template for this Oligo" menu.
|
||
|
Templates that do not appear on the menu can be specified by selecting
|
||
|
"other". However, the template must be on the correct strand and be
|
||
|
upstream of the oligo.
|
||
|
|
||
|
.left margin2
|
||
|
6. A tag can be created for the current oligo by pressing the button
|
||
|
"Create a tag for this oligo". The annotation for this tag holds the
|
||
|
name of the template and the oligo primer sequence. There are fields
|
||
|
to allow the user to specify their own primer name ("serial#") and
|
||
|
comments ("flags") for this tag. An example of oligo tag annotation:
|
||
|
.lit
|
||
|
serial#=
|
||
|
template=a16a9.s1
|
||
|
sequence=CGTTATGACCTATATTTTGTATG
|
||
|
flags=
|
||
|
|
||
|
.end lit
|
||
|
.left margin2
|
||
|
7. The oligo selection window is closed when "Create a tag for this
|
||
|
oligo" or "Quit" is selected.
|
||
|
|
||
|
|
||
|
.left margin2
|
||
|
Oligo selection:
|
||
|
.left margin2
|
||
|
----------------
|
||
|
|
||
|
.left margin2
|
||
|
The oligo selection engine is the one used in the program OSP. It is
|
||
|
described in some detail in:
|
||
|
|
||
|
.left margin2
|
||
|
Hillier, L., and Green, P. (1991). "OSP: an oligonucleotide
|
||
|
selection program," PCR Methods and Applications, 1:124-128.
|
||
|
|
||
|
.left margin2
|
||
|
The parameters controlling the selection of oligos can be changed in
|
||
|
the "Oligo Selection Parameters" window. The weights controlling the
|
||
|
scoring of selected oligos can be changed in the "Oligo Selection
|
||
|
Weights" window.
|
||
|
|
||
|
.left margin2
|
||
|
By default, the oligos are selected from a window that extends 40
|
||
|
bases either side of the cursor. The size and location of this window
|
||
|
relative to the cursor position can be changed in the "Parameters"
|
||
|
window.
|
||
|
|
||
|
.left margin2
|
||
|
In xbap oligos are ranked according to their proximity to the cursor
|
||
|
position, rather than by their scores.
|
||
|
|
||
|
|
||
|
.left margin2
|
||
|
Template selection:
|
||
|
.left margin2
|
||
|
-------------------
|
||
|
|
||
|
.left margin2
|
||
|
For simplicity, each reading is considered to represent a template. In
|
||
|
practise, many readings can be made of the same template. Suitable
|
||
|
templates that are identified are those that:
|
||
|
.lit
|
||
|
|
||
|
1. are in the appropriate sense,
|
||
|
2. have 5' ends that start upstream of the oligo,
|
||
|
and 3. are sufficiently close to the oligo to be useful.
|
||
|
|
||
|
.end lit
|
||
|
.left margin2
|
||
|
|
||
|
This last criterion relates to the insert size for the subclones used
|
||
|
for sequencing and the average reading length. A template is
|
||
|
considered useful if a full reading can be made from it, taking into
|
||
|
account both of these factors. The default insert size is 1000 bases,
|
||
|
and the default average reading length is 400 bases. These values can
|
||
|
be changed in the "Parameters" window.
|
||
|
|
||
|
.left margin1
|
||
|
@5. TX 1 @Display a contig
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to show the aligned gel readings for any part of a contig. The
|
||
|
|
||
|
number, name and strandedness of each gel reading is shown and the
|
||
|
|
||
|
consensus is written below.
|
||
|
.para
|
||
|
If required identify the contig, and then the start and end points of the
|
||
|
|
||
|
region to display.
|
||
|
.para
|
||
|
The display can be directed to a disk file using "direct output to disk".
|
||
|
|
||
|
.para
|
||
|
Below is an example showing the left end of a contig from
|
||
|
position 1 to 200. Overlapping this region are gels 6,3,5,17and 12;
|
||
|
6, 3 and 5
|
||
|
are in reverse orientation to their archives (denoted by a minus sign)
|
||
|
There are a few uncertainty codes and a few padding
|
||
|
characters in the working versions, but the consensus (shown
|
||
|
below
|
||
|
each page width) has a definite assignment for almost every
|
||
|
position.
|
||
|
.lit
|
||
|
|
||
|
10 20 30 40 50
|
||
|
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
|
||
|
60 70 80 90 100
|
||
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
||
|
-3 HINW.007 GGCACA*GTC
|
||
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
|
||
|
|
||
|
110 120 130 140 150
|
||
|
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
|
||
|
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
|
||
|
-5 HINW.009 GGCAGCGACCGTCG
|
||
|
17 HINW.999 AGCGACCGTCG
|
||
|
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
|
||
|
|
||
|
160 170 180 190 200
|
||
|
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
12 HINW.017 GTAGAGC*
|
||
|
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
.END LIT
|
||
|
.left margin1
|
||
|
@6. TX 1 @List a text file
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
This option allows users to list text files on the screen. It can be used
|
||
|
to read a file containing notes, for checking files written to disk etc. The
|
||
|
user is asked to type the name of the file to list.
|
||
|
.left margin1
|
||
|
@8. TX 1 @Calculate a consensus
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Calculates a consensus sequence either for the whole database or
|
||
|
|
||
|
for selected contigs. The consensus is written to a file named by the
|
||
|
user.
|
||
|
.left margin2
|
||
|
Supply a file name, choose between whole database or selected contigs.
|
||
|
.para
|
||
|
Symbols for uncertainty in gel readings
|
||
|
.para
|
||
|
In order to record uncertainties when reading gels the codes shown
|
||
|
|
||
|
below can be used. Use of these codes permits us to extract the
|
||
|
|
||
|
maximum amount of data from each gel and yet record any doubts by
|
||
|
|
||
|
choice of code. The program can deal with all of these codes and any
|
||
|
|
||
|
other characters in a sequence are treated as dash (-) characters.
|
||
|
|
||
|
|
||
|
.lit
|
||
|
|
||
|
SYMBOL MEANING
|
||
|
|
||
|
1 PROBABLY C
|
||
|
2 " T
|
||
|
3 " A
|
||
|
4 " G
|
||
|
D " C POSSIBLY CC
|
||
|
V " T " TT
|
||
|
B " A " AA
|
||
|
H " G " GG
|
||
|
K " C " C-
|
||
|
L " T " T-
|
||
|
M " A " A-
|
||
|
N " G " G-
|
||
|
R A OR G
|
||
|
Y C OR T
|
||
|
5 A OR C
|
||
|
6 G OR T
|
||
|
7 A OR T
|
||
|
8 G OR C
|
||
|
- A OR G OR C OR T
|
||
|
a A
|
||
|
c C
|
||
|
g G
|
||
|
t T
|
||
|
* padding character placed by auto assembler
|
||
|
else = -
|
||
|
|
||
|
.end lit
|
||
|
|
||
|
.LEFT MARGIN2
|
||
|
The DNA consensus algorithm
|
||
|
.para
|
||
|
The "calculate consensus" function, the "display contig" routine and the
|
||
|
|
||
|
"show quality" option use the rules outlined here to calculate a
|
||
|
|
||
|
consensus from aligned gel readings. Note that "display contig"
|
||
|
calculates
|
||
|
a consensus for each page width it displays (it does not use the
|
||
|
|
||
|
consensus sequence file calculated by the consensus function).
|
||
|
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
We have 6 possible symbols in the consensus sequence: A,C,G,T,* and -. The
|
||
|
last symbols is assigned if none of the others makes up a sufficient
|
||
|
proportion of the aligned characters at any position in the contig. The
|
||
|
following calculation is used to decide which symbol to place in the
|
||
|
consensus at each position.
|
||
|
.para
|
||
|
Each uncertainty code contributes a score
|
||
|
to one of A,C,G,T,* and also to the total at each point. Symbols like R
|
||
|
and Y which don't correspond to a single base type contribute only to the
|
||
|
total at each point. The scores are shown below.
|
||
|
.lit
|
||
|
definite assignments ie A,C,G,T,B,D,H,V,K,L,M,N,a,c,g,t,* =1
|
||
|
|
||
|
probable assignments ie 1,2,3,4 = 0.75
|
||
|
|
||
|
other uncertainty codes including R,Y,5,6,7,8,- = 0.1
|
||
|
.end lit
|
||
|
.para
|
||
|
A cutoff score of 51% to 100% is supplied by the user. (When the program
|
||
|
starts this is set to 75%. See "set display parameters").
|
||
|
At each position in the contig we calculate the total score for each of
|
||
|
the 5 symbols
|
||
|
A,C,G,T and * (denote these by Xi, where i=A,C,G,T or *),
|
||
|
and also the sum of these totals
|
||
|
(denote this by S). Then if 100 Xi / S > the cutoff for any i, symbol i is
|
||
|
placed in the consensus; otherwise - is assigned.
|
||
|
.para
|
||
|
Notice that S does not equal the number of times the sequence has been
|
||
|
determined, but is the score total, and hence we are less likely to put a -
|
||
|
in the consensus. For the "examine quality" algorithm each strand is
|
||
|
treated separately but the calculation is the same. (It was originally
|
||
|
different).
|
||
|
.para
|
||
|
Format of the consensus sequence ( and vector sequences).
|
||
|
.para
|
||
|
A consensus sequence file may contain the consensus for several contigs
|
||
|
|
||
|
and so we identify each of them by preceding them by a 20 character
|
||
|
|
||
|
title. The title is of the form <---LAMBDA.0076----> ( where LAMBDA is
|
||
|
|
||
|
the project name and gel reading number
|
||
|
|
||
|
|
||
|
76 is the leftmost gel
|
||
|
reading to contribute to this consensus sequence).
|
||
|
|
||
|
|
||
|
The angle brackets <> and the 4 digit number precede by a .
|
||
|
|
||
|
are important to some processing programs.
|
||
|
.left margin1
|
||
|
@25. TX 1 @Show relationships
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to show the relationships of the gel readings in the database in
|
||
|
|
||
|
three ways -
|
||
|
.LEFT MARGIN2
|
||
|
(a) All contig descriptor lines followed by all gel descriptor
|
||
|
lines.
|
||
|
.LEFT MARGIN2
|
||
|
(b) All contigs one after the other sorted, i.e. for each
|
||
|
contig show its contig descriptor line followed by all its
|
||
|
gel descriptor lines sorted on position from left to right
|
||
|
.LEFT MARGIN2
|
||
|
(c) Selected contigs: show the contig line and, in order,
|
||
|
those gel readings that cover a user-defined region.
|
||
|
Note that this output can be directed to a disk file by
|
||
|
prior selection of "redirect output".
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Below is an example showing a contig from position
|
||
|
1 to 689. The left gel reading is number 6 and has archive
|
||
|
name HINW.010, the
|
||
|
rightmost gel reading is number 2 and is has archive name HINW.004.
|
||
|
On each gel descriptor line is shown:
|
||
|
the name of the archive version, the gel number, the position of the
|
||
|
left end of the gel reading relative to the left end of the contig, the
|
||
|
length of the gel
|
||
|
reading (if this is negative it means that the gel reading is in
|
||
|
the opposite orientation to its archive), the number of the gel
|
||
|
reading to
|
||
|
the left and the number of the gel reading to the right.
|
||
|
.lit
|
||
|
|
||
|
|
||
|
CONTIG LINES
|
||
|
CONTIG LINE LENGTH ENDS
|
||
|
LEFT RIGHT
|
||
|
48 689 6 2
|
||
|
GEL LINES
|
||
|
NAME NUMBER POSITION LENGTH NEIGHBOURS
|
||
|
LEFT RIGHT
|
||
|
HINW.010 6 1 -279 0 3
|
||
|
HINW.007 3 91 -265 6 5
|
||
|
HINW.009 5 137 -299 3 17
|
||
|
HINW.999 17 140 273 5 12
|
||
|
HINW.017 12 193 265 17 18
|
||
|
HINW.031 18 385 -245 12 2
|
||
|
HINW.004 2 401 -289 18 0
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@23. TX 3 @Complement a contig
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
This function will complement and reverse all of the gel
|
||
|
readings in a
|
||
|
contig. It automatically reverses and complements each gel
|
||
|
reading sequence, reorders left and right neighbours, recalculates
|
||
|
relative
|
||
|
positions and changes each strandedness.
|
||
|
.PARA
|
||
|
The only user input required is to identify the contig to
|
||
|
complement by the number or name of a gel reading it contains.
|
||
|
DO NOT KILL THE
|
||
|
PROGRAM DURING THIS STEP!
|
||
|
.left margin1
|
||
|
@22. TX 3 @ Join contigs
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
This function joins contigs interactively using a mouse driven editor.
|
||
|
The operation of this editor is very similar to the Contig Editor
|
||
|
described in "Edit".
|
||
|
|
||
|
.para
|
||
|
It allows the
|
||
|
user to align the ends of the two contigs by editing each
|
||
|
contig separately. It is important that the alignment achieved is
|
||
|
correct because once the join is completed the alignment is fixed.
|
||
|
The program needs to know which two contigs to join.
|
||
|
.para
|
||
|
First specify which two contigs are to be joined.
|
||
|
The user should identify the two
|
||
|
contigs.
|
||
|
The program checks that the two contig numbers are different (it will not
|
||
|
allow circles to be formed!)
|
||
|
.para
|
||
|
The Join Editor consists of two Contig Editors in between which is sandwiched
|
||
|
a disagreement box. This disagreement box shows exclamation marks to
|
||
|
denote mismatches between the two consensuses.
|
||
|
.para
|
||
|
For example, the display will look something like this:
|
||
|
.lit
|
||
|
|
||
|
1460 1470 1480 1490 1500
|
||
|
56 HINW.100 TCT*GAGCAGTGTGGGCGCTG*CCGG
|
||
|
33 HINW.300 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGG
|
||
|
-25 HINW.090 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGG
|
||
|
19 HINW.123 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
CONSENSUS TCTCGAGCAGTGTGGGCGCTG-CCGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
MISMATCH ! !!!!!!
|
||
|
10 20 30 40 50
|
||
|
-6 HINW.010 TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
|
||
|
-3 HINW.007 TGGGCGCTGCCCGGGCTCGGAGGGCATGAAGT*AGAGC
|
||
|
-5 HINW.009 GCTCGGAGGGCATGAAGT*AGAGC
|
||
|
CONSENSUS TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
|
||
|
|
||
|
.END LIT
|
||
|
.para
|
||
|
The overlap must be of at least one character.
|
||
|
Use the scroll bar and the scroll buttons (`<<',`<',`>',and`>>')
|
||
|
for positioning the relative positions of the two contigs.
|
||
|
.para
|
||
|
The join position can be fixed in position
|
||
|
by pressing the `lock' button at the top of the Join Editor.
|
||
|
Locking allows the two contigs to be scrolled as one when using the scroll bar
|
||
|
and buttons, the left ends always in the same position relative to each
|
||
|
other.
|
||
|
.para
|
||
|
Once locked, it is best to proceed to the right along the contigs, inserting
|
||
|
padding characters (`*') into the consensuses to minimise the
|
||
|
disagreements.
|
||
|
.para
|
||
|
It is essential that the user aligns the two contigs throughout the whole
|
||
|
region of overlap before completing the join because it is only at this
|
||
|
stage that the two contigs can be edited independently. Once the join is
|
||
|
completed the alignment can only be altered using the routines supplied
|
||
|
by "alter relationships".
|
||
|
.para
|
||
|
The join can be completed by pressing the `Leave Editor' button. The
|
||
|
percentage mismatch is displayed, and the user is required to confirm that
|
||
|
they want to perform the join.
|
||
|
.left margin1
|
||
|
@24. TX 1 @ Copy the database
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
Used to make a copy of the database. If required the database size can be
|
||
|
|
||
|
altered using this option. The "version" of a database is encoded as the
|
||
|
|
||
|
last letter in the names of the five files that contain the database.
|
||
|
|
||
|
.para
|
||
|
Supply a "version" number (the default is version 1), and if required
|
||
|
|
||
|
select a new size for the database. The size of a database is the number
|
||
|
of
|
||
|
lines of information it can hold. It needs a line for each gel reading and
|
||
|
|
||
|
another for each contig.
|
||
|
.left margin1
|
||
|
@19. TX 1 @ Check database
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to perform a check on the logical consistency of the
|
||
|
database. No user intervention is required. If selected "with
|
||
|
dialogue" the program also checks for any sections of the consensus that
|
||
|
contain 15 dashes in 20 characters.
|
||
|
.para
|
||
|
The following relationships are checked:
|
||
|
.LEFT MARGIN2
|
||
|
1. If gel reading A thinks gel reading B is its left
|
||
|
neighbour
|
||
|
|
||
|
does B think A is
|
||
|
its right neighbour?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
"Hand holding problem for gel reading A"
|
||
|
.left margin2
|
||
|
followed by the
|
||
|
gel descriptor lines for gel readings A and B.
|
||
|
.LEFT MARGIN2
|
||
|
2. Are there any contig lines with no left or right
|
||
|
end gel readings?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
"Bad contig line number A"
|
||
|
.LEFT MARGIN2
|
||
|
3. Do the gel readings that are described as left ends on
|
||
|
contig
|
||
|
lines agree that they are left ends?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
"The end gel readings of contig A have outward neighbours"
|
||
|
.LEFT MARGIN2
|
||
|
4. Are there gel readings that are in more than one contig?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
" Gel number A is used N times"
|
||
|
.LEFT MARGIN2
|
||
|
5. Are there gel readings that are not in any contig?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
" Gel number A is not used"
|
||
|
.LEFT MARGIN2
|
||
|
6. Do the relative positions of gel readings agree with
|
||
|
their
|
||
|
position as defined by left and right neighbourliness?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
" Gel number A with position X is left neighbour of gel number B with
|
||
|
position Y"
|
||
|
.LEFT MARGIN2
|
||
|
7. Are there any loops in contigs? If so no further
|
||
|
checking is done.
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
" Loop in contig n no further checking done, but gel reading numbers follow"
|
||
|
.left margin2
|
||
|
The
|
||
|
program then prints the gel reading numbers in the looped
|
||
|
contig up
|
||
|
to
|
||
|
the start of the loop.
|
||
|
.LEFT MARGIN2
|
||
|
8. Are there any contigs of length <1? The error message is
|
||
|
.left margin2
|
||
|
" The contig on line
|
||
|
number x has zero length"
|
||
|
.LEFT MARGIN2
|
||
|
9. Are there any gel readings (used in only one contig) that have zero
|
||
|
|
||
|
length? The error
|
||
|
message is
|
||
|
.left margin2
|
||
|
" Gel number N has zero length"
|
||
|
.left margin2
|
||
|
Note that "auto assemble" also uses this logical consistency check and
|
||
|
will
|
||
|
only tolerate a "Gel number N
|
||
|
is not used" error. Any other error will cause it to
|
||
|
|
||
|
give up.
|
||
|
|
||
|
.left margin1
|
||
|
@29. TX 1 @ Examine quality
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Analyses the quality of the data in a contig. It reports on the proportion
|
||
|
|
||
|
of the consensus that is "well determined" and will display a sequence of
|
||
|
|
||
|
symbols that indicate the quality of the consensus at each position.
|
||
|
|
||
|
.para
|
||
|
Identify the contig to analyse, and the section of interest. The current
|
||
|
|
||
|
consensus calculation cutoff score will be used to decide if each position
|
||
|
is
|
||
|
"well determined". In general the quality of a reading deteriorates along
|
||
|
the length of the gel and so it is also possible to use a length cutoff for
|
||
|
the quality calculation. Only the data from the first section of each reading
|
||
|
will be included in the quality calculation. The length is altered under
|
||
|
"set parameters" and is initially set to the maximum reading length.
|
||
|
A summary showing the percentage of the consensus
|
||
|
that falls into each category of quality is shown. Choose whether or not to
|
||
|
have the quality codes for each position of the consensus displayed.
|
||
|
They can be displayed as either graphics or text.
|
||
|
.para
|
||
|
The quality of the data depends on the number of times it has been
|
||
|
|
||
|
sequenced and the particular uncertainty codes used in each gel
|
||
|
|
||
|
reading. This function divides the data into five categories, assigning
|
||
|
|
||
|
each
|
||
|
a symbol or code:
|
||
|
.LEFT MARGIN2
|
||
|
1. Well determined on both strands and they agree. code=0
|
||
|
.LEFT MARGIN2
|
||
|
2. Well determined on the plus strand only. code=1
|
||
|
.LEFT MARGIN2
|
||
|
3. Well determined on the minus strand only. code=2
|
||
|
.LEFT MARGIN2
|
||
|
4. Not well determined on either strand. code=3
|
||
|
.LEFT MARGIN2
|
||
|
5. Well determined on both strands but they disagree. code=4
|
||
|
.LEFT MARGIN2
|
||
|
A position is "well determined" if it is assigned one of the symbols
|
||
|
A,C,G,T when the algorithm described in the section "calculate a
|
||
|
consensus".
|
||
|
The calculation is performed
|
||
|
separately for each strand.
|
||
|
.para
|
||
|
If the user chooses to have the data displayed graphically the following
|
||
|
scheme is used. A rectangular box is drawn so that the x coordinate
|
||
|
represents the length of the contig. The box is notionally
|
||
|
divided vertically into
|
||
|
5 possible levels which are given the y values: -2,-1,0,1,2.
|
||
|
The quality codes attributed to each base position are plotted as
|
||
|
rectangles.
|
||
|
Each rectangle represents a region in
|
||
|
which the quality codes are identical, so a single base having a different
|
||
|
code from its immediate neighbours will appear as a very narrow rectangle.
|
||
|
.lit
|
||
|
|
||
|
Rectangle bottom and top y values
|
||
|
|
||
|
Quality 0 rectangle from 0 to 0
|
||
|
Quality 1 rectangle from 0 to 1
|
||
|
Quality 2 rectangle from 0 to -1
|
||
|
Quality 3 rectangle from -1 to 1
|
||
|
Quality 4 rectangle from -2 to 2
|
||
|
.end lit
|
||
|
.para
|
||
|
Obviously a single line at the midheight shows a perfect sequence.
|
||
|
.para
|
||
|
Typical dialogue is shown below.
|
||
|
.lit
|
||
|
|
||
|
41.47% OK on both strands and they agree(0)
|
||
|
55.48% OK on plus strand only(1)
|
||
|
2.08% OK on minus strand only(2)
|
||
|
0.97% Bad on both strands(3)
|
||
|
0.00% OK on both strands but they disagree(4)
|
||
|
? (y/n) (y) Show sequence of codes
|
||
|
|
||
|
10 20 30 40 50
|
||
|
1111111111 1111111111 1111111111 1111111111 1111111111
|
||
|
|
||
|
60 70 80 90 100
|
||
|
1111111111 1111111111 1111111111 3111111111 1111111111
|
||
|
|
||
|
110 120 130 140 150
|
||
|
1111111111 1111131111 1111111111 1111111111 1111111111
|
||
|
|
||
|
160 170 180 190 200
|
||
|
1111111111 1111111111 1111111111 1111111111 1111111133
|
||
|
|
||
|
210 220 230 240 250
|
||
|
1311111111 1111111111 1111111110 0000000000 0000220000
|
||
|
|
||
|
260 270 280 290 300
|
||
|
0000000000 0020000000 2200000202 0002000000 0000222200
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@26. TX 3 @ Alter relationships
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to make what are normally illegal changes to the database. That is
|
||
|
|
||
|
the normal checks are not done and any item in the database can be
|
||
|
changed independently of all others. Users need to know what they are
|
||
|
|
||
|
doing because it is very easy to make a horrible mess. Always start by
|
||
|
|
||
|
making a copy!
|
||
|
.para
|
||
|
By using the options here users can
|
||
|
move one section of a contig relative to another, break contigs, remove
|
||
|
contigs, remove gel readings, etc. To give flexibility most
|
||
|
of the commands do only one thing. This means that several commands
|
||
|
may
|
||
|
have to be executed to complete any change.
|
||
|
.para
|
||
|
The following options are offered:
|
||
|
.lit
|
||
|
|
||
|
Cancel
|
||
|
Line change
|
||
|
Check logical consistency
|
||
|
Remove contig
|
||
|
Shift
|
||
|
Move gel reading
|
||
|
Rename gel reading
|
||
|
Break a contig
|
||
|
Remove a gel reading
|
||
|
Alter raw data parameters
|
||
|
|
||
|
.end lit
|
||
|
.left margin2
|
||
|
1. QUIT returns to the main options of BAP.
|
||
|
.left margin2
|
||
|
|
||
|
3. Line change
|
||
|
.left margin2
|
||
|
allows the user to change the contents of any line in the
|
||
|
|
||
|
file of relationships. The line is selected by number, the
|
||
|
program prints the current line and prompts for the new line.
|
||
|
|
||
|
.left margin2
|
||
|
4. Check logical consistency
|
||
|
.left margin2
|
||
|
5. Remove a contig
|
||
|
.left margin2
|
||
|
This function removes a contig and all its gel readings. The user specifies
|
||
|
any reading in the contig.
|
||
|
.left margin2
|
||
|
6. Shift
|
||
|
.left margin2
|
||
|
allows the user to change all the relative positions of a
|
||
|
set of neighbouring gel
|
||
|
readings by some fixed value, i.e. it will
|
||
|
shift related gel readings
|
||
|
either left or right. It can therefore
|
||
|
be used to change the alignment of the gel
|
||
|
readings in a contig.
|
||
|
It prompts for the number of the first gel
|
||
|
reading to
|
||
|
shift and then for the distance to move them (Note a
|
||
|
negative value will move the gel readings
|
||
|
left and a positive value
|
||
|
right). It then chains rightwards (ie follows right
|
||
|
neighbours) and shifts each gel
|
||
|
reading, in turn, up to the end
|
||
|
of the contig. (This means that only those gel readings
|
||
|
from the first
|
||
|
to shift to the rightmost are moved). It updates the length of
|
||
|
the contig accordingly.
|
||
|
|
||
|
.left margin2
|
||
|
7. Move gel reading
|
||
|
.left margin2
|
||
|
is a function to renumber a gel reading. It moves all the information
|
||
|
about a gel
|
||
|
reading on to another line. The user must specify the
|
||
|
number
|
||
|
of the gel reading
|
||
|
to move and the number of the line to place it. It
|
||
|
takes care of all the relationships. Of course gel
|
||
|
readings must not be
|
||
|
moved to lines occupied by other gel
|
||
|
readings!
|
||
|
|
||
|
.left margin2
|
||
|
8. Rename gel reading
|
||
|
.left margin2
|
||
|
is a function that is used to rename the archive names of
|
||
|
gel
|
||
|
readings in the database; it only changes the name in the .ARN
|
||
|
file of the database.
|
||
|
|
||
|
.sk1
|
||
|
.LEFT MARGIN2
|
||
|
9. Break contig
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
Occasionally it is necessary to break a contig into two parts and this can be
|
||
|
achieved using this option. The program needs only the number of a gel
|
||
|
reading. This is the gel reading that will become a left end after the
|
||
|
break. That
|
||
|
is, the break is made between this gel
|
||
|
reading and its left neighbour. A new contig
|
||
|
line is created so ensure that there is sufficient space in the database.
|
||
|
.left margin2
|
||
|
10. Removing gel readings from contigs
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
Gel
|
||
|
readings can be removed from contigs. If they are essential for holding the
|
||
|
contig together (ie are the only gel reading covering a particular region),
|
||
|
the program will create a new contig.
|
||
|
.sk1
|
||
|
.LEFT MARGIN2
|
||
|
11. Alter raw data parameters
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
Allows the user to edit the individual raw data parameters, such as
|
||
|
the left and right cutoff lengths and the name of the machine readable trace
|
||
|
file.
|
||
|
The user must specify the gel line to modify, and provide new values for
|
||
|
the length of the raw sequence including cutoff lengths, the left cutoff position, the length of the original working sequence, the machine type, and the name
|
||
|
of the raw data file, where these values change.
|
||
|
.left margin1
|
||
|
@27. TX 1 @ Set display parameters
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to redefine the parameters that control the cutoff employed by the
|
||
|
|
||
|
consensus calculation and quality examiner, the maximum length of each
|
||
|
reading to include in the quality calculation, the line length used by
|
||
|
|
||
|
the display function, the text window length used by the graphics
|
||
|
options, and the graphics window length used by the graphics options.
|
||
|
.para
|
||
|
The default cutoff score is 75%. The default line length is 50 characters.
|
||
|
For protein sequences the cutoff is always 100%.
|
||
|
.para
|
||
|
The text window used by the graphics options controls the amount of
|
||
|
sequence listed at the crosshair position. The graphics window controls the
|
||
|
"zoom" function. Both these windows are defined as the number of bases that
|
||
|
should be shown, to both left and right of the crosshair.
|
||
|
.left margin1
|
||
|
@30. TX 3 @ Shuffle pads
|
||
|
.left margin2
|
||
|
.para
|
||
|
One weakness of the alignment strategy used is that padding
|
||
|
characters are not always aligned by the assembly routine. This function
|
||
|
attempts to align padding characters using a very simply strategy. It
|
||
|
does not solve all pad alignment problems but is a useful first step during
|
||
|
cleaning-up operations.
|
||
|
.LEFT MARGIN1
|
||
|
@10. TX 2 @Clear graphics
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Clears graphics from the screen.
|
||
|
.left margin1
|
||
|
@11. TX 2 @Clear text
|
||
|
.LEFT MARGIN1
|
||
|
.para
|
||
|
Clears text from the screen.
|
||
|
.left margin1
|
||
|
@12. TX 2 @Draw a ruler.
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This option
|
||
|
allows the user to draw a ruler or scale along the x axis of the screen to
|
||
|
help identify the coordinates of points of interest. The user can define
|
||
|
the position of the first base to be marked (for example if the active
|
||
|
region is 1501 to 8000, the user might wish to mark every 1000th base
|
||
|
starting at either 1501 or 2000 - it depends if the user wishes to treat
|
||
|
the active region as an independent unit with its own numbering starting
|
||
|
at
|
||
|
its left edge, or as part of the whole sequence). The user can also define
|
||
|
the separation of the ticks on the scale and their height. If required the
|
||
|
labelling routine can be used to add numbers to the ticks.
|
||
|
.left margin1
|
||
|
@14. TX 2 @Reposition plots
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
The positions of each of the plots is defined relative to a users drawing
|
||
|
board which has size 1-10,000 in x and 1-10,000 in y.
|
||
|
Plots for
|
||
|
each option are drawn in a window defined by x0,y0 and xlength,ylength.
|
||
|
Where x0,y0 is the position of the bottom left hand corner of the window,
|
||
|
and xlength is the width of the window and ylength the
|
||
|
height of the window.
|
||
|
.lit
|
||
|
--------------------------------------------------------- 10,000
|
||
|
1 1
|
||
|
1 -------------------------------------- ^ 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 ylength 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 -------------------------------------- v 1
|
||
|
1 x0,y0^ 1
|
||
|
1 <---------------xlength--------------> 1
|
||
|
--------------------------------------------------------- 1
|
||
|
1 10,000
|
||
|
|
||
|
.end lit
|
||
|
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
|
||
|
The default window positions are read from a file "ANALMARG" when the
|
||
|
program is started. Users can have their own file if required.
|
||
|
As all the plots start
|
||
|
at the same position in x and have the same width, x0 and xlength are the
|
||
|
same for all options. Generally users will only want to change the start
|
||
|
level of the window y0 and its height ylength.
|
||
|
This option
|
||
|
allows users to change window positions whilst running the program.
|
||
|
The routine prompts first for the number of the option that the users
|
||
|
wishes
|
||
|
to reposition; then for the y start and height; then for the x start and
|
||
|
length. Note that changes to the x values affect all options. If the user
|
||
|
types only carriage return for any value it will remain unchanged.
|
||
|
Note that, unlike all the other programs, the boxes used to contain
|
||
|
analytical results (eg plot quality) should not be made to overlap one
|
||
|
another, as the function of the crosshair routine depends on which box the
|
||
|
crosshair is in!
|
||
|
.LEFT MARGIN1
|
||
|
@15. TX 2 @Label a diagram
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This routine allows users to label any diagrams they have produced. They
|
||
|
are asked to type in a label. When the user types carriage return to finish
|
||
|
typing the label the cross-hair appears on the screen. The user can
|
||
|
position it anywhere on the screen. If the user types R (for right justify)
|
||
|
the label will be
|
||
|
written on the diagram with its right end at the cross-hair position.
|
||
|
If the user types L (for left justify) the label will be written on the
|
||
|
diagram with its left end at the cross hair position.
|
||
|
The
|
||
|
cross-hair will then immediately reappear. The user may put the same
|
||
|
label
|
||
|
on another part of the diagram as before or if he hits the space bar he
|
||
|
will be asked if he wishes to type in another label.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
? Menu or option number=15
|
||
|
Type label then drive cross hair to left or right end
|
||
|
of label position then hit "L" to write label left
|
||
|
justified or "R" to write label right justified or
|
||
|
the space bar to quit
|
||
|
|
||
|
|
||
|
? Label=delta gene
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
? Label=
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@16. TX 2 @Display a map
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This is disabled!
|
||
|
.left margin1
|
||
|
@7. TX 1 @Redirect output
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to direct output that would normally appear on the screen to a file and
|
||
|
to create postscript output.
|
||
|
.para
|
||
|
Select redirection of either text or graphics, and
|
||
|
supply the name of the file that the output should be written to.
|
||
|
.para
|
||
|
The results from the next options selected will not appear on the screen
|
||
|
but will be written to the file. When option 7 is selected again
|
||
|
the file will be
|
||
|
closed and output will again appear on the screen.
|
||
|
.left margin1
|
||
|
@13. TX 2 @Use crosshair
|
||
|
.left margin2
|
||
|
.para
|
||
|
This option puts a steerable cross on the screen which the user
|
||
|
drives around
|
||
|
by using the arrow keys (or mouse). When the crosshair is
|
||
|
visible a number of options are available if the user types one of a
|
||
|
set of special keyboard characters. Any other characters will cause
|
||
|
an exit from the crosshair option. The special keys are:
|
||
|
.lit
|
||
|
|
||
|
I = Identify the nearest gel reading
|
||
|
Z = Zoom in
|
||
|
Q = plot Quality
|
||
|
S = display the aligned Sequences at the crosshair position
|
||
|
N = list the Names and Numbers of the sequences at the crosshair
|
||
|
.end lit
|
||
|
.para
|
||
|
In order for any of these special keys to operate, the crosshair
|
||
|
must be in an appropriate display box, and the precise function of
|
||
|
the keys will also depend on which box the crosshair is in.
|
||
|
.para
|
||
|
If the
|
||
|
crosshair is in the "plot all contigs" box, Z will cause a new box to
|
||
|
appear showing all the readings for the nearest contig; Q will give
|
||
|
the same as Z but will also produce an extra box showing the
|
||
|
"quality" plot.
|
||
|
.para
|
||
|
If Z is hit in the "plot single contig" box, the contig will be zoomed
|
||
|
to the current graphics window size. The zoom will be roughly
|
||
|
centred on the crosshair position. Because of this it is possible to
|
||
|
step along a contig by repeatedly zooming with the crosshair near
|
||
|
to one end of the single contig display box. If I is hit the crosshair
|
||
|
must be close to a gel reading line. If Q is hit, the quality plot will
|
||
|
be produced for the region shown in the plot single contig box. In
|
||
|
all cases when the "plot all contigs" box is shown, a vertical line will
|
||
|
bisect the line the represents the relevant contig, at the current
|
||
|
position.
|
||
|
.para
|
||
|
If the crosshair is in the plot quality box only the character "s" will operate
|
||
|
as a special symbol.
|
||
|
.para
|
||
|
The number of bases shown in the N and S options is controlled by
|
||
|
the current graphics text window size, and the size of the zoom
|
||
|
window by the current graphics window size. Both are set by the
|
||
|
parameter setting function of the general menu.
|
||
|
.left margin1
|
||
|
@33. TX 2 @Plot single contig
|
||
|
.left margin2
|
||
|
.para
|
||
|
This option produces a schematic of a selected region of a single
|
||
|
contig by drawing a horizontal line to represent each of its gel
|
||
|
readings. The lines show the relative positions of each reading and
|
||
|
also their sense. The plot is divided vertically into two sections by
|
||
|
a line that is identified by an asterisk drawn at each end. All lines
|
||
|
that lie above this line represent readings that are in their original
|
||
|
sense, all lines below show readings that are in the
|
||
|
complementary sense to their original. By use of the crosshair
|
||
|
function the plot can be stepped through and examined in more
|
||
|
detail. See help on crosshair.
|
||
|
.left margin1
|
||
|
@34. TX 2 @Plot all contigs
|
||
|
.left margin2
|
||
|
.para
|
||
|
This option produces a schematic of all the contigs in a database. It
|
||
|
does this by drawing a horizontal line to represent each of them.
|
||
|
In order to show the ends of each contig it draws the lines for
|
||
|
contigs at alternate heights: the first at height one, the
|
||
|
second at height two, the third at height one, etc. The order of the
|
||
|
contigs in the display is the same as their order in the database.
|
||
|
By use of the crosshair function the plot can be stepped
|
||
|
through and examined in more detail. See help on crosshair.
|
||
|
.left margin1
|
||
|
@31. TX 3 @ Disassemble readings
|
||
|
.left margin2
|
||
|
.para
|
||
|
This function is used to remove a list of readings from a database, or
|
||
|
to create a new contig from a single reading moved from an existing contig.
|
||
|
This latter mode is useful for repositioning a reading in a repeat:
|
||
|
once separated it can be placed in the join editor and scrolled by the
|
||
|
other copies.
|
||
|
Removal of sets of readings works in two modes:
|
||
|
1. A set of adjacent readings in a
|
||
|
contig can be removed by the user naming the two end ones; or 2. A batch
|
||
|
of readings from any number of contigs can be defined by the user naming
|
||
|
a file containing a list of reading names. The program cleans up the
|
||
|
database by moving data to fill up any holes made in the files.
|
||
|
.para
|
||
|
For both modes of operation the program will ask for a file of file names.
|
||
|
If users create their own file (ie mode 2) each reading NAME must be on
|
||
|
a separate line. For mode 1 the user types the NAMES of the leftmost
|
||
|
and rightmost readings to be removed. They and all intervening readings
|
||
|
will be removed. Note that the routine operates on reading names - not
|
||
|
numbers. For both modes, if necessary, new contigs will be created.
|
||
|
.left margin1
|
||
|
@35. TX 1 3 @Find internal joins
|
||
|
.left margin2
|
||
|
.para
|
||
|
The purpose of this function is to use data already in the database to
|
||
|
find possible joins between contigs.
|
||
|
Joins may have been missed due to poor data or may have not been made
|
||
|
due to repeated sequences. Where appropriate, it may be
|
||
|
possible to find potential
|
||
|
joins by using the "unused data" derived from sequencing machines.
|
||
|
.left margin2
|
||
|
For all overlaps found when the X version is used,
|
||
|
the contig editor (in join mode) will be
|
||
|
called up with the two contigs aligned.
|
||
|
.left margin2
|
||
|
The database is checked for logical consistency. Supply a minimum initial
|
||
|
match length, a minimum alignment block, the maximum pads per sequence,
|
||
|
the maximum percent mismatch after alignment, the probe length. Choose
|
||
|
if clipped data is to be used, if so define the window size for finding good
|
||
|
data and the number of dashes allowed in the window. Processing will commence.
|
||
|
Most of these values are used in an identical way in the autoassemble
|
||
|
function. The others are defined below.
|
||
|
.left margin2
|
||
|
The program strategy
|
||
|
.left margin2
|
||
|
Take the first contig and calculate its consensus. If clipped data is being
|
||
|
used examine all readings that
|
||
|
are in the complementary orientation, and sufficiently near to the contigs left
|
||
|
end, to see if they have good clipped sequence which if present, would
|
||
|
protrude
|
||
|
from the left end of the contig. If found add the longest such sequence to the
|
||
|
left end of the consensus. Do the same for the right end by examining
|
||
|
readings that are in their
|
||
|
original orientation. If any are found add the longest extension to the
|
||
|
right end of
|
||
|
the consensus. Repeat the consensus calculations and extensions
|
||
|
for all contigs hence producing an extended consensus. If clipped data is not
|
||
|
being used simply calculate the consensus for the whole database. Now
|
||
|
look for possible joins by processing the extended consensus in the following
|
||
|
way. Take the last, say 100, bases (termed the "probe length" by the program)
|
||
|
of the rightmost consensus, compare it both
|
||
|
orientations with the extended consensus of all the other contigs. Display
|
||
|
any sufficiently good alignments. Repeat with the left end of the rightmost
|
||
|
contig. Do the same for the ends of all the entended contigs, always only
|
||
|
comparing with the contigs to their left, so that the same matches do not
|
||
|
appear twice.
|
||
|
.left margin2
|
||
|
Good cliped data is defined by sliding a window of "Window size for good data
|
||
|
scan" bases outwards
|
||
|
along the sequence and stopping when "Maximum number of dashes in scan window"
|
||
|
or more dashes appear in the window.
|
||
|
Note that
|
||
|
it is advisable to have some sort of cutoff because if we simply take all the
|
||
|
data it might be so full of rubbish that we wont find any good matches. For
|
||
|
the same reason it is worth trying the procedure with different cutoffs. An
|
||
|
initial run using no clipped data is also recommended.
|
||
|
Sufficiently good
|
||
|
alignments are defined by criteria equivalent to those used in autoassemble,
|
||
|
however here we only display alignments that pass all tests.
|
||
|
.left margin2
|
||
|
Bugs
|
||
|
.left margin2
|
||
|
If a small contig is wholly contained within a larger one, such that its
|
||
|
ends are further than ("Probe length" - "Minimum initial match length")
|
||
|
from the ends of the larger contig, and the consensus for the small
|
||
|
contig lies to the left
|
||
|
of the consensus for large contig, the overlap will not be discovered. (See
|
||
|
the search stratgey).
|
||
|
.left margin2
|
||
|
All numbering is
|
||
|
relative to base number one in the contig: matches to the left (i.e. in
|
||
|
the clipped data) have negative
|
||
|
positions, matches off the right end of the contig (i.e. in the clipped
|
||
|
data) have positions
|
||
|
greater than that of the contig length.
|
||
|
The convention for reporting the positions of overlaps is as follows: if neither
|
||
|
contig needs to be complemented the positions are as shown. If the program says
|
||
|
"contig x in the - sense" then the positions shown assume contig x has been
|
||
|
complemented. For example in the results given below the positions for the
|
||
|
first overlap are as reported, but those for the second assume that the contig
|
||
|
in the minus sense (i.e. 443) has been complemented.
|
||
|
.lit
|
||
|
|
||
|
|
||
|
Possible join between contig 445 in the + sense and contig 405
|
||
|
Percentage mismatch after alignment = 4.9
|
||
|
412 422 432 442 452 462
|
||
|
405 TTTCCCGACT GGAAAGCGGG CAGTGAGCGC AACGCAATTA ATGTGAG,TT AGCTCACTCA
|
||
|
********* * ******** ***** *** ********** ********** **********
|
||
|
445 -TTCCCGACT G,AAAGCGGG TAGTGA,CGC AACGCAATTA ATGTGAG-TT AGCTCACTCA
|
||
|
-127 -117 -107 -97 -87 -77
|
||
|
472 482 492 502 512
|
||
|
405 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
|
||
|
********** ********** ********** ********** **
|
||
|
445 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
|
||
|
-67 -57 -47 -37 -27
|
||
|
Possible join between contig 443 in the - sense and contig 423
|
||
|
Percentage mismatch after alignment = 10.4
|
||
|
64 74 84 94 104 114
|
||
|
423 ATCGAAGAAA GAAAAGGAGG AGAAGATGAT TTTAAAAATG AAACG-CGAT GTCAGATGGG
|
||
|
**** ***** ********** ********** ****** ** ***** **** *********
|
||
|
443 ATCG,AGAAA GAAAAGGAGG AGAAGATGAT TTTAAA,,TG AAACGACGAT GTCAGATGG,
|
||
|
3610 3620 3630 3640 3650 3660
|
||
|
124 134 144 154 164
|
||
|
423 TTG-ATGAAG TAGAAGTAGG AG-AGGTGGA AGAGAAGAGA GTGGGA
|
||
|
*** ****** ********** ** ******* *** ***** ** **
|
||
|
443 TTGGATGAAG TAGAAGTAGG AGGAGGTGGA ,GAG,AGAGA GTTGG-
|
||
|
3670 3680 3690 3700 3710
|
||
|
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@36. TX 3 @Double strand
|
||
|
.left margin2
|
||
|
.para
|
||
|
PLEASE MAKE A COPY OF THE DATABASE BEFORE USING THIS OPTION AS IT HAS
|
||
|
CURRENTLY HAD VERY LITTLE TESTING.
|
||
|
.para
|
||
|
Uses the cutoff data to change single stranded sections of a contig into
|
||
|
double stranded sections. Data is used carefully to try and minimise the
|
||
|
number of data disagreements created. However it must be noted that an overall
|
||
|
slight degradation in quality will still occur.
|
||
|
.para
|
||
|
When using this option you will be prompted for a contig and a region within
|
||
|
that contig. The default region is the entire contig. The option will then
|
||
|
search through the region for areas of good data on one strand and cutoff data
|
||
|
on the opposite strand, extending the cutoff data. The criteria for evaluating
|
||
|
the amount of cutoff data to be used is based upon a maximum number of
|
||
|
mismatches and a score (derived by accumulating points for mismatches (-8),
|
||
|
matches(+1) and insertions (-5) over the length of an alignment). The defaults
|
||
|
are:
|
||
|
.lit
|
||
|
|
||
|
maximum mismatches : 6
|
||
|
|
||
|
score for mismatch : -8
|
||
|
score for correct match : +1
|
||
|
score for insertion : -5
|
||
|
.end lit
|
||
|
.para
|
||
|
Note that with successive calls to this option it is possible to double strand
|
||
|
more and more data. Naturally however the quality of the data generated will
|
||
|
diminish each time.
|
||
|
.left margin1
|
||
|
@37. TX 3 @Auto-select oligos
|
||
|
.left margin2
|
||
|
.para
|
||
|
PLEASE MAKE A COPY OF THE DATABASE BEFORE USING THIS OPTION AS IT HAS
|
||
|
CURRENTLY HAD VERY LITTLE TESTING.
|
||
|
.para
|
||
|
Generates a file (default "primers") of suggested primers to use for covering
|
||
|
a single stranded section or for walking off the end of a contig. The file
|
||
|
generated contains the gel reading name, the primer sequence, it's offset in
|
||
|
the contig and the orientation. An example file would be :
|
||
|
.lit
|
||
|
|
||
|
c81d12.s1 TTGTCTGTAAGCGGATG (@ 6449 ) +
|
||
|
c98a10.s1 ATTATCACTTTACGGGTC (@ 6959 ) +
|
||
|
c81c1.s1 CAAGAAGGCGATAGAAG (@ 7643 ) +
|
||
|
c76a10.s1 CCTCATCCTGTCTCTTG (@ 8441 ) +
|
||
|
c81g4.s1 ATGAAACCTGGGCGTTG (@ 16156 ) +
|
||
|
c91e6.s1 GTTTTCAGATGTCGGAG (@ 18249 ) +
|
||
|
c81e12.s1 GCTACCGTAAAACACTTC (@ 18737 ) +
|
||
|
c93h11.s1 GCTGCTTTTTGTTTTATCC (@ 19158 ) +
|
||
|
c81h6.s1 CTTCCACTTCTTTCTTATC (@ 21210 ) +
|
||
|
c86a12.s1 CGAATGATAAAGACAAATCAG (@ 22122 ) +
|
||
|
c98b1.s1 GCCACTTTATCCGAGAC (@ 3048 ) -
|
||
|
c97c5.s1 GTGTTTTGGGTATATTGTG (@ 3371 ) -
|
||
|
c83d2.s1 CTACACAGAATGAACCC (@ 3768 ) -
|
||
|
c78h10.s1 GGCGGTGAAGATTGAAG (@ 4200 ) -
|
||
|
c98h9.s2dt CTCGTTTAAATTTCAAACTTCC (@ 7419 ) -
|
||
|
c95a9.s1 ATTGGAAGGAAGGAGGG (@ 22996 ) -
|
||
|
c82b4.s1 TGTAGCCGAAATCTTCC (@ 23369 ) -
|
||
|
.end lit
|
||
|
.para
|
||
|
This is best employed after having previously used the 'Double strand' option.
|
||
|
When selecting the option you will be asked for the contig, a region within
|
||
|
this contig and the file to write the list of primers to. For each primer
|
||
|
suggested a tag is automatically created containing details of the gel reading
|
||
|
name and the sequence. Preferably the tag will be created on the gel reading
|
||
|
from which the primer was selected. However this is not always possible so
|
||
|
failing that the tag will be on another sequence overlapping the primer
|
||
|
position.
|
||
|
.para
|
||
|
When invoked with the dialogue option you will be asked a couple more
|
||
|
questions relating to the position and size of the consensus checked for
|
||
|
suitable oligos. You will be prompted for the start and end of a region
|
||
|
(default 40-140) at a relative position to the left of our initial region.
|
||
|
.para
|
||
|
For example:
|
||
|
.lit
|
||
|
|
||
|
? Menu or option number=d37
|
||
|
Auto-select oligos
|
||
|
Default Contig identfier=/e97f2.s1
|
||
|
? Contig identfier=
|
||
|
? Start position in contig (1-20942) (1) =10000
|
||
|
? End position in contig (10000-20942) (20942) =11000
|
||
|
Default Name of file for primers=primers
|
||
|
? Name of file for primers=
|
||
|
? Start of oligo choice region (1-1024) (40) =50
|
||
|
? End of oligo choice region (50-1024) (150) =150
|
||
|
|
||
|
.end lit
|
||
|
.para
|
||
|
This implies that we are going to look for oligos to use as primers covering
|
||
|
the region 10000 to 11000. For each single stranded section in this region we
|
||
|
search for the oligos at between 50 and 150 to the left. So if we had a single
|
||
|
stranded section from 10121 to 10295 we would search for oligos in the region
|
||
|
9971 to 10071.
|
||
|
.left margin1
|
||
|
@38. TX 1 @Check assembly
|
||
|
.left margin2
|
||
|
.para
|
||
|
This new function is used for checking the positioning of assembled readings.
|
||
|
It is useful for checking sequences that contain repeats
|
||
|
of length similar to that of a single gel reading. It takes the poor
|
||
|
quality data for each reading and compares it to the segment of the consensus
|
||
|
to which it should align.
|
||
|
If the extension of the
|
||
|
read does not match the consensus then the read (or its neighbours) has
|
||
|
probably been assembled into the wrong place.
|
||
|
The program displays the bad alignments.
|
||
|
The quality of an alignment is defined by the percentage mismatch.
|
||
|
Naturally the user should select a value that takes into account
|
||
|
the poor quality of the data being aligned.
|
||
|
.para
|
||
|
When the routine is used from the X version the
|
||
|
user is offered the editor to examine poor alignments.
|
||
|
If alignments are reported as poor, but on inspection are OK, the user
|
||
|
can set a tag so that the poor quality data is ignored on subsequent passes
|
||
|
through the routine. Note however such data will then also be ignored by
|
||
|
the automatic double stranding routine!
|
||
|
.para
|
||
|
The user defines the percentage mismatch; the window size and number of
|
||
|
dashes allowed in the window used for selecting the amount of the poor data
|
||
|
to be employed; can choose to save the names of the poorly aligned reads
|
||
|
in a file; can select an individual contig or scan the whole database.
|
||
|
The file containing the names of the poorly aligned reads can be used by
|
||
|
the disassembly routine to remove them from the database, and then can be used
|
||
|
to reassemble them. Note that the routine complements each contig twice
|
||
|
during processing.
|
||
|
|
||
|
.left margin1
|
||
|
@39. TX 1 @Find read pairs
|
||
|
.left margin2
|
||
|
.para
|
||
|
This new function is used to check the positions of readings taken from each
|
||
|
end of the same template. For each forward read it searches for a corresponding
|
||
|
reverse reading. The search can be over the whole database or over a single contig.
|
||
|
The results can be presented graphically for single contig searches and the crosshair
|
||
|
function can be used to identify the readings displayed.
|
||
|
.para
|
||
|
Note that at present the function only knows that two reads are from the same template
|
||
|
by comparing reading names. For our local projects we use the following naming
|
||
|
convention: forward reads are named abcdefgh.s1 and reverse reads abcdefgh.r1. The
|
||
|
program expects this naming convention and so if it finds read fred.s1 and fred.r1 it
|
||
|
assumes they are the forward and reverse reads for template fred. In the future we
|
||
|
will make the routine more general!
|
||
|
.para
|
||
|
If a single contig is selected and the output is listed the program displays two
|
||
|
lines for each pair: the first line shows the reading name, its position and length,
|
||
|
and the distance between the extremeties of the two reads; the second line shows the
|
||
|
other read name, its position and length. If there are pairs that are in separate contigs
|
||
|
or are facing away from one another they are listed after the pairs that face inwards.
|
||
|
Is this true?
|
||
|
.para
|
||
|
If the results are plotted the full length of the template is drawn with arrows
|
||
|
indicating the direction of reads and the extent of each reading. Those reads that have
|
||
|
their partner in another contig are marked by asterisks.
|
||
|
.para
|
||
|
Typical dialogue is shown below.
|
||
|
.lit
|
||
|
|
||
|
? Select contigs (y/n) (y) =
|
||
|
Default Contig identifier=/i55d8.s1
|
||
|
? Contig identifier=
|
||
|
? Start position in contig (1-15227) (1) =
|
||
|
? End position in contig (1-15227) (15227) =
|
||
|
? Plot results (y/n) (y) = n
|
||
|
852 k23a1.r1 249 238 1615
|
||
|
806 k23a1.s1 1529 -335
|
||
|
238 i68e6.s1 422 193 1632
|
||
|
868 i68e6.r1 1756 -298
|
||
|
576 k17a2.s1 2370 213 1676
|
||
|
885 k17a2.r1 3790 -256
|
||
|
84 k27g6.s1 3456 291 1777
|
||
|
867 k27g6.r1 4905 -328
|
||
|
453 k01g10.s1 5805 142 1251
|
||
|
881 k01g10.r1 6909 -147
|
||
|
781 i98b8.r1 6754 338 1079
|
||
|
10 i98b8.s1 7653 -180
|
||
|
883 k02d11.r1 7327 276 1597
|
||
|
283 k02d11.s1 8726 -198
|
||
|
269 i68f9.s1 8191 169 1055
|
||
|
777 i68f9.r1 8891 -355
|
||
|
710 i91c6.s1 8245 95 1516
|
||
|
780 i91c6.r1 9403 -358
|
||
|
596 k27d12.s1 136 329 -329
|
||
|
219 k27d12.r1 1 -116
|
||
|
159 k27d11.r1 1830 -263 -263
|
||
|
317 k27d11.s1 2902 343
|
||
|
886 k17g11.r1 7107 -123 -123
|
||
|
647 k17g11.s1 1867 265
|
||
|
851 i69g10.r1 8045 -137 -137
|
||
|
277 i69g10.s1 4658 174
|
||
|
.end lit
|
||
|
.para
|
||
|
If contigs are not selected the pairs are sorted on their separations.
|
||
|
.lit
|
||
|
|
||
|
? Select contigs (y/n) (y) = n
|
||
|
i68f2.s1 27 1781 1777
|
||
|
i68f2.r1 776 111 1777
|
||
|
k17f6.s1 601 60 1706
|
||
|
k17f6.r1 856 1405 1706
|
||
|
k17a2.s1 576 2370 1676
|
||
|
k17a2.r1 885 3790 1676
|
||
|
k27g3.s1 177 14985 1664
|
||
|
k27g3.r1 889 13564 1664
|
||
|
.
|
||
|
.
|
||
|
k27b12.s1 764 1 1086
|
||
|
k27b12.r1 857 932 1086
|
||
|
i98b8.s1 10 7653 1079
|
||
|
i98b8.r1 781 6754 1079
|
||
|
k16a3.s1 748 1276 1070
|
||
|
k16a3.r1 784 472 1070
|
||
|
k17b7.r1 786 14937 18942*
|
||
|
k17b7.s1 787 3601 18942*
|
||
|
k27d12.r1 219 1 15208*
|
||
|
k27d12.s1 596 136 15208*
|
||
|
k01g2.s1 502 87 14754*
|
||
|
k01g2.r1 782 9224 14754*
|
||
|
|
||
|
.end lit
|
||
|
|
||
|
.left margin1
|
||
|
@ end of help
|