2725 lines
103 KiB
Text
2725 lines
103 KiB
Text
|
.npa
|
||
|
.left margin1
|
||
|
@-1. TX 0 @General
|
||
|
.sp
|
||
|
@-2. T 0 @Screen control
|
||
|
.sp
|
||
|
@-2. X 0 @Screen
|
||
|
.sp
|
||
|
@-3. TX 0 @Modification
|
||
|
.sp
|
||
|
@0. TX -1 @SAP
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
This is help information for the X Windows version of SAP.
|
||
|
Currently it is being brought up to date with the new features in XDAP.
|
||
|
The accuracy of this help should therefore not be assumed.
|
||
|
.PARA
|
||
|
This is an interactive program whose primary use is
|
||
|
for managing shotgun sequencing projects, but it can also be used for
|
||
|
handling alignments of other sequences, including those of proteins.
|
||
|
Currently the maximum 'gel reading' length is set to 4096 characters.
|
||
|
Almost all of the information below describes the use of the program for
|
||
|
shotgun projects, but those using the programs for handling other
|
||
|
sequence
|
||
|
alignments should interpret it accordingly.
|
||
|
The data for such a project is stored in a special type of database. The
|
||
|
program
|
||
|
contains the tools that are required to type in gel readings,
|
||
|
screen them against vector sequences and restriction sites;
|
||
|
enter new gel
|
||
|
readings into the database (automatically comparing and aligning
|
||
|
them). In addition it contains editors and functions to examine the quality
|
||
|
of the aligned sequences.
|
||
|
.para
|
||
|
There are three main menus: "general", "screen" and "modification",
|
||
|
and some functions have submenus.
|
||
|
.left margin2
|
||
|
.lit
|
||
|
The general menu contains the following options:
|
||
|
|
||
|
Open a database
|
||
|
Display a contig
|
||
|
List a text file
|
||
|
Direct output
|
||
|
Calculate a consensus
|
||
|
Screen against restriction enzymes
|
||
|
Screen against vector
|
||
|
Check database
|
||
|
Copy database
|
||
|
Show relationships
|
||
|
set parameters
|
||
|
Highlight disagreements
|
||
|
Examine quality
|
||
|
Find internal joins
|
||
|
|
||
|
The graphics menu contains:
|
||
|
|
||
|
Clear graphics
|
||
|
Clear text
|
||
|
Draw ruler
|
||
|
Use cross hair
|
||
|
Change margins
|
||
|
Label diagram
|
||
|
Plot map
|
||
|
Plot single contig
|
||
|
Plot all contigs
|
||
|
|
||
|
|
||
|
The modification menu contains:
|
||
|
|
||
|
Edit contig
|
||
|
Auto assemble
|
||
|
Join contigs
|
||
|
Complement a contig
|
||
|
Alter relationships
|
||
|
Extract gel readings
|
||
|
|
||
|
|
||
|
The alter relationships menu contains:
|
||
|
|
||
|
Cancel
|
||
|
Line change
|
||
|
Edit single gel reading
|
||
|
Delete contig
|
||
|
Shift
|
||
|
Move gel reading
|
||
|
Rename gel reading
|
||
|
Break contig
|
||
|
Alter raw data parameters
|
||
|
|
||
|
.END LIT
|
||
|
.SK1
|
||
|
.para
|
||
|
Overview of the methodology
|
||
|
.para
|
||
|
The shotgun sequencing strategy
|
||
|
.para
|
||
|
In the shotgun sequencing procedure
|
||
|
the sequence to be determined is randomly broken into fragments of
|
||
|
about
|
||
|
400 nucleotides in length. These fragments are cloned and then
|
||
|
selected randomly and their
|
||
|
|
||
|
sequences determined. The relationship between any pair of
|
||
|
|
||
|
fragments is not known beforehand
|
||
|
but is found by comparing their sequences.
|
||
|
|
||
|
If the sequence of one found to be wholly or partially contained
|
||
|
|
||
|
within that of another for sufficient length to distinguish an
|
||
|
|
||
|
overlap from a repeat then those two fragments can be joined.
|
||
|
The
|
||
|
|
||
|
process of select, sequence and compare is continued until the
|
||
|
whole
|
||
|
|
||
|
of the DNA to be sequenced is in one continuous well
|
||
|
determined
|
||
|
|
||
|
piece.
|
||
|
|
||
|
.para
|
||
|
Definition of a contig
|
||
|
|
||
|
.para
|
||
|
A CONTIG is a set of gel readings that are related to one
|
||
|
another by overlap of their sequences. All gel readings belong to
|
||
|
a contig and each contig contains at least one gel
|
||
|
reading. The gel readings in a contig can be summed to produce
|
||
|
a continuous consensus sequence and the length of this sequence is
|
||
|
the length of the contig. The rules used to perform this summation are
|
||
|
given under "the consensus algorithm".
|
||
|
At any stage
|
||
|
of a sequencing project the data will comprise a number of
|
||
|
contigs;
|
||
|
when a project is
|
||
|
|
||
|
complete there should be only one contig and its consensus will be
|
||
|
the finished sequence. Note that since being introduced and
|
||
|
defined as above the word "contig" has been taken up by those involved in
|
||
|
genomic mapping. In that context the consensus with a precise length is not
|
||
|
defined.
|
||
|
|
||
|
.SK1
|
||
|
.LEFT MARGIN2
|
||
|
Introduction to the computer method
|
||
|
.LEFT margin2
|
||
|
.PARA
|
||
|
It is useful to consider the objectives of a sequencing project before
|
||
|
outlining how we use the computer to help achieve them. The aim of a
|
||
|
shotgun sequencing project is to
|
||
|
produce an accurate consensus sequence from many overlapping gel
|
||
|
readings.
|
||
|
It is necessary to know, particularly at the latter
|
||
|
stages of the project, how accurate the
|
||
|
consensus sequence is. This enables us to know which regions of the
|
||
|
sequence require further work and also to know when the project is
|
||
|
finished.
|
||
|
To show the quality of the consensus, the programs described here
|
||
|
produce displays like that shown below.
|
||
|
.sk1
|
||
|
.lit
|
||
|
|
||
|
10 20 30 40 50
|
||
|
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
|
||
|
60 70 80 90 100
|
||
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
||
|
-3 HINW.007 GGCACA*GTC
|
||
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
|
||
|
|
||
|
110 120 130 140 150
|
||
|
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
|
||
|
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
|
||
|
-5 HINW.009 GGCAGCGACCGTCG
|
||
|
17 HINW.999 AGCGACCGTCG
|
||
|
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
|
||
|
|
||
|
160 170 180 190 200
|
||
|
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
12 HINW.017 GTAGAGC*
|
||
|
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
.END LIT
|
||
|
.para
|
||
|
This is an example showing the left end of a contig from
|
||
|
position 1 to 200. Overlapping this region are gel readings
|
||
|
numbered 6, 3, 5, 17 and 12;
|
||
|
6, 3 and 5
|
||
|
are in reverse orientation to their original reading (denoted by a minus
|
||
|
sign). Each gel reading also has a name (eg HINW.010). It can be seen that
|
||
|
in a number of places the sequences contain characters other than A,C,G
|
||
|
and
|
||
|
T. Some of these extra characters have been used by the sequencer to
|
||
|
indicate regions of uncertainty in the initial interpretation of the gel
|
||
|
reading, but the asterisks (*) have been inserted by the automatic
|
||
|
assembly function in order to align the sequences. Underneath each 50
|
||
|
character block of gel reading sequences is the consensus derived from
|
||
|
the
|
||
|
sequences aligned above (the line labelled CONSENSUS). For most of its
|
||
|
length the consensus has a definite nucleotide assignment but in a few
|
||
|
positions there is insufficient agreement between the gel readings and
|
||
|
so a dash (-) appears in the sequence. This display contains all the
|
||
|
evidence needed to assess the quality of the consensus: the number of
|
||
|
times
|
||
|
the sequence has been determined on each strand of the DNA, and the
|
||
|
individual nucleotide assignments given for each gel reading.
|
||
|
.para
|
||
|
So the aim is to produce the consensus sequence and, equally important,
|
||
|
a display of the experimental results from which it was derived.
|
||
|
.para
|
||
|
In order to achieve this the following operations need to be performed:
|
||
|
.left margin2
|
||
|
1) Put individual gel readings into the computer.
|
||
|
This might involved the manual interpretation of autoradiographs
|
||
|
or the transfer and process of machine-readable files from fluorescent
|
||
|
sequencing machines.
|
||
|
.left margin2
|
||
|
2) Check each gel reading to make sure it is not simply part of one of the
|
||
|
vectors used to clone the sequence.
|
||
|
.left margin2
|
||
|
3) Check each gel reading to make sure that those fragments that span
|
||
|
the
|
||
|
ligation point used prior to sonication are not assembled as single
|
||
|
sequences.
|
||
|
.left margin2
|
||
|
4) Compare all the remaining gel readings with one another to assemble
|
||
|
them
|
||
|
to produce the consensus sequence.
|
||
|
.left margin2
|
||
|
5) Check the quality of the consensus and edit the sequences.
|
||
|
.left margin2
|
||
|
6) When all the consensus is sufficiently well determined, produce a copy
|
||
|
of
|
||
|
it for processing by other analysis programs.
|
||
|
.para
|
||
|
It is very unlikely that this procedure will only be passed through once.
|
||
|
Usually steps 1 to 5 are cycled through repeatedly, with step 4 just
|
||
|
adding
|
||
|
new sequences to those already assembled. Generally step 6 is also used
|
||
|
in
|
||
|
order to analyse imperfect sequence to check if it is the one the project
|
||
|
intended to sequence, or to look for interesting features. Analysis of
|
||
|
the consensus, such as
|
||
|
searches for protein coding regions,
|
||
|
can also help to find errors in the sequence. The display of the
|
||
|
overlapping gel readings shown above can be used to indicate, not only
|
||
|
the
|
||
|
poorly determined regions, but also which clones should be resequenced
|
||
|
to
|
||
|
resolve ambiguities, or those which can usefully be extended or
|
||
|
sequenced
|
||
|
in the reverse direction, to cover
|
||
|
difficult regions.
|
||
|
|
||
|
.PARA
|
||
|
The original
|
||
|
individual gel readings for a sequencing project are each stored in
|
||
|
separate files. As the gel readings are entered into the computer
|
||
|
(usually in batches, say 10
|
||
|
from a film), the file names they are given are stored in
|
||
|
a further file, called a file of file names. Files of file names
|
||
|
enable gel readings to be processed in batches.
|
||
|
.para
|
||
|
For each sequencing project
|
||
|
we start a project database. This database has a structure specifically
|
||
|
designed for
|
||
|
dealing with shotgun sequence data.
|
||
|
In order to arrive at the final consensus sequence many operations will
|
||
|
be
|
||
|
performed on the sequence data. Individual fragments must be
|
||
|
sequenced and
|
||
|
compared in both senses (i.e. both orientations) with all the other
|
||
|
sequences. When an overlap between a new gel reading and a contig are
|
||
|
found
|
||
|
they must be aligned and the new gel reading added to the contig. If a
|
||
|
new
|
||
|
gel reading overlaps two contigs they must be aligned and joined. Before
|
||
|
the two contigs are joined one of them may need to be turned around
|
||
|
(reversed and complemented) so they are both in in the same orientation.
|
||
|
.para
|
||
|
Clearly, keeping track of all these manipulations is quite complicated,
|
||
|
and to be able to perform the operations
|
||
|
quickly requires careful choice of data
|
||
|
structure and algorithms. For these reasons it is not practicable to store
|
||
|
the gel readings aligned as shown in the display above. Rather, it is more
|
||
|
convenient to store the sequences unassembled, and to record sufficient
|
||
|
information for programs to assemble them during processing. The
|
||
|
data used to assemble the sequences is called relational information.
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
The database comprises five files and they are described under the
|
||
|
section entitled "open database".
|
||
|
.PARA
|
||
|
Before entry into the project database
|
||
|
each new gel reading must be compared to look for overlaps
|
||
|
with all the data already contained
|
||
|
within the database. This last point is
|
||
|
important: all searching for overlaps is between individual new gel
|
||
|
readings and the data already in the database. There is no searching for
|
||
|
overlaps between sequences within the database; overlaps must be found
|
||
|
before new gel readings are entered into the database.
|
||
|
.para
|
||
|
Below I give an introduction to how the sequences are processed by
|
||
|
being
|
||
|
passed from one function to the next.
|
||
|
.para
|
||
|
This program is used to start a
|
||
|
database for the project and
|
||
|
then the following procedure is used.
|
||
|
.para
|
||
|
Data in the form of individual gel readings are entered into the computer
|
||
|
|
||
|
and stored in separate files using either program this program or the digitizer
|
||
|
|
||
|
program. Batches
|
||
|
of these gel readings
|
||
|
are passed to the screening functions in this program to search for overlaps
|
||
|
|
||
|
with vector sequences ("screen against vector") or for matches to
|
||
|
|
||
|
restriction enzyme sites that should not be
|
||
|
|
||
|
present ("screen against enzymes").
|
||
|
Each run of these screening functions passes on only those gel
|
||
|
|
||
|
readings that do not contain unwanted sequences. Sequences are passed
|
||
|
|
||
|
via
|
||
|
files of file names and eventually are processed by the automatic
|
||
|
assembly function ("auto assemble"). This function compares each gel
|
||
|
reading with a consensus of all the previous gel readings
|
||
|
stored in the database.
|
||
|
If it finds any
|
||
|
overlaps
|
||
|
it aligns the overlapping sequences by inserting padding characters,
|
||
|
and then adds the new gel reading to the database.
|
||
|
Gels that overlap are added to existing contigs and gels that do not
|
||
|
overlap any data in the database start
|
||
|
new contigs. If a new gel overlaps two contigs they are joined.
|
||
|
Any gel readings that appear to overlap but which
|
||
|
cannot be aligned sufficiently well are not entered and have
|
||
|
their names written to a file of failed gel reading names.
|
||
|
.PARA
|
||
|
Generally data is entered
|
||
|
into the database in batches as just described. The program
|
||
|
is also used to examine
|
||
|
|
||
|
the data in the database, to enter gel readings that the automatic
|
||
|
|
||
|
assembly function cannot align ("auto assemble"),
|
||
|
|
||
|
and to make final edits. Edits to whole contigs
|
||
|
|
||
|
can be made in several ways.
|
||
|
A mouse-driven editor ("edit contig") is used to perform all edits manually.
|
||
|
Disagreements between gel readings
|
||
|
|
||
|
in contigs and their consensus
|
||
|
|
||
|
sequences can be highlighted by use of the function "highlight
|
||
|
|
||
|
disagreements".
|
||
|
.PARA
|
||
|
Editing the sequences is obviously an essential part of managing a
|
||
|
|
||
|
sequencing project.
|
||
|
Editing is required when new
|
||
|
|
||
|
sequences are added, when contigs are joined, and when sequences are
|
||
|
|
||
|
corrected.
|
||
|
A basic part of the strategy
|
||
|
|
||
|
used here is that new
|
||
|
|
||
|
gel readings should be correctly aligned throughout their whole length
|
||
|
|
||
|
when
|
||
|
they are entered into the database, and that when contigs are joined they
|
||
|
|
||
|
are edited so that they are well aligned in the region of overlap.
|
||
|
|
||
|
Alignment can be achieved by
|
||
|
|
||
|
adding padding characters to the sequences, and this is the way "auto
|
||
|
|
||
|
assemble"
|
||
|
operates when adding new sequences to the database.
|
||
|
|
||
|
.para
|
||
|
In order to search
|
||
|
for overlaps that may have been missed due to errors in
|
||
|
|
||
|
the gel readings, the function "extract gel readings" can be used to take
|
||
|
|
||
|
copies of the gel
|
||
|
|
||
|
readings at the ends of contigs, and write them out as separate files.
|
||
|
|
||
|
These can then be compared with the database consensus using the "auto
|
||
|
|
||
|
assemble" function in a mode that forbids entry of data into the
|
||
|
database,
|
||
|
and any gel reading matching two contigs will indicate a join that has
|
||
|
|
||
|
been
|
||
|
missed. The joins can then be made interactively using "join contigs".
|
||
|
|
||
|
Missed matches can be
|
||
|
|
||
|
found at this stage because the errors in the sequences may have been
|
||
|
|
||
|
corrected by new data.
|
||
|
|
||
|
.para
|
||
|
Generally the users need not concern themselves with how the relational
|
||
|
information is used by the program, but it is necessary to know
|
||
|
how contigs are identified. Because contigs are constantly being changed and
|
||
|
reordered the program identifies them by the numbers of the gel readings
|
||
|
they contain. Whenever users need to identify a contig they need only
|
||
|
know
|
||
|
the number or name of one of the gel readings it contains. Whenever the
|
||
|
program asks users to identify a contig or gel reading they can type its
|
||
|
number or its archive name. If they type its archive name they must precede
|
||
|
the name by a slash "/" symbol to denote that it is a name rather than a
|
||
|
number. E.g if the archive
|
||
|
name is fred.gel with number 99, users should
|
||
|
type /fred.gel or 99 when asked to identify the contig. Generally,
|
||
|
when it asks for the gel reading to be identified,
|
||
|
the program will offer the user a default name,
|
||
|
and if the user types only return, that
|
||
|
contig will be accessed. When a database is opened the default contig will
|
||
|
be the longest one, but if another is accessed, it will subsequently become
|
||
|
the current default.
|
||
|
.para
|
||
|
Further information is located in the following places.
|
||
|
The database files are described under "open database". The format
|
||
|
for
|
||
|
vector and consensus sequences is given under "calculate a consensus", as are
|
||
|
the
|
||
|
uncertainty codes used in gel readings.
|
||
|
.left margin2
|
||
|
.para
|
||
|
There are two programs,
|
||
|
other than this, relevant to sequencing are the digitizer
|
||
|
program and the trace editor program, both is outlined briefly below.
|
||
|
.para
|
||
|
The digitiser program
|
||
|
is used for the initial input of gel readings
|
||
|
and for writing a file of file names. The program
|
||
|
uses a digitizer for data entry.
|
||
|
A digitizer is
|
||
|
a two dimensional surface such as a light box
|
||
|
which is such that if a special pen is pressed onto it, the pens
|
||
|
coordinates are recorded by a computer.
|
||
|
These coordinates
|
||
|
can be interpreted by a program.
|
||
|
.para
|
||
|
In order to read an autoradiograph placed on the light box
|
||
|
the user need only define the bottom of
|
||
|
the four sequencing lanes and the bases
|
||
|
to which they correspond and then use the pen to point to each
|
||
|
successive band progressing up the gel. The program examines
|
||
|
the
|
||
|
coordinates of each pen position to see in which of the four
|
||
|
lanes
|
||
|
it lies and assigns the corresponding base to be stored in the
|
||
|
computer. Each time the pen tip is depressed to point to a position
|
||
|
on the surface of the digitizer the program sounds the bell on the
|
||
|
terminal to indicate to the user that a point has been recorded. As
|
||
|
the sequence is read the program displays it on the screen.
|
||
|
.para
|
||
|
The trace editor program
|
||
|
is used for the initial processing of data obtained from
|
||
|
fluorescent sequencing machines. It allows the user to visually
|
||
|
select left and right cutoff positions to denote the start and end of good
|
||
|
data. Users may also edit the sequence at this point.
|
||
|
Output from ted is a sequence file in Staden format with headers that
|
||
|
describe to xdap the cutoff information.
|
||
|
|
||
|
.left margin1
|
||
|
@17. TX 1 @Screen against enzymes
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
Used to compare gel readings against any restriction enzyme recognition
|
||
|
|
||
|
sequences that may have been used during cloning and which should not
|
||
|
|
||
|
be present in the data. Works on single gel readings or processes batches
|
||
|
|
||
|
accessed through files of file names. The algorithm looks for exact
|
||
|
|
||
|
matches to recognition sequences stored in a file.
|
||
|
|
||
|
.para
|
||
|
The file containing the recognition sequences must be identified. The
|
||
|
user
|
||
|
must choose between employing a file of file names, or typing in the
|
||
|
|
||
|
|
||
|
names of individual gel reading files. If a file of file names is used the
|
||
|
|
||
|
|
||
|
program will also create a new file of file names. When the option has
|
||
|
|
||
|
finished operating this new file will contain the names of all those gel
|
||
|
|
||
|
readings that did not match any of the recognition sequences. Hence it
|
||
|
can
|
||
|
be used for further processing of the batch. The recognition sequences
|
||
|
|
||
|
should be stored in a simple text file with one recognition sequence per
|
||
|
|
||
|
line.
|
||
|
.left margin1
|
||
|
@18. TX 1 @Screen against vector
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
Used to compare gel readings against any vector sequences that may have
|
||
|
|
||
|
been picked up during cloning. Works on single gel readings or processes
|
||
|
|
||
|
batches accessed through files of file names. The algorithm looks for
|
||
|
exact
|
||
|
matches of length "minimum match length" and displays the overlapping
|
||
|
|
||
|
sequences.
|
||
|
.para
|
||
|
The file containing the vector sequence must be identified. The user must
|
||
|
|
||
|
choose between employing a file of file names, or typing in the names of
|
||
|
|
||
|
individual gel reading files. If a file of file names is used the program
|
||
|
will
|
||
|
also create a new file of file names. When the option has finished
|
||
|
|
||
|
operating this new file will contain the names of all those gel readings
|
||
|
|
||
|
that did not match the vector sequence. Hence it can be used for further
|
||
|
|
||
|
processing of the batch. The vector sequence should be stored in a simple
|
||
|
|
||
|
text file with up to 80 characters of data per line. More than one vector
|
||
|
|
||
|
can be stored in a single file. If so each should be preceded by a 20
|
||
|
|
||
|
character title of the form <---m13mp8.001-----> where the < and >
|
||
|
signs
|
||
|
and the number like .001 are obligatory. The number must be preceded
|
||
|
|
||
|
by a dot (.) and be 3 digits long. The total sequence in the file must be <
|
||
|
|
||
|
50,001 characters in length.
|
||
|
|
||
|
.left margin1
|
||
|
@20. TX 3 @Auto assemble
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
Compares gel readings against the current contents of the database and
|
||
|
|
||
|
produces alignments. In its normal mode of operation
|
||
|
("entry permitted"), the function
|
||
|
will automatically enter the gel readings into the database, but if entry
|
||
|
is not permitted it will only produce alignments. It works on
|
||
|
|
||
|
single gel readings or processes batches of gel readings accessed through
|
||
|
|
||
|
files of file names. It is the usual way to enter data into the database.
|
||
|
|
||
|
.para
|
||
|
The function will check the database for logical consistency and will
|
||
|
only
|
||
|
proceed if it is OK. Choose if gel readings should be entered into the
|
||
|
|
||
|
database, or if they should only be compared. Choose between using a file
|
||
|
|
||
|
of file names or typing file names on the keyboard. If so selected, supply
|
||
|
|
||
|
the file of file names. Also supply a file of file names to contain the names of
|
||
|
|
||
|
all the gel readings that fail to get entered.
|
||
|
Select the entry mode. Normal assembly is appropriate for all but special
|
||
|
cases, as is "permit joins". Uses for the other modes are not documented
|
||
|
here.
|
||
|
Define a minimum initial
|
||
|
|
||
|
match length. Define a minimum alignment block (the default value is
|
||
|
|
||
|
taken in all but exceptional circumstances). Define the maximum number
|
||
|
|
||
|
of padding characters allowed to be used in each gel reading to help
|
||
|
|
||
|
achieve alignment, and the same for the number allowed in the contig for
|
||
|
|
||
|
each gel reading. Finally define the maximum percentage mismatch to
|
||
|
be allowed for any gel reading to be entered into the database. If
|
||
|
|
||
|
for any gel reading, either of these last three values is exceeded the gel
|
||
|
|
||
|
reading will not be entered into the database.
|
||
|
|
||
|
.para
|
||
|
In operation the function takes a batch of gel readings (probably passed
|
||
|
|
||
|
on as a file of file names from one of the screening routines) and
|
||
|
enters them into a
|
||
|
database for a sequencing project. It takes each gel reading
|
||
|
in turn,
|
||
|
compares it with the current consensus for the database, it then
|
||
|
produces an alignment for any regions of the consensus it
|
||
|
overlaps; if this alignment is sufficiently good it then edits
|
||
|
both the new gel reading and the sequences it overlaps and adds
|
||
|
the
|
||
|
new gel reading to the database. The program then updates the
|
||
|
consensus
|
||
|
accordingly and carries on to the next gel reading.
|
||
|
.para
|
||
|
All alignments are displayed and any gel readings
|
||
|
that do match but that
|
||
|
|
||
|
cannot be aligned sufficiently well have their names written to a
|
||
|
file of failed gel reading names. The function works without any
|
||
|
|
||
|
user intervention and can process any number of gel readings in a
|
||
|
single run. Those gel readings that fail can be recompared using
|
||
|
|
||
|
the same function (to find the current overlap position) and the
|
||
|
|
||
|
user can enter them into the database
|
||
|
|
||
|
manually using the "enter new gel reading" option.
|
||
|
.para
|
||
|
Typical dialogue and output from the function is shown below. (Note that
|
||
|
output for gel readings 2 - 9 has been deleted to save space).
|
||
|
.lit
|
||
|
Automatic sequence assembler
|
||
|
Database is logically consistent
|
||
|
? (y/n) (y) Permit entry
|
||
|
? (y/n) (y) Use file of file names
|
||
|
? File of gel reading names=demo.nam
|
||
|
? File for names of failures=demo.fail
|
||
|
Select entry mode
|
||
|
X 1 Perform normal shotgun assembly
|
||
|
2 Put all sequences in one contig
|
||
|
3 Put all sequences in new contigs
|
||
|
? Selection (1-3) (1) =
|
||
|
? (y/n) (y) Permit joins
|
||
|
? Minimum initial match (12-4097) (15) =
|
||
|
? Minimum alignment block (2-5) (3) =
|
||
|
? Maximum pads per gel (0-25) (8) =
|
||
|
? Maximum pads per gel in contig (0-25) (8) =
|
||
|
? Maximum percent mismatch after alignment (0.00-15.00) (8.00) =
|
||
|
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
|
||
|
Processing 1 in batch
|
||
|
Gel reading name=HINW.004
|
||
|
Gel reading length= 283
|
||
|
Searching for overlaps
|
||
|
Strand 1
|
||
|
Strand 2
|
||
|
No matches found
|
||
|
Total matches found 1
|
||
|
Padding in contig= 0 and in gel= 1
|
||
|
Percentage mismatch after alignment = 1.8
|
||
|
Best alignment found
|
||
|
1 11 21 31 41 51
|
||
|
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
|
||
|
********** ********** ********** ********** ********** **********
|
||
|
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
|
||
|
1 11 21 31 41 51
|
||
|
61 71 81 91 101 111
|
||
|
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACA-GTTG
|
||
|
********** ********** ********** ********** ********** ***** ****
|
||
|
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACACGTTG
|
||
|
61 71 81 91 101 111
|
||
|
121 131 141 151 161 171
|
||
|
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
|
||
|
********** ********** ********** ********** ********** **********
|
||
|
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
|
||
|
121 131 141 151 161 171
|
||
|
181 191 201 211 221 231
|
||
|
GCCATGGTGG CCACAGTGAC G-TATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
|
||
|
********** ********** * ******** ********** ********** **********
|
||
|
GCCATGGTGG CCACAGTGAC GCTATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
|
||
|
181 191 201 211 221 231
|
||
|
241 251 261 271 281
|
||
|
CGCG--GTGG AATTACAGCG TTCCCTATTG ACGGGCGCAT CCAC
|
||
|
**** **** ********** ** * ***** ********** ****
|
||
|
CGCGACGTGG AATTACAGCG TT,CDTATTG ACGGGCGCAT CCAC
|
||
|
241 251 261 271 281
|
||
|
Batch finished
|
||
|
9 sequences processed
|
||
|
0 sequences entered into database
|
||
|
0 joins made
|
||
|
|
||
|
.end lit
|
||
|
|
||
|
.para
|
||
|
Note that "auto assemble" cannot align protein sequences.
|
||
|
.left margin1
|
||
|
@28. TX 1 @Highlight disagreements
|
||
|
.left margin2
|
||
|
.para
|
||
|
Used in the latter stages of a project
|
||
|
to highlight disagreements between individual gel readings
|
||
|
and their consensus sequences. Characters that agree with the
|
||
|
|
||
|
consensus are shown as : symbols for the plus strand and . for the minus
|
||
|
|
||
|
strand. Characters that disagree with the consensus are left unchanged
|
||
|
|
||
|
and so stand out clearly. The results of this analysis are written to a
|
||
|
file.
|
||
|
|
||
|
.para
|
||
|
Before selecting this option create a file of the display of the contig to
|
||
|
be
|
||
|
"highlighted". The option will ask for the name of this file. Select
|
||
|
symbols
|
||
|
to denote "agreeing" characters on each strand, the defaults are : and .,
|
||
|
|
||
|
but any others can be used. Supply the name of a file in which to put
|
||
|
|
||
|
the output.
|
||
|
.para
|
||
|
The display file needed as input for this option is created by selecting
|
||
|
|
||
|
"Redirect output", followed immediately by "display contig", and then
|
||
|
"Redirect output" again. The
|
||
|
|
||
|
cutoff score used in the consensus calculation can be set by option "set
|
||
|
|
||
|
display parameters". Note that for the highlight function
|
||
|
there is a limit of 50 for the number of gel
|
||
|
readings that are aligned at any position - ie the contig must be less
|
||
|
than 51 gel readings deep at its thickest point. I hope that those performing
|
||
|
shotgun sequencing never reach this limit, but those using the program for
|
||
|
comparing sequence families might.
|
||
|
.para
|
||
|
Typical output from this function is shown below.
|
||
|
.lit
|
||
|
|
||
|
210 220 230 240 250
|
||
|
1 HINW.004 :C::::::::::::::::::::::::::::::::::::::::::AC::::
|
||
|
7 HINW.018 :*::::::::::::::::::::::::::::::::::::::::::CA::::
|
||
|
-4 HINW.017 ...............AC....
|
||
|
G-TATTTTGTTTCCTCACGCTCGCTACGTATCTGTTTGCCCGCG--GTGG
|
||
|
|
||
|
260 270 280 290 300
|
||
|
1 HINW.004 ::::::::::::*:D:::::::::::::::::::
|
||
|
7 HINW.018 ::::::::::::::::::::CA:::::T:*:::*::::::::::::CA:
|
||
|
-4 HINW.017 ..............................................A...
|
||
|
3 HINW.009 :::::::::::::::V::::::::::::::::::::::::::::*AV:::
|
||
|
-6 HINW.028 ......................A...
|
||
|
AATTACAGCGTTCCCTATTGACGGGCGCATCCACGCTGATTCTCTT-CTG
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@32. TX 3 @Extract gel readings
|
||
|
.left margin2
|
||
|
.para
|
||
|
Used to make copies of the aligned gel readings in a database,
|
||
|
to write them into separate files, and to write a
|
||
|
|
||
|
corresponding file of file names. It operates in two modes: either all gel
|
||
|
|
||
|
readings are extracted, or only those at the ends of contigs.
|
||
|
|
||
|
.para
|
||
|
Choose which mode of operation is required and supply a file of file
|
||
|
|
||
|
names.
|
||
|
.para
|
||
|
The gel readings are given their original
|
||
|
|
||
|
names.
|
||
|
If used to extract the gel readings from the ends of contigs the function
|
||
|
is
|
||
|
useful for checking for missed contig joins: the file of file names can be
|
||
|
|
||
|
used with the auto assemble function to recompare these gel readings,
|
||
|
|
||
|
and each should only overlap one contig. Any that overlap two contigs
|
||
|
|
||
|
will identify possible joins.
|
||
|
.para
|
||
|
If the option is used to extract all the gel readings from a database, a
|
||
|
|
||
|
subsequent run of "auto assemble" can reconstitute a database which has
|
||
|
|
||
|
been corrupted. This rarely occurs and is usually necessitated by a
|
||
|
|
||
|
user employing "alter relationships" incorrectly without first having
|
||
|
|
||
|
made a copy.
|
||
|
.left margin1
|
||
|
@1. TX 0 @Help
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
Help is available on the following topics :
|
||
|
|
||
|
.LEFT MARGIN1
|
||
|
@2. TX 0 @Quit
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
This command stops the program and is the only safe way to terminate a
|
||
|
|
||
|
run
|
||
|
of the program that has altered the contents of the database in any way.
|
||
|
|
||
|
.left margin1
|
||
|
@3. TX 1 @Open a database
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
Opens existing databases or allows new ones to be started. The function
|
||
|
is
|
||
|
automatically called into operation
|
||
|
when the program is started but can also be selected
|
||
|
|
||
|
from the general menu.
|
||
|
.para
|
||
|
Choose to open an existing database or start a new one, or if ! is typed
|
||
|
when the program is first started, enter the program without opening a
|
||
|
database. Supply a project
|
||
|
|
||
|
database name, and if it already exists, the "version". If starting a new
|
||
|
|
||
|
database define the database size and if it is for DNA or protein sequences.
|
||
|
The database size is an initial size for the database. It can be increased
|
||
|
later during the project. It is the sum of the number of gel
|
||
|
readings plus the number of contigs.
|
||
|
.para
|
||
|
Database names can have from one to 12 letters and must not include full
|
||
|
|
||
|
stop (.). The database is made from five separate files. If the database
|
||
|
is
|
||
|
called FRED then version 0 of database FRED comprises files FRED.AR0,
|
||
|
|
||
|
FRED.RL0, FRED.SQ0, FRED.TG0 and FRED.CC0. The version is the last symbol in the file names.
|
||
|
|
||
|
Only this program
|
||
|
can read these files. If the "copy database" option is used it
|
||
|
|
||
|
will ask the user to define a new "version".
|
||
|
.para
|
||
|
For normal use the maximum gel reading length is set to 512 characters,
|
||
|
|
||
|
but when a database is started the user may choose lengths of either
|
||
|
|
||
|
512,
|
||
|
1024, 1536..., 4096. Normally the program is used to handle DNA
|
||
|
|
||
|
sequences but many of the functions also work on protein sequences. The
|
||
|
|
||
|
choice of sequence type is made when the database is started.
|
||
|
|
||
|
.para
|
||
|
The contigs are not stored on the disk as the user sees them displayed on
|
||
|
|
||
|
the screen. Each gel reading is stored with sufficient information about
|
||
|
|
||
|
how it overlaps other gel readings so that the program can work out how
|
||
|
|
||
|
to
|
||
|
present them aligned on the screen. We refer to this extra data as "the
|
||
|
relationships" and it is explained below.
|
||
|
|
||
|
The database comprises 5 separate files.
|
||
|
|
||
|
.left margin2
|
||
|
1. a working version of each gel reading. This is the version of
|
||
|
the gel reading
|
||
|
that is in the database and initially it is an exact copy of
|
||
|
the original sequence (known as the archive)
|
||
|
but it is edited and manipulated to align it
|
||
|
with other gel readings.
|
||
|
|
||
|
.left margin2
|
||
|
2. the file of relationships. This file contains all of the
|
||
|
|
||
|
information that is required to assemble the working versions
|
||
|
into
|
||
|
|
||
|
contigs during processing; any manipulations on the data use this
|
||
|
|
||
|
file and it is automatically updated at any time that the
|
||
|
|
||
|
relationships are changed. The information in this file is as
|
||
|
|
||
|
follows:
|
||
|
.left margin2
|
||
|
(A) Facts about each gel reading and its relationship to
|
||
|
others
|
||
|
("gel
|
||
|
|
||
|
descriptor lines"):
|
||
|
|
||
|
.left margin2
|
||
|
(a) the number of the gel
|
||
|
reading (each gel reading is given a number as it is
|
||
|
|
||
|
entered into the database)
|
||
|
|
||
|
.left margin2
|
||
|
(b) the length of the sequence from this gel reading
|
||
|
|
||
|
.left margin2
|
||
|
(c) the position of the left end of this gel
|
||
|
reading relative to the left
|
||
|
|
||
|
end of the contig of which it is a member
|
||
|
|
||
|
.left margin2
|
||
|
(d) the number of the next gel
|
||
|
reading to the left of this gel reading
|
||
|
|
||
|
.left margin2
|
||
|
(e) the number of the next gel reading to the right
|
||
|
|
||
|
.left margin2
|
||
|
(f) the relative strandedness of this gel
|
||
|
reading , ie whether it is in
|
||
|
|
||
|
the same sense or the complementary sense as its archive.
|
||
|
|
||
|
.left margin2
|
||
|
(B) Facts about each contig ("contig descriptor lines"):
|
||
|
|
||
|
.left margin2
|
||
|
(a) the length of this contig
|
||
|
|
||
|
.left margin2
|
||
|
(b) the number of the leftmost gel
|
||
|
reading of this contig
|
||
|
|
||
|
.left margin2
|
||
|
(c) the number of the rightmost gel reading of this contig.
|
||
|
|
||
|
.left margin2
|
||
|
(C) General facts:
|
||
|
|
||
|
.left margin2
|
||
|
(a) the number of gel readings in the database
|
||
|
|
||
|
.left margin2
|
||
|
(b) the number of contigs in the database.
|
||
|
|
||
|
.left margin2
|
||
|
3. the file of archive names. This is simply a list of the names
|
||
|
|
||
|
of each of the archive files in the database but on line number
|
||
|
|
||
|
1000 we also store the size of the database. ie the number of lines
|
||
|
|
||
|
of information allowed in the database files. This file always has
|
||
|
|
||
|
1000 lines but the length of the file of relationships and the file
|
||
|
|
||
|
of working versions can be set by the user when creating a
|
||
|
database
|
||
|
|
||
|
or when copying from one to another.
|
||
|
.left margin2
|
||
|
4. the file of tags (annotation).
|
||
|
This consists of linked lists of tag information for each sequences in the
|
||
|
database.
|
||
|
Tags are created by the user as annotation, or by xdap as records of edits or
|
||
|
for storing cutoff information.
|
||
|
As the number of tags can grow without limit, so can this file.
|
||
|
For each gel there is a header record, which contains the record number of
|
||
|
the start of the linked list for that gel. On line IDBSIZ there is a record
|
||
|
containing information about the file such as its present length and if there
|
||
|
are any free "tag" slots to be reused in the file.
|
||
|
|
||
|
5. the file of comments (annotation).
|
||
|
This consists of linked lists of comment fragments.
|
||
|
Comments are created by the user as a message attached to annotation,
|
||
|
or by the system to store cutoff information.
|
||
|
Comments are character strings of any length.
|
||
|
Comments longer than 40 characters are broken up into fragments, each 40
|
||
|
characters long, and are chained together in a link list.
|
||
|
As the number of comments can grow without limit, so can this file.
|
||
|
|
||
|
.para
|
||
|
Structure of the database files
|
||
|
.para
|
||
|
1. The file of relationships
|
||
|
.para
|
||
|
The file contains IDBSIZ lines of data:
|
||
|
the general data are stored on line IDBSIZ; data about gel
|
||
|
readings are
|
||
|
stored from line 1 downwards; data about contigs are stored from
|
||
|
line IDBSIZ-1 upwards. A database of 500 lines containing 25 gel
|
||
|
readings and 4 contigs would have a file
|
||
|
of relationships as is shown below.
|
||
|
.lit
|
||
|
|
||
|
|
||
|
---------------------------------------------
|
||
|
1 Gel descriptor record
|
||
|
2 " " "
|
||
|
3 " " "
|
||
|
4 " " "
|
||
|
5 " " "
|
||
|
' ' ' '
|
||
|
' ' ' '
|
||
|
25 " " "
|
||
|
26 Empty record
|
||
|
' ' '
|
||
|
|
||
|
' ' '
|
||
|
495 ' '
|
||
|
496 Contig descriptor record
|
||
|
497 " " "
|
||
|
498 " " "
|
||
|
499 " " "
|
||
|
500 Number of gel readings=25, Number of contigs=4
|
||
|
---------------------------------------------
|
||
|
|
||
|
The arrangement of the data in the file of relationships
|
||
|
|
||
|
.end lit
|
||
|
As each new gel reading is added into the database a new line is added
|
||
|
to the end of the list of gel descriptor
|
||
|
lines. If this new gel reading does not
|
||
|
overlap with any gel readings
|
||
|
already in the database a new contig line is
|
||
|
added to the top of the list of contig lines. If it overlaps with
|
||
|
one contig then no new contig line need be added but if it overlaps
|
||
|
with two contigs then these two contigs must be joined and the
|
||
|
number of contig lines will be reduced by one. Then the list of
|
||
|
contig
|
||
|
lines is compressed to leave the empty line at the top of the list.
|
||
|
Initially the two types of line will move towards one another but
|
||
|
eventually, as contigs are joined, the contig descriptor lines will
|
||
|
move in the same direction as the gel descriptor
|
||
|
lines. At the end of a
|
||
|
project there should be only one contig line. The database is thus
|
||
|
capable of handling a project of 998 gels.
|
||
|
.para
|
||
|
2. Structure of the working versions file
|
||
|
.para
|
||
|
The working versions of gel readings are stored in a file of
|
||
|
IDBSIZ lines each containing 512 characters. Gel reading
|
||
|
number 1 is stored on line
|
||
|
1, gel reading number 2 on line 2 and so on.
|
||
|
.para
|
||
|
3. Structure of the archive names file
|
||
|
.para
|
||
|
This file, unlike the others, always has 1000 lines each 10
|
||
|
characters in length. Its length is fixed because line 1000 is used
|
||
|
to store IDBSIZ the database size and the programs need a definite
|
||
|
location from which to read this number.
|
||
|
.para
|
||
|
4. Structure of the tag file
|
||
|
.para
|
||
|
This file initially starts with IDBSIZ lines, and is expanded as new tags are
|
||
|
created.
|
||
|
Information about the length of the file, and which tag records are reusable
|
||
|
is stored on line IDBSIZ.
|
||
|
A database of 500 lines would have a file of tags as shown below.
|
||
|
.lit
|
||
|
|
||
|
---------------------------------------------
|
||
|
1 Tag descriptor record
|
||
|
2 " " "
|
||
|
3 " " "
|
||
|
4 " " "
|
||
|
5 " " "
|
||
|
' ' ' '
|
||
|
' ' ' '
|
||
|
497 " " "
|
||
|
498 " " "
|
||
|
499 " " "
|
||
|
500 Length of file=N, Free list=0
|
||
|
501 Tag record
|
||
|
502 " "
|
||
|
503 " "
|
||
|
' ' '
|
||
|
' ' '
|
||
|
N-2 " "
|
||
|
N-1 " "
|
||
|
N Tag record
|
||
|
---------------------------------------------
|
||
|
|
||
|
The arrangement of the data in the file of relationships
|
||
|
|
||
|
.end lit
|
||
|
As each new tag is added to the database, a check is made in the
|
||
|
file descriptor record at line IDBSIZ. If the list of reusable records is 0,
|
||
|
the file is extended by one line. Otherwise the new tag is assigned to
|
||
|
record at the head of the freelist.
|
||
|
When tags are deleted, they are added to the free list in the file descriptor
|
||
|
record.
|
||
|
.para
|
||
|
5. Structure of the comment file
|
||
|
.para
|
||
|
This file initially starts with 1 line, and is expanded as new annotation is
|
||
|
created.
|
||
|
Information about the length of the file, and which comment records are reusable
|
||
|
is stored on the first line.
|
||
|
.lit
|
||
|
|
||
|
---------------------------------------------
|
||
|
1 Length of file=N, Free list=0
|
||
|
2 Comment fragment
|
||
|
3 " "
|
||
|
4 " "
|
||
|
' ' '
|
||
|
' ' '
|
||
|
N-2 " "
|
||
|
N-1 " "
|
||
|
N Comment fragment
|
||
|
---------------------------------------------
|
||
|
|
||
|
The arrangement of the data in the file of relationships
|
||
|
|
||
|
.end lit
|
||
|
As each new comment is added to the database, a check is made in the file
|
||
|
descriptor record at line 1. If the list of reusable records is 0,
|
||
|
the file is extended to hold the new comment. Otherwise the new comments is
|
||
|
assigned to records starting with the head of the freelist.
|
||
|
When comments are deleted, the discarded records are added to the free list in
|
||
|
the file descriptor record.
|
||
|
.para
|
||
|
There are various checks within the programs to
|
||
|
protect users from themselves:-
|
||
|
.left margin2
|
||
|
1. All user input is checked for errors - e.g. reference to
|
||
|
non-existent gel
|
||
|
readings or contigs, incorrect positions in the
|
||
|
contig or gel readings.
|
||
|
.left margin2
|
||
|
2. Before entering a gel reading the system checks to see if a
|
||
|
file of the same name has already been entered.
|
||
|
.left margin2
|
||
|
3. Join will not allow the circularising of a contig.
|
||
|
.left margin2
|
||
|
4. Both enter and join functions restrict the region
|
||
|
that the user is allowed to edit (using edit contig) to the
|
||
|
region of overlap.
|
||
|
.left margin2
|
||
|
5. Users may escape from any point in the program.
|
||
|
.left margin2
|
||
|
6. Help is available from all points in the program.
|
||
|
.SK2
|
||
|
.LEFT MARGIN2
|
||
|
IT IS ESSENTIAL THAT USERS DO NOT KILL THE PROGRAM WHILE IT IS
|
||
|
DOING
|
||
|
ANYTHING THAT INVOLVES CHANGING THE CONTENTS OF THE
|
||
|
DATABASE. I.E DURING AUTO ASSEMBLE,
|
||
|
COMPLETE ENTRY, COMPLETE JOIN, COMPLEMENT CONTIG, EDIT CONTIG, AND SCREEN
|
||
|
EDIT.
|
||
|
This could
|
||
|
corrupt the database so badly that it is impossible to fix. The program
|
||
|
should always be left using the QUIT option.
|
||
|
|
||
|
.left margin1
|
||
|
@4. TX 3 @Edit contig
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
The Contig Editor is a mouse-driven editor that can insert,
|
||
|
delete and change gel reading sequences.
|
||
|
.para
|
||
|
The Contig Editor allows scrolling from one end of a contig to the other
|
||
|
using the scroll bar and scroll buttons. Action of mouse button presses
|
||
|
when the mouse pointer is in the scroll bar:
|
||
|
.sk1
|
||
|
.lit
|
||
|
Middle Mouse Button Set editor position
|
||
|
Left Mouse Button Scroll forward one screenful
|
||
|
Right Mouse Button Scroll backwards one screenful
|
||
|
.end lit
|
||
|
.sk1
|
||
|
The four scroll buttons operate as follows:
|
||
|
.sk1
|
||
|
.lit
|
||
|
"<<" Scroll left half a screenful
|
||
|
"<" Scroll left one character
|
||
|
">" Scroll right one character
|
||
|
">>" Scroll right half a screenful
|
||
|
.end lit
|
||
|
.para
|
||
|
The Editor cursor can be positioned anywhere in the edit window by
|
||
|
moving the mouse pointer over the character of interest, then pressing the
|
||
|
left mouse button. The Editor cursor can also be moved by using the
|
||
|
direction arrow keys.
|
||
|
.para
|
||
|
The editor operates in two main edit modes - Replace and Insert. Replace allows
|
||
|
a character to be replaced by another. Insert allows characters to be
|
||
|
inserted into a gel reading sequence. Characters are entered by typing
|
||
|
them from the keyboard. Only valid characters are permitted.
|
||
|
Characters can be deleted by positioning the cursor one character to the right,
|
||
|
then pressing the delete key.
|
||
|
Normally Insert and Delete apply to the consensus line of the contig ONLY.
|
||
|
This restraint can be overridden by using the "Super Edit" mode of
|
||
|
operation, THOUGH IT IS NOT RECOMMENDED.
|
||
|
.para
|
||
|
Edits can also be performed on the consensus, though they are
|
||
|
restricted to insertion and deletion of padding characters ("*").
|
||
|
These edits also have special meanings.
|
||
|
A deletion will delete ALL characters at the position to the left
|
||
|
of the cursor in the contig, and move the relative positions of all
|
||
|
sequences starting to the right of the cursor position left one
|
||
|
character.
|
||
|
An insertion will insert the character typed ("*") into ALL gel
|
||
|
reading sequences at the cursors position in the contig, and move the
|
||
|
relative positions of all sequences starting to the right of the cursor
|
||
|
position right one character.
|
||
|
.para
|
||
|
The effect of the last edit can be undone by pressing the "Undo" button
|
||
|
at the top of the editor window.
|
||
|
.para
|
||
|
The cursor will automatically be positioned at the next problem when the
|
||
|
"Find Next Problem" button is selected. The next problem is where the
|
||
|
consensus shows either an ambiguity ("-") or a pad ("*") character.
|
||
|
.para
|
||
|
The edits to the contig can be saved by pressing the "Leave Editor"
|
||
|
button and replying "Yes" to the prompt to "Save changes?". As no changes
|
||
|
are made to the working copy of your database til this point it
|
||
|
is possible to abort the editor if
|
||
|
the edit session ends up in an unsatisfactory state (ie if you've
|
||
|
stuffed it up!)
|
||
|
.left margin1
|
||
|
.sk3
|
||
|
Displaying Traces
|
||
|
.left margin2
|
||
|
.para
|
||
|
The original data from which the gel reading sequences where derived can
|
||
|
be seen by double clicking (two quick clicks) with the middle mouse button
|
||
|
on the area of interest. The trace will be displayed with the point
|
||
|
clicked at the centre of the trace viewport.
|
||
|
.para
|
||
|
All traces that are displayed are maintained in one window, called the Trace
|
||
|
Manager. The Trace Manager will only display four traces maximum. When four
|
||
|
traces are already being managed and a new one is requested, the one at the top
|
||
|
of the Trace Manager is removed and the new one is added to the bottom.
|
||
|
Traces can be removed individually by using the "quit" button in the panel next
|
||
|
to the trace.
|
||
|
.left margin1
|
||
|
.sk3
|
||
|
Extending Reads Using Cutoff Information
|
||
|
.left margin2
|
||
|
.para
|
||
|
Sequence data read in from Automated Fluorescent sequencing machines
|
||
|
trace files processed through the program ted
|
||
|
will have the discarded sequence (vector at start and poor read at
|
||
|
end) available to the contig editor. To display the cutoff
|
||
|
information, press the "Display Cutoff" button at the top of the
|
||
|
editor window.
|
||
|
The cutoff sequence appears in grey. This sequence can be incorporated
|
||
|
into the editable sequence, by moving the cutoff position. This is
|
||
|
done by positioning the cursor at the end of the gel sequence, and
|
||
|
using Meta-Left-Arrow and Meta-Right-Arrow to adjust the point of cutoff.
|
||
|
The Meta key is a diamond on the Sun keyboard.
|
||
|
.left margin1
|
||
|
.sk3
|
||
|
Pop-up menu
|
||
|
.left margin2
|
||
|
.para
|
||
|
A pop-up menu is revealed by depressing the "Control" key on the keyboard
|
||
|
and at the same time pressing the left mouse button. The menu has the following
|
||
|
functions:
|
||
|
.lit
|
||
|
|
||
|
Search
|
||
|
Save Contig
|
||
|
Create Tag
|
||
|
Edit Tag
|
||
|
Delete Tag
|
||
|
|
||
|
.end lit
|
||
|
"Save Contig" is described above.
|
||
|
Searching and operations on tags are described below.
|
||
|
.left margin1
|
||
|
.sk3
|
||
|
Searching
|
||
|
.left margin2
|
||
|
.para
|
||
|
Selecting "Search" brings up a
|
||
|
window which can remain present during normal editor operation. The
|
||
|
window allows the user to select the direction of search, the type of
|
||
|
search and a value to search on. The value is entered into the value
|
||
|
text window. Then pressing the "search" button
|
||
|
performs the search. If successful, the cursor is positioned and
|
||
|
centred accordingly. An audible tone indicates failure. Pressing the
|
||
|
"ok" button removes the search window. The search window is
|
||
|
automatically removed when the contig editor is exited.
|
||
|
.sk1
|
||
|
There are seven different search modes:
|
||
|
.sk1
|
||
|
1. Search by position
|
||
|
.sk1
|
||
|
This positions the cursor at the numeric position specified in the
|
||
|
value text window. Eg a value of "1234" causes the cursor to be placed
|
||
|
at base number 1234 in the contig. Positioning withing a gel reading is
|
||
|
achieved by prefixing the number with the "@" character, eg "@123"
|
||
|
positions the cursor at base 123 of the sequence in which the cursor
|
||
|
lies. Relative positions can be specified by prefixing the number with
|
||
|
a plus or minus character. Eg "+1234" will advance the cursor 1234
|
||
|
bases. If possible, the cursor is positioned within the same sequence.
|
||
|
The direction buttons have no effect on the operation of "search
|
||
|
by position".
|
||
|
.sk1
|
||
|
2. Search by reading name
|
||
|
.sk1
|
||
|
This positions the cursor at the left end of the gel reading specified
|
||
|
in the value text window. If the value is prefixed with a slash is is
|
||
|
assumed to be a gel reading name. Otherwise it is assumed to be a gel
|
||
|
reading number. Eg "123" positions the cursor at the left end of gel
|
||
|
reading number 123. "/a16a12.s1" positions at the start of reading
|
||
|
a16a12.s1. If the value was "/a16" the cursor is positioned at the
|
||
|
first reading which starts with "a16". The direction buttons have no
|
||
|
effect on the operation of "search by position".
|
||
|
.sk1
|
||
|
3. Search by tag type.
|
||
|
.sk1
|
||
|
This positions the cursor at the start of the next tag which has the
|
||
|
the same type as specified by the type value menu. To change the type,
|
||
|
select off the menu that pops up when the mouse is clicked on the
|
||
|
button labeled "Type:". The search can be performed either forwards
|
||
|
or backwards of the current cursor position. To find all tags, use
|
||
|
"search by annotation", with a null text value string.
|
||
|
.sk1
|
||
|
4. Search by annotation.
|
||
|
.sk1
|
||
|
This positions the cursor at the start of the next tag which has a
|
||
|
comment containing the string specified in the value text window. The
|
||
|
search performed is a regular expression search, and certain
|
||
|
characters have special meaning. Be careful when your value string
|
||
|
contains ".", "*", "[", "^" or "$". The search can be performed either
|
||
|
forwards or backwards from the current cursor position.
|
||
|
.sk1
|
||
|
5. Search by sequence.
|
||
|
.sk1
|
||
|
This positions the cursor at the start of the next piece of sequence
|
||
|
that matches the value specified in the text value window. The search
|
||
|
is for an exact match, which means the case of value string is
|
||
|
important. The search is performed on the gel readings themselves,
|
||
|
rather than the consensus sequence. The search can be performed either
|
||
|
forwards or backwards from the current cursor position.
|
||
|
.sk1
|
||
|
6. Search by problem.
|
||
|
.sk1
|
||
|
This positions the cursor at the next place in the consensus sequence
|
||
|
which is not an "A", "C", "G" or "T". The search can be performed
|
||
|
either forwards or backwards from the current cursor position.
|
||
|
.sk1
|
||
|
7. Search by quality
|
||
|
.sk1
|
||
|
This positions the cursor at the next place in the consensus sequence
|
||
|
where the consensus calculation for each strand disagrees. When only
|
||
|
sequences on one strand is present, the search will stop at every
|
||
|
base. The search can be performed either forwards or backwards from the
|
||
|
current cursor position.
|
||
|
.left margin1
|
||
|
.sk3
|
||
|
Annotation
|
||
|
.left margin2
|
||
|
.para
|
||
|
Parts of a sequence can be annotated, to record the positions of primers used
|
||
|
for walking, or to mark sites, such as compressions that have caused problems
|
||
|
during sequencing.
|
||
|
The consensus sequence CANNOT be annotated.
|
||
|
.para
|
||
|
To annotate a piece of sequence first select the part of sequence
|
||
|
using the mouse buttons. Use the left mouse button to position the start of the
|
||
|
selection, and while this button is being held down, move the mouse to extend.
|
||
|
The selection can be extended further using the right mouse button.
|
||
|
.para
|
||
|
To create annotation, invoke the pop-up menu, and select the "Create Tag"
|
||
|
function. A small "tag editor" will appear which
|
||
|
allows you to select the type of the
|
||
|
annotation from a pull-down menu, and specify a comment if desired.
|
||
|
To select a new type pull down the Type menu, and select the entry desired.
|
||
|
To enter a comment, simply type into the text window in the tag editor.
|
||
|
The annotation is created when the "Leave" button on the tag editor,
|
||
|
and is displayed in the colour defined in the tag database file (TAGDB).
|
||
|
.para
|
||
|
To edit existing annotation,
|
||
|
position the cursor with the left mouse button
|
||
|
on the tag, and select the
|
||
|
"Edit Tag"
|
||
|
off the pop-up menu.
|
||
|
This invokes the tag editor, and changes to the type and comment of the
|
||
|
annotation can be made. The tag is updated when the "Leave" button is pressed.
|
||
|
.para
|
||
|
To delete an existing annotation,
|
||
|
position the cursor with the left mouse button
|
||
|
on the tag, and select the
|
||
|
"Delete Tag"
|
||
|
off the pop-up menu.
|
||
|
.left margin1
|
||
|
.sk3
|
||
|
NOTE:
|
||
|
.left margin2
|
||
|
.para
|
||
|
As the Contig Editor is a very powerful tool, it is possible that the alignment
|
||
|
of the gel reading sequences has unexpectedly been disrupted.
|
||
|
This can easily happen to parts of the contig that lie to the right
|
||
|
of the screen if excessive use has been made of the "Super Edit" facility.
|
||
|
Until familiar with "Super Edit" it would benefit the sequencer to quickly
|
||
|
scan through the contig after editing to check that bad alignments have not
|
||
|
been created.
|
||
|
.left margin1
|
||
|
@9. T 3 @Screen edit
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
THIS OPTION IS NO LONGER AVAILABLE IN XDAP. USE EDIT CONTIG
|
||
|
.para
|
||
|
Gives access to the system editor on the machine (for example EDT on a VAX)
|
||
|
and allows users to edit contigs. The contigs are presented as for
|
||
|
"display contig" and the program will
|
||
|
reconstitute the contig's sequences and relationships when the editor is
|
||
|
exited.
|
||
|
.para
|
||
|
To screen edit a contig set the line length to 50 characters,
|
||
|
select the contig to edit, and supply the name of a temporary file in which
|
||
|
the editing will be performed.
|
||
|
After a short pause the system
|
||
|
editor will present the first page of the file. Edit the file obeying the
|
||
|
rules given below. Exit from the editor and affirm the intention of
|
||
|
returning the contig to the database. The program will put the contig
|
||
|
back into the database.
|
||
|
.para
|
||
|
Rules for screen editing
|
||
|
.para
|
||
|
There are some limitations on the changes that can be made to the contigs
|
||
|
when using the screen editor. Users are unlikely to want to break the
|
||
|
rules
|
||
|
in order to achieve changes to contigs, but nevertheless the
|
||
|
constraints need to be defined and they are given below.
|
||
|
.para
|
||
|
Alignments must be maintained during editing.
|
||
|
Whole lines of sequence should not be deleted or added unless the
|
||
|
order
|
||
|
of the gel readings in the contig is preserved.
|
||
|
Each line in the
|
||
|
contig display consists of gel reading numbers, their names and 50
|
||
|
character sections of sequence. Insertions are limited in the following
|
||
|
way.
|
||
|
No line of sequence can be extended rightwards more than 10 characters
|
||
|
beyond the end of a full length line (a full length line is 50 characters
|
||
|
long). Only one character can be added to the left end of full length
|
||
|
lines, but sections of sequence beginning further into a line
|
||
|
can be extended leftwards up to an equivalent position. Do not delete any
|
||
|
non-sequence lines in the file.
|
||
|
.para
|
||
|
Before returning the contig to the database the program checks that the
|
||
|
rules have been obeyed. If an error is found the number of the erroneous
|
||
|
line in the
|
||
|
file is displayed and the contig will not be changed.
|
||
|
.left margin1
|
||
|
@5. TX 1 @Display a contig
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to show the aligned gel readings for any part of a contig. The
|
||
|
|
||
|
number, name and strandedness of each gel reading is shown and the
|
||
|
|
||
|
consensus is written below.
|
||
|
.para
|
||
|
If required identify the contig, and then the start and end points of the
|
||
|
|
||
|
region to display.
|
||
|
.para
|
||
|
The display can be directed to a disk file using "direct output to disk".
|
||
|
|
||
|
These files are required by options: "screen edit" and "highlight
|
||
|
|
||
|
disagreements", and printed copies of them
|
||
|
are very useful for marking corrections prior to
|
||
|
|
||
|
using the editors.
|
||
|
.para
|
||
|
Below is an example showing the left end of a contig from
|
||
|
position 1 to 200. Overlapping this region are gels 6,3,5,17and 12;
|
||
|
6, 3 and 5
|
||
|
are in reverse orientation to their archives (denoted by a minus sign)
|
||
|
There are a few uncertainty codes and a few padding
|
||
|
characters in the working versions, but the consensus (shown
|
||
|
below
|
||
|
each page width) has a definite assignment for almost every
|
||
|
position.
|
||
|
.lit
|
||
|
|
||
|
10 20 30 40 50
|
||
|
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
|
||
|
60 70 80 90 100
|
||
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
||
|
-3 HINW.007 GGCACA*GTC
|
||
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
|
||
|
|
||
|
110 120 130 140 150
|
||
|
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
|
||
|
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
|
||
|
-5 HINW.009 GGCAGCGACCGTCG
|
||
|
17 HINW.999 AGCGACCGTCG
|
||
|
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
|
||
|
|
||
|
160 170 180 190 200
|
||
|
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
12 HINW.017 GTAGAGC*
|
||
|
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
.END LIT
|
||
|
.left margin1
|
||
|
@6. TX 1 @List a text file
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
This option allows users to list text files on the screen. It can be used
|
||
|
to read a file containing notes, for checking files written to disk etc. The
|
||
|
user is asked to type the name of the file to list.
|
||
|
.left margin1
|
||
|
@8. TX 1 @Calculate a consensus
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Calculates a consensus sequence either for the whole database or
|
||
|
|
||
|
for selected contigs. The consensus is written to a file named by the
|
||
|
user.
|
||
|
.left margin2
|
||
|
Supply a file name, choose between whole database or selected contigs.
|
||
|
.para
|
||
|
Symbols for uncertainty in gel readings
|
||
|
.para
|
||
|
In order to record uncertainties when reading gels the codes shown
|
||
|
|
||
|
below can be used. Use of these codes permits us to extract the
|
||
|
|
||
|
maximum amount of data from each gel and yet record any doubts by
|
||
|
|
||
|
choice of code. The program can deal with all of these codes and any
|
||
|
|
||
|
other characters in a sequence are treated as dash (-) characters.
|
||
|
|
||
|
|
||
|
.lit
|
||
|
|
||
|
SYMBOL MEANING
|
||
|
|
||
|
1 PROBABLY C
|
||
|
2 " T
|
||
|
3 " A
|
||
|
4 " G
|
||
|
D " C POSSIBLY CC
|
||
|
V " T " TT
|
||
|
B " A " AA
|
||
|
H " G " GG
|
||
|
K " C " C-
|
||
|
L " T " T-
|
||
|
M " A " A-
|
||
|
N " G " G-
|
||
|
R A OR G
|
||
|
Y C OR T
|
||
|
5 A OR C
|
||
|
6 G OR T
|
||
|
7 A OR T
|
||
|
8 G OR C
|
||
|
- A OR G OR C OR T
|
||
|
a A set by auto edit
|
||
|
c C set by auto edit
|
||
|
g G set by auto edit
|
||
|
t T set by auto edit
|
||
|
* padding character placed by auto assembler
|
||
|
else = -
|
||
|
|
||
|
.end lit
|
||
|
|
||
|
.LEFT MARGIN2
|
||
|
The DNA consensus algorithm
|
||
|
.para
|
||
|
The "calculate consensus" function, the "display contig" routine and the
|
||
|
|
||
|
"show quality" option use the rules outlined here to calculate a
|
||
|
|
||
|
consensus from aligned gel readings. Note that "display contig"
|
||
|
calculates
|
||
|
a consensus for each page width it displays (it does not use the
|
||
|
|
||
|
consensus sequence file calculated by the consensus function).
|
||
|
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
We have 6 possible symbols in the consensus sequence: A,C,G,T,* and -. The
|
||
|
last symbols is assigned if none of the others makes up a sufficient
|
||
|
proportion of the aligned characters at any position in the contig. The
|
||
|
following calculation is used to decide which symbol to place in the
|
||
|
consensus at each position.
|
||
|
.para
|
||
|
Each uncertainty code contributes a score
|
||
|
to one of A,C,G,T,* and also to the total at each point. Symbols like R
|
||
|
and Y which don't correspond to a single base type contribute only to the
|
||
|
total at each point. The scores are shown below.
|
||
|
.lit
|
||
|
definite assignments ie A,C,G,T,B,D,H,V,K,L,M,N,a,c,g,t,* =1
|
||
|
|
||
|
probable assignments ie 1,2,3,4 = 0.75
|
||
|
|
||
|
other uncertainty codes including R,Y,5,6,7,8,- = 0.1
|
||
|
.end lit
|
||
|
.para
|
||
|
A cutoff score of 51% to 100% is supplied by the user. (When the program
|
||
|
starts this is set to 75%. See "set display parameters").
|
||
|
At each position in the contig we calculate the total score for each of
|
||
|
the 5 symbols
|
||
|
A,C,G,T and * (denote these by Xi, where i=A,C,G,T or *),
|
||
|
and also the sum of these totals
|
||
|
(denote this by S). Then if 100 Xi / S > the cutoff for any i, symbol i is
|
||
|
placed in the consensus; otherwise - is assigned.
|
||
|
.para
|
||
|
Notice that S does not equal the number of times the sequence has been
|
||
|
determined, but is the score total, and hence we are less likely to put a -
|
||
|
in the consensus. For the "examine quality" algorithm each strand is
|
||
|
treated separately but the calculation is the same. (It was originally
|
||
|
different).
|
||
|
.para
|
||
|
Format of the consensus sequence ( and vector sequences).
|
||
|
.para
|
||
|
A consensus sequence file may contain the consensus for several contigs
|
||
|
|
||
|
and so we identify each of them by preceding them by a 20 character
|
||
|
|
||
|
title. The title is of the form <---LAMBDA.076-----> ( where LAMBDA is
|
||
|
|
||
|
the project name and gel reading number
|
||
|
|
||
|
|
||
|
76 is the leftmost gel
|
||
|
reading to contribute to this consensus sequence).
|
||
|
|
||
|
|
||
|
The angle brackets <> and the three digit number precede by a .
|
||
|
|
||
|
are important to some processing programs.
|
||
|
.left margin1
|
||
|
@25. TX 1 @Show relationships
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to show the relationships of the gel readings in the database in
|
||
|
|
||
|
three ways -
|
||
|
.LEFT MARGIN2
|
||
|
(a) All contig descriptor lines followed by all gel descriptor
|
||
|
lines.
|
||
|
.LEFT MARGIN2
|
||
|
(b) All contigs one after the other sorted, i.e. for each
|
||
|
contig show its contig descriptor line followed by all its
|
||
|
gel descriptor lines sorted on position from left to right
|
||
|
.LEFT MARGIN2
|
||
|
(c) Selected contigs: show the contig line and, in order,
|
||
|
those gel readings that cover a user-defined region.
|
||
|
Note that this output can be directed to a disk file by
|
||
|
prior selection of "disk output".
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Below is an example showing a contig from position
|
||
|
1 to 689. The left gel reading is number 6 and has archive
|
||
|
name HINW.010, the
|
||
|
rightmost gel reading is number 2 and is has archive name HINW.004.
|
||
|
On each gel descriptor line is shown:
|
||
|
the name of the archive version, the gel number, the position of the
|
||
|
left end of the gel reading relative to the left end of the contig, the
|
||
|
length of the gel
|
||
|
reading (if this is negative it means that the gel reading is in
|
||
|
the opposite orientation to its archive), the number of the gel
|
||
|
reading to
|
||
|
the left and the number of the gel reading to the right.
|
||
|
.lit
|
||
|
|
||
|
|
||
|
CONTIG LINES
|
||
|
CONTIG LINE LENGTH ENDS
|
||
|
LEFT RIGHT
|
||
|
48 689 6 2
|
||
|
GEL LINES
|
||
|
NAME NUMBER POSITION LENGTH NEIGHBOURS
|
||
|
LEFT RIGHT
|
||
|
HINW.010 6 1 -279 0 3
|
||
|
HINW.007 3 91 -265 6 5
|
||
|
HINW.009 5 137 -299 3 17
|
||
|
HINW.999 17 140 273 5 12
|
||
|
HINW.017 12 193 265 17 18
|
||
|
HINW.031 18 385 -245 12 2
|
||
|
HINW.004 2 401 -289 18 0
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@21. TX 3 @Enter new gel reading
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
THIS OPTION IS NO LONGER AVAILABLE IN XDAP. USE AUTO ASSEMBLE
|
||
|
.para
|
||
|
Used to enter new gel readings into the
|
||
|
database. The new gel reading must have previously been compared with
|
||
|
the
|
||
|
contents of the database by use of " auto assemble" in order to ascertain
|
||
|
if it overlaps any previously entered data.
|
||
|
.para
|
||
|
The user is expected to know: if
|
||
|
the gel reading overlaps; if so which contig it overlaps; if so where it
|
||
|
overlaps. The program takes the user through a series of question to
|
||
|
establish the nature of the overlap and then displays the overlap. The
|
||
|
user
|
||
|
is then offered a number of options, including editors for the new gel
|
||
|
reading and the contig, to enable the correct alignment of the gel reading
|
||
|
throughout its whole length.
|
||
|
.left margin2
|
||
|
|
||
|
Supply the name of the gel reading file.
|
||
|
If the gel
|
||
|
reading has been entered before the program will not permit
|
||
|
|
||
|
entry.
|
||
|
The program gives the gel reading a unique number and asks if the
|
||
|
|
||
|
sequence overlaps any data already in the database (reported by "auto
|
||
|
|
||
|
assemble").
|
||
|
|
||
|
If it does not, entry is complete.
|
||
|
If it does overlap the
|
||
|
|
||
|
dialogue
|
||
|
continues with the program asking if the gel readings overlaps "in the
|
||
|
|
||
|
normal sense", if not it will automatically complement the sequence.
|
||
|
|
||
|
Then supply the number of the contig the gel reading overlaps (as
|
||
|
|
||
|
reported by "auto assemble").
|
||
|
.para
|
||
|
Overlaps are divided into two types: those for which the new gel reading
|
||
|
|
||
|
protrudes from the left end of the contig it overlaps, and those for which
|
||
|
|
||
|
it does not. The program asks about this with the question "Left end of
|
||
|
gel
|
||
|
reading is inside contig". If this is true the program will go on to ask for
|
||
|
|
||
|
the position in the contig of the left end of the new gel reading. If it is
|
||
|
not
|
||
|
true the program will ask for the position in the new gel reading of the
|
||
|
|
||
|
left end of the contig.
|
||
|
.para
|
||
|
Once this is completed the program will display the first 50 bases of
|
||
|
|
||
|
the overlap.
|
||
|
The gel readings in the contig and their consensus are displayed with the
|
||
|
|
||
|
new gel reading underneath. The mismatches are shown by *'s on the
|
||
|
next
|
||
|
line down.
|
||
|
For example:
|
||
|
.lit
|
||
|
|
||
|
|
||
|
60 70 80 90 100
|
||
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
||
|
-3 HINW.007 GGCACA*GTC
|
||
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACACGTC
|
||
|
NEWGEL CACAAGCGAGCGAGAGGGGCACCGTGACGTGGTCACGCCGGGGACACGTC
|
||
|
MISMATCH * * *
|
||
|
10 20 30 40 50
|
||
|
|
||
|
.end lit
|
||
|
.para
|
||
|
The program then needs to know if the position of the left end of the
|
||
|
overlap is correct.
|
||
|
|
||
|
If it is the user should type return, if not, 1 and the program will ask for
|
||
|
the
|
||
|
new position and display it.
|
||
|
|
||
|
.LEFT MARGIN2
|
||
|
The program now offers a number of options to allow the
|
||
|
user to align the new gel reading
|
||
|
correctly over its whole length with
|
||
|
the data already in the contig. It is important that
|
||
|
sufficient edits are made to the new gel reading
|
||
|
or the sequences in the
|
||
|
contig at this stage to get the alignment correct, because once
|
||
|
entry is completed, the alignment is fixed and cannot easily be
|
||
|
changed (see "alter relationships").
|
||
|
Alignment can be achieved
|
||
|
by making
|
||
|
insertions or deletions but deletion of data requires the
|
||
|
original gels to be checked. For this reason at entry we
|
||
|
usually make only insertions to achieve alignment. We use X or
|
||
|
asterisks (*) as padding characters to achieve alignment and
|
||
|
so can, if required,
|
||
|
distinguish padding characters from characters assigned from
|
||
|
reading gels.
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
The options available are:
|
||
|
.lit
|
||
|
? = HELP
|
||
|
! = Give up
|
||
|
3 = Complete entry
|
||
|
4 = Edit contig
|
||
|
5 = Display overlap
|
||
|
6 = Edit new gel reading
|
||
|
|
||
|
.end lit
|
||
|
|
||
|
.sk1
|
||
|
.para
|
||
|
1. HELP gives this information.
|
||
|
.para
|
||
|
2. Give up allows users to change their minds about entering the new gel
|
||
|
reading. The program will ask the user to
|
||
|
confirm this choice.
|
||
|
.para
|
||
|
3. Complete entry is the command to add the new gel reading to the
|
||
|
contig. The
|
||
|
program updates the relationships accordingly. The user is asked to
|
||
|
confirm
|
||
|
this command.
|
||
|
.para
|
||
|
4. Edit contig gives the user access to a simple editor that allows
|
||
|
insertions, deletions and changes to be made to the contig. The editor
|
||
|
maintains alignments by making the same number of insertions or
|
||
|
deletions
|
||
|
in all sequences covering the edit position.
|
||
|
The program
|
||
|
protects the user by allowing edits only within
|
||
|
the region of overlap.
|
||
|
.para
|
||
|
5. Display allows display of the region of overlap only. This
|
||
|
is defined by the relative positions in the contig. The
|
||
|
default is the whole of the region of overlap.
|
||
|
.para
|
||
|
6. Edit new gel reading allows the new gel reading to be edited using a
|
||
|
simple editor.
|
||
|
.left margin1
|
||
|
@23. TX 3 @ Complement a contig
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
This function will complement and reverse all of the gel
|
||
|
readings in a
|
||
|
contig. It automatically reverses and complements each gel
|
||
|
reading sequence, reorders left and right neighbours, recalculates
|
||
|
relative
|
||
|
positions and changes each strandedness.
|
||
|
.PARA
|
||
|
The only user input required is to identify the contig to
|
||
|
complement by the number or name of a gel reading it contains.
|
||
|
DO NOT KILL THE
|
||
|
PROGRAM DURING THIS STEP!
|
||
|
.left margin1
|
||
|
@22. TX 3 @ Join contigs
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
This function joins contigs interactively using a mouse driven editor.
|
||
|
The operation of this editor is very similar to the Contig Editor
|
||
|
described in "@4 Edit".
|
||
|
|
||
|
.para
|
||
|
It allows the
|
||
|
user to align the ends of the two contigs by editing each
|
||
|
contig separately. It is important that the alignment achieved is
|
||
|
correct because once the join is completed the alignment is fixed.
|
||
|
The program needs to know which two contigs to join.
|
||
|
.para
|
||
|
First specify which two contigs are to be joined.
|
||
|
The user should identify the two
|
||
|
contigs. First the left contig and then the right.
|
||
|
The program checks that the two contig numbers are different (it will not
|
||
|
allow circles to be formed!)
|
||
|
.para
|
||
|
The Join Editor consists of two Contig Editors in between which is sandwiched
|
||
|
a disagreement box. This disagreement box shows exclamation marks to
|
||
|
denote mismatches between the two consensuses.
|
||
|
.para
|
||
|
For example, the display will look something like this:
|
||
|
.lit
|
||
|
|
||
|
1460 1470 1480 1490 1500
|
||
|
56 HINW.100 TCT*GAGCAGTGTGGGCGCTG*CCGG
|
||
|
33 HINW.300 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGG
|
||
|
-25 HINW.090 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGG
|
||
|
19 HINW.123 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
CONSENSUS TCTCGAGCAGTGTGGGCGCTG-CCGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
MISMATCH ! !!!!!!
|
||
|
10 20 30 40 50
|
||
|
-6 HINW.010 TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
|
||
|
-3 HINW.007 TGGGCGCTGCCCGGGCTCGGAGGGCATGAAGT*AGAGC
|
||
|
-5 HINW.009 GCTCGGAGGGCATGAAGT*AGAGC
|
||
|
CONSENSUS TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
|
||
|
|
||
|
.END LIT
|
||
|
.para
|
||
|
.para
|
||
|
The best strategy for joining is to
|
||
|
identify the exact position of overlap. This is defined as
|
||
|
the position in the left contig that the leftmost character of the right
|
||
|
contig overlaps.
|
||
|
The overlap must be of at least one character.
|
||
|
Use the scroll bar and the scroll buttons (`<<',`<',`>',and`>>')
|
||
|
for positioning the relative positions of the two contigs.
|
||
|
.para
|
||
|
The join position can be fixed in position
|
||
|
by pressing the `lock' button at the top of the Join Editor.
|
||
|
Locking allows the two contigs to be scrolled as one when using the scroll bar
|
||
|
and buttons, the left ends always in the same position relative to each
|
||
|
other.
|
||
|
.para
|
||
|
Once locked, it is best to proceed to the right along the contigs, inserting
|
||
|
padding characters (`*') into the consensuses to minimise the
|
||
|
disagreements.
|
||
|
.para
|
||
|
It is essential that the user aligns the two contigs throughout the whole
|
||
|
region of overlap before completing the join because it is only at this
|
||
|
stage that the two contigs can be edited independently. Once the join is
|
||
|
completed the alignment can only be altered using the routines supplied
|
||
|
by "alter relationships".
|
||
|
.para
|
||
|
The join can be completed by pressing the `Leave Editor' button. The
|
||
|
percentage mismatch is displayed, and the user is required to confirm that
|
||
|
they want to perform the join.
|
||
|
.left margin1
|
||
|
@24. TX 1 @ Copy the database
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
Used to make a copy of the database. If required the database size can be
|
||
|
|
||
|
altered using this option. The "version" of a database is encoded as the
|
||
|
|
||
|
last letter in the names of the five files that contain the database.
|
||
|
|
||
|
.para
|
||
|
Supply a "version" number (the default is version 1), and if required
|
||
|
|
||
|
select a new size for the database. The size of a database is the number
|
||
|
of
|
||
|
lines of information it can hold. It needs a line for each gel reading and
|
||
|
|
||
|
another for each contig.
|
||
|
.left margin1
|
||
|
@19. TX 1 @ Check database
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to perform a check on the logical consistency of the
|
||
|
database. No user intervention is required.
|
||
|
.para
|
||
|
The following relationships are checked:
|
||
|
.LEFT MARGIN2
|
||
|
1. If gel reading A thinks gel reading B is its left
|
||
|
neighbour
|
||
|
|
||
|
does B think A is
|
||
|
its right neighbour?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
"Hand holding problem for gel reading A"
|
||
|
.left margin2
|
||
|
followed by the
|
||
|
gel descriptor lines for gel readings A and B.
|
||
|
.LEFT MARGIN2
|
||
|
2. Are there any contig lines with no left or right
|
||
|
end gel readings?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
"Bad contig line number A"
|
||
|
.LEFT MARGIN2
|
||
|
3. Do the gel readings that are described as left ends on
|
||
|
contig
|
||
|
lines agree that they are left ends?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
"The end gel readings of contig A have outward neighbours"
|
||
|
.LEFT MARGIN2
|
||
|
4. Are there gel readings that are in more than one contig?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
" Gel number A is used N times"
|
||
|
.LEFT MARGIN2
|
||
|
5. Are there gel readings that are not in any contig?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
" Gel number A is not used"
|
||
|
.LEFT MARGIN2
|
||
|
6. Do the relative positions of gel readings agree with
|
||
|
their
|
||
|
position as defined by left and right neighbourliness?
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
" Gel number A with position X is left neighbour of gel number B with
|
||
|
position Y"
|
||
|
.LEFT MARGIN2
|
||
|
7. Are there any loops in contigs? If so no further
|
||
|
checking is done.
|
||
|
The error message is
|
||
|
.left margin2
|
||
|
" Loop in contig n no further checking done, but gel reading numbers follow"
|
||
|
.left margin2
|
||
|
The
|
||
|
program then prints the gel reading numbers in the looped
|
||
|
contig up
|
||
|
to
|
||
|
the start of the loop.
|
||
|
.LEFT MARGIN2
|
||
|
8. Are there any contigs of length <1? The error message is
|
||
|
.left margin2
|
||
|
" The contig on line
|
||
|
number x has zero length"
|
||
|
.LEFT MARGIN2
|
||
|
9. Are there any gel readings (used in only one contig) that have zero
|
||
|
|
||
|
length? The error
|
||
|
message is
|
||
|
.left margin2
|
||
|
" Gel number N has zero length"
|
||
|
.left margin2
|
||
|
Note that "auto assemble" also uses this logical consistency check and
|
||
|
will
|
||
|
only tolerate a "Gel number N
|
||
|
is not used" error. Any other error will cause it to
|
||
|
|
||
|
give up.
|
||
|
|
||
|
.left margin1
|
||
|
@29. TX 1 @ Examine quality
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Analyses the quality of the data in a contig. It reports on the proportion
|
||
|
|
||
|
of the consensus that is "well determined" and will display a sequence of
|
||
|
|
||
|
symbols that indicate the quality of the consensus at each position.
|
||
|
|
||
|
.para
|
||
|
Identify the contig to analyse, and the section of interest. The current
|
||
|
|
||
|
consensus calculation cutoff score will be used to decide if each position
|
||
|
is
|
||
|
"well determined". In general the quality of a reading deteriorates along
|
||
|
the length of the gel and so it is also possible to use a length cutoff for
|
||
|
the quality calculation. Only the data from the first section of each reading
|
||
|
will be included in the quality calcualtion. The length is altered under
|
||
|
"set parameters" and is initially set to the maximum reading length.
|
||
|
A summary showing the percentage of the consensus
|
||
|
that falls into each category of quality is shown. Choose whether or not to
|
||
|
have the quality codes for each position of the consensus displayed.
|
||
|
They can be displayed as either graphics or text.
|
||
|
.para
|
||
|
The quality of the data depends on the number of times it has been
|
||
|
|
||
|
sequenced and the particular uncertainty codes used in each gel
|
||
|
|
||
|
reading. This function divides the data into five categories, assigning
|
||
|
|
||
|
each
|
||
|
a symbol or code:
|
||
|
.LEFT MARGIN2
|
||
|
1. Well determined on both strands and they agree. code=0
|
||
|
.LEFT MARGIN2
|
||
|
2. Well determined on the plus strand only. code=1
|
||
|
.LEFT MARGIN2
|
||
|
3. Well determined on the minus strand only. code=2
|
||
|
.LEFT MARGIN2
|
||
|
4. Not well determined on either strand. code=3
|
||
|
.LEFT MARGIN2
|
||
|
5. Well determined on both strands but they disagree. code=4
|
||
|
.LEFT MARGIN2
|
||
|
A position is "well determined" if it is assigned one of the symbols
|
||
|
A,C,G,T when the algorithm described in the section "calculate a
|
||
|
consensus".
|
||
|
The calculation is performed
|
||
|
separately for each strand.
|
||
|
.para
|
||
|
If the user chooses to have the data displayed graphically the following
|
||
|
scheme is used. A rectangular box is drawn so that the x coordinate
|
||
|
represents the length of the contig. The box is notionally
|
||
|
divided vertically into
|
||
|
5 possible levels which are given the y values: -2,-1,0,1,2.
|
||
|
The quality codes attributed to each base position are plotted as
|
||
|
rectangles.
|
||
|
Each rectangle represents a region in
|
||
|
which the quality codes are identical, so a single base having a different
|
||
|
code from its immediate neighbours will appear as a very narrow rectangle.
|
||
|
.lit
|
||
|
|
||
|
Rectangle bottom and top y values
|
||
|
|
||
|
Quality 0 rectangle from 0 to 0
|
||
|
Quality 1 rectangle from 0 to 1
|
||
|
Quality 2 rectangle from 0 to -1
|
||
|
Quality 3 rectangle from -1 to 1
|
||
|
Quality 4 rectangle from -2 to 2
|
||
|
.end lit
|
||
|
.para
|
||
|
Obviously a single line at the midheight shows a perfect sequence.
|
||
|
.para
|
||
|
Typical dialogue is shown below.
|
||
|
.lit
|
||
|
|
||
|
41.47% OK on both strands and they agree(0)
|
||
|
55.48% OK on plus strand only(1)
|
||
|
2.08% OK on minus strand only(2)
|
||
|
0.97% Bad on both strands(3)
|
||
|
0.00% OK on both strands but they disagree(4)
|
||
|
? (y/n) (y) Show sequence of codes
|
||
|
|
||
|
10 20 30 40 50
|
||
|
1111111111 1111111111 1111111111 1111111111 1111111111
|
||
|
|
||
|
60 70 80 90 100
|
||
|
1111111111 1111111111 1111111111 3111111111 1111111111
|
||
|
|
||
|
110 120 130 140 150
|
||
|
1111111111 1111131111 1111111111 1111111111 1111111111
|
||
|
|
||
|
160 170 180 190 200
|
||
|
1111111111 1111111111 1111111111 1111111111 1111111133
|
||
|
|
||
|
210 220 230 240 250
|
||
|
1311111111 1111111111 1111111110 0000000000 0000220000
|
||
|
|
||
|
260 270 280 290 300
|
||
|
0000000000 0020000000 2200000202 0002000000 0000222200
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@26. TX 3 @ Alter relationships
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to make what are normally illegal changes to the database. That is
|
||
|
|
||
|
the normal checks are not done and any item in the database can be
|
||
|
changed independently of all others. Users need to know what they are
|
||
|
|
||
|
doing because it is very easy to make a horrible mess. Always start by
|
||
|
|
||
|
making a copy!
|
||
|
.para
|
||
|
By using the options here users can edit individual gel readings in contigs,
|
||
|
move one section of a contig relative to another, break contigs, remove
|
||
|
contigs, remove gel readings, etc. To give flexibility most
|
||
|
of the commands do only one thing. This means that several commands
|
||
|
may
|
||
|
have to be executed to complete any change. At the end of this help
|
||
|
section
|
||
|
there are notes on removing gel readings from the database.
|
||
|
.para
|
||
|
The following options are offered:
|
||
|
.lit
|
||
|
|
||
|
Cancel
|
||
|
Line change
|
||
|
Edit single gel reading
|
||
|
Delete contig
|
||
|
Shift
|
||
|
Move gel reading
|
||
|
Rename gel reading
|
||
|
Break a contig
|
||
|
Alter raw data parameters
|
||
|
|
||
|
.end lit
|
||
|
.left margin2
|
||
|
1. QUIT returns to the main options of SAP.
|
||
|
.left margin2
|
||
|
|
||
|
2. Line change
|
||
|
.left margin2
|
||
|
allows the user to change the contents of any line in the
|
||
|
|
||
|
file of relationships. The line is selected by number, the
|
||
|
|
||
|
program prints the current line and prompts for the new line.
|
||
|
|
||
|
.left margin2
|
||
|
3. Edit
|
||
|
.left margin2
|
||
|
allows the user to edit an individual gel reading
|
||
|
independently of any others it may be related to. The edit
|
||
|
positions are relative to
|
||
|
the contig. The effect of this editing on the length of the
|
||
|
gel reading is taken care of but, if it changes the length of
|
||
|
a contig,
|
||
|
or its relationship to others, this must be accounted for (if
|
||
|
necessary) by use of the "line change" function.
|
||
|
|
||
|
.left margin2
|
||
|
4. Delete contig
|
||
|
.left margin2
|
||
|
is a function that deletes a contig line by moving down
|
||
|
all the contig lines above by one position. It prompts only
|
||
|
for the line to delete. It does not delete any of the gel
|
||
|
readings
|
||
|
or gel reading
|
||
|
lines for the deleted contig but it does reduce the
|
||
|
number of contigs on line IDBSIZ by 1.
|
||
|
|
||
|
.left margin2
|
||
|
5. Shift
|
||
|
.left margin2
|
||
|
allows the user to change all the relative positions of a
|
||
|
set of neighbouring gel
|
||
|
readings by some fixed value, i.e. it will
|
||
|
shift related gel readings
|
||
|
either left or right. It can therefore
|
||
|
be used to change the alignment of the gel
|
||
|
readings in a contig
|
||
|
or as part of the process of breaking a contig into two parts
|
||
|
(see below). It prompts for the number of the first gel
|
||
|
reading to
|
||
|
shift and then for the distance to move them (Note a
|
||
|
negative value will move the gel readings
|
||
|
left and a positive value
|
||
|
right). It then chains rightwards (ie follows right
|
||
|
neighbours) and shifts each gel
|
||
|
reading, in turn, up to the end
|
||
|
of the contig. (This means that only those gel readings
|
||
|
from the first
|
||
|
to shift to the rightmost are moved). It updates the length of
|
||
|
the contig accordingly.
|
||
|
|
||
|
.left margin2
|
||
|
6. Move gel reading
|
||
|
.left margin2
|
||
|
is a function to renumber a gel reading. It moves all the information
|
||
|
about a gel
|
||
|
reading on to another line. The user must specify the
|
||
|
number
|
||
|
of the gel reading
|
||
|
to move and the number of the line to place it. It
|
||
|
takes care of all the relationships. Of course gel
|
||
|
readings must not be
|
||
|
moved to lines occupied by other gel
|
||
|
readings! It can be used as part
|
||
|
of the process of removing a gel
|
||
|
reading from the database (see below).
|
||
|
|
||
|
.left margin2
|
||
|
7. Rename gel reading
|
||
|
.left margin2
|
||
|
is a function that is used to rename the archive names of
|
||
|
gel
|
||
|
readings in the database; it only changes the name in the .ARN
|
||
|
file of the database.
|
||
|
|
||
|
.sk1
|
||
|
.LEFT MARGIN2
|
||
|
8. Break contig
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
Occasionally it is necessary to break a contig into two parts and this can be
|
||
|
achieved using this option. The program needs only the number of a gel
|
||
|
reading. This is the gel reading that will become a left end after the
|
||
|
break. That
|
||
|
is, the break is made between this gel
|
||
|
reading and its left neighbour. A new contig
|
||
|
line is created so ensure that there is sufficient space in the database.
|
||
|
.left margin2
|
||
|
Removing gel readings from contigs
|
||
|
.left margin2
|
||
|
.PARA
|
||
|
Gel
|
||
|
readings can be removed from contigs if they are not essential for holding the
|
||
|
contig together (ie are not the only gel reading covering a particular region).
|
||
|
Suppose the gel reading to remove is gel number
|
||
|
b with left neighbour a and right
|
||
|
neighbour c.
|
||
|
Using "line change" change the right neighbour of a to c, and the left
|
||
|
neighbour of c to a. To tidy things up: suppose there are x gel
|
||
|
readings in the
|
||
|
database; then, using "move gel reading" move gel x to line b; then, using
|
||
|
"line change"
|
||
|
decrease the number of gel
|
||
|
readings in the database (stored in the last line) by 1.
|
||
|
.sk1
|
||
|
.LEFT MARGIN2
|
||
|
8. Alter raw data parameters
|
||
|
.LEFT MARGIN2
|
||
|
.PARA
|
||
|
Allows the user to edit the individual raw data parameters, such as
|
||
|
the left and right cutoff lengths and the name of the machine readable trace
|
||
|
file.
|
||
|
The user must specify the gel line to modify, and provide new values for
|
||
|
the length of the raw sequence including cutoff lengths, the left cutoff position, the length of the original working sequence, the machine type, and the name
|
||
|
of the raw data file, where these values change.
|
||
|
.left margin1
|
||
|
@27. TX 1 @ Set display parameters
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to redefine the parameters that control the cutoff employed by the
|
||
|
|
||
|
consensus calculation and quality examiner, the maximum length of each
|
||
|
reading to include in the quality calculation, the line length used by
|
||
|
|
||
|
the display function, the text window length used by the graphics
|
||
|
options, and the graphics window length used by the graphics options.
|
||
|
.para
|
||
|
The default cutoff score is 75%. The default line length is 50 characters.
|
||
|
For protein sequences the cutoff is always 100%.
|
||
|
.para
|
||
|
The text window used by the graphics options controls the amount of
|
||
|
sequence listed at the crosshair position. The graphics window controls the
|
||
|
"zoom" function. Both these windows are defined as the number of bases that
|
||
|
should be shown, to both left and right of the crosshair.
|
||
|
.left margin1
|
||
|
@30. TX 3 @ Auto edit a contig
|
||
|
.left margin2
|
||
|
.para
|
||
|
This function automatically changes characters in gel readings to make
|
||
|
|
||
|
them agree with the consensus sequence. If employed as is intended, use
|
||
|
|
||
|
of this function is not a criminal activity but a method that saves a large
|
||
|
|
||
|
amount of work. All characters changed by the auto editor will appear in
|
||
|
|
||
|
the gel readings as lowercase letters. The current consensus calculation
|
||
|
cutoff score is used.
|
||
|
.para
|
||
|
Identify the contig and the section to edit. The program will display a
|
||
|
|
||
|
summary of changes made. Note that it is important to understand both
|
||
|
|
||
|
what the auto editor does and the order in which it does it. Before
|
||
|
|
||
|
employing the auto editor users should note all the corrections that they
|
||
|
require, so that after it has been used the corrections can be checked.
|
||
|
|
||
|
.para
|
||
|
The
|
||
|
general strategy employed when collecting shotgun sequence data is to let
|
||
|
the contigs get fairly deep, to get a printout of a contig,
|
||
|
check problems against the
|
||
|
films, note corrections on the printout, and
|
||
|
make the changes using an interactive editor.
|
||
|
In general the consensus is correct except for places where padding
|
||
|
characters have been used to accommodate a single gel with an extra
|
||
|
character, or where the consensus is dash. The important point for the
|
||
|
auto
|
||
|
editor is that
|
||
|
most edits simply make the
|
||
|
gel readings conform to the consensus, or remove columns of pads.
|
||
|
.para
|
||
|
The new editor does the following.
|
||
|
.para
|
||
|
1) calculates a consensus for the contig (or part of a contig) to be
|
||
|
edited, and then uses this consensus to direct the editing of the contig
|
||
|
in 3 stages
|
||
|
.para
|
||
|
2) stage 1: find and correct all places where, if the order of two adjacent
|
||
|
characters is swapped, they will both agree with the consensus (given
|
||
|
that
|
||
|
they did not match the consensus before). These corrections are termed
|
||
|
"transpositions"
|
||
|
.para
|
||
|
3) stage 2: find and correct all places where there is a definite consensus
|
||
|
but the gel reading has a different character. These corrections are
|
||
|
termed
|
||
|
"changes".
|
||
|
.para
|
||
|
4) stage 3: delete all positions in which padding is the consensus. These
|
||
|
corrections are termed "deletions".
|
||
|
.para
|
||
|
All changed characters are shown in lowercase letters so it will be
|
||
|
obvious which
|
||
|
characters have been assigned by the program (except for deletions). The
|
||
|
number of each type of correction will be displayed.
|
||
|
|
||
|
.LEFT MARGIN1
|
||
|
@10. TX 2 @Clear graphics
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Clears graphics from the screen.
|
||
|
.left margin1
|
||
|
@11. TX 2 @Clear text
|
||
|
.LEFT MARGIN1
|
||
|
.para
|
||
|
Clears text from the screen.
|
||
|
.left margin1
|
||
|
@12. TX 2 @Draw a ruler.
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This option
|
||
|
allows the user to draw a ruler or scale along the x axis of the screen to
|
||
|
help identify the coordinates of points of interest. The user can define
|
||
|
the position of the first base to be marked (for example if the active
|
||
|
region is 1501 to 8000, the user might wish to mark every 1000th base
|
||
|
starting at either 1501 or 2000 - it depends if the user wishes to treat
|
||
|
the active region as an independent unit with its own numbering starting
|
||
|
at
|
||
|
its left edge, or as part of the whole sequence). The user can also define
|
||
|
the separation of the ticks on the scale and their height. If required the
|
||
|
labelling routine can be used to add numbers to the ticks.
|
||
|
.left margin1
|
||
|
@14. TX 2 @Reposition plots
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
The positions of each of the plots is defined relative to a users drawing
|
||
|
board which has size 1-10,000 in x and 1-10,000 in y.
|
||
|
Plots for
|
||
|
each option are drawn in a window defined by x0,y0 and xlength,ylength.
|
||
|
Where x0,y0 is the position of the bottom left hand corner of the window,
|
||
|
and xlength is the width of the window and ylength the
|
||
|
height of the window.
|
||
|
.lit
|
||
|
--------------------------------------------------------- 10,000
|
||
|
1 1
|
||
|
1 -------------------------------------- ^ 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 ylength 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 -------------------------------------- v 1
|
||
|
1 x0,y0^ 1
|
||
|
1 <---------------xlength--------------> 1
|
||
|
--------------------------------------------------------- 1
|
||
|
1 10,000
|
||
|
|
||
|
.end lit
|
||
|
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
|
||
|
The default window positions are read from a file "ANALMARG" when the
|
||
|
program is started. Users can have their own file if required.
|
||
|
As all the plots start
|
||
|
at the same position in x and have the same width, x0 and xlength are the
|
||
|
same for all options. Generally users will only want to change the start
|
||
|
level of the window y0 and its height ylength.
|
||
|
This option
|
||
|
allows users to change window positions whilst running the program.
|
||
|
The routine prompts first for the number of the option that the users
|
||
|
wishes
|
||
|
to reposition; then for the y start and height; then for the x start and
|
||
|
length. Note that changes to the x values affect all options. If the user
|
||
|
types only carriage return for any value it will remain unchanged.
|
||
|
Note that, unlike all the other programs, the boxes used to contain
|
||
|
analytical results (eg plot quality) should not be made to overlap one
|
||
|
another, as the function of the crosshair routine depends on which box the
|
||
|
crosshair is in!
|
||
|
.LEFT MARGIN1
|
||
|
@15. TX 2 @Label a diagram
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This routine allows users to label any diagrams they have produced. They
|
||
|
are asked to type in a label. When the user types carriage return to finish
|
||
|
typing the label the cross-hair appears on the screen. The user can
|
||
|
position it anywhere on the screen. If the user types R (for right justify)
|
||
|
the label will be
|
||
|
written on the diagram with its right end at the cross-hair position.
|
||
|
If the user types L (for left justify) the label will be written on the
|
||
|
diagram with its left end at the cross hair position.
|
||
|
The
|
||
|
cross-hair will then immediately reappear. The user may put the same
|
||
|
label
|
||
|
on another part of the diagram as before or if he hits the space bar he
|
||
|
will be asked if he wishes to type in another label.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
? Menu or option number=15
|
||
|
Type label then drive cross hair to left or right end
|
||
|
of label position then hit "L" to write label left
|
||
|
justified or "R" to write label right justified or
|
||
|
the space bar to quit
|
||
|
|
||
|
|
||
|
? Label=delta gene
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
? Label=
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@16. TX 2 @Display a map
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This draws a map
|
||
|
of any sequence features selected by the user.
|
||
|
These features may be protein coding regions (CDS), tRNA genes (TRNA),
|
||
|
promoter positions (PRM), etc. Users may define their own feature table
|
||
|
key
|
||
|
names. For example I find it convenient to split CDS lines into CDS1,
|
||
|
CDS2
|
||
|
and CDS3 each of which contains only those sequences that code in the
|
||
|
reading frames 1, 2 or 3. Then I can plot them at different heights on
|
||
|
the screen ( suitable heights can be determined by using the cross-hair).
|
||
|
The coordinates must be stored in a file in the format of an EMBL feature
|
||
|
table.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
? Menu or option number=16
|
||
|
Display a map using an EMBL feature table file
|
||
|
? map file name=hsegl1.ft
|
||
|
? feature code(e.g. CDS) =CDS
|
||
|
X 1 + strand
|
||
|
2 - strand
|
||
|
3 both strands
|
||
|
? 0,1,2,3 =
|
||
|
? level (0-9480) (256) =4000
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
? feature code(e.g. CDS) =
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@7. TX 1 @Redirect output
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to direct output that would normally appear on the screen to a file.
|
||
|
.para
|
||
|
Select redirection of either text or graphics, and
|
||
|
supply the name of the file that the output should be written to.
|
||
|
.para
|
||
|
The results from the next options selected will not appear on the screen
|
||
|
but will be written to the file. When option 7 is selected again
|
||
|
the file will be
|
||
|
closed and output will again appear on the screen.
|
||
|
.left margin1
|
||
|
@13. TX 2 @Use crosshair
|
||
|
.left margin2
|
||
|
.para
|
||
|
This option puts a steerable cross on the screen which the user
|
||
|
drives around
|
||
|
by using the arrow keys (or mouse). When the crosshair is
|
||
|
visible a number of options are available if the user types one of a
|
||
|
set of special keyboard characters. Any other characters will cause
|
||
|
an exit from the crosshair option. The special keys are:
|
||
|
.lit
|
||
|
|
||
|
I = Identify the nearest gel reading
|
||
|
Z = Zoom in
|
||
|
Q = plot Quality
|
||
|
S = display the aligned Sequences at the crosshair position
|
||
|
N = list the Names and Numbers of the sequences at the crosshair
|
||
|
.end lit
|
||
|
.para
|
||
|
In order for any of these special keys to operate, the crosshair
|
||
|
must be in an appropriate display box, and the precise function of
|
||
|
the keys will also depend on which box the crosshair is in.
|
||
|
.para
|
||
|
If the
|
||
|
crosshair is in the "plot all contigs" box, Z will cause a new box to
|
||
|
appear showing all the readings for the nearest contig; Q will give
|
||
|
the same as Z but will also produce an extra box showing the
|
||
|
"quality" plot.
|
||
|
.para
|
||
|
If Z is hit in the "plot single contig" box, the contig will be zoomed
|
||
|
to the current graphics window size. The zoom will be roughly
|
||
|
centred on the crosshair position. Because of this it is possible to
|
||
|
step along a contig by repeatedly zooming with the crosshair near
|
||
|
to one end of the single contig display box. If I is hit the crosshair
|
||
|
must be close to a gel reading line. If Q is hit, the quality plot will
|
||
|
be produced for the region shown in the plot single contig box. In
|
||
|
all cases when the "plot all contigs" box is shown, a vertical line will
|
||
|
bisect the line the represents the relevant contig, at the current
|
||
|
position.
|
||
|
.para
|
||
|
If the crosshair is in the plot quality box only the character "s" will operate
|
||
|
as a special symbol.
|
||
|
.para
|
||
|
The number of bases shown in the N and S options is controlled by
|
||
|
the current graphics text window size, and the size of the zoom
|
||
|
window by the current graphics window size. Both are set by the
|
||
|
parameter setting function of the general menu.
|
||
|
.left margin1
|
||
|
@33. TX 2 @Plot single contig
|
||
|
.left margin2
|
||
|
.para
|
||
|
This option produces a schematic of a selected region of a single
|
||
|
contig by drawing a horizontal line to represent each of its gel
|
||
|
readings. The lines show the relative positions of each reading and
|
||
|
also their sense. The plot is divided vertically into two sections by
|
||
|
a line that is identified by an asterisk drawn at each end. All lines
|
||
|
that lie above this line represent readings that are in their original
|
||
|
sense, all lines below show readings that are in the
|
||
|
complementary sense to their original. By use of the crosshair
|
||
|
function the plot can be stepped through and examined in more
|
||
|
detail. See help on crosshair.
|
||
|
.left margin1
|
||
|
@34. TX 2 @Plot all contigs
|
||
|
.left margin2
|
||
|
.para
|
||
|
This option produces a schematic of all the contigs in a database. It
|
||
|
does this by drawing a horizontal line to represent each of them.
|
||
|
In order to show the ends of each contig it draws the lines for
|
||
|
contigs at alternate heights: the first at height one, the
|
||
|
second at height two, the third at height one, etc. The order of the
|
||
|
contigs in the display is the same as their order in the database.
|
||
|
By use of the crosshair function the plot can be stepped
|
||
|
through and examined in more detail. See help on crosshair.
|
||
|
.left margin1
|
||
|
@31. TX 3 @ Type in gel readings
|
||
|
.left margin2
|
||
|
.para
|
||
|
THIS OPTION IS NO LONGER AVAILABLE IN XDAP.
|
||
|
.para
|
||
|
This option allows gel readings to be typed in at the keyboard. It creates
|
||
|
a separate file for each gel reading and a file of file names for the
|
||
|
batch. The sequences from each batch may be listed when they have all been
|
||
|
entered. Users may choose to employ special keys to identify the 4 bases
|
||
|
A,C,G and T. By default these special keys are N M , . but any other four
|
||
|
characters may be used. If special keys are used the characters are
|
||
|
automatically translated to A C G T before being stored on the disk.
|
||
|
|
||
|
.left margin1
|
||
|
@35. TX 1 3 @Find internal joins
|
||
|
.left margin2
|
||
|
.para
|
||
|
The purpose of this function is to use data already in the database to
|
||
|
find possible joins between contigs.
|
||
|
Joins may have been missed due to poor data or may have not been made
|
||
|
due to repeated sequences. Where appropriate, it may be
|
||
|
possible to find potential
|
||
|
joins by using the data clipped off readings prior to their entry into the
|
||
|
database.
|
||
|
.left margin2
|
||
|
The database is checked for logical consistency. Supply a minimum initial
|
||
|
match length, a minimum alignment block, the maximum pads per sequence,
|
||
|
the maximum percent mismatch after alignment, the probe length. Choose
|
||
|
if clipped data is to be used, if so define the window size for finding good
|
||
|
data and the number of dashes allowed in the window. Processing will commence.
|
||
|
Most of these values are used in an identical way in the autoassemble
|
||
|
function. The others are defined below.
|
||
|
.left margin2
|
||
|
The program strategy
|
||
|
.left margin2
|
||
|
Take the first contig and calculate its consensus. If clipped data is being
|
||
|
used examine all readings that
|
||
|
are in the complementary orientation, and sufficiently near to the contigs left
|
||
|
end, to see if they have good clipped sequence which if present, would
|
||
|
protrude
|
||
|
from the left end of the contig. If found add the longest such sequence to the
|
||
|
left end of the consensus. Do the same for the right end by examining
|
||
|
readings that are in their
|
||
|
original orientation. If any are found add the longest extension to the
|
||
|
right end of
|
||
|
the consensus. Repeat the consensus calculations and extensions
|
||
|
for all contigs hence producing an extended consensus. If clipped data is not
|
||
|
being used simply calculate the consensus for the whole database. Now
|
||
|
look for possible joins by processing the extended consensus in the following
|
||
|
way. Take the last, say 100, bases (termed the "probe length" by the program)
|
||
|
of the rightmost consensus, compare it both
|
||
|
orientations with the extended consensus of all the other contigs. Display
|
||
|
any sufficiently good alignments. Repeat with the left end of the rightmost
|
||
|
contig. Do the same for the ends of all the entended contigs, always only
|
||
|
comparing with the contigs to their left, so that the same matches do not
|
||
|
appear twice.
|
||
|
.left margin2
|
||
|
Good cliped data is defined by sliding a window of "Window size for good data
|
||
|
scan" bases outwards
|
||
|
along the sequence and stopping when "Maximum number of dashes in scan window"
|
||
|
or more dashes appear in the window.
|
||
|
Note that
|
||
|
it is advisable to have some sort of cutoff because if we simply take all the
|
||
|
data it might be so full of rubbish that we wont find any good matches. For
|
||
|
the same reason it is worth trying the procedure with different cutoffs. An
|
||
|
initial run using no clipped data is also recommended.
|
||
|
Sufficiently good
|
||
|
alignments are defined by criteria equivalent to those used in autoassemble,
|
||
|
however here we only display alignments that pass all tests.
|
||
|
.left margin2
|
||
|
Bugs
|
||
|
.left margin2
|
||
|
If a small contig is wholly contained within a larger one, such that its
|
||
|
ends are further than ("Probe length" - "Minimum initial match length")
|
||
|
from the ends of the larger contig, and the consensus for the small
|
||
|
contig lies to the left
|
||
|
of the consensus for large contig, the overlap will not be discovered. (See
|
||
|
the search stratgey).
|
||
|
.left margin2
|
||
|
All numbering is
|
||
|
relative to base number one in the contig: matches to the left (i.e. in
|
||
|
the clipped data) have negative
|
||
|
positions, matches off the right end of the contig (i.e. in the clipped
|
||
|
data) have positions
|
||
|
greater than that of the contig length.
|
||
|
The convention for reporting the positions of overlaps is as follows: if neither
|
||
|
contig needs to be complemented the positions are as shown. If the program says
|
||
|
"contig x in the - sense" then the positions shown assume contig x has been
|
||
|
complemented. For example in the results given below the positions for the
|
||
|
first overlap are as reported, but those for the second assume that the contig
|
||
|
in the minus sense (i.e. 443) has been complemented.
|
||
|
.lit
|
||
|
|
||
|
|
||
|
Possible join between contig 445 in the + sense and contig 405
|
||
|
Percentage mismatch after alignment = 4.9
|
||
|
412 422 432 442 452 462
|
||
|
405 TTTCCCGACT GGAAAGCGGG CAGTGAGCGC AACGCAATTA ATGTGAG,TT AGCTCACTCA
|
||
|
********* * ******** ***** *** ********** ********** **********
|
||
|
445 -TTCCCGACT G,AAAGCGGG TAGTGA,CGC AACGCAATTA ATGTGAG-TT AGCTCACTCA
|
||
|
-127 -117 -107 -97 -87 -77
|
||
|
472 482 492 502 512
|
||
|
405 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
|
||
|
********** ********** ********** ********** **
|
||
|
445 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
|
||
|
-67 -57 -47 -37 -27
|
||
|
Possible join between contig 443 in the - sense and contig 423
|
||
|
Percentage mismatch after alignment = 10.4
|
||
|
64 74 84 94 104 114
|
||
|
423 ATCGAAGAAA GAAAAGGAGG AGAAGATGAT TTTAAAAATG AAACG-CGAT GTCAGATGGG
|
||
|
**** ***** ********** ********** ****** ** ***** **** *********
|
||
|
443 ATCG,AGAAA GAAAAGGAGG AGAAGATGAT TTTAAA,,TG AAACGACGAT GTCAGATGG,
|
||
|
3610 3620 3630 3640 3650 3660
|
||
|
124 134 144 154 164
|
||
|
423 TTG-ATGAAG TAGAAGTAGG AG-AGGTGGA AGAGAAGAGA GTGGGA
|
||
|
*** ****** ********** ** ******* *** ***** ** **
|
||
|
443 TTGGATGAAG TAGAAGTAGG AGGAGGTGGA ,GAG,AGAGA GTTGG-
|
||
|
3670 3680 3690 3700 3710
|
||
|
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@ end of help
|