2112 lines
102 KiB
Text
2112 lines
102 KiB
Text
@-1. TX 0 @General
|
|
|
|
@-2. T 0 @Screen control
|
|
|
|
@-2. X 0 @Screen
|
|
|
|
@-3. TX 0 @Modification
|
|
|
|
@0. TX -1 @SAP
|
|
|
|
This is help information for the X Windows version of SAP.
|
|
Currently it is being brought up to date with the new features in
|
|
XDAP. The accuracy of this help should therefore not be assumed.
|
|
|
|
This is an interactive program whose primary use is for
|
|
managing shotgun sequencing projects, but it can also be used for
|
|
handling alignments of other sequences, including those of proteins.
|
|
Currently the maximum 'gel reading' length is set to 4096
|
|
characters. Almost all of the information below describes the use of
|
|
the program for shotgun projects, but those using the programs for
|
|
handling other sequence alignments should interpret it accordingly.
|
|
The data for such a project is stored in a special type of database.
|
|
The program contains the tools that are required to type in gel
|
|
readings, screen them against vector sequences and restriction
|
|
sites; enter new gel readings into the database (automatically
|
|
comparing and aligning them). In addition it contains editors and
|
|
functions to examine the quality of the aligned sequences.
|
|
|
|
There are three main menus: "general", "screen" and
|
|
"modification", and some functions have submenus.
|
|
The general menu contains the following options:
|
|
|
|
Open a database
|
|
Display a contig
|
|
List a text file
|
|
Direct output
|
|
Calculate a consensus
|
|
Screen against restriction enzymes
|
|
Screen against vector
|
|
Check database
|
|
Copy database
|
|
Show relationships
|
|
set parameters
|
|
Highlight disagreements
|
|
Examine quality
|
|
Find internal joins
|
|
|
|
The graphics menu contains:
|
|
|
|
Clear graphics
|
|
Clear text
|
|
Draw ruler
|
|
Use cross hair
|
|
Change margins
|
|
Label diagram
|
|
Plot map
|
|
Plot single contig
|
|
Plot all contigs
|
|
|
|
|
|
The modification menu contains:
|
|
|
|
Edit contig
|
|
Auto assemble
|
|
Join contigs
|
|
Complement a contig
|
|
Alter relationships
|
|
Extract gel readings
|
|
|
|
|
|
The alter relationships menu contains:
|
|
|
|
Cancel
|
|
Line change
|
|
Edit single gel reading
|
|
Delete contig
|
|
Shift
|
|
Move gel reading
|
|
Rename gel reading
|
|
Break contig
|
|
Alter raw data parameters
|
|
|
|
|
|
|
|
Overview of the methodology
|
|
|
|
The shotgun sequencing strategy
|
|
|
|
In the shotgun sequencing procedure the sequence to be
|
|
determined is randomly broken into fragments of about 400
|
|
nucleotides in length. These fragments are cloned and then selected
|
|
randomly and their sequences determined. The relationship
|
|
between any pair of fragments is not known beforehand but is
|
|
found by comparing their sequences. If the sequence of one
|
|
found to be wholly or partially contained within that of another
|
|
for sufficient length to distinguish an overlap from a repeat
|
|
then those two fragments can be joined. The process of select,
|
|
sequence and compare is continued until the whole of the DNA to
|
|
be sequenced is in one continuous well determined piece.
|
|
|
|
Definition of a contig
|
|
|
|
A CONTIG is a set of gel readings that are related to
|
|
one another by overlap of their sequences. All gel readings
|
|
belong to a contig and each contig contains at least one gel
|
|
reading. The gel readings in a contig can be summed to produce a
|
|
continuous consensus sequence and the length of this sequence is the
|
|
length of the contig. The rules used to perform this summation are
|
|
given under "the consensus algorithm". At any stage of a
|
|
sequencing project the data will comprise a number of contigs; when
|
|
a project is complete there should be only one contig and its
|
|
consensus will be the finished sequence. Note that since being
|
|
introduced and defined as above the word "contig" has been taken up
|
|
by those involved in genomic mapping. In that context the consensus
|
|
with a precise length is not defined.
|
|
|
|
Introduction to the computer method
|
|
|
|
It is useful to consider the objectives of a sequencing
|
|
project before outlining how we use the computer to help achieve
|
|
them. The aim of a shotgun sequencing project is to produce an
|
|
accurate consensus sequence from many overlapping gel readings. It
|
|
is necessary to know, particularly at the latter stages of the
|
|
project, how accurate the consensus sequence is. This enables us to
|
|
know which regions of the sequence require further work and also to
|
|
know when the project is finished. To show the quality of the
|
|
consensus, the programs described here produce displays like that
|
|
shown below.
|
|
|
|
|
|
10 20 30 40 50
|
|
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
|
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
|
|
|
60 70 80 90 100
|
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
|
-3 HINW.007 GGCACA*GTC
|
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
|
|
|
|
110 120 130 140 150
|
|
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
|
|
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
|
|
-5 HINW.009 GGCAGCGACCGTCG
|
|
17 HINW.999 AGCGACCGTCG
|
|
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
|
|
|
|
160 170 180 190 200
|
|
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
|
|
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
|
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
|
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
|
12 HINW.017 GTAGAGC*
|
|
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
|
|
|
This is an example showing the left end of a contig from
|
|
position 1 to 200. Overlapping this region are gel readings
|
|
numbered 6, 3, 5, 17 and 12; 6, 3 and 5 are in reverse orientation
|
|
to their original reading (denoted by a minus sign). Each gel
|
|
reading also has a name (eg HINW.010). It can be seen that in a
|
|
number of places the sequences contain characters other than A,C,G
|
|
and T. Some of these extra characters have been used by the
|
|
sequencer to indicate regions of uncertainty in the initial
|
|
interpretation of the gel reading, but the asterisks (*) have been
|
|
inserted by the automatic assembly function in order to align the
|
|
sequences. Underneath each 50 character block of gel reading
|
|
sequences is the consensus derived from the sequences aligned above
|
|
(the line labelled CONSENSUS). For most of its length the consensus
|
|
has a definite nucleotide assignment but in a few positions there is
|
|
insufficient agreement between the gel readings and so a dash (-)
|
|
appears in the sequence. This display contains all the evidence
|
|
needed to assess the quality of the consensus: the number of times
|
|
the sequence has been determined on each strand of the DNA, and the
|
|
individual nucleotide assignments given for each gel reading.
|
|
|
|
So the aim is to produce the consensus sequence and, equally
|
|
important, a display of the experimental results from which it was
|
|
derived.
|
|
|
|
In order to achieve this the following operations need to be
|
|
performed:
|
|
1) Put individual gel readings into the computer. This might
|
|
involved the manual interpretation of autoradiographs or the
|
|
transfer and process of machine-readable files from fluorescent
|
|
sequencing machines.
|
|
2) Check each gel reading to make sure it is not simply part of one
|
|
of the vectors used to clone the sequence.
|
|
3) Check each gel reading to make sure that those fragments that
|
|
span the ligation point used prior to sonication are not assembled
|
|
as single sequences.
|
|
4) Compare all the remaining gel readings with one another to
|
|
assemble them to produce the consensus sequence.
|
|
5) Check the quality of the consensus and edit the sequences.
|
|
6) When all the consensus is sufficiently well determined, produce a
|
|
copy of it for processing by other analysis programs.
|
|
|
|
It is very unlikely that this procedure will only be passed
|
|
through once. Usually steps 1 to 5 are cycled through repeatedly,
|
|
with step 4 just adding new sequences to those already assembled.
|
|
Generally step 6 is also used in order to analyse imperfect sequence
|
|
to check if it is the one the project intended to sequence, or to
|
|
look for interesting features. Analysis of the consensus, such as
|
|
searches for protein coding regions, can also help to find errors in
|
|
the sequence. The display of the overlapping gel readings shown
|
|
above can be used to indicate, not only the poorly determined
|
|
regions, but also which clones should be resequenced to resolve
|
|
ambiguities, or those which can usefully be extended or sequenced in
|
|
the reverse direction, to cover difficult regions.
|
|
|
|
The original individual gel readings for a sequencing project
|
|
are each stored in separate files. As the gel readings are entered
|
|
into the computer (usually in batches, say 10 from a film), the file
|
|
names they are given are stored in a further file, called a file of
|
|
file names. Files of file names enable gel readings to be processed
|
|
in batches.
|
|
|
|
For each sequencing project we start a project database. This
|
|
database has a structure specifically designed for dealing with
|
|
shotgun sequence data. In order to arrive at the final consensus
|
|
sequence many operations will be performed on the sequence data.
|
|
Individual fragments must be sequenced and compared in both senses
|
|
(i.e. both orientations) with all the other sequences. When an
|
|
overlap between a new gel reading and a contig are found they must
|
|
be aligned and the new gel reading added to the contig. If a new gel
|
|
reading overlaps two contigs they must be aligned and joined. Before
|
|
the two contigs are joined one of them may need to be turned around
|
|
(reversed and complemented) so they are both in in the same
|
|
orientation.
|
|
|
|
Clearly, keeping track of all these manipulations is quite
|
|
complicated, and to be able to perform the operations quickly
|
|
requires careful choice of data structure and algorithms. For these
|
|
reasons it is not practicable to store the gel readings aligned as
|
|
shown in the display above. Rather, it is more convenient to store
|
|
the sequences unassembled, and to record sufficient information for
|
|
programs to assemble them during processing. The data used to
|
|
assemble the sequences is called relational information.
|
|
|
|
The database comprises five files and they are described under
|
|
the section entitled "open database".
|
|
|
|
Before entry into the project database each new gel reading
|
|
must be compared to look for overlaps with all the data already
|
|
contained within the database. This last point is important: all
|
|
searching for overlaps is between individual new gel readings and
|
|
the data already in the database. There is no searching for overlaps
|
|
between sequences within the database; overlaps must be found before
|
|
new gel readings are entered into the database.
|
|
|
|
Below I give an introduction to how the sequences are
|
|
processed by being passed from one function to the next.
|
|
|
|
This program is used to start a database for the project and
|
|
then the following procedure is used.
|
|
|
|
Data in the form of individual gel readings are entered into
|
|
the computer and stored in separate files using either program this
|
|
program or the digitizer program. Batches of these gel readings are
|
|
passed to the screening functions in this program to search for
|
|
overlaps with vector sequences ("screen against vector") or for
|
|
matches to restriction enzyme sites that should not be present
|
|
("screen against enzymes"). Each run of these screening functions
|
|
passes on only those gel readings that do not contain unwanted
|
|
sequences. Sequences are passed via files of file names and
|
|
eventually are processed by the automatic assembly function ("auto
|
|
assemble"). This function compares each gel reading with a consensus
|
|
of all the previous gel readings stored in the database. If it
|
|
finds any overlaps it aligns the overlapping sequences by inserting
|
|
padding characters, and then adds the new gel reading to the
|
|
database. Gels that overlap are added to existing contigs and gels
|
|
that do not overlap any data in the database start new contigs. If a
|
|
new gel overlaps two contigs they are joined. Any gel readings that
|
|
appear to overlap but which cannot be aligned sufficiently well are
|
|
not entered and have their names written to a file of failed gel
|
|
reading names.
|
|
|
|
Generally data is entered into the database in batches as just
|
|
described. The program is also used to examine the data in the
|
|
database, to enter gel readings that the automatic assembly function
|
|
cannot align ("auto assemble"), and to make final edits. Edits to
|
|
whole contigs can be made in several ways. A mouse-driven editor
|
|
("edit contig") is used to perform all edits manually.
|
|
Disagreements between gel readings in contigs and their consensus
|
|
sequences can be highlighted by use of the function "highlight
|
|
disagreements".
|
|
|
|
Editing the sequences is obviously an essential part of
|
|
managing a sequencing project. Editing is required when new
|
|
sequences are added, when contigs are joined, and when sequences are
|
|
corrected. A basic part of the strategy used here is that new gel
|
|
readings should be correctly aligned throughout their whole length
|
|
when they are entered into the database, and that when contigs are
|
|
joined they are edited so that they are well aligned in the region
|
|
of overlap. Alignment can be achieved by adding padding characters
|
|
to the sequences, and this is the way "auto assemble" operates when
|
|
adding new sequences to the database.
|
|
|
|
In order to search for overlaps that may have been missed due
|
|
to errors in the gel readings, the function "extract gel readings"
|
|
can be used to take copies of the gel readings at the ends of
|
|
contigs, and write them out as separate files. These can then be
|
|
compared with the database consensus using the "auto assemble"
|
|
function in a mode that forbids entry of data into the database, and
|
|
any gel reading matching two contigs will indicate a join that has
|
|
been missed. The joins can then be made interactively using "join
|
|
contigs". Missed matches can be found at this stage because the
|
|
errors in the sequences may have been corrected by new data.
|
|
|
|
Generally the users need not concern themselves with how the
|
|
relational information is used by the program, but it is necessary
|
|
to know how contigs are identified. Because contigs are constantly
|
|
being changed and reordered the program identifies them by the
|
|
numbers of the gel readings they contain. Whenever users need to
|
|
identify a contig they need only know the number or name of one of
|
|
the gel readings it contains. Whenever the program asks users to
|
|
identify a contig or gel reading they can type its number or its
|
|
archive name. If they type its archive name they must precede the
|
|
name by a slash "/" symbol to denote that it is a name rather than a
|
|
number. E.g if the archive name is fred.gel with number 99, users
|
|
should type /fred.gel or 99 when asked to identify the contig.
|
|
Generally, when it asks for the gel reading to be identified, the
|
|
program will offer the user a default name, and if the user types
|
|
only return, that contig will be accessed. When a database is opened
|
|
the default contig will be the longest one, but if another is
|
|
accessed, it will subsequently become the current default.
|
|
|
|
Further information is located in the following places. The
|
|
database files are described under "open database". The format for
|
|
vector and consensus sequences is given under "calculate a
|
|
consensus", as are the uncertainty codes used in gel readings.
|
|
|
|
There are two programs, other than this, relevant to
|
|
sequencing are the digitizer program and the trace editor program,
|
|
both is outlined briefly below.
|
|
|
|
The digitiser program is used for the initial input of gel
|
|
readings and for writing a file of file names. The program uses a
|
|
digitizer for data entry. A digitizer is a two dimensional
|
|
surface such as a light box which is such that if a special pen is
|
|
pressed onto it, the pens coordinates are recorded by a computer.
|
|
These coordinates can be interpreted by a program.
|
|
|
|
In order to read an autoradiograph placed on the light box the
|
|
user need only define the bottom of the four sequencing lanes and
|
|
the bases to which they correspond and then use the pen to point
|
|
to each successive band progressing up the gel. The program
|
|
examines the coordinates of each pen position to see in which of the
|
|
four lanes it lies and assigns the corresponding base to be
|
|
stored in the computer. Each time the pen tip is depressed to point
|
|
to a position on the surface of the digitizer the program sounds
|
|
the bell on the terminal to indicate to the user that a point has
|
|
been recorded. As the sequence is read the program displays it on
|
|
the screen.
|
|
|
|
The trace editor program is used for the initial processing of
|
|
data obtained from fluorescent sequencing machines. It allows the
|
|
user to visually select left and right cutoff positions to denote
|
|
the start and end of good data. Users may also edit the sequence at
|
|
this point. Output from ted is a sequence file in Staden format
|
|
with headers that describe to xdap the cutoff information.
|
|
@17. TX 1 @Screen against enzymes
|
|
|
|
Used to compare gel readings against any restriction enzyme
|
|
recognition sequences that may have been used during cloning and
|
|
which should not be present in the data. Works on single gel
|
|
readings or processes batches accessed through files of file names.
|
|
The algorithm looks for exact matches to recognition sequences
|
|
stored in a file.
|
|
|
|
The file containing the recognition sequences must be
|
|
identified. The user must choose between employing a file of file
|
|
names, or typing in the names of individual gel reading files. If a
|
|
file of file names is used the program will also create a new file
|
|
of file names. When the option has finished operating this new file
|
|
will contain the names of all those gel readings that did not match
|
|
any of the recognition sequences. Hence it can be used for further
|
|
processing of the batch. The recognition sequences should be stored
|
|
in a simple text file with one recognition sequence per line.
|
|
@18. TX 1 @Screen against vector
|
|
|
|
Used to compare gel readings against any vector sequences that
|
|
may have been picked up during cloning. Works on single gel readings
|
|
or processes batches accessed through files of file names. The
|
|
algorithm looks for exact matches of length "minimum match length"
|
|
and displays the overlapping sequences.
|
|
|
|
The file containing the vector sequence must be identified.
|
|
The user must choose between employing a file of file names, or
|
|
typing in the names of individual gel reading files. If a file of
|
|
file names is used the program will also create a new file of file
|
|
names. When the option has finished operating this new file will
|
|
contain the names of all those gel readings that did not match the
|
|
vector sequence. Hence it can be used for further processing of the
|
|
batch. The vector sequence should be stored in a simple text file
|
|
with up to 80 characters of data per line. More than one vector can
|
|
be stored in a single file. If so each should be preceded by a 20
|
|
character title of the form <---m13mp8.001-----> where the < and >
|
|
signs and the number like .001 are obligatory. The number must be
|
|
preceded by a dot (.) and be 3 digits long. The total sequence in
|
|
the file must be < 50,001 characters in length.
|
|
@20. TX 3 @Auto assemble
|
|
|
|
Compares gel readings against the current contents of the
|
|
database and produces alignments. In its normal mode of operation
|
|
("entry permitted"), the function will automatically enter the gel
|
|
readings into the database, but if entry is not permitted it will
|
|
only produce alignments. It works on single gel readings or
|
|
processes batches of gel readings accessed through files of file
|
|
names. It is the usual way to enter data into the database.
|
|
|
|
The function will check the database for logical consistency
|
|
and will only proceed if it is OK. Choose if gel readings should be
|
|
entered into the database, or if they should only be compared.
|
|
Choose between using a file of file names or typing file names on
|
|
the keyboard. If so selected, supply the file of file names. Also
|
|
supply a file of file names to contain the names of all the gel
|
|
readings that fail to get entered. Select the entry mode. Normal
|
|
assembly is appropriate for all but special cases, as is "permit
|
|
joins". Uses for the other modes are not documented here. Define a
|
|
minimum initial match length. Define a minimum alignment block (the
|
|
default value is taken in all but exceptional circumstances). Define
|
|
the maximum number of padding characters allowed to be used in each
|
|
gel reading to help achieve alignment, and the same for the number
|
|
allowed in the contig for each gel reading. Finally define the
|
|
maximum percentage mismatch to be allowed for any gel reading to be
|
|
entered into the database. If for any gel reading, either of these
|
|
last three values is exceeded the gel reading will not be entered
|
|
into the database.
|
|
|
|
In operation the function takes a batch of gel readings
|
|
(probably passed on as a file of file names from one of the
|
|
screening routines) and enters them into a database for a sequencing
|
|
project. It takes each gel reading in turn, compares it with the
|
|
current consensus for the database, it then produces an alignment
|
|
for any regions of the consensus it overlaps; if this
|
|
alignment is sufficiently good it then edits both the new gel
|
|
reading and the sequences it overlaps and adds the new gel
|
|
reading to the database. The program then updates the consensus
|
|
accordingly and carries on to the next gel reading.
|
|
|
|
All alignments are displayed and any gel readings that do
|
|
match but that cannot be aligned sufficiently well have their names
|
|
written to a file of failed gel reading names. The function works
|
|
without any user intervention and can process any number of gel
|
|
readings in a single run. Those gel readings that fail can be
|
|
recompared using the same function (to find the current overlap
|
|
position) and the user can enter them into the database manually
|
|
using the "enter new gel reading" option.
|
|
|
|
Typical dialogue and output from the function is shown below.
|
|
(Note that output for gel readings 2 - 9 has been deleted to save
|
|
space).
|
|
Automatic sequence assembler
|
|
Database is logically consistent
|
|
? (y/n) (y) Permit entry
|
|
? (y/n) (y) Use file of file names
|
|
? File of gel reading names=demo.nam
|
|
? File for names of failures=demo.fail
|
|
Select entry mode
|
|
X 1 Perform normal shotgun assembly
|
|
2 Put all sequences in one contig
|
|
3 Put all sequences in new contigs
|
|
? Selection (1-3) (1) =
|
|
? (y/n) (y) Permit joins
|
|
? Minimum initial match (12-4097) (15) =
|
|
? Minimum alignment block (2-5) (3) =
|
|
? Maximum pads per gel (0-25) (8) =
|
|
? Maximum pads per gel in contig (0-25) (8) =
|
|
? Maximum percent mismatch after alignment (0.00-15.00) (8.00) =
|
|
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
|
|
Processing 1 in batch
|
|
Gel reading name=HINW.004
|
|
Gel reading length= 283
|
|
Searching for overlaps
|
|
Strand 1
|
|
Strand 2
|
|
No matches found
|
|
Total matches found 1
|
|
Padding in contig= 0 and in gel= 1
|
|
Percentage mismatch after alignment = 1.8
|
|
Best alignment found
|
|
1 11 21 31 41 51
|
|
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
|
|
********** ********** ********** ********** ********** **********
|
|
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
|
|
1 11 21 31 41 51
|
|
61 71 81 91 101 111
|
|
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACA-GTTG
|
|
********** ********** ********** ********** ********** ***** ****
|
|
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACACGTTG
|
|
61 71 81 91 101 111
|
|
121 131 141 151 161 171
|
|
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
|
|
********** ********** ********** ********** ********** **********
|
|
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
|
|
121 131 141 151 161 171
|
|
181 191 201 211 221 231
|
|
GCCATGGTGG CCACAGTGAC G-TATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
|
|
********** ********** * ******** ********** ********** **********
|
|
GCCATGGTGG CCACAGTGAC GCTATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
|
|
181 191 201 211 221 231
|
|
241 251 261 271 281
|
|
CGCG--GTGG AATTACAGCG TTCCCTATTG ACGGGCGCAT CCAC
|
|
**** **** ********** ** * ***** ********** ****
|
|
CGCGACGTGG AATTACAGCG TT,CDTATTG ACGGGCGCAT CCAC
|
|
241 251 261 271 281
|
|
Batch finished
|
|
9 sequences processed
|
|
0 sequences entered into database
|
|
0 joins made
|
|
|
|
|
|
Note that "auto assemble" cannot align protein sequences.
|
|
@28. TX 1 @Highlight disagreements
|
|
|
|
Used in the latter stages of a project to highlight
|
|
disagreements between individual gel readings and their consensus
|
|
sequences. Characters that agree with the consensus are shown as :
|
|
symbols for the plus strand and . for the minus strand. Characters
|
|
that disagree with the consensus are left unchanged and so stand out
|
|
clearly. The results of this analysis are written to a file.
|
|
|
|
Before selecting this option create a file of the display of
|
|
the contig to be "highlighted". The option will ask for the name of
|
|
this file. Select symbols to denote "agreeing" characters on each
|
|
strand, the defaults are : and ., but any others can be used. Supply
|
|
the name of a file in which to put the output.
|
|
|
|
The display file needed as input for this option is created by
|
|
selecting "Redirect output", followed immediately by "display
|
|
contig", and then "Redirect output" again. The cutoff score used in
|
|
the consensus calculation can be set by option "set display
|
|
parameters". Note that for the highlight function there is a limit
|
|
of 50 for the number of gel readings that are aligned at any
|
|
position - ie the contig must be less than 51 gel readings deep at
|
|
its thickest point. I hope that those performing shotgun sequencing
|
|
never reach this limit, but those using the program for comparing
|
|
sequence families might.
|
|
|
|
Typical output from this function is shown below.
|
|
|
|
210 220 230 240 250
|
|
1 HINW.004 :C::::::::::::::::::::::::::::::::::::::::::AC::::
|
|
7 HINW.018 :*::::::::::::::::::::::::::::::::::::::::::CA::::
|
|
-4 HINW.017 ...............AC....
|
|
G-TATTTTGTTTCCTCACGCTCGCTACGTATCTGTTTGCCCGCG--GTGG
|
|
|
|
260 270 280 290 300
|
|
1 HINW.004 ::::::::::::*:D:::::::::::::::::::
|
|
7 HINW.018 ::::::::::::::::::::CA:::::T:*:::*::::::::::::CA:
|
|
-4 HINW.017 ..............................................A...
|
|
3 HINW.009 :::::::::::::::V::::::::::::::::::::::::::::*AV:::
|
|
-6 HINW.028 ......................A...
|
|
AATTACAGCGTTCCCTATTGACGGGCGCATCCACGCTGATTCTCTT-CTG
|
|
|
|
@32. TX 3 @Extract gel readings
|
|
|
|
Used to make copies of the aligned gel readings in a database,
|
|
to write them into separate files, and to write a corresponding file
|
|
of file names. It operates in two modes: either all gel readings are
|
|
extracted, or only those at the ends of contigs.
|
|
|
|
Choose which mode of operation is required and supply a file
|
|
of file names.
|
|
|
|
The gel readings are given their original names. If used to
|
|
extract the gel readings from the ends of contigs the function is
|
|
useful for checking for missed contig joins: the file of file names
|
|
can be used with the auto assemble function to recompare these gel
|
|
readings, and each should only overlap one contig. Any that overlap
|
|
two contigs will identify possible joins.
|
|
|
|
If the option is used to extract all the gel readings from a
|
|
database, a subsequent run of "auto assemble" can reconstitute a
|
|
database which has been corrupted. This rarely occurs and is
|
|
usually necessitated by a user employing "alter relationships"
|
|
incorrectly without first having made a copy.
|
|
@1. TX 0 @Help
|
|
|
|
Help is available on the following topics :
|
|
@2. TX 0 @Quit
|
|
|
|
This command stops the program and is the only safe way to
|
|
terminate a run of the program that has altered the contents of the
|
|
database in any way.
|
|
@3. TX 1 @Open a database
|
|
|
|
Opens existing databases or allows new ones to be started. The
|
|
function is automatically called into operation when the program is
|
|
started but can also be selected from the general menu.
|
|
|
|
Choose to open an existing database or start a new one, or if
|
|
! is typed when the program is first started, enter the program
|
|
without opening a database. Supply a project database name, and if
|
|
it already exists, the "version". If starting a new database define
|
|
the database size and if it is for DNA or protein sequences. The
|
|
database size is an initial size for the database. It can be
|
|
increased later during the project. It is the sum of the number of
|
|
gel readings plus the number of contigs.
|
|
|
|
Database names can have from one to 12 letters and must not
|
|
include full stop (.). The database is made from five separate
|
|
files. If the database is called FRED then version 0 of database
|
|
FRED comprises files FRED.AR0, FRED.RL0, FRED.SQ0, FRED.TG0 and
|
|
FRED.CC0. The version is the last symbol in the file names. Only
|
|
this program can read these files. If the "copy database" option is
|
|
used it will ask the user to define a new "version".
|
|
|
|
For normal use the maximum gel reading length is set to 512
|
|
characters, but when a database is started the user may choose
|
|
lengths of either 512, 1024, 1536..., 4096. Normally the program is
|
|
used to handle DNA sequences but many of the functions also work on
|
|
protein sequences. The choice of sequence type is made when the
|
|
database is started.
|
|
|
|
The contigs are not stored on the disk as the user sees them
|
|
displayed on the screen. Each gel reading is stored with sufficient
|
|
information about how it overlaps other gel readings so that the
|
|
program can work out how to present them aligned on the screen. We
|
|
refer to this extra data as "the relationships" and it is explained
|
|
below. The database comprises 5 separate files.
|
|
1. a working version of each gel reading. This is the version of
|
|
the gel reading that is in the database and initially it is an
|
|
exact copy of the original sequence (known as the archive) but it is
|
|
edited and manipulated to align it with other gel readings.
|
|
2. the file of relationships. This file contains all of the
|
|
information that is required to assemble the working versions into
|
|
contigs during processing; any manipulations on the data use this
|
|
file and it is automatically updated at any time that the
|
|
relationships are changed. The information in this file is as
|
|
follows:
|
|
(A) Facts about each gel reading and its relationship to
|
|
others ("gel descriptor lines"):
|
|
(a) the number of the gel reading (each gel reading is given a
|
|
number as it is entered into the database)
|
|
(b) the length of the sequence from this gel reading
|
|
(c) the position of the left end of this gel reading relative to
|
|
the left end of the contig of which it is a member
|
|
(d) the number of the next gel reading to the left of this gel
|
|
reading
|
|
(e) the number of the next gel reading to the right
|
|
(f) the relative strandedness of this gel reading , ie whether it
|
|
is in the same sense or the complementary sense as its archive.
|
|
(B) Facts about each contig ("contig descriptor lines"):
|
|
(a) the length of this contig
|
|
(b) the number of the leftmost gel reading of this contig
|
|
(c) the number of the rightmost gel reading of this contig.
|
|
(C) General facts:
|
|
(a) the number of gel readings in the database
|
|
(b) the number of contigs in the database.
|
|
3. the file of archive names. This is simply a list of the names
|
|
of each of the archive files in the database but on line number 1000
|
|
we also store the size of the database. ie the number of lines of
|
|
information allowed in the database files. This file always has 1000
|
|
lines but the length of the file of relationships and the file of
|
|
working versions can be set by the user when creating a database or
|
|
when copying from one to another.
|
|
4. the file of tags (annotation). This consists of linked lists of
|
|
tag information for each sequences in the database. Tags are
|
|
created by the user as annotation, or by xdap as records of edits or
|
|
for storing cutoff information. As the number of tags can grow
|
|
without limit, so can this file. For each gel there is a header
|
|
record, which contains the record number of the start of the linked
|
|
list for that gel. On line IDBSIZ there is a record containing
|
|
information about the file such as its present length and if there
|
|
are any free "tag" slots to be reused in the file. 5. the file of
|
|
comments (annotation). This consists of linked lists of comment
|
|
fragments. Comments are created by the user as a message attached
|
|
to annotation, or by the system to store cutoff information.
|
|
Comments are character strings of any length. Comments longer than
|
|
40 characters are broken up into fragments, each 40 characters long,
|
|
and are chained together in a link list. As the number of comments
|
|
can grow without limit, so can this file.
|
|
|
|
Structure of the database files
|
|
|
|
1. The file of relationships
|
|
|
|
The file contains IDBSIZ lines of data: the general data are
|
|
stored on line IDBSIZ; data about gel readings are stored from
|
|
line 1 downwards; data about contigs are stored from line IDBSIZ-1
|
|
upwards. A database of 500 lines containing 25 gel readings and 4
|
|
contigs would have a file of relationships as is shown below.
|
|
|
|
|
|
---------------------------------------------
|
|
1 Gel descriptor record
|
|
2 " " "
|
|
3 " " "
|
|
4 " " "
|
|
5 " " "
|
|
' ' ' '
|
|
' ' ' '
|
|
25 " " "
|
|
26 Empty record
|
|
' ' '
|
|
|
|
' ' '
|
|
495 ' '
|
|
496 Contig descriptor record
|
|
497 " " "
|
|
498 " " "
|
|
499 " " "
|
|
500 Number of gel readings=25, Number of contigs=4
|
|
---------------------------------------------
|
|
|
|
The arrangement of the data in the file of relationships
|
|
|
|
As each new gel reading is added into the database a new line is
|
|
added to the end of the list of gel descriptor lines. If this
|
|
new gel reading does not overlap with any gel readings already in
|
|
the database a new contig line is added to the top of the list
|
|
of contig lines. If it overlaps with one contig then no new contig
|
|
line need be added but if it overlaps with two contigs then
|
|
these two contigs must be joined and the number of contig lines
|
|
will be reduced by one. Then the list of contig lines is compressed
|
|
to leave the empty line at the top of the list. Initially the two
|
|
types of line will move towards one another but eventually, as
|
|
contigs are joined, the contig descriptor lines will move in the
|
|
same direction as the gel descriptor lines. At the end of a
|
|
project there should be only one contig line. The database is
|
|
thus capable of handling a project of 998 gels.
|
|
|
|
2. Structure of the working versions file
|
|
|
|
The working versions of gel readings are stored in a file
|
|
of IDBSIZ lines each containing 512 characters. Gel reading number
|
|
1 is stored on line 1, gel reading number 2 on line 2 and so on.
|
|
|
|
3. Structure of the archive names file
|
|
|
|
This file, unlike the others, always has 1000 lines each 10
|
|
characters in length. Its length is fixed because line 1000 is used
|
|
to store IDBSIZ the database size and the programs need a definite
|
|
location from which to read this number.
|
|
|
|
4. Structure of the tag file
|
|
|
|
This file initially starts with IDBSIZ lines, and is expanded
|
|
as new tags are created. Information about the length of the file,
|
|
and which tag records are reusable is stored on line IDBSIZ. A
|
|
database of 500 lines would have a file of tags as shown below.
|
|
|
|
---------------------------------------------
|
|
1 Tag descriptor record
|
|
2 " " "
|
|
3 " " "
|
|
4 " " "
|
|
5 " " "
|
|
' ' ' '
|
|
' ' ' '
|
|
497 " " "
|
|
498 " " "
|
|
499 " " "
|
|
500 Length of file=N, Free list=0
|
|
501 Tag record
|
|
502 " "
|
|
503 " "
|
|
' ' '
|
|
' ' '
|
|
N-2 " "
|
|
N-1 " "
|
|
N Tag record
|
|
---------------------------------------------
|
|
|
|
The arrangement of the data in the file of relationships
|
|
|
|
As each new tag is added to the database, a check is made in the
|
|
file descriptor record at line IDBSIZ. If the list of reusable
|
|
records is 0, the file is extended by one line. Otherwise the new
|
|
tag is assigned to record at the head of the freelist. When tags
|
|
are deleted, they are added to the free list in the file descriptor
|
|
record.
|
|
|
|
5. Structure of the comment file
|
|
|
|
This file initially starts with 1 line, and is expanded as new
|
|
annotation is created. Information about the length of the file,
|
|
and which comment records are reusable is stored on the first line.
|
|
|
|
---------------------------------------------
|
|
1 Length of file=N, Free list=0
|
|
2 Comment fragment
|
|
3 " "
|
|
4 " "
|
|
' ' '
|
|
' ' '
|
|
N-2 " "
|
|
N-1 " "
|
|
N Comment fragment
|
|
---------------------------------------------
|
|
|
|
The arrangement of the data in the file of relationships
|
|
|
|
As each new comment is added to the database, a check is made in the
|
|
file descriptor record at line 1. If the list of reusable records is
|
|
0, the file is extended to hold the new comment. Otherwise the new
|
|
comments is assigned to records starting with the head of the
|
|
freelist. When comments are deleted, the discarded records are
|
|
added to the free list in the file descriptor record.
|
|
|
|
There are various checks within the programs to protect
|
|
users from themselves:-
|
|
1. All user input is checked for errors - e.g. reference to
|
|
non-existent gel readings or contigs, incorrect positions in the
|
|
contig or gel readings.
|
|
2. Before entering a gel reading the system checks to see if a file
|
|
of the same name has already been entered.
|
|
3. Join will not allow the circularising of a contig.
|
|
4. Both enter and join functions restrict the region that
|
|
the user is allowed to edit (using edit contig) to the region of
|
|
overlap.
|
|
5. Users may escape from any point in the program.
|
|
6. Help is available from all points in the program.
|
|
|
|
|
|
IT IS ESSENTIAL THAT USERS DO NOT KILL THE PROGRAM WHILE IT IS DOING
|
|
ANYTHING THAT INVOLVES CHANGING THE CONTENTS OF THE DATABASE. I.E
|
|
DURING AUTO ASSEMBLE, COMPLETE ENTRY, COMPLETE JOIN, COMPLEMENT
|
|
CONTIG, EDIT CONTIG, AND SCREEN EDIT. This could corrupt the
|
|
database so badly that it is impossible to fix. The program should
|
|
always be left using the QUIT option.
|
|
@4. TX 3 @Edit contig
|
|
|
|
The Contig Editor is a mouse-driven editor that can insert,
|
|
delete and change gel reading sequences.
|
|
|
|
The Contig Editor allows scrolling from one end of a contig to
|
|
the other using the scroll bar and scroll buttons. Action of mouse
|
|
button presses when the mouse pointer is in the scroll bar:
|
|
|
|
Middle Mouse Button Set editor position
|
|
Left Mouse Button Scroll forward one screenful
|
|
Right Mouse Button Scroll backwards one screenful
|
|
|
|
The four scroll buttons operate as follows:
|
|
|
|
"<<" Scroll left half a screenful
|
|
"<" Scroll left one character
|
|
">" Scroll right one character
|
|
">>" Scroll right half a screenful
|
|
|
|
The Editor cursor can be positioned anywhere in the edit
|
|
window by moving the mouse pointer over the character of interest,
|
|
then pressing the left mouse button. The Editor cursor can also be
|
|
moved by using the direction arrow keys.
|
|
|
|
The editor operates in two main edit modes - Replace and
|
|
Insert. Replace allows a character to be replaced by another. Insert
|
|
allows characters to be inserted into a gel reading sequence.
|
|
Characters are entered by typing them from the keyboard. Only valid
|
|
characters are permitted. Characters can be deleted by positioning
|
|
the cursor one character to the right, then pressing the delete key.
|
|
Normally Insert and Delete apply to the consensus line of the contig
|
|
ONLY. This restraint can be overridden by using the "Super Edit"
|
|
mode of operation, THOUGH IT IS NOT RECOMMENDED.
|
|
|
|
Edits can also be performed on the consensus, though they are
|
|
restricted to insertion and deletion of padding characters ("*").
|
|
These edits also have special meanings. A deletion will delete ALL
|
|
characters at the position to the left of the cursor in the contig,
|
|
and move the relative positions of all sequences starting to the
|
|
right of the cursor position left one character. An insertion will
|
|
insert the character typed ("*") into ALL gel reading sequences at
|
|
the cursors position in the contig, and move the relative positions
|
|
of all sequences starting to the right of the cursor position right
|
|
one character.
|
|
|
|
The effect of the last edit can be undone by pressing the
|
|
"Undo" button at the top of the editor window.
|
|
|
|
The cursor will automatically be positioned at the next
|
|
problem when the "Find Next Problem" button is selected. The next
|
|
problem is where the consensus shows either an ambiguity ("-") or a
|
|
pad ("*") character.
|
|
|
|
The edits to the contig can be saved by pressing the "Leave
|
|
Editor" button and replying "Yes" to the prompt to "Save changes?".
|
|
As no changes are made to the working copy of your database til this
|
|
point it is possible to abort the editor if the edit session ends up
|
|
in an unsatisfactory state (ie if you've stuffed it up!)
|
|
|
|
|
|
|
|
Displaying Traces
|
|
|
|
The original data from which the gel reading sequences where
|
|
derived can be seen by double clicking (two quick clicks) with the
|
|
middle mouse button on the area of interest. The trace will be
|
|
displayed with the point clicked at the centre of the trace
|
|
viewport.
|
|
|
|
All traces that are displayed are maintained in one window,
|
|
called the Trace Manager. The Trace Manager will only display four
|
|
traces maximum. When four traces are already being managed and a new
|
|
one is requested, the one at the top of the Trace Manager is removed
|
|
and the new one is added to the bottom. Traces can be removed
|
|
individually by using the "quit" button in the panel next to the
|
|
trace.
|
|
|
|
|
|
|
|
Extending Reads Using Cutoff Information
|
|
|
|
Sequence data read in from Automated Fluorescent sequencing
|
|
machines trace files processed through the program ted will have the
|
|
discarded sequence (vector at start and poor read at end) available
|
|
to the contig editor. To display the cutoff information, press the
|
|
"Display Cutoff" button at the top of the editor window. The cutoff
|
|
sequence appears in grey. This sequence can be incorporated into the
|
|
editable sequence, by moving the cutoff position. This is done by
|
|
positioning the cursor at the end of the gel sequence, and using
|
|
Meta-Left-Arrow and Meta-Right-Arrow to adjust the point of cutoff.
|
|
The Meta key is a diamond on the Sun keyboard.
|
|
|
|
|
|
|
|
Pop-up menu
|
|
|
|
A pop-up menu is revealed by depressing the "Control" key on
|
|
the keyboard and at the same time pressing the left mouse button.
|
|
The menu has the following functions:
|
|
|
|
Search
|
|
Save Contig
|
|
Create Tag
|
|
Edit Tag
|
|
Delete Tag
|
|
|
|
"Save Contig" is described above. Searching and operations on tags
|
|
are described below.
|
|
|
|
|
|
|
|
Searching
|
|
|
|
Selecting "Search" brings up a window which can remain present
|
|
during normal editor operation. The window allows the user to select
|
|
the direction of search, the type of search and a value to search
|
|
on. The value is entered into the value text window. Then pressing
|
|
the "search" button performs the search. If successful, the cursor
|
|
is positioned and centred accordingly. An audible tone indicates
|
|
failure. Pressing the "ok" button removes the search window. The
|
|
search window is automatically removed when the contig editor is
|
|
exited.
|
|
|
|
There are seven different search modes:
|
|
|
|
1. Search by position
|
|
|
|
This positions the cursor at the numeric position specified in the
|
|
value text window. Eg a value of "1234" causes the cursor to be
|
|
placed at base number 1234 in the contig. Positioning withing a gel
|
|
reading is achieved by prefixing the number with the "@" character,
|
|
eg "@123" positions the cursor at base 123 of the sequence in which
|
|
the cursor lies. Relative positions can be specified by prefixing
|
|
the number with a plus or minus character. Eg "+1234" will advance
|
|
the cursor 1234 bases. If possible, the cursor is positioned within
|
|
the same sequence. The direction buttons have no effect on the
|
|
operation of "search by position".
|
|
|
|
2. Search by reading name
|
|
|
|
This positions the cursor at the left end of the gel reading
|
|
specified in the value text window. If the value is prefixed with a
|
|
slash is is assumed to be a gel reading name. Otherwise it is
|
|
assumed to be a gel reading number. Eg "123" positions the cursor at
|
|
the left end of gel reading number 123. "/a16a12.s1" positions at
|
|
the start of reading a16a12.s1. If the value was "/a16" the cursor
|
|
is positioned at the first reading which starts with "a16". The
|
|
direction buttons have no effect on the operation of "search by
|
|
position".
|
|
|
|
3. Search by tag type.
|
|
|
|
This positions the cursor at the start of the next tag which has the
|
|
the same type as specified by the type value menu. To change the
|
|
type, select off the menu that pops up when the mouse is clicked on
|
|
the button labeled "Type:". The search can be performed either
|
|
forwards or backwards of the current cursor position. To find all
|
|
tags, use "search by annotation", with a null text value string.
|
|
|
|
4. Search by annotation.
|
|
|
|
This positions the cursor at the start of the next tag which has a
|
|
comment containing the string specified in the value text window.
|
|
The search performed is a regular expression search, and certain
|
|
characters have special meaning. Be careful when your value string
|
|
contains ".", "*", "[", "^" or "$". The search can be performed
|
|
either forwards or backwards from the current cursor position.
|
|
|
|
5. Search by sequence.
|
|
|
|
This positions the cursor at the start of the next piece of sequence
|
|
that matches the value specified in the text value window. The
|
|
search is for an exact match, which means the case of value string
|
|
is important. The search is performed on the gel readings
|
|
themselves, rather than the consensus sequence. The search can be
|
|
performed either forwards or backwards from the current cursor
|
|
position.
|
|
|
|
6. Search by problem.
|
|
|
|
This positions the cursor at the next place in the consensus
|
|
sequence which is not an "A", "C", "G" or "T". The search can be
|
|
performed either forwards or backwards from the current cursor
|
|
position.
|
|
|
|
7. Search by quality
|
|
|
|
This positions the cursor at the next place in the consensus
|
|
sequence where the consensus calculation for each strand disagrees.
|
|
When only sequences on one strand is present, the search will stop
|
|
at every base. The search can be performed either forwards or
|
|
backwards from the current cursor position.
|
|
|
|
|
|
|
|
Annotation
|
|
|
|
Parts of a sequence can be annotated, to record the positions
|
|
of primers used for walking, or to mark sites, such as compressions
|
|
that have caused problems during sequencing. The consensus sequence
|
|
CANNOT be annotated.
|
|
|
|
To annotate a piece of sequence first select the part of
|
|
sequence using the mouse buttons. Use the left mouse button to
|
|
position the start of the selection, and while this button is being
|
|
held down, move the mouse to extend. The selection can be extended
|
|
further using the right mouse button.
|
|
|
|
To create annotation, invoke the pop-up menu, and select the
|
|
"Create Tag" function. A small "tag editor" will appear which allows
|
|
you to select the type of the annotation from a pull-down menu, and
|
|
specify a comment if desired. To select a new type pull down the
|
|
Type menu, and select the entry desired. To enter a comment, simply
|
|
type into the text window in the tag editor. The annotation is
|
|
created when the "Leave" button on the tag editor, and is displayed
|
|
in the colour defined in the tag database file (TAGDB).
|
|
|
|
To edit existing annotation, position the cursor with the left
|
|
mouse button on the tag, and select the "Edit Tag" off the pop-up
|
|
menu. This invokes the tag editor, and changes to the type and
|
|
comment of the annotation can be made. The tag is updated when the
|
|
"Leave" button is pressed.
|
|
|
|
To delete an existing annotation, position the cursor with the
|
|
left mouse button on the tag, and select the "Delete Tag" off the
|
|
pop-up menu.
|
|
|
|
|
|
|
|
NOTE:
|
|
|
|
As the Contig Editor is a very powerful tool, it is possible
|
|
that the alignment of the gel reading sequences has unexpectedly
|
|
been disrupted. This can easily happen to parts of the contig that
|
|
lie to the right of the screen if excessive use has been made of the
|
|
"Super Edit" facility. Until familiar with "Super Edit" it would
|
|
benefit the sequencer to quickly scan through the contig after
|
|
editing to check that bad alignments have not been created.
|
|
@9. T 3 @Screen edit
|
|
|
|
THIS OPTION IS NO LONGER AVAILABLE IN XDAP. USE EDIT CONTIG
|
|
|
|
Gives access to the system editor on the machine (for example
|
|
EDT on a VAX) and allows users to edit contigs. The contigs are
|
|
presented as for "display contig" and the program will reconstitute
|
|
the contig's sequences and relationships when the editor is exited.
|
|
|
|
To screen edit a contig set the line length to 50 characters,
|
|
select the contig to edit, and supply the name of a temporary file
|
|
in which the editing will be performed. After a short pause the
|
|
system editor will present the first page of the file. Edit the file
|
|
obeying the rules given below. Exit from the editor and affirm the
|
|
intention of returning the contig to the database. The program will
|
|
put the contig back into the database.
|
|
|
|
Rules for screen editing
|
|
|
|
There are some limitations on the changes that can be made to
|
|
the contigs when using the screen editor. Users are unlikely to want
|
|
to break the rules in order to achieve changes to contigs, but
|
|
nevertheless the constraints need to be defined and they are given
|
|
below.
|
|
|
|
Alignments must be maintained during editing. Whole lines of
|
|
sequence should not be deleted or added unless the order of the gel
|
|
readings in the contig is preserved. Each line in the contig
|
|
display consists of gel reading numbers, their names and 50
|
|
character sections of sequence. Insertions are limited in the
|
|
following way. No line of sequence can be extended rightwards more
|
|
than 10 characters beyond the end of a full length line (a full
|
|
length line is 50 characters long). Only one character can be added
|
|
to the left end of full length lines, but sections of sequence
|
|
beginning further into a line can be extended leftwards up to an
|
|
equivalent position. Do not delete any non-sequence lines in the
|
|
file.
|
|
|
|
Before returning the contig to the database the program checks
|
|
that the rules have been obeyed. If an error is found the number of
|
|
the erroneous line in the file is displayed and the contig will not
|
|
be changed.
|
|
@5. TX 1 @Display a contig
|
|
|
|
Used to show the aligned gel readings for any part of a
|
|
contig. The number, name and strandedness of each gel reading is
|
|
shown and the consensus is written below.
|
|
|
|
If required identify the contig, and then the start and end
|
|
points of the region to display.
|
|
|
|
The display can be directed to a disk file using "direct
|
|
output to disk". These files are required by options: "screen edit"
|
|
and "highlight disagreements", and printed copies of them are very
|
|
useful for marking corrections prior to using the editors.
|
|
|
|
Below is an example showing the left end of a contig from
|
|
position 1 to 200. Overlapping this region are gels 6,3,5,17and
|
|
12; 6, 3 and 5 are in reverse orientation to their archives (denoted
|
|
by a minus sign) There are a few uncertainty codes and a few
|
|
padding characters in the working versions, but the consensus
|
|
(shown below each page width) has a definite assignment for almost
|
|
every position.
|
|
|
|
10 20 30 40 50
|
|
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
|
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
|
|
|
60 70 80 90 100
|
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
|
-3 HINW.007 GGCACA*GTC
|
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
|
|
|
|
110 120 130 140 150
|
|
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
|
|
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
|
|
-5 HINW.009 GGCAGCGACCGTCG
|
|
17 HINW.999 AGCGACCGTCG
|
|
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
|
|
|
|
160 170 180 190 200
|
|
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
|
|
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
|
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
|
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
|
12 HINW.017 GTAGAGC*
|
|
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
|
@6. TX 1 @List a text file
|
|
|
|
This option allows users to list text files on the screen. It
|
|
can be used to read a file containing notes, for checking files
|
|
written to disk etc. The user is asked to type the name of the file
|
|
to list.
|
|
@8. TX 1 @Calculate a consensus
|
|
|
|
Calculates a consensus sequence either for the whole
|
|
database or for selected contigs. The consensus is written to a file
|
|
named by the user.
|
|
Supply a file name, choose between whole database or selected
|
|
contigs.
|
|
|
|
Symbols for uncertainty in gel readings
|
|
|
|
In order to record uncertainties when reading gels the
|
|
codes shown below can be used. Use of these codes permits us to
|
|
extract the maximum amount of data from each gel and yet record any
|
|
doubts by choice of code. The program can deal with all of
|
|
these codes and any other characters in a sequence are treated
|
|
as dash (-) characters.
|
|
|
|
SYMBOL MEANING
|
|
|
|
1 PROBABLY C
|
|
2 " T
|
|
3 " A
|
|
4 " G
|
|
D " C POSSIBLY CC
|
|
V " T " TT
|
|
B " A " AA
|
|
H " G " GG
|
|
K " C " C-
|
|
L " T " T-
|
|
M " A " A-
|
|
N " G " G-
|
|
R A OR G
|
|
Y C OR T
|
|
5 A OR C
|
|
6 G OR T
|
|
7 A OR T
|
|
8 G OR C
|
|
- A OR G OR C OR T
|
|
a A set by auto edit
|
|
c C set by auto edit
|
|
g G set by auto edit
|
|
t T set by auto edit
|
|
* padding character placed by auto assembler
|
|
else = -
|
|
|
|
The DNA consensus algorithm
|
|
|
|
The "calculate consensus" function, the "display contig"
|
|
routine and the "show quality" option use the rules outlined here
|
|
to calculate a consensus from aligned gel readings. Note that
|
|
"display contig" calculates a consensus for each page width it
|
|
displays (it does not use the consensus sequence file calculated
|
|
by the consensus function).
|
|
|
|
We have 6 possible symbols in the consensus sequence:
|
|
A,C,G,T,* and -. The last symbols is assigned if none of the others
|
|
makes up a sufficient proportion of the aligned characters at any
|
|
position in the contig. The following calculation is used to decide
|
|
which symbol to place in the consensus at each position.
|
|
|
|
Each uncertainty code contributes a score to one of A,C,G,T,*
|
|
and also to the total at each point. Symbols like R and Y which
|
|
don't correspond to a single base type contribute only to the total
|
|
at each point. The scores are shown below.
|
|
definite assignments ie A,C,G,T,B,D,H,V,K,L,M,N,a,c,g,t,* =1
|
|
|
|
probable assignments ie 1,2,3,4 = 0.75
|
|
|
|
other uncertainty codes including R,Y,5,6,7,8,- = 0.1
|
|
|
|
A cutoff score of 51% to 100% is supplied by the user. (When
|
|
the program starts this is set to 75%. See "set display
|
|
parameters"). At each position in the contig we calculate the total
|
|
score for each of the 5 symbols A,C,G,T and * (denote these by Xi,
|
|
where i=A,C,G,T or *), and also the sum of these totals (denote this
|
|
by S). Then if 100 Xi / S > the cutoff for any i, symbol i is placed
|
|
in the consensus; otherwise - is assigned.
|
|
|
|
Notice that S does not equal the number of times the sequence
|
|
has been determined, but is the score total, and hence we are less
|
|
likely to put a - in the consensus. For the "examine quality"
|
|
algorithm each strand is treated separately but the calculation is
|
|
the same. (It was originally different).
|
|
|
|
Format of the consensus sequence ( and vector sequences).
|
|
|
|
A consensus sequence file may contain the consensus for
|
|
several contigs and so we identify each of them by preceding them by
|
|
a 20 character title. The title is of the form <---LAMBDA.076----->
|
|
( where LAMBDA is the project name and gel reading number 76 is the
|
|
leftmost gel reading to contribute to this consensus sequence).
|
|
The angle brackets <> and the three digit number precede by a .
|
|
are important to some processing programs.
|
|
@25. TX 1 @Show relationships
|
|
|
|
Used to show the relationships of the gel readings in the
|
|
database in three ways -
|
|
(a) All contig descriptor lines followed by all gel descriptor
|
|
lines.
|
|
(b) All contigs one after the other sorted, i.e. for each
|
|
contig show its contig descriptor line followed by all its gel
|
|
descriptor lines sorted on position from left to right
|
|
(c) Selected contigs: show the contig line and, in order, those
|
|
gel readings that cover a user-defined region. Note that this
|
|
output can be directed to a disk file by prior selection of "disk
|
|
output".
|
|
|
|
Below is an example showing a contig from position 1 to 689.
|
|
The left gel reading is number 6 and has archive name HINW.010, the
|
|
rightmost gel reading is number 2 and is has archive name HINW.004.
|
|
On each gel descriptor line is shown: the name of the archive
|
|
version, the gel number, the position of the left end of the gel
|
|
reading relative to the left end of the contig, the length of
|
|
the gel reading (if this is negative it means that the gel reading
|
|
is in the opposite orientation to its archive), the number of the
|
|
gel reading to the left and the number of the gel reading to the
|
|
right.
|
|
|
|
|
|
CONTIG LINES
|
|
CONTIG LINE LENGTH ENDS
|
|
LEFT RIGHT
|
|
48 689 6 2
|
|
GEL LINES
|
|
NAME NUMBER POSITION LENGTH NEIGHBOURS
|
|
LEFT RIGHT
|
|
HINW.010 6 1 -279 0 3
|
|
HINW.007 3 91 -265 6 5
|
|
HINW.009 5 137 -299 3 17
|
|
HINW.999 17 140 273 5 12
|
|
HINW.017 12 193 265 17 18
|
|
HINW.031 18 385 -245 12 2
|
|
HINW.004 2 401 -289 18 0
|
|
|
|
@21. TX 3 @Enter new gel reading
|
|
|
|
THIS OPTION IS NO LONGER AVAILABLE IN XDAP. USE AUTO ASSEMBLE
|
|
|
|
Used to enter new gel readings into the database. The new gel
|
|
reading must have previously been compared with the contents of the
|
|
database by use of " auto assemble" in order to ascertain if it
|
|
overlaps any previously entered data.
|
|
|
|
The user is expected to know: if the gel reading overlaps; if
|
|
so which contig it overlaps; if so where it overlaps. The program
|
|
takes the user through a series of question to establish the nature
|
|
of the overlap and then displays the overlap. The user is then
|
|
offered a number of options, including editors for the new gel
|
|
reading and the contig, to enable the correct alignment of the gel
|
|
reading throughout its whole length.
|
|
Supply the name of the gel reading file. If the gel reading has
|
|
been entered before the program will not permit entry. The program
|
|
gives the gel reading a unique number and asks if the sequence
|
|
overlaps any data already in the database (reported by "auto
|
|
assemble"). If it does not, entry is complete. If it does overlap
|
|
the dialogue continues with the program asking if the gel readings
|
|
overlaps "in the normal sense", if not it will automatically
|
|
complement the sequence. Then supply the number of the contig the
|
|
gel reading overlaps (as reported by "auto assemble").
|
|
|
|
Overlaps are divided into two types: those for which the new
|
|
gel reading protrudes from the left end of the contig it overlaps,
|
|
and those for which it does not. The program asks about this with
|
|
the question "Left end of gel reading is inside contig". If this is
|
|
true the program will go on to ask for the position in the contig of
|
|
the left end of the new gel reading. If it is not true the program
|
|
will ask for the position in the new gel reading of the left end of
|
|
the contig.
|
|
|
|
Once this is completed the program will display the first 50
|
|
bases of the overlap. The gel readings in the contig and their
|
|
consensus are displayed with the new gel reading underneath. The
|
|
mismatches are shown by *'s on the next line down. For example:
|
|
|
|
|
|
60 70 80 90 100
|
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
|
-3 HINW.007 GGCACA*GTC
|
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACACGTC
|
|
NEWGEL CACAAGCGAGCGAGAGGGGCACCGTGACGTGGTCACGCCGGGGACACGTC
|
|
MISMATCH * * *
|
|
10 20 30 40 50
|
|
|
|
|
|
The program then needs to know if the position of the left
|
|
end of the overlap is correct. If it is the user should type
|
|
return, if not, 1 and the program will ask for the new position and
|
|
display it.
|
|
The program now offers a number of options to allow the user to
|
|
align the new gel reading correctly over its whole length with the
|
|
data already in the contig. It is important that
|
|
sufficient edits are made to the new gel reading or the
|
|
sequences in the contig at this stage to get the alignment correct,
|
|
because once entry is completed, the alignment is fixed and cannot
|
|
easily be changed (see "alter relationships"). Alignment can be
|
|
achieved by making insertions or deletions but deletion of
|
|
data requires the original gels to be checked. For this reason
|
|
at entry we usually make only insertions to achieve alignment. We
|
|
use X or asterisks (*) as padding characters to achieve alignment
|
|
and so can, if required, distinguish padding characters from
|
|
characters assigned from reading gels.
|
|
|
|
The options available are:
|
|
? = HELP
|
|
! = Give up
|
|
3 = Complete entry
|
|
4 = Edit contig
|
|
5 = Display overlap
|
|
6 = Edit new gel reading
|
|
|
|
|
|
|
|
1. HELP gives this information.
|
|
|
|
2. Give up allows users to change their minds about entering
|
|
the new gel reading. The program will ask the user to confirm this
|
|
choice.
|
|
|
|
3. Complete entry is the command to add the new gel reading to
|
|
the contig. The program updates the relationships accordingly. The
|
|
user is asked to confirm this command.
|
|
|
|
4. Edit contig gives the user access to a simple editor that
|
|
allows insertions, deletions and changes to be made to the contig.
|
|
The editor maintains alignments by making the same number of
|
|
insertions or deletions in all sequences covering the edit position.
|
|
The program protects the user by allowing edits only
|
|
within the region of overlap.
|
|
|
|
5. Display allows display of the region of overlap only. This
|
|
is defined by the relative positions in the contig. The default is
|
|
the whole of the region of overlap.
|
|
|
|
6. Edit new gel reading allows the new gel reading to be
|
|
edited using a simple editor.
|
|
@23. TX 3 @ Complement a contig
|
|
|
|
This function will complement and reverse all of the gel
|
|
readings in a contig. It automatically reverses and
|
|
complements each gel reading sequence, reorders left and right
|
|
neighbours, recalculates relative positions and changes each
|
|
strandedness.
|
|
|
|
The only user input required is to identify the contig
|
|
to complement by the number or name of a gel reading it contains.
|
|
DO NOT KILL THE PROGRAM DURING THIS STEP!
|
|
@22. TX 3 @ Join contigs
|
|
|
|
This function joins contigs interactively using a mouse driven
|
|
editor. The operation of this editor is very similar to the Contig
|
|
Editor described in "@4 Edit".
|
|
|
|
It allows the user to align the ends of the two contigs by
|
|
editing each contig separately. It is important that the alignment
|
|
achieved is correct because once the join is completed the
|
|
alignment is fixed. The program needs to know which two contigs to
|
|
join.
|
|
|
|
First specify which two contigs are to be joined. The user
|
|
should identify the two contigs. First the left contig and then the
|
|
right. The program checks that the two contig numbers are different
|
|
(it will not allow circles to be formed!)
|
|
|
|
The Join Editor consists of two Contig Editors in between
|
|
which is sandwiched a disagreement box. This disagreement box shows
|
|
exclamation marks to denote mismatches between the two consensuses.
|
|
|
|
For example, the display will look something like this:
|
|
|
|
1460 1470 1480 1490 1500
|
|
56 HINW.100 TCT*GAGCAGTGTGGGCGCTG*CCGG
|
|
33 HINW.300 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGG
|
|
-25 HINW.090 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGG
|
|
19 HINW.123 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
|
CONSENSUS TCTCGAGCAGTGTGGGCGCTG-CCGGGCTCGGAGGGCATGAAGTAGAGCG
|
|
MISMATCH ! !!!!!!
|
|
10 20 30 40 50
|
|
-6 HINW.010 TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
|
|
-3 HINW.007 TGGGCGCTGCCCGGGCTCGGAGGGCATGAAGT*AGAGC
|
|
-5 HINW.009 GCTCGGAGGGCATGAAGT*AGAGC
|
|
CONSENSUS TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
|
|
|
|
|
|
|
|
The best strategy for joining is to identify the exact
|
|
position of overlap. This is defined as the position in the left
|
|
contig that the leftmost character of the right contig overlaps.
|
|
The overlap must be of at least one character. Use the scroll bar
|
|
and the scroll buttons (`<<',`<',`>',and`>>') for positioning the
|
|
relative positions of the two contigs.
|
|
|
|
The join position can be fixed in position by pressing the
|
|
`lock' button at the top of the Join Editor. Locking allows the two
|
|
contigs to be scrolled as one when using the scroll bar and buttons,
|
|
the left ends always in the same position relative to each other.
|
|
|
|
Once locked, it is best to proceed to the right along the
|
|
contigs, inserting padding characters (`*') into the consensuses to
|
|
minimise the disagreements.
|
|
|
|
It is essential that the user aligns the two contigs
|
|
throughout the whole region of overlap before completing the join
|
|
because it is only at this stage that the two contigs can be edited
|
|
independently. Once the join is completed the alignment can only be
|
|
altered using the routines supplied by "alter relationships".
|
|
|
|
The join can be completed by pressing the `Leave Editor'
|
|
button. The percentage mismatch is displayed, and the user is
|
|
required to confirm that they want to perform the join.
|
|
@24. TX 1 @ Copy the database
|
|
|
|
Used to make a copy of the database. If required the database
|
|
size can be altered using this option. The "version" of a database
|
|
is encoded as the last letter in the names of the five files that
|
|
contain the database.
|
|
|
|
Supply a "version" number (the default is version 1), and if
|
|
required select a new size for the database. The size of a database
|
|
is the number of lines of information it can hold. It needs a line
|
|
for each gel reading and another for each contig.
|
|
@19. TX 1 @ Check database
|
|
|
|
Used to perform a check on the logical consistency of the
|
|
database. No user intervention is required.
|
|
|
|
The following relationships are checked:
|
|
1. If gel reading A thinks gel reading B is its left neighbour
|
|
does B think A is its right neighbour? The error message is
|
|
"Hand holding problem for gel reading A"
|
|
followed by the gel descriptor lines for gel readings A and B.
|
|
2. Are there any contig lines with no left or right end gel
|
|
readings? The error message is
|
|
"Bad contig line number A"
|
|
3. Do the gel readings that are described as left ends on
|
|
contig lines agree that they are left ends? The error message is
|
|
"The end gel readings of contig A have outward neighbours"
|
|
4. Are there gel readings that are in more than one contig?
|
|
The error message is
|
|
" Gel number A is used N times"
|
|
5. Are there gel readings that are not in any contig? The
|
|
error message is
|
|
" Gel number A is not used"
|
|
6. Do the relative positions of gel readings agree with
|
|
their position as defined by left and right neighbourliness? The
|
|
error message is
|
|
" Gel number A with position X is left neighbour of gel number B
|
|
with position Y"
|
|
7. Are there any loops in contigs? If so no further
|
|
checking is done. The error message is
|
|
" Loop in contig n no further checking done, but gel reading numbers
|
|
follow"
|
|
The program then prints the gel reading numbers in the looped
|
|
contig up to the start of the loop.
|
|
8. Are there any contigs of length <1? The error message is
|
|
" The contig on line number x has zero length"
|
|
9. Are there any gel readings (used in only one contig) that have
|
|
zero length? The error message is
|
|
" Gel number N has zero length"
|
|
Note that "auto assemble" also uses this logical consistency check
|
|
and will only tolerate a "Gel number N is not used" error. Any other
|
|
error will cause it to give up.
|
|
@29. TX 1 @ Examine quality
|
|
|
|
Analyses the quality of the data in a contig. It reports on
|
|
the proportion of the consensus that is "well determined" and will
|
|
display a sequence of symbols that indicate the quality of the
|
|
consensus at each position.
|
|
|
|
Identify the contig to analyse, and the section of interest.
|
|
The current consensus calculation cutoff score will be used to
|
|
decide if each position is "well determined". In general the quality
|
|
of a reading deteriorates along the length of the gel and so it is
|
|
also possible to use a length cutoff for the quality calculation.
|
|
Only the data from the first section of each reading will be
|
|
included in the quality calcualtion. The length is altered under
|
|
"set parameters" and is initially set to the maximum reading length.
|
|
A summary showing the percentage of the consensus that falls into
|
|
each category of quality is shown. Choose whether or not to have the
|
|
quality codes for each position of the consensus displayed. They can
|
|
be displayed as either graphics or text.
|
|
|
|
The quality of the data depends on the number of times it has
|
|
been sequenced and the particular uncertainty codes used in each
|
|
gel reading. This function divides the data into five categories,
|
|
assigning each a symbol or code:
|
|
1. Well determined on both strands and they agree. code=0
|
|
2. Well determined on the plus strand only. code=1
|
|
3. Well determined on the minus strand only. code=2
|
|
4. Not well determined on either strand. code=3
|
|
5. Well determined on both strands but they disagree. code=4
|
|
A position is "well determined" if it is assigned one of the symbols
|
|
A,C,G,T when the algorithm described in the section "calculate a
|
|
consensus". The calculation is performed separately for each
|
|
strand.
|
|
|
|
If the user chooses to have the data displayed graphically the
|
|
following scheme is used. A rectangular box is drawn so that the x
|
|
coordinate represents the length of the contig. The box is
|
|
notionally divided vertically into 5 possible levels which are given
|
|
the y values: -2,-1,0,1,2. The quality codes attributed to each
|
|
base position are plotted as rectangles. Each rectangle represents
|
|
a region in which the quality codes are identical, so a single base
|
|
having a different code from its immediate neighbours will appear as
|
|
a very narrow rectangle.
|
|
|
|
Rectangle bottom and top y values
|
|
|
|
Quality 0 rectangle from 0 to 0
|
|
Quality 1 rectangle from 0 to 1
|
|
Quality 2 rectangle from 0 to -1
|
|
Quality 3 rectangle from -1 to 1
|
|
Quality 4 rectangle from -2 to 2
|
|
|
|
Obviously a single line at the midheight shows a perfect
|
|
sequence.
|
|
|
|
Typical dialogue is shown below.
|
|
|
|
41.47% OK on both strands and they agree(0)
|
|
55.48% OK on plus strand only(1)
|
|
2.08% OK on minus strand only(2)
|
|
0.97% Bad on both strands(3)
|
|
0.00% OK on both strands but they disagree(4)
|
|
? (y/n) (y) Show sequence of codes
|
|
|
|
10 20 30 40 50
|
|
1111111111 1111111111 1111111111 1111111111 1111111111
|
|
|
|
60 70 80 90 100
|
|
1111111111 1111111111 1111111111 3111111111 1111111111
|
|
|
|
110 120 130 140 150
|
|
1111111111 1111131111 1111111111 1111111111 1111111111
|
|
|
|
160 170 180 190 200
|
|
1111111111 1111111111 1111111111 1111111111 1111111133
|
|
|
|
210 220 230 240 250
|
|
1311111111 1111111111 1111111110 0000000000 0000220000
|
|
|
|
260 270 280 290 300
|
|
0000000000 0020000000 2200000202 0002000000 0000222200
|
|
|
|
@26. TX 3 @ Alter relationships
|
|
|
|
Used to make what are normally illegal changes to the
|
|
database. That is the normal checks are not done and any item in the
|
|
database can be changed independently of all others. Users need to
|
|
know what they are doing because it is very easy to make a horrible
|
|
mess. Always start by making a copy!
|
|
|
|
By using the options here users can edit individual gel
|
|
readings in contigs, move one section of a contig relative to
|
|
another, break contigs, remove contigs, remove gel readings, etc. To
|
|
give flexibility most of the commands do only one thing. This means
|
|
that several commands may have to be executed to complete any
|
|
change. At the end of this help section there are notes on removing
|
|
gel readings from the database.
|
|
|
|
The following options are offered:
|
|
|
|
Cancel
|
|
Line change
|
|
Edit single gel reading
|
|
Delete contig
|
|
Shift
|
|
Move gel reading
|
|
Rename gel reading
|
|
Break a contig
|
|
Alter raw data parameters
|
|
|
|
1. QUIT returns to the main options of SAP.
|
|
2. Line change
|
|
allows the user to change the contents of any line in the file of
|
|
relationships. The line is selected by number, the program prints
|
|
the current line and prompts for the new line.
|
|
3. Edit
|
|
allows the user to edit an individual gel reading
|
|
independently of any others it may be related to. The edit positions
|
|
are relative to the contig. The effect of this editing on the length
|
|
of the gel reading is taken care of but, if it changes the length of
|
|
a contig, or its relationship to others, this must be accounted for
|
|
(if necessary) by use of the "line change" function.
|
|
4. Delete contig
|
|
is a function that deletes a contig line by moving down all the
|
|
contig lines above by one position. It prompts only for the line to
|
|
delete. It does not delete any of the gel readings or gel
|
|
reading lines for the deleted contig but it does reduce the number
|
|
of contigs on line IDBSIZ by 1.
|
|
5. Shift
|
|
allows the user to change all the relative positions of a set of
|
|
neighbouring gel readings by some fixed value, i.e. it will shift
|
|
related gel readings either left or right. It can therefore be
|
|
used to change the alignment of the gel readings in a contig or as
|
|
part of the process of breaking a contig into two parts (see below).
|
|
It prompts for the number of the first gel reading to shift and
|
|
then for the distance to move them (Note a negative value will
|
|
move the gel readings left and a positive value right). It then
|
|
chains rightwards (ie follows right neighbours) and shifts each gel
|
|
reading, in turn, up to the end of the contig. (This means that
|
|
only those gel readings from the first to shift to the rightmost are
|
|
moved). It updates the length of the contig accordingly.
|
|
6. Move gel reading
|
|
is a function to renumber a gel reading. It moves all the
|
|
information about a gel reading on to another line. The user must
|
|
specify the number of the gel reading to move and the number of the
|
|
line to place it. It takes care of all the relationships. Of course
|
|
gel readings must not be moved to lines occupied by other gel
|
|
readings! It can be used as part of the process of removing a gel
|
|
reading from the database (see below).
|
|
7. Rename gel reading
|
|
is a function that is used to rename the archive names of gel
|
|
readings in the database; it only changes the name in the .ARN
|
|
file of the database.
|
|
|
|
8. Break contig
|
|
|
|
Occasionally it is necessary to break a contig into two parts
|
|
and this can be achieved using this option. The program needs only
|
|
the number of a gel reading. This is the gel reading that will
|
|
become a left end after the break. That is, the break is made
|
|
between this gel reading and its left neighbour. A new contig line
|
|
is created so ensure that there is sufficient space in the database.
|
|
Removing gel readings from contigs
|
|
|
|
Gel readings can be removed from contigs if they are not
|
|
essential for holding the contig together (ie are not the only gel
|
|
reading covering a particular region). Suppose the gel reading to
|
|
remove is gel number b with left neighbour a and right neighbour c.
|
|
Using "line change" change the right neighbour of a to c, and the
|
|
left neighbour of c to a. To tidy things up: suppose there are x gel
|
|
readings in the database; then, using "move gel reading" move gel x
|
|
to line b; then, using "line change" decrease the number of gel
|
|
readings in the database (stored in the last line) by 1.
|
|
|
|
8. Alter raw data parameters
|
|
|
|
Allows the user to edit the individual raw data parameters,
|
|
such as the left and right cutoff lengths and the name of the
|
|
machine readable trace file. The user must specify the gel line to
|
|
modify, and provide new values for the length of the raw sequence
|
|
including cutoff lengths, the left cutoff position, the length of
|
|
the original working sequence, the machine type, and the name of the
|
|
raw data file, where these values change.
|
|
@27. TX 1 @ Set display parameters
|
|
|
|
Used to redefine the parameters that control the cutoff
|
|
employed by the consensus calculation and quality examiner, the
|
|
maximum length of each reading to include in the quality
|
|
calculation, the line length used by the display function, the text
|
|
window length used by the graphics options, and the graphics window
|
|
length used by the graphics options.
|
|
|
|
The default cutoff score is 75%. The default line length is 50
|
|
characters. For protein sequences the cutoff is always 100%.
|
|
|
|
The text window used by the graphics options controls the
|
|
amount of sequence listed at the crosshair position. The graphics
|
|
window controls the "zoom" function. Both these windows are defined
|
|
as the number of bases that should be shown, to both left and right
|
|
of the crosshair.
|
|
@30. TX 3 @ Auto edit a contig
|
|
|
|
This function automatically changes characters in gel readings
|
|
to make them agree with the consensus sequence. If employed as is
|
|
intended, use of this function is not a criminal activity but a
|
|
method that saves a large amount of work. All characters changed by
|
|
the auto editor will appear in the gel readings as lowercase
|
|
letters. The current consensus calculation cutoff score is used.
|
|
|
|
Identify the contig and the section to edit. The program will
|
|
display a summary of changes made. Note that it is important to
|
|
understand both what the auto editor does and the order in which it
|
|
does it. Before employing the auto editor users should note all the
|
|
corrections that they require, so that after it has been used the
|
|
corrections can be checked.
|
|
|
|
The general strategy employed when collecting shotgun sequence
|
|
data is to let the contigs get fairly deep, to get a printout of a
|
|
contig, check problems against the films, note corrections on the
|
|
printout, and make the changes using an interactive editor. In
|
|
general the consensus is correct except for places where padding
|
|
characters have been used to accommodate a single gel with an extra
|
|
character, or where the consensus is dash. The important point for
|
|
the auto editor is that most edits simply make the gel readings
|
|
conform to the consensus, or remove columns of pads.
|
|
|
|
The new editor does the following.
|
|
|
|
1) calculates a consensus for the contig (or part of a contig)
|
|
to be edited, and then uses this consensus to direct the editing of
|
|
the contig in 3 stages
|
|
|
|
2) stage 1: find and correct all places where, if the order of
|
|
two adjacent characters is swapped, they will both agree with the
|
|
consensus (given that they did not match the consensus before).
|
|
These corrections are termed "transpositions"
|
|
|
|
3) stage 2: find and correct all places where there is a
|
|
definite consensus but the gel reading has a different character.
|
|
These corrections are termed "changes".
|
|
|
|
4) stage 3: delete all positions in which padding is the
|
|
consensus. These corrections are termed "deletions".
|
|
|
|
All changed characters are shown in lowercase letters so it
|
|
will be obvious which characters have been assigned by the program
|
|
(except for deletions). The number of each type of correction will
|
|
be displayed.
|
|
@10. TX 2 @Clear graphics
|
|
|
|
Clears graphics from the screen.
|
|
@11. TX 2 @Clear text
|
|
|
|
Clears text from the screen.
|
|
@12. TX 2 @Draw a ruler.
|
|
|
|
This option allows the user to draw a ruler or scale along the
|
|
x axis of the screen to help identify the coordinates of points of
|
|
interest. The user can define the position of the first base to be
|
|
marked (for example if the active region is 1501 to 8000, the user
|
|
might wish to mark every 1000th base starting at either 1501 or 2000
|
|
- it depends if the user wishes to treat the active region as an
|
|
independent unit with its own numbering starting at its left edge,
|
|
or as part of the whole sequence). The user can also define the
|
|
separation of the ticks on the scale and their height. If required
|
|
the labelling routine can be used to add numbers to the ticks.
|
|
@14. TX 2 @Reposition plots
|
|
|
|
The positions of each of the plots is defined relative to a
|
|
users drawing board which has size 1-10,000 in x and 1-10,000 in y.
|
|
Plots for each option are drawn in a window defined by x0,y0 and
|
|
xlength,ylength. Where x0,y0 is the position of the bottom left hand
|
|
corner of the window, and xlength is the width of the window and
|
|
ylength the height of the window.
|
|
--------------------------------------------------------- 10,000
|
|
1 1
|
|
1 -------------------------------------- ^ 1
|
|
1 1 1 1 1
|
|
1 1 1 1 1
|
|
1 1 1 ylength 1
|
|
1 1 1 1 1
|
|
1 1 1 1 1
|
|
1 -------------------------------------- v 1
|
|
1 x0,y0^ 1
|
|
1 <---------------xlength--------------> 1
|
|
--------------------------------------------------------- 1
|
|
1 10,000
|
|
|
|
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
|
|
The default window positions are read from a file "ANALMARG" when
|
|
the program is started. Users can have their own file if required.
|
|
As all the plots start at the same position in x and have the same
|
|
width, x0 and xlength are the same for all options. Generally users
|
|
will only want to change the start level of the window y0 and its
|
|
height ylength. This option allows users to change window positions
|
|
whilst running the program. The routine prompts first for the
|
|
number of the option that the users wishes to reposition; then for
|
|
the y start and height; then for the x start and length. Note that
|
|
changes to the x values affect all options. If the user types only
|
|
carriage return for any value it will remain unchanged. Note that,
|
|
unlike all the other programs, the boxes used to contain analytical
|
|
results (eg plot quality) should not be made to overlap one another,
|
|
as the function of the crosshair routine depends on which box the
|
|
crosshair is in!
|
|
@15. TX 2 @Label a diagram
|
|
|
|
This routine allows users to label any diagrams they have
|
|
produced. They are asked to type in a label. When the user types
|
|
carriage return to finish typing the label the cross-hair appears on
|
|
the screen. The user can position it anywhere on the screen. If the
|
|
user types R (for right justify) the label will be written on the
|
|
diagram with its right end at the cross-hair position. If the user
|
|
types L (for left justify) the label will be written on the diagram
|
|
with its left end at the cross hair position. The cross-hair will
|
|
then immediately reappear. The user may put the same label on
|
|
another part of the diagram as before or if he hits the space bar he
|
|
will be asked if he wishes to type in another label.
|
|
|
|
Typical dialogue follows.
|
|
? Menu or option number=15
|
|
Type label then drive cross hair to left or right end
|
|
of label position then hit "L" to write label left
|
|
justified or "R" to write label right justified or
|
|
the space bar to quit
|
|
|
|
|
|
? Label=delta gene
|
|
|
|
missing graphics
|
|
|
|
? Label=
|
|
|
|
@16. TX 2 @Display a map
|
|
|
|
This draws a map of any sequence features selected by the
|
|
user. These features may be protein coding regions (CDS), tRNA
|
|
genes (TRNA), promoter positions (PRM), etc. Users may define their
|
|
own feature table key names. For example I find it convenient to
|
|
split CDS lines into CDS1, CDS2 and CDS3 each of which contains only
|
|
those sequences that code in the reading frames 1, 2 or 3. Then I
|
|
can plot them at different heights on the screen ( suitable heights
|
|
can be determined by using the cross-hair). The coordinates must be
|
|
stored in a file in the format of an EMBL feature table.
|
|
|
|
Typical dialogue follows.
|
|
? Menu or option number=16
|
|
Display a map using an EMBL feature table file
|
|
? map file name=hsegl1.ft
|
|
? feature code(e.g. CDS) =CDS
|
|
X 1 + strand
|
|
2 - strand
|
|
3 both strands
|
|
? 0,1,2,3 =
|
|
? level (0-9480) (256) =4000
|
|
|
|
missing graphics
|
|
|
|
? feature code(e.g. CDS) =
|
|
|
|
@7. TX 1 @Redirect output
|
|
|
|
Used to direct output that would normally appear on the screen
|
|
to a file.
|
|
|
|
Select redirection of either text or graphics, and supply the
|
|
name of the file that the output should be written to.
|
|
|
|
The results from the next options selected will not appear on
|
|
the screen but will be written to the file. When option 7 is
|
|
selected again the file will be closed and output will again appear
|
|
on the screen.
|
|
@13. TX 2 @Use crosshair
|
|
|
|
This option puts a steerable cross on the screen which the
|
|
user drives around by using the arrow keys (or mouse). When the
|
|
crosshair is visible a number of options are available if the user
|
|
types one of a set of special keyboard characters. Any other
|
|
characters will cause an exit from the crosshair option. The special
|
|
keys are:
|
|
|
|
I = Identify the nearest gel reading
|
|
Z = Zoom in
|
|
Q = plot Quality
|
|
S = display the aligned Sequences at the crosshair position
|
|
N = list the Names and Numbers of the sequences at the crosshair
|
|
|
|
In order for any of these special keys to operate, the
|
|
crosshair must be in an appropriate display box, and the precise
|
|
function of the keys will also depend on which box the crosshair is
|
|
in.
|
|
|
|
If the crosshair is in the "plot all contigs" box, Z will
|
|
cause a new box to appear showing all the readings for the nearest
|
|
contig; Q will give the same as Z but will also produce an extra box
|
|
showing the "quality" plot.
|
|
|
|
If Z is hit in the "plot single contig" box, the contig will
|
|
be zoomed to the current graphics window size. The zoom will be
|
|
roughly centred on the crosshair position. Because of this it is
|
|
possible to step along a contig by repeatedly zooming with the
|
|
crosshair near to one end of the single contig display box. If I is
|
|
hit the crosshair must be close to a gel reading line. If Q is hit,
|
|
the quality plot will be produced for the region shown in the plot
|
|
single contig box. In all cases when the "plot all contigs" box is
|
|
shown, a vertical line will bisect the line the represents the
|
|
relevant contig, at the current position.
|
|
|
|
If the crosshair is in the plot quality box only the character
|
|
"s" will operate as a special symbol.
|
|
|
|
The number of bases shown in the N and S options is controlled
|
|
by the current graphics text window size, and the size of the zoom
|
|
window by the current graphics window size. Both are set by the
|
|
parameter setting function of the general menu.
|
|
@33. TX 2 @Plot single contig
|
|
|
|
This option produces a schematic of a selected region of a
|
|
single contig by drawing a horizontal line to represent each of its
|
|
gel readings. The lines show the relative positions of each reading
|
|
and also their sense. The plot is divided vertically into two
|
|
sections by a line that is identified by an asterisk drawn at each
|
|
end. All lines that lie above this line represent readings that are
|
|
in their original sense, all lines below show readings that are in
|
|
the complementary sense to their original. By use of the crosshair
|
|
function the plot can be stepped through and examined in more
|
|
detail. See help on crosshair.
|
|
@34. TX 2 @Plot all contigs
|
|
|
|
This option produces a schematic of all the contigs in a
|
|
database. It does this by drawing a horizontal line to represent
|
|
each of them. In order to show the ends of each contig it draws the
|
|
lines for contigs at alternate heights: the first at height one, the
|
|
second at height two, the third at height one, etc. The order of the
|
|
contigs in the display is the same as their order in the database.
|
|
By use of the crosshair function the plot can be stepped through and
|
|
examined in more detail. See help on crosshair.
|
|
@31. TX 3 @ Type in gel readings
|
|
|
|
THIS OPTION IS NO LONGER AVAILABLE IN XDAP.
|
|
|
|
This option allows gel readings to be typed in at the
|
|
keyboard. It creates a separate file for each gel reading and a file
|
|
of file names for the batch. The sequences from each batch may be
|
|
listed when they have all been entered. Users may choose to employ
|
|
special keys to identify the 4 bases A,C,G and T. By default these
|
|
special keys are N M , . but any other four characters may be used.
|
|
If special keys are used the characters are automatically translated
|
|
to A C G T before being stored on the disk.
|
|
@35. TX 1 3 @Find internal joins
|
|
|
|
The purpose of this function is to use data already in the
|
|
database to find possible joins between contigs. Joins may have
|
|
been missed due to poor data or may have not been made due to
|
|
repeated sequences. Where appropriate, it may be possible to find
|
|
potential joins by using the data clipped off readings prior to
|
|
their entry into the database.
|
|
The database is checked for logical consistency. Supply a minimum
|
|
initial match length, a minimum alignment block, the maximum pads
|
|
per sequence, the maximum percent mismatch after alignment, the
|
|
probe length. Choose if clipped data is to be used, if so define the
|
|
window size for finding good data and the number of dashes allowed
|
|
in the window. Processing will commence. Most of these values are
|
|
used in an identical way in the autoassemble function. The others
|
|
are defined below.
|
|
The program strategy
|
|
Take the first contig and calculate its consensus. If clipped data
|
|
is being used examine all readings that are in the complementary
|
|
orientation, and sufficiently near to the contigs left end, to see
|
|
if they have good clipped sequence which if present, would protrude
|
|
from the left end of the contig. If found add the longest such
|
|
sequence to the left end of the consensus. Do the same for the right
|
|
end by examining readings that are in their original orientation. If
|
|
any are found add the longest extension to the right end of the
|
|
consensus. Repeat the consensus calculations and extensions for all
|
|
contigs hence producing an extended consensus. If clipped data is
|
|
not being used simply calculate the consensus for the whole
|
|
database. Now look for possible joins by processing the extended
|
|
consensus in the following way. Take the last, say 100, bases
|
|
(termed the "probe length" by the program) of the rightmost
|
|
consensus, compare it both orientations with the extended consensus
|
|
of all the other contigs. Display any sufficiently good alignments.
|
|
Repeat with the left end of the rightmost contig. Do the same for
|
|
the ends of all the entended contigs, always only comparing with the
|
|
contigs to their left, so that the same matches do not appear twice.
|
|
Good cliped data is defined by sliding a window of "Window size for
|
|
good data scan" bases outwards along the sequence and stopping when
|
|
"Maximum number of dashes in scan window" or more dashes appear in
|
|
the window. Note that it is advisable to have some sort of cutoff
|
|
because if we simply take all the data it might be so full of
|
|
rubbish that we wont find any good matches. For the same reason it
|
|
is worth trying the procedure with different cutoffs. An initial run
|
|
using no clipped data is also recommended. Sufficiently good
|
|
alignments are defined by criteria equivalent to those used in
|
|
autoassemble, however here we only display alignments that pass all
|
|
tests.
|
|
Bugs
|
|
If a small contig is wholly contained within a larger one, such that
|
|
its ends are further than ("Probe length" - "Minimum initial match
|
|
length") from the ends of the larger contig, and the consensus for
|
|
the small contig lies to the left of the consensus for large contig,
|
|
the overlap will not be discovered. (See the search stratgey).
|
|
All numbering is relative to base number one in the contig: matches
|
|
to the left (i.e. in the clipped data) have negative positions,
|
|
matches off the right end of the contig (i.e. in the clipped data)
|
|
have positions greater than that of the contig length. The
|
|
convention for reporting the positions of overlaps is as follows: if
|
|
neither contig needs to be complemented the positions are as shown.
|
|
If the program says "contig x in the - sense" then the positions
|
|
shown assume contig x has been complemented. For example in the
|
|
results given below the positions for the first overlap are as
|
|
reported, but those for the second assume that the contig in the
|
|
minus sense (i.e. 443) has been complemented.
|
|
|
|
|
|
Possible join between contig 445 in the + sense and contig 405
|
|
Percentage mismatch after alignment = 4.9
|
|
412 422 432 442 452 462
|
|
405 TTTCCCGACT GGAAAGCGGG CAGTGAGCGC AACGCAATTA ATGTGAG,TT AGCTCACTCA
|
|
********* * ******** ***** *** ********** ********** **********
|
|
445 -TTCCCGACT G,AAAGCGGG TAGTGA,CGC AACGCAATTA ATGTGAG-TT AGCTCACTCA
|
|
-127 -117 -107 -97 -87 -77
|
|
472 482 492 502 512
|
|
405 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
|
|
********** ********** ********** ********** **
|
|
445 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
|
|
-67 -57 -47 -37 -27
|
|
Possible join between contig 443 in the - sense and contig 423
|
|
Percentage mismatch after alignment = 10.4
|
|
64 74 84 94 104 114
|
|
423 ATCGAAGAAA GAAAAGGAGG AGAAGATGAT TTTAAAAATG AAACG-CGAT GTCAGATGGG
|
|
**** ***** ********** ********** ****** ** ***** **** *********
|
|
443 ATCG,AGAAA GAAAAGGAGG AGAAGATGAT TTTAAA,,TG AAACGACGAT GTCAGATGG,
|
|
3610 3620 3630 3640 3650 3660
|
|
124 134 144 154 164
|
|
423 TTG-ATGAAG TAGAAGTAGG AG-AGGTGGA AGAGAAGAGA GTGGGA
|
|
*** ****** ********** ** ******* *** ***** ** **
|
|
443 TTGGATGAAG TAGAAGTAGG AGGAGGTGGA ,GAG,AGAGA GTTGG-
|
|
3670 3680 3690 3700 3710
|
|
|
|
|
|
@ end of help
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|