1849 lines
91 KiB
Text
1849 lines
91 KiB
Text
|
@-1. TX 0 @General
|
||
|
|
||
|
@-2. T 0 @Screen control
|
||
|
|
||
|
@-2. X 0 @Screen
|
||
|
|
||
|
@-3. TX 0 @Modification
|
||
|
|
||
|
@0. TX -1 @SAP
|
||
|
|
||
|
This is an interactive program whose primary use is for
|
||
|
managing shotgun sequencing projects, but it can also be used for
|
||
|
handling alignments of other sequences, including those of proteins.
|
||
|
Currently the maximum 'gel reading' length is set to 4096
|
||
|
characters. Almost all of the information below describes the use of
|
||
|
the program for shotgun projects, but those using the programs for
|
||
|
handling other sequence alignments should interpret it accordingly.
|
||
|
The data for such a project is stored in a special type of database.
|
||
|
The program contains the tools that are required to type in gel
|
||
|
readings, screen them against vector sequences and restriction
|
||
|
sites; enter new gel readings into the database (automatically
|
||
|
comparing and aligning them). In addition it contains editors and
|
||
|
functions to examine the quality of the aligned sequences.
|
||
|
|
||
|
There are three main menus: "general", "graphics" and
|
||
|
"modification", and some functions have submenus.
|
||
|
The general menu contains the following options:
|
||
|
|
||
|
0 = List of menus
|
||
|
? = Help
|
||
|
! = Quit
|
||
|
3 = Open a database
|
||
|
4 = Edit contig
|
||
|
5 = Display a contig
|
||
|
6 = List a text file
|
||
|
7 = Direct output to disk
|
||
|
8 = Calculate a consensus
|
||
|
17 = Screen against restriction enzymes
|
||
|
18 = Screen against vector
|
||
|
19 = Check consistency
|
||
|
25 = Show relationships
|
||
|
27 = set parameters
|
||
|
28 = Highlight disagreements
|
||
|
29 = Examine quality
|
||
|
|
||
|
The graphics menu contains:
|
||
|
|
||
|
0 = List of menus
|
||
|
? = Help
|
||
|
! = Quit
|
||
|
10 = Clear graphics
|
||
|
11 = Clear text
|
||
|
12 = Draw ruler
|
||
|
13 = Use cross hair
|
||
|
14 = Change margins
|
||
|
15 = Label diagram
|
||
|
16 = Plot map
|
||
|
33 = Plot single contig
|
||
|
34 = Plot all contigs
|
||
|
|
||
|
|
||
|
The modification menu contains:
|
||
|
|
||
|
0 = List of menus
|
||
|
? = Help
|
||
|
! = Quit
|
||
|
4 = Edit a contig
|
||
|
9 = Screen edit
|
||
|
20 = Auto assemble
|
||
|
21 = Enter new gel reading
|
||
|
22 = Join contigs
|
||
|
23 = Complement a contig
|
||
|
24 = Copy database
|
||
|
26 = Alter relationships
|
||
|
30 = Auto edit a contig
|
||
|
31 = Type in gel readings
|
||
|
32 = Extract gel readings
|
||
|
|
||
|
The enter new gel reading menu contains:
|
||
|
|
||
|
? = Help
|
||
|
! = Quit
|
||
|
3 = Complete entry
|
||
|
4 = Edit contig...
|
||
|
5 = Display overlap
|
||
|
6 = Edit new gel reading...
|
||
|
|
||
|
The join contig menu contains:
|
||
|
|
||
|
? = Help
|
||
|
! = Quit
|
||
|
3 = Complete join
|
||
|
4 = Edit left contig...
|
||
|
5 = Display joint
|
||
|
6 = Edit right contig...
|
||
|
7 = Move join
|
||
|
|
||
|
The alter relationships menu contains:
|
||
|
|
||
|
? = Help
|
||
|
! = Quit
|
||
|
3 = Line change
|
||
|
4 = Edit single gel reading...
|
||
|
5 = Delete contig
|
||
|
6 = Shift
|
||
|
7 = Move gel reading
|
||
|
8 = Rename gel reading
|
||
|
9 = Break contig
|
||
|
|
||
|
The edit menu contains:
|
||
|
|
||
|
? = Help
|
||
|
! = Quit
|
||
|
3 = Insert
|
||
|
4 = Delete
|
||
|
5 = Change
|
||
|
|
||
|
|
||
|
|
||
|
Overview of the methodology
|
||
|
|
||
|
The shotgun sequencing strategy
|
||
|
|
||
|
In the shotgun sequencing procedure the sequence to be
|
||
|
determined is randomly broken into fragments of about 400
|
||
|
nucleotides in length. These fragments are cloned and then selected
|
||
|
randomly and their sequences determined. The relationship
|
||
|
between any pair of fragments is not known beforehand but is
|
||
|
found by comparing their sequences. If the sequence of one
|
||
|
found to be wholly or partially contained within that of another
|
||
|
for sufficient length to distinguish an overlap from a repeat
|
||
|
then those two fragments can be joined. The process of select,
|
||
|
sequence and compare is continued until the whole of the DNA to
|
||
|
be sequenced is in one continuous well determined piece.
|
||
|
|
||
|
Definition of a contig
|
||
|
|
||
|
A CONTIG is a set of gel readings that are related to
|
||
|
one another by overlap of their sequences. All gel readings
|
||
|
belong to a contig and each contig contains at least one gel
|
||
|
reading. The gel readings in a contig can be summed to produce a
|
||
|
continuous consensus sequence and the length of this sequence is the
|
||
|
length of the contig. The rules used to perform this summation are
|
||
|
given under "the consensus algorithm". At any stage of a
|
||
|
sequencing project the data will comprise a number of contigs; when
|
||
|
a project is complete there should be only one contig and its
|
||
|
consensus will be the finished sequence. Note that since being
|
||
|
introduced and defined as above the word "contig" has been taken up
|
||
|
by those involved in genomic mapping. In that context the consensus
|
||
|
with a precise length is not defined.
|
||
|
|
||
|
Introduction to the computer method
|
||
|
|
||
|
It is useful to consider the objectives of a sequencing
|
||
|
project before outlining how we use the computer to help achieve
|
||
|
them. The aim of a shotgun sequencing project is to produce an
|
||
|
accurate consensus sequence from many overlapping gel readings. It
|
||
|
is necessary to know, particularly at the latter stages of the
|
||
|
project, how accurate the consensus sequence is. This enables us to
|
||
|
know which regions of the sequence require further work and also to
|
||
|
know when the project is finished. To show the quality of the
|
||
|
consensus, the programs described here produce displays like that
|
||
|
shown below.
|
||
|
|
||
|
|
||
|
10 20 30 40 50
|
||
|
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
|
||
|
60 70 80 90 100
|
||
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
||
|
-3 HINW.007 GGCACA*GTC
|
||
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
|
||
|
|
||
|
110 120 130 140 150
|
||
|
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
|
||
|
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
|
||
|
-5 HINW.009 GGCAGCGACCGTCG
|
||
|
17 HINW.999 AGCGACCGTCG
|
||
|
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
|
||
|
|
||
|
160 170 180 190 200
|
||
|
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
12 HINW.017 GTAGAGC*
|
||
|
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
|
||
|
This is an example showing the left end of a contig from
|
||
|
position 1 to 200. Overlapping this region are gel readings
|
||
|
numbered 6, 3, 5, 17 and 12; 6, 3 and 5 are in reverse orientation
|
||
|
to their original reading (denoted by a minus sign). Each gel
|
||
|
reading also has a name (eg HINW.010). It can be seen that in a
|
||
|
number of places the sequences contain characters other than A,C,G
|
||
|
and T. Some of these extra characters have been used by the
|
||
|
sequencer to indicate regions of uncertainty in the initial
|
||
|
interpretation of the gel reading, but the asterisks (*) have been
|
||
|
inserted by the automatic assembly function in order to align the
|
||
|
sequences. Underneath each 50 character block of gel reading
|
||
|
sequences is the consensus derived from the sequences aligned above
|
||
|
(the line labelled CONSENSUS). For most of its length the consensus
|
||
|
has a definite nucleotide assignment but in a few positions there is
|
||
|
insufficient agreement between the gel readings and so a dash (-)
|
||
|
appears in the sequence. This display contains all the evidence
|
||
|
needed to assess the quality of the consensus: the number of times
|
||
|
the sequence has been determined on each strand of the DNA, and the
|
||
|
individual nucleotide assignments given for each gel reading.
|
||
|
|
||
|
So the aim is to produce the consensus sequence and, equally
|
||
|
important, a display of the experimental results from which it was
|
||
|
derived.
|
||
|
|
||
|
In order to achieve this the following operations need to be
|
||
|
performed:
|
||
|
1) Interpret autoradiographs and put individual gel readings into
|
||
|
the computer.
|
||
|
2) Check each gel reading to make sure it is not simply part of one
|
||
|
of the vectors used to clone the sequence.
|
||
|
3) Check each gel reading to make sure that those fragments that
|
||
|
span the ligation point used prior to sonication are not assembled
|
||
|
as single sequences.
|
||
|
4) Compare all the remaining gel readings with one another to
|
||
|
assemble them to produce the consensus sequence.
|
||
|
5) Check the quality of the consensus and edit the sequences.
|
||
|
6) When all the consensus is sufficiently well determined, produce a
|
||
|
copy of it for processing by other analysis programs.
|
||
|
|
||
|
It is very unlikely that this procedure will only be passed
|
||
|
through once. Usually steps 1 to 5 are cycled through repeatedly,
|
||
|
with step 4 just adding new sequences to those already assembled.
|
||
|
Generally step 6 is also used in order to analyse imperfect sequence
|
||
|
to check if it is the one the project intended to sequence, or to
|
||
|
look for interesting features. Analysis of the consensus, such as
|
||
|
searches for protein coding regions, can also help to find errors in
|
||
|
the sequence. The display of the overlapping gel readings shown
|
||
|
above can be used to indicate, not only the poorly determined
|
||
|
regions, but also which clones should be resequenced to resolve
|
||
|
ambiguities, or those which can usefully be extended or sequenced in
|
||
|
the reverse direction, to cover difficult regions.
|
||
|
|
||
|
The original individual gel readings for a sequencing project
|
||
|
are each stored in separate files. As the gel readings are entered
|
||
|
into the computer (usually in batches, say 10 from a film), the file
|
||
|
names they are given are stored in a further file, called a file of
|
||
|
file names. Files of file names enable gel readings to be processed
|
||
|
in batches.
|
||
|
|
||
|
For each sequencing project we start a project database. This
|
||
|
database has a structure specifically designed for dealing with
|
||
|
shotgun sequence data. In order to arrive at the final consensus
|
||
|
sequence many operations will be performed on the sequence data.
|
||
|
Individual fragments must be sequenced and compared in both senses
|
||
|
(i.e. both orientations) with all the other sequences. When an
|
||
|
overlap between a new gel reading and a contig are found they must
|
||
|
be aligned and the new gel reading added to the contig. If a new gel
|
||
|
reading overlaps two contigs they must be aligned and joined. Before
|
||
|
the two contigs are joined one of them may need to be turned around
|
||
|
(reversed and complemented) so they are both in in the same
|
||
|
orientation.
|
||
|
|
||
|
Clearly, keeping track of all these manipulations is quite
|
||
|
complicated, and to be able to perform the operations quickly
|
||
|
requires careful choice of data structure and algorithms. For these
|
||
|
reasons it is not practicable to store the gel readings aligned as
|
||
|
shown in the display above. Rather, it is more convenient to store
|
||
|
the sequences unassembled, and to record sufficient information for
|
||
|
programs to assemble them during processing. The data used to
|
||
|
assemble the sequences is called relational information.
|
||
|
|
||
|
The database comprises three files and they are described
|
||
|
under the section entitled "open database".
|
||
|
|
||
|
Before entry into the project database each new gel reading
|
||
|
must be compared to look for overlaps with all the data already
|
||
|
contained within the database. This last point is important: all
|
||
|
searching for overlaps is between individual new gel readings and
|
||
|
the data already in the database. There is no searching for overlaps
|
||
|
between sequences within the database; overlaps must be found before
|
||
|
new gel readings are entered into the database.
|
||
|
|
||
|
Below I give an introduction to how the sequencess are
|
||
|
processed by being passed from one function to the next.
|
||
|
|
||
|
This program is used to start a database for the project and
|
||
|
then the following procedure is used.
|
||
|
|
||
|
Data in the form of individual gel readings are entered into
|
||
|
the computer and stored in separate files using either program this
|
||
|
program or the digitizer program. Batches of these gel readings are
|
||
|
passed to the screening functions in this program to search for
|
||
|
overlaps with vector sequences ("screen against vector") or for
|
||
|
matches to restriction enzyme sites that should not be present
|
||
|
("screen against enzymes"). Each run of these screening functions
|
||
|
passes on only those gel readings that do not contain unwanted
|
||
|
sequences. Sequences are passed via files of file names and
|
||
|
eventually are processed by the automatic assembly function ("auto
|
||
|
assemble"). This function compares each gel reading with a consensus
|
||
|
of all the previous gel readings stored in the database. If it
|
||
|
finds any overlaps it aligns the overlapping sequences by inserting
|
||
|
padding characters, and then adds the new gel reading to the
|
||
|
database. Gels that overlap are added to existing contigs and gels
|
||
|
that do not overlap any data in the database start new contigs. If a
|
||
|
new gel overlaps two contigs they are joined. Any gel readings that
|
||
|
appear to overlap but which cannot be aligned sufficiently well are
|
||
|
not entered and have their names written to a file of failed gel
|
||
|
reading names.
|
||
|
|
||
|
Generally data is entered into the database in batches as just
|
||
|
described. The program is also used to examine the data in the
|
||
|
database, to enter gel readings that the automatic assembly function
|
||
|
cannot align ("enter new gel reading"), and to make final edits.
|
||
|
Edits to whole contigs can be made in several ways. An automatic
|
||
|
editor ("auto edit") will perform almost all edits without any user
|
||
|
intervention, but the program also gives access to the system editor
|
||
|
(EDT on the VAX), through the function "screen edit", and to simple
|
||
|
command driven editors ("edit contig" and "edit new gel reading").
|
||
|
Disagreements between gel readings in contigs and their consensus
|
||
|
sequences can be highlighted by use of the function "highlight
|
||
|
disagreements".
|
||
|
|
||
|
Editing the sequences is obviously an essential part of
|
||
|
managing a sequencing project. Editing is required when new
|
||
|
sequences are added, when contigs are joined, and when sequences are
|
||
|
corrected. A basic part of the strategy used here is that new gel
|
||
|
readings should be correctly aligned throughout their whole length
|
||
|
when they are entered into the database, and that when contigs are
|
||
|
joined they are edited so that they are well aligned in the region
|
||
|
of overlap. Alignment can be achieved by adding padding characters
|
||
|
to the sequences, and this is the way "auto assemble" operates when
|
||
|
adding new sequences to the database.
|
||
|
|
||
|
In order to search for overlaps that may have been missed due
|
||
|
to errors in the gel readings, the function "extract gel readings"
|
||
|
can be used to take copies of the gel readings at the ends of
|
||
|
contigs, and write them out as separate files. These can then be
|
||
|
compared with the database consensus using the "auto assemble"
|
||
|
function in a mode that forbids entry of data into the database, and
|
||
|
any gel reading matching two contigs will indicate a join that has
|
||
|
been missed. The joins can then be made interactively using "join
|
||
|
contigs". Missed matches can be found at this stage because the
|
||
|
errors in the sequences may have been corrected by new data.
|
||
|
|
||
|
Generally the users need not concern themselves with how the
|
||
|
relational information is used by the program, but it is necessary
|
||
|
to know how contigs are identified. Because contigs are constantly
|
||
|
being changed and reordered the program identifies them by the
|
||
|
numbers of the gel readings they contain. Whenever users need to
|
||
|
identify a contig they need only know the number or name of one of
|
||
|
the gel readings it contains. Whenever the program asks users to
|
||
|
identify a contig or gel reading they can type its number or its
|
||
|
archive name. If they type its archive name they must precede the
|
||
|
name by a slash "/" symbol to denote that it is a name rather than a
|
||
|
number. E.g if the archive name is fred.gel with number 99, users
|
||
|
should type /fred.gel or 99 when asked to identify the contig.
|
||
|
Generally, when it asks for the gel reading to be identified, the
|
||
|
program will offer the user a default name, and if the user types
|
||
|
only return, that contig will be accessed. When a database is opened
|
||
|
the default contig will be the longest one, but if another is
|
||
|
accessed, it will subsequently become the current default.
|
||
|
|
||
|
Further information is located in the following places. The
|
||
|
database files are described under "open database". The format for
|
||
|
vector and consensus sequences is given under "calculate a
|
||
|
consensus", as are the uncertainty codes used in gel readings.
|
||
|
|
||
|
The only program, other than this, relevant to sequencing is
|
||
|
the digitizer program and it is outlined briefly below.
|
||
|
|
||
|
The digitiser program is used for the initial input of gel
|
||
|
readings and for writing a file of file names. The program uses a
|
||
|
digitizer for data entry. A digitizer is a two dimensional
|
||
|
surface such as a light box which is such that if a special pen is
|
||
|
pressed onto it, the pens coordinates are recorded by a computer.
|
||
|
These coordinates can be interpreted by a program.
|
||
|
|
||
|
In order to read an autoradiograph placed on the light box the
|
||
|
user need only define the bottom of the four sequencing lanes and
|
||
|
the bases to which they correspond and then use the pen to point
|
||
|
to each successive band progressing up the gel. The program
|
||
|
examines the coordinates of each pen position to see in which of the
|
||
|
four lanes it lies and assigns the corresponding base to be
|
||
|
stored in the computer. Each time the pen tip is depressed to point
|
||
|
to a position on the surface of the digitizer the program sounds
|
||
|
the bell on the terminal to indicate to the user that a point has
|
||
|
been recorded. As the sequence is read the program displays it on
|
||
|
the screen.
|
||
|
@17. TX 1 @Screen against restriction enzymes
|
||
|
|
||
|
Used to compare gel readings against any restriction enzyme
|
||
|
recognition sequences that may have been used during cloning and
|
||
|
which should not be present in the data. Works on single gel
|
||
|
readings or processes batches accessed through files of file names.
|
||
|
The algorithm looks for exact matches to recognition sequences
|
||
|
stored in a file.
|
||
|
|
||
|
The file containing the recognition sequences must be
|
||
|
identified. The user must choose between employing a file of file
|
||
|
names, or typing in the names of individual gel reading files. If a
|
||
|
file of file names is used the program will also create a new file
|
||
|
of file names. When the option has finished operating this new file
|
||
|
will contain the names of all those gel readings that did not match
|
||
|
any of the recognition sequences. Hence it can be used for further
|
||
|
processing of the batch. The recognition sequences should be stored
|
||
|
in a simple text file with one recognition sequence per line.
|
||
|
@18. TX 1 @Screen against vector
|
||
|
|
||
|
Used to compare gel readings against any vector sequences that
|
||
|
may have been picked up during cloning. Works on single gel readings
|
||
|
or processes batches accessed through files of file names. The
|
||
|
algorithm looks for exact matches of length "minimum match length"
|
||
|
and displays the overlapping sequences.
|
||
|
|
||
|
The file containing the vector sequence must be identified.
|
||
|
The user must choose between employing a file of file names, or
|
||
|
typing in the names of individual gel reading files. If a file of
|
||
|
file names is used the program will also create a new file of file
|
||
|
names. When the option has finished operating this new file will
|
||
|
contain the names of all those gel readings that did not match the
|
||
|
vector sequence. Hence it can be used for further processing of the
|
||
|
batch.The vector sequence should be stored in a simple text file
|
||
|
with up to 80 characters of data per line. More than one vector can
|
||
|
be stored in a single file. If so each should be preceded by a 20
|
||
|
character title of the form <---m13mp8.001-----> where the < and >
|
||
|
signs and the number like .001 are obligatory. The number must be
|
||
|
preceded by a dot (.) and be 3 digits long. The total sequence in
|
||
|
the file must be < 50,001 characters in length.
|
||
|
@20. TX 2 @Auto assemble
|
||
|
|
||
|
Compares gel readings against the current contents of the
|
||
|
database and produces alignments. In its normal mode of operation
|
||
|
("entry permitted"), the function will automatically enter the gel
|
||
|
readings into the database, but if entry is not permitted it will
|
||
|
only produce alignments. It works on single gel readings or
|
||
|
processes batches of gel readings accessed through files of file
|
||
|
names. It is the usual way to enter data into the database.
|
||
|
|
||
|
The function will check the database for logical consistency
|
||
|
and will only procede if it is OK. Choose if gel readings should be
|
||
|
entered into the database, or if they should only be compared.
|
||
|
Choose between using a file of file names or typing file names on
|
||
|
the keyboard. If so selected, supply the file of file names. Also
|
||
|
supply a file of file names to contain the names of all the gel
|
||
|
readings that fail to get entered. Select the entry mode. Normal
|
||
|
assembly is appropriate for all but special cases, as is "permit
|
||
|
joins". Uses for the other modes are not documented here. Define a
|
||
|
minimum initial match length. Define a minimum alignment block (the
|
||
|
default value is taken in all but exceptional circumstances). Define
|
||
|
the maximum number of paddding characters allowed to be used in each
|
||
|
gel reading to help achieve alignment, and the same for the number
|
||
|
allowed in the contig for each gel reading. Finally define the
|
||
|
maximum percentage mismatch to be allowed for any gel reading to be
|
||
|
entered into the database. If for any gel reading, either of these
|
||
|
last three values is exceeded the gel reading will not be entered
|
||
|
into the database.
|
||
|
|
||
|
In operation the function takes a batch of gel readings
|
||
|
(probably passed on as a file of file names from one of the
|
||
|
screening routines) and enters them into a database for a sequencing
|
||
|
project. It takes each gel reading in turn, compares it with the
|
||
|
current consensus for the database, it then produces an alignment
|
||
|
for any regions of the consensus it overlaps; if this
|
||
|
alignment is sufficiently good it then edits both the new gel
|
||
|
reading and the sequences it overlaps and adds the new gel
|
||
|
reading to the database. The program then updates the consensus
|
||
|
accordingly and carries on to the next gel reading.
|
||
|
|
||
|
All alignments are displayed and any gel readings that do
|
||
|
match but that cannot be aligned sufficiently well have their names
|
||
|
written to a file of failed gel reading names. The function works
|
||
|
without any user intervention and can process any number of gel
|
||
|
readings in a single run. Those gel readings that fail can be
|
||
|
recompared using the same function (to find the current overlap
|
||
|
position) and the user can enter them into the database manually
|
||
|
using the "enter new gel reading" option.
|
||
|
|
||
|
Typical dialogue and output from the function is shown below.
|
||
|
(Note that output for gel readings 2 - 9 has been deleted to save
|
||
|
space).
|
||
|
Automatic sequence assembler
|
||
|
Database is logically consistent
|
||
|
? (y/n) (y) Permit entry
|
||
|
? (y/n) (y) Use file of file names
|
||
|
? File of gel reading names=demo.nam
|
||
|
? File for names of failures=demo.fail
|
||
|
Select entry mode
|
||
|
X 1 Perform normal shotgun assembly
|
||
|
2 Put all sequences in one contig
|
||
|
3 Put all sequences in new contigs
|
||
|
? Selection (1-3) (1) =
|
||
|
? (y/n) (y) Permit joins
|
||
|
? Minimum initial match (12-4097) (15) =
|
||
|
? Minimum alignment block (2-5) (3) =
|
||
|
? Maximum pads per gel (0-25) (8) =
|
||
|
? Maximum pads per gel in contig (0-25) (8) =
|
||
|
? Maximum percent mismatch after alignment (0.00-15.00) (8.00) =
|
||
|
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
|
||
|
Processing 1 in batch
|
||
|
Gel reading name=HINW.004
|
||
|
Gel reading length= 283
|
||
|
Searching for overlaps
|
||
|
Strand 1
|
||
|
Strand 2
|
||
|
No matches found
|
||
|
Total matches found 1
|
||
|
Padding in contig= 0 and in gel= 1
|
||
|
Percentage mismatch after alignment = 1.8
|
||
|
Best alignment found
|
||
|
1 11 21 31 41 51
|
||
|
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
|
||
|
********** ********** ********** ********** ********** **********
|
||
|
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
|
||
|
1 11 21 31 41 51
|
||
|
61 71 81 91 101 111
|
||
|
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACA-GTTG
|
||
|
********** ********** ********** ********** ********** ***** ****
|
||
|
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACACGTTG
|
||
|
61 71 81 91 101 111
|
||
|
121 131 141 151 161 171
|
||
|
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
|
||
|
********** ********** ********** ********** ********** **********
|
||
|
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
|
||
|
121 131 141 151 161 171
|
||
|
181 191 201 211 221 231
|
||
|
GCCATGGTGG CCACAGTGAC G-TATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
|
||
|
********** ********** * ******** ********** ********** **********
|
||
|
GCCATGGTGG CCACAGTGAC GCTATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
|
||
|
181 191 201 211 221 231
|
||
|
241 251 261 271 281
|
||
|
CGCG--GTGG AATTACAGCG TTCCCTATTG ACGGGCGCAT CCAC
|
||
|
**** **** ********** ** * ***** ********** ****
|
||
|
CGCGACGTGG AATTACAGCG TT,CDTATTG ACGGGCGCAT CCAC
|
||
|
241 251 261 271 281
|
||
|
Batch finished
|
||
|
9 sequences processed
|
||
|
0 sequences entered into database
|
||
|
0 joins made
|
||
|
|
||
|
|
||
|
Note that "auto assemble" cannot align protein sequences.
|
||
|
@28. TX 1 @Highlight disagreements
|
||
|
|
||
|
Used in the latter stages of a project to highlight
|
||
|
disagreements between individual gel readings and their consensus
|
||
|
sequences. Characters that agree with the consensus are shown as :
|
||
|
symbols for the plus strand and . for the minus strand. Characters
|
||
|
that disagree with the consensus are left unchanged and so stand out
|
||
|
clearly. The results of this analysis are written to a file.
|
||
|
|
||
|
Before selecting this option create a file of the display of
|
||
|
the contig to be "highlighted". The option will ask for the name of
|
||
|
this file. Select symbols to denote "agreeing" characters on each
|
||
|
strand, the defaults are : and ., but any others can be used. Supply
|
||
|
the name of a file in which to put the output.
|
||
|
|
||
|
The display file needed as input for this option is created by
|
||
|
selecting "Redirect output", followed immediately by "display
|
||
|
contig", and then "Redirect output" again. The cutoff score used in
|
||
|
the consensus calculation can be set by option "set display
|
||
|
parameters". Note that for the highlight function there is a limit
|
||
|
of 50 for the number of gel readings that are aligned at any
|
||
|
position - ie the contig must be less than 51 gel readings deep at
|
||
|
its thickest point. I hope that those performing shotgun sequencing
|
||
|
never reach this limit, but those using the program for comparing
|
||
|
sequence families might.
|
||
|
|
||
|
Typical output from this function is shown below.
|
||
|
|
||
|
210 220 230 240 250
|
||
|
1 HINW.004 :C::::::::::::::::::::::::::::::::::::::::::AC::::
|
||
|
7 HINW.018 :*::::::::::::::::::::::::::::::::::::::::::CA::::
|
||
|
-4 HINW.017 ...............AC....
|
||
|
G-TATTTTGTTTCCTCACGCTCGCTACGTATCTGTTTGCCCGCG--GTGG
|
||
|
|
||
|
260 270 280 290 300
|
||
|
1 HINW.004 ::::::::::::*:D:::::::::::::::::::
|
||
|
7 HINW.018 ::::::::::::::::::::CA:::::T:*:::*::::::::::::CA:
|
||
|
-4 HINW.017 ..............................................A...
|
||
|
3 HINW.009 :::::::::::::::V::::::::::::::::::::::::::::*AV:::
|
||
|
-6 HINW.028 ......................A...
|
||
|
AATTACAGCGTTCCCTATTGACGGGCGCATCCACGCTGATTCTCTT-CTG
|
||
|
|
||
|
@32. TX 3 @Extract gel readings
|
||
|
|
||
|
Used to make copies of the aligned gel readings in a database,
|
||
|
to write them into separate files, and to write a corresponding file
|
||
|
of file names. It operates in two modes: either all gel readings are
|
||
|
extracted, or only those at the ends of contigs.
|
||
|
|
||
|
Choose which mode of operation is required and supply a file
|
||
|
of file names.
|
||
|
|
||
|
The gel readings are given their original names. If used to
|
||
|
extract the gel readings from the ends of contigs the function is
|
||
|
useful for checking for missed contig joins: the file of file names
|
||
|
can be used with the auto assemble function to recompare these gel
|
||
|
readings, and each should only overlap one contig. Any that overlap
|
||
|
two contigs will identify possible joins.
|
||
|
|
||
|
If the option is used to extract all the gel readings from a
|
||
|
database, a subsequent run of "auto assemble" can reconstitute a
|
||
|
database which has been corrupted. This rarely occurs and is
|
||
|
usually necessesitated by a user employing "alter relationships"
|
||
|
incorrectly without first having made a copy.
|
||
|
@1. TX 0 @Help
|
||
|
|
||
|
Help is available on the following topics :
|
||
|
@2. TX 0 Quit
|
||
|
|
||
|
This command stops the program and is the only safe way to
|
||
|
terminate a run of the program that has altered the contents of the
|
||
|
database in any way.
|
||
|
@3. TX 1 @Open a database
|
||
|
|
||
|
Opens existing databases or allows new ones to be started. The
|
||
|
function is automatically called into operation when the program is
|
||
|
started but can also be selected from the general menu.
|
||
|
|
||
|
Choose to open an existing database or start a new one, or if
|
||
|
! is typed when the program is first started, enter the program
|
||
|
without opening a database. Supply a project database name, and if
|
||
|
it already exists, the "version". If starting a new database define
|
||
|
the database size and if it is for DNA or protein sequences. The
|
||
|
database size is an initial size for the database. It can be
|
||
|
increased later during the project. It is the sum of the number of
|
||
|
gel readings plus the number of contigs.
|
||
|
|
||
|
Database names can have from one to 12 letters and must not
|
||
|
include full stop (.). The database is made from three separate
|
||
|
files. If the database is called FRED then version 0 of database
|
||
|
FRED comprises files FRED.AR0, FRED.RL0 and FRED.SQ0. The version is
|
||
|
the last symbol in the file names. Only this program can read these
|
||
|
files. If the "copy database" option is used it will ask the user to
|
||
|
define a new "version".
|
||
|
|
||
|
For normal use the maximum gel reading length is set to 512
|
||
|
characters, but when a database is started the user may choose
|
||
|
lengths of either 512, 1024, 1536..., 4096. Normally the program is
|
||
|
used to handle DNA sequences but many of the functions also work on
|
||
|
protein sequences. The choice of sequence type is made when the
|
||
|
database is started.
|
||
|
|
||
|
The contigs are not stored on the disk as the user sees them
|
||
|
displayed on the screen. Each gel reading is stored with sufficient
|
||
|
information about how it overlaps other gel readings so that the
|
||
|
program can work out how to present them aligned on the screen. We
|
||
|
refer to this extra data as "the relationships" and it is explained
|
||
|
below. The database comprises 3 separate files.
|
||
|
1. a working version of each gel reading. This is the version of
|
||
|
the gel reading that is in the database and initially it is an
|
||
|
exact copy of the original sequence (known as the archive) but it is
|
||
|
edited and manipulated to align it with other gel readings.
|
||
|
2. the file of relationships. This file contains all of the
|
||
|
information that is required to assemble the working versions into
|
||
|
contigs during processing; any manipulations on the data use this
|
||
|
file and it is automatically updated at any time that the
|
||
|
relationships are changed. The information in this file is as
|
||
|
follows:
|
||
|
(A) Facts about each gel reading and its relationship to
|
||
|
others ("gel descriptor lines"):
|
||
|
(a) the number of the gel reading (each gel reading is given a
|
||
|
number as it is entered into the database)
|
||
|
(b) the length of the sequence from this gel reading
|
||
|
(c) the position of the left end of this gel reading relative to
|
||
|
the left end of the contig of which it is a member
|
||
|
(d) the number of the next gel reading to the left of this gel
|
||
|
reading
|
||
|
(e) the number of the next gel reading to the right
|
||
|
(f) the relative strandedness of this gel reading , ie whether it
|
||
|
is in the same sense or the complementary sense as its archive.
|
||
|
(B) Facts about each contig ("contig descriptor lines"):
|
||
|
(a) the length of this contig
|
||
|
(b) the number of the leftmost gel reading of this contig
|
||
|
(c) the number of the rightmost gel reading of this contig.
|
||
|
(C) General facts:
|
||
|
(a) the number of gel readings in the database
|
||
|
(b) the number of contigs in the database.
|
||
|
3. the file of archive names. This is simply a list of the names
|
||
|
of each of the archive files in the database but on line number 1000
|
||
|
we also store the size of the database. ie the number of lines of
|
||
|
information allowed in the database files. This file always has 1000
|
||
|
lines but the length of the file of relationships and the file of
|
||
|
working versions can be set by the user when creating a database or
|
||
|
when copying from one to another.
|
||
|
|
||
|
Structure of the database files
|
||
|
|
||
|
1. The file of relationships
|
||
|
|
||
|
The file contains IDBSIZ lines of data: the general data are
|
||
|
stored on line IDBSIZ; data about gel readings are stored from
|
||
|
line 1 downwards; data about contigs are stored from line IDBSIZ-1
|
||
|
upwards. A database of 500 lines containing 25 gel readings and 4
|
||
|
contigs would have a file of relationships as is shown below.
|
||
|
|
||
|
|
||
|
---------------------------------------------
|
||
|
1 Gel descriptor record
|
||
|
2 " " "
|
||
|
3 " " "
|
||
|
4 " " "
|
||
|
5 " " "
|
||
|
' ' ' '
|
||
|
' ' ' '
|
||
|
25 " " "
|
||
|
26 Empty record
|
||
|
' ' '
|
||
|
|
||
|
' ' '
|
||
|
495 ' '
|
||
|
496 Contig descriptor record
|
||
|
497 " " "
|
||
|
498 " " "
|
||
|
499 " " "
|
||
|
500 Number of gel readings=25, Number of contigs=4
|
||
|
---------------------------------------------
|
||
|
|
||
|
The arrangement of the data in the file of relationships
|
||
|
|
||
|
As each new gel reading is added into the database a new line is
|
||
|
added to the end of the list of gel descriptor lines. If this
|
||
|
new gel reading does not overlap with any gel readings already in
|
||
|
the database a new contig line is added to the top of the list
|
||
|
of contig lines. If it overlaps with one contig then no new contig
|
||
|
line need be added but if it overlaps with two contigs then
|
||
|
these two contigs must be joined and the number of contig lines
|
||
|
will be reduced by one. Then the list of contig lines is compressed
|
||
|
to leave the empty line at the top of the list. Initially the two
|
||
|
types of line will move towards one another but eventually, as
|
||
|
contigs are joined, the contig descriptor lines will move in the
|
||
|
same direction as the gel descriptor lines. At the end of a
|
||
|
project there should be only one contig line. The database is
|
||
|
thus capable of handling a project of 998 gels.
|
||
|
|
||
|
Structure of the working versions file
|
||
|
|
||
|
The working versions of gel readings are stored in a file
|
||
|
of IDBSIZ lines each containing 512 characters. Gel reading number
|
||
|
1 is stored on line 1, gel reading number 2 on line 2 and so on.
|
||
|
|
||
|
Structure of the archive names file
|
||
|
|
||
|
This file, unlike the others, always has 1000 lines each 10
|
||
|
characters in length. Its length is fixed because line 1000 is used
|
||
|
to store IDBSIZ the database size and the programs need a definite
|
||
|
location from which to read this number.
|
||
|
|
||
|
Safeguarding the database
|
||
|
|
||
|
It is advisable to copy regularly (using the copy function of
|
||
|
DS) from say copy 0 to copy 1 in case of errors.
|
||
|
|
||
|
I also recommend setting the protection codes on copy 0 of
|
||
|
each database so that users cannot delete the files without first
|
||
|
resetting the protection codes. This will protect you from
|
||
|
accidently deleting the files. Users at LMB can use the PROTECT
|
||
|
command for this purpose.
|
||
|
|
||
|
The give-up options allow users to change their minds about
|
||
|
entering a new gel reading or joining two contigs without
|
||
|
affecting the file of relationships. BUT if the edit contig
|
||
|
option from either of these two functions has been used the
|
||
|
edits will remain even though the user has "given up". To leave the
|
||
|
files completely unaffected the user could, if required, undo
|
||
|
any edits before "giving up".
|
||
|
|
||
|
There are various checks within the programs to protect
|
||
|
users from themselves:-
|
||
|
1. All user input is checked for errors - e.g. reference to
|
||
|
non-existent gel readings or contigs, incorrect positions in the
|
||
|
contig or gel readings.
|
||
|
2. Before entering a gel reading the system checks to see if a file
|
||
|
of the same name has already been entered.
|
||
|
3. Join will not allow the circularising of a contig.
|
||
|
4. Both enter and join functions restrict the region that
|
||
|
the user is allowed to edit (using edit contig) to the region of
|
||
|
overlap.
|
||
|
5. Users may escape from any point in the program.
|
||
|
6. Help is available from all points in the program.
|
||
|
|
||
|
|
||
|
IT IS ESSENTIAL THAT USERS DO NOT KILL THE PROGRAM WHILE IT IS DOING
|
||
|
ANYTHING THAT INVOLVES CHANGING THE CONTENTS OF THE DATABASE. I.E
|
||
|
DURING AUTO ASSEMBLE, COMPLETE ENTRY, COMPLETE JOIN, COMPLEMENT
|
||
|
CONTIG, EDIT CONTIG, AND SCREEN EDIT. This could corrupt the
|
||
|
database so badly that it is impossible to fix. The program should
|
||
|
always be left using the QUIT option.
|
||
|
@4. TX 3 @Edit
|
||
|
|
||
|
A simple commnd driven editor that can insert, delete and
|
||
|
change gel reading sequences. Insert, delete and change commands
|
||
|
will request the position at which the edit is required and the
|
||
|
number of characters to insert, delete or change. The default
|
||
|
character for insertions is *.
|
||
|
|
||
|
There are three modes of editing offered by this editor
|
||
|
depending where it is selected from. New gel readings can be edited
|
||
|
as they are being entered into the database, contigs can be edited
|
||
|
with alignments being automatically maintained, or gel readings in
|
||
|
contigs can be edited without the maintenance of alignments.
|
||
|
The following commands can be used.
|
||
|
|
||
|
? = Help
|
||
|
! = Quit
|
||
|
3 = Insert
|
||
|
4 = Delete
|
||
|
5 = Change
|
||
|
|
||
|
|
||
|
All commands request the position at which the edit should be
|
||
|
made. (Note that the position refers to the position in the contig
|
||
|
for gel readings in the database, but to the position in the gel
|
||
|
reading if you are editing a new gel reading while entering it into
|
||
|
the database.)
|
||
|
|
||
|
All commands request the number of characters to operate on.
|
||
|
(Note that if you are editing a contig the program will ask for the
|
||
|
characters to insert into each separate gel reading, hence allowing
|
||
|
different changes to be made to each. Also the default character is
|
||
|
asterisk (*) - i.e if you include a space in the string it will be
|
||
|
replaced by an asterisk, or if you simply type return the whole
|
||
|
string inserted will be asterisks.)
|
||
|
"Change" allows characters in individual gel readings to be
|
||
|
replaced. If the user is not editing a new gel reading during
|
||
|
"enter new gel reading" the program will request the numer of the
|
||
|
gel reading to edit. (When editing gel readings in contigs the
|
||
|
program responds with the relative position and length of the
|
||
|
selected gel reading in case the the user only knows the edit
|
||
|
position relative to the gel reading. (The edit position must
|
||
|
be relative to the contig.))
|
||
|
Further notes on editing
|
||
|
|
||
|
When you are editing a contig the program maintains the
|
||
|
alignments of the gel readings by always making the same number of
|
||
|
insertions or deletions in all the gels. Note that these edits are
|
||
|
immediately carried out and the "Quit" options of "enter new gel
|
||
|
reading" and "join contigs" do not undo them. Users must undo them
|
||
|
themselves. Note that if this option has been entered from either
|
||
|
"enter new gel reading" or "join contigs" the program will restrict
|
||
|
edits to the region of overlap. DO NOT KILL THE PROGRAM DURING
|
||
|
EDIT CONTIG!
|
||
|
|
||
|
When editing a single gel reading in a contig from "alter
|
||
|
relationships" (which you should not normally need to do) the
|
||
|
program will correct the length of the individual gel reading, but
|
||
|
it will not update the length of the contig if it has changed.
|
||
|
|
||
|
The program contains better methods than this simple command
|
||
|
driven editor, for making multiple edits to contigs. "Screen edit",
|
||
|
gives access to the system editor on your machine, and "auto edit"
|
||
|
will edit a whole contig automatically.
|
||
|
@9. TX 3 @Screen edit
|
||
|
|
||
|
Gives access to the system editor on the machine (for example
|
||
|
EDT on a VAX) and allows users to edit contigs. The contigs are
|
||
|
presented as for "display contig" and the program will reconstitute
|
||
|
the contig's sequences and relationships when the editor is exited.
|
||
|
|
||
|
To screen edit a contig set the line length to 50 characters,
|
||
|
select the contig to edit, and supply the name of a temporary file
|
||
|
in which the editing will be performed. After a short pause the
|
||
|
system editor will present the first page of the file. Edit the file
|
||
|
obeying the rules given below. Exit from the editor and affirm the
|
||
|
intention of returning the contig to the database. The program will
|
||
|
put the contig back into the database.
|
||
|
|
||
|
Rules for screen editing
|
||
|
|
||
|
There are some limitations on the changes that can be made to
|
||
|
the contigs when using the screen editor. Users are unlikely to want
|
||
|
to break the rules in order to achieve changes to contigs, but
|
||
|
nevertheless the constraints need to be defined and they are given
|
||
|
below.
|
||
|
|
||
|
Alignments must be maintained during editing. Whole lines of
|
||
|
sequence should not be deleted or added unless the order of the gel
|
||
|
readings in the contig is preserved. Each line in the contig
|
||
|
display consists of gel reading numbers, their names and 50
|
||
|
character sections of sequence. Insertions are limited in the
|
||
|
following way. No line of sequence can be extended rightwards more
|
||
|
than 10 characters beyond the end of a full length line (a full
|
||
|
length line is 50 characters long). Only one character can be added
|
||
|
to the left end of full length lines, but sections of sequence
|
||
|
beginning further into a line can be extended leftwards up to an
|
||
|
equivalent position. Do not delete any non-sequence lines in the
|
||
|
file.
|
||
|
|
||
|
Before returning the contig to the database the program checks
|
||
|
that the rules have been obeyed. If an error is found the number of
|
||
|
the erroneous line in the file is displayed and the contig will not
|
||
|
be changed.
|
||
|
@5. TX 1 @Display a contig
|
||
|
|
||
|
Used to show the aligned gel readings for any part of a
|
||
|
contig. The number, name and strandedness of each gel reading is
|
||
|
shown and the consensus is written below.
|
||
|
|
||
|
If required identify the contig, and then the start and end
|
||
|
points of the region to display.
|
||
|
|
||
|
The display can be directed to a disk file using "direct
|
||
|
output to disk". These files are required by options: "screen edit"
|
||
|
and "highlight disagreements", and printed copies of them are very
|
||
|
useful for marking corrections prior to using the editors.
|
||
|
|
||
|
Below is an example showing the left end of a contig from
|
||
|
position 1 to 200. Overlapping this region are gels 6,3,5,17and
|
||
|
12; 6, 3 and 5 are in reverse orientation to their archives (denoted
|
||
|
by a minus sign) There are a few uncertainty codes and a few
|
||
|
padding characters in the working versions, but the consensus
|
||
|
(shown below each page width) has a definite assignment for almost
|
||
|
every position.
|
||
|
|
||
|
10 20 30 40 50
|
||
|
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
|
||
|
60 70 80 90 100
|
||
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
||
|
-3 HINW.007 GGCACA*GTC
|
||
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
|
||
|
|
||
|
110 120 130 140 150
|
||
|
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
|
||
|
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
|
||
|
-5 HINW.009 GGCAGCGACCGTCG
|
||
|
17 HINW.999 AGCGACCGTCG
|
||
|
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
|
||
|
|
||
|
160 170 180 190 200
|
||
|
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
12 HINW.017 GTAGAGC*
|
||
|
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
@6. TX 1 @List a text file
|
||
|
|
||
|
This option allows users to list text files on the screen. It
|
||
|
can be used to read a file containing notes, for checking files
|
||
|
written to disk etc. The user is asked to type the name of the file
|
||
|
to list.
|
||
|
@8. TX 1 @Calculate a consensus
|
||
|
|
||
|
Calculates a consensus sequence either for the whole
|
||
|
database or for selected contigs. The consensus is written to a file
|
||
|
named by the user.
|
||
|
Supply a file name, choose between whole database or selected
|
||
|
contigs.
|
||
|
|
||
|
Symbols for uncertainty in gel readings
|
||
|
|
||
|
In order to record uncertainties when reading gels the
|
||
|
codes shown below can be used. Use of these codes permits us to
|
||
|
extract the maximum amount of data from each gel and yet record any
|
||
|
doubts by choice of code. The program can deal with all of
|
||
|
these codes and any other characters in a sequence are treated
|
||
|
as dash (-) characters.
|
||
|
|
||
|
SYMBOL MEANING
|
||
|
|
||
|
1 PROBABLY C
|
||
|
2 " T
|
||
|
3 " A
|
||
|
4 " G
|
||
|
D " C POSSIBLY CC
|
||
|
V " T " TT
|
||
|
B " A " AA
|
||
|
H " G " GG
|
||
|
K " C " C-
|
||
|
L " T " T-
|
||
|
M " A " A-
|
||
|
N " G " G-
|
||
|
R A OR G
|
||
|
Y C OR T
|
||
|
5 A OR C
|
||
|
6 G OR T
|
||
|
7 A OR T
|
||
|
8 G OR C
|
||
|
- A OR G OR C OR T
|
||
|
a A set by auto edit
|
||
|
c C set by auto edit
|
||
|
g G set by auto edit
|
||
|
t T set by auto edit
|
||
|
* padding character placed by auto assembler
|
||
|
else = -
|
||
|
|
||
|
The DNA consensus algorithm
|
||
|
|
||
|
The "calculate consensus" function, the "display contig"
|
||
|
routine and the "show quality" option use the rules outlined here
|
||
|
to calculate a consensus from aligned gel readings. Note that
|
||
|
"display contig" calculates a consensus for each page width it
|
||
|
displays (it does not use the consensus sequence file calculated
|
||
|
by the consensus function).
|
||
|
|
||
|
We have 6 possble symbols in the consensus sequence: A,C,G,T,*
|
||
|
and -. The last symbols is assigned if none of the others makes up a
|
||
|
sufficient proportion of the aligned characters at any position in
|
||
|
the contig. The following calculation is used to decide which symbol
|
||
|
to place in the consensus at each position.
|
||
|
|
||
|
Each uncertainty code contributes a score to one of A,C,G,T,*
|
||
|
and also to the total at each point. Symbols like R and Y which
|
||
|
don't correspond to a single base type contribute only to the total
|
||
|
at each point. The scores are shown below.
|
||
|
definite assignments ie A,C,G,T,B,D,H,V,K,L,M,N,a,c,g,t,* =1
|
||
|
|
||
|
probable assignments ie 1,2,3,4 = 0.75
|
||
|
|
||
|
other uncertainty codes including R,Y,5,6,7,8,- = 0.1
|
||
|
|
||
|
|
||
|
A cutoff score of 51% to 100% is supplied by the user. (When
|
||
|
the program starts this is set to 75%. See "set display
|
||
|
parameters"). At each position in the contig we calculate the total
|
||
|
score for each of the 5 symbols A,C,G,T and * (denote these by Xi,
|
||
|
where i=A,C,G,T or *), and also the sum of these totals (denote this
|
||
|
by S). Then if 100 Xi / S > the cutoff for any i, symbol i is placed
|
||
|
in the consensus; otherwise - is assigned.
|
||
|
|
||
|
Notice that S does not equal the number of times the sequence
|
||
|
has been determined, but is the score total, and hence we are less
|
||
|
likely to put a - in the consensus. For the "examine quality"
|
||
|
algorithm each strand is treated separately but the calculation is
|
||
|
the same. (It was originally different).
|
||
|
|
||
|
Format of the consensus sequence ( and vector sequences).
|
||
|
|
||
|
A consensus sequence file may contain the consensus for
|
||
|
several contigs and so we identify each of them by preceding them by
|
||
|
a 20 character title. The title is of the form <---LAMBDA.076----->
|
||
|
( where LAMBDA is the project name and gel reading number 76 is the
|
||
|
leftmost gel reading to contribute to this consensus sequence).
|
||
|
The angle brackets <> and the three digit number precede by a .
|
||
|
are important to some processing programs.
|
||
|
@25. TX 1 @Show relationships
|
||
|
|
||
|
Used to show the relationships of the gel readings in the
|
||
|
database in three ways -
|
||
|
(a) All contig descriptor lines followed by all gel descriptor
|
||
|
lines.
|
||
|
(b) All contigs one after the other sorted, i.e. for each
|
||
|
contig show its contig descriptor line followed by all its gel
|
||
|
descriptor lines sorted on position from left to right
|
||
|
(c) Selected contigs: show the contig line and, in order, those
|
||
|
gel readings that cover a user-defined region. Note that this
|
||
|
output can be directed to a disk file by prior selection of "disk
|
||
|
output".
|
||
|
|
||
|
Below is an example showing a contig from position 1 to 689.
|
||
|
The left gel reading is number 6 and has archive name HINW.010, the
|
||
|
rightmost gel reading is number 2 and is has archive name HINW.004.
|
||
|
On each gel descriptor line is shown: the name of the archive
|
||
|
version, the gel number, the position of the left end of the gel
|
||
|
reading relative to the left end of the contig, the length of
|
||
|
the gel reading (if this is negative it means that the gel reading
|
||
|
is in the opposite orientation to its archive), the number of the
|
||
|
gel reading to the left and the number of the gel reading to the
|
||
|
right.
|
||
|
|
||
|
|
||
|
CONTIG LINES
|
||
|
CONTIG LINE LENGTH ENDS
|
||
|
LEFT RIGHT
|
||
|
48 689 6 2
|
||
|
GEL LINES
|
||
|
NAME NUMBER POSITION LENGTH NEIGHBOURS
|
||
|
LEFT RIGHT
|
||
|
HINW.010 6 1 -279 0 3
|
||
|
HINW.007 3 91 -265 6 5
|
||
|
HINW.009 5 137 -299 3 17
|
||
|
HINW.999 17 140 273 5 12
|
||
|
HINW.017 12 193 265 17 18
|
||
|
HINW.031 18 385 -245 12 2
|
||
|
HINW.004 2 401 -289 18 0
|
||
|
|
||
|
@21. TX 3 @Enter new gel reading
|
||
|
|
||
|
Used to enter new gel readings into the database. The new gel
|
||
|
reading must have previously been compared with the contents of the
|
||
|
database by use of " auto assemble" in order to ascertain if it
|
||
|
overlaps any previously entered data.
|
||
|
|
||
|
The user is expected to know: if the gel reading overlaps; if
|
||
|
so which contig it overlaps; if so where it overlaps. The program
|
||
|
takes the user through a series of question to establish the nature
|
||
|
of the overlap and then displays the overlap. The user is then
|
||
|
offered a number of options, including editors for the new gel
|
||
|
reading and the contig, to enable the correct alignment of the gel
|
||
|
reading throughout its whole length.
|
||
|
Supply the name of the gel reading file. If the gel reading has
|
||
|
been entered before the program will not permit entry. The program
|
||
|
gives the gel reading a unique number and asks if the sequence
|
||
|
overlaps any data already in the database (reported by "auto
|
||
|
assemble"). If it does not, entry is complete. If it does overlap
|
||
|
the dialogue continues with the program asking if the gel readings
|
||
|
overlaps "in the normal sense", if not it will automatically
|
||
|
complement the sequence. Then supply the number of the contig the
|
||
|
gel reading overlaps (as reported by "auto assemble").
|
||
|
|
||
|
Overlaps are divided into two types: those for which the new
|
||
|
gel reading protrudes from the left end of the contig it overlaps,
|
||
|
and those for which it does not. The program asks about this with
|
||
|
the question "Left end of gel reading is inside contig". If this is
|
||
|
true the program will go on to ask for the position in the contig of
|
||
|
the left end of the new gel reading. If it is not true the program
|
||
|
will ask for the position in the new gel reading of the left end of
|
||
|
the contig.
|
||
|
|
||
|
Once this is completed the program will display the first 50
|
||
|
bases of the overlap. The gel readings in the contig and their
|
||
|
consensus are displayed with the new gel reading underneath. The
|
||
|
mismatches are shown by *'s on the next line down. For example:
|
||
|
|
||
|
|
||
|
60 70 80 90 100
|
||
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
||
|
-3 HINW.007 GGCACA*GTC
|
||
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACACGTC
|
||
|
NEWGEL CACAAGCGAGCGAGAGGGGCACCGTGACGTGGTCACGCCGGGGACACGTC
|
||
|
MISMATCH * * *
|
||
|
10 20 30 40 50
|
||
|
|
||
|
|
||
|
The program then needs to know if the position of the left
|
||
|
end of the overlap is correct. If it is the user should type
|
||
|
return, if not, 1 and the program will ask for the new position and
|
||
|
display it.
|
||
|
The program now offers a number of options to allow the user to
|
||
|
align the new gel reading correctly over its whole length with the
|
||
|
data already in the contig. It is important that
|
||
|
sufficient edits are made to the new gel reading or the
|
||
|
sequences in the contig at this stage to get the alignment correct,
|
||
|
because once entry is completed, the alignment is fixed and cannot
|
||
|
easily be changed (see "alter relationships"). Alignment can be
|
||
|
achieved by making insertions or deletions but deletion of
|
||
|
data requires the original gels to be checked. For this reason
|
||
|
at entry we usually make only insertions to achieve alignment. We
|
||
|
use X or asterisks (*) as padding characters to achieve alignment
|
||
|
and so can, if required, distinguish padding characters from
|
||
|
characters assigned from reading gels.
|
||
|
|
||
|
The options available are:
|
||
|
? = HELP
|
||
|
! = Give up
|
||
|
3 = Complete entry
|
||
|
4 = Edit contig
|
||
|
5 = Display overlap
|
||
|
6 = Edit new gel reading
|
||
|
|
||
|
|
||
|
|
||
|
1. HELP gives this information.
|
||
|
|
||
|
2. Give up allows users to change their minds about entering
|
||
|
the new gel reading. The program will ask the user to confirm this
|
||
|
choice.
|
||
|
|
||
|
3. Complete entry is the command to add the new gel reading to
|
||
|
the contig. The program updates the relationships accordingly. The
|
||
|
user is asked to confirm this command.
|
||
|
|
||
|
4. Edit contig gives the user access to a simple editor that
|
||
|
allows insertions, deletions and changes to be made to the contig.
|
||
|
The editor maintains alignments by making the same number of
|
||
|
insertions or deletions in all sequences covering the edit position.
|
||
|
The program protects the user by allowing edits only
|
||
|
within the region of overlap.
|
||
|
|
||
|
5. Display allows display of the region of overlap only. This
|
||
|
is defined by the relative positions in the contig. The default is
|
||
|
the whole of the region of overlap.
|
||
|
|
||
|
6. Edit new gel reading allows the new gel reading to be
|
||
|
edited using a simple editor.
|
||
|
@23. TX 3 @Complement a contig
|
||
|
|
||
|
This function will complement and reverse all of the gel
|
||
|
readings in a contig. It automatically reverses and
|
||
|
complements each gel reading sequence, reorders left and right
|
||
|
neighbours, recalculates relative positions and changes each
|
||
|
strandedness.
|
||
|
|
||
|
The only user input required is to identify the contig
|
||
|
to complement by the number or name of a gel reading it contains.
|
||
|
DO NOT KILL THE PROGRAM DURING THIS STEP!
|
||
|
@22. TX 3 @Join contigs
|
||
|
|
||
|
This function joins contigs interactively. It allows the user
|
||
|
to align the ends of the two contigs by editing each contig
|
||
|
separately. It is important that the alignment achieved is
|
||
|
correct because once the join is completed the alignment is fixed.
|
||
|
The program needs to know which two contigs to join and where they
|
||
|
overlap.
|
||
|
|
||
|
First which two contigs are to be joined. The user should
|
||
|
identify the two contigs. First the left contig and then the right.
|
||
|
The program checks that the two contig numbers are different (it
|
||
|
will not allow circles to be formed!)
|
||
|
|
||
|
Now identify the exact position of overlap. This is defined as
|
||
|
the position in the left contig that the leftmost character of the
|
||
|
right contig overlaps. Normally the position is established by
|
||
|
employing the end gel reading for option "auto assemble". The
|
||
|
overlap must be of at least one character. The program then
|
||
|
displays the join showing all the gel readings overlapping the
|
||
|
join from the left contig, their consensus, all the gel readings
|
||
|
from the right contig that overlap the join, their consensus
|
||
|
and then asterisks to denote mismatches between the two
|
||
|
consensuses. For example:
|
||
|
|
||
|
1460 1470 1480 1490 1500
|
||
|
56 HINW.100 TCT*GAGCAGTGTGGGCGCTG*CCGG
|
||
|
33 HINW.300 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGG
|
||
|
-25 HINW.090 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGG
|
||
|
19 HINW.123 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
CONSENSUS TCTCGAGCAGTGTGGGCGCTG-CCGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
-6 HINW.010 TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
|
||
|
-3 HINW.007 TGGGCGCTGCCCGGGCTCGGAGGGCATGAAGT*AGAGC
|
||
|
-5 HINW.009 GCTCGGAGGGCATGAAGT*AGAGC
|
||
|
CONSENSUS TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
|
||
|
MISMATCH * ******
|
||
|
10 20 30 40 50
|
||
|
|
||
|
|
||
|
It is essential that the user aligns the two contigs
|
||
|
throughout the whole region of overlap before completing the join
|
||
|
because it is only at this stage that the two contigs can be edited
|
||
|
independently. Once the join is completed the alignment can only be
|
||
|
altered using the routines supplied by "alter relationships". The
|
||
|
program offers the user options to facilitate the alignment of the
|
||
|
two contigs. These options are:-
|
||
|
|
||
|
? = Help
|
||
|
! = Give up
|
||
|
3 = Complete join
|
||
|
4 = Edit left contig
|
||
|
5 = Display joint
|
||
|
6 = Edit right contig
|
||
|
7 = Move join
|
||
|
|
||
|
1. Help gives this information.
|
||
|
2. Give up allows the user to return to the main options without
|
||
|
completing the join. Note any edits made will remain.
|
||
|
3. Complete join instructs the program to update the relationships
|
||
|
so that the two contigs are joined. DO NOT KILL THE PROGRAM DURING
|
||
|
COMPLETE JOIN!
|
||
|
4. Edit left contig and edit right contig give access to a simple
|
||
|
editor that allows insertions, deletions and changes to be made to
|
||
|
the contigs. Help is available on editing once the editing option
|
||
|
is selected. The user is only allowed to edit within the region of
|
||
|
overlap and should make sure that the positions used correspond to
|
||
|
the correct contig.
|
||
|
5. Display join displays the joint as shown above.
|
||
|
6. See above.
|
||
|
7. Move join allows the position of the joint to be changed.
|
||
|
@24. TX 1 @ Copy the database
|
||
|
|
||
|
Used to make a copy of the database. If required the database
|
||
|
size can be altered using this option. The "version" of a database
|
||
|
is encoded as the last letter in the names of the three files that
|
||
|
contain the database.
|
||
|
|
||
|
Supply a "version" number (the default is version 1), and if
|
||
|
required select a new size for the database. The size of a database
|
||
|
is the number of lines of information it can hold. It needs a line
|
||
|
for each gel reading and another for each contig.
|
||
|
@19. TX 1 @ Check database
|
||
|
|
||
|
Used to perform a check on the logical consistency of the
|
||
|
database. No user intervention is required.
|
||
|
|
||
|
The following relationships are checked:
|
||
|
1. If gel reading A thinks gel reading B is its left neighbour
|
||
|
does B think A is its right neighbour? The error message is
|
||
|
"Hand holding problem for gel reading A"
|
||
|
followed by the gel descriptor lines for gel readings A and B.
|
||
|
2. Are there any contig lines with no left or right end gel
|
||
|
readings? The error message is
|
||
|
"Bad contig line number A"
|
||
|
3. Do the gel readings that are described as left ends on
|
||
|
contig lines agree that they are left ends? The error message is
|
||
|
"The end gel readings of contig A have outward neighbours"
|
||
|
4. Are there gel readings that are in more than one contig?
|
||
|
The error message is
|
||
|
" Gel number A is used N times"
|
||
|
5. Are there gel readings that are not in any contig? The
|
||
|
error message is
|
||
|
" Gel number A is not used"
|
||
|
6. Do the relative positions of gel readings agree with
|
||
|
their position as defined by left and right neighbourliness? The
|
||
|
error message is
|
||
|
" Gel number A with position X is left neighbour of gel number B
|
||
|
with position Y"
|
||
|
7. Are there any loops in contigs? If so no further
|
||
|
checking is done. The error mesage is
|
||
|
" Loop in contig n no further checking done, but gel reading numbers
|
||
|
follow"
|
||
|
The program then prints the gel reading numbers in the looped
|
||
|
contig up to the start of the loop.
|
||
|
8. Are there any contigs of length <1? The error message is
|
||
|
" The contig on line number x has zero length"
|
||
|
9. Are there any gel readings (used in only one contig) that have
|
||
|
zero length? The error message is
|
||
|
" Gel number N has zero length"
|
||
|
Note that "auto assemble" also uses this logical consistency check
|
||
|
and will only tolerate a "Gel number N is not used" error. Any other
|
||
|
error will cause it to give up.
|
||
|
@29. TX 1 @ Examine quality
|
||
|
|
||
|
Analyses the quality of the data in a contig. It reports on
|
||
|
the proportion of the consensus that is "well determined" and will
|
||
|
display a sequence of symbols that indicate the quality of the
|
||
|
consensus at each position.
|
||
|
|
||
|
Identify the contig to analyse, and the section of interest.
|
||
|
The current consensus calculation cutoff score will be used to
|
||
|
decide if each position is "well determined". In general the quality
|
||
|
of a reading deteriorates along the length of the gel and so it is
|
||
|
also possible to use a length cutoff for the quality calculation.
|
||
|
Only the data from the first section of each reading will be
|
||
|
included in the quality calcualtion. The length is altered under
|
||
|
"set parameters" and is initially set to the maximum reading length.
|
||
|
A summary showing the percentage of the consensus that falls into
|
||
|
each category of quality is shown. Choose whether or not to have the
|
||
|
quality codes for each position of the consensus displayed. They can
|
||
|
be displayed as either graphics or text.
|
||
|
|
||
|
The quality of the data depends on the number of times it has
|
||
|
been sequenced and the particular uncertainty codes used in each
|
||
|
gel reading. This function divides the data into five categories,
|
||
|
assigning each a symbol or code:
|
||
|
1. Well determined on both strands and they agree. code=0
|
||
|
2. Well determined on the plus strand only. code=1
|
||
|
3. Well determined on the minus strand only. code=2
|
||
|
4. Not well determined on either strand. code=3
|
||
|
5. Well determined on both strands but they disagree. code=4
|
||
|
A position is "well determined" if it is assigned one of the symbols
|
||
|
A,C,G,T when the algorithm described in the section "calculate a
|
||
|
consensus". The calculation is performed separately for each
|
||
|
strand.
|
||
|
|
||
|
If the user chooses to have the data displayed graphically the
|
||
|
following scheme is used. A rectangular box is drawn so that the x
|
||
|
coordinate represents the length of the contig. The box is
|
||
|
notionally divided vertically into 5 possible levels which are given
|
||
|
the y values: -2,-1,0,1,2. The quality codes attributed to each
|
||
|
base position are plotted as rectangles. Each rectangle represents
|
||
|
a region in which the quality codes are identical, so a single base
|
||
|
having a different code from its immediate neighbours will appear as
|
||
|
a very narrow rectangle.
|
||
|
|
||
|
Rectangle bottom and top y values
|
||
|
|
||
|
Quality 0 rectangle from 0 to 0
|
||
|
Quality 1 rectangle from 0 to 1
|
||
|
Quality 2 rectangle from 0 to -1
|
||
|
Quality 3 rectangle from -1 to 1
|
||
|
Quality 4 rectangle from -2 to 2
|
||
|
|
||
|
Obviously a single line at the midheight shows a perfect
|
||
|
sequence.
|
||
|
|
||
|
Typical dialogue is shown below.
|
||
|
|
||
|
41.47% OK on both strands and they agree(0)
|
||
|
55.48% OK on plus strand only(1)
|
||
|
2.08% OK on minus strand only(2)
|
||
|
0.97% Bad on both strands(3)
|
||
|
0.00% OK on both strands but they disagree(4)
|
||
|
? (y/n) (y) Show sequence of codes
|
||
|
|
||
|
10 20 30 40 50
|
||
|
1111111111 1111111111 1111111111 1111111111 1111111111
|
||
|
|
||
|
60 70 80 90 100
|
||
|
1111111111 1111111111 1111111111 3111111111 1111111111
|
||
|
|
||
|
110 120 130 140 150
|
||
|
1111111111 1111131111 1111111111 1111111111 1111111111
|
||
|
|
||
|
160 170 180 190 200
|
||
|
1111111111 1111111111 1111111111 1111111111 1111111133
|
||
|
|
||
|
210 220 230 240 250
|
||
|
1311111111 1111111111 1111111110 0000000000 0000220000
|
||
|
|
||
|
260 270 280 290 300
|
||
|
0000000000 0020000000 2200000202 0002000000 0000222200
|
||
|
|
||
|
@26. TX 3 @ Alter relationships
|
||
|
|
||
|
Used to make what are normally illegal changes to the
|
||
|
database. That is the normal checks are not done and any item in the
|
||
|
database can be changed independently of all others. Users need to
|
||
|
know what they are doing because it is very easy to make a horrible
|
||
|
mess. Always start by making a copy!
|
||
|
|
||
|
By using the options here users can edit individual gel
|
||
|
readings in contigs, move one section of a contig relative to
|
||
|
another, break contigs, remove contigs, remove gel readings, etc. To
|
||
|
give flexibility most of the commands do only one thing. This means
|
||
|
that several commands may have to be executed to complete any
|
||
|
change. At the end of this help section there are notes on removing
|
||
|
gel readings from the database.
|
||
|
|
||
|
The following options are offered:
|
||
|
|
||
|
? = HELP
|
||
|
! = QUIT
|
||
|
3 = Line change
|
||
|
4 = Edit single gel reading
|
||
|
5 = Delete contig
|
||
|
6 = Shift
|
||
|
7 = Move gel reading
|
||
|
8 = Rename gel reading
|
||
|
9 = Break a contig
|
||
|
|
||
|
1. HELP gives this information.
|
||
|
2. QUIT returns to the main options of SAP.
|
||
|
3. Line change
|
||
|
allows the user to change the contents of any line in the file of
|
||
|
relationships. The line is selected by number, the program prints
|
||
|
the current line and prompts for the new line.
|
||
|
4. Edit
|
||
|
allows the user to edit an individual gel reading
|
||
|
independently of any others it may be related to. The edit positions
|
||
|
are relative to the contig. The effect of this editing on the length
|
||
|
of the gel reading is taken care of but, if it changes the length of
|
||
|
a contig, or its relationship to others, this must be accounted for
|
||
|
(if necessary) by use of the "line change" function.
|
||
|
5. Delete contig
|
||
|
is a function that deletes a contig line by moving down all the
|
||
|
contig lines above by one position. It prompts only for the line to
|
||
|
delete. It does not delete any of the gel readings or gel
|
||
|
reading lines for the deleted contig but it does reduce the number
|
||
|
of contigs on line IDBSIZ by 1.
|
||
|
6. Shift
|
||
|
allows the user to change all the relative positions of a set of
|
||
|
neighbouring gel readings by some fixed value, i.e. it will shift
|
||
|
related gel readings either left or right. It can therefore be
|
||
|
used to change the alignment of the gel readings in a contig or as
|
||
|
part of the process of breaking a contig into two parts (see below).
|
||
|
It prompts for the number of the first gel reading to shift and
|
||
|
then for the distance to move them (Note a negative value will
|
||
|
move the gel readings left and a positive value right). It then
|
||
|
chains rightwards (ie follows right neighbours) and shifts each gel
|
||
|
reading, in turn, up to the end of the contig. (This means that
|
||
|
only those gel readings from the first to shift to the rightmost are
|
||
|
moved). It updates the length of the contig accordingly.
|
||
|
7. Move gel reading
|
||
|
is a function to renumber a gel reading. It moves all the
|
||
|
information about a gel reading on to another line. The user must
|
||
|
specify the number of the gel reading to move and the number of the
|
||
|
line to place it. It takes care of all the relationships. Of course
|
||
|
gel readings must not be moved to lines occupied by other gel
|
||
|
readings! It can be used as part of the process of removing a gel
|
||
|
reading from the database (see below).
|
||
|
8. Rename gel reading
|
||
|
is a function that is used to rename the archive names of gel
|
||
|
readings in the database; it only changes the name in the .ARN
|
||
|
file of the database.
|
||
|
|
||
|
9. Break contig
|
||
|
|
||
|
Occasionaly it is necessary to break a contig into two parts
|
||
|
and this can be achieved using this option. The program needs only
|
||
|
the number of a gel reading. This is the gel reading that will
|
||
|
become a left end after the break. That is, the break is made
|
||
|
between this gel reading and its left neighbour. A new contig line
|
||
|
is created so ensure that there is sufficient space in the database.
|
||
|
Removing gel readings from contigs
|
||
|
|
||
|
Gel readings can be removed from contigs if they are not
|
||
|
essential for holding the contig together (ie are not the only gel
|
||
|
reading covering a particular region). Suppose the gel reading to
|
||
|
remove is gel number b with left neighbour a and right neighbour c.
|
||
|
Using "line change" change the right neighbour of a to c, and the
|
||
|
left neighbour of c to a. To tidy things up: suppose there are x gel
|
||
|
readings in the database; then, using "move gel reading" move gel x
|
||
|
to line b; then, using "line change" decrease the number of gel
|
||
|
readings in the database (stored in the last line) by 1.
|
||
|
@27. TX 1 @ Set display parameters
|
||
|
|
||
|
Used to redefine the parameters that control the cutoff
|
||
|
employed by the consensus calculation and quality examiner, the
|
||
|
maximum length of each reading to include in the quality
|
||
|
calculation, the line length used by the display function, the text
|
||
|
window length used by the graphics options, and the graphics window
|
||
|
length used by the graphics options.
|
||
|
|
||
|
The default cutoff score is 75%. The default line length is 50
|
||
|
characters. For protein sequences the cutoff is always 100%.
|
||
|
|
||
|
The text window used by the graphics options controls the
|
||
|
amount of sequence listed at the crosshair position. The graphics
|
||
|
window controls the "zoom" function. Both these windows are defined
|
||
|
as the number of bases that should be shown, to both left and right
|
||
|
of the crosshair.
|
||
|
@30. TX 3 @ Auto edit a contig
|
||
|
|
||
|
This function automatically changes characters in gel readings
|
||
|
to make them agree with the consensus sequence. If employed as is
|
||
|
intended, use of this function is not a criminal activity but a
|
||
|
method that saves a large amount of work. All characters changed by
|
||
|
the auto editor will appear in the gel readings as lowercase
|
||
|
letters. The current consensus calculation cutoff score is used.
|
||
|
|
||
|
Identify the contig and the section to edit. The program will
|
||
|
display a summary of changes made. Note that it is important to
|
||
|
understand both what the auto editor does and the order in which it
|
||
|
does it. Before employing the auto editor users should note all the
|
||
|
corrections that they require, so that after it has been used the
|
||
|
corrections can be checked.
|
||
|
|
||
|
The general strategy employed when collecting shotgun sequence
|
||
|
data is to let the contigs get fairly deep, to get a printout of a
|
||
|
contig, check problems against the films, note corrections on the
|
||
|
printout, and make the changes using an interactive editor. In
|
||
|
general the consensus is correct except for places where padding
|
||
|
characters have been used to accommodate a single gel with an extra
|
||
|
character, or where the consensus is dash. The important point for
|
||
|
the auto editor is that most edits simply make the gel readings
|
||
|
conform to the consensus, or remove columns of pads.
|
||
|
|
||
|
The new editor does the following.
|
||
|
|
||
|
1) calculates a consensus for the contig (or part of a contig)
|
||
|
to be edited, and then uses this consensus to direct the editing of
|
||
|
the contig in 3 stages
|
||
|
|
||
|
2) stage 1: find and correct all places where, if the order of
|
||
|
two adjacent characters is swapped, they will both agree with the
|
||
|
consensus (given that they did not match the consensus before).
|
||
|
These corrections are termed "transpositions"
|
||
|
|
||
|
3) stage 2: find and correct all places where there is a
|
||
|
definite consensus but the gel reading has a different character.
|
||
|
These corrections are termed "changes".
|
||
|
|
||
|
4) stage 3: delete all positions in which padding is the
|
||
|
consensus. These corrections are termed "deletions".
|
||
|
|
||
|
All changed characters are shown in lowercase letters so it
|
||
|
will be obvious which characters have been assigned by the program
|
||
|
(except for deletions). The number of each type of correction will
|
||
|
be displayed.
|
||
|
@10. TX 2 @Clear graphics
|
||
|
|
||
|
Clears graphics from the screen.
|
||
|
@11. TX 2 @Clear text
|
||
|
|
||
|
Clears text from the screen.
|
||
|
@12. TX 2 @Draw a ruler.
|
||
|
|
||
|
This option allows the user to draw a ruler or scale along the
|
||
|
x axis of the screen to help identify the coordinates of points of
|
||
|
interest. The user can define the position of the first base to be
|
||
|
marked (for example if the active region is 1501 to 8000, the user
|
||
|
might wish to mark every 1000th base starting at either 1501 or 2000
|
||
|
- it depends if the user wishes to treat the active region as an
|
||
|
independent unit with its own numbering starting at its left edge,
|
||
|
or as part of the whole sequence). The user can also define the
|
||
|
separation of the ticks on the scale and their height. If required
|
||
|
the labelling routine can be used to add numbers to the ticks.
|
||
|
@14. TX 2 @Reposition plots
|
||
|
|
||
|
The positions of each of the plots is defined relative to a
|
||
|
users drawing board which has size 1-10,000 in x and 1-10,000 in y.
|
||
|
Plots for each option are drawn in a window defined by x0,y0 and
|
||
|
xlength,ylength. Where x0,y0 is the position of the bottom left hand
|
||
|
corner of the window, and xlength is the width of the window and
|
||
|
ylength the height of the window.
|
||
|
--------------------------------------------------------- 10,000
|
||
|
1 1
|
||
|
1 -------------------------------------- ^ 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 ylength 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 -------------------------------------- v 1
|
||
|
1 x0,y0^ 1
|
||
|
1 <---------------xlength--------------> 1
|
||
|
--------------------------------------------------------- 1
|
||
|
1 10,000
|
||
|
|
||
|
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
|
||
|
The default window positions are read from a file "ANALMARG" when
|
||
|
the program is started. Users can have their own file if required.
|
||
|
As all the plots start at the same position in x and have the same
|
||
|
width, x0 and xlength are the same for all options. Generally users
|
||
|
will only want to change the start level of the window y0 and its
|
||
|
height ylength. This option allows users to change window positions
|
||
|
whilst running the program. The routine prompts first for the
|
||
|
number of the option that the users wishes to reposition; then for
|
||
|
the y start and height; then for the x start and length. Note that
|
||
|
changes to the x values affect all options. If the user types only
|
||
|
carriage return for any value it will remain unchanged. Note that,
|
||
|
unlike all the other programs, the boxes used to contain analytical
|
||
|
results (eg plot quality) should not be made to overlap one another,
|
||
|
as the function of the crosshair routine depends on which box the
|
||
|
crosshair is in! overlap
|
||
|
@15. TX 2 @Label a diagram
|
||
|
|
||
|
This routine allows users to label any diagrams they have
|
||
|
produced. They are asked to type in a label. When the user types
|
||
|
carriage return to finish typing the label the cross-hair appears on
|
||
|
the screen. The user can position it anywhere on the screen. If the
|
||
|
user types R (for right justify) the label will be written on the
|
||
|
diagram with its right end at the cross-hair position. If the user
|
||
|
types L (for left justify) the label will be written on the diagram
|
||
|
with its left end at the cross hair position. The cross-hair will
|
||
|
then immediately reappear. The user may put the same label on
|
||
|
another part of the diagram as before or if he hits the space bar he
|
||
|
will be asked if he wishes to type in another label.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=15
|
||
|
Type label then drive cross hair to left or right end
|
||
|
of label position then hit "L" to write label left
|
||
|
justified or "R" to write label right justified or
|
||
|
the space bar to quit
|
||
|
|
||
|
|
||
|
? Label=delta gene
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
? Label=
|
||
|
|
||
|
@16. TX 2 @Display a map.
|
||
|
|
||
|
This draws a map of any sequence features selected by the
|
||
|
user. These features may be protein coding regions (CDS), tRNA
|
||
|
genes (TRNA), promoter positions (PRM), etc. Users may define their
|
||
|
own feature table key names. For example I find it convenient to
|
||
|
split CDS lines into CDS1, CDS2 and CDS3 each of which contains only
|
||
|
those sequences that code in the reading frames 1, 2 or 3. Then I
|
||
|
can plot them at different heights on the screen ( suitable heights
|
||
|
can be determined by using the cross-hair). The coordinates must be
|
||
|
stored in a file in the format of an EMBL feature table.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=16
|
||
|
Display a map using an EMBL feature table file
|
||
|
? map file name=hsegl1.ft
|
||
|
? feature code(e.g. CDS) =CDS
|
||
|
X 1 + strand
|
||
|
2 - strand
|
||
|
3 both strands
|
||
|
? 0,1,2,3 =
|
||
|
? level (0-9480) (256) =4000
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
? feature code(e.g. CDS) =
|
||
|
|
||
|
@7. TX 1 @Redirect output
|
||
|
|
||
|
Used to direct output that would normally appear on the screen
|
||
|
to a file.
|
||
|
|
||
|
Select redirection of either text or graphics, and supply the
|
||
|
name of the file that the output should be written to.
|
||
|
|
||
|
The results from the next options selected will not appear on
|
||
|
the screen but will be written to the file. When option 7 is
|
||
|
selected again the file will be closed and output will again appear
|
||
|
on the screen.
|
||
|
@13. TX 2 @Use crosshair
|
||
|
This option puts a steerable cross on the screen which the user
|
||
|
drives around by using the arrow keys (or mouse). When the crosshair
|
||
|
is visible a number of options are available if the user types one
|
||
|
of a set of special keyboard characters. Any other characters will
|
||
|
cause an exit from the crosshair option. The special keys are:
|
||
|
|
||
|
I = Identify the nearest gel reading
|
||
|
Z = Zoom in
|
||
|
Q = plot Quality
|
||
|
S = display the aligned Sequences at the crosshair position
|
||
|
N = list the Names and Numbers of the sequences at the crosshair
|
||
|
|
||
|
In order for any of these special keys to operate, the
|
||
|
crosshair must be in an appropriate display box, and the precise
|
||
|
function of the keys will also depend on which box the crosshair is
|
||
|
in.
|
||
|
|
||
|
If the crosshair is in the "plot all contigs" box, Z will
|
||
|
cause a new box to appear showing all the readings for the nearest
|
||
|
contig; Q will give the same as Z but will also produce an extra box
|
||
|
showing the "quality" plot.
|
||
|
|
||
|
If Z is hit in the "plot single contig" box, the contig will
|
||
|
be zoomed to the current graphics window size. The zoom will be
|
||
|
roughly centred on the crosshair position. Because of this it is
|
||
|
possible to step along a contig by repeatedly zooming with the
|
||
|
crosshair near to one end of the single contig display box. If I is
|
||
|
hit the crosshair must be close to a gel reading line. If Q is hit,
|
||
|
the quality plot will be produced for the region shown in the plot
|
||
|
single contig box. In all cases when the "plot all contigs" box is
|
||
|
shown, a vertical line will bisect the line the represents the
|
||
|
relevent contig, at the current position.
|
||
|
|
||
|
If the crosshair is in the plot quality box only the character
|
||
|
"s" will operate as a special symbol.
|
||
|
|
||
|
The number of bases shown in the N and S options is controlled
|
||
|
by the current graphics text window size, and the size of the zoom
|
||
|
window by the current graphics window size. Both are set by the
|
||
|
parameter setting function of the general menu.
|
||
|
@33. TX 2 @Plot single contig
|
||
|
This option produces a schematic of a selected region of a single
|
||
|
contig by drawing a horizontal line to represent each of its gel
|
||
|
readings. The lines show the relative positions of each reading and
|
||
|
also their sense. The plot is divided vertically into two sections
|
||
|
by a line that is identified by an asterisk drawn at each end. All
|
||
|
lines that lie above this line represent readings that are in their
|
||
|
original sense, all lines below show readings that are in the
|
||
|
complementary sense to their original. By use of the crosshair
|
||
|
function the plot can be stepped through and examined in more
|
||
|
detail. See help on crosshair.
|
||
|
@34. TX 2 @Plot all contigs
|
||
|
This option produces a schematic of all the contigs in a database.
|
||
|
It does this by drawing a horizontal line to represent each of them.
|
||
|
In order to show the ends of each contig it draws the lines for
|
||
|
contigs at alternate heights: the first at height one, the second at
|
||
|
height two, the third at height one, etc. The order of the contigs
|
||
|
in the display is the same as their order in the database. By use of
|
||
|
the crosshair function the plot can be stepped through and examined
|
||
|
in more detail. See help on crosshair.
|
||
|
@31. TX 3 @ Type in gel readings
|
||
|
This option allows gel readings to be typed in at the keyboard. It
|
||
|
creates a separate file for each gel reading and a file of file
|
||
|
names for the batch. The sequences from each batch may be listed
|
||
|
when they have all been entered. Users may choose to employ special
|
||
|
keys to identify the 4 bases A,C,G and T. By default these special
|
||
|
keys are N M , . but any other four characters may be used. If
|
||
|
special keys are used the characters are automatically translated to
|
||
|
A C G T before being stored on the disk.
|
||
|
@35. TX 1 @ Find internal joins
|
||
|
The purpose of this function is to use data already in the database
|
||
|
to find possible joins between contigs. Joins may have been missed
|
||
|
due to poor data or may have not been made due to repeated
|
||
|
sequences. Where appropriate, it may be possible to find potential
|
||
|
joins by using the data clipped off readings prior to their entry
|
||
|
into the database.
|
||
|
The database is checked for logical consistency. Supply a minimum
|
||
|
initial match length, a minimum alignment block, the maximum pads
|
||
|
per sequence, the maximum percent mismatch after alignment, the
|
||
|
probe length. Choose if clipped data is to be used, if so define the
|
||
|
window size for finding good data and the number of dashes allowed
|
||
|
in the window. Processing will commence. Most of these values are
|
||
|
used in an identical way in the autoassemble function. The others
|
||
|
are defined below.
|
||
|
The program strategy
|
||
|
Take the first contig and calculate its consensus. If clipped data
|
||
|
is being used examine all readings that are in the complementary
|
||
|
orientation, and sufficiently near to the contigs left end, to see
|
||
|
if they have good clipped sequence which if present, would protrude
|
||
|
from the left end of the contig. If found add the longest such
|
||
|
sequence to the left end of the consensus. Do the same for the right
|
||
|
end by examining readings that are in their original orientation. If
|
||
|
any are found add the longest extension to the right end of the
|
||
|
consensus. Repeat the consensus calculations and extensions for all
|
||
|
contigs hence producing an extended consensus. If clipped data is
|
||
|
not being used simply calculate the consensus for the whole
|
||
|
database. Now look for possible joins by processing the extended
|
||
|
consensus in the following way. Take the last, say 100, bases
|
||
|
(termed the "probe length" by the program) of the rightmost
|
||
|
consensus, compare it both orientations with the extended consensus
|
||
|
of all the other contigs. Display any sufficiently good alignments.
|
||
|
Repeat with the left end of the rightmost contig. Do the same for
|
||
|
the ends of all the entended contigs, always only comparing with the
|
||
|
contigs to their left, so that the same matches do not appear twice.
|
||
|
Good cliped data is defined by sliding a window of "Window size for
|
||
|
good data scan" bases outwards along the sequence and stopping when
|
||
|
"Maximum number of dashes in scan window" or more dashes appear in
|
||
|
the window. Note that it is advisable to have some sort of cutoff
|
||
|
because if we simply take all the data it might be so full of
|
||
|
rubbish that we wont find any good matches. For the same reason it
|
||
|
is worth trying the procedure with different cutoffs. An initial run
|
||
|
using no clipped data is also recommended. Sufficiently good
|
||
|
alignments are defined by criteria equivalent to those used in
|
||
|
autoassemble, however here we only display alignments that pass all
|
||
|
tests.
|
||
|
Bugs
|
||
|
If a small contig is wholly contained within a larger one, such that
|
||
|
its ends are further than ("Probe length" - "Minimum initial match
|
||
|
length") from the ends of the larger contig, and the consensus for
|
||
|
the small contig lies to the left of the consensus for large contig,
|
||
|
the overlap will not be discovered. (See the search stratgey).
|
||
|
All numbering is relative to base number one in the contig: matches
|
||
|
to the left (i.e. in the clipped data) have negative positions,
|
||
|
matches off the right end of the contig (i.e. in the clipped data)
|
||
|
have positions greater than that of the contig length. A typical
|
||
|
result is shown below.
|
||
|
|
||
|
Right end of contig 22 in the - sense and contig 96
|
||
|
Percentage mismatch after alignment = 3.0
|
||
|
628 638 648 658 668 678
|
||
|
GTGAGATGAG CATATTTAAA ATGAACCGAG CAGTTAGGAG ATATGTTGGG AGGACAAGAA
|
||
|
********* ********** ********** ********** ********** **********
|
||
|
-TGAGATGAG CATATTTAAA ATGAACCGAG CAGTTAGGAG ATATGTTGGG AGGACAAGAA
|
||
|
-86 -76 -66 -56 -46 -36
|
||
|
688 698 708 718
|
||
|
ACATCCGGGA TACAGTCAAT AAATGAAAAA TTAATGAATT
|
||
|
********** ********** ****** *** ***** ****
|
||
|
ACATCCGGGA TACAGTCAAT AAATGA-AAA TTAATTAATT
|
||
|
-26 -16 -6 4
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|
||
|
|