2113 lines
104 KiB
Text
2113 lines
104 KiB
Text
|
@-1. TX 0 @General
|
||
|
|
||
|
@-2. T 0 @Screen control
|
||
|
|
||
|
@-2. X 0 @Screen
|
||
|
|
||
|
@-3. TX 0 @Modification
|
||
|
|
||
|
@0. TX -1 @BAP
|
||
|
|
||
|
This is an interactive program whose primary use is for
|
||
|
managing shotgun sequencing projects, but it can also be used for
|
||
|
handling alignments of other sequences, including those of proteins.
|
||
|
Currently the maximum 'gel reading' length is set to 4096
|
||
|
characters. Almost all of the information below describes the use of
|
||
|
the program for shotgun projects, but those using the programs for
|
||
|
handling other sequence alignments should interpret it accordingly.
|
||
|
The data for such a project is stored in a special type of database.
|
||
|
The program contains the tools that are required to screen gel
|
||
|
readings against vector sequences and restriction sites, and to
|
||
|
assemble new gel readings into the database (automatically comparing
|
||
|
and aligning them). In addition it contains editors and functions to
|
||
|
examine the quality of the aligned sequences.
|
||
|
|
||
|
There are three main menus: "general", "screen" and
|
||
|
"modification", and some functions have submenus.
|
||
|
The general menu contains the following options:
|
||
|
|
||
|
Open a database
|
||
|
Display a contig
|
||
|
List a text file
|
||
|
Direct output
|
||
|
Calculate a consensus
|
||
|
Screen against restriction enzymes
|
||
|
Screen against vector
|
||
|
Check logical consistency
|
||
|
Copy database
|
||
|
Show relationships
|
||
|
set parameters
|
||
|
Highlight disagreements
|
||
|
Examine quality
|
||
|
Check Assembly
|
||
|
Find read pairs
|
||
|
|
||
|
The graphics menu contains:
|
||
|
|
||
|
Clear graphics
|
||
|
Clear text
|
||
|
Draw ruler
|
||
|
Use cross hair
|
||
|
Change margins
|
||
|
Label diagram
|
||
|
Plot map
|
||
|
Plot single contig
|
||
|
Plot all contigs
|
||
|
|
||
|
|
||
|
The modification menu contains:
|
||
|
|
||
|
Edit contig
|
||
|
Auto assemble
|
||
|
Join contigs
|
||
|
Complement a contig
|
||
|
Alter relationships
|
||
|
Extract gel readings
|
||
|
Find internal joins
|
||
|
Disassemble readings
|
||
|
Shuffle pads
|
||
|
Auto-select oligos
|
||
|
Double strand
|
||
|
|
||
|
The alter relationships menu contains:
|
||
|
|
||
|
Cancel
|
||
|
Line change
|
||
|
Check logical consistency
|
||
|
Remove contig
|
||
|
Shift
|
||
|
Move gel reading
|
||
|
Rename gel reading
|
||
|
Break a contig
|
||
|
Remove a gel reading
|
||
|
Alter raw data parameters
|
||
|
|
||
|
|
||
|
|
||
|
Overview of the methodology
|
||
|
|
||
|
The shotgun sequencing strategy
|
||
|
|
||
|
In the shotgun sequencing procedure the sequence to be
|
||
|
determined is randomly broken into fragments of about 1000
|
||
|
nucleotides in length. These fragments are cloned and then selected
|
||
|
randomly and their sequences determined. The relationship
|
||
|
between any pair of fragments is not known beforehand but is
|
||
|
found by comparing their sequences. If the sequence of one
|
||
|
found to be wholly or partially contained within that of another
|
||
|
for sufficient length to distinguish an overlap from a repeat
|
||
|
then those two fragments can be joined. The process of select,
|
||
|
sequence and compare is continued until the whole of the DNA to
|
||
|
be sequenced is in one continuous well determined piece.
|
||
|
|
||
|
Definition of a contig
|
||
|
|
||
|
A CONTIG is a set of gel readings that are related to
|
||
|
one another by overlap of their sequences. All gel readings
|
||
|
belong to a contig and each contig contains at least one gel
|
||
|
reading. The gel readings in a contig can be summed to produce a
|
||
|
continuous consensus sequence and the length of this sequence is the
|
||
|
length of the contig. The rules used to perform this summation are
|
||
|
given under "the consensus algorithm". At any stage of a
|
||
|
sequencing project the data will comprise a number of contigs; when
|
||
|
a project is complete there should be only one contig and its
|
||
|
consensus will be the finished sequence. Note that since being
|
||
|
introduced and defined as above the word "contig" has been taken up
|
||
|
by those involved in genomic mapping. In that context the consensus
|
||
|
with a precise length is, of course, not defined.
|
||
|
|
||
|
Introduction to the computer method
|
||
|
|
||
|
It is useful to consider the objectives of a sequencing
|
||
|
project before outlining how we use the computer to help achieve
|
||
|
them. The aim of a shotgun sequencing project is to produce an
|
||
|
accurate consensus sequence from many overlapping gel readings. It
|
||
|
is necessary to know, particularly at the latter stages of the
|
||
|
project, how accurate the consensus sequence is. This enables us to
|
||
|
know which regions of the sequence require further work and also to
|
||
|
know when the project is finished. To show the quality of the
|
||
|
consensus, the programs described here produce displays like that
|
||
|
shown below.
|
||
|
|
||
|
|
||
|
10 20 30 40 50
|
||
|
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
|
||
|
60 70 80 90 100
|
||
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
||
|
-3 HINW.007 GGCACA*GTC
|
||
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
|
||
|
|
||
|
110 120 130 140 150
|
||
|
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
|
||
|
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
|
||
|
-5 HINW.009 GGCAGCGACCGTCG
|
||
|
17 HINW.999 AGCGACCGTCG
|
||
|
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
|
||
|
|
||
|
160 170 180 190 200
|
||
|
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
12 HINW.017 GTAGAGC*
|
||
|
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
|
||
|
This is an example showing the left end of a contig from
|
||
|
position 1 to 200. Overlapping this region are gel readings
|
||
|
numbered 6, 3, 5, 17 and 12; 6, 3 and 5 are in reverse orientation
|
||
|
to their original reading (denoted by a minus sign). Each gel
|
||
|
reading also has a name (eg HINW.010). It can be seen that in a
|
||
|
number of places the sequences contain characters other than A,C,G
|
||
|
and T. Some of these extra characters have been used by the
|
||
|
sequencer to indicate regions of uncertainty in the initial
|
||
|
interpretation of the gel reading, but the asterisks (*) have been
|
||
|
inserted by the automatic assembly function in order to align the
|
||
|
sequences. Underneath each 50 character block of gel reading
|
||
|
sequences is the consensus derived from the sequences aligned above
|
||
|
(the line labelled CONSENSUS). For most of its length the consensus
|
||
|
has a definite nucleotide assignment but in a few positions there is
|
||
|
insufficient agreement between the gel readings and so a dash (-)
|
||
|
appears in the sequence. This display contains all the evidence
|
||
|
needed to assess the quality of the consensus: the number of times
|
||
|
the sequence has been determined on each strand of the DNA, and the
|
||
|
individual nucleotide assignments given for each gel reading.
|
||
|
|
||
|
So the aim is to produce the consensus sequence and, equally
|
||
|
important, a display of the experimental results from which it was
|
||
|
derived.
|
||
|
|
||
|
In order to achieve this the following operations need to be
|
||
|
performed:
|
||
|
1) Put individual gel readings into the computer. This might
|
||
|
involved the manual interpretation of autoradiographs or the
|
||
|
transfer and process of machine-readable files from fluorescent
|
||
|
sequencing machines.
|
||
|
2) Check each gel reading to make sure it is not simply part of one
|
||
|
of the vectors used to clone the sequence.
|
||
|
3) Check each gel reading to make sure that those fragments that
|
||
|
span the ligation point used prior to sonication are not assembled
|
||
|
as single sequences.
|
||
|
4) Compare all the remaining gel readings with one another to
|
||
|
assemble them to produce the consensus sequence.
|
||
|
5) Check the quality of the consensus and edit the sequences.
|
||
|
6) When all the consensus is sufficiently well determined, produce a
|
||
|
copy of it for processing by other analysis programs.
|
||
|
|
||
|
It is very unlikely that this procedure will only be passed
|
||
|
through once. Usually steps 1 to 5 are cycled through repeatedly,
|
||
|
with step 4 just adding new sequences to those already assembled.
|
||
|
Generally step 6 is also used in order to analyse imperfect sequence
|
||
|
to check if it is the one the project intended to sequence, or to
|
||
|
look for interesting features. Analysis of the consensus, such as
|
||
|
searches for protein coding regions, can also help to find errors in
|
||
|
the sequence. The display of the overlapping gel readings shown
|
||
|
above can be used to indicate, not only the poorly determined
|
||
|
regions, but also which clones should be resequenced to resolve
|
||
|
ambiguities, or those which can usefully be extended or sequenced in
|
||
|
the reverse direction, to cover difficult regions.
|
||
|
|
||
|
The original individual gel readings for a sequencing project
|
||
|
are each stored in separate files. As the gel readings are entered
|
||
|
into the computer (usually in batches, say 10 from a film), the file
|
||
|
names they are given are stored in a further file, called a file of
|
||
|
file names. Files of file names enable gel readings to be processed
|
||
|
in batches.
|
||
|
|
||
|
For each sequencing project we start a project database. This
|
||
|
database has a structure specifically designed for dealing with
|
||
|
shotgun sequence data. In order to arrive at the final consensus
|
||
|
sequence many operations will be performed on the sequence data.
|
||
|
Individual fragments must be sequenced and compared in both senses
|
||
|
(i.e. both orientations) with all the other sequences. When an
|
||
|
overlap between a new gel reading and a contig are found they must
|
||
|
be aligned and the new gel reading added to the contig. If a new gel
|
||
|
reading overlaps two contigs they must be aligned and joined. Before
|
||
|
the two contigs are joined one of them may need to be turned around
|
||
|
(reversed and complemented) so they are both in in the same
|
||
|
orientation.
|
||
|
|
||
|
Clearly, keeping track of all these manipulations is quite
|
||
|
complicated, and to be able to perform the operations quickly
|
||
|
requires careful choice of data structure and algorithms. For these
|
||
|
reasons it is not practicable to store the gel readings aligned as
|
||
|
shown in the display above. Rather, it is more convenient to store
|
||
|
the sequences unassembled, and to record sufficient information for
|
||
|
programs to assemble them during processing. The data used to
|
||
|
assemble the sequences is called relational information.
|
||
|
|
||
|
The database comprises five files and they are described under
|
||
|
the section entitled "open database".
|
||
|
|
||
|
Before entry into the project database each new gel reading
|
||
|
must be compared to look for overlaps with all the data already
|
||
|
contained within the database. This last point is important: all
|
||
|
searching for overlaps is between individual new gel readings and
|
||
|
the data already in the database. There is no searching for overlaps
|
||
|
between sequences within the database; overlaps must be found before
|
||
|
new gel readings are entered into the database.
|
||
|
|
||
|
Below I give an introduction to how the sequences are
|
||
|
processed by being passed from one function to the next.
|
||
|
|
||
|
This program is used to start a database for the project and
|
||
|
then the following procedure is used.
|
||
|
|
||
|
Data in the form of individual gel readings are entered into
|
||
|
the computer and stored in separate files (possibly using either the
|
||
|
digitizer program GIP). Batches of these gel readings are passed to
|
||
|
the screening functions in this program to search for overlaps with
|
||
|
vector sequences (see VEP and "screen against vector") or for
|
||
|
matches to restriction enzyme sites that should not be present
|
||
|
("screen against enzymes"). Each run of these screening functions
|
||
|
passes on only those gel readings that do not contain unwanted
|
||
|
sequences. Sequences are passed via files of file names and
|
||
|
eventually are processed by the automatic assembly function ("auto
|
||
|
assemble"). This function compares each gel reading with a consensus
|
||
|
of all the previous gel readings stored in the database. If it
|
||
|
finds any overlaps it aligns the overlapping sequences by inserting
|
||
|
padding characters, and then adds the new gel reading to the
|
||
|
database. Gels that overlap are added to existing contigs and gels
|
||
|
that do not overlap any data in the database start new contigs. If a
|
||
|
new gel overlaps two contigs they are joined. Any gel readings that
|
||
|
appear to overlap but which cannot be aligned sufficiently well are
|
||
|
not entered and have their names written to a file of failed gel
|
||
|
reading names.
|
||
|
|
||
|
Generally data is entered into the database in batches as just
|
||
|
described. The program is also used to examine the data in the
|
||
|
database, to enter gel readings that the automatic assembly function
|
||
|
cannot align ("auto assemble"), and to make final edits. Edits to
|
||
|
whole contigs can be made using a mouse-driven editor ("edit
|
||
|
contig").
|
||
|
|
||
|
Editing the sequences is obviously an essential part of
|
||
|
managing a sequencing project. Editing is required when new
|
||
|
sequences are added, when contigs are joined, and when sequences are
|
||
|
corrected. A basic part of the strategy used here is that new gel
|
||
|
readings should be correctly aligned throughout their whole length
|
||
|
when they are entered into the database, and that when contigs are
|
||
|
joined they are edited so that they are well aligned in the region
|
||
|
of overlap. Alignment can be achieved by adding padding characters
|
||
|
to the sequences, and this is the way "auto assemble" operates when
|
||
|
adding new sequences to the database.
|
||
|
|
||
|
In order to search for overlaps that may have been missed or
|
||
|
may be hidden in the "unused data" the function "find internal
|
||
|
joins" can be used.
|
||
|
|
||
|
Generally the users need not concern themselves with how the
|
||
|
relational information is used by the program, but it is necessary
|
||
|
to know how contigs are identified. Because contigs are constantly
|
||
|
being changed and reordered the program identifies them by the
|
||
|
numbers of the gel readings they contain. Whenever users need to
|
||
|
identify a contig they need only know the number or name of one of
|
||
|
the gel readings it contains. Whenever the program asks users to
|
||
|
identify a contig or gel reading they can type its number or its
|
||
|
archive name. If they type its archive name they must precede the
|
||
|
name by a slash "/" symbol to denote that it is a name rather than a
|
||
|
number. E.g if the archive name is fred.gel with number 99, users
|
||
|
should type /fred.gel or 99 when asked to identify the contig.
|
||
|
Generally, when it asks for the gel reading to be identified, the
|
||
|
program will offer the user a default name, and if the user types
|
||
|
only return, that contig will be accessed. When a database is opened
|
||
|
the default contig will be the longest one, but if another is
|
||
|
accessed, it will subsequently become the current default.
|
||
|
|
||
|
Further information is located in the following places. The
|
||
|
database files are described under "open database". The format for
|
||
|
vector and consensus sequences is given under "calculate a
|
||
|
consensus", as are the uncertainty codes used in gel readings.
|
||
|
|
||
|
The digitiser program is used for the initial input of gel
|
||
|
readings and for writing a file of file names. The program uses a
|
||
|
digitizer for data entry. A digitizer is a two dimensional
|
||
|
surface such as a light box which is such that if a special pen is
|
||
|
pressed onto it, the pens coordinates are recorded by a computer.
|
||
|
These coordinates can be interpreted by a program.
|
||
|
|
||
|
In order to read an autoradiograph placed on the light box the
|
||
|
user need only define the bottom of the four sequencing lanes and
|
||
|
the bases to which they correspond and then use the pen to point
|
||
|
to each successive band progressing up the gel. The program
|
||
|
examines the coordinates of each pen position to see in which of the
|
||
|
four lanes it lies and assigns the corresponding base to be
|
||
|
stored in the computer. Each time the pen tip is depressed to point
|
||
|
to a position on the surface of the digitizer the program sounds
|
||
|
the bell on the terminal to indicate to the user that a point has
|
||
|
been recorded. As the sequence is read the program displays it on
|
||
|
the screen.
|
||
|
@17. TX 1 @Screen against enzymes
|
||
|
|
||
|
Used to compare gel readings against any restriction enzyme
|
||
|
recognition sequences that may have been used during cloning and
|
||
|
which should not be present in the data. Works on single gel
|
||
|
readings or processes batches accessed through files of file names.
|
||
|
The algorithm looks for exact matches to recognition sequences
|
||
|
stored in a file.
|
||
|
|
||
|
The file containing the recognition sequences must be
|
||
|
identified. The user must choose between employing a file of file
|
||
|
names, or typing in the names of individual gel reading files. If a
|
||
|
file of file names is used the program will also create a new file
|
||
|
of file names. When the option has finished operating this new file
|
||
|
will contain the names of all those gel readings that did not match
|
||
|
any of the recognition sequences. Hence it can be used for further
|
||
|
processing of the batch. The recognition sequences should be stored
|
||
|
in a simple text file with one recognition sequence per line.
|
||
|
@18. TX 1 @Screen against vector
|
||
|
|
||
|
Used to compare gel readings against any vector sequences that
|
||
|
may have been picked up during cloning and which have not been
|
||
|
removed by vep. It Works on single gel readings or processes batches
|
||
|
accessed through files of file names. The algorithm looks for exact
|
||
|
matches of length "minimum match length" and displays the
|
||
|
overlapping sequences.
|
||
|
|
||
|
The file containing the vector sequence must be identified.
|
||
|
The user must choose between employing a file of file names, or
|
||
|
typing in the names of individual gel reading files. If a file of
|
||
|
file names is used the program will also create a new file of file
|
||
|
names. When the option has finished operating this new file will
|
||
|
contain the names of all those gel readings that did not match the
|
||
|
vector sequence. Hence it can be used for further processing of the
|
||
|
batch. The vector sequence should be stored in a simple text file
|
||
|
with up to 80 characters of data per line. More than one vector can
|
||
|
be stored in a single file. If so each should be preceded by a 20
|
||
|
character title of the form <---m13mp8.0001----> where the < and >
|
||
|
signs and the number like .0001 are obligatory. The number must be
|
||
|
preceded by a dot (.) and be 4 digits long. The total sequence in
|
||
|
the file must be < 500,001 characters in length.
|
||
|
@20. TX 3 @Auto assemble
|
||
|
|
||
|
Compares gel readings against the current contents of the
|
||
|
database and produces alignments. In its normal mode of operation
|
||
|
("entry permitted"), the function will automatically enter the gel
|
||
|
readings into the database.
|
||
|
|
||
|
New assembly suboption. However if entry is not permitted the
|
||
|
reads won't be entered but the program will produce alignments and
|
||
|
(optionally) save each reading name and its best alignment score
|
||
|
(percentage mismatch) in a file. When used in this mode, the program
|
||
|
will include in the alignment the poor quality data for each
|
||
|
reading. These files of names can then be sorted into score order
|
||
|
and then used for assembly, hence forcing the readings that align
|
||
|
best to be entered into the database first. End of new suboption.
|
||
|
|
||
|
The routine works on single gel readings or processes batches
|
||
|
of gel readings accessed through files of file names. It is the only
|
||
|
way to enter data into the database.
|
||
|
|
||
|
The function will check the database for logical consistency
|
||
|
and will only proceed if it is OK. Choose if gel readings should be
|
||
|
entered into the database, or if they should only be compared.
|
||
|
Choose between using a file of file names or typing file names on
|
||
|
the keyboard. If so selected, supply the file of file names. Also
|
||
|
supply a file of file names to contain the names of all the gel
|
||
|
readings that fail to get entered. Select the entry mode. Normal
|
||
|
assembly is appropriate for all but special cases, as is "permit
|
||
|
joins". Uses for the other modes are not documented here. Define a
|
||
|
minimum initial match length. Define the maximum number of padding
|
||
|
characters allowed to be used in each gel reading to help achieve
|
||
|
alignment, and the same for the number allowed in the contig for
|
||
|
each gel reading. Finally define the maximum percentage mismatch to
|
||
|
be allowed for any gel reading to be entered into the database. If
|
||
|
for any gel reading, either of these last three values is exceeded
|
||
|
the gel reading will not be entered into the database.
|
||
|
|
||
|
In operation the function takes a batch of gel readings
|
||
|
(probably passed on as a file of file names from one of the
|
||
|
screening routines) and enters them into a database for a sequencing
|
||
|
project. It takes each gel reading in turn, compares it with the
|
||
|
current consensus for the database, it then produces an alignment
|
||
|
for any regions of the consensus it overlaps; if this
|
||
|
alignment is sufficiently good it then edits both the new gel
|
||
|
reading and the sequences it overlaps and adds the new gel
|
||
|
reading to the database. The program then updates the consensus
|
||
|
accordingly and carries on to the next gel reading.
|
||
|
|
||
|
All alignments are displayed and any gel readings that do
|
||
|
match but that cannot be aligned sufficiently well have their names
|
||
|
written to a file of failed gel reading names. The function works
|
||
|
without any user intervention and can process any number of gel
|
||
|
readings in a single run. Those gel readings that fail can be
|
||
|
recompared using the same function (to find the current overlap
|
||
|
position) and the user can enter them into the database using the
|
||
|
"put all readings in new contigs" assembly option and then joined
|
||
|
using "join contigs".
|
||
|
|
||
|
Typical dialogue and output from the function is shown below.
|
||
|
(Note that output for gel readings 2 - 9 has been deleted to save
|
||
|
space).
|
||
|
Automatic sequence assembler
|
||
|
Database is logically consistent
|
||
|
? (y/n) (y) Permit entry
|
||
|
? (y/n) (y) Use file of file names
|
||
|
? File of gel reading names=demo.nam
|
||
|
? File for names of failures=demo.fail
|
||
|
Select entry mode
|
||
|
X 1 Perform normal shotgun assembly
|
||
|
2 Put all sequences in one contig
|
||
|
3 Put all sequences in new contigs
|
||
|
? Selection (1-3) (1) =
|
||
|
? (y/n) (y) Permit joins
|
||
|
? Minimum initial match (12-4097) (15) =
|
||
|
? Maximum pads per gel (0-25) (8) =
|
||
|
? Maximum pads per gel in contig (0-25) (8) =
|
||
|
? Maximum percent mismatch after alignment (0.00-15.00) (8.00) =
|
||
|
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
|
||
|
Processing 1 in batch
|
||
|
Gel reading name=HINW.004
|
||
|
Gel reading length= 283
|
||
|
Searching for overlaps
|
||
|
Strand 1
|
||
|
Strand 2
|
||
|
No matches found
|
||
|
Total matches found 1
|
||
|
Padding in contig= 0 and in gel= 1
|
||
|
Percentage mismatch after alignment = 1.8
|
||
|
Best alignment found
|
||
|
1 11 21 31 41 51
|
||
|
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
|
||
|
********** ********** ********** ********** ********** **********
|
||
|
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
|
||
|
1 11 21 31 41 51
|
||
|
61 71 81 91 101 111
|
||
|
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACA-GTTG
|
||
|
********** ********** ********** ********** ********** ***** ****
|
||
|
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACACGTTG
|
||
|
61 71 81 91 101 111
|
||
|
121 131 141 151 161 171
|
||
|
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
|
||
|
********** ********** ********** ********** ********** **********
|
||
|
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
|
||
|
121 131 141 151 161 171
|
||
|
181 191 201 211 221 231
|
||
|
GCCATGGTGG CCACAGTGAC G-TATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
|
||
|
********** ********** * ******** ********** ********** **********
|
||
|
GCCATGGTGG CCACAGTGAC GCTATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
|
||
|
181 191 201 211 221 231
|
||
|
241 251 261 271 281
|
||
|
CGCG--GTGG AATTACAGCG TTCCCTATTG ACGGGCGCAT CCAC
|
||
|
**** **** ********** ** * ***** ********** ****
|
||
|
CGCGACGTGG AATTACAGCG TT,CDTATTG ACGGGCGCAT CCAC
|
||
|
241 251 261 271 281
|
||
|
Batch finished
|
||
|
9 sequences processed
|
||
|
0 sequences entered into database
|
||
|
0 joins made
|
||
|
|
||
|
|
||
|
Note that "auto assemble" cannot align protein sequences.
|
||
|
@28. TX 1 @Highlight disagreements
|
||
|
|
||
|
Used in the latter stages of a project to highlight
|
||
|
disagreements between individual gel readings and their consensus
|
||
|
sequences. This display is also availbale in the contig editor.
|
||
|
Characters that agree with the consensus are shown as : symbols for
|
||
|
the plus strand and . for the minus strand. Characters that disagree
|
||
|
with the consensus are left unchanged and so stand out clearly. The
|
||
|
results of this analysis are written to a file.
|
||
|
|
||
|
Before selecting this option create a file of the display of
|
||
|
the contig to be "highlighted". The option will ask for the name of
|
||
|
this file. Select symbols to denote "agreeing" characters on each
|
||
|
strand, the defaults are : and ., but any others can be used. Supply
|
||
|
the name of a file in which to put the output.
|
||
|
|
||
|
The display file needed as input for this option is created by
|
||
|
selecting "Redirect output", followed immediately by "display
|
||
|
contig", and then "Redirect output" again. The cutoff score used in
|
||
|
the consensus calculation can be set by option "set display
|
||
|
parameters". Note that for the highlight function there is a limit
|
||
|
of 50 for the number of gel readings that are aligned at any
|
||
|
position - ie the contig must be less than 51 gel readings deep at
|
||
|
its thickest point. I hope that those performing shotgun sequencing
|
||
|
never reach this limit, but those using the program for comparing
|
||
|
sequence families might.
|
||
|
|
||
|
Typical output from this function is shown below.
|
||
|
|
||
|
210 220 230 240 250
|
||
|
1 HINW.004 :C::::::::::::::::::::::::::::::::::::::::::AC::::
|
||
|
7 HINW.018 :*::::::::::::::::::::::::::::::::::::::::::CA::::
|
||
|
-4 HINW.017 ...............AC....
|
||
|
G-TATTTTGTTTCCTCACGCTCGCTACGTATCTGTTTGCCCGCG--GTGG
|
||
|
|
||
|
260 270 280 290 300
|
||
|
1 HINW.004 ::::::::::::*:D:::::::::::::::::::
|
||
|
7 HINW.018 ::::::::::::::::::::CA:::::T:*:::*::::::::::::CA:
|
||
|
-4 HINW.017 ..............................................A...
|
||
|
3 HINW.009 :::::::::::::::V::::::::::::::::::::::::::::*AV:::
|
||
|
-6 HINW.028 ......................A...
|
||
|
AATTACAGCGTTCCCTATTGACGGGCGCATCCACGCTGATTCTCTT-CTG
|
||
|
|
||
|
@32. TX 3 @Extract gel readings
|
||
|
|
||
|
Used to make copies of the aligned gel readings in a database,
|
||
|
to write them into separate files, and to write a corresponding file
|
||
|
of file names. It operates in two modes: either all gel readings are
|
||
|
extracted, or only those at the ends of contigs.
|
||
|
|
||
|
Choose which mode of operation is required and supply a file
|
||
|
of file names.
|
||
|
|
||
|
The gel readings are given their original names.
|
||
|
|
||
|
If the option is used to extract all the gel readings from a
|
||
|
database, a subsequent run of "auto assemble" can reconstitute a
|
||
|
database which has been corrupted. This rarely occurs and is
|
||
|
usually necessitated by a user employing "alter relationships"
|
||
|
incorrectly without first having made a copy.
|
||
|
@1. TX 0 @Help
|
||
|
|
||
|
Help is available on the following topics :
|
||
|
@2. TX 0 @Quit
|
||
|
|
||
|
This command stops the program and is the only safe way to
|
||
|
terminate a run of the program that has altered the contents of the
|
||
|
database in any way.
|
||
|
@3. TX 1 @Open a database
|
||
|
|
||
|
Opens existing databases or allows new ones to be started. The
|
||
|
function is automatically called into operation when the program is
|
||
|
started but can also be selected from the general menu.
|
||
|
|
||
|
Choose to open an existing database or start a new one, or if
|
||
|
! is typed when the program is first started, enter the program
|
||
|
without opening a database. Supply a project database name, and if
|
||
|
it already exists, the "version". If starting a new database define
|
||
|
the database size and if it is for DNA or protein sequences. The
|
||
|
database size is an initial size for the database. It can be
|
||
|
increased later during the project. It is the sum of the number of
|
||
|
gel readings plus the number of contigs. The current maximum size is
|
||
|
8000.
|
||
|
|
||
|
Database names can have from one to 12 letters and must not
|
||
|
include full stop (.). The database is made from five separate
|
||
|
files. If the database is called FRED then version 0 of database
|
||
|
FRED comprises files FRED.AR0, FRED.RL0, FRED.SQ0, FRED.TG0 and
|
||
|
FRED.CC0. The version is the last symbol in the file names. Only
|
||
|
this program can read these files. If the "copy database" option is
|
||
|
used it will ask the user to define a new "version".
|
||
|
|
||
|
For normal use the maximum gel reading length is set to 512
|
||
|
characters, but when a database is started the user may choose
|
||
|
lengths of either 512, 1024, 1536..., 4096. Normally the program is
|
||
|
used to handle DNA sequences but many of the functions also work on
|
||
|
protein sequences. The choice of sequence type is made when the
|
||
|
database is started.
|
||
|
|
||
|
The contigs are not stored on the disk as the user sees them
|
||
|
displayed on the screen. Each gel reading is stored with sufficient
|
||
|
information about how it overlaps other gel readings so that the
|
||
|
program can work out how to present them aligned on the screen. We
|
||
|
refer to this extra data as "the relationships" and it is explained
|
||
|
below. The database comprises 5 separate files.
|
||
|
1. a working version of each gel reading. This is the version of
|
||
|
the gel reading that is in the database and initially it is an
|
||
|
exact copy of the original sequence (known as the archive) but it is
|
||
|
edited and manipulated to align it with other gel readings.
|
||
|
2. the file of relationships. This file contains all of the
|
||
|
information that is required to assemble the working versions into
|
||
|
contigs during processing; any manipulations on the data use this
|
||
|
file and it is automatically updated at any time that the
|
||
|
relationships are changed. The information in this file is as
|
||
|
follows:
|
||
|
(A) Facts about each gel reading and its relationship to
|
||
|
others ("gel descriptor lines"):
|
||
|
(a) the number of the gel reading (each gel reading is given a
|
||
|
number as it is entered into the database)
|
||
|
(b) the length of the sequence from this gel reading
|
||
|
(c) the position of the left end of this gel reading relative to
|
||
|
the left end of the contig of which it is a member
|
||
|
(d) the number of the next gel reading to the left of this gel
|
||
|
reading
|
||
|
(e) the number of the next gel reading to the right
|
||
|
(f) the relative strandedness of this gel reading , ie whether it
|
||
|
is in the same sense or the complementary sense as its archive.
|
||
|
(B) Facts about each contig ("contig descriptor lines"):
|
||
|
(a) the length of this contig
|
||
|
(b) the number of the leftmost gel reading of this contig
|
||
|
(c) the number of the rightmost gel reading of this contig.
|
||
|
(C) General facts:
|
||
|
(a) the number of gel readings in the database
|
||
|
(b) the number of contigs in the database.
|
||
|
3. the file of archive names. This is simply a list of the names
|
||
|
of each of the archive files in the database.
|
||
|
4. the file of tags (annotation). This consists of linked lists of
|
||
|
tag information for each sequences in the database. Tags are
|
||
|
created by the user as annotation, or by xdap as records of edits or
|
||
|
for storing cutoff information. As the number of tags can grow
|
||
|
without limit, so can this file. For each gel there is a header
|
||
|
record, which contains the record number of the start of the linked
|
||
|
list for that gel. On line IDBSIZ there is a record containing
|
||
|
information about the file such as its present length and if there
|
||
|
are any free "tag" slots to be reused in the file. 5. the file of
|
||
|
comments (annotation). This consists of linked lists of comment
|
||
|
fragments. Comments are created by the user as a message attached
|
||
|
to annotation, or by the system to store cutoff information.
|
||
|
Comments are character strings of any length. Comments longer than
|
||
|
40 characters are broken up into fragments, each 40 characters long,
|
||
|
and are chained together in a link list. As the number of comments
|
||
|
can grow without limit, so can this file.
|
||
|
|
||
|
Structure of the database files
|
||
|
|
||
|
1. The file of relationships
|
||
|
|
||
|
The file contains IDBSIZ lines of data: the general data are
|
||
|
stored on line IDBSIZ; data about gel readings are stored from
|
||
|
line 1 downwards; data about contigs are stored from line IDBSIZ-1
|
||
|
upwards. A database of 500 lines containing 25 gel readings and 4
|
||
|
contigs would have a file of relationships as is shown below.
|
||
|
|
||
|
|
||
|
---------------------------------------------
|
||
|
0 Info about the database size
|
||
|
1 Gel descriptor record
|
||
|
2 " " "
|
||
|
3 " " "
|
||
|
4 " " "
|
||
|
5 " " "
|
||
|
' ' ' '
|
||
|
' ' ' '
|
||
|
25 " " "
|
||
|
26 Empty record
|
||
|
' ' '
|
||
|
|
||
|
' ' '
|
||
|
495 ' '
|
||
|
496 Contig descriptor record
|
||
|
497 " " "
|
||
|
498 " " "
|
||
|
499 " " "
|
||
|
500 Number of gel readings=25, Number of contigs=4
|
||
|
---------------------------------------------
|
||
|
|
||
|
The arrangement of the data in the file of relationships
|
||
|
|
||
|
As each new gel reading is added into the database a new line is
|
||
|
added to the end of the list of gel descriptor lines. If this
|
||
|
new gel reading does not overlap with any gel readings already in
|
||
|
the database a new contig line is added to the top of the list
|
||
|
of contig lines. If it overlaps with one contig then no new contig
|
||
|
line need be added but if it overlaps with two contigs then
|
||
|
these two contigs must be joined and the number of contig lines
|
||
|
will be reduced by one. Then the list of contig lines is compressed
|
||
|
to leave the empty line at the top of the list. Initially the two
|
||
|
types of line will move towards one another but eventually, as
|
||
|
contigs are joined, the contig descriptor lines will move in the
|
||
|
same direction as the gel descriptor lines. At the end of a
|
||
|
project there should be only one contig line. The database is
|
||
|
thus capable of handling a project of 998 gels.
|
||
|
|
||
|
2. Structure of the working versions file
|
||
|
|
||
|
The working versions of gel readings are stored in a file
|
||
|
of NGELS lines each containing MAXGEL characters. Gel reading
|
||
|
number 1 is stored on line 1, gel reading number 2 on line 2 and so
|
||
|
on. NGELS is the current number of readings and MAXGEL the maximum
|
||
|
reading length.
|
||
|
|
||
|
3. Structure of the archive names file
|
||
|
|
||
|
This file has NGELS lines of 16 characters.
|
||
|
|
||
|
4. Structure of the tag file
|
||
|
|
||
|
This file initially starts with IDBSIZ lines, and is expanded
|
||
|
as new tags are created. Information about the length of the file,
|
||
|
and which tag records are reusable is stored on line IDBSIZ. A
|
||
|
database of 500 lines would have a file of tags as shown below.
|
||
|
|
||
|
---------------------------------------------
|
||
|
1 Tag descriptor record
|
||
|
2 " " "
|
||
|
3 " " "
|
||
|
4 " " "
|
||
|
5 " " "
|
||
|
' ' ' '
|
||
|
' ' ' '
|
||
|
497 " " "
|
||
|
498 " " "
|
||
|
499 " " "
|
||
|
500 Length of file=N, Free list=0
|
||
|
501 Tag record
|
||
|
502 " "
|
||
|
503 " "
|
||
|
' ' '
|
||
|
' ' '
|
||
|
N-2 " "
|
||
|
N-1 " "
|
||
|
N Tag record
|
||
|
---------------------------------------------
|
||
|
|
||
|
The arrangement of the data in the tag file
|
||
|
|
||
|
As each new tag is added to the database, a check is made in the
|
||
|
file descriptor record at line IDBSIZ. If the list of reusable
|
||
|
records is 0, the file is extended by one line. Otherwise the new
|
||
|
tag is assigned to record at the head of the freelist. When tags
|
||
|
are deleted, they are added to the free list in the file descriptor
|
||
|
record.
|
||
|
|
||
|
5. Structure of the comment file
|
||
|
|
||
|
This file initially starts with 1 line, and is expanded as new
|
||
|
annotation is created. Information about the length of the file,
|
||
|
and which comment records are reusable is stored on the first line.
|
||
|
|
||
|
---------------------------------------------
|
||
|
1 Length of file=N, Free list=0
|
||
|
2 Comment fragment
|
||
|
3 " "
|
||
|
4 " "
|
||
|
' ' '
|
||
|
' ' '
|
||
|
N-2 " "
|
||
|
N-1 " "
|
||
|
N Comment fragment
|
||
|
---------------------------------------------
|
||
|
|
||
|
The arrangement of the data in the comment file
|
||
|
|
||
|
As each new comment is added to the database, a check is made in the
|
||
|
file descriptor record at line 1. If the list of reusable records is
|
||
|
0, the file is extended to hold the new comment. Otherwise the new
|
||
|
comments is assigned to records starting with the head of the
|
||
|
freelist. When comments are deleted, the discarded records are
|
||
|
added to the free list in the file descriptor record.
|
||
|
|
||
|
There are various checks within the programs to protect
|
||
|
users from themselves:-
|
||
|
1. All user input is checked for errors - e.g. reference to
|
||
|
non-existent gel readings or contigs, incorrect positions in the
|
||
|
contig or gel readings.
|
||
|
2. Before entering a gel reading the system checks to see if a file
|
||
|
of the same name has already been entered.
|
||
|
3. Join will not allow the circularising of a contig.
|
||
|
5. Users may escape from any point in the program.
|
||
|
6. Help is available from all points in the program.
|
||
|
|
||
|
|
||
|
IT IS ESSENTIAL THAT USERS DO NOT KILL THE PROGRAM WHILE IT IS DOING
|
||
|
ANYTHING THAT INVOLVES CHANGING THE CONTENTS OF THE DATABASE. I.E
|
||
|
DURING AUTO ASSEMBLE, COMPLETE JOIN, COMPLEMENT CONTIG, SAVE EDIT
|
||
|
CONTIG. This could corrupt the database so badly that it is
|
||
|
impossible to fix. The program should always be left using the QUIT
|
||
|
option.
|
||
|
@4. TX 3 @Edit contig
|
||
|
|
||
|
The Contig Editor is a mouse-driven editor that can insert,
|
||
|
delete and change gel reading sequences.
|
||
|
|
||
|
The Contig Editor allows scrolling from one end of a contig to
|
||
|
the other using the scroll bar and scroll buttons. Action of mouse
|
||
|
button presses when the mouse pointer is in the scroll bar:
|
||
|
|
||
|
Middle Mouse Button Set editor position
|
||
|
Left Mouse Button Scroll forward one screenful
|
||
|
Right Mouse Button Scroll backwards one screenful
|
||
|
|
||
|
The four scroll buttons operate as follows:
|
||
|
|
||
|
"<<" Scroll left half a screenful
|
||
|
"<" Scroll left one character
|
||
|
">" Scroll right one character
|
||
|
">>" Scroll right half a screenful
|
||
|
|
||
|
The Editor cursor can be positioned anywhere in the edit
|
||
|
window by moving the mouse pointer over the character of interest,
|
||
|
then pressing the left mouse button. The Editor cursor can also be
|
||
|
moved by using the direction arrow keys.
|
||
|
|
||
|
The editor operates in two main edit modes - Replace and
|
||
|
Insert. Replace allows a character to be replaced by another. Insert
|
||
|
allows characters to be inserted into a gel reading sequence.
|
||
|
Characters are entered by typing them from the keyboard. Only valid
|
||
|
characters are permitted. Characters can be deleted by positioning
|
||
|
the cursor one character to the right, then pressing the delete key.
|
||
|
Normally Insert and Delete apply to the consensus line of the contig
|
||
|
ONLY. This restraint can be overridden by using the "Super Edit"
|
||
|
mode of operation, THOUGH IT IS NOT RECOMMENDED.
|
||
|
|
||
|
Edits can also be performed on the consensus, though they are
|
||
|
restricted to insertion and deletion of padding characters ("*").
|
||
|
These edits also have special meanings. A deletion will delete ALL
|
||
|
characters at the position to the left of the cursor in the contig,
|
||
|
and move the relative positions of all sequences starting to the
|
||
|
right of the cursor position left one character. An insertion will
|
||
|
insert the character typed ("*") into ALL gel reading sequences at
|
||
|
the cursors position in the contig, and move the relative positions
|
||
|
of all sequences starting to the right of the cursor position right
|
||
|
one character.
|
||
|
|
||
|
The effect of the last edit can be undone by pressing the
|
||
|
"Undo" button at the top of the editor window.
|
||
|
|
||
|
The cursor will automatically be positioned at the next
|
||
|
problem when the "Find Next Problem" button is selected. The next
|
||
|
problem is where the consensus shows either an ambiguity ("-") or a
|
||
|
pad ("*") character.
|
||
|
|
||
|
The edits to the contig can be saved by pressing the "Leave
|
||
|
Editor" button and replying "Yes" to the prompt to "Save changes?".
|
||
|
As no changes are made to the working copy of your database til this
|
||
|
point it is possible to abort the editor if the edit session ends up
|
||
|
in an unsatisfactory state (ie if you've stuffed it up!)
|
||
|
|
||
|
|
||
|
|
||
|
Displaying Traces
|
||
|
|
||
|
The original data from which the gel reading sequences where
|
||
|
derived can be seen by double clicking (two quick clicks) with the
|
||
|
middle mouse button on the area of interest. The trace will be
|
||
|
displayed with the point clicked at the centre of the trace
|
||
|
viewport.
|
||
|
|
||
|
All traces that are displayed are maintained in one window,
|
||
|
called the Trace Manager. The Trace Manager will only display four
|
||
|
traces maximum. When four traces are already being managed and a new
|
||
|
one is requested, the one at the top of the Trace Manager is removed
|
||
|
and the new one is added to the bottom. Traces can be removed
|
||
|
individually by using the "quit" button in the panel next to the
|
||
|
trace.
|
||
|
|
||
|
|
||
|
|
||
|
Extending Reads Using Cutoff Information
|
||
|
|
||
|
Sequence data read in from Automated Fluorescent sequencing
|
||
|
machines trace files processed through the program ted will have the
|
||
|
discarded sequence (vector at start and poor read at end) available
|
||
|
to the contig editor. To display the cutoff information, press the
|
||
|
"Display Cutoff" button at the top of the editor window. The cutoff
|
||
|
sequence appears in grey. This sequence can be incorporated into the
|
||
|
editable sequence, by moving the cutoff position. This is done by
|
||
|
positioning the cursor at the end of the gel sequence, and using
|
||
|
Meta-Left-Arrow and Meta-Right-Arrow to adjust the point of cutoff.
|
||
|
The Meta key is a diamond on the Sun keyboard.
|
||
|
|
||
|
|
||
|
|
||
|
Pop-up menu
|
||
|
|
||
|
A pop-up menu is revealed by depressing the "Control" key on
|
||
|
the keyboard and at the same time pressing the left mouse button.
|
||
|
The menu has the following functions:
|
||
|
|
||
|
Search
|
||
|
Highlight Disagreements
|
||
|
Save Contig
|
||
|
Create Tag
|
||
|
Edit Tag
|
||
|
Delete Tag
|
||
|
Select Oligo
|
||
|
|
||
|
"Highlight Disaggreements" simply toggles between the normal display
|
||
|
showing the current base assignments and one in which only those
|
||
|
assignments that differ from the consensus are shown.
|
||
|
"Save Contig" is described above. Searching and operations on tags
|
||
|
are described below.
|
||
|
|
||
|
|
||
|
|
||
|
Searching
|
||
|
|
||
|
Selecting "Search" brings up a window which can remain present
|
||
|
during normal editor operation. The window allows the user to select
|
||
|
the direction of search, the type of search and a value to search
|
||
|
on. The value is entered into the value text window. Then pressing
|
||
|
the "search" button performs the search. If successful, the cursor
|
||
|
is positioned and centred accordingly. An audible tone indicates
|
||
|
failure. Pressing the "ok" button removes the search window. The
|
||
|
search window is automatically removed when the contig editor is
|
||
|
exited.
|
||
|
|
||
|
There are seven different search modes:
|
||
|
|
||
|
1. Search by position
|
||
|
|
||
|
This positions the cursor at the numeric position specified in the
|
||
|
value text window. Eg a value of "1234" causes the cursor to be
|
||
|
placed at base number 1234 in the contig. Positioning withing a gel
|
||
|
reading is achieved by prefixing the number with the "@" character,
|
||
|
eg "@123" positions the cursor at base 123 of the sequence in which
|
||
|
the cursor lies. Relative positions can be specified by prefixing
|
||
|
the number with a plus or minus character. Eg "+1234" will advance
|
||
|
the cursor 1234 bases. If possible, the cursor is positioned within
|
||
|
the same sequence. The direction buttons have no effect on the
|
||
|
operation of "search by position".
|
||
|
|
||
|
2. Search by reading name
|
||
|
|
||
|
This positions the cursor at the left end of the gel reading
|
||
|
specified in the value text window. If the value is prefixed with a
|
||
|
slash is is assumed to be a gel reading name. Otherwise it is
|
||
|
assumed to be a gel reading number. Eg "123" positions the cursor at
|
||
|
the left end of gel reading number 123. "/a16a12.s1" positions at
|
||
|
the start of reading a16a12.s1. If the value was "/a16" the cursor
|
||
|
is positioned at the first reading which starts with "a16". The
|
||
|
direction buttons have no effect on the operation of "search by
|
||
|
position".
|
||
|
|
||
|
3. Search by tag type.
|
||
|
|
||
|
This positions the cursor at the start of the next tag which has the
|
||
|
the same type as specified by the type value menu. To change the
|
||
|
type, select off the menu that pops up when the mouse is clicked on
|
||
|
the button labeled "Type:". The search can be performed either
|
||
|
forwards or backwards of the current cursor position. To find all
|
||
|
tags, use "search by annotation", with a null text value string.
|
||
|
|
||
|
4. Search by annotation.
|
||
|
|
||
|
This positions the cursor at the start of the next tag which has a
|
||
|
comment containing the string specified in the value text window.
|
||
|
The search performed is a regular expression search, and certain
|
||
|
characters have special meaning. Be careful when your value string
|
||
|
contains ".", "*", "[", "^" or "$". The search can be performed
|
||
|
either forwards or backwards from the current cursor position.
|
||
|
|
||
|
5. Search by sequence.
|
||
|
|
||
|
This positions the cursor at the start of the next piece of sequence
|
||
|
that matches the value specified in the text value window. The
|
||
|
search is for an exact match, which means the case of value string
|
||
|
is important. The search is performed on the gel readings
|
||
|
themselves, rather than the consensus sequence. The search can be
|
||
|
performed either forwards or backwards from the current cursor
|
||
|
position.
|
||
|
|
||
|
6. Search by problem.
|
||
|
|
||
|
This positions the cursor at the next place in the consensus
|
||
|
sequence which is not an "A", "C", "G" or "T". The search can be
|
||
|
performed either forwards or backwards from the current cursor
|
||
|
position.
|
||
|
|
||
|
7. Search by quality
|
||
|
|
||
|
This positions the cursor at the next place in the consensus
|
||
|
sequence where the consensus calculation for each strand disagrees.
|
||
|
When only sequences on one strand is present, the search will stop
|
||
|
at every base. The search can be performed either forwards or
|
||
|
backwards from the current cursor position.
|
||
|
|
||
|
|
||
|
|
||
|
Annotation
|
||
|
|
||
|
Parts of a sequence can be annotated, to record the positions
|
||
|
of primers used for walking, or to mark sites, such as compressions
|
||
|
that have caused problems during sequencing. The consensus sequence
|
||
|
CANNOT be annotated.
|
||
|
|
||
|
To annotate a piece of sequence first select the part of
|
||
|
sequence using the mouse buttons. Use the left mouse button to
|
||
|
position the start of the selection, and while this button is being
|
||
|
held down, move the mouse to extend. The selection can be extended
|
||
|
further using the right mouse button.
|
||
|
|
||
|
To create annotation, invoke the pop-up menu, and select the
|
||
|
"Create Tag" function. A small "tag editor" will appear which allows
|
||
|
you to select the type of the annotation from a pull-down menu, and
|
||
|
specify a comment if desired. To select a new type pull down the
|
||
|
Type menu, and select the entry desired. To enter a comment, simply
|
||
|
type into the text window in the tag editor. The annotation is
|
||
|
created when the "Leave" button on the tag editor, and is displayed
|
||
|
in the colour defined in the tag database file (TAGDB).
|
||
|
|
||
|
To edit existing annotation, position the cursor with the left
|
||
|
mouse button on the tag, and select the "Edit Tag" off the pop-up
|
||
|
menu. This invokes the tag editor, and changes to the type and
|
||
|
comment of the annotation can be made. The tag is updated when the
|
||
|
"Leave" button is pressed.
|
||
|
|
||
|
To delete an existing annotation, position the cursor with the
|
||
|
left mouse button on the tag, and select the "Delete Tag" off the
|
||
|
pop-up menu.
|
||
|
|
||
|
|
||
|
|
||
|
NOTE:
|
||
|
|
||
|
As the Contig Editor is a very powerful tool, it is possible
|
||
|
that the alignment of the gel reading sequences has unexpectedly
|
||
|
been disrupted. This can easily happen to parts of the contig that
|
||
|
lie to the right of the screen if excessive use has been made of the
|
||
|
"Super Edit" facility. Until familiar with "Super Edit" it would
|
||
|
benefit the sequencer to quickly scan through the contig after
|
||
|
editing to check that bad alignments have not been created.
|
||
|
|
||
|
Selecting Oligos ----------------
|
||
|
|
||
|
1. Open the oligo selection window, by selecting "Select Oligo" from
|
||
|
the contig editor popup menu.
|
||
|
2. Position the cursor to where you want the oligo to be chosen.
|
||
|
While the oligo selection window is visible, you will still have
|
||
|
complete control over positioning and editing within the contig
|
||
|
editor.
|
||
|
3. Indicate the strand for which you require an oligo. This is done
|
||
|
by toggling the direction arrow ("----->" or "<------"), if
|
||
|
necessary.
|
||
|
3. Press the "Find Oligos" button to find all suitable oligos (See
|
||
|
"Oligo selection" below.) Information for the closest oligo to the
|
||
|
cursor position is given in the output text window. In the contig
|
||
|
editor the position of the oligo is marked by a temporary tag on the
|
||
|
consensus. The window is recentered if the oligo is off the screen.
|
||
|
Selecting "Display Selection Information" will print a short report
|
||
|
on the numbers of oligos considered and rejected during oligo
|
||
|
selection.
|
||
|
4. If this oligo is not suitable (it may have been previously
|
||
|
chosen, and found to be unsuitable by experimentation, say), the
|
||
|
next closest oligo can be viewed by pressing "Select Next".
|
||
|
5. Suitable templates are automatically identified for the currently
|
||
|
displayed oligo (See "Template selection" below.) By default, the
|
||
|
template is that closest to the oligo site. If the choice is not
|
||
|
suitable (it may be known to be a poor quality template, say)
|
||
|
another can be chosen from the "Choose Template for this Oligo"
|
||
|
menu. Templates that do not appear on the menu can be specified by
|
||
|
selecting "other". However, the template must be on the correct
|
||
|
strand and be upstream of the oligo.
|
||
|
6. A tag can be created for the current oligo by pressing the button
|
||
|
"Create a tag for this oligo". The annotation for this tag holds the
|
||
|
name of the template and the oligo primer sequence. There are
|
||
|
fields to allow the user to specify their own primer name
|
||
|
("serial#") and comments ("flags") for this tag. An example of oligo
|
||
|
tag annotation:
|
||
|
serial#=
|
||
|
template=a16a9.s1
|
||
|
sequence=CGTTATGACCTATATTTTGTATG
|
||
|
flags=
|
||
|
|
||
|
7. The oligo selection window is closed when "Create a tag for this
|
||
|
oligo" or "Quit" is selected.
|
||
|
Oligo selection:
|
||
|
----------------
|
||
|
The oligo selection engine is the one used in the program OSP. It is
|
||
|
described in some detail in:
|
||
|
Hillier, L., and Green, P. (1991). "OSP: an oligonucleotide
|
||
|
selection program," PCR Methods and Applications, 1:124-128.
|
||
|
The parameters controlling the selection of oligos can be changed in
|
||
|
the "Oligo Selection Parameters" window. The weights controlling the
|
||
|
scoring of selected oligos can be changed in the "Oligo Selection
|
||
|
Weights" window.
|
||
|
By default, the oligos are selected from a window that extends 40
|
||
|
bases either side of the cursor. The size and location of this
|
||
|
window relative to the cursor position can be changed in the
|
||
|
"Parameters" window.
|
||
|
In xbap oligos are ranked according to their proximity to the cursor
|
||
|
position, rather than by their scores.
|
||
|
Template selection:
|
||
|
-------------------
|
||
|
For simplicity, each reading is considered to represent a template.
|
||
|
In practise, many readings can be made of the same template.
|
||
|
Suitable templates that are identified are those that:
|
||
|
|
||
|
1. are in the appropriate sense,
|
||
|
2. have 5' ends that start upstream of the oligo,
|
||
|
and 3. are sufficiently close to the oligo to be useful.
|
||
|
|
||
|
This last criterion relates to the insert size for the subclones
|
||
|
used for sequencing and the average reading length. A template is
|
||
|
considered useful if a full reading can be made from it, taking into
|
||
|
account both of these factors. The default insert size is 1000
|
||
|
bases, and the default average reading length is 400 bases. These
|
||
|
values can be changed in the "Parameters" window.
|
||
|
@5. TX 1 @Display a contig
|
||
|
|
||
|
Used to show the aligned gel readings for any part of a
|
||
|
contig. The number, name and strandedness of each gel reading is
|
||
|
shown and the consensus is written below.
|
||
|
|
||
|
If required identify the contig, and then the start and end
|
||
|
points of the region to display.
|
||
|
|
||
|
The display can be directed to a disk file using "direct
|
||
|
output to disk".
|
||
|
|
||
|
Below is an example showing the left end of a contig from
|
||
|
position 1 to 200. Overlapping this region are gels 6,3,5,17and
|
||
|
12; 6, 3 and 5 are in reverse orientation to their archives (denoted
|
||
|
by a minus sign) There are a few uncertainty codes and a few
|
||
|
padding characters in the working versions, but the consensus
|
||
|
(shown below each page width) has a definite assignment for almost
|
||
|
every position.
|
||
|
|
||
|
10 20 30 40 50
|
||
|
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
|
||
|
|
||
|
60 70 80 90 100
|
||
|
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
|
||
|
-3 HINW.007 GGCACA*GTC
|
||
|
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
|
||
|
|
||
|
110 120 130 140 150
|
||
|
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
|
||
|
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
|
||
|
-5 HINW.009 GGCAGCGACCGTCG
|
||
|
17 HINW.999 AGCGACCGTCG
|
||
|
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
|
||
|
|
||
|
160 170 180 190 200
|
||
|
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
12 HINW.017 GTAGAGC*
|
||
|
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
|
||
|
@6. TX 1 @List a text file
|
||
|
|
||
|
This option allows users to list text files on the screen. It
|
||
|
can be used to read a file containing notes, for checking files
|
||
|
written to disk etc. The user is asked to type the name of the file
|
||
|
to list.
|
||
|
@8. TX 1 @Calculate a consensus
|
||
|
|
||
|
Calculates a consensus sequence either for the whole
|
||
|
database or for selected contigs. The consensus is written to a file
|
||
|
named by the user.
|
||
|
Supply a file name, choose between whole database or selected
|
||
|
contigs.
|
||
|
|
||
|
Symbols for uncertainty in gel readings
|
||
|
|
||
|
In order to record uncertainties when reading gels the
|
||
|
codes shown below can be used. Use of these codes permits us to
|
||
|
extract the maximum amount of data from each gel and yet record any
|
||
|
doubts by choice of code. The program can deal with all of
|
||
|
these codes and any other characters in a sequence are treated
|
||
|
as dash (-) characters.
|
||
|
|
||
|
SYMBOL MEANING
|
||
|
|
||
|
1 PROBABLY C
|
||
|
2 " T
|
||
|
3 " A
|
||
|
4 " G
|
||
|
D " C POSSIBLY CC
|
||
|
V " T " TT
|
||
|
B " A " AA
|
||
|
H " G " GG
|
||
|
K " C " C-
|
||
|
L " T " T-
|
||
|
M " A " A-
|
||
|
N " G " G-
|
||
|
R A OR G
|
||
|
Y C OR T
|
||
|
5 A OR C
|
||
|
6 G OR T
|
||
|
7 A OR T
|
||
|
8 G OR C
|
||
|
- A OR G OR C OR T
|
||
|
a A
|
||
|
c C
|
||
|
g G
|
||
|
t T
|
||
|
* padding character placed by auto assembler
|
||
|
else = -
|
||
|
|
||
|
The DNA consensus algorithm
|
||
|
|
||
|
The "calculate consensus" function, the "display contig"
|
||
|
routine and the "show quality" option use the rules outlined here
|
||
|
to calculate a consensus from aligned gel readings. Note that
|
||
|
"display contig" calculates a consensus for each page width it
|
||
|
displays (it does not use the consensus sequence file calculated
|
||
|
by the consensus function).
|
||
|
|
||
|
We have 6 possible symbols in the consensus sequence:
|
||
|
A,C,G,T,* and -. The last symbols is assigned if none of the others
|
||
|
makes up a sufficient proportion of the aligned characters at any
|
||
|
position in the contig. The following calculation is used to decide
|
||
|
which symbol to place in the consensus at each position.
|
||
|
|
||
|
Each uncertainty code contributes a score to one of A,C,G,T,*
|
||
|
and also to the total at each point. Symbols like R and Y which
|
||
|
don't correspond to a single base type contribute only to the total
|
||
|
at each point. The scores are shown below.
|
||
|
definite assignments ie A,C,G,T,B,D,H,V,K,L,M,N,a,c,g,t,* =1
|
||
|
|
||
|
probable assignments ie 1,2,3,4 = 0.75
|
||
|
|
||
|
other uncertainty codes including R,Y,5,6,7,8,- = 0.1
|
||
|
|
||
|
A cutoff score of 51% to 100% is supplied by the user. (When
|
||
|
the program starts this is set to 75%. See "set display
|
||
|
parameters"). At each position in the contig we calculate the total
|
||
|
score for each of the 5 symbols A,C,G,T and * (denote these by Xi,
|
||
|
where i=A,C,G,T or *), and also the sum of these totals (denote this
|
||
|
by S). Then if 100 Xi / S > the cutoff for any i, symbol i is placed
|
||
|
in the consensus; otherwise - is assigned.
|
||
|
|
||
|
Notice that S does not equal the number of times the sequence
|
||
|
has been determined, but is the score total, and hence we are less
|
||
|
likely to put a - in the consensus. For the "examine quality"
|
||
|
algorithm each strand is treated separately but the calculation is
|
||
|
the same. (It was originally different).
|
||
|
|
||
|
Format of the consensus sequence ( and vector sequences).
|
||
|
|
||
|
A consensus sequence file may contain the consensus for
|
||
|
several contigs and so we identify each of them by preceding them by
|
||
|
a 20 character title. The title is of the form <---LAMBDA.0076---->
|
||
|
( where LAMBDA is the project name and gel reading number 76 is the
|
||
|
leftmost gel reading to contribute to this consensus sequence).
|
||
|
The angle brackets <> and the 4 digit number precede by a . are
|
||
|
important to some processing programs.
|
||
|
@25. TX 1 @Show relationships
|
||
|
|
||
|
Used to show the relationships of the gel readings in the
|
||
|
database in three ways -
|
||
|
(a) All contig descriptor lines followed by all gel descriptor
|
||
|
lines.
|
||
|
(b) All contigs one after the other sorted, i.e. for each
|
||
|
contig show its contig descriptor line followed by all its gel
|
||
|
descriptor lines sorted on position from left to right
|
||
|
(c) Selected contigs: show the contig line and, in order, those
|
||
|
gel readings that cover a user-defined region. Note that this
|
||
|
output can be directed to a disk file by prior selection of
|
||
|
"redirect output".
|
||
|
|
||
|
Below is an example showing a contig from position 1 to 689.
|
||
|
The left gel reading is number 6 and has archive name HINW.010, the
|
||
|
rightmost gel reading is number 2 and is has archive name HINW.004.
|
||
|
On each gel descriptor line is shown: the name of the archive
|
||
|
version, the gel number, the position of the left end of the gel
|
||
|
reading relative to the left end of the contig, the length of
|
||
|
the gel reading (if this is negative it means that the gel reading
|
||
|
is in the opposite orientation to its archive), the number of the
|
||
|
gel reading to the left and the number of the gel reading to the
|
||
|
right.
|
||
|
|
||
|
|
||
|
CONTIG LINES
|
||
|
CONTIG LINE LENGTH ENDS
|
||
|
LEFT RIGHT
|
||
|
48 689 6 2
|
||
|
GEL LINES
|
||
|
NAME NUMBER POSITION LENGTH NEIGHBOURS
|
||
|
LEFT RIGHT
|
||
|
HINW.010 6 1 -279 0 3
|
||
|
HINW.007 3 91 -265 6 5
|
||
|
HINW.009 5 137 -299 3 17
|
||
|
HINW.999 17 140 273 5 12
|
||
|
HINW.017 12 193 265 17 18
|
||
|
HINW.031 18 385 -245 12 2
|
||
|
HINW.004 2 401 -289 18 0
|
||
|
|
||
|
@23. TX 3 @Complement a contig
|
||
|
|
||
|
This function will complement and reverse all of the gel
|
||
|
readings in a contig. It automatically reverses and
|
||
|
complements each gel reading sequence, reorders left and right
|
||
|
neighbours, recalculates relative positions and changes each
|
||
|
strandedness.
|
||
|
|
||
|
The only user input required is to identify the contig
|
||
|
to complement by the number or name of a gel reading it contains.
|
||
|
DO NOT KILL THE PROGRAM DURING THIS STEP!
|
||
|
@22. TX 3 @ Join contigs
|
||
|
|
||
|
This function joins contigs interactively using a mouse driven
|
||
|
editor. The operation of this editor is very similar to the Contig
|
||
|
Editor described in "Edit".
|
||
|
|
||
|
It allows the user to align the ends of the two contigs by
|
||
|
editing each contig separately. It is important that the alignment
|
||
|
achieved is correct because once the join is completed the
|
||
|
alignment is fixed. The program needs to know which two contigs to
|
||
|
join.
|
||
|
|
||
|
First specify which two contigs are to be joined. The user
|
||
|
should identify the two contigs. The program checks that the two
|
||
|
contig numbers are different (it will not allow circles to be
|
||
|
formed!)
|
||
|
|
||
|
The Join Editor consists of two Contig Editors in between
|
||
|
which is sandwiched a disagreement box. This disagreement box shows
|
||
|
exclamation marks to denote mismatches between the two consensuses.
|
||
|
|
||
|
For example, the display will look something like this:
|
||
|
|
||
|
1460 1470 1480 1490 1500
|
||
|
56 HINW.100 TCT*GAGCAGTGTGGGCGCTG*CCGG
|
||
|
33 HINW.300 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGG
|
||
|
-25 HINW.090 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGG
|
||
|
19 HINW.123 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
CONSENSUS TCTCGAGCAGTGTGGGCGCTG-CCGGGCTCGGAGGGCATGAAGTAGAGCG
|
||
|
MISMATCH ! !!!!!!
|
||
|
10 20 30 40 50
|
||
|
-6 HINW.010 TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
|
||
|
-3 HINW.007 TGGGCGCTGCCCGGGCTCGGAGGGCATGAAGT*AGAGC
|
||
|
-5 HINW.009 GCTCGGAGGGCATGAAGT*AGAGC
|
||
|
CONSENSUS TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
|
||
|
|
||
|
|
||
|
The overlap must be of at least one character. Use the scroll
|
||
|
bar and the scroll buttons (`<<',`<',`>',and`>>') for positioning
|
||
|
the relative positions of the two contigs.
|
||
|
|
||
|
The join position can be fixed in position by pressing the
|
||
|
`lock' button at the top of the Join Editor. Locking allows the two
|
||
|
contigs to be scrolled as one when using the scroll bar and buttons,
|
||
|
the left ends always in the same position relative to each other.
|
||
|
|
||
|
Once locked, it is best to proceed to the right along the
|
||
|
contigs, inserting padding characters (`*') into the consensuses to
|
||
|
minimise the disagreements.
|
||
|
|
||
|
It is essential that the user aligns the two contigs
|
||
|
throughout the whole region of overlap before completing the join
|
||
|
because it is only at this stage that the two contigs can be edited
|
||
|
independently. Once the join is completed the alignment can only be
|
||
|
altered using the routines supplied by "alter relationships".
|
||
|
|
||
|
The join can be completed by pressing the `Leave Editor'
|
||
|
button. The percentage mismatch is displayed, and the user is
|
||
|
required to confirm that they want to perform the join.
|
||
|
@24. TX 1 @ Copy the database
|
||
|
|
||
|
Used to make a copy of the database. If required the database
|
||
|
size can be altered using this option. The "version" of a database
|
||
|
is encoded as the last letter in the names of the five files that
|
||
|
contain the database.
|
||
|
|
||
|
Supply a "version" number (the default is version 1), and if
|
||
|
required select a new size for the database. The size of a database
|
||
|
is the number of lines of information it can hold. It needs a line
|
||
|
for each gel reading and another for each contig.
|
||
|
@19. TX 1 @ Check database
|
||
|
|
||
|
Used to perform a check on the logical consistency of the
|
||
|
database. No user intervention is required. If selected "with
|
||
|
dialogue" the program also checks for any sections of the consensus
|
||
|
that contain 15 dashes in 20 characters.
|
||
|
|
||
|
The following relationships are checked:
|
||
|
1. If gel reading A thinks gel reading B is its left neighbour
|
||
|
does B think A is its right neighbour? The error message is
|
||
|
"Hand holding problem for gel reading A"
|
||
|
followed by the gel descriptor lines for gel readings A and B.
|
||
|
2. Are there any contig lines with no left or right end gel
|
||
|
readings? The error message is
|
||
|
"Bad contig line number A"
|
||
|
3. Do the gel readings that are described as left ends on
|
||
|
contig lines agree that they are left ends? The error message is
|
||
|
"The end gel readings of contig A have outward neighbours"
|
||
|
4. Are there gel readings that are in more than one contig?
|
||
|
The error message is
|
||
|
" Gel number A is used N times"
|
||
|
5. Are there gel readings that are not in any contig? The
|
||
|
error message is
|
||
|
" Gel number A is not used"
|
||
|
6. Do the relative positions of gel readings agree with
|
||
|
their position as defined by left and right neighbourliness? The
|
||
|
error message is
|
||
|
" Gel number A with position X is left neighbour of gel number B
|
||
|
with position Y"
|
||
|
7. Are there any loops in contigs? If so no further
|
||
|
checking is done. The error message is
|
||
|
" Loop in contig n no further checking done, but gel reading numbers
|
||
|
follow"
|
||
|
The program then prints the gel reading numbers in the looped
|
||
|
contig up to the start of the loop.
|
||
|
8. Are there any contigs of length <1? The error message is
|
||
|
" The contig on line number x has zero length"
|
||
|
9. Are there any gel readings (used in only one contig) that have
|
||
|
zero length? The error message is
|
||
|
" Gel number N has zero length"
|
||
|
Note that "auto assemble" also uses this logical consistency check
|
||
|
and will only tolerate a "Gel number N is not used" error. Any other
|
||
|
error will cause it to give up.
|
||
|
@29. TX 1 @ Examine quality
|
||
|
|
||
|
Analyses the quality of the data in a contig. It reports on
|
||
|
the proportion of the consensus that is "well determined" and will
|
||
|
display a sequence of symbols that indicate the quality of the
|
||
|
consensus at each position.
|
||
|
|
||
|
Identify the contig to analyse, and the section of interest.
|
||
|
The current consensus calculation cutoff score will be used to
|
||
|
decide if each position is "well determined". In general the quality
|
||
|
of a reading deteriorates along the length of the gel and so it is
|
||
|
also possible to use a length cutoff for the quality calculation.
|
||
|
Only the data from the first section of each reading will be
|
||
|
included in the quality calculation. The length is altered under
|
||
|
"set parameters" and is initially set to the maximum reading length.
|
||
|
A summary showing the percentage of the consensus that falls into
|
||
|
each category of quality is shown. Choose whether or not to have the
|
||
|
quality codes for each position of the consensus displayed. They can
|
||
|
be displayed as either graphics or text.
|
||
|
|
||
|
The quality of the data depends on the number of times it has
|
||
|
been sequenced and the particular uncertainty codes used in each
|
||
|
gel reading. This function divides the data into five categories,
|
||
|
assigning each a symbol or code:
|
||
|
1. Well determined on both strands and they agree. code=0
|
||
|
2. Well determined on the plus strand only. code=1
|
||
|
3. Well determined on the minus strand only. code=2
|
||
|
4. Not well determined on either strand. code=3
|
||
|
5. Well determined on both strands but they disagree. code=4
|
||
|
A position is "well determined" if it is assigned one of the symbols
|
||
|
A,C,G,T when the algorithm described in the section "calculate a
|
||
|
consensus". The calculation is performed separately for each
|
||
|
strand.
|
||
|
|
||
|
If the user chooses to have the data displayed graphically the
|
||
|
following scheme is used. A rectangular box is drawn so that the x
|
||
|
coordinate represents the length of the contig. The box is
|
||
|
notionally divided vertically into 5 possible levels which are given
|
||
|
the y values: -2,-1,0,1,2. The quality codes attributed to each
|
||
|
base position are plotted as rectangles. Each rectangle represents
|
||
|
a region in which the quality codes are identical, so a single base
|
||
|
having a different code from its immediate neighbours will appear as
|
||
|
a very narrow rectangle.
|
||
|
|
||
|
Rectangle bottom and top y values
|
||
|
|
||
|
Quality 0 rectangle from 0 to 0
|
||
|
Quality 1 rectangle from 0 to 1
|
||
|
Quality 2 rectangle from 0 to -1
|
||
|
Quality 3 rectangle from -1 to 1
|
||
|
Quality 4 rectangle from -2 to 2
|
||
|
|
||
|
Obviously a single line at the midheight shows a perfect
|
||
|
sequence.
|
||
|
|
||
|
Typical dialogue is shown below.
|
||
|
|
||
|
41.47% OK on both strands and they agree(0)
|
||
|
55.48% OK on plus strand only(1)
|
||
|
2.08% OK on minus strand only(2)
|
||
|
0.97% Bad on both strands(3)
|
||
|
0.00% OK on both strands but they disagree(4)
|
||
|
? (y/n) (y) Show sequence of codes
|
||
|
|
||
|
10 20 30 40 50
|
||
|
1111111111 1111111111 1111111111 1111111111 1111111111
|
||
|
|
||
|
60 70 80 90 100
|
||
|
1111111111 1111111111 1111111111 3111111111 1111111111
|
||
|
|
||
|
110 120 130 140 150
|
||
|
1111111111 1111131111 1111111111 1111111111 1111111111
|
||
|
|
||
|
160 170 180 190 200
|
||
|
1111111111 1111111111 1111111111 1111111111 1111111133
|
||
|
|
||
|
210 220 230 240 250
|
||
|
1311111111 1111111111 1111111110 0000000000 0000220000
|
||
|
|
||
|
260 270 280 290 300
|
||
|
0000000000 0020000000 2200000202 0002000000 0000222200
|
||
|
|
||
|
@26. TX 3 @ Alter relationships
|
||
|
|
||
|
Used to make what are normally illegal changes to the
|
||
|
database. That is the normal checks are not done and any item in the
|
||
|
database can be changed independently of all others. Users need to
|
||
|
know what they are doing because it is very easy to make a horrible
|
||
|
mess. Always start by making a copy!
|
||
|
|
||
|
By using the options here users can move one section of a
|
||
|
contig relative to another, break contigs, remove contigs, remove
|
||
|
gel readings, etc. To give flexibility most of the commands do only
|
||
|
one thing. This means that several commands may have to be executed
|
||
|
to complete any change.
|
||
|
|
||
|
The following options are offered:
|
||
|
|
||
|
Cancel
|
||
|
Line change
|
||
|
Check logical consistency
|
||
|
Remove contig
|
||
|
Shift
|
||
|
Move gel reading
|
||
|
Rename gel reading
|
||
|
Break a contig
|
||
|
Remove a gel reading
|
||
|
Alter raw data parameters
|
||
|
|
||
|
1. QUIT returns to the main options of BAP.
|
||
|
3. Line change
|
||
|
allows the user to change the contents of any line in the file of
|
||
|
relationships. The line is selected by number, the program prints
|
||
|
the current line and prompts for the new line.
|
||
|
4. Check logical consistency
|
||
|
5. Remove a contig
|
||
|
This function removes a contig and all its gel readings. The user
|
||
|
specifies any reading in the contig.
|
||
|
6. Shift
|
||
|
allows the user to change all the relative positions of a set of
|
||
|
neighbouring gel readings by some fixed value, i.e. it will shift
|
||
|
related gel readings either left or right. It can therefore be
|
||
|
used to change the alignment of the gel readings in a contig. It
|
||
|
prompts for the number of the first gel reading to shift and then
|
||
|
for the distance to move them (Note a negative value will move
|
||
|
the gel readings left and a positive value right). It then chains
|
||
|
rightwards (ie follows right neighbours) and shifts each gel
|
||
|
reading, in turn, up to the end of the contig. (This means that
|
||
|
only those gel readings from the first to shift to the rightmost are
|
||
|
moved). It updates the length of the contig accordingly.
|
||
|
7. Move gel reading
|
||
|
is a function to renumber a gel reading. It moves all the
|
||
|
information about a gel reading on to another line. The user must
|
||
|
specify the number of the gel reading to move and the number of the
|
||
|
line to place it. It takes care of all the relationships. Of course
|
||
|
gel readings must not be moved to lines occupied by other gel
|
||
|
readings!
|
||
|
8. Rename gel reading
|
||
|
is a function that is used to rename the archive names of gel
|
||
|
readings in the database; it only changes the name in the .ARN
|
||
|
file of the database.
|
||
|
|
||
|
9. Break contig
|
||
|
|
||
|
Occasionally it is necessary to break a contig into two parts
|
||
|
and this can be achieved using this option. The program needs only
|
||
|
the number of a gel reading. This is the gel reading that will
|
||
|
become a left end after the break. That is, the break is made
|
||
|
between this gel reading and its left neighbour. A new contig line
|
||
|
is created so ensure that there is sufficient space in the database.
|
||
|
10. Removing gel readings from contigs
|
||
|
|
||
|
Gel readings can be removed from contigs. If they are
|
||
|
essential for holding the contig together (ie are the only gel
|
||
|
reading covering a particular region), the program will create a new
|
||
|
contig.
|
||
|
|
||
|
11. Alter raw data parameters
|
||
|
|
||
|
Allows the user to edit the individual raw data parameters,
|
||
|
such as the left and right cutoff lengths and the name of the
|
||
|
machine readable trace file. The user must specify the gel line to
|
||
|
modify, and provide new values for the length of the raw sequence
|
||
|
including cutoff lengths, the left cutoff position, the length of
|
||
|
the original working sequence, the machine type, and the name of the
|
||
|
raw data file, where these values change.
|
||
|
@27. TX 1 @ Set display parameters
|
||
|
|
||
|
Used to redefine the parameters that control the cutoff
|
||
|
employed by the consensus calculation and quality examiner, the
|
||
|
maximum length of each reading to include in the quality
|
||
|
calculation, the line length used by the display function, the text
|
||
|
window length used by the graphics options, and the graphics window
|
||
|
length used by the graphics options.
|
||
|
|
||
|
The default cutoff score is 75%. The default line length is 50
|
||
|
characters. For protein sequences the cutoff is always 100%.
|
||
|
|
||
|
The text window used by the graphics options controls the
|
||
|
amount of sequence listed at the crosshair position. The graphics
|
||
|
window controls the "zoom" function. Both these windows are defined
|
||
|
as the number of bases that should be shown, to both left and right
|
||
|
of the crosshair.
|
||
|
@30. TX 3 @ Shuffle pads
|
||
|
|
||
|
One weakness of the alignment strategy used is that padding
|
||
|
characters are not always aligned by the assembly routine. This
|
||
|
function attempts to align padding characters using a very simply
|
||
|
strategy. It does not solve all pad alignment problems but is a
|
||
|
useful first step during cleaning-up operations.
|
||
|
@10. TX 2 @Clear graphics
|
||
|
|
||
|
Clears graphics from the screen.
|
||
|
@11. TX 2 @Clear text
|
||
|
|
||
|
Clears text from the screen.
|
||
|
@12. TX 2 @Draw a ruler.
|
||
|
|
||
|
This option allows the user to draw a ruler or scale along the
|
||
|
x axis of the screen to help identify the coordinates of points of
|
||
|
interest. The user can define the position of the first base to be
|
||
|
marked (for example if the active region is 1501 to 8000, the user
|
||
|
might wish to mark every 1000th base starting at either 1501 or 2000
|
||
|
- it depends if the user wishes to treat the active region as an
|
||
|
independent unit with its own numbering starting at its left edge,
|
||
|
or as part of the whole sequence). The user can also define the
|
||
|
separation of the ticks on the scale and their height. If required
|
||
|
the labelling routine can be used to add numbers to the ticks.
|
||
|
@14. TX 2 @Reposition plots
|
||
|
|
||
|
The positions of each of the plots is defined relative to a
|
||
|
users drawing board which has size 1-10,000 in x and 1-10,000 in y.
|
||
|
Plots for each option are drawn in a window defined by x0,y0 and
|
||
|
xlength,ylength. Where x0,y0 is the position of the bottom left hand
|
||
|
corner of the window, and xlength is the width of the window and
|
||
|
ylength the height of the window.
|
||
|
--------------------------------------------------------- 10,000
|
||
|
1 1
|
||
|
1 -------------------------------------- ^ 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 ylength 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 -------------------------------------- v 1
|
||
|
1 x0,y0^ 1
|
||
|
1 <---------------xlength--------------> 1
|
||
|
--------------------------------------------------------- 1
|
||
|
1 10,000
|
||
|
|
||
|
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
|
||
|
The default window positions are read from a file "ANALMARG" when
|
||
|
the program is started. Users can have their own file if required.
|
||
|
As all the plots start at the same position in x and have the same
|
||
|
width, x0 and xlength are the same for all options. Generally users
|
||
|
will only want to change the start level of the window y0 and its
|
||
|
height ylength. This option allows users to change window positions
|
||
|
whilst running the program. The routine prompts first for the
|
||
|
number of the option that the users wishes to reposition; then for
|
||
|
the y start and height; then for the x start and length. Note that
|
||
|
changes to the x values affect all options. If the user types only
|
||
|
carriage return for any value it will remain unchanged. Note that,
|
||
|
unlike all the other programs, the boxes used to contain analytical
|
||
|
results (eg plot quality) should not be made to overlap one another,
|
||
|
as the function of the crosshair routine depends on which box the
|
||
|
crosshair is in!
|
||
|
@15. TX 2 @Label a diagram
|
||
|
|
||
|
This routine allows users to label any diagrams they have
|
||
|
produced. They are asked to type in a label. When the user types
|
||
|
carriage return to finish typing the label the cross-hair appears on
|
||
|
the screen. The user can position it anywhere on the screen. If the
|
||
|
user types R (for right justify) the label will be written on the
|
||
|
diagram with its right end at the cross-hair position. If the user
|
||
|
types L (for left justify) the label will be written on the diagram
|
||
|
with its left end at the cross hair position. The cross-hair will
|
||
|
then immediately reappear. The user may put the same label on
|
||
|
another part of the diagram as before or if he hits the space bar he
|
||
|
will be asked if he wishes to type in another label.
|
||
|
|
||
|
Typical dialogue follows.
|
||
|
? Menu or option number=15
|
||
|
Type label then drive cross hair to left or right end
|
||
|
of label position then hit "L" to write label left
|
||
|
justified or "R" to write label right justified or
|
||
|
the space bar to quit
|
||
|
|
||
|
|
||
|
? Label=delta gene
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
? Label=
|
||
|
|
||
|
@16. TX 2 @Display a map
|
||
|
|
||
|
This is disabled!
|
||
|
@7. TX 1 @Redirect output
|
||
|
|
||
|
Used to direct output that would normally appear on the screen
|
||
|
to a file and to create postscript output.
|
||
|
|
||
|
Select redirection of either text or graphics, and supply the
|
||
|
name of the file that the output should be written to.
|
||
|
|
||
|
The results from the next options selected will not appear on
|
||
|
the screen but will be written to the file. When option 7 is
|
||
|
selected again the file will be closed and output will again appear
|
||
|
on the screen.
|
||
|
@13. TX 2 @Use crosshair
|
||
|
|
||
|
This option puts a steerable cross on the screen which the
|
||
|
user drives around by using the arrow keys (or mouse). When the
|
||
|
crosshair is visible a number of options are available if the user
|
||
|
types one of a set of special keyboard characters. Any other
|
||
|
characters will cause an exit from the crosshair option. The special
|
||
|
keys are:
|
||
|
|
||
|
I = Identify the nearest gel reading
|
||
|
Z = Zoom in
|
||
|
Q = plot Quality
|
||
|
S = display the aligned Sequences at the crosshair position
|
||
|
N = list the Names and Numbers of the sequences at the crosshair
|
||
|
|
||
|
In order for any of these special keys to operate, the
|
||
|
crosshair must be in an appropriate display box, and the precise
|
||
|
function of the keys will also depend on which box the crosshair is
|
||
|
in.
|
||
|
|
||
|
If the crosshair is in the "plot all contigs" box, Z will
|
||
|
cause a new box to appear showing all the readings for the nearest
|
||
|
contig; Q will give the same as Z but will also produce an extra box
|
||
|
showing the "quality" plot.
|
||
|
|
||
|
If Z is hit in the "plot single contig" box, the contig will
|
||
|
be zoomed to the current graphics window size. The zoom will be
|
||
|
roughly centred on the crosshair position. Because of this it is
|
||
|
possible to step along a contig by repeatedly zooming with the
|
||
|
crosshair near to one end of the single contig display box. If I is
|
||
|
hit the crosshair must be close to a gel reading line. If Q is hit,
|
||
|
the quality plot will be produced for the region shown in the plot
|
||
|
single contig box. In all cases when the "plot all contigs" box is
|
||
|
shown, a vertical line will bisect the line the represents the
|
||
|
relevant contig, at the current position.
|
||
|
|
||
|
If the crosshair is in the plot quality box only the character
|
||
|
"s" will operate as a special symbol.
|
||
|
|
||
|
The number of bases shown in the N and S options is controlled
|
||
|
by the current graphics text window size, and the size of the zoom
|
||
|
window by the current graphics window size. Both are set by the
|
||
|
parameter setting function of the general menu.
|
||
|
@33. TX 2 @Plot single contig
|
||
|
|
||
|
This option produces a schematic of a selected region of a
|
||
|
single contig by drawing a horizontal line to represent each of its
|
||
|
gel readings. The lines show the relative positions of each reading
|
||
|
and also their sense. The plot is divided vertically into two
|
||
|
sections by a line that is identified by an asterisk drawn at each
|
||
|
end. All lines that lie above this line represent readings that are
|
||
|
in their original sense, all lines below show readings that are in
|
||
|
the complementary sense to their original. By use of the crosshair
|
||
|
function the plot can be stepped through and examined in more
|
||
|
detail. See help on crosshair.
|
||
|
@34. TX 2 @Plot all contigs
|
||
|
|
||
|
This option produces a schematic of all the contigs in a
|
||
|
database. It does this by drawing a horizontal line to represent
|
||
|
each of them. In order to show the ends of each contig it draws the
|
||
|
lines for contigs at alternate heights: the first at height one, the
|
||
|
second at height two, the third at height one, etc. The order of the
|
||
|
contigs in the display is the same as their order in the database.
|
||
|
By use of the crosshair function the plot can be stepped through and
|
||
|
examined in more detail. See help on crosshair.
|
||
|
@31. TX 3 @ Disassemble readings
|
||
|
|
||
|
This function is used to remove a list of readings from a
|
||
|
database, or to create a new contig from a single reading moved from
|
||
|
an existing contig. This latter mode is useful for repositioning a
|
||
|
reading in a repeat: once separated it can be placed in the join
|
||
|
editor and scrolled by the other copies. Removal of sets of
|
||
|
readings works in two modes: 1. A set of adjacent readings in a
|
||
|
contig can be removed by the user naming the two end ones; or 2. A
|
||
|
batch of readings from any number of contigs can be defined by the
|
||
|
user naming a file containing a list of reading names. The program
|
||
|
cleans up the database by moving data to fill up any holes made in
|
||
|
the files.
|
||
|
|
||
|
For both modes of operation the program will ask for a file of
|
||
|
file names. If users create their own file (ie mode 2) each reading
|
||
|
NAME must be on a separate line. For mode 1 the user types the NAMES
|
||
|
of the leftmost and rightmost readings to be removed. They and all
|
||
|
intervening readings will be removed. Note that the routine operates
|
||
|
on reading names - not numbers. For both modes, if necessary, new
|
||
|
contigs will be created.
|
||
|
@35. TX 1 3 @Find internal joins
|
||
|
|
||
|
The purpose of this function is to use data already in the
|
||
|
database to find possible joins between contigs. Joins may have
|
||
|
been missed due to poor data or may have not been made due to
|
||
|
repeated sequences. Where appropriate, it may be possible to find
|
||
|
potential joins by using the "unused data" derived from sequencing
|
||
|
machines.
|
||
|
For all overlaps found when the X version is used, the contig editor
|
||
|
(in join mode) will be called up with the two contigs aligned.
|
||
|
The database is checked for logical consistency. Supply a minimum
|
||
|
initial match length, a minimum alignment block, the maximum pads
|
||
|
per sequence, the maximum percent mismatch after alignment, the
|
||
|
probe length. Choose if clipped data is to be used, if so define the
|
||
|
window size for finding good data and the number of dashes allowed
|
||
|
in the window. Processing will commence. Most of these values are
|
||
|
used in an identical way in the autoassemble function. The others
|
||
|
are defined below.
|
||
|
The program strategy
|
||
|
Take the first contig and calculate its consensus. If clipped data
|
||
|
is being used examine all readings that are in the complementary
|
||
|
orientation, and sufficiently near to the contigs left end, to see
|
||
|
if they have good clipped sequence which if present, would protrude
|
||
|
from the left end of the contig. If found add the longest such
|
||
|
sequence to the left end of the consensus. Do the same for the right
|
||
|
end by examining readings that are in their original orientation. If
|
||
|
any are found add the longest extension to the right end of the
|
||
|
consensus. Repeat the consensus calculations and extensions for all
|
||
|
contigs hence producing an extended consensus. If clipped data is
|
||
|
not being used simply calculate the consensus for the whole
|
||
|
database. Now look for possible joins by processing the extended
|
||
|
consensus in the following way. Take the last, say 100, bases
|
||
|
(termed the "probe length" by the program) of the rightmost
|
||
|
consensus, compare it both orientations with the extended consensus
|
||
|
of all the other contigs. Display any sufficiently good alignments.
|
||
|
Repeat with the left end of the rightmost contig. Do the same for
|
||
|
the ends of all the entended contigs, always only comparing with the
|
||
|
contigs to their left, so that the same matches do not appear twice.
|
||
|
Good cliped data is defined by sliding a window of "Window size for
|
||
|
good data scan" bases outwards along the sequence and stopping when
|
||
|
"Maximum number of dashes in scan window" or more dashes appear in
|
||
|
the window. Note that it is advisable to have some sort of cutoff
|
||
|
because if we simply take all the data it might be so full of
|
||
|
rubbish that we wont find any good matches. For the same reason it
|
||
|
is worth trying the procedure with different cutoffs. An initial run
|
||
|
using no clipped data is also recommended. Sufficiently good
|
||
|
alignments are defined by criteria equivalent to those used in
|
||
|
autoassemble, however here we only display alignments that pass all
|
||
|
tests.
|
||
|
Bugs
|
||
|
If a small contig is wholly contained within a larger one, such that
|
||
|
its ends are further than ("Probe length" - "Minimum initial match
|
||
|
length") from the ends of the larger contig, and the consensus for
|
||
|
the small contig lies to the left of the consensus for large contig,
|
||
|
the overlap will not be discovered. (See the search stratgey).
|
||
|
All numbering is relative to base number one in the contig: matches
|
||
|
to the left (i.e. in the clipped data) have negative positions,
|
||
|
matches off the right end of the contig (i.e. in the clipped data)
|
||
|
have positions greater than that of the contig length. The
|
||
|
convention for reporting the positions of overlaps is as follows: if
|
||
|
neither contig needs to be complemented the positions are as shown.
|
||
|
If the program says "contig x in the - sense" then the positions
|
||
|
shown assume contig x has been complemented. For example in the
|
||
|
results given below the positions for the first overlap are as
|
||
|
reported, but those for the second assume that the contig in the
|
||
|
minus sense (i.e. 443) has been complemented.
|
||
|
|
||
|
|
||
|
Possible join between contig 445 in the + sense and contig 405
|
||
|
Percentage mismatch after alignment = 4.9
|
||
|
412 422 432 442 452 462
|
||
|
405 TTTCCCGACT GGAAAGCGGG CAGTGAGCGC AACGCAATTA ATGTGAG,TT AGCTCACTCA
|
||
|
********* * ******** ***** *** ********** ********** **********
|
||
|
445 -TTCCCGACT G,AAAGCGGG TAGTGA,CGC AACGCAATTA ATGTGAG-TT AGCTCACTCA
|
||
|
-127 -117 -107 -97 -87 -77
|
||
|
472 482 492 502 512
|
||
|
405 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
|
||
|
********** ********** ********** ********** **
|
||
|
445 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
|
||
|
-67 -57 -47 -37 -27
|
||
|
Possible join between contig 443 in the - sense and contig 423
|
||
|
Percentage mismatch after alignment = 10.4
|
||
|
64 74 84 94 104 114
|
||
|
423 ATCGAAGAAA GAAAAGGAGG AGAAGATGAT TTTAAAAATG AAACG-CGAT GTCAGATGGG
|
||
|
**** ***** ********** ********** ****** ** ***** **** *********
|
||
|
443 ATCG,AGAAA GAAAAGGAGG AGAAGATGAT TTTAAA,,TG AAACGACGAT GTCAGATGG,
|
||
|
3610 3620 3630 3640 3650 3660
|
||
|
124 134 144 154 164
|
||
|
423 TTG-ATGAAG TAGAAGTAGG AG-AGGTGGA AGAGAAGAGA GTGGGA
|
||
|
*** ****** ********** ** ******* *** ***** ** **
|
||
|
443 TTGGATGAAG TAGAAGTAGG AGGAGGTGGA ,GAG,AGAGA GTTGG-
|
||
|
3670 3680 3690 3700 3710
|
||
|
|
||
|
|
||
|
@36. TX 3 @Double strand
|
||
|
|
||
|
PLEASE MAKE A COPY OF THE DATABASE BEFORE USING THIS OPTION AS
|
||
|
IT HAS CURRENTLY HAD VERY LITTLE TESTING.
|
||
|
|
||
|
Uses the cutoff data to change single stranded sections of a
|
||
|
contig into double stranded sections. Data is used carefully to try
|
||
|
and minimise the number of data disagreements created. However it
|
||
|
must be noted that an overall slight degradation in quality will
|
||
|
still occur.
|
||
|
|
||
|
When using this option you will be prompted for a contig and a
|
||
|
region within that contig. The default region is the entire contig.
|
||
|
The option will then search through the region for areas of good
|
||
|
data on one strand and cutoff data on the opposite strand, extending
|
||
|
the cutoff data. The criteria for evaluating the amount of cutoff
|
||
|
data to be used is based upon a maximum number of mismatches and a
|
||
|
score (derived by accumulating points for mismatches (-8),
|
||
|
matches(+1) and insertions (-5) over the length of an alignment).
|
||
|
The defaults are:
|
||
|
|
||
|
maximum mismatches : 6
|
||
|
|
||
|
score for mismatch : -8
|
||
|
score for correct match : +1
|
||
|
score for insertion : -5
|
||
|
|
||
|
Note that with successive calls to this option it is possible
|
||
|
to double strand more and more data. Naturally however the quality
|
||
|
of the data generated will diminish each time.
|
||
|
@37. TX 3 @Auto-select oligos
|
||
|
|
||
|
PLEASE MAKE A COPY OF THE DATABASE BEFORE USING THIS OPTION AS
|
||
|
IT HAS CURRENTLY HAD VERY LITTLE TESTING.
|
||
|
|
||
|
Generates a file (default "primers") of suggested primers to
|
||
|
use for covering a single stranded section or for walking off the
|
||
|
end of a contig. The file generated contains the gel reading name,
|
||
|
the primer sequence, it's offset in the contig and the orientation.
|
||
|
An example file would be :
|
||
|
|
||
|
c81d12.s1 TTGTCTGTAAGCGGATG (@ 6449 ) +
|
||
|
c98a10.s1 ATTATCACTTTACGGGTC (@ 6959 ) +
|
||
|
c81c1.s1 CAAGAAGGCGATAGAAG (@ 7643 ) +
|
||
|
c76a10.s1 CCTCATCCTGTCTCTTG (@ 8441 ) +
|
||
|
c81g4.s1 ATGAAACCTGGGCGTTG (@ 16156 ) +
|
||
|
c91e6.s1 GTTTTCAGATGTCGGAG (@ 18249 ) +
|
||
|
c81e12.s1 GCTACCGTAAAACACTTC (@ 18737 ) +
|
||
|
c93h11.s1 GCTGCTTTTTGTTTTATCC (@ 19158 ) +
|
||
|
c81h6.s1 CTTCCACTTCTTTCTTATC (@ 21210 ) +
|
||
|
c86a12.s1 CGAATGATAAAGACAAATCAG (@ 22122 ) +
|
||
|
c98b1.s1 GCCACTTTATCCGAGAC (@ 3048 ) -
|
||
|
c97c5.s1 GTGTTTTGGGTATATTGTG (@ 3371 ) -
|
||
|
c83d2.s1 CTACACAGAATGAACCC (@ 3768 ) -
|
||
|
c78h10.s1 GGCGGTGAAGATTGAAG (@ 4200 ) -
|
||
|
c98h9.s2dt CTCGTTTAAATTTCAAACTTCC (@ 7419 ) -
|
||
|
c95a9.s1 ATTGGAAGGAAGGAGGG (@ 22996 ) -
|
||
|
c82b4.s1 TGTAGCCGAAATCTTCC (@ 23369 ) -
|
||
|
|
||
|
This is best employed after having previously used the 'Double
|
||
|
strand' option. When selecting the option you will be asked for the
|
||
|
contig, a region within this contig and the file to write the list
|
||
|
of primers to. For each primer suggested a tag is automatically
|
||
|
created containing details of the gel reading name and the sequence.
|
||
|
Preferably the tag will be created on the gel reading from which the
|
||
|
primer was selected. However this is not always possible so failing
|
||
|
that the tag will be on another sequence overlapping the primer
|
||
|
position.
|
||
|
|
||
|
When invoked with the dialogue option you will be asked a
|
||
|
couple more questions relating to the position and size of the
|
||
|
consensus checked for suitable oligos. You will be prompted for the
|
||
|
start and end of a region (default 40-140) at a relative position to
|
||
|
the left of our initial region.
|
||
|
|
||
|
For example:
|
||
|
|
||
|
? Menu or option number=d37
|
||
|
Auto-select oligos
|
||
|
Default Contig identfier=/e97f2.s1
|
||
|
? Contig identfier=
|
||
|
? Start position in contig (1-20942) (1) =10000
|
||
|
? End position in contig (10000-20942) (20942) =11000
|
||
|
Default Name of file for primers=primers
|
||
|
? Name of file for primers=
|
||
|
? Start of oligo choice region (1-1024) (40) =50
|
||
|
? End of oligo choice region (50-1024) (150) =150
|
||
|
|
||
|
|
||
|
This implies that we are going to look for oligos to use as
|
||
|
primers covering the region 10000 to 11000. For each single stranded
|
||
|
section in this region we search for the oligos at between 50 and
|
||
|
150 to the left. So if we had a single stranded section from 10121
|
||
|
to 10295 we would search for oligos in the region 9971 to 10071.
|
||
|
@38. TX 1 @Check assembly
|
||
|
|
||
|
This new function is used for checking the positioning of
|
||
|
assembled readings. It is useful for checking sequences that
|
||
|
contain repeats of length similar to that of a single gel reading.
|
||
|
It takes the poor quality data for each reading and compares it to
|
||
|
the segment of the consensus to which it should align. If the
|
||
|
extension of the read does not match the consensus then the read (or
|
||
|
its neighbours) has probably been assembled into the wrong place.
|
||
|
The program displays the bad alignments. The quality of an
|
||
|
alignment is defined by the percentage mismatch. Naturally the user
|
||
|
should select a value that takes into account the poor quality of
|
||
|
the data being aligned.
|
||
|
|
||
|
When the routine is used from the X version the user is
|
||
|
offered the editor to examine poor alignments. If alignments are
|
||
|
reported as poor, but on inspection are OK, the user can set a tag
|
||
|
so that the poor quality data is ignored on subsequent passes
|
||
|
through the routine. Note however such data will then also be
|
||
|
ignored by the automatic double stranding routine!
|
||
|
|
||
|
The user defines the percentage mismatch; the window size and
|
||
|
number of dashes allowed in the window used for selecting the amount
|
||
|
of the poor data to be employed; can choose to save the names of the
|
||
|
poorly aligned reads in a file; can select an individual contig or
|
||
|
scan the whole database. The file containing the names of the
|
||
|
poorly aligned reads can be used by the disassembly routine to
|
||
|
remove them from the database, and then can be used to reassemble
|
||
|
them. Note that the routine complements each contig twice during
|
||
|
processing.
|
||
|
@39. TX 1 @Find read pairs
|
||
|
|
||
|
This new function is used to check the positions of readings
|
||
|
taken from each end of the same template. For each forward read it
|
||
|
searches for a corresponding reverse reading. The search can be over
|
||
|
the whole database or over a single contig. The results can be
|
||
|
presented graphically for single contig searches and the crosshair
|
||
|
function can be used to identify the readings displayed.
|
||
|
|
||
|
Note that at present the function only knows that two reads
|
||
|
are from the same template by comparing reading names. For our local
|
||
|
projects we use the following naming convention: forward reads are
|
||
|
named abcdefgh.s1 and reverse reads abcdefgh.r1. The program expects
|
||
|
this naming convention and so if it finds read fred.s1 and fred.r1
|
||
|
it assumes they are the forward and reverse reads for template fred.
|
||
|
In the future we will make the routine more general!
|
||
|
|
||
|
If a single contig is selected and the output is listed the
|
||
|
program displays two lines for each pair: the first line shows the
|
||
|
reading name, its position and length, and the distance between the
|
||
|
extremeties of the two reads; the second line shows the other read
|
||
|
name, its position and length. If there are pairs that are in
|
||
|
separate contigs or are facing away from one another they are listed
|
||
|
after the pairs that face inwards. Is this true?
|
||
|
|
||
|
If the results are plotted the full length of the template is
|
||
|
drawn with arrows indicating the direction of reads and the extent
|
||
|
of each reading. Those reads that have their partner in another
|
||
|
contig are marked by asterisks.
|
||
|
|
||
|
Typical dialogue is shown below.
|
||
|
|
||
|
? Select contigs (y/n) (y) =
|
||
|
Default Contig identifier=/i55d8.s1
|
||
|
? Contig identifier=
|
||
|
? Start position in contig (1-15227) (1) =
|
||
|
? End position in contig (1-15227) (15227) =
|
||
|
? Plot results (y/n) (y) = n
|
||
|
852 k23a1.r1 249 238 1615
|
||
|
806 k23a1.s1 1529 -335
|
||
|
238 i68e6.s1 422 193 1632
|
||
|
868 i68e6.r1 1756 -298
|
||
|
576 k17a2.s1 2370 213 1676
|
||
|
885 k17a2.r1 3790 -256
|
||
|
84 k27g6.s1 3456 291 1777
|
||
|
867 k27g6.r1 4905 -328
|
||
|
453 k01g10.s1 5805 142 1251
|
||
|
881 k01g10.r1 6909 -147
|
||
|
781 i98b8.r1 6754 338 1079
|
||
|
10 i98b8.s1 7653 -180
|
||
|
883 k02d11.r1 7327 276 1597
|
||
|
283 k02d11.s1 8726 -198
|
||
|
269 i68f9.s1 8191 169 1055
|
||
|
777 i68f9.r1 8891 -355
|
||
|
710 i91c6.s1 8245 95 1516
|
||
|
780 i91c6.r1 9403 -358
|
||
|
596 k27d12.s1 136 329 -329
|
||
|
219 k27d12.r1 1 -116
|
||
|
159 k27d11.r1 1830 -263 -263
|
||
|
317 k27d11.s1 2902 343
|
||
|
886 k17g11.r1 7107 -123 -123
|
||
|
647 k17g11.s1 1867 265
|
||
|
851 i69g10.r1 8045 -137 -137
|
||
|
277 i69g10.s1 4658 174
|
||
|
|
||
|
If contigs are not selected the pairs are sorted on their
|
||
|
separations.
|
||
|
|
||
|
? Select contigs (y/n) (y) = n
|
||
|
i68f2.s1 27 1781 1777
|
||
|
i68f2.r1 776 111 1777
|
||
|
k17f6.s1 601 60 1706
|
||
|
k17f6.r1 856 1405 1706
|
||
|
k17a2.s1 576 2370 1676
|
||
|
k17a2.r1 885 3790 1676
|
||
|
k27g3.s1 177 14985 1664
|
||
|
k27g3.r1 889 13564 1664
|
||
|
k27b12.s1 764 1 1086
|
||
|
k27b12.r1 857 932 1086
|
||
|
i98b8.s1 10 7653 1079
|
||
|
i98b8.r1 781 6754 1079
|
||
|
k16a3.s1 748 1276 1070
|
||
|
k16a3.r1 784 472 1070
|
||
|
k17b7.r1 786 14937 18942*
|
||
|
k17b7.s1 787 3601 18942*
|
||
|
k27d12.r1 219 1 15208*
|
||
|
k27d12.s1 596 136 15208*
|
||
|
k01g2.s1 502 87 14754*
|
||
|
k01g2.r1 782 9224 14754*
|
||
|
|
||
|
@ end of help
|
||
|
|