staden-lg/help/BAP.RNO

2723 lines
104 KiB
Plaintext

.npa
.left margin1
@-1. TX 0 @General
.sp
@-2. T 0 @Screen control
.sp
@-2. X 0 @Screen
.sp
@-3. TX 0 @Modification
.sp
@0. TX -1 @BAP
.left margin2
.PARA
This is an interactive program whose primary use is
for managing shotgun sequencing projects, but it can also be used for
handling alignments of other sequences, including those of proteins.
Currently the maximum 'gel reading' length is set to 4096 characters.
Almost all of the information below describes the use of the program for
shotgun projects, but those using the programs for handling other
sequence
alignments should interpret it accordingly.
The data for such a project is stored in a special type of database. The
program
contains the tools that are required to screen gel readings
against vector sequences and restriction sites, and to assemble
new gel
readings into the database (automatically comparing and aligning
them). In addition it contains editors and functions to examine the quality
of the aligned sequences.
.para
There are three main menus: "general", "screen" and "modification",
and some functions have submenus.
.left margin2
.lit
The general menu contains the following options:
Open a database
Display a contig
List a text file
Direct output
Calculate a consensus
Screen against restriction enzymes
Screen against vector
Check logical consistency
Copy database
Show relationships
set parameters
Highlight disagreements
Examine quality
Check Assembly
Find read pairs
The graphics menu contains:
Clear graphics
Clear text
Draw ruler
Use cross hair
Change margins
Label diagram
Plot map
Plot single contig
Plot all contigs
The modification menu contains:
Edit contig
Auto assemble
Join contigs
Complement a contig
Alter relationships
Extract gel readings
Find internal joins
Disassemble readings
Shuffle pads
Auto-select oligos
Double strand
The alter relationships menu contains:
Cancel
Line change
Check logical consistency
Remove contig
Shift
Move gel reading
Rename gel reading
Break a contig
Remove a gel reading
Alter raw data parameters
.END LIT
.SK1
.para
Overview of the methodology
.para
The shotgun sequencing strategy
.para
In the shotgun sequencing procedure
the sequence to be determined is randomly broken into fragments of
about
1000 nucleotides in length. These fragments are cloned and then
selected randomly and their
sequences determined. The relationship between any pair of
fragments is not known beforehand
but is found by comparing their sequences.
If the sequence of one found to be wholly or partially contained
within that of another for sufficient length to distinguish an
overlap from a repeat then those two fragments can be joined.
The
process of select, sequence and compare is continued until the
whole
of the DNA to be sequenced is in one continuous well
determined
piece.
.para
Definition of a contig
.para
A CONTIG is a set of gel readings that are related to one
another by overlap of their sequences. All gel readings belong to
a contig and each contig contains at least one gel
reading. The gel readings in a contig can be summed to produce
a continuous consensus sequence and the length of this sequence is
the length of the contig. The rules used to perform this summation are
given under "the consensus algorithm".
At any stage
of a sequencing project the data will comprise a number of
contigs;
when a project is
complete there should be only one contig and its consensus will be
the finished sequence. Note that since being introduced and
defined as above the word "contig" has been taken up by those involved in
genomic mapping. In that context the consensus with a precise length is,
of course, not
defined.
.SK1
.LEFT MARGIN2
Introduction to the computer method
.LEFT margin2
.PARA
It is useful to consider the objectives of a sequencing project before
outlining how we use the computer to help achieve them. The aim of a
shotgun sequencing project is to
produce an accurate consensus sequence from many overlapping gel
readings.
It is necessary to know, particularly at the latter
stages of the project, how accurate the
consensus sequence is. This enables us to know which regions of the
sequence require further work and also to know when the project is
finished.
To show the quality of the consensus, the programs described here
produce displays like that shown below.
.sk1
.lit
10 20 30 40 50
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
60 70 80 90 100
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
-3 HINW.007 GGCACA*GTC
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
110 120 130 140 150
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
-5 HINW.009 GGCAGCGACCGTCG
17 HINW.999 AGCGACCGTCG
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
160 170 180 190 200
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
12 HINW.017 GTAGAGC*
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
.END LIT
.para
This is an example showing the left end of a contig from
position 1 to 200. Overlapping this region are gel readings
numbered 6, 3, 5, 17 and 12;
6, 3 and 5
are in reverse orientation to their original reading (denoted by a minus
sign). Each gel reading also has a name (eg HINW.010). It can be seen that
in a number of places the sequences contain characters other than A,C,G
and
T. Some of these extra characters have been used by the sequencer to
indicate regions of uncertainty in the initial interpretation of the gel
reading, but the asterisks (*) have been inserted by the automatic
assembly function in order to align the sequences. Underneath each 50
character block of gel reading sequences is the consensus derived from
the
sequences aligned above (the line labelled CONSENSUS). For most of its
length the consensus has a definite nucleotide assignment but in a few
positions there is insufficient agreement between the gel readings and
so a dash (-) appears in the sequence. This display contains all the
evidence needed to assess the quality of the consensus: the number of
times
the sequence has been determined on each strand of the DNA, and the
individual nucleotide assignments given for each gel reading.
.para
So the aim is to produce the consensus sequence and, equally important,
a display of the experimental results from which it was derived.
.para
In order to achieve this the following operations need to be performed:
.left margin2
1) Put individual gel readings into the computer.
This might involved the manual interpretation of autoradiographs
or the transfer and process of machine-readable files from fluorescent
sequencing machines.
.left margin2
2) Check each gel reading to make sure it is not simply part of one of the
vectors used to clone the sequence.
.left margin2
3) Check each gel reading to make sure that those fragments that span
the
ligation point used prior to sonication are not assembled as single
sequences.
.left margin2
4) Compare all the remaining gel readings with one another to assemble
them
to produce the consensus sequence.
.left margin2
5) Check the quality of the consensus and edit the sequences.
.left margin2
6) When all the consensus is sufficiently well determined, produce a copy
of
it for processing by other analysis programs.
.para
It is very unlikely that this procedure will only be passed through once.
Usually steps 1 to 5 are cycled through repeatedly, with step 4 just
adding
new sequences to those already assembled. Generally step 6 is also used
in
order to analyse imperfect sequence to check if it is the one the project
intended to sequence, or to look for interesting features. Analysis of
the consensus, such as
searches for protein coding regions,
can also help to find errors in the sequence. The display of the
overlapping gel readings shown above can be used to indicate, not only
the
poorly determined regions, but also which clones should be resequenced
to
resolve ambiguities, or those which can usefully be extended or
sequenced
in the reverse direction, to cover
difficult regions.
.PARA
The original
individual gel readings for a sequencing project are each stored in
separate files. As the gel readings are entered into the computer
(usually in batches, say 10
from a film), the file names they are given are stored in
a further file, called a file of file names. Files of file names
enable gel readings to be processed in batches.
.para
For each sequencing project
we start a project database. This database has a structure specifically
designed for
dealing with shotgun sequence data.
In order to arrive at the final consensus sequence many operations will
be
performed on the sequence data. Individual fragments must be
sequenced and
compared in both senses (i.e. both orientations) with all the other
sequences. When an overlap between a new gel reading and a contig are
found
they must be aligned and the new gel reading added to the contig. If a
new
gel reading overlaps two contigs they must be aligned and joined. Before
the two contigs are joined one of them may need to be turned around
(reversed and complemented) so they are both in in the same orientation.
.para
Clearly, keeping track of all these manipulations is quite complicated,
and to be able to perform the operations
quickly requires careful choice of data
structure and algorithms. For these reasons it is not practicable to store
the gel readings aligned as shown in the display above. Rather, it is more
convenient to store the sequences unassembled, and to record sufficient
information for programs to assemble them during processing. The
data used to assemble the sequences is called relational information.
.left margin2
.PARA
The database comprises five files and they are described under the
section entitled "open database".
.PARA
Before entry into the project database
each new gel reading must be compared to look for overlaps
with all the data already contained
within the database. This last point is
important: all searching for overlaps is between individual new gel
readings and the data already in the database. There is no searching for
overlaps between sequences within the database; overlaps must be found
before new gel readings are entered into the database.
.para
Below I give an introduction to how the sequences are processed by
being
passed from one function to the next.
.para
This program is used to start a
database for the project and
then the following procedure is used.
.para
Data in the form of individual gel readings are entered into the computer
and stored in separate files (possibly using either the digitizer
program GIP). Batches
of these gel readings
are passed to the screening functions in this program to search for overlaps
with vector sequences (see VEP and "screen against vector") or for matches to
restriction enzyme sites that should not be
present ("screen against enzymes").
Each run of these screening functions passes on only those gel
readings that do not contain unwanted sequences. Sequences are passed
via
files of file names and eventually are processed by the automatic
assembly function ("auto assemble"). This function compares each gel
reading with a consensus of all the previous gel readings
stored in the database.
If it finds any
overlaps
it aligns the overlapping sequences by inserting padding characters,
and then adds the new gel reading to the database.
Gels that overlap are added to existing contigs and gels that do not
overlap any data in the database start
new contigs. If a new gel overlaps two contigs they are joined.
Any gel readings that appear to overlap but which
cannot be aligned sufficiently well are not entered and have
their names written to a file of failed gel reading names.
.PARA
Generally data is entered
into the database in batches as just described. The program
is also used to examine
the data in the database, to enter gel readings that the automatic
assembly function cannot align ("auto assemble"),
and to make final edits. Edits to whole contigs
can be made using a
mouse-driven editor ("edit contig").
.PARA
Editing the sequences is obviously an essential part of managing a
sequencing project.
Editing is required when new
sequences are added, when contigs are joined, and when sequences are
corrected.
A basic part of the strategy
used here is that new
gel readings should be correctly aligned throughout their whole length
when
they are entered into the database, and that when contigs are joined they
are edited so that they are well aligned in the region of overlap.
Alignment can be achieved by
adding padding characters to the sequences, and this is the way "auto
assemble"
operates when adding new sequences to the database.
.para
In order to search
for overlaps that may have been missed or may be hidden in the "unused data"
the function "find internal joins" can be used.
.para
Generally the users need not concern themselves with how the relational
information is used by the program, but it is necessary to know
how contigs are identified. Because contigs are constantly being changed and
reordered the program identifies them by the numbers of the gel readings
they contain. Whenever users need to identify a contig they need only
know
the number or name of one of the gel readings it contains. Whenever the
program asks users to identify a contig or gel reading they can type its
number or its archive name. If they type its archive name they must precede
the name by a slash "/" symbol to denote that it is a name rather than a
number. E.g if the archive
name is fred.gel with number 99, users should
type /fred.gel or 99 when asked to identify the contig. Generally,
when it asks for the gel reading to be identified,
the program will offer the user a default name,
and if the user types only return, that
contig will be accessed. When a database is opened the default contig will
be the longest one, but if another is accessed, it will subsequently become
the current default.
.para
Further information is located in the following places.
The database files are described under "open database". The format
for
vector and consensus sequences is given under "calculate a consensus", as are
the
uncertainty codes used in gel readings.
.left margin2
.para
The digitiser program
is used for the initial input of gel readings
and for writing a file of file names. The program
uses a digitizer for data entry.
A digitizer is
a two dimensional surface such as a light box
which is such that if a special pen is pressed onto it, the pens
coordinates are recorded by a computer.
These coordinates
can be interpreted by a program.
.para
In order to read an autoradiograph placed on the light box
the user need only define the bottom of
the four sequencing lanes and the bases
to which they correspond and then use the pen to point to each
successive band progressing up the gel. The program examines
the
coordinates of each pen position to see in which of the four
lanes
it lies and assigns the corresponding base to be stored in the
computer. Each time the pen tip is depressed to point to a position
on the surface of the digitizer the program sounds the bell on the
terminal to indicate to the user that a point has been recorded. As
the sequence is read the program displays it on the screen.
.left margin1
@17. TX 1 @Screen against enzymes
.left margin2
.PARA
Used to compare gel readings against any restriction enzyme recognition
sequences that may have been used during cloning and which should not
be present in the data. Works on single gel readings or processes batches
accessed through files of file names. The algorithm looks for exact
matches to recognition sequences stored in a file.
.para
The file containing the recognition sequences must be identified. The
user
must choose between employing a file of file names, or typing in the
names of individual gel reading files. If a file of file names is used the
program will also create a new file of file names. When the option has
finished operating this new file will contain the names of all those gel
readings that did not match any of the recognition sequences. Hence it
can
be used for further processing of the batch. The recognition sequences
should be stored in a simple text file with one recognition sequence per
line.
.left margin1
@18. TX 1 @Screen against vector
.left margin2
.PARA
Used to compare gel readings against any vector sequences that may have
been picked up during cloning and which have not been removed by vep.
It Works on single gel readings or processes
batches accessed through files of file names. The algorithm looks for
exact
matches of length "minimum match length" and displays the overlapping
sequences.
.para
The file containing the vector sequence must be identified. The user must
choose between employing a file of file names, or typing in the names of
individual gel reading files. If a file of file names is used the program
will
also create a new file of file names. When the option has finished
operating this new file will contain the names of all those gel readings
that did not match the vector sequence. Hence it can be used for further
processing of the batch. The vector sequence should be stored in a simple
text file with up to 80 characters of data per line. More than one vector
can be stored in a single file. If so each should be preceded by a 20
character title of the form <---m13mp8.0001----> where the < and >
signs
and the number like .0001 are obligatory. The number must be preceded
by a dot (.) and be 4 digits long. The total sequence in the file must be <
500,001 characters in length.
.left margin1
@20. TX 3 @Auto assemble
.left margin2
.PARA
Compares gel readings against the current contents of the database and
produces alignments. In its normal mode of operation
("entry permitted"), the function
will automatically enter the gel readings into the database.
.para
New assembly suboption.
However
if entry is not permitted the reads won't be entered but the program
will produce alignments and (optionally) save each reading name and its best
alignment score (percentage mismatch) in a file. When used in
this mode, the program will include in the alignment the poor quality data
for each reading. These files of names can then be sorted into score order
and then used for assembly, hence forcing the readings that align best to
be entered into the database first.
End of new suboption.
.para
The routine works on
single gel readings or processes batches of gel readings accessed through
files of file names. It is the only way to enter data into the database.
.para
The function will check the database for logical consistency and will
only
proceed if it is OK. Choose if gel readings should be entered into the
database, or if they should only be compared. Choose between using a file
of file names or typing file names on the keyboard. If so selected, supply
the file of file names. Also supply a file of file names to contain the names of
all the gel readings that fail to get entered.
Select the entry mode. Normal assembly is appropriate for all but special
cases, as is "permit joins". Uses for the other modes are not documented
here.
Define a minimum initial
match length.
Define the maximum number
of padding characters allowed to be used in each gel reading to help
achieve alignment, and the same for the number allowed in the contig for
each gel reading. Finally define the maximum percentage mismatch to
be allowed for any gel reading to be entered into the database. If
for any gel reading, either of these last three values is exceeded the gel
reading will not be entered into the database.
.para
In operation the function takes a batch of gel readings (probably passed
on as a file of file names from one of the screening routines) and
enters them into a
database for a sequencing project. It takes each gel reading
in turn,
compares it with the current consensus for the database, it then
produces an alignment for any regions of the consensus it
overlaps; if this alignment is sufficiently good it then edits
both the new gel reading and the sequences it overlaps and adds
the
new gel reading to the database. The program then updates the
consensus
accordingly and carries on to the next gel reading.
.para
All alignments are displayed and any gel readings
that do match but that
cannot be aligned sufficiently well have their names written to a
file of failed gel reading names. The function works without any
user intervention and can process any number of gel readings in a
single run. Those gel readings that fail can be recompared using
the same function (to find the current overlap position) and the
user can enter them into the database
using the "put all readings in new contigs"
assembly option and then joined using "join contigs".
.para
Typical dialogue and output from the function is shown below. (Note that
output for gel readings 2 - 9 has been deleted to save space).
.lit
Automatic sequence assembler
Database is logically consistent
? (y/n) (y) Permit entry
? (y/n) (y) Use file of file names
? File of gel reading names=demo.nam
? File for names of failures=demo.fail
Select entry mode
X 1 Perform normal shotgun assembly
2 Put all sequences in one contig
3 Put all sequences in new contigs
? Selection (1-3) (1) =
? (y/n) (y) Permit joins
? Minimum initial match (12-4097) (15) =
? Maximum pads per gel (0-25) (8) =
? Maximum pads per gel in contig (0-25) (8) =
? Maximum percent mismatch after alignment (0.00-15.00) (8.00) =
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Processing 1 in batch
Gel reading name=HINW.004
Gel reading length= 283
Searching for overlaps
Strand 1
Strand 2
No matches found
Total matches found 1
Padding in contig= 0 and in gel= 1
Percentage mismatch after alignment = 1.8
Best alignment found
1 11 21 31 41 51
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
********** ********** ********** ********** ********** **********
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
1 11 21 31 41 51
61 71 81 91 101 111
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACA-GTTG
********** ********** ********** ********** ********** ***** ****
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACACGTTG
61 71 81 91 101 111
121 131 141 151 161 171
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
********** ********** ********** ********** ********** **********
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
121 131 141 151 161 171
181 191 201 211 221 231
GCCATGGTGG CCACAGTGAC G-TATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
********** ********** * ******** ********** ********** **********
GCCATGGTGG CCACAGTGAC GCTATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
181 191 201 211 221 231
241 251 261 271 281
CGCG--GTGG AATTACAGCG TTCCCTATTG ACGGGCGCAT CCAC
**** **** ********** ** * ***** ********** ****
CGCGACGTGG AATTACAGCG TT,CDTATTG ACGGGCGCAT CCAC
241 251 261 271 281
Batch finished
9 sequences processed
0 sequences entered into database
0 joins made
.end lit
.para
Note that "auto assemble" cannot align protein sequences.
.left margin1
@28. TX 1 @Highlight disagreements
.left margin2
.para
Used in the latter stages of a project
to highlight disagreements between individual gel readings
and their consensus sequences. This display is also availbale in the
contig editor.
Characters that agree with the
consensus are shown as : symbols for the plus strand and . for the minus
strand. Characters that disagree with the consensus are left unchanged
and so stand out clearly. The results of this analysis are written to a
file.
.para
Before selecting this option create a file of the display of the contig to
be
"highlighted". The option will ask for the name of this file. Select
symbols
to denote "agreeing" characters on each strand, the defaults are : and .,
but any others can be used. Supply the name of a file in which to put
the output.
.para
The display file needed as input for this option is created by selecting
"Redirect output", followed immediately by "display contig", and then
"Redirect output" again. The
cutoff score used in the consensus calculation can be set by option "set
display parameters". Note that for the highlight function
there is a limit of 50 for the number of gel
readings that are aligned at any position - ie the contig must be less
than 51 gel readings deep at its thickest point. I hope that those performing
shotgun sequencing never reach this limit, but those using the program for
comparing sequence families might.
.para
Typical output from this function is shown below.
.lit
210 220 230 240 250
1 HINW.004 :C::::::::::::::::::::::::::::::::::::::::::AC::::
7 HINW.018 :*::::::::::::::::::::::::::::::::::::::::::CA::::
-4 HINW.017 ...............AC....
G-TATTTTGTTTCCTCACGCTCGCTACGTATCTGTTTGCCCGCG--GTGG
260 270 280 290 300
1 HINW.004 ::::::::::::*:D:::::::::::::::::::
7 HINW.018 ::::::::::::::::::::CA:::::T:*:::*::::::::::::CA:
-4 HINW.017 ..............................................A...
3 HINW.009 :::::::::::::::V::::::::::::::::::::::::::::*AV:::
-6 HINW.028 ......................A...
AATTACAGCGTTCCCTATTGACGGGCGCATCCACGCTGATTCTCTT-CTG
.end lit
.left margin1
@32. TX 3 @Extract gel readings
.left margin2
.para
Used to make copies of the aligned gel readings in a database,
to write them into separate files, and to write a
corresponding file of file names. It operates in two modes: either all gel
readings are extracted, or only those at the ends of contigs.
.para
Choose which mode of operation is required and supply a file of file
names.
.para
The gel readings are given their original
names.
.para
If the option is used to extract all the gel readings from a database, a
subsequent run of "auto assemble" can reconstitute a database which has
been corrupted. This rarely occurs and is usually necessitated by a
user employing "alter relationships" incorrectly without first having
made a copy.
.left margin1
@1. TX 0 @Help
.left margin2
.PARA
Help is available on the following topics :
.LEFT MARGIN1
@2. TX 0 @Quit
.LEFT MARGIN2
.PARA
This command stops the program and is the only safe way to terminate a
run
of the program that has altered the contents of the database in any way.
.left margin1
@3. TX 1 @Open a database
.LEFT MARGIN2
.PARA
Opens existing databases or allows new ones to be started. The function
is
automatically called into operation
when the program is started but can also be selected
from the general menu.
.para
Choose to open an existing database or start a new one, or if ! is typed
when the program is first started, enter the program without opening a
database. Supply a project
database name, and if it already exists, the "version". If starting a new
database define the database size and if it is for DNA or protein sequences.
The database size is an initial size for the database. It can be increased
later during the project. It is the sum of the number of gel
readings plus the number of contigs. The current maximum size is 8000.
.para
Database names can have from one to 12 letters and must not include full
stop (.). The database is made from five separate files. If the database
is
called FRED then version 0 of database FRED comprises files FRED.AR0,
FRED.RL0, FRED.SQ0, FRED.TG0 and FRED.CC0. The version is the last symbol in the file names.
Only this program
can read these files. If the "copy database" option is used it
will ask the user to define a new "version".
.para
For normal use the maximum gel reading length is set to 512 characters,
but when a database is started the user may choose lengths of either
512,
1024, 1536..., 4096. Normally the program is used to handle DNA
sequences but many of the functions also work on protein sequences. The
choice of sequence type is made when the database is started.
.para
The contigs are not stored on the disk as the user sees them displayed on
the screen. Each gel reading is stored with sufficient information about
how it overlaps other gel readings so that the program can work out how
to
present them aligned on the screen. We refer to this extra data as "the
relationships" and it is explained below.
The database comprises 5 separate files.
.left margin2
1. a working version of each gel reading. This is the version of
the gel reading
that is in the database and initially it is an exact copy of
the original sequence (known as the archive)
but it is edited and manipulated to align it
with other gel readings.
.left margin2
2. the file of relationships. This file contains all of the
information that is required to assemble the working versions
into
contigs during processing; any manipulations on the data use this
file and it is automatically updated at any time that the
relationships are changed. The information in this file is as
follows:
.left margin2
(A) Facts about each gel reading and its relationship to
others
("gel
descriptor lines"):
.left margin2
(a) the number of the gel
reading (each gel reading is given a number as it is
entered into the database)
.left margin2
(b) the length of the sequence from this gel reading
.left margin2
(c) the position of the left end of this gel
reading relative to the left
end of the contig of which it is a member
.left margin2
(d) the number of the next gel
reading to the left of this gel reading
.left margin2
(e) the number of the next gel reading to the right
.left margin2
(f) the relative strandedness of this gel
reading , ie whether it is in
the same sense or the complementary sense as its archive.
.left margin2
(B) Facts about each contig ("contig descriptor lines"):
.left margin2
(a) the length of this contig
.left margin2
(b) the number of the leftmost gel
reading of this contig
.left margin2
(c) the number of the rightmost gel reading of this contig.
.left margin2
(C) General facts:
.left margin2
(a) the number of gel readings in the database
.left margin2
(b) the number of contigs in the database.
.left margin2
3. the file of archive names. This is simply a list of the names
of each of the archive files in the database.
.left margin2
4. the file of tags (annotation).
This consists of linked lists of tag information for each sequences in the
database.
Tags are created by the user as annotation, or by xdap as records of edits or
for storing cutoff information.
As the number of tags can grow without limit, so can this file.
For each gel there is a header record, which contains the record number of
the start of the linked list for that gel. On line IDBSIZ there is a record
containing information about the file such as its present length and if there
are any free "tag" slots to be reused in the file.
5. the file of comments (annotation).
This consists of linked lists of comment fragments.
Comments are created by the user as a message attached to annotation,
or by the system to store cutoff information.
Comments are character strings of any length.
Comments longer than 40 characters are broken up into fragments, each 40
characters long, and are chained together in a link list.
As the number of comments can grow without limit, so can this file.
.para
Structure of the database files
.para
1. The file of relationships
.para
The file contains IDBSIZ lines of data:
the general data are stored on line IDBSIZ; data about gel
readings are
stored from line 1 downwards; data about contigs are stored from
line IDBSIZ-1 upwards. A database of 500 lines containing 25 gel
readings and 4 contigs would have a file
of relationships as is shown below.
.lit
---------------------------------------------
0 Info about the database size
1 Gel descriptor record
2 " " "
3 " " "
4 " " "
5 " " "
' ' ' '
' ' ' '
25 " " "
26 Empty record
' ' '
' ' '
495 ' '
496 Contig descriptor record
497 " " "
498 " " "
499 " " "
500 Number of gel readings=25, Number of contigs=4
---------------------------------------------
The arrangement of the data in the file of relationships
.end lit
As each new gel reading is added into the database a new line is added
to the end of the list of gel descriptor
lines. If this new gel reading does not
overlap with any gel readings
already in the database a new contig line is
added to the top of the list of contig lines. If it overlaps with
one contig then no new contig line need be added but if it overlaps
with two contigs then these two contigs must be joined and the
number of contig lines will be reduced by one. Then the list of
contig
lines is compressed to leave the empty line at the top of the list.
Initially the two types of line will move towards one another but
eventually, as contigs are joined, the contig descriptor lines will
move in the same direction as the gel descriptor
lines. At the end of a
project there should be only one contig line. The database is thus
capable of handling a project of 998 gels.
.para
2. Structure of the working versions file
.para
The working versions of gel readings are stored in a file of
NGELS lines each containing MAXGEL characters. Gel reading
number 1 is stored on line
1, gel reading number 2 on line 2 and so on. NGELS is the
current number of readings and MAXGEL the maximum reading length.
.para
3. Structure of the archive names file
.para
This file has NGELS lines of 16 characters.
.para
4. Structure of the tag file
.para
This file initially starts with IDBSIZ lines, and is expanded as new tags are
created.
Information about the length of the file, and which tag records are reusable
is stored on line IDBSIZ.
A database of 500 lines would have a file of tags as shown below.
.lit
---------------------------------------------
1 Tag descriptor record
2 " " "
3 " " "
4 " " "
5 " " "
' ' ' '
' ' ' '
497 " " "
498 " " "
499 " " "
500 Length of file=N, Free list=0
501 Tag record
502 " "
503 " "
' ' '
' ' '
N-2 " "
N-1 " "
N Tag record
---------------------------------------------
The arrangement of the data in the tag file
.end lit
As each new tag is added to the database, a check is made in the
file descriptor record at line IDBSIZ. If the list of reusable records is 0,
the file is extended by one line. Otherwise the new tag is assigned to
record at the head of the freelist.
When tags are deleted, they are added to the free list in the file descriptor
record.
.para
5. Structure of the comment file
.para
This file initially starts with 1 line, and is expanded as new annotation is
created.
Information about the length of the file, and which comment records are reusable
is stored on the first line.
.lit
---------------------------------------------
1 Length of file=N, Free list=0
2 Comment fragment
3 " "
4 " "
' ' '
' ' '
N-2 " "
N-1 " "
N Comment fragment
---------------------------------------------
The arrangement of the data in the comment file
.end lit
As each new comment is added to the database, a check is made in the file
descriptor record at line 1. If the list of reusable records is 0,
the file is extended to hold the new comment. Otherwise the new comments is
assigned to records starting with the head of the freelist.
When comments are deleted, the discarded records are added to the free list in
the file descriptor record.
.para
There are various checks within the programs to
protect users from themselves:-
.left margin2
1. All user input is checked for errors - e.g. reference to
non-existent gel
readings or contigs, incorrect positions in the
contig or gel readings.
.left margin2
2. Before entering a gel reading the system checks to see if a
file of the same name has already been entered.
.left margin2
3. Join will not allow the circularising of a contig.
.left margin2
5. Users may escape from any point in the program.
.left margin2
6. Help is available from all points in the program.
.SK2
.LEFT MARGIN2
IT IS ESSENTIAL THAT USERS DO NOT KILL THE PROGRAM WHILE IT IS
DOING
ANYTHING THAT INVOLVES CHANGING THE CONTENTS OF THE
DATABASE. I.E DURING AUTO ASSEMBLE,
COMPLETE JOIN, COMPLEMENT CONTIG, SAVE EDIT CONTIG.
This could
corrupt the database so badly that it is impossible to fix. The program
should always be left using the QUIT option.
.left margin1
@4. TX 3 @Edit contig
.LEFT MARGIN2
.PARA
The Contig Editor is a mouse-driven editor that can insert,
delete and change gel reading sequences.
.para
The Contig Editor allows scrolling from one end of a contig to the other
using the scroll bar and scroll buttons. Action of mouse button presses
when the mouse pointer is in the scroll bar:
.sk1
.lit
Middle Mouse Button Set editor position
Left Mouse Button Scroll forward one screenful
Right Mouse Button Scroll backwards one screenful
.end lit
.sk1
The four scroll buttons operate as follows:
.sk1
.lit
"<<" Scroll left half a screenful
"<" Scroll left one character
">" Scroll right one character
">>" Scroll right half a screenful
.end lit
.para
The Editor cursor can be positioned anywhere in the edit window by
moving the mouse pointer over the character of interest, then pressing the
left mouse button. The Editor cursor can also be moved by using the
direction arrow keys.
.para
The editor operates in two main edit modes - Replace and Insert. Replace allows
a character to be replaced by another. Insert allows characters to be
inserted into a gel reading sequence. Characters are entered by typing
them from the keyboard. Only valid characters are permitted.
Characters can be deleted by positioning the cursor one character to the right,
then pressing the delete key.
Normally Insert and Delete apply to the consensus line of the contig ONLY.
This restraint can be overridden by using the "Super Edit" mode of
operation, THOUGH IT IS NOT RECOMMENDED.
.para
Edits can also be performed on the consensus, though they are
restricted to insertion and deletion of padding characters ("*").
These edits also have special meanings.
A deletion will delete ALL characters at the position to the left
of the cursor in the contig, and move the relative positions of all
sequences starting to the right of the cursor position left one
character.
An insertion will insert the character typed ("*") into ALL gel
reading sequences at the cursors position in the contig, and move the
relative positions of all sequences starting to the right of the cursor
position right one character.
.para
The effect of the last edit can be undone by pressing the "Undo" button
at the top of the editor window.
.para
The cursor will automatically be positioned at the next problem when the
"Find Next Problem" button is selected. The next problem is where the
consensus shows either an ambiguity ("-") or a pad ("*") character.
.para
The edits to the contig can be saved by pressing the "Leave Editor"
button and replying "Yes" to the prompt to "Save changes?". As no changes
are made to the working copy of your database til this point it
is possible to abort the editor if
the edit session ends up in an unsatisfactory state (ie if you've
stuffed it up!)
.left margin1
.sk3
Displaying Traces
.left margin2
.para
The original data from which the gel reading sequences where derived can
be seen by double clicking (two quick clicks) with the middle mouse button
on the area of interest. The trace will be displayed with the point
clicked at the centre of the trace viewport.
.para
All traces that are displayed are maintained in one window, called the Trace
Manager. The Trace Manager will only display four traces maximum. When four
traces are already being managed and a new one is requested, the one at the top
of the Trace Manager is removed and the new one is added to the bottom.
Traces can be removed individually by using the "quit" button in the panel next
to the trace.
.left margin1
.sk3
Extending Reads Using Cutoff Information
.left margin2
.para
Sequence data read in from Automated Fluorescent sequencing machines
trace files processed through the program ted
will have the discarded sequence (vector at start and poor read at
end) available to the contig editor. To display the cutoff
information, press the "Display Cutoff" button at the top of the
editor window.
The cutoff sequence appears in grey. This sequence can be incorporated
into the editable sequence, by moving the cutoff position. This is
done by positioning the cursor at the end of the gel sequence, and
using Meta-Left-Arrow and Meta-Right-Arrow to adjust the point of cutoff.
The Meta key is a diamond on the Sun keyboard.
.left margin1
.sk3
Pop-up menu
.left margin2
.para
A pop-up menu is revealed by depressing the "Control" key on the keyboard
and at the same time pressing the left mouse button. The menu has the following
functions:
.lit
Search
Highlight Disagreements
Save Contig
Create Tag
Edit Tag
Delete Tag
Select Oligo
.end lit
.left margin2
"Highlight Disaggreements" simply toggles between the normal display showing
the current base assignments and one in which only those assignments that
differ from the consensus are shown.
.left margin2
"Save Contig" is described above.
Searching and operations on tags are described below.
.left margin2
.sk3
Searching
.left margin2
.para
Selecting "Search" brings up a
window which can remain present during normal editor operation. The
window allows the user to select the direction of search, the type of
search and a value to search on. The value is entered into the value
text window. Then pressing the "search" button
performs the search. If successful, the cursor is positioned and
centred accordingly. An audible tone indicates failure. Pressing the
"ok" button removes the search window. The search window is
automatically removed when the contig editor is exited.
.sk1
There are seven different search modes:
.sk1
1. Search by position
.sk1
This positions the cursor at the numeric position specified in the
value text window. Eg a value of "1234" causes the cursor to be placed
at base number 1234 in the contig. Positioning withing a gel reading is
achieved by prefixing the number with the "@" character, eg "@123"
positions the cursor at base 123 of the sequence in which the cursor
lies. Relative positions can be specified by prefixing the number with
a plus or minus character. Eg "+1234" will advance the cursor 1234
bases. If possible, the cursor is positioned within the same sequence.
The direction buttons have no effect on the operation of "search
by position".
.sk1
2. Search by reading name
.sk1
This positions the cursor at the left end of the gel reading specified
in the value text window. If the value is prefixed with a slash is is
assumed to be a gel reading name. Otherwise it is assumed to be a gel
reading number. Eg "123" positions the cursor at the left end of gel
reading number 123. "/a16a12.s1" positions at the start of reading
a16a12.s1. If the value was "/a16" the cursor is positioned at the
first reading which starts with "a16". The direction buttons have no
effect on the operation of "search by position".
.sk1
3. Search by tag type.
.sk1
This positions the cursor at the start of the next tag which has the
the same type as specified by the type value menu. To change the type,
select off the menu that pops up when the mouse is clicked on the
button labeled "Type:". The search can be performed either forwards
or backwards of the current cursor position. To find all tags, use
"search by annotation", with a null text value string.
.sk1
4. Search by annotation.
.sk1
This positions the cursor at the start of the next tag which has a
comment containing the string specified in the value text window. The
search performed is a regular expression search, and certain
characters have special meaning. Be careful when your value string
contains ".", "*", "[", "^" or "$". The search can be performed either
forwards or backwards from the current cursor position.
.sk1
5. Search by sequence.
.sk1
This positions the cursor at the start of the next piece of sequence
that matches the value specified in the text value window. The search
is for an exact match, which means the case of value string is
important. The search is performed on the gel readings themselves,
rather than the consensus sequence. The search can be performed either
forwards or backwards from the current cursor position.
.sk1
6. Search by problem.
.sk1
This positions the cursor at the next place in the consensus sequence
which is not an "A", "C", "G" or "T". The search can be performed
either forwards or backwards from the current cursor position.
.sk1
7. Search by quality
.sk1
This positions the cursor at the next place in the consensus sequence
where the consensus calculation for each strand disagrees. When only
sequences on one strand is present, the search will stop at every
base. The search can be performed either forwards or backwards from the
current cursor position.
.left margin1
.sk3
Annotation
.left margin2
.para
Parts of a sequence can be annotated, to record the positions of primers used
for walking, or to mark sites, such as compressions that have caused problems
during sequencing.
The consensus sequence CANNOT be annotated.
.para
To annotate a piece of sequence first select the part of sequence
using the mouse buttons. Use the left mouse button to position the start of the
selection, and while this button is being held down, move the mouse to extend.
The selection can be extended further using the right mouse button.
.para
To create annotation, invoke the pop-up menu, and select the "Create Tag"
function. A small "tag editor" will appear which
allows you to select the type of the
annotation from a pull-down menu, and specify a comment if desired.
To select a new type pull down the Type menu, and select the entry desired.
To enter a comment, simply type into the text window in the tag editor.
The annotation is created when the "Leave" button on the tag editor,
and is displayed in the colour defined in the tag database file (TAGDB).
.para
To edit existing annotation,
position the cursor with the left mouse button
on the tag, and select the
"Edit Tag"
off the pop-up menu.
This invokes the tag editor, and changes to the type and comment of the
annotation can be made. The tag is updated when the "Leave" button is pressed.
.para
To delete an existing annotation,
position the cursor with the left mouse button
on the tag, and select the
"Delete Tag"
off the pop-up menu.
.left margin1
.sk3
NOTE:
.left margin2
.para
As the Contig Editor is a very powerful tool, it is possible that the alignment
of the gel reading sequences has unexpectedly been disrupted.
This can easily happen to parts of the contig that lie to the right
of the screen if excessive use has been made of the "Super Edit" facility.
Until familiar with "Super Edit" it would benefit the sequencer to quickly
scan through the contig after editing to check that bad alignments have not
been created.
.sp
.left margin2
Selecting Oligos
----------------
.sk1
.left margin2
1. Open the oligo selection window, by selecting "Select Oligo" from
the contig editor popup menu.
.left margin2
2. Position the cursor to where you want the oligo to be chosen. While
the oligo selection window is visible, you will still have complete
control over positioning and editing within the contig editor.
.left margin2
3. Indicate the strand for which you require an oligo. This is done by
toggling the direction arrow ("----->" or "<------"), if necessary.
.left margin2
3. Press the "Find Oligos" button to find all suitable oligos (See
"Oligo selection" below.) Information for the closest oligo to the
cursor position is given in the output text window. In the contig
editor the position of the oligo is marked by a temporary tag on the
consensus. The window is recentered if the oligo is off the screen.
Selecting "Display Selection Information" will print a short report on
the numbers of oligos considered and rejected during oligo selection.
.left margin2
4. If this oligo is not suitable (it may have been previously chosen,
and found to be unsuitable by experimentation, say), the next closest
oligo can be viewed by pressing "Select Next".
.left margin2
5. Suitable templates are automatically identified for the currently
displayed oligo (See "Template selection" below.) By default, the
template is that closest to the oligo site. If the choice is not
suitable (it may be known to be a poor quality template, say) another
can be chosen from the "Choose Template for this Oligo" menu.
Templates that do not appear on the menu can be specified by selecting
"other". However, the template must be on the correct strand and be
upstream of the oligo.
.left margin2
6. A tag can be created for the current oligo by pressing the button
"Create a tag for this oligo". The annotation for this tag holds the
name of the template and the oligo primer sequence. There are fields
to allow the user to specify their own primer name ("serial#") and
comments ("flags") for this tag. An example of oligo tag annotation:
.lit
serial#=
template=a16a9.s1
sequence=CGTTATGACCTATATTTTGTATG
flags=
.end lit
.left margin2
7. The oligo selection window is closed when "Create a tag for this
oligo" or "Quit" is selected.
.left margin2
Oligo selection:
.left margin2
----------------
.left margin2
The oligo selection engine is the one used in the program OSP. It is
described in some detail in:
.left margin2
Hillier, L., and Green, P. (1991). "OSP: an oligonucleotide
selection program," PCR Methods and Applications, 1:124-128.
.left margin2
The parameters controlling the selection of oligos can be changed in
the "Oligo Selection Parameters" window. The weights controlling the
scoring of selected oligos can be changed in the "Oligo Selection
Weights" window.
.left margin2
By default, the oligos are selected from a window that extends 40
bases either side of the cursor. The size and location of this window
relative to the cursor position can be changed in the "Parameters"
window.
.left margin2
In xbap oligos are ranked according to their proximity to the cursor
position, rather than by their scores.
.left margin2
Template selection:
.left margin2
-------------------
.left margin2
For simplicity, each reading is considered to represent a template. In
practise, many readings can be made of the same template. Suitable
templates that are identified are those that:
.lit
1. are in the appropriate sense,
2. have 5' ends that start upstream of the oligo,
and 3. are sufficiently close to the oligo to be useful.
.end lit
.left margin2
This last criterion relates to the insert size for the subclones used
for sequencing and the average reading length. A template is
considered useful if a full reading can be made from it, taking into
account both of these factors. The default insert size is 1000 bases,
and the default average reading length is 400 bases. These values can
be changed in the "Parameters" window.
.left margin1
@5. TX 1 @Display a contig
.LEFT MARGIN2
.para
Used to show the aligned gel readings for any part of a contig. The
number, name and strandedness of each gel reading is shown and the
consensus is written below.
.para
If required identify the contig, and then the start and end points of the
region to display.
.para
The display can be directed to a disk file using "direct output to disk".
.para
Below is an example showing the left end of a contig from
position 1 to 200. Overlapping this region are gels 6,3,5,17and 12;
6, 3 and 5
are in reverse orientation to their archives (denoted by a minus sign)
There are a few uncertainty codes and a few padding
characters in the working versions, but the consensus (shown
below
each page width) has a definite assignment for almost every
position.
.lit
10 20 30 40 50
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
60 70 80 90 100
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
-3 HINW.007 GGCACA*GTC
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
110 120 130 140 150
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
-5 HINW.009 GGCAGCGACCGTCG
17 HINW.999 AGCGACCGTCG
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
160 170 180 190 200
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
12 HINW.017 GTAGAGC*
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
.END LIT
.left margin1
@6. TX 1 @List a text file
.LEFT MARGIN2
.PARA
This option allows users to list text files on the screen. It can be used
to read a file containing notes, for checking files written to disk etc. The
user is asked to type the name of the file to list.
.left margin1
@8. TX 1 @Calculate a consensus
.LEFT MARGIN2
.para
Calculates a consensus sequence either for the whole database or
for selected contigs. The consensus is written to a file named by the
user.
.left margin2
Supply a file name, choose between whole database or selected contigs.
.para
Symbols for uncertainty in gel readings
.para
In order to record uncertainties when reading gels the codes shown
below can be used. Use of these codes permits us to extract the
maximum amount of data from each gel and yet record any doubts by
choice of code. The program can deal with all of these codes and any
other characters in a sequence are treated as dash (-) characters.
.lit
SYMBOL MEANING
1 PROBABLY C
2 " T
3 " A
4 " G
D " C POSSIBLY CC
V " T " TT
B " A " AA
H " G " GG
K " C " C-
L " T " T-
M " A " A-
N " G " G-
R A OR G
Y C OR T
5 A OR C
6 G OR T
7 A OR T
8 G OR C
- A OR G OR C OR T
a A
c C
g G
t T
* padding character placed by auto assembler
else = -
.end lit
.LEFT MARGIN2
The DNA consensus algorithm
.para
The "calculate consensus" function, the "display contig" routine and the
"show quality" option use the rules outlined here to calculate a
consensus from aligned gel readings. Note that "display contig"
calculates
a consensus for each page width it displays (it does not use the
consensus sequence file calculated by the consensus function).
.LEFT MARGIN2
.para
We have 6 possible symbols in the consensus sequence: A,C,G,T,* and -. The
last symbols is assigned if none of the others makes up a sufficient
proportion of the aligned characters at any position in the contig. The
following calculation is used to decide which symbol to place in the
consensus at each position.
.para
Each uncertainty code contributes a score
to one of A,C,G,T,* and also to the total at each point. Symbols like R
and Y which don't correspond to a single base type contribute only to the
total at each point. The scores are shown below.
.lit
definite assignments ie A,C,G,T,B,D,H,V,K,L,M,N,a,c,g,t,* =1
probable assignments ie 1,2,3,4 = 0.75
other uncertainty codes including R,Y,5,6,7,8,- = 0.1
.end lit
.para
A cutoff score of 51% to 100% is supplied by the user. (When the program
starts this is set to 75%. See "set display parameters").
At each position in the contig we calculate the total score for each of
the 5 symbols
A,C,G,T and * (denote these by Xi, where i=A,C,G,T or *),
and also the sum of these totals
(denote this by S). Then if 100 Xi / S > the cutoff for any i, symbol i is
placed in the consensus; otherwise - is assigned.
.para
Notice that S does not equal the number of times the sequence has been
determined, but is the score total, and hence we are less likely to put a -
in the consensus. For the "examine quality" algorithm each strand is
treated separately but the calculation is the same. (It was originally
different).
.para
Format of the consensus sequence ( and vector sequences).
.para
A consensus sequence file may contain the consensus for several contigs
and so we identify each of them by preceding them by a 20 character
title. The title is of the form <---LAMBDA.0076----> ( where LAMBDA is
the project name and gel reading number
76 is the leftmost gel
reading to contribute to this consensus sequence).
The angle brackets <> and the 4 digit number precede by a .
are important to some processing programs.
.left margin1
@25. TX 1 @Show relationships
.LEFT MARGIN2
.para
Used to show the relationships of the gel readings in the database in
three ways -
.LEFT MARGIN2
(a) All contig descriptor lines followed by all gel descriptor
lines.
.LEFT MARGIN2
(b) All contigs one after the other sorted, i.e. for each
contig show its contig descriptor line followed by all its
gel descriptor lines sorted on position from left to right
.LEFT MARGIN2
(c) Selected contigs: show the contig line and, in order,
those gel readings that cover a user-defined region.
Note that this output can be directed to a disk file by
prior selection of "redirect output".
.LEFT MARGIN2
.para
Below is an example showing a contig from position
1 to 689. The left gel reading is number 6 and has archive
name HINW.010, the
rightmost gel reading is number 2 and is has archive name HINW.004.
On each gel descriptor line is shown:
the name of the archive version, the gel number, the position of the
left end of the gel reading relative to the left end of the contig, the
length of the gel
reading (if this is negative it means that the gel reading is in
the opposite orientation to its archive), the number of the gel
reading to
the left and the number of the gel reading to the right.
.lit
CONTIG LINES
CONTIG LINE LENGTH ENDS
LEFT RIGHT
48 689 6 2
GEL LINES
NAME NUMBER POSITION LENGTH NEIGHBOURS
LEFT RIGHT
HINW.010 6 1 -279 0 3
HINW.007 3 91 -265 6 5
HINW.009 5 137 -299 3 17
HINW.999 17 140 273 5 12
HINW.017 12 193 265 17 18
HINW.031 18 385 -245 12 2
HINW.004 2 401 -289 18 0
.end lit
.left margin1
@23. TX 3 @Complement a contig
.LEFT MARGIN2
.PARA
This function will complement and reverse all of the gel
readings in a
contig. It automatically reverses and complements each gel
reading sequence, reorders left and right neighbours, recalculates
relative
positions and changes each strandedness.
.PARA
The only user input required is to identify the contig to
complement by the number or name of a gel reading it contains.
DO NOT KILL THE
PROGRAM DURING THIS STEP!
.left margin1
@22. TX 3 @ Join contigs
.LEFT MARGIN2
.PARA
This function joins contigs interactively using a mouse driven editor.
The operation of this editor is very similar to the Contig Editor
described in "Edit".
.para
It allows the
user to align the ends of the two contigs by editing each
contig separately. It is important that the alignment achieved is
correct because once the join is completed the alignment is fixed.
The program needs to know which two contigs to join.
.para
First specify which two contigs are to be joined.
The user should identify the two
contigs.
The program checks that the two contig numbers are different (it will not
allow circles to be formed!)
.para
The Join Editor consists of two Contig Editors in between which is sandwiched
a disagreement box. This disagreement box shows exclamation marks to
denote mismatches between the two consensuses.
.para
For example, the display will look something like this:
.lit
1460 1470 1480 1490 1500
56 HINW.100 TCT*GAGCAGTGTGGGCGCTG*CCGG
33 HINW.300 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGG
-25 HINW.090 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGG
19 HINW.123 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
CONSENSUS TCTCGAGCAGTGTGGGCGCTG-CCGGGCTCGGAGGGCATGAAGTAGAGCG
MISMATCH ! !!!!!!
10 20 30 40 50
-6 HINW.010 TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
-3 HINW.007 TGGGCGCTGCCCGGGCTCGGAGGGCATGAAGT*AGAGC
-5 HINW.009 GCTCGGAGGGCATGAAGT*AGAGC
CONSENSUS TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
.END LIT
.para
The overlap must be of at least one character.
Use the scroll bar and the scroll buttons (`<<',`<',`>',and`>>')
for positioning the relative positions of the two contigs.
.para
The join position can be fixed in position
by pressing the `lock' button at the top of the Join Editor.
Locking allows the two contigs to be scrolled as one when using the scroll bar
and buttons, the left ends always in the same position relative to each
other.
.para
Once locked, it is best to proceed to the right along the contigs, inserting
padding characters (`*') into the consensuses to minimise the
disagreements.
.para
It is essential that the user aligns the two contigs throughout the whole
region of overlap before completing the join because it is only at this
stage that the two contigs can be edited independently. Once the join is
completed the alignment can only be altered using the routines supplied
by "alter relationships".
.para
The join can be completed by pressing the `Leave Editor' button. The
percentage mismatch is displayed, and the user is required to confirm that
they want to perform the join.
.left margin1
@24. TX 1 @ Copy the database
.LEFT MARGIN2
.PARA
Used to make a copy of the database. If required the database size can be
altered using this option. The "version" of a database is encoded as the
last letter in the names of the five files that contain the database.
.para
Supply a "version" number (the default is version 1), and if required
select a new size for the database. The size of a database is the number
of
lines of information it can hold. It needs a line for each gel reading and
another for each contig.
.left margin1
@19. TX 1 @ Check database
.LEFT MARGIN2
.para
Used to perform a check on the logical consistency of the
database. No user intervention is required. If selected "with
dialogue" the program also checks for any sections of the consensus that
contain 15 dashes in 20 characters.
.para
The following relationships are checked:
.LEFT MARGIN2
1. If gel reading A thinks gel reading B is its left
neighbour
does B think A is
its right neighbour?
The error message is
.left margin2
"Hand holding problem for gel reading A"
.left margin2
followed by the
gel descriptor lines for gel readings A and B.
.LEFT MARGIN2
2. Are there any contig lines with no left or right
end gel readings?
The error message is
.left margin2
"Bad contig line number A"
.LEFT MARGIN2
3. Do the gel readings that are described as left ends on
contig
lines agree that they are left ends?
The error message is
.left margin2
"The end gel readings of contig A have outward neighbours"
.LEFT MARGIN2
4. Are there gel readings that are in more than one contig?
The error message is
.left margin2
" Gel number A is used N times"
.LEFT MARGIN2
5. Are there gel readings that are not in any contig?
The error message is
.left margin2
" Gel number A is not used"
.LEFT MARGIN2
6. Do the relative positions of gel readings agree with
their
position as defined by left and right neighbourliness?
The error message is
.left margin2
" Gel number A with position X is left neighbour of gel number B with
position Y"
.LEFT MARGIN2
7. Are there any loops in contigs? If so no further
checking is done.
The error message is
.left margin2
" Loop in contig n no further checking done, but gel reading numbers follow"
.left margin2
The
program then prints the gel reading numbers in the looped
contig up
to
the start of the loop.
.LEFT MARGIN2
8. Are there any contigs of length <1? The error message is
.left margin2
" The contig on line
number x has zero length"
.LEFT MARGIN2
9. Are there any gel readings (used in only one contig) that have zero
length? The error
message is
.left margin2
" Gel number N has zero length"
.left margin2
Note that "auto assemble" also uses this logical consistency check and
will
only tolerate a "Gel number N
is not used" error. Any other error will cause it to
give up.
.left margin1
@29. TX 1 @ Examine quality
.LEFT MARGIN2
.para
Analyses the quality of the data in a contig. It reports on the proportion
of the consensus that is "well determined" and will display a sequence of
symbols that indicate the quality of the consensus at each position.
.para
Identify the contig to analyse, and the section of interest. The current
consensus calculation cutoff score will be used to decide if each position
is
"well determined". In general the quality of a reading deteriorates along
the length of the gel and so it is also possible to use a length cutoff for
the quality calculation. Only the data from the first section of each reading
will be included in the quality calculation. The length is altered under
"set parameters" and is initially set to the maximum reading length.
A summary showing the percentage of the consensus
that falls into each category of quality is shown. Choose whether or not to
have the quality codes for each position of the consensus displayed.
They can be displayed as either graphics or text.
.para
The quality of the data depends on the number of times it has been
sequenced and the particular uncertainty codes used in each gel
reading. This function divides the data into five categories, assigning
each
a symbol or code:
.LEFT MARGIN2
1. Well determined on both strands and they agree. code=0
.LEFT MARGIN2
2. Well determined on the plus strand only. code=1
.LEFT MARGIN2
3. Well determined on the minus strand only. code=2
.LEFT MARGIN2
4. Not well determined on either strand. code=3
.LEFT MARGIN2
5. Well determined on both strands but they disagree. code=4
.LEFT MARGIN2
A position is "well determined" if it is assigned one of the symbols
A,C,G,T when the algorithm described in the section "calculate a
consensus".
The calculation is performed
separately for each strand.
.para
If the user chooses to have the data displayed graphically the following
scheme is used. A rectangular box is drawn so that the x coordinate
represents the length of the contig. The box is notionally
divided vertically into
5 possible levels which are given the y values: -2,-1,0,1,2.
The quality codes attributed to each base position are plotted as
rectangles.
Each rectangle represents a region in
which the quality codes are identical, so a single base having a different
code from its immediate neighbours will appear as a very narrow rectangle.
.lit
Rectangle bottom and top y values
Quality 0 rectangle from 0 to 0
Quality 1 rectangle from 0 to 1
Quality 2 rectangle from 0 to -1
Quality 3 rectangle from -1 to 1
Quality 4 rectangle from -2 to 2
.end lit
.para
Obviously a single line at the midheight shows a perfect sequence.
.para
Typical dialogue is shown below.
.lit
41.47% OK on both strands and they agree(0)
55.48% OK on plus strand only(1)
2.08% OK on minus strand only(2)
0.97% Bad on both strands(3)
0.00% OK on both strands but they disagree(4)
? (y/n) (y) Show sequence of codes
10 20 30 40 50
1111111111 1111111111 1111111111 1111111111 1111111111
60 70 80 90 100
1111111111 1111111111 1111111111 3111111111 1111111111
110 120 130 140 150
1111111111 1111131111 1111111111 1111111111 1111111111
160 170 180 190 200
1111111111 1111111111 1111111111 1111111111 1111111133
210 220 230 240 250
1311111111 1111111111 1111111110 0000000000 0000220000
260 270 280 290 300
0000000000 0020000000 2200000202 0002000000 0000222200
.end lit
.left margin1
@26. TX 3 @ Alter relationships
.LEFT MARGIN2
.para
Used to make what are normally illegal changes to the database. That is
the normal checks are not done and any item in the database can be
changed independently of all others. Users need to know what they are
doing because it is very easy to make a horrible mess. Always start by
making a copy!
.para
By using the options here users can
move one section of a contig relative to another, break contigs, remove
contigs, remove gel readings, etc. To give flexibility most
of the commands do only one thing. This means that several commands
may
have to be executed to complete any change.
.para
The following options are offered:
.lit
Cancel
Line change
Check logical consistency
Remove contig
Shift
Move gel reading
Rename gel reading
Break a contig
Remove a gel reading
Alter raw data parameters
.end lit
.left margin2
1. QUIT returns to the main options of BAP.
.left margin2
3. Line change
.left margin2
allows the user to change the contents of any line in the
file of relationships. The line is selected by number, the
program prints the current line and prompts for the new line.
.left margin2
4. Check logical consistency
.left margin2
5. Remove a contig
.left margin2
This function removes a contig and all its gel readings. The user specifies
any reading in the contig.
.left margin2
6. Shift
.left margin2
allows the user to change all the relative positions of a
set of neighbouring gel
readings by some fixed value, i.e. it will
shift related gel readings
either left or right. It can therefore
be used to change the alignment of the gel
readings in a contig.
It prompts for the number of the first gel
reading to
shift and then for the distance to move them (Note a
negative value will move the gel readings
left and a positive value
right). It then chains rightwards (ie follows right
neighbours) and shifts each gel
reading, in turn, up to the end
of the contig. (This means that only those gel readings
from the first
to shift to the rightmost are moved). It updates the length of
the contig accordingly.
.left margin2
7. Move gel reading
.left margin2
is a function to renumber a gel reading. It moves all the information
about a gel
reading on to another line. The user must specify the
number
of the gel reading
to move and the number of the line to place it. It
takes care of all the relationships. Of course gel
readings must not be
moved to lines occupied by other gel
readings!
.left margin2
8. Rename gel reading
.left margin2
is a function that is used to rename the archive names of
gel
readings in the database; it only changes the name in the .ARN
file of the database.
.sk1
.LEFT MARGIN2
9. Break contig
.LEFT MARGIN2
.PARA
Occasionally it is necessary to break a contig into two parts and this can be
achieved using this option. The program needs only the number of a gel
reading. This is the gel reading that will become a left end after the
break. That
is, the break is made between this gel
reading and its left neighbour. A new contig
line is created so ensure that there is sufficient space in the database.
.left margin2
10. Removing gel readings from contigs
.left margin2
.PARA
Gel
readings can be removed from contigs. If they are essential for holding the
contig together (ie are the only gel reading covering a particular region),
the program will create a new contig.
.sk1
.LEFT MARGIN2
11. Alter raw data parameters
.LEFT MARGIN2
.PARA
Allows the user to edit the individual raw data parameters, such as
the left and right cutoff lengths and the name of the machine readable trace
file.
The user must specify the gel line to modify, and provide new values for
the length of the raw sequence including cutoff lengths, the left cutoff position, the length of the original working sequence, the machine type, and the name
of the raw data file, where these values change.
.left margin1
@27. TX 1 @ Set display parameters
.LEFT MARGIN2
.para
Used to redefine the parameters that control the cutoff employed by the
consensus calculation and quality examiner, the maximum length of each
reading to include in the quality calculation, the line length used by
the display function, the text window length used by the graphics
options, and the graphics window length used by the graphics options.
.para
The default cutoff score is 75%. The default line length is 50 characters.
For protein sequences the cutoff is always 100%.
.para
The text window used by the graphics options controls the amount of
sequence listed at the crosshair position. The graphics window controls the
"zoom" function. Both these windows are defined as the number of bases that
should be shown, to both left and right of the crosshair.
.left margin1
@30. TX 3 @ Shuffle pads
.left margin2
.para
One weakness of the alignment strategy used is that padding
characters are not always aligned by the assembly routine. This function
attempts to align padding characters using a very simply strategy. It
does not solve all pad alignment problems but is a useful first step during
cleaning-up operations.
.LEFT MARGIN1
@10. TX 2 @Clear graphics
.LEFT MARGIN2
.para
Clears graphics from the screen.
.left margin1
@11. TX 2 @Clear text
.LEFT MARGIN1
.para
Clears text from the screen.
.left margin1
@12. TX 2 @Draw a ruler.
.LEFT MARGIN2
.para
This option
allows the user to draw a ruler or scale along the x axis of the screen to
help identify the coordinates of points of interest. The user can define
the position of the first base to be marked (for example if the active
region is 1501 to 8000, the user might wish to mark every 1000th base
starting at either 1501 or 2000 - it depends if the user wishes to treat
the active region as an independent unit with its own numbering starting
at
its left edge, or as part of the whole sequence). The user can also define
the separation of the ticks on the scale and their height. If required the
labelling routine can be used to add numbers to the ticks.
.left margin1
@14. TX 2 @Reposition plots
.LEFT MARGIN2
.para
The positions of each of the plots is defined relative to a users drawing
board which has size 1-10,000 in x and 1-10,000 in y.
Plots for
each option are drawn in a window defined by x0,y0 and xlength,ylength.
Where x0,y0 is the position of the bottom left hand corner of the window,
and xlength is the width of the window and ylength the
height of the window.
.lit
--------------------------------------------------------- 10,000
1 1
1 -------------------------------------- ^ 1
1 1 1 1 1
1 1 1 1 1
1 1 1 ylength 1
1 1 1 1 1
1 1 1 1 1
1 -------------------------------------- v 1
1 x0,y0^ 1
1 <---------------xlength--------------> 1
--------------------------------------------------------- 1
1 10,000
.end lit
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
The default window positions are read from a file "ANALMARG" when the
program is started. Users can have their own file if required.
As all the plots start
at the same position in x and have the same width, x0 and xlength are the
same for all options. Generally users will only want to change the start
level of the window y0 and its height ylength.
This option
allows users to change window positions whilst running the program.
The routine prompts first for the number of the option that the users
wishes
to reposition; then for the y start and height; then for the x start and
length. Note that changes to the x values affect all options. If the user
types only carriage return for any value it will remain unchanged.
Note that, unlike all the other programs, the boxes used to contain
analytical results (eg plot quality) should not be made to overlap one
another, as the function of the crosshair routine depends on which box the
crosshair is in!
.LEFT MARGIN1
@15. TX 2 @Label a diagram
.LEFT MARGIN2
.para
This routine allows users to label any diagrams they have produced. They
are asked to type in a label. When the user types carriage return to finish
typing the label the cross-hair appears on the screen. The user can
position it anywhere on the screen. If the user types R (for right justify)
the label will be
written on the diagram with its right end at the cross-hair position.
If the user types L (for left justify) the label will be written on the
diagram with its left end at the cross hair position.
The
cross-hair will then immediately reappear. The user may put the same
label
on another part of the diagram as before or if he hits the space bar he
will be asked if he wishes to type in another label.
.para
Typical dialogue follows.
.lit
? Menu or option number=15
Type label then drive cross hair to left or right end
of label position then hit "L" to write label left
justified or "R" to write label right justified or
the space bar to quit
? Label=delta gene
missing graphics
? Label=
.end lit
.left margin1
@16. TX 2 @Display a map
.LEFT MARGIN2
.para
This is disabled!
.left margin1
@7. TX 1 @Redirect output
.LEFT MARGIN2
.para
Used to direct output that would normally appear on the screen to a file and
to create postscript output.
.para
Select redirection of either text or graphics, and
supply the name of the file that the output should be written to.
.para
The results from the next options selected will not appear on the screen
but will be written to the file. When option 7 is selected again
the file will be
closed and output will again appear on the screen.
.left margin1
@13. TX 2 @Use crosshair
.left margin2
.para
This option puts a steerable cross on the screen which the user
drives around
by using the arrow keys (or mouse). When the crosshair is
visible a number of options are available if the user types one of a
set of special keyboard characters. Any other characters will cause
an exit from the crosshair option. The special keys are:
.lit
I = Identify the nearest gel reading
Z = Zoom in
Q = plot Quality
S = display the aligned Sequences at the crosshair position
N = list the Names and Numbers of the sequences at the crosshair
.end lit
.para
In order for any of these special keys to operate, the crosshair
must be in an appropriate display box, and the precise function of
the keys will also depend on which box the crosshair is in.
.para
If the
crosshair is in the "plot all contigs" box, Z will cause a new box to
appear showing all the readings for the nearest contig; Q will give
the same as Z but will also produce an extra box showing the
"quality" plot.
.para
If Z is hit in the "plot single contig" box, the contig will be zoomed
to the current graphics window size. The zoom will be roughly
centred on the crosshair position. Because of this it is possible to
step along a contig by repeatedly zooming with the crosshair near
to one end of the single contig display box. If I is hit the crosshair
must be close to a gel reading line. If Q is hit, the quality plot will
be produced for the region shown in the plot single contig box. In
all cases when the "plot all contigs" box is shown, a vertical line will
bisect the line the represents the relevant contig, at the current
position.
.para
If the crosshair is in the plot quality box only the character "s" will operate
as a special symbol.
.para
The number of bases shown in the N and S options is controlled by
the current graphics text window size, and the size of the zoom
window by the current graphics window size. Both are set by the
parameter setting function of the general menu.
.left margin1
@33. TX 2 @Plot single contig
.left margin2
.para
This option produces a schematic of a selected region of a single
contig by drawing a horizontal line to represent each of its gel
readings. The lines show the relative positions of each reading and
also their sense. The plot is divided vertically into two sections by
a line that is identified by an asterisk drawn at each end. All lines
that lie above this line represent readings that are in their original
sense, all lines below show readings that are in the
complementary sense to their original. By use of the crosshair
function the plot can be stepped through and examined in more
detail. See help on crosshair.
.left margin1
@34. TX 2 @Plot all contigs
.left margin2
.para
This option produces a schematic of all the contigs in a database. It
does this by drawing a horizontal line to represent each of them.
In order to show the ends of each contig it draws the lines for
contigs at alternate heights: the first at height one, the
second at height two, the third at height one, etc. The order of the
contigs in the display is the same as their order in the database.
By use of the crosshair function the plot can be stepped
through and examined in more detail. See help on crosshair.
.left margin1
@31. TX 3 @ Disassemble readings
.left margin2
.para
This function is used to remove a list of readings from a database, or
to create a new contig from a single reading moved from an existing contig.
This latter mode is useful for repositioning a reading in a repeat:
once separated it can be placed in the join editor and scrolled by the
other copies.
Removal of sets of readings works in two modes:
1. A set of adjacent readings in a
contig can be removed by the user naming the two end ones; or 2. A batch
of readings from any number of contigs can be defined by the user naming
a file containing a list of reading names. The program cleans up the
database by moving data to fill up any holes made in the files.
.para
For both modes of operation the program will ask for a file of file names.
If users create their own file (ie mode 2) each reading NAME must be on
a separate line. For mode 1 the user types the NAMES of the leftmost
and rightmost readings to be removed. They and all intervening readings
will be removed. Note that the routine operates on reading names - not
numbers. For both modes, if necessary, new contigs will be created.
.left margin1
@35. TX 1 3 @Find internal joins
.left margin2
.para
The purpose of this function is to use data already in the database to
find possible joins between contigs.
Joins may have been missed due to poor data or may have not been made
due to repeated sequences. Where appropriate, it may be
possible to find potential
joins by using the "unused data" derived from sequencing machines.
.left margin2
For all overlaps found when the X version is used,
the contig editor (in join mode) will be
called up with the two contigs aligned.
.left margin2
The database is checked for logical consistency. Supply a minimum initial
match length, a minimum alignment block, the maximum pads per sequence,
the maximum percent mismatch after alignment, the probe length. Choose
if clipped data is to be used, if so define the window size for finding good
data and the number of dashes allowed in the window. Processing will commence.
Most of these values are used in an identical way in the autoassemble
function. The others are defined below.
.left margin2
The program strategy
.left margin2
Take the first contig and calculate its consensus. If clipped data is being
used examine all readings that
are in the complementary orientation, and sufficiently near to the contigs left
end, to see if they have good clipped sequence which if present, would
protrude
from the left end of the contig. If found add the longest such sequence to the
left end of the consensus. Do the same for the right end by examining
readings that are in their
original orientation. If any are found add the longest extension to the
right end of
the consensus. Repeat the consensus calculations and extensions
for all contigs hence producing an extended consensus. If clipped data is not
being used simply calculate the consensus for the whole database. Now
look for possible joins by processing the extended consensus in the following
way. Take the last, say 100, bases (termed the "probe length" by the program)
of the rightmost consensus, compare it both
orientations with the extended consensus of all the other contigs. Display
any sufficiently good alignments. Repeat with the left end of the rightmost
contig. Do the same for the ends of all the entended contigs, always only
comparing with the contigs to their left, so that the same matches do not
appear twice.
.left margin2
Good cliped data is defined by sliding a window of "Window size for good data
scan" bases outwards
along the sequence and stopping when "Maximum number of dashes in scan window"
or more dashes appear in the window.
Note that
it is advisable to have some sort of cutoff because if we simply take all the
data it might be so full of rubbish that we wont find any good matches. For
the same reason it is worth trying the procedure with different cutoffs. An
initial run using no clipped data is also recommended.
Sufficiently good
alignments are defined by criteria equivalent to those used in autoassemble,
however here we only display alignments that pass all tests.
.left margin2
Bugs
.left margin2
If a small contig is wholly contained within a larger one, such that its
ends are further than ("Probe length" - "Minimum initial match length")
from the ends of the larger contig, and the consensus for the small
contig lies to the left
of the consensus for large contig, the overlap will not be discovered. (See
the search stratgey).
.left margin2
All numbering is
relative to base number one in the contig: matches to the left (i.e. in
the clipped data) have negative
positions, matches off the right end of the contig (i.e. in the clipped
data) have positions
greater than that of the contig length.
The convention for reporting the positions of overlaps is as follows: if neither
contig needs to be complemented the positions are as shown. If the program says
"contig x in the - sense" then the positions shown assume contig x has been
complemented. For example in the results given below the positions for the
first overlap are as reported, but those for the second assume that the contig
in the minus sense (i.e. 443) has been complemented.
.lit
Possible join between contig 445 in the + sense and contig 405
Percentage mismatch after alignment = 4.9
412 422 432 442 452 462
405 TTTCCCGACT GGAAAGCGGG CAGTGAGCGC AACGCAATTA ATGTGAG,TT AGCTCACTCA
********* * ******** ***** *** ********** ********** **********
445 -TTCCCGACT G,AAAGCGGG TAGTGA,CGC AACGCAATTA ATGTGAG-TT AGCTCACTCA
-127 -117 -107 -97 -87 -77
472 482 492 502 512
405 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
********** ********** ********** ********** **
445 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
-67 -57 -47 -37 -27
Possible join between contig 443 in the - sense and contig 423
Percentage mismatch after alignment = 10.4
64 74 84 94 104 114
423 ATCGAAGAAA GAAAAGGAGG AGAAGATGAT TTTAAAAATG AAACG-CGAT GTCAGATGGG
**** ***** ********** ********** ****** ** ***** **** *********
443 ATCG,AGAAA GAAAAGGAGG AGAAGATGAT TTTAAA,,TG AAACGACGAT GTCAGATGG,
3610 3620 3630 3640 3650 3660
124 134 144 154 164
423 TTG-ATGAAG TAGAAGTAGG AG-AGGTGGA AGAGAAGAGA GTGGGA
*** ****** ********** ** ******* *** ***** ** **
443 TTGGATGAAG TAGAAGTAGG AGGAGGTGGA ,GAG,AGAGA GTTGG-
3670 3680 3690 3700 3710
.end lit
.left margin1
@36. TX 3 @Double strand
.left margin2
.para
PLEASE MAKE A COPY OF THE DATABASE BEFORE USING THIS OPTION AS IT HAS
CURRENTLY HAD VERY LITTLE TESTING.
.para
Uses the cutoff data to change single stranded sections of a contig into
double stranded sections. Data is used carefully to try and minimise the
number of data disagreements created. However it must be noted that an overall
slight degradation in quality will still occur.
.para
When using this option you will be prompted for a contig and a region within
that contig. The default region is the entire contig. The option will then
search through the region for areas of good data on one strand and cutoff data
on the opposite strand, extending the cutoff data. The criteria for evaluating
the amount of cutoff data to be used is based upon a maximum number of
mismatches and a score (derived by accumulating points for mismatches (-8),
matches(+1) and insertions (-5) over the length of an alignment). The defaults
are:
.lit
maximum mismatches : 6
score for mismatch : -8
score for correct match : +1
score for insertion : -5
.end lit
.para
Note that with successive calls to this option it is possible to double strand
more and more data. Naturally however the quality of the data generated will
diminish each time.
.left margin1
@37. TX 3 @Auto-select oligos
.left margin2
.para
PLEASE MAKE A COPY OF THE DATABASE BEFORE USING THIS OPTION AS IT HAS
CURRENTLY HAD VERY LITTLE TESTING.
.para
Generates a file (default "primers") of suggested primers to use for covering
a single stranded section or for walking off the end of a contig. The file
generated contains the gel reading name, the primer sequence, it's offset in
the contig and the orientation. An example file would be :
.lit
c81d12.s1 TTGTCTGTAAGCGGATG (@ 6449 ) +
c98a10.s1 ATTATCACTTTACGGGTC (@ 6959 ) +
c81c1.s1 CAAGAAGGCGATAGAAG (@ 7643 ) +
c76a10.s1 CCTCATCCTGTCTCTTG (@ 8441 ) +
c81g4.s1 ATGAAACCTGGGCGTTG (@ 16156 ) +
c91e6.s1 GTTTTCAGATGTCGGAG (@ 18249 ) +
c81e12.s1 GCTACCGTAAAACACTTC (@ 18737 ) +
c93h11.s1 GCTGCTTTTTGTTTTATCC (@ 19158 ) +
c81h6.s1 CTTCCACTTCTTTCTTATC (@ 21210 ) +
c86a12.s1 CGAATGATAAAGACAAATCAG (@ 22122 ) +
c98b1.s1 GCCACTTTATCCGAGAC (@ 3048 ) -
c97c5.s1 GTGTTTTGGGTATATTGTG (@ 3371 ) -
c83d2.s1 CTACACAGAATGAACCC (@ 3768 ) -
c78h10.s1 GGCGGTGAAGATTGAAG (@ 4200 ) -
c98h9.s2dt CTCGTTTAAATTTCAAACTTCC (@ 7419 ) -
c95a9.s1 ATTGGAAGGAAGGAGGG (@ 22996 ) -
c82b4.s1 TGTAGCCGAAATCTTCC (@ 23369 ) -
.end lit
.para
This is best employed after having previously used the 'Double strand' option.
When selecting the option you will be asked for the contig, a region within
this contig and the file to write the list of primers to. For each primer
suggested a tag is automatically created containing details of the gel reading
name and the sequence. Preferably the tag will be created on the gel reading
from which the primer was selected. However this is not always possible so
failing that the tag will be on another sequence overlapping the primer
position.
.para
When invoked with the dialogue option you will be asked a couple more
questions relating to the position and size of the consensus checked for
suitable oligos. You will be prompted for the start and end of a region
(default 40-140) at a relative position to the left of our initial region.
.para
For example:
.lit
? Menu or option number=d37
Auto-select oligos
Default Contig identfier=/e97f2.s1
? Contig identfier=
? Start position in contig (1-20942) (1) =10000
? End position in contig (10000-20942) (20942) =11000
Default Name of file for primers=primers
? Name of file for primers=
? Start of oligo choice region (1-1024) (40) =50
? End of oligo choice region (50-1024) (150) =150
.end lit
.para
This implies that we are going to look for oligos to use as primers covering
the region 10000 to 11000. For each single stranded section in this region we
search for the oligos at between 50 and 150 to the left. So if we had a single
stranded section from 10121 to 10295 we would search for oligos in the region
9971 to 10071.
.left margin1
@38. TX 1 @Check assembly
.left margin2
.para
This new function is used for checking the positioning of assembled readings.
It is useful for checking sequences that contain repeats
of length similar to that of a single gel reading. It takes the poor
quality data for each reading and compares it to the segment of the consensus
to which it should align.
If the extension of the
read does not match the consensus then the read (or its neighbours) has
probably been assembled into the wrong place.
The program displays the bad alignments.
The quality of an alignment is defined by the percentage mismatch.
Naturally the user should select a value that takes into account
the poor quality of the data being aligned.
.para
When the routine is used from the X version the
user is offered the editor to examine poor alignments.
If alignments are reported as poor, but on inspection are OK, the user
can set a tag so that the poor quality data is ignored on subsequent passes
through the routine. Note however such data will then also be ignored by
the automatic double stranding routine!
.para
The user defines the percentage mismatch; the window size and number of
dashes allowed in the window used for selecting the amount of the poor data
to be employed; can choose to save the names of the poorly aligned reads
in a file; can select an individual contig or scan the whole database.
The file containing the names of the poorly aligned reads can be used by
the disassembly routine to remove them from the database, and then can be used
to reassemble them. Note that the routine complements each contig twice
during processing.
.left margin1
@39. TX 1 @Find read pairs
.left margin2
.para
This new function is used to check the positions of readings taken from each
end of the same template. For each forward read it searches for a corresponding
reverse reading. The search can be over the whole database or over a single contig.
The results can be presented graphically for single contig searches and the crosshair
function can be used to identify the readings displayed.
.para
Note that at present the function only knows that two reads are from the same template
by comparing reading names. For our local projects we use the following naming
convention: forward reads are named abcdefgh.s1 and reverse reads abcdefgh.r1. The
program expects this naming convention and so if it finds read fred.s1 and fred.r1 it
assumes they are the forward and reverse reads for template fred. In the future we
will make the routine more general!
.para
If a single contig is selected and the output is listed the program displays two
lines for each pair: the first line shows the reading name, its position and length,
and the distance between the extremeties of the two reads; the second line shows the
other read name, its position and length. If there are pairs that are in separate contigs
or are facing away from one another they are listed after the pairs that face inwards.
Is this true?
.para
If the results are plotted the full length of the template is drawn with arrows
indicating the direction of reads and the extent of each reading. Those reads that have
their partner in another contig are marked by asterisks.
.para
Typical dialogue is shown below.
.lit
? Select contigs (y/n) (y) =
Default Contig identifier=/i55d8.s1
? Contig identifier=
? Start position in contig (1-15227) (1) =
? End position in contig (1-15227) (15227) =
? Plot results (y/n) (y) = n
852 k23a1.r1 249 238 1615
806 k23a1.s1 1529 -335
238 i68e6.s1 422 193 1632
868 i68e6.r1 1756 -298
576 k17a2.s1 2370 213 1676
885 k17a2.r1 3790 -256
84 k27g6.s1 3456 291 1777
867 k27g6.r1 4905 -328
453 k01g10.s1 5805 142 1251
881 k01g10.r1 6909 -147
781 i98b8.r1 6754 338 1079
10 i98b8.s1 7653 -180
883 k02d11.r1 7327 276 1597
283 k02d11.s1 8726 -198
269 i68f9.s1 8191 169 1055
777 i68f9.r1 8891 -355
710 i91c6.s1 8245 95 1516
780 i91c6.r1 9403 -358
596 k27d12.s1 136 329 -329
219 k27d12.r1 1 -116
159 k27d11.r1 1830 -263 -263
317 k27d11.s1 2902 343
886 k17g11.r1 7107 -123 -123
647 k17g11.s1 1867 265
851 i69g10.r1 8045 -137 -137
277 i69g10.s1 4658 174
.end lit
.para
If contigs are not selected the pairs are sorted on their separations.
.lit
? Select contigs (y/n) (y) = n
i68f2.s1 27 1781 1777
i68f2.r1 776 111 1777
k17f6.s1 601 60 1706
k17f6.r1 856 1405 1706
k17a2.s1 576 2370 1676
k17a2.r1 885 3790 1676
k27g3.s1 177 14985 1664
k27g3.r1 889 13564 1664
.
.
k27b12.s1 764 1 1086
k27b12.r1 857 932 1086
i98b8.s1 10 7653 1079
i98b8.r1 781 6754 1079
k16a3.s1 748 1276 1070
k16a3.r1 784 472 1070
k17b7.r1 786 14937 18942*
k17b7.s1 787 3601 18942*
k27d12.r1 219 1 15208*
k27d12.s1 596 136 15208*
k01g2.s1 502 87 14754*
k01g2.r1 782 9224 14754*
.end lit
.left margin1
@ end of help