staden-lg/help/bap_help

2113 lines
104 KiB
Text
Raw Permalink Normal View History

2021-12-04 13:07:58 +08:00
@-1. TX 0 @General
@-2. T 0 @Screen control
@-2. X 0 @Screen
@-3. TX 0 @Modification
@0. TX -1 @BAP
This is an interactive program whose primary use is for
managing shotgun sequencing projects, but it can also be used for
handling alignments of other sequences, including those of proteins.
Currently the maximum 'gel reading' length is set to 4096
characters. Almost all of the information below describes the use of
the program for shotgun projects, but those using the programs for
handling other sequence alignments should interpret it accordingly.
The data for such a project is stored in a special type of database.
The program contains the tools that are required to screen gel
readings against vector sequences and restriction sites, and to
assemble new gel readings into the database (automatically comparing
and aligning them). In addition it contains editors and functions to
examine the quality of the aligned sequences.
There are three main menus: "general", "screen" and
"modification", and some functions have submenus.
The general menu contains the following options:
Open a database
Display a contig
List a text file
Direct output
Calculate a consensus
Screen against restriction enzymes
Screen against vector
Check logical consistency
Copy database
Show relationships
set parameters
Highlight disagreements
Examine quality
Check Assembly
Find read pairs
The graphics menu contains:
Clear graphics
Clear text
Draw ruler
Use cross hair
Change margins
Label diagram
Plot map
Plot single contig
Plot all contigs
The modification menu contains:
Edit contig
Auto assemble
Join contigs
Complement a contig
Alter relationships
Extract gel readings
Find internal joins
Disassemble readings
Shuffle pads
Auto-select oligos
Double strand
The alter relationships menu contains:
Cancel
Line change
Check logical consistency
Remove contig
Shift
Move gel reading
Rename gel reading
Break a contig
Remove a gel reading
Alter raw data parameters
Overview of the methodology
The shotgun sequencing strategy
In the shotgun sequencing procedure the sequence to be
determined is randomly broken into fragments of about 1000
nucleotides in length. These fragments are cloned and then selected
randomly and their sequences determined. The relationship
between any pair of fragments is not known beforehand but is
found by comparing their sequences. If the sequence of one
found to be wholly or partially contained within that of another
for sufficient length to distinguish an overlap from a repeat
then those two fragments can be joined. The process of select,
sequence and compare is continued until the whole of the DNA to
be sequenced is in one continuous well determined piece.
Definition of a contig
A CONTIG is a set of gel readings that are related to
one another by overlap of their sequences. All gel readings
belong to a contig and each contig contains at least one gel
reading. The gel readings in a contig can be summed to produce a
continuous consensus sequence and the length of this sequence is the
length of the contig. The rules used to perform this summation are
given under "the consensus algorithm". At any stage of a
sequencing project the data will comprise a number of contigs; when
a project is complete there should be only one contig and its
consensus will be the finished sequence. Note that since being
introduced and defined as above the word "contig" has been taken up
by those involved in genomic mapping. In that context the consensus
with a precise length is, of course, not defined.
Introduction to the computer method
It is useful to consider the objectives of a sequencing
project before outlining how we use the computer to help achieve
them. The aim of a shotgun sequencing project is to produce an
accurate consensus sequence from many overlapping gel readings. It
is necessary to know, particularly at the latter stages of the
project, how accurate the consensus sequence is. This enables us to
know which regions of the sequence require further work and also to
know when the project is finished. To show the quality of the
consensus, the programs described here produce displays like that
shown below.
10 20 30 40 50
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
60 70 80 90 100
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
-3 HINW.007 GGCACA*GTC
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
110 120 130 140 150
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
-5 HINW.009 GGCAGCGACCGTCG
17 HINW.999 AGCGACCGTCG
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
160 170 180 190 200
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
12 HINW.017 GTAGAGC*
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
This is an example showing the left end of a contig from
position 1 to 200. Overlapping this region are gel readings
numbered 6, 3, 5, 17 and 12; 6, 3 and 5 are in reverse orientation
to their original reading (denoted by a minus sign). Each gel
reading also has a name (eg HINW.010). It can be seen that in a
number of places the sequences contain characters other than A,C,G
and T. Some of these extra characters have been used by the
sequencer to indicate regions of uncertainty in the initial
interpretation of the gel reading, but the asterisks (*) have been
inserted by the automatic assembly function in order to align the
sequences. Underneath each 50 character block of gel reading
sequences is the consensus derived from the sequences aligned above
(the line labelled CONSENSUS). For most of its length the consensus
has a definite nucleotide assignment but in a few positions there is
insufficient agreement between the gel readings and so a dash (-)
appears in the sequence. This display contains all the evidence
needed to assess the quality of the consensus: the number of times
the sequence has been determined on each strand of the DNA, and the
individual nucleotide assignments given for each gel reading.
So the aim is to produce the consensus sequence and, equally
important, a display of the experimental results from which it was
derived.
In order to achieve this the following operations need to be
performed:
1) Put individual gel readings into the computer. This might
involved the manual interpretation of autoradiographs or the
transfer and process of machine-readable files from fluorescent
sequencing machines.
2) Check each gel reading to make sure it is not simply part of one
of the vectors used to clone the sequence.
3) Check each gel reading to make sure that those fragments that
span the ligation point used prior to sonication are not assembled
as single sequences.
4) Compare all the remaining gel readings with one another to
assemble them to produce the consensus sequence.
5) Check the quality of the consensus and edit the sequences.
6) When all the consensus is sufficiently well determined, produce a
copy of it for processing by other analysis programs.
It is very unlikely that this procedure will only be passed
through once. Usually steps 1 to 5 are cycled through repeatedly,
with step 4 just adding new sequences to those already assembled.
Generally step 6 is also used in order to analyse imperfect sequence
to check if it is the one the project intended to sequence, or to
look for interesting features. Analysis of the consensus, such as
searches for protein coding regions, can also help to find errors in
the sequence. The display of the overlapping gel readings shown
above can be used to indicate, not only the poorly determined
regions, but also which clones should be resequenced to resolve
ambiguities, or those which can usefully be extended or sequenced in
the reverse direction, to cover difficult regions.
The original individual gel readings for a sequencing project
are each stored in separate files. As the gel readings are entered
into the computer (usually in batches, say 10 from a film), the file
names they are given are stored in a further file, called a file of
file names. Files of file names enable gel readings to be processed
in batches.
For each sequencing project we start a project database. This
database has a structure specifically designed for dealing with
shotgun sequence data. In order to arrive at the final consensus
sequence many operations will be performed on the sequence data.
Individual fragments must be sequenced and compared in both senses
(i.e. both orientations) with all the other sequences. When an
overlap between a new gel reading and a contig are found they must
be aligned and the new gel reading added to the contig. If a new gel
reading overlaps two contigs they must be aligned and joined. Before
the two contigs are joined one of them may need to be turned around
(reversed and complemented) so they are both in in the same
orientation.
Clearly, keeping track of all these manipulations is quite
complicated, and to be able to perform the operations quickly
requires careful choice of data structure and algorithms. For these
reasons it is not practicable to store the gel readings aligned as
shown in the display above. Rather, it is more convenient to store
the sequences unassembled, and to record sufficient information for
programs to assemble them during processing. The data used to
assemble the sequences is called relational information.
The database comprises five files and they are described under
the section entitled "open database".
Before entry into the project database each new gel reading
must be compared to look for overlaps with all the data already
contained within the database. This last point is important: all
searching for overlaps is between individual new gel readings and
the data already in the database. There is no searching for overlaps
between sequences within the database; overlaps must be found before
new gel readings are entered into the database.
Below I give an introduction to how the sequences are
processed by being passed from one function to the next.
This program is used to start a database for the project and
then the following procedure is used.
Data in the form of individual gel readings are entered into
the computer and stored in separate files (possibly using either the
digitizer program GIP). Batches of these gel readings are passed to
the screening functions in this program to search for overlaps with
vector sequences (see VEP and "screen against vector") or for
matches to restriction enzyme sites that should not be present
("screen against enzymes"). Each run of these screening functions
passes on only those gel readings that do not contain unwanted
sequences. Sequences are passed via files of file names and
eventually are processed by the automatic assembly function ("auto
assemble"). This function compares each gel reading with a consensus
of all the previous gel readings stored in the database. If it
finds any overlaps it aligns the overlapping sequences by inserting
padding characters, and then adds the new gel reading to the
database. Gels that overlap are added to existing contigs and gels
that do not overlap any data in the database start new contigs. If a
new gel overlaps two contigs they are joined. Any gel readings that
appear to overlap but which cannot be aligned sufficiently well are
not entered and have their names written to a file of failed gel
reading names.
Generally data is entered into the database in batches as just
described. The program is also used to examine the data in the
database, to enter gel readings that the automatic assembly function
cannot align ("auto assemble"), and to make final edits. Edits to
whole contigs can be made using a mouse-driven editor ("edit
contig").
Editing the sequences is obviously an essential part of
managing a sequencing project. Editing is required when new
sequences are added, when contigs are joined, and when sequences are
corrected. A basic part of the strategy used here is that new gel
readings should be correctly aligned throughout their whole length
when they are entered into the database, and that when contigs are
joined they are edited so that they are well aligned in the region
of overlap. Alignment can be achieved by adding padding characters
to the sequences, and this is the way "auto assemble" operates when
adding new sequences to the database.
In order to search for overlaps that may have been missed or
may be hidden in the "unused data" the function "find internal
joins" can be used.
Generally the users need not concern themselves with how the
relational information is used by the program, but it is necessary
to know how contigs are identified. Because contigs are constantly
being changed and reordered the program identifies them by the
numbers of the gel readings they contain. Whenever users need to
identify a contig they need only know the number or name of one of
the gel readings it contains. Whenever the program asks users to
identify a contig or gel reading they can type its number or its
archive name. If they type its archive name they must precede the
name by a slash "/" symbol to denote that it is a name rather than a
number. E.g if the archive name is fred.gel with number 99, users
should type /fred.gel or 99 when asked to identify the contig.
Generally, when it asks for the gel reading to be identified, the
program will offer the user a default name, and if the user types
only return, that contig will be accessed. When a database is opened
the default contig will be the longest one, but if another is
accessed, it will subsequently become the current default.
Further information is located in the following places. The
database files are described under "open database". The format for
vector and consensus sequences is given under "calculate a
consensus", as are the uncertainty codes used in gel readings.
The digitiser program is used for the initial input of gel
readings and for writing a file of file names. The program uses a
digitizer for data entry. A digitizer is a two dimensional
surface such as a light box which is such that if a special pen is
pressed onto it, the pens coordinates are recorded by a computer.
These coordinates can be interpreted by a program.
In order to read an autoradiograph placed on the light box the
user need only define the bottom of the four sequencing lanes and
the bases to which they correspond and then use the pen to point
to each successive band progressing up the gel. The program
examines the coordinates of each pen position to see in which of the
four lanes it lies and assigns the corresponding base to be
stored in the computer. Each time the pen tip is depressed to point
to a position on the surface of the digitizer the program sounds
the bell on the terminal to indicate to the user that a point has
been recorded. As the sequence is read the program displays it on
the screen.
@17. TX 1 @Screen against enzymes
Used to compare gel readings against any restriction enzyme
recognition sequences that may have been used during cloning and
which should not be present in the data. Works on single gel
readings or processes batches accessed through files of file names.
The algorithm looks for exact matches to recognition sequences
stored in a file.
The file containing the recognition sequences must be
identified. The user must choose between employing a file of file
names, or typing in the names of individual gel reading files. If a
file of file names is used the program will also create a new file
of file names. When the option has finished operating this new file
will contain the names of all those gel readings that did not match
any of the recognition sequences. Hence it can be used for further
processing of the batch. The recognition sequences should be stored
in a simple text file with one recognition sequence per line.
@18. TX 1 @Screen against vector
Used to compare gel readings against any vector sequences that
may have been picked up during cloning and which have not been
removed by vep. It Works on single gel readings or processes batches
accessed through files of file names. The algorithm looks for exact
matches of length "minimum match length" and displays the
overlapping sequences.
The file containing the vector sequence must be identified.
The user must choose between employing a file of file names, or
typing in the names of individual gel reading files. If a file of
file names is used the program will also create a new file of file
names. When the option has finished operating this new file will
contain the names of all those gel readings that did not match the
vector sequence. Hence it can be used for further processing of the
batch. The vector sequence should be stored in a simple text file
with up to 80 characters of data per line. More than one vector can
be stored in a single file. If so each should be preceded by a 20
character title of the form <---m13mp8.0001----> where the < and >
signs and the number like .0001 are obligatory. The number must be
preceded by a dot (.) and be 4 digits long. The total sequence in
the file must be < 500,001 characters in length.
@20. TX 3 @Auto assemble
Compares gel readings against the current contents of the
database and produces alignments. In its normal mode of operation
("entry permitted"), the function will automatically enter the gel
readings into the database.
New assembly suboption. However if entry is not permitted the
reads won't be entered but the program will produce alignments and
(optionally) save each reading name and its best alignment score
(percentage mismatch) in a file. When used in this mode, the program
will include in the alignment the poor quality data for each
reading. These files of names can then be sorted into score order
and then used for assembly, hence forcing the readings that align
best to be entered into the database first. End of new suboption.
The routine works on single gel readings or processes batches
of gel readings accessed through files of file names. It is the only
way to enter data into the database.
The function will check the database for logical consistency
and will only proceed if it is OK. Choose if gel readings should be
entered into the database, or if they should only be compared.
Choose between using a file of file names or typing file names on
the keyboard. If so selected, supply the file of file names. Also
supply a file of file names to contain the names of all the gel
readings that fail to get entered. Select the entry mode. Normal
assembly is appropriate for all but special cases, as is "permit
joins". Uses for the other modes are not documented here. Define a
minimum initial match length. Define the maximum number of padding
characters allowed to be used in each gel reading to help achieve
alignment, and the same for the number allowed in the contig for
each gel reading. Finally define the maximum percentage mismatch to
be allowed for any gel reading to be entered into the database. If
for any gel reading, either of these last three values is exceeded
the gel reading will not be entered into the database.
In operation the function takes a batch of gel readings
(probably passed on as a file of file names from one of the
screening routines) and enters them into a database for a sequencing
project. It takes each gel reading in turn, compares it with the
current consensus for the database, it then produces an alignment
for any regions of the consensus it overlaps; if this
alignment is sufficiently good it then edits both the new gel
reading and the sequences it overlaps and adds the new gel
reading to the database. The program then updates the consensus
accordingly and carries on to the next gel reading.
All alignments are displayed and any gel readings that do
match but that cannot be aligned sufficiently well have their names
written to a file of failed gel reading names. The function works
without any user intervention and can process any number of gel
readings in a single run. Those gel readings that fail can be
recompared using the same function (to find the current overlap
position) and the user can enter them into the database using the
"put all readings in new contigs" assembly option and then joined
using "join contigs".
Typical dialogue and output from the function is shown below.
(Note that output for gel readings 2 - 9 has been deleted to save
space).
Automatic sequence assembler
Database is logically consistent
? (y/n) (y) Permit entry
? (y/n) (y) Use file of file names
? File of gel reading names=demo.nam
? File for names of failures=demo.fail
Select entry mode
X 1 Perform normal shotgun assembly
2 Put all sequences in one contig
3 Put all sequences in new contigs
? Selection (1-3) (1) =
? (y/n) (y) Permit joins
? Minimum initial match (12-4097) (15) =
? Maximum pads per gel (0-25) (8) =
? Maximum pads per gel in contig (0-25) (8) =
? Maximum percent mismatch after alignment (0.00-15.00) (8.00) =
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
Processing 1 in batch
Gel reading name=HINW.004
Gel reading length= 283
Searching for overlaps
Strand 1
Strand 2
No matches found
Total matches found 1
Padding in contig= 0 and in gel= 1
Percentage mismatch after alignment = 1.8
Best alignment found
1 11 21 31 41 51
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
********** ********** ********** ********** ********** **********
TTTTCCAGCG TGCGTCTGAC GCTGTCTTGC TTAATGATCT CCATCGTGTG CCTAGGTCTG
1 11 21 31 41 51
61 71 81 91 101 111
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACA-GTTG
********** ********** ********** ********** ********** ***** ****
TTGCGTTGGG CCGAGCCCAA CTTTCCCAAA AACGTATGGA TCTTACTGAC GTACACGTTG
61 71 81 91 101 111
121 131 141 151 161 171
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
********** ********** ********** ********** ********** **********
CTTACCAGCG TGGCTGTCAC GGCGTCAGGC TTCCACTTTA GTCATCGTTC AGTCATTTAT
121 131 141 151 161 171
181 191 201 211 221 231
GCCATGGTGG CCACAGTGAC G-TATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
********** ********** * ******** ********** ********** **********
GCCATGGTGG CCACAGTGAC GCTATTTTGT TTCCTCACGC TCGCTACGTA TCTGTTTGCC
181 191 201 211 221 231
241 251 261 271 281
CGCG--GTGG AATTACAGCG TTCCCTATTG ACGGGCGCAT CCAC
**** **** ********** ** * ***** ********** ****
CGCGACGTGG AATTACAGCG TT,CDTATTG ACGGGCGCAT CCAC
241 251 261 271 281
Batch finished
9 sequences processed
0 sequences entered into database
0 joins made
Note that "auto assemble" cannot align protein sequences.
@28. TX 1 @Highlight disagreements
Used in the latter stages of a project to highlight
disagreements between individual gel readings and their consensus
sequences. This display is also availbale in the contig editor.
Characters that agree with the consensus are shown as : symbols for
the plus strand and . for the minus strand. Characters that disagree
with the consensus are left unchanged and so stand out clearly. The
results of this analysis are written to a file.
Before selecting this option create a file of the display of
the contig to be "highlighted". The option will ask for the name of
this file. Select symbols to denote "agreeing" characters on each
strand, the defaults are : and ., but any others can be used. Supply
the name of a file in which to put the output.
The display file needed as input for this option is created by
selecting "Redirect output", followed immediately by "display
contig", and then "Redirect output" again. The cutoff score used in
the consensus calculation can be set by option "set display
parameters". Note that for the highlight function there is a limit
of 50 for the number of gel readings that are aligned at any
position - ie the contig must be less than 51 gel readings deep at
its thickest point. I hope that those performing shotgun sequencing
never reach this limit, but those using the program for comparing
sequence families might.
Typical output from this function is shown below.
210 220 230 240 250
1 HINW.004 :C::::::::::::::::::::::::::::::::::::::::::AC::::
7 HINW.018 :*::::::::::::::::::::::::::::::::::::::::::CA::::
-4 HINW.017 ...............AC....
G-TATTTTGTTTCCTCACGCTCGCTACGTATCTGTTTGCCCGCG--GTGG
260 270 280 290 300
1 HINW.004 ::::::::::::*:D:::::::::::::::::::
7 HINW.018 ::::::::::::::::::::CA:::::T:*:::*::::::::::::CA:
-4 HINW.017 ..............................................A...
3 HINW.009 :::::::::::::::V::::::::::::::::::::::::::::*AV:::
-6 HINW.028 ......................A...
AATTACAGCGTTCCCTATTGACGGGCGCATCCACGCTGATTCTCTT-CTG
@32. TX 3 @Extract gel readings
Used to make copies of the aligned gel readings in a database,
to write them into separate files, and to write a corresponding file
of file names. It operates in two modes: either all gel readings are
extracted, or only those at the ends of contigs.
Choose which mode of operation is required and supply a file
of file names.
The gel readings are given their original names.
If the option is used to extract all the gel readings from a
database, a subsequent run of "auto assemble" can reconstitute a
database which has been corrupted. This rarely occurs and is
usually necessitated by a user employing "alter relationships"
incorrectly without first having made a copy.
@1. TX 0 @Help
Help is available on the following topics :
@2. TX 0 @Quit
This command stops the program and is the only safe way to
terminate a run of the program that has altered the contents of the
database in any way.
@3. TX 1 @Open a database
Opens existing databases or allows new ones to be started. The
function is automatically called into operation when the program is
started but can also be selected from the general menu.
Choose to open an existing database or start a new one, or if
! is typed when the program is first started, enter the program
without opening a database. Supply a project database name, and if
it already exists, the "version". If starting a new database define
the database size and if it is for DNA or protein sequences. The
database size is an initial size for the database. It can be
increased later during the project. It is the sum of the number of
gel readings plus the number of contigs. The current maximum size is
8000.
Database names can have from one to 12 letters and must not
include full stop (.). The database is made from five separate
files. If the database is called FRED then version 0 of database
FRED comprises files FRED.AR0, FRED.RL0, FRED.SQ0, FRED.TG0 and
FRED.CC0. The version is the last symbol in the file names. Only
this program can read these files. If the "copy database" option is
used it will ask the user to define a new "version".
For normal use the maximum gel reading length is set to 512
characters, but when a database is started the user may choose
lengths of either 512, 1024, 1536..., 4096. Normally the program is
used to handle DNA sequences but many of the functions also work on
protein sequences. The choice of sequence type is made when the
database is started.
The contigs are not stored on the disk as the user sees them
displayed on the screen. Each gel reading is stored with sufficient
information about how it overlaps other gel readings so that the
program can work out how to present them aligned on the screen. We
refer to this extra data as "the relationships" and it is explained
below. The database comprises 5 separate files.
1. a working version of each gel reading. This is the version of
the gel reading that is in the database and initially it is an
exact copy of the original sequence (known as the archive) but it is
edited and manipulated to align it with other gel readings.
2. the file of relationships. This file contains all of the
information that is required to assemble the working versions into
contigs during processing; any manipulations on the data use this
file and it is automatically updated at any time that the
relationships are changed. The information in this file is as
follows:
(A) Facts about each gel reading and its relationship to
others ("gel descriptor lines"):
(a) the number of the gel reading (each gel reading is given a
number as it is entered into the database)
(b) the length of the sequence from this gel reading
(c) the position of the left end of this gel reading relative to
the left end of the contig of which it is a member
(d) the number of the next gel reading to the left of this gel
reading
(e) the number of the next gel reading to the right
(f) the relative strandedness of this gel reading , ie whether it
is in the same sense or the complementary sense as its archive.
(B) Facts about each contig ("contig descriptor lines"):
(a) the length of this contig
(b) the number of the leftmost gel reading of this contig
(c) the number of the rightmost gel reading of this contig.
(C) General facts:
(a) the number of gel readings in the database
(b) the number of contigs in the database.
3. the file of archive names. This is simply a list of the names
of each of the archive files in the database.
4. the file of tags (annotation). This consists of linked lists of
tag information for each sequences in the database. Tags are
created by the user as annotation, or by xdap as records of edits or
for storing cutoff information. As the number of tags can grow
without limit, so can this file. For each gel there is a header
record, which contains the record number of the start of the linked
list for that gel. On line IDBSIZ there is a record containing
information about the file such as its present length and if there
are any free "tag" slots to be reused in the file. 5. the file of
comments (annotation). This consists of linked lists of comment
fragments. Comments are created by the user as a message attached
to annotation, or by the system to store cutoff information.
Comments are character strings of any length. Comments longer than
40 characters are broken up into fragments, each 40 characters long,
and are chained together in a link list. As the number of comments
can grow without limit, so can this file.
Structure of the database files
1. The file of relationships
The file contains IDBSIZ lines of data: the general data are
stored on line IDBSIZ; data about gel readings are stored from
line 1 downwards; data about contigs are stored from line IDBSIZ-1
upwards. A database of 500 lines containing 25 gel readings and 4
contigs would have a file of relationships as is shown below.
---------------------------------------------
0 Info about the database size
1 Gel descriptor record
2 " " "
3 " " "
4 " " "
5 " " "
' ' ' '
' ' ' '
25 " " "
26 Empty record
' ' '
' ' '
495 ' '
496 Contig descriptor record
497 " " "
498 " " "
499 " " "
500 Number of gel readings=25, Number of contigs=4
---------------------------------------------
The arrangement of the data in the file of relationships
As each new gel reading is added into the database a new line is
added to the end of the list of gel descriptor lines. If this
new gel reading does not overlap with any gel readings already in
the database a new contig line is added to the top of the list
of contig lines. If it overlaps with one contig then no new contig
line need be added but if it overlaps with two contigs then
these two contigs must be joined and the number of contig lines
will be reduced by one. Then the list of contig lines is compressed
to leave the empty line at the top of the list. Initially the two
types of line will move towards one another but eventually, as
contigs are joined, the contig descriptor lines will move in the
same direction as the gel descriptor lines. At the end of a
project there should be only one contig line. The database is
thus capable of handling a project of 998 gels.
2. Structure of the working versions file
The working versions of gel readings are stored in a file
of NGELS lines each containing MAXGEL characters. Gel reading
number 1 is stored on line 1, gel reading number 2 on line 2 and so
on. NGELS is the current number of readings and MAXGEL the maximum
reading length.
3. Structure of the archive names file
This file has NGELS lines of 16 characters.
4. Structure of the tag file
This file initially starts with IDBSIZ lines, and is expanded
as new tags are created. Information about the length of the file,
and which tag records are reusable is stored on line IDBSIZ. A
database of 500 lines would have a file of tags as shown below.
---------------------------------------------
1 Tag descriptor record
2 " " "
3 " " "
4 " " "
5 " " "
' ' ' '
' ' ' '
497 " " "
498 " " "
499 " " "
500 Length of file=N, Free list=0
501 Tag record
502 " "
503 " "
' ' '
' ' '
N-2 " "
N-1 " "
N Tag record
---------------------------------------------
The arrangement of the data in the tag file
As each new tag is added to the database, a check is made in the
file descriptor record at line IDBSIZ. If the list of reusable
records is 0, the file is extended by one line. Otherwise the new
tag is assigned to record at the head of the freelist. When tags
are deleted, they are added to the free list in the file descriptor
record.
5. Structure of the comment file
This file initially starts with 1 line, and is expanded as new
annotation is created. Information about the length of the file,
and which comment records are reusable is stored on the first line.
---------------------------------------------
1 Length of file=N, Free list=0
2 Comment fragment
3 " "
4 " "
' ' '
' ' '
N-2 " "
N-1 " "
N Comment fragment
---------------------------------------------
The arrangement of the data in the comment file
As each new comment is added to the database, a check is made in the
file descriptor record at line 1. If the list of reusable records is
0, the file is extended to hold the new comment. Otherwise the new
comments is assigned to records starting with the head of the
freelist. When comments are deleted, the discarded records are
added to the free list in the file descriptor record.
There are various checks within the programs to protect
users from themselves:-
1. All user input is checked for errors - e.g. reference to
non-existent gel readings or contigs, incorrect positions in the
contig or gel readings.
2. Before entering a gel reading the system checks to see if a file
of the same name has already been entered.
3. Join will not allow the circularising of a contig.
5. Users may escape from any point in the program.
6. Help is available from all points in the program.
IT IS ESSENTIAL THAT USERS DO NOT KILL THE PROGRAM WHILE IT IS DOING
ANYTHING THAT INVOLVES CHANGING THE CONTENTS OF THE DATABASE. I.E
DURING AUTO ASSEMBLE, COMPLETE JOIN, COMPLEMENT CONTIG, SAVE EDIT
CONTIG. This could corrupt the database so badly that it is
impossible to fix. The program should always be left using the QUIT
option.
@4. TX 3 @Edit contig
The Contig Editor is a mouse-driven editor that can insert,
delete and change gel reading sequences.
The Contig Editor allows scrolling from one end of a contig to
the other using the scroll bar and scroll buttons. Action of mouse
button presses when the mouse pointer is in the scroll bar:
Middle Mouse Button Set editor position
Left Mouse Button Scroll forward one screenful
Right Mouse Button Scroll backwards one screenful
The four scroll buttons operate as follows:
"<<" Scroll left half a screenful
"<" Scroll left one character
">" Scroll right one character
">>" Scroll right half a screenful
The Editor cursor can be positioned anywhere in the edit
window by moving the mouse pointer over the character of interest,
then pressing the left mouse button. The Editor cursor can also be
moved by using the direction arrow keys.
The editor operates in two main edit modes - Replace and
Insert. Replace allows a character to be replaced by another. Insert
allows characters to be inserted into a gel reading sequence.
Characters are entered by typing them from the keyboard. Only valid
characters are permitted. Characters can be deleted by positioning
the cursor one character to the right, then pressing the delete key.
Normally Insert and Delete apply to the consensus line of the contig
ONLY. This restraint can be overridden by using the "Super Edit"
mode of operation, THOUGH IT IS NOT RECOMMENDED.
Edits can also be performed on the consensus, though they are
restricted to insertion and deletion of padding characters ("*").
These edits also have special meanings. A deletion will delete ALL
characters at the position to the left of the cursor in the contig,
and move the relative positions of all sequences starting to the
right of the cursor position left one character. An insertion will
insert the character typed ("*") into ALL gel reading sequences at
the cursors position in the contig, and move the relative positions
of all sequences starting to the right of the cursor position right
one character.
The effect of the last edit can be undone by pressing the
"Undo" button at the top of the editor window.
The cursor will automatically be positioned at the next
problem when the "Find Next Problem" button is selected. The next
problem is where the consensus shows either an ambiguity ("-") or a
pad ("*") character.
The edits to the contig can be saved by pressing the "Leave
Editor" button and replying "Yes" to the prompt to "Save changes?".
As no changes are made to the working copy of your database til this
point it is possible to abort the editor if the edit session ends up
in an unsatisfactory state (ie if you've stuffed it up!)
Displaying Traces
The original data from which the gel reading sequences where
derived can be seen by double clicking (two quick clicks) with the
middle mouse button on the area of interest. The trace will be
displayed with the point clicked at the centre of the trace
viewport.
All traces that are displayed are maintained in one window,
called the Trace Manager. The Trace Manager will only display four
traces maximum. When four traces are already being managed and a new
one is requested, the one at the top of the Trace Manager is removed
and the new one is added to the bottom. Traces can be removed
individually by using the "quit" button in the panel next to the
trace.
Extending Reads Using Cutoff Information
Sequence data read in from Automated Fluorescent sequencing
machines trace files processed through the program ted will have the
discarded sequence (vector at start and poor read at end) available
to the contig editor. To display the cutoff information, press the
"Display Cutoff" button at the top of the editor window. The cutoff
sequence appears in grey. This sequence can be incorporated into the
editable sequence, by moving the cutoff position. This is done by
positioning the cursor at the end of the gel sequence, and using
Meta-Left-Arrow and Meta-Right-Arrow to adjust the point of cutoff.
The Meta key is a diamond on the Sun keyboard.
Pop-up menu
A pop-up menu is revealed by depressing the "Control" key on
the keyboard and at the same time pressing the left mouse button.
The menu has the following functions:
Search
Highlight Disagreements
Save Contig
Create Tag
Edit Tag
Delete Tag
Select Oligo
"Highlight Disaggreements" simply toggles between the normal display
showing the current base assignments and one in which only those
assignments that differ from the consensus are shown.
"Save Contig" is described above. Searching and operations on tags
are described below.
Searching
Selecting "Search" brings up a window which can remain present
during normal editor operation. The window allows the user to select
the direction of search, the type of search and a value to search
on. The value is entered into the value text window. Then pressing
the "search" button performs the search. If successful, the cursor
is positioned and centred accordingly. An audible tone indicates
failure. Pressing the "ok" button removes the search window. The
search window is automatically removed when the contig editor is
exited.
There are seven different search modes:
1. Search by position
This positions the cursor at the numeric position specified in the
value text window. Eg a value of "1234" causes the cursor to be
placed at base number 1234 in the contig. Positioning withing a gel
reading is achieved by prefixing the number with the "@" character,
eg "@123" positions the cursor at base 123 of the sequence in which
the cursor lies. Relative positions can be specified by prefixing
the number with a plus or minus character. Eg "+1234" will advance
the cursor 1234 bases. If possible, the cursor is positioned within
the same sequence. The direction buttons have no effect on the
operation of "search by position".
2. Search by reading name
This positions the cursor at the left end of the gel reading
specified in the value text window. If the value is prefixed with a
slash is is assumed to be a gel reading name. Otherwise it is
assumed to be a gel reading number. Eg "123" positions the cursor at
the left end of gel reading number 123. "/a16a12.s1" positions at
the start of reading a16a12.s1. If the value was "/a16" the cursor
is positioned at the first reading which starts with "a16". The
direction buttons have no effect on the operation of "search by
position".
3. Search by tag type.
This positions the cursor at the start of the next tag which has the
the same type as specified by the type value menu. To change the
type, select off the menu that pops up when the mouse is clicked on
the button labeled "Type:". The search can be performed either
forwards or backwards of the current cursor position. To find all
tags, use "search by annotation", with a null text value string.
4. Search by annotation.
This positions the cursor at the start of the next tag which has a
comment containing the string specified in the value text window.
The search performed is a regular expression search, and certain
characters have special meaning. Be careful when your value string
contains ".", "*", "[", "^" or "$". The search can be performed
either forwards or backwards from the current cursor position.
5. Search by sequence.
This positions the cursor at the start of the next piece of sequence
that matches the value specified in the text value window. The
search is for an exact match, which means the case of value string
is important. The search is performed on the gel readings
themselves, rather than the consensus sequence. The search can be
performed either forwards or backwards from the current cursor
position.
6. Search by problem.
This positions the cursor at the next place in the consensus
sequence which is not an "A", "C", "G" or "T". The search can be
performed either forwards or backwards from the current cursor
position.
7. Search by quality
This positions the cursor at the next place in the consensus
sequence where the consensus calculation for each strand disagrees.
When only sequences on one strand is present, the search will stop
at every base. The search can be performed either forwards or
backwards from the current cursor position.
Annotation
Parts of a sequence can be annotated, to record the positions
of primers used for walking, or to mark sites, such as compressions
that have caused problems during sequencing. The consensus sequence
CANNOT be annotated.
To annotate a piece of sequence first select the part of
sequence using the mouse buttons. Use the left mouse button to
position the start of the selection, and while this button is being
held down, move the mouse to extend. The selection can be extended
further using the right mouse button.
To create annotation, invoke the pop-up menu, and select the
"Create Tag" function. A small "tag editor" will appear which allows
you to select the type of the annotation from a pull-down menu, and
specify a comment if desired. To select a new type pull down the
Type menu, and select the entry desired. To enter a comment, simply
type into the text window in the tag editor. The annotation is
created when the "Leave" button on the tag editor, and is displayed
in the colour defined in the tag database file (TAGDB).
To edit existing annotation, position the cursor with the left
mouse button on the tag, and select the "Edit Tag" off the pop-up
menu. This invokes the tag editor, and changes to the type and
comment of the annotation can be made. The tag is updated when the
"Leave" button is pressed.
To delete an existing annotation, position the cursor with the
left mouse button on the tag, and select the "Delete Tag" off the
pop-up menu.
NOTE:
As the Contig Editor is a very powerful tool, it is possible
that the alignment of the gel reading sequences has unexpectedly
been disrupted. This can easily happen to parts of the contig that
lie to the right of the screen if excessive use has been made of the
"Super Edit" facility. Until familiar with "Super Edit" it would
benefit the sequencer to quickly scan through the contig after
editing to check that bad alignments have not been created.
Selecting Oligos ----------------
1. Open the oligo selection window, by selecting "Select Oligo" from
the contig editor popup menu.
2. Position the cursor to where you want the oligo to be chosen.
While the oligo selection window is visible, you will still have
complete control over positioning and editing within the contig
editor.
3. Indicate the strand for which you require an oligo. This is done
by toggling the direction arrow ("----->" or "<------"), if
necessary.
3. Press the "Find Oligos" button to find all suitable oligos (See
"Oligo selection" below.) Information for the closest oligo to the
cursor position is given in the output text window. In the contig
editor the position of the oligo is marked by a temporary tag on the
consensus. The window is recentered if the oligo is off the screen.
Selecting "Display Selection Information" will print a short report
on the numbers of oligos considered and rejected during oligo
selection.
4. If this oligo is not suitable (it may have been previously
chosen, and found to be unsuitable by experimentation, say), the
next closest oligo can be viewed by pressing "Select Next".
5. Suitable templates are automatically identified for the currently
displayed oligo (See "Template selection" below.) By default, the
template is that closest to the oligo site. If the choice is not
suitable (it may be known to be a poor quality template, say)
another can be chosen from the "Choose Template for this Oligo"
menu. Templates that do not appear on the menu can be specified by
selecting "other". However, the template must be on the correct
strand and be upstream of the oligo.
6. A tag can be created for the current oligo by pressing the button
"Create a tag for this oligo". The annotation for this tag holds the
name of the template and the oligo primer sequence. There are
fields to allow the user to specify their own primer name
("serial#") and comments ("flags") for this tag. An example of oligo
tag annotation:
serial#=
template=a16a9.s1
sequence=CGTTATGACCTATATTTTGTATG
flags=
7. The oligo selection window is closed when "Create a tag for this
oligo" or "Quit" is selected.
Oligo selection:
----------------
The oligo selection engine is the one used in the program OSP. It is
described in some detail in:
Hillier, L., and Green, P. (1991). "OSP: an oligonucleotide
selection program," PCR Methods and Applications, 1:124-128.
The parameters controlling the selection of oligos can be changed in
the "Oligo Selection Parameters" window. The weights controlling the
scoring of selected oligos can be changed in the "Oligo Selection
Weights" window.
By default, the oligos are selected from a window that extends 40
bases either side of the cursor. The size and location of this
window relative to the cursor position can be changed in the
"Parameters" window.
In xbap oligos are ranked according to their proximity to the cursor
position, rather than by their scores.
Template selection:
-------------------
For simplicity, each reading is considered to represent a template.
In practise, many readings can be made of the same template.
Suitable templates that are identified are those that:
1. are in the appropriate sense,
2. have 5' ends that start upstream of the oligo,
and 3. are sufficiently close to the oligo to be useful.
This last criterion relates to the insert size for the subclones
used for sequencing and the average reading length. A template is
considered useful if a full reading can be made from it, taking into
account both of these factors. The default insert size is 1000
bases, and the default average reading length is 400 bases. These
values can be changed in the "Parameters" window.
@5. TX 1 @Display a contig
Used to show the aligned gel readings for any part of a
contig. The number, name and strandedness of each gel reading is
shown and the consensus is written below.
If required identify the contig, and then the start and end
points of the region to display.
The display can be directed to a disk file using "direct
output to disk".
Below is an example showing the left end of a contig from
position 1 to 200. Overlapping this region are gels 6,3,5,17and
12; 6, 3 and 5 are in reverse orientation to their archives (denoted
by a minus sign) There are a few uncertainty codes and a few
padding characters in the working versions, but the consensus
(shown below each page width) has a definite assignment for almost
every position.
10 20 30 40 50
-6 HINW.010 GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
CONSENSUS GCGACGGTCTCGGCACAAAGCCGCTGCGGCGCACCTACCCTTCTCTTATA
60 70 80 90 100
-6 HINW.010 CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCGCGGACACGTC
-3 HINW.007 GGCACA*GTC
CONSENSUS CACAAGCGAGCGAGTGGGGCACGGTGACGTGGTCACGCCG-G-ACA-GTC
110 120 130 140 150
-6 HINW.010 GATTAGGAGACGAACTGGGGCG3CGCC*GCTGCTGTGGCAGCGACCGTCG
-3 HINW.007 GATTAG4AGACGAACTGGGGCGACGCCCG*TGCTGTGGCAGCGACCGTCG
-5 HINW.009 GGCAGCGACCGTCG
17 HINW.999 AGCGACCGTCG
CONSENSUS GATTAGGAGACGAACTGGGGCGACGCC-G-TGCTGTGGCAGCGACCGTCG
160 170 180 190 200
-6 HINW.010 TCT*GAGCAGTGTGGGCGCTG*CCGGGCTCGGAGGGCATGAAGTAGAGC*
-3 HINW.007 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGGCATGAAGTAGAGC*
-5 HINW.009 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGGCATGAAGTAGAGC*
17 HINW.999 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
12 HINW.017 GTAGAGC*
CONSENSUS TCT*GAGCAGTGTGGGCGCTG-*CGGGCTCGGAGGGCATGAAGTAGAGC*
@6. TX 1 @List a text file
This option allows users to list text files on the screen. It
can be used to read a file containing notes, for checking files
written to disk etc. The user is asked to type the name of the file
to list.
@8. TX 1 @Calculate a consensus
Calculates a consensus sequence either for the whole
database or for selected contigs. The consensus is written to a file
named by the user.
Supply a file name, choose between whole database or selected
contigs.
Symbols for uncertainty in gel readings
In order to record uncertainties when reading gels the
codes shown below can be used. Use of these codes permits us to
extract the maximum amount of data from each gel and yet record any
doubts by choice of code. The program can deal with all of
these codes and any other characters in a sequence are treated
as dash (-) characters.
SYMBOL MEANING
1 PROBABLY C
2 " T
3 " A
4 " G
D " C POSSIBLY CC
V " T " TT
B " A " AA
H " G " GG
K " C " C-
L " T " T-
M " A " A-
N " G " G-
R A OR G
Y C OR T
5 A OR C
6 G OR T
7 A OR T
8 G OR C
- A OR G OR C OR T
a A
c C
g G
t T
* padding character placed by auto assembler
else = -
The DNA consensus algorithm
The "calculate consensus" function, the "display contig"
routine and the "show quality" option use the rules outlined here
to calculate a consensus from aligned gel readings. Note that
"display contig" calculates a consensus for each page width it
displays (it does not use the consensus sequence file calculated
by the consensus function).
We have 6 possible symbols in the consensus sequence:
A,C,G,T,* and -. The last symbols is assigned if none of the others
makes up a sufficient proportion of the aligned characters at any
position in the contig. The following calculation is used to decide
which symbol to place in the consensus at each position.
Each uncertainty code contributes a score to one of A,C,G,T,*
and also to the total at each point. Symbols like R and Y which
don't correspond to a single base type contribute only to the total
at each point. The scores are shown below.
definite assignments ie A,C,G,T,B,D,H,V,K,L,M,N,a,c,g,t,* =1
probable assignments ie 1,2,3,4 = 0.75
other uncertainty codes including R,Y,5,6,7,8,- = 0.1
A cutoff score of 51% to 100% is supplied by the user. (When
the program starts this is set to 75%. See "set display
parameters"). At each position in the contig we calculate the total
score for each of the 5 symbols A,C,G,T and * (denote these by Xi,
where i=A,C,G,T or *), and also the sum of these totals (denote this
by S). Then if 100 Xi / S > the cutoff for any i, symbol i is placed
in the consensus; otherwise - is assigned.
Notice that S does not equal the number of times the sequence
has been determined, but is the score total, and hence we are less
likely to put a - in the consensus. For the "examine quality"
algorithm each strand is treated separately but the calculation is
the same. (It was originally different).
Format of the consensus sequence ( and vector sequences).
A consensus sequence file may contain the consensus for
several contigs and so we identify each of them by preceding them by
a 20 character title. The title is of the form <---LAMBDA.0076---->
( where LAMBDA is the project name and gel reading number 76 is the
leftmost gel reading to contribute to this consensus sequence).
The angle brackets <> and the 4 digit number precede by a . are
important to some processing programs.
@25. TX 1 @Show relationships
Used to show the relationships of the gel readings in the
database in three ways -
(a) All contig descriptor lines followed by all gel descriptor
lines.
(b) All contigs one after the other sorted, i.e. for each
contig show its contig descriptor line followed by all its gel
descriptor lines sorted on position from left to right
(c) Selected contigs: show the contig line and, in order, those
gel readings that cover a user-defined region. Note that this
output can be directed to a disk file by prior selection of
"redirect output".
Below is an example showing a contig from position 1 to 689.
The left gel reading is number 6 and has archive name HINW.010, the
rightmost gel reading is number 2 and is has archive name HINW.004.
On each gel descriptor line is shown: the name of the archive
version, the gel number, the position of the left end of the gel
reading relative to the left end of the contig, the length of
the gel reading (if this is negative it means that the gel reading
is in the opposite orientation to its archive), the number of the
gel reading to the left and the number of the gel reading to the
right.
CONTIG LINES
CONTIG LINE LENGTH ENDS
LEFT RIGHT
48 689 6 2
GEL LINES
NAME NUMBER POSITION LENGTH NEIGHBOURS
LEFT RIGHT
HINW.010 6 1 -279 0 3
HINW.007 3 91 -265 6 5
HINW.009 5 137 -299 3 17
HINW.999 17 140 273 5 12
HINW.017 12 193 265 17 18
HINW.031 18 385 -245 12 2
HINW.004 2 401 -289 18 0
@23. TX 3 @Complement a contig
This function will complement and reverse all of the gel
readings in a contig. It automatically reverses and
complements each gel reading sequence, reorders left and right
neighbours, recalculates relative positions and changes each
strandedness.
The only user input required is to identify the contig
to complement by the number or name of a gel reading it contains.
DO NOT KILL THE PROGRAM DURING THIS STEP!
@22. TX 3 @ Join contigs
This function joins contigs interactively using a mouse driven
editor. The operation of this editor is very similar to the Contig
Editor described in "Edit".
It allows the user to align the ends of the two contigs by
editing each contig separately. It is important that the alignment
achieved is correct because once the join is completed the
alignment is fixed. The program needs to know which two contigs to
join.
First specify which two contigs are to be joined. The user
should identify the two contigs. The program checks that the two
contig numbers are different (it will not allow circles to be
formed!)
The Join Editor consists of two Contig Editors in between
which is sandwiched a disagreement box. This disagreement box shows
exclamation marks to denote mismatches between the two consensuses.
For example, the display will look something like this:
1460 1470 1480 1490 1500
56 HINW.100 TCT*GAGCAGTGTGGGCGCTG*CCGG
33 HINW.300 TCT*GAGCAGTGTGGGCGCTGC*CGGGCTCGGAGGG
-25 HINW.090 TCT*GAGCAGTGTGGGCG*T*G*CGGGCTCGGAGGG
19 HINW.123 TCTCGAGCAGTGTGGGCGCTG**CGGGCTCGGAGGGCATGAAGTAGAGCG
CONSENSUS TCTCGAGCAGTGTGGGCGCTG-CCGGGCTCGGAGGGCATGAAGTAGAGCG
MISMATCH ! !!!!!!
10 20 30 40 50
-6 HINW.010 TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
-3 HINW.007 TGGGCGCTGCCCGGGCTCGGAGGGCATGAAGT*AGAGC
-5 HINW.009 GCTCGGAGGGCATGAAGT*AGAGC
CONSENSUS TCTCGAGCAGTGTGGGCGCTGCCCGGGCTCGGAGGGCATGAAGTTAGAGC
The overlap must be of at least one character. Use the scroll
bar and the scroll buttons (`<<',`<',`>',and`>>') for positioning
the relative positions of the two contigs.
The join position can be fixed in position by pressing the
`lock' button at the top of the Join Editor. Locking allows the two
contigs to be scrolled as one when using the scroll bar and buttons,
the left ends always in the same position relative to each other.
Once locked, it is best to proceed to the right along the
contigs, inserting padding characters (`*') into the consensuses to
minimise the disagreements.
It is essential that the user aligns the two contigs
throughout the whole region of overlap before completing the join
because it is only at this stage that the two contigs can be edited
independently. Once the join is completed the alignment can only be
altered using the routines supplied by "alter relationships".
The join can be completed by pressing the `Leave Editor'
button. The percentage mismatch is displayed, and the user is
required to confirm that they want to perform the join.
@24. TX 1 @ Copy the database
Used to make a copy of the database. If required the database
size can be altered using this option. The "version" of a database
is encoded as the last letter in the names of the five files that
contain the database.
Supply a "version" number (the default is version 1), and if
required select a new size for the database. The size of a database
is the number of lines of information it can hold. It needs a line
for each gel reading and another for each contig.
@19. TX 1 @ Check database
Used to perform a check on the logical consistency of the
database. No user intervention is required. If selected "with
dialogue" the program also checks for any sections of the consensus
that contain 15 dashes in 20 characters.
The following relationships are checked:
1. If gel reading A thinks gel reading B is its left neighbour
does B think A is its right neighbour? The error message is
"Hand holding problem for gel reading A"
followed by the gel descriptor lines for gel readings A and B.
2. Are there any contig lines with no left or right end gel
readings? The error message is
"Bad contig line number A"
3. Do the gel readings that are described as left ends on
contig lines agree that they are left ends? The error message is
"The end gel readings of contig A have outward neighbours"
4. Are there gel readings that are in more than one contig?
The error message is
" Gel number A is used N times"
5. Are there gel readings that are not in any contig? The
error message is
" Gel number A is not used"
6. Do the relative positions of gel readings agree with
their position as defined by left and right neighbourliness? The
error message is
" Gel number A with position X is left neighbour of gel number B
with position Y"
7. Are there any loops in contigs? If so no further
checking is done. The error message is
" Loop in contig n no further checking done, but gel reading numbers
follow"
The program then prints the gel reading numbers in the looped
contig up to the start of the loop.
8. Are there any contigs of length <1? The error message is
" The contig on line number x has zero length"
9. Are there any gel readings (used in only one contig) that have
zero length? The error message is
" Gel number N has zero length"
Note that "auto assemble" also uses this logical consistency check
and will only tolerate a "Gel number N is not used" error. Any other
error will cause it to give up.
@29. TX 1 @ Examine quality
Analyses the quality of the data in a contig. It reports on
the proportion of the consensus that is "well determined" and will
display a sequence of symbols that indicate the quality of the
consensus at each position.
Identify the contig to analyse, and the section of interest.
The current consensus calculation cutoff score will be used to
decide if each position is "well determined". In general the quality
of a reading deteriorates along the length of the gel and so it is
also possible to use a length cutoff for the quality calculation.
Only the data from the first section of each reading will be
included in the quality calculation. The length is altered under
"set parameters" and is initially set to the maximum reading length.
A summary showing the percentage of the consensus that falls into
each category of quality is shown. Choose whether or not to have the
quality codes for each position of the consensus displayed. They can
be displayed as either graphics or text.
The quality of the data depends on the number of times it has
been sequenced and the particular uncertainty codes used in each
gel reading. This function divides the data into five categories,
assigning each a symbol or code:
1. Well determined on both strands and they agree. code=0
2. Well determined on the plus strand only. code=1
3. Well determined on the minus strand only. code=2
4. Not well determined on either strand. code=3
5. Well determined on both strands but they disagree. code=4
A position is "well determined" if it is assigned one of the symbols
A,C,G,T when the algorithm described in the section "calculate a
consensus". The calculation is performed separately for each
strand.
If the user chooses to have the data displayed graphically the
following scheme is used. A rectangular box is drawn so that the x
coordinate represents the length of the contig. The box is
notionally divided vertically into 5 possible levels which are given
the y values: -2,-1,0,1,2. The quality codes attributed to each
base position are plotted as rectangles. Each rectangle represents
a region in which the quality codes are identical, so a single base
having a different code from its immediate neighbours will appear as
a very narrow rectangle.
Rectangle bottom and top y values
Quality 0 rectangle from 0 to 0
Quality 1 rectangle from 0 to 1
Quality 2 rectangle from 0 to -1
Quality 3 rectangle from -1 to 1
Quality 4 rectangle from -2 to 2
Obviously a single line at the midheight shows a perfect
sequence.
Typical dialogue is shown below.
41.47% OK on both strands and they agree(0)
55.48% OK on plus strand only(1)
2.08% OK on minus strand only(2)
0.97% Bad on both strands(3)
0.00% OK on both strands but they disagree(4)
? (y/n) (y) Show sequence of codes
10 20 30 40 50
1111111111 1111111111 1111111111 1111111111 1111111111
60 70 80 90 100
1111111111 1111111111 1111111111 3111111111 1111111111
110 120 130 140 150
1111111111 1111131111 1111111111 1111111111 1111111111
160 170 180 190 200
1111111111 1111111111 1111111111 1111111111 1111111133
210 220 230 240 250
1311111111 1111111111 1111111110 0000000000 0000220000
260 270 280 290 300
0000000000 0020000000 2200000202 0002000000 0000222200
@26. TX 3 @ Alter relationships
Used to make what are normally illegal changes to the
database. That is the normal checks are not done and any item in the
database can be changed independently of all others. Users need to
know what they are doing because it is very easy to make a horrible
mess. Always start by making a copy!
By using the options here users can move one section of a
contig relative to another, break contigs, remove contigs, remove
gel readings, etc. To give flexibility most of the commands do only
one thing. This means that several commands may have to be executed
to complete any change.
The following options are offered:
Cancel
Line change
Check logical consistency
Remove contig
Shift
Move gel reading
Rename gel reading
Break a contig
Remove a gel reading
Alter raw data parameters
1. QUIT returns to the main options of BAP.
3. Line change
allows the user to change the contents of any line in the file of
relationships. The line is selected by number, the program prints
the current line and prompts for the new line.
4. Check logical consistency
5. Remove a contig
This function removes a contig and all its gel readings. The user
specifies any reading in the contig.
6. Shift
allows the user to change all the relative positions of a set of
neighbouring gel readings by some fixed value, i.e. it will shift
related gel readings either left or right. It can therefore be
used to change the alignment of the gel readings in a contig. It
prompts for the number of the first gel reading to shift and then
for the distance to move them (Note a negative value will move
the gel readings left and a positive value right). It then chains
rightwards (ie follows right neighbours) and shifts each gel
reading, in turn, up to the end of the contig. (This means that
only those gel readings from the first to shift to the rightmost are
moved). It updates the length of the contig accordingly.
7. Move gel reading
is a function to renumber a gel reading. It moves all the
information about a gel reading on to another line. The user must
specify the number of the gel reading to move and the number of the
line to place it. It takes care of all the relationships. Of course
gel readings must not be moved to lines occupied by other gel
readings!
8. Rename gel reading
is a function that is used to rename the archive names of gel
readings in the database; it only changes the name in the .ARN
file of the database.
9. Break contig
Occasionally it is necessary to break a contig into two parts
and this can be achieved using this option. The program needs only
the number of a gel reading. This is the gel reading that will
become a left end after the break. That is, the break is made
between this gel reading and its left neighbour. A new contig line
is created so ensure that there is sufficient space in the database.
10. Removing gel readings from contigs
Gel readings can be removed from contigs. If they are
essential for holding the contig together (ie are the only gel
reading covering a particular region), the program will create a new
contig.
11. Alter raw data parameters
Allows the user to edit the individual raw data parameters,
such as the left and right cutoff lengths and the name of the
machine readable trace file. The user must specify the gel line to
modify, and provide new values for the length of the raw sequence
including cutoff lengths, the left cutoff position, the length of
the original working sequence, the machine type, and the name of the
raw data file, where these values change.
@27. TX 1 @ Set display parameters
Used to redefine the parameters that control the cutoff
employed by the consensus calculation and quality examiner, the
maximum length of each reading to include in the quality
calculation, the line length used by the display function, the text
window length used by the graphics options, and the graphics window
length used by the graphics options.
The default cutoff score is 75%. The default line length is 50
characters. For protein sequences the cutoff is always 100%.
The text window used by the graphics options controls the
amount of sequence listed at the crosshair position. The graphics
window controls the "zoom" function. Both these windows are defined
as the number of bases that should be shown, to both left and right
of the crosshair.
@30. TX 3 @ Shuffle pads
One weakness of the alignment strategy used is that padding
characters are not always aligned by the assembly routine. This
function attempts to align padding characters using a very simply
strategy. It does not solve all pad alignment problems but is a
useful first step during cleaning-up operations.
@10. TX 2 @Clear graphics
Clears graphics from the screen.
@11. TX 2 @Clear text
Clears text from the screen.
@12. TX 2 @Draw a ruler.
This option allows the user to draw a ruler or scale along the
x axis of the screen to help identify the coordinates of points of
interest. The user can define the position of the first base to be
marked (for example if the active region is 1501 to 8000, the user
might wish to mark every 1000th base starting at either 1501 or 2000
- it depends if the user wishes to treat the active region as an
independent unit with its own numbering starting at its left edge,
or as part of the whole sequence). The user can also define the
separation of the ticks on the scale and their height. If required
the labelling routine can be used to add numbers to the ticks.
@14. TX 2 @Reposition plots
The positions of each of the plots is defined relative to a
users drawing board which has size 1-10,000 in x and 1-10,000 in y.
Plots for each option are drawn in a window defined by x0,y0 and
xlength,ylength. Where x0,y0 is the position of the bottom left hand
corner of the window, and xlength is the width of the window and
ylength the height of the window.
--------------------------------------------------------- 10,000
1 1
1 -------------------------------------- ^ 1
1 1 1 1 1
1 1 1 1 1
1 1 1 ylength 1
1 1 1 1 1
1 1 1 1 1
1 -------------------------------------- v 1
1 x0,y0^ 1
1 <---------------xlength--------------> 1
--------------------------------------------------------- 1
1 10,000
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
The default window positions are read from a file "ANALMARG" when
the program is started. Users can have their own file if required.
As all the plots start at the same position in x and have the same
width, x0 and xlength are the same for all options. Generally users
will only want to change the start level of the window y0 and its
height ylength. This option allows users to change window positions
whilst running the program. The routine prompts first for the
number of the option that the users wishes to reposition; then for
the y start and height; then for the x start and length. Note that
changes to the x values affect all options. If the user types only
carriage return for any value it will remain unchanged. Note that,
unlike all the other programs, the boxes used to contain analytical
results (eg plot quality) should not be made to overlap one another,
as the function of the crosshair routine depends on which box the
crosshair is in!
@15. TX 2 @Label a diagram
This routine allows users to label any diagrams they have
produced. They are asked to type in a label. When the user types
carriage return to finish typing the label the cross-hair appears on
the screen. The user can position it anywhere on the screen. If the
user types R (for right justify) the label will be written on the
diagram with its right end at the cross-hair position. If the user
types L (for left justify) the label will be written on the diagram
with its left end at the cross hair position. The cross-hair will
then immediately reappear. The user may put the same label on
another part of the diagram as before or if he hits the space bar he
will be asked if he wishes to type in another label.
Typical dialogue follows.
? Menu or option number=15
Type label then drive cross hair to left or right end
of label position then hit "L" to write label left
justified or "R" to write label right justified or
the space bar to quit
? Label=delta gene
missing graphics
? Label=
@16. TX 2 @Display a map
This is disabled!
@7. TX 1 @Redirect output
Used to direct output that would normally appear on the screen
to a file and to create postscript output.
Select redirection of either text or graphics, and supply the
name of the file that the output should be written to.
The results from the next options selected will not appear on
the screen but will be written to the file. When option 7 is
selected again the file will be closed and output will again appear
on the screen.
@13. TX 2 @Use crosshair
This option puts a steerable cross on the screen which the
user drives around by using the arrow keys (or mouse). When the
crosshair is visible a number of options are available if the user
types one of a set of special keyboard characters. Any other
characters will cause an exit from the crosshair option. The special
keys are:
I = Identify the nearest gel reading
Z = Zoom in
Q = plot Quality
S = display the aligned Sequences at the crosshair position
N = list the Names and Numbers of the sequences at the crosshair
In order for any of these special keys to operate, the
crosshair must be in an appropriate display box, and the precise
function of the keys will also depend on which box the crosshair is
in.
If the crosshair is in the "plot all contigs" box, Z will
cause a new box to appear showing all the readings for the nearest
contig; Q will give the same as Z but will also produce an extra box
showing the "quality" plot.
If Z is hit in the "plot single contig" box, the contig will
be zoomed to the current graphics window size. The zoom will be
roughly centred on the crosshair position. Because of this it is
possible to step along a contig by repeatedly zooming with the
crosshair near to one end of the single contig display box. If I is
hit the crosshair must be close to a gel reading line. If Q is hit,
the quality plot will be produced for the region shown in the plot
single contig box. In all cases when the "plot all contigs" box is
shown, a vertical line will bisect the line the represents the
relevant contig, at the current position.
If the crosshair is in the plot quality box only the character
"s" will operate as a special symbol.
The number of bases shown in the N and S options is controlled
by the current graphics text window size, and the size of the zoom
window by the current graphics window size. Both are set by the
parameter setting function of the general menu.
@33. TX 2 @Plot single contig
This option produces a schematic of a selected region of a
single contig by drawing a horizontal line to represent each of its
gel readings. The lines show the relative positions of each reading
and also their sense. The plot is divided vertically into two
sections by a line that is identified by an asterisk drawn at each
end. All lines that lie above this line represent readings that are
in their original sense, all lines below show readings that are in
the complementary sense to their original. By use of the crosshair
function the plot can be stepped through and examined in more
detail. See help on crosshair.
@34. TX 2 @Plot all contigs
This option produces a schematic of all the contigs in a
database. It does this by drawing a horizontal line to represent
each of them. In order to show the ends of each contig it draws the
lines for contigs at alternate heights: the first at height one, the
second at height two, the third at height one, etc. The order of the
contigs in the display is the same as their order in the database.
By use of the crosshair function the plot can be stepped through and
examined in more detail. See help on crosshair.
@31. TX 3 @ Disassemble readings
This function is used to remove a list of readings from a
database, or to create a new contig from a single reading moved from
an existing contig. This latter mode is useful for repositioning a
reading in a repeat: once separated it can be placed in the join
editor and scrolled by the other copies. Removal of sets of
readings works in two modes: 1. A set of adjacent readings in a
contig can be removed by the user naming the two end ones; or 2. A
batch of readings from any number of contigs can be defined by the
user naming a file containing a list of reading names. The program
cleans up the database by moving data to fill up any holes made in
the files.
For both modes of operation the program will ask for a file of
file names. If users create their own file (ie mode 2) each reading
NAME must be on a separate line. For mode 1 the user types the NAMES
of the leftmost and rightmost readings to be removed. They and all
intervening readings will be removed. Note that the routine operates
on reading names - not numbers. For both modes, if necessary, new
contigs will be created.
@35. TX 1 3 @Find internal joins
The purpose of this function is to use data already in the
database to find possible joins between contigs. Joins may have
been missed due to poor data or may have not been made due to
repeated sequences. Where appropriate, it may be possible to find
potential joins by using the "unused data" derived from sequencing
machines.
For all overlaps found when the X version is used, the contig editor
(in join mode) will be called up with the two contigs aligned.
The database is checked for logical consistency. Supply a minimum
initial match length, a minimum alignment block, the maximum pads
per sequence, the maximum percent mismatch after alignment, the
probe length. Choose if clipped data is to be used, if so define the
window size for finding good data and the number of dashes allowed
in the window. Processing will commence. Most of these values are
used in an identical way in the autoassemble function. The others
are defined below.
The program strategy
Take the first contig and calculate its consensus. If clipped data
is being used examine all readings that are in the complementary
orientation, and sufficiently near to the contigs left end, to see
if they have good clipped sequence which if present, would protrude
from the left end of the contig. If found add the longest such
sequence to the left end of the consensus. Do the same for the right
end by examining readings that are in their original orientation. If
any are found add the longest extension to the right end of the
consensus. Repeat the consensus calculations and extensions for all
contigs hence producing an extended consensus. If clipped data is
not being used simply calculate the consensus for the whole
database. Now look for possible joins by processing the extended
consensus in the following way. Take the last, say 100, bases
(termed the "probe length" by the program) of the rightmost
consensus, compare it both orientations with the extended consensus
of all the other contigs. Display any sufficiently good alignments.
Repeat with the left end of the rightmost contig. Do the same for
the ends of all the entended contigs, always only comparing with the
contigs to their left, so that the same matches do not appear twice.
Good cliped data is defined by sliding a window of "Window size for
good data scan" bases outwards along the sequence and stopping when
"Maximum number of dashes in scan window" or more dashes appear in
the window. Note that it is advisable to have some sort of cutoff
because if we simply take all the data it might be so full of
rubbish that we wont find any good matches. For the same reason it
is worth trying the procedure with different cutoffs. An initial run
using no clipped data is also recommended. Sufficiently good
alignments are defined by criteria equivalent to those used in
autoassemble, however here we only display alignments that pass all
tests.
Bugs
If a small contig is wholly contained within a larger one, such that
its ends are further than ("Probe length" - "Minimum initial match
length") from the ends of the larger contig, and the consensus for
the small contig lies to the left of the consensus for large contig,
the overlap will not be discovered. (See the search stratgey).
All numbering is relative to base number one in the contig: matches
to the left (i.e. in the clipped data) have negative positions,
matches off the right end of the contig (i.e. in the clipped data)
have positions greater than that of the contig length. The
convention for reporting the positions of overlaps is as follows: if
neither contig needs to be complemented the positions are as shown.
If the program says "contig x in the - sense" then the positions
shown assume contig x has been complemented. For example in the
results given below the positions for the first overlap are as
reported, but those for the second assume that the contig in the
minus sense (i.e. 443) has been complemented.
Possible join between contig 445 in the + sense and contig 405
Percentage mismatch after alignment = 4.9
412 422 432 442 452 462
405 TTTCCCGACT GGAAAGCGGG CAGTGAGCGC AACGCAATTA ATGTGAG,TT AGCTCACTCA
********* * ******** ***** *** ********** ********** **********
445 -TTCCCGACT G,AAAGCGGG TAGTGA,CGC AACGCAATTA ATGTGAG-TT AGCTCACTCA
-127 -117 -107 -97 -87 -77
472 482 492 502 512
405 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
********** ********** ********** ********** **
445 TTAGGCACCC CAGGCTTTAC ACTTTATGCT TCCGGCTCGT AT
-67 -57 -47 -37 -27
Possible join between contig 443 in the - sense and contig 423
Percentage mismatch after alignment = 10.4
64 74 84 94 104 114
423 ATCGAAGAAA GAAAAGGAGG AGAAGATGAT TTTAAAAATG AAACG-CGAT GTCAGATGGG
**** ***** ********** ********** ****** ** ***** **** *********
443 ATCG,AGAAA GAAAAGGAGG AGAAGATGAT TTTAAA,,TG AAACGACGAT GTCAGATGG,
3610 3620 3630 3640 3650 3660
124 134 144 154 164
423 TTG-ATGAAG TAGAAGTAGG AG-AGGTGGA AGAGAAGAGA GTGGGA
*** ****** ********** ** ******* *** ***** ** **
443 TTGGATGAAG TAGAAGTAGG AGGAGGTGGA ,GAG,AGAGA GTTGG-
3670 3680 3690 3700 3710
@36. TX 3 @Double strand
PLEASE MAKE A COPY OF THE DATABASE BEFORE USING THIS OPTION AS
IT HAS CURRENTLY HAD VERY LITTLE TESTING.
Uses the cutoff data to change single stranded sections of a
contig into double stranded sections. Data is used carefully to try
and minimise the number of data disagreements created. However it
must be noted that an overall slight degradation in quality will
still occur.
When using this option you will be prompted for a contig and a
region within that contig. The default region is the entire contig.
The option will then search through the region for areas of good
data on one strand and cutoff data on the opposite strand, extending
the cutoff data. The criteria for evaluating the amount of cutoff
data to be used is based upon a maximum number of mismatches and a
score (derived by accumulating points for mismatches (-8),
matches(+1) and insertions (-5) over the length of an alignment).
The defaults are:
maximum mismatches : 6
score for mismatch : -8
score for correct match : +1
score for insertion : -5
Note that with successive calls to this option it is possible
to double strand more and more data. Naturally however the quality
of the data generated will diminish each time.
@37. TX 3 @Auto-select oligos
PLEASE MAKE A COPY OF THE DATABASE BEFORE USING THIS OPTION AS
IT HAS CURRENTLY HAD VERY LITTLE TESTING.
Generates a file (default "primers") of suggested primers to
use for covering a single stranded section or for walking off the
end of a contig. The file generated contains the gel reading name,
the primer sequence, it's offset in the contig and the orientation.
An example file would be :
c81d12.s1 TTGTCTGTAAGCGGATG (@ 6449 ) +
c98a10.s1 ATTATCACTTTACGGGTC (@ 6959 ) +
c81c1.s1 CAAGAAGGCGATAGAAG (@ 7643 ) +
c76a10.s1 CCTCATCCTGTCTCTTG (@ 8441 ) +
c81g4.s1 ATGAAACCTGGGCGTTG (@ 16156 ) +
c91e6.s1 GTTTTCAGATGTCGGAG (@ 18249 ) +
c81e12.s1 GCTACCGTAAAACACTTC (@ 18737 ) +
c93h11.s1 GCTGCTTTTTGTTTTATCC (@ 19158 ) +
c81h6.s1 CTTCCACTTCTTTCTTATC (@ 21210 ) +
c86a12.s1 CGAATGATAAAGACAAATCAG (@ 22122 ) +
c98b1.s1 GCCACTTTATCCGAGAC (@ 3048 ) -
c97c5.s1 GTGTTTTGGGTATATTGTG (@ 3371 ) -
c83d2.s1 CTACACAGAATGAACCC (@ 3768 ) -
c78h10.s1 GGCGGTGAAGATTGAAG (@ 4200 ) -
c98h9.s2dt CTCGTTTAAATTTCAAACTTCC (@ 7419 ) -
c95a9.s1 ATTGGAAGGAAGGAGGG (@ 22996 ) -
c82b4.s1 TGTAGCCGAAATCTTCC (@ 23369 ) -
This is best employed after having previously used the 'Double
strand' option. When selecting the option you will be asked for the
contig, a region within this contig and the file to write the list
of primers to. For each primer suggested a tag is automatically
created containing details of the gel reading name and the sequence.
Preferably the tag will be created on the gel reading from which the
primer was selected. However this is not always possible so failing
that the tag will be on another sequence overlapping the primer
position.
When invoked with the dialogue option you will be asked a
couple more questions relating to the position and size of the
consensus checked for suitable oligos. You will be prompted for the
start and end of a region (default 40-140) at a relative position to
the left of our initial region.
For example:
? Menu or option number=d37
Auto-select oligos
Default Contig identfier=/e97f2.s1
? Contig identfier=
? Start position in contig (1-20942) (1) =10000
? End position in contig (10000-20942) (20942) =11000
Default Name of file for primers=primers
? Name of file for primers=
? Start of oligo choice region (1-1024) (40) =50
? End of oligo choice region (50-1024) (150) =150
This implies that we are going to look for oligos to use as
primers covering the region 10000 to 11000. For each single stranded
section in this region we search for the oligos at between 50 and
150 to the left. So if we had a single stranded section from 10121
to 10295 we would search for oligos in the region 9971 to 10071.
@38. TX 1 @Check assembly
This new function is used for checking the positioning of
assembled readings. It is useful for checking sequences that
contain repeats of length similar to that of a single gel reading.
It takes the poor quality data for each reading and compares it to
the segment of the consensus to which it should align. If the
extension of the read does not match the consensus then the read (or
its neighbours) has probably been assembled into the wrong place.
The program displays the bad alignments. The quality of an
alignment is defined by the percentage mismatch. Naturally the user
should select a value that takes into account the poor quality of
the data being aligned.
When the routine is used from the X version the user is
offered the editor to examine poor alignments. If alignments are
reported as poor, but on inspection are OK, the user can set a tag
so that the poor quality data is ignored on subsequent passes
through the routine. Note however such data will then also be
ignored by the automatic double stranding routine!
The user defines the percentage mismatch; the window size and
number of dashes allowed in the window used for selecting the amount
of the poor data to be employed; can choose to save the names of the
poorly aligned reads in a file; can select an individual contig or
scan the whole database. The file containing the names of the
poorly aligned reads can be used by the disassembly routine to
remove them from the database, and then can be used to reassemble
them. Note that the routine complements each contig twice during
processing.
@39. TX 1 @Find read pairs
This new function is used to check the positions of readings
taken from each end of the same template. For each forward read it
searches for a corresponding reverse reading. The search can be over
the whole database or over a single contig. The results can be
presented graphically for single contig searches and the crosshair
function can be used to identify the readings displayed.
Note that at present the function only knows that two reads
are from the same template by comparing reading names. For our local
projects we use the following naming convention: forward reads are
named abcdefgh.s1 and reverse reads abcdefgh.r1. The program expects
this naming convention and so if it finds read fred.s1 and fred.r1
it assumes they are the forward and reverse reads for template fred.
In the future we will make the routine more general!
If a single contig is selected and the output is listed the
program displays two lines for each pair: the first line shows the
reading name, its position and length, and the distance between the
extremeties of the two reads; the second line shows the other read
name, its position and length. If there are pairs that are in
separate contigs or are facing away from one another they are listed
after the pairs that face inwards. Is this true?
If the results are plotted the full length of the template is
drawn with arrows indicating the direction of reads and the extent
of each reading. Those reads that have their partner in another
contig are marked by asterisks.
Typical dialogue is shown below.
? Select contigs (y/n) (y) =
Default Contig identifier=/i55d8.s1
? Contig identifier=
? Start position in contig (1-15227) (1) =
? End position in contig (1-15227) (15227) =
? Plot results (y/n) (y) = n
852 k23a1.r1 249 238 1615
806 k23a1.s1 1529 -335
238 i68e6.s1 422 193 1632
868 i68e6.r1 1756 -298
576 k17a2.s1 2370 213 1676
885 k17a2.r1 3790 -256
84 k27g6.s1 3456 291 1777
867 k27g6.r1 4905 -328
453 k01g10.s1 5805 142 1251
881 k01g10.r1 6909 -147
781 i98b8.r1 6754 338 1079
10 i98b8.s1 7653 -180
883 k02d11.r1 7327 276 1597
283 k02d11.s1 8726 -198
269 i68f9.s1 8191 169 1055
777 i68f9.r1 8891 -355
710 i91c6.s1 8245 95 1516
780 i91c6.r1 9403 -358
596 k27d12.s1 136 329 -329
219 k27d12.r1 1 -116
159 k27d11.r1 1830 -263 -263
317 k27d11.s1 2902 343
886 k17g11.r1 7107 -123 -123
647 k17g11.s1 1867 265
851 i69g10.r1 8045 -137 -137
277 i69g10.s1 4658 174
If contigs are not selected the pairs are sorted on their
separations.
? Select contigs (y/n) (y) = n
i68f2.s1 27 1781 1777
i68f2.r1 776 111 1777
k17f6.s1 601 60 1706
k17f6.r1 856 1405 1706
k17a2.s1 576 2370 1676
k17a2.r1 885 3790 1676
k27g3.s1 177 14985 1664
k27g3.r1 889 13564 1664
k27b12.s1 764 1 1086
k27b12.r1 857 932 1086
i98b8.s1 10 7653 1079
i98b8.r1 781 6754 1079
k16a3.s1 748 1276 1070
k16a3.r1 784 472 1070
k17b7.r1 786 14937 18942*
k17b7.s1 787 3601 18942*
k27d12.r1 219 1 15208*
k27d12.s1 596 136 15208*
k01g2.s1 502 87 14754*
k01g2.r1 782 9224 14754*
@ end of help