1431 lines
57 KiB
Text
1431 lines
57 KiB
Text
|
.NPA
|
||
|
.SP 1
|
||
|
.left margin1
|
||
|
@-1. TX 0 @General
|
||
|
.sp
|
||
|
@-2. T 0 @Screen control
|
||
|
.sp
|
||
|
@-2. X 0 @Screen
|
||
|
.sp
|
||
|
@-3. TX 0 @Set parameters
|
||
|
.sp
|
||
|
@-4. TX 0 @Comparison
|
||
|
.sp
|
||
|
@0. TX -1 @SIP
|
||
|
.PARA
|
||
|
This is program for comparing and aligning nucleic acid or protein
|
||
|
sequences. It can produce optimal alignments using a dynamic
|
||
|
programming algorithm, and has several ways of producing "dot matrix"
|
||
|
diagrams.
|
||
|
.PARA
|
||
|
The following analyses (preceded by their option numbers) are included:
|
||
|
.sp
|
||
|
.para
|
||
|
The program is used on a simple graphics terminal ie a
|
||
|
keyboard with a screen on which points and lines can be drawn.
|
||
|
The
|
||
|
user works at the terminal and produces plots for various
|
||
|
combinations of values for the span length and minimum scores.
|
||
|
However large or small a region the user
|
||
|
elects to compare the program expands or contracts the diagram
|
||
|
so
|
||
|
that the plot always fills the screen. This allows the user to gain
|
||
|
an overall impression or to "home-in" on particular regions and
|
||
|
examine them in more detail. Having found a region that looks
|
||
|
interesting the user can determine its coordinates in terms of
|
||
|
sequence positions by use of a crosshair facility.
|
||
|
.para
|
||
|
The program has two statistical options to help the user
|
||
|
choose score levels for plotting and to assess the significance of
|
||
|
any similarity found. It can produce a cumulative histogram of
|
||
|
observed scores for the current span length and region and it can
|
||
|
calculate the "double matching probability" of McLachlan.
|
||
|
The
|
||
|
"double matching probability" is the probability of finding
|
||
|
particular scores given two infinitely long sequences of the
|
||
|
composition of those being compared, with the current span
|
||
|
length
|
||
|
and score matrix. By using these options the user can choose to
|
||
|
plot all the matches for which the score exceeds a given
|
||
|
significance level (such as 1%), using either empirical or
|
||
|
theoretical probability values. Generally it is best to begin at a
|
||
|
low level to avoid an overcrowded diagram.
|
||
|
.para
|
||
|
If the user finds that the two sequences do contain stretches
|
||
|
of homology he will often want to align the sequences by
|
||
|
inserting
|
||
|
padding characters at deletion points. The program has a
|
||
|
selection
|
||
|
of options for this purpose: it contains an alignment routine; it
|
||
|
can display on the screen the two
|
||
|
sequences, one above the other, with asterisks marking
|
||
|
identities,
|
||
|
it has inbuilt editing functions and can save the aligned sequences
|
||
|
on disk files.
|
||
|
.para
|
||
|
The basic principle of dot matrices was first
|
||
|
described by Gibbs and McIntyre and involves producing a diagram
|
||
|
that contains a representation of all the matches between a pair
|
||
|
of
|
||
|
sequences. This diagram is then scanned by eye and the human
|
||
|
ability to recognise patterns used to detect any similarities that
|
||
|
might be present. The diagram consists of a two dimensional plot
|
||
|
in
|
||
|
which the x axis represents one sequence (A) and the y axis the
|
||
|
other (B). Every point (i,j) on the plane x,y is assigned a score
|
||
|
which corresponds to the level of similarity between sequence
|
||
|
characters A(i) and B(j). In the simplest use of the method a score
|
||
|
of 1 could be assigned to every point (i,j) where A(i) = B(j), and a
|
||
|
score of 0 to every other point. If a plot of the points in the
|
||
|
plane was made in which all scores of 1 were marked with a dot
|
||
|
and
|
||
|
all those of 0 left blank then regions of identity would appear as
|
||
|
diagonal lines. With the comparison displayed in this form the
|
||
|
human eye is very good at detecting regions of homology even if
|
||
|
they
|
||
|
are imperfect. The effects of mismatches, insertions or
|
||
|
deletions
|
||
|
can be seen: matches interrupted by insertions or deletions will
|
||
|
appear as parallel diagonals, and matches interrupted by the odd
|
||
|
mismatching pair of characters will appear as broken collinear
|
||
|
diagonal lines. This diagram is a very useful representation but
|
||
|
simply placing a dot for every identity is of limited value for the
|
||
|
following reasons.
|
||
|
.para
|
||
|
For nucleic acid sequences around 25% of the plot will contain
|
||
|
points and it will often be very difficult to distinguish
|
||
|
significant homologies from chance matches. For proteins
|
||
|
many
|
||
|
significant alignments of sequences contain almost no identities
|
||
|
but
|
||
|
are formed from chemically and structurally similar amino acids
|
||
|
so
|
||
|
that simply looking for identity would be insufficient. What is
|
||
|
required is to first find those points that correspond to fairly
|
||
|
strong local similarities and then to use the diagram of these
|
||
|
points so that the human eye can be used to look for larger scale
|
||
|
homologies. The program uses a number of different algorithms to
|
||
|
calculate the
|
||
|
score for each point and the user defines a minimum score so
|
||
|
that
|
||
|
only those points in the diagram for which the score is at least
|
||
|
this value will be marked with a dot.
|
||
|
.para
|
||
|
The first scoring method finds the longest uninterrupted sections of
|
||
|
perfect identity i.e.
|
||
|
those that contain no mismatches, insertions or deletions.
|
||
|
Generally this method, termed "the identities algorithm" is of little
|
||
|
value, but runs very quickly.
|
||
|
.para
|
||
|
The
|
||
|
second method looks for sections where a proportion of the
|
||
|
characters in the sequence are similar, again allowing no
|
||
|
insertions
|
||
|
or deletions. For a thorough analysis this method, termed "the
|
||
|
proportional algorithm", is the best.
|
||
|
.para
|
||
|
The original method, of this type was first
|
||
|
described by McLachlan and involves calculating a score for
|
||
|
each position in the matrix by summing points found when
|
||
|
looking
|
||
|
forwards and backwards along a diagonal line of a given length.
|
||
|
This length, called the span, must be an odd number so that the dot
|
||
|
marking matches can be precisely placed at its centre.
|
||
|
The algorithm does not simply look for identity but uses a
|
||
|
score matrix that contains scores for every possible pair of
|
||
|
characters. For comparing amino acid sequences we usually
|
||
|
use the score
|
||
|
matrix shown below which was calculated by adding 10 (to make
|
||
|
every term >0) to each term of the relatedness odds matrix MDM78
|
||
|
of
|
||
|
Dayhoff. This matrix MDM78 was calculated by looking at accepted
|
||
|
point mutations in 71 families of closely related proteins and, of
|
||
|
those tested by Dayhoff, was found to be the most powerful
|
||
|
score
|
||
|
matrix for finding distant relationships between amino acid
|
||
|
sequences.
|
||
|
.left margin1
|
||
|
.lit
|
||
|
|
||
|
AMINO ACID SCORE MATRIX
|
||
|
-----------------------
|
||
|
|
||
|
C S T P A G N D E Q B Z H R K M I L V F Y W - X ?
|
||
|
C 22 10 8 7 8 7 6 5 5 5 5 5 7 6 5 5 8 4 8 6 10 2 10 10 10 10
|
||
|
S 10 12 11 11 11 11 11 10 10 9 10 10 9 10 10 8 9 7 9 7 7 8 10 10 10 10
|
||
|
T 8 11 13 10 11 10 10 10 10 9 10 10 9 9 10 9 10 8 10 7 7 5 10 10 10 10
|
||
|
P 7 11 10 16 11 9 9 9 9 10 9 10 10 10 9 8 8 7 9 5 5 4 10 10 10 10
|
||
|
A 8 11 11 11 12 11 10 10 10 10 10 10 9 8 9 9 9 8 10 6 7 4 10 10 10 10
|
||
|
G 7 11 10 9 11 15 10 11 10 9 10 10 8 7 8 7 7 6 9 5 5 3 10 10 10 10
|
||
|
N 6 11 10 9 10 10 12 12 11 11 12 11 12 10 11 8 8 7 8 6 8 6 10 10 10 10
|
||
|
D 5 10 10 9 10 11 12 14 13 12 13 12 11 9 10 7 8 6 8 4 6 3 10 10 10 10
|
||
|
E 5 10 10 9 10 10 11 13 14 12 12 13 11 9 10 8 8 7 8 5 6 3 10 10 10 10
|
||
|
Q 5 9 9 10 10 9 11 12 12 14 11 13 13 11 11 9 8 8 8 5 6 5 10 10 10 10
|
||
|
B 5 10 10 9 10 10 12 13 12 11 13 11 11 10 10 8 8 6 8 5 7 4 10 10 10 10
|
||
|
Z 5 10 10 10 10 10 11 12 13 13 11 14 12 10 10 8 8 8 8 5 6 4 10 10 10 10
|
||
|
H 7 9 9 10 9 8 12 11 11 13 11 12 16 12 10 8 8 8 8 8 10 7 10 10 10 10
|
||
|
R 6 10 9 10 8 7 10 9 9 11 10 10 12 16 13 10 8 7 8 6 6 12 10 10 10 10
|
||
|
K 5 10 10 9 9 8 11 10 10 11 10 10 10 13 15 10 8 7 8 5 6 7 10 10 10 10
|
||
|
M 5 8 9 8 9 7 8 7 8 9 8 8 8 10 10 16 12 14 12 10 8 6 10 10 10 10
|
||
|
I 8 9 10 8 9 7 8 8 8 8 8 8 8 8 8 12 15 12 14 11 9 5 10 10 10 10
|
||
|
L 4 7 8 7 8 6 7 6 7 8 6 8 8 7 7 14 12 16 12 12 9 8 10 10 10 10
|
||
|
V 8 9 10 9 10 9 8 8 8 8 8 8 8 8 8 12 14 12 14 9 8 4 10 10 10 10
|
||
|
F 6 7 7 5 6 5 6 4 5 5 5 5 8 6 5 10 11 12 9 19 17 10 10 10 10 10
|
||
|
Y 10 7 7 5 7 5 8 6 6 6 7 6 10 6 6 8 9 9 8 17 20 10 10 10 10 10
|
||
|
W 2 8 5 4 4 3 6 3 3 5 4 4 7 12 7 6 5 8 4 10 10 27 10 10 10 10
|
||
|
- 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
||
|
X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
||
|
? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
||
|
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
||
|
.end lit
|
||
|
.para
|
||
|
It is also possible to use other matrices, including an identity matrix for
|
||
|
proteins. For nucleic acids we usually use the matrix shown below.
|
||
|
.lit
|
||
|
|
||
|
DNA SCORE MATRIX
|
||
|
|
||
|
A C G T X
|
||
|
A 1 0 0 0 0
|
||
|
C 0 1 0 0 0
|
||
|
G 0 0 1 0 0
|
||
|
T 0 0 0 1 0
|
||
|
X 0 0 0 0 0
|
||
|
.end lit
|
||
|
.left margin2
|
||
|
.para
|
||
|
Plotting dots at the centres of spans that reach the cutoff leads to a
|
||
|
persistence effect that, to some extent, can be mitigated by a variation
|
||
|
on the method. If, for example, all the high scoring amino acids are
|
||
|
clustered at the left end of a particular diagonal segment, dots will
|
||
|
continue to be plotted to their right until the span score drops below the
|
||
|
cutoff. Instead of plotting a single point for each span that reaches the
|
||
|
cutoff score, the variant method plots points for all the identities that
|
||
|
lie in spans that reach the cutoff. Obviously the persistence effect can be
|
||
|
more pronounced for long spans and low cutoff scores, but note that the
|
||
|
variant method will not plot anything if there are no identities present,
|
||
|
and so similar regions could be missed!
|
||
|
.para
|
||
|
A further variant, useful for comparing a sequence against itself, ignores
|
||
|
the main diagonal.
|
||
|
.para
|
||
|
The third comparison method called "quick scan" is really a combination
|
||
|
of the first two, and is similar to the FASTP program of Lipman and
|
||
|
Pearson, but produces a dot matrix diagram. The algorithm is as follows.
|
||
|
The dot matrix positions are found for all words of some minimum length
|
||
|
(obviously length 1 is most sensitive) that are common to both
|
||
|
sequences. Imagine a diagonal line running from corner to corner of the
|
||
|
diagram, at right angles to the diagonals in the dotmatrix, The scores
|
||
|
for the common words (according to the current score matrix, e.g.
|
||
|
MDM78) are accummulated at the appropriate positions on
|
||
|
that imaginary line, hence producing a
|
||
|
histogram. The histogram is analysed to find its mean and standard
|
||
|
deviation. The diagonals that lie above some cutoff score (defined in
|
||
|
standard deviation units), are rescanned using the proportional
|
||
|
algorithm, and a diagram produced. The method is very fast, and is also
|
||
|
employed by the library comparison program.
|
||
|
.para
|
||
|
The dynamic programming alignment algorithm contained in the program
|
||
|
is based on that of Miller and Myers (). It guarantees to produce
|
||
|
alignments with the optimum score given a score matrix, a gap start
|
||
|
penalty, and a gap extension penalty. That is, starting a gap costs a fixed
|
||
|
penalty (IG) and each residue added to the gap incurs a further penalty
|
||
|
(IH) so that for each gap of length K residues the penalty is IG + k*IH.
|
||
|
Gaps at the ends of sequences incur no penalty.
|
||
|
.para
|
||
|
It is very useful to have the dot matrix methods and the alignment
|
||
|
routine together in the same program because it allows users to produce
|
||
|
a dot matrix diagram to help select which regions of the sequence they
|
||
|
wish to align. Selection is made by use of the crosshair. First the
|
||
|
crosshair is positioned at the bottom left hand end of the segment to be
|
||
|
aligned. The crosshair function is quit and immediately selected again,
|
||
|
the crosshair positioned at the top right of the segment, and the
|
||
|
crosshair function quit. When the alignment routine is selected the
|
||
|
segment will be aligned.
|
||
|
.para
|
||
|
The alignment can replace the original segment of the sequence. By
|
||
|
repeated plotting of dot matrices, followed by alignment, very long
|
||
|
sequences can easily be aligned.
|
||
|
.LEFT MARGIN1
|
||
|
@1. TX 0 @Help
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This option gives online help. The user should select option numbers and
|
||
|
the current documentation will be given.
|
||
|
.PARA
|
||
|
The following analyses (preceded by their option numbers) are included:
|
||
|
.lit
|
||
|
? = Help
|
||
|
! = Quit
|
||
|
3 = read a new sequence
|
||
|
4 = define active region
|
||
|
5 = list the sequence
|
||
|
6 = list a text file
|
||
|
7 = direct output to disk
|
||
|
8 = write active sequence to disk
|
||
|
9 = edit the sequences
|
||
|
10 = clear graphics screen
|
||
|
11 = clear text screen
|
||
|
12 = draw a ruler
|
||
|
13 = use cross hair
|
||
|
14 = reposition plots
|
||
|
15 = label diagram
|
||
|
16 = display a map
|
||
|
17 = apply identities algorithm
|
||
|
18 = apply proportional algorithm
|
||
|
19 = list matching spans
|
||
|
20 = set span length
|
||
|
21 = set proportional score
|
||
|
22 = set identities score
|
||
|
23 = calculate expected scores
|
||
|
24 = calculate observed scores
|
||
|
25 = show current parameter settings
|
||
|
26 = quick scan
|
||
|
27 = draw a /
|
||
|
28 = align the sequences
|
||
|
29 = complement the sequences
|
||
|
30 = switch main diagonal
|
||
|
31 = switch identities
|
||
|
32 = change score matrix
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@2. TX 0 @Quit
|
||
|
.left margin2
|
||
|
.para
|
||
|
This function stops the program.
|
||
|
.left margin1
|
||
|
@3. TX 1 @Read a new sequence
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This option allows users to read in new sequences, browse through annotations,
|
||
|
or search sequence
|
||
|
libraries for keywords. Sequences can be read from "personal"
|
||
|
sequence files or from sequence libraries. These are referred to as the
|
||
|
sequence "source". Personal files can be stored in several formats:
|
||
|
Staden, PIR, EMBL, GENBANK and GCG.
|
||
|
At LMB we use "Staden" format for sequencing and all
|
||
|
the
|
||
|
libraries are stored in their original formats. Note, however, that libraries
|
||
|
such as EMBL or GenBank that are divided into several files (eg GenBank has
|
||
|
13 separate files) are indexed as a whole. This means that users do not need
|
||
|
to know which file contains an entry, only which library.
|
||
|
When the user selects to read in a sequence the program first asks for the
|
||
|
sequence "source".
|
||
|
.para
|
||
|
If the user selects "personal" the program will ask for
|
||
|
the format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
|
||
|
the file. For PIR format the user will also be required to know the entry
|
||
|
name of the sequence as the file can contain several. For the other formats
|
||
|
only a single entry is expected. The file will be read, its length and
|
||
|
composition will be displayed and the option left.
|
||
|
.para
|
||
|
If the user selects "library" as the sequence source the program will display a
|
||
|
list of available libraries. The programs are capable of handling all current
|
||
|
libraries but which ones are available will vary from site to site. At LMB we
|
||
|
have several libraries and also weekly updates of data gathered between releases.
|
||
|
The program will ask users to select a library and then give a list of options:
|
||
|
.lit
|
||
|
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entrynames from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
|
||
|
.end lit
|
||
|
If get a sequence or get annotations is selected users will be asked to
|
||
|
type the entry name. The option will be left when a sequence is selected or
|
||
|
! is typed. The composition and length will be displayed.
|
||
|
.para
|
||
|
The text index contains all words from feature tables, reference titles,
|
||
|
definition lines, keywords lists and comments, so the text index search
|
||
|
is most useful. It is also the fastest. Up to 5 words can be searched for
|
||
|
at once. The words should be typed separated by spaces, for example
|
||
|
.lit
|
||
|
? Keywords=P53 mouse murine tumo
|
||
|
|
||
|
.end lit
|
||
|
will search for all entries that contain words starting with p53, mouse,
|
||
|
murine and tumo. Only the unique entries that contain ALL words will be
|
||
|
listed. Before listing the matching entries
|
||
|
the program will show the number of 'hits' for each word and ring the bell.
|
||
|
Escape is possible at this point, or after each screenfull of entries.
|
||
|
In addition to the entry names the text search displays the primary accession
|
||
|
number, the sequence length and up to 80 characters of description.
|
||
|
(The search of 'titles' is now redundant because the full text index
|
||
|
contains all the title words and the search is much faster. It will probably
|
||
|
be removed from the program.)
|
||
|
All searches are independent of case. Where
|
||
|
possible the program will offer default entry names.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
Select sequence source
|
||
|
X 1 Personal file
|
||
|
2 Sequence library
|
||
|
? Selection (1-2) (1) =
|
||
|
Select sequence file format
|
||
|
X 1 Staden
|
||
|
2 EMBL
|
||
|
3 GenBank
|
||
|
4 PIR
|
||
|
5 GCG
|
||
|
? Selection (1-5) (1) =
|
||
|
? Sequence file name=M13MP7.SEQ
|
||
|
Contig title removed
|
||
|
Sequence length= 7238
|
||
|
Sequence composition
|
||
|
T C A G -
|
||
|
2405. 1539. 1765. 1527. 2.
|
||
|
33.2% 21.3% 24.4% 21.1% 0.0%
|
||
|
.
|
||
|
.
|
||
|
.
|
||
|
|
||
|
|
||
|
Select sequence source
|
||
|
X 1 Personal file
|
||
|
2 Sequence library
|
||
|
? Selection (1-2) (1) =2
|
||
|
Select a library
|
||
|
X 1 EMBL 29 nucleotide library Dec 91
|
||
|
2 SWISSPROT 20 protein library Nov 91
|
||
|
3 PIR 31 protein library Dec 91
|
||
|
4 NRL3D 58 From Brookhaven protein library Dec 91
|
||
|
5 GenBank
|
||
|
? Selection (1-5) (1) =
|
||
|
Library is in EMBL format with indexes
|
||
|
Select a task
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entry names from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
? Selection (1-5) (1) =5
|
||
|
Search for keywords
|
||
|
? Keywords=P53 mouse
|
||
|
P53 hits 68
|
||
|
MOUSE hits 8180
|
||
|
|
||
|
MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen
|
||
|
MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon
|
||
|
MMANT11 X00885 562 Murine p53 gene 3' region with exon 11
|
||
|
MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end.
|
||
|
MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds.
|
||
|
MMP53 X00741 1377 Mouse mRNA for transformation associated protein
|
||
|
MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53.
|
||
|
MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11.
|
||
|
MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8.
|
||
|
MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou
|
||
|
MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4
|
||
|
MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53
|
||
|
MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53
|
||
|
MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5
|
||
|
22 different entries found
|
||
|
|
||
|
Select a task
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entry names from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
? Selection (1-5) (1) =4
|
||
|
Search for keywords
|
||
|
? Keywords=alpha
|
||
|
Searching for alpha
|
||
|
AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
|
||
|
AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
|
||
|
AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
|
||
|
AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
|
||
|
ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank.
|
||
|
ADHBADA1 102 duck alpha-d-globin gene, exon 1.
|
||
|
ADHBADA2 1145 duck alpha-a-globin gene and 5' flank
|
||
|
ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co
|
||
|
AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
|
||
|
AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
|
||
|
AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
|
||
|
AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
|
||
|
AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
||
|
AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
||
|
AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
||
|
AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide.
|
||
|
AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
|
||
|
AGAMYLS 120 synthetic alpha-amylase gene, 5' end.
|
||
|
AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
|
||
|
!
|
||
|
Select a task
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entry names from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
? Selection (1-5) (1) =3
|
||
|
? Accession number=v00636
|
||
|
Entry name LAMBDA
|
||
|
Select a task
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entry names from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
? Selection (1-5) (1) =2
|
||
|
Default Entry name=LAMBDA
|
||
|
? Entry name=
|
||
|
ID LAMBDA standard; DNA; PHG; 48502 BP.
|
||
|
XX
|
||
|
AC V00636; J02459; M17233; X00906;
|
||
|
XX
|
||
|
DT 03-JUL-1991 (Rel. 28, Last updated, Version 3)
|
||
|
DT 09-JUN-1982 (Rel. 1, Created)
|
||
|
XX
|
||
|
DE Genome of the bacteriophage lambda (Styloviridae).
|
||
|
XX
|
||
|
KW circular; coat protein; DNA binding protein; genome;
|
||
|
KW origin of replication.
|
||
|
XX
|
||
|
OS Bacteriophage lambda
|
||
|
OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
|
||
|
XX
|
||
|
RN [1]
|
||
|
RP 1-48502
|
||
|
RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
|
||
|
RT "Nucleotide sequence of bacteriophage lambda DNA";
|
||
|
RL J. Mol. Biol. 162:729-773(1982).
|
||
|
XX
|
||
|
!
|
||
|
Select a task
|
||
|
X 1 Get a sequence
|
||
|
2 Get annotations
|
||
|
3 Get entry names from accession numbers
|
||
|
4 Search titles for keywords
|
||
|
5 Search text index for keywords
|
||
|
? Selection (1-5) (1) =
|
||
|
Default Entry name=LAMBDA
|
||
|
? Entry name=
|
||
|
DE Genome of the bacteriophage lambda (Styloviridae).
|
||
|
Sequence length 48502
|
||
|
Sequence composition
|
||
|
T C A G -
|
||
|
11988. 11360. 12336. 12818. 0.
|
||
|
24.7% 23.4% 25.4% 26.4% 0.0%
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@4. TX 1 @Define active region
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
For its analytic functions
|
||
|
the program always works on a region of the sequence called the active
|
||
|
region. When a new sequence is read into the program the active region is
|
||
|
automatically set to start at the beginning of the sequence and go
|
||
|
up to the
|
||
|
maximum allowed size of active region the program can
|
||
|
handle. The positions are shown on the screen.
|
||
|
On most machines this will be to the end of the sequence.
|
||
|
This option allows the user define a different region.
|
||
|
.left margin1
|
||
|
@5. TX 1 @List a sequence
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
The sequence can be listed with line lengths from
|
||
|
10 to 120 in multiples of 10. The output looks like:
|
||
|
.lit
|
||
|
|
||
|
87 97 107 117 127 137
|
||
|
KVKCTGRILE VPVGRGLLGR VVNTLGAPID GKGPLDHDGF SAVEAIAPGV IERQSVDQPV
|
||
|
** * **** *** * ** * * ** * ** *
|
||
|
DVKDLEHPIE VPVGKATLGR IMNVLGEPVD MKGEIGEEER WAIHRAAPSY EELSNSQELL
|
||
|
68 78 88 98 108 118
|
||
|
147 157 167 177 187 197
|
||
|
QTGYKAVDSM IPIGRGQREL IIGDRQTGKT ALAIDAIINQ RDSGIKCIYV AIGQ
|
||
|
** * * * * * * *** * * *
|
||
|
ETGIKVIDLM CPFAKGGKVG LFGGAGVGKT VNMMELIRNI AIEHSGYSVF AGVG
|
||
|
128 138 148 158 168 178
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@6. TX 1 @List a text file
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Allows the user to have a text file displayed on the screen. It will appear
|
||
|
one page at a time.
|
||
|
.left margin1
|
||
|
@7. TX 1 @Direct output to disk
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Used to direct output that would normally appear on the screen to a file.
|
||
|
.para
|
||
|
Select redirection of either text or graphics, and
|
||
|
supply the name of the file that the output should be written to.
|
||
|
.para
|
||
|
The results from the next options selected will not appear on the screen
|
||
|
but will be written to the file. When option 7 is selected again
|
||
|
the file will be
|
||
|
closed and output will again appear on the screen.
|
||
|
.left margin1
|
||
|
@8. TX 1 @Write active region to disk
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This option allows users to
|
||
|
write the current active sequence to a disk file in Staden format.
|
||
|
.left margin1
|
||
|
@9. TX 1 @Edit the sequences
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This function allows the user to insert or delete parts of either sequence
|
||
|
to help align them. The inserted characters are dashes.
|
||
|
.left margin1
|
||
|
@10. TX 2 @Clear graphics
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Clears the screen of both text and graphics.
|
||
|
.left margin1
|
||
|
@11. TX 2 @Clear text
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
Clears only text from the screen.
|
||
|
.left margin1
|
||
|
@12. TX 2 @Draw a ruler
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This option
|
||
|
allows the user to draw a ruler or scale along the axes of the screen to
|
||
|
help identify the coordinates of points of interest. The user can define
|
||
|
the position of the first sequence element to be marked
|
||
|
(for example if the active
|
||
|
region is 1501 to 8000, the user might wish to mark every 1000th
|
||
|
element
|
||
|
starting at either 1501 or 2000 - it depends if the user wishes to treat
|
||
|
the active region as an independent unit with its own numbering starting
|
||
|
at
|
||
|
its left edge, or as part of the whole sequence). The user can also define
|
||
|
the separation of the ticks on the scale and their height. If required the
|
||
|
labelling routine can be used to add numbers to the ticks.
|
||
|
.PARA
|
||
|
To escape type !
|
||
|
.left margin1
|
||
|
@13. TX 2 @Use cross hair
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This function puts
|
||
|
a steerable cross on the screen that can be used to find the
|
||
|
coordinates of points in the sequence. The user can move the cross
|
||
|
around using the directional keys; when he hits the space bar the
|
||
|
program will write out the coordinates of the cross in sequence units and
|
||
|
the option will be exited.
|
||
|
.para
|
||
|
If instead,
|
||
|
the user hits a , the position will be displayed but the cross will remain on
|
||
|
the screen.
|
||
|
.para
|
||
|
If a letter s is hit the sequences around the cross hair are displayed as a
|
||
|
short alignment (as shown below) and the cross remains on the screen.
|
||
|
.lit
|
||
|
97 107
|
||
|
VPVGRGLLGR VVNTLGAPID
|
||
|
**** *** * ** * *
|
||
|
VPVGKATLGR IMNVLGEPVD
|
||
|
78 88
|
||
|
|
||
|
.end lit
|
||
|
.PARA
|
||
|
If a letter m is hit the sequences around the cross hair are displayed in
|
||
|
the form of a matrix (as shown below) and the cross remains on the screen.
|
||
|
|
||
|
.lit
|
||
|
|
||
|
VPVGKATLGRIMNVLGEPVD
|
||
|
D...................DD
|
||
|
I..........I.........I
|
||
|
P.P...............P..P
|
||
|
A.....A..............A
|
||
|
G...G....G......G....G
|
||
|
L.......L......L.....L
|
||
|
T......T.............T
|
||
|
N............N.......N
|
||
|
VV.V..........V....V.V
|
||
|
VV.V..........V....V.V
|
||
|
R.........R..........R
|
||
|
G...G....G......G....G
|
||
|
L.......L......L.....L
|
||
|
L.......L......L.....L
|
||
|
G...G....G......G....G
|
||
|
R.........R..........R
|
||
|
G...G....G......G....G
|
||
|
VV.V..........V....V.V
|
||
|
P.P...............P..P
|
||
|
VV.V..........V....V.V
|
||
|
VPVGKATLGRIMNVLGEPVD
|
||
|
|
||
|
.end lit
|
||
|
.para
|
||
|
The function is also used prior to "align sequences" in order to delineate the
|
||
|
region to be aligned. The crosshair is positioned at the bottom left of the
|
||
|
region, the crosshair option quit. Then the crosshair option is selected
|
||
|
again, and the crosshair moved to the top right of the region to be
|
||
|
aligned.
|
||
|
.left margin1
|
||
|
@14. TX 2 @Reposition plots
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
The position of the plots is defined relative to a users drawing
|
||
|
board which has size 1-10,000 in x and 1-10,000 in y.
|
||
|
Plots
|
||
|
are drawn in a window defined by x0,y0 and xlength,ylength.
|
||
|
Where x0,y0 is the position of the bottom left hand corner of the window,
|
||
|
and xlength is the width of the window and ylength the
|
||
|
height of the window.
|
||
|
.lit
|
||
|
--------------------------------------------------------- 10,000
|
||
|
1 1
|
||
|
1 -------------------------------------- ^ 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 ylength 1
|
||
|
1 1 1 1 1
|
||
|
1 1 1 1 1
|
||
|
1 -------------------------------------- v 1
|
||
|
1 x0,y0^ 1
|
||
|
1 <---------------xlength--------------> 1
|
||
|
--------------------------------------------------------- 1
|
||
|
1 10,000
|
||
|
|
||
|
.end lit
|
||
|
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
|
||
|
The default window positions are read from a file "DIAGMARG" when the
|
||
|
program is started. Users can have their own file if required.
|
||
|
This option
|
||
|
allows users to change window positions whilst running the program.
|
||
|
If the user
|
||
|
types only carriage return for any value it will remain unchanged.
|
||
|
The cross-hair can be used to choose suitable heights.
|
||
|
.LEFT MARGIN1
|
||
|
@15. TX 2 @Label a diagram
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This routine allows users to label any diagrams they have produced. They
|
||
|
are asked to type in a label. When the user types carriage return to finish
|
||
|
typing the label the cross-hair appears on the screen. The user can
|
||
|
position it anywhere on the screen. If the user types R (for right justify)
|
||
|
the label will be
|
||
|
written on the diagram with its right end at the cross-hair position.
|
||
|
If the user types L (for left justify) the label will be written with its
|
||
|
left end at the cross hair position.
|
||
|
The
|
||
|
cross-hair will then immediately reappear. The user may put the same
|
||
|
label
|
||
|
on another part of the diagram as before or if he hits the space bar he
|
||
|
will be asked if he wishes to type in another label.
|
||
|
.left margin1
|
||
|
@16. TX 2 @Display a map
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
NOT AVAILABLE.
|
||
|
This draws a map
|
||
|
of any sequence features selected by the user.
|
||
|
These features may be protein coding regions (CDS), tRNA genes (TRNA),
|
||
|
promoter positions (PRM), etc. Users may define their own feature table
|
||
|
key
|
||
|
names.
|
||
|
The coordinates must be stored in a file in the format of an EMBL feature
|
||
|
table.
|
||
|
.left margin1
|
||
|
@17. TX 4 @Apply identities algorithm
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
The identities algorithm finds runs of identical characters
|
||
|
in the sequence. Its main value is speed, being 100's of times faster than
|
||
|
the proportional algorithm. It is of course not very sensitive, and should
|
||
|
only be used for a quick scan. The cutoff score is the minimum number of
|
||
|
consecutive matching characters.
|
||
|
All runs of identical characters that are at least as long as the cutoff
|
||
|
score will produce a dot on the screen.
|
||
|
.para
|
||
|
See also quick scan.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
? Menu or option number=d17
|
||
|
? Identity score (1-20) (2) =3
|
||
|
Working
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@18. TX 4 @Apply proportional algorithm
|
||
|
.para
|
||
|
This method, generally the most useful, was first
|
||
|
described by McLachlan and involves calculating a score for
|
||
|
each position in the matrix by summing points found when
|
||
|
looking
|
||
|
forwards and backwards along a diagonal line of a given length.
|
||
|
This length, called the span, must be an odd number.
|
||
|
The algorithm does not simply look for identity but uses a
|
||
|
score matrix that contains scores for every possible pair of
|
||
|
characters. At each point that a threshold score is achieved the
|
||
|
program marks the screen in one of two ways. It will either place a
|
||
|
single
|
||
|
dot at the position corresponding to the centre of the matching span, or
|
||
|
it
|
||
|
will plot a dot for each identical residue within each matching span.
|
||
|
Alternatively, the "list matching spans"
|
||
|
option will list the segments that match.
|
||
|
.para
|
||
|
For comparing amino acid sequences we usually use the score
|
||
|
matrix shown below which was calculated by adding 10 (to make
|
||
|
every term >0) to each term of the relatedness odds matrix MDM78
|
||
|
of
|
||
|
Dayhoff. This matrix MDM78 was calculated by looking at accepted
|
||
|
point mutations in 71 families of closely related proteins and, of
|
||
|
those tested by Dayhoff, was found to be the most powerful
|
||
|
score
|
||
|
matrix for finding distant relationships between amino acid
|
||
|
sequences.
|
||
|
.left margin1
|
||
|
.lit
|
||
|
|
||
|
AMINO ACID SCORE MATRIX
|
||
|
-----------------------
|
||
|
|
||
|
C S T P A G N D E Q B Z H R K M I L V F Y W - X ?
|
||
|
C 22 10 8 7 8 7 6 5 5 5 5 5 7 6 5 5 8 4 8 6 10 2 10 10 10 10
|
||
|
S 10 12 11 11 11 11 11 10 10 9 10 10 9 10 10 8 9 7 9 7 7 8 10 10 10 10
|
||
|
T 8 11 13 10 11 10 10 10 10 9 10 10 9 9 10 9 10 8 10 7 7 5 10 10 10 10
|
||
|
P 7 11 10 16 11 9 9 9 9 10 9 10 10 10 9 8 8 7 9 5 5 4 10 10 10 10
|
||
|
A 8 11 11 11 12 11 10 10 10 10 10 10 9 8 9 9 9 8 10 6 7 4 10 10 10 10
|
||
|
G 7 11 10 9 11 15 10 11 10 9 10 10 8 7 8 7 7 6 9 5 5 3 10 10 10 10
|
||
|
N 6 11 10 9 10 10 12 12 11 11 12 11 12 10 11 8 8 7 8 6 8 6 10 10 10 10
|
||
|
D 5 10 10 9 10 11 12 14 13 12 13 12 11 9 10 7 8 6 8 4 6 3 10 10 10 10
|
||
|
E 5 10 10 9 10 10 11 13 14 12 12 13 11 9 10 8 8 7 8 5 6 3 10 10 10 10
|
||
|
Q 5 9 9 10 10 9 11 12 12 14 11 13 13 11 11 9 8 8 8 5 6 5 10 10 10 10
|
||
|
B 5 10 10 9 10 10 12 13 12 11 13 11 11 10 10 8 8 6 8 5 7 4 10 10 10 10
|
||
|
Z 5 10 10 10 10 10 11 12 13 13 11 14 12 10 10 8 8 8 8 5 6 4 10 10 10 10
|
||
|
H 7 9 9 10 9 8 12 11 11 13 11 12 16 12 10 8 8 8 8 8 10 7 10 10 10 10
|
||
|
R 6 10 9 10 8 7 10 9 9 11 10 10 12 16 13 10 8 7 8 6 6 12 10 10 10 10
|
||
|
K 5 10 10 9 9 8 11 10 10 11 10 10 10 13 15 10 8 7 8 5 6 7 10 10 10 10
|
||
|
M 5 8 9 8 9 7 8 7 8 9 8 8 8 10 10 16 12 14 12 10 8 6 10 10 10 10
|
||
|
I 8 9 10 8 9 7 8 8 8 8 8 8 8 8 8 12 15 12 14 11 9 5 10 10 10 10
|
||
|
L 4 7 8 7 8 6 7 6 7 8 6 8 8 7 7 14 12 16 12 12 9 8 10 10 10 10
|
||
|
V 8 9 10 9 10 9 8 8 8 8 8 8 8 8 8 12 14 12 14 9 8 4 10 10 10 10
|
||
|
F 6 7 7 5 6 5 6 4 5 5 5 5 8 6 5 10 11 12 9 19 17 10 10 10 10 10
|
||
|
Y 10 7 7 5 7 5 8 6 6 6 7 6 10 6 6 8 9 9 8 17 20 10 10 10 10 10
|
||
|
W 2 8 5 4 4 3 6 3 3 5 4 4 7 12 7 6 5 8 4 10 10 27 10 10 10 10
|
||
|
- 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
||
|
X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
||
|
? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
||
|
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
||
|
.end lit
|
||
|
One alternative for proteins is to use an identity matrix.
|
||
|
For comparing nucleic acids we usually use the matrix shown below.
|
||
|
.lit
|
||
|
|
||
|
DNA SCORE MATRIX
|
||
|
|
||
|
A C G T X
|
||
|
A 1 0 0 0 0
|
||
|
C 0 1 0 0 0
|
||
|
G 0 0 1 0 0
|
||
|
T 0 0 0 1 0
|
||
|
X 0 0 0 0 0
|
||
|
.end lit
|
||
|
See option 32 for how to change the score matrices.
|
||
|
.para
|
||
|
When a sequence is compared against itselt to look for repeats it is
|
||
|
possible to use the proportional algorithm in a mode such that the main
|
||
|
diagonal is not shown. See option 30.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
|
||
|
? Menu or option number=d18
|
||
|
? Odd span length (1-401) (11) =
|
||
|
? Proportional score (1-297) (132) =
|
||
|
Working
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@19. TX 4 @List matching spans
|
||
|
.LEFT MARGIN2
|
||
|
This option applies the proportional algorithm using the current span and
|
||
|
cut-off score, but instead of drawing a dot matrix it lists all the
|
||
|
matching spans. When a sequence is compared against itselt to look for
|
||
|
repeats it is
|
||
|
possible to use this algorithm in a mode such that the main
|
||
|
diagonal is not listed. See option 30.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
? Menu or option number=d19
|
||
|
? Odd span length (1-401) (11) =
|
||
|
? Proportional score (1-297) (132) =148
|
||
|
List matching spans
|
||
|
Working
|
||
|
76
|
||
|
IEVPVGKATLG
|
||
|
LEVPVGRGLLG
|
||
|
95
|
||
|
77
|
||
|
EVPVGKATLGR
|
||
|
EVPVGRGLLGR
|
||
|
96
|
||
|
78
|
||
|
VPVGKATLGRI
|
||
|
VPVGRGLLGRV
|
||
|
97
|
||
|
79
|
||
|
PVGKATLGRIM
|
||
|
PVGRGLLGRVV
|
||
|
98
|
||
|
85
|
||
|
LGRIMNVLGEP
|
||
|
LGRVVNTLGAP
|
||
|
104
|
||
|
86
|
||
|
GRIMNVLGEPV
|
||
|
GRVVNTLGAPI
|
||
|
105
|
||
|
87
|
||
|
RIMNVLGEPVD
|
||
|
RVVNTLGAPID
|
||
|
106
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@20. TX 3 @Set span length
|
||
|
.para
|
||
|
The proportional algorithm
|
||
|
calculates a score for
|
||
|
each position in the matrix by summing the
|
||
|
points found when looking
|
||
|
forwards and backwards along a diagonal line of a given length.
|
||
|
This length, called the span, should be an odd number so that the
|
||
|
score for any point is correctly positioned at the centre of the
|
||
|
span. This option allows the user to define the span length. It
|
||
|
should be noted that short spans can produce noisy diagrams, but are less
|
||
|
affected by insertions and deletions than are long spans. However long
|
||
|
spans can detect more distant relationships. Long spans can suffer from
|
||
|
a
|
||
|
persistence problem by plotting dots when all the "signal" is to one side
|
||
|
of the spans central position. To help avoid this, the option that plots
|
||
|
the position of all matching residues within a matching span, can be
|
||
|
tried.
|
||
|
This is most useful if an identity matrix is being used.
|
||
|
.left margin1
|
||
|
@21. TX 3 @Set proportional score
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
The proportional algorithm
|
||
|
calculates a score for
|
||
|
each position in the matrix by summing the
|
||
|
scores for the individual amino acids found when looking
|
||
|
forwards and backwards along a diagonal line of a given length.
|
||
|
All points at which the proportional score is achieved will produce a dot
|
||
|
on the diagram. (The same score is used for the 'LIST MATCHING SPANS'
|
||
|
option.)
|
||
|
.para
|
||
|
Before chosing a score the user can apply the routine that will calculate
|
||
|
the expected score, or can calculate a histogram of observed scores. It is
|
||
|
best to start with a high score to avoid an overcrowded diagram.
|
||
|
.left margin1
|
||
|
@22. TX 3 @Set identities score
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
The identities algorithm is of limited value as it only finds runs of
|
||
|
matching characters, however it has the virtue of being very fast.
|
||
|
This option allows the user to set the minimum length
|
||
|
of run that will produce a dot on the screen.
|
||
|
.left margin1
|
||
|
@23. TX 3 @Calculate expected scores
|
||
|
.left margin2
|
||
|
.para
|
||
|
This function calculates the "double matching probability" of McLachlan.
|
||
|
The
|
||
|
"double matching probability" is the probability of finding
|
||
|
particular scores given two infinitely long sequences of the
|
||
|
composition of those being compared, with the current span
|
||
|
length
|
||
|
and score matrix. By using this option the user can choose to
|
||
|
plot all the matches for which the score exceeds a given
|
||
|
significance level (such as 1%).
|
||
|
Generally it is best to begin at a
|
||
|
low level to avoid an overcrowded diagram.
|
||
|
.para
|
||
|
When the calculation of the expected scores
|
||
|
is finished the program offers
|
||
|
the user 3 ways of examining the results:
|
||
|
.LEFT MARGIN2
|
||
|
"Show probability for a score" allows the user to type in a
|
||
|
score and the
|
||
|
program responds with the probability of achieving that level
|
||
|
of score.
|
||
|
.LEFT MARGIN2
|
||
|
"Show score for a probability" allows the user to type in a
|
||
|
probability value and
|
||
|
the program types the score that corresponds to that level of
|
||
|
probability.
|
||
|
.LEFT MARGIN2
|
||
|
"List scores and probabilities" is the command to list out the
|
||
|
scores and their
|
||
|
corresponding probabilities. The user is asked to supply a
|
||
|
further parameter, the "number of steps between scores", and
|
||
|
the program only lists
|
||
|
every stepsize point. e.g a stepsize of 5 will get every 5th
|
||
|
score listed.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
? Menu or option number=d23
|
||
|
? Odd span length (1-401) (11) =
|
||
|
? Proportional score (1-297) (132) =
|
||
|
|
||
|
Working
|
||
|
Average score= 103.18557
|
||
|
RMS deviation= 7.85276
|
||
|
X 1 Show probability for a score
|
||
|
2 Show score for a probability
|
||
|
3 List scores and probabilities
|
||
|
? 0,1,2,3 =
|
||
|
|
||
|
? Show probability for score (1-165) (134) =160
|
||
|
Probability of score 160 is 0.0000000008
|
||
|
X 1 Show probability for a score
|
||
|
2 Show score for a probability
|
||
|
3 List scores and probabilities
|
||
|
? 0,1,2,3 =2
|
||
|
? Show score for probability (0.0000000001-1.) (0.00001) =0.0000001
|
||
|
Score for probability 0.0000001000 is 153
|
||
|
1 Show probability for a score
|
||
|
X 2 Show score for a probability
|
||
|
3 List scores and probabilities
|
||
|
? 0,1,2,3 =3
|
||
|
? Number of steps between scores (1-10) (5) =
|
||
|
|
||
|
0 0.10000E+01 100 0.67232E+00 200 0.18977E-20
|
||
|
5 0.10000E+01 105 0.42119E+00 205 0.42561E-22
|
||
|
10 0.10000E+01 110 0.20671E+00 210 0.87767E-24
|
||
|
15 0.10000E+01 115 0.78860E-01 215 0.16651E-25
|
||
|
20 0.10000E+01 120 0.23515E-01 220 0.27300E-27
|
||
|
25 0.10000E+01 125 0.55406E-02 225 0.00000E+00
|
||
|
30 0.10000E+01 130 0.10443E-02 230 0.00000E+00
|
||
|
35 0.10000E+01 135 0.15935E-03 235 0.00000E+00
|
||
|
40 0.10000E+01 140 0.19906E-04 240 0.00000E+00
|
||
|
45 0.10000E+01 145 0.20569E-05 245 0.00000E+00
|
||
|
50 0.10000E+01 150 0.17758E-06 250 0.00000E+00
|
||
|
55 0.10000E+01 155 0.12938E-07 255 0.00000E+00
|
||
|
60 0.10000E+01 160 0.80360E-09 260 0.00000E+00
|
||
|
65 0.10000E+01 165 0.43009E-10 265 0.00000E+00
|
||
|
70 0.10000E+01 170 0.20049E-11 270 0.00000E+00
|
||
|
75 0.99997E+00 175 0.82263E-13 275 0.00000E+00
|
||
|
80 0.99949E+00 180 0.29998E-14 280 0.00000E+00
|
||
|
85 0.99448E+00 185 0.98050E-16 285 0.00000E+00
|
||
|
90 0.96543E+00 190 0.28934E-17 290 0.00000E+00
|
||
|
95 0.86836E+00 195 0.77556E-19 295 0.00000E+00
|
||
|
1 Show probability for a score
|
||
|
2 Show score for a probability
|
||
|
X 3 List scores and probabilities
|
||
|
? 0,1,2,3 =!
|
||
|
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@24. TX 3 @Calculate observed scores
|
||
|
.left margin2
|
||
|
.para
|
||
|
This option applies the proportional algorithm to the currently active
|
||
|
sequence but instead of producing a
|
||
|
dot matrix it calculates a histogram of observed scores.
|
||
|
The speed of this calculation
|
||
|
of course depends on the size of the active
|
||
|
regions, but when it
|
||
|
is completed the program offers the user 3 ways of examining
|
||
|
the results:
|
||
|
.para
|
||
|
"Show percentage for score" allows the user to type in a score and the
|
||
|
program
|
||
|
responds with the percentage of points that achieve this
|
||
|
value.
|
||
|
.para
|
||
|
"Show percentage for score" allows the user to type in a percentage and
|
||
|
the
|
||
|
program responds with the corresponding score. Values of
|
||
|
this score and above are only achieved by the given
|
||
|
percentage of points.
|
||
|
.para
|
||
|
"List scores and percentages" is the command to list out the scores
|
||
|
and the
|
||
|
percentage of points achieving them.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
? Menu or option number=24
|
||
|
Working
|
||
|
Maximum observed score is 152
|
||
|
X 1 Show percentage reaching a score
|
||
|
2 Show score for a percentage
|
||
|
3 List scores and percentages
|
||
|
? 0,1,2,3 =
|
||
|
|
||
|
? Show percentage for score (1-152) (114) =144
|
||
|
Percentage of points with score 144 is 0.005486297
|
||
|
X 1 Show percentage reaching a score
|
||
|
2 Show score for a percentage
|
||
|
3 List scores and percentages
|
||
|
? 0,1,2,3 =2
|
||
|
|
||
|
? Show score for percentage (0.00001-1.) (0.001) =0.01
|
||
|
Score for percentage 0.010000000 is 143
|
||
|
1 Show percentage reaching a score
|
||
|
X 2 Show score for a percentage
|
||
|
3 List scores and percentages
|
||
|
? 0,1,2,3 =
|
||
|
|
||
|
? Show score for percentage (0.00001-1.) (0.001) =1.
|
||
|
Score for percentage 1.000000000 is 124
|
||
|
1 Show percentage reaching a score
|
||
|
X 2 Show score for a percentage
|
||
|
3 List scores and percentages
|
||
|
? 0,1,2,3 =3
|
||
|
? Number of steps between scores (1-10) (5) =1
|
||
|
|
||
|
73 236953 0.10000E+03
|
||
|
74 236951 0.99999E+02
|
||
|
75 236951 0.99999E+02
|
||
|
76 236950 0.99998E+02
|
||
|
77 236945 0.99996E+02
|
||
|
78 236942 0.99995E+02
|
||
|
79 236929 0.99989E+02
|
||
|
80 236900 0.99977E+02
|
||
|
|
||
|
missing data here
|
||
|
|
||
|
130 384 0.16206E+00
|
||
|
131 307 0.12956E+00
|
||
|
132 239 0.10086E+00
|
||
|
133 180 0.75964E-01
|
||
|
134 134 0.56551E-01
|
||
|
135 103 0.43468E-01
|
||
|
136 78 0.32918E-01
|
||
|
137 67 0.28276E-01
|
||
|
138 46 0.19413E-01
|
||
|
139 40 0.16881E-01
|
||
|
140 33 0.13927E-01
|
||
|
141 29 0.12239E-01
|
||
|
142 24 0.10129E-01
|
||
|
143 19 0.80184E-02
|
||
|
144 13 0.54863E-02
|
||
|
145 10 0.42202E-02
|
||
|
146 8 0.33762E-02
|
||
|
147 7 0.29542E-02
|
||
|
148 7 0.29542E-02
|
||
|
149 6 0.25321E-02
|
||
|
150 5 0.21101E-02
|
||
|
151 3 0.12661E-02
|
||
|
152 3 0.12661E-02
|
||
|
1 Show percentage reaching a score
|
||
|
2 Show score for a percentage
|
||
|
X 3 List scores and percentages
|
||
|
? 0,1,2,3 =!
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@25. TX 3 @Show current parameter settings
|
||
|
.LEFT MARGIN2
|
||
|
.para
|
||
|
This function lists the names of the current sequences, their total
|
||
|
lengths, the start
|
||
|
and end points of the active sequence and the current values of span and
|
||
|
cut-off scores. It also shows if the main diagonal will be shown, or if
|
||
|
the
|
||
|
proportional algorithm will mark all identities in matching spans.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
? Menu or option number=25
|
||
|
Horizontal sequence
|
||
|
ALPHA.PRT
|
||
|
Positions
|
||
|
1 TO 514
|
||
|
Vertical sequence
|
||
|
BETA.PRT
|
||
|
Positions
|
||
|
1 TO 461
|
||
|
Span length= 11
|
||
|
Scores
|
||
|
Proportional= 132
|
||
|
Identities= 3
|
||
|
Identites off
|
||
|
Main diagonal shown
|
||
|
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@27. TX 2 @Draw a /
|
||
|
.left margin2
|
||
|
.para
|
||
|
This option simply draws a diagonal line from the bottom left of the
|
||
|
diagram to the top right. it can be an aid when trying to align the
|
||
|
sequences.
|
||
|
.left margin1
|
||
|
@26. TX 4 @Quick scan
|
||
|
.left margin2
|
||
|
.para
|
||
|
The algorithm is as follows. The dot matrix positions are found for all
|
||
|
words of some minimum length (obviously length 1 is most sensitive)
|
||
|
that are common to both sequences. Imagine a diagonal line running from
|
||
|
corner to corner of the diagram, at right angles to the diagonals in the
|
||
|
dotmatrix, The scores for the common words (according to the current
|
||
|
score matrix, e.g. MDM78) are accummulated at the appropriate positions
|
||
|
on that imaginary line, hence
|
||
|
producing a histogram. The histogram is analysed to find its mean and
|
||
|
standard deviation. The diagonals that lie above some cutoff score
|
||
|
(defined in standard deviation units), are rescanned using the
|
||
|
proportional algorithm, and a diagram produced. The method is very fast,
|
||
|
and is also employed by the library comparison program.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
|
||
|
? Menu or option number=d26
|
||
|
? Identity score (1-20) (3) =
|
||
|
? Odd span length (1-401) (11) =
|
||
|
? Proportional score (1-297) (132) =
|
||
|
? Number of sd above mean (0.00-10.00) (5.00) =
|
||
|
|
||
|
missing graphics
|
||
|
|
||
|
|
||
|
.end lit
|
||
|
.left margin2
|
||
|
.para
|
||
|
SIPL the library searching version of SIP
|
||
|
.para
|
||
|
This program compares a probe sequence against a library of sequences using
|
||
|
the quick scan algorithm, sorts the matches into descending order of score,
|
||
|
and produces optimal alignments of the best scores using the Myers and
|
||
|
Miller method. It is very rapid.
|
||
|
.para
|
||
|
Use of lists of entry names
|
||
|
.para
|
||
|
SIPL has the ability to
|
||
|
restrict searches to subsets of the libraries. This does not require
|
||
|
sublibraries to be created but instead is achieved by using files
|
||
|
containing a list of the entry names of sequences. The user may choose to
|
||
|
search only those entries on the list or, alternatively to search all but
|
||
|
those on the list (i.e. in the latter case
|
||
|
the list contains the names of those to be excluded).
|
||
|
The programs can search libraries that have indexes and those that
|
||
|
do not.
|
||
|
If a list of names for inclusion is used,
|
||
|
then the search will be faster if the index is present. In all other
|
||
|
circumstances the whole library will be read.
|
||
|
The list must be in library order except when it is used
|
||
|
to include entries, and an index is available.
|
||
|
The list must contain each entry name on a separate line, with the name
|
||
|
starting in column 1 of the line. ie there must be no spaces at the start
|
||
|
of the line.
|
||
|
The list of entry names
|
||
|
can be produced by the keyword searches of nip, pip, sip, etc as long
|
||
|
as the listings produced have a space character separating the entry name
|
||
|
from the entry description. This will depend on how well the library
|
||
|
reformatting programs work. For example swissprot entry names tend to run
|
||
|
into the beginning of the descriptions, but other libraries are generally
|
||
|
OK.
|
||
|
|
||
|
.left margin1
|
||
|
@28. TX 4 @Align sequences
|
||
|
.left margin2
|
||
|
.para
|
||
|
This function will produce an optimal alignment of two segments of the
|
||
|
sequence.
|
||
|
The dynamic programming alignment algorithm is based on that of Miller
|
||
|
and Myers (). It guarantees to produce alignments with the optimum score
|
||
|
given a score matrix, a gap start penalty, and a gap extension penalty.
|
||
|
That is, starting a gap costs a fixed penalty (F) and each residue added
|
||
|
to the gap incurs a further penalty (E) so that for each gap of length K
|
||
|
residues the penalty is F + K*E. Gaps at the ends of sequences incur no
|
||
|
penalty.
|
||
|
|
||
|
.para
|
||
|
The routine can only handle segments of sequence of maximum
|
||
|
length 5000 residues. When the sequences are read in the alignment
|
||
|
segment
|
||
|
will be set to the first 5000 residues. A different segment can be
|
||
|
selected by prefixing the option number by the letter D, in which case the
|
||
|
cross hair can be used to identify the two ends. The cross hair will
|
||
|
appear.
|
||
|
First position the
|
||
|
crosshair at
|
||
|
the bottom left of the
|
||
|
segment and type a character other than s
|
||
|
or m or ",". When the crosshair reappears, position it a the top right
|
||
|
of the segment, and type a keyboard character.
|
||
|
The aligned sequences will replace the active sequence if the user
|
||
|
confirms "keep alignment". By alternate use of the
|
||
|
plotting and alignment routines it is possible to rapidly produce an
|
||
|
alignment of quite long sequences.
|
||
|
.para
|
||
|
Typical dialogue follows.
|
||
|
.lit
|
||
|
|
||
|
28 = Align sequences
|
||
|
? Menu or option number=d28
|
||
|
Define the region to align using the cross-hair.
|
||
|
First identify the bottom left position and exit
|
||
|
the cross-hair routine. Then the top right.
|
||
|
|
||
|
(Bell rings, type return, cross hair appears)
|
||
|
|
||
|
? Penalty for starting a gap (1-100) (10) =
|
||
|
? Penalty for each residue in gap (1-100) (10) =
|
||
|
|
||
|
Aligning region 1 to 461
|
||
|
with region 1 to 514
|
||
|
1 11 21 31 41 51
|
||
|
MA--TGKIVQ VIGA------ VVDVEFPQDA VPRVYDALEV QNG------N ERLVL-----
|
||
|
* * * ** * * * * *
|
||
|
MQLNSTEISE LIKQRIAQFN VVSEAHNEGT IVSVSDGVIR IHGLADCMQG EMISLPGNRY
|
||
|
1 11 21 31 41 51
|
||
|
61 71 81 91 101 111
|
||
|
EVQQQLGGGI VRTIAMGSSD GLRRGLDVKD LEHPIEVPVG KATLGRIMNV LGEPVDMKGE
|
||
|
* * ** * * ** ***** *** * ** * * **
|
||
|
AIALNLERDS VGAVVMGPYA DLAEGMKVKC TGRILEVPVG RGLLGRVVNT LGAPIDGKGP
|
||
|
61 71 81 91 101 111
|
||
|
121 131 141 151 161 171
|
||
|
IGEEERWAIH RAAPSYEELS NSQELLETGI KVIDLMCPFA KGGKVGLFGG AGVGKTVNMM
|
||
|
* ** * ** * * * * * * ***
|
||
|
LDHDGFSAVE AIAPGVIERQ SVDQPVQTGY KAVDSMIPIG RGQRELIIGD RQTGKTALAI
|
||
|
121 131 141 151 161 171
|
||
|
181 191 201 211 221 231
|
||
|
ELIRNIAIEH SGYS-VFAGV GERTREGNDF YHEMTDSNVI DKVSLVYGQM NEPPGNRLRV
|
||
|
* * ** * * *
|
||
|
DAI--INQRD SGIKCIYVAI GQKASTISNV VRKLEEHGAL ANTIVVVATA SESAALQYLA
|
||
|
181 191 201 211 221 231
|
||
|
241 251 261 271 281 291
|
||
|
ALTGLTMAEK FRDEGRDVLL FVDNIYRYTL AGTEVSALLG RMPSAVGYQP TLAEEMGVLQ
|
||
|
* * *** * * * * * * ** * * *
|
||
|
RMPVALMGEY FRDRGEDALI IYDDLSKQAV AYRQISLLLR RPPGREAFPG DVFYLHSRLL
|
||
|
241 251 261 271 281 291
|
||
|
301 311 321 331 341 351
|
||
|
ERITST---- ---------- -KTGSITSVQ AVYVPADDLT DPSPATTFAH LDATVVLSRQ
|
||
|
** **** * * * * * *
|
||
|
ERAARVNAEY VEAFTKGEVK GKTGSLTALP IIETQAGDVS AFVPTNVISI TDGQIFLETN
|
||
|
301 311 321 331 341 351
|
||
|
361 371 381 391 401 411
|
||
|
IASLGIYPAV DPLDSTSRQL DPLVVGQEHY DTAR----GV QSILQRYQEL KDIIAILGMD
|
||
|
** *** * * ** * * * * * **
|
||
|
LFNAGIRPAV NPGISVSR-- ---VGGAAQT KIMKKLSGGI RTALAQYREL AAFSQFAS--
|
||
|
361 371 381 391 401 411
|
||
|
421 431 441 451 461 471
|
||
|
ELSEEDKLVV ARARKIQRFL SQ----PFFV AE----VFTG SPGKYVSLKD --TIRGFKGI
|
||
|
* * * * * * * * * * * *
|
||
|
DLDDATRKQL DHGQKVTELL KQKQYAPMSV AQQSLVLFAA ERG-YLADVE LSKIGSFEAA
|
||
|
421 431 441 451 461 471
|
||
|
481 491 501 511 521
|
||
|
MEG--EYDHL P-EQAFYMVG SIEEAVE--- --------KA KKL*
|
||
|
** * * * * *
|
||
|
LLAYVDRDHA PLMQEINQTG GYNDEIEGKL KGILDSFKAT QSW*
|
||
|
481 491 501 511 521
|
||
|
Conservation 22.5%
|
||
|
Number of padding characters inserted 63 and 10
|
||
|
? (y/n) (y) Keep alignment n
|
||
|
|
||
|
|
||
|
.end lit
|
||
|
.left margin1
|
||
|
@29. TX 1 @Complement the sequences
|
||
|
.left margin2
|
||
|
.para
|
||
|
This function allows users to reverse and complement nucleic acid
|
||
|
sequences.
|
||
|
.left margin1
|
||
|
@30. TX 3 @Switch main diagonal
|
||
|
.left margin2
|
||
|
.para
|
||
|
If a sequence is being compared against itself to look for repeats it is
|
||
|
sometimes convenient if the main diagonal is not included in the
|
||
|
comparison. This function allows users to set a switch that determines
|
||
|
whether or not to include the main
|
||
|
diagonal for all the comparison methods.
|
||
|
If the switch is set, and the active regions for both sequences have
|
||
|
the same start position, then the main diagonal will not be compared.
|
||
|
.left margin1
|
||
|
@31. TX 3 @Switch identities
|
||
|
.left margin2
|
||
|
.para
|
||
|
This function allows a switch to be set or unset. The switch determines
|
||
|
which of two forms of plot will be produced by the proportional
|
||
|
algorithm.
|
||
|
One form of output (the original method) plots a dot at the centre of each
|
||
|
span that reaches the threshold score; whereas the other form will plot
|
||
|
dots for all matching residues that lie within spans that reach the
|
||
|
threshold.
|
||
|
.left margin1
|
||
|
@32. TX 3 @change score matrix
|
||
|
.left margin2
|
||
|
.para
|
||
|
This option allows users to select their
|
||
|
own score matrix for use with the proportional algorithm. The choices
|
||
|
are:
|
||
|
.lit
|
||
|
|
||
|
1 = MDM78
|
||
|
2 = identity
|
||
|
3 = your own matrix
|
||
|
|
||
|
.end lit
|
||
|
.para
|
||
|
MDM78 is the standard matrix that is used for proteins and an
|
||
|
identity matrix is the default matrix for nucleic acids. However an
|
||
|
identity
|
||
|
matrix is also useful for protein comparisons. "Your own matrix" allows
|
||
|
users to apply any other matrix, as long as the matrix file is in the same
|
||
|
format as MDM78.
|
||
|
For comparisons of DNA it might be useful to try one that gave say 3 for
|
||
|
exact matches and 1 for R-R or Y-Y, else=0.
|
||
|
.left margin1
|
||
|
@33. TX 3 @Set number of sd's for Quickscan
|
||
|
.left margin2
|
||
|
.para
|
||
|
The quickscan
|
||
|
algorithm is as follows. The dot matrix positions are found for all
|
||
|
words of some minimum length (obviously length 1 is most sensitive)
|
||
|
that are common to both sequences. Imagine a diagonal line running from
|
||
|
corner to corner of the diagram, at right angles to the diagonals in the
|
||
|
dotmatrix, The scores for the common words (according to the current
|
||
|
score matrix, e.g. MDM78) are accummulated at the appropriate positions
|
||
|
on that imaginary line, hence
|
||
|
producing a histogram. The histogram is analysed to find its mean and
|
||
|
standard deviation. The diagonals that lie above some cutoff score
|
||
|
(defined in standard deviation units), are rescanned using the
|
||
|
proportional algorithm, and a diagram produced.
|
||
|
.para
|
||
|
This option allows the number of sd's to be set.
|
||
|
.left margin1
|
||
|
@34. TX 3 @Set gap penalities
|
||
|
.left margin2
|
||
|
.para
|
||
|
The alignment
|
||
|
function will produce an optimal alignment of two segments of the
|
||
|
sequence.
|
||
|
The dynamic programming alignment algorithm is based on that of Miller
|
||
|
and Myers (). It guarantees to produce alignments with the optimum score
|
||
|
given a score matrix, a gap start penalty, and a gap extension penalty.
|
||
|
That is, starting a gap costs a fixed penalty (F) and each residue added
|
||
|
to the gap incurs a further penalty (E) so that for each gap of length K
|
||
|
residues the penalty is F + K*E. Gaps at the ends of sequences incur no
|
||
|
penalty.
|
||
|
.para
|
||
|
This option allows the gap penalties to be set.
|
||
|
.left margin1
|
||
|
@ end of help
|