1254 lines
57 KiB
Text
1254 lines
57 KiB
Text
|
|
@-1. TX 0 @General
|
|
|
|
@-2. T 0 @Screen control
|
|
|
|
@-2. X 0 @Screen
|
|
|
|
@-3. TX 0 @Set parameters
|
|
|
|
@-4. TX 0 @Comparison
|
|
|
|
@0. TX -1 @SIP
|
|
|
|
This is program for comparing and aligning nucleic acid or
|
|
protein sequences. It can produce optimal alignments using a dynamic
|
|
programming algorithm, and has several ways of producing "dot matrix"
|
|
diagrams.
|
|
|
|
The following analyses (preceded by their option numbers) are
|
|
included:
|
|
|
|
|
|
The program is used on a simple graphics terminal ie a
|
|
keyboard with a screen on which points and lines can be drawn. The
|
|
user works at the terminal and produces plots for various
|
|
combinations of values for the span length and minimum scores.
|
|
However large or small a region the user elects to compare the
|
|
program expands or contracts the diagram so that the plot always
|
|
fills the screen. This allows the user to gain an overall
|
|
impression or to "home-in" on particular regions and examine them
|
|
in more detail. Having found a region that looks interesting
|
|
the user can determine its coordinates in terms of sequence
|
|
positions by use of a crosshair facility.
|
|
|
|
The program has two statistical options to help the user
|
|
choose score levels for plotting and to assess the significance of
|
|
any similarity found. It can produce a cumulative histogram of
|
|
observed scores for the current span length and region and it can
|
|
calculate the "double matching probability" of McLachlan. The "double
|
|
matching probability" is the probability of finding particular
|
|
scores given two infinitely long sequences of the composition
|
|
of those being compared, with the current span length and score
|
|
matrix. By using these options the user can choose to plot all
|
|
the matches for which the score exceeds a given significance
|
|
level (such as 1%), using either empirical or theoretical
|
|
probability values. Generally it is best to begin at a low level to
|
|
avoid an overcrowded diagram.
|
|
|
|
If the user finds that the two sequences do contain
|
|
stretches of homology he will often want to align the sequences by
|
|
inserting padding characters at deletion points. The program has a
|
|
selection of options for this purpose: it contains an alignment
|
|
routine; it can display on the screen the two sequences, one above
|
|
the other, with asterisks marking identities, it has inbuilt
|
|
editing functions and can save the aligned sequences on disk files.
|
|
|
|
The basic principle of dot matrices was first described by
|
|
Gibbs and McIntyre and involves producing a diagram that contains a
|
|
representation of all the matches between a pair of sequences. This
|
|
diagram is then scanned by eye and the human ability to
|
|
recognise patterns used to detect any similarities that might be
|
|
present. The diagram consists of a two dimensional plot in which the
|
|
x axis represents one sequence (A) and the y axis the other
|
|
(B). Every point (i,j) on the plane x,y is assigned a score which
|
|
corresponds to the level of similarity between sequence
|
|
characters A(i) and B(j). In the simplest use of the method a score
|
|
of 1 could be assigned to every point (i,j) where A(i) = B(j), and a
|
|
score of 0 to every other point. If a plot of the points in the
|
|
plane was made in which all scores of 1 were marked with a dot and
|
|
all those of 0 left blank then regions of identity would appear as
|
|
diagonal lines. With the comparison displayed in this form the
|
|
human eye is very good at detecting regions of homology even if they
|
|
are imperfect. The effects of mismatches, insertions or deletions
|
|
can be seen: matches interrupted by insertions or deletions will
|
|
appear as parallel diagonals, and matches interrupted by the odd
|
|
mismatching pair of characters will appear as broken collinear
|
|
diagonal lines. This diagram is a very useful representation but
|
|
simply placing a dot for every identity is of limited value for the
|
|
following reasons.
|
|
|
|
For nucleic acid sequences around 25% of the plot will contain
|
|
points and it will often be very difficult to distinguish
|
|
significant homologies from chance matches. For proteins many
|
|
significant alignments of sequences contain almost no identities but
|
|
are formed from chemically and structurally similar amino acids so
|
|
that simply looking for identity would be insufficient. What is
|
|
required is to first find those points that correspond to fairly
|
|
strong local similarities and then to use the diagram of these
|
|
points so that the human eye can be used to look for larger scale
|
|
homologies. The program uses a number of different algorithms to
|
|
calculate the score for each point and the user defines a minimum
|
|
score so that only those points in the diagram for which the
|
|
score is at least this value will be marked with a dot.
|
|
|
|
The first scoring method finds the longest uninterrupted
|
|
sections of perfect identity i.e. those that contain no mismatches,
|
|
insertions or deletions. Generally this method, termed "the
|
|
identities algorithm" is of little value, but runs very quickly.
|
|
|
|
The second method looks for sections where a
|
|
proportion of the characters in the sequence are similar, again
|
|
allowing no insertions or deletions. For a thorough analysis this
|
|
method, termed "the proportional algorithm", is the best.
|
|
|
|
The original method, of this type was first described by
|
|
McLachlan and involves calculating a score for each position in the
|
|
matrix by summing points found when looking forwards and
|
|
backwards along a diagonal line of a given length. This length,
|
|
called the span, must be an odd number so that the dot marking
|
|
matches can be precisely placed at its centre. The algorithm does
|
|
not simply look for identity but uses a score matrix that
|
|
contains scores for every possible pair of characters. For
|
|
comparing amino acid sequences we usually use the score matrix
|
|
shown below which was calculated by adding 10 (to make every term
|
|
>0) to each term of the relatedness odds matrix MDM78 of Dayhoff.
|
|
This matrix MDM78 was calculated by looking at accepted point
|
|
mutations in 71 families of closely related proteins and, of those
|
|
tested by Dayhoff, was found to be the most powerful score matrix
|
|
for finding distant relationships between amino acid
|
|
sequences.
|
|
|
|
AMINO ACID SCORE MATRIX
|
|
-----------------------
|
|
|
|
C S T P A G N D E Q B Z H R K M I L V F Y W - X ?
|
|
C 22 10 8 7 8 7 6 5 5 5 5 5 7 6 5 5 8 4 8 6 10 2 10 10 10 10
|
|
S 10 12 11 11 11 11 11 10 10 9 10 10 9 10 10 8 9 7 9 7 7 8 10 10 10 10
|
|
T 8 11 13 10 11 10 10 10 10 9 10 10 9 9 10 9 10 8 10 7 7 5 10 10 10 10
|
|
P 7 11 10 16 11 9 9 9 9 10 9 10 10 10 9 8 8 7 9 5 5 4 10 10 10 10
|
|
A 8 11 11 11 12 11 10 10 10 10 10 10 9 8 9 9 9 8 10 6 7 4 10 10 10 10
|
|
G 7 11 10 9 11 15 10 11 10 9 10 10 8 7 8 7 7 6 9 5 5 3 10 10 10 10
|
|
N 6 11 10 9 10 10 12 12 11 11 12 11 12 10 11 8 8 7 8 6 8 6 10 10 10 10
|
|
D 5 10 10 9 10 11 12 14 13 12 13 12 11 9 10 7 8 6 8 4 6 3 10 10 10 10
|
|
E 5 10 10 9 10 10 11 13 14 12 12 13 11 9 10 8 8 7 8 5 6 3 10 10 10 10
|
|
Q 5 9 9 10 10 9 11 12 12 14 11 13 13 11 11 9 8 8 8 5 6 5 10 10 10 10
|
|
B 5 10 10 9 10 10 12 13 12 11 13 11 11 10 10 8 8 6 8 5 7 4 10 10 10 10
|
|
Z 5 10 10 10 10 10 11 12 13 13 11 14 12 10 10 8 8 8 8 5 6 4 10 10 10 10
|
|
H 7 9 9 10 9 8 12 11 11 13 11 12 16 12 10 8 8 8 8 8 10 7 10 10 10 10
|
|
R 6 10 9 10 8 7 10 9 9 11 10 10 12 16 13 10 8 7 8 6 6 12 10 10 10 10
|
|
K 5 10 10 9 9 8 11 10 10 11 10 10 10 13 15 10 8 7 8 5 6 7 10 10 10 10
|
|
M 5 8 9 8 9 7 8 7 8 9 8 8 8 10 10 16 12 14 12 10 8 6 10 10 10 10
|
|
I 8 9 10 8 9 7 8 8 8 8 8 8 8 8 8 12 15 12 14 11 9 5 10 10 10 10
|
|
L 4 7 8 7 8 6 7 6 7 8 6 8 8 7 7 14 12 16 12 12 9 8 10 10 10 10
|
|
V 8 9 10 9 10 9 8 8 8 8 8 8 8 8 8 12 14 12 14 9 8 4 10 10 10 10
|
|
F 6 7 7 5 6 5 6 4 5 5 5 5 8 6 5 10 11 12 9 19 17 10 10 10 10 10
|
|
Y 10 7 7 5 7 5 8 6 6 6 7 6 10 6 6 8 9 9 8 17 20 10 10 10 10 10
|
|
W 2 8 5 4 4 3 6 3 3 5 4 4 7 12 7 6 5 8 4 10 10 27 10 10 10 10
|
|
- 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
|
X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
|
? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
|
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
|
|
|
It is also possible to use other matrices, including an
|
|
identity matrix for proteins. For nucleic acids we usually use the
|
|
matrix shown below.
|
|
|
|
DNA SCORE MATRIX
|
|
|
|
A C G T X
|
|
A 1 0 0 0 0
|
|
C 0 1 0 0 0
|
|
G 0 0 1 0 0
|
|
T 0 0 0 1 0
|
|
X 0 0 0 0 0
|
|
|
|
Plotting dots at the centres of spans that reach the cutoff
|
|
leads to a persistence effect that, to some extent, can be mitigated
|
|
by a variation on the method. If, for example, all the high scoring
|
|
amino acids are clustered at the left end of a particular diagonal
|
|
segment, dots will continue to be plotted to their right until the
|
|
span score drops below the cutoff. Instead of plotting a single
|
|
point for each span that reaches the cutoff score, the variant
|
|
method plots points for all the identities that lie in spans that
|
|
reach the cutoff. Obviously the persistence effect can be more
|
|
pronounced for long spans and low cutoff scores, but note that the
|
|
variant method will not plot anything if there are no identities
|
|
present, and so similar regions could be missed!
|
|
|
|
A further variant, useful for comparing a sequence against
|
|
itself, ignores the main diagonal.
|
|
|
|
The third comparison method called "quick scan" is really a
|
|
combination of the first two, and is similar to the FASTP program of
|
|
Lipman and Pearson, but produces a dot matrix diagram. The algorithm
|
|
is as follows. The dot matrix positions are found for all words of
|
|
some minimum length (obviously length 1 is most sensitive) that are
|
|
common to both sequences. Imagine a diagonal line running from
|
|
corner to corner of the diagram, at right angles to the diagonals in
|
|
the dotmatrix, The scores for the common words (according to the
|
|
current score matrix, e.g. MDM78) are accummulated at the
|
|
appropriate positions on that imaginary line, hence producing a
|
|
histogram. The histogram is analysed to find its mean and standard
|
|
deviation. The diagonals that lie above some cutoff score (defined
|
|
in standard deviation units), are rescanned using the proportional
|
|
algorithm, and a diagram produced. The method is very fast, and is
|
|
also employed by the library comparison program.
|
|
|
|
The dynamic programming alignment algorithm contained in the
|
|
program is based on that of Miller and Myers (). It guarantees to
|
|
produce alignments with the optimum score given a score matrix, a
|
|
gap start penalty, and a gap extension penalty. That is, starting a
|
|
gap costs a fixed penalty (IG) and each residue added to the gap
|
|
incurs a further penalty (IH) so that for each gap of length K
|
|
residues the penalty is IG + k*IH. Gaps at the ends of sequences
|
|
incur no penalty.
|
|
|
|
It is very useful to have the dot matrix methods and the
|
|
alignment routine together in the same program because it allows
|
|
users to produce a dot matrix diagram to help select which regions
|
|
of the sequence they wish to align. Selection is made by use of the
|
|
crosshair. First the crosshair is positioned at the bottom left hand
|
|
end of the segment to be aligned. The crosshair function is quit and
|
|
immediately selected again, the crosshair positioned at the top
|
|
right of the segment, and the crosshair function quit. When the
|
|
alignment routine is selected the segment will be aligned.
|
|
|
|
The alignment can replace the original segment of the
|
|
sequence. By repeated plotting of dot matrices, followed by
|
|
alignment, very long sequences can easily be aligned.
|
|
@1. TX 0 @Help
|
|
|
|
This option gives online help. The user should select option
|
|
numbers and the current documentation will be given.
|
|
|
|
The following analyses (preceded by their option numbers) are
|
|
included:
|
|
? = Help
|
|
! = Quit
|
|
3 = read a new sequence
|
|
4 = define active region
|
|
5 = list the sequence
|
|
6 = list a text file
|
|
7 = direct output to disk
|
|
8 = write active sequence to disk
|
|
9 = edit the sequences
|
|
10 = clear graphics screen
|
|
11 = clear text screen
|
|
12 = draw a ruler
|
|
13 = use cross hair
|
|
14 = reposition plots
|
|
15 = label diagram
|
|
16 = display a map
|
|
17 = apply identities algorithm
|
|
18 = apply proportional algorithm
|
|
19 = list matching spans
|
|
20 = set span length
|
|
21 = set proportional score
|
|
22 = set identities score
|
|
23 = calculate expected scores
|
|
24 = calculate observed scores
|
|
25 = show current parameter settings
|
|
26 = quick scan
|
|
27 = draw a /
|
|
28 = align the sequences
|
|
29 = complement the sequences
|
|
30 = switch main diagonal
|
|
31 = switch identities
|
|
32 = change score matrix
|
|
@2. TX 0 @Quit
|
|
|
|
This function stops the program.
|
|
@3. TX 1 @Read a new sequence
|
|
|
|
This option allows users to read in new sequences, browse
|
|
through annotations, or search sequence libraries for keywords.
|
|
Sequences can be read from "personal" sequence files or from
|
|
sequence libraries. These are referred to as the sequence "source".
|
|
Personal files can be stored in several formats: Staden, PIR, EMBL,
|
|
GENBANK and GCG. At LMB we use "Staden" format for sequencing and
|
|
all the libraries are stored in their original formats. Note,
|
|
however, that libraries such as EMBL or GenBank that are divided
|
|
into several files (eg GenBank has 13 separate files) are indexed as
|
|
a whole. This means that users do not need to know which file
|
|
contains an entry, only which library. When the user selects to
|
|
read in a sequence the program first asks for the sequence "source".
|
|
|
|
If the user selects "personal" the program will ask for the
|
|
format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
|
|
the file. For PIR format the user will also be required to know the
|
|
entry name of the sequence as the file can contain several. For the
|
|
other formats only a single entry is expected. The file will be
|
|
read, its length and composition will be displayed and the option
|
|
left.
|
|
|
|
If the user selects "library" as the sequence source the
|
|
program will display a list of available libraries. The programs are
|
|
capable of handling all current libraries but which ones are
|
|
available will vary from site to site. At LMB we have several
|
|
libraries and also weekly updates of data gathered between releases.
|
|
The program will ask users to select a library and then give a list
|
|
of options:
|
|
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entrynames from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
|
|
If get a sequence or get annotations is selected users will be asked
|
|
to type the entry name. The option will be left when a sequence is
|
|
selected or ! is typed. The composition and length will be
|
|
displayed.
|
|
|
|
The text index contains all words from feature tables,
|
|
reference titles, definition lines, keywords lists and comments, so
|
|
the text index search is most useful. It is also the fastest. Up to
|
|
5 words can be searched for at once. The words should be typed
|
|
separated by spaces, for example
|
|
? Keywords=P53 mouse murine tumo
|
|
|
|
will search for all entries that contain words starting with p53,
|
|
mouse, murine and tumo. Only the unique entries that contain ALL
|
|
words will be listed. Before listing the matching entries the
|
|
program will show the number of 'hits' for each word and ring the
|
|
bell. Escape is possible at this point, or after each screenfull of
|
|
entries. In addition to the entry names the text search displays
|
|
the primary accession number, the sequence length and up to 80
|
|
characters of description. (The search of 'titles' is now redundant
|
|
because the full text index contains all the title words and the
|
|
search is much faster. It will probably be removed from the
|
|
program.) All searches are independent of case. Where possible the
|
|
program will offer default entry names.
|
|
|
|
Typical dialogue follows.
|
|
Select sequence source
|
|
X 1 Personal file
|
|
2 Sequence library
|
|
? Selection (1-2) (1) =
|
|
Select sequence file format
|
|
X 1 Staden
|
|
2 EMBL
|
|
3 GenBank
|
|
4 PIR
|
|
5 GCG
|
|
? Selection (1-5) (1) =
|
|
? Sequence file name=M13MP7.SEQ
|
|
Contig title removed
|
|
Sequence length= 7238
|
|
Sequence composition
|
|
T C A G -
|
|
2405. 1539. 1765. 1527. 2.
|
|
33.2% 21.3% 24.4% 21.1% 0.0%
|
|
.
|
|
.
|
|
.
|
|
|
|
|
|
Select sequence source
|
|
X 1 Personal file
|
|
2 Sequence library
|
|
? Selection (1-2) (1) =2
|
|
Select a library
|
|
X 1 EMBL 29 nucleotide library Dec 91
|
|
2 SWISSPROT 20 protein library Nov 91
|
|
3 PIR 31 protein library Dec 91
|
|
4 NRL3D 58 From Brookhaven protein library Dec 91
|
|
5 GenBank
|
|
? Selection (1-5) (1) =
|
|
Library is in EMBL format with indexes
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =5
|
|
Search for keywords
|
|
? Keywords=P53 mouse
|
|
P53 hits 68
|
|
MOUSE hits 8180
|
|
|
|
MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen
|
|
MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen
|
|
MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen
|
|
MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen
|
|
MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen
|
|
MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen
|
|
MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen
|
|
MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen
|
|
MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen
|
|
MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon
|
|
MMANT11 X00885 562 Murine p53 gene 3' region with exon 11
|
|
MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end.
|
|
MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds.
|
|
MMP53 X00741 1377 Mouse mRNA for transformation associated protein
|
|
MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53.
|
|
MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11.
|
|
MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8.
|
|
MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou
|
|
MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4
|
|
MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53
|
|
MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53
|
|
MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5
|
|
22 different entries found
|
|
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =4
|
|
Search for keywords
|
|
? Keywords=alpha
|
|
Searching for alpha
|
|
AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
|
|
AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
|
|
AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
|
|
AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
|
|
ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank.
|
|
ADHBADA1 102 duck alpha-d-globin gene, exon 1.
|
|
ADHBADA2 1145 duck alpha-a-globin gene and 5' flank
|
|
ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co
|
|
AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
|
|
AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
|
|
AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
|
|
AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
|
|
AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
|
AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
|
AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
|
AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide.
|
|
AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
|
|
AGAMYLS 120 synthetic alpha-amylase gene, 5' end.
|
|
AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
|
|
!
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =3
|
|
? Accession number=v00636
|
|
Entry name LAMBDA
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =2
|
|
Default Entry name=LAMBDA
|
|
? Entry name=
|
|
ID LAMBDA standard; DNA; PHG; 48502 BP.
|
|
XX
|
|
AC V00636; J02459; M17233; X00906;
|
|
XX
|
|
DT 03-JUL-1991 (Rel. 28, Last updated, Version 3)
|
|
DT 09-JUN-1982 (Rel. 1, Created)
|
|
XX
|
|
DE Genome of the bacteriophage lambda (Styloviridae).
|
|
XX
|
|
KW circular; coat protein; DNA binding protein; genome;
|
|
KW origin of replication.
|
|
XX
|
|
OS Bacteriophage lambda
|
|
OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
|
|
XX
|
|
RN [1]
|
|
RP 1-48502
|
|
RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
|
|
RT "Nucleotide sequence of bacteriophage lambda DNA";
|
|
RL J. Mol. Biol. 162:729-773(1982).
|
|
XX
|
|
!
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =
|
|
Default Entry name=LAMBDA
|
|
? Entry name=
|
|
DE Genome of the bacteriophage lambda (Styloviridae).
|
|
Sequence length 48502
|
|
Sequence composition
|
|
T C A G -
|
|
11988. 11360. 12336. 12818. 0.
|
|
24.7% 23.4% 25.4% 26.4% 0.0%
|
|
|
|
@4. TX 1 @Define active region
|
|
|
|
For its analytic functions the program always works on a
|
|
region of the sequence called the active region. When a new sequence
|
|
is read into the program the active region is automatically set to
|
|
start at the beginning of the sequence and go up to the maximum
|
|
allowed size of active region the program can handle. The positions
|
|
are shown on the screen. On most machines this will be to the end
|
|
of the sequence. This option allows the user define a different
|
|
region.
|
|
@5. TX 1 @List a sequence
|
|
|
|
The sequence can be listed with line lengths from 10 to 120 in
|
|
multiples of 10. The output looks like:
|
|
|
|
87 97 107 117 127 137
|
|
KVKCTGRILE VPVGRGLLGR VVNTLGAPID GKGPLDHDGF SAVEAIAPGV IERQSVDQPV
|
|
** * **** *** * ** * * ** * ** *
|
|
DVKDLEHPIE VPVGKATLGR IMNVLGEPVD MKGEIGEEER WAIHRAAPSY EELSNSQELL
|
|
68 78 88 98 108 118
|
|
147 157 167 177 187 197
|
|
QTGYKAVDSM IPIGRGQREL IIGDRQTGKT ALAIDAIINQ RDSGIKCIYV AIGQ
|
|
** * * * * * * *** * * *
|
|
ETGIKVIDLM CPFAKGGKVG LFGGAGVGKT VNMMELIRNI AIEHSGYSVF AGVG
|
|
128 138 148 158 168 178
|
|
|
|
@6. TX 1 @List a text file
|
|
|
|
Allows the user to have a text file displayed on the screen.
|
|
It will appear one page at a time.
|
|
@7. TX 1 @Direct output to disk
|
|
|
|
Used to direct output that would normally appear on the screen
|
|
to a file.
|
|
|
|
Select redirection of either text or graphics, and supply the
|
|
name of the file that the output should be written to.
|
|
|
|
The results from the next options selected will not appear on
|
|
the screen but will be written to the file. When option 7 is
|
|
selected again the file will be closed and output will again appear
|
|
on the screen.
|
|
@8. TX 1 @Write active region to disk
|
|
|
|
This option allows users to write the current active sequence
|
|
to a disk file in Staden format.
|
|
@9. TX 1 @Edit the sequences
|
|
|
|
This function allows the user to insert or delete parts of
|
|
either sequence to help align them. The inserted characters are
|
|
dashes.
|
|
@10. TX 2 @Clear graphics
|
|
|
|
Clears the screen of both text and graphics.
|
|
@11. TX 2 @Clear text
|
|
|
|
Clears only text from the screen.
|
|
@12. TX 2 @Draw a ruler
|
|
|
|
This option allows the user to draw a ruler or scale along the
|
|
axes of the screen to help identify the coordinates of points of
|
|
interest. The user can define the position of the first sequence
|
|
element to be marked (for example if the active region is 1501 to
|
|
8000, the user might wish to mark every 1000th element starting at
|
|
either 1501 or 2000 - it depends if the user wishes to treat the
|
|
active region as an independent unit with its own numbering starting
|
|
at its left edge, or as part of the whole sequence). The user can
|
|
also define the separation of the ticks on the scale and their
|
|
height. If required the labelling routine can be used to add numbers
|
|
to the ticks.
|
|
|
|
To escape type !
|
|
@13. TX 2 @Use cross hair
|
|
|
|
This function puts a steerable cross on the screen that can be
|
|
used to find the coordinates of points in the sequence. The user can
|
|
move the cross around using the directional keys; when he hits the
|
|
space bar the program will write out the coordinates of the cross in
|
|
sequence units and the option will be exited.
|
|
|
|
If instead, the user hits a , the position will be displayed
|
|
but the cross will remain on the screen.
|
|
|
|
If a letter s is hit the sequences around the cross hair are
|
|
displayed as a short alignment (as shown below) and the cross
|
|
remains on the screen.
|
|
97 107
|
|
VPVGRGLLGR VVNTLGAPID
|
|
**** *** * ** * *
|
|
VPVGKATLGR IMNVLGEPVD
|
|
78 88
|
|
|
|
|
|
If a letter m is hit the sequences around the cross hair are
|
|
displayed in the form of a matrix (as shown below) and the cross
|
|
remains on the screen.
|
|
|
|
VPVGKATLGRIMNVLGEPVD
|
|
D...................DD
|
|
I..........I.........I
|
|
P.P...............P..P
|
|
A.....A..............A
|
|
G...G....G......G....G
|
|
L.......L......L.....L
|
|
T......T.............T
|
|
N............N.......N
|
|
VV.V..........V....V.V
|
|
VV.V..........V....V.V
|
|
R.........R..........R
|
|
G...G....G......G....G
|
|
L.......L......L.....L
|
|
L.......L......L.....L
|
|
G...G....G......G....G
|
|
R.........R..........R
|
|
G...G....G......G....G
|
|
VV.V..........V....V.V
|
|
P.P...............P..P
|
|
VV.V..........V....V.V
|
|
VPVGKATLGRIMNVLGEPVD
|
|
|
|
|
|
The function is also used prior to "align sequences" in order
|
|
to delineate the region to be aligned. The crosshair is positioned
|
|
at the bottom left of the region, the crosshair option quit. Then
|
|
the crosshair option is selected again, and the crosshair moved to
|
|
the top right of the region to be aligned.
|
|
@14. TX 2 @Reposition plots
|
|
|
|
The position of the plots is defined relative to a users
|
|
drawing board which has size 1-10,000 in x and 1-10,000 in y. Plots
|
|
are drawn in a window defined by x0,y0 and xlength,ylength. Where
|
|
x0,y0 is the position of the bottom left hand corner of the window,
|
|
and xlength is the width of the window and ylength the height of the
|
|
window.
|
|
--------------------------------------------------------- 10,000
|
|
1 1
|
|
1 -------------------------------------- ^ 1
|
|
1 1 1 1 1
|
|
1 1 1 1 1
|
|
1 1 1 ylength 1
|
|
1 1 1 1 1
|
|
1 1 1 1 1
|
|
1 -------------------------------------- v 1
|
|
1 x0,y0^ 1
|
|
1 <---------------xlength--------------> 1
|
|
--------------------------------------------------------- 1
|
|
1 10,000
|
|
|
|
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
|
|
The default window positions are read from a file "DIAGMARG" when
|
|
the program is started. Users can have their own file if required.
|
|
This option allows users to change window positions whilst running
|
|
the program. If the user types only carriage return for any value
|
|
it will remain unchanged. The cross-hair can be used to choose
|
|
suitable heights.
|
|
@15. TX 2 @Label a diagram
|
|
|
|
This routine allows users to label any diagrams they have
|
|
produced. They are asked to type in a label. When the user types
|
|
carriage return to finish typing the label the cross-hair appears on
|
|
the screen. The user can position it anywhere on the screen. If the
|
|
user types R (for right justify) the label will be written on the
|
|
diagram with its right end at the cross-hair position. If the user
|
|
types L (for left justify) the label will be written with its left
|
|
end at the cross hair position. The cross-hair will then
|
|
immediately reappear. The user may put the same label on another
|
|
part of the diagram as before or if he hits the space bar he will be
|
|
asked if he wishes to type in another label.
|
|
@16. TX 2 @Display a map
|
|
|
|
NOT AVAILABLE. This draws a map of any sequence features
|
|
selected by the user. These features may be protein coding regions
|
|
(CDS), tRNA genes (TRNA), promoter positions (PRM), etc. Users may
|
|
define their own feature table key names. The coordinates must be
|
|
stored in a file in the format of an EMBL feature table.
|
|
@17. TX 4 @Apply identities algorithm
|
|
|
|
The identities algorithm finds runs of identical characters in
|
|
the sequence. Its main value is speed, being 100's of times faster
|
|
than the proportional algorithm. It is of course not very sensitive,
|
|
and should only be used for a quick scan. The cutoff score is the
|
|
minimum number of consecutive matching characters. All runs of
|
|
identical characters that are at least as long as the cutoff score
|
|
will produce a dot on the screen.
|
|
|
|
See also quick scan.
|
|
|
|
Typical dialogue follows.
|
|
? Menu or option number=d17
|
|
? Identity score (1-20) (2) =3
|
|
Working
|
|
|
|
missing graphics
|
|
|
|
@18. TX 4 @Apply proportional algorithm
|
|
|
|
This method, generally the most useful, was first
|
|
described by McLachlan and involves calculating a score for each
|
|
position in the matrix by summing points found when looking
|
|
forwards and backwards along a diagonal line of a given length.
|
|
This length, called the span, must be an odd number. The algorithm
|
|
does not simply look for identity but uses a score matrix that
|
|
contains scores for every possible pair of characters. At each
|
|
point that a threshold score is achieved the program marks the screen
|
|
in one of two ways. It will either place a single dot at the position
|
|
corresponding to the centre of the matching span, or it will plot a
|
|
dot for each identical residue within each matching span.
|
|
Alternatively, the "list matching spans" option will list the
|
|
segments that match.
|
|
|
|
For comparing amino acid sequences we usually use the
|
|
score matrix shown below which was calculated by adding 10 (to make
|
|
every term >0) to each term of the relatedness odds matrix MDM78 of
|
|
Dayhoff. This matrix MDM78 was calculated by looking at accepted
|
|
point mutations in 71 families of closely related proteins and, of
|
|
those tested by Dayhoff, was found to be the most powerful score
|
|
matrix for finding distant relationships between amino acid
|
|
sequences.
|
|
|
|
AMINO ACID SCORE MATRIX
|
|
-----------------------
|
|
|
|
C S T P A G N D E Q B Z H R K M I L V F Y W - X ?
|
|
C 22 10 8 7 8 7 6 5 5 5 5 5 7 6 5 5 8 4 8 6 10 2 10 10 10 10
|
|
S 10 12 11 11 11 11 11 10 10 9 10 10 9 10 10 8 9 7 9 7 7 8 10 10 10 10
|
|
T 8 11 13 10 11 10 10 10 10 9 10 10 9 9 10 9 10 8 10 7 7 5 10 10 10 10
|
|
P 7 11 10 16 11 9 9 9 9 10 9 10 10 10 9 8 8 7 9 5 5 4 10 10 10 10
|
|
A 8 11 11 11 12 11 10 10 10 10 10 10 9 8 9 9 9 8 10 6 7 4 10 10 10 10
|
|
G 7 11 10 9 11 15 10 11 10 9 10 10 8 7 8 7 7 6 9 5 5 3 10 10 10 10
|
|
N 6 11 10 9 10 10 12 12 11 11 12 11 12 10 11 8 8 7 8 6 8 6 10 10 10 10
|
|
D 5 10 10 9 10 11 12 14 13 12 13 12 11 9 10 7 8 6 8 4 6 3 10 10 10 10
|
|
E 5 10 10 9 10 10 11 13 14 12 12 13 11 9 10 8 8 7 8 5 6 3 10 10 10 10
|
|
Q 5 9 9 10 10 9 11 12 12 14 11 13 13 11 11 9 8 8 8 5 6 5 10 10 10 10
|
|
B 5 10 10 9 10 10 12 13 12 11 13 11 11 10 10 8 8 6 8 5 7 4 10 10 10 10
|
|
Z 5 10 10 10 10 10 11 12 13 13 11 14 12 10 10 8 8 8 8 5 6 4 10 10 10 10
|
|
H 7 9 9 10 9 8 12 11 11 13 11 12 16 12 10 8 8 8 8 8 10 7 10 10 10 10
|
|
R 6 10 9 10 8 7 10 9 9 11 10 10 12 16 13 10 8 7 8 6 6 12 10 10 10 10
|
|
K 5 10 10 9 9 8 11 10 10 11 10 10 10 13 15 10 8 7 8 5 6 7 10 10 10 10
|
|
M 5 8 9 8 9 7 8 7 8 9 8 8 8 10 10 16 12 14 12 10 8 6 10 10 10 10
|
|
I 8 9 10 8 9 7 8 8 8 8 8 8 8 8 8 12 15 12 14 11 9 5 10 10 10 10
|
|
L 4 7 8 7 8 6 7 6 7 8 6 8 8 7 7 14 12 16 12 12 9 8 10 10 10 10
|
|
V 8 9 10 9 10 9 8 8 8 8 8 8 8 8 8 12 14 12 14 9 8 4 10 10 10 10
|
|
F 6 7 7 5 6 5 6 4 5 5 5 5 8 6 5 10 11 12 9 19 17 10 10 10 10 10
|
|
Y 10 7 7 5 7 5 8 6 6 6 7 6 10 6 6 8 9 9 8 17 20 10 10 10 10 10
|
|
W 2 8 5 4 4 3 6 3 3 5 4 4 7 12 7 6 5 8 4 10 10 27 10 10 10 10
|
|
- 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
|
X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
|
? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
|
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
|
|
One alternative for proteins is to use an identity matrix. For
|
|
comparing nucleic acids we usually use the matrix shown below.
|
|
|
|
DNA SCORE MATRIX
|
|
|
|
A C G T X
|
|
A 1 0 0 0 0
|
|
C 0 1 0 0 0
|
|
G 0 0 1 0 0
|
|
T 0 0 0 1 0
|
|
X 0 0 0 0 0
|
|
See option 32 for how to change the score matrices.
|
|
|
|
When a sequence is compared against itselt to look for repeats
|
|
it is possible to use the proportional algorithm in a mode such that
|
|
the main diagonal is not shown. See option 30.
|
|
|
|
Typical dialogue follows.
|
|
|
|
? Menu or option number=d18
|
|
? Odd span length (1-401) (11) =
|
|
? Proportional score (1-297) (132) =
|
|
Working
|
|
|
|
missing graphics
|
|
|
|
@19. TX 4 @List matching spans
|
|
This option applies the proportional algorithm using the current
|
|
span and cut-off score, but instead of drawing a dot matrix it lists
|
|
all the matching spans. When a sequence is compared against itselt
|
|
to look for repeats it is possible to use this algorithm in a mode
|
|
such that the main diagonal is not listed. See option 30.
|
|
|
|
Typical dialogue follows.
|
|
? Menu or option number=d19
|
|
? Odd span length (1-401) (11) =
|
|
? Proportional score (1-297) (132) =148
|
|
List matching spans
|
|
Working
|
|
76
|
|
IEVPVGKATLG
|
|
LEVPVGRGLLG
|
|
95
|
|
77
|
|
EVPVGKATLGR
|
|
EVPVGRGLLGR
|
|
96
|
|
78
|
|
VPVGKATLGRI
|
|
VPVGRGLLGRV
|
|
97
|
|
79
|
|
PVGKATLGRIM
|
|
PVGRGLLGRVV
|
|
98
|
|
85
|
|
LGRIMNVLGEP
|
|
LGRVVNTLGAP
|
|
104
|
|
86
|
|
GRIMNVLGEPV
|
|
GRVVNTLGAPI
|
|
105
|
|
87
|
|
RIMNVLGEPVD
|
|
RVVNTLGAPID
|
|
106
|
|
|
|
@20. TX 3 @Set span length
|
|
|
|
The proportional algorithm calculates a score for each
|
|
position in the matrix by summing the points found when looking
|
|
forwards and backwards along a diagonal line of a given length.
|
|
This length, called the span, should be an odd number so that the
|
|
score for any point is correctly positioned at the centre of the
|
|
span. This option allows the user to define the span length. It
|
|
should be noted that short spans can produce noisy diagrams, but are
|
|
less affected by insertions and deletions than are long spans.
|
|
However long spans can detect more distant relationships. Long spans
|
|
can suffer from a persistence problem by plotting dots when all the
|
|
"signal" is to one side of the spans central position. To help avoid
|
|
this, the option that plots the position of all matching residues
|
|
within a matching span, can be tried. This is most useful if an
|
|
identity matrix is being used.
|
|
@21. TX 3 @Set proportional score
|
|
|
|
The proportional algorithm calculates a score for each
|
|
position in the matrix by summing the scores for the individual
|
|
amino acids found when looking forwards and backwards along a
|
|
diagonal line of a given length. All points at which the
|
|
proportional score is achieved will produce a dot on the diagram.
|
|
(The same score is used for the 'LIST MATCHING SPANS' option.)
|
|
|
|
Before chosing a score the user can apply the routine that
|
|
will calculate the expected score, or can calculate a histogram of
|
|
observed scores. It is best to start with a high score to avoid an
|
|
overcrowded diagram.
|
|
@22. TX 3 @Set identities score
|
|
|
|
The identities algorithm is of limited value as it only finds
|
|
runs of matching characters, however it has the virtue of being very
|
|
fast. This option allows the user to set the minimum length of run
|
|
that will produce a dot on the screen.
|
|
@23. TX 3 @Calculate expected scores
|
|
|
|
This function calculates the "double matching probability" of
|
|
McLachlan. The "double matching probability" is the
|
|
probability of finding particular scores given two infinitely
|
|
long sequences of the composition of those being compared,
|
|
with the current span length and score matrix. By using this option
|
|
the user can choose to plot all the matches for which
|
|
the score exceeds a given significance level (such as 1%).
|
|
Generally it is best to begin at a low level to avoid an overcrowded
|
|
diagram.
|
|
|
|
When the calculation of the expected scores is finished the
|
|
program offers the user 3 ways of examining the results:
|
|
"Show probability for a score" allows the user to type in a
|
|
score and the program responds with the probability of achieving
|
|
that level of score.
|
|
"Show score for a probability" allows the user to type in a
|
|
probability value and the program types the score that corresponds
|
|
to that level of probability.
|
|
"List scores and probabilities" is the command to list out the
|
|
scores and their corresponding probabilities. The user is asked
|
|
to supply a further parameter, the "number of steps between scores",
|
|
and the program only lists every stepsize point. e.g a stepsize of
|
|
5 will get every 5th score listed.
|
|
|
|
Typical dialogue follows.
|
|
? Menu or option number=d23
|
|
? Odd span length (1-401) (11) =
|
|
? Proportional score (1-297) (132) =
|
|
|
|
Working
|
|
Average score= 103.18557
|
|
RMS deviation= 7.85276
|
|
X 1 Show probability for a score
|
|
2 Show score for a probability
|
|
3 List scores and probabilities
|
|
? 0,1,2,3 =
|
|
|
|
? Show probability for score (1-165) (134) =160
|
|
Probability of score 160 is 0.0000000008
|
|
X 1 Show probability for a score
|
|
2 Show score for a probability
|
|
3 List scores and probabilities
|
|
? 0,1,2,3 =2
|
|
? Show score for probability (0.0000000001-1.) (0.00001) =0.0000001
|
|
Score for probability 0.0000001000 is 153
|
|
1 Show probability for a score
|
|
X 2 Show score for a probability
|
|
3 List scores and probabilities
|
|
? 0,1,2,3 =3
|
|
? Number of steps between scores (1-10) (5) =
|
|
|
|
0 0.10000E+01 100 0.67232E+00 200 0.18977E-20
|
|
5 0.10000E+01 105 0.42119E+00 205 0.42561E-22
|
|
10 0.10000E+01 110 0.20671E+00 210 0.87767E-24
|
|
15 0.10000E+01 115 0.78860E-01 215 0.16651E-25
|
|
20 0.10000E+01 120 0.23515E-01 220 0.27300E-27
|
|
25 0.10000E+01 125 0.55406E-02 225 0.00000E+00
|
|
30 0.10000E+01 130 0.10443E-02 230 0.00000E+00
|
|
35 0.10000E+01 135 0.15935E-03 235 0.00000E+00
|
|
40 0.10000E+01 140 0.19906E-04 240 0.00000E+00
|
|
45 0.10000E+01 145 0.20569E-05 245 0.00000E+00
|
|
50 0.10000E+01 150 0.17758E-06 250 0.00000E+00
|
|
55 0.10000E+01 155 0.12938E-07 255 0.00000E+00
|
|
60 0.10000E+01 160 0.80360E-09 260 0.00000E+00
|
|
65 0.10000E+01 165 0.43009E-10 265 0.00000E+00
|
|
70 0.10000E+01 170 0.20049E-11 270 0.00000E+00
|
|
75 0.99997E+00 175 0.82263E-13 275 0.00000E+00
|
|
80 0.99949E+00 180 0.29998E-14 280 0.00000E+00
|
|
85 0.99448E+00 185 0.98050E-16 285 0.00000E+00
|
|
90 0.96543E+00 190 0.28934E-17 290 0.00000E+00
|
|
95 0.86836E+00 195 0.77556E-19 295 0.00000E+00
|
|
1 Show probability for a score
|
|
2 Show score for a probability
|
|
X 3 List scores and probabilities
|
|
? 0,1,2,3 =!
|
|
|
|
|
|
@24. TX 3 @Calculate observed scores
|
|
|
|
This option applies the proportional algorithm to the
|
|
currently active sequence but instead of producing a dot matrix it
|
|
calculates a histogram of observed scores. The speed of this
|
|
calculation of course depends on the size of the active regions, but
|
|
when it is completed the program offers the user 3 ways of
|
|
examining the results:
|
|
|
|
"Show percentage for score" allows the user to type in a score
|
|
and the program responds with the percentage of points that
|
|
achieve this value.
|
|
|
|
"Show percentage for score" allows the user to type in a
|
|
percentage and the program responds with the corresponding
|
|
score. Values of this score and above are only achieved by
|
|
the given percentage of points.
|
|
|
|
"List scores and percentages" is the command to list out
|
|
the scores and the percentage of points achieving them. Typical
|
|
dialogue follows.
|
|
? Menu or option number=24
|
|
Working
|
|
Maximum observed score is 152
|
|
X 1 Show percentage reaching a score
|
|
2 Show score for a percentage
|
|
3 List scores and percentages
|
|
? 0,1,2,3 =
|
|
|
|
? Show percentage for score (1-152) (114) =144
|
|
Percentage of points with score 144 is 0.005486297
|
|
X 1 Show percentage reaching a score
|
|
2 Show score for a percentage
|
|
3 List scores and percentages
|
|
? 0,1,2,3 =2
|
|
|
|
? Show score for percentage (0.00001-1.) (0.001) =0.01
|
|
Score for percentage 0.010000000 is 143
|
|
1 Show percentage reaching a score
|
|
X 2 Show score for a percentage
|
|
3 List scores and percentages
|
|
? 0,1,2,3 =
|
|
|
|
? Show score for percentage (0.00001-1.) (0.001) =1.
|
|
Score for percentage 1.000000000 is 124
|
|
1 Show percentage reaching a score
|
|
X 2 Show score for a percentage
|
|
3 List scores and percentages
|
|
? 0,1,2,3 =3
|
|
? Number of steps between scores (1-10) (5) =1
|
|
|
|
73 236953 0.10000E+03
|
|
74 236951 0.99999E+02
|
|
75 236951 0.99999E+02
|
|
76 236950 0.99998E+02
|
|
77 236945 0.99996E+02
|
|
78 236942 0.99995E+02
|
|
79 236929 0.99989E+02
|
|
80 236900 0.99977E+02
|
|
|
|
missing data here
|
|
|
|
130 384 0.16206E+00
|
|
131 307 0.12956E+00
|
|
132 239 0.10086E+00
|
|
133 180 0.75964E-01
|
|
134 134 0.56551E-01
|
|
135 103 0.43468E-01
|
|
136 78 0.32918E-01
|
|
137 67 0.28276E-01
|
|
138 46 0.19413E-01
|
|
139 40 0.16881E-01
|
|
140 33 0.13927E-01
|
|
141 29 0.12239E-01
|
|
142 24 0.10129E-01
|
|
143 19 0.80184E-02
|
|
144 13 0.54863E-02
|
|
145 10 0.42202E-02
|
|
146 8 0.33762E-02
|
|
147 7 0.29542E-02
|
|
148 7 0.29542E-02
|
|
149 6 0.25321E-02
|
|
150 5 0.21101E-02
|
|
151 3 0.12661E-02
|
|
152 3 0.12661E-02
|
|
1 Show percentage reaching a score
|
|
2 Show score for a percentage
|
|
X 3 List scores and percentages
|
|
? 0,1,2,3 =!
|
|
|
|
@25. TX 3 @Show current parameter settings
|
|
|
|
This function lists the names of the current sequences, their
|
|
total lengths, the start and end points of the active sequence and
|
|
the current values of span and cut-off scores. It also shows if the
|
|
main diagonal will be shown, or if the proportional algorithm will
|
|
mark all identities in matching spans.
|
|
|
|
Typical dialogue follows.
|
|
? Menu or option number=25
|
|
Horizontal sequence
|
|
ALPHA.PRT
|
|
Positions
|
|
1 TO 514
|
|
Vertical sequence
|
|
BETA.PRT
|
|
Positions
|
|
1 TO 461
|
|
Span length= 11
|
|
Scores
|
|
Proportional= 132
|
|
Identities= 3
|
|
Identites off
|
|
Main diagonal shown
|
|
|
|
|
|
@27. TX 2 @Draw a /
|
|
|
|
This option simply draws a diagonal line from the bottom left
|
|
of the diagram to the top right. it can be an aid when trying to
|
|
align the sequences.
|
|
@26. TX 4 @Quick scan
|
|
|
|
The algorithm is as follows. The dot matrix positions are
|
|
found for all words of some minimum length (obviously length 1 is
|
|
most sensitive) that are common to both sequences. Imagine a
|
|
diagonal line running from corner to corner of the diagram, at right
|
|
angles to the diagonals in the dotmatrix, The scores for the common
|
|
words (according to the current score matrix, e.g. MDM78) are
|
|
accummulated at the appropriate positions on that imaginary line,
|
|
hence producing a histogram. The histogram is analysed to find its
|
|
mean and standard deviation. The diagonals that lie above some
|
|
cutoff score (defined in standard deviation units), are rescanned
|
|
using the proportional algorithm, and a diagram produced. The method
|
|
is very fast, and is also employed by the library comparison
|
|
program.
|
|
|
|
Typical dialogue follows.
|
|
|
|
? Menu or option number=d26
|
|
? Identity score (1-20) (3) =
|
|
? Odd span length (1-401) (11) =
|
|
? Proportional score (1-297) (132) =
|
|
? Number of sd above mean (0.00-10.00) (5.00) =
|
|
|
|
missing graphics
|
|
|
|
|
|
|
|
SIPL the library searching version of SIP
|
|
|
|
This program compares a probe sequence against a library of
|
|
sequences using the quick scan algorithm, sorts the matches into
|
|
descending order of score, and produces optimal alignments of the
|
|
best scores using the Myers and Miller method. It is very rapid.
|
|
|
|
Use of lists of entry names
|
|
|
|
SIPL has the ability to restrict searches to subsets of the
|
|
libraries. This does not require sublibraries to be created but
|
|
instead is achieved by using files containing a list of the entry
|
|
names of sequences. The user may choose to search only those entries
|
|
on the list or, alternatively to search all but those on the list
|
|
(i.e. in the latter case the list contains the names of those to be
|
|
excluded). The programs can search libraries that have indexes and
|
|
those that do not. If a list of names for inclusion is used, then
|
|
the search will be faster if the index is present. In all other
|
|
circumstances the whole library will be read. The list must be in
|
|
library order except when it is used to include entries, and an
|
|
index is available. The list must contain each entry name on a
|
|
separate line, with the name starting in column 1 of the line. ie
|
|
there must be no spaces at the start of the line. The list of entry
|
|
names can be produced by the keyword searches of nip, pip, sip, etc
|
|
as long as the listings produced have a space character separating
|
|
the entry name from the entry description. This will depend on how
|
|
well the library reformatting programs work. For example swissprot
|
|
entry names tend to run into the beginning of the descriptions, but
|
|
other libraries are generally OK.
|
|
@28. TX 4 @Align sequences
|
|
|
|
This function will produce an optimal alignment of two
|
|
segments of the sequence. The dynamic programming alignment
|
|
algorithm is based on that of Miller and Myers (). It guarantees to
|
|
produce alignments with the optimum score given a score matrix, a
|
|
gap start penalty, and a gap extension penalty. That is, starting a
|
|
gap costs a fixed penalty (F) and each residue added to the gap
|
|
incurs a further penalty (E) so that for each gap of length K
|
|
residues the penalty is F + K*E. Gaps at the ends of sequences incur
|
|
no penalty.
|
|
|
|
The routine can only handle segments of sequence of maximum
|
|
length 5000 residues. When the sequences are read in the alignment
|
|
segment will be set to the first 5000 residues. A different segment
|
|
can be selected by prefixing the option number by the letter D, in
|
|
which case the cross hair can be used to identify the two ends. The
|
|
cross hair will appear. First position the crosshair at the bottom
|
|
left of the segment and type a character other than s or m or ",".
|
|
When the crosshair reappears, position it a the top right of the
|
|
segment, and type a keyboard character. The aligned sequences will
|
|
replace the active sequence if the user confirms "keep alignment".
|
|
By alternate use of the plotting and alignment routines it is
|
|
possible to rapidly produce an alignment of quite long sequences.
|
|
|
|
Typical dialogue follows.
|
|
|
|
28 = Align sequences
|
|
? Menu or option number=d28
|
|
Define the region to align using the cross-hair.
|
|
First identify the bottom left position and exit
|
|
the cross-hair routine. Then the top right.
|
|
|
|
(Bell rings, type return, cross hair appears)
|
|
|
|
? Penalty for starting a gap (1-100) (10) =
|
|
? Penalty for each residue in gap (1-100) (10) =
|
|
|
|
Aligning region 1 to 461
|
|
with region 1 to 514
|
|
1 11 21 31 41 51
|
|
MA--TGKIVQ VIGA------ VVDVEFPQDA VPRVYDALEV QNG------N ERLVL-----
|
|
* * * ** * * * * *
|
|
MQLNSTEISE LIKQRIAQFN VVSEAHNEGT IVSVSDGVIR IHGLADCMQG EMISLPGNRY
|
|
1 11 21 31 41 51
|
|
61 71 81 91 101 111
|
|
EVQQQLGGGI VRTIAMGSSD GLRRGLDVKD LEHPIEVPVG KATLGRIMNV LGEPVDMKGE
|
|
* * ** * * ** ***** *** * ** * * **
|
|
AIALNLERDS VGAVVMGPYA DLAEGMKVKC TGRILEVPVG RGLLGRVVNT LGAPIDGKGP
|
|
61 71 81 91 101 111
|
|
121 131 141 151 161 171
|
|
IGEEERWAIH RAAPSYEELS NSQELLETGI KVIDLMCPFA KGGKVGLFGG AGVGKTVNMM
|
|
* ** * ** * * * * * * ***
|
|
LDHDGFSAVE AIAPGVIERQ SVDQPVQTGY KAVDSMIPIG RGQRELIIGD RQTGKTALAI
|
|
121 131 141 151 161 171
|
|
181 191 201 211 221 231
|
|
ELIRNIAIEH SGYS-VFAGV GERTREGNDF YHEMTDSNVI DKVSLVYGQM NEPPGNRLRV
|
|
* * ** * * *
|
|
DAI--INQRD SGIKCIYVAI GQKASTISNV VRKLEEHGAL ANTIVVVATA SESAALQYLA
|
|
181 191 201 211 221 231
|
|
241 251 261 271 281 291
|
|
ALTGLTMAEK FRDEGRDVLL FVDNIYRYTL AGTEVSALLG RMPSAVGYQP TLAEEMGVLQ
|
|
* * *** * * * * * * ** * * *
|
|
RMPVALMGEY FRDRGEDALI IYDDLSKQAV AYRQISLLLR RPPGREAFPG DVFYLHSRLL
|
|
241 251 261 271 281 291
|
|
301 311 321 331 341 351
|
|
ERITST---- ---------- -KTGSITSVQ AVYVPADDLT DPSPATTFAH LDATVVLSRQ
|
|
** **** * * * * * *
|
|
ERAARVNAEY VEAFTKGEVK GKTGSLTALP IIETQAGDVS AFVPTNVISI TDGQIFLETN
|
|
301 311 321 331 341 351
|
|
361 371 381 391 401 411
|
|
IASLGIYPAV DPLDSTSRQL DPLVVGQEHY DTAR----GV QSILQRYQEL KDIIAILGMD
|
|
** *** * * ** * * * * * **
|
|
LFNAGIRPAV NPGISVSR-- ---VGGAAQT KIMKKLSGGI RTALAQYREL AAFSQFAS--
|
|
361 371 381 391 401 411
|
|
421 431 441 451 461 471
|
|
ELSEEDKLVV ARARKIQRFL SQ----PFFV AE----VFTG SPGKYVSLKD --TIRGFKGI
|
|
* * * * * * * * * * * *
|
|
DLDDATRKQL DHGQKVTELL KQKQYAPMSV AQQSLVLFAA ERG-YLADVE LSKIGSFEAA
|
|
421 431 441 451 461 471
|
|
481 491 501 511 521
|
|
MEG--EYDHL P-EQAFYMVG SIEEAVE--- --------KA KKL*
|
|
** * * * * *
|
|
LLAYVDRDHA PLMQEINQTG GYNDEIEGKL KGILDSFKAT QSW*
|
|
481 491 501 511 521
|
|
Conservation 22.5%
|
|
Number of padding characters inserted 63 and 10
|
|
? (y/n) (y) Keep alignment n
|
|
|
|
|
|
@29. TX 1 @Complement the sequences
|
|
|
|
This function allows users to reverse and complement nucleic
|
|
acid sequences.
|
|
@30. TX 3 @Switch main diagonal
|
|
|
|
If a sequence is being compared against itself to look for
|
|
repeats it is sometimes convenient if the main diagonal is not
|
|
included in the comparison. This function allows users to set a
|
|
switch that determines whether or not to include the main diagonal
|
|
for all the comparison methods. If the switch is set, and the
|
|
active regions for both sequences have the same start position, then
|
|
the main diagonal will not be compared.
|
|
@31. TX 3 @Switch identities
|
|
|
|
This function allows a switch to be set or unset. The switch
|
|
determines which of two forms of plot will be produced by the
|
|
proportional algorithm. One form of output (the original method)
|
|
plots a dot at the centre of each span that reaches the threshold
|
|
score; whereas the other form will plot dots for all matching
|
|
residues that lie within spans that reach the threshold.
|
|
@32. TX 3 @change score matrix
|
|
|
|
This option allows users to select their own score matrix for
|
|
use with the proportional algorithm. The choices are:
|
|
|
|
1 = MDM78
|
|
2 = identity
|
|
3 = your own matrix
|
|
|
|
|
|
MDM78 is the standard matrix that is used for proteins and an
|
|
identity matrix is the default matrix for nucleic acids. However an
|
|
identity matrix is also useful for protein comparisons. "Your own
|
|
matrix" allows users to apply any other matrix, as long as the
|
|
matrix file is in the same format as MDM78. For comparisons of DNA
|
|
it might be useful to try one that gave say 3 for exact matches and
|
|
1 for R-R or Y-Y, else=0.
|
|
@33. TX 3 @Set number of sd's for Quickscan
|
|
|
|
The quickscan algorithm is as follows. The dot matrix
|
|
positions are found for all words of some minimum length (obviously
|
|
length 1 is most sensitive) that are common to both sequences.
|
|
Imagine a diagonal line running from corner to corner of the
|
|
diagram, at right angles to the diagonals in the dotmatrix, The
|
|
scores for the common words (according to the current score matrix,
|
|
e.g. MDM78) are accummulated at the appropriate positions on that
|
|
imaginary line, hence producing a histogram. The histogram is
|
|
analysed to find its mean and standard deviation. The diagonals that
|
|
lie above some cutoff score (defined in standard deviation units),
|
|
are rescanned using the proportional algorithm, and a diagram
|
|
produced.
|
|
|
|
This option allows the number of sd's to be set.
|
|
@34. TX 3 @Set gap penalities
|
|
|
|
The alignment function will produce an optimal alignment of
|
|
two segments of the sequence. The dynamic programming alignment
|
|
algorithm is based on that of Miller and Myers (). It guarantees to
|
|
produce alignments with the optimum score given a score matrix, a
|
|
gap start penalty, and a gap extension penalty. That is, starting a
|
|
gap costs a fixed penalty (F) and each residue added to the gap
|
|
incurs a further penalty (E) so that for each gap of length K
|
|
residues the penalty is F + K*E. Gaps at the ends of sequences incur
|
|
no penalty.
|
|
|
|
This option allows the gap penalties to be set.
|
|
@ end of help
|
|
|
|
|
|
|
|
|
|
|
|
|