staden-lg/help/SIP.RNO

1431 lines
57 KiB
Plaintext

.NPA
.SP 1
.left margin1
@-1. TX 0 @General
.sp
@-2. T 0 @Screen control
.sp
@-2. X 0 @Screen
.sp
@-3. TX 0 @Set parameters
.sp
@-4. TX 0 @Comparison
.sp
@0. TX -1 @SIP
.PARA
This is program for comparing and aligning nucleic acid or protein
sequences. It can produce optimal alignments using a dynamic
programming algorithm, and has several ways of producing "dot matrix"
diagrams.
.PARA
The following analyses (preceded by their option numbers) are included:
.sp
.para
The program is used on a simple graphics terminal ie a
keyboard with a screen on which points and lines can be drawn.
The
user works at the terminal and produces plots for various
combinations of values for the span length and minimum scores.
However large or small a region the user
elects to compare the program expands or contracts the diagram
so
that the plot always fills the screen. This allows the user to gain
an overall impression or to "home-in" on particular regions and
examine them in more detail. Having found a region that looks
interesting the user can determine its coordinates in terms of
sequence positions by use of a crosshair facility.
.para
The program has two statistical options to help the user
choose score levels for plotting and to assess the significance of
any similarity found. It can produce a cumulative histogram of
observed scores for the current span length and region and it can
calculate the "double matching probability" of McLachlan.
The
"double matching probability" is the probability of finding
particular scores given two infinitely long sequences of the
composition of those being compared, with the current span
length
and score matrix. By using these options the user can choose to
plot all the matches for which the score exceeds a given
significance level (such as 1%), using either empirical or
theoretical probability values. Generally it is best to begin at a
low level to avoid an overcrowded diagram.
.para
If the user finds that the two sequences do contain stretches
of homology he will often want to align the sequences by
inserting
padding characters at deletion points. The program has a
selection
of options for this purpose: it contains an alignment routine; it
can display on the screen the two
sequences, one above the other, with asterisks marking
identities,
it has inbuilt editing functions and can save the aligned sequences
on disk files.
.para
The basic principle of dot matrices was first
described by Gibbs and McIntyre and involves producing a diagram
that contains a representation of all the matches between a pair
of
sequences. This diagram is then scanned by eye and the human
ability to recognise patterns used to detect any similarities that
might be present. The diagram consists of a two dimensional plot
in
which the x axis represents one sequence (A) and the y axis the
other (B). Every point (i,j) on the plane x,y is assigned a score
which corresponds to the level of similarity between sequence
characters A(i) and B(j). In the simplest use of the method a score
of 1 could be assigned to every point (i,j) where A(i) = B(j), and a
score of 0 to every other point. If a plot of the points in the
plane was made in which all scores of 1 were marked with a dot
and
all those of 0 left blank then regions of identity would appear as
diagonal lines. With the comparison displayed in this form the
human eye is very good at detecting regions of homology even if
they
are imperfect. The effects of mismatches, insertions or
deletions
can be seen: matches interrupted by insertions or deletions will
appear as parallel diagonals, and matches interrupted by the odd
mismatching pair of characters will appear as broken collinear
diagonal lines. This diagram is a very useful representation but
simply placing a dot for every identity is of limited value for the
following reasons.
.para
For nucleic acid sequences around 25% of the plot will contain
points and it will often be very difficult to distinguish
significant homologies from chance matches. For proteins
many
significant alignments of sequences contain almost no identities
but
are formed from chemically and structurally similar amino acids
so
that simply looking for identity would be insufficient. What is
required is to first find those points that correspond to fairly
strong local similarities and then to use the diagram of these
points so that the human eye can be used to look for larger scale
homologies. The program uses a number of different algorithms to
calculate the
score for each point and the user defines a minimum score so
that
only those points in the diagram for which the score is at least
this value will be marked with a dot.
.para
The first scoring method finds the longest uninterrupted sections of
perfect identity i.e.
those that contain no mismatches, insertions or deletions.
Generally this method, termed "the identities algorithm" is of little
value, but runs very quickly.
.para
The
second method looks for sections where a proportion of the
characters in the sequence are similar, again allowing no
insertions
or deletions. For a thorough analysis this method, termed "the
proportional algorithm", is the best.
.para
The original method, of this type was first
described by McLachlan and involves calculating a score for
each position in the matrix by summing points found when
looking
forwards and backwards along a diagonal line of a given length.
This length, called the span, must be an odd number so that the dot
marking matches can be precisely placed at its centre.
The algorithm does not simply look for identity but uses a
score matrix that contains scores for every possible pair of
characters. For comparing amino acid sequences we usually
use the score
matrix shown below which was calculated by adding 10 (to make
every term >0) to each term of the relatedness odds matrix MDM78
of
Dayhoff. This matrix MDM78 was calculated by looking at accepted
point mutations in 71 families of closely related proteins and, of
those tested by Dayhoff, was found to be the most powerful
score
matrix for finding distant relationships between amino acid
sequences.
.left margin1
.lit
AMINO ACID SCORE MATRIX
-----------------------
C S T P A G N D E Q B Z H R K M I L V F Y W - X ?
C 22 10 8 7 8 7 6 5 5 5 5 5 7 6 5 5 8 4 8 6 10 2 10 10 10 10
S 10 12 11 11 11 11 11 10 10 9 10 10 9 10 10 8 9 7 9 7 7 8 10 10 10 10
T 8 11 13 10 11 10 10 10 10 9 10 10 9 9 10 9 10 8 10 7 7 5 10 10 10 10
P 7 11 10 16 11 9 9 9 9 10 9 10 10 10 9 8 8 7 9 5 5 4 10 10 10 10
A 8 11 11 11 12 11 10 10 10 10 10 10 9 8 9 9 9 8 10 6 7 4 10 10 10 10
G 7 11 10 9 11 15 10 11 10 9 10 10 8 7 8 7 7 6 9 5 5 3 10 10 10 10
N 6 11 10 9 10 10 12 12 11 11 12 11 12 10 11 8 8 7 8 6 8 6 10 10 10 10
D 5 10 10 9 10 11 12 14 13 12 13 12 11 9 10 7 8 6 8 4 6 3 10 10 10 10
E 5 10 10 9 10 10 11 13 14 12 12 13 11 9 10 8 8 7 8 5 6 3 10 10 10 10
Q 5 9 9 10 10 9 11 12 12 14 11 13 13 11 11 9 8 8 8 5 6 5 10 10 10 10
B 5 10 10 9 10 10 12 13 12 11 13 11 11 10 10 8 8 6 8 5 7 4 10 10 10 10
Z 5 10 10 10 10 10 11 12 13 13 11 14 12 10 10 8 8 8 8 5 6 4 10 10 10 10
H 7 9 9 10 9 8 12 11 11 13 11 12 16 12 10 8 8 8 8 8 10 7 10 10 10 10
R 6 10 9 10 8 7 10 9 9 11 10 10 12 16 13 10 8 7 8 6 6 12 10 10 10 10
K 5 10 10 9 9 8 11 10 10 11 10 10 10 13 15 10 8 7 8 5 6 7 10 10 10 10
M 5 8 9 8 9 7 8 7 8 9 8 8 8 10 10 16 12 14 12 10 8 6 10 10 10 10
I 8 9 10 8 9 7 8 8 8 8 8 8 8 8 8 12 15 12 14 11 9 5 10 10 10 10
L 4 7 8 7 8 6 7 6 7 8 6 8 8 7 7 14 12 16 12 12 9 8 10 10 10 10
V 8 9 10 9 10 9 8 8 8 8 8 8 8 8 8 12 14 12 14 9 8 4 10 10 10 10
F 6 7 7 5 6 5 6 4 5 5 5 5 8 6 5 10 11 12 9 19 17 10 10 10 10 10
Y 10 7 7 5 7 5 8 6 6 6 7 6 10 6 6 8 9 9 8 17 20 10 10 10 10 10
W 2 8 5 4 4 3 6 3 3 5 4 4 7 12 7 6 5 8 4 10 10 27 10 10 10 10
- 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
.end lit
.para
It is also possible to use other matrices, including an identity matrix for
proteins. For nucleic acids we usually use the matrix shown below.
.lit
DNA SCORE MATRIX
A C G T X
A 1 0 0 0 0
C 0 1 0 0 0
G 0 0 1 0 0
T 0 0 0 1 0
X 0 0 0 0 0
.end lit
.left margin2
.para
Plotting dots at the centres of spans that reach the cutoff leads to a
persistence effect that, to some extent, can be mitigated by a variation
on the method. If, for example, all the high scoring amino acids are
clustered at the left end of a particular diagonal segment, dots will
continue to be plotted to their right until the span score drops below the
cutoff. Instead of plotting a single point for each span that reaches the
cutoff score, the variant method plots points for all the identities that
lie in spans that reach the cutoff. Obviously the persistence effect can be
more pronounced for long spans and low cutoff scores, but note that the
variant method will not plot anything if there are no identities present,
and so similar regions could be missed!
.para
A further variant, useful for comparing a sequence against itself, ignores
the main diagonal.
.para
The third comparison method called "quick scan" is really a combination
of the first two, and is similar to the FASTP program of Lipman and
Pearson, but produces a dot matrix diagram. The algorithm is as follows.
The dot matrix positions are found for all words of some minimum length
(obviously length 1 is most sensitive) that are common to both
sequences. Imagine a diagonal line running from corner to corner of the
diagram, at right angles to the diagonals in the dotmatrix, The scores
for the common words (according to the current score matrix, e.g.
MDM78) are accummulated at the appropriate positions on
that imaginary line, hence producing a
histogram. The histogram is analysed to find its mean and standard
deviation. The diagonals that lie above some cutoff score (defined in
standard deviation units), are rescanned using the proportional
algorithm, and a diagram produced. The method is very fast, and is also
employed by the library comparison program.
.para
The dynamic programming alignment algorithm contained in the program
is based on that of Miller and Myers (). It guarantees to produce
alignments with the optimum score given a score matrix, a gap start
penalty, and a gap extension penalty. That is, starting a gap costs a fixed
penalty (IG) and each residue added to the gap incurs a further penalty
(IH) so that for each gap of length K residues the penalty is IG + k*IH.
Gaps at the ends of sequences incur no penalty.
.para
It is very useful to have the dot matrix methods and the alignment
routine together in the same program because it allows users to produce
a dot matrix diagram to help select which regions of the sequence they
wish to align. Selection is made by use of the crosshair. First the
crosshair is positioned at the bottom left hand end of the segment to be
aligned. The crosshair function is quit and immediately selected again,
the crosshair positioned at the top right of the segment, and the
crosshair function quit. When the alignment routine is selected the
segment will be aligned.
.para
The alignment can replace the original segment of the sequence. By
repeated plotting of dot matrices, followed by alignment, very long
sequences can easily be aligned.
.LEFT MARGIN1
@1. TX 0 @Help
.LEFT MARGIN2
.para
This option gives online help. The user should select option numbers and
the current documentation will be given.
.PARA
The following analyses (preceded by their option numbers) are included:
.lit
? = Help
! = Quit
3 = read a new sequence
4 = define active region
5 = list the sequence
6 = list a text file
7 = direct output to disk
8 = write active sequence to disk
9 = edit the sequences
10 = clear graphics screen
11 = clear text screen
12 = draw a ruler
13 = use cross hair
14 = reposition plots
15 = label diagram
16 = display a map
17 = apply identities algorithm
18 = apply proportional algorithm
19 = list matching spans
20 = set span length
21 = set proportional score
22 = set identities score
23 = calculate expected scores
24 = calculate observed scores
25 = show current parameter settings
26 = quick scan
27 = draw a /
28 = align the sequences
29 = complement the sequences
30 = switch main diagonal
31 = switch identities
32 = change score matrix
.end lit
.left margin1
@2. TX 0 @Quit
.left margin2
.para
This function stops the program.
.left margin1
@3. TX 1 @Read a new sequence
.LEFT MARGIN2
.para
This option allows users to read in new sequences, browse through annotations,
or search sequence
libraries for keywords. Sequences can be read from "personal"
sequence files or from sequence libraries. These are referred to as the
sequence "source". Personal files can be stored in several formats:
Staden, PIR, EMBL, GENBANK and GCG.
At LMB we use "Staden" format for sequencing and all
the
libraries are stored in their original formats. Note, however, that libraries
such as EMBL or GenBank that are divided into several files (eg GenBank has
13 separate files) are indexed as a whole. This means that users do not need
to know which file contains an entry, only which library.
When the user selects to read in a sequence the program first asks for the
sequence "source".
.para
If the user selects "personal" the program will ask for
the format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
the file. For PIR format the user will also be required to know the entry
name of the sequence as the file can contain several. For the other formats
only a single entry is expected. The file will be read, its length and
composition will be displayed and the option left.
.para
If the user selects "library" as the sequence source the program will display a
list of available libraries. The programs are capable of handling all current
libraries but which ones are available will vary from site to site. At LMB we
have several libraries and also weekly updates of data gathered between releases.
The program will ask users to select a library and then give a list of options:
.lit
X 1 Get a sequence
2 Get annotations
3 Get entrynames from accession numbers
4 Search titles for keywords
5 Search text index for keywords
.end lit
If get a sequence or get annotations is selected users will be asked to
type the entry name. The option will be left when a sequence is selected or
! is typed. The composition and length will be displayed.
.para
The text index contains all words from feature tables, reference titles,
definition lines, keywords lists and comments, so the text index search
is most useful. It is also the fastest. Up to 5 words can be searched for
at once. The words should be typed separated by spaces, for example
.lit
? Keywords=P53 mouse murine tumo
.end lit
will search for all entries that contain words starting with p53, mouse,
murine and tumo. Only the unique entries that contain ALL words will be
listed. Before listing the matching entries
the program will show the number of 'hits' for each word and ring the bell.
Escape is possible at this point, or after each screenfull of entries.
In addition to the entry names the text search displays the primary accession
number, the sequence length and up to 80 characters of description.
(The search of 'titles' is now redundant because the full text index
contains all the title words and the search is much faster. It will probably
be removed from the program.)
All searches are independent of case. Where
possible the program will offer default entry names.
.para
Typical dialogue follows.
.lit
Select sequence source
X 1 Personal file
2 Sequence library
? Selection (1-2) (1) =
Select sequence file format
X 1 Staden
2 EMBL
3 GenBank
4 PIR
5 GCG
? Selection (1-5) (1) =
? Sequence file name=M13MP7.SEQ
Contig title removed
Sequence length= 7238
Sequence composition
T C A G -
2405. 1539. 1765. 1527. 2.
33.2% 21.3% 24.4% 21.1% 0.0%
.
.
.
Select sequence source
X 1 Personal file
2 Sequence library
? Selection (1-2) (1) =2
Select a library
X 1 EMBL 29 nucleotide library Dec 91
2 SWISSPROT 20 protein library Nov 91
3 PIR 31 protein library Dec 91
4 NRL3D 58 From Brookhaven protein library Dec 91
5 GenBank
? Selection (1-5) (1) =
Library is in EMBL format with indexes
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =5
Search for keywords
? Keywords=P53 mouse
P53 hits 68
MOUSE hits 8180
MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen
MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen
MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen
MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen
MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen
MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen
MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen
MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen
MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen
MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon
MMANT11 X00885 562 Murine p53 gene 3' region with exon 11
MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end.
MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds.
MMP53 X00741 1377 Mouse mRNA for transformation associated protein
MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53.
MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11.
MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8.
MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou
MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4
MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53
MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53
MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5
22 different entries found
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =4
Search for keywords
? Keywords=alpha
Searching for alpha
AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank.
ADHBADA1 102 duck alpha-d-globin gene, exon 1.
ADHBADA2 1145 duck alpha-a-globin gene and 5' flank
ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co
AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide.
AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
AGAMYLS 120 synthetic alpha-amylase gene, 5' end.
AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
!
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =3
? Accession number=v00636
Entry name LAMBDA
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =2
Default Entry name=LAMBDA
? Entry name=
ID LAMBDA standard; DNA; PHG; 48502 BP.
XX
AC V00636; J02459; M17233; X00906;
XX
DT 03-JUL-1991 (Rel. 28, Last updated, Version 3)
DT 09-JUN-1982 (Rel. 1, Created)
XX
DE Genome of the bacteriophage lambda (Styloviridae).
XX
KW circular; coat protein; DNA binding protein; genome;
KW origin of replication.
XX
OS Bacteriophage lambda
OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
XX
RN [1]
RP 1-48502
RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
RT "Nucleotide sequence of bacteriophage lambda DNA";
RL J. Mol. Biol. 162:729-773(1982).
XX
!
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =
Default Entry name=LAMBDA
? Entry name=
DE Genome of the bacteriophage lambda (Styloviridae).
Sequence length 48502
Sequence composition
T C A G -
11988. 11360. 12336. 12818. 0.
24.7% 23.4% 25.4% 26.4% 0.0%
.end lit
.left margin1
@4. TX 1 @Define active region
.LEFT MARGIN2
.para
For its analytic functions
the program always works on a region of the sequence called the active
region. When a new sequence is read into the program the active region is
automatically set to start at the beginning of the sequence and go
up to the
maximum allowed size of active region the program can
handle. The positions are shown on the screen.
On most machines this will be to the end of the sequence.
This option allows the user define a different region.
.left margin1
@5. TX 1 @List a sequence
.LEFT MARGIN2
.para
The sequence can be listed with line lengths from
10 to 120 in multiples of 10. The output looks like:
.lit
87 97 107 117 127 137
KVKCTGRILE VPVGRGLLGR VVNTLGAPID GKGPLDHDGF SAVEAIAPGV IERQSVDQPV
** * **** *** * ** * * ** * ** *
DVKDLEHPIE VPVGKATLGR IMNVLGEPVD MKGEIGEEER WAIHRAAPSY EELSNSQELL
68 78 88 98 108 118
147 157 167 177 187 197
QTGYKAVDSM IPIGRGQREL IIGDRQTGKT ALAIDAIINQ RDSGIKCIYV AIGQ
** * * * * * * *** * * *
ETGIKVIDLM CPFAKGGKVG LFGGAGVGKT VNMMELIRNI AIEHSGYSVF AGVG
128 138 148 158 168 178
.end lit
.left margin1
@6. TX 1 @List a text file
.LEFT MARGIN2
.para
Allows the user to have a text file displayed on the screen. It will appear
one page at a time.
.left margin1
@7. TX 1 @Direct output to disk
.LEFT MARGIN2
.para
Used to direct output that would normally appear on the screen to a file.
.para
Select redirection of either text or graphics, and
supply the name of the file that the output should be written to.
.para
The results from the next options selected will not appear on the screen
but will be written to the file. When option 7 is selected again
the file will be
closed and output will again appear on the screen.
.left margin1
@8. TX 1 @Write active region to disk
.LEFT MARGIN2
.para
This option allows users to
write the current active sequence to a disk file in Staden format.
.left margin1
@9. TX 1 @Edit the sequences
.LEFT MARGIN2
.para
This function allows the user to insert or delete parts of either sequence
to help align them. The inserted characters are dashes.
.left margin1
@10. TX 2 @Clear graphics
.LEFT MARGIN2
.para
Clears the screen of both text and graphics.
.left margin1
@11. TX 2 @Clear text
.LEFT MARGIN2
.para
Clears only text from the screen.
.left margin1
@12. TX 2 @Draw a ruler
.LEFT MARGIN2
.para
This option
allows the user to draw a ruler or scale along the axes of the screen to
help identify the coordinates of points of interest. The user can define
the position of the first sequence element to be marked
(for example if the active
region is 1501 to 8000, the user might wish to mark every 1000th
element
starting at either 1501 or 2000 - it depends if the user wishes to treat
the active region as an independent unit with its own numbering starting
at
its left edge, or as part of the whole sequence). The user can also define
the separation of the ticks on the scale and their height. If required the
labelling routine can be used to add numbers to the ticks.
.PARA
To escape type !
.left margin1
@13. TX 2 @Use cross hair
.LEFT MARGIN2
.para
This function puts
a steerable cross on the screen that can be used to find the
coordinates of points in the sequence. The user can move the cross
around using the directional keys; when he hits the space bar the
program will write out the coordinates of the cross in sequence units and
the option will be exited.
.para
If instead,
the user hits a , the position will be displayed but the cross will remain on
the screen.
.para
If a letter s is hit the sequences around the cross hair are displayed as a
short alignment (as shown below) and the cross remains on the screen.
.lit
97 107
VPVGRGLLGR VVNTLGAPID
**** *** * ** * *
VPVGKATLGR IMNVLGEPVD
78 88
.end lit
.PARA
If a letter m is hit the sequences around the cross hair are displayed in
the form of a matrix (as shown below) and the cross remains on the screen.
.lit
VPVGKATLGRIMNVLGEPVD
D...................DD
I..........I.........I
P.P...............P..P
A.....A..............A
G...G....G......G....G
L.......L......L.....L
T......T.............T
N............N.......N
VV.V..........V....V.V
VV.V..........V....V.V
R.........R..........R
G...G....G......G....G
L.......L......L.....L
L.......L......L.....L
G...G....G......G....G
R.........R..........R
G...G....G......G....G
VV.V..........V....V.V
P.P...............P..P
VV.V..........V....V.V
VPVGKATLGRIMNVLGEPVD
.end lit
.para
The function is also used prior to "align sequences" in order to delineate the
region to be aligned. The crosshair is positioned at the bottom left of the
region, the crosshair option quit. Then the crosshair option is selected
again, and the crosshair moved to the top right of the region to be
aligned.
.left margin1
@14. TX 2 @Reposition plots
.LEFT MARGIN2
.para
The position of the plots is defined relative to a users drawing
board which has size 1-10,000 in x and 1-10,000 in y.
Plots
are drawn in a window defined by x0,y0 and xlength,ylength.
Where x0,y0 is the position of the bottom left hand corner of the window,
and xlength is the width of the window and ylength the
height of the window.
.lit
--------------------------------------------------------- 10,000
1 1
1 -------------------------------------- ^ 1
1 1 1 1 1
1 1 1 1 1
1 1 1 ylength 1
1 1 1 1 1
1 1 1 1 1
1 -------------------------------------- v 1
1 x0,y0^ 1
1 <---------------xlength--------------> 1
--------------------------------------------------------- 1
1 10,000
.end lit
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
The default window positions are read from a file "DIAGMARG" when the
program is started. Users can have their own file if required.
This option
allows users to change window positions whilst running the program.
If the user
types only carriage return for any value it will remain unchanged.
The cross-hair can be used to choose suitable heights.
.LEFT MARGIN1
@15. TX 2 @Label a diagram
.LEFT MARGIN2
.para
This routine allows users to label any diagrams they have produced. They
are asked to type in a label. When the user types carriage return to finish
typing the label the cross-hair appears on the screen. The user can
position it anywhere on the screen. If the user types R (for right justify)
the label will be
written on the diagram with its right end at the cross-hair position.
If the user types L (for left justify) the label will be written with its
left end at the cross hair position.
The
cross-hair will then immediately reappear. The user may put the same
label
on another part of the diagram as before or if he hits the space bar he
will be asked if he wishes to type in another label.
.left margin1
@16. TX 2 @Display a map
.LEFT MARGIN2
.para
NOT AVAILABLE.
This draws a map
of any sequence features selected by the user.
These features may be protein coding regions (CDS), tRNA genes (TRNA),
promoter positions (PRM), etc. Users may define their own feature table
key
names.
The coordinates must be stored in a file in the format of an EMBL feature
table.
.left margin1
@17. TX 4 @Apply identities algorithm
.LEFT MARGIN2
.para
The identities algorithm finds runs of identical characters
in the sequence. Its main value is speed, being 100's of times faster than
the proportional algorithm. It is of course not very sensitive, and should
only be used for a quick scan. The cutoff score is the minimum number of
consecutive matching characters.
All runs of identical characters that are at least as long as the cutoff
score will produce a dot on the screen.
.para
See also quick scan.
.para
Typical dialogue follows.
.lit
? Menu or option number=d17
? Identity score (1-20) (2) =3
Working
missing graphics
.end lit
.left margin1
@18. TX 4 @Apply proportional algorithm
.para
This method, generally the most useful, was first
described by McLachlan and involves calculating a score for
each position in the matrix by summing points found when
looking
forwards and backwards along a diagonal line of a given length.
This length, called the span, must be an odd number.
The algorithm does not simply look for identity but uses a
score matrix that contains scores for every possible pair of
characters. At each point that a threshold score is achieved the
program marks the screen in one of two ways. It will either place a
single
dot at the position corresponding to the centre of the matching span, or
it
will plot a dot for each identical residue within each matching span.
Alternatively, the "list matching spans"
option will list the segments that match.
.para
For comparing amino acid sequences we usually use the score
matrix shown below which was calculated by adding 10 (to make
every term >0) to each term of the relatedness odds matrix MDM78
of
Dayhoff. This matrix MDM78 was calculated by looking at accepted
point mutations in 71 families of closely related proteins and, of
those tested by Dayhoff, was found to be the most powerful
score
matrix for finding distant relationships between amino acid
sequences.
.left margin1
.lit
AMINO ACID SCORE MATRIX
-----------------------
C S T P A G N D E Q B Z H R K M I L V F Y W - X ?
C 22 10 8 7 8 7 6 5 5 5 5 5 7 6 5 5 8 4 8 6 10 2 10 10 10 10
S 10 12 11 11 11 11 11 10 10 9 10 10 9 10 10 8 9 7 9 7 7 8 10 10 10 10
T 8 11 13 10 11 10 10 10 10 9 10 10 9 9 10 9 10 8 10 7 7 5 10 10 10 10
P 7 11 10 16 11 9 9 9 9 10 9 10 10 10 9 8 8 7 9 5 5 4 10 10 10 10
A 8 11 11 11 12 11 10 10 10 10 10 10 9 8 9 9 9 8 10 6 7 4 10 10 10 10
G 7 11 10 9 11 15 10 11 10 9 10 10 8 7 8 7 7 6 9 5 5 3 10 10 10 10
N 6 11 10 9 10 10 12 12 11 11 12 11 12 10 11 8 8 7 8 6 8 6 10 10 10 10
D 5 10 10 9 10 11 12 14 13 12 13 12 11 9 10 7 8 6 8 4 6 3 10 10 10 10
E 5 10 10 9 10 10 11 13 14 12 12 13 11 9 10 8 8 7 8 5 6 3 10 10 10 10
Q 5 9 9 10 10 9 11 12 12 14 11 13 13 11 11 9 8 8 8 5 6 5 10 10 10 10
B 5 10 10 9 10 10 12 13 12 11 13 11 11 10 10 8 8 6 8 5 7 4 10 10 10 10
Z 5 10 10 10 10 10 11 12 13 13 11 14 12 10 10 8 8 8 8 5 6 4 10 10 10 10
H 7 9 9 10 9 8 12 11 11 13 11 12 16 12 10 8 8 8 8 8 10 7 10 10 10 10
R 6 10 9 10 8 7 10 9 9 11 10 10 12 16 13 10 8 7 8 6 6 12 10 10 10 10
K 5 10 10 9 9 8 11 10 10 11 10 10 10 13 15 10 8 7 8 5 6 7 10 10 10 10
M 5 8 9 8 9 7 8 7 8 9 8 8 8 10 10 16 12 14 12 10 8 6 10 10 10 10
I 8 9 10 8 9 7 8 8 8 8 8 8 8 8 8 12 15 12 14 11 9 5 10 10 10 10
L 4 7 8 7 8 6 7 6 7 8 6 8 8 7 7 14 12 16 12 12 9 8 10 10 10 10
V 8 9 10 9 10 9 8 8 8 8 8 8 8 8 8 12 14 12 14 9 8 4 10 10 10 10
F 6 7 7 5 6 5 6 4 5 5 5 5 8 6 5 10 11 12 9 19 17 10 10 10 10 10
Y 10 7 7 5 7 5 8 6 6 6 7 6 10 6 6 8 9 9 8 17 20 10 10 10 10 10
W 2 8 5 4 4 3 6 3 3 5 4 4 7 12 7 6 5 8 4 10 10 27 10 10 10 10
- 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
.end lit
One alternative for proteins is to use an identity matrix.
For comparing nucleic acids we usually use the matrix shown below.
.lit
DNA SCORE MATRIX
A C G T X
A 1 0 0 0 0
C 0 1 0 0 0
G 0 0 1 0 0
T 0 0 0 1 0
X 0 0 0 0 0
.end lit
See option 32 for how to change the score matrices.
.para
When a sequence is compared against itselt to look for repeats it is
possible to use the proportional algorithm in a mode such that the main
diagonal is not shown. See option 30.
.para
Typical dialogue follows.
.lit
? Menu or option number=d18
? Odd span length (1-401) (11) =
? Proportional score (1-297) (132) =
Working
missing graphics
.end lit
.left margin1
@19. TX 4 @List matching spans
.LEFT MARGIN2
This option applies the proportional algorithm using the current span and
cut-off score, but instead of drawing a dot matrix it lists all the
matching spans. When a sequence is compared against itselt to look for
repeats it is
possible to use this algorithm in a mode such that the main
diagonal is not listed. See option 30.
.para
Typical dialogue follows.
.lit
? Menu or option number=d19
? Odd span length (1-401) (11) =
? Proportional score (1-297) (132) =148
List matching spans
Working
76
IEVPVGKATLG
LEVPVGRGLLG
95
77
EVPVGKATLGR
EVPVGRGLLGR
96
78
VPVGKATLGRI
VPVGRGLLGRV
97
79
PVGKATLGRIM
PVGRGLLGRVV
98
85
LGRIMNVLGEP
LGRVVNTLGAP
104
86
GRIMNVLGEPV
GRVVNTLGAPI
105
87
RIMNVLGEPVD
RVVNTLGAPID
106
.end lit
.left margin1
@20. TX 3 @Set span length
.para
The proportional algorithm
calculates a score for
each position in the matrix by summing the
points found when looking
forwards and backwards along a diagonal line of a given length.
This length, called the span, should be an odd number so that the
score for any point is correctly positioned at the centre of the
span. This option allows the user to define the span length. It
should be noted that short spans can produce noisy diagrams, but are less
affected by insertions and deletions than are long spans. However long
spans can detect more distant relationships. Long spans can suffer from
a
persistence problem by plotting dots when all the "signal" is to one side
of the spans central position. To help avoid this, the option that plots
the position of all matching residues within a matching span, can be
tried.
This is most useful if an identity matrix is being used.
.left margin1
@21. TX 3 @Set proportional score
.LEFT MARGIN2
.para
The proportional algorithm
calculates a score for
each position in the matrix by summing the
scores for the individual amino acids found when looking
forwards and backwards along a diagonal line of a given length.
All points at which the proportional score is achieved will produce a dot
on the diagram. (The same score is used for the 'LIST MATCHING SPANS'
option.)
.para
Before chosing a score the user can apply the routine that will calculate
the expected score, or can calculate a histogram of observed scores. It is
best to start with a high score to avoid an overcrowded diagram.
.left margin1
@22. TX 3 @Set identities score
.LEFT MARGIN2
.para
The identities algorithm is of limited value as it only finds runs of
matching characters, however it has the virtue of being very fast.
This option allows the user to set the minimum length
of run that will produce a dot on the screen.
.left margin1
@23. TX 3 @Calculate expected scores
.left margin2
.para
This function calculates the "double matching probability" of McLachlan.
The
"double matching probability" is the probability of finding
particular scores given two infinitely long sequences of the
composition of those being compared, with the current span
length
and score matrix. By using this option the user can choose to
plot all the matches for which the score exceeds a given
significance level (such as 1%).
Generally it is best to begin at a
low level to avoid an overcrowded diagram.
.para
When the calculation of the expected scores
is finished the program offers
the user 3 ways of examining the results:
.LEFT MARGIN2
"Show probability for a score" allows the user to type in a
score and the
program responds with the probability of achieving that level
of score.
.LEFT MARGIN2
"Show score for a probability" allows the user to type in a
probability value and
the program types the score that corresponds to that level of
probability.
.LEFT MARGIN2
"List scores and probabilities" is the command to list out the
scores and their
corresponding probabilities. The user is asked to supply a
further parameter, the "number of steps between scores", and
the program only lists
every stepsize point. e.g a stepsize of 5 will get every 5th
score listed.
.para
Typical dialogue follows.
.lit
? Menu or option number=d23
? Odd span length (1-401) (11) =
? Proportional score (1-297) (132) =
Working
Average score= 103.18557
RMS deviation= 7.85276
X 1 Show probability for a score
2 Show score for a probability
3 List scores and probabilities
? 0,1,2,3 =
? Show probability for score (1-165) (134) =160
Probability of score 160 is 0.0000000008
X 1 Show probability for a score
2 Show score for a probability
3 List scores and probabilities
? 0,1,2,3 =2
? Show score for probability (0.0000000001-1.) (0.00001) =0.0000001
Score for probability 0.0000001000 is 153
1 Show probability for a score
X 2 Show score for a probability
3 List scores and probabilities
? 0,1,2,3 =3
? Number of steps between scores (1-10) (5) =
0 0.10000E+01 100 0.67232E+00 200 0.18977E-20
5 0.10000E+01 105 0.42119E+00 205 0.42561E-22
10 0.10000E+01 110 0.20671E+00 210 0.87767E-24
15 0.10000E+01 115 0.78860E-01 215 0.16651E-25
20 0.10000E+01 120 0.23515E-01 220 0.27300E-27
25 0.10000E+01 125 0.55406E-02 225 0.00000E+00
30 0.10000E+01 130 0.10443E-02 230 0.00000E+00
35 0.10000E+01 135 0.15935E-03 235 0.00000E+00
40 0.10000E+01 140 0.19906E-04 240 0.00000E+00
45 0.10000E+01 145 0.20569E-05 245 0.00000E+00
50 0.10000E+01 150 0.17758E-06 250 0.00000E+00
55 0.10000E+01 155 0.12938E-07 255 0.00000E+00
60 0.10000E+01 160 0.80360E-09 260 0.00000E+00
65 0.10000E+01 165 0.43009E-10 265 0.00000E+00
70 0.10000E+01 170 0.20049E-11 270 0.00000E+00
75 0.99997E+00 175 0.82263E-13 275 0.00000E+00
80 0.99949E+00 180 0.29998E-14 280 0.00000E+00
85 0.99448E+00 185 0.98050E-16 285 0.00000E+00
90 0.96543E+00 190 0.28934E-17 290 0.00000E+00
95 0.86836E+00 195 0.77556E-19 295 0.00000E+00
1 Show probability for a score
2 Show score for a probability
X 3 List scores and probabilities
? 0,1,2,3 =!
.end lit
.left margin1
@24. TX 3 @Calculate observed scores
.left margin2
.para
This option applies the proportional algorithm to the currently active
sequence but instead of producing a
dot matrix it calculates a histogram of observed scores.
The speed of this calculation
of course depends on the size of the active
regions, but when it
is completed the program offers the user 3 ways of examining
the results:
.para
"Show percentage for score" allows the user to type in a score and the
program
responds with the percentage of points that achieve this
value.
.para
"Show percentage for score" allows the user to type in a percentage and
the
program responds with the corresponding score. Values of
this score and above are only achieved by the given
percentage of points.
.para
"List scores and percentages" is the command to list out the scores
and the
percentage of points achieving them.
.para
Typical dialogue follows.
.lit
? Menu or option number=24
Working
Maximum observed score is 152
X 1 Show percentage reaching a score
2 Show score for a percentage
3 List scores and percentages
? 0,1,2,3 =
? Show percentage for score (1-152) (114) =144
Percentage of points with score 144 is 0.005486297
X 1 Show percentage reaching a score
2 Show score for a percentage
3 List scores and percentages
? 0,1,2,3 =2
? Show score for percentage (0.00001-1.) (0.001) =0.01
Score for percentage 0.010000000 is 143
1 Show percentage reaching a score
X 2 Show score for a percentage
3 List scores and percentages
? 0,1,2,3 =
? Show score for percentage (0.00001-1.) (0.001) =1.
Score for percentage 1.000000000 is 124
1 Show percentage reaching a score
X 2 Show score for a percentage
3 List scores and percentages
? 0,1,2,3 =3
? Number of steps between scores (1-10) (5) =1
73 236953 0.10000E+03
74 236951 0.99999E+02
75 236951 0.99999E+02
76 236950 0.99998E+02
77 236945 0.99996E+02
78 236942 0.99995E+02
79 236929 0.99989E+02
80 236900 0.99977E+02
missing data here
130 384 0.16206E+00
131 307 0.12956E+00
132 239 0.10086E+00
133 180 0.75964E-01
134 134 0.56551E-01
135 103 0.43468E-01
136 78 0.32918E-01
137 67 0.28276E-01
138 46 0.19413E-01
139 40 0.16881E-01
140 33 0.13927E-01
141 29 0.12239E-01
142 24 0.10129E-01
143 19 0.80184E-02
144 13 0.54863E-02
145 10 0.42202E-02
146 8 0.33762E-02
147 7 0.29542E-02
148 7 0.29542E-02
149 6 0.25321E-02
150 5 0.21101E-02
151 3 0.12661E-02
152 3 0.12661E-02
1 Show percentage reaching a score
2 Show score for a percentage
X 3 List scores and percentages
? 0,1,2,3 =!
.end lit
.left margin1
@25. TX 3 @Show current parameter settings
.LEFT MARGIN2
.para
This function lists the names of the current sequences, their total
lengths, the start
and end points of the active sequence and the current values of span and
cut-off scores. It also shows if the main diagonal will be shown, or if
the
proportional algorithm will mark all identities in matching spans.
.para
Typical dialogue follows.
.lit
? Menu or option number=25
Horizontal sequence
ALPHA.PRT
Positions
1 TO 514
Vertical sequence
BETA.PRT
Positions
1 TO 461
Span length= 11
Scores
Proportional= 132
Identities= 3
Identites off
Main diagonal shown
.end lit
.left margin1
@27. TX 2 @Draw a /
.left margin2
.para
This option simply draws a diagonal line from the bottom left of the
diagram to the top right. it can be an aid when trying to align the
sequences.
.left margin1
@26. TX 4 @Quick scan
.left margin2
.para
The algorithm is as follows. The dot matrix positions are found for all
words of some minimum length (obviously length 1 is most sensitive)
that are common to both sequences. Imagine a diagonal line running from
corner to corner of the diagram, at right angles to the diagonals in the
dotmatrix, The scores for the common words (according to the current
score matrix, e.g. MDM78) are accummulated at the appropriate positions
on that imaginary line, hence
producing a histogram. The histogram is analysed to find its mean and
standard deviation. The diagonals that lie above some cutoff score
(defined in standard deviation units), are rescanned using the
proportional algorithm, and a diagram produced. The method is very fast,
and is also employed by the library comparison program.
.para
Typical dialogue follows.
.lit
? Menu or option number=d26
? Identity score (1-20) (3) =
? Odd span length (1-401) (11) =
? Proportional score (1-297) (132) =
? Number of sd above mean (0.00-10.00) (5.00) =
missing graphics
.end lit
.left margin2
.para
SIPL the library searching version of SIP
.para
This program compares a probe sequence against a library of sequences using
the quick scan algorithm, sorts the matches into descending order of score,
and produces optimal alignments of the best scores using the Myers and
Miller method. It is very rapid.
.para
Use of lists of entry names
.para
SIPL has the ability to
restrict searches to subsets of the libraries. This does not require
sublibraries to be created but instead is achieved by using files
containing a list of the entry names of sequences. The user may choose to
search only those entries on the list or, alternatively to search all but
those on the list (i.e. in the latter case
the list contains the names of those to be excluded).
The programs can search libraries that have indexes and those that
do not.
If a list of names for inclusion is used,
then the search will be faster if the index is present. In all other
circumstances the whole library will be read.
The list must be in library order except when it is used
to include entries, and an index is available.
The list must contain each entry name on a separate line, with the name
starting in column 1 of the line. ie there must be no spaces at the start
of the line.
The list of entry names
can be produced by the keyword searches of nip, pip, sip, etc as long
as the listings produced have a space character separating the entry name
from the entry description. This will depend on how well the library
reformatting programs work. For example swissprot entry names tend to run
into the beginning of the descriptions, but other libraries are generally
OK.
.left margin1
@28. TX 4 @Align sequences
.left margin2
.para
This function will produce an optimal alignment of two segments of the
sequence.
The dynamic programming alignment algorithm is based on that of Miller
and Myers (). It guarantees to produce alignments with the optimum score
given a score matrix, a gap start penalty, and a gap extension penalty.
That is, starting a gap costs a fixed penalty (F) and each residue added
to the gap incurs a further penalty (E) so that for each gap of length K
residues the penalty is F + K*E. Gaps at the ends of sequences incur no
penalty.
.para
The routine can only handle segments of sequence of maximum
length 5000 residues. When the sequences are read in the alignment
segment
will be set to the first 5000 residues. A different segment can be
selected by prefixing the option number by the letter D, in which case the
cross hair can be used to identify the two ends. The cross hair will
appear.
First position the
crosshair at
the bottom left of the
segment and type a character other than s
or m or ",". When the crosshair reappears, position it a the top right
of the segment, and type a keyboard character.
The aligned sequences will replace the active sequence if the user
confirms "keep alignment". By alternate use of the
plotting and alignment routines it is possible to rapidly produce an
alignment of quite long sequences.
.para
Typical dialogue follows.
.lit
28 = Align sequences
? Menu or option number=d28
Define the region to align using the cross-hair.
First identify the bottom left position and exit
the cross-hair routine. Then the top right.
(Bell rings, type return, cross hair appears)
? Penalty for starting a gap (1-100) (10) =
? Penalty for each residue in gap (1-100) (10) =
Aligning region 1 to 461
with region 1 to 514
1 11 21 31 41 51
MA--TGKIVQ VIGA------ VVDVEFPQDA VPRVYDALEV QNG------N ERLVL-----
* * * ** * * * * *
MQLNSTEISE LIKQRIAQFN VVSEAHNEGT IVSVSDGVIR IHGLADCMQG EMISLPGNRY
1 11 21 31 41 51
61 71 81 91 101 111
EVQQQLGGGI VRTIAMGSSD GLRRGLDVKD LEHPIEVPVG KATLGRIMNV LGEPVDMKGE
* * ** * * ** ***** *** * ** * * **
AIALNLERDS VGAVVMGPYA DLAEGMKVKC TGRILEVPVG RGLLGRVVNT LGAPIDGKGP
61 71 81 91 101 111
121 131 141 151 161 171
IGEEERWAIH RAAPSYEELS NSQELLETGI KVIDLMCPFA KGGKVGLFGG AGVGKTVNMM
* ** * ** * * * * * * ***
LDHDGFSAVE AIAPGVIERQ SVDQPVQTGY KAVDSMIPIG RGQRELIIGD RQTGKTALAI
121 131 141 151 161 171
181 191 201 211 221 231
ELIRNIAIEH SGYS-VFAGV GERTREGNDF YHEMTDSNVI DKVSLVYGQM NEPPGNRLRV
* * ** * * *
DAI--INQRD SGIKCIYVAI GQKASTISNV VRKLEEHGAL ANTIVVVATA SESAALQYLA
181 191 201 211 221 231
241 251 261 271 281 291
ALTGLTMAEK FRDEGRDVLL FVDNIYRYTL AGTEVSALLG RMPSAVGYQP TLAEEMGVLQ
* * *** * * * * * * ** * * *
RMPVALMGEY FRDRGEDALI IYDDLSKQAV AYRQISLLLR RPPGREAFPG DVFYLHSRLL
241 251 261 271 281 291
301 311 321 331 341 351
ERITST---- ---------- -KTGSITSVQ AVYVPADDLT DPSPATTFAH LDATVVLSRQ
** **** * * * * * *
ERAARVNAEY VEAFTKGEVK GKTGSLTALP IIETQAGDVS AFVPTNVISI TDGQIFLETN
301 311 321 331 341 351
361 371 381 391 401 411
IASLGIYPAV DPLDSTSRQL DPLVVGQEHY DTAR----GV QSILQRYQEL KDIIAILGMD
** *** * * ** * * * * * **
LFNAGIRPAV NPGISVSR-- ---VGGAAQT KIMKKLSGGI RTALAQYREL AAFSQFAS--
361 371 381 391 401 411
421 431 441 451 461 471
ELSEEDKLVV ARARKIQRFL SQ----PFFV AE----VFTG SPGKYVSLKD --TIRGFKGI
* * * * * * * * * * * *
DLDDATRKQL DHGQKVTELL KQKQYAPMSV AQQSLVLFAA ERG-YLADVE LSKIGSFEAA
421 431 441 451 461 471
481 491 501 511 521
MEG--EYDHL P-EQAFYMVG SIEEAVE--- --------KA KKL*
** * * * * *
LLAYVDRDHA PLMQEINQTG GYNDEIEGKL KGILDSFKAT QSW*
481 491 501 511 521
Conservation 22.5%
Number of padding characters inserted 63 and 10
? (y/n) (y) Keep alignment n
.end lit
.left margin1
@29. TX 1 @Complement the sequences
.left margin2
.para
This function allows users to reverse and complement nucleic acid
sequences.
.left margin1
@30. TX 3 @Switch main diagonal
.left margin2
.para
If a sequence is being compared against itself to look for repeats it is
sometimes convenient if the main diagonal is not included in the
comparison. This function allows users to set a switch that determines
whether or not to include the main
diagonal for all the comparison methods.
If the switch is set, and the active regions for both sequences have
the same start position, then the main diagonal will not be compared.
.left margin1
@31. TX 3 @Switch identities
.left margin2
.para
This function allows a switch to be set or unset. The switch determines
which of two forms of plot will be produced by the proportional
algorithm.
One form of output (the original method) plots a dot at the centre of each
span that reaches the threshold score; whereas the other form will plot
dots for all matching residues that lie within spans that reach the
threshold.
.left margin1
@32. TX 3 @change score matrix
.left margin2
.para
This option allows users to select their
own score matrix for use with the proportional algorithm. The choices
are:
.lit
1 = MDM78
2 = identity
3 = your own matrix
.end lit
.para
MDM78 is the standard matrix that is used for proteins and an
identity matrix is the default matrix for nucleic acids. However an
identity
matrix is also useful for protein comparisons. "Your own matrix" allows
users to apply any other matrix, as long as the matrix file is in the same
format as MDM78.
For comparisons of DNA it might be useful to try one that gave say 3 for
exact matches and 1 for R-R or Y-Y, else=0.
.left margin1
@33. TX 3 @Set number of sd's for Quickscan
.left margin2
.para
The quickscan
algorithm is as follows. The dot matrix positions are found for all
words of some minimum length (obviously length 1 is most sensitive)
that are common to both sequences. Imagine a diagonal line running from
corner to corner of the diagram, at right angles to the diagonals in the
dotmatrix, The scores for the common words (according to the current
score matrix, e.g. MDM78) are accummulated at the appropriate positions
on that imaginary line, hence
producing a histogram. The histogram is analysed to find its mean and
standard deviation. The diagonals that lie above some cutoff score
(defined in standard deviation units), are rescanned using the
proportional algorithm, and a diagram produced.
.para
This option allows the number of sd's to be set.
.left margin1
@34. TX 3 @Set gap penalities
.left margin2
.para
The alignment
function will produce an optimal alignment of two segments of the
sequence.
The dynamic programming alignment algorithm is based on that of Miller
and Myers (). It guarantees to produce alignments with the optimum score
given a score matrix, a gap start penalty, and a gap extension penalty.
That is, starting a gap costs a fixed penalty (F) and each residue added
to the gap incurs a further penalty (E) so that for each gap of length K
residues the penalty is F + K*E. Gaps at the ends of sequences incur no
penalty.
.para
This option allows the gap penalties to be set.
.left margin1
@ end of help