staden-lg/help/SIP.RNO

.NPA
.SP 1
.left margin1
@-1. TX  0 @General
.sp
@-2. T   0 @Screen control
.sp
@-2. X   0 @Screen
.sp
@-3. TX  0 @Set parameters
.sp
@-4. TX  0 @Comparison
.sp
@0.  TX -1 @SIP
.PARA
This is program for comparing and aligning nucleic acid or protein 
sequences. It can produce optimal alignments using a dynamic 
programming algorithm, and has several ways of producing "dot matrix" 
diagrams. 
.PARA
The following analyses (preceded by their option numbers) are included:
.sp
.para
                The program is  used  on  a  simple  graphics  terminal  ie  a
          keyboard  with a screen on which points and lines can be drawn.  
The
          user  works  at  the  terminal  and  produces  plots   for   various
          combinations  of values for the span length and minimum scores. 
However large or small a region  the  user
          elects  to  compare  the program expands or contracts the diagram 
so
          that the plot always fills the screen.  This allows the user to gain
          an  overall  impression  or  to  "home-in" on particular regions and
          examine them in more detail.   Having  found  a  region  that  looks
          interesting  the  user  can  determine  its  coordinates in terms of
          sequence positions by use of a crosshair facility.
.para
                The program has two  statistical  options  to  help  the  user
          choose  score  levels for plotting and to assess the significance of
          any similarity found.  It can  produce  a  cumulative  histogram  of
          observed  scores  for  the current span length and region and it can
          calculate the "double matching probability" of McLachlan. 
The
          "double   matching   probability"  is  the  probability  of  finding
          particular  scores  given  two  infinitely  long  sequences  of  the
          composition  of  those  being compared, with the current span 
length
          and score matrix.  By using these options the  user  can  choose  to
          plot   all   the  matches  for  which  the  score  exceeds  a  given
          significance  level  (such  as  1%),  using  either   empirical   or
          theoretical  probability values.  Generally it is best to begin at a
          low level to avoid an overcrowded diagram.
.para
               If the user finds that the two sequences do  contain  stretches
          of  homology  he will often want to align the sequences by 
inserting
          padding characters at deletion points.  The program has a  
selection
          of  options  for this purpose:  it contains an alignment routine; it 
can display on the screen the two
          sequences, one above the other, with asterisks  marking  
identities,
          it  has inbuilt editing functions and can save the aligned sequences
          on disk files.  
.para
               The basic principle of dot matrices was  first
          described  by Gibbs and McIntyre and involves producing a diagram
          that contains a representation of all the matches between a pair  
of
          sequences.   This  diagram  is  then  scanned  by  eye and the human
          ability to recognise patterns used to detect any  similarities  that
          might be present.  The diagram consists of a two dimensional plot 
in
          which the x axis represents one sequence (A)  and  the  y  axis  the
          other  (B).   Every point (i,j) on the plane x,y is assigned a score
          which corresponds  to  the  level  of  similarity  between  sequence
          characters A(i) and B(j).  In the simplest use of the method a score
          of 1 could be assigned to every point (i,j) where A(i) = B(j), and a
          score  of  0  to  every other point.  If a plot of the points in the
          plane was made in which all scores of 1 were marked with a  dot  
and
          all  those  of 0 left blank then regions of identity would appear as
          diagonal lines.  With the comparison  displayed  in  this  form  the
          human eye is very good at detecting regions of homology even if 
they
          are imperfect.  The effects of mismatches, insertions  or  
deletions
          can  be  seen:   matches interrupted by insertions or deletions will
          appear as parallel diagonals, and matches  interrupted  by  the  odd
          mismatching  pair  of  characters  will  appear  as broken collinear
          diagonal lines. This diagram is  a  very  useful  representation  but
          simply  placing a dot for every identity is of limited value for the
          following reasons.
.para
                For nucleic acid sequences around 25% of the plot will contain
          points   and   it  will  often  be  very  difficult  to  distinguish
          significant homologies  from  chance  matches.   For  proteins  
many
          significant alignments of sequences contain almost no identities 
but
          are formed from chemically and structurally similar amino  acids  
so
          that  simply  looking  for  identity would be insufficient.  What is
          required is to first find those points  that  correspond  to  fairly
          strong  local  similarities  and  then  to  use the diagram of these
          points so that the human eye can be used to look  for  larger  scale
          homologies.   The program uses a number of different algorithms to 
calculate the
          score for each point and the user defines a minimum  score  so  
that
          only  those  points  in  the diagram for which the score is at least
          this value will be marked with a dot. 
.para
 The first scoring  method  finds  the  longest uninterrupted sections of 
perfect identity i.e.
          those that contain no  mismatches,  insertions  or  deletions. 
Generally this method, termed "the identities algorithm" is of little 
value, but runs very quickly.
.para
  The
          second   method  looks  for  sections  where  a  proportion  of  the
          characters in the sequence are similar, again allowing no 
insertions
          or deletions. For a thorough analysis this  method, termed "the 
proportional algorithm", is the best.        
.para
The original method, of this type was  first
          described  by  McLachlan  and involves calculating a score for
          each position in the matrix by summing  points  found  when  
looking
          forwards  and  backwards  along  a  diagonal line of a given length.
          This length, called the span, must be an odd number so that the dot 
marking matches can be precisely placed at its centre.
The algorithm does not simply look for identity  but  uses  a
          score  matrix  that  contains  scores  for  every  possible  pair of
          characters.  For comparing amino acid sequences  we  usually
use  the  score
          matrix  shown  below which was calculated by adding 10 (to make
          every term >0) to each term of the relatedness odds matrix MDM78  
of
          Dayhoff.  This matrix MDM78 was calculated by looking at accepted
          point mutations in 71 families of closely related proteins  and,  of
          those  tested  by  Dayhoff,  was found to be the most powerful 
score
          matrix  for  finding  distant  relationships  between   amino   acid
          sequences.
.left margin1
.lit

                           AMINO ACID SCORE MATRIX
                           -----------------------

   C  S  T  P  A  G  N  D  E  Q  B  Z  H  R  K  M  I  L  V  F  Y  W  -  X  ?  
C 22 10  8  7  8  7  6  5  5  5  5  5  7  6  5  5  8  4  8  6 10  2 10 10 10 10
S 10 12 11 11 11 11 11 10 10  9 10 10  9 10 10  8  9  7  9  7  7  8 10 10 10 10
T  8 11 13 10 11 10 10 10 10  9 10 10  9  9 10  9 10  8 10  7  7  5 10 10 10 10
P  7 11 10 16 11  9  9  9  9 10  9 10 10 10  9  8  8  7  9  5  5  4 10 10 10 10
A  8 11 11 11 12 11 10 10 10 10 10 10  9  8  9  9  9  8 10  6  7  4 10 10 10 10
G  7 11 10  9 11 15 10 11 10  9 10 10  8  7  8  7  7  6  9  5  5  3 10 10 10 10
N  6 11 10  9 10 10 12 12 11 11 12 11 12 10 11  8  8  7  8  6  8  6 10 10 10 10
D  5 10 10  9 10 11 12 14 13 12 13 12 11  9 10  7  8  6  8  4  6  3 10 10 10 10
E  5 10 10  9 10 10 11 13 14 12 12 13 11  9 10  8  8  7  8  5  6  3 10 10 10 10
Q  5  9  9 10 10  9 11 12 12 14 11 13 13 11 11  9  8  8  8  5  6  5 10 10 10 10
B  5 10 10  9 10 10 12 13 12 11 13 11 11 10 10  8  8  6  8  5  7  4 10 10 10 10
Z  5 10 10 10 10 10 11 12 13 13 11 14 12 10 10  8  8  8  8  5  6  4 10 10 10 10
H  7  9  9 10  9  8 12 11 11 13 11 12 16 12 10  8  8  8  8  8 10  7 10 10 10 10
R  6 10  9 10  8  7 10  9  9 11 10 10 12 16 13 10  8  7  8  6  6 12 10 10 10 10
K  5 10 10  9  9  8 11 10 10 11 10 10 10 13 15 10  8  7  8  5  6  7 10 10 10 10
M  5  8  9  8  9  7  8  7  8  9  8  8  8 10 10 16 12 14 12 10  8  6 10 10 10 10
I  8  9 10  8  9  7  8  8  8  8  8  8  8  8  8 12 15 12 14 11  9  5 10 10 10 10
L  4  7  8  7  8  6  7  6  7  8  6  8  8  7  7 14 12 16 12 12  9  8 10 10 10 10
V  8  9 10  9 10  9  8  8  8  8  8  8  8  8  8 12 14 12 14  9  8  4 10 10 10 10
F  6  7  7  5  6  5  6  4  5  5  5  5  8  6  5 10 11 12  9 19 17 10 10 10 10 10
Y 10  7  7  5  7  5  8  6  6  6  7  6 10  6  6  8  9  9  8 17 20 10 10 10 10 10
W  2  8  5  4  4  3  6  3  3  5  4  4  7 12  7  6  5  8  4 10 10 27 10 10 10 10
- 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
  10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
.end lit
.para
It is also possible to use other matrices, including an identity matrix for 
proteins. For nucleic acids we usually use the matrix shown below.
.lit

         DNA SCORE MATRIX

             A C G T X 
           A 1 0 0 0 0 
           C 0 1 0 0 0 
           G 0 0 1 0 0 
           T 0 0 0 1 0 
           X 0 0 0 0 0 
.end lit
.left margin2
.para
Plotting dots at the centres of spans that reach the cutoff leads to a 
persistence effect that, to some extent, can be mitigated by a  variation 
on the method. If, for example, all the high scoring amino acids are 
clustered at the left end of a particular diagonal segment, dots will 
continue to be plotted to their right until the span score drops below the 
cutoff. Instead of plotting a single point for each span that reaches the 
cutoff score, the variant  method  plots points for all the identities that 
lie in spans that reach the cutoff. Obviously the persistence effect can be 
more pronounced for long spans and low cutoff scores, but note that the 
variant method will not plot anything if there are no identities present, 
and so similar regions could be missed!
.para
A further variant, useful for comparing a sequence against itself, ignores 
the main diagonal.
.para
The third comparison method called "quick scan"  is really a combination 
of the first two, and is similar to the FASTP program of Lipman and 
Pearson, but produces a dot matrix diagram. The algorithm is as follows. 
The dot matrix positions are found for all words of some minimum length 
(obviously length 1 is most sensitive) that are common to both 
sequences. Imagine a diagonal line running from corner to corner of the 
diagram, at right angles to the diagonals in the dotmatrix,  The scores 
for the common words (according to the current score matrix, e.g. 
MDM78) are accummulated at the appropriate positions on
that imaginary line, hence  producing a 
histogram. The histogram is analysed to find its mean and standard 
deviation. The diagonals that lie above some cutoff score (defined in 
standard deviation units), are rescanned using the proportional 
algorithm, and a diagram produced. The method is very fast, and is also 
employed  by the library comparison program.
.para
The dynamic programming alignment algorithm contained in the program 
is based on that of Miller and Myers (). It guarantees to produce 
alignments with the optimum score given a score matrix, a gap start 
penalty, and a gap extension penalty. That is, starting a gap costs a fixed 
penalty (IG) and each residue added to the gap incurs a further penalty 
(IH) so that for each gap of length K residues the penalty is IG + k*IH. 
Gaps at the ends of sequences incur no penalty.
.para
It is very useful to have the dot matrix methods and the alignment 
routine together in the same program because it allows users to produce 
a dot matrix diagram to help select which regions of the sequence they 
wish to align. Selection is made by use of the crosshair. First the 
crosshair is positioned at the bottom left hand end of the segment to be 
aligned. The crosshair function is quit and immediately selected again, 
the crosshair positioned at the top right of the segment, and the 
crosshair function quit. When the alignment routine is selected the 
segment will be aligned. 
.para
The alignment can replace the original segment of the sequence. By 
repeated plotting of dot matrices, followed by alignment, very long 
sequences can easily be aligned.
.LEFT MARGIN1
@1. TX 0 @Help
.LEFT MARGIN2
.para
This option gives online help. The user should select option numbers and
the current documentation will be given. 
.PARA
The following analyses (preceded by their option numbers) are included:
.lit
 ? = Help
 ! = Quit
 3 = read a new sequence
 4 = define active region
 5 = list the sequence
 6 = list a text file
 7 = direct output to disk
 8 = write active sequence to disk
 9 = edit the sequences
10 = clear graphics screen
11 = clear text screen
12 = draw a ruler
13 = use cross hair
14 = reposition plots
15 = label diagram
16 = display a map
17 = apply identities algorithm
18 = apply proportional algorithm
19 = list matching spans
20 = set span length
21 = set proportional score
22 = set identities score
23 = calculate expected scores
24 = calculate observed scores
25 = show current parameter settings
26 = quick scan
27 = draw a /
28 = align the sequences
29 = complement the sequences
30 = switch main diagonal
31 = switch identities
32 = change score matrix
.end lit
.left margin1
@2. TX 0 @Quit
.left margin2
.para
This function stops the program.
.left margin1
@3. TX 1 @Read a new sequence
.LEFT MARGIN2
.para
This option allows users to read in new sequences, browse through annotations,
 or search sequence 
libraries for keywords. Sequences can be read from "personal" 
sequence files or from sequence libraries. These are referred to as the 
sequence "source". Personal files can be stored in several formats:
Staden, PIR, EMBL, GENBANK and GCG.
At LMB we use "Staden" format for sequencing and all 
the 
libraries are stored in their original formats. Note, however, that libraries
such as EMBL or GenBank that are divided into several files (eg GenBank has
13 separate files) are indexed as a whole. This means that users do not need
to know which file contains an entry, only which library.
When the user selects to read in a sequence the program first asks for the 
sequence "source". 
.para
If the user selects "personal" the program will ask for 
the format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of 
the file. For PIR format the user will also be required to know the entry 
name of the sequence as the file can contain several. For the other formats
only a single entry is expected. The file will be read, its length and
composition will be displayed and the option left.
.para
If the user selects "library" as the sequence source the program will display a
list of available libraries. The programs are capable of handling all current
libraries but which ones are available will vary from site to site. At LMB we
have several libraries and also weekly updates of data gathered between releases.
The program will ask users to select a library and then give a list of options:
.lit

 X  1 Get a sequence
    2 Get annotations
    3 Get entrynames from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords

.end lit
If get a sequence or get annotations is selected users will be asked to 
type the entry name. The option will be left when a sequence is selected or 
! is typed. The composition and length will be displayed.
.para
The text index contains all words from feature tables, reference titles,
definition lines, keywords lists and comments, so the text index search
is most useful. It is also the fastest. Up to 5 words can be searched for
at once. The words should be typed separated by spaces, for example
.lit
 ? Keywords=P53 mouse murine tumo

.end lit
will search for all entries that contain words starting with p53, mouse,
murine and tumo. Only the unique entries that contain ALL words will be 
listed. Before listing the matching entries
the program will show the number of 'hits' for each word and ring the bell.
Escape is possible at this point, or after each screenfull of entries.
In addition to the entry names the text search displays the primary accession 
number, the sequence length and up to 80 characters of description.
(The search of 'titles' is now redundant because the full text index
contains all the title words and the search is much faster. It will probably
be removed from the program.)
All searches are independent of case. Where
possible the program will offer default entry names.
.para
Typical dialogue follows.
.lit
Select sequence source
X  1 Personal file
   2 Sequence library
? Selection  (1-2) (1) =
Select sequence file format
X  1 Staden
   2 EMBL
   3 GenBank
   4 PIR
   5 GCG
? Selection  (1-5) (1) =
? Sequence file name=M13MP7.SEQ
 Contig title removed
Sequence length=  7238
 Sequence composition
          T          C          A          G          -
      2405.      1539.      1765.      1527.         2.
        33.2%      21.3%      24.4%      21.1%       0.0%
  .
  .
  .


 Select sequence source
 X  1 Personal file
    2 Sequence library
 ? Selection  (1-2) (1) =2
 Select a library
 X  1 EMBL 29 nucleotide library Dec 91
    2 SWISSPROT 20 protein library Nov 91
    3 PIR 31 protein library Dec 91
    4 NRL3D 58 From Brookhaven protein library Dec 91
    5 GenBank
 ? Selection  (1-5) (1) =
Library is in EMBL format with indexes
 Select a task
 X  1 Get a sequence
    2 Get annotations
    3 Get entry names from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords
 ? Selection  (1-5) (1) =5
 Search for keywords
 ? Keywords=P53 mouse
P53 hits  68
MOUSE hits  8180

 MMANT01    X00875         536 Murine gene fragment for cellular tumour antigen
 MMANT02    X00876          83 Murine gene fragment for cellular tumour antigen
 MMANT03    X00877          21 Murine gene fragment for cellular tumour antigen
 MMANT04    X00878         261 Murine gene fragment for cellular tumour antigen
 MMANT05    X00879         184 Murine gene fragment for cellular tumour antigen
 MMANT06    X00880         113 Murine gene fragment for cellular tumour antigen
 MMANT07    X00881         110 Murine gene fragment for cellular tumour antigen
 MMANT08    X00882         137 Murine gene fragment for cellular tumour antigen
 MMANT09    X00883          74 Murine gene fragment for cellular tumour antigen
 MMANT10    X00884         107 Murine gene for cellular tumour antigen p53 (exon
 MMANT11    X00885         562 Murine p53 gene 3' region with exon 11
 MMANTP53   M26862         536 Mouse tumor antigen p53 gene, 5' end.
 MMLYN      M64608        2044 Mouse lyn protein mRNA, complete cds.
 MMP53      X00741        1377 Mouse mRNA for transformation associated protein
 MMP53A     M13872        1285 Mouse p53 mRNA, complete cds, clone pcD53.
 MMP53B     M13873        1241 Mouse p53 mRNA, complete cds, clone p53-m11.
 MMP53C     M13874        1322 Mouse p53 mRNA, complete cds, clone p53-m8.
 MMP53G1    X01235         554 Mouse genomic DNA for 5' region of cellular tumou
 MMP53IN4   X60470         729 M.musculus p53 gene for p53 protein, intron 4
 MMP53P     X01236        2132 Mouse pseudogene for cellular tumour antigen p53
 MMP53R     X01237        1773 Mouse mRNA for cellular tumour antigen p53
 MMRSB2P5   M64597         196 Mouse B2 repeat in the 3' flank of protein 53 (p5
      22 different entries found

 Select a task
 X  1 Get a sequence
    2 Get annotations
    3 Get entry names from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords
 ? Selection  (1-5) (1) =4
 Search for keywords
 ? Keywords=alpha
 Searching for alpha
 AAGHA          623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
 AAMALI        3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
 AAMALIA       1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
 AAMALIB       1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
 ACA13GT        371 alouatta caraya alpha-1,3gt gene, 3' flank.
 ADHBADA1       102 duck alpha-d-globin gene, exon 1.
 ADHBADA2      1145 duck alpha-a-globin gene and 5' flank
 ADHBADWP       513 duck (white pekin) alpha ii (minor) globin mrna, complete co
 AEACOXABC     5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
 AGA13GT        371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
 AGAAAGFP       282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
 AGAABL         138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
 AGAFAMYA        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
 AGAFAMYB        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
 AGAFAMYC        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
 AGAFCOXA        98 synthetic alpha-factor/cox iv fusion gene signal peptide.
 AGAGABA       7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
 AGAMYLS        120 synthetic alpha-amylase gene, 5' end.
 AGANPS          95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
!
 Select a task
 X  1 Get a sequence
    2 Get annotations
    3 Get entry names from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords
 ? Selection  (1-5) (1) =3
 ? Accession number=v00636
Entry name LAMBDA
 Select a task
 X  1 Get a sequence
    2 Get annotations
    3 Get entry names from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords
 ? Selection  (1-5) (1) =2
 Default Entry name=LAMBDA
 ? Entry name=
ID   LAMBDA     standard; DNA; PHG; 48502 BP.
XX
AC   V00636; J02459; M17233; X00906;
XX
DT   03-JUL-1991 (Rel. 28, Last updated, Version 3)
DT   09-JUN-1982 (Rel. 1, Created)
XX
DE   Genome of the bacteriophage lambda (Styloviridae).
XX
KW   circular; coat protein; DNA binding protein; genome;
KW   origin of replication.
XX
OS   Bacteriophage lambda
OC   Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
XX
RN   [1]
RP   1-48502
RA   Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
RT   "Nucleotide sequence of bacteriophage lambda DNA";
RL   J. Mol. Biol. 162:729-773(1982).
XX
!
 Select a task
 X  1 Get a sequence
    2 Get annotations
    3 Get entry names from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords
 ? Selection  (1-5) (1) =
 Default Entry name=LAMBDA
 ? Entry name=
DE   Genome of the bacteriophage lambda (Styloviridae).
 Sequence length  48502
 Sequence composition
           T          C          A          G          -
      11988.     11360.     12336.     12818.         0.
         24.7%      23.4%      25.4%      26.4%       0.0%

.end lit
.left margin1
@4. TX 1 @Define active region
.LEFT MARGIN2
.para
For its analytic functions 
the program always works on a region of the sequence called the active 
region. When a new sequence is read into the program the active region is 
automatically set to start at the beginning of the sequence and go
up to the 
maximum allowed size of active region the  program can 
handle. The positions are shown on the screen.
On most machines this will be to the end of the sequence.
This option allows the user define a different region.
.left margin1
@5. TX 1 @List a sequence
.LEFT MARGIN2
.para
The sequence can be listed with line lengths from 
10 to 120 in multiples of 10.  The output looks like:
.lit

    87         97        107        117        127        137
     KVKCTGRILE VPVGRGLLGR VVNTLGAPID GKGPLDHDGF SAVEAIAPGV IERQSVDQPV
      **      * ****   ***   * ** * *  **         *    **    *        
     DVKDLEHPIE VPVGKATLGR IMNVLGEPVD MKGEIGEEER WAIHRAAPSY EELSNSQELL
    68         78         88         98        108        118
   147        157        167        177        187        197
     QTGYKAVDSM IPIGRGQREL IIGDRQTGKT ALAIDAIINQ RDSGIKCIYV AIGQ
      ** *  * *  *   *       *    ***       * *             *   
     ETGIKVIDLM CPFAKGGKVG LFGGAGVGKT VNMMELIRNI AIEHSGYSVF AGVG
   128        138        148        158        168        178

.end lit
.left margin1
@6. TX 1 @List a text file
.LEFT MARGIN2
.para
Allows the user to have a text file displayed on the screen. It will appear 
one page at a time.
.left margin1
@7. TX 1 @Direct output to disk
.LEFT MARGIN2
.para
Used to direct output that would normally appear on the screen to a file. 
.para
Select redirection of either text or graphics, and 
supply the name of the file that the output should be written to.
.para
 The results from the next options selected will not appear on the screen 
but will be written to the file. When option 7 is selected again
the file will be 
closed and output will again appear on the screen.
.left margin1
@8. TX 1 @Write active region to disk
.LEFT MARGIN2
.para
This option allows users to 
write the current active sequence to a disk file in Staden format. 
.left margin1
@9. TX 1 @Edit the sequences
.LEFT MARGIN2
.para
This function allows the user to insert or delete parts of either sequence 
to help align them. The inserted characters are dashes.
.left margin1
@10. TX 2 @Clear graphics
.LEFT MARGIN2
.para
 Clears the screen of both text and graphics.
.left margin1
@11. TX 2 @Clear text
.LEFT MARGIN2
.para
 Clears only text from the screen.
.left margin1
@12. TX 2 @Draw a ruler
.LEFT MARGIN2
.para
This option
allows the user to draw a ruler or scale along the axes of the screen to 
help identify the coordinates of points of interest. The user can define 
the position of the first sequence element to be marked
 (for example if the active 
region is 1501 to 8000, the user might wish to mark every 1000th 
element 
starting at either 1501 or 2000 - it depends if the user wishes to treat 
the active region as an independent unit with its own numbering starting 
at 
its left edge, or as part of the whole sequence). The user can also define 
the separation of the ticks on the scale and their height. If required the 
labelling routine can be used to add numbers to the ticks.
.PARA
To escape type !
.left margin1
@13. TX 2 @Use cross hair
.LEFT MARGIN2
.para
This function puts
a steerable cross on the screen that can be used to find the 
coordinates of points in the sequence. The user can move the cross 
around using the directional keys; when he hits the space bar the 
program will write out the coordinates of the cross in sequence units and 
the option will be exited.
.para
If instead, 
the user hits a , the position will be displayed but the cross will remain on 
the screen.
.para
If a letter s is hit the sequences around the cross hair are displayed as a 
short alignment (as shown below) and the cross remains on the screen.
.lit
        97        107
         VPVGRGLLGR VVNTLGAPID
         ****   ***   * ** * *
         VPVGKATLGR IMNVLGEPVD
        78         88

.end lit
.PARA
If a letter m is hit the sequences around the cross hair are displayed in 
the form of a matrix (as shown below) and the cross remains on the screen.

.lit

   VPVGKATLGRIMNVLGEPVD
  D...................DD
  I..........I.........I
  P.P...............P..P
  A.....A..............A
  G...G....G......G....G
  L.......L......L.....L
  T......T.............T
  N............N.......N
  VV.V..........V....V.V
  VV.V..........V....V.V
  R.........R..........R
  G...G....G......G....G
  L.......L......L.....L
  L.......L......L.....L
  G...G....G......G....G
  R.........R..........R
  G...G....G......G....G
  VV.V..........V....V.V
  P.P...............P..P
  VV.V..........V....V.V
   VPVGKATLGRIMNVLGEPVD

.end lit
.para
The function is also used prior to "align sequences" in order to delineate the 
region to be aligned. The crosshair is positioned at the bottom left of the 
region, the crosshair option quit. Then the crosshair option is selected 
again, and the crosshair moved to the top right of the region to be 
aligned. 
.left margin1
@14. TX 2 @Reposition plots
.LEFT MARGIN2
.para
The position of the plots is defined relative to a users drawing 
board which has size 1-10,000 in x and 1-10,000 in y.
Plots 
are drawn in a window defined by x0,y0 and xlength,ylength. 
Where x0,y0 is the position of the bottom left hand corner of the window,
  and xlength is the width of the window and ylength the 
height of the window.
.lit
   --------------------------------------------------------- 10,000
   1                                                       1
   1       --------------------------------------   ^      1
   1       1                                    1   1      1
   1       1                                    1   1      1
   1       1                                    1 ylength  1
   1       1                                    1   1      1
   1       1                                    1   1      1
   1       --------------------------------------   v      1
   1  x0,y0^                                               1
   1       <---------------xlength-------------->          1
   ---------------------------------------------------------      1
   1                                                   10,000

.end lit
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
The default window positions are read from a file "DIAGMARG" when the 
program is started. Users can have their own file if required.
 This option 
allows users to change window positions whilst running the program.
If the user 
types only carriage return for any value it will remain unchanged. 
The cross-hair can be used to choose suitable heights.
.LEFT MARGIN1
@15. TX 2 @Label a diagram
.LEFT MARGIN2
.para
This routine allows users to label any diagrams they have produced. They 
are asked to type in a label. When the user types carriage return to finish 
typing the label the cross-hair appears on the screen. The user can 
position it anywhere on the screen. If the user types R (for right justify)
 the label will be 
written on the diagram with its right end at the cross-hair position. 
If the user types L (for left justify) the label will be written with its 
left end at the cross hair position.
The 
cross-hair will then immediately reappear. The user may put the same 
label 
on another part of the diagram as before or if he hits the space bar he 
will be asked if he wishes to type in another label.
.left margin1
@16. TX 2 @Display a map
.LEFT MARGIN2
.para
NOT AVAILABLE.
This draws a map 
of any sequence features selected by the user.
These features may be protein coding regions (CDS), tRNA genes (TRNA), 
promoter positions (PRM), etc. Users may define their own feature table 
key 
names. 
The coordinates must be stored in a file in the format of an EMBL feature 
table. 
.left margin1
@17. TX 4 @Apply identities algorithm
.LEFT MARGIN2
.para
                The identities algorithm finds runs of identical characters 
in the sequence. Its main value is speed, being 100's of times faster than 
the proportional algorithm. It is of course not very sensitive, and should 
only be used for a quick scan. The cutoff score is the minimum number of 
consecutive matching characters.
All runs of identical characters that are at least as long as the cutoff 
score will produce a dot on the screen.
.para
See also quick scan.
.para
Typical dialogue follows.
.lit
? Menu or option number=d17
? Identity score (1-20) (2) =3
Working

 missing graphics

.end lit
.left margin1
@18. TX 4 @Apply proportional algorithm
.para
                        This method, generally  the  most  useful,  was  first
          described  by  McLachlan  and involves calculating a score for
          each position in the matrix by summing  points  found  when  
looking
          forwards  and  backwards  along  a  diagonal line of a given length.
          This length, called the span, must be an odd number.
The algorithm does not simply look for identity  but  uses  a
          score  matrix  that  contains  scores  for  every  possible  pair of
          characters.  At each point that a threshold score is achieved the 
program marks the screen in one of two ways. It will either place a 
single 
dot at the position corresponding to the centre of the matching span, or 
it 
will plot a dot for each identical residue within each matching span.
Alternatively, the "list matching spans" 
option will list the segments that match.
.para
For comparing amino acid sequences  we  usually use  the  score
          matrix  shown  below which was calculated by adding 10 (to make
          every term >0) to each term of the relatedness odds matrix MDM78  
of
          Dayhoff.  This matrix MDM78 was calculated by looking at accepted
          point mutations in 71 families of closely related proteins  and,  of
          those  tested  by  Dayhoff,  was found to be the most powerful 
score
          matrix  for  finding  distant  relationships  between   amino   acid
          sequences. 
.left margin1
.lit

                           AMINO ACID SCORE MATRIX
                           -----------------------

   C  S  T  P  A  G  N  D  E  Q  B  Z  H  R  K  M  I  L  V  F  Y  W  -  X  ?  
C 22 10  8  7  8  7  6  5  5  5  5  5  7  6  5  5  8  4  8  6 10  2 10 10 10 10
S 10 12 11 11 11 11 11 10 10  9 10 10  9 10 10  8  9  7  9  7  7  8 10 10 10 10
T  8 11 13 10 11 10 10 10 10  9 10 10  9  9 10  9 10  8 10  7  7  5 10 10 10 10
P  7 11 10 16 11  9  9  9  9 10  9 10 10 10  9  8  8  7  9  5  5  4 10 10 10 10
A  8 11 11 11 12 11 10 10 10 10 10 10  9  8  9  9  9  8 10  6  7  4 10 10 10 10
G  7 11 10  9 11 15 10 11 10  9 10 10  8  7  8  7  7  6  9  5  5  3 10 10 10 10
N  6 11 10  9 10 10 12 12 11 11 12 11 12 10 11  8  8  7  8  6  8  6 10 10 10 10
D  5 10 10  9 10 11 12 14 13 12 13 12 11  9 10  7  8  6  8  4  6  3 10 10 10 10
E  5 10 10  9 10 10 11 13 14 12 12 13 11  9 10  8  8  7  8  5  6  3 10 10 10 10
Q  5  9  9 10 10  9 11 12 12 14 11 13 13 11 11  9  8  8  8  5  6  5 10 10 10 10
B  5 10 10  9 10 10 12 13 12 11 13 11 11 10 10  8  8  6  8  5  7  4 10 10 10 10
Z  5 10 10 10 10 10 11 12 13 13 11 14 12 10 10  8  8  8  8  5  6  4 10 10 10 10
H  7  9  9 10  9  8 12 11 11 13 11 12 16 12 10  8  8  8  8  8 10  7 10 10 10 10
R  6 10  9 10  8  7 10  9  9 11 10 10 12 16 13 10  8  7  8  6  6 12 10 10 10 10
K  5 10 10  9  9  8 11 10 10 11 10 10 10 13 15 10  8  7  8  5  6  7 10 10 10 10
M  5  8  9  8  9  7  8  7  8  9  8  8  8 10 10 16 12 14 12 10  8  6 10 10 10 10
I  8  9 10  8  9  7  8  8  8  8  8  8  8  8  8 12 15 12 14 11  9  5 10 10 10 10
L  4  7  8  7  8  6  7  6  7  8  6  8  8  7  7 14 12 16 12 12  9  8 10 10 10 10
V  8  9 10  9 10  9  8  8  8  8  8  8  8  8  8 12 14 12 14  9  8  4 10 10 10 10
F  6  7  7  5  6  5  6  4  5  5  5  5  8  6  5 10 11 12  9 19 17 10 10 10 10 10
Y 10  7  7  5  7  5  8  6  6  6  7  6 10  6  6  8  9  9  8 17 20 10 10 10 10 10
W  2  8  5  4  4  3  6  3  3  5  4  4  7 12  7  6  5  8  4 10 10 27 10 10 10 10
- 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
  10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
.end lit
One alternative for proteins is to use an identity matrix.
For comparing nucleic acids we usually use the matrix shown below.
.lit

         DNA SCORE MATRIX

             A C G T X 
           A 1 0 0 0 0 
           C 0 1 0 0 0 
           G 0 0 1 0 0 
           T 0 0 0 1 0 
           X 0 0 0 0 0 
.end lit
See option 32 for how to change the score matrices.
.para
When a sequence is compared against itselt to look for repeats it is 
possible to use the proportional algorithm in a mode such that the main 
diagonal is not shown. See option 30.
.para
Typical dialogue follows.
.lit

? Menu or option number=d18
? Odd span length (1-401) (11) =
? Proportional score (1-297) (132) =
Working

 missing graphics

.end lit
.left margin1
@19. TX 4 @List matching spans
.LEFT MARGIN2
This option applies the proportional algorithm using the current span and 
cut-off score, but instead of drawing a dot matrix it lists all the 
matching spans. When a sequence is compared against itselt to look for 
repeats it is 
possible to use this algorithm in a mode such that the main 
diagonal is not listed. See option 30.
.para
Typical dialogue follows.
.lit
? Menu or option number=d19
? Odd span length (1-401) (11) =
? Proportional score (1-297) (132) =148
List matching spans
Working
     76
IEVPVGKATLG
LEVPVGRGLLG
     95
     77
EVPVGKATLGR
EVPVGRGLLGR
     96
     78
VPVGKATLGRI
VPVGRGLLGRV
     97
     79
PVGKATLGRIM
PVGRGLLGRVV
     98
     85
LGRIMNVLGEP
LGRVVNTLGAP
    104
     86
GRIMNVLGEPV
GRVVNTLGAPI
    105
     87
RIMNVLGEPVD
RVVNTLGAPID
    106

.end lit
.left margin1
@20. TX 3 @Set span length
.para
                        The proportional algorithm
calculates a score for
          each position in the matrix by summing  the 
points  found  when  looking
          forwards  and  backwards  along  a  diagonal line of a given length.
          This length, called the span, should be an odd number  so  that  the
          score  for  any  point  is correctly positioned at the centre of the
          span.  This option allows the user to define the span length. It 
should be noted that short spans can produce noisy diagrams, but are less 
affected by insertions and deletions than are long spans. However long 
spans can detect more  distant relationships. Long spans can suffer from 
a 
persistence problem by plotting dots when all the "signal" is to one side 
of the spans central position. To help avoid this, the option that plots 
the position of all matching residues within a matching span, can be 
tried. 
This is most useful if an identity matrix is being used.
.left margin1
@21. TX 3 @Set proportional score
.LEFT MARGIN2
.para
                        The proportional algorithm
calculates a score for
          each position in the matrix by summing  the 
scores for the individual amino acids found  when  looking
          forwards  and  backwards  along  a  diagonal line of a given length.
All points at which the proportional score is achieved will produce a dot 
on the diagram. (The same score is used for the 'LIST MATCHING SPANS' 
option.)
.para
Before chosing a score the user can apply the routine that will calculate 
the expected score, or can calculate a histogram of observed scores. It is 
best to start with a high score to avoid an overcrowded diagram.
.left margin1
@22. TX 3 @Set identities score
.LEFT MARGIN2
.para
The identities algorithm is of limited value as it only finds runs of 
matching characters, however it has the virtue of being very fast.
 This option allows the user to set the minimum length 
of run that will produce a dot on the screen.
.left margin1
@23. TX 3 @Calculate expected scores
.left margin2
.para
This function calculates the "double matching probability" of McLachlan.
The
          "double   matching   probability"  is  the  probability  of  finding
          particular  scores  given  two  infinitely  long  sequences  of  the
          composition  of  those  being compared, with the current span 
length
          and score matrix.  By using this option the  user  can  choose  to
          plot   all   the  matches  for  which  the  score  exceeds  a  given
          significance  level  (such  as  1%). 
Generally it is best to begin at a
          low level to avoid an overcrowded diagram.
.para   
When the calculation of the expected scores
is  finished  the program offers
               the user 3 ways of examining the results:
.LEFT MARGIN2   
                "Show probability for a score" allows the user  to  type  in  a  
score  and  the
                 program responds with the probability of achieving that level
                 of score.
.LEFT MARGIN2   
                "Show score for a probability" allows the user to type in a 
probability value and
                 the program types the score that corresponds to that level of
                 probability.
.LEFT MARGIN2   
                "List scores and probabilities" is the command to list out the  
scores  and  their
                 corresponding  probabilities.   The user is asked to supply a
                 further parameter, the "number of steps between scores", and 
the program only lists
                 every stepsize point.  e.g a stepsize of 5 will get every 5th
                 score listed.
.para
Typical dialogue follows.
.lit
? Menu or option number=d23
? Odd span length (1-401) (11) =
? Proportional score (1-297) (132) =
 
Working
Average score=   103.18557
RMS deviation=     7.85276
X 1 Show probability for a score
  2 Show score for a probability
  3 List scores and probabilities
? 0,1,2,3 =
 
? Show probability for score (1-165) (134) =160
Probability of score    160 is 0.0000000008
X 1 Show probability for a score
  2 Show score for a probability
  3 List scores and probabilities
? 0,1,2,3 =2
? Show score for probability (0.0000000001-1.) (0.00001) =0.0000001
Score for probability 0.0000001000 is   153
  1 Show probability for a score
X 2 Show score for a probability
  3 List scores and probabilities
? 0,1,2,3 =3
? Number of steps between scores (1-10) (5) =
 
     0  0.10000E+01    100  0.67232E+00    200  0.18977E-20
     5  0.10000E+01    105  0.42119E+00    205  0.42561E-22
    10  0.10000E+01    110  0.20671E+00    210  0.87767E-24
    15  0.10000E+01    115  0.78860E-01    215  0.16651E-25
    20  0.10000E+01    120  0.23515E-01    220  0.27300E-27
    25  0.10000E+01    125  0.55406E-02    225  0.00000E+00
    30  0.10000E+01    130  0.10443E-02    230  0.00000E+00
    35  0.10000E+01    135  0.15935E-03    235  0.00000E+00
    40  0.10000E+01    140  0.19906E-04    240  0.00000E+00
    45  0.10000E+01    145  0.20569E-05    245  0.00000E+00
    50  0.10000E+01    150  0.17758E-06    250  0.00000E+00
    55  0.10000E+01    155  0.12938E-07    255  0.00000E+00
    60  0.10000E+01    160  0.80360E-09    260  0.00000E+00
    65  0.10000E+01    165  0.43009E-10    265  0.00000E+00
    70  0.10000E+01    170  0.20049E-11    270  0.00000E+00
    75  0.99997E+00    175  0.82263E-13    275  0.00000E+00
    80  0.99949E+00    180  0.29998E-14    280  0.00000E+00
    85  0.99448E+00    185  0.98050E-16    285  0.00000E+00
    90  0.96543E+00    190  0.28934E-17    290  0.00000E+00
    95  0.86836E+00    195  0.77556E-19    295  0.00000E+00
  1 Show probability for a score
  2 Show score for a probability
X 3 List scores and probabilities
? 0,1,2,3 =!
 

.end lit
.left margin1
@24. TX 3 @Calculate observed scores
.left margin2
.para
This option applies the proportional algorithm to the currently active 
sequence but instead of producing a 
dot matrix it calculates a histogram of observed scores.
             The speed of this calculation
               of course depends on the size of the active 
regions, but  when  it
               is  completed  the  program offers the user 3 ways of examining
               the results:
.para   
 "Show percentage for score" allows the user to type in a score and the 
program
                 responds  with  the  percentage  of  points that achieve this
                 value.
.para
 "Show percentage for score" allows the user to type in a percentage and  
the
                 program  responds  with  the  corresponding score.  Values of
                 this  score  and  above  are  only  achieved  by  the   given
                 percentage of points.
.para
 "List scores and percentages" is the command to  list  out  the  scores  
and  the
                 percentage of points achieving them.
 .para
Typical dialogue follows.
.lit
? Menu or option number=24
Working
Maximum observed score is    152
X 1 Show percentage reaching a score
  2 Show score for a percentage
  3 List scores and percentages
? 0,1,2,3 =
 
? Show percentage for score (1-152) (114) =144
Percentage of points with score    144 is   0.005486297
X 1 Show percentage reaching a score
  2 Show score for a percentage
  3 List scores and percentages
? 0,1,2,3 =2
 
? Show score for percentage (0.00001-1.) (0.001) =0.01
Score for percentage   0.010000000 is   143
  1 Show percentage reaching a score
X 2 Show score for a percentage
  3 List scores and percentages
? 0,1,2,3 =
 
? Show score for percentage (0.00001-1.) (0.001) =1.
Score for percentage   1.000000000 is   124
  1 Show percentage reaching a score
X 2 Show score for a percentage
  3 List scores and percentages
? 0,1,2,3 =3
? Number of steps between scores (1-10) (5) =1
 
   73   236953  0.10000E+03
   74   236951  0.99999E+02
   75   236951  0.99999E+02
   76   236950  0.99998E+02
   77   236945  0.99996E+02
   78   236942  0.99995E+02
   79   236929  0.99989E+02
   80   236900  0.99977E+02
  
  missing data here

  130      384  0.16206E+00
  131      307  0.12956E+00
  132      239  0.10086E+00
  133      180  0.75964E-01
  134      134  0.56551E-01
  135      103  0.43468E-01
  136       78  0.32918E-01
  137       67  0.28276E-01
  138       46  0.19413E-01
  139       40  0.16881E-01
  140       33  0.13927E-01
  141       29  0.12239E-01
  142       24  0.10129E-01
  143       19  0.80184E-02
  144       13  0.54863E-02
  145       10  0.42202E-02
  146        8  0.33762E-02
  147        7  0.29542E-02
  148        7  0.29542E-02
  149        6  0.25321E-02
  150        5  0.21101E-02
  151        3  0.12661E-02
  152        3  0.12661E-02
  1 Show percentage reaching a score
  2 Show score for a percentage
X 3 List scores and percentages
? 0,1,2,3 =!
 
.end lit
.left margin1
@25. TX 3 @Show current parameter settings
.LEFT MARGIN2
.para
This function lists the names of the current sequences, their total
lengths, the start 
and end points of the active sequence and the current values of span and 
cut-off scores. It also shows if the main diagonal will be shown, or if 
the 
proportional algorithm will mark all identities in matching spans.
.para
Typical dialogue follows.
.lit
? Menu or option number=25
Horizontal sequence
ALPHA.PRT
Positions
     1 TO    514
Vertical sequence
BETA.PRT
Positions
     1 TO    461
Span length=    11
Scores
Proportional=   132
Identities=     3
Identites off
Main diagonal shown


.end lit
.left margin1
@27. TX 2 @Draw a /
.left margin2
.para
This option simply draws a diagonal line from the bottom left of the 
diagram to the top right. it can be an aid when trying to align the 
sequences.
.left margin1
@26. TX 4 @Quick scan
.left margin2
.para
The algorithm is as follows. The dot matrix positions are found for all 
words of some minimum length (obviously length 1 is most sensitive) 
that are common to both sequences. Imagine a diagonal line running from 
corner to corner of the diagram, at right angles to the diagonals in the 
dotmatrix,  The scores for the common words (according to the current 
score matrix, e.g. MDM78) are accummulated at the appropriate positions
on that imaginary line, hence  
producing a histogram. The histogram is analysed to find its mean and 
standard deviation. The diagonals that lie above some cutoff score 
(defined in standard deviation units), are rescanned using the 
proportional algorithm, and a diagram produced. The method is very fast, 
and is also employed  by the library comparison program.
.para
Typical dialogue follows.
.lit

? Menu or option number=d26
? Identity score (1-20) (3) =
? Odd span length (1-401) (11) =
? Proportional score (1-297) (132) =
? Number of sd above mean (0.00-10.00) (5.00) =

 missing graphics
 

.end lit
.left margin2
.para
SIPL the library searching version of SIP
.para
This program compares a probe sequence against a library of sequences using 
the quick scan algorithm, sorts the matches into descending order of score, 
and produces optimal alignments of the best scores using the Myers and 
Miller method. It is very rapid.
.para
Use of lists of entry names 
.para
SIPL has the ability to 
restrict searches to subsets of the libraries. This does not require 
sublibraries to be created but instead is achieved by using files 
containing a list of the entry names of sequences. The user may choose to 
search only those entries on the list or, alternatively to search all but 
those on the list (i.e. in the latter case
the list contains the names of those to be excluded).
 The programs can search libraries that have indexes and those that 
do not.
 If a list of names for inclusion is used,
then the search will be faster if the index is present. In all other 
circumstances the whole library will be read. 
The list must be in library order except when it is used
to include entries, and an index is available.
The list must contain each entry name on a separate line, with the name 
starting in column 1 of the line. ie there must be no spaces at the start 
of the line.
The list of entry names
can be produced by the keyword searches of nip, pip, sip, etc as long 
as the listings produced have a space character separating the entry name 
from the entry description. This will depend on how well the library 
reformatting programs work. For example swissprot entry names tend to run 
into the beginning of the descriptions, but other libraries are generally 
OK.

.left margin1
@28. TX 4 @Align sequences
.left margin2
.para
This function will produce an optimal alignment of two segments of the 
sequence. 
The dynamic programming alignment algorithm is based on that of Miller 
and Myers (). It guarantees to produce alignments with the optimum score 
given a score matrix, a gap start penalty, and a gap extension penalty. 
That is, starting a gap costs a fixed penalty (F) and each residue added 
to the gap incurs a further penalty (E) so that for each gap of length K 
residues the penalty is F + K*E. Gaps at the ends of sequences incur no 
penalty.
                                                                               
.para
The routine can only handle segments of sequence of maximum 
length 5000 residues. When the sequences are read in the alignment 
segment 
will be set to the first 5000 residues. A different segment can be 
selected by prefixing the option number by the letter D, in which case the 
cross hair can be used to identify the two ends. The cross hair will 
appear.
First position the 
crosshair at 
the bottom left of the 
segment and type a character other than s 
or m or ",". When the crosshair reappears, position it a the top right 
of the segment, and type a keyboard character.
The aligned sequences will replace the active sequence if the user 
confirms "keep alignment". By alternate use of the 
plotting and alignment routines it is possible to rapidly produce an 
alignment of quite long sequences.
.para
Typical dialogue follows.
.lit

28 = Align sequences
? Menu or option number=d28
Define the region to align using the cross-hair.
First identify the bottom left position and exit
the cross-hair routine. Then the top right.

(Bell rings, type return, cross hair appears)

? Penalty for starting a gap (1-100) (10) =
? Penalty for each residue in gap (1-100) (10) =
 
Aligning region           1 to         461
    with region           1 to         514
         1         11         21         31         41         51
         MA--TGKIVQ VIGA------ VVDVEFPQDA VPRVYDALEV QNG------N ERLVL-----
         *      *    *         **            * *       *        *   *
         MQLNSTEISE LIKQRIAQFN VVSEAHNEGT IVSVSDGVIR IHGLADCMQG EMISLPGNRY
         1         11         21         31         41         51
        61         71         81         91        101        111
         EVQQQLGGGI VRTIAMGSSD GLRRGLDVKD LEHPIEVPVG KATLGRIMNV LGEPVDMKGE
              *     *    **     *  *  **       *****    ***  *  ** * * **
         AIALNLERDS VGAVVMGPYA DLAEGMKVKC TGRILEVPVG RGLLGRVVNT LGAPIDGKGP
        61         71         81         91        101        111
       121        131        141        151        161        171
         IGEEERWAIH RAAPSYEELS NSQELLETGI KVIDLMCPFA KGGKVGLFGG AGVGKTVNMM
                *     **   *          **  *  * * *    *      *     ***
         LDHDGFSAVE AIAPGVIERQ SVDQPVQTGY KAVDSMIPIG RGQRELIIGD RQTGKTALAI
       121        131        141        151        161        171
       181        191        201        211        221        231
         ELIRNIAIEH SGYS-VFAGV GERTREGNDF YHEMTDSNVI DKVSLVYGQM NEPPGNRLRV
           *  *     **         *                          *      *
         DAI--INQRD SGIKCIYVAI GQKASTISNV VRKLEEHGAL ANTIVVVATA SESAALQYLA
       181        191        201        211        221        231
       241        251        261        271        281        291
         ALTGLTMAEK FRDEGRDVLL FVDNIYRYTL AGTEVSALLG RMPSAVGYQP TLAEEMGVLQ
               * *  *** * * *    *        *    * **  * *                *
         RMPVALMGEY FRDRGEDALI IYDDLSKQAV AYRQISLLLR RPPGREAFPG DVFYLHSRLL
       241        251        261        271        281        291
       301        311        321        331        341        351
         ERITST---- ---------- -KTGSITSVQ AVYVPADDLT DPSPATTFAH LDATVVLSRQ
         **                     **** *         * *      *        *    *
         ERAARVNAEY VEAFTKGEVK GKTGSLTALP IIETQAGDVS AFVPTNVISI TDGQIFLETN
       301        311        321        331        341        351
       361        371        381        391        401        411
         IASLGIYPAV DPLDSTSRQL DPLVVGQEHY DTAR----GV QSILQRYQEL KDIIAILGMD
             ** ***  *  * **      * *             *     *  * **
         LFNAGIRPAV NPGISVSR-- ---VGGAAQT KIMKKLSGGI RTALAQYREL AAFSQFAS--
       361        371        381        391        401        411
       421        431        441        451        461        471
         ELSEEDKLVV ARARKIQRFL SQ----PFFV AE----VFTG SPGKYVSLKD --TIRGFKGI
          *             *    *  *    *  * *      *     * *         *  *
         DLDDATRKQL DHGQKVTELL KQKQYAPMSV AQQSLVLFAA ERG-YLADVE LSKIGSFEAA
       421        431        441        451        461        471
       481        491        501        511        521
         MEG--EYDHL P-EQAFYMVG SIEEAVE--- --------KA KKL*
                **  *  *     *       *                  *
         LLAYVDRDHA PLMQEINQTG GYNDEIEGKL KGILDSFKAT QSW*
       481        491        501        511        521
Conservation  22.5%
Number of padding characters inserted    63 and    10
? (y/n) (y) Keep alignment n
 

.end lit
.left margin1
@29. TX 1 @Complement the sequences
.left margin2
.para
This function allows users to reverse and complement nucleic acid 
sequences.
.left margin1
@30. TX 3 @Switch main diagonal
.left margin2
.para
If a sequence is being compared against itself to look for repeats it is
sometimes convenient if the main diagonal is not included in the 
comparison. This function allows users to set a switch that determines 
whether or not to include the main 
diagonal for all the comparison methods.
If the switch is set, and the active regions for both sequences have 
the same start position, then the main diagonal will not be compared. 
.left margin1
@31. TX 3 @Switch identities
.left margin2
.para
This function allows a switch to be set or unset. The switch determines 
which of two forms of plot will be produced by the proportional 
algorithm. 
One form of output (the original method) plots a dot at the centre of each 
span that reaches the threshold score; whereas the other form will plot
dots for all matching residues that lie within spans that reach the 
threshold.
.left margin1
@32. TX 3 @change score matrix
.left margin2
.para
This option allows users to select their 
own score matrix for use with the proportional algorithm. The choices 
are:
.lit

 1 = MDM78
 2 = identity
 3 = your own matrix

.end lit
.para
MDM78 is the standard matrix that is used for proteins and an 
identity matrix is the default matrix for nucleic acids. However an 
identity 
matrix is also useful for protein comparisons. "Your own matrix" allows 
users to apply any other matrix, as long as the matrix file is in the same
format as MDM78.
For comparisons of DNA it might be useful to try one that gave say 3 for 
exact matches and 1 for R-R or Y-Y, else=0.
.left margin1
@33. TX 3 @Set number of sd's for Quickscan
.left margin2
.para
The quickscan 
algorithm is as follows. The dot matrix positions are found for all 
words of some minimum length (obviously length 1 is most sensitive) 
that are common to both sequences. Imagine a diagonal line running from 
corner to corner of the diagram, at right angles to the diagonals in the 
dotmatrix,  The scores for the common words (according to the current 
score matrix, e.g. MDM78) are accummulated at the appropriate positions
on that imaginary line, hence  
producing a histogram. The histogram is analysed to find its mean and 
standard deviation. The diagonals that lie above some cutoff score 
(defined in standard deviation units), are rescanned using the 
proportional algorithm, and a diagram produced.
.para
This option allows the number of sd's to be set.
.left margin1
@34. TX 3 @Set gap penalities
.left margin2
.para
The alignment 
function will produce an optimal alignment of two segments of the 
sequence. 
The dynamic programming alignment algorithm is based on that of Miller 
and Myers (). It guarantees to produce alignments with the optimum score 
given a score matrix, a gap start penalty, and a gap extension penalty. 
That is, starting a gap costs a fixed penalty (F) and each residue added 
to the gap incurs a further penalty (E) so that for each gap of length K 
residues the penalty is F + K*E. Gaps at the ends of sequences incur no 
penalty.
.para
This option allows the gap penalties to be set.
.left margin1
@ end of help