staden-lg/help/sip_help


 @-1. TX  0 @General

 @-2. T   0 @Screen control

 @-2. X   0 @Screen

 @-3. TX  0 @Set parameters

 @-4. TX  0 @Comparison

 @0.  TX -1 @SIP

        This is program for comparing and  aligning  nucleic  acid  or
 protein  sequences. It can produce optimal alignments using a dynamic
 programming algorithm, and has several ways of producing "dot matrix"
 diagrams.

        The following analyses (preceded by their option numbers)  are
 included:


        The program is  used  on  a  simple  graphics  terminal  ie  a
 keyboard   with  a screen on which points and lines can be drawn. The
 user  works  at  the  terminal  and  produces  plots   for    various
 combinations   of  values  for  the  span  length and minimum scores.
 However large or small a region  the  user elects  to   compare   the
 program  expands  or  contracts  the  diagram so that the plot always
 fills  the  screen.   This  allows  the  user  to  gain  an   overall
 impression   or  to  "home-in" on particular regions and examine them
 in more detail.   Having  found  a  region  that   looks  interesting
 the   user   can   determine   its   coordinates in terms of sequence
 positions by use of a crosshair facility.

        The program has two  statistical  options  to  help  the  user
 choose   score  levels for plotting and to assess the significance of
 any similarity found.  It can  produce  a  cumulative  histogram   of
 observed   scores  for  the current span length and region and it can
 calculate the "double matching probability" of McLachlan. The "double
 matching   probability"  is  the  probability  of  finding particular
 scores  given  two  infinitely  long  sequences  of  the  composition
 of   those   being  compared,  with the current span length and score
 matrix.  By using these options the  user  can  choose  to plot   all
 the   matches  for  which  the  score  exceeds  a  given significance
 level  (such  as  1%),  using  either    empirical    or  theoretical
 probability  values.  Generally it is best to begin at a low level to
 avoid an overcrowded diagram.

        If  the  user  finds  that  the  two  sequences  do    contain
 stretches  of  homology  he will often want to align the sequences by
 inserting padding characters at deletion points.  The program  has  a
 selection  of   options   for this purpose:  it contains an alignment
 routine; it can display on the screen the two  sequences,  one  above
 the  other,  with  asterisks   marking  identities,  it   has inbuilt
 editing functions and can save the aligned sequences on disk files.

        The basic principle of dot matrices was  first  described   by
 Gibbs  and  McIntyre and involves producing a diagram that contains a
 representation of all the matches between a pair of sequences.   This
 diagram   is   then   scanned   by   eye  and  the  human  ability to
 recognise patterns used to detect any  similarities   that  might  be
 present.  The diagram consists of a two dimensional plot in which the
 x axis represents one sequence (A)  and   the   y   axis   the  other
 (B).    Every  point (i,j) on the plane x,y is assigned a score which
 corresponds   to   the   level   of   similarity   between   sequence
 characters  A(i) and B(j).  In the simplest use of the method a score
 of 1 could be assigned to every point (i,j) where A(i) = B(j), and  a
 score   of  0  to  every other point.  If a plot of the points in the
 plane was made in which all scores of 1 were marked with a   dot  and
 all   those  of 0 left blank then regions of identity would appear as
 diagonal lines.  With the comparison  displayed  in  this  form   the
 human  eye is very good at detecting regions of homology even if they
 are imperfect.  The effects of mismatches, insertions   or  deletions
 can   be  seen:   matches interrupted by insertions or deletions will
 appear as parallel diagonals, and matches  interrupted  by  the   odd
 mismatching   pair  of  characters  will  appear  as broken collinear
 diagonal lines. This diagram is  a  very  useful  representation  but
 simply   placing a dot for every identity is of limited value for the
 following reasons.

        For nucleic acid sequences around 25% of the plot will contain
 points    and   it  will  often  be  very  difficult  to  distinguish
 significant homologies  from  chance  matches.   For   proteins  many
 significant  alignments of sequences contain almost no identities but
 are formed from chemically and structurally similar amino   acids  so
 that   simply  looking  for  identity would be insufficient.  What is
 required is to first find those points  that  correspond  to   fairly
 strong   local  similarities  and  then  to  use the diagram of these
 points so that the human eye can be used to look  for  larger   scale
 homologies.    The  program  uses a number of different algorithms to
 calculate the score for each point and the  user  defines  a  minimum
 score   so  that  only   those  points  in  the diagram for which the
 score is at least this value will be marked with a dot.

        The first scoring  method  finds  the   longest  uninterrupted
 sections of perfect identity i.e.  those that contain no  mismatches,
 insertions   or   deletions.  Generally  this  method,  termed   "the
 identities algorithm" is of little value, but runs very quickly.

        The  second    method   looks    for    sections    where    a
 proportion   of   the  characters  in the sequence are similar, again
 allowing no insertions or deletions. For  a  thorough  analysis  this
 method, termed "the proportional algorithm", is the best.

        The original method, of this type  was   first  described   by
 McLachlan   and involves calculating a score for each position in the
 matrix  by  summing   points   found   when  looking  forwards    and
 backwards   along   a  diagonal line of a given length.  This length,
 called the span, must be an  odd  number  so  that  the  dot  marking
 matches  can  be  precisely placed at its centre.  The algorithm does
 not simply look for  identity   but   uses   a  score   matrix   that
 contains   scores   for   every   possible   pair of characters.  For
 comparing amino acid sequences  we  usually use   the   score  matrix
 shown   below  which  was calculated by adding 10 (to make every term
 >0) to each term of the relatedness odds  matrix  MDM78  of  Dayhoff.
 This  matrix  MDM78  was  calculated  by  looking  at  accepted point
 mutations in 71 families of closely related proteins  and,  of  those
 tested   by  Dayhoff,  was found to be the most powerful score matrix
 for   finding   distant   relationships    between     amino     acid
 sequences.

                            AMINO ACID SCORE MATRIX
                            -----------------------

    C  S  T  P  A  G  N  D  E  Q  B  Z  H  R  K  M  I  L  V  F  Y  W  -  X  ?
 C 22 10  8  7  8  7  6  5  5  5  5  5  7  6  5  5  8  4  8  6 10  2 10 10 10 10
 S 10 12 11 11 11 11 11 10 10  9 10 10  9 10 10  8  9  7  9  7  7  8 10 10 10 10
 T  8 11 13 10 11 10 10 10 10  9 10 10  9  9 10  9 10  8 10  7  7  5 10 10 10 10
 P  7 11 10 16 11  9  9  9  9 10  9 10 10 10  9  8  8  7  9  5  5  4 10 10 10 10
 A  8 11 11 11 12 11 10 10 10 10 10 10  9  8  9  9  9  8 10  6  7  4 10 10 10 10
 G  7 11 10  9 11 15 10 11 10  9 10 10  8  7  8  7  7  6  9  5  5  3 10 10 10 10
 N  6 11 10  9 10 10 12 12 11 11 12 11 12 10 11  8  8  7  8  6  8  6 10 10 10 10
 D  5 10 10  9 10 11 12 14 13 12 13 12 11  9 10  7  8  6  8  4  6  3 10 10 10 10
 E  5 10 10  9 10 10 11 13 14 12 12 13 11  9 10  8  8  7  8  5  6  3 10 10 10 10
 Q  5  9  9 10 10  9 11 12 12 14 11 13 13 11 11  9  8  8  8  5  6  5 10 10 10 10
 B  5 10 10  9 10 10 12 13 12 11 13 11 11 10 10  8  8  6  8  5  7  4 10 10 10 10
 Z  5 10 10 10 10 10 11 12 13 13 11 14 12 10 10  8  8  8  8  5  6  4 10 10 10 10
 H  7  9  9 10  9  8 12 11 11 13 11 12 16 12 10  8  8  8  8  8 10  7 10 10 10 10
 R  6 10  9 10  8  7 10  9  9 11 10 10 12 16 13 10  8  7  8  6  6 12 10 10 10 10
 K  5 10 10  9  9  8 11 10 10 11 10 10 10 13 15 10  8  7  8  5  6  7 10 10 10 10
 M  5  8  9  8  9  7  8  7  8  9  8  8  8 10 10 16 12 14 12 10  8  6 10 10 10 10
 I  8  9 10  8  9  7  8  8  8  8  8  8  8  8  8 12 15 12 14 11  9  5 10 10 10 10
 L  4  7  8  7  8  6  7  6  7  8  6  8  8  7  7 14 12 16 12 12  9  8 10 10 10 10
 V  8  9 10  9 10  9  8  8  8  8  8  8  8  8  8 12 14 12 14  9  8  4 10 10 10 10
 F  6  7  7  5  6  5  6  4  5  5  5  5  8  6  5 10 11 12  9 19 17 10 10 10 10 10
 Y 10  7  7  5  7  5  8  6  6  6  7  6 10  6  6  8  9  9  8 17 20 10 10 10 10 10
 W  2  8  5  4  4  3  6  3  3  5  4  4  7 12  7  6  5  8  4 10 10 27 10 10 10 10
 - 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 ? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
   10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10

        It is also  possible  to  use  other  matrices,  including  an
 identity  matrix  for  proteins. For nucleic acids we usually use the
 matrix shown below.

          DNA SCORE MATRIX

              A C G T X
            A 1 0 0 0 0
            C 0 1 0 0 0
            G 0 0 1 0 0
            T 0 0 0 1 0
            X 0 0 0 0 0

        Plotting dots at the centres of spans that  reach  the  cutoff
  leads to a persistence effect that, to some extent, can be mitigated
  by a  variation on the method. If, for example, all the high scoring
  amino  acids  are clustered at the left end of a particular diagonal
  segment, dots will continue to be plotted to their right  until  the
  span  score  drops  below  the  cutoff. Instead of plotting a single
  point for each span that  reaches  the  cutoff  score,  the  variant
  method   plots  points for all the identities that lie in spans that
  reach the cutoff. Obviously  the  persistence  effect  can  be  more
  pronounced  for  long spans and low cutoff scores, but note that the
  variant method will not plot anything if  there  are  no  identities
  present, and so similar regions could be missed!

        A further variant, useful for  comparing  a  sequence  against
  itself, ignores the main diagonal.

        The third comparison method called "quick scan"  is  really  a
  combination of the first two, and is similar to the FASTP program of
  Lipman and Pearson, but produces a dot matrix diagram. The algorithm
  is  as  follows. The dot matrix positions are found for all words of
  some minimum length (obviously length 1 is most sensitive) that  are
  common  to  both  sequences.  Imagine  a  diagonal line running from
  corner to corner of the diagram, at right angles to the diagonals in
  the  dotmatrix,   The  scores for the common words (according to the
  current  score  matrix,  e.g.  MDM78)  are   accummulated   at   the
  appropriate  positions  on  that  imaginary line, hence  producing a
  histogram. The histogram is analysed to find its mean  and  standard
  deviation.  The  diagonals that lie above some cutoff score (defined
  in standard deviation units), are rescanned using  the  proportional
  algorithm,  and  a diagram produced. The method is very fast, and is
  also employed  by the library comparison program.

        The dynamic programming alignment algorithm contained  in  the
  program  is  based  on that of Miller and Myers (). It guarantees to
  produce alignments with the optimum score given a  score  matrix,  a
  gap  start penalty, and a gap extension penalty. That is, starting a
  gap costs a fixed penalty (IG) and each residue  added  to  the  gap
  incurs  a  further  penalty  (IH)  so  that for each gap of length K
  residues the penalty is IG + k*IH. Gaps at  the  ends  of  sequences
  incur no penalty.

        It is very useful to have  the  dot  matrix  methods  and  the
  alignment  routine  together  in  the same program because it allows
  users to produce a dot matrix diagram to help select  which  regions
  of  the sequence they wish to align. Selection is made by use of the
  crosshair. First the crosshair is positioned at the bottom left hand
  end of the segment to be aligned. The crosshair function is quit and
  immediately selected again, the  crosshair  positioned  at  the  top
  right  of  the  segment,  and  the crosshair function quit. When the
  alignment routine is selected the segment will be aligned.

        The  alignment  can  replace  the  original  segment  of   the
  sequence.   By  repeated  plotting  of  dot  matrices,  followed  by
  alignment, very long sequences can easily be aligned.
 @1. TX 0 @Help

        This option gives online help. The user should  select  option
  numbers and the current documentation will be given.

        The following analyses (preceded by their option numbers)  are
  included:
   ? = Help
   ! = Quit
   3 = read a new sequence
   4 = define active region
   5 = list the sequence
   6 = list a text file
   7 = direct output to disk
   8 = write active sequence to disk
   9 = edit the sequences
  10 = clear graphics screen
  11 = clear text screen
  12 = draw a ruler
  13 = use cross hair
  14 = reposition plots
  15 = label diagram
  16 = display a map
  17 = apply identities algorithm
  18 = apply proportional algorithm
  19 = list matching spans
  20 = set span length
  21 = set proportional score
  22 = set identities score
  23 = calculate expected scores
  24 = calculate observed scores
  25 = show current parameter settings
  26 = quick scan
  27 = draw a /
  28 = align the sequences
  29 = complement the sequences
  30 = switch main diagonal
  31 = switch identities
  32 = change score matrix
 @2. TX 0 @Quit

        This function stops the program.
 @3. TX 1 @Read a new sequence

        This option allows users to  read  in  new  sequences,  browse
  through  annotations,  or  search  sequence  libraries for keywords.
  Sequences can  be  read  from  "personal"  sequence  files  or  from
  sequence  libraries. These are referred to as the sequence "source".
  Personal files can be stored in several formats:  Staden, PIR, EMBL,
  GENBANK  and  GCG.  At LMB we use "Staden" format for sequencing and
  all the libraries  are  stored  in  their  original  formats.  Note,
  however,  that  libraries  such  as EMBL or GenBank that are divided
  into several files (eg GenBank has 13 separate files) are indexed as
  a  whole.  This  means  that  users  do  not need to know which file
  contains an entry, only which library.  When  the  user  selects  to
  read in a sequence the program first asks for the sequence "source".

        If the user selects "personal" the program will  ask  for  the
  format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
  the file. For PIR format the user will also be required to know  the
  entry  name of the sequence as the file can contain several. For the
  other formats only a single entry is  expected.  The  file  will  be
  read,  its  length  and composition will be displayed and the option
  left.

        If the user selects  "library"  as  the  sequence  source  the
  program will display a list of available libraries. The programs are
  capable of  handling  all  current  libraries  but  which  ones  are
  available  will  vary  from  site  to  site.  At LMB we have several
  libraries and also weekly updates of data gathered between releases.
  The  program will ask users to select a library and then give a list
  of options:

   X  1 Get a sequence
      2 Get annotations
      3 Get entrynames from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords

  If get a sequence or get annotations is selected users will be asked
  to  type  the entry name. The option will be left when a sequence is
  selected  or  !  is  typed.  The  composition  and  length  will  be
  displayed.

        The  text  index  contains  all  words  from  feature  tables,
  reference  titles, definition lines, keywords lists and comments, so
  the text index search is most useful. It is also the fastest. Up  to
  5  words  can  be  searched  for  at once. The words should be typed
  separated by spaces, for example
   ? Keywords=P53 mouse murine tumo

  will search for all entries that contain words  starting  with  p53,
  mouse,  murine  and  tumo.  Only the unique entries that contain ALL
  words will be  listed.  Before  listing  the  matching  entries  the
  program  will  show  the number of 'hits' for each word and ring the
  bell.  Escape is possible at this point, or after each screenfull of
  entries.   In  addition  to the entry names the text search displays
  the primary accession number, the  sequence  length  and  up  to  80
  characters of description.  (The search of 'titles' is now redundant
  because the full text index contains all the  title  words  and  the
  search  is  much  faster.  It  will  probably  be  removed  from the
  program.)  All searches are independent of case. Where possible  the
  program will offer default entry names.

        Typical dialogue follows.
  Select sequence source
  X  1 Personal file
     2 Sequence library
  ? Selection  (1-2) (1) =
  Select sequence file format
  X  1 Staden
     2 EMBL
     3 GenBank
     4 PIR
     5 GCG
  ? Selection  (1-5) (1) =
  ? Sequence file name=M13MP7.SEQ
   Contig title removed
  Sequence length=  7238
   Sequence composition
            T          C          A          G          -
        2405.      1539.      1765.      1527.         2.
          33.2%      21.3%      24.4%      21.1%       0.0%
    .
    .
    .


   Select sequence source
   X  1 Personal file
      2 Sequence library
   ? Selection  (1-2) (1) =2
   Select a library
   X  1 EMBL 29 nucleotide library Dec 91
      2 SWISSPROT 20 protein library Nov 91
      3 PIR 31 protein library Dec 91
      4 NRL3D 58 From Brookhaven protein library Dec 91
      5 GenBank
   ? Selection  (1-5) (1) =
  Library is in EMBL format with indexes
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =5
   Search for keywords
   ? Keywords=P53 mouse
  P53 hits  68
  MOUSE hits  8180

   MMANT01    X00875         536 Murine gene fragment for cellular tumour antigen
   MMANT02    X00876          83 Murine gene fragment for cellular tumour antigen
   MMANT03    X00877          21 Murine gene fragment for cellular tumour antigen
   MMANT04    X00878         261 Murine gene fragment for cellular tumour antigen
   MMANT05    X00879         184 Murine gene fragment for cellular tumour antigen
   MMANT06    X00880         113 Murine gene fragment for cellular tumour antigen
   MMANT07    X00881         110 Murine gene fragment for cellular tumour antigen
   MMANT08    X00882         137 Murine gene fragment for cellular tumour antigen
   MMANT09    X00883          74 Murine gene fragment for cellular tumour antigen
   MMANT10    X00884         107 Murine gene for cellular tumour antigen p53 (exon
   MMANT11    X00885         562 Murine p53 gene 3' region with exon 11
   MMANTP53   M26862         536 Mouse tumor antigen p53 gene, 5' end.
   MMLYN      M64608        2044 Mouse lyn protein mRNA, complete cds.
   MMP53      X00741        1377 Mouse mRNA for transformation associated protein
   MMP53A     M13872        1285 Mouse p53 mRNA, complete cds, clone pcD53.
   MMP53B     M13873        1241 Mouse p53 mRNA, complete cds, clone p53-m11.
   MMP53C     M13874        1322 Mouse p53 mRNA, complete cds, clone p53-m8.
   MMP53G1    X01235         554 Mouse genomic DNA for 5' region of cellular tumou
   MMP53IN4   X60470         729 M.musculus p53 gene for p53 protein, intron 4
   MMP53P     X01236        2132 Mouse pseudogene for cellular tumour antigen p53
   MMP53R     X01237        1773 Mouse mRNA for cellular tumour antigen p53
   MMRSB2P5   M64597         196 Mouse B2 repeat in the 3' flank of protein 53 (p5
        22 different entries found

   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =4
   Search for keywords
   ? Keywords=alpha
   Searching for alpha
   AAGHA          623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
   AAMALI        3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
   AAMALIA       1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
   AAMALIB       1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
   ACA13GT        371 alouatta caraya alpha-1,3gt gene, 3' flank.
   ADHBADA1       102 duck alpha-d-globin gene, exon 1.
   ADHBADA2      1145 duck alpha-a-globin gene and 5' flank
   ADHBADWP       513 duck (white pekin) alpha ii (minor) globin mrna, complete co
   AEACOXABC     5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
   AGA13GT        371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
   AGAAAGFP       282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
   AGAABL         138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
   AGAFAMYA        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
   AGAFAMYB        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
   AGAFAMYC        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
   AGAFCOXA        98 synthetic alpha-factor/cox iv fusion gene signal peptide.
   AGAGABA       7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
   AGAMYLS        120 synthetic alpha-amylase gene, 5' end.
   AGANPS          95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
  !
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =3
   ? Accession number=v00636
  Entry name LAMBDA
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =2
   Default Entry name=LAMBDA
   ? Entry name=
  ID   LAMBDA     standard; DNA; PHG; 48502 BP.
  XX
  AC   V00636; J02459; M17233; X00906;
  XX
  DT   03-JUL-1991 (Rel. 28, Last updated, Version 3)
  DT   09-JUN-1982 (Rel. 1, Created)
  XX
  DE   Genome of the bacteriophage lambda (Styloviridae).
  XX
  KW   circular; coat protein; DNA binding protein; genome;
  KW   origin of replication.
  XX
  OS   Bacteriophage lambda
  OC   Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
  XX
  RN   [1]
  RP   1-48502
  RA   Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
  RT   "Nucleotide sequence of bacteriophage lambda DNA";
  RL   J. Mol. Biol. 162:729-773(1982).
  XX
  !
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =
   Default Entry name=LAMBDA
   ? Entry name=
  DE   Genome of the bacteriophage lambda (Styloviridae).
   Sequence length  48502
   Sequence composition
             T          C          A          G          -
        11988.     11360.     12336.     12818.         0.
           24.7%      23.4%      25.4%      26.4%       0.0%

 @4. TX 1 @Define active region

        For its analytic functions  the  program  always  works  on  a
  region of the sequence called the active region. When a new sequence
  is read into the program the active region is automatically  set  to
  start  at  the  beginning  of  the sequence and go up to the maximum
  allowed size of active region the  program can handle. The positions
  are  shown  on the screen.  On most machines this will be to the end
  of the sequence.  This option allows the  user  define  a  different
  region.
 @5. TX 1 @List a sequence

        The sequence can be listed with line lengths from 10 to 120 in
  multiples of 10.  The output looks like:

      87         97        107        117        127        137
       KVKCTGRILE VPVGRGLLGR VVNTLGAPID GKGPLDHDGF SAVEAIAPGV IERQSVDQPV
        **      * ****   ***   * ** * *  **         *    **    *
       DVKDLEHPIE VPVGKATLGR IMNVLGEPVD MKGEIGEEER WAIHRAAPSY EELSNSQELL
      68         78         88         98        108        118
     147        157        167        177        187        197
       QTGYKAVDSM IPIGRGQREL IIGDRQTGKT ALAIDAIINQ RDSGIKCIYV AIGQ
        ** *  * *  *   *       *    ***       * *             *
       ETGIKVIDLM CPFAKGGKVG LFGGAGVGKT VNMMELIRNI AIEHSGYSVF AGVG
     128        138        148        158        168        178

 @6. TX 1 @List a text file

        Allows the user to have a text file displayed on  the  screen.
  It will appear one page at a time.
 @7. TX 1 @Direct output to disk

        Used to direct output that would normally appear on the screen
  to a file.

        Select redirection of either text or graphics, and supply  the
  name of the file that the output should be written to.

        The results from the next options selected will not appear  on
  the  screen  but  will  be  written  to  the  file. When option 7 is
  selected again the file will be closed and output will again  appear
  on the screen.
 @8. TX 1 @Write active region to disk

        This option allows users to write the current active  sequence
  to a disk file in Staden format.
 @9. TX 1 @Edit the sequences

        This function allows the user to insert  or  delete  parts  of
  either  sequence  to  help  align  them. The inserted characters are
  dashes.
 @10. TX 2 @Clear graphics

        Clears the screen of both text and graphics.
 @11. TX 2 @Clear text

        Clears only text from the screen.
 @12. TX 2 @Draw a ruler

        This option allows the user to draw a ruler or scale along the
  axes  of  the  screen  to help identify the coordinates of points of
  interest. The user can define the position  of  the  first  sequence
  element  to  be  marked (for example if the active region is 1501 to
  8000, the user might wish to mark every 1000th element  starting  at
  either  1501  or  2000  - it depends if the user wishes to treat the
  active region as an independent unit with its own numbering starting
  at  its  left  edge, or as part of the whole sequence). The user can
  also define the separation of the  ticks  on  the  scale  and  their
  height. If required the labelling routine can be used to add numbers
  to the ticks.

        To escape type !
 @13. TX 2 @Use cross hair

        This function puts a steerable cross on the screen that can be
  used to find the coordinates of points in the sequence. The user can
  move the cross around using the directional keys; when he  hits  the
  space bar the program will write out the coordinates of the cross in
  sequence units and the option will be exited.

        If instead, the user hits a , the position will  be  displayed
  but the cross will remain on the screen.

        If a letter s is hit the sequences around the cross  hair  are
  displayed  as  a  short  alignment  (as  shown  below) and the cross
  remains on the screen.
          97        107
           VPVGRGLLGR VVNTLGAPID
           ****   ***   * ** * *
           VPVGKATLGR IMNVLGEPVD
          78         88


        If a letter m is hit the sequences around the cross  hair  are
  displayed  in  the  form  of a matrix (as shown below) and the cross
  remains on the screen.

     VPVGKATLGRIMNVLGEPVD
    D...................DD
    I..........I.........I
    P.P...............P..P
    A.....A..............A
    G...G....G......G....G
    L.......L......L.....L
    T......T.............T
    N............N.......N
    VV.V..........V....V.V
    VV.V..........V....V.V
    R.........R..........R
    G...G....G......G....G
    L.......L......L.....L
    L.......L......L.....L
    G...G....G......G....G
    R.........R..........R
    G...G....G......G....G
    VV.V..........V....V.V
    P.P...............P..P
    VV.V..........V....V.V
     VPVGKATLGRIMNVLGEPVD


        The function is also used prior to "align sequences" in  order
  to  delineate  the region to be aligned. The crosshair is positioned
  at the bottom left of the region, the crosshair  option  quit.  Then
  the  crosshair  option is selected again, and the crosshair moved to
  the top right of the region to be aligned.
 @14. TX 2 @Reposition plots

        The position of the plots  is  defined  relative  to  a  users
  drawing board which has size 1-10,000 in x and 1-10,000 in y.  Plots
  are drawn in a window defined by x0,y0  and  xlength,ylength.  Where
  x0,y0  is the position of the bottom left hand corner of the window,
  and xlength is the width of the window and ylength the height of the
  window.
     --------------------------------------------------------- 10,000
     1                                                       1
     1       --------------------------------------   ^      1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       1                                    1 ylength  1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       --------------------------------------   v      1
     1  x0,y0^                                               1
     1       <---------------xlength-------------->          1
     ---------------------------------------------------------      1
     1                                                   10,000

  All values are in drawing board  units  (i.e.  1-10,000,  1-10,000).
  The  default  window  positions are read from a file "DIAGMARG" when
  the program is started. Users can have their own file  if  required.
  This  option  allows users to change window positions whilst running
  the program.  If the user types only carriage return for  any  value
  it  will  remain  unchanged.  The  cross-hair  can be used to choose
  suitable heights.
 @15. TX 2 @Label a diagram

        This routine allows users to  label  any  diagrams  they  have
  produced.  They  are  asked  to type in a label. When the user types
  carriage return to finish typing the label the cross-hair appears on
  the  screen. The user can position it anywhere on the screen. If the
  user types R (for right justify) the label will be  written  on  the
  diagram  with  its right end at the cross-hair position. If the user
  types L (for left justify) the label will be written with  its  left
  end   at   the  cross  hair  position.   The  cross-hair  will  then
  immediately reappear. The user may put the  same  label  on  another
  part of the diagram as before or if he hits the space bar he will be
  asked if he wishes to type in another label.
 @16. TX 2 @Display a map

        NOT AVAILABLE.  This draws a  map  of  any  sequence  features
  selected  by the user.  These features may be protein coding regions
  (CDS), tRNA genes (TRNA), promoter positions (PRM), etc.  Users  may
  define  their  own  feature table key names. The coordinates must be
  stored in a file in the format of an EMBL feature table.
 @17. TX 4 @Apply identities algorithm

        The identities algorithm finds runs of identical characters in
  the  sequence.  Its main value is speed, being 100's of times faster
  than the proportional algorithm. It is of course not very sensitive,
  and  should  only  be used for a quick scan. The cutoff score is the
  minimum number of consecutive  matching  characters.   All  runs  of
  identical  characters  that are at least as long as the cutoff score
  will produce a dot on the screen.

        See also quick scan.

        Typical dialogue follows.
  ? Menu or option number=d17
  ? Identity score (1-20) (2) =3
  Working

   missing graphics

 @18. TX 4 @Apply proportional algorithm

        This  method,  generally   the   most   useful,   was    first
 described   by   McLachlan  and involves calculating a score for each
 position in the  matrix  by  summing   points   found   when  looking
 forwards   and  backwards  along  a  diagonal line of a given length.
 This length, called the span, must be an odd number.   The  algorithm
 does  not  simply look for identity  but  uses  a score  matrix  that
 contains  scores  for  every  possible  pair of characters.  At  each
 point that a threshold score is achieved the program marks the screen
 in one of two ways. It will either place a single dot at the position
 corresponding  to  the centre of the matching span, or it will plot a
 dot  for  each  identical  residue   within   each   matching   span.
 Alternatively,  the  "list  matching  spans"  option  will  list  the
 segments that match.

        For comparing amino  acid  sequences   we   usually  use   the
 score matrix  shown  below which was calculated by adding 10 (to make
 every term >0) to each term of the relatedness odds matrix  MDM78  of
 Dayhoff.   This  matrix  MDM78  was calculated by looking at accepted
 point mutations in 71 families of closely related proteins  and,   of
 those   tested  by  Dayhoff,  was found to be the most powerful score
 matrix  for  finding  distant  relationships  between   amino    acid
 sequences.

                            AMINO ACID SCORE MATRIX
                            -----------------------

    C  S  T  P  A  G  N  D  E  Q  B  Z  H  R  K  M  I  L  V  F  Y  W  -  X  ?
 C 22 10  8  7  8  7  6  5  5  5  5  5  7  6  5  5  8  4  8  6 10  2 10 10 10 10
 S 10 12 11 11 11 11 11 10 10  9 10 10  9 10 10  8  9  7  9  7  7  8 10 10 10 10
 T  8 11 13 10 11 10 10 10 10  9 10 10  9  9 10  9 10  8 10  7  7  5 10 10 10 10
 P  7 11 10 16 11  9  9  9  9 10  9 10 10 10  9  8  8  7  9  5  5  4 10 10 10 10
 A  8 11 11 11 12 11 10 10 10 10 10 10  9  8  9  9  9  8 10  6  7  4 10 10 10 10
 G  7 11 10  9 11 15 10 11 10  9 10 10  8  7  8  7  7  6  9  5  5  3 10 10 10 10
 N  6 11 10  9 10 10 12 12 11 11 12 11 12 10 11  8  8  7  8  6  8  6 10 10 10 10
 D  5 10 10  9 10 11 12 14 13 12 13 12 11  9 10  7  8  6  8  4  6  3 10 10 10 10
 E  5 10 10  9 10 10 11 13 14 12 12 13 11  9 10  8  8  7  8  5  6  3 10 10 10 10
 Q  5  9  9 10 10  9 11 12 12 14 11 13 13 11 11  9  8  8  8  5  6  5 10 10 10 10
 B  5 10 10  9 10 10 12 13 12 11 13 11 11 10 10  8  8  6  8  5  7  4 10 10 10 10
 Z  5 10 10 10 10 10 11 12 13 13 11 14 12 10 10  8  8  8  8  5  6  4 10 10 10 10
 H  7  9  9 10  9  8 12 11 11 13 11 12 16 12 10  8  8  8  8  8 10  7 10 10 10 10
 R  6 10  9 10  8  7 10  9  9 11 10 10 12 16 13 10  8  7  8  6  6 12 10 10 10 10
 K  5 10 10  9  9  8 11 10 10 11 10 10 10 13 15 10  8  7  8  5  6  7 10 10 10 10
 M  5  8  9  8  9  7  8  7  8  9  8  8  8 10 10 16 12 14 12 10  8  6 10 10 10 10
 I  8  9 10  8  9  7  8  8  8  8  8  8  8  8  8 12 15 12 14 11  9  5 10 10 10 10
 L  4  7  8  7  8  6  7  6  7  8  6  8  8  7  7 14 12 16 12 12  9  8 10 10 10 10
 V  8  9 10  9 10  9  8  8  8  8  8  8  8  8  8 12 14 12 14  9  8  4 10 10 10 10
 F  6  7  7  5  6  5  6  4  5  5  5  5  8  6  5 10 11 12  9 19 17 10 10 10 10 10
 Y 10  7  7  5  7  5  8  6  6  6  7  6 10  6  6  8  9  9  8 17 20 10 10 10 10 10
 W  2  8  5  4  4  3  6  3  3  5  4  4  7 12  7  6  5  8  4 10 10 27 10 10 10 10
 - 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 X 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 ? 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
   10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 One alternative for proteins is  to  use  an  identity  matrix.   For
 comparing nucleic acids we usually use the matrix shown below.

          DNA SCORE MATRIX

              A C G T X
            A 1 0 0 0 0
            C 0 1 0 0 0
            G 0 0 1 0 0
            T 0 0 0 1 0
            X 0 0 0 0 0
 See option 32 for how to change the score matrices.

        When a sequence is compared against itselt to look for repeats
 it  is possible to use the proportional algorithm in a mode such that
 the main diagonal is not shown. See option 30.

        Typical dialogue follows.

 ? Menu or option number=d18
 ? Odd span length (1-401) (11) =
 ? Proportional score (1-297) (132) =
 Working

  missing graphics

 @19. TX 4 @List matching spans
  This option applies the proportional  algorithm  using  the  current
  span and cut-off score, but instead of drawing a dot matrix it lists
  all the matching spans. When a sequence is compared  against  itselt
  to  look  for repeats it is possible to use this algorithm in a mode
  such that the main diagonal is not listed. See option 30.

        Typical dialogue follows.
  ? Menu or option number=d19
  ? Odd span length (1-401) (11) =
  ? Proportional score (1-297) (132) =148
  List matching spans
  Working
       76
  IEVPVGKATLG
  LEVPVGRGLLG
       95
       77
  EVPVGKATLGR
  EVPVGRGLLGR
       96
       78
  VPVGKATLGRI
  VPVGRGLLGRV
       97
       79
  PVGKATLGRIM
  PVGRGLLGRVV
       98
       85
  LGRIMNVLGEP
  LGRVVNTLGAP
      104
       86
  GRIMNVLGEPV
  GRVVNTLGAPI
      105
       87
  RIMNVLGEPVD
  RVVNTLGAPID
      106

 @20. TX 3 @Set span length

        The  proportional  algorithm  calculates  a  score  for   each
 position  in  the matrix by summing  the points  found  when  looking
 forwards  and  backwards  along  a  diagonal line of a given  length.
 This  length, called the span, should be an odd number  so  that  the
 score  for  any  point  is correctly positioned at the centre of  the
 span.   This  option  allows  the  user to define the span length. It
 should be noted that short spans can produce noisy diagrams, but  are
 less  affected  by  insertions  and  deletions  than  are long spans.
 However long spans can detect more  distant relationships. Long spans
 can  suffer  from a persistence problem by plotting dots when all the
 "signal" is to one side of the spans central position. To help  avoid
 this,  the  option  that  plots the position of all matching residues
 within a matching span, can be tried.  This  is  most  useful  if  an
 identity matrix is being used.
 @21. TX 3 @Set proportional score

        The  proportional  algorithm  calculates  a  score  for   each
  position  in  the  matrix  by summing  the scores for the individual
  amino acids found  when  looking forwards  and  backwards  along   a
  diagonal   line  of  a  given  length.   All  points  at  which  the
  proportional score is achieved will produce a dot  on  the  diagram.
  (The same score is used for the 'LIST MATCHING SPANS' option.)

        Before chosing a score the user can  apply  the  routine  that
  will  calculate  the expected score, or can calculate a histogram of
  observed scores. It is best to start with a high score to  avoid  an
  overcrowded diagram.
 @22. TX 3 @Set identities score

        The identities algorithm is of limited value as it only  finds
  runs of matching characters, however it has the virtue of being very
  fast.  This option allows the user to set the minimum length of  run
  that will produce a dot on the screen.
 @23. TX 3 @Calculate expected scores

        This function calculates the "double matching probability"  of
  McLachlan.    The   "double     matching     probability"   is   the
  probability  of  finding particular  scores  given  two   infinitely
  long   sequences   of   the  composition  of  those  being compared,
  with the current span length and score matrix.  By using this option
  the   user   can   choose   to plot   all   the  matches  for  which
  the  score  exceeds  a  given significance  level  (such   as   1%).
  Generally it is best to begin at a low level to avoid an overcrowded
  diagram.

        When the calculation of the expected scores is  finished   the
  program offers the user 3 ways of examining the results:
  "Show probability for a score" allows the  user   to   type   in   a
  score   and   the program responds with the probability of achieving
  that level of score.
  "Show score for  a  probability"  allows  the  user  to  type  in  a
  probability  value  and the program types the score that corresponds
  to that level of probability.
  "List scores and probabilities" is  the  command  to  list  out  the
  scores  and  their corresponding  probabilities.   The user is asked
  to supply a further parameter, the "number of steps between scores",
  and  the program only lists every stepsize point.  e.g a stepsize of
  5 will get every 5th score listed.

        Typical dialogue follows.
  ? Menu or option number=d23
  ? Odd span length (1-401) (11) =
  ? Proportional score (1-297) (132) =

  Working
  Average score=   103.18557
  RMS deviation=     7.85276
  X 1 Show probability for a score
    2 Show score for a probability
    3 List scores and probabilities
  ? 0,1,2,3 =

  ? Show probability for score (1-165) (134) =160
  Probability of score    160 is 0.0000000008
  X 1 Show probability for a score
    2 Show score for a probability
    3 List scores and probabilities
  ? 0,1,2,3 =2
  ? Show score for probability (0.0000000001-1.) (0.00001) =0.0000001
  Score for probability 0.0000001000 is   153
    1 Show probability for a score
  X 2 Show score for a probability
    3 List scores and probabilities
  ? 0,1,2,3 =3
  ? Number of steps between scores (1-10) (5) =

       0  0.10000E+01    100  0.67232E+00    200  0.18977E-20
       5  0.10000E+01    105  0.42119E+00    205  0.42561E-22
      10  0.10000E+01    110  0.20671E+00    210  0.87767E-24
      15  0.10000E+01    115  0.78860E-01    215  0.16651E-25
      20  0.10000E+01    120  0.23515E-01    220  0.27300E-27
      25  0.10000E+01    125  0.55406E-02    225  0.00000E+00
      30  0.10000E+01    130  0.10443E-02    230  0.00000E+00
      35  0.10000E+01    135  0.15935E-03    235  0.00000E+00
      40  0.10000E+01    140  0.19906E-04    240  0.00000E+00
      45  0.10000E+01    145  0.20569E-05    245  0.00000E+00
      50  0.10000E+01    150  0.17758E-06    250  0.00000E+00
      55  0.10000E+01    155  0.12938E-07    255  0.00000E+00
      60  0.10000E+01    160  0.80360E-09    260  0.00000E+00
      65  0.10000E+01    165  0.43009E-10    265  0.00000E+00
      70  0.10000E+01    170  0.20049E-11    270  0.00000E+00
      75  0.99997E+00    175  0.82263E-13    275  0.00000E+00
      80  0.99949E+00    180  0.29998E-14    280  0.00000E+00
      85  0.99448E+00    185  0.98050E-16    285  0.00000E+00
      90  0.96543E+00    190  0.28934E-17    290  0.00000E+00
      95  0.86836E+00    195  0.77556E-19    295  0.00000E+00
    1 Show probability for a score
    2 Show score for a probability
  X 3 List scores and probabilities
  ? 0,1,2,3 =!


 @24. TX 3 @Calculate observed scores

        This  option  applies  the  proportional  algorithm   to   the
  currently  active  sequence but instead of producing a dot matrix it
  calculates a histogram  of  observed  scores.   The  speed  of  this
  calculation of course depends on the size of the active regions, but
  when  it is  completed  the  program  offers  the  user  3  ways  of
  examining the results:

        "Show percentage for score" allows the user to type in a score
  and  the  program  responds   with  the  percentage  of  points that
  achieve this value.

        "Show percentage for score" allows  the  user  to  type  in  a
  percentage  and  the  program   responds   with   the  corresponding
  score.  Values of this  score  and  above  are  only   achieved   by
  the   given percentage of points.

        "List scores and percentages" is the  command  to   list   out
  the   scores  and  the percentage of points achieving them.  Typical
  dialogue follows.
  ? Menu or option number=24
  Working
  Maximum observed score is    152
  X 1 Show percentage reaching a score
    2 Show score for a percentage
    3 List scores and percentages
  ? 0,1,2,3 =

  ? Show percentage for score (1-152) (114) =144
  Percentage of points with score    144 is   0.005486297
  X 1 Show percentage reaching a score
    2 Show score for a percentage
    3 List scores and percentages
  ? 0,1,2,3 =2

  ? Show score for percentage (0.00001-1.) (0.001) =0.01
  Score for percentage   0.010000000 is   143
    1 Show percentage reaching a score
  X 2 Show score for a percentage
    3 List scores and percentages
  ? 0,1,2,3 =

  ? Show score for percentage (0.00001-1.) (0.001) =1.
  Score for percentage   1.000000000 is   124
    1 Show percentage reaching a score
  X 2 Show score for a percentage
    3 List scores and percentages
  ? 0,1,2,3 =3
  ? Number of steps between scores (1-10) (5) =1

     73   236953  0.10000E+03
     74   236951  0.99999E+02
     75   236951  0.99999E+02
     76   236950  0.99998E+02
     77   236945  0.99996E+02
     78   236942  0.99995E+02
     79   236929  0.99989E+02
     80   236900  0.99977E+02

    missing data here

    130      384  0.16206E+00
    131      307  0.12956E+00
    132      239  0.10086E+00
    133      180  0.75964E-01
    134      134  0.56551E-01
    135      103  0.43468E-01
    136       78  0.32918E-01
    137       67  0.28276E-01
    138       46  0.19413E-01
    139       40  0.16881E-01
    140       33  0.13927E-01
    141       29  0.12239E-01
    142       24  0.10129E-01
    143       19  0.80184E-02
    144       13  0.54863E-02
    145       10  0.42202E-02
    146        8  0.33762E-02
    147        7  0.29542E-02
    148        7  0.29542E-02
    149        6  0.25321E-02
    150        5  0.21101E-02
    151        3  0.12661E-02
    152        3  0.12661E-02
    1 Show percentage reaching a score
    2 Show score for a percentage
  X 3 List scores and percentages
  ? 0,1,2,3 =!

 @25. TX 3 @Show current parameter settings

        This function lists the names of the current sequences,  their
  total  lengths,  the start and end points of the active sequence and
  the current values of span and cut-off scores. It also shows if  the
  main  diagonal  will be shown, or if the proportional algorithm will
  mark all identities in matching spans.

        Typical dialogue follows.
  ? Menu or option number=25
  Horizontal sequence
  ALPHA.PRT
  Positions
       1 TO    514
  Vertical sequence
  BETA.PRT
  Positions
       1 TO    461
  Span length=    11
  Scores
  Proportional=   132
  Identities=     3
  Identites off
  Main diagonal shown


 @27. TX 2 @Draw a /

        This option simply draws a diagonal line from the bottom  left
  of  the  diagram  to  the top right. it can be an aid when trying to
  align the sequences.
 @26. TX 4 @Quick scan

        The algorithm is as follows.  The  dot  matrix  positions  are
  found  for  all  words of some minimum length (obviously length 1 is
  most sensitive)  that  are  common  to  both  sequences.  Imagine  a
  diagonal line running from corner to corner of the diagram, at right
  angles to the diagonals in the dotmatrix,  The scores for the common
  words  (according  to  the  current  score  matrix,  e.g. MDM78) are
  accummulated at the appropriate positions on  that  imaginary  line,
  hence  producing  a histogram. The histogram is analysed to find its
  mean and standard deviation.  The  diagonals  that  lie  above  some
  cutoff  score  (defined  in standard deviation units), are rescanned
  using the proportional algorithm, and a diagram produced. The method
  is  very  fast,  and  is  also  employed   by the library comparison
  program.

        Typical dialogue follows.

  ? Menu or option number=d26
  ? Identity score (1-20) (3) =
  ? Odd span length (1-401) (11) =
  ? Proportional score (1-297) (132) =
  ? Number of sd above mean (0.00-10.00) (5.00) =

   missing graphics


        SIPL the library searching version of SIP

        This program compares a probe sequence against  a  library  of
  sequences  using  the  quick  scan algorithm, sorts the matches into
  descending order of score, and produces optimal  alignments  of  the
  best scores using the Myers and Miller method. It is very rapid.

        Use of lists of entry names

        SIPL has the ability to restrict searches to  subsets  of  the
  libraries.  This  does  not  require  sublibraries to be created but
  instead is achieved by using files containing a list  of  the  entry
  names of sequences. The user may choose to search only those entries
  on the list or, alternatively to search all but those  on  the  list
  (i.e.  in the latter case the list contains the names of those to be
  excluded).  The programs can search libraries that have indexes  and
  those  that  do not.  If a list of names for inclusion is used, then
  the search will be faster if the index  is  present.  In  all  other
  circumstances  the  whole  library will be read. The list must be in
  library order except when it is used  to  include  entries,  and  an
  index  is  available.   The  list  must contain each entry name on a
  separate line, with the name starting in column 1 of  the  line.  ie
  there must be no spaces at the start of the line.  The list of entry
  names can be produced by the keyword searches of nip, pip, sip,  etc
  as  long  as the listings produced have a space character separating
  the entry name from the entry description. This will depend  on  how
  well  the  library reformatting programs work. For example swissprot
  entry names tend to run into the beginning of the descriptions,  but
  other libraries are generally OK.
 @28. TX 4 @Align sequences

        This  function  will  produce  an  optimal  alignment  of  two
  segments   of   the  sequence.  The  dynamic  programming  alignment
  algorithm is based on that of Miller and Myers (). It guarantees  to
  produce  alignments  with  the optimum score given a score matrix, a
  gap start penalty, and a gap extension penalty. That is, starting  a
  gap  costs  a  fixed  penalty  (F) and each residue added to the gap
  incurs a further penalty (E) so  that  for  each  gap  of  length  K
  residues the penalty is F + K*E. Gaps at the ends of sequences incur
  no penalty.

        The routine can only handle segments of  sequence  of  maximum
  length  5000  residues. When the sequences are read in the alignment
  segment will be set to the first 5000 residues. A different  segment
  can  be  selected by prefixing the option number by the letter D, in
  which case the cross hair can be used to identify the two ends.  The
  cross  hair will appear.  First position the crosshair at the bottom
  left of the segment and type a character other than s or m  or  ",".
  When  the  crosshair  reappears,  position it a the top right of the
  segment, and type a keyboard character.  The aligned sequences  will
  replace  the  active sequence if the user confirms "keep alignment".
  By alternate use of  the  plotting  and  alignment  routines  it  is
  possible to rapidly produce an alignment of quite long sequences.

        Typical dialogue follows.

  28 = Align sequences
  ? Menu or option number=d28
  Define the region to align using the cross-hair.
  First identify the bottom left position and exit
  the cross-hair routine. Then the top right.

  (Bell rings, type return, cross hair appears)

  ? Penalty for starting a gap (1-100) (10) =
  ? Penalty for each residue in gap (1-100) (10) =

  Aligning region           1 to         461
      with region           1 to         514
           1         11         21         31         41         51
           MA--TGKIVQ VIGA------ VVDVEFPQDA VPRVYDALEV QNG------N ERLVL-----
           *      *    *         **            * *       *        *   *
           MQLNSTEISE LIKQRIAQFN VVSEAHNEGT IVSVSDGVIR IHGLADCMQG EMISLPGNRY
           1         11         21         31         41         51
          61         71         81         91        101        111
           EVQQQLGGGI VRTIAMGSSD GLRRGLDVKD LEHPIEVPVG KATLGRIMNV LGEPVDMKGE
                *     *    **     *  *  **       *****    ***  *  ** * * **
           AIALNLERDS VGAVVMGPYA DLAEGMKVKC TGRILEVPVG RGLLGRVVNT LGAPIDGKGP
          61         71         81         91        101        111
         121        131        141        151        161        171
           IGEEERWAIH RAAPSYEELS NSQELLETGI KVIDLMCPFA KGGKVGLFGG AGVGKTVNMM
                  *     **   *          **  *  * * *    *      *     ***
           LDHDGFSAVE AIAPGVIERQ SVDQPVQTGY KAVDSMIPIG RGQRELIIGD RQTGKTALAI
         121        131        141        151        161        171
         181        191        201        211        221        231
           ELIRNIAIEH SGYS-VFAGV GERTREGNDF YHEMTDSNVI DKVSLVYGQM NEPPGNRLRV
             *  *     **         *                          *      *
           DAI--INQRD SGIKCIYVAI GQKASTISNV VRKLEEHGAL ANTIVVVATA SESAALQYLA
         181        191        201        211        221        231
         241        251        261        271        281        291
           ALTGLTMAEK FRDEGRDVLL FVDNIYRYTL AGTEVSALLG RMPSAVGYQP TLAEEMGVLQ
                 * *  *** * * *    *        *    * **  * *                *
           RMPVALMGEY FRDRGEDALI IYDDLSKQAV AYRQISLLLR RPPGREAFPG DVFYLHSRLL
         241        251        261        271        281        291
         301        311        321        331        341        351
           ERITST---- ---------- -KTGSITSVQ AVYVPADDLT DPSPATTFAH LDATVVLSRQ
           **                     **** *         * *      *        *    *
           ERAARVNAEY VEAFTKGEVK GKTGSLTALP IIETQAGDVS AFVPTNVISI TDGQIFLETN
         301        311        321        331        341        351
         361        371        381        391        401        411
           IASLGIYPAV DPLDSTSRQL DPLVVGQEHY DTAR----GV QSILQRYQEL KDIIAILGMD
               ** ***  *  * **      * *             *     *  * **
           LFNAGIRPAV NPGISVSR-- ---VGGAAQT KIMKKLSGGI RTALAQYREL AAFSQFAS--
         361        371        381        391        401        411
         421        431        441        451        461        471
           ELSEEDKLVV ARARKIQRFL SQ----PFFV AE----VFTG SPGKYVSLKD --TIRGFKGI
            *             *    *  *    *  * *      *     * *         *  *
           DLDDATRKQL DHGQKVTELL KQKQYAPMSV AQQSLVLFAA ERG-YLADVE LSKIGSFEAA
         421        431        441        451        461        471
         481        491        501        511        521
           MEG--EYDHL P-EQAFYMVG SIEEAVE--- --------KA KKL*
                  **  *  *     *       *                  *
           LLAYVDRDHA PLMQEINQTG GYNDEIEGKL KGILDSFKAT QSW*
         481        491        501        511        521
  Conservation  22.5%
  Number of padding characters inserted    63 and    10
  ? (y/n) (y) Keep alignment n


 @29. TX 1 @Complement the sequences

        This function allows users to reverse and  complement  nucleic
  acid sequences.
 @30. TX 3 @Switch main diagonal

        If a sequence is being compared against  itself  to  look  for
  repeats  it  is  sometimes  convenient  if  the main diagonal is not
  included in the comparison. This function  allows  users  to  set  a
  switch  that  determines whether or not to include the main diagonal
  for all the comparison methods.  If  the  switch  is  set,  and  the
  active regions for both sequences have the same start position, then
  the main diagonal will not be compared.
 @31. TX 3 @Switch identities

        This function allows a switch to be set or unset.  The  switch
  determines  which  of  two  forms  of  plot  will be produced by the
  proportional algorithm. One form of  output  (the  original  method)
  plots  a  dot  at the centre of each span that reaches the threshold
  score; whereas the other  form  will  plot  dots  for  all  matching
  residues that lie within spans that reach the threshold.
 @32. TX 3 @change score matrix

        This option allows users to select their own score matrix  for
  use with the proportional algorithm. The choices are:

   1 = MDM78
   2 = identity
   3 = your own matrix


        MDM78 is the standard matrix that is used for proteins and  an
  identity  matrix is the default matrix for nucleic acids. However an
  identity matrix is also useful for protein  comparisons.  "Your  own
  matrix"  allows  users  to  apply  any  other matrix, as long as the
  matrix file is in the same format as MDM78.  For comparisons of  DNA
  it  might be useful to try one that gave say 3 for exact matches and
  1 for R-R or Y-Y, else=0.
 @33. TX 3 @Set number of sd's for Quickscan

        The  quickscan  algorithm  is  as  follows.  The  dot   matrix
  positions  are found for all words of some minimum length (obviously
  length 1 is most sensitive)  that  are  common  to  both  sequences.
  Imagine  a  diagonal  line  running  from  corner  to  corner of the
  diagram, at right angles to the diagonals  in  the  dotmatrix,   The
  scores  for the common words (according to the current score matrix,
  e.g. MDM78) are accummulated at the appropriate  positions  on  that
  imaginary  line,  hence  producing  a  histogram.  The  histogram is
  analysed to find its mean and standard deviation. The diagonals that
  lie  above  some cutoff score (defined in standard deviation units),
  are rescanned  using  the  proportional  algorithm,  and  a  diagram
  produced.

        This option allows the number of sd's to be set.
 @34. TX 3 @Set gap penalities

        The alignment function will produce an  optimal  alignment  of
  two  segments  of  the  sequence.  The dynamic programming alignment
  algorithm is based on that of Miller and Myers (). It guarantees  to
  produce  alignments  with  the optimum score given a score matrix, a
  gap start penalty, and a gap extension penalty. That is, starting  a
  gap  costs  a  fixed  penalty  (F) and each residue added to the gap
  incurs a further penalty (E) so  that  for  each  gap  of  length  K
  residues the penalty is F + K*E. Gaps at the ends of sequences incur
  no penalty.

        This option allows the gap penalties to be set.
 @ end of help
No results found.