@-1. TX 0 @General

 @-2. T   0 @Screen control

 @-2. X   0 @Screen

 @-3. T   0 @Statistical analysis of content

 @-3. X   0 @Statistics

 @-4. T   0 @Structures and repeats

 @-4. X   0 @Structures

 @-5. TX  0 @Search

 @0.  TX -1 @PIP

        This is a program  for analysing individual protein sequences.
 It  can  read  sequences  stored  in  many  of the most commonly used
 formats, and performs all of the usual simple analyses.  In  addition
 it  has  very  flexible search procedures  and   presents many of its
 results graphically.

        The following analyses (preceded by their option numbers)  are
 included:
  ? = Help
  ! = Quit
  3 = read a new sequence
  4 = define active region
  5 = list the sequence
  6 = list a text file
  7 = direct output to disk
  8 = write active sequence to disk
  9 = edit the sequence
 10 = clear graphics screen
 11 = clear text screen
 12 = draw a ruler
 13 = use cross hair
 14 = reposition plots
 15 = label diagram
 16 = display a map
 17 = search for short sequences
 18 = compare a sequence
 19 = compare a sequence using a score matrix
 20 = search for a sequence using a weight matrix
 21 = calculate amino acid composition
 22 = plot hydrophobicity
 23 = plot charge
 24 = plot Robson prediction
 25 = plot hydrophobic moment
 26 = draw helix wheel
 27 = back translate
 28 = search for patterns of motifs

        Some of these methods produce graphical  results  and  so  the
 program  is  generally  used from a graphics terminal (a vdu on which
 lines and points can be drawn as well as characters).

        For users of VT640's or their equivalents the terminal must be
 set nowrap (type NOWRAP) prior to running the program.
  The positions of each of the plots is defined relative  to  a  users
  drawing board which has size 1-10,000 in x and 1-10,000 in y.  Plots
  for each  option  are  drawn  in  a  window  defined  by  x0,y0  and
  xlength,ylength. Where x0,y0 is the position of the bottom left hand
  corner of the window, and xlength is the width  of  the  window  and
  ylength the height of the window.
     --------------------------------------------------------- 10,000
     1                                                       1
     1       --------------------------------------   ^      1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       1                                    1 ylength  1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       --------------------------------------   v      1
     1  x0,y0^                                               1
     1       <---------------xlength-------------->          1
     ---------------------------------------------------------      1
     1                                                   10,000

  All values are in drawing board  units  (i.e.  1-10,000,  1-10,000).
  The  default  window  positions are read from a file "ANALPMRG" when
  the program is started. Users can have their own file if required.

        The program can handle sequences stored  in  several  formats:
  Staden, EMBL, GENBANK, PIR (also known as NBRF) and GCG and they are
  described in the help for 'READ NEW SEQUENCE'.

        The options for the program are accessed from  5  main  menus:
  general, screen control, statistical analysis of content, structure,
  search.  Both menus and options are selected by number.
 @1. TX 0 @Help

        This option gives online help. The user should  select  option
  numbers  and  the  current  documentation  will  be given. Note that
  option 0 gives an introduction to the program, and that ?  will  get
  help from anywhere in the program.  The following analyses (preceded
  by their option numbers) are included:

 @2. TX 0 @Quit

        This function stops the program.
 @3. TX 1 @Read a new sequence

        This option allows users to  read  in  new  sequences,  browse
  through  annotations,  or  search  sequence  libraries for keywords.
  Sequences can  be  read  from  "personal"  sequence  files  or  from
  sequence  libraries. These are referred to as the sequence "source".
  Personal files can be stored in several formats:  Staden, PIR, EMBL,
  GENBANK  and  GCG.  At LMB we use "Staden" format for sequencing and
  all the libraries  are  stored  in  their  original  formats.  Note,
  however,  that  libraries  such  as EMBL or GenBank that are divided
  into several files (eg GenBank has 13 separate files) are indexed as
  a  whole.  This  means  that  users  do  not need to know which file
  contains an entry, only which library.  When  the  user  selects  to
  read in a sequence the program first asks for the sequence "source".

        If the user selects "personal" the program will  ask  for  the
  format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
  the file. For PIR format the user will also be required to know  the
  entry  name of the sequence as the file can contain several. For the
  other formats only a single entry is  expected.  The  file  will  be
  read,  its  length  and composition will be displayed and the option
  left.

        If the user selects  "library"  as  the  sequence  source  the
  program will display a list of available libraries. The programs are
  capable of  handling  all  current  libraries  but  which  ones  are
  available  will  vary  from  site  to  site.  At LMB we have several
  libraries and also weekly updates of data gathered between releases.
  The  program will ask users to select a library and then give a list
  of options:

   X  1 Get a sequence
      2 Get annotations
      3 Get entrynames from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords

  If get a sequence or get annotations is selected users will be asked
  to  type  the entry name. The option will be left when a sequence is
  selected  or  !  is  typed.  The  composition  and  length  will  be
  displayed.

        The  text  index  contains  all  words  from  feature  tables,
  reference  titles, definition lines, keywords lists and comments, so
  the text index search is most useful. It is also the fastest. Up  to
  5  words  can  be  searched  for  at once. The words should be typed
  separated by spaces, for example
   ? Keywords=P53 mouse murine tumo

  will search for all entries that contain words  starting  with  p53,
  mouse,  murine  and  tumo.  Only the unique entries that contain ALL
  words will be  listed.  Before  listing  the  matching  entries  the
  program  will  show  the number of 'hits' for each word and ring the
  bell.  Escape is possible at this point, or after each screenfull of
  entries.   In  addition  to the entry names the text search displays
  the primary accession number, the  sequence  length  and  up  to  80
  characters of description.  (The search of 'titles' is now redundant
  because the full text index contains all the  title  words  and  the
  search  is  much  faster.  It  will  probably  be  removed  from the
  program.)  All searches are independent of case. Where possible  the
  program will offer default entry names.

        Typical dialogue follows.
  Select sequence source
  X  1 Personal file
     2 Sequence library
  ? Selection  (1-2) (1) =
  Select sequence file format
  X  1 Staden
     2 EMBL
     3 GenBank
     4 PIR
     5 GCG
  ? Selection  (1-5) (1) =
  ? Sequence file name=M13MP7.SEQ
   Contig title removed
  Sequence length=  7238
   Sequence composition
            T          C          A          G          -
        2405.      1539.      1765.      1527.         2.
          33.2%      21.3%      24.4%      21.1%       0.0%
    .
    .
    .


   Select sequence source
   X  1 Personal file
      2 Sequence library
   ? Selection  (1-2) (1) =2
   Select a library
   X  1 EMBL 29 nucleotide library Dec 91
      2 SWISSPROT 20 protein library Nov 91
      3 PIR 31 protein library Dec 91
      4 NRL3D 58 From Brookhaven protein library Dec 91
      5 GenBank
   ? Selection  (1-5) (1) =
  Library is in EMBL format with indexes
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =5
   Search for keywords
   ? Keywords=P53 mouse
  P53 hits  68
  MOUSE hits  8180

   MMANT01    X00875         536 Murine gene fragment for cellular tumour antigen
   MMANT02    X00876          83 Murine gene fragment for cellular tumour antigen
   MMANT03    X00877          21 Murine gene fragment for cellular tumour antigen
   MMANT04    X00878         261 Murine gene fragment for cellular tumour antigen
   MMANT05    X00879         184 Murine gene fragment for cellular tumour antigen
   MMANT06    X00880         113 Murine gene fragment for cellular tumour antigen
   MMANT07    X00881         110 Murine gene fragment for cellular tumour antigen
   MMANT08    X00882         137 Murine gene fragment for cellular tumour antigen
   MMANT09    X00883          74 Murine gene fragment for cellular tumour antigen
   MMANT10    X00884         107 Murine gene for cellular tumour antigen p53 (exon
   MMANT11    X00885         562 Murine p53 gene 3' region with exon 11
   MMANTP53   M26862         536 Mouse tumor antigen p53 gene, 5' end.
   MMLYN      M64608        2044 Mouse lyn protein mRNA, complete cds.
   MMP53      X00741        1377 Mouse mRNA for transformation associated protein
   MMP53A     M13872        1285 Mouse p53 mRNA, complete cds, clone pcD53.
   MMP53B     M13873        1241 Mouse p53 mRNA, complete cds, clone p53-m11.
   MMP53C     M13874        1322 Mouse p53 mRNA, complete cds, clone p53-m8.
   MMP53G1    X01235         554 Mouse genomic DNA for 5' region of cellular tumou
   MMP53IN4   X60470         729 M.musculus p53 gene for p53 protein, intron 4
   MMP53P     X01236        2132 Mouse pseudogene for cellular tumour antigen p53
   MMP53R     X01237        1773 Mouse mRNA for cellular tumour antigen p53
   MMRSB2P5   M64597         196 Mouse B2 repeat in the 3' flank of protein 53 (p5
        22 different entries found

   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =4
   Search for keywords
   ? Keywords=alpha
   Searching for alpha
   AAGHA          623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
   AAMALI        3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
   AAMALIA       1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
   AAMALIB       1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
   ACA13GT        371 alouatta caraya alpha-1,3gt gene, 3' flank.
   ADHBADA1       102 duck alpha-d-globin gene, exon 1.
   ADHBADA2      1145 duck alpha-a-globin gene and 5' flank
   ADHBADWP       513 duck (white pekin) alpha ii (minor) globin mrna, complete co
   AEACOXABC     5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
   AGA13GT        371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
   AGAAAGFP       282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
   AGAABL         138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
   AGAFAMYA        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
   AGAFAMYB        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
   AGAFAMYC        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
   AGAFCOXA        98 synthetic alpha-factor/cox iv fusion gene signal peptide.
   AGAGABA       7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
   AGAMYLS        120 synthetic alpha-amylase gene, 5' end.
   AGANPS          95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
  !
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =3
   ? Accession number=v00636
  Entry name LAMBDA
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =2
   Default Entry name=LAMBDA
   ? Entry name=
  ID   LAMBDA     standard; DNA; PHG; 48502 BP.
  XX
  AC   V00636; J02459; M17233; X00906;
  XX
  DT   03-JUL-1991 (Rel. 28, Last updated, Version 3)
  DT   09-JUN-1982 (Rel. 1, Created)
  XX
  DE   Genome of the bacteriophage lambda (Styloviridae).
  XX
  KW   circular; coat protein; DNA binding protein; genome;
  KW   origin of replication.
  XX
  OS   Bacteriophage lambda
  OC   Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
  XX
  RN   [1]
  RP   1-48502
  RA   Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
  RT   "Nucleotide sequence of bacteriophage lambda DNA";
  RL   J. Mol. Biol. 162:729-773(1982).
  XX
  !
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =
   Default Entry name=LAMBDA
   ? Entry name=
  DE   Genome of the bacteriophage lambda (Styloviridae).
   Sequence length  48502
   Sequence composition
             T          C          A          G          -
        11988.     11360.     12336.     12818.         0.
           24.7%      23.4%      25.4%      26.4%       0.0%

 @4. TX 1 @Redefine active region

        For its analytic functions  the  program  always  works  on  a
  region of the sequence called the active region. When a new sequence
  is read into the program the active region is automatically  set  to
  start  at  the  beginning  of  the sequence and go up to the maximum
  allowed size of active region the version of the program can handle.
  The  positions  are shown on the screen.  On most machines this will
  be to the end of the sequence.  This option allows the user define a
  different  region.  Note  that  for  convenience  in the listing and
  translation functions the user is given access  to  regions  outside
  the active region.
 @5. TX 1 @List a sequence

        The sequence can be listed with line lengths from 10 to 120 in
  multiples  of  10.  Output  can  be directed to a disk file by first
  selecting disk output. The output looks like:

            10         20         30         40         50         60
    MQLNSTEISE LIKQRIAQFN VVSEAHNEGT IVSVSDGVIR IHGLADCMQG EMISLPGNRY

            70         80         90        100        110        120
    AIALNLERDS VGAVVMGPYA DLAEGMKVKC TGRILEVPVG RGLLGRVVNT LGAPIDGKGP

           130        140        150        160        170        180
    LDHDGFSAVE AIAPGVIERQ SVDQPVQTGY KAVDSMIPIG RGQRELIIGD RQTGKTALAI

           190        200        210        220        230        240
    DAIINQRDSG IKCIYVAIGQ KASTISNVVR KLEEHGALAN TIVVVATASE SAALQYLARM

           250        260        270        280        290        300
    PVALMGEYFR DRGEDALIIY DDLSKQAVAY RQISLLLRRP PGREAFPGDV FYLHSRLLER

           310        320        330        340        350        360
    AARVNAEYVE AFTKGEVKGK TGSLTALPII ETQAGDVSAF VPTNVISITD GQIFLETNLF

           370        380        390        400        410        420
    NAGIRPAVNP GISVSRVGGA AQTKIMKKLS GGIRTALAQY RELAAFSQFA SDLDDATRKQ

           430        440        450        460        470        480
    LDHGQKVTEL LKQKQYAPMS VAQQSLVLFA AERGYLADVE LSKIGSFEAA LLAYVDRDHA

           490        500        510        520        530        540
    PLMQEINQTG GYNDEIEGKL KGILDSFKAT QSW*

 @6. TX 1 @List a text file

        Allows the user to have a text file displayed on  the  screen.
  It will appear one page at a time.
 @7. TX 1 @Direct output to disk

        Used to direct output that would normally appear on the screen
  to a file.

        Select redirection of either text or graphics, and supply  the
  name of the file that the output should be written to.

        The results from the next options selected will not appear  on
  the  screen  but  will  be  written  to  the  file. When option 7 is
  selected again the file will be closed and output will again  appear
  on the screen.
 @8. TX 1 @Write active region to disk

        The program has the capability of reading  in  EMBL,  GENBANK,
  NBRF,  GCG  and  Staden  formats  and of reversing and complementing
  sequences. This option allows users  to  write  the  current  active
  sequence  to  a  disk  file in Staden format. Hence it allows format
  conversion and crude sequence cutting.
 @9. TX 1 @Edit the sequence

        Used to edit sequences or any other files by giving access  to
  the  computers  system  editor. For editing sequences the input file
  should have already been created using the  listing  function  "list
  sequence".

        Supply the name of the file to edit.  Wait  while  the  system
  editor  is  made  ready  (can take awhile on a vax). Use the editor.
  Exit from the editor. If a sequence has been edited, and you want to
  process  it,  affirm  that the sequence should be "made active". The
  edited sequence will replace the original sequence.

        This editing method is designed to give  users  access  to  an
  editor with which they are familiar - i.e. the one on their machine,
  and yet to  allow  them  to  edit  a  sequence  which  contains  the
  landmarks  they  need  in  order  to  know where they are. Users can
  create files containing simple listings with numbering, using  "list
  the  sequence",  and  then edit them with their system editor, using
  the numbering to know where they are within the sequence.  When  the
  edits  are  complete  they  exit  from  the  editor  and the program
  "analyses" the edited file to extract only the sequence  characters.
  Define     the     permitted    set    of    characters    to    be:
  ACDEFGHIKLMNPQRSTVWXYZ-acdefghiklmnpqrstvwxyz.     All     permitted
  characters  found  in the file will become part of the sequence, all
  others removed.
 @10. TX 2 @Clear graphics

        Clears the screen of both text and graphics.
 @11. TX 2 @Clear text

        Clears only text from the screen.
 @12. TX 2 @Draw a ruler

        This option allows the user to draw a ruler or scale along the
  x  axis  of the screen to help identify the coordinates of points of
  interest. The user can define the position of the first  amino  acid
  to  be marked (for example if the active region is 1501 to 8000, the
  user might wish to mark every 1000th amino acid starting  at  either
  1501  or  2000  -  it depends if the user wishes to treat the active
  region as an independent unit with its own numbering starting at its
  left  edge,  or  as  part  of the whole sequence). The user can also
  define the separation of the ticks on the scale and their height. If
  required  the  labelling  routine  can be used to add numbers to the
  ticks.
 @13. TX 2 @Use cross hair

        This function puts a steerable cross on the screen that can be
  used to find the coordinates of points in the sequence. The user can
  move the cross around using the directional keys; when he  hits  the
  space bar the program will print out the coordinates of the cross in
  sequence units and the option will be exited.

        If instead, you hit a , the position will be displayed but the
  cross will remain on the screen.

        If a letter s is hit the sequence around  the  cross  hair  is
  displayed and the cross remains on the screen.
 @14. TX 2 @Reset margins

        The positions of each of the plots is defined  relative  to  a
  users  drawing board which has size 1-10,000 in x and 1-10,000 in y.
  Plots for each option are drawn in a window  defined  by  x0,y0  and
  xlength,ylength. Where x0,y0 is the position of the bottom left hand
  corner of the window, and xlength is the width  of  the  window  and
  ylength the height of the window.
     --------------------------------------------------------- 10,000
     1                                                       1
     1       --------------------------------------   ^      1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       1                                    1 ylength  1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       --------------------------------------   v      1
     1  x0,y0^                                               1
     1       <---------------xlength-------------->          1
     ---------------------------------------------------------      1
     1                                                   10,000

  All values are in drawing board  units  (i.e.  1-10,000,  1-10,000).
  The  default  window  positions are read from a file "ANALMARG" when
  the program is started. Users can have their own file  if  required.
  As  all  the plots start at the same position in x and have the same
  width, x0 and xlength are the same for all options. Generally  users
  will  only  want  to change the start level of the window y0 and its
  height ylength. This option allows users to change window  positions
  whilst  running  the  program.   The  routine  prompts first for the
  number of the option that the users wishes to reposition;  then  for
  the  y  start and height; then for the x start and length. Note that
  changes to the x values affect all options. If the user  types  only
  carriage  return  for any value it will remain unchanged. The cross-
  hair can be used to choose suitable heights.
 @15. TX 2 @Label a diagram

        This routine allows users to  label  any  diagrams  they  have
  produced.  They  are  asked  to type in a label. When the user types
  carriage return to finish typing the label the cross-hair appears on
  the  screen. The user can position it anywhere on the screen. If the
  user types R (for right justify) the label will be  written  on  the
  diagram  with  its right end at the cross-hair position. If the user
  types L (for left justify) the label will be written on the  diagram
  with  its  left end at the cross hair position.  The cross-hair will
  then immediately reappear. The  user  may  put  the  same  label  on
  another part of the diagram as before or if he hits the space bar he
  will be asked if he wishes to type in another label.
 @16. TX 2 @Display a map

        It is  often  convenient  to  plot  a  map  alongside  graphed
  analysis  in  order  to  indicate features within the sequence. This
  function allows users to draw maps using files arranged in the  form
  of  EMBL  feature  tables. Of course the EMBL table are usually only
  used for nucleic acid  sequence  annotation  but,  as  long  as  the
  features  are written in the correct format, they can be employed by
  this routine. The  map  is  composed  of  a  line  representing  the
  sequence  and  then  further  lines  denoting  the endpoints of each
  feature the user identifies. The user is asked to define  height  at
  which  the  line representing the sequence should be drawn; then for
  the feature height; then for the features to plot.
 @17. TX 1 5 @Short sequence search

        This routine is used to search  for  exact  matches  to  short
  sequences.  It  is  equivalent  to  the restriction enzyme search in
  program NIP. It and can either list matches or present  the  results
  graphically.

        Select from searching, screen clearing or file listing. Choose
  a file of strings and the mode of output required.

        The files of short sequences (strings) and their names need to
  be arranged in a particular way. For example
  ACID/D/E//
  BASIC/R/K/H//
  HYDRO/F/L/I/V/Y//
  GLYCO/N-S/N-T//
  +/R/K/H//
  -/D/E//
  defines various groups of  amino  acids.   Each  string  or  set  of
  strings must be preceded by a name, each string must be preceded and
  terminated with a slash (/), and each set of strings by  2  slashes.
  These  collections  of strings and their names can be read from disk
  or entered from the keyboard. Two  files  containing  sequences  are
  currently  available.  One contains named groups of amino acids. The
  other simply contains the names of  all  amino  acids  and  gives  a
  convenient  way  of  producing  a  plot  of the positions of all the
  different amino acids in the sequence.  The user can select  strings
  by  name  from  these collections. Results can be displayed  name by
  name or all together. Strings entered from the keyboard need  to  be
  separated  by  slash characters(/).  For the name by name search the
  output looks like:
    MATCHES=    12
   NAME                  SEQUENCE            POSITION  FRAGMENT LENGTHS
   ACID                  E                          7       7       1
   ACID                  E                         10       3       1
   ACID                  E                         24      14       1
   ACID                  E                         28       4       1
   ACID                  D                         36       8       1
   ACID                  D                         46      10       2
   ACID                  E                         51       5       2
   ACID                  E                         67      16       2
   ACID                  D                         69       2       2
   ACID                  D                         81      12       2
   ACID                  E                         84       3       2
   ACID                  E                         96      12       3
    MATCHES=    10
   NAME                  SEQUENCE            POSITION  FRAGMENT LENGTHS
   BASIC                 K                         13      13       1
   BASIC                 R                         15       2       1
   BASIC                 H                         26      11       1
   BASIC                 R                         40      14       1
   BASIC                 H                         42       2       2
   BASIC                 R                         59      17       2
   BASIC                 R                         68       9       2
   BASIC                 K                         87      19       2
   BASIC                 K                         89       2       2
   BASIC                 R                         93       4       2
    MATCHES=     1
   NAME                  SEQUENCE            POSITION  FRAGMENT LENGTHS
   GLYCO                 NST                        4       4       3

   or when the results are ordered only on position the output looks like:

   NAME                  SEQUENCE            POSITION  FRAGMENT LENGTHS
   GLYCO                 NST                        4       3
   ACID                  E                          7       3
   ACID                  E                         10       3
   BASIC                 K                         13       3
   BASIC                 R                         15       2
   ACID                  E                         24       9
   BASIC                 H                         26       2
   ACID                  E                         28       2
   ACID                  D                         36       8
   BASIC                 R                         40       4
   BASIC                 H                         42       2
   ACID                  D                         46       4
   ACID                  E                         51       5
   BASIC                 R                         59       8
  Graphical output marks the  position  of  each  string  by  a  short
  vertical line and gives its name at the left end of the line. If the
  top of the  screen  is  reached  the  program  gives  the  user  the
  oportunity  to  take  a hard copy and then will clear the screen and
  restart plotting results at the original start position.  Note  that
  any  character  in  the  string  that  is not a recognisable protein
  symbol will be treated as a wild card character will match with  all
  characters in the searched sequence.

        Typical dialogue follows.

  Menus and their numbers are
  m0 = This menu
  m1 = General
  m2 = Screen control
  m3 = Statistical analysis of content
  m4 = Structure
  m5 = Search
   ? = Help
   ! = Quit
  ? Menu or option number=17
   Search for short sequences
  X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 =2
    1 All acids
  X 2 Named groups
    3 Personal file
    4 Keyboard
  ? 0,1,2,3,4 =

  ACID/D/E//
  BASIC/R/K/H//
  HYDRO/F/L/I/V/Y//
  GLYCO/N-S/N-T//
  +/R/K/H//
  -/D/E//
  DIBASIC/RR/KK/RK/KR//
  TURN/N/D/G/P/S//
  BLOCK/A/Q/E/I/L/M/F/W/V//
  INDIF/R/C/H/K/T/Y//
  End of file


  X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 =

    1 All acids
  X 2 Named groups
    3 Personal file
    4 Keyboard
  ? 0,1,2,3,4 =

  ? (y/n) (y) All names n
  ? Name=acid
  ? Name=basic
  ? Name=glyco
  ? Name=

  ? (y/n) (y) Show results name by name
  ? (y/n) (y) List matches

   searching
   matches=    59
  NAME                  SEQUENCE            POSITION  FRAGMENT LENGTHS
  ACID                  E                          7       7       1
  ACID                  E                         10       3       1
  ACID                  E                         24      14       1
  ACID                  E                         28       4       1
  ACID                  D                         36       8       1
  ACID                  D                         46      10       2
  ACID                  E                         51       5       2
  ACID                  E                         67      16       2
  ACID                  D                         69       2       2
  ACID                  D                         81      12       2
  ACID                  E                         84       3       2
  ACID                  E                         96      12       3
  ACID                  D                        116      20       3
   matches=    61
  NAME                  SEQUENCE            POSITION  FRAGMENT LENGTHS
  BASIC                 K                         13      13       1
  BASIC                 R                         15       2       1
  BASIC                 H                         26      11       1
  BASIC                 R                         40      14       1
  BASIC                 H                         42       2       2
  BASIC                 R                         59      17       2
   ...etc
   matches=     2
  NAME                  SEQUENCE            POSITION  FRAGMENT LENGTHS
  GLYCO                 NST                        4       4       3
  GLYCO                 NQT                      487     483      28
                                                          28     483


  X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 =

    1 All acids
  X 2 Named groups
    3 Personal file
    4 Keyboard
  ? 0,1,2,3,4 =

  ? (y/n) (y) Selected names

  ? Name=basic
  ? Name=glyco
  ? Name=

  ? (y/n) (y) Show results name by name n
  ? (y/n) (y) List matches

   searching
  NAME                  SEQUENCE            POSITION  FRAGMENT LENGTHS
  GLYCO                 NST                        4       3
  BASIC                 K                         13       9
  BASIC                 R                         15       2
  BASIC                 H                         26      11
  BASIC                 R                         40      14
  BASIC                 H                         42       2
  BASIC                 R                         59      17
  BASIC                 R                         68       9
  BASIC                 K                         87      19
   ...etc
  BASIC                 R                        477      14
  BASIC                 H                        479       2
  GLYCO                 NQT                      487       8
  BASIC                 K                        499      12
  BASIC                 K                        501       2
  BASIC                 K                        508       7
                                                           7

  X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 =
    1 All acids
  X 2 Named groups
    3 Personal file
    4 Keyboard
  ? 0,1,2,3,4 =4
  Define search strings by typing a string name
  followed by the string(s)
  ? Name=MARY
  ? String(s)=AL/VI
  ? Name=
  ? (y/n) (y) All names
  ? (y/n) (y) Show results name by name
  ? (y/n) (y) List matches

   searching
   matches=    12
  NAME                  SEQUENCE            POSITION  FRAGMENT LENGTHS
  MARY                  VI                        38      38      10
  MARY                  AL                        63      25      13
  MARY                  VI                       136      73      16
  MARY                  AL                       177      41      19
  MARY                  AL                       217      40      25
  MARY                  AL                       233      16      37
  MARY                  AL                       243      10      40
  MARY                  AL                       256      13      41
  MARY                  AL                       326      70      45
  MARY                  VI                       345      19      51
  MARY                  AL                       396      51      70
  MARY                  AL                       470      74      73


 @18. TX 1 5 @Compare a sequence

        This  routine  slides  a  short  sequence  along  the  current
  sequence  and finds all positions at which a given percentage of the
  amino acids match.  Output is in both graphical and listed forms.

        If  users call for dialogue when the routine is selected  they
  will  be  given  the  choice  of  keyboard or file input. Define the
  string, and the percentage match. Matches will be  plotted  out  and
  then  the  user  can  select  to  have them listed. Then the routine
  cycles around.

        The routine slides the search string along the   sequence  and
  marks  the positions at which a minimum percentage score is reached.
  The graphical output draws a vertical line at  the  match  position;
  the  height  of the line represents the percentage score, so that if
  the line reaches the top of the box the score is 100%.

        Typical dialogue follows.

  ? Menu or option number=18
   Find percentage matches
  ? (y/n) (y) Keep picture

  ? String=aaa
  ? Percent match (1.00-100.00) (70.00) =

   missing graphics

  Total scoring positions above 70.000 percent =  19
  Scores          2      2      2      2      2      2      2      2      2      2
  Positions      61    131    177    217    226    231    232    267    300    301

  ? Number to list (0-19) (0) =3

          61
           AIA
           * *
           aaa
           1

         131
           AIA
           * *
           aaa
           1

         177
           ALA
           * *
           aaa
           1
  ? (y/n) (y) Keep picture n

  Default String=aaa
  ? String=!

 @19. TX 1 5 @Compare a sequence using a score matrix

        This  routine  slides  a  short  sequence  along  the  current
  sequence  and  finds  all  positions  at  which  a  given  level  of
  similarity (a cutoff score) is reached. The score is defined by  use
  of  a  score  matrix (MDM78). Output is in both graphical and listed
  forms.

        If  users call for dialogue when the routine is selected  they
  will  be  given  the  choice  of  keyboard or file input. Define the
  string and the cutoff score. Matches will be plotted  out  and  then
  the  user  can  select  to have them listed. Then the routine cycles
  around.

        The routine slides the search string along the   sequence  and
  marks  the  positions  at  which a the cutoff score is achieved. The
  graphical output draws a vertical line at the  match  position;  the
  height  of  the  line  represents  the   score,  so that if the line
  reaches the top of the box the score is the maximum possible.

        Typical dialogue follows.

  Menus and their numbers are
  m0 = This menu
  m1 = General
  m2 = Screen control
  m3 = Statistical analysis of content
  m4 = Structure
  m5 = Search
   ? = Help
   ! = Quit
  ? Menu or option number=19
   Find matches using a score matrix
  ? (y/n) (y) Keep picture

  ? String=aaa
  Minimum score=    12 Maximum score=    36
  ? Score (12-36) (36) =

   missing graphics

  For score    24 the number of matches=   507
  scores         35     35     35     34     34     34     34     34     34     34
  positions     226    231    379    112    133    202    227    267    378
  380

  ? Number to list (0-507) (0) =3

         226
           ATA
           * *
           aaa
           1

         231
           SAA
             **
           aaa
           1

         379
           GAA
            **
           aaa
           1
  ? (y/n) (y) Keep picture n

  Default String=aaa
  ? String=!
 @20. TX 5 @Search for a motif using a weight matrix

        This function performs  searches  for  short  sequence  motifs
  using  an  appropriate  weight matrix. In addition it can be used to
  create or modify weight matrices. In order to perform a  search  the
  only  input  required  is the name of the file containing the weight
  matrix.  The results can be presented  graphically  or  listed.  The
  graphical presentation will draw line at the position of any matches
  found; the height of the line is proportional to the score.

        For a search, select "use weight matrix", supply the  name  of
  the  file  containing  the  weight matrix, and choose between having
  results plotted  or  listed.  If  dialogue  is  requested  when  the
  function is selected users can alter the cutoff score employed.

        To create a weight matrix several steps are involved.  A  file
  containing an alignment of known motifs is required. (This file must
  be created before the current option is selected. The  format  is  a
  follows:  each  sequence is written on a separate line with at least
  one space at the beginning; each sequence is terminated by  a  space
  character,  and  can  be  followed  by a name. The sequences must be
  aligned.) Supply the name of the  file  of  aligned  sequences.  The
  program  reads  and  displays the sequences. Choose between "summing
  logs of weights" or summing weights (i.e. whether to multiply or add
  weights).  If  logs  are used all scores will be negative. Choose if
  all positions in the set of aligned sequences should be used or if a
  mask should be applied. If so selected, define a mask as a string of
  symbols, in which symbol - means ignore and any other  symbol  means
  use. E.g. xx-x--abc means use all positions except 3,5 and 6.

        The program will calculate weights as the frequencies of  each
  amino  acid  at  each  unmasked  position  in  the  set  of  aligned
  sequences. These weights are then applied  to  the  set  of  aligned
  sequences  to  give  a  range  of  "observed"  scores.  The mean and
  standard deviation of these scores is displayed. The user  is  asked
  to  supply  several  values  to  be  used  when the weight matrix is
  applied to other sequences: a cutoff score  (by  default,  the  mean
  minus  3  standard  deviations),  a  top score for scaling graphical
  results (by default, the mean plus 3  standard  deviations),  and  a
  position  to  identify  (this  means that if a particular amino acid
  within the motif is used as a "landmark",  such  as  the  G  of  the
  helix-turn-helix  motif, then its position will be marked in plots).
  All these values are stored along with the  weight  matrix.  Finally
  supply the name of a file to contain the weight matrix.

        Weight matrices can be  "rescaled"  using  a  set  of  aligned
  sequences  in much the same ways as a matrix is created. The purpose
  is to redefine the cutoff scores, and rescaling does not  alter  any
  other values in the weight matrix file.

        The methods have changed considerably but were first  outlined
  in  Staden,  R.  Nucl.  Acid  Res.  12  505-519 1984, and Staden, R.
  Genetic engineering: principles and methods vol 7,  Edited  by  J.K.
  Setlow and A. Hollaender, Plenum publishing corp., 1985.

        The methods have always had to deal with the problem of zeroes
  in  the  matrices.  The  current  versions  employ  "Laplaces Law of
  Succession" in which 1 is added to each term.

        It is now possible to  apply  a  mask  to  a  set  of  aligned
  sequences  in  order  to  give  weight  to  selected positions only.
  Sequences have superimposed functions: some parts may be of  general
  structural  importance  and  give  rise to an overall framework, and
  other parts give specificity and hence are not common; we  may  want
  to use a set of aligned sequences to define a motif, but want to use
  only the framework positions.  Alternatively we may want to pick out
  only  those  parts  of  a  set  of  aligned  sequences  that  give a
  particular property, and to ignore other similarities that  are  due
  to  some  other  property and which could obscure the pattern we are
  interested in. The ability to define a mask allows certain positions
  to  be  used  in  the  motif and others to be ignored, and yet still
  permits the use of a set of aligned sequences to calculate weights.

        Typical dialogue is shown below.
  ? Menu or option number=20
  X 1 Use weight matrix
    2 Make weight matrix
    3 Rescale weight matrix
  ? 0,1,2,3 =2
  ? Name of aligned sequences file=[rs.motifs]hth.seq
       1 QESVADKMGMGQSGVGALFN LAMBDA.REP
       2 QTKTAKDLGVYQSAINKAIH LAMBDA.CRO
       3 QAALGKMVGVSNVAISQWQR P22.REP
       4 QRAVAKALGISDAAVSQWKE P22.CRO
       5 QAELAQKVGTTQQSIEQLEN 434.REP
       6 QTELATKAGVKQQSIQLIEA 434.CRO
       7 RQEIGQIVGCSRETVGRILK CAP
       8 RGDIGNYLGLTVETISRLLG Fnr
       9 LYDVAEYAGVSYQTVSRVVN LAC.R
      10 IKDVARLAGVSVATVSRVIN GAL.R
      11 TEKTAEAVGVDKSQISRWKR LAMBDA.CII
      12 QRKVADALGINESQISRWKG P22.CI
      13 KEEVAKKCGITPLQVRVWCN MAT.ALPHA
      14 TRKLAQKLGVEQPTLYWHVK TETR.TN10
      15 TRRLAERLGVQQPALYWHFK TETR.pSC1
      16 QRELKNELGAGIATITRGSN TRP.REP
      17 RQQLAIIFGIGVSTLYRYFP H-INVERSN
      18 ATEIAHQLSIARSTVYKILE TN3.RESOL
      19 ASHISKTMNIARSTVYKVIN GD.RESOLV
      20 IASVAQHVCLSPSRLSHLFR ARA.C
      21 RAEIAQRLGFRSPNAAEEHL LEX.R
  Length of motif    20
  ? (y/n) (y) Sum logs of weights
  ? (y/n) (y) Use all motif positions n
  x means use, - means ignore
  e.g. xx-x---x-x means use positions 1,2,4,8,10
  ? Mask=--xxxxxxxxxxxx------
   Applying weights to input sequences
     1      -57.143 QESVADKMGMGQSGVGALFN
     2      -55.087 QTKTAKDLGVYQSAINKAIH
     3      -58.079 QAALGKMVGVSNVAISQWQR
     4      -54.986 QRAVAKALGISDAAVSQWKE
     5      -55.181 QAELAQKVGTTQQSIEQLEN
     6      -55.874 QTELATKAGVKQQSIQLIEA
     7      -56.692 RQEIGQIVGCSRETVGRILK
     8      -57.722 RGDIGNYLGLTVETISRLLG
     9      -55.363 LYDVAEYAGVSYQTVSRVVN
    10      -55.769 IKDVARLAGVSVATVSRVIN
    11      -56.786 TEKTAEAVGVDKSQISRWKR
    12      -55.833 QRKVADALGINESQISRWKG
    13      -56.279 KEEVAKKCGITPLQVRVWCN
    14      -53.125 TRKLAQKLGVEQPTLYWHVK
    15      -55.833 TRRLAERLGVQQPALYWHFK
    16      -58.651 QRELKNELGAGIATITRGSN
    17      -56.749 RQQLAIIFGIGVSTLYRYFP
    18      -56.986 ATEIAHQLSIARSTVYKILE
    19      -60.618 ASHISKTMNIARSTVYKVIN
    20      -58.988 IASVAQHVCLSPSRLSHLFR
    21      -58.002 RAEIAQRLGFRSPNAAEEHL
  Top score     -53.125  Bottom score     -60.618
  Mean     -56.655  Standard deviation       1.617
  Mean minus 3.sd     -61.505  Mean plus 3.sd     -51.804
  ? Cutoff score (-999.00-9999.00) (-61.51) =
  ? Top score for scaling plots (-61.51-999.00) (-51.80) =
  ? Position to identify (0-20) (1) =9
  ? Title=hth
  ? Name for new weight matrix file=1.wts

  Menus and their numbers are
  m0 = This menu
  m1 = General
  m2 = Screen control
  m3 = Statistical analysis of content
  m4 = Structure
  m5 = Search
   ? = Help
   ! = Quit
  ? Menu or option number=20
  X 1 Use weight matrix
    2 Make weight matrix
    3 Rescale weight matrix
  ? 0,1,2,3 =

  ? Motif weight matrix file=1.wts
   hth
  ? (y/n) (y) Use frequencies as weights
  ? (y/n) (y) Plot results n
        5    -61.46 STEISELIKQRIAQFNVVSE
       13    -58.93 KQRIAQFNVVSEAHNEGTIV
       21    -60.42 VVSEAHNEGTIVSVSDGVIR
       57    -59.39 GNRYAIALNLERDSVGAVVM
       59    -61.47 RYAIALNLERDSVGAVVMGP
       79    -59.90 YADLAEGMKVKCTGRILEVP
       88    -61.41 VKCTGRILEVPVGRGLLGRV
      104    -60.38 LGRVVNTLGAPIDGKGPLDH
      127    -60.13 SAVEAIAPGVIERQSVDQPV
      129    -59.91 VEAIAPGVIERQSVDQPVQT
      133    -60.79 APGVIERQSVDQPVQTGYKA
      139    -61.12 RQSVDQPVQTGYKAVDSMIP
      175    -58.90 KTALAIDAIINQRDSGIKCI
      191    -60.95 IKCIYVAIGQKASTISNVVR
      195    -60.94 YVAIGQKASTISNVVRKLEE
      215    -60.66 HGALANTIVVVATASESAAL
      254    -60.56 EDALIIYDDLSKQAVAYRQI
      260    -60.08 YDDLSKQAVAYRQISLLLRR
      297    -61.00 LLERAARVNAEYVEAFTKGE
      314    -61.29 KGEVKGKTGSLTALPIIETQ
      330    -60.49 IETQAGDVSAFVPTNVISIT
      363    -57.63 GIRPAVNPGISVSRVGGAAQ
      365    -61.48 RPAVNPGISVSRVGGAAQTK
      371    -61.02 GISVSRVGGAAQTKIMKKLS
      382    -57.90 QTKIMKKLSGGIRTALAQYR
      394    -60.07 RTALAQYRELAAFSQFASDL
      424    -59.95 GQKVTELLKQKQYAPMSVAQ
      430    -58.89 LLKQKQYAPMSVAQQSLVLF
      432    -61.14 KQKQYAPMSVAQQSLVLFAA
      438    -58.58 PMSVAQQSLVLFAAERGYLA
      458    -61.06 DVELSKIGSFEAALLAYVDR
      466    -61.00 SFEAALLAYVDRDHAPLMQE
      483    -60.48 MQEINQTGGYNDEIEGKLKG
      494    -60.61 DEIEGKLKGILDSFKATQSW

  Menus and their numbers are
  m0 = This menu
  m1 = General
  m2 = Screen control
  m3 = Statistical analysis of content
  m4 = Structure
  m5 = Search
   ? = Help
   ! = Quit
  ? Menu or option number=d20
  X 1 Use weight matrix
    2 Make weight matrix
    3 Rescale weight matrix
  ? 0,1,2,3 =

  ? Motif weight matrix file=1.wts
   hth
  ? (y/n) (y) Use frequencies as weights
  ? Cutoff score (-9999.00-9999.00) (-61.51) =-56.
  ? (y/n) (y) Plot results n


 @21. TX 3 @Calculate amino acid composition

        This  function  calculates  the  amino  acid  composition  and
  molecular weight for the active region.
  ? Menu or option number=21
   Sequence composition

  A   C     S     T     P     A     G     N     D     E     Q     B     Z     H
  N   3.   32.   23.   18.   57.   47.   16.   28.   31.   28.    0.    0.    7.
  %   0.6   6.2   4.5   3.5  11.1   9.1   3.1   5.4   6.0   5.4   0.0   0.0   1.4
  W  309. 2786. 2325. 1748. 4051. 2682. 1826. 3222. 4003. 3588.    0.    0.
  960.

  A   R     K     M     I     L     V     F     Y     W     -     X     ?
  N  30.   24.   11.   40.   47.   41.   14.   15.    1.    0.    0.    0.    1.
  %   5.8   4.7   2.1   7.8   9.1   8.0   2.7   2.9   0.2   0.0   0.0   0.0   0.2
  W 4686. 3076. 1443. 4527. 5319. 4065. 2060. 2448.  186.    0.    0.    0.
  0.
  Total molecular weight=    55328.

 @22. TX 3 4 @Plot hydrophobicity

        This routine plots the hydrophobicity of each section  of  the
  sequence  using  the hydrophobicity values of Kyte and Doolittle (J.
  Mol. Biol. 157, 105-132 (1982)).  A window  of  size  span  is  slid
  along the sequence and a sum calculated for each position.

        If dialogue is requested select  a  span  length  and  a  plot
  interval.

        The diagrams are  on the same scale as Fig. 6 of the Kyte  and
  Doolittle  paper  and  values of + and - 50 could be assigned to the
  top and bottom of the diagram with corresponding values  in  between
  (-40,-20,0,20,40 are shown in the paper).
  ? Menu or option number=d22
   Plot hydrophobicity
  ? odd span length (1-101) (11) =
  ? plot interval (1-101) (3) =

   missing graphics
 @23. TX 3 4 @Plot charge

        This routine plots the charge of each section of the sequence.
  A  window  of  size  span  is  slid  along  the  sequence  and a sum
  calculated for each position. Amino acids are assigned charges of 1,
  -1 or 0.

        If dialogue is requested select  a  span  length  and  a  plot
  interval.

        Typical dialogue follows.

  ? Menu or option number=d23
   Plot charge
  ? odd span length (1-101) (11) =
  ? plot interval (1-101) (3) =

   missing graphics

 @24. TX 4 @Plot robson prediction

        This routine uses the method of Garnier J, Osguthorpe D J, and
  Robson  B.  (1978)  J.  Mol.  Biol. 120, 97-120 to predict secondary
  structures. The method divides protein secondary structures  into  4
  classes:  helix,  extended  (usually referred to as sheet), turn and
  coil. The routine calculates the likelihood that each segment of the
  sequence  lies  in  each  of  these  classes.  Results are presented
  graphically or listed.

        If dialogue is requested  choose  between  plotted  or  listed
  output.

        Each residue has a certain probability of being found in  each
  of  the  4  classes.  This probability depends both on its own amino
  acid type and also the 8 amino acids found to either side along  the
  protein  chain.  Four  tables of weights, each 20 by 17 elements are
  used to calculate the likelihood that each residue along  the  chain
  falls  into  one  of  the four classes of structure. The most likely
  structure at each point is the one with the highest score.  The four
  values are plotted in strips labelled H, E, T and C.  Below, a strip
  labelled  D  for  decision  is  divided  into  four   levels,   each
  corresponding  to  one  of  the  four  structure types. Their top to
  bottom order is the same as that for the strips above, i.e C, T,  E,
  and  H.  For  each  residue  the  program measures which of the four
  likelhoods is highest. It places a single dot at  the  mid-point  of
  the  corresponding  strip,  and also at the appropriate level in the
  strip labelled D.

        It should be noted that the method, when tested  by  Kabsch  W
  and  Sander C, (1983) Febs. Lett. 155 (179-182), although one of the
  better ones, was correct for only about 56% of residues.

        Typical dialogue follows.
  ? Menu or option number=d24
   Plot Robson secondary structure predictions
  ? (y/n) (y) Plot results n

       9 S   217   -7  -39   15
      10 E   226    5  -27  -39
      11 L   233   -7  -26  -15
      12 I   229  -23    9    4
      13 K   214   -8   10   -8
      14 Q   178   42   19    5
      15 R   131   54   16    3
      16 I    86   42  -31  -23
      17 A    55   52  -30  -15
      18 Q    15   67    4   25
      19 F   -34   86   47   74
      20 N   -41   74   17  106
      21 V   -16  118   -5  100
      22 V    64   88    5  115
      23 S    96   38   26  155
      24 E   133  -25   13   96
      25 A   118  -98   25  100
      26 H   110 -150   37   86
      27 N    57 -201   37   66
      28 E    51 -140   11   -4
      29 G     2  -77   37    9
      30 T     2   28   28    7
      31 I   -11  117  -21   22
      32 V   -23  178  -55    5
      33 S   -54  193  -14   35
      34 V   -46  123    5   30
      35 S   -54   53   51   80
      36 D   -60    1   86   55
      37 G   -66    8   57   49
      38 V    -1  128  -30   -5
      39 I    11  212  -56  -33
      40 R    16  204  -44  -57
   ...etc

 @26. TX 4 @Draw a helix wheel

        A helical representation of segments of the sequence is shown.
  The  display  includes  a  schematic  of the helix showing the links
  between residues, with each vertex numbered according  to  position;
  the   sequence   element   at  each  vertex;  a  symbol  denoting  a
  classification as hydrophobic(.), positively charged(+),  negatively
  charged(-),  or  otherwise(  ).  The  residue  number  of  the first
  sequence element in the current window  is  displayed  at  the  top-
  left-hand  corner  of  the  diagram. Also at the top-left corner the
  sequence in the current window is listed. Below this  is  the  total
  hydrophobicity  and  hydrophobic  moment  for  the window calculated
  according to Eisenberg et al J. Mol. Biol. 179, 125-142 (1984).

        If dialogue is requested the user is asked for  the  angle  to
  define  the  turn  between residues as seen looking along the helix,
  and a window length. The window length can be up to 60, with default
  18,  and  the angle has a default of 100 degrees. Note that 18 x 100
  is 5 turns. When the option is selected the  first  segment  in  the
  current  active region is displayed then the bell rings. If the user
  types only return, the display will click  on  by  one  residue;  if
  another number is typed, say N, then the display will click forwards
  (or backwards if N is negative) by N residues. If the wheel runs off
  either end of the sequence the option will be exited.

        Typical dialogue follows.
  ? Menu or option number=d26
  ? Angle (1-130) (100) =
  ? Window (1-60) (18) =

   missing graphics

 @25. TX 3 4 @Plot hydrophobic moment

        This  routine  plots  hydrophobic  moment  and  hydrophobicity
  according  to Eisenberg et al J. Mol. Biol. 179, 125-142 (1984). The
  mean hydrophobicity per residue in the window is plotted on a  scale
  -1.0  to 1.5, and the mean hydrophobic moment per residue on a scale
  0.0 to 1.5. The hydrophobicity is shown in the top  frame  with  the
  hydrophobic  moment  below.   The plot is arranged so that the value
  shown at position x  represents  the  mean  value  for  residues  x-
  window+1 to x, where window is the window length.

        If dialogue is requested the user can select a window  length,
  and the  angle used for the hydrophobic moment calculation.

        Note that according  to  Eisenberg  et  al,  in  transmembrane
  proteins   an  "initiator"  is  required.  This  is  either  a  very
  hydrophobic  single  helix  with  <H>  >=0.68,   or   a   moderately
  hydrophobic  pair  of helices whose <H> sum to >= 1.1. Other helices
  are then accepted as transmembrane if their <H> >= 0.42

        The following rules are claimed: if <H> < 0.51 and points  lie
  below  the  line  <M>  = -0.392 + 0.603x <H> they are "globular", if
  they lie above this line they are "surface". If <H> > 0.51 and  they
  lie  above  the  line <M> = 0.6 - 0.342x<H> they are "monomeric", if
  above "multimeric".

        Typical dialogue follows.

  ? Menu or option number=d25
  ? Angle (1-130) (100) =
  ? Window (1-60) (18) =
  ? Plot interval (1-101) (3) =

   missing graphics


 @27. TX 1 @Back translate to dna

        This routine back translates protein sequences into DNA  using
  the  standard  genetic  code. The level of redundancy can be plotted
  and the backtranslation saved to a file.

        The translation can use either the IUB symbols shown below, or
  a  set  of codon preferences. If a set of codon preferences are used
  they must conform to the format of  codon  tables  produced  by  the
  nucleotide  analysis  program, and the back translation will contain
  the favoured codons. If there is no favoured codon the  IUB  symbols
  will  be  employed. The window length for plotting the redundancy is
  in codons.

        The program will plot the redundancy along  the  sequence  and
  hence can be used to find the best sequences to use as primers. Note
  that the program plots the inverse, and so the higher the  plot  the
  LESS  redundant the sequence. For primers look for peaks rather than
  troughs.

        The DNA sequence can be saved to a file and analysed using the
  nucleotide  analysis  program.   Depending  on the application it is
  often useful to produce a back translation using  both  a  table  of
  codon preferences and one using the IUB symbols. This is because the
  restriction enzyme search program can distinguish  between  definite
  and  possible  cuts  in  the  sequence.   These matches are what the
  program  terms  "definite  matches"  and  are  ones  in  which   the
  specification  of  the  recognition  sequence corresponds exactly to
  that of the back translation. The program will  also  find  what  it
  terms   "possible  matches"  which  are  ones  that  depend  on  the
  particular codons chosen for each amino acid.  These  are  sites  at
  which  recognition sequences could be engineered to produce a cut in
  the  DNA  without  changing  the  amino  acid,  but  which  are  not
  necessarily found in the original sequence.


              NC-IUB SYMBOLS

        A,C,G,T
        R        (A,R)        'puRine'
        Y        (T,C)        'pYrimidine'
        W        (A,T)        'Weak'
        S        (C,G)        'Strong'
        M        (A,C)        'aMino'
        K        (G,T)        'Keto'
        H        (A,T,C)      'not G'
        B        (G,C,T)      'not A'
        V        (G,A,C)      'not T'
        D        (G,A,T)      'not C'
        N        (G,A,C,T)    'aNy'

   Typical dialogue follows.

  ? Menu or option number=d27
   Back translate
  ? (y/n) (y) No codon preference
  ? (y/n) (y) Plot redundancy n
  ? (y/n) (y) Save DNA to disk
  ? File name for DNA sequence=tt:
  ATGCARYTNAAYWSNACNGARATHWSNGARYTNATHAARCARMGNATHGCNCARTTYAAY
  GTNGTNWSNGARGCNCAYAAYGARGGNACNATHGTNWSNGTNWSNGAYGGNGTNATHMGN
  ATHCAYGGNYTNGCNGAYTGYATGCARGGNGARATGATHWSNYTNCCNGGNAAYMGNTAY
  GCNATHGCNYTNAAYYTNGARMGNGAYWSNGTNGGNGCNGTNGTNATGGGNCCNTAYGCN
  GAYYTNGCNGARGGNATGAARGTNAARTGYACNGGNMGNATHYTNGARGTNCCNGTNGGN
  MGNGGNYTNYTNGGNMGNGTNGTNAAYACNYTNGGNGCNCCNATHGAYGGNAARGGNCCN
  YTNGAYCAYGAYGGNTTYWSNGCNGTNGARGCNATHGCNCCNGGNGTNATHGARMGNCAR
  WSNGTNGAYCARCCNGTNCARACNGGNTAYAARGCNGTNGAYWSNATGATHCCNATHGGN
  MGNGGNCARMGNGARYTNATHATHGGNGAYMGNCARACNGGNAARACNGCNYTNGCNATH
  GAYGCNATHATHAAYCARMGNGAYWSNGGNATHAARTGYATHTAYGTNGCNATHGGNCAR
  AARGCNWSNACNATHWSNAAYGTNGTNMGNAARYTNGARGARCAYGGNGCNYTNGCNAAY
  ACNATHGTNGTNGTNGCNACNGCNWSNGARWSNGCNGCNYTNCARTAYYTNGCNMGNATG
  CCNGTNGCNYTNATGGGNGARTAYTTYMGNGAYMGNGGNGARGAYGCNYTNATHATHTAY
  GAYGAYYTNWSNAARCARGCNGTNGCNTAYMGNCARATHWSNYTNYTNYTNMGNMGNCCN
  CCNGGNMGNGARGCNTTYCCNGGNGAYGTNTTYTAYYTNCAYWSNMGNYTNYTNGARMGN
  GCNGCNMGNGTNAAYGCNGARTAYGTNGARGCNTTYACNAARGGNGARGTNAARGGNAAR
  ACNGGNWSNYTNACNGCNYTNCCNATHATHGARACNCARGCNGGNGAYGTNWSNGCNTTY
  GTNCCNACNAAYGTNATHWSNATHACNGAYGGNCARATHTTYYTNGARACNAAYYTNTTY
  AAYGCNGGNATHMGNCCNGCNGTNAAYCCNGGNATHWSNGTNWSNMGNGTNGGNGGNGCN
  GCNCARACNAARATHATGAARAARYTNWSNGGNGGNATHMGNACNGCNYTNGCNCARTAY
  MGNGARYTNGCNGCNTTYWSNCARTTYGCNWSNGAYYTNGAYGAYGCNACNMGNAARCAR
  YTNGAYCAYGGNCARAARGTNACNGARYTNYTNAARCARAARCARTAYGCNCCNATGWSN
  GTNGCNCARCARWSNYTNGTNYTNTTYGCNGCNGARMGNGGNTAYYTNGCNGAYGTNGAR
  YTNWSNAARATHGGNWSNTTYGARGCNGCNYTNYTNGCNTAYGTNGAYMGNGAYCAYGCN
  CCNYTNATGCARGARATHAAYCARACNGGNGGNTAYAAYGAYGARATHGARGGNAARYTN
  AARGGNATHYTNGAYWSNTTYAARGCNACNCARWSNTGG---


 @28. TX 5 @Search for patterns of motifs

        This option searches for patterns of motifs. Patterns  can  be
  defined  interactively  or read from files. Results can be displayed
  in several ways in both graphical and textual form. Used  to  create
  pattern  files  for  searching  libraries.  The  option is extremely
  flexible and  consequently  the  following  documentation  is  quite
  lengthy.  However the routine is capable of searching for almost any
  known pattern. In addition  the  flexibility  does  not  necessitate
  difficulty  of  use,  and  the  userinterface  has  been  simplified
  considerably since the methods were first published.

        Users should refer to the "typical dialogue" shown  below  for
  the most helpful information on using the program.

        There  are  currently  four  ways  to  display  the   matching
  patterns:  1=each individual motif and its position is listed; 2=all
  the sequence between, and including  the  two  outermost  motifs  is
  listed;  3=graphical,  with  a vertical line marking the position of
  the leftmost motif; 4 = EMBL feature table format, where the  KEYNAM
  field  is  the motif name, the FROM and TO fields denote the ends of
  the match, and the DESCRIPTION field is "Program".

        When it is defined for  the  first  time  a  pattern  must  be
  entered  interactively  at the keyboard, but the pattern description
  can be saved to a file. This file can be  used  for  all  subsequent
  searches.

        When defining a pattern interactively select a motif class and
  the program will request the required inputs.

        The program gives each motif an identifying name  and  number.
  For  motifs  other than the first, a range of allowed positions must
  be defined (Note that sets of motifs included using the OR  operator
  will  all  be  given  the  same  range, and so the program will only
  request range values for the first  motif  in  any  such  set).   To
  specify  the  allowed  range  for  a  motif the user must supply the
  following: the identifying number of the motif,  relative  to  which
  the   current  motifs  positions  are  to  be  defined  (termed  the
  "reference motif"); a "relative start position"  and  a  range.  The
  relative  start  position  can  be  negative or positive. A negative
  start position means that although the reference motif  is  searched
  for  first,  the  current  motif  can  be found to its left.  A zero
  relative start position means their left ends are superimposed.  The
  default  start  position is to butt-joint the motif to righthand end
  of the  "reference  motif".  The  range  is  "the  number  of  extra
  positions" that the motif can take.

        The program will  display  the  probability  of  finding  each
  motif.  These  values  are presented in the following form: .1234E-5
  means 0.1234 times 10 to the power -5.

        After the pattern has been defined, the program  will  type  a
  description of it on the screen. It will then allow the user to give
  an overall cutoff score and overall probability cutoff.

        Typical dialogue  for  all  the  different  motif  classes  is
  displayed below.

  ? Menu or option number=28
    Pattern searcher
  ? (y/n) (y) Read pattern from keyboard
  X 1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Direct repeat
    6 Membership of set
    7 Pattern complete
  ? 0,1,2,3,4,5,6,7 =
  ? Motif name=aa
  ? String=aa
  Probability of score     2.0000 = 0.123E-01
  X 1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Direct repeat
    6 Membership of set
    7 Pattern complete
  ? 0,1,2,3,4,5,6,7 =2
  ? Motif name=pmatch
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-1) (1) =
  ? Relative start position (-1000-1000) (3) =
  ? Number of extra positions (0-1000) (0) =
  ? String=qqq
  ? Minimum matches (1.00-3.00) (3.00) =2
  Probability of score     2.0000 = 0.858E-02
    1 Exact match
  X 2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Direct repeat
    6 Membership of set
    7 Pattern complete
  ? 0,1,2,3,4,5,6,7 =3
  ? Motif name=sm
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-2) (2) =
  ? Relative start position (-1000-1000) (4) =
  ? Number of extra positions (0-1000) (0) =
  ? String=wqa
  ? Minimum score (11.00-53.00) (53.00) =36
  Probability of score    36.0000 = 0.531E-02
    1 Exact match
    2 Percentage match
  X 3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Direct repeat
    6 Membership of set
    7 Pattern complete
  ? 0,1,2,3,4,5,6,7 =4
  ? Motif name=hth
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-3) (3) =
  ? Relative start position (-1000-1000) (4) =
  ? Number of extra positions (0-1000) (0) =
  ? Weight matrix file name=hth
   HELIX TURN HELIX PABO SAUER WEIGHTS 17-11-87
  Probability of score   -51.5860 = 0.230E-04
    1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
  X 4 Cut-off score and weight matrix
    5 Direct repeat
    6 Membership of set
    7 Pattern complete
  ? 0,1,2,3,4,5,6,7 =5
  ? Motif name=repeat
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-4) (4) =
  ? Relative start position (-1000-1000) (21) =
  ? Number of extra positions (0-1000) (0) =3
  ? Repeat length (1-60) (6) =3
  ? Minimum gap (0-60) (0) =
  ? Maximum gap (0-60) (0) =2
  ? Minimum score (11.00-60.00) (36.00) =
  Probability of score    36.0000 = 0.445E-01
    1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
  X 5 Direct repeat
    6 Membership of set
    7 Pattern complete
  ? 0,1,2,3,4,5,6,7 =6
  ? Motif name=mset
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-5) (5) =
  ? Relative start position (-1000-1000) (1) =
  ? Number of extra positions (0-1000) (0) =
  X 1 Keyboard input
    2 File input
  ? 0,1,2 =
  Separate sets with commas
  ? String=AVL,AST,,WYRF
  ? Minimum matches (1.00-4.00) (4.00) =3
  Probability of score     3.0000 = 0.718E-02
    1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Direct repeat
  X 6 Membership of set
    7 Pattern complete
  ? 0,1,2,3,4,5,6,7 =7
  ? (y/n) (y) Save pattern in a file
  ? Pattern definition file=EXAM.PAT
  Motif  6 needs a file name to store set as a weight matrix
  ? Weight matrix file name=DEMO.WTS
  Weight matrix needs a title
  ? Title=Demonstration class 6 weight matrix

  Pattern description

  Motif  1 named aa       is of class    1
  Which is an exact match to the string
  aa
  Motif  2 named pmatch   is of class    2
  which is a match of score     2. to the string
  qqq
  and the N-terminal residue can take positions      3 to       3
  relative to the N-terminal end of motif   1
  It is anded with the previous motif.
  Motif  3 named sm       is of class    3
  which is a match of score    36. to the string
  wqa
  and the N-terminal residue can take positions      4 to       4
  relative to the N-terminal end of motif   2
  It is anded with the previous motif.
  Motif  4 named hth      is of class    4
  Which is a match to a weight matrix with score -51.586
  and the N-terminal residue can take positions      4 to       4
  relative to the N-terminal end of motif   3
  It is anded with the previous motif.
  Motif  5 named repeat   is of class    5
  Which is a repeat with repeat length    3 and score    36.
  The loop-out can have sizes      0 to      2
  and the N-terminal residue can take positions     21 to      24
  relative to the N-terminal end of motif   4
  It is anded with the previous motif.
  Motif  6 named mset     is of class    6
  Which is membership of a set with score   3.000
  It is anded with the previous motif.
  Probability of finding pattern = 0.4109E-14
  Expected number of matches  = 0.2539E-10
  ? Maximum pattern probability (0.00-1.00) (1.00) =
  ? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
   Select display mode
  X 1 Motif by motif
    2 Inclusive
    3 Graphical
    4 EMBL feature table
  ? 0,1,2,3,4 =
   Searching

  Total matches found      0
  Menus and their numbers are
  m0 = This menu
  m1 = General
  m2 = Screen control
  m3 = Statistical analysis of content
  m4 = Structure
  m5 = Search
   ? = Help
   ! = Quit
  ? Menu or option number=6
  Page through text files
  ? Name of file to read=exam.pat
   A1          aa       Class
   aa
   @ End of string
   A2          pmatch   Class
        1      Relative motif
        3      Relative start position
        0      Number of extra positions
   qqq
   @ End of string
     2.00000   Cutoff
   A3          sm       Class
        2      Relative motif
        4      Relative start position
        0      Number of extra positions
   wqa
   @ End of string
    36.00000   Cutoff
   A4          hth      Class
        3      Relative motif
        4      Relative start position
        0      Number of extra positions
  hth                                      File name
   A5          repeat   Class
        4      Relative motif
       21      Relative start position
        3      Number of extra positions
        3      Length
        0      Minimum loop
        2      Maximum loop
    36.00000   Cutoff
   A6          mset     Class
        5      Relative motif
        1      Relative start position
        0      Number of extra positions
  DEMO.WTS                                 File name
  End of file
  Menus and their numbers are
  m0 = This menu
  m1 = General
  m2 = Screen control
  m3 = Statistical analysis of content
  m4 = Structure
  m5 = Search
   ? = Help
   ! = Quit
  ? Menu or option number=6
  Page through text files
  ? Name of file to read=demo.wts
   Demonstration class 6 weight matrix
        4     0     3.000     4.000
   P   1   2   3   4
   N   0   0   0   0
   C   0   0   0   0
   S   0   1   0   0
   T   0   1   0   0
   P   0   0   0   0
   A   1   1   0   0
   G   0   0   0   0
   N   0   0   0   0
   D   0   0   0   0
   E   0   0   0   0
   Q   0   0   0   0
   B   0   0   0   0
   Z   0   0   0   0
   H   0   0   0   0
   R   0   0   0   1
   K   0   0   0   0
   M   0   0   0   0
   I   0   0   0   0
   L   1   0   0   0
   V   1   0   0   0
   F   0   0   0   1
   Y   0   0   0   1
   W   0   0   0   1
  End of file
  Menus and their numbers are
  m0 = This menu
  m1 = General
  m2 = Screen control
  m3 = Statistical analysis of content
  m4 = Structure
  m5 = Search
   ? = Help
   ! = Quit
  ? Menu or option number=28
    Pattern searcher
  ? (y/n) (y) Read pattern from keyboard
  X 1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Direct repeat
    6 Membership of set
    7 Pattern complete
  ? 0,1,2,3,4,5,6,7 =2
  ? Motif name=avlst
  ? String=avlst
  ? Minimum matches (1.00-5.00) (5.00) =3
  Probability of score     3.0000 = 0.394E-02
    1 Exact match
  X 2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Direct repeat
    6 Membership of set
    7 Pattern complete
  ? 0,1,2,3,4,5,6,7 =7
  ? (y/n) (y) Save pattern in a file n

  Pattern description

  Motif  1 named avlst    is of class    2
  which is a match of score     3. to the string
  avlst
  Probability of finding pattern = 0.3941E-02
  Expected number of matches  = 0.2030E+01
  ? Maximum pattern probability (0.00-1.00) (1.00) =
  ? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
   Select display mode
  X 1 Motif by motif
    2 Inclusive
    3 Graphical
    4 EMBL feature table
  ? 0,1,2,3,4 =4
   Searching

  FT   avlst       152    156       Program
  Total matches found      1
  Minimum and maximum observed scores        3.00        3.00


        General notes

        These methods allow users to define  and  search  for  complex
  patterns  of  motifs  defined as single objects.  The programs allow
  individual DNA motifs to be defined in  eight  different  ways,  and
  protein  motifs  in  six.  Motifs  are  combined,  using the logical
  operators AND, OR and NOT, to describe a pattern. The  pattern  also
  specifies   the  ranges  of  allowed  relative  separations  of  the
  individual motifs.

        First some definitions.

        A MOTIF is a contiguous subsequence of fixed length.   At  its
  simplest  it  could  be a single definite base or amino acid; a more
  complex motif might be better represented as a consensus or a weight
  matrix;  two  more-abstract  types  of motif are direct and inverted
  repeats.

        A PATTERN is a higher order of structure defined by a list  of
  motifs.  The  motifs  in  a  pattern  are combined using the logical
  operators AND, OR  and  NOT.  The  list  also  defines  the  allowed
  relative  separations  of the motifs. In the current versions of the
  programs up to 50 motifs can be combined into a single  pattern.  So
  using these definitions there are two differences between motifs and
  patterns: 1) the distances between  all  elements  of  a  motif  are
  fixed,  but  the  separations  of parts of patterns can vary; 2) all
  characters in a motif are defined using the same method (class), but
  different  parts of a pattern can be defined in completely different
  ways.

        Each motif can be represented in 9 ways (known  as  the  motif
  class):

             MOTIF CLASSES
  CLASS           DESCRIPTION
   1       Exact match to a short defined sequence. The IUB symbols
           can be used for DNA sequences.
   2       Percentage match to a defined short sequence. In nucleic acids,
           the IUB symbols can be used.
   3       Match to a defined sequence, using a score matrix and cutoff
           score. The DNA matrix (see option 18) gives scores to IUB symbols
           depending on their level of redundancy. MDM78 is used for proteins.
   4       Match to a weight matrix with cutoff score.
   5       As class 4 but on the complementary strand.
   6       Inverted repeat or stem-loop. Fixed stem length, range of
           loop sizes, and cutoff score using A-T, G-C=2; G-T=1.
   7       Exact match to short sequence but with a defined step size.
   8       Direct repeat. Fixed repeat length, range of loop-out sizes,
           cutoff score, and score matrix (for protein sequences MDM78 and
           for nucleic acids an identity matrix).
   9       Membership of a set. A list of sets of allowed amino acids for
           each position in the motif. The sets are separated by commas(,).
           For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
           acids in which one of I,V or L must be found in the first position,
           then anything in the next two positions, D,E,K or R in the fourth
           position and F,Y,W,I,L,V or M in the fifth. This class only applies
           to protein sequences because for nucleic acids "membership of a
  set"
           can be achieved using IUB symbols.

      Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to
      nucleic acids.


        Class 1: exact match.

        The motif is defined by a short sequence,  which  for  nucleic
  acids, may include IUB symbols. All symbols must match.

        Class 2: percentage match

        The motif is defined by a short sequence,  which  for  nucleic
  acids,  may  include  IUB  symbols.  The  minimum number of matching
  characters must also be specified.

        Class 3: match using a score matrix

        The motif is defined by a short sequence,  which  for  nucleic
  acids,  may  include IUB symbols. The motif is not compared directly
  with the sequence  to  count  the  number  of  matching  characters.
  Instead  a  matrix is used to provide a score for all possible pairs
  of characters. The motif score for any position along  the  sequence
  is  the  sum  of  the scores found by looking-up the scores for each
  pair of aligned characters. A match  is  declared  if  some  minimum
  score is achieved.

        Class 4: weight matrix

        The motif is defined by a table of values (called  weights  or
  scores). The table gives a score for finding each possible character
  at each position along the length of the  motif.  It  therefore  has
  dimension  motif-length  x character-set-size, and allows us to give
  different  scores  for  each  character  at  each  position.  It  is
  equivalent  to  having  a  different  score matrix for each position
  along the motif, and provides the most flexible and specific  method
  of  defining  motifs. The weight matrices are created by program PIP
  option 20 and stored as files. The file contains the values for each
  position, as well as an overall minimum score. There are two ways in
  which these values can be used to calculate an overall score for any
  section  of  the  sequence. The simplest way is to add the values in
  the file. (This  means  that  the  highest  possible  score  can  be
  calculated  by adding the top value at each column position, and the
  lowest by adding the bottom value.)  The normal  way  of  using  the
  values  in  the  file  is  as follows. First the programs divide the
  values in each column by the column total so that they  sum  to  1.0
  Then  the  natural logs of these values are used as scores. When the
  matrix is applied to a sequence these logarithmic values are  summed
  (which  is  of  course  equivalent  to multiplying the frequencies).
  Note that using the natural logs of the frequencies as  weights  and
  adding  them  means  that the overall cutoff score must be less than
  zero, whereas if the original values in the weight matrix  file  are
  added,  the  cutoff  score  will  be  greater  than zero. The search
  routines therefore decide whether the user wants to  add  values  or
  multiply  frequencies by examining the value of the cutoff score: it
  will add if the  cutoff  is  greater  than  zero  and  add  logs  of
  frequencies  if  it is less than zero.  Hence we effectively get two
  motif classes in one. The program PIP, when creating  weight  matrix
  files,  will  ask  the  user  whether  the scores should be added or
  multiplied. If the values in the table  have  been  defined  without
  using a set of aligned sequences it is easier for the user to choose
  a cutoff score if the values are added.

        Class 5: complement of weight matrix

        The motif is defined by  a  weight  matrix,  but  the  program
  searches for its complement.

        Class 6: inverted repeat, or stem-loop

        The motif is defined by a repeat length, a minimum score and a
  range  of  loop  sizes.  The scores are A-T=2, G-C=2, G-T=1, else=0.
  The loop sizes are defined by a minimum and  maximum  distance  from
  the  3'  end  of  the  stem.  For a stem-loop these will be positive
  numbers. For example to define a stem of length  8  and  loop  sizes
  varying  from  3 to 5, the stem would be set to 8, the minimum start
  distance to 3 and the maximum to 5. To define an inverted repeat the
  minimum  distance  will  be  negative.  For  example  stem length=9,
  minimum distance=-9, and  maximum  distance=-8  will  find  inverted
  repeats  of lengths 9 and 10. E.g. AAAAATTTT and AAAAATTTTT would be
  found, the first having a base at  its  centre,  the  second  having
  none.

        Class 7: exact match, defined step size.

        The motif is defined by a short sequence,  which  for  nucleic
  acids,  may  include  IUB symbols. All symbols must match. The class
  differs from class 1 in that searches will move  in  steps  of  some
  given  size. For example we could search for a certain codon and use
  a step size of 3 and hence keep in a single reading frame.

        Class 8: direct repeat

        The motif is defined by a repeat length, a minimum score and a
  range  of loop sizes. The scores are defined using MDM78 for protein
  sequences and an identity matrix for nucleic acids.  The loop  sizes
  are defined by a minimum and maximum distance from the 3' end of the
  stem.

        Class 9: membership of a set

        This motif class is for protein sequences. It  is  defined  by
  lists  of  allowed amino acids for each position in the motif, and a
  cut-off score.  Positions at which any amino acid can occur are left
  blank.  All allowed amino acids for each position give a score of 1.
  The motifs can be defined in two ways: either typed at the  keyboard
  or  read in as a weight-matrix-like file.  When the motif is defined
  at the keyboard the sets of allowed amino  acids  are  separated  by
  commas(,).  For example IVL,,,DEKR,FYWILVM defines a motif of length
  5 amino acids in which one of I,V or L must be found  in  the  first
  position, then anything in the next two positions, D,E,K or R in the
  fourth position and F,Y,W,I,L,V or M in the fifth.  To specify  that
  the  whole motif must match a score of 3 would be required (i.e. one
  of the allowed amino acids must be  found  for  each  of  the  three
  defined  positions).  If the motif is read from a file the file must
  have been written by program PIP, or have been saved by the  pattern
  searching  routines.  If  the  user elects to save a pattern, and it
  includes class 9 motifs typed at the keyboard, then the program will
  save  the  class  9 motifs as weight matrix files. Therefore it will
  request file names for each motif of this class. If the motif  given
  above  as  an example were saved the weight matrix file would have 5
  columns.  The first column would contain zeroes except for the I,  V
  and  L  rows which would be set to 1; the next two columns would all
  be zero; the next would be zero except for  the  D,E,K  and  R  rows
  which  would  be  1;  the  final  column  would  contain 1's in rows
  F,Y,W,I,L,V and M, with the rest zero.

        The logical operator (AND, OR or NOT) used to add  each  motif
  to  the  pattern  is  specified by preceding the class number by the
  letters A, O or N. A = AND, O = OR, N = NOT.  The default is  A,  so
  N2  means include, using the NOT operator, a class 2 motif; O2 means
  include, using the OR operator, a class 2 motif; both A2 and 2  mean
  include, using the AND operator, a class 2 motif.

        Range setting.

        The motifs in a pattern are numbered according to their  order
  in  the list. Apart from the first motif in a pattern all motifs are
  given a range of allowed positions relative to a  motif  further  up
  the  list.  For example suppose we have a pattern defined by A AND B
  AND C AND D.  Motif A can occur anywhere, but B must have its  range
  of  allowed  positions  defined relative to the position of motif A,
  and C's positions  can  be  defined  relative  to  either  A  or  B,
  depending  on  which  is most convenient, and likewise D's positions
  can be relative to A or B or C.

        Notice that the positions of motifs can be defined relative to
  more  than one motif. Suppose we have a pattern consisting of motifs
  A, B and C, and that B occurs 5-10 residues right of A, C occurs  5-
  10  residues  right  of B, and also C is never more than 15 residues
  from A. Then it is quite consistent  with  the  methods  to  include
  motif C into the pattern twice using the AND operator: once relative
  to A and once relative to B. This will define the  relative  spacing
  and  the  ORDER  of the motifs in the pattern. (If we simply defined
  the position of C relative to A it could be found to the left of B).

        Motifs combined together using the OR operator are  all  given
  the  same range. For example suppose we had a pattern A AND (B OR C)
  AND (D OR E), then B and C each have the same range,  and  D  and  E
  also  have  the same range as one another. The range for D and E can
  be relative to A or to B.

        Motifs cannot have their ranges  defined  relative  to  motifs
  that  are included using the NOT operator. For example if we had the
  pattern A NOT B AND C, then the range for  C  can  only  be  defined
  relative to motif A.

        Speed can be gained by arranging the order of  the  motifs  so
  that  those higher up the list are of types that can be searched for
  rapidly and that are also unlikely to be found.

        Motifs combined by the OR operator are  alternatives:  if  any
  one  of a set of motifs combined by the OR operator is found, then a
  match is declared. All alternatives will be reported. For example if
  we  had a pattern defined by A AND (B OR C), then all places where A
  occurs and B is found within range, and all places where A is  found
  and C is found within range will be reported. A typical use would be
  where we might allow a motif to appear on either strand of  the  DNA
  sequence.  For  example  a  weight matrix representing the heatshock
  element could be used in a pattern which  included  heatshock  as  a
  motif  class  4  combined  using the OR operator with heatshock as a
  motif class 5.

        The probability calculations are performed for each  motif  as
  it  is  defined.  If  an  overall  probability  cut-off is given the
  calculation is repeated for each match  found.  To  achieve  maximum
  searching  speed do not give an overall probability cut-off. Overall
  cut-off scores should only be used if the  motif  classes  used  are
  compatible.

        There are currently several ways to display the matches:  1  =
  each  motif and its position is listed; 2 = all the sequence between
  the two outermost motifs is listed; 3  =  graphical,  with  a  spike
  marking  the  position  of  the leftmost motif. The library versions
  also give entry names, and a one line title; in addition they can be
  used  to  produce  aligned  families of sequences. When this mode of
  output is selected the program will write a separate file  for  each
  match. The files will be called ENTRYNAME.DAT where ENTRYNAME is the
  name of the entry in the library.  The  matching  sequence  will  be
  written  out so that the spacing between motifs is constant, and set
  to the maximum allowed by the pattern definition. Any gaps  will  be
  filled   with   dashes   (-).   If  the  individual  sequences  were
  subsequently written one above the other they should line up so that
  all  motifs are in register. There two types of output of this sort:
  one, option 4, writes out whole  sequences,  the  other,  option  5,
  writes  out only the sequences between the two outermost motifs.  If
  the individual sequences were subsequently  written  one  above  the
  other  they should line up so that all motifs are in register. There
  two types of output of this sort: one, option 4,  writes  out  whole
  sequences,  the  other,  option  5,  writes  out  only the sequences
  between the two outermost motifs.  Note that for option 4 users  are
  asked  to  type  the position of the first motif, and the reason for
  this is  explained  below.  Consider  a  pattern  found  in  several
  sequences.  Consider only the first motif in the pattern and suppose
  that it was found in different positions  in  these  sequences.  Say
  that  of  these  positions  the  one  furthest from the left end was
  position 100. Then, in order to ensure that all the sequences  would
  align,  we must specify that motif 1 must start at position 100. Any
  sequences in which motif 1 started  nearer  to  the  left  end  than
  position  100  would  be  padded accordingly.  These modes of output
  should only be used when the  position  of  each  motif  is  defined
  relative to its immediate neighbour.

        The pattern descriptions can be saved to  files.  These  files
  can  be used instead of typing definitions again at the keyboard. As
  the files are annotated, they can easily  be  changed  using  system
  editors,  and  the  modified  versions  used  to  define the variant
  patterns for the programs.


        Use of lists of entry names

        The two programs that operate on libraries have the ability to
  restrict  their  searches to subsets of the libraries. This does not
  require sublibraries to be created but instead is achieved by  using
  files  containing  a  list of the entry names of sequences. The user
  may  choose  to  search  only  those  entries  on   the   list   or,
  alternatively  to  search  all  but  those  on the list (i.e. in the
  latter case the list contains the names of those  to  be  excluded).
  The  programs  can search libraries that have indexes and those that
  do not.  If a list of names for inclusion is used, then  the  search
  will  be  faster if the index is present. In all other circumstances
  the whole library will be read. The list must be  in  library  order
  except  when  it  is  used  to  include  entries,  and  an  index is
  available.  The list must contain each  entry  name  on  a  separate
  line,  with the name starting in column 1 of the line. ie there must
  be no spaces at the start of the line.  The list of entry names  can
  be  produced by the keyword searches of nip, pip, etc as long as the
  listings produced have a space character separating the  entry  name
  from the entry description. This will depend on how well the library
  reformatting programs work. For example swissprot entry  names  tend
  to  run  into the beginning of the descriptions, but other libraries
  are generally OK.

        One use of the programs  is  to  look  for  patterns  that  we
  already  know  about, but in new sequences. However it is hoped that
  they will also be useful for finding new motifs. For example several
  known   control   regions  in  nucleic  acid  sequences  consist  of
  particular direct or inverted repeats; the inclusion of  direct  and
  inverted  repeats  as  motif  classes  makes  it  possible  to  find
  previously unknown motifs of these types. Using these  new  programs
  we can ask questions like: "are there any inverted or direct repeats
  near to sections of sequence that contain both a  CCAAT  box  and  a
  TATA  box?"; and to search for such things throughout the libraries.
  In addition, the mode of output in which all  the  sequence  between
  the  two outermost motifs found is printed out, allows us to extract
  sequences and  examine  them  in  more  detail  for  further  common
  subsequences.  For example we might want to collect together all the
  sequences between putative CCAAT and TATA boxes.

        A further use of  the  inverted  repeat  motif  class  is  the
  following.  If  a  regulatory  sequence in DNA is poorly defined but
  also an inverted repeat, then it might be an advantage to specify it
  both  as a consensus sequence and a superimposed inverted repeat. In
  this way two weak definitions can be combined to produce a  stronger
  pattern.

        Given only a few examples of a motif it should be possible  to
  perform  initial  searches  using  a  class 3 motif, and then, using
  plausible matching sequences, create a more specific  weight  matrix
  for the same motif.

        If motifs are combined with  the  first  motif  using  the  OR
  operator  they  will  be ignored until all permutations that include
  the first motif have been looked for. The whole search will then  be
  repeated,  in  turn, for each of those motifs that are combined with
  the first motif using the OR operator.  An  interesting  consequence
  of  this is that the program can be used, without change, to compare
  any newly determined sequence with all known individual  motifs.  We
  achieve  this by having a pattern in which all known relevant motifs
  are combined using the OR operator.  If we ask to use  this  pattern
  with  a  sequence,  the  program  will  automatically  compare  each
  individual motif in  the  pattern  with  the  whole  length  of  the
  sequence.  As the number of known motifs grows this should become an
  increasingly useful standard procedure.

        The  NOT  operator  is  obviously  useful  for   making   sure
  particular  motifs  are  not  present,  but  it  can also be used to
  bracket the levels of matches found. We may want a degree  of  match
  that  lies  between  two  limits - binding should occur, but not too
  strongly; or base-pairs should  form,  but  not  too  many.  We  can
  specify  this by asking for a match with a low score, in combination
  with a match and a high score, both for the same motif, but with the
  high score included using the NOT operator.

        The algorithm is designed to find all sections of  a  sequence
  that   satisfy   the  pattern  rather  than  only  the  best  match.
  Particularly if some of the  motifs  in  a  pattern  are  less  well
  defined  than  others, this can often result in the same region of a
  sequence being reported as having several matches,  but  which  only
  vary in the positions of the weakest motifs.

        General remarks on motif searching

        Generally motifs are short subsequences that are thought to be
  associated  with particular functions in some known sequences. Often
  we  search  for  them  to  try  to  understand  or  interpret  other
  sequences.  Sometimes  we  search  for motifs and patterns to test a
  hypothesis  about  their  role:  are  they  found  in  the  expected
  positions  in the expected sequences. In doing so we should remember
  that, in both proteins and nucleic acids, what we are really looking
  for  is  a  particular  three  dimensional  structure  with  certain
  affinities for other structures, and that we are assuming  that  the
  sequence  of  the  motif alone defines the 3D structure we searching
  for. The overall structure may be completely different to  those  in
  which  the  motif  is  functional,  and  hence  the motif may have a
  different shape or be  inaccessible.  We  should  be  aware  of  the
  importance  of  the context in which a motif is found. Where does it
  lie relative to the overall structure,  is  it  accessible,  is  the
  three  dimensional  spacing between it and other motifs correct? For
  example, is it on the same side of the double helix, and the correct
  distance  from  some  other  motif?  How  does  context  affect  our
  assessment of the significance of finding  a  motif?  Finding  false
  mammalian  mRNA splice junctions in non-coding sequences is far less
  important than  finding  false  sites  in  pre-mRNA  sequences,  but
  finding  them  in  the  correct  places  is most important! In other
  words, it is often the case that when we are searching for  a  motif
  that  is  known  to  be necessary for some function, then a positive
  result in the form of a match in  the  required  position,  is  more
  important  than a high background of matches in the wrong positions.
  Being able to write down the probability of finding  a  motif  in  a
  random  sequence  tells  us how well it is defined. In nucleic acids
  the DNA may contain many superimposed types of information  such  as
  those  concerned  with  histone  phasing,  protein  coding  or  mRNA
  secondary structure. These overlapping "codes"  may  interfere  with
  one  another  causing  matches to motifs to be poorer than expected.
  In general we will only have a limited number  of  examples  of  the
  motif and we do not know how representative they are.

        Sequences have superimposed functions: some parts  may  be  of
  general structural importance and give rise to an overall framework,
  and other parts give specificity and hence are not  common;  we  may
  want  to  use a set of aligned sequences to define a motif, but want
  to use only the framework positions.  Alternatively we may  want  to
  pick  out only those parts of a set of aligned sequences that give a
  particular property, and to ignore other similarities that  are  due
  to  some  other  property and which could obscure the pattern we are
  interested in.  It is possible to apply a mask to a set  of  aligned
  sequences  in  order to give weight to selected positions only.  The
  ability to define a mask allows certain positions to be used in  the
  motif  and  others to be ignored, and yet still permits the use of a
  set of aligned sequences to calculate weights. The mask is requested
  and applied by the program and results in the masked positions being
  zero in the weight matrix. The mask is defined in the following way.
  Suppose  we  had a motif of length 15, then the mask x--x--xx-x will
  give zero weights to positions 2,3,5,6 and 9 (note it is the  dashes
  (-)  that  are significant and that positions 1,4,7,8,10,11,12,13,14
  and 15 will be non-zero). Of course the same set of sequences  could
  be used with several alternative masks in order to extract different
  features and create corresponding weight matrices.

        The programs are described in Staden,R. CABIOS 4, 53-60, 1988;
  Staden,R.   CABIOS  5,  89-96,  1989,  anf  a forthcoming Methods in
  Enzymology.
 @ end of help