@-1. TX  0 @General

 @-2. T   0 @Screen control

 @-2. X   0 @Screen

 @-3. T   0 @Statistical analysis of content

 @-3. X   0 @Statistics

 @-4. T   0 @Structures and repeats

 @-4. X   0 @Structures

 @-5. TX  0 @Translation and codons

 @-6. TX  0 @Gene search by content

 @-7. TX  0 @General signals

 @-8. TX  0 @Specific signals

 @0.  TX  -1 @NIP


        This  is  a  program   for  analysing  individual   nucleotide
 sequences.  It can read sequences stored in many of the most commonly
 used formats, and performs all of the usual simple analyses.  However
 the  main  purpose  of the program is to provide  methods for finding
 the function of each section of a  sequence.  In  general  no  single
 method  can   give  an unequivecal interpretation of a sequence so we
 need to use many techniques together and to combine   their  results.
 For   this   reason   the  program   present  many  of  its   results
 graphically.

        General information is contained in the user interface. Online
 documentation for any function follows a consistent pattern: summary,
 list of inputs, list of outputs, details, example.
 @1. TX 0 @ Help

        This option gives online help. The user should  select  option
  numbers  and  the  current  documentation  will  be given. Note that
  option 0 gives an introduction to the program, and that ?  will  get
  help  from  anywhere  in  the  program.  The following functions are
  included:
 @2. TX 0 @ Quit

        This function stops the program.
 @3. TX 1 @ Read a new sequence

        This option allows users to  read  in  new  sequences,  browse
  through  annotations,  or  search  sequence  libraries for keywords.
  Sequences can  be  read  from  "personal"  sequence  files  or  from
  sequence  libraries. These are referred to as the sequence "source".
  Personal files can be stored in several formats:  Staden, PIR, EMBL,
  GENBANK  and  GCG.  At LMB we use "Staden" format for sequencing and
  all the libraries  are  stored  in  their  original  formats.  Note,
  however,  that  libraries  such  as EMBL or GenBank that are divided
  into several files (eg GenBank has 13 separate files) are indexed as
  a  whole.  This  means  that  users  do  not need to know which file
  contains an entry, only which library.  When  the  user  selects  to
  read in a sequence the program first asks for the sequence "source".

        If the user selects "personal" the program will  ask  for  the
  format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
  the file. For PIR format the user will also be required to know  the
  entry  name of the sequence as the file can contain several. For the
  other formats only a single entry is  expected.  The  file  will  be
  read,  its  length  and composition will be displayed and the option
  left.

        If the user selects  "library"  as  the  sequence  source  the
  program will display a list of available libraries. The programs are
  capable of  handling  all  current  libraries  but  which  ones  are
  available  will  vary  from  site  to  site.  At LMB we have several
  libraries and also weekly updates of data gathered between releases.
  The  program will ask users to select a library and then give a list
  of options:

   X  1 Get a sequence
      2 Get annotations
      3 Get entrynames from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords

  If get a sequence or get annotations is selected users will be asked
  to  type  the entry name. The option will be left when a sequence is
  selected  or  !  is  typed.  The  composition  and  length  will  be
  displayed.

        The  text  index  contains  all  words  from  feature  tables,
  reference  titles, definition lines, keywords lists and comments, so
  the text index search is most useful. It is also the fastest. Up  to
  5  words  can  be  searched  for  at once. The words should be typed
  separated by spaces, for example
   ? Keywords=P53 mouse murine tumo

  will search for all entries that contain words  starting  with  p53,
  mouse,  murine  and  tumo.  Only the unique entries that contain ALL
  words will be  listed.  Before  listing  the  matching  entries  the
  program  will  show  the number of 'hits' for each word and ring the
  bell.  Escape is possible at this point, or after each screenfull of
  entries.   In  addition  to the entry names the text search displays
  the primary accession number, the  sequence  length  and  up  to  80
  characters of description.  (The search of 'titles' is now redundant
  because the full text index contains all the  title  words  and  the
  search  is  much  faster.  It  will  probably  be  removed  from the
  program.)  All searches are independent of case. Where possible  the
  program will offer default entry names.

        Typical dialogue follows.
  Select sequence source
  X  1 Personal file
     2 Sequence library
  ? Selection  (1-2) (1) =
  Select sequence file format
  X  1 Staden
     2 EMBL
     3 GenBank
     4 PIR
     5 GCG
  ? Selection  (1-5) (1) =
  ? Sequence file name=M13MP7.SEQ
   Contig title removed
  Sequence length=  7238
   Sequence composition
            T          C          A          G          -
        2405.      1539.      1765.      1527.         2.
          33.2%      21.3%      24.4%      21.1%       0.0%
    .
    .
    .


   Select sequence source
   X  1 Personal file
      2 Sequence library
   ? Selection  (1-2) (1) =2
   Select a library
   X  1 EMBL 29 nucleotide library Dec 91
      2 SWISSPROT 20 protein library Nov 91
      3 PIR 31 protein library Dec 91
      4 NRL3D 58 From Brookhaven protein library Dec 91
      5 GenBank
   ? Selection  (1-5) (1) =
  Library is in EMBL format with indexes
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =5
   Search for keywords
   ? Keywords=P53 mouse
  P53 hits  68
  MOUSE hits  8180

   MMANT01    X00875         536 Murine gene fragment for cellular tumour antigen
   MMANT02    X00876          83 Murine gene fragment for cellular tumour antigen
   MMANT03    X00877          21 Murine gene fragment for cellular tumour antigen
   MMANT04    X00878         261 Murine gene fragment for cellular tumour antigen
   MMANT05    X00879         184 Murine gene fragment for cellular tumour antigen
   MMANT06    X00880         113 Murine gene fragment for cellular tumour antigen
   MMANT07    X00881         110 Murine gene fragment for cellular tumour antigen
   MMANT08    X00882         137 Murine gene fragment for cellular tumour antigen
   MMANT09    X00883          74 Murine gene fragment for cellular tumour antigen
   MMANT10    X00884         107 Murine gene for cellular tumour antigen p53 (exon
   MMANT11    X00885         562 Murine p53 gene 3' region with exon 11
   MMANTP53   M26862         536 Mouse tumor antigen p53 gene, 5' end.
   MMLYN      M64608        2044 Mouse lyn protein mRNA, complete cds.
   MMP53      X00741        1377 Mouse mRNA for transformation associated protein
   MMP53A     M13872        1285 Mouse p53 mRNA, complete cds, clone pcD53.
   MMP53B     M13873        1241 Mouse p53 mRNA, complete cds, clone p53-m11.
   MMP53C     M13874        1322 Mouse p53 mRNA, complete cds, clone p53-m8.
   MMP53G1    X01235         554 Mouse genomic DNA for 5' region of cellular tumou
   MMP53IN4   X60470         729 M.musculus p53 gene for p53 protein, intron 4
   MMP53P     X01236        2132 Mouse pseudogene for cellular tumour antigen p53
   MMP53R     X01237        1773 Mouse mRNA for cellular tumour antigen p53
   MMRSB2P5   M64597         196 Mouse B2 repeat in the 3' flank of protein 53 (p5
        22 different entries found

   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =4
   Search for keywords
   ? Keywords=alpha
   Searching for alpha
   AAGHA          623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
   AAMALI        3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
   AAMALIA       1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
   AAMALIB       1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
   ACA13GT        371 alouatta caraya alpha-1,3gt gene, 3' flank.
   ADHBADA1       102 duck alpha-d-globin gene, exon 1.
   ADHBADA2      1145 duck alpha-a-globin gene and 5' flank
   ADHBADWP       513 duck (white pekin) alpha ii (minor) globin mrna, complete co
   AEACOXABC     5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
   AGA13GT        371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
   AGAAAGFP       282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
   AGAABL         138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
   AGAFAMYA        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
   AGAFAMYB        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
   AGAFAMYC        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
   AGAFCOXA        98 synthetic alpha-factor/cox iv fusion gene signal peptide.
   AGAGABA       7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
   AGAMYLS        120 synthetic alpha-amylase gene, 5' end.
   AGANPS          95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
  !
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =3
   ? Accession number=v00636
  Entry name LAMBDA
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =2
   Default Entry name=LAMBDA
   ? Entry name=
  ID   LAMBDA     standard; DNA; PHG; 48502 BP.
  XX
  AC   V00636; J02459; M17233; X00906;
  XX
  DT   03-JUL-1991 (Rel. 28, Last updated, Version 3)
  DT   09-JUN-1982 (Rel. 1, Created)
  XX
  DE   Genome of the bacteriophage lambda (Styloviridae).
  XX
  KW   circular; coat protein; DNA binding protein; genome;
  KW   origin of replication.
  XX
  OS   Bacteriophage lambda
  OC   Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
  XX
  RN   [1]
  RP   1-48502
  RA   Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
  RT   "Nucleotide sequence of bacteriophage lambda DNA";
  RL   J. Mol. Biol. 162:729-773(1982).
  XX
  !
   Select a task
   X  1 Get a sequence
      2 Get annotations
      3 Get entry names from accession numbers
      4 Search titles for keywords
      5 Search text index for keywords
   ? Selection  (1-5) (1) =
   Default Entry name=LAMBDA
   ? Entry name=
  DE   Genome of the bacteriophage lambda (Styloviridae).
   Sequence length  48502
   Sequence composition
             T          C          A          G          -
        11988.     11360.     12336.     12818.         0.
           24.7%      23.4%      25.4%      26.4%       0.0%

 @4. TX 1 @ Define active region

        For its analytic functions  the  program  always  works  on  a
  region  of  the  sequence  called the "active region". This function
  allows the start and end points of the active region to be reset.

        Define  the required start and end points.

        When a new sequence is read into the program the active region
  is  automatically  set to start at the beginning of the sequence and
  extend  to the maximum the program can handle. On most machines this
  will  be  to the end of the sequence. The positions are shown on the
  screen.  Note that for convenience, in the listing  and  translation
  functions,  the  user  is given access to regions outside the active
  region.
 @5. TX 1 @ List a sequence

        The sequence can be listed single or double stranded with line
  lengths from 10 to 120 in multiples of 10.

        Define the region to list, the line length required and choose
  between a single or double stranded display.  The output looks like:

    GTTAATGTAG CTTAATAACA AAGCAAAGCA CTGAAAATGC TTAGATGGAT
    CAATTACATC GAATTATTGT TTCGTTTCGT GACTTTTACG AATCTACCTA
            10         20         30         40         50

    AATTGTATCC CATAAACACA AAGGTTTGGT CCTGGCCTTA TAATTAATTA
    TTAACATAGG GTATTTGTGT TTCCAAACCA GGACCGGAAT ATTAATTAAT
            60         70         80         90        100

    GAGGTAAAAT TACACATGCA AACCTCCATA GACCGGTGTA AAATCCCTTA
    CTCCATTTTA ATGTGTACGT TTGGAGGTAT CTGGCCACAT TTTAGGGAAT
           110        120        130        140        150

    AACATTTACT TAAAATTTAA GGAGAGGGTA TCAAGCACAT TAAAATAGCT
    TTGTAAATGA ATTTTAAATT CCTCTCCCAT AGTTCGTGTA ATTTTATCGA
           160        170        180        190        200

 @6. TX 1 @ List a text file.

        Allows the user to have a text file displayed on  the  screen.
  It will appear one page at a time.

        Supply the name of the file to be displayed.
 @7. TX 1 @ Direct output to disk

        Used to direct output that would normally appear on the screen
  to a file.

        Select redirection of either text or graphics, and supply  the
  name of the file that the output should be written to.

        The results from the next options selected will not appear  on
  the  screen  but  will  be  written  to  the  file. When option 7 is
  selected again the file will be closed and output will again  appear
  on the screen.
 @8. TX 1 @ Write active region to disk

        Used to write the current active section of sequence to a disk
  file in "Staden format".

        Supply a file name and an optional title.

        The program has the capability of reading sequences stored  in
  several formats and so, in conjunction with this option, can be used
  to reformat them.
 @9. TX 1 @ Edit the sequence

        Used to edit sequences or any other files by giving access  to
  the  computers  system  editor. For editing sequences the input file
  should have already been created using one of the listing  functions
  such  as  "list  sequence",  "list translation" or "list restriction
  sites above the sequence".

        Supply the name of the file to edit.  Wait  while  the  system
  editor  is  made  ready  (can take awhile on a vax). Use the editor.
  Exit from the editor. If a sequence has been edited, and you want to
  process  it,  affirm  that the sequence should be "made active". The
  edited sequence will replace the original sequence.

        This editing method is designed to give  users  access  to  an
  editor with which they are familiar - i.e. the one on their machine,
  and yet to allow them to edit a  sequence  which  contains  all  the
  landmarks  they  need  in  order  to  know where they are. Users can
  create files  containing  simple  listings  (single  stranded)  with
  numbering,  using "list the sequence", and then edit them with their
  system editor, using the numbering to know where they are within the
  sequence.  When the edits are complete they exit from the editor and
  the program "analyses" the edited file to extract only the  sequence
  characters.  Similarly  a file containing a three phase tranlslation
  can be edited, or a file containing a sequence plus its three  phase
  translation,  plus  its restriction sites marked above the sequence.
  In order to be able  to  "analyse"  such  complicated  listings  and
  correctly  extract  the  sequence the following simple rule is used:
  all lines in the file that contain a character that is  not  A,C,T,G
  or U are deleted. It is obviously important to be aware of this rule
  and its implications.
 @10. TX 2 @ Clear graphics

        Clears graphics from the screen.
 @11. TX 2 @ Clear text

        Clears  text from the screen.
 @12. TX 2 @ Draw a ruler

        This option allows the user to draw a ruler or scale along the
  x  axis  of the screen to help identify the coordinates of points of
  interest. The user can define the position of the first base  to  be
  marked  (for  example if the active region is 1501 to 8000, the user
  might wish to mark every 1000th base starting at either 1501 or 2000
  -  it  depends  if  the user wishes to treat the active region as an
  independent unit with its own numbering starting at its  left  edge,
  or  as  part  of  the  whole sequence). The user can also define the
  separation of the ticks on the scale and their height.  If  required
  the labelling routine can be used to add numbers to the ticks.
 @13. TX 2 @ Use crosshair

        This function puts a steerable cross on the screen that can be
  used to find the coordinates of points in the sequence. The user can
  move the cross around using the directional keys; when he  hits  the
  space bar the program will print out the coordinates of the cross in
  sequence units and the option will be exited.

        If instead, you hit a , the position will be displayed but the
  cross will remain on the screen.

        If a letter s is hit the program  will  display  the  sequence
  around the crosshair position, and leave the cross on the screen.
 @14. TX 2 @ Reposition plots

        The positions of each of the plots is defined  relative  to  a
  users  drawing board which has size 1-10,000 in x and 1-10,000 in y.
  Plots for each option are drawn in a window  defined  by  x0,y0  and
  xlength,ylength. Where x0,y0 is the position of the bottom left hand
  corner of the window, and xlength is the width  of  the  window  and
  ylength the height of the window.
     --------------------------------------------------------- 10,000
     1                                                       1
     1       --------------------------------------   ^      1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       1                                    1 ylength  1
     1       1                                    1   1      1
     1       1                                    1   1      1
     1       --------------------------------------   v      1
     1  x0,y0^                                               1
     1       <---------------xlength-------------->          1
     ---------------------------------------------------------      1
     1                                                   10,000

  All values are in drawing board  units  (i.e.  1-10,000,  1-10,000).
  The default window positions are read from a file "NIPMARG" when the
  program is started. Users can have their own file if  required.   As
  all  the  plots  start  at  the same position in x and have the same
  width, x0 and xlength are the same for all options. Generally  users
  will  only  want  to change the start level of the window y0 and its
  height ylength. This option allows users to change window  positions
  whilst  running  the  program.   The  routine  prompts first for the
  number of the option that the users wishes to reposition;  then  for
  the  y  start and height; then for the x start and length. Note that
  changes to the x values affect all options. If the user  types  only
  carriage  return  for any value it will remain unchanged. The cross-
  hair can be used to choose suitable heights.
 @15. TX 2 @ Label a diagram

        This routine allows users to  label  any  diagrams  they  have
  produced.  They  are  asked  to type in a label. When the user types
  carriage return to finish typing the label the cross-hair appears on
  the  screen. The user can position it anywhere on the screen. If the
  user types R (for right justify) the label will be  written  on  the
  diagram  with  its right end at the cross-hair position. If the user
  types L (for left justify) the label will be written on the  diagram
  with  its  left end at the cross hair position.  The cross-hair will
  then immediately reappear. The  user  may  put  the  same  label  on
  another part of the diagram as before or if he hits the space bar he
  will be asked if he wishes to type in another label.

        Typical dialogue follows.
  ? Menu or option number=15
  Type label then drive cross hair to left or right end
  of label position then hit  "L"  to  write label left
  justified or  "R"  to  write label right justified or
  the space bar to quit


  ? Label=delta gene

   missing graphics

  ? Label=

 @16. TX 2 @Display a map

        This draws a map of any  sequence  features  selected  by  the
  user.   These  features  may  be  protein coding regions (CDS), tRNA
  genes (TRNA), promoter positions (PRM), etc. Users may define  their
  own  feature  table  key  names. For example I find it convenient to
  split CDS lines into CDS1, CDS2 and CDS3 each of which contains only
  those  sequences  that  code in the reading frames 1, 2 or 3. Then I
  can plot them at different heights on the screen ( suitable  heights
  can be determined by using the cross-hair).

        The coordinates must be stored in a file in the format  of  an
  EMBL  or  GenBank  feature table. Note that this means that the file
  must include either EMBL or GenBank headers, and a suitable  "tail".
  The simplest header is the word FEATURES starting in column 1 of the
  first line of the file. The simplest tail is 2 empty  lines  at  the
  end  of  the  file. These lines are not included when nip writes out
  results in feature table format.

        Typical dialogue follows.
  ? Menu or option number=16
   Display a map using an EMBL feature table file
  ? map file name=hsegl1.ft
  ? feature code(e.g. CDS) =CDS
  X 1 + strand
    2 - strand
    3 both strands
  ? 0,1,2,3 =
  ? level (0-9480) (256) =4000

   missing graphics

  ? feature code(e.g. CDS) =

 @17. TX 1 @ Search for restriction enzymes

        This routine is used  to  search  for  short  sequences,  like
  restriction  enzyme  recognition sequences, and can either list  the
  results or present them graphically. Listings can take several forms
  and can include the sequence and its translation. Examples are given
  below. The program will also display the names of enzymes  that  cut
  the  sequence  infrequently.  Users  can select from sets of enzymes
  stored in files or can enter them from the keyboard.

        The short sequences (strings)  and  their  names  need  to  be
  arranged  in  a particular way. See below. Select to search, list an
  enzyme file or clear the screen. Choose either a file of enzymes  or
  to  enter  their  recognition  sequences  at the keyboard. Choose to
  search for all the enzymes in the list or to select from  the  list.
  Select  a mode of output. Define the sequence as circular or linear.
  Select to search for "definite" or "possible"  matches.  The  search
  starts,  and after the results have been displayed, further searches
  can be performed.

        When the enzymes and their recognition sequences are stored in
  a  file  they  must  be  defined  in  the following way. We call the
  recognition sequences "strings".  The format  is  as  follows:  each
  string  or  set  of  strings must be preceded by a name, each string
  must be preceded and terminated with a slash (/), and  each  set  of
  strings  by  2 slashes. For example AATII/GACGT'C// defines the name
  AATII, its recognition sequence GACGTC and its cut site with  the  '
  symbol;  ACCI/GT'MKAC//  defines  the  name ACCI and its recognition
  sequence includes IUB symbols for incompletely  defined  symbols  in
  nucleic   acid  sequences;  BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
  defines the name BBVI and this time two  recognition  sequences  and
  cut  sites  are  specified  in  order  to correctly show the cutting
  position relative to the recognition sequence. If  no  cut  site  is
  included  the first base of the recognition sequence is displayed as
  being on the 3' side of the recognition sequence.

        These collections of strings and their names can be read  from
  disk  or  entered  from  the  keyboard.   When names and strings are
  entered from the keyboard the program will ask for the name and then
  the  string(s).  If more than one string is typed per name they must
  be separated by slash (/) characters.  See  the  "Typical  dialogue"
  below.    Three  files  containing  restriction  enzyme  recognition
  sequences are currently available. The "all enzymes"  file  contains
  the  Rich  Roberts  REBASE  restriction  enzyme  database,  which is
  updated monthly.

        The user can select strings by name from these collections. If
  so  the  program  will prompt for the names, one at a time. The user
  can continue to select names until a blank name is entered  (by  the
  user typing only return).

        Listed output can be displayed in  several  ways:  it  can  be
  ordered  enzyme by enzyme, or on cut positions, or with enzyme names
  written above a listing of the sequence. This last listing can  also
  include  a  three phase translation of the sequence. In addition the
  program will display only infrequent cutters (the user  defines  the
  minimum number of cuts), or can plot the positions of matches.

        Listings sorted "enzyme by enzyme" have the following form:

   Matches found=     1
       Name                  Sequence            Position  Fragment lengths
     1 AATII                 GACGT'C                  112     111     111
                                                              912     912
   Matches found=     2
       Name                  Sequence            Position  Fragment lengths
     1 ACCI                  GT'CGAC                  112     111     111
     2 ACCI                  GT'AGAC                  420     308     308
                                                              604     604
   Matches found=     2
       Name                  Sequence            Position  Fragment lengths
     1 AHAII                 GA'CGTC                  109     108      90
     2 AHAII                 GG'CGTC                  199      90     108
                                                              825     825
   Matches found=     2
       Name                  Sequence            Position  Fragment lengths
     1 AVAII                 G'GACC                    84      83      51
     2 AVAII                 G'GTCC                   973     889      83
                                                               51     889
   Matches found=     1
       Name                  Sequence            Position  Fragment lengths
     1 BALI                  TGG'CCA                  258     257     257
                                                              766     766
   Matches found=     1
       Name                  Sequence            Position  Fragment lengths
     1 BAMHI                 G'GATCC                   92      91      91

     ......   etc

  Listings sorted on cut position have the following form:

   Searching
       Name                  Sequence            Position  Fragment lengths
     1 ECORI                 G'AATTC                    2       1
     2 BANI                  G'GTGCC                   26      24
     3 BSP1286               GTGCC'C                   31       5
     4 BBVI                  'TACTGCGCCGCAGCTGC        38       7
     5 NSPBII                CAG'CTG                   51      13
     6 PVUII                 CAG'CTG                   51       0
     7 BBVI                  GCAGCTGCTGGTG'            60       9
     8 HINCII                GTC'AAC                   80      20
     9 AVAII                 G'GACC                    84       4
    10 BINI                  'CCAGGGATCC               87       3
    11 BSTNI                 CC'AGG                    89       2
    12 BAMHI                 G'GATCC                   92       3
    13 XHOII                 G'GATCC                   92       0
    14 NSPBII                CCG'CTG                   98       6
    15 BINI                  GGATCCGCT'               100       2
    16 AHAII                 GA'CGTC                  109       9
    17 SALI                  G'TCGAC                  111       2
    18 AATII                 GACGT'C                  112       1
    19 ACCI                  GT'CGAC                  112       0
    20 HINCII                GTC'GAC                  113       1
    21 BBVI                  GCAGCGACTGATT'           166      53
    22 BINI                  'ACTCAGATCC              178      12
    23 XHOII                 A'GATCC                  183       5
    24 HGAI                  'GGCGGCGGAGGCGTC         188       5

    .....etc

  Lists of infrequent cutters have the following form:

       0 AFLII
       0 AFLIII
       0 APAI
       0 APALI
       0 ASUII
       0 AVAI
       0 AVRII
       0 BCLI
       0 BGLI
       0 BGLII
       0 BSMI
       0 BSPMII
       0 BSTEII
    ...... etc

   Listings showing names above the sequence, and a translation have the
  following form:


   ECORI                   BANI BSP1286
   .                       .    .      BBVI         NSPBII
   .                       .    .      .            PVUII    BBVI
  GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
  GTGC
          10        20        30        40        50        60
   E  F  G  L  G  L  V  *  G  A  Q  R  L  L  R  R  S  C  W  C
    N  S  V  W  A  W  C  E  V  P  R  D  Y  C  A  A  A  A  G  A
     I  R  F  G  L  G  V  R  C  P  E  I  T  A  P  Q  L  L  V  L

                     HINCII
                     .   AVAII
                     .   .  BINI
                     .   .  . BSTNI
                     .   .  . .  BAMHI
                     .   .  . .  XHOII NSPBII
                     .   .  . .  .     . BINI     AHAII
                     .   .  . .  .     . .        . SALI
                     .   .  . .  .     . .        . .AATII
                     .   .  . .  .     . .        . .ACCI
                     .   .  . .  .     . .        . ..HINCII
  TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
  ACGA
          70        80        90       100       110       120
   W  R  C  G  G  R  Q  R  T  Q  G  S  A  G  R  G  R  R  R  R
    G  G  A  E  V  V  N  G  P  R  D  P  L  D  E  D  V  D  D  E
     A  V  R  R  S  S  T  D  P  G  I  R  W  T  R  T  S  T  T  R

                                               BBVI        BINI
  GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
  CACT
         130       140       150       160       170       180
   G  G  G  G  *  R  I  A  G  G  W  Q  R  L  I  *  V  L  T  T
    E  E  V  D  S  A  L  L  V  A  G  S  D  *  F  E  F  *  P  L
     R  R  W  I  A  H  C  W  W  L  A  A  T  D  L  S  S  D  H  S

    XHOII
    .    HGAI       AHAII                      PFIMI
    .    .          .                          .   BBVI
  CAGATCCGGCGGCGGAGGCGTCGAGGCTCCCGAAACTCCCAGTGGCTGGCCTGCTA
  GATT
         190       200       210       220       230       240
   Q  I  R  R  R  R  R  R  G  S  R  N  S  Q  W  L  A  C  *  I
    R  S  G  G  G  G  V  E  A  P  E  T  P  S  G  W  P  A  R  F
     D  P  A  A  E  A  S  R  L  P  K  L  P  V  A  G  L  L  D  S

     .........etc


        The terms "possible" and "definite" matches are important only
  for  back  translations  of  protein into DNA, and which include IUB
  redundancy codes.  Those matches that the  program  terms  "definite
  matches"  and are ones in which the specification of the recognition
  sequence corresponds exactly to that of the  back  translation,  and
  consequently  are  definitely  in the DNA sequence. The program will
  also find what it terms  'possible  matches'  which  are  ones  that
  depend  on  the particular codons chosen for each amino acid.  These
  are sites at which recognition  sequences  could  be  engineered  to
  produce  a cut in the DNA without changing the amino acid, but which
  are not necessarily found in the original sequence.

        The routine will handle both linear  and  circular  sequences,
  and  so  finds  cutsites  spanning the "ends" of circular sequences.
  The program will only find cutsites spanning the ends  of  sequences
  if  the  sequence  is declared as circular.  This includes sites for
  recognition sequences containing leading or trailing N  symbols,  in
  which  the  actual  recognition sequence does not span the join. For
  example if the recognition sequence was 'NNNNACGT and  the  first  4
  characters  in  the sequence were ACGT, then the match would only be
  found if the sequence was declared as circular. If the  sequence  is
  linear then the first fragment starts at base number 1, and the last
  ends at the last base. If the sequence is circular then  the  length
  of the first fragment is the clockwise distance from the last cut to
  the first.

        Graphical output marks the position of each string by a  short
  vertical  line  and  gives the name of the enzyme at the left end of
  the line. If the top of the screen is reached the program gives  the
  user  the  oportunity  to  take  a hard copy and then will clear the
  screen and restart plotting results at the original start position.

        Below is an edited piece of dialogue from use  of  the  search
  option:
  ? Menu or option number=17

  Search for restriction enzyme sites
  X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 = 2

    1 All enzymes
  X 2 Six cutters
    3 Four cutters
    4 Personal file
    5 Keyboard
  ? 0,1,2,3,4,5 =

  AATII/GACGT'C//
  ACCI/GT'MKAC//
  AFLII/C'TTAAG//
  AFLIII/A'CRYGT//
  AHAII/GR'CGYC//
  APAI/GGGCC'C//
  APALI/G'TGCAC//
  ASUII/TT'CGAA//
  AVAI/C'YCGRG//
  AVAII/G'GWCC//
  AVRII/C'CTAGG//
  BALI/TGG'CCA//
  BAMHI/G'GATCC//
  BANI/G'GYRCC//
  BANII/GRGCY'C//
  BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
  BCLI/T'GATCA//
  BGLI/GCCNNNN'NGGC//
  BGLII/A'GATCT//
  BINI/GGATCNNNN'/'NNNNNGATCC//
  BSMI/GAATGCN'/NG'CATTC//
  BSP1286/GDGCH'C//

  X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 =
    1 All enzymes
  X 2 Six cutters
    3 Four cutters
    4 Personal file
    5 Keyboard
  ? 0,1,2,3,4,5 =
  ? (y/n) (y) Search for all names
  X 1 Order results enzyme by enzyme
    2 Order results by position
    3 Show only infrequent cutters
    4 Show names above the sequence
  ? 0,1,2,3,4 =
  ? (y/n) (y) List matches
  ? (y/n) (y) The sequence is linear
  ? (y/n) (y) Search for definite matches

   Searching
   Matches found=     1
       Name                  Sequence            Position  Fragment lengths
     1 AATII                 GACGT'C                  112     111     111
                                                              912     912
   Matches found=     2
       Name                  Sequence            Position  Fragment lengths
     1 ACCI                  GT'CGAC                  112     111     111
     2 ACCI                  GT'AGAC                  420     308     308
                                                              604     604
   Matches found=     2
       Name                  Sequence            Position  Fragment lengths
     1 AHAII                 GA'CGTC                  109     108      90
     2 AHAII                 GG'CGTC                  199      90     108
                                                              825     825
   Matches found=     2
       Name                  Sequence            Position  Fragment lengths
     1 AVAII                 G'GACC                    84      83      51
     2 AVAII                 G'GTCC                   973     889      83
                                                               51     889
   Matches found=     1
       Name                  Sequence            Position  Fragment lengths
     1 BALI                  TGG'CCA                  258     257     257
                                                              766     766
   Matches found=     1
       Name                  Sequence            Position  Fragment lengths
     1 BAMHI                 G'GATCC                   92      91      91
                                                              932     932
   Matches found=     1
       Name                  Sequence            Position  Fragment lengths
     1 BANI                  G'GTGCC                   26      25      25
                                                              998     998
   Matches found=     1
       Name                  Sequence            Position  Fragment lengths
     1 BANII                 GAGCC'C                  490     489     489
                                                              534     534
   Matches found=    11
       Name                  Sequence            Position  Fragment lengths
     1 BBVI                  'TACTGCGCCGCAGCTGC        38      37       3
     2 BBVI                  GCAGCTGCTGGTG'            60      22      22
     3 BBVI                  GCAGCGACTGATT'           166     106      28
     4 BBVI                  'CCTGCTAGATTCGCTGC       230      64      37
     5 BBVI                  GCAGCGGTACGTA'           452     222      50
     6 BBVI                  'CTCGCCAACGTTGCTGC       502      50      55
     7 BBVI                  GCAGCCTTCAACT'           606     104      64
     8 BBVI                  'GAGGTATTCCTGGCTGC       634      28      97
     9 BBVI                  'CTGGCCGCCGCCGCTGC       869     235     104
    10 BBVI                  'GCCGCCGCCGCTGCTGC       872       3     106
    11 BBVI                  GCAGCGATGAGGA'           927      55     222

    ....etc

   X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 =

    1 All enzymes
  X 2 Six cutters
    3 Four cutters
    4 Personal file
    5 Keyboard
  ? 0,1,2,3,4,5 =

  ? (y/n) (y) Search for all names

  X 1 Order results enzyme by enzyme
    2 Order results by position
    3 Show only infrequent cutters
    4 Show names above the sequence
  ? 0,1,2,3,4 = 2

  ? (y/n) (y) List matches
  ? (y/n) (y) The sequence is linear
  ? (y/n) (y) Search for definite matches

   Searching
       Name                  Sequence            Position  Fragment lengths
     1 ECORI                 G'AATTC                    2       1
     2 BANI                  G'GTGCC                   26      24
     3 BSP1286               GTGCC'C                   31       5
     4 BBVI                  'TACTGCGCCGCAGCTGC        38       7
     5 NSPBII                CAG'CTG                   51      13
     6 PVUII                 CAG'CTG                   51       0
     7 BBVI                  GCAGCTGCTGGTG'            60       9
     8 HINCII                GTC'AAC                   80      20
     9 AVAII                 G'GACC                    84       4
    10 BINI                  'CCAGGGATCC               87       3
    11 BSTNI                 CC'AGG                    89       2
    12 BAMHI                 G'GATCC                   92       3
    13 XHOII                 G'GATCC                   92       0
    14 NSPBII                CCG'CTG                   98       6
    15 BINI                  GGATCCGCT'               100       2
    16 AHAII                 GA'CGTC                  109       9
    17 SALI                  G'TCGAC                  111       2
    18 AATII                 GACGT'C                  112       1
    19 ACCI                  GT'CGAC                  112       0
    20 HINCII                GTC'GAC                  113       1

    .....etc

  X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 =

    1 All enzymes
  X 2 Six cutters
    3 Four cutters
    4 Personal file
    5 Keyboard
  ? 0,1,2,3,4,5 =

  ? (y/n) (y) Search for all names

    1 Order results enzyme by enzyme
  X 2 Order results by position
    3 Show only infrequent cutters
    4 Show names above the sequence
  ? 0,1,2,3,4 =3
  ? Maximum number of cuts (0-100) (0) =
  ? (y/n) (y) The sequence is linear
  ? (y/n) (y) Search for definite matches

   Searching
       0 AFLII
       0 AFLIII
       0 APAI
       0 APALI
       0 ASUII
       0 AVAI
       0 AVRII
       0 BCLI
       0 BGLI
       0 BGLII
       0 BSMI
       0 BSPMII
       0 BSTEII
       0 CLAI
       0 DRAI
       0 DRAII
       0 ECOB
       0 ECOK
       0 ECORV
       0 ESPI

     ......etc

  X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 =

    1 All enzymes
  X 2 Six cutters
    3 Four cutters
    4 Personal file
    5 Keyboard
  ? 0,1,2,3,4,5 =

  ? (y/n) (y) Search for all names

    1 Order results enzyme by enzyme
    2 Order results by position
  X 3 Show only infrequent cutters
    4 Show names above the sequence
  ? 0,1,2,3,4 =4
  ? (y/n) (y) Hide translation n
  ? (y/n) (y) Use 1 letter codes
  ? Line length (30-90) (60) =
  ? (y/n) (y) The sequence is linear
  ? (y/n) (y) Search for definite matches

   Searching
   ECORI                   BANI BSP1286
   .                       .    .      BBVI         NSPBII
   .                       .    .      .            PVUII    BBVI
  GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
  GTGC
          10        20        30        40        50        60
   E  F  G  L  G  L  V  *  G  A  Q  R  L  L  R  R  S  C  W  C
    N  S  V  W  A  W  C  E  V  P  R  D  Y  C  A  A  A  A  G  A
     I  R  F  G  L  G  V  R  C  P  E  I  T  A  P  Q  L  L  V  L

                     HINCII
                     .   AVAII
                     .   .  BINI
                     .   .  . BSTNI
                     .   .  . .  BAMHI
                     .   .  . .  XHOII NSPBII
                     .   .  . .  .     . BINI     AHAII
                     .   .  . .  .     . .        . SALI
                     .   .  . .  .     . .        . .AATII
                     .   .  . .  .     . .        . .ACCI
                     .   .  . .  .     . .        . ..HINCII
  TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
  ACGA
          70        80        90       100       110       120
   W  R  C  G  G  R  Q  R  T  Q  G  S  A  G  R  G  R  R  R  R
    G  G  A  E  V  V  N  G  P  R  D  P  L  D  E  D  V  D  D  E
     A  V  R  R  S  S  T  D  P  G  I  R  W  T  R  T  S  T  T  R

                                               BBVI        BINI
  GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
  CACT
         130       140       150       160       170       180
   G  G  G  G  *  R  I  A  G  G  W  Q  R  L  I  *  V  L  T  T
    E  E  V  D  S  A  L  L  V  A  G  S  D  *  F  E  F  *  P  L
     R  R  W  I  A  H  C  W  W  L  A  A  T  D  L  S  S  D  H  S

   .......etc

  X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 =

    1 All enzymes
  X 2 Six cutters
    3 Four cutters
    4 Personal file
    5 Keyboard
  ? 0,1,2,3,4,5 =5
  Define search strings by typing a string name
  followed by the string(s)
  ? Name=FRED
  ? String(s)=AAAAAA/TTTTTT
  ? Name=MARY
  ? String(s)=CCCC/GGGG/GCGCT
  ? Name=
  ? (y/n) (y) Search for all names
  X 1 Order results enzyme by enzyme
    2 Order results by position
    3 Show only infrequent cutters
    4 Show names above the sequence
  ? 0,1,2,3,4 =
  ? (y/n) (y) List matches
  ? (y/n) (y) The sequence is linear
  ? (y/n) (y) Search for definite matches
   Searching
   Matches found=     9
       Name                  Sequence            Position  Fragment lengths
     1 FRED                  'TTTTTT                 1557    1556       1
     2 FRED                  'TTTTTT                 1558       1       1
     3 FRED                  'TTTTTT                 1559       1       1
     4 FRED                  'TTTTTT                 1560       1      22
     5 FRED                  'AAAAAA                 1582      22     529
     6 FRED                  'AAAAAA                 3160    1578    1019
     7 FRED                  'AAAAAA                 4204    1044    1044
     8 FRED                  'AAAAAA                 5691    1487    1487
     9 FRED                  'AAAAAA                 6710    1019    1556
                                                              529    1578
   Matches found=    36
       Name                  Sequence            Position  Fragment lengths
     1 MARY                  'CCCC                     47      46       1
     2 MARY                  'GGGG                    486     439       1
     3 MARY                  'GGGG                    487       1       1
     4 MARY                  'CCCC                    557      70       1
     5 MARY                  'CCCC                    558       1       1
     6 MARY                  'GCGCT                  1177     619       1

    ... etc

  X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 =
    1 All enzymes
  X 2 Six cutters
    3 Four cutters
    4 Personal file
    5 Keyboard
  ? 0,1,2,3,4,5 =5
  Define search strings by typing a string name
  followed by the string(s)
  ? Name=JANE
  ? String(s)=A'TTTT/CC'GGG
  ? Name=
  ? (y/n) (y) Search for all names
  X 1 Order results enzyme by enzyme
    2 Order results by position
    3 Show only infrequent cutters
    4 Show names above the sequence
  ? 0,1,2,3,4 =
  ? (y/n) (y) List matches
  ? (y/n) (y) The sequence is linear
  ? (y/n) (y) Search for definite matches
   Searching
   Matches found=    30
       Name                  Sequence            Position  Fragment lengths
     1 JANE                  A'TTTT                   437     436       6
     2 JANE                  A'TTTT                   546     109      33
     3 JANE                  A'TTTT                   597      51      43
     4 JANE                  A'TTTT                   777     180      51
     5 JANE                  A'TTTT                  1274     497      60
     6 JANE                  A'TTTT                  1571     297      62
     7 JANE                  CC'GGG                  1926     355      75
     8 JANE                  A'TTTT                  2403     477      81
     9 JANE                  A'TTTT                  2586     183      82
    10 JANE                  A'TTTT                  2731     145     101
    11 JANE                  A'TTTT                  2812      81     103

   ... etc


  X 1 Search
    2 List enzyme file
    3 Clear text
    4 Clear graphics
  ? 0,1,2,3,4 =!
 @18. TX 1 7 @ Compare a short sequence

        This  routine  slides  a  short  sequence  along  the  current
  sequence  and finds all positions at which a given percentage of the
  bases match.  Output is in both graphical and listed forms.

        If  users call for dialogue when the routine is selected  they
  will  be  given  the  choice  of  keyboard or file input. Define the
  string, select the "sense" to use and the percentage match.  Matches
  will  be  plotted  out  and  then  the  user can select to have them
  listed. Then the routine cycles around.

        The routine slides the search string along the   sequence  and
  marks  the positions at which a minimum percentage score is reached.
  The graphical output draws a vertical line at  the  match  position;
  the  height  of the line represents the percentage score, so that if
  the line reaches the top of the box the score is 100%.   The  NC-IUB
  symbols  may  be  used  in  the  search sequence to encode uncertain
  characters. Any other symbols will not match.


              NC-IUB SYMBOLS

        A,C,G,T
        R        (A,G)        'puRine'
        Y        (T,C)        'pYrimidine'
        W        (A,T)        'Weak'
        S        (C,G)        'Strong'
        M        (A,C)        'aMino'
        K        (G,T)        'Keto'
        H        (A,T,C)      'not G'
        B        (G,C,T)      'not A'
        V        (G,A,C)      'not T'
        D        (G,A,T)      'not C'
        N        (G,A,C,T)    'aNy'

   Typical dialogue is shown below.


  ? Menu or option number=18
   Find percentage matches
  ? (y/n) (y) Keep picture
  ? String=AAATTTCCC
  STRING=AAATTTCCC
  ? (y/n) (y) This sense
  ? Percent match (1.00-100.00) (70.00) =

   Missing graphics display here

  Total scoring positions above 70.000 percent =   7
  Scores         7      6      6      6      6      6      6
  Positions    365    212    213    292    311    358    627
  ? Display (0-7) (0) =3

         365
           ACATTTCGC
           * ***** *
           AAATTTCCC
           1

         212
           GAAACTCCC
            **  ****
           AAATTTCCC
           1

         213
           AAACTCCCA
           *** * **
           AAATTTCCC
           1
  ? (y/n) (y) Keep picture
  Default String=AAATTTCCC
  ? String=
  STRING=AAATTTCCC
  ? (y/n) (y) This sense n
  STRING=GGGAAATTT
  ? Percent match (1.00-100.00) (70.00) =

   Missing graphics display here

  Total scoring positions above 70.000 percent =   7
  Scores         6      6      6      6      6      6      6
  Positions    269    270    271    288    354    624    853
  ? Display (0-7) (0) =3

         269
           GAGGGATTT
           * *  ****
           GGGAAATTT
           1

         270
           AGGGATTTT
            ** * ***
           GGGAAATTT
           1

         271
           GGGATTTTC
           ****  **
           GGGAAATTT
           1
  ? (y/n) (y) Keep picture !

 @19. TX 7 @ Compare a short sequence using a score matrix

        This  routine  slides  a  short  sequence  along  the  current
  sequence  and  finds  all  positions  at  which  a  given  level  of
  similarity (a cutoff score) is reached. The score is defined by  use
  of a score matrix. Output is in both graphical and listed forms.

        If  users call for dialogue when the routine is selected  they
  will  be  given  the  choice  of  keyboard or file input. Define the
  string, select the "sense" to use and the cutoff score. Matches will
  be  plotted  out  and  then the user can select to have them listed.
  Then the routine cycles around.

        The routine slides the search string along the   sequence  and
  marks  the  positions  at  which a the cutoff score is achieved. The
  graphical output draws a vertical line at the  match  position;  the
  height  of  the  line  represents  the   score,  so that if the line
  reaches the top of the box the score is the maximum  possible.   The
  NC-IUB  symbols  may  be  used  in  the  search  sequence  to encode
  uncertain characters.

        The score matrix reflects the level of redundancy in the probe
  sequence  and  hence will put more emphasis on those characters that
  are better defined. The score matrix is:
               DNA SCORE MATRIX USING IUB SYMBOLS

          T  C  A  G  -  R  Y  W  S  M  K  H  B  V  D  N  ?

     T   36  0  0  0  9  0 18 18  0  0 18 12 12  0 12  9  0
     C    0 36  0  0  9  0 18  0 18 18  0 12 12 12  0  9  0
     A    0  0 36  0  9 18  0 18  0 18  0 12  0 12 12  9  0
     G    0  0  0 36  9 18  0  0 18  0 18  0 12 12 12  9  0
     -    9  9  9  9 36 18 18 18 18 18 18 27 27 27 27 36  0
     R    0  0 18 18 18 36  0  9  9  9  9  6  6 12 12 18  0
     Y   18 18  0  0 18  0 36  9  9  9  9 12 12  6  6 18  0
     W   18  0 18  0 18  9  9 36  0  9  9 12  6  6 12 18  0
     S    0 18  0 18 18  9  9  0 36  9  9  6 12 12  6 18  0
     M    0 18 18  0 18  9  9  9  9 36  0 12  6 12  6 18  0
     K   18  0  0 18 18  9  9  9  9  0 36  6 12  6 12 18  0
     H   12 12 12  0 27  6 12 12  6 12  6 36  8  8  8 27  0
     B   12 12  0 12 27  6 12  6 12  6 12  8 36  8  8 27  0
     V    0 12 12 12 27 12  6  6 12 12  6  8  8 36  8 27  0
     D   12  0 12 12 27 12  6 12  6  6 12  8  8  8 36 27  0
     N    9  9  9  9 36 18 18 18 18 18 18 27 27 27 27 36  0
     ?    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

    ? is any unrecognised character.

    Typical dialogue is shown below.

  ? Menu or option number=19
   Find matches using a score matrix
  ? (y/n) (y) Keep picture
  ? String=AAATTTCCC
  STRING=AAATTTCCC
  ? (y/n) (y) This sense
  Minimum score=     0 Maximum score=   324
  ? Score (0-324) (280) =250

   Missing graphics display here

  For score   250 the number of matches=     1
  Scores       252
  Positions    365
  ? Display (0-1) (0) =1

         365
           ACATTTCGC
           * ***** *
           AAATTTCCC
           1
  ? (y/n) (y) Keep picture
  Default String=AAATTTCCC
  ? String=
  STRING=AAATTTCCC
  ? (y/n) (y) This sense n
  STRING=GGGAAATTT
  Minimum score=     0 Maximum score=   324
  ? Score (0-324) (222) = 200

   Missing graphics display here

  For score   200 the number of matches=     7
  Scores       216    216    216    216    216    216    216
  Positions    269    270    271    288    354    624    853
  ? Display (0-7) (0) =3

         269
           GAGGGATTT
           * *  ****
           GGGAAATTT
           1

         270
           AGGGATTTT
            ** * ***
           GGGAAATTT
           1

         271
           GGGATTTTC
           ****  **
           GGGAAATTT
           1
  ? (y/n) (y) Keep picture !

 @20. TX 7 @ Search for a motif using a weight matrix

        This function performs  searches  for  short  sequence  motifs
  using  an  appropriate  weight matrix. In addition it can be used to
  create or modify weight matrices. In order to perform a  search  the
  only  input  required  is the name of the file containing the weight
  matrix.  The results can be presented  graphically  or  listed.  The
  graphical presentation will draw line at the position of any matches
  found; the height of the line is proportional to the score.

        For a search, select "use weight matrix", supply the  name  of
  the  file  containing  the  weight matrix, and choose between having
  results plotted  or  listed.  If  dialogue  is  requested  when  the
  function is selected users can alter the cutoff score employed.

        To create a weight matrix several steps are involved.  A  file
  containing an alignment of known motifs is required. (This file must
  be created before the current option is selected. The  format  is  a
  follows:  each  sequence is written on a separate line with at least
  one space at the beginning; each sequence is terminated by  a  space
  character,  and  can  be  followed  by a name. The sequences must be
  aligned.) Supply the name of the  file  of  aligned  sequences.  The
  program  reads  and  displays the sequences. Choose between "summing
  logs of weights" or summing weights (i.e. whether to multiply or add
  weights).  If  logs  are used all scores will be negative. Choose if
  all positions in the set of aligned sequences should be used or if a
  mask should be applied. If so selected, define a mask as a string of
  symbols, in which symbol - means ignore and any other  symbol  means
  use. E.g. xx-x--abc means use all positions except 3,5 and 6.

        The program will calculate weights as the frequencies of  each
  base  at  each  unmasked  position  in the set of aligned sequences.
  These weights are then applied to the set of  aligned  sequences  to
  give  a range  of "observed" scores. The mean and standard deviation
  of these scores is displayed. The user is asked  to  supply  several
  values  to  be  used  when  the  weight  matrix  is applied to other
  sequences: a cutoff score (by default, the  mean  minus  3  standard
  deviations),  a top score for scaling graphical results (by default,
  the mean plus 3 standard deviations), and  a  position  to  identify
  (this  means that if a particular base within the motif is used as a
  "landmark", such as the A of the AG in splice acceptor  sites,  then
  its  position  will be marked in plots). All these values are stored
  along with the weight matrix. Finally supply the name of a  file  to
  contain the weight matrix.

        Weight matrices can be  "rescaled"  using  a  set  of  aligned
  sequences  in much the same ways as a matrix is created. The purpose
  is to redefine the cutoff scores, and rescaling does not  alter  any
  other values in the weight matrix file.

        The methods have changed considerably but were first  outlined
  in  Staden,  R.  Nucl.  Acid  Res.  12  505-519 1984, and Staden, R.
  Genetic engineering: principles and methods vol 7,  Edited  by  J.K.
  Setlow and A. Hollaender, Plenum publishing corp., 1985.

        The methods have always had to deal with the problem of zeroes
  in  the  matrices.  The  current  versions  employ  "Laplaces Law of
  Succession" in which 1 is added to each term.

        It is now possible to  apply  a  mask  to  a  set  of  aligned
  sequences  in  order  to  give  weight  to  selected positions only.
  Sequences have superimposed functions: some parts may be of  general
  structural  importance  and  give  rise to an overall framework, and
  other parts give specificity and hence are not common; we  may  want
  to use a set of aligned sequences to define a motif, but want to use
  only the framework positions.  Alternatively we may want to pick out
  only  those  parts  of  a  set  of  aligned  sequences  that  give a
  particular property, and to ignore other similarities that  are  due
  to  some  other  property and which could obscure the pattern we are
  interested in. The ability to define a mask allows certain positions
  to  be  used  in  the  motif and others to be ignored, and yet still
  permits the use of a set of aligned sequences to calculate weights.

        Typical dialogue is shown below.

  ? Menu or option number=20
  X 1 Use weight matrix
    2 Make weight matrix
    3 Rescale weight matrix
  ? 0,1,2,3 =2
  ? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ

       1 AGCGTGACTCTTCCCGGAA HIS1
       2 GAGGTGACTCACTTGGAAG HIS1
       3 CGGATGACTCTTTTTTTTT HIS3
       4 ACAGTGACTCACGTTTTTT HIS4
       5 GTCGTGACTCATATGCTTT ARG3
       6 TGAATGACTCACTTTTTGG ARG4
       7 TTCTTGACTCGTCTTTTCT CPA1
       8 CGAATGACTCTTATTGATG CPA2
       9 AGAATGACTAATTTTACTA TRP5
      10 TCGTTGACTCATTCTAATC TRP3
      11 TTGCTGACTCATTACGATT TRP2
      12 GAGATGACTCTTTTTCTTT IV1
      13 GCGATGATTCATTTCTCTG IV2
      14 TAGATGACTCAGTTTAGTC LEU1
      15 TAAGTGACTCAGTTCTTTC LEU4
      16 ATGATGACTCTTAAGCATG ILS1
  Length of motif    19
  ? (y/n) (y) Sum logs of weights

  ? (y/n) (y) Use all motif positions n
  x means use, - means ignore
  e.g. xx-x---x-x means use positions 1,2,4,8,10
  ? Mask=----XXXXXXXX
   Applying weights to input sequences
     1      -27.979 AGCGTGACTCTTCCCGGAA
     2      -24.543 GAGGTGACTCACTTGGAAG
     3      -20.890 CGGATGACTCTTTTTTTTT
     4      -23.087 ACAGTGACTCACGTTTTTT
     5      -22.771 GTCGTGACTCATATGCTTT
     6      -23.408 TGAATGACTCACTTTTTGG
     7      -25.159 TTCTTGACTCGTCTTTTCT
     8      -22.679 CGAATGACTCTTATTGATG
     9      -24.751 AGAATGACTAATTTTACTA
    10      -23.157 TCGTTGACTCATTCTAATC
    11      -23.067 TTGCTGACTCATTACGATT
    12      -21.449 GAGATGACTCTTTTTCTTT
    13      -24.191 GCGATGATTCATTTCTCTG
    14      -23.770 TAGATGACTCAGTTTAGTC
    15      -22.923 TAAGTGACTCAGTTCTTTC
    16      -25.285 ATGATGACTCTTAAGCATG
  Top score     -20.890  Bottom score     -27.979
  Mean     -23.694  Standard deviation       1.613
  Mean minus 3.sd     -28.534  Mean plus 3.sd     -18.854
  ? Cutoff score (-999.00-9999.00) (-28.53) =
  ? Top score for scaling plots (-28.53-999.00) (-18.85) =
  ? Position to identify (0-19) (1) =
  ? Title=GCN4 SEQUENCES
  ? Name for new weight matrix file=1.WTS


  ? Menu or option number=20
  X 1 Use weight matrix
    2 Make weight matrix
    3 Rescale weight matrix
  ? 0,1,2,3 =3
  ? Name of existing weight matrix file=1.WTS
   GCN4 SEQUENCES
  ? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
  Length of motif    19
  ? (y/n) (y) Sum logs of weights n
  ? (y/n) (y) Use all motif positions

   Applying weights to input sequences
     1      128.000 AGCGTGACTCTTCCCGGAA
     2      148.000 GAGGTGACTCACTTGGAAG
     3      172.000 CGGATGACTCTTTTTTTTT
     4      160.000 ACAGTGACTCACGTTTTTT
     5      161.000 GTCGTGACTCATATGCTTT
     6      157.000 TGAATGACTCACTTTTTGG
     7      149.000 TTCTTGACTCGTCTTTTCT
     8      160.000 CGAATGACTCTTATTGATG
     9      151.000 AGAATGACTAATTTTACTA
    10      159.000 TCGTTGACTCATTCTAATC
    11      158.000 TTGCTGACTCATTACGATT
    12      169.000 GAGATGACTCTTTTTCTTT
    13      152.000 GCGATGATTCATTTCTCTG
    14      157.000 TAGATGACTCAGTTTAGTC
    15      160.000 TAAGTGACTCAGTTCTTTC
    16      143.000 ATGATGACTCTTAAGCATG
  Top score     172.000  Bottom score     128.000
  Mean     155.250  Standard deviation      10.034
  Mean minus 3.sd     125.147  Mean plus 3.sd     185.353
  ? Cutoff score (-999.00-9999.00) (125.15) =
  ? Top score for scaling plots (125.15-999.00) (185.35) =
  ? Position to identify (0-19) (1) =
  ? Title=GCN4 SEQUENCES
  ? Name for new weight matrix file=2.WTS


  ? Menu or option number=20
  X 1 Use weight matrix
    2 Make weight matrix
    3 Rescale weight matrix
  ? 0,1,2,3 =
  ? Motif weight matrix file=1.WTS
   GCN4 SEQUENCES
  ? (y/n) (y) Plot results n

      153    -22.61 GCAGCGACTGATTTGAGTT
      169    -28.53 GTTCTGACCACTCAGATCC
      172    -27.27 CTGACCACTCAGATCCGGC
      219    -27.35 CCAGTGGCTGGCCTGCTAG
      268    -27.82 CGAGGGATTTTCGATCTTG
      274    -26.99 ATTTTCGATCTTGTGGATG
      283    -25.79 CTTGTGGATGATTTTCACG
      287    -27.50 TGGATGATTTTCACGTGCG
      298    -28.17 CACGTGCGCCGTCATATTG
      332    -28.27 TCTTTGAAGCAGAAGGGAC
      351    -28.27 AGGGGTACACTTTCACATT
      357    -25.05 ACACTTTCACATTTCGCTT
      364    -28.51 CACATTTCGCTTATGGGAG
      400    -23.77 GAAGTTACTAATGTGCGTG
      451    -26.22 ATGCTCGCCCTCTTTGGTG
      476    -28.00 TCCCTCACTGAGCCCTCCG
      480    -28.33 TCACTGAGCCCTCCGCCTC
      517    -23.46 GCTAAGATTCAGCTTGGTT
      556    -27.27 TCCAGCACTCAGGTTCGGC
      602    -27.01 AACTTGAATCCATCGTTGC
      648    -28.45 TGCTAAACACAGCCGGTTT
      679    -28.18 CTGTTTGCCCAGTTTGGGC
      691    -28.51 TTTGGGCCGCTTCTGGACG
      713    -27.67 GGCTTGACCGTGGCTGTGG
      803    -25.47 ATGCTGACCATGCTTTTCA
      848    -28.11 ATAATGTTAAGTTTGATTC
      857    -25.97 AGTTTGATTCCGCTGGCCG
      879    -27.85 CCGCTGCTGCTGTTTCCAC
      917    -27.77 GCGATGAGGAAGGCTTGTT
      931    -27.81 TTGTTGGCGCGCCTGCTCG
      952    -23.52 GAGGTGACTACCATCCGTG
      977    -28.40 TGCGTGGGTGAGCTGTTGT




  ? Menu or option number=6
  Page through text files
  ? Name of file to read=1.WTS
   GCN4 SEQUENCES
       19     1   -28.534   -18.854
   P   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
   N  16  16  16  16  16  16  16  16  16  16  16  16  16  16  16  16  16  16
  16
   T   0   0   0   0  16   0   0   1  16   0   5  11  10  12   9   6   7  12   6
   C   0   0   0   0   0   0   0  15   0  15   0   3   2   2   4   3   2   1   3
   A   0   0   0   0   0   0  16   0   0   1  10   0   3   2   0   3   5   2   2
   G   0   0   0   0   0  16   0   0   0   0   1   2   1   0   3   4   2   1   5
  End of file

 @21. TX 3 @ Count base composition

        This routine calculates the base  composition  of  the  active
  region of the sequence as both totals and percentages.
 @22. TX 3 @ Count dinucleotide frequencies

        This routine simply counts dinucleotide  frequencies  for  the
  currently  active  region  of  the  sequence.  It also calculates an
  expected distribution based on the  base  composition.   The  output
  looks like:
                T             C             A             G
          obs  expected obs  expected obs  expected obs  expected

       T   8.44   8.25   6.67   7.01  10.35   9.92   3.27   3.54
       C   7.49   7.01   6.76   5.95   8.39   8.43   1.76   3.01
       A  10.13   9.92   7.78   8.43  11.74  11.93   4.89   4.26
       G   2.67   3.54   3.19   3.01   4.06   4.26   2.42   1.52

 @23. TX 3 5 @ Count codons and amino acids

        This function counts codons, amino acid  composition,  protein
  molecular  weights,  and base composition. Users select the segments
  of the sequence that the program should analyse.

        Choose  between  being  shown  observed   counts   or   counts
  normalised so that the totals for each amino acid sum to 100. Select
  to define segments using either the  keyboard  or  an  EMBL  feature
  table.   Define  the  segments to count over. Select strand for each
  segment. Stop selecting segments by typing a zero  for  "Count  from
  ()".  The  results are displayed a screenful at a time, and the bell
  is sounded to show there is more to come. A zero start position,  or
  the  end  of an EMBL feature table, signals the routine to print out
  totals for all values.

        The  counts  are  broken  down  into  several  figures.   Base
  composition  by  position in codon expressed as a percentage of each
  bases  own  frequency;   base  composition  by  position  in   codon
  expressed  as  a  percentage  of the overall base composition of the
  section; base composition expected for this amino  acid  composition
  if  there  was  no  codon  preference;  percentage deviations of the
  observed  amino  acid  composition  from  an  average   amino   acid
  composition.

        The output looks like:

        ===========================================
        F TTT   1. S TCT   2. Y TAT   2. C TGT   1.
        F TTC   1. S TCC   1. Y TAC   3. C TGC   2.
        L TTA   7. S TCA   4. * TAA   9. * TGA   1.
        L TTG   2. S TCG   1. * TAG   2. W TGG   2.
        ===========================================
        L CTT   3. P CCT   2. H CAT   4. R CGT   1.
        L CTC   2. P CCC   3. H CAC   1. R CGC   0.
        L CTA   3. P CCA   2. Q CAA   4. R CGA   0.
        L CTG   2. P CCG   2. Q CAG   1. R CGG   2.
        ===========================================
        I ATT   9. T ACT   1. N AAT   7. S AGT   3.
        I ATC   2. T ACC   2. N AAC   4. S AGC   2.
        I ATA   4. T ACA   5. K AAA  13. R AGA   5.
        M ATG   1. T ACG   2. K AAG   4. R AGG   1.
        ===========================================
        V GTT   2. A GCT   2. D GAT   1. G GGT   3.
        V GTC   2. A GCC   2. D GAC   1. G GGC   1.
        V GTA   4. A GCA   3. E GAA   2. G GGA   1.
        V GTG   2. A GCG   0. E GAG   1. G GGG   1.
        ===========================================
    total codons=      166.
            T          C          A          G

    1     31.06      33.68      34.03      35.00
    2     35.61      35.79      30.89      32.50
    3     33.33      30.53      35.08      32.50

    1     24.70      19.28      39.16      16.87
    2     28.31      20.48      35.54      15.66
    3     26.51      17.47      40.36      15.66
    %     26.51      19.08      38.35      16.06  observed, overall totals
    %     25.00      22.26      33.10      19.65  expected, even codons per acid

            A    C    D    E    F    G    H    I    K    L
            7.   3.   2.   3.   2.   6.   5.  15.  17.  19.
   o-e %  -47. -33. -76. -68. -64. -54.  62. 116.  67.  67.

            M    N    P    Q    R    S    T    V    W    Y
            1.  11.   9.   5.   9.  13.  10.  10.   2.   5.
   o-e %  -62.  66.  12. -17.  19.  21.   6.  -2.   0.  -5.
   total acids=  154. molecular weight=    17421.

   Typical dialogue follows.

  ? Menu or option number=23
   Calculate codon usage, base composition
   and amino acid composition
  ? (y/n) (y) Show observed counts
  ? (y/n) (y) Define segments using keyboard
  ? Count from (0-1023) (0) =1
  ? Count to (1-1023) (1023) =1000
  ? (y/n) (y) + strand

       ===========================================
       F TTT  13. S TCT   1. Y TAT   1. C TGT   3.
       F TTC   4. S TCC  10. Y TAC   1. C TGC   7.
       L TTA   1. S TCA   0. * TAA   1. * TGA   4.
       L TTG   4. S TCG   1. * TAG   3. W TGG   5.
       ===========================================
       L CTT   9. P CCT   1. H CAT   3. R CGT  14.
       L CTC   7. P CCC   0. H CAC   7. R CGC  14.
       L CTA   0. P CCA   0. Q CAA   4. R CGA   9.
       L CTG  12. P CCG   1. Q CAG   9. R CGG   8.
       ===========================================
       I ATT   7. T ACT   4. N AAT   4. S AGT   1.
       I ATC   4. T ACC   5. N AAC   3. S AGC   7.
       I ATA   1. T ACA   1. K AAA   3. R AGA   2.
       M ATG   2. T ACG   1. K AAG   2. R AGG   2.
       ===========================================
       V GTT  11. A GCT  13. D GAT   6. G GGT   9.
       V GTC   5. A GCC  10. D GAC   9. G GGC  11.
       V GTA   6. A GCA   5. E GAA   6. G GGA  12.
       V GTG   8. A GCG   5. E GAG   3. G GGG   8.
       ===========================================


   Total codons=      333.
           T          C          A          G

   1     23.32      37.69      28.99      40.06
   2     37.15      22.31      38.46      36.59
   3     39.53      40.00      32.54      23.34
         -----      -----      -----      -----
   =     100%       100%       100%       100%

   1     17.72      29.43      14.71      38.14  = 100%
   2     28.23      17.42      19.52      34.83  = 100%
   3     30.03      31.23      16.52      22.22  = 100%
   %     25.33      26.03      16.92      31.73  Observed, overall totals
   %     24.44      22.31      20.90      32.35  Expected, even codons per acid

           A    C    D    E    F    G    H    I    K    L
          33.  10.  15.   9.  17.  40.  10.  12.   5.  33.
  O-E %   22.  81. -13. -55.  34.  71.  40. -29. -73.  13.

           M    N    P    Q    R    S    T    V    W    Y
           2.   7.   2.  13.  49.  20.  11.  30.   5.   2.
  O-E %  -74. -51. -88.   0. 165. -11. -42.  40.  18. -81.
  Total acids=  325. Molecular weight=    35831. Hydrophobicity= -17.8


  ? Count from (0-1023) (0) =

      Codon totals over all genes
       ===========================================
       F TTT  13. S TCT   1. Y TAT   1. C TGT   3.
       F TTC   4. S TCC  10. Y TAC   1. C TGC   7.
       L TTA   1. S TCA   0. * TAA   1. * TGA   4.
       L TTG   4. S TCG   1. * TAG   3. W TGG   5.
       ===========================================
       L CTT   9. P CCT   1. H CAT   3. R CGT  14.
       L CTC   7. P CCC   0. H CAC   7. R CGC  14.
       L CTA   0. P CCA   0. Q CAA   4. R CGA   9.
       L CTG  12. P CCG   1. Q CAG   9. R CGG   8.
       ===========================================
       I ATT   7. T ACT   4. N AAT   4. S AGT   1.
       I ATC   4. T ACC   5. N AAC   3. S AGC   7.
       I ATA   1. T ACA   1. K AAA   3. R AGA   2.
       M ATG   2. T ACG   1. K AAG   2. R AGG   2.
       ===========================================
       V GTT  11. A GCT  13. D GAT   6. G GGT   9.
       V GTC   5. A GCC  10. D GAC   9. G GGC  11.
       V GTA   6. A GCA   5. E GAA   6. G GGA  12.
       V GTG   8. A GCG   5. E GAG   3. G GGG   8.
       ===========================================


   Total codons=      333.
           T          C          A          G

   1     23.32      37.69      28.99      40.06
   2     37.15      22.31      38.46      36.59
   3     39.53      40.00      32.54      23.34
         -----      -----      -----      -----
   =     100%       100%       100%       100%

   1     17.72      29.43      14.71      38.14  = 100%
   2     28.23      17.42      19.52      34.83  = 100%
   3     30.03      31.23      16.52      22.22  = 100%
   %     25.33      26.03      16.92      31.73  Observed, overall totals
   %     24.44      22.31      20.90      32.35  Expected, even codons per acid

           A    C    D    E    F    G    H    I    K    L
          33.  10.  15.   9.  17.  40.  10.  12.   5.  33.
  O-E %   22.  81. -13. -55.  34.  71.  40. -29. -73.  13.

           M    N    P    Q    R    S    T    V    W    Y
           2.   7.   2.  13.  49.  20.  11.  30.   5.   2.
  O-E %  -74. -51. -88.   0. 165. -11. -42.  40.  18. -81.
  Total acids=  325. Molecular weight=    35831. Hydrophobicity= -17.8

 @24. TX 3 @ Plot base composition

        This option plots the base composition of  the  sequence.  The
  counts for any combination of bases can be plotted.

        If dialogue is requested the user is presented  with  a  check
  box for selecting which bases should be counted, and then allowed to
  define a window length, and a "plot  interval".  Otherwise,  the  AT
  composition  is  plotted with a window of 101 and a plot interval of
  5.

        Typical dialogue follows.
  ? Menu or option number=d24
   Plot base composition

  checkbox: those set are marked X
  X 1 T
    2 C
  X 3 A
    4 G
  ? 0,1,2,3,4 =1

  checkbox: those set are marked X
    1 T
    2 C
  X 3 A
    4 G
  ? 0,1,2,3,4 =3

  checkbox: those set are marked X
    1 T
    2 C
    3 A
    4 G
  ? 0,1,2,3,4 =2

  checkbox: those set are marked X
    1 T
  X 2 C
    3 A
    4 G
  ? 0,1,2,3,4 =4

  checkbox: those set are marked X
    1 T
  X 2 C
    3 A
  X 4 G
  ? 0,1,2,3,4 =

  ? odd span length (1-201) (31) =
  ? plot interval (1-11) (5) =

   missing graphics



 @25. TX 3 @ Plot local deviations in base composition

        The "local deviation" routines are designed  to  indicate  the
  similarity  of  the compositions of different parts of the sequence.
  The composition of every segment of the sequence is compared with  a
  standard  composition. The levels of similarity are plotted as a chi
  squared values. The standard can be the  composition  of  the  whole
  sequence,  or  alternatively  that of a small segment defined by the
  user.

        If dialogue is forced define the standard region,  the  window
  length and the plot interval. Otherwise the composition of the whole
  sequence is taken as a standard. The maximum  and  minimum  observed
  value  of  the  chi squared calculation is displayed, and plots will
  always exactly fill the available box. Any unusual regions will show
  as peaks.

        The following  measure  is  used:  for  each  window  position
  calculate  (sum((obs-exp)*(obs-exp))/(exp*exp))  where  obs  is  the
  observed composition  and  exp  is  the  expected  composition  (the
  composition  of the standard).  The calculation is performed once to
  find out the range of values and is then  repeated  and  plotted  so
  that the plot exactly fills the allocated screen space.
 @26. TX 3 @ Plot local deviations from dinucleotide composition

        The "local deviation" routines are designed  to  indicate  the
  similarity  of  the compositions of different parts of the sequence.
  The dinucleotide composition of every segment  of  the  sequence  is
  compared  with  a standard composition. The levels of similarity are
  plotted as a chi squared values. The standard can be the composition
  of  the  whole  sequence,  or  alternatively that of a small segment
  defined by the user.

        If dialogue is forced define the standard region,  the  window
  length and the plot interval. Otherwise the composition of the whole
  sequence is taken as a standard. The maximum  and  minimum  observed
  value  of  the  chi squared calculation is displayed, and plots will
  always exactly fill the available box. Any unusual regions will show
  as peaks.

        The following  measure  is  used:  for  each  window  position
  calculate  (sum((obs-exp)*(obs-exp))/(exp*exp))  where  obs  is  the
  observed composition  and  exp  is  the  expected  composition  (the
  composition  of the standard).  The calculation is performed once to
  find out the range of values and is then  repeated  and  plotted  so
  that the plot exactly fills the allocated screen space.
 @27. TX 3 @ Plot local deviations from trinucleotide composition

        The "local deviation" routines are designed  to  indicate  the
  similarity  of  the compositions of different parts of the sequence.
  The trinucleotide composition of every segment of  the  sequence  is
  compared  with  a standard composition. The levels of similarity are
  plotted as a chi squared values. The standard can be the composition
  of  the  whole  sequence,  or  alternatively that of a small segment
  defined by the user.

        If dialogue is forced define the standard region,  the  window
  length and the plot interval. Otherwise the composition of the whole
  sequence is taken as a standard. The maximum  and  minimum  observed
  value  of  the  chi squared calculation is displayed, and plots will
  always exactly fill the available box. Any unusual regions will show
  as peaks.

        The following  measure  is  used:  for  each  window  position
  calculate  (sum((obs-exp)*(obs-exp))/(exp*exp))  where  obs  is  the
  observed composition  and  exp  is  the  expected  composition  (the
  composition  of the standard).  The calculation is performed once to
  find out the range of values and is then  repeated  and  plotted  so
  that the plot exactly fills the allocated screen space.
 @28. TX 5 @ Calculate codon constraint

        The purpose of this option (which is somewhat specialised)  is
  to measure the level of constraint imposed on the sequence by coding
  for a protein of the observed composition. It measures the  strength
  of  the  codon  bias averaged over windows of 99 codons and displays
  the values observed.

        Select between defining segments at the keyboard or  using  an
  EMBL  feature  table.  Finish  selecting  segments  by typing a zero
  start. The value for each segment is displayed:

        Mean (W-EW) / EWD, window 99      10.5

        The codon constraint is the difference  between  the  observed
  codon  improbability and the mean improbabilty for a sequence of the
  same composition.   See McLachlan, Staden  and  Boswell  Nucl.  Acid
  Res. 1984
 @59. TX 3 @ Plot negentropy

        This routine is designed to show regions of the sequence  that
  differ  in  composition  from  others,  and  hence is like the "plot
  deviation.." routines.

        Negentropy or information is defined in the following way: let
  Pi  be  the  probability  of observing base i, where i = A,C,G or T,
  then the average information per base is  I=-sum(Pi.Log(Pi))    (sum
  over  all  i). This routine calculates Pi by calculating the overall
  composition for the sequence and then plots I for windows of  length
  defined by the user.
 @30. TX 4 @ Search for hairpin loops

        Used to find simple  inverted  repeats  or  potential  hairpin
  loops  The  loops are defined by a range of sizes for the loop and a
  minimum number of consecutive base pairs in the  stem.  The  results
  can  be  presented graphically or listed. A-T, G-C and G-T basepairs
  are counted.

        Define the range of loop  sizes  and  the  minimum  number  of
  consecutive  basepairs  required.  Choose  between plotted or listed
  results.

        The loops found are plotted as blips on a horizontal line that
  represents  the  sequence, the heights of the lines are proportional
  to  the  number  of  basepairs  in  the  stems.   Note   that   only
  uninterrupted  stems are found - i.e. all basepairs must be made. To
  look for stems with some unpaired bases (or for palindromes) use the
  inverted repeat motif class in the pattern searching option.

        Typical dialogue follows.
  ? Menu or option number=30
   Search for hairpin loops
  Define the range of loop sizes
  ? Minimum loop size (1-30) (1) =
  ? Maximum loop size (3-60) (3) =
  ? Minimum number of basepairs (2-20) (6) =
  ? (y/n) (y) Plot results n
   Searching

            T.G
            G-C
            G.T
            T.G
            C-G
            G-C
            T.G
            C-G
            G.T
       GCCGCA GCGGAGG
           49

             G
            G-C
            T.G
            C-G
            G.T
            T.G
            G-C
       CTGCTG GGAGGTC
           56


             G
            T.G
            G-C
            G.T
            T.G
            C-G
            G-C
            T-A
            T.G
       AGCGCA CGACTGA
          139

            A C
            G.T
            C-G
            G.T
            C-G
            C-G
            G-C
       TTCGCT CAACGCC
          244

 @31. TX 4 @ Search for long range inverted repeats

        Searches for inverted repeats. The  repeats  found  are  exact
  matches  of  at  least 6 consecutive bases. Results can be presented
  graphically or listed.  Plotted  results  show  the  end  points  of
  repeats joined by rectangular lines.

        If dialogue is not  requested  the  defaults  will  be  taken.
  Otherwise  choose  between  plotted  or  listed results. If required
  select to analyse a  restricted  segment  of  the  currently  active
  region. Choose a repeat length.

        Typical dialogue follows.
  ? Menu or option number=D31
   Plot long-range inverted repeats
  ? (y/n) (y) Plot results n
  Define restricted region
  ? start (1-1023) (1) =
  ? end (2-1023) (1023) =
  ? Minimum inverted repeat (6-30) (12) =10
   Searching
      27     909      10  TGCCCAGAGA

 @32. TX 4 @ Search for repeats

        Searches for direct  repeats.  The  repeats  found  are  exact
  matches  of  at  least 6 consecutive bases. Results can be presented
  graphically or listed.  Plotted  results  show  the  end  points  of
  repeats joined by rectangular lines.

        If dialogue is not  requested  the  defaults  will  be  taken.
  Otherwise  choose  between  plotted  or  listed results. If required
  select to analyse a  restricted  segment  of  the  currently  active
  region. Choose a repeat length.

        Typical dialogue follows.
   ? Menu or option number=D32
   Plot repeats
  ? (y/n) (y) Plot results n
  Define restricted region
  ? start (1-1023) (1) =
  ? end (2-1023) (1023) =
  ? Minimum repeat (6-30) (12) =8
   Searching
     619     988       8  GCTGTTGT
     514     646       8  GCTGCTAA
      94     865       8  TCCGCTGG
     146     222       9  GTGGCTGGC
     455     497       8  TCGCCCTC
     454     496       9  CTCGCCCTC
     872     875       8  GCCGCCGC
     510     615       8  CGTTGCTG
     152     913       8  GGCAGCGA
     199     265       8  CGTCGAGG
     689     794       8  AGTTTGGG
     147     223       8  TGGCTGGC
     101     116       8  GACGAGGA
       8     690       8  GTTTGGGC
      52     141       8  TGCTGGTG

 @33. TX 4 @ Search for z dna (total ry, yr)

        Searches for segments of the sequence that might form Z DNA. A
  window  length  is  chosen and the number of RY and YR dinucleotides
  within each window is plotted. The top of the box corresponds to all
  RY or YR, the bottom to zero RY or YR.

        If dialogue is requested, select  a  window  length  and  plot
  interval. Otherwise the defaults will be used.

        The  program  contains  three  separate  ways  of  doing  this
  (options 33,34,35).
 @34. TX 4 @ Search for z dna (runs of ry, yr)

        Searches for segments of the sequence that might form  Z  DNA.
  Results are plotted.

        If dialogue is requested  define  a  window  length  and  plot
  interval.  Otherwise  the defaults will be used.  The routine counts
  the number of R in positions 1,3,5 etc  =R1,  the  number  of  Y  in
  positions  2,4,6 etc =Y1, the number of Y in positions 1,3,5 etc =Y2
  and the number of R in positions 2,4,6 etc =R2 for a window  length.
  It  plots  the  maximum  of R1+Y1 and R2+Y2 relative to a minimum of
  (window length)/2 and a maximum of (window length). (see 33,35).
 @35. TX 4 @ Search for z dna (best phased value)

        Searches for segments of the sequence that might form  Z  DNA.
  Results are plotted.

        If dialogue is requested define a window  length  and  a  plot
  interval. Ohterwise the defaults values will be used.

        The  routine  counts  the  number  of  consecutive  RY  or  YR
  dinucleotides  in  phase. It moves through the sequence counting the
  number of RY or YR dinucleotides; when the next dinucleotide is  not
  of  the  correct  type  the score is set back to zero and the search
  restarted using the current base to set the  phase.  The  plots  are
  done  relative  to  a  minimum  of zero and a maximum defined by the
  user. (See 33,34).
 @36. TX 4 @ Local similarity or complementarity search

        This function is designed to find segments of local similarity
  or  complementarity.  It  is therefore like performing a DIAGON plot
  that is restricted to regions near the main diagonal.   Results  can
  be presented graphically or listed.

        Users define a region to search  through,  a  span  length,  a
  range  for  searching through and a cut-off score. The program takes
  all sections of sequence of length span within  the  defined  region
  and compares them to all other sequences within the region and range
  specified. If a match above the cutoff is found we need to show  the
  position of the two sections of sequence and the score, and we do it
  in the following way.  If we have a 70%  match  between  a  sequence
  that starts at p1 and a sequence that starts at p2 the program draws
  a diagonal line that starts at p1 with height 70%  of  the  box  and
  which finishes at p2 with height 0.  The matches can also be listed.

        Here I define the terms range, region, and span  and  what  is
  compared.   Suppose we have a defined region j1 to j2, a range of i1
  to i2 and a span of s; the program will take, in turn, all  sections
  of  sequence  of  length  s within j1 and j2 and compare them to all
  sequences that start a distance i1+s-1 to  i2+s-1  away  from  them.
  First  it  will  take  the  sequence  of length s starting at j1 and
  compare it with the sequence of length s starting at j1+s-1+i1, then
  j1+s-1+i1+1,  etc up to j1+s-1+i2; then it will take the sequence of
  length s starting at j1+1 and compare it with the sequence  starting
  at  j1+s-1+1+i1  etc. This continues until we hit the right hand end
  of the sequence as defined by j2.  Note  1)that  sequences  are  not
  compared  with themselves: the nearest sequence compared to a span s
  starting at j starts at j+s; 2) ranges i1 and i2 are ranges of start
  positions;  3)  by  choosing  a range greater than the length of the
  sequence this routine will do a  full  DIAGON  analysis  except  for
  those  points  within a distance span of the main diagonal (see note
  1).

        Typical dialog follows.

  ? Menu or option number=36
   Search for local similarity or complementarity
  ? (y/n) (y) Find direct repeats
  ? (y/n) (y) Keep picture n
  ? Span (5-200) (15) =
  Define restricted region
  ? start (0-1023) (1) =
  ? end (2-1023) (1023) =
  ? Percent match (1.00-100.00) (70.00) =
  ? Range start (1-50) (1) =
  ? Range end (1-50) (1) =5
  ? (y/n) (y) Plot results n
   Working


         118        128
           CGAGGAGGAG GTGGA
            ** *****  ** **
           GGACGAGGAC GTCGA
         100        110


         119        129
           GAGGAGGAGG TGGAT
           ** ***** * * **
           GACGAGGACG TCGAC
         101        111
  ? (y/n) (y) Find direct repeats n
  ? (y/n) (y) Keep picture
  ? Span (5-200) (15) =
  Define restricted region
  ? start (0-1023) (1) =
  ? end (2-1023) (1023) =
  ? Percent match (1.00-100.00) (70.00) =
  ? Range start (1-50) (1) =
  ? Range end (1-50) (5) =8
  ? (y/n) (y) List results

   Working


         178        188
           ACTCAGATCC GGCGG
           ***** ***  * **
           ACTCAAATCA GTCGC
         156        166


         177        187
           CACTCAGATC CGGCG
            ***** ***  * **
           AACTCAAATC AGTCG
         157        167
  ? (y/n) (y) Find inverted repeats !
 @37. TX 5 @ Set genetic code

        This function allows the user to  change  the  current  active
  genetic  code for all the options. The user may select: the standard
  code, the mammalian mitochondrial code, the yeast mitochondrial code
  or a personal code (define your own).

        Select code. If personal, define a codon and select  an  amino
  acid. When all codons have been reset define a blank codon.

        The code differences are:
            Mammalian        Yeast
    Codon  Mitochondrial  Mitochondrial  Standard
     UGA       W              W            STOP
     AUA       M              M             I
     CUA       L              T             L
     AGA      STOP            R             R
     AGG      STOP            R             R

        Typical dialogue follows.
  ? Menu or option number=37
  X 1 Standard code
    2 Mammalian mitochondrial code
    3 Yeast mitochondrial code
    4 Personal code
  ? 0,1,2,3,4 =2

  ? Menu or option number=37
  X 1 Standard code
    2 Mammalian mitochondrial code
    3 Yeast mitochondrial code
    4 Personal code
  ? 0,1,2,3,4 =4
  Define genetic code by typing a codon
  followed by a 1 letter amino acid symbol
  ? Codon=TTT
  Default Amino acid symbol=F
  ? Amino acid symbol=W
  ? Codon=
 @38. T 3 4 @ Examine repeats

        This function can  be  used  to  examine  the  frequencies  of
  repeated words within a sequence. It finds all words that occur more
  than once. The user selects a minimum word length  and  the  program
  finds  all  words  of that length that occur more than once; then it
  "follows" each repeated word until it becomes unique. For each  word
  length  it  can  report  the number of different repeated words, the
  number of occurrences of each word, and their actual  positions  and
  sequences.

        It is possible that the  algorithm  may  run  out  of  memory,
  paticularly  if  a  short  mimimum  word length is chosen, or if the
  sequence is very long or very repetitive. If this occurs the longest
  reported  word  length  will  not  necessarily be the longest in the
  sequence: the memory will have been consumed before the longest word
  is found.
  Typical dialogue and output is shown below.

   Expected length of longest repeat    14
   ? Minumim word length (1-6) (6) =6
   Working
   ? Show repeat frequencies for words of at least length (6-15) (15) =10
   For length    10 the number of different repeated words is  2035
   For length    11 the number of different repeated words is   613
   For length    12 the number of different repeated words is   161
   For length    13 the number of different repeated words is    37
   For length    14 the number of different repeated words is    10
   For length    15 the number of different repeated words is     1
   ? Show repeats for words of length (6-15) (15) =14
   ? Show repeats for words occuring with frequency (2-9999) (2) =2

   ggtgctcatgccca
   occurs at  21611
   occurs at  21851
   ttatccggtgatga
   occurs at   4604
   occurs at   8806
   agcaccacgctgac
   occurs at   5954
   occurs at   9486
   catgacggaggatg
   occurs at  10480
   occurs at  19925
   aaagacgggaaaat
   occurs at  11820
   occurs at  43157
   tacaaaaccaattt
   occurs at  26797
   occurs at  31369
   cgagaaagagtgcg
   occurs at   4260
   occurs at  44305
   gccggatgatggcg
   occurs at   7893
   occurs at  16638
   atgacggaggatga
   occurs at  10481
   occurs at  19926
   gcggcgaacgaggc
   occurs at  11352
   occurs at  18718
   ? Show repeats for words of length (6-15) (15) =!

  Example of not enough memory
  ----------------------------

   Expected length of longest repeat    14
   ? Minumim word length (1-6) (6) =1
   Working
   Not enough memory
   Memory used in bytes 1125996. Length of longest repeat     5
   ? Show repeat frequencies for words of at least length (1-5) (5) =!

 @39. TX 5 @ Translate and list in upto six phases

        This  is  a  general  listing  function  that   will   perform
  translations  and produce several forms of output. The possibilities
  are:
  1) no translation, list one or two strands, two ways of numbering the
  sequence.
  2) translation, one or two strands, one or three letter codes.
   Positions defined by:
    a) open reading frames of some minimum length l, l can be 0, hence giving
  a complete six phase translation.
    b) positions typed on keyboard, again 1 to 6 phases, translations appearing
  above and below the dna.
    c) positions read from a feature table.

  It should be used in preference to option 5. For publication
  without a translation, the option to number ends of lines is more compact
  than option 5. Some examples and typical dialogue are given below. Note the
  requirement for d39.

  ? Menu or option number=D39
  Find open reading frames, translate and list
  ? (y/n) (y) Show translation

  The segments to translate can be
     1 Typed on the keyboard
     2 Read from a feature table
  X  3 Open reading frames
  ? 1,2,3 =
  ? Minimum open frame in amino acids (0-7238) (30) =
  ? (y/n) (y) Use 1 letter codes
  Define section of DNA to display
  ? start (1-7238) (1) =
  ? end (2-7238) (7238) =300
  ? Line length (30-120) (60) =
  Which strands should be shown
  X  1 + strand only
     2 - strand only
     3 Both strands
  ? 1,2,3 =3
  ? (y/n) (y) Number ends of lines


      N  A  T  T  I  S  R  I  D  A  T  F  S  A  R  A  P  N  E  N
     AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT      60
         .    :    .    :    .    :    .    :    .    :    .    :
     TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
                                          *  S  A  G  W  I  F  I
        A  V  V  I  L  L  I  S  A  V  K  E  A  R  A  G  F  S  F

      I  A  K  Q  V  I  D  H  L  R  N  V  S  N  G  Q  T  K  S  T
          L  N  R  L  L  T  I  C  E  M  Y  L  M  V  K  L  N  L  L
     ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT     120
         .    :    .    :    .    :    .    :    .    :    .    :
     TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
      Y  S  F  L  N  N  V  M  Q  S  I  Y  R  I  T  L  S  F  R  S
     I  A  L  C  T  I  S  W  K  R  F  T  D  L  P  *  V  L  D  V

      R  S  Q  N  W  E  S  T  V  T  W  N  E  T  S  R  H  R  T  L
       V  R  R  I  G  N  Q  L  L  H  G  M  K  L  P  D  T  V  L  *
     CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA     180
         .    :    .    :    .    :    .    :    .    :    .    :
     GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
      T  R  L  I  P  F
     R  E  C  F  Q  S  D  V  T  V  H  F  S  V  E  L  C  R  V  K

      V  A  Y  L  K  H  V  E  L  Q  H  Q  I  Q  Q  L  S  S  K  P
     GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA     240
         .    :    .    :    .    :    .    :    .    :    .    :
     CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
     T  A  Y  K  F  C  T  S  S  C  C  W  I

      S  A  K  M  T  S  Y  Q  K  E  Q  L  K  V  L  S  N  P  D  L
     TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG     300
         .    :    .    :    .    :    .    :    .    :    .    :
     AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC


  ? Menu or option number=D39
  Find open reading frames, translate and list
  ? (y/n) (y) Show translation N
  Define section of DNA to display
  ? start (1-7238) (1) =
  ? end (2-7238) (7238) =300
  ? Line length (30-120) (60) =
  Which strands should be shown
  X  1 + strand only
     2 - strand only
     3 Both strands
  ? 1,2,3 =
  ? (y/n) (y) Number ends of lines


     AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT      60

     ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT     120

     CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA     180

     GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA     240

     TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG     300


  ? Menu or option number=D39
  Find open reading frames, translate and list
  ? (y/n) (y) Show translation
  The segments to translate can be
     1 Typed on the keyboard
     2 Read from a feature table
  X  3 Open reading frames
  ? 1,2,3 =
  ? Minimum open frame in amino acids (0-7238) (30) =0
  ? (y/n) (y) Use 1 letter codes N
  Define section of DNA to display
  ? start (1-7238) (1) =
  ? end (2-7238) (7238) =300
  ? Line length (30-120) (60) =
  Which strands should be shown
  X  1 + strand only
     2 - strand only
     3 Both strands
  ? 1,2,3 =3
  ? (y/n) (y) Number ends of lines


     AsnAlaThrThrIleSerArgIleAspAlaThrPheSerAlaArgAlaProAsnGluAsn
      ThrLeuLeuLeuLeuValGluLeuMetProProPheGlnLeuAlaProGlnMetLysIle
       ArgTyrTyrTyr******Asn***CysHisLeuPheSerSerArgProLys***Lys
     AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT      60
         .    :    .    :    .    :    .    :    .    :    .    :
     TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
     ValSerSerSerAsnThrSerAsnIleGlyGlyLys***SerAlaGlyTrpIlePheIle
      Arg************TyrPheGlnHisTrpArgLysLeuGluArgGlyLeuHisPheTyr
       AlaValValIleLeuLeuIleSerAlaValLysGluAlaArgAlaGlyPheSerPhe

     IleAlaLysGlnValIleAspHisLeuArgAsnValSerAsnGlyGlnThrLysSerThr
      ***LeuAsnArgLeuLeuThrIleCysGluMetTyrLeuMetValLysLeuAsnLeuLeu
    TyrSer***ThrGlyTyr***ProPheAlaLysCysIle***TrpSerAsn***IleTyr
     ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT     120
         .    :    .    :    .    :    .    :    .    :    .    :
     TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
     TyrSerPheLeuAsnAsnValMetGlnSerIleTyrArgIleThrLeuSerPheArgSer
      Leu***ValPro***GlnGlyAsnAlaPheHisIle***HisAspPhe***Ile***Glu
    IleAlaLeuCysThrIleSerTrpLysArgPheThrAspLeuPro***ValLeuAspVal

     ArgSerGlnAsnTrpGluSerThrValThrTrpAsnGluThrSerArgHisArgThrLeu
      ValArgArgIleGlyAsnGlnLeuLeuHisGlyMetLysLeuProAspThrValLeu***
    SerPheAlaGluLeuGlyIleAsnCysTyrMetGlu***AsnPheGlnThrProTyrPhe
     CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA     180
         .    :    .    :    .    :    .    :    .    :    .    :
     GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
     ThrArgLeuIleProPhe***SerAsnCysProIlePheSerGlySerValThrSer***
      AsnAlaSerAsnProIleLeuGln***MetSerHisPheLysTrpValGlyTyrLysLeu
    ArgGluCysPheGlnSerAspValThrValHisPheSerValGluLeuCysArgValLys

     ValAlaTyrLeuLysHisValGluLeuGlnHisGlnIleGlnGlnLeuSerSerLysPro
      LeuHisIle***AsnMetLeuSerTyrSerThrArgPheSerAsn***AlaLeuSerHis
    SerCysIlePheLysThrCys***AlaThrAlaProAspSerAlaIleLysLeu***Ala
     GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA     240
         .    :    .    :    .    :    .    :    .    :    .    :
     CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
     AsnCysIle***PheMetAsnLeu***LeuValLeuAsnLeuLeu***AlaArgLeuTrp
      GlnMetAsnLeuValHisGlnAlaValAlaGlySerGluAlaIleLeuSer***AlaMet
    ThrAlaTyrLysPheCysThrSerSerCysCysTrpIle***CysAsnLeuGluLeuGly

     SerAlaLysMetThrSerTyrGlnLysGluGlnLeuLysValLeuSerAsnProAspLeu
      ProGlnLys***ProLeuIleLysArgSerAsn***ArgTyrSerLeuIleLeuThrCys
    IleArgLysAsnAspLeuLeuSerLysGlyAlaIleLysGlyThrLeu***Ser***Pro
     TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG     300
         .    :    .    :    .    :    .    :    .    :    .    :
     AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
     GlyCysPheHisGlyArgIleLeuLeuLeuLeu***LeuTyrGluArgIleArgValGln
      ArgLeuPheSerArgLysAspPheProAlaIleLeuProValArg***AspGlnGlyThr
    AspAlaPheIleValGlu******PheSerCysAsnPheThrSerGluLeuGlySerArg


  ? Menu or option number=D39
  Find open reading frames, translate and list
  ? (y/n) (y) Show translation
  The segments to translate can be
     1 Typed on the keyboard
     2 Read from a feature table
  X  3 Open reading frames
  ? 1,2,3 =1
  ? (y/n) (y) Use 1 letter codes
  Define section of DNA to display
  ? start (1-7238) (1) =
  ? end (2-7238) (7238) =300
  ? Line length (30-120) (60) =
  Which strands should be shown
  X  1 + strand only
     2 - strand only
     3 Both strands
  ? 1,2,3 =
  ? (y/n) (y) Number ends of lines N
  Translate
  ? From (0-300) (0) =101
  ? To (1-300) (300) =300
  Translate
  ? From (0-300) (0) =102
  ? To (1-300) (300) =200
  Translate
  ? From (0-300) (0) =


     AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT
             10        20        30        40        50        60

                                              M  V  K  L  N  L  L
                                               W  S  N  *  I  Y
     ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT
             70        80        90       100       110       120

       V  R  R  I  G  N  Q  L  L  H  G  M  K  L  P  D  T  V  L  *
     S  F  A  E  L  G  I  N  C  Y  M  E  *  N  F  Q  T  P  Y  F
     CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA
            130       140       150       160       170       180

       L  H  I  *  N  M  L  S  Y  S  T  R  F  S  N  *  A  L  S  H
     S  C  I  F  K  T  C
     GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA
            190       200       210       220       230       240

       P  Q  K  *  P  L  I  K  R  S  N  *  R  Y  S  L  I  L  T  C
     TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG
            250       260       270       280       290       300


  ? Menu or option number=D39
  Find open reading frames, translate and list
  ? (y/n) (y) Show translation
  The segments to translate can be
     1 Typed on the keyboard
     2 Read from a feature table
  X  3 Open reading frames
  ? 1,2,3 =2
  ? Embl feature table file=1.FT
  ? (y/n) (y) Use 1 letter codes
  Define section of DNA to display
  ? start (1-7238) (1) =
  ? end (2-7238) (7238) =300
  ? Line length (30-120) (60) =
  Which strands should be shown
  X  1 + strand only
     2 - strand only
     3 Both strands
  ? 1,2,3 =3
  ? (y/n) (y) Number ends of lines


      N  A  T  T  I  S  R  I  D  A  T  F  S  A  R  A  P  N  E  N
     AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT      60
         .    :    .    :    .    :    .    :    .    :    .    :
     TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
                                          *  S  A  G  W  I  F  I
        A  V  V  I  L  L  I  S  A  V  K  E  A  R  A  G  F  S  F

      I  A  K  Q  V  I  D  H  L  R  N  V  S  N  G  Q  T  K  S  T
          L  N  R  L  L  T  I  C  E  M  Y  L  M  V  K  L  N  L  L
     ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT     120
         .    :    .    :    .    :    .    :    .    :    .    :
     TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
      Y  S  F  L  N  N  V  M  Q  S  I  Y  R  I  T  L  S  F  R  S
     I  A  L  C  T  I  S  W  K  R  F  T  D  L  P  *  V  L  D  V

      R  S  Q  N  W  E  S  T  V  T  W  N  E  T  S  R  H  R  T  L
       V  R  R  I  G  N  Q  L  L  H  G  M  K  L  P  D  T  V  L  *
     CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA     180
         .    :    .    :    .    :    .    :    .    :    .    :
     GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
      T  R  L  I  P  F
     R  E  C  F  Q  S  D  V  T  V  H  F  S  V  E  L  C  R  V  K

      V  A  Y  L  K  H  V  E  L  Q  H  Q  I  Q  Q  L  S  S  K  P
     GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA     240
         .    :    .    :    .    :    .    :    .    :    .    :
     CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
     T  A  Y  K  F  C  T  S  S  C  C  W  I

      S  A  K  M  T  S  Y  Q  K  E  Q  L  K  V  L  S  N  P  D  L
     TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG     300
         .    :    .    :    .    :    .    :    .    :    .    :
     AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
                                       *  L  Y  E  R  I  R  V  Q
                          *  F  S  C  N  F  T  S  E  L  G  S  R
 @40. TX 5 @ Translate and write the protein sequence to disk

        This routine allows the user  to  translate  sections  of  the
  sequence  into the 1 letter amino acid codes and store the resulting
  amino acid sequences in a disk file.  Two modes of use are possible.
  Either  all open reading frames of at least some minimum length will
  automatically be found and translated, or the user can specify  that
  particular segments be translated.

        Mode 1: the user selects to  to  translate  all  open  reading
  frames.

        Either, or both, strands can be translated.  The  output  file
  is  in  the  same format as a PIR .seq file. Each protein segment is
  given an entry name that is its start base in the DNA, and  a  title
  that  includes  its  end  position,  reading frame and strand (+ for
  plus, - for minus). Each segment is terminated by * whether  or  not
  there is a stop codon in the DNA. The file is therefore suitable for
  input to FASTA, ALIGNL and ANALYSEPL.

        Mode  2:  the  user  selects  to  identify  the  segments   to
  translate.

        Either, or both,  strands  can  be  translated.   If  multiple
  coding  regions  are  translated  each  will  be  separated from the
  previous one by  a  gap  of  5  dashes  (-----).   The  sections  to
  translate  can be defined from the keyboard or by supplying the name
  of the appropriate EMBL library feature table.

        Typical dialogue follows.
  ? Menu or option number=40
   Translate and write protein sequence to disk
  ? (y/n) (y) Translate selected regions
  ? (y/n) (y) Define segments using keyboard
  Translate
  ? From (0-1023) (0) =1
  ? To (1-1023) (1023) =111
  ? (y/n) (y) + strand
  Translate
  ? From (0-1023) (0) =
  ? Output file name=1.OUT

   ? Menu or option number=40
   Translate and write protein sequence to disk
  ? (y/n) (y) Translate selected regions n
  ? Minimum open frame in amino acids (5-1000) (30) =

  X 1 + strand only
    2 - strand only
    3 Both strands
  ? 0,1,2,3 =3
  ? File name for translation=1.OUT

  ? Menu or option number=6
  Page through text files
  ? Name of file to read=1.OUT
  >P1;    25
      135     1 +
   GAQRLLRRSCWCWRCGGRQRTQGSAGRGRRRRGGGG*
  >P1;   238
      486     1 +
   IRCRDCGQRRRGIFDLVDDFHVRRHIVLARKLFEAEGTGVHFHISLMGGNIVTAEVTNVR
   VDAGADFAAVRMLALFGAVVPH*
  >P1;   556
      795     1 +

   SSTQVRRASAQTSSLQLESIVAVVNVEVFLAAKHSRFYIAVLFAQFGPLLDARLDRGCGK
   GAGRRDQWRGGGVDLANGR*
  >P1;   796
      987     1 +

   FGYADHAFHLRSTSRHSDNVKFDSAGRRRCCCFHLVFSLGSDEEGLLARLLVEVTTIRVV
   LRG*
  >P1;     2
      163     2 +
   NSVWAWCEVPRDYCAAAAGAGGAEVVNGPRDPLDEDVDDEEEVDSALLVAGSD*
  >P1;   176
      391     2 +
   PLRSGGGGVEAPETPSGWPARFAAATVANAVEGFSILWMIFTCAVILSLRVNSLKQKGQG
   YTFTFRLWEVT*
  >P1;   476
      628     2 +
   SLTEPSASPSPTLLLRFSLVLTEGVPNPALRFGVLPLRPAAFNLNPSLLL*
  >P1;   629
      958     2 +
   MSRYSWLLNTAGFTSPFCLPSLGRFWTRGLTVAVEKEPAGETNGVEAALTLPMGVSLGML
   TMLFTCAPPAAIPIMLSLIPLAAAAAAVSTWCFLWAAMRKACWRACSLR*
  >P1;     3
      293     3 +
   IRFGLGVRCPEITAPQLLVLAVRRSSTDPGIRWTRTSTTRRRWIAHCWWLAATDLSSDHS
   DPAAEASRLPKLPVAGLLDSLPRLWPTPSRDFRSCG*
  >P1;   411
      521     3 +
   CACRRGSRLCSGTYARPLWCSSPSLSPPPRPRQRCC*
  >P1;  1020
       37     1 -
   EFGKYNPLTDNSSPTQDHTDGSHLNEQARQQAFLIAAQRKHQVETAAAAAASGIKLNIIG
   MAAGGAQVKSMVSIPKLTPIGKVNAASTPLVSPAGSFSTATVKPRVQKRPKLGKQNGDVK
   PAVFSSQEYLDIYNSNDGFKLKAAGLSGSTPNLSAGLGTPSVKTKLNLSSNVGEGEAEGS
   VRDYCTKEGEHTYRCKVCSRVYTHISNFCRHYVTSHKRNVKVYPCPFCFKEFTRKDNMTA
   HVKIIHKIENPSTALATVAAANLAGQPLGVSGASTPPPPDLSGQNSNQSLPATSNALSTS
   SSSSTSSSSGSLGPLTTSAPPAPAAAAQ*
  >P1;   373
       -1     2 -
   AKCESVPLSLLLQRVYAQGQYDGARENHPQDRKSLDGVGHSRGSESSRPATGSFGSLDAS
   AAGSEWSELKSVAASHQQCAIHLLLVVDVLVQRIPGSVDDLRTASTSSCGAVISGHLTPS
   PNRI*
  >P1;   517
      407     2 -
   QQRWRGRGGGLSEGLLHQRGRAYVPLQSLLPRLHAH*
  >P1;   649
      518     2 -
   QPGIPRHLQQQRWIQVEGCWSERKHAEPECWIRNSLCQNQAES*
  >P1;   853
      650     2 -
   HYRNGGWWSAGEKHGQHTQTNAHWQGQRRLHAIGLACRLLFHSHGQAARPEAAQTQTER
   RCKTGCV*
  >P1;   958
      854     2 -
   SPQRAGAPTSLPHRCPEKTPGGNSSSGGGQRNQT*
  >P1;   179
       78     3 -
   VVRTQISRCQPPAMRYPPPPRRRRPRPADPWVR*
  >P1;   479
      363     3 -
   GTTAPKRASIRTAAKSAPASTRTLVTSAVTMLPPISEM*
  >P1;   791
      666     3 -
   RPLARSTPPPRHWSRLPAPFPQPRSSRASRSGPNWANRTAM*
  >P1;  1022
      819     3 -
   SNSASTTRSPTTAHPRRTTRMVVTSTSRRANKPSSSLPRENTRWKQQQRRRPAESNLTLS
   EWRLVERR*
  End of file
 @41. TX 5 @ Calculate and write codon table to disk

        This routine calculates codon usage tables for sections of the
  sequence  and  stores the resulting tables on disk.  The sections to
  translate can be defined from the keyboard or by supplying the  name
  of the appropriate EMBL library feature table.

        If required users can add to an existing codon table stored as
  a  disk  file. Choose between storing observed counts or having them
  normalised so that the totals for each amino acid sum to 100. Select
  between  defining  segments at the keyboard or using an EMBL feature
  table. Define segments. Signal completion with a zero start.  Supply
  a  file  name. For each segment the program will display the counts,
  at the end it will display the accumulated totals.

   Typical dialogue follows.
  ? Menu or option number=41
   Calculate and write codon table to disk
  ? (y/n) (y) Start with empty table
  ? (y/n) (y) Show observed counts
  ? (y/n) (y) Define segments using keyboard
  ? Count from (0-1023) (0) =1
  ? Count to (1-1023) (1023) =111
  ? (y/n) (y) + strand

       ===========================================
       F TTT   0. S TCT   0. Y TAT   0. C TGT   0.
       F TTC   1. S TCC   1. Y TAC   0. C TGC   3.
       L TTA   1. S TCA   0. * TAA   0. * TGA   1.
       L TTG   2. S TCG   0. * TAG   0. W TGG   2.
       ===========================================
       L CTT   0. P CCT   0. H CAT   0. R CGT   2.
       L CTC   0. P CCC   0. H CAC   0. R CGC   2.
       L CTA   0. P CCA   0. Q CAA   1. R CGA   1.
       L CTG   1. P CCG   0. Q CAG   2. R CGG   2.
       ===========================================
       I ATT   0. T ACT   0. N AAT   0. S AGT   0.
       I ATC   0. T ACC   1. N AAC   0. S AGC   1.
       I ATA   0. T ACA   0. K AAA   0. R AGA   1.
       M ATG   0. T ACG   0. K AAG   0. R AGG   0.
       ===========================================
       V GTT   0. A GCT   1. D GAT   0. G GGT   3.
       V GTC   0. A GCC   1. D GAC   0. G GGC   1.
       V GTA   0. A GCA   0. E GAA   1. G GGA   4.
       V GTG   1. A GCG   0. E GAG   0. G GGG   0.
       ===========================================
  ? Count from (0-1023) (0) =

      Codon totals over all genes
       ===========================================
       F TTT   0. S TCT   0. Y TAT   0. C TGT   0.
       F TTC   1. S TCC   1. Y TAC   0. C TGC   3.
       L TTA   1. S TCA   0. * TAA   0. * TGA   1.
       L TTG   2. S TCG   0. * TAG   0. W TGG   2.
       ===========================================
       L CTT   0. P CCT   0. H CAT   0. R CGT   2.
       L CTC   0. P CCC   0. H CAC   0. R CGC   2.
       L CTA   0. P CCA   0. Q CAA   1. R CGA   1.
       L CTG   1. P CCG   0. Q CAG   2. R CGG   2.
       ===========================================
       I ATT   0. T ACT   0. N AAT   0. S AGT   0.
       I ATC   0. T ACC   1. N AAC   0. S AGC   1.
       I ATA   0. T ACA   0. K AAA   0. R AGA   1.
       M ATG   0. T ACG   0. K AAG   0. R AGG   0.
       ===========================================
       V GTT   0. A GCT   1. D GAT   0. G GGT   3.
       V GTC   0. A GCC   1. D GAC   0. G GGC   1.
       V GTA   0. A GCA   0. E GAA   1. G GGA   4.
       V GTG   1. A GCG   0. E GAG   0. G GGG   0.
       ===========================================
  ? (y/n) (y) Save table in a file n
 @42. TX 6 @ Codon usage method

        Used to find protein coding regions. For each window length of
  the sequence the routine measures the closeness to an expected codon
  usage. Results are plotted for each of  the  three  reading  frames.
  Stop  and start codons are also marked on the plots. Has the highest
  resolution of all such methods, but makes the strongest  assumption,
  i.e.  that the codon usage is known. The latest version is described
  in Methods in Enzymology 183, 193-211.

        Choose whether to use an internal standard (i.e. part  of  the
  current  sequence known to code for a protein). If so define its end
  points, and those of any others. Otherwise supply the name of a disk
  file  containing  a  table of codon usage. Tables are listed. Choose
  between using the observed counts, or two  types  of  normalisation:
  normalised  to give an average amino acid composition; normalised to
  no amino  acid  bias.  The  first  normalisation  is  clearly  often
  sensible,  but  the  second removes valuable information and is only
  made availabe for special circumstances. The  final  table  will  be
  displayed, followed by the expected scores for window lengths 21, 31
  and 41 codons. The scores for each of the three reading  frames  are
  shown  (they  are  logarithmic values) to help users choose a window
  length for the analysis. Define a window length and  plot  interval.
  Plotting will start.

        The method was first described in Staden and  McLachlan  Nucl.
  Acid  Res.  10  141-156 (1982) and the following is a summary of the
  initial ideas.  The method makes the following main assumptions: the
  codon  preferences of all the genes in the sequence we are examining
  are similar to  those  of  the  standard;  the  sequence  is  coding
  throughout its whole length in only one reading frame; in the coding
  frame the frequency of codon abc has a definite value Fabc
  If we select a  sequence   a1b1c1a2b2c2a3b3c3,...,anbncnan+1bn+1cn+1
  then the probability of selecting it in each of the three frames is:
               frame 1: p1=Fa1b1c1.Fa2b2c2....Fanbncn
               frame 2: p2=Fb1c1a2.Fb2c2a3...Fbncnan+1
               frame 3: p3=Fc1a2b2.Fc2a3b3...Fcnan+1bn+1
  The probability that selection of a particular sequence was "caused"
  by it being a coding sequence is:
  P1=p1/(p1+p2+p3), P2=p2/(p1+p2+p3), P3=p3/(p1+p2+p3).
  The program calculates these values for the given window length  but
  plots Log(P/(1-P)) for each of the three frames. At each point along
  the sequence that the program has a point to plot it finds which  of
  the  three  values  is  highest and places a single point at the 50%
  level for the corresponding frame. These single points will join  to
  form  a solid line if one frame is consistently the highest scoring.
  In addition stop codons are  shown  as  short  vertical  lines  that
  bisect the 50% level of probability. When looking for coding regions
  the user should look for solid horizontal lines  at  the  50%  level
  that are not interrupted by these short vertical lines.

        Changes.  Two normalisations are offered:  1)  to  remove  all
  amino  acid  compositional components from the tables, hence leaving
  only the  codon  preference  components.  In  general  this  is  not
  recommended as the amino acid component alone is often sufficient to
  choose correctly between  frames,  but  may  be  useful  in  special
  circumstances. 2) to change the amino acid composition components to
  give an average amino acid composition rather the the one  contained
  in  the  standard  (this  leaves  the  codon  preference  components
  unchanged). In general this should be useful as  the  average  amino
  acid  composition  is  likely to be closer to the composition of the
  genes being hunted, than is that of  the  standard  table  of  codon
  preferences.  The  average composition is that recently published by
  Argos, not the Dayhoff one that we have used before.

        Typical dialogue follows.

  ? Menu or option number=42
  Staden and McLachlan codon usage method
  Codon tables for standards may be read from disk
  or calculated from parts of the current sequence
  ? (y/n) (y) Define internal standard
  Define standard
  ? start (0-1023) (0) =1
  ? end (2-1023) (1023) =1000
       ===========================================
       F TTT  13. S TCT   1. Y TAT   1. C TGT   3.
       F TTC   4. S TCC  10. Y TAC   1. C TGC   7.
       L TTA   1. S TCA   0. * TAA   1. * TGA   4.
       L TTG   4. S TCG   1. * TAG   3. W TGG   5.
       ===========================================
       L CTT   9. P CCT   1. H CAT   3. R CGT  14.
       L CTC   7. P CCC   0. H CAC   7. R CGC  14.
       L CTA   0. P CCA   0. Q CAA   4. R CGA   9.
       L CTG  12. P CCG   1. Q CAG   9. R CGG   8.
       ===========================================
       I ATT   7. T ACT   4. N AAT   4. S AGT   1.
       I ATC   4. T ACC   5. N AAC   3. S AGC   7.
       I ATA   1. T ACA   1. K AAA   3. R AGA   2.
       M ATG   2. T ACG   1. K AAG   2. R AGG   2.
       ===========================================
       V GTT  11. A GCT  13. D GAT   6. G GGT   9.
       V GTC   5. A GCC  10. D GAC   9. G GGC  11.
       V GTA   6. A GCA   5. E GAA   6. G GGA  12.
       V GTG   8. A GCG   5. E GAG   3. G GGG   8.
       ===========================================
  Define standard
  ? start (0-1023) (0) =
  Total codons in standard=     333.
  X 1 Use observed frequencies
    2 Normalize to average amino acid composition
    3 Normalize to no amino acid bias
  ? 0,1,2,3 =2
       ===========================================
       F TTT  19. S TCT   2. Y TAT  10. C TGT   3.
       F TTC   6. S TCC  22. Y TAC  10. C TGC   8.
       L TTA   2. S TCA   0. * TAA   0. * TGA   0.
       L TTG   7. S TCG   2. * TAG   0. W TGG   8.
       ===========================================
       L CTT  16. P CCT  16. H CAT   4. R CGT  10.
       L CTC  12. P CCC   0. H CAC  10. R CGC  10.
       L CTA   0. P CCA   0. Q CAA   8. R CGA   7.
       L CTG  21. P CCG  16. Q CAG  18. R CGG   6.
       ===========================================
       I ATT  19. T ACT  13. N AAT  16. S AGT   2.
       I ATC  11. T ACC  17. N AAC  12. S AGC  15.
       I ATA   3. T ACA   3. K AAA  22. R AGA   1.
       M ATG  15. T ACG   3. K AAG  15. R AGG   1.
       ===========================================
       V GTT  15. A GCT  21. D GAT  14. G GGT  10.
       V GTC   7. A GCC  16. D GAC  20. G GGC  13.
       V GTA   8. A GCA   8. E GAA  26. G GGA  14.
       V GTG  11. A GCG   8. E GAG  13. G GGG   9.
       ===========================================
  Span length  21 expected mean values:   4.8  -5.7  -4.8
  Span length  31 expected mean values:   7.1  -8.4  -7.2
  Span length  41 expected mean values:   9.5 -11.1  -9.5
  ? odd span length (11-101) (25) =41
  ? plot interval (1-11) (5) =

   Missing graphics display here

 @43. TX 6 @ Positional base preference method.

        Used to find protein coding regions. For each window length of
  the  sequence  the  routine  measures  the  closeness to an expected
  pattern of base frequencies . Results are plotted for  each  of  the
  three  reading  frames. Stop and start codons are also marked on the
  plots.  The method is particularly useful for showing which  reading
  frame  is  the  most  likely  to  be  coding.  The latest version is
  described in a forthcoming issue of Methods in Enzymology,  but  the
  original  ideas  were given in Staden, R. Nucl. Acid Res. 12 551-567
  (1984).

        If dialogue is requested  the  following  inputs  are  needed,
  otherwise  the  standard  analysis  is  performed.  Choose between a
  "global" standard, or a selected one.  If  the  global  standard  is
  selected  the  expected  scores  are displayed and the user asked to
  define a span length and a plot interval. Then users choose  between
  plotting  relative  or  absolute  scores,  and can reset the scaling
  values employed  for  plotting.   If  the  global  standard  is  not
  selected  users  must  define  a  region of the sequence to use as a
  standard, or they can read in a codon table from which  the  program
  will calculate one. Then they can either, use the values observed in
  this standard,  or  they  can  combine  its  values  for  the  third
  positions  in codons, with those from the global standard. Next they
  can give different weightings to each  of  the  three  positions  in
  codons.

        In its original form the method took advantage of  the  uneven
  use of amino acids by proteins and the structure of the genetic code
  table and assumed that there is  a  typical  ("global")  amino  acid
  composition   and  no  codon  preference.  The  typical  amino  acid
  composition is the average composition found by Argos  (see  below).
  This composition and no codon preference determines the frequency of
  each of the four bases in each of the three  codon  positions.  This
  3x4 frequency table shows unequal use of the bases and in particular
  a marked use of G in position 1 and of  A  in  position  2  (at  the
  expence  of  G).  The routine slides a window along the sequence and
  calculates a score for each of the  three  reading  frames  at  each
  window  position.  It  assumes the sequence is coding throughout its
  whole length and calcualtes the probability that  it  is  coding  in
  each  of  the  three  frames.  When  tested  against all the E. coli
  sequences in the EMBL sequence library it correctly  identified  the
  coding  frame  for  91% of window positions.  (The E. coli sequences
  were chosen only for technical reasons: I have no  reason  to  think
  the method would work less well on other organisms with roughly even
  base composition.)  The routine can plot either absolute or relative
  values:  ie  absolute  values  are  the  values found by summing the
  scores for each frame (say p1, p2 and p3), and the  relative  values
  are then p1/(p1+p2+p3), p2/(p1+p2+p3) and p3/(p1+p2+p3).

        At each point along the sequence that the program has a  point
  to  plot  it finds which of the three values is highest and places a
  single point at the 50% level for  the  corresponding  frame.  These
  single  points  will  join  to  form  a  solid  line if one frame is
  consistently the highest scoring. In addition stop codons are  shown
  as  short  vertical  lines that bisect the 50% level of probability.
  When looking for coding regions  the  user  should  look  for  solid
  horizontal  lines at the 50% level that are not interrupted by these
  short vertical lines.  The absolute  mean  values  expected  on  the
  complement of the coding strand (and in the same frame) are 5% lower
  than those on the coding strand but the relative values are the same
  on  both  strands.  Although the relative values give smoother plots
  and tend to emphasize the coding frame  they  therefore,  cannot  be
  used  to  decide  which  strand  is coding. The absolute values plot
  should be used for this purpose but bearing in mind the fact the the
  differences between strands are quite small.

        The method has been improved in two overall ways: first it now
  allows  users  to define their own typical amino acid composition by
  selecting a standard sequence from  within  the  sequence  they  are
  analysing or from a codon table; secondly it allows the inclusion of
  third position preferences. Again these third  position  preferences
  are  defined  by  the use of an internal standard sequence. Not only
  can users define their own standards but they can also give  weights
  to  each  of  the  three  positions in codons. This allows different
  emphasis to be used for each of the three positions. As  an  example
  of  its  use, by giving, in turn, weights of 1.0, 0.0, 0.0, and 0.0,
  1.0, 0.0, and finally 0.0, 0.0, 1.0,  you  could  see  the  separate
  contribution  made  by  each  of  the  three  positions.  It is also
  possible to use the third position preferences with the  values  for
  the  first  two  positions  taken  from  the  "global"   amino  acid
  composition. In all cases users  may  choose  to  plot  absolute  or
  relative  values.  The  expected  scores  are  displayed before each
  analysis and scales are drawn on the plots.  At present this  method
  does  not  give probabilities of coding; it has only been tested for
  its ability to choose the correct  reading  frame  (see  above).  It
  could  be used to give probabilities of coding if was applied to all
  known coding and non-coding sequences in the  way  that  the  uneven
  positional base frequencies method was. It is designed to be used in
  conjunction  with  this  method.  Note  that   the   average   amino
  composition  used  to derive the base frequencies was changed on 17-
  11-1988, to be the new  average  given  by  McCaldon  and  Argos  in
  Proteins  4  99-122  (1988).   A further change is to allow users to
  select their own scales for producing the plots. It can  be  helpful
  if they want to emphasise or diminish certain features.

        Typical dialogue follows.
  ? Menu or option number=D43
  Positional base preferences method to find protein genes
  Select standard source
  X  1 Use global standard
     2 Use internal standard
     3 Use codon usage table
  ? Selection  (1-3) (1) =2
  Define region for standard
  ? start (0-8134) (0) =3171
  ? end (3172-8134) (8134) =4700
  Select normalisation
  X  1 Use observed frequencies
     2 Combine with global standard
  ? Selection  (1-2) (1) =1
            T      C      A      G      Range
        1  0.125  0.249  0.230  0.397  0.272
        2  0.298  0.245  0.292  0.165  0.132
        3  0.288  0.313  0.169  0.230  0.144
  ? (y/n) (y) Use 1.0 for positional weights
  Give weights between 0.0 and 1.0
  to each of the 3 codon positions
  ? Position 1 (0.00-1.00) (1.00) =
  ? Position 2 (0.00-1.00) (1.00) =
  ? Position 3 (0.00-1.00) (1.00) =
  Expected scores per codon in each frame
         0.136     0.122     0.123
  ? odd span length (31-101) (67) =
  ? plot interval (1-11) (5) =
  ? (y/n) (y) Plot relative scores
  Scaling values:
     Minimum  maximum    range
      0.3121   0.3656   0.0382
  ? (y/n) (y) Leave scaling values unchanged

    Graphics not shown

  ? Menu or option number=D43
  Positional base preferences method to find protein genes
  Select standard source
  X  1 Use global standard
     2 Use internal standard
     3 Use codon usage table
  ? Selection  (1-3) (1) =3
  ? File name of standard=atpase.cods
       ===========================================
       F TTT  21. S TCT  33. Y TAT  15. C TGT   5.
       F TTC  55. S TCC  40. Y TAC  40. C TGC   4.
       L TTA   8. S TCA   7. * TAA   8. * TGA   0.
       L TTG  19. S TCG  12. * TAG   1. W TGG  17.
       ===========================================
       L CTT  22. P CCT  17. H CAT   6. R CGT  73.
       L CTC  21. P CCC   4. H CAC  30. R CGC  23.
       L CTA   1. P CCA  10. Q CAA  19. R CGA   5.
       L CTG 168. P CCG  48. Q CAG  80. R CGG   3.
       ===========================================
       I ATT  47. T ACT  14. N AAT  17. S AGT   8.
       I ATC  98. T ACC  54. N AAC  52. S AGC  26.
       I ATA   6. T ACA   7. K AAA  85. R AGA   0.
       M ATG  75. T ACG  13. K AAG  28. R AGG   0.
       ===========================================
       V GTT  67. A GCT  56. D GAT  41. G GGT  90.
       V GTC  29. A GCC  53. D GAC  66. G GGC  66.
       V GTA  49. A GCA  59. E GAA 101. G GGA   5.
       V GTG  57. A GCG  64. E GAG  41. G GGG   8.
       ===========================================
  Select normalisation
  X  1 Use observed frequencies
     2 Combine with global standard
  ? Selection  (1-2) (1) =2
            T      C      A      G      Range
        1  0.177  0.211  0.277  0.336  0.159
        2  0.271  0.238  0.310  0.182  0.128
        3  0.242  0.301  0.168  0.289  0.132
  ? (y/n) (y) Use 1.0 for positional weights
  Expected scores per codon in each frame
         0.785     0.736     0.736
  ? odd span length (31-101) (67) =
  ? plot interval (1-11) (5) =
  ? (y/n) (y) Plot relative scores
  Scaling values:
     Minimum  maximum    range
      0.3219   0.3519   0.0214
  ? (y/n) (y) Leave scaling values unchanged

    Graphics not shown
 @44. TX 6 @ Uneven positional base frequencies.

        Used to find regions of a sequence that might be coding for  a
  protein.  The method looks for sections of the sequence in which the
  frequency at which  each  of  the  four  bases  occupies  the  three
  positions  in  codons  is  nonrandom.  The level of nonrandomness is
  plotted on a scale that shows the probability that the  sequence  is
  coding.  At each position along a sequence the calculation gives the
  same value for all six possible reading frames, so only one value is
  plotted.

        Define the window length and plot interval.

        The results are plotted in a box divided by a horizontal  line
  marked  "76%".  76% of coding regions achieve values above this line
  and 76% of noncoding regions achieve scores below the line.

        This method, first described in  Staden R. Nucl. Acid Res.  12
  551-567  1984, looks for uneven positional usage of bases in codons.
  It looks through the sequence in one  fixed  phase  and  counts  the
  number  of  times  each  base  apears  in  each  of  the three codon
  positions: for each window position it counts A1,A2,A3 and  C1,C2,C3
  and  G1,G2,G3  and  T1,T2,T3  and calculates AMEAN=(A1+A2+A3)/3, and
  similarly CMEAN, GMEAN and TMEAN; it  then  calculates  ADIF=abs(A1-
  AMEAN)+abs(A2-AMEAN)+abs(A3-AMEAN) and similarly CDIF, GDIF and TDIF
  to measure the differences  between  an  even  base  usage  for  all
  positions  in  the  codons  and the observed usage. The routine then
  calculates the sum ADIF+CDIF+GDIF+TDIF and plots this value  on  the
  following  scale:  the  base level is such that no known window in a
  coding region has a lower value, whereas 14% of windows in noncoding
  sequences  score  below  it. The top of the scale is not achieved by
  any known noncoding region, but is reached by 16%  of  known  coding
  regions.  The  bar drawn across the plot corresponds to a level that
  is exceeded by 76% of windows in known coding regions but is reached
  by  only 24% of windows in known noncoding regions. ie 76% of coding
  windows score above and 76% of noncoding windows score below.   This
  is  similar  to  Ficketts  method  but without the probabilities and
  weightings from the Los Alamos sequence  library:  it  is  therefore
  unbiased but may well give very similar results.
 @45. TX 6 @ Codon improbability on base composition

        Used to find regions of a  sequence  that  might  code  for  a
  protein.

        If dialogue is requested  define  a  window  length  and  plot
  interval.

        The idea of the method is, that of all sequence features  that
  we  know,  it  is  only  coding regions that will give rise to codon
  biases well above those expected from the base  composition.   If  a
  region  of  sequence  shows  sufficiently  strong codon bias then we
  conclude that it is coding for a  protein.   Using  the  multinomial
  distribution we have derived a function to measure the improbability
  of  observing  a  set  of  codons  from  a  sequence  of  the  given
  composition.  Using  the Poisson distribution we have worked out the
  distribution of the improbability. The program  plots  the  observed
  improbability   minus   the  expected  improbability  (the  mean  as
  calculated from the Poisson distribution). The plots  are  presented
  against  a scale of units of standard deviation as measured from the
  Poisson distribution. As with the other Staden and McLachlan  method
  the  program puts an extra point at a fixed level for the highest of
  the three probabilities; for this function this point is  placed  at
  six  standard  deviations  above the mean expected level. The top of
  each plot corresponds to 12 standard deviations above  the  expected
  level and the bottom corresponds to the expected value.

        Analysis of the application of the method to the EMBL sequence
  library  indicates  that the method does work for most sequences and
  that the levels of improbability roughly correlate  with  levels  of
  expression.  Coding regions will show high peaks in all three frames
  making interpretation more difficult than  for  some  of  the  other
  methods.
 @46. TX 6 @ Codon improbability on amino acid composition

        Used to finds regions of a sequence  that  might  code  for  a
  protein.

        If dialogue is requested define a window  length  and  a  plot
  interval.

        The idea of the method is, that of all sequence features  that
  we  know,  it  is  only  coding regions that will give rise to codon
  biases such that, for each amino acid, some codons are used far more
  frequently  than  others. The method is independent of what the bias
  actually is, requiring only that it is  present.   If  a  region  of
  sequence  shows sufficiently strong codon bias then we conclude that
  it is coding for a protein.  Using the multinomial  distribution  we
  have  derived a function to measure the improbability of observing a
  set of codons from a sequence of the given  composition.  Using  the
  Poisson  distribution  we  have  worked  out the distribution of the
  improbability. The program plots the  observed  improbability  minus
  the  expected improbability (the mean as calculated from the Poisson
  distribution). The plots are presented against a scale of  units  of
  standard  deviation  as  measured  from the Poisson distribution. As
  with the other Staden and McLachlan method the program puts an extra
  point  at  a fixed level for the highest of the three probabilities;
  for this function this point is placed at  six  standard  deviations
  above  the  mean expected level. The top of each plot corresponds to
  12 standard deviations above  the  expected  level  and  the  bottom
  corresponds to the expected value.
 @47. TX 6 @ Shepherd RNY preference method

        Used to find regions of a  sequence  that  might  code  for  a
  protein. Based on the method of Shepherd (PNAS 78 1596-1600, 1981).

        If dialogue is requested  define  a  window  length  and  plot
  interval.

        Shepherd has found that many genes have a preference  for  the
  use of codons of the form RNY where R=purine, Y=pyrimidine and N=any
  base. He has attributed this to being due to remants of a  primitive
  genetic  code. The calculation is similar to that for the Staden and
  McLachlan method, the p1's being simply the  number  of  RNY  codons
  found in frame 1 etc and the P's being p/(p1+p2+p3).
 @48. TX 6 @ Ficketts method

        Used to find regions of a  sequence  that  might  code  for  a
  protein.  Based  on  the method of Fickett (Nucl. Acid Res.10 1982),
  but plots values for fixed window lengths rather than over the whole
  of open reading frames.

        If dialogue is requested  define  a  window  length  and  plot
  interval.  The  results  are  plotted  in  a  box divided into three
  horizontal strips.

        Sections of the sequence with values plotted in the top  strip
  of  the box are adjudged to be coding, those in the middle strip "no
  decision", and those in the bottom "not coding".

        The program performs the following calculations: let A1 =  the
  number  of  occurences  of  base  A  in position 1 of codons, A2 for
  position 2 etc. Similarly for bases  C,G  and  T.  For  each  window
  position calculate Apos=max(A1,A2,A3)/min(A1,A2,A3)+1. Similarly for
  C,G and  T  to  give  4  positional  values.  Also  count  the  base
  composition for the window to give Acomp, Ccomp etc. Fickett  tested
  each of these 8 parameters singly as to their ability to distinguish
  coding  from noncoding regions and arived at probabilities of coding
  for the range of values each can take = Pcod. He also measured their
  relative  abilities and given weightings to each of the 8 parameters
  = Pw. To calculate the "TESTCODE" for a window we first  lookup  the
  Pcod  for each of the calculated compositional and positional values
  and  then  calculate  TESTCODE=sum(Pcod*Pw).  TESTCODE  is   plotted
  relative to three levels of decision: the top division="coding", the
  middle="no opinion" and the bottom division="non coding".
 @49. TX 6 @ tRNA gene search.

        Used to find segments of a sequence that might code for tRNAs.
  Looks  for  potential cloverleaf forming structures and then for the
  presence of expected conserved bases. Presents  results  graphically
  or draws out the cloverleafs.

        If dialogue is requested a large number of parameters need  to
  be given values, including some loop lengths, scores for each of the
  four stems, and scores for the conserved bases.

        The program was first  described  in  Staden  Nucl.  Acid  Res
  817-825  (1980). The tRNA's  that  have been  sequenced  so far have
  two characteristics that can be used to locate  their  genes  within
  long  DNA  sequences.   Firstly  they   have   a  common   secondary
  structure  -  the  cloverleaf  -  and   secondly,  particular  bases
  almost always appear at certain  positions  in the cloverleaf.   The
  cloverleaf  is composed of four base-paired stems  and  four  loops.
  Three  of  the  stems  are  of  fixed  length  but  the fourth,  the
  dhu  stem which usually has four  base  pairs,  sometimes  has  only
  three.   All  of  the  loops  can  vary  in  size.    The  following
  relationships between the stems in the cloverleaf are assumed in the
  program:   (a) there are no bases between one end  of  the aminoacyl
  stem  and  the  adjoining tuc stem;  (b) there are two bases between
  the  aminoacyl stem and the dhu stem;  (c) there is one base between
  the  dhu  stem and the anticodon stem;  (d) there are at least three
  bases  between  the  anticodon  stem  and the tuc stem.  The program
  looks first for cloverleaf structure and then,   if  required,   for
  conserved bases.  The sizes of the loops, the number of basepairs in
  the stems and the required conserved bases  may  all  be   specified
  by  the  user.  The process of looking for the presence of conserved
  bases  can  reduce  the   number   of   potential  structures  found
  considerably.   The  user  may  also  specify  that an intron may be
  present in the anticodon loop.

        The user may define a minimum number of base  pairs  for  each
  stem  using  the  scoring system G-C, A-T=2 and G-T=1 and scores for
  each of the conserved bases. Recommended values for the stem  scores
  are  given  by  the  prompts  and the percentage conservation of the
  conserved bases as found in the Nucl. Acid Res  1979   paper  Gauss,
  Gruter  and  Sprinzl  are also given, but the user must decide which
  bases are most  likely  to  be  conserved  for  the  sequence  being
  examined.  The output shows the position of the possible gene in the
  sequence by a vertical line the height of which shows the number  of
  basepairs  made in the stems. The cloverleaf structure is also drawn
  but will scroll up off the screen. Output of  the  cloverleafs  will
  look like:

         6942
                      A
                    A-U
                    A-U
                    G-C
                    A-U
                    U-A
                    A-U
                    U-A      AAU
                    U   UAUCU
            AA    A    !!!!!
              AAUG     AUAGA   A
           U  !!!!     U    UCA
           C  UUAC      U
            AA    A
                   U-AA A
                   A-U
                   A-U
                   C-G
                   U-A
                  U   A
                  U   A
                   GUC

   Typical dialogue follows.

  ? Menu or option number=D49
   tRNA search
  ? Maximum trna length (70-130) (92) =
  ? Aminoacyl stem score (0-14) (11) =
  ? Tu stem score (0-10) (8) =
  ? Anticodon stem score (0-10) (8) =
  ? D stem score (0-8) (3) =
  ? Minimum base pairing total (30-32) (32) =
  ? Minimum intron length (0-30) (0) =
  ? Minimum length for TU loop (4-12) (6) =
  ? Maximum length for TU loop (6-12) (9) =
  ? (y/n) (y) Skip search for conserved bases n
  Give a score for each base, then a minimum total at the end
  ? Base  8, T is 100% conserved. Score (0-100) (0) =
  ? Base 10, G is  95% conserved. Score (0-100) (0) =
  ? Base 11, Y is  96% conserved. Score (0-100) (0) =
  ? Base 14, A is 100% conserved. Score (0-100) (0) =
  ? Base 15, R is 100% conserved. Score (0-100) (0) =
  ? Base 21, A is  97% conserved. Score (0-100) (0) =
  ? Base 32, Y is 100% conserved. Score (0-100) (0) =
  ? Base 33, T is  98% conserved. Score (0-100) (0) =
  ? Base 37, A is  91% conserved. Score (0-100) (0) =
  ? Base 48, Y is 100% conserved. Score (0-100) (0) =
  ? Base 53, G is 100% conserved. Score (0-100) (0) =
  ? Base 54, T is  95% conserved. Score (0-100) (0) =
  ? Base 55, T is  97% conserved. Score (0-100) (0) =
  ? Base 56, C is 100% conserved. Score (0-100) (0) =
  ? Base 57, R is 100% conserved. Score (0-100) (0) =
  ? Base 58, A is 100% conserved. Score (0-100) (0) =
  ? Base 60, Y is  92% conserved. Score (0-100) (0) =
  ? Base 61, C is 100% conserved. Score (0-100) (0) =
  ? Minimum total conserved base score (0-0) (0) =
  ? (y/n) (y) Plot results n

   Searching

         306
                     C
                   C-G
                   C-G
                   G-C
                   T-A
                   C-G
                   A-T
                   T+G     AT
                  A   ATACA
          TTC    T    !!!!   G
             CTGT     TATGG  G
         G    ! !     T    GA
         C   TAAA      C
          GCG    C      G
                  T+GA   C
                  C-G C   T
                  T+G  A   T
                  T-A   G   T
                  T-A    G   A
                 G   G    G   C
                 A   A     G   A
                  AGC       T   C
                             A   T
                              C   T
                               A
                                C T


 @50. TX 7 @ Plot start codons

        This function plots the positions of all start codons for each
  of the three reading frames.
 @51. TX 7 @ Plot stop codons

        This function plots the positions of all stop codons for  each
  of the three reading frames.
 @52. TX 7 @ Plot stop codons on the complementary strand

        This function plots the positions of all stop codons for  each
  of the three reading frames on the complementary strand.
 @53. TX 7 @ Plot stop codons on both strands

        This function plots the positions of all stop codons for  each
  of the three reading frames on both strands.
 @54. TX 5 @ Search for longest open reading frames

        This function will report the positons  of  the  ends  of  all
  sections  of  sequence  that contain no stop codons. All six reading
  frames are examined. Results are presented in the form  of  an  EMBL
  feature  table.  Hence if the results are stored in a file by use of
  "direct output to disk", the file can be used to translate the  open
  reading frames in a sequence.  Note that in order for the file to be
  used as a feature table it  must  include  either  EMBL  or  GenBank
  headers,  and  a  suitable  "tail".  The simplest header is the word
  FEATURES starting in column 1 of the first line  of  the  file.  The
  simplest  tail  is 2 empty lines at the end of the file. These lines
  are not included when  nip  writes  out  results  in  feature  table
  format.

        Define the minimum length of open reading frame to report  (in
  amino  acids).  Choose to search either or both strands. The program
  displays the end points, the reading frame and strand.

        Typical dialogue follows.

  ? Menu or option number=D54
   Find open reading frames
  ? Minimum open frame in amino acids (5-1000) (30) =100

  X 1 + strand only
    2 - strand only
    3 Both strands
  ? 0,1,2,3 =3

  FT   CDS           1    831       1    831
  FT   CDS        1540   2853       1   1314
  FT   CDS        3130   4242       1   1113
  FT   CDS        5761   6114       1    354
  FT   CDS        6187   6711       1    525
  FT   CDS        1766   2077       2    312
  FT   CDS        2078   2446       2    369
  FT   CDS        4136   5500       2   1365
  FT   CDS        1335   1637       3    303
  FT   CDS        2844   3194       3    351
  FT   CDS        6819   7238       3    420
  FT   CDS        2073   1711  C    1    363
  FT   CDS        2469   2149  C    1    321
  FT   CDS        6542   6144  C    3    399

 @55. TX 8 @ Search for E. coli promoter (general)

        Searches for E coli promoter like sequences using  a  standard
  weight matrix. The positions of the matches are plotted. No dialogue
  is required.

        The method was first described in Staden R. Nucl. Acid Res. 12
  505-519  1984.   This  search  uses  a  weight matrix taken from the
  frequency tables contained in Hawley, D. K. and McClure, R., nar  11
  2237-2255 (1983).  The weight matrix is divided into 3 sections that
  are separated by varying sizes of gap: the - 35 region, the -10  and
  the  +1  region.   The algorithm first looks for a sufficiently good
  -35 region, then for the best -10 region within range and  then  for
  the  best  +1  region  within range of the -10; each separate region
  must score above  the  lowest  known  score  for  the  corresponding
  section. The gap penalty is then applied and two plots produced: one
  with gap penalties, one without.  Scaling  is  such  that  no  known
  promoter  scores below the bottom level and no known promoter scores
  above the top level when the weight matrix is applied.

        Two other functions also look for E. coli promoters: 92  looks
  for  sites  on  the complementary strand and 93 looks for individual
  -35 and -10 regions and plots them on a scale such the  top  is  the
  highest known value +10% and the bottom is the lowest known -10%
 weights for E. coli promoters
 -35 region:
 P -50-49-48-47-46-45-44-43-42-41-40-39-38-37-36-35-34-33-32-31-30-29-28-27-26

 107109109110110110110110110111111110111112112112112112112112112112112112112
 T  41 33 32 25 34 22 35 35 42 27 32 42 47 14 92 94 11 19 15 37 46 34 38 48 34
 C  22 27 18 29 20 14 20 12 22 23 16 25 10 43  7  6 11 18 60  8 25 23 23 17 20
 A  28 38 30 37 35 56 42 42 37 42 39 18 25 26  2  6  2 72 26 50 26 34 25 26 31
 G  16 11 29 19 21 18 13 21  9 19 24 26 29 29 11  6 88  3 11 17 15 21 26 21 27
 -10 region:
 P -23-22-21-20-19-18-17-16-15-14-13-12-11-10 -9 -8 -7 -6 -5
   112112112112112112112112112112112112112112112112112112112
 T  35 28 28 27 39 51 34 43 26 31 89  3 49 15 19108 31 29 21
 C  34 21 24 27 12 25 20 25 20 27 10  2 16 14 22  3 13 16 30
 A  20 39 33 33 39 23 29 16 23 19  2106 29 66 57  1 35 23 31
 G  23 24 27 25 22 13 29 28 43 35 11  1 18 17 14  0 33 24 30
 + region:
 P -2 -1  1  2  3  4  5  6  7  8  9 10
   86 88 85 88 88 88 88 88 88 88 88 88
 T 16 22  2 42 27 23 20 25 27 15 16 29
 C 29 49  4 25 25 13 18 22 17 17 16 17
 A 20  9 45 16 24 25 28 24 24 32 35 26
 G 21  8 37  5 12 27 22 17 20 24 21 16
 Notes:  E. coli promoters have been shown to  contain  2  regions  of
 conserved  sequence  located  about  10  and 35 bases upstream of the
 transcription startsite. These are TATAAT and TTGACA with an  allowed
 spacing  of  15  to  21  bases  between.  The  spacing  with  maximum
 efficiency was 17 bases and all but 12 of the 112 sequences could  be
 aligned  with  a  separation of 17 +or-1 bases. The standard promoter
 has spacing 7 and 17 bases between the startsite and the -10  region,
 and  the  -10  and -35 regions, respectively. The spacing between the
 -10 region and the startsite is usually  6  or  7  bases  but  varies
 between  4  and 8 bases.  There is an AT rich region of 8 to 10 bases
 upstream of the  -35  region.   Iniation  with  a  purine  is  highly
 prefered with G being used if A is not present.
 Gap penalties:
         15 0.02   (only exists as mutant)
         16 0.2
         17 1.0
         18 0.2
         19 0.05   (guess)
         20 0.02   (guess)
         21 0.01   (guess)
 @56. TX  8 @ Search for E. coli promoter (general) strand

        This  function  searches  for  E.  Coli   promoters   on   the
  complementary strand of the sequence. See the notes on option 55.
 @57. TX 8 @ Search for E. coli promoter sequences. (-35 and -10)

        This  function  searches  separately  for  the  -35  and   -10
  sequences of an E. coli promoter. See the notes on option 55.
 @58. TX 8 @ Search for procaryotic ribosome binding sites

        This function searches for the 5' ends  of  prokaryotic  genes
  using  an  unusual  weight  matrix.  The  search  is relatively slow
  because the matrix is 101 bases in length. No dialogue is required.

        The method was first described in Staden Nucl.  Acid  Res.  12
  505-519  1984.  This actually looks for more than a ribosome binding
  site as is explained below.  This uses their weight matrix  w101  of
  Stormo  and Schneider (NAR 10 2971-3024, 1982) which with a value of
  2 finds all gene starts in their library.
  P-60-59-58-57-56-55-54-53-52-51-50-49-48-47-46-45-44-43-42-41-40-39-38-37-36
  T  5  1 -3  9-14  7 15 -5  3-16-17  4 18  5 -3 -1  2  4  5 -5  7  8 -5-15  6
  C-21 -6-11-21  0  8 -7-12 -1  1  0-19 12 -3 -1 10  2 -8 -5-11  8  1 23  6 -5
  A  7 -2 13 -2 -8-13-18  5  0 -5 13  8-15  9 -4 -7  9  0 -8-11-10 -6 -7 -5 -6
  G -6 -9 -7  0  8-16 -4 -2-16  1 -4  8-14  5 11-13-24  3  7 22-11 -9-15 10 -4

  P-35-34-33-32-31-30-29-28-27-26-25-24-23-22-21-20-19-18-17-16-15-14-13-12-11
  T  3  4 16 -4  7 11 -4 -1 12  8 10 -1  1  8  2-10-16 11  1 -3 16 -3-36 -8-27
  C  2-14 -3 -8-10-21  2  0 -2 -1-11 -3 -1  5-11 -4  7  0-14  6 -8-20 -7-36-44
  A-12 -1-27 -3 -6  0-12 -3 -4 -7 14 -2 -4 -6  0 12  5 -9  0-11-11 10  8  2  8
  G  4 -5 -6 -3 -1 -4 -1 -4-15  0-14  3 10-19 -3-10 -7 -7  7  1 -8 -6 15 21 42

  P-10 -9 -8 -7 -6 -5 -4 -3 -2 -1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14
  T-53-27-26-23  2 -7-14-40-28  0-53 75-62-20-40-10-35 -5-12 -1  4 14-23  7 -2
  C-15-50-43-35-38-29-29  1 -9  1-87-55-64-45 11-22-14-20-15-15-10-22 -5  2  6
  A  0 -3 -5  4-20-11  5  6 -2-15 66-69-52 -5 -4  6  8-24 -7-10 -7 13 14 -9-18
  G 35 22 16 -6 -5-15-25-33-28-53-36-50107 -5-37-44-27-15-23-16-29-47-17-29-15

  P 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
  T-26  1  4 -7  3 -4  0-10  8-18  7-22-21  8  4 -3 -6  7 -8  1 -5-16-16  7 -6
  C  6 -8 19 -7  9 -3 17 -2  3 -9  5 22 22  8 -1  1 18  6 11-10 -8  7 10  0  7
  A 14-12-42  1 -5 -4-32 12-10 20 -6 -1  3 -4  4-10 -1 -2-14 11 14 -3  2-13  5
  G-23 -7 -1 -6-17 -4  0-15-14 -4-17-10 -5-13 -8 10-13-13  9 -4 -3 10  2  4 -8

  P 40
  T  0
  C 14
  A  5
  G-21
 These come from w101 of Stormo, Schneider, Gold and Ehrenfeucht Nucl.
 Acid  Res.  10 2997- 3011, 1982. They report that this matrix gives a
 score of at least 2 for all gene starts in their library whereas  all
 other sequences score 1 or less.
 @29. TX 1 @ Reverse and complement the sequence

        Reverses and complements the  current  active  region  of  the
  sequence.
 @60. TX 7 @ Search using a dinucleotide weight matrix

        This function performs  searches  for  short  sequence  motifs
  using an appropriate  dinucleotide weight matrix. In addition it can
  be used to create or modify weight matrices. In order to  perform  a
  search  the  only  input required is the name of the file containing
  the weight matrix.  The results  can  be  presented  graphically  or
  listed. The graphical presentation will draw line at the position of
  any matches found; the height of the line  is  proportional  to  the
  score. The method is identical to that using weight matrices derived
  from nucleotide frequencies, except that here we use the frequencies
  of dinucleotides.

        For a search, select "use weight matrix", supply the  name  of
  the  file  containing  the  weight matrix, and choose between having
  results plotted  or  listed.  If  dialogue  is  requested  when  the
  function is selected users can alter the cutoff score employed.

        To create a weight matrix several steps are involved.  A  file
  containing an alignment of known motifs is required. (This file must
  be created before the current option is selected. The  format  is  a
  follows:  each  sequence is written on a separate line with at least
  one space at the beginning; each sequence is terminated by  a  space
  character,  and  can  be  followed  by a name. The sequences must be
  aligned.) Supply the name of the  file  of  aligned  sequences.  The
  program  reads  and  displays the sequences. Choose between "summing
  logs of weights" or summing weights (i.e. whether to multiply or add
  weights).  If  logs  are used all scores will be negative. Choose if
  all positions in the set of aligned sequences should be used or if a
  mask should be applied. If so selected, define a mask as a string of
  symbols, in which symbol - means ignore and any other  symbol  means
  use. E.g. xx-x--abc means use all positions except 3,5 and 6.

        The program will calculate weights as the frequencies  of  the
  dinucleotides  at  each  unmasked  position  in  the  set of aligned
  sequences. These weights are then applied  to  the  set  of  aligned
  sequences  to  give  a  range   of  "observed"  scores. The mean and
  standard deviation of these scores is displayed. The user  is  asked
  to  supply  several  values  to  be  used  when the weight matrix is
  applied to other sequences: a cutoff score  (by  default,  the  mean
  minus  3  standard  deviations),  a  top score for scaling graphical
  results (by default, the mean plus 3  standard  deviations),  and  a
  position  to  identify  (this means that if a particular base within
  the motif is used as a "landmark", such as the A of the AG in splice
  acceptor  sites,  then  its  position  will be marked in plots). All
  these values are stored along with the weight matrix. Finally supply
  the name of a file to contain the weight matrix.

        Weight matrices can be  "rescaled"  using  a  set  of  aligned
  sequences  in much the same ways as a matrix is created. The purpose
  is to redefine the cutoff scores, and rescaling does not  alter  any
  other values in the weight matrix file.

        The methods have always had to deal with the problem of zeroes
  in  the  matrices.  The  current  versions  employ  "Laplaces Law of
  Succession" in which 1 is added to each term.
  Typical dialogue follows.

  ? Menu or option number=D60

   Motif search using dinucleotide weight matrix
  X 1 Use weight matrix
    2 Make weight matrix
    3 Rescale weight matrix
  ? 0,1,2,3 = 2
  ? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ


       1 AGCGTGACTCTTCCCGGAA HIS1
       2 GAGGTGACTCACTTGGAAG HIS1
       3 CGGATGACTCTTTTTTTTT HIS3
       4 ACAGTGACTCACGTTTTTT HIS4
       5 GTCGTGACTCATATGCTTT ARG3
       6 TGAATGACTCACTTTTTGG ARG4
       7 TTCTTGACTCGTCTTTTCT CPA1
       8 CGAATGACTCTTATTGATG CPA2
       9 AGAATGACTAATTTTACTA TRP5
      10 TCGTTGACTCATTCTAATC TRP3
      11 TTGCTGACTCATTACGATT TRP2
      12 GAGATGACTCTTTTTCTTT IV1
      13 GCGATGATTCATTTCTCTG IV2
      14 TAGATGACTCAGTTTAGTC LEU1
      15 TAAGTGACTCAGTTCTTTC LEU4
      16 ATGATGACTCTTAAGCATG ILS1
  Length of motif    18
  ? (y/n) (y) Sum logs of weights n
  ? (y/n) (y) Use all motif positions n
  x means use, - means ignore
  e.g. xx-x---x-x means use positions 1,2,4,8,10
  ? Mask=----XXXXXXXX--------
   Applying weights to input sequences
     1       89.000 AGCGTGACTCTTCCCGGA
     2       91.000 GAGGTGACTCACTTGGAA
     3       93.000 CGGATGACTCTTTTTTTT
     4       90.000 ACAGTGACTCACGTTTTT
     5       94.000 GTCGTGACTCATATGCTT
     6       91.000 TGAATGACTCACTTTTTG
     7       81.000 TTCTTGACTCGTCTTTTC
     8       90.000 CGAATGACTCTTATTGAT
     9       75.000 AGAATGACTAATTTTACT
    10       97.000 TCGTTGACTCATTCTAAT
    11       97.000 TTGCTGACTCATTACGAT
    12       93.000 GAGATGACTCTTTTTCTT
    13       69.000 GCGATGATTCATTTCTCT
    14       90.000 TAGATGACTCAGTTTAGT
    15       90.000 TAAGTGACTCAGTTCTTT
    16       90.000 ATGATGACTCTTAAGCAT
  Top score      97.000  Bottom score      69.000
  Mean      88.750  Standard deviation       7.319
  Mean minus 3.sd      66.794  Mean plus 3.sd     110.706
  ? Cutoff score (-999.00-9999.00) (66.79) =
  ? Top score for scaling plots (66.79-999.00) (110.71) =
  ? Position to identify (0-18) (1) =
  ? Title=GCN4 DI WTS
  ? Name for new weight matrix file=3.WTS

  ? Menu or option number=D60
   Motif search using dinucleotide weight matrix
  X 1 Use weight matrix
    2 Make weight matrix
    3 Rescale weight matrix
  ? 0,1,2,3 =
  ? Motif weight matrix file=3.WTS
   GCN4 DI WTS
  ? Cutoff score (-9999.00-9999.00) (66.79) =40
  ? (y/n) (y) Plot results n
       15     42.00 CAACCCGCTCACCGACAA
       29     42.00 ACAACAGCTCACCCACGC
       93     46.00 AGCCTTCCTCATCGCTGC
      153     40.00 CAGCGGAATCAAACTTAA
      408     42.00 CGATGGATTCAAGTTGAA
      469     47.00 TTAGGAACTCCCTCTGTC
      493     60.00 AAGCTGAATCTTAGCAGC
      530     43.00 CGGAGGGCTCAGTGAGGG
      542     47.00 TGAGGGACTACTGCACCA
      678     41.00 CTTCTGCTTCAAAGAGTT
      709     47.00 AATATGACGGCGCACGTG
      848     54.00 GTCAGAACTCAAATCAGT
      940     49.00 CCGTTGACGACCTCCGCA
      992     42.00 TGGGCACCTCACACCAAG


 @61. TX 8 @ Search for eukaryotic ribosome binding sites

        Searches  for  eukaryotic   ribosome   binding   sites   using
  weightings  derived  from  Sargan,Gregory,Butterworth  febs  let 147
  133-136 1982.  No dialogue is required. First  described  in  Staden
  Nucl. Acid Res. 12 505-519 1984.
 mRNA WTS FOR EUKARYOTES SARGAN,GREGORY,BUTTERWORTH FEBS LET
 147 133-136 1982
 P  -7 -6 -5 -4 -3 -2 -1  1  2  3
   102102102102102102102102102102
 T  19 24 31 12  0 18  5  0102  0
 C  20 15 32 65  5 42 52  0  0  0
 A  50 27 27 19 86 36 34102  0  0
 G   6 29 12  6 11  6 11  0  0102
 VIRAL ONLY
 P  -7 -6 -5 -4 -3 -2 -1  1  2  3
    41 41 41 41 41 41 41 41 41 41
 T  14 12 16  4  2 13  9  0 41  0
 C   7  3 13 17  7  9 14  0  0  0
 A  15 10  6 10 27 15  9 41  0  0
 G   5 16  6 10  5  4  9  0  0 41
 The Sargan et al paper puts forward the hypothesis that there  is  an
 interaction between some mRNA leader sequences and a highly conserved
 structure in the 18S rRNA of eukaryotic  ribosomes.  The  attempt  to
 substantiate  the hypothesis includes a table of base frequencies for
 sequences  immediately  5'  to  start  codons.   They  examined   102
 sequences and I have used the base frequencies they found as a weight
 matrix for searching for eukaryotic gene starts. I don't yet know how
 good  this  method  is. The viral sequences were found to be slightly
 different but the separate table  shown  here  is  not  used  in  the
 program.
 @62. TX 8 @ Search for splice junctions

        Used to search  for  mRNA  splice  junctions  using  a  weight
  matrix.  The  default  weight  matrix is still that derived from the
  paper of Mount (Nucl. Acids Res. 10,  459-472).  However  users  may
  employ  their  own  tables.   By  default  the positions of possible
  junctions will be plotted rather than listed.   The  diagram  splits
  the  donor plot into 3 horizontal boxes so that all the sites marked
  in any box are from  the  same  reading  frame.  The  acceptor  plot
  appears  above  the donor plot and is split in an equivalent way. So
  sites marked  as  donors  and  acceptors  in  equivalent  boxes  are
  compatible.  i.e.  donors  from  donor  box  1  are  compatible with
  acceptors from acceptor box 1, etc. Of course it is the  combination
  of  reading  frame  and splice sites that really matters, and donors
  from box 1 can be compatible with acceptors in box 3 if the  reading
  frame switches.

        If dialogue is selected users can employ  their  own  file  of
  weights  (see  below  for the format), can change the cutoff scores,
  and can elect to have the results listed rather than plotted. Listed
  results  show  the position (of the last or first base in the exon),
  the frame and the matching  sequence.   The  frequency  table  shown
  below  is  used  as  a  default  weight  matrix  and  AG  and GT are
  obligatory at the appropriate positions.  The plots  are  scaled  so
  that  the  top  of scale is the highest value achieved by a junction
  sequence in the set used to compile the  frequency  table,  and  the
  bottom  of  the  scale  is  the  lowest value achieved by a junction
  sequence in the set used to compile the frequency table.

        In the light of current knowledge it  would  be  sensible  for
  users to use the weight matrix search option (20) to create matrices
  that define  more specific splice junctions. If so it  is  important
  that  the positions "marked" are the last base in the donor exon and
  the first base in  the  acceptor  exon.  To  make  a  weight  matrix
  suitable  for  use  with  this  function follow the instructions for
  option 20 and create files for both donor and acceptor  sites.  Then
  concatenate  the  two  matrix files with the donor file first.  Note
  that any positions in the weight matrix that are 100% conserved will
  be made obligatory (normally the AG and GT).

  Mount donors redone 16-4-91
      12     3   -16.085    -7.500
  P  -2  -1   0   1   2   3   4   5   6   7   8   9
  N 136 136 136 136 136 136 136 136 136 136 136 136
  T  28   8  15  17   0 136   9  16   7  84  30  36
  C  41  60  16   7   0   0   3  13   3  17  28  39
  A  40  56  89  12   0   0  83  91  12  23  53  33
  G  27  12  16 100 136   0  41  16 114  12  25  28
  Mount acceptors redone 16-4-91
      18    15   -26.142   -14.400
  P -14 -13 -12 -11 -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3
  N 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113
  T  58  50  57  59  67  56  58  49  47  66  64  31  34   0   0  11  41  31
  C  21  28  34  25  29  33  35  32  42  40  33  25  74   0   0  23  28  41
  A  17  11  11  18   7  17  12  23  15   3  10  29   5 113   0  24  21  21
  G  17  24  11  11  10   7   8   9   9   4   6  28   0   0 113  55  23  20
 @63. TX 7 @ Search using a weight matrix (complementary)

        This  function  searches  the  complementary  strand  of   the
  sequence   using  a  weight  matrix.  Many motifs can bind to either
  strand of the DNA and this  function  allows  users  to  search  the
  complementary strand without having to change the orientation of the
  sequence. See option 20 for more details.
 @64. TX 3 @ Plot observed-expected word frequencies

        This  option is designed to examine the  abundances  of  short
  words  in  a  sequence to see if particular ones are either under or
  over represented. It compares the observed and expected  frequencies
  and  plots  them along the sequence. There has been some work on the
  relative amounts of CG dinucleotides  in  eukaryotic  sequences  (eg
  Bird,  Nature  321, 209-213 (1986)) and this new routine can be used
  to examine such biases, or any others that might be interesting.

        The user selects a word - say CG -, a  window  length,  and  a
  maximum  and  mininum  scale  for  plotting the results. The program
  examines each sucessive window length along the sequence, with  each
  window  overlapping  the previous one by windowlength-1. The program
  counts the base frequencies  in  each  window,  and  the  number  of
  occurrences  of  the  chosen  word within the window. Using the base
  frequencies it calculates an expected number of occurrences for  the
  chosen  word  (simply  by  multiplying the relevant frequencies). It
  plots observed-expected, and hence will show regions that  are  rich
  or  depleted  in  the  chosen  word.  The  longest allowed word is 9
  characters, but the calculation of the expected frequencies  becomes
  less appropriate as the word length increases above 2.

        Typical dialogue follows.

  ? Menu or option number=D64
  Plot composition differences (obs-exp))
  Default String=CG
  ? String=
  ? odd span length (3-401) (101) =
  ? plot interval (1-20) (5) =
  ? Maximum plot value (-6.31-25.25) (6.31) =
  ? Minimum plot value (-25.25-6.31) (-6.31) =

   Missing graphics display here

 @65. TX 9 @ Search for polya sites

        Simply  searches  for  the  sequence  AATAAA  (Proudfoot   and
  Brownlee  Nature  263,  211-214,  1982)  and  marks  it with a short
  vertical line.
 @66. TX 1 @ Interconvert t and u

        This function interconverts T and U characters in  the  active
  sequence i.e between DNA and RNA.
 @67. TX 7 @ Search for patterns of motifs

        This option searches for patterns of motifs. Patterns  can  be
  defined  interactively  or read from files. Results can be displayed
  in several ways in both graphical and textual form. Used  to  create
  pattern  files  for  searching  libraries.  The  option is extremely
  flexible and  consequently  the  following  documentation  is  quite
  lengthy.  However the routine is capable of searching for almost any
  known pattern. In addition  the  flexibility  does  not  necessitate
  difficulty  of  use,  and  the  userinterface  has  been  simplified
  considerably since the methods were first published.

        Users should refer to the "typical dialogue" shown  below  for
  the most helpful information on using the program.

        There  are  currently  four  ways  to  display  the   matching
  patterns:  1=each individual motif and its position is listed; 2=all
  the sequence between, and including  the  two  outermost  motifs  is
  listed;  3=graphical,  with  a vertical line marking the position of
  the leftmost motif; 4 = EMBL feature table format, where the  KEYNAM
  field  if  the motif name, the FROM and TO fields denote the ends of
  the match, and the DESCRIPTION field is "Program".

        When it is defined for  the  first  time  a  pattern  must  be
  entered  interactively  at the keyboard, but the pattern description
  can be saved to a file. This file can be  used  for  all  subsequent
  searches.

        When defining a pattern interactively select a motif class and
  the program will request the required inputs.

        The program gives each motif an identifying name  and  number.
  For  motifs  other than the first, a range of allowed positions must
  be defined (Note that sets of motifs included using the OR  operator
  will  all  be  given  the  same  range, and so the program will only
  request range values for the first  motif  in  any  such  set).   To
  specify  the  allowed  range  for  a  motif the user must supply the
  following: the identifying number of the motif,  relative  to  which
  the   current  motifs  positions  are  to  be  defined  (termed  the
  "reference motif"); a "relative start position"  and  a  range.  The
  relative  start  position  can  be  negative or positive. A negative
  start position means that although the reference motif  is  searched
  for  first,  the  current  motif  can  be found to its left.  A zero
  relative start position means their left ends are superimposed.  The
  default  start  position is to butt-joint the motif to righthand end
  of the  "reference  motif".  The  range  is  "the  number  of  extra
  positions" that the motif can take.

        The program will  display  the  probability  of  finding  each
  motif.  These  values  are presented in the following form: .1234E-5
  means 0.1234 times 10 to the power -5.

        After the pattern has been defined, the program  will  type  a
  description of it on the screen. It will then allow the user to give
  an overall cutoff score and overall probability cutoff.

        Typical dialogue  for  all  the  different  motif  classes  is
  displayed below.

  ? Menu or option number=67
    Pattern searcher
  ? (y/n) (y) Read pattern from keyboard
  X 1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Complement of weight matrix
    6 Inverted repeat or stem-loop
    7 Exact match, defined step
    8 Direct repeat
    9 Pattern complete
  ? 0,1,2,3,4,5,6,7,8,9 =
  ? Motif name=Ematch
  ? String=AA
  Probability of score     2.0000 = 0.595E-01
  X 1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Complement of weight matrix
    6 Inverted repeat or stem-loop
    7 Exact match, defined step
    8 Direct repeat
    9 Pattern complete
  ? 0,1,2,3,4,5,6,7,8,9 =2
  ? Motif name=AAA
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-1) (1) =
  ? Relative start position (-1000-1000) (3) =
  ? Number of extra positions (0-1000) (0) =
  ? string=AAA
  ? Minimum matches (1.00-3.00) (3.00) =2
  Probability of score     2.0000 = 0.149E+00
    1 Exact match
  X 2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Complement of weight matrix
    6 Inverted repeat or stem-loop
    7 Exact match, defined step
    8 Direct repeat
    9 Pattern complete
  ? 0,1,2,3,4,5,6,7,8,9 =3
  ? Motif name=T'S
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-2) (2) =
  ? Relative start position (-1000-1000) (4) =
  ? Number of extra positions (0-1000) (0) =
  ? String=TTT
  ? Minimum score (0.00-108.00) (108.00) =72
  Probability of score    72.0000 = 0.258E+00
    1 Exact match
    2 Percentage match
  X 3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Complement of weight matrix
    6 Inverted repeat or stem-loop
    7 Exact match, defined step
    8 Direct repeat
    9 Pattern complete
  ? 0,1,2,3,4,5,6,7,8,9 =4
  ? Motif name=GCN4
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-3) (3) =
  ? Relative start position (-1000-1000) (4) =
  ? Number of extra positions (0-1000) (0) =
  ? Weight matrix file name=GCN4
   GCN4 FROM WEIGHTS 17-11-87
  Probability of score   -22.0020 = 0.139E-02
    1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
  X 4 Cut-off score and weight matrix
    5 Complement of weight matrix
    6 Inverted repeat or stem-loop
    7 Exact match, defined step
    8 Direct repeat
    9 Pattern complete
  ? 0,1,2,3,4,5,6,7,8,9 =5
  ? Motif name=GCN4
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-4) (4) =
  ? Relative start position (-1000-1000) (20) =
  ? Number of extra positions (0-1000) (0) =
  ? Weight matrix file name=GCN4
   GCN4 FROM WEIGHTS 17-11-87
  Probability of score   -22.0020 = 0.606E-03
    1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
  X 5 Complement of weight matrix
    6 Inverted repeat or stem-loop
    7 Exact match, defined step
    8 Direct repeat
    9 Pattern complete
  ? 0,1,2,3,4,5,6,7,8,9 =6
  ? Motif name=LOOP
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-5) (5) =
  ? Relative start position (-1000-1000) (20) =
  ? Number of extra positions (0-1000) (0) =
  ? Stem length (1-60) (6) =
  ? Minimum loop length (-6-60) (0) =
  ? Maximum loop length (0-60) (0) =5
  ? Minimum score (1.00-12.00) (12.00) =10
  Probability of score    10.0000 = 0.598E-02
    1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Complement of weight matrix
  X 6 Inverted repeat or stem-loop
    7 Exact match, defined step
    8 Direct repeat
    9 Pattern complete
  ? 0,1,2,3,4,5,6,7,8,9 =7
  ? Motif name=Tstep
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-6) (6) =
  ? (y/n) (y) Relative to 5 prime end
  ? Relative start position (-1000-1000) (1) =
  ? Number of extra positions (0-1000) (0) =
  ? String=TTT
  ? Step (1-20) (3) =
  Probability of score     3.0000 = 0.367E-01
    1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Complement of weight matrix
    6 Inverted repeat or stem-loop
  X 7 Exact match, defined step
    8 Direct repeat
    9 Pattern complete
  ? 0,1,2,3,4,5,6,7,8,9 =8
  ? Motif name=REPEAT
  X 1 And
    2 Or
    3 Not
  ? 0,1,2,3 =
  ? Number of reference motif (1-7) (7) =
  ? Relative start position (-1000-1000) (4) =
  ? Number of extra positions (0-1000) (0) =2
  ? Repeat length (1-60) (6) =
  ? Minimum gap (0-60) (0) =
  ? Maximum gap (0-60) (0) =4
  ? Minimum score (1.00-6.00) (6.00) =5
  Probability of score     5.0000 = 0.554E-02
    1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Complement of weight matrix
    6 Inverted repeat or stem-loop
    7 Exact match, defined step
  X 8 Direct repeat
    9 Pattern complete
  ? 0,1,2,3,4,5,6,7,8,9 =9
  ? (y/n) (y) Save pattern in a file N

  Pattern description

  Motif  1 named Ematch   is of class    1
  Which is an exact match to the string
  AA
  Motif  2 named AAA      is of class    2
  which is a match of score     2. to the string
  AAA
  and the 5 prime base can take positions      3 to       3
  relative to the 5 prime end of motif   1
  It is anded with the previous motif.
  Motif  3 named T'S      is of class    3
  which is a match of score    72. to the string
  TTT
  and the 5 prime base can take positions      4 to       4
  relative to the 5 prime end of motif   2
  It is anded with the previous motif.
  Motif  4 named GCN4     is of class    4
  Which is a match to a weight matrix with score -22.002
  and the 5 prime base can take positions      4 to       4
  relative to the 5 prime end of motif   3
  It is anded with the previous motif.
  Motif  5 named GCN4     is of class    5
  Which is a match to the complement of a weight matrix with score -22.002
  and the 5 prime base can take positions     20 to      20
  relative to the 5 prime end of motif   4
  It is anded with the previous motif.
  Motif  6 named LOOP     is of class    6
  Which is a stem-loop structure with stem length    6 and score    10.
  The loop can have sizes      0 to      5
  and the 5 prime base can take positions     20 to      20
  relative to the 5 prime end of motif   5
  It is anded with the previous motif.
  Motif  7 named Tstep    is of class    7
  Which is an exact match to the string
  TTT
  with a step size of     3
  and the 5 prime base can take positions      1 to       1
  relative to the 5 prime end of motif   6
  It is anded with the previous motif.
  Motif  8 named REPEAT   is of class    8
  Which is a repeat with repeat length    6 and score     5.
  The loop-out can have sizes      0 to      4
  and the 5 prime base can take positions      4 to       6
  relative to the 5 prime end of motif   7
  It is anded with the previous motif.
  Probability of finding pattern = 0.2348E-14
  Expected number of matches  = 0.5100E-09
  ? Maximum pattern probability (0.00-1.00) (1.00) =
  ? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
   Select display mode
  X 1 Motif by motif
    2 Inclusive
    3 Graphical
    4 EMBL feature table
  ? 0,1,2,3,4 =4
   Searching


  Total matches found      0

  Menus and their numbers are
  m0 = This menu
  m1 = General
  m2 = Screen control
  m3 = Statistical analysis of content
  m4 = Structures and repeats
  m5 = Translation and codons
  m6 = Gene search by content
  m7 = Prokaryotic signal search
  m8 = Eukaryotic signal search
   ? = Help
   ! = Quit
  ? Menu or option number=67
    Pattern searcher
  ? (y/n) (y) Read pattern from keyboard
  X 1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Complement of weight matrix
    6 Inverted repeat or stem-loop
    7 Exact match, defined step
    8 Direct repeat
    9 Pattern complete
  ? 0,1,2,3,4,5,6,7,8,9 =
  ? Motif name=Arun
  ? String=AAAAAA
  Probability of score     6.0000 = 0.210E-03
  X 1 Exact match
    2 Percentage match
    3 Cut-off score and score matrix
    4 Cut-off score and weight matrix
    5 Complement of weight matrix
    6 Inverted repeat or stem-loop
    7 Exact match, defined step
    8 Direct repeat
    9 Pattern complete
  ? 0,1,2,3,4,5,6,7,8,9 =9
  ? (y/n) (y) Save pattern in a file N

  Pattern description

  Motif  1 named Arun     is of class    1
  Which is an exact match to the string
  AAAAAA
  Probability of finding pattern = 0.2103E-03
  Expected number of matches  = 0.1522E+01
  ? Maximum pattern probability (0.00-1.00) (1.00) =
  ? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
   Select display mode
  X 1 Motif by motif
    2 Inclusive
    3 Graphical
    4 EMBL feature table
  ? 0,1,2,3,4 =4
   Searching


  FT   Arun       1582   1587       Program
  FT   Arun       3160   3165       Program
  FT   Arun       4204   4209       Program
  FT   Arun       5691   5696       Program
  FT   Arun       6710   6715       Program
  Total matches found      5
  Minimum and maximum observed scores        6.00        6.00


        These methods allow users to define  and  search  for  complex
  patterns  of  motifs  defined as single objects.  The programs allow
  individual DNA motifs to be defined in  eight  different  ways,  and
  protein  motifs  in  six.  Motifs  are  combined,  using the logical
  operators AND, OR and NOT, to describe a pattern. The  pattern  also
  specifies   the  ranges  of  allowed  relative  separations  of  the
  individual motifs.

        First some definitions.

        A MOTIF is a contiguous subsequence of fixed length.   At  its
  simplest  it  could  be a single definite base or amino acid; a more
  complex motif might be better represented as a consensus or a weight
  matrix;  two  more-abstract  types  of motif are direct and inverted
  repeats.

        A PATTERN is a higher order of structure defined by a list  of
  motifs.  The  motifs  in  a  pattern  are combined using the logical
  operators AND, OR  and  NOT.  The  list  also  defines  the  allowed
  relative  separations  of the motifs. In the current versions of the
  programs up to 50 motifs can be combined into a single  pattern.  So
  using these definitions there are two differences between motifs and
  patterns: 1) the distances between  all  elements  of  a  motif  are
  fixed,  but  the  separations  of parts of patterns can vary; 2) all
  characters in a motif are defined using the same method (class), but
  different  parts of a pattern can be defined in completely different
  ways.

        Each motif can be represented in 9 ways (known  as  the  motif
  class):

             MOTIF CLASSES
  CLASS           DESCRIPTION
   1       Exact match to a short defined sequence. The IUB symbols
           can be used for DNA sequences.
   2       Percentage match to a defined short sequence. In nucleic acids,
           the IUB symbols can be used.
   3       Match to a defined sequence, using a score matrix and cutoff
           score. The DNA matrix (see option 18) gives scores to IUB symbols
           depending on their level of redundancy. MDM78 is used for proteins.
   4       Match to a weight matrix with cutoff score.
   5       As class 4 but on the complementary strand.
   6       Inverted repeat or stem-loop. Fixed stem length, range of
           loop sizes, and cutoff score using A-T, G-C=2; G-T=1.
   7       Exact match to short sequence but with a defined step size.
   8       Direct repeat. Fixed repeat length, range of loop-out sizes,
           cutoff score, and score matrix (for protein sequences MDM78 and
           for nucleic acids an identity matrix).
   9       Membership of a set. A list of sets of allowed amino acids for
           each position in the motif. The sets are separated by commas(,).
           For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
           acids in which one of I,V or L must be found in the first position,
           then anything in the next two positions, D,E,K or R in the fourth
           position and F,Y,W,I,L,V or M in the fifth. This class only applies
           to protein sequences because for nucleic acids "membership of a
  set"
           can be achieved using IUB symbols.

      Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to
      nucleic acids.


        Class 1: exact match.

        The motif is defined by a short sequence,  which  for  nucleic
  acids, may include IUB symbols. All symbols must match.

        Class 2: percentage match

        The motif is defined by a short sequence,  which  for  nucleic
  acids,  may  include  IUB  symbols.  The  minimum number of matching
  characters must also be specified.

        Class 3: match using a score matrix

        The motif is defined by a short sequence,  which  for  nucleic
  acids,  may  include IUB symbols. The motif is not compared directly
  with the sequence  to  count  the  number  of  matching  characters.
  Instead  a  matrix is used to provide a score for all possible pairs
  of characters. The motif score for any position along  the  sequence
  is  the  sum  of  the scores found by looking-up the scores for each
  pair of aligned characters. A match  is  declared  if  some  minimum
  score is achieved.

        Class 4: weight matrix

        The motif is defined by a table of values (called  weights  or
  scores). The table gives a score for finding each possible character
  at each position along the length of the  motif.  It  therefore  has
  dimension  motif-length  x character-set-size, and allows us to give
  different  scores  for  each  character  at  each  position.  It  is
  equivalent  to  having  a  different  score matrix for each position
  along the motif, and provides the most flexible and specific  method
  of  defining  motifs. The weight matrices are created by program NIP
  option 20 and stored as files. The file contains the values for each
  position, as well as an overall minimum score. There are two ways in
  which these values can be used to calculate an overall score for any
  section  of  the  sequence. The simplest way is to add the values in
  the file. (This  means  that  the  highest  possible  score  can  be
  calculated  by adding the top value at each column position, and the
  lowest by adding the bottom value.)  The normal  way  of  using  the
  values  in  the  file  is  as follows. First the programs divide the
  values in each column by the column total so that they  sum  to  1.0
  Then  the  natural logs of these values are used as scores. When the
  matrix is applied to a sequence these logarithmic values are  summed
  (which  is  of  course  equivalent  to multiplying the frequencies).
  Note that using the natural logs of the frequencies as  weights  and
  adding  them  means  that the overall cutoff score must be less than
  zero, whereas if the original values in the weight matrix  file  are
  added,  the  cutoff  score  will  be  greater  than zero. The search
  routines therefore decide whether the user wants to  add  values  or
  multiply  frequencies by examining the value of the cutoff score: it
  will add if the  cutoff  is  greater  than  zero  and  add  logs  of
  frequencies  if  it is less than zero.  Hence we effectively get two
  motif classes in one. The program NIP, when creating  weight  matrix
  files,  will  ask  the  user  whether  the scores should be added or
  multiplied. If the values in the table  have  been  defined  without
  using a set of aligned sequences it is easier for the user to choose
  a cutoff score if the values are added.

        Class 5: complement of weight matrix

        The motif is defined by  a  weight  matrix,  but  the  program
  searches for its complement.

        Class 6: inverted repeat, or stem-loop

        The motif is defined by a repeat length, a minimum score and a
  range  of  loop  sizes.  The scores are A-T=2, G-C=2, G-T=1, else=0.
  The loop sizes are defined by a minimum and  maximum  distance  from
  the  3'  end  of  the  stem.  For a stem-loop these will be positive
  numbers. For example to define a stem of length  8  and  loop  sizes
  varying  from  3 to 5, the stem would be set to 8, the minimum start
  distance to 3 and the maximum to 5. To define an inverted repeat the
  minimum  distance  will  be  negative.  For  example  stem length=9,
  minimum distance=-9, and  maximum  distance=-8  will  find  inverted
  repeats  of lengths 9 and 10. E.g. AAAAATTTT and AAAAATTTTT would be
  found, the first having a base at  its  centre,  the  second  having
  none.

        Class 7: exact match, defined step size.

        The motif is defined by a short sequence,  which  for  nucleic
  acids,  may  include  IUB symbols. All symbols must match. The class
  differs from class 1 in that searches will move  in  steps  of  some
  given  size. For example we could search for a certain codon and use
  a step size of 3 and hence keep in a single reading frame.

        Class 8: direct repeat

        The motif is defined by a repeat length, a minimum score and a
  range  of loop sizes. The scores are defined using MDM78 for protein
  sequences and an identity matrix for nucleic acids.  The loop  sizes
  are defined by a minimum and maximum distance from the 3' end of the
  stem.

        Class 9: membership of a set

        This motif class is for protein sequences. It  is  defined  by
  lists  of  allowed amino acids for each position in the motif, and a
  cut-off score.  Positions at which any amino acid can occur are left
  blank.  All allowed amino acids for each position give a score of 1.
  The motifs can be defined in two ways: either typed at the  keyboard
  or  read in as a weight-matrix-like file.  When the motif is defined
  at the keyboard the sets of allowed amino  acids  are  separated  by
  commas(,).  For example IVL,,,DEKR,FYWILVM defines a motif of length
  5 amino acids in which one of I,V or L must be found  in  the  first
  position, then anything in the next two positions, D,E,K or R in the
  fourth position and F,Y,W,I,L,V or M in the fifth.  To specify  that
  the  whole motif must match a score of 3 would be required (i.e. one
  of the allowed amino acids must be  found  for  each  of  the  three
  defined  positions).  If the motif is read from a file the file must
  have been written by program NIP, or have been saved by the  pattern
  searching  routines.  If  the  user elects to save a pattern, and it
  includes class 9 motifs typed at the keyboard, then the program will
  save  the  class  9 motifs as weight matrix files. Therefore it will
  request file names for each motif of this class. If the motif  given
  above  as  an example were saved the weight matrix file would have 5
  columns.  The first column would contain zeroes except for the I,  V
  and  L  rows which would be set to 1; the next two columns would all
  be zero; the next would be zero except for  the  D,E,K  and  R  rows
  which  would  be  1;  the  final  column  would  contain 1's in rows
  F,Y,W,I,L,V and M, with the rest zero.

        The logical operator (AND, OR or NOT) used to add  each  motif
  to  the  pattern  is  specified by preceding the class number by the
  letters A, O or N. A = AND, O = OR, N = NOT.  The default is  A,  so
  N2  means include, using the NOT operator, a class 2 motif; O2 means
  include, using the OR operator, a class 2 motif; both A2 and 2  mean
  include, using the AND operator, a class 2 motif.

        Range setting.

        The motifs in a pattern are numbered according to their  order
  in  the list. Apart from the first motif in a pattern all motifs are
  given a range of allowed positions relative to a  motif  further  up
  the  list.  For example suppose we have a pattern defined by A AND B
  AND C AND D.  Motif A can occur anywhere, but B must have its  range
  of  allowed  positions  defined relative to the position of motif A,
  and C's positions  can  be  defined  relative  to  either  A  or  B,
  depending  on  which  is most convenient, and likewise D's positions
  can be relative to A or B or C.

        Notice that the positions of motifs can be defined relative to
  more  than one motif. Suppose we have a pattern consisting of motifs
  A, B and C, and that B occurs 5-10 residues right of A, C occurs  5-
  10  residues  right  of B, and also C is never more than 15 residues
  from A. Then it is quite consistent  with  the  methods  to  include
  motif C into the pattern twice using the AND operator: once relative
  to A and once relative to B. This will define the  relative  spacing
  and  the  ORDER  of the motifs in the pattern. (If we simply defined
  the position of C relative to A it could be found to the left of B).

        Motifs combined together using the OR operator are  all  given
  the  same range. For example suppose we had a pattern A AND (B OR C)
  AND (D OR E), then B and C each have the same range,  and  D  and  E
  also  have  the same range as one another. The range for D and E can
  be relative to A or to B.

        Motifs cannot have their ranges  defined  relative  to  motifs
  that  are included using the NOT operator. For example if we had the
  pattern A NOT B AND C, then the range for  C  can  only  be  defined
  relative to motif A.

        Speed can be gained by arranging the order of  the  motifs  so
  that  those higher up the list are of types that can be searched for
  rapidly and that are also unlikely to be found.

        Motifs combined by the OR operator are  alternatives:  if  any
  one  of a set of motifs combined by the OR operator is found, then a
  match is declared. All alternatives will be reported. For example if
  we  had a pattern defined by A AND (B OR C), then all places where A
  occurs and B is found within range, and all places where A is  found
  and C is found within range will be reported. A typical use would be
  where we might allow a motif to appear on either strand of  the  DNA
  sequence.  For  example  a  weight matrix representing the heatshock
  element could be used in a pattern which  included  heatshock  as  a
  motif  class  4  combined  using the OR operator with heatshock as a
  motif class 5.

        The probability calculations are performed for each  motif  as
  it  is  defined.  If  an  overall  probability  cut-off is given the
  calculation is repeated for each match  found.  To  achieve  maximum
  searching  speed do not give an overall probability cut-off. Overall
  cut-off scores should only be used if the  motif  classes  used  are
  compatible.

        There are currently several ways to display the matches:  1  =
  each  motif and its position is listed; 2 = all the sequence between
  the two outermost motifs is listed; 3  =  graphical,  with  a  spike
  marking  the  position  of  the leftmost motif. The library versions
  also give entry names, and a one line title; in addition they can be
  used  to  produce  aligned  families of sequences. When this mode of
  output is selected the program will write a separate file  for  each
  match. The files will be called ENTRYNAME.DAT where ENTRYNAME is the
  name of the entry in the library.  The  matching  sequence  will  be
  written  out so that the spacing between motifs is constant, and set
  to the maximum allowed by the pattern definition. Any gaps  will  be
  filled   with   dashes   (-).   If  the  individual  sequences  were
  subsequently written one above the other they should line up so that
  all  motifs are in register. There two types of output of this sort:
  one, option 4, writes out whole  sequences,  the  other,  option  5,
  writes  out only the sequences between the two outermost motifs.  If
  the individual sequences were subsequently  written  one  above  the
  other  they should line up so that all motifs are in register. There
  two types of output of this sort: one, option 4,  writes  out  whole
  sequences,  the  other,  option  5,  writes  out  only the sequences
  between the two outermost motifs.  Note that for option 4 users  are
  asked  to  type  the position of the first motif, and the reason for
  this is  explained  below.  Consider  a  pattern  found  in  several
  sequences.  Consider only the first motif in the pattern and suppose
  that it was found in different positions  in  these  sequences.  Say
  that  of  these  positions  the  one  furthest from the left end was
  position 100. Then, in order to ensure that all the sequences  would
  align,  we must specify that motif 1 must start at position 100. Any
  sequences in which motif 1 started  nearer  to  the  left  end  than
  position  100  would  be  padded accordingly.  These modes of output
  should only be used when the  position  of  each  motif  is  defined
  relative to its immediate neighbour.

        The pattern descriptions can be saved to  files.  These  files
  can  be used instead of typing definitions again at the keyboard. As
  the files are annotated, they can easily  be  changed  using  system
  editors,  and  the  modified  versions  used  to  define the variant
  patterns for the programs.

        Use of lists of entry names

        The two programs that operate on libraries have the ability to
  restrict  their  searches to subsets of the libraries. This does not
  require sublibraries to be created but instead is achieved by  using
  files  containing  a  list of the entry names of sequences. The user
  may  choose  to  search  only  those  entries  on   the   list   or,
  alternatively  to  search  all  but  those  on the list (i.e. in the
  latter case the list contains the names of those  to  be  excluded).
  The  programs  can search libraries that have indexes and those that
  do not.  If a list of names for inclusion is used, then  the  search
  will  be  faster if the index is present. In all other circumstances
  the whole library will be read. The list must be  in  library  order
  except  when  it  is  used  to  include  entries,  and  an  index is
  available.  The list must contain each  entry  name  on  a  separate
  line,  with the name starting in column 1 of the line. ie there must
  be no spaces at the start of the line.  The list of entry names  can
  be  produced by the keyword searches of nip, pip, etc as long as the
  listings produced have a space character separating the  entry  name
  from the entry description. This will depend on how well the library
  reformatting programs work. For example swissprot entry  names  tend
  to  run  into the beginning of the descriptions, but other libraries
  are generally OK.

        One use of the programs  is  to  look  for  patterns  that  we
  already  know  about, but in new sequences. However it is hoped that
  they will also be useful for finding new motifs. For example several
  known   control   regions  in  nucleic  acid  sequences  consist  of
  particular direct or inverted repeats; the inclusion of  direct  and
  inverted  repeats  as  motif  classes  makes  it  possible  to  find
  previously unknown motifs of these types. Using these  new  programs
  we can ask questions like: "are there any inverted or direct repeats
  near to sections of sequence that contain both a  CCAAT  box  and  a
  TATA  box?"; and to search for such things throughout the libraries.
  In addition, the mode of output in which all  the  sequence  between
  the  two outermost motifs found is printed out, allows us to extract
  sequences and  examine  them  in  more  detail  for  further  common
  subsequences.  For example we might want to collect together all the
  sequences between putative CCAAT and TATA boxes.

        A further use of  the  inverted  repeat  motif  class  is  the
  following.  If  a  regulatory  sequence in DNA is poorly defined but
  also an inverted repeat, then it might be an advantage to specify it
  both  as a consensus sequence and a superimposed inverted repeat. In
  this way two weak definitions can be combined to produce a  stronger
  pattern.

        Given only a few examples of a motif it should be possible  to
  perform  initial  searches  using  a  class 3 motif, and then, using
  plausible matching sequences, create a more specific  weight  matrix
  for the same motif.

        If motifs are combined with  the  first  motif  using  the  OR
  operator  they  will  be ignored until all permutations that include
  the first motif have been looked for. The whole search will then  be
  repeated,  in  turn, for each of those motifs that are combined with
  the first motif using the OR operator.  An  interesting  consequence
  of  this is that the program can be used, without change, to compare
  any newly determined sequence with all known individual  motifs.  We
  achieve  this by having a pattern in which all known relevant motifs
  are combined using the OR operator.  If we ask to use  this  pattern
  with  a  sequence,  the  program  will  automatically  compare  each
  individual motif in  the  pattern  with  the  whole  length  of  the
  sequence.  As the number of known motifs grows this should become an
  increasingly useful standard procedure.

        The  NOT  operator  is  obviously  useful  for   making   sure
  particular  motifs  are  not  present,  but  it  can also be used to
  bracket the levels of matches found. We may want a degree  of  match
  that  lies  between  two  limits - binding should occur, but not too
  strongly; or base-pairs should  form,  but  not  too  many.  We  can
  specify  this by asking for a match with a low score, in combination
  with a match and a high score, both for the same motif, but with the
  high score included using the NOT operator.

        The algorithm is designed to find all sections of  a  sequence
  that   satisfy   the  pattern  rather  than  only  the  best  match.
  Particularly if some of the  motifs  in  a  pattern  are  less  well
  defined  than  others, this can often result in the same region of a
  sequence being reported as having several matches,  but  which  only
  vary in the positions of the weakest motifs.

        General remarks on motif searching

        Generally motifs are short subsequences that are thought to be
  associated  with particular functions in some known sequences. Often
  we  search  for  them  to  try  to  understand  or  interpret  other
  sequences.  Sometimes  we  search  for motifs and patterns to test a
  hypothesis  about  their  role:  are  they  found  in  the  expected
  positions  in the expected sequences. In doing so we should remember
  that, in both proteins and nucleic acids, what we are really looking
  for  is  a  particular  three  dimensional  structure  with  certain
  affinities for other structures, and that we are assuming  that  the
  sequence  of  the  motif alone defines the 3D structure we searching
  for. The overall structure may be completely different to  those  in
  which  the  motif  is  functional,  and  hence  the motif may have a
  different shape or be  inaccessible.  We  should  be  aware  of  the
  importance  of  the context in which a motif is found. Where does it
  lie relative to the overall structure,  is  it  accessible,  is  the
  three  dimensional  spacing between it and other motifs correct? For
  example, is it on the same side of the double helix, and the correct
  distance  from  some  other  motif?  How  does  context  affect  our
  assessment of the significance of finding  a  motif?  Finding  false
  mammalian  mRNA splice junctions in non-coding sequences is far less
  important than  finding  false  sites  in  pre-mRNA  sequences,  but
  finding  them  in  the  correct  places  is most important! In other
  words, it is often the case that when we are searching for  a  motif
  that  is  known  to  be necessary for some function, then a positive
  result in the form of a match in  the  required  position,  is  more
  important  than a high background of matches in the wrong positions.
  Being able to write down the probability of finding  a  motif  in  a
  random  sequence  tells  us how well it is defined. In nucleic acids
  the DNA may contain many superimposed types of information  such  as
  those  concerned  with  histone  phasing,  protein  coding  or  mRNA
  secondary structure. These overlapping "codes"  may  interfere  with
  one  another  causing  matches to motifs to be poorer than expected.
  In general we will only have a limited number  of  examples  of  the
  motif and we do not know how representative they are.

        Sequences have superimposed functions: some parts  may  be  of
  general structural importance and give rise to an overall framework,
  and other parts give specificity and hence are not  common;  we  may
  want  to  use a set of aligned sequences to define a motif, but want
  to use only the framework positions.  Alternatively we may  want  to
  pick  out only those parts of a set of aligned sequences that give a
  particular property, and to ignore other similarities that  are  due
  to  some  other  property and which could obscure the pattern we are
  interested in.  It is possible to apply a mask to a set  of  aligned
  sequences  in  order to give weight to selected positions only.  The
  ability to define a mask allows certain positions to be used in  the
  motif  and  others to be ignored, and yet still permits the use of a
  set of aligned sequences to calculate weights. The mask is requested
  and applied by the program and results in the masked positions being
  zero in the weight matrix. The mask is defined in the following way.
  Suppose  we  had a motif of length 15, then the mask x--x--xx-x will
  give zero weights to positions 2,3,5,6 and 9 (note it is the  dashes
  (-)  that  are significant and that positions 1,4,7,8,10,11,12,13,14
  and 15 will be non-zero). Of course the same set of sequences  could
  be used with several alternative masks in order to extract different
  features and create corresponding weight matrices.

        The programs are described in Staden,R. CABIOS 4, 53-60, 1988;
  Staden,R.   CABIOS  5,  89-96,  1989, and Methods in Enzymology 183,
  193-211 (1990).
 @ end of help