staden-lg/help/NIP.RNO

.NPA
.SP 1
.left margin1
@-1. TX  0 @General
.sp
@-2. T   0 @Screen control
.sp
@-2. X   0 @Screen
.sp
@-3. T   0 @Statistical analysis of content
.sp
@-3. X   0 @Statistics
.sp
@-4. T   0 @Structures and repeats
.sp
@-4. X   0 @Structures
.sp
@-5. TX  0 @Translation and codons
.sp
@-6. TX  0 @Gene search by content
.sp
@-7. TX  0 @General signals
.sp
@-8. TX  0 @Specific signals
.sp
@0.  TX  -1 @NIP
.PARA
.para
This is a program  for analysing individual nucleotide sequences. It can
read sequences stored in many of the most commonly used formats, and
performs all of the usual simple analyses. However the main purpose of
the program is to provide  methods for finding  the function of each
section of a sequence. In general no single method can  give an
unequivecal interpretation of a sequence so we need to use many
techniques together and to combine  their results. For this reason the
program  present many of its  results graphically.
.para
General information is contained in the user interface. Online
documentation for any function follows a consistent pattern: summary,
list of inputs, list of outputs, details, example.
.LEFT MARGIN1
@1. TX 0 @ Help
.LEFT MARGIN2
.para
This option gives online help. The user should select option numbers and
the current documentation will be given. Note that option 0 gives an
introduction to the program, and that ? will get help from anywhere in
the
program.
The following functions are included:
.left margin1
@2. TX 0 @ Quit
.left margin2
.para
This function stops the program.
.left margin1
@3. TX 1 @ Read a new sequence
.LEFT MARGIN2
.para
This option allows users to read in new sequences, browse through annotations,
 or search sequence
libraries for keywords. Sequences can be read from "personal"
sequence files or from sequence libraries. These are referred to as the
sequence "source". Personal files can be stored in several formats:
Staden, PIR, EMBL, GENBANK and GCG.
At LMB we use "Staden" format for sequencing and all
the
libraries are stored in their original formats. Note, however, that libraries
such as EMBL or GenBank that are divided into several files (eg GenBank has
13 separate files) are indexed as a whole. This means that users do not need
to know which file contains an entry, only which library.
When the user selects to read in a sequence the program first asks for the
sequence "source".
.para
If the user selects "personal" the program will ask for
the format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
the file. For PIR format the user will also be required to know the entry
name of the sequence as the file can contain several. For the other formats
only a single entry is expected. The file will be read, its length and
composition will be displayed and the option left.
.para
If the user selects "library" as the sequence source the program will display a
list of available libraries. The programs are capable of handling all current
libraries but which ones are available will vary from site to site. At LMB we
have several libraries and also weekly updates of data gathered between releases.
The program will ask users to select a library and then give a list of options:
.lit

 X  1 Get a sequence
    2 Get annotations
    3 Get entrynames from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords

.end lit
If get a sequence or get annotations is selected users will be asked to
type the entry name. The option will be left when a sequence is selected or
! is typed. The composition and length will be displayed.
.para
The text index contains all words from feature tables, reference titles,
definition lines, keywords lists and comments, so the text index search
is most useful. It is also the fastest. Up to 5 words can be searched for
at once. The words should be typed separated by spaces, for example
.lit
 ? Keywords=P53 mouse murine tumo

.end lit
will search for all entries that contain words starting with p53, mouse,
murine and tumo. Only the unique entries that contain ALL words will be
listed. Before listing the matching entries
the program will show the number of 'hits' for each word and ring the bell.
Escape is possible at this point, or after each screenfull of entries.
In addition to the entry names the text search displays the primary accession
number, the sequence length and up to 80 characters of description.
(The search of 'titles' is now redundant because the full text index
contains all the title words and the search is much faster. It will probably
be removed from the program.)
All searches are independent of case. Where
possible the program will offer default entry names.
.para
Typical dialogue follows.
.lit
Select sequence source
X  1 Personal file
   2 Sequence library
? Selection  (1-2) (1) =
Select sequence file format
X  1 Staden
   2 EMBL
   3 GenBank
   4 PIR
   5 GCG
? Selection  (1-5) (1) =
? Sequence file name=M13MP7.SEQ
 Contig title removed
Sequence length=  7238
 Sequence composition
          T          C          A          G          -
      2405.      1539.      1765.      1527.         2.
        33.2%      21.3%      24.4%      21.1%       0.0%
  .
  .
  .


 Select sequence source
 X  1 Personal file
    2 Sequence library
 ? Selection  (1-2) (1) =2
 Select a library
 X  1 EMBL 29 nucleotide library Dec 91
    2 SWISSPROT 20 protein library Nov 91
    3 PIR 31 protein library Dec 91
    4 NRL3D 58 From Brookhaven protein library Dec 91
    5 GenBank
 ? Selection  (1-5) (1) =
Library is in EMBL format with indexes
 Select a task
 X  1 Get a sequence
    2 Get annotations
    3 Get entry names from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords
 ? Selection  (1-5) (1) =5
 Search for keywords
 ? Keywords=P53 mouse
P53 hits  68
MOUSE hits  8180

 MMANT01    X00875         536 Murine gene fragment for cellular tumour antigen
 MMANT02    X00876          83 Murine gene fragment for cellular tumour antigen
 MMANT03    X00877          21 Murine gene fragment for cellular tumour antigen
 MMANT04    X00878         261 Murine gene fragment for cellular tumour antigen
 MMANT05    X00879         184 Murine gene fragment for cellular tumour antigen
 MMANT06    X00880         113 Murine gene fragment for cellular tumour antigen
 MMANT07    X00881         110 Murine gene fragment for cellular tumour antigen
 MMANT08    X00882         137 Murine gene fragment for cellular tumour antigen
 MMANT09    X00883          74 Murine gene fragment for cellular tumour antigen
 MMANT10    X00884         107 Murine gene for cellular tumour antigen p53 (exon
 MMANT11    X00885         562 Murine p53 gene 3' region with exon 11
 MMANTP53   M26862         536 Mouse tumor antigen p53 gene, 5' end.
 MMLYN      M64608        2044 Mouse lyn protein mRNA, complete cds.
 MMP53      X00741        1377 Mouse mRNA for transformation associated protein
 MMP53A     M13872        1285 Mouse p53 mRNA, complete cds, clone pcD53.
 MMP53B     M13873        1241 Mouse p53 mRNA, complete cds, clone p53-m11.
 MMP53C     M13874        1322 Mouse p53 mRNA, complete cds, clone p53-m8.
 MMP53G1    X01235         554 Mouse genomic DNA for 5' region of cellular tumou
 MMP53IN4   X60470         729 M.musculus p53 gene for p53 protein, intron 4
 MMP53P     X01236        2132 Mouse pseudogene for cellular tumour antigen p53
 MMP53R     X01237        1773 Mouse mRNA for cellular tumour antigen p53
 MMRSB2P5   M64597         196 Mouse B2 repeat in the 3' flank of protein 53 (p5
      22 different entries found

 Select a task
 X  1 Get a sequence
    2 Get annotations
    3 Get entry names from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords
 ? Selection  (1-5) (1) =4
 Search for keywords
 ? Keywords=alpha
 Searching for alpha
 AAGHA          623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
 AAMALI        3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
 AAMALIA       1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
 AAMALIB       1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
 ACA13GT        371 alouatta caraya alpha-1,3gt gene, 3' flank.
 ADHBADA1       102 duck alpha-d-globin gene, exon 1.
 ADHBADA2      1145 duck alpha-a-globin gene and 5' flank
 ADHBADWP       513 duck (white pekin) alpha ii (minor) globin mrna, complete co
 AEACOXABC     5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
 AGA13GT        371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
 AGAAAGFP       282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
 AGAABL         138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
 AGAFAMYA        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
 AGAFAMYB        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
 AGAFAMYC        57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
 AGAFCOXA        98 synthetic alpha-factor/cox iv fusion gene signal peptide.
 AGAGABA       7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
 AGAMYLS        120 synthetic alpha-amylase gene, 5' end.
 AGANPS          95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
!
 Select a task
 X  1 Get a sequence
    2 Get annotations
    3 Get entry names from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords
 ? Selection  (1-5) (1) =3
 ? Accession number=v00636
Entry name LAMBDA
 Select a task
 X  1 Get a sequence
    2 Get annotations
    3 Get entry names from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords
 ? Selection  (1-5) (1) =2
 Default Entry name=LAMBDA
 ? Entry name=
ID   LAMBDA     standard; DNA; PHG; 48502 BP.
XX
AC   V00636; J02459; M17233; X00906;
XX
DT   03-JUL-1991 (Rel. 28, Last updated, Version 3)
DT   09-JUN-1982 (Rel. 1, Created)
XX
DE   Genome of the bacteriophage lambda (Styloviridae).
XX
KW   circular; coat protein; DNA binding protein; genome;
KW   origin of replication.
XX
OS   Bacteriophage lambda
OC   Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
XX
RN   [1]
RP   1-48502
RA   Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
RT   "Nucleotide sequence of bacteriophage lambda DNA";
RL   J. Mol. Biol. 162:729-773(1982).
XX
!
 Select a task
 X  1 Get a sequence
    2 Get annotations
    3 Get entry names from accession numbers
    4 Search titles for keywords
    5 Search text index for keywords
 ? Selection  (1-5) (1) =
 Default Entry name=LAMBDA
 ? Entry name=
DE   Genome of the bacteriophage lambda (Styloviridae).
 Sequence length  48502
 Sequence composition
           T          C          A          G          -
      11988.     11360.     12336.     12818.         0.
         24.7%      23.4%      25.4%      26.4%       0.0%

.end lit
.left margin1
@4. TX 1 @ Define active region
.LEFT MARGIN2
.para
For its analytic functions
the program always works on a region of the sequence called the "active
region". This function allows the start and end points of the active region
to be reset.
.para
Define  the required start and end points.
.para
When a new sequence is read into the program the active region is
automatically set to start at the beginning of the sequence and extend  to
the
maximum the program can
handle. On most machines this will be to the end of the sequence. The
positions are shown on the screen.
 Note that for
convenience, in the
listing and translation functions, the user is given access to regions
outside the active region.
.left margin1
@5. TX 1 @ List a sequence
.LEFT MARGIN2
.para
The sequence can be listed single or double stranded with line lengths
from
10 to 120 in multiples of 10.
.para
Define the region to list, the line length required and choose between a
single or double stranded display.
The output looks like:
.lit

  GTTAATGTAG CTTAATAACA AAGCAAAGCA CTGAAAATGC TTAGATGGAT
  CAATTACATC GAATTATTGT TTCGTTTCGT GACTTTTACG AATCTACCTA
          10         20         30         40         50

  AATTGTATCC CATAAACACA AAGGTTTGGT CCTGGCCTTA TAATTAATTA
  TTAACATAGG GTATTTGTGT TTCCAAACCA GGACCGGAAT ATTAATTAAT
          60         70         80         90        100

  GAGGTAAAAT TACACATGCA AACCTCCATA GACCGGTGTA AAATCCCTTA
  CTCCATTTTA ATGTGTACGT TTGGAGGTAT CTGGCCACAT TTTAGGGAAT
         110        120        130        140        150

  AACATTTACT TAAAATTTAA GGAGAGGGTA TCAAGCACAT TAAAATAGCT
  TTGTAAATGA ATTTTAAATT CCTCTCCCAT AGTTCGTGTA ATTTTATCGA
         160        170        180        190        200

.end lit
.left margin1
@6. TX 1 @ List a text file.
.LEFT MARGIN2
.para
Allows the user to have a text file displayed on the screen. It will appear
one page at a time.
.para
Supply the name of the file to be displayed.
.left margin1
@7. TX 1 @ Direct output to disk
.LEFT MARGIN2
.para
Used to direct output that would normally appear on the screen to a file.
.para
Select redirection of either text or graphics, and
supply the name of the file that the output should be written to.
.para
 The results from the next options selected will not appear on the screen
but will be written to the file. When option 7 is selected again
the file will be
closed and output will again appear on the screen.
.left margin1
@8. TX 1 @ Write active region to disk
.LEFT MARGIN2
.para
Used to write the current active section of sequence to a disk file in
"Staden format".
.para
Supply a file name and an optional title.
.para
The program has the capability of reading sequences stored in several
formats and so, in conjunction with this option, can be used to reformat
them.
.left margin1
@9. TX 1 @ Edit the sequence
.LEFT MARGIN2
.para
Used to edit sequences or any other files by giving access to the
computers system editor. For editing sequences the input file should
have already been created using one of the listing functions such as "list
sequence", "list translation" or "list restriction sites above the
sequence".
.para
Supply the name of the file to edit. Wait while the system editor is made
ready (can take awhile on a vax). Use the editor. Exit from the editor. If a
sequence has been edited, and you want to process it, affirm that the
sequence should be "made active". The edited sequence will replace the
original sequence.
.para
This editing method is designed to give users access to an editor with
which they are familiar - i.e. the one on their machine, and yet to allow
them to edit a sequence which contains all the landmarks they need in
order to know where they are. Users can create files containing simple
listings (single stranded) with numbering, using "list the sequence", and
then edit them with their system editor, using the numbering to know
where they are within the sequence. When the edits are complete they
exit from the editor and the program "analyses" the edited file to extract
only the sequence characters. Similarly a file containing a three phase
tranlslation can be edited, or a file containing a sequence plus its three
phase translation, plus its restriction sites marked above the sequence.
In order to be able to "analyse" such complicated listings and correctly
extract the sequence the following simple rule is used: all lines in the
file that contain a character that is not A,C,T,G or U are deleted. It is
obviously important to be aware of this rule and its implications.
.left margin1
@10. TX 2 @ Clear graphics
.LEFT MARGIN1
.para
 Clears graphics from the screen.
.left margin1
@11. TX 2 @ Clear text
.LEFT MARGIN1
.para
 Clears  text from the screen.
.left margin1
@12. TX 2 @ Draw a ruler
.LEFT MARGIN2
.para
This option
allows the user to draw a ruler or scale along the x axis of the screen to
help identify the coordinates of points of interest. The user can define
the position of the first base to be marked (for example if the active
region is 1501 to 8000, the user might wish to mark every 1000th base
starting at either 1501 or 2000 - it depends if the user wishes to treat
the active region as an independent unit with its own numbering starting
at
its left edge, or as part of the whole sequence). The user can also define
the separation of the ticks on the scale and their height. If required the
labelling routine can be used to add numbers to the ticks.
.left margin1
@13. TX 2 @ Use crosshair
.LEFT MARGIN2
.para
This function puts
a steerable cross on the screen that can be used to find the
coordinates of points in the sequence. The user can move the cross
around using the directional keys; when he hits the space bar the
program will print out the coordinates of the cross in sequence units and
the option will be exited.
.PARA
If instead,
you hit a , the position will be displayed but the cross will remain on
the screen.
.PARA
If a letter s is hit the program will display the sequence around the
crosshair
position, and leave the cross on the screen.
.left margin1
@14. TX 2 @ Reposition plots
.LEFT MARGIN2
.para
The positions of each of the plots is defined relative to a users drawing
board which has size 1-10,000 in x and 1-10,000 in y.
Plots for
each option are drawn in a window defined by x0,y0 and xlength,ylength.
Where x0,y0 is the position of the bottom left hand corner of the window,
  and xlength is the width of the window and ylength the
height of the window.
.lit
   --------------------------------------------------------- 10,000
   1                                                       1
   1       --------------------------------------   ^      1
   1       1                                    1   1      1
   1       1                                    1   1      1
   1       1                                    1 ylength  1
   1       1                                    1   1      1
   1       1                                    1   1      1
   1       --------------------------------------   v      1
   1  x0,y0^                                               1
   1       <---------------xlength-------------->          1
   ---------------------------------------------------------      1
   1                                                   10,000

.end lit
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
The default window positions are read from a file "NIPMARG" when the
program is started. Users can have their own file if required.
As all the plots start
at the same position in x and have the same width, x0 and xlength are the
same for all options. Generally users will only want to change the start
level of the window y0 and its height ylength.
 This option
allows users to change window positions whilst running the program.
The routine prompts first for the number of the option that the users
wishes
to reposition; then for the y start and height; then for the x start and
length. Note that changes to the x values affect all options. If the user
types only carriage return for any value it will remain unchanged.
The cross-hair can be used to choose suitable heights.
.LEFT MARGIN1
@15. TX 2 @ Label a diagram
.LEFT MARGIN2
.para
This routine allows users to label any diagrams they have produced. They
are asked to type in a label. When the user types carriage return to finish
typing the label the cross-hair appears on the screen. The user can
position it anywhere on the screen. If the user types R (for right justify)
the label will be
written on the diagram with its right end at the cross-hair position.
If the user types L (for left justify) the label will be written on the
diagram with its left end at the cross hair position.
The
cross-hair will then immediately reappear. The user may put the same
label
on another part of the diagram as before or if he hits the space bar he
will be asked if he wishes to type in another label.
.para
Typical dialogue follows.
.lit
? Menu or option number=15
Type label then drive cross hair to left or right end
of label position then hit  "L"  to  write label left
justified or  "R"  to  write label right justified or
the space bar to quit


? Label=delta gene

 missing graphics

? Label=

.end lit
.left margin1
@16. TX 2 @Display a map
.LEFT MARGIN2
.para
This draws a map
of any sequence features selected by the user.
These features may be protein coding regions (CDS), tRNA genes (TRNA),
promoter positions (PRM), etc. Users may define their own feature table
key
names. For example I find it convenient to split CDS lines into CDS1,
CDS2
and CDS3 each of which contains only those sequences that code in the
reading frames 1, 2 or 3. Then I can plot them at different heights on
the screen ( suitable heights can be determined by using the cross-hair).
.para
The coordinates must be stored in a file in the format of an EMBL or GenBank
feature table. Note that this means that the file must include either EMBL
or GenBank headers, and a suitable "tail". The simplest header is the word
FEATURES starting in column 1 of the first line of the file. The simplest
tail is 2 empty lines at the end of the file. These lines are not included
when nip writes out results in feature table format.
.para
Typical dialogue follows.
.lit
? Menu or option number=16
 Display a map using an EMBL feature table file
? map file name=hsegl1.ft
? feature code(e.g. CDS) =CDS
X 1 + strand
  2 - strand
  3 both strands
? 0,1,2,3 =
? level (0-9480) (256) =4000

 missing graphics

? feature code(e.g. CDS) =

.end lit
.left margin1
@17. TX 1 @ Search for restriction enzymes
.LEFT MARGIN2
.para
This routine is used to search for short sequences, like restriction
enzyme
recognition sequences,
and can either list  the results or present them graphically. Listings can
take several forms and can include the sequence and its translation.
Examples are given below. The program will also display the names of
enzymes that cut the sequence infrequently. Users can select from sets
of enzymes stored in files or can enter them from the keyboard.
.para
The short
sequences (strings) and their names need to be arranged in a particular
way. See below. Select to search, list an enzyme file or clear the screen.
Choose either a file of enzymes or to enter their recognition sequences at the
keyboard. Choose to search for all the enzymes in the list or to select
from the list. Select a mode of output. Define the sequence as circular or
linear. Select to search for "definite" or "possible" matches. The search
starts, and after the results have been displayed, further searches can be
performed.
.para
When the enzymes and their recognition sequences are stored in a file
they must be defined in the following way. We
call the recognition sequences "strings".
The format is as follows: each string or set of strings must be
preceded by a name, each string must be preceded and
terminated with a slash (/), and
each set of strings by 2 slashes.
For example
AATII/GACGT'C// defines the name AATII, its recognition sequence
GACGTC
and its cut site with the ' symbol; ACCI/GT'MKAC// defines the name
ACCI
and its recognition sequence includes IUB symbols for incompletely
defined
symbols in nucleic acid sequences;
BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
defines the name BBVI and this time two recognition sequences and cut
sites
are specified in order to correctly show the cutting position relative to
the recognition sequence. If no cut site is included the first base of the
recognition sequence is displayed as being on the 3' side of the
recognition sequence.
.para
These collections of strings and their
names can be read from disk or entered from the keyboard.
When names and strings are entered from the keyboard the program will ask
for the name and then the string(s). If more than one string is typed per
name they must be separated by slash (/) characters. See the "Typical
dialogue" below.
 Three files
containing restriction enzyme recognition sequences are currently
available. The "all enzymes" file contains the Rich Roberts REBASE
restriction enzyme database, which is updated monthly.
.para
The user can select strings
by name from these collections. If so the program will prompt for the
names, one at a time. The user can continue to select names until a blank
name is entered (by the user typing only return).
.para
 Listed output can be displayed in several ways: it
can be ordered enzyme by enzyme, or on cut positions, or with enzyme
names
written above a listing of the sequence. This last listing can also include
a three phase translation of the sequence. In addition the program will
display only infrequent cutters (the user defines the minimum number of
cuts), or can plot the positions of matches.
.para
Listings sorted "enzyme by enzyme" have the following form:
.lit

 Matches found=     1
     Name                  Sequence            Position  Fragment lengths
   1 AATII                 GACGT'C                  112     111     111
                                                            912     912
 Matches found=     2
     Name                  Sequence            Position  Fragment lengths
   1 ACCI                  GT'CGAC                  112     111     111
   2 ACCI                  GT'AGAC                  420     308     308
                                                            604     604
 Matches found=     2
     Name                  Sequence            Position  Fragment lengths
   1 AHAII                 GA'CGTC                  109     108      90
   2 AHAII                 GG'CGTC                  199      90     108
                                                            825     825
 Matches found=     2
     Name                  Sequence            Position  Fragment lengths
   1 AVAII                 G'GACC                    84      83      51
   2 AVAII                 G'GTCC                   973     889      83
                                                             51     889
 Matches found=     1
     Name                  Sequence            Position  Fragment lengths
   1 BALI                  TGG'CCA                  258     257     257
                                                            766     766
 Matches found=     1
     Name                  Sequence            Position  Fragment lengths
   1 BAMHI                 G'GATCC                   92      91      91

   ......   etc

Listings sorted on cut position have the following form:

 Searching
     Name                  Sequence            Position  Fragment lengths
   1 ECORI                 G'AATTC                    2       1
   2 BANI                  G'GTGCC                   26      24
   3 BSP1286               GTGCC'C                   31       5
   4 BBVI                  'TACTGCGCCGCAGCTGC        38       7
   5 NSPBII                CAG'CTG                   51      13
   6 PVUII                 CAG'CTG                   51       0
   7 BBVI                  GCAGCTGCTGGTG'            60       9
   8 HINCII                GTC'AAC                   80      20
   9 AVAII                 G'GACC                    84       4
  10 BINI                  'CCAGGGATCC               87       3
  11 BSTNI                 CC'AGG                    89       2
  12 BAMHI                 G'GATCC                   92       3
  13 XHOII                 G'GATCC                   92       0
  14 NSPBII                CCG'CTG                   98       6
  15 BINI                  GGATCCGCT'               100       2
  16 AHAII                 GA'CGTC                  109       9
  17 SALI                  G'TCGAC                  111       2
  18 AATII                 GACGT'C                  112       1
  19 ACCI                  GT'CGAC                  112       0
  20 HINCII                GTC'GAC                  113       1
  21 BBVI                  GCAGCGACTGATT'           166      53
  22 BINI                  'ACTCAGATCC              178      12
  23 XHOII                 A'GATCC                  183       5
  24 HGAI                  'GGCGGCGGAGGCGTC         188       5

  .....etc

Lists of infrequent cutters have the following form:

     0 AFLII
     0 AFLIII
     0 APAI
     0 APALI
     0 ASUII
     0 AVAI
     0 AVRII
     0 BCLI
     0 BGLI
     0 BGLII
     0 BSMI
     0 BSPMII
     0 BSTEII
  ...... etc

 Listings showing names above the sequence, and a translation have the
following form:


 ECORI                   BANI BSP1286
 .                       .    .      BBVI         NSPBII
 .                       .    .      .            PVUII    BBVI
GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
GTGC
        10        20        30        40        50        60
 E  F  G  L  G  L  V  *  G  A  Q  R  L  L  R  R  S  C  W  C
  N  S  V  W  A  W  C  E  V  P  R  D  Y  C  A  A  A  A  G  A
   I  R  F  G  L  G  V  R  C  P  E  I  T  A  P  Q  L  L  V  L

                   HINCII
                   .   AVAII
                   .   .  BINI
                   .   .  . BSTNI
                   .   .  . .  BAMHI
                   .   .  . .  XHOII NSPBII
                   .   .  . .  .     . BINI     AHAII
                   .   .  . .  .     . .        . SALI
                   .   .  . .  .     . .        . .AATII
                   .   .  . .  .     . .        . .ACCI
                   .   .  . .  .     . .        . ..HINCII
TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
ACGA
        70        80        90       100       110       120
 W  R  C  G  G  R  Q  R  T  Q  G  S  A  G  R  G  R  R  R  R
  G  G  A  E  V  V  N  G  P  R  D  P  L  D  E  D  V  D  D  E
   A  V  R  R  S  S  T  D  P  G  I  R  W  T  R  T  S  T  T  R

                                             BBVI        BINI
GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
CACT
       130       140       150       160       170       180
 G  G  G  G  *  R  I  A  G  G  W  Q  R  L  I  *  V  L  T  T
  E  E  V  D  S  A  L  L  V  A  G  S  D  *  F  E  F  *  P  L
   R  R  W  I  A  H  C  W  W  L  A  A  T  D  L  S  S  D  H  S

  XHOII
  .    HGAI       AHAII                      PFIMI
  .    .          .                          .   BBVI
CAGATCCGGCGGCGGAGGCGTCGAGGCTCCCGAAACTCCCAGTGGCTGGCCTGCTA
GATT
       190       200       210       220       230       240
 Q  I  R  R  R  R  R  R  G  S  R  N  S  Q  W  L  A  C  *  I
  R  S  G  G  G  G  V  E  A  P  E  T  P  S  G  W  P  A  R  F
   D  P  A  A  E  A  S  R  L  P  K  L  P  V  A  G  L  L  D  S

   .........etc

.end lit
.para
The terms "possible" and "definite" matches are important only for back
translations of protein into DNA, and which include IUB redundancy codes.
Those matches that the program terms "definite matches" and are ones in
which the specification of the recognition sequence corresponds
exactly to that of the back translation, and consequently are definitely in
the DNA sequence. The program will also find what it
terms 'possible matches' which are ones that depend on the particular
codons
chosen for each amino acid.
These are sites at which recognition
sequences could be engineered to produce a cut in the DNA
without changing the amino
acid, but which are not
necessarily found in the original sequence.
.para
The routine will handle both linear and circular sequences, and
so finds cutsites spanning the "ends" of circular sequences.
 The program will only find cutsites spanning the
ends of sequences if the sequence is declared as circular.
This includes sites for
recognition sequences containing leading or trailing N symbols, in which
the actual recognition sequence does not span the join. For example if the
recognition sequence was 'NNNNACGT and the first 4 characters in the
sequence were ACGT, then the match would only be found if the sequence
was
declared as circular. If the sequence is linear then the first fragment
starts at base number 1, and the last ends at the last base. If the
sequence is circular then the length of the first fragment is the
clockwise
distance from the last cut to the first.
.para
Graphical output marks the position of each string by a
short vertical line and gives the name of the enzyme at the left end of
the
line. If the top of the screen is reached the program gives the user the
oportunity to take a hard copy and then will clear the screen and restart
plotting results at the original start position.
.para
Below is an edited piece of dialogue from use of the search option:
.lit
? Menu or option number=17

Search for restriction enzyme sites
X 1 Search
  2 List enzyme file
  3 Clear text
  4 Clear graphics
? 0,1,2,3,4 = 2

  1 All enzymes
X 2 Six cutters
  3 Four cutters
  4 Personal file
  5 Keyboard
? 0,1,2,3,4,5 =

AATII/GACGT'C//
ACCI/GT'MKAC//
AFLII/C'TTAAG//
AFLIII/A'CRYGT//
AHAII/GR'CGYC//
APAI/GGGCC'C//
APALI/G'TGCAC//
ASUII/TT'CGAA//
AVAI/C'YCGRG//
AVAII/G'GWCC//
AVRII/C'CTAGG//
BALI/TGG'CCA//
BAMHI/G'GATCC//
BANI/G'GYRCC//
BANII/GRGCY'C//
BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
BCLI/T'GATCA//
BGLI/GCCNNNN'NGGC//
BGLII/A'GATCT//
BINI/GGATCNNNN'/'NNNNNGATCC//
BSMI/GAATGCN'/NG'CATTC//
BSP1286/GDGCH'C//

X 1 Search
  2 List enzyme file
  3 Clear text
  4 Clear graphics
? 0,1,2,3,4 =
  1 All enzymes
X 2 Six cutters
  3 Four cutters
  4 Personal file
  5 Keyboard
? 0,1,2,3,4,5 =
? (y/n) (y) Search for all names
X 1 Order results enzyme by enzyme
  2 Order results by position
  3 Show only infrequent cutters
  4 Show names above the sequence
? 0,1,2,3,4 =
? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches

 Searching
 Matches found=     1
     Name                  Sequence            Position  Fragment lengths
   1 AATII                 GACGT'C                  112     111     111
                                                            912     912
 Matches found=     2
     Name                  Sequence            Position  Fragment lengths
   1 ACCI                  GT'CGAC                  112     111     111
   2 ACCI                  GT'AGAC                  420     308     308
                                                            604     604
 Matches found=     2
     Name                  Sequence            Position  Fragment lengths
   1 AHAII                 GA'CGTC                  109     108      90
   2 AHAII                 GG'CGTC                  199      90     108
                                                            825     825
 Matches found=     2
     Name                  Sequence            Position  Fragment lengths
   1 AVAII                 G'GACC                    84      83      51
   2 AVAII                 G'GTCC                   973     889      83
                                                             51     889
 Matches found=     1
     Name                  Sequence            Position  Fragment lengths
   1 BALI                  TGG'CCA                  258     257     257
                                                            766     766
 Matches found=     1
     Name                  Sequence            Position  Fragment lengths
   1 BAMHI                 G'GATCC                   92      91      91
                                                            932     932
 Matches found=     1
     Name                  Sequence            Position  Fragment lengths
   1 BANI                  G'GTGCC                   26      25      25
                                                            998     998
 Matches found=     1
     Name                  Sequence            Position  Fragment lengths
   1 BANII                 GAGCC'C                  490     489     489
                                                            534     534
 Matches found=    11
     Name                  Sequence            Position  Fragment lengths
   1 BBVI                  'TACTGCGCCGCAGCTGC        38      37       3
   2 BBVI                  GCAGCTGCTGGTG'            60      22      22
   3 BBVI                  GCAGCGACTGATT'           166     106      28
   4 BBVI                  'CCTGCTAGATTCGCTGC       230      64      37
   5 BBVI                  GCAGCGGTACGTA'           452     222      50
   6 BBVI                  'CTCGCCAACGTTGCTGC       502      50      55
   7 BBVI                  GCAGCCTTCAACT'           606     104      64
   8 BBVI                  'GAGGTATTCCTGGCTGC       634      28      97
   9 BBVI                  'CTGGCCGCCGCCGCTGC       869     235     104
  10 BBVI                  'GCCGCCGCCGCTGCTGC       872       3     106
  11 BBVI                  GCAGCGATGAGGA'           927      55     222

  ....etc

 X 1 Search
  2 List enzyme file
  3 Clear text
  4 Clear graphics
? 0,1,2,3,4 =

  1 All enzymes
X 2 Six cutters
  3 Four cutters
  4 Personal file
  5 Keyboard
? 0,1,2,3,4,5 =

? (y/n) (y) Search for all names

X 1 Order results enzyme by enzyme
  2 Order results by position
  3 Show only infrequent cutters
  4 Show names above the sequence
? 0,1,2,3,4 = 2

? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches

 Searching
     Name                  Sequence            Position  Fragment lengths
   1 ECORI                 G'AATTC                    2       1
   2 BANI                  G'GTGCC                   26      24
   3 BSP1286               GTGCC'C                   31       5
   4 BBVI                  'TACTGCGCCGCAGCTGC        38       7
   5 NSPBII                CAG'CTG                   51      13
   6 PVUII                 CAG'CTG                   51       0
   7 BBVI                  GCAGCTGCTGGTG'            60       9
   8 HINCII                GTC'AAC                   80      20
   9 AVAII                 G'GACC                    84       4
  10 BINI                  'CCAGGGATCC               87       3
  11 BSTNI                 CC'AGG                    89       2
  12 BAMHI                 G'GATCC                   92       3
  13 XHOII                 G'GATCC                   92       0
  14 NSPBII                CCG'CTG                   98       6
  15 BINI                  GGATCCGCT'               100       2
  16 AHAII                 GA'CGTC                  109       9
  17 SALI                  G'TCGAC                  111       2
  18 AATII                 GACGT'C                  112       1
  19 ACCI                  GT'CGAC                  112       0
  20 HINCII                GTC'GAC                  113       1

  .....etc

X 1 Search
  2 List enzyme file
  3 Clear text
  4 Clear graphics
? 0,1,2,3,4 =

  1 All enzymes
X 2 Six cutters
  3 Four cutters
  4 Personal file
  5 Keyboard
? 0,1,2,3,4,5 =

? (y/n) (y) Search for all names

  1 Order results enzyme by enzyme
X 2 Order results by position
  3 Show only infrequent cutters
  4 Show names above the sequence
? 0,1,2,3,4 =3
? Maximum number of cuts (0-100) (0) =
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches

 Searching
     0 AFLII
     0 AFLIII
     0 APAI
     0 APALI
     0 ASUII
     0 AVAI
     0 AVRII
     0 BCLI
     0 BGLI
     0 BGLII
     0 BSMI
     0 BSPMII
     0 BSTEII
     0 CLAI
     0 DRAI
     0 DRAII
     0 ECOB
     0 ECOK
     0 ECORV
     0 ESPI

   ......etc

X 1 Search
  2 List enzyme file
  3 Clear text
  4 Clear graphics
? 0,1,2,3,4 =

  1 All enzymes
X 2 Six cutters
  3 Four cutters
  4 Personal file
  5 Keyboard
? 0,1,2,3,4,5 =

? (y/n) (y) Search for all names

  1 Order results enzyme by enzyme
  2 Order results by position
X 3 Show only infrequent cutters
  4 Show names above the sequence
? 0,1,2,3,4 =4
? (y/n) (y) Hide translation n
? (y/n) (y) Use 1 letter codes
? Line length (30-90) (60) =
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches

 Searching
 ECORI                   BANI BSP1286
 .                       .    .      BBVI         NSPBII
 .                       .    .      .            PVUII    BBVI
GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
GTGC
        10        20        30        40        50        60
 E  F  G  L  G  L  V  *  G  A  Q  R  L  L  R  R  S  C  W  C
  N  S  V  W  A  W  C  E  V  P  R  D  Y  C  A  A  A  A  G  A
   I  R  F  G  L  G  V  R  C  P  E  I  T  A  P  Q  L  L  V  L

                   HINCII
                   .   AVAII
                   .   .  BINI
                   .   .  . BSTNI
                   .   .  . .  BAMHI
                   .   .  . .  XHOII NSPBII
                   .   .  . .  .     . BINI     AHAII
                   .   .  . .  .     . .        . SALI
                   .   .  . .  .     . .        . .AATII
                   .   .  . .  .     . .        . .ACCI
                   .   .  . .  .     . .        . ..HINCII
TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
ACGA
        70        80        90       100       110       120
 W  R  C  G  G  R  Q  R  T  Q  G  S  A  G  R  G  R  R  R  R
  G  G  A  E  V  V  N  G  P  R  D  P  L  D  E  D  V  D  D  E
   A  V  R  R  S  S  T  D  P  G  I  R  W  T  R  T  S  T  T  R

                                             BBVI        BINI
GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
CACT
       130       140       150       160       170       180
 G  G  G  G  *  R  I  A  G  G  W  Q  R  L  I  *  V  L  T  T
  E  E  V  D  S  A  L  L  V  A  G  S  D  *  F  E  F  *  P  L
   R  R  W  I  A  H  C  W  W  L  A  A  T  D  L  S  S  D  H  S

 .......etc

X 1 Search
  2 List enzyme file
  3 Clear text
  4 Clear graphics
? 0,1,2,3,4 =

  1 All enzymes
X 2 Six cutters
  3 Four cutters
  4 Personal file
  5 Keyboard
? 0,1,2,3,4,5 =5
Define search strings by typing a string name
followed by the string(s)
? Name=FRED
? String(s)=AAAAAA/TTTTTT
? Name=MARY
? String(s)=CCCC/GGGG/GCGCT
? Name=
? (y/n) (y) Search for all names
X 1 Order results enzyme by enzyme
  2 Order results by position
  3 Show only infrequent cutters
  4 Show names above the sequence
? 0,1,2,3,4 =
? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
 Searching
 Matches found=     9
     Name                  Sequence            Position  Fragment lengths
   1 FRED                  'TTTTTT                 1557    1556       1
   2 FRED                  'TTTTTT                 1558       1       1
   3 FRED                  'TTTTTT                 1559       1       1
   4 FRED                  'TTTTTT                 1560       1      22
   5 FRED                  'AAAAAA                 1582      22     529
   6 FRED                  'AAAAAA                 3160    1578    1019
   7 FRED                  'AAAAAA                 4204    1044    1044
   8 FRED                  'AAAAAA                 5691    1487    1487
   9 FRED                  'AAAAAA                 6710    1019    1556
                                                            529    1578
 Matches found=    36
     Name                  Sequence            Position  Fragment lengths
   1 MARY                  'CCCC                     47      46       1
   2 MARY                  'GGGG                    486     439       1
   3 MARY                  'GGGG                    487       1       1
   4 MARY                  'CCCC                    557      70       1
   5 MARY                  'CCCC                    558       1       1
   6 MARY                  'GCGCT                  1177     619       1

  ... etc

X 1 Search
  2 List enzyme file
  3 Clear text
  4 Clear graphics
? 0,1,2,3,4 =
  1 All enzymes
X 2 Six cutters
  3 Four cutters
  4 Personal file
  5 Keyboard
? 0,1,2,3,4,5 =5
Define search strings by typing a string name
followed by the string(s)
? Name=JANE
? String(s)=A'TTTT/CC'GGG
? Name=
? (y/n) (y) Search for all names
X 1 Order results enzyme by enzyme
  2 Order results by position
  3 Show only infrequent cutters
  4 Show names above the sequence
? 0,1,2,3,4 =
? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
 Searching
 Matches found=    30
     Name                  Sequence            Position  Fragment lengths
   1 JANE                  A'TTTT                   437     436       6
   2 JANE                  A'TTTT                   546     109      33
   3 JANE                  A'TTTT                   597      51      43
   4 JANE                  A'TTTT                   777     180      51
   5 JANE                  A'TTTT                  1274     497      60
   6 JANE                  A'TTTT                  1571     297      62
   7 JANE                  CC'GGG                  1926     355      75
   8 JANE                  A'TTTT                  2403     477      81
   9 JANE                  A'TTTT                  2586     183      82
  10 JANE                  A'TTTT                  2731     145     101
  11 JANE                  A'TTTT                  2812      81     103

 ... etc


X 1 Search
  2 List enzyme file
  3 Clear text
  4 Clear graphics
? 0,1,2,3,4 =!
.end lit

.left margin1
@18. TX 1 7 @ Compare a short sequence
.LEFT MARGIN2
.para
This routine slides a short sequence along the current sequence and finds
all positions at which a given percentage of the bases match.
Output is in both graphical and listed forms.
.para
If  users call for dialogue when the routine is selected they will be given
the choice of keyboard or file input. Define the string, select the "sense"
to use and the percentage match. Matches will be plotted out and then the
user can select to have them listed. Then the routine cycles around.
.para
 The routine slides the search string
along the  sequence and marks the positions at which a minimum
percentage score is reached. The graphical output draws a vertical line at
the match position; the height of the line represents the percentage
score,
so that if the line reaches the top of the box the score is 100%.
The NC-IUB symbols may be used in the search sequence to encode
uncertain
characters. Any other symbols will not match.
.LIT


            NC-IUB SYMBOLS

      A,C,G,T
      R        (A,G)        'puRine'
      Y        (T,C)        'pYrimidine'
      W        (A,T)        'Weak'
      S        (C,G)        'Strong'
      M        (A,C)        'aMino'
      K        (G,T)        'Keto'
      H        (A,T,C)      'not G'
      B        (G,C,T)      'not A'
      V        (G,A,C)      'not T'
      D        (G,A,T)      'not C'
      N        (G,A,C,T)    'aNy'

 Typical dialogue is shown below.


? Menu or option number=18
 Find percentage matches
? (y/n) (y) Keep picture
? String=AAATTTCCC
STRING=AAATTTCCC
? (y/n) (y) This sense
? Percent match (1.00-100.00) (70.00) =

 Missing graphics display here

Total scoring positions above 70.000 percent =   7
Scores         7      6      6      6      6      6      6
Positions    365    212    213    292    311    358    627
? Display (0-7) (0) =3

       365
         ACATTTCGC
         * ***** *
         AAATTTCCC
         1

       212
         GAAACTCCC
          **  ****
         AAATTTCCC
         1

       213
         AAACTCCCA
         *** * **
         AAATTTCCC
         1
? (y/n) (y) Keep picture
Default String=AAATTTCCC
? String=
STRING=AAATTTCCC
? (y/n) (y) This sense n
STRING=GGGAAATTT
? Percent match (1.00-100.00) (70.00) =

 Missing graphics display here

Total scoring positions above 70.000 percent =   7
Scores         6      6      6      6      6      6      6
Positions    269    270    271    288    354    624    853
? Display (0-7) (0) =3

       269
         GAGGGATTT
         * *  ****
         GGGAAATTT
         1

       270
         AGGGATTTT
          ** * ***
         GGGAAATTT
         1

       271
         GGGATTTTC
         ****  **
         GGGAAATTT
         1
? (y/n) (y) Keep picture !

.end lit
.left margin1
@19. TX 7 @ Compare a short sequence using a score matrix
.LEFT MARGIN2
.para
This routine slides a short sequence along the current sequence and finds
all positions at which a given level of similarity (a cutoff score) is
reached. The score is defined by use of a score matrix. Output is in both
graphical and listed forms.
.para
If  users call for dialogue when the routine is selected they will be given
the choice of keyboard or file input. Define the string, select the "sense"
to use and the cutoff score. Matches will be plotted out and then the user
can select to have them listed. Then the routine cycles around.
.para
 The routine slides the search string
along the  sequence and marks the positions at which a the cutoff score
is achieved. The graphical output draws a vertical line at
the match position; the height of the line represents the  score,
so that if the line reaches the top of the box the score is the maximum
possible.
The NC-IUB symbols may be used in the search sequence to encode
uncertain
characters.
.para
 The score matrix reflects the level of
redundancy in the probe sequence and hence will put more emphasis on
those
characters that are better defined. The score matrix is:
.lit
             DNA SCORE MATRIX USING IUB SYMBOLS

        T  C  A  G  -  R  Y  W  S  M  K  H  B  V  D  N  ?

   T   36  0  0  0  9  0 18 18  0  0 18 12 12  0 12  9  0
   C    0 36  0  0  9  0 18  0 18 18  0 12 12 12  0  9  0
   A    0  0 36  0  9 18  0 18  0 18  0 12  0 12 12  9  0
   G    0  0  0 36  9 18  0  0 18  0 18  0 12 12 12  9  0
   -    9  9  9  9 36 18 18 18 18 18 18 27 27 27 27 36  0
   R    0  0 18 18 18 36  0  9  9  9  9  6  6 12 12 18  0
   Y   18 18  0  0 18  0 36  9  9  9  9 12 12  6  6 18  0
   W   18  0 18  0 18  9  9 36  0  9  9 12  6  6 12 18  0
   S    0 18  0 18 18  9  9  0 36  9  9  6 12 12  6 18  0
   M    0 18 18  0 18  9  9  9  9 36  0 12  6 12  6 18  0
   K   18  0  0 18 18  9  9  9  9  0 36  6 12  6 12 18  0
   H   12 12 12  0 27  6 12 12  6 12  6 36  8  8  8 27  0
   B   12 12  0 12 27  6 12  6 12  6 12  8 36  8  8 27  0
   V    0 12 12 12 27 12  6  6 12 12  6  8  8 36  8 27  0
   D   12  0 12 12 27 12  6 12  6  6 12  8  8  8 36 27  0
   N    9  9  9  9 36 18 18 18 18 18 18 27 27 27 27 36  0
   ?    0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0

  ? is any unrecognised character.

  Typical dialogue is shown below.

? Menu or option number=19
 Find matches using a score matrix
? (y/n) (y) Keep picture
? String=AAATTTCCC
STRING=AAATTTCCC
? (y/n) (y) This sense
Minimum score=     0 Maximum score=   324
? Score (0-324) (280) =250

 Missing graphics display here

For score   250 the number of matches=     1
Scores       252
Positions    365
? Display (0-1) (0) =1

       365
         ACATTTCGC
         * ***** *
         AAATTTCCC
         1
? (y/n) (y) Keep picture
Default String=AAATTTCCC
? String=
STRING=AAATTTCCC
? (y/n) (y) This sense n
STRING=GGGAAATTT
Minimum score=     0 Maximum score=   324
? Score (0-324) (222) = 200

 Missing graphics display here

For score   200 the number of matches=     7
Scores       216    216    216    216    216    216    216
Positions    269    270    271    288    354    624    853
? Display (0-7) (0) =3

       269
         GAGGGATTT
         * *  ****
         GGGAAATTT
         1

       270
         AGGGATTTT
          ** * ***
         GGGAAATTT
         1

       271
         GGGATTTTC
         ****  **
         GGGAAATTT
         1
? (y/n) (y) Keep picture !

.end lit
.left margin1
@20. TX 7 @ Search for a motif using a weight matrix
.LEFT MARGIN2
.para
This function performs searches for short sequence
motifs using an appropriate  weight matrix. In addition it can be used to
create or modify weight matrices. In order to perform a search the only
input
required is the name of the file containing the weight matrix.
The results can be presented graphically or listed. The graphical
presentation will draw line at the position of any matches found; the
height of the line is proportional to the score.
.para
For a search, select "use weight matrix", supply the name of the file
containing the weight matrix, and choose between having results plotted
or listed. If dialogue is requested when the function is selected users can
alter the cutoff score employed.
.para
To create a weight matrix several steps are involved. A file containing an
alignment of known motifs is required. (This file must be created before
the current option is selected. The format is a follows: each sequence is
written on a separate line with at least one space at the beginning; each
sequence is terminated by a space character, and can be followed by a
name. The sequences must be aligned.) Supply the name of the file of
aligned sequences. The program reads and displays the sequences. Choose
between "summing logs of weights" or summing weights (i.e. whether to
multiply or add weights). If logs are used all scores will be negative.
Choose if all positions in the set of aligned sequences should be used or
if a mask should be applied. If so selected, define a mask as a string of
symbols, in which symbol - means ignore and any other symbol means
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
.para
The program will calculate weights as the frequencies of each base at
each unmasked position in the set of aligned sequences. These weights
are then applied to the set of aligned sequences to give a range  of
"observed" scores. The mean and standard deviation of these scores is
displayed. The user is asked to supply several values to be used when the
weight matrix is applied to other sequences: a cutoff score (by default,
the mean minus 3 standard deviations), a top score for scaling graphical
results (by default, the mean plus 3 standard deviations), and a position
to identify (this means that if a particular base within the motif is used
as a "landmark", such as the A of the AG in splice acceptor sites, then its
position will be marked in plots). All these values are stored along with
the weight matrix. Finally supply the name of a file to contain the weight
matrix.
.para
Weight matrices can be "rescaled" using a set of aligned sequences in
much the same ways as a matrix is created. The purpose is to redefine
the cutoff scores, and rescaling does not alter any other values in the
weight matrix file.
.para
The methods have changed considerably but were first outlined in
Staden, R. Nucl. Acid Res. 12 505-519 1984, and
Staden, R. Genetic
engineering: principles and methods vol 7, Edited by J.K. Setlow and A.
Hollaender, Plenum publishing corp., 1985.
.para
 The methods have always had to deal with the problem of zeroes in the
matrices. The current versions
employ "Laplaces Law of Succession" in which 1 is
added to each term.
.para
It is now possible to apply a mask to a set of aligned sequences in
order to give weight to selected positions only.
Sequences have superimposed functions: some parts may be of general
structural
importance and give rise to an overall framework, and other parts give
specificity and hence are not common; we may want to use a set of
aligned
sequences to define a motif, but want to use only the framework
positions.
 Alternatively we may want to pick out
only those parts of a set of aligned sequences that give a particular
property, and to ignore other similarities that are due to some other
property
and which could obscure the pattern
we are interested in. The ability to define a mask allows certain
positions
to be used in the motif and others to be ignored, and yet still permits the
use of a set of aligned sequences to calculate weights.
.para
Typical dialogue is shown below.
.lit

? Menu or option number=20
X 1 Use weight matrix
  2 Make weight matrix
  3 Rescale weight matrix
? 0,1,2,3 =2
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ

     1 AGCGTGACTCTTCCCGGAA HIS1
     2 GAGGTGACTCACTTGGAAG HIS1
     3 CGGATGACTCTTTTTTTTT HIS3
     4 ACAGTGACTCACGTTTTTT HIS4
     5 GTCGTGACTCATATGCTTT ARG3
     6 TGAATGACTCACTTTTTGG ARG4
     7 TTCTTGACTCGTCTTTTCT CPA1
     8 CGAATGACTCTTATTGATG CPA2
     9 AGAATGACTAATTTTACTA TRP5
    10 TCGTTGACTCATTCTAATC TRP3
    11 TTGCTGACTCATTACGATT TRP2
    12 GAGATGACTCTTTTTCTTT IV1
    13 GCGATGATTCATTTCTCTG IV2
    14 TAGATGACTCAGTTTAGTC LEU1
    15 TAAGTGACTCAGTTCTTTC LEU4
    16 ATGATGACTCTTAAGCATG ILS1
Length of motif    19
? (y/n) (y) Sum logs of weights

? (y/n) (y) Use all motif positions n
x means use, - means ignore
e.g. xx-x---x-x means use positions 1,2,4,8,10
? Mask=----XXXXXXXX
 Applying weights to input sequences
   1      -27.979 AGCGTGACTCTTCCCGGAA
   2      -24.543 GAGGTGACTCACTTGGAAG
   3      -20.890 CGGATGACTCTTTTTTTTT
   4      -23.087 ACAGTGACTCACGTTTTTT
   5      -22.771 GTCGTGACTCATATGCTTT
   6      -23.408 TGAATGACTCACTTTTTGG
   7      -25.159 TTCTTGACTCGTCTTTTCT
   8      -22.679 CGAATGACTCTTATTGATG
   9      -24.751 AGAATGACTAATTTTACTA
  10      -23.157 TCGTTGACTCATTCTAATC
  11      -23.067 TTGCTGACTCATTACGATT
  12      -21.449 GAGATGACTCTTTTTCTTT
  13      -24.191 GCGATGATTCATTTCTCTG
  14      -23.770 TAGATGACTCAGTTTAGTC
  15      -22.923 TAAGTGACTCAGTTCTTTC
  16      -25.285 ATGATGACTCTTAAGCATG
Top score     -20.890  Bottom score     -27.979
Mean     -23.694  Standard deviation       1.613
Mean minus 3.sd     -28.534  Mean plus 3.sd     -18.854
? Cutoff score (-999.00-9999.00) (-28.53) =
? Top score for scaling plots (-28.53-999.00) (-18.85) =
? Position to identify (0-19) (1) =
? Title=GCN4 SEQUENCES
? Name for new weight matrix file=1.WTS


? Menu or option number=20
X 1 Use weight matrix
  2 Make weight matrix
  3 Rescale weight matrix
? 0,1,2,3 =3
? Name of existing weight matrix file=1.WTS
 GCN4 SEQUENCES
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
Length of motif    19
? (y/n) (y) Sum logs of weights n
? (y/n) (y) Use all motif positions

 Applying weights to input sequences
   1      128.000 AGCGTGACTCTTCCCGGAA
   2      148.000 GAGGTGACTCACTTGGAAG
   3      172.000 CGGATGACTCTTTTTTTTT
   4      160.000 ACAGTGACTCACGTTTTTT
   5      161.000 GTCGTGACTCATATGCTTT
   6      157.000 TGAATGACTCACTTTTTGG
   7      149.000 TTCTTGACTCGTCTTTTCT
   8      160.000 CGAATGACTCTTATTGATG
   9      151.000 AGAATGACTAATTTTACTA
  10      159.000 TCGTTGACTCATTCTAATC
  11      158.000 TTGCTGACTCATTACGATT
  12      169.000 GAGATGACTCTTTTTCTTT
  13      152.000 GCGATGATTCATTTCTCTG
  14      157.000 TAGATGACTCAGTTTAGTC
  15      160.000 TAAGTGACTCAGTTCTTTC
  16      143.000 ATGATGACTCTTAAGCATG
Top score     172.000  Bottom score     128.000
Mean     155.250  Standard deviation      10.034
Mean minus 3.sd     125.147  Mean plus 3.sd     185.353
? Cutoff score (-999.00-9999.00) (125.15) =
? Top score for scaling plots (125.15-999.00) (185.35) =
? Position to identify (0-19) (1) =
? Title=GCN4 SEQUENCES
? Name for new weight matrix file=2.WTS


? Menu or option number=20
X 1 Use weight matrix
  2 Make weight matrix
  3 Rescale weight matrix
? 0,1,2,3 =
? Motif weight matrix file=1.WTS
 GCN4 SEQUENCES
? (y/n) (y) Plot results n

    153    -22.61 GCAGCGACTGATTTGAGTT
    169    -28.53 GTTCTGACCACTCAGATCC
    172    -27.27 CTGACCACTCAGATCCGGC
    219    -27.35 CCAGTGGCTGGCCTGCTAG
    268    -27.82 CGAGGGATTTTCGATCTTG
    274    -26.99 ATTTTCGATCTTGTGGATG
    283    -25.79 CTTGTGGATGATTTTCACG
    287    -27.50 TGGATGATTTTCACGTGCG
    298    -28.17 CACGTGCGCCGTCATATTG
    332    -28.27 TCTTTGAAGCAGAAGGGAC
    351    -28.27 AGGGGTACACTTTCACATT
    357    -25.05 ACACTTTCACATTTCGCTT
    364    -28.51 CACATTTCGCTTATGGGAG
    400    -23.77 GAAGTTACTAATGTGCGTG
    451    -26.22 ATGCTCGCCCTCTTTGGTG
    476    -28.00 TCCCTCACTGAGCCCTCCG
    480    -28.33 TCACTGAGCCCTCCGCCTC
    517    -23.46 GCTAAGATTCAGCTTGGTT
    556    -27.27 TCCAGCACTCAGGTTCGGC
    602    -27.01 AACTTGAATCCATCGTTGC
    648    -28.45 TGCTAAACACAGCCGGTTT
    679    -28.18 CTGTTTGCCCAGTTTGGGC
    691    -28.51 TTTGGGCCGCTTCTGGACG
    713    -27.67 GGCTTGACCGTGGCTGTGG
    803    -25.47 ATGCTGACCATGCTTTTCA
    848    -28.11 ATAATGTTAAGTTTGATTC
    857    -25.97 AGTTTGATTCCGCTGGCCG
    879    -27.85 CCGCTGCTGCTGTTTCCAC
    917    -27.77 GCGATGAGGAAGGCTTGTT
    931    -27.81 TTGTTGGCGCGCCTGCTCG
    952    -23.52 GAGGTGACTACCATCCGTG
    977    -28.40 TGCGTGGGTGAGCTGTTGT


? Menu or option number=6
Page through text files
? Name of file to read=1.WTS
 GCN4 SEQUENCES
     19     1   -28.534   -18.854
 P   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
 N  16  16  16  16  16  16  16  16  16  16  16  16  16  16  16  16  16  16
16
 T   0   0   0   0  16   0   0   1  16   0   5  11  10  12   9   6   7  12   6
 C   0   0   0   0   0   0   0  15   0  15   0   3   2   2   4   3   2   1   3
 A   0   0   0   0   0   0  16   0   0   1  10   0   3   2   0   3   5   2   2
 G   0   0   0   0   0  16   0   0   0   0   1   2   1   0   3   4   2   1   5
End of file

.end lit


.left margin1
@21. TX 3 @ Count base composition
.LEFT MARGIN2
.para
This routine
calculates the base composition of the
active region of the sequence as both totals and percentages.
.left margin1
@22. TX 3 @ Count dinucleotide frequencies
.LEFT MARGIN2
.para
This routine simply counts dinucleotide frequencies for the currently
active region of the sequence. It also calculates an expected distribution
based on the base composition.
The output looks like:
.LIT
              T             C             A             G
        obs  expected obs  expected obs  expected obs  expected

     T   8.44   8.25   6.67   7.01  10.35   9.92   3.27   3.54
     C   7.49   7.01   6.76   5.95   8.39   8.43   1.76   3.01
     A  10.13   9.92   7.78   8.43  11.74  11.93   4.89   4.26
     G   2.67   3.54   3.19   3.01   4.06   4.26   2.42   1.52

.END LIT
.left margin1
@23. TX 3 5 @ Count codons and amino acids
.LEFT MARGIN2
.para
This function
counts codons, amino acid composition, protein molecular weights, and
base
composition. Users select the segments of the sequence that the program
should analyse.
.para
Choose between being shown observed counts or counts normalised so
that the totals for each amino acid sum to 100. Select to define
segments using either the keyboard or an EMBL feature table.
Define the segments to count over. Select strand for each segment. Stop
selecting segments by typing a zero for "Count from ()". The results are
displayed a screenful at a time, and the bell is sounded to show there is
more to come. A zero start position,  or the end of an EMBL feature table,
signals
the routine to print out totals for all values.

.para
The counts are broken down into several figures.
 Base
composition by position in codon expressed as a percentage of each bases
own frequency;  base composition by position in codon expressed as a
percentage of the overall base composition of the section; base
composition
expected for this amino acid composition if there was no codon
preference;
percentage deviations of the observed amino acid composition from an
average amino acid composition.
.para
The output looks like:
.LIT

      ===========================================
      F TTT   1. S TCT   2. Y TAT   2. C TGT   1.
      F TTC   1. S TCC   1. Y TAC   3. C TGC   2.
      L TTA   7. S TCA   4. * TAA   9. * TGA   1.
      L TTG   2. S TCG   1. * TAG   2. W TGG   2.
      ===========================================
      L CTT   3. P CCT   2. H CAT   4. R CGT   1.
      L CTC   2. P CCC   3. H CAC   1. R CGC   0.
      L CTA   3. P CCA   2. Q CAA   4. R CGA   0.
      L CTG   2. P CCG   2. Q CAG   1. R CGG   2.
      ===========================================
      I ATT   9. T ACT   1. N AAT   7. S AGT   3.
      I ATC   2. T ACC   2. N AAC   4. S AGC   2.
      I ATA   4. T ACA   5. K AAA  13. R AGA   5.
      M ATG   1. T ACG   2. K AAG   4. R AGG   1.
      ===========================================
      V GTT   2. A GCT   2. D GAT   1. G GGT   3.
      V GTC   2. A GCC   2. D GAC   1. G GGC   1.
      V GTA   4. A GCA   3. E GAA   2. G GGA   1.
      V GTG   2. A GCG   0. E GAG   1. G GGG   1.
      ===========================================
  total codons=      166.
          T          C          A          G

  1     31.06      33.68      34.03      35.00
  2     35.61      35.79      30.89      32.50
  3     33.33      30.53      35.08      32.50

  1     24.70      19.28      39.16      16.87
  2     28.31      20.48      35.54      15.66
  3     26.51      17.47      40.36      15.66
  %     26.51      19.08      38.35      16.06  observed, overall totals
  %     25.00      22.26      33.10      19.65  expected, even codons per acid

          A    C    D    E    F    G    H    I    K    L
          7.   3.   2.   3.   2.   6.   5.  15.  17.  19.
 o-e %  -47. -33. -76. -68. -64. -54.  62. 116.  67.  67.

          M    N    P    Q    R    S    T    V    W    Y
          1.  11.   9.   5.   9.  13.  10.  10.   2.   5.
 o-e %  -62.  66.  12. -17.  19.  21.   6.  -2.   0.  -5.
 total acids=  154. molecular weight=    17421.

 Typical dialogue follows.

? Menu or option number=23
 Calculate codon usage, base composition
 and amino acid composition
? (y/n) (y) Show observed counts
? (y/n) (y) Define segments using keyboard
? Count from (0-1023) (0) =1
? Count to (1-1023) (1023) =1000
? (y/n) (y) + strand

     ===========================================
     F TTT  13. S TCT   1. Y TAT   1. C TGT   3.
     F TTC   4. S TCC  10. Y TAC   1. C TGC   7.
     L TTA   1. S TCA   0. * TAA   1. * TGA   4.
     L TTG   4. S TCG   1. * TAG   3. W TGG   5.
     ===========================================
     L CTT   9. P CCT   1. H CAT   3. R CGT  14.
     L CTC   7. P CCC   0. H CAC   7. R CGC  14.
     L CTA   0. P CCA   0. Q CAA   4. R CGA   9.
     L CTG  12. P CCG   1. Q CAG   9. R CGG   8.
     ===========================================
     I ATT   7. T ACT   4. N AAT   4. S AGT   1.
     I ATC   4. T ACC   5. N AAC   3. S AGC   7.
     I ATA   1. T ACA   1. K AAA   3. R AGA   2.
     M ATG   2. T ACG   1. K AAG   2. R AGG   2.
     ===========================================
     V GTT  11. A GCT  13. D GAT   6. G GGT   9.
     V GTC   5. A GCC  10. D GAC   9. G GGC  11.
     V GTA   6. A GCA   5. E GAA   6. G GGA  12.
     V GTG   8. A GCG   5. E GAG   3. G GGG   8.
     ===========================================


 Total codons=      333.
         T          C          A          G

 1     23.32      37.69      28.99      40.06
 2     37.15      22.31      38.46      36.59
 3     39.53      40.00      32.54      23.34
       -----      -----      -----      -----
 =     100%       100%       100%       100%

 1     17.72      29.43      14.71      38.14  = 100%
 2     28.23      17.42      19.52      34.83  = 100%
 3     30.03      31.23      16.52      22.22  = 100%
 %     25.33      26.03      16.92      31.73  Observed, overall totals
 %     24.44      22.31      20.90      32.35  Expected, even codons per acid

         A    C    D    E    F    G    H    I    K    L
        33.  10.  15.   9.  17.  40.  10.  12.   5.  33.
O-E %   22.  81. -13. -55.  34.  71.  40. -29. -73.  13.

         M    N    P    Q    R    S    T    V    W    Y
         2.   7.   2.  13.  49.  20.  11.  30.   5.   2.
O-E %  -74. -51. -88.   0. 165. -11. -42.  40.  18. -81.
Total acids=  325. Molecular weight=    35831. Hydrophobicity= -17.8


? Count from (0-1023) (0) =

    Codon totals over all genes
     ===========================================
     F TTT  13. S TCT   1. Y TAT   1. C TGT   3.
     F TTC   4. S TCC  10. Y TAC   1. C TGC   7.
     L TTA   1. S TCA   0. * TAA   1. * TGA   4.
     L TTG   4. S TCG   1. * TAG   3. W TGG   5.
     ===========================================
     L CTT   9. P CCT   1. H CAT   3. R CGT  14.
     L CTC   7. P CCC   0. H CAC   7. R CGC  14.
     L CTA   0. P CCA   0. Q CAA   4. R CGA   9.
     L CTG  12. P CCG   1. Q CAG   9. R CGG   8.
     ===========================================
     I ATT   7. T ACT   4. N AAT   4. S AGT   1.
     I ATC   4. T ACC   5. N AAC   3. S AGC   7.
     I ATA   1. T ACA   1. K AAA   3. R AGA   2.
     M ATG   2. T ACG   1. K AAG   2. R AGG   2.
     ===========================================
     V GTT  11. A GCT  13. D GAT   6. G GGT   9.
     V GTC   5. A GCC  10. D GAC   9. G GGC  11.
     V GTA   6. A GCA   5. E GAA   6. G GGA  12.
     V GTG   8. A GCG   5. E GAG   3. G GGG   8.
     ===========================================


 Total codons=      333.
         T          C          A          G

 1     23.32      37.69      28.99      40.06
 2     37.15      22.31      38.46      36.59
 3     39.53      40.00      32.54      23.34
       -----      -----      -----      -----
 =     100%       100%       100%       100%

 1     17.72      29.43      14.71      38.14  = 100%
 2     28.23      17.42      19.52      34.83  = 100%
 3     30.03      31.23      16.52      22.22  = 100%
 %     25.33      26.03      16.92      31.73  Observed, overall totals
 %     24.44      22.31      20.90      32.35  Expected, even codons per acid

         A    C    D    E    F    G    H    I    K    L
        33.  10.  15.   9.  17.  40.  10.  12.   5.  33.
O-E %   22.  81. -13. -55.  34.  71.  40. -29. -73.  13.

         M    N    P    Q    R    S    T    V    W    Y
         2.   7.   2.  13.  49.  20.  11.  30.   5.   2.
O-E %  -74. -51. -88.   0. 165. -11. -42.  40.  18. -81.
Total acids=  325. Molecular weight=    35831. Hydrophobicity= -17.8

.END LIT
.LEFT MARGIN1
@24. TX 3 @ Plot base composition
.LEFT MARGIN2
.para
This option plots the base composition of the sequence. The counts for
any combination of bases can be plotted.
.para
If dialogue is requested the user is presented with a check box for
selecting which bases should be counted, and then allowed to define a
window length, and a "plot interval". Otherwise, the AT composition is
plotted with a window of 101 and a plot interval of 5.
.para
Typical dialogue follows.
.lit
? Menu or option number=d24
 Plot base composition

checkbox: those set are marked X
X 1 T
  2 C
X 3 A
  4 G
? 0,1,2,3,4 =1

checkbox: those set are marked X
  1 T
  2 C
X 3 A
  4 G
? 0,1,2,3,4 =3

checkbox: those set are marked X
  1 T
  2 C
  3 A
  4 G
? 0,1,2,3,4 =2

checkbox: those set are marked X
  1 T
X 2 C
  3 A
  4 G
? 0,1,2,3,4 =4

checkbox: those set are marked X
  1 T
X 2 C
  3 A
X 4 G
? 0,1,2,3,4 =

? odd span length (1-201) (31) =
? plot interval (1-11) (5) =

 missing graphics


.end lit
.left margIN1
@25. TX 3 @ Plot local deviations in base composition
.LEFT MARGIN2
.para
The "local deviation" routines are designed to indicate the  similarity of
the compositions of different parts of the sequence. The composition of
every segment of the sequence is compared with  a standard composition.
The levels of similarity are plotted as a chi squared values. The standard
can be the composition of the whole sequence, or alternatively that of a
small segment defined by the user.
.para
If dialogue is forced define the standard region, the window length and
the plot interval. Otherwise the composition of the whole sequence is
taken as a standard. The maximum and minimum observed value of the chi
squared calculation is displayed, and plots will always exactly fill the
available box. Any unusual regions will show as peaks.
.para
The following measure is used: for each window position
calculate (sum((obs-exp)*(obs-exp))/(exp*exp))
where obs is the observed composition
and exp is the expected composition (the composition of the standard).
 The calculation is performed once to find out the range of values and is
then repeated and
plotted so that the plot exactly fills the allocated screen space.
.left margIN1
@26. TX 3 @ Plot local deviations from dinucleotide composition
.LEFT MARGIN2
.para
The "local deviation" routines are designed to indicate the  similarity of
the compositions of different parts of the sequence. The dinucleotide
composition of every segment of the sequence is compared with  a
standard composition. The levels of similarity are plotted as a chi
squared values. The standard can be the composition of the whole
sequence, or alternatively that of a small segment defined by the user.
.para
If dialogue is forced define the standard region, the window length and
the plot interval. Otherwise the composition of the whole sequence is
taken as a standard. The maximum and minimum observed value of the chi
squared calculation is displayed, and plots will always exactly fill the
available box. Any unusual regions will show as peaks.
.para
The following measure is used: for each window position
calculate (sum((obs-exp)*(obs-exp))/(exp*exp))
where obs is the observed composition
and exp is the expected composition (the composition of the standard).
 The calculation is performed once to find out the range of values and is
then repeated and
plotted so that the plot exactly fills the allocated screen space.
.left margin1
@27. TX 3 @ Plot local deviations from trinucleotide composition
.LEFT MARGIN2
.para
The "local deviation" routines are designed to indicate the  similarity of
the compositions of different parts of the sequence. The trinucleotide
composition of every segment of the sequence is compared with  a
standard composition. The levels of similarity are plotted as a chi
squared values. The standard can be the composition of the whole
sequence, or alternatively that of a small segment defined by the user.
.para
If dialogue is forced define the standard region, the window length and
the plot interval. Otherwise the composition of the whole sequence is
taken as a standard. The maximum and minimum observed value of the chi
squared calculation is displayed, and plots will always exactly fill the
available box. Any unusual regions will show as peaks.
.para
The following measure is used: for each window position
calculate (sum((obs-exp)*(obs-exp))/(exp*exp))
where obs is the observed composition
and exp is the expected composition (the composition of the standard).
 The calculation is performed once to find out the range of values and is
then repeated and
plotted so that the plot exactly fills the allocated screen space.
.left margin1
@28. TX 5 @ Calculate codon constraint
.left margin2
.para
The purpose of this option (which is somewhat specialised) is to measure
the level of constraint imposed on the sequence by coding for a protein of
the observed composition. It measures the strength of the codon bias
averaged over windows of 99 codons and displays the values observed.
.para
Select between defining segments at the keyboard or using an EMBL
feature table. Finish selecting segments by typing a zero start. The value
for each segment is displayed:
.para
 Mean (W-EW) / EWD, window 99      10.5
.para
The codon constraint is the
difference between the observed codon improbability and the mean
improbabilty for
a sequence of the same composition.   See McLachlan, Staden and Boswell
Nucl. Acid Res. 1984

.left margin1
@59. TX 3 @ Plot negentropy
.LEFT MARGIN2
.para
This routine is designed to show regions of the sequence that differ in
composition from others, and hence is like the "plot deviation.." routines.
.para
Negentropy or information is defined in the following way: let Pi be the
probability of observing base i, where i = A,C,G or T, then the average
information per base is
I=-sum(Pi.Log(Pi))   (sum over all i). This routine calculates Pi by
calculating the overall composition for the sequence and then plots I for
windows of length defined by the user.
.left margin1
@30. TX 4 @ Search for hairpin loops
.LEFT MARGIN2
.para
Used to find simple inverted repeats or potential hairpin loops
 The loops are defined by a range of sizes for
the loop and a minimum number of consecutive base pairs in the stem.
The results can be presented graphically or listed.
A-T, G-C and G-T basepairs are counted.
.para
Define the range of loop sizes and the minimum number of consecutive
basepairs required. Choose between plotted or listed results.
.para
The loops found are plotted as blips on a
horizontal line that represents the sequence, the heights of the lines are
proportional to the number of basepairs in the stems. Note that only
uninterrupted stems are found - i.e. all basepairs must be made. To look
for stems with some unpaired bases (or for palindromes) use the inverted
repeat motif class in the pattern searching option.
.para
Typical dialogue follows.
.lit
? Menu or option number=30
 Search for hairpin loops
Define the range of loop sizes
? Minimum loop size (1-30) (1) =
? Maximum loop size (3-60) (3) =
? Minimum number of basepairs (2-20) (6) =
? (y/n) (y) Plot results n
 Searching

          T.G
          G-C
          G.T
          T.G
          C-G
          G-C
          T.G
          C-G
          G.T
     GCCGCA GCGGAGG
         49

           G
          G-C
          T.G
          C-G
          G.T
          T.G
          G-C
     CTGCTG GGAGGTC
         56


           G
          T.G
          G-C
          G.T
          T.G
          C-G
          G-C
          T-A
          T.G
     AGCGCA CGACTGA
        139

          A C
          G.T
          C-G
          G.T
          C-G
          C-G
          G-C
     TTCGCT CAACGCC
        244

.end lit
.LEFT MARGIN1
@31. TX 4 @ Search for long range inverted repeats
.LEFT MARGIN2
.para
Searches for inverted repeats. The repeats found are exact matches of at
least 6 consecutive bases. Results can be presented graphically or listed.
Plotted results show the end points of repeats joined by rectangular
lines.
.para
If dialogue is not requested the defaults will be taken. Otherwise choose
between plotted or listed results. If required select to analyse a
restricted segment of the currently active region. Choose a repeat length.
.para
Typical dialogue follows.
.lit
? Menu or option number=D31
 Plot long-range inverted repeats
? (y/n) (y) Plot results n
Define restricted region
? start (1-1023) (1) =
? end (2-1023) (1023) =
? Minimum inverted repeat (6-30) (12) =10
 Searching
    27     909      10  TGCCCAGAGA

.end lit
.LEFT MARGIN1
@32. TX 4 @ Search for repeats
.LEFT MARGIN2
.para
Searches for direct repeats. The repeats found are exact matches of at
least 6 consecutive bases. Results can be presented graphically or listed.
Plotted results show the end points of repeats joined by rectangular
lines.
.para
If dialogue is not requested the defaults will be taken. Otherwise choose
between plotted or listed results. If required select to analyse a
restricted segment of the currently active region. Choose a repeat length.
.para
Typical dialogue follows.

.lit
 ? Menu or option number=D32
 Plot repeats
? (y/n) (y) Plot results n
Define restricted region
? start (1-1023) (1) =
? end (2-1023) (1023) =
? Minimum repeat (6-30) (12) =8
 Searching
   619     988       8  GCTGTTGT
   514     646       8  GCTGCTAA
    94     865       8  TCCGCTGG
   146     222       9  GTGGCTGGC
   455     497       8  TCGCCCTC
   454     496       9  CTCGCCCTC
   872     875       8  GCCGCCGC
   510     615       8  CGTTGCTG
   152     913       8  GGCAGCGA
   199     265       8  CGTCGAGG
   689     794       8  AGTTTGGG
   147     223       8  TGGCTGGC
   101     116       8  GACGAGGA
     8     690       8  GTTTGGGC
    52     141       8  TGCTGGTG

.end lit
.left margin1
@33. TX 4 @ Search for z dna (total ry, yr)
.LEFT MARGIN2
.para
Searches for segments of the sequence that might form Z DNA. A window
length is chosen and the number of RY and YR dinucleotides within each
window is plotted. The top of the box corresponds to all RY or YR, the
bottom to zero RY or YR.
.para
If dialogue is requested, select a window length and plot interval.
Otherwise the defaults will be used.
.para
The program contains three
separate ways of doing this (options 33,34,35).
.left margin1
@34. TX 4 @ Search for z dna (runs of ry, yr)
.LEFT MARGIN2
.para
Searches for segments of the sequence that might form Z DNA. Results
are plotted.
.para
If dialogue is requested define a window length and plot interval.
Otherwise the defaults will be used.
 The routine
counts the number of R in positions 1,3,5 etc =R1, the number of Y in
positions 2,4,6 etc =Y1, the number of Y in positions 1,3,5 etc =Y2 and
the
number of R in positions 2,4,6 etc =R2 for a window length. It plots the
maximum of R1+Y1 and R2+Y2 relative to a minimum of (window
length)/2 and a
maximum of (window length). (see 33,35).
.LEFT MARGIN1
@35. TX 4 @ Search for z dna (best phased value)
.LEFT MARGIN2
.para
Searches for segments of the sequence that might form Z DNA. Results
are plotted.
.para
If dialogue is requested define a window length and a plot interval.
Ohterwise the defaults values will be used.
.para
 The routine
counts the number of consecutive RY or YR dinucleotides in phase. It
moves
through the sequence counting the number of RY or YR dinucleotides; when
the next dinucleotide is not of the correct type the score is set back to
zero and the search restarted using the current base to set the phase. The
plots are done relative to a minimum of zero and a maximum defined by
the
user. (See 33,34).
.LEFT MARGIN1
@36. TX 4 @ Local similarity or complementarity search
.LEFT MARGIN2
.PARA
This function is designed to find segments of
local similarity or complementarity. It is therefore like performing
a DIAGON
plot that is
restricted to regions near the main diagonal.  Results can be presented
graphically or listed.
.para
Users define
a region to search through,
a span length, a range for searching through and a cut-off score. The
program takes all sections of sequence
of length span within the defined region
 and compares them to
all other sequences within the region and
range specified.
If a match above the cutoff is found we
need to show the position
of the two sections of sequence and the score, and we do it in the
following way.
If we have a 70%
match between
a sequence that starts at p1 and a sequence that starts at p2
the program draws a
diagonal line that starts at p1 with height 70% of the box and which
finishes at p2 with
height 0.
The matches can also be listed.
.para
Here I define the terms range, region, and span and what is compared.
Suppose we have a defined region j1 to j2, a range of i1 to i2 and a span
of
s; the program will take, in turn, all sections of sequence of length s
within j1 and j2 and compare them to all sequences that start a distance
i1+s-1
to i2+s-1 away from them. First it will take the sequence of length s
starting
at j1 and compare it
with the sequence of length s starting at
j1+s-1+i1, then j1+s-1+i1+1, etc up to j1+s-1+i2; then it will take the
sequence of length s starting at j1+1 and compare it with the sequence
starting at j1+s-1+1+i1 etc. This continues until we hit
 the right hand end of the
sequence as defined by j2. Note 1)that sequences are not compared with
themselves: the nearest sequence compared to a span s starting at j
starts
at j+s; 2) ranges i1 and i2 are ranges of start positions; 3) by choosing a
range greater than the length of the sequence this routine will do a full
DIAGON analysis except for those points within a distance span of
 the main diagonal (see note 1).
.para
Typical dialog follows.
.lit

? Menu or option number=36
 Search for local similarity or complementarity
? (y/n) (y) Find direct repeats
? (y/n) (y) Keep picture n
? Span (5-200) (15) =
Define restricted region
? start (0-1023) (1) =
? end (2-1023) (1023) =
? Percent match (1.00-100.00) (70.00) =
? Range start (1-50) (1) =
? Range end (1-50) (1) =5
? (y/n) (y) Plot results n
 Working


       118        128
         CGAGGAGGAG GTGGA
          ** *****  ** **
         GGACGAGGAC GTCGA
       100        110


       119        129
         GAGGAGGAGG TGGAT
         ** ***** * * **
         GACGAGGACG TCGAC
       101        111
? (y/n) (y) Find direct repeats n
? (y/n) (y) Keep picture
? Span (5-200) (15) =
Define restricted region
? start (0-1023) (1) =
? end (2-1023) (1023) =
? Percent match (1.00-100.00) (70.00) =
? Range start (1-50) (1) =
? Range end (1-50) (5) =8
? (y/n) (y) List results

 Working


       178        188
         ACTCAGATCC GGCGG
         ***** ***  * **
         ACTCAAATCA GTCGC
       156        166


       177        187
         CACTCAGATC CGGCG
          ***** ***  * **
         AACTCAAATC AGTCG
       157        167
? (y/n) (y) Find inverted repeats !
.end lit

.left margin1
@37. TX 5 @ Set genetic code
.LEFT MARGIN2
.para
This function allows the user to change the current active genetic code
for
all the options. The user may select: the standard code, the mammalian
mitochondrial code, the yeast mitochondrial code or a personal code
(define
your own).
.para
Select code. If personal, define a codon and select an amino acid. When all
codons have been reset define a blank codon.
.para
The code differences are:
.lit
          Mammalian        Yeast
  Codon  Mitochondrial  Mitochondrial  Standard
   UGA       W              W            STOP
   AUA       M              M             I
   CUA       L              T             L
   AGA      STOP            R             R
   AGG      STOP            R             R
.END LIT
.para
Typical dialogue follows.

.lit
? Menu or option number=37
X 1 Standard code
  2 Mammalian mitochondrial code
  3 Yeast mitochondrial code
  4 Personal code
? 0,1,2,3,4 =2

? Menu or option number=37
X 1 Standard code
  2 Mammalian mitochondrial code
  3 Yeast mitochondrial code
  4 Personal code
? 0,1,2,3,4 =4
Define genetic code by typing a codon
followed by a 1 letter amino acid symbol
? Codon=TTT
Default Amino acid symbol=F
? Amino acid symbol=W
? Codon=
.end lit

.left margin1
@38. T 3 4 @ Examine repeats
.left margin2
.para
This function can be used to examine the frequencies of repeated words
within a sequence. It finds all words that occur more than once. The
user selects a minimum word length and the program finds all words of that
length that occur more than once; then it "follows" each repeated word until it
becomes unique. For each word length it can report the number of different
repeated words, the number of occurrences of each word, and their actual
positions and sequences.
.para
It is possible that the algorithm may run out of memory, paticularly if a short
mimimum word length is chosen, or if the sequence is very long or very
repetitive. If this occurs the longest reported word length will not
necessarily be the longest in the sequence: the memory will have been consumed
before the longest word is found.
.lit
Typical dialogue and output is shown below.

 Expected length of longest repeat    14
 ? Minumim word length (1-6) (6) =6
 Working
 ? Show repeat frequencies for words of at least length (6-15) (15) =10
 For length    10 the number of different repeated words is  2035
 For length    11 the number of different repeated words is   613
 For length    12 the number of different repeated words is   161
 For length    13 the number of different repeated words is    37
 For length    14 the number of different repeated words is    10
 For length    15 the number of different repeated words is     1
 ? Show repeats for words of length (6-15) (15) =14
 ? Show repeats for words occuring with frequency (2-9999) (2) =2

 ggtgctcatgccca
 occurs at  21611
 occurs at  21851
 ttatccggtgatga
 occurs at   4604
 occurs at   8806
 agcaccacgctgac
 occurs at   5954
 occurs at   9486
 catgacggaggatg
 occurs at  10480
 occurs at  19925
 aaagacgggaaaat
 occurs at  11820
 occurs at  43157
 tacaaaaccaattt
 occurs at  26797
 occurs at  31369
 cgagaaagagtgcg
 occurs at   4260
 occurs at  44305
 gccggatgatggcg
 occurs at   7893
 occurs at  16638
 atgacggaggatga
 occurs at  10481
 occurs at  19926
 gcggcgaacgaggc
 occurs at  11352
 occurs at  18718
 ? Show repeats for words of length (6-15) (15) =!

Example of not enough memory
----------------------------

 Expected length of longest repeat    14
 ? Minumim word length (1-6) (6) =1
 Working
 Not enough memory
 Memory used in bytes 1125996. Length of longest repeat     5
 ? Show repeat frequencies for words of at least length (1-5) (5) =!

.end lit
.left margin1
@39. TX 5 @ Translate and list in upto six phases
.LEFT MARGIN2
.para
This is a general listing function that will perform translations and
produce several forms of output. The possibilities are:
.lit
1) no translation, list one or two strands, two ways of numbering the
sequence.
2) translation, one or two strands, one or three letter codes.
 Positions defined by:
  a) open reading frames of some minimum length l, l can be 0, hence giving
a complete six phase translation.
  b) positions typed on keyboard, again 1 to 6 phases, translations appearing
above and below the dna.
  c) positions read from a feature table.

It should be used in preference to option 5. For publication
without a translation, the option to number ends of lines is more compact
than option 5. Some examples and typical dialogue are given below. Note the
requirement for d39.

? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation

The segments to translate can be
   1 Typed on the keyboard
   2 Read from a feature table
X  3 Open reading frames
? 1,2,3 =
? Minimum open frame in amino acids (0-7238) (30) =
? (y/n) (y) Use 1 letter codes
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X  1 + strand only
   2 - strand only
   3 Both strands
? 1,2,3 =3
? (y/n) (y) Number ends of lines


    N  A  T  T  I  S  R  I  D  A  T  F  S  A  R  A  P  N  E  N
   AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT      60
       .    :    .    :    .    :    .    :    .    :    .    :
   TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
                                        *  S  A  G  W  I  F  I
      A  V  V  I  L  L  I  S  A  V  K  E  A  R  A  G  F  S  F

    I  A  K  Q  V  I  D  H  L  R  N  V  S  N  G  Q  T  K  S  T
        L  N  R  L  L  T  I  C  E  M  Y  L  M  V  K  L  N  L  L
   ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT     120
       .    :    .    :    .    :    .    :    .    :    .    :
   TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
    Y  S  F  L  N  N  V  M  Q  S  I  Y  R  I  T  L  S  F  R  S
   I  A  L  C  T  I  S  W  K  R  F  T  D  L  P  *  V  L  D  V

    R  S  Q  N  W  E  S  T  V  T  W  N  E  T  S  R  H  R  T  L
     V  R  R  I  G  N  Q  L  L  H  G  M  K  L  P  D  T  V  L  *
   CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA     180
       .    :    .    :    .    :    .    :    .    :    .    :
   GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
    T  R  L  I  P  F
   R  E  C  F  Q  S  D  V  T  V  H  F  S  V  E  L  C  R  V  K

    V  A  Y  L  K  H  V  E  L  Q  H  Q  I  Q  Q  L  S  S  K  P
   GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA     240
       .    :    .    :    .    :    .    :    .    :    .    :
   CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
   T  A  Y  K  F  C  T  S  S  C  C  W  I

    S  A  K  M  T  S  Y  Q  K  E  Q  L  K  V  L  S  N  P  D  L
   TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG     300
       .    :    .    :    .    :    .    :    .    :    .    :
   AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC


? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation N
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X  1 + strand only
   2 - strand only
   3 Both strands
? 1,2,3 =
? (y/n) (y) Number ends of lines


   AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT      60

   ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT     120

   CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA     180

   GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA     240

   TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG     300


? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation
The segments to translate can be
   1 Typed on the keyboard
   2 Read from a feature table
X  3 Open reading frames
? 1,2,3 =
? Minimum open frame in amino acids (0-7238) (30) =0
? (y/n) (y) Use 1 letter codes N
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X  1 + strand only
   2 - strand only
   3 Both strands
? 1,2,3 =3
? (y/n) (y) Number ends of lines


   AsnAlaThrThrIleSerArgIleAspAlaThrPheSerAlaArgAlaProAsnGluAsn
    ThrLeuLeuLeuLeuValGluLeuMetProProPheGlnLeuAlaProGlnMetLysIle
     ArgTyrTyrTyr******Asn***CysHisLeuPheSerSerArgProLys***Lys
   AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT      60
       .    :    .    :    .    :    .    :    .    :    .    :
   TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
   ValSerSerSerAsnThrSerAsnIleGlyGlyLys***SerAlaGlyTrpIlePheIle
    Arg************TyrPheGlnHisTrpArgLysLeuGluArgGlyLeuHisPheTyr
     AlaValValIleLeuLeuIleSerAlaValLysGluAlaArgAlaGlyPheSerPhe

   IleAlaLysGlnValIleAspHisLeuArgAsnValSerAsnGlyGlnThrLysSerThr
    ***LeuAsnArgLeuLeuThrIleCysGluMetTyrLeuMetValLysLeuAsnLeuLeu
  TyrSer***ThrGlyTyr***ProPheAlaLysCysIle***TrpSerAsn***IleTyr
   ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT     120
       .    :    .    :    .    :    .    :    .    :    .    :
   TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
   TyrSerPheLeuAsnAsnValMetGlnSerIleTyrArgIleThrLeuSerPheArgSer
    Leu***ValPro***GlnGlyAsnAlaPheHisIle***HisAspPhe***Ile***Glu
  IleAlaLeuCysThrIleSerTrpLysArgPheThrAspLeuPro***ValLeuAspVal

   ArgSerGlnAsnTrpGluSerThrValThrTrpAsnGluThrSerArgHisArgThrLeu
    ValArgArgIleGlyAsnGlnLeuLeuHisGlyMetLysLeuProAspThrValLeu***
  SerPheAlaGluLeuGlyIleAsnCysTyrMetGlu***AsnPheGlnThrProTyrPhe
   CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA     180
       .    :    .    :    .    :    .    :    .    :    .    :
   GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
   ThrArgLeuIleProPhe***SerAsnCysProIlePheSerGlySerValThrSer***
    AsnAlaSerAsnProIleLeuGln***MetSerHisPheLysTrpValGlyTyrLysLeu
  ArgGluCysPheGlnSerAspValThrValHisPheSerValGluLeuCysArgValLys

   ValAlaTyrLeuLysHisValGluLeuGlnHisGlnIleGlnGlnLeuSerSerLysPro
    LeuHisIle***AsnMetLeuSerTyrSerThrArgPheSerAsn***AlaLeuSerHis
  SerCysIlePheLysThrCys***AlaThrAlaProAspSerAlaIleLysLeu***Ala
   GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA     240
       .    :    .    :    .    :    .    :    .    :    .    :
   CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
   AsnCysIle***PheMetAsnLeu***LeuValLeuAsnLeuLeu***AlaArgLeuTrp
    GlnMetAsnLeuValHisGlnAlaValAlaGlySerGluAlaIleLeuSer***AlaMet
  ThrAlaTyrLysPheCysThrSerSerCysCysTrpIle***CysAsnLeuGluLeuGly

   SerAlaLysMetThrSerTyrGlnLysGluGlnLeuLysValLeuSerAsnProAspLeu
    ProGlnLys***ProLeuIleLysArgSerAsn***ArgTyrSerLeuIleLeuThrCys
  IleArgLysAsnAspLeuLeuSerLysGlyAlaIleLysGlyThrLeu***Ser***Pro
   TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG     300
       .    :    .    :    .    :    .    :    .    :    .    :
   AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
   GlyCysPheHisGlyArgIleLeuLeuLeuLeu***LeuTyrGluArgIleArgValGln
    ArgLeuPheSerArgLysAspPheProAlaIleLeuProValArg***AspGlnGlyThr
  AspAlaPheIleValGlu******PheSerCysAsnPheThrSerGluLeuGlySerArg


? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation
The segments to translate can be
   1 Typed on the keyboard
   2 Read from a feature table
X  3 Open reading frames
? 1,2,3 =1
? (y/n) (y) Use 1 letter codes
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X  1 + strand only
   2 - strand only
   3 Both strands
? 1,2,3 =
? (y/n) (y) Number ends of lines N
Translate
? From (0-300) (0) =101
? To (1-300) (300) =300
Translate
? From (0-300) (0) =102
? To (1-300) (300) =200
Translate
? From (0-300) (0) =


   AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT
           10        20        30        40        50        60

                                            M  V  K  L  N  L  L
                                             W  S  N  *  I  Y
   ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT
           70        80        90       100       110       120

     V  R  R  I  G  N  Q  L  L  H  G  M  K  L  P  D  T  V  L  *
   S  F  A  E  L  G  I  N  C  Y  M  E  *  N  F  Q  T  P  Y  F
   CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA
          130       140       150       160       170       180

     L  H  I  *  N  M  L  S  Y  S  T  R  F  S  N  *  A  L  S  H
   S  C  I  F  K  T  C
   GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA
          190       200       210       220       230       240

     P  Q  K  *  P  L  I  K  R  S  N  *  R  Y  S  L  I  L  T  C
   TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG
          250       260       270       280       290       300


? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation
The segments to translate can be
   1 Typed on the keyboard
   2 Read from a feature table
X  3 Open reading frames
? 1,2,3 =2
? Embl feature table file=1.FT
? (y/n) (y) Use 1 letter codes
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X  1 + strand only
   2 - strand only
   3 Both strands
? 1,2,3 =3
? (y/n) (y) Number ends of lines


    N  A  T  T  I  S  R  I  D  A  T  F  S  A  R  A  P  N  E  N
   AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT      60
       .    :    .    :    .    :    .    :    .    :    .    :
   TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
                                        *  S  A  G  W  I  F  I
      A  V  V  I  L  L  I  S  A  V  K  E  A  R  A  G  F  S  F

    I  A  K  Q  V  I  D  H  L  R  N  V  S  N  G  Q  T  K  S  T
        L  N  R  L  L  T  I  C  E  M  Y  L  M  V  K  L  N  L  L
   ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT     120
       .    :    .    :    .    :    .    :    .    :    .    :
   TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
    Y  S  F  L  N  N  V  M  Q  S  I  Y  R  I  T  L  S  F  R  S
   I  A  L  C  T  I  S  W  K  R  F  T  D  L  P  *  V  L  D  V

    R  S  Q  N  W  E  S  T  V  T  W  N  E  T  S  R  H  R  T  L
     V  R  R  I  G  N  Q  L  L  H  G  M  K  L  P  D  T  V  L  *
   CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA     180
       .    :    .    :    .    :    .    :    .    :    .    :
   GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
    T  R  L  I  P  F
   R  E  C  F  Q  S  D  V  T  V  H  F  S  V  E  L  C  R  V  K

    V  A  Y  L  K  H  V  E  L  Q  H  Q  I  Q  Q  L  S  S  K  P
   GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA     240
       .    :    .    :    .    :    .    :    .    :    .    :
   CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
   T  A  Y  K  F  C  T  S  S  C  C  W  I

    S  A  K  M  T  S  Y  Q  K  E  Q  L  K  V  L  S  N  P  D  L
   TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG     300
       .    :    .    :    .    :    .    :    .    :    .    :
   AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
                                     *  L  Y  E  R  I  R  V  Q
                        *  F  S  C  N  F  T  S  E  L  G  S  R
.end lit
.left margin1
@40. TX 5 @ Translate and write the protein sequence to disk
.LEFT MARGIN2
.para
This routine allows the user to translate sections of the sequence into
the
1 letter amino acid codes and store the resulting amino acid sequences in
a disk file.
Two modes of use are possible. Either all open reading frames of at least
some minimum length will
automatically be found and translated, or the user can specify that
particular segments be translated.
.para
Mode 1: the user selects to to translate all open reading frames.
.para
Either, or both, strands can be
translated.
 The output file is in the same format as a PIR .seq file.
Each protein segment is given an entry name that is its start base in
the DNA, and a title that includes its end position,
reading frame and strand (+ for plus, - for minus).
Each segment is terminated by * whether or not
there is a stop codon in the DNA. The file is therefore suitable for input
to FASTA, ALIGNL and ANALYSEPL.
.para
Mode 2: the user selects to identify the segments to translate.
.para
Either, or both, strands can be
translated.
If multiple coding regions
are translated each will be separated from the previous one by a gap of 5
dashes (-----).
The sections to translate can be
defined from the keyboard or by supplying the name of the appropriate
EMBL
library feature table.
.para
Typical dialogue follows.
.lit
? Menu or option number=40
 Translate and write protein sequence to disk
? (y/n) (y) Translate selected regions
? (y/n) (y) Define segments using keyboard
Translate
? From (0-1023) (0) =1
? To (1-1023) (1023) =111
? (y/n) (y) + strand
Translate
? From (0-1023) (0) =
? Output file name=1.OUT

 ? Menu or option number=40
 Translate and write protein sequence to disk
? (y/n) (y) Translate selected regions n
? Minimum open frame in amino acids (5-1000) (30) =

X 1 + strand only
  2 - strand only
  3 Both strands
? 0,1,2,3 =3
? File name for translation=1.OUT

? Menu or option number=6
Page through text files
? Name of file to read=1.OUT
>P1;    25
    135     1 +
 GAQRLLRRSCWCWRCGGRQRTQGSAGRGRRRRGGGG*
>P1;   238
    486     1 +
 IRCRDCGQRRRGIFDLVDDFHVRRHIVLARKLFEAEGTGVHFHISLMGGNIVTAEVTNVR
 VDAGADFAAVRMLALFGAVVPH*
>P1;   556
    795     1 +

 SSTQVRRASAQTSSLQLESIVAVVNVEVFLAAKHSRFYIAVLFAQFGPLLDARLDRGCGK
 GAGRRDQWRGGGVDLANGR*
>P1;   796
    987     1 +

 FGYADHAFHLRSTSRHSDNVKFDSAGRRRCCCFHLVFSLGSDEEGLLARLLVEVTTIRVV
 LRG*
>P1;     2
    163     2 +
 NSVWAWCEVPRDYCAAAAGAGGAEVVNGPRDPLDEDVDDEEEVDSALLVAGSD*
>P1;   176
    391     2 +
 PLRSGGGGVEAPETPSGWPARFAAATVANAVEGFSILWMIFTCAVILSLRVNSLKQKGQG
 YTFTFRLWEVT*
>P1;   476
    628     2 +
 SLTEPSASPSPTLLLRFSLVLTEGVPNPALRFGVLPLRPAAFNLNPSLLL*
>P1;   629
    958     2 +
 MSRYSWLLNTAGFTSPFCLPSLGRFWTRGLTVAVEKEPAGETNGVEAALTLPMGVSLGML
 TMLFTCAPPAAIPIMLSLIPLAAAAAAVSTWCFLWAAMRKACWRACSLR*
>P1;     3
    293     3 +
 IRFGLGVRCPEITAPQLLVLAVRRSSTDPGIRWTRTSTTRRRWIAHCWWLAATDLSSDHS
 DPAAEASRLPKLPVAGLLDSLPRLWPTPSRDFRSCG*
>P1;   411
    521     3 +
 CACRRGSRLCSGTYARPLWCSSPSLSPPPRPRQRCC*
>P1;  1020
     37     1 -
 EFGKYNPLTDNSSPTQDHTDGSHLNEQARQQAFLIAAQRKHQVETAAAAAASGIKLNIIG
 MAAGGAQVKSMVSIPKLTPIGKVNAASTPLVSPAGSFSTATVKPRVQKRPKLGKQNGDVK
 PAVFSSQEYLDIYNSNDGFKLKAAGLSGSTPNLSAGLGTPSVKTKLNLSSNVGEGEAEGS
 VRDYCTKEGEHTYRCKVCSRVYTHISNFCRHYVTSHKRNVKVYPCPFCFKEFTRKDNMTA
 HVKIIHKIENPSTALATVAAANLAGQPLGVSGASTPPPPDLSGQNSNQSLPATSNALSTS
 SSSSTSSSSGSLGPLTTSAPPAPAAAAQ*
>P1;   373
     -1     2 -
 AKCESVPLSLLLQRVYAQGQYDGARENHPQDRKSLDGVGHSRGSESSRPATGSFGSLDAS
 AAGSEWSELKSVAASHQQCAIHLLLVVDVLVQRIPGSVDDLRTASTSSCGAVISGHLTPS
 PNRI*
>P1;   517
    407     2 -
 QQRWRGRGGGLSEGLLHQRGRAYVPLQSLLPRLHAH*
>P1;   649
    518     2 -
 QPGIPRHLQQQRWIQVEGCWSERKHAEPECWIRNSLCQNQAES*
>P1;   853
    650     2 -
 HYRNGGWWSAGEKHGQHTQTNAHWQGQRRLHAIGLACRLLFHSHGQAARPEAAQTQTER
 RCKTGCV*
>P1;   958
    854     2 -
 SPQRAGAPTSLPHRCPEKTPGGNSSSGGGQRNQT*
>P1;   179
     78     3 -
 VVRTQISRCQPPAMRYPPPPRRRRPRPADPWVR*
>P1;   479
    363     3 -
 GTTAPKRASIRTAAKSAPASTRTLVTSAVTMLPPISEM*
>P1;   791
    666     3 -
 RPLARSTPPPRHWSRLPAPFPQPRSSRASRSGPNWANRTAM*
>P1;  1022
    819     3 -
 SNSASTTRSPTTAHPRRTTRMVVTSTSRRANKPSSSLPRENTRWKQQQRRRPAESNLTLS
 EWRLVERR*
End of file
.end lit

.LEFT MARGIN1
@41. TX 5 @ Calculate and write codon table to disk
.LEFT MARGIN2
.para
This routine calculates codon usage tables
for sections of the sequence
and stores the resulting tables on disk.
The sections to translate can be
defined from the keyboard or by supplying the name of the appropriate
EMBL
library feature table.
.para
If required users can add to an existing codon table stored as a disk file.
Choose between storing observed counts or having them normalised so
that the totals for each amino acid sum to 100. Select between defining
segments at the keyboard or using an EMBL feature table. Define
segments. Signal completion with a zero start. Supply a file name. For
each segment the program will display the counts, at the end it will
display the accumulated totals.
.lit

 Typical dialogue follows.
? Menu or option number=41
 Calculate and write codon table to disk
? (y/n) (y) Start with empty table
? (y/n) (y) Show observed counts
? (y/n) (y) Define segments using keyboard
? Count from (0-1023) (0) =1
? Count to (1-1023) (1023) =111
? (y/n) (y) + strand

     ===========================================
     F TTT   0. S TCT   0. Y TAT   0. C TGT   0.
     F TTC   1. S TCC   1. Y TAC   0. C TGC   3.
     L TTA   1. S TCA   0. * TAA   0. * TGA   1.
     L TTG   2. S TCG   0. * TAG   0. W TGG   2.
     ===========================================
     L CTT   0. P CCT   0. H CAT   0. R CGT   2.
     L CTC   0. P CCC   0. H CAC   0. R CGC   2.
     L CTA   0. P CCA   0. Q CAA   1. R CGA   1.
     L CTG   1. P CCG   0. Q CAG   2. R CGG   2.
     ===========================================
     I ATT   0. T ACT   0. N AAT   0. S AGT   0.
     I ATC   0. T ACC   1. N AAC   0. S AGC   1.
     I ATA   0. T ACA   0. K AAA   0. R AGA   1.
     M ATG   0. T ACG   0. K AAG   0. R AGG   0.
     ===========================================
     V GTT   0. A GCT   1. D GAT   0. G GGT   3.
     V GTC   0. A GCC   1. D GAC   0. G GGC   1.
     V GTA   0. A GCA   0. E GAA   1. G GGA   4.
     V GTG   1. A GCG   0. E GAG   0. G GGG   0.
     ===========================================
? Count from (0-1023) (0) =

    Codon totals over all genes
     ===========================================
     F TTT   0. S TCT   0. Y TAT   0. C TGT   0.
     F TTC   1. S TCC   1. Y TAC   0. C TGC   3.
     L TTA   1. S TCA   0. * TAA   0. * TGA   1.
     L TTG   2. S TCG   0. * TAG   0. W TGG   2.
     ===========================================
     L CTT   0. P CCT   0. H CAT   0. R CGT   2.
     L CTC   0. P CCC   0. H CAC   0. R CGC   2.
     L CTA   0. P CCA   0. Q CAA   1. R CGA   1.
     L CTG   1. P CCG   0. Q CAG   2. R CGG   2.
     ===========================================
     I ATT   0. T ACT   0. N AAT   0. S AGT   0.
     I ATC   0. T ACC   1. N AAC   0. S AGC   1.
     I ATA   0. T ACA   0. K AAA   0. R AGA   1.
     M ATG   0. T ACG   0. K AAG   0. R AGG   0.
     ===========================================
     V GTT   0. A GCT   1. D GAT   0. G GGT   3.
     V GTC   0. A GCC   1. D GAC   0. G GGC   1.
     V GTA   0. A GCA   0. E GAA   1. G GGA   4.
     V GTG   1. A GCG   0. E GAG   0. G GGG   0.
     ===========================================
? (y/n) (y) Save table in a file n
.end lit

.left margin1
@42. TX 6 @ Codon usage method
.LEFT MARGIN2
.para
Used to find protein coding regions. For each window length of the
sequence the routine measures the closeness to an expected codon usage.
Results are plotted for each of the three reading frames. Stop and start
codons are also marked on the plots. Has the highest resolution of all
such methods, but makes the strongest assumption, i.e. that the codon
usage is known. The latest version is described in Methods in Enzymology
183, 193-211.
.para
Choose whether to use an internal standard (i.e. part of the current
sequence known to code for a protein). If so define its end points, and
those of any others. Otherwise supply the name of a disk file containing a
table of codon usage. Tables are listed. Choose between using the
observed counts, or two types of normalisation: normalised to give an
average amino acid composition; normalised to no amino acid bias. The
first normalisation is clearly often sensible, but the second removes
valuable information and is only made availabe for special
circumstances. The final table will be displayed, followed by the
expected scores for window lengths 21, 31 and 41 codons. The scores for
each of the three reading frames are shown (they are logarithmic values)
to help users choose a window length for the analysis. Define a window
length and plot interval. Plotting will start.
.para
The method was first described in
Staden and McLachlan Nucl. Acid Res. 10 141-156 (1982) and the
following is a summary of the initial ideas.
The method makes the following main assumptions: the codon
preferences
of all the
genes in the sequence we are examining are similar to those of the
standard;
the sequence is coding
throughout its whole length in only one reading frame; in the coding
frame
the frequency of codon abc has a definite value Fabc
.LEFT MARGIN2
If we select a sequence  a1b1c1a2b2c2a3b3c3,...,anbncnan+1bn+1cn+1
then the
probability of selecting it in each of the three frames is:
.left margin15
frame 1: p1=Fa1b1c1.Fa2b2c2....Fanbncn
.left margin15
frame 2: p2=Fb1c1a2.Fb2c2a3...Fbncnan+1
.left margin15
frame 3: p3=Fc1a2b2.Fc2a3b3...Fcnan+1bn+1
.LEFT MARGIN2
The probability that selection of a particular sequence was "caused" by it
being a coding sequence is:
.LEFT MARGIN2
P1=p1/(p1+p2+p3), P2=p2/(p1+p2+p3), P3=p3/(p1+p2+p3).
.LEFT MARGIN2
The program calculates these values for the given window length but
plots
Log(P/(1-P)) for each of the three frames. At each point along the
sequence
that the program has a
point to plot it finds which of the three values is highest and places a
single point at the 50% level for the corresponding frame. These single
points will join to form a solid line if one frame is consistently the
highest scoring. In addition stop codons are shown as short vertical lines
that bisect the 50%
level of probability. When looking for coding regions
the user should look for solid horizontal lines at the
50% level that are not interrupted by these short vertical lines.
.para
Changes.
 Two normalisations are offered: 1) to remove all amino acid
compositional components from the tables, hence leaving only the codon
preference components. In general this is not recommended as the amino
acid
component alone is often sufficient to choose correctly between frames,
but
may be useful in special circumstances. 2) to change the amino acid
composition components to give an average amino acid composition
rather the
the one contained in the standard (this leaves the codon preference
components unchanged). In general this should be useful as the average
amino acid composition is likely to be closer to the composition of the
genes being hunted, than is that of the standard table of codon
preferences.
The average composition
is that recently published by Argos, not the Dayhoff one that we have
used
before.
.para
Typical dialogue follows.
.lit

? Menu or option number=42
Staden and McLachlan codon usage method
Codon tables for standards may be read from disk
or calculated from parts of the current sequence
? (y/n) (y) Define internal standard
Define standard
? start (0-1023) (0) =1
? end (2-1023) (1023) =1000
     ===========================================
     F TTT  13. S TCT   1. Y TAT   1. C TGT   3.
     F TTC   4. S TCC  10. Y TAC   1. C TGC   7.
     L TTA   1. S TCA   0. * TAA   1. * TGA   4.
     L TTG   4. S TCG   1. * TAG   3. W TGG   5.
     ===========================================
     L CTT   9. P CCT   1. H CAT   3. R CGT  14.
     L CTC   7. P CCC   0. H CAC   7. R CGC  14.
     L CTA   0. P CCA   0. Q CAA   4. R CGA   9.
     L CTG  12. P CCG   1. Q CAG   9. R CGG   8.
     ===========================================
     I ATT   7. T ACT   4. N AAT   4. S AGT   1.
     I ATC   4. T ACC   5. N AAC   3. S AGC   7.
     I ATA   1. T ACA   1. K AAA   3. R AGA   2.
     M ATG   2. T ACG   1. K AAG   2. R AGG   2.
     ===========================================
     V GTT  11. A GCT  13. D GAT   6. G GGT   9.
     V GTC   5. A GCC  10. D GAC   9. G GGC  11.
     V GTA   6. A GCA   5. E GAA   6. G GGA  12.
     V GTG   8. A GCG   5. E GAG   3. G GGG   8.
     ===========================================
Define standard
? start (0-1023) (0) =
Total codons in standard=     333.
X 1 Use observed frequencies
  2 Normalize to average amino acid composition
  3 Normalize to no amino acid bias
? 0,1,2,3 =2
     ===========================================
     F TTT  19. S TCT   2. Y TAT  10. C TGT   3.
     F TTC   6. S TCC  22. Y TAC  10. C TGC   8.
     L TTA   2. S TCA   0. * TAA   0. * TGA   0.
     L TTG   7. S TCG   2. * TAG   0. W TGG   8.
     ===========================================
     L CTT  16. P CCT  16. H CAT   4. R CGT  10.
     L CTC  12. P CCC   0. H CAC  10. R CGC  10.
     L CTA   0. P CCA   0. Q CAA   8. R CGA   7.
     L CTG  21. P CCG  16. Q CAG  18. R CGG   6.
     ===========================================
     I ATT  19. T ACT  13. N AAT  16. S AGT   2.
     I ATC  11. T ACC  17. N AAC  12. S AGC  15.
     I ATA   3. T ACA   3. K AAA  22. R AGA   1.
     M ATG  15. T ACG   3. K AAG  15. R AGG   1.
     ===========================================
     V GTT  15. A GCT  21. D GAT  14. G GGT  10.
     V GTC   7. A GCC  16. D GAC  20. G GGC  13.
     V GTA   8. A GCA   8. E GAA  26. G GGA  14.
     V GTG  11. A GCG   8. E GAG  13. G GGG   9.
     ===========================================
Span length  21 expected mean values:   4.8  -5.7  -4.8
Span length  31 expected mean values:   7.1  -8.4  -7.2
Span length  41 expected mean values:   9.5 -11.1  -9.5
? odd span length (11-101) (25) =41
? plot interval (1-11) (5) =

 Missing graphics display here

.end lit

.left margin1
@43. TX 6 @ Positional base preference method.
.LEFT MARGIN2
.para
Used to find protein coding regions. For each window length of the
sequence the routine measures the closeness to an expected pattern of
base frequencies . Results are plotted for each of the three reading
frames. Stop and start codons are also marked on the plots.  The method
is particularly useful for showing which reading frame is the most likely
to be coding. The latest version is described in a forthcoming issue of
Methods in Enzymology, but the original ideas were given in
Staden, R. Nucl. Acid Res. 12 551-567 (1984).
.para
If dialogue is requested the following inputs are needed, otherwise the
standard analysis is performed. Choose between a "global" standard, or a
selected one. If the global standard is selected the
expected scores are displayed and the user asked to define a span length
and a plot interval. Then users choose between plotting relative or
absolute scores, and can reset the scaling values employed for plotting.
If the global standard is not selected users must define a region of the
sequence to use as a standard, or they can read in a codon table from which
the
program will calculate one. Then they can either, use the values
observed in this standard, or they can combine its values for the third
positions in codons, with those from the global standard. Next they can
give different weightings to each of the three positions in codons.
.para
In its original form the method
 took advantage of the
uneven
use of amino acids by proteins and the structure of the genetic code table
and assumed that there is a typical ("global")
amino acid composition
and no codon preference. The typical amino acid composition is the
average
composition found by Argos (see below).
 This composition and no codon preference
determines the frequency of each of the four bases in each of the three
codon positions. This 3x4 frequency table shows unequal use of the bases
and in particular a marked use of G in position 1 and of A in position 2
(at the expence of G). The routine slides a window along the sequence and
calculates a score for each of the three reading
frames at each window position. It assumes the sequence is coding
throughout its whole length and calcualtes the probability that it is
coding in each of the three frames.
When tested against all the E. coli sequences in the EMBL sequence
library
it correctly identified the coding frame for 91% of window positions.
(The E. coli
sequences were chosen only for technical reasons: I have no reason to
think
the method would work less well on other organisms with roughly even
base composition.)
The routine can plot either absolute or relative values: ie absolute values
are the values found by summing the scores for each frame (say p1, p2
and
p3), and the relative values are then p1/(p1+p2+p3), p2/(p1+p2+p3) and
p3/(p1+p2+p3).
.para
At each point along the sequence
that the program has a
point to plot it finds which of the three values is highest and places a
single point at the 50% level for the corresponding frame. These single
points will join to form a solid line if one frame is consistently the
highest scoring. In addition stop codons are shown as short vertical lines
that bisect the 50%
level of probability. When looking for coding regions
the user should look for solid horizontal lines at the
50% level that are not interrupted by these short vertical lines.

The absolute mean
values expected on the complement of
the coding strand (and in the same frame)
are 5% lower than those on the coding strand but the relative values
are the same on both strands. Although the
relative values give smoother plots and tend to emphasize the coding
frame
they therefore, cannot be used to decide which strand is coding. The
absolute values plot should be used for this purpose but bearing in mind
the fact the the differences between strands are quite small.
.para
The method has been improved in two overall ways: first it now allows
users to
define their own typical amino acid composition by selecting a standard
sequence from within the sequence they are analysing or from a codon table;
secondly it allows the inclusion of third position preferences.
Again these third position preferences are defined by the use of an
internal standard sequence. Not only can users define their own standards
but they can also give weights to each of the three positions in codons.
This allows different emphasis to be used for each of the three positions.
As an example of its use, by giving, in turn, weights of 1.0, 0.0, 0.0, and
0.0, 1.0, 0.0, and finally 0.0, 0.0, 1.0, you could see the separate
contribution made by each of the three positions. It is also possible to
use the third position preferences with the values for the first two
positions taken from the "global"  amino acid composition.
 In all cases users may choose to plot
absolute or relative values. The expected scores are displayed before
each
analysis and scales are drawn on the plots.
At present this method does not give probabilities of coding; it has only
been tested for its ability to choose the correct reading frame (see
above). It could be used to give probabilities of coding if was applied to
all known coding and non-coding sequences in the way that the uneven
positional base frequencies method was. It is designed to be used in
conjunction with this method. Note that the average amino composition
used
to derive the base frequencies was changed on 17-11-1988, to be
 the new average given by McCaldon and Argos in Proteins 4 99-122
(1988).
A further change is to allow users to select their own scales for
producing the plots. It can be helpful if they want to emphasise or
diminish
certain features.
.para
Typical dialogue follows.
.lit
? Menu or option number=D43
Positional base preferences method to find protein genes
Select standard source
X  1 Use global standard
   2 Use internal standard
   3 Use codon usage table
? Selection  (1-3) (1) =2
Define region for standard
? start (0-8134) (0) =3171
? end (3172-8134) (8134) =4700
Select normalisation
X  1 Use observed frequencies
   2 Combine with global standard
? Selection  (1-2) (1) =1
          T      C      A      G      Range
      1  0.125  0.249  0.230  0.397  0.272
      2  0.298  0.245  0.292  0.165  0.132
      3  0.288  0.313  0.169  0.230  0.144
? (y/n) (y) Use 1.0 for positional weights
Give weights between 0.0 and 1.0
to each of the 3 codon positions
? Position 1 (0.00-1.00) (1.00) =
? Position 2 (0.00-1.00) (1.00) =
? Position 3 (0.00-1.00) (1.00) =
Expected scores per codon in each frame
       0.136     0.122     0.123
? odd span length (31-101) (67) =
? plot interval (1-11) (5) =
? (y/n) (y) Plot relative scores
Scaling values:
   Minimum  maximum    range
    0.3121   0.3656   0.0382
? (y/n) (y) Leave scaling values unchanged

  Graphics not shown

? Menu or option number=D43
Positional base preferences method to find protein genes
Select standard source
X  1 Use global standard
   2 Use internal standard
   3 Use codon usage table
? Selection  (1-3) (1) =3
? File name of standard=atpase.cods
     ===========================================
     F TTT  21. S TCT  33. Y TAT  15. C TGT   5.
     F TTC  55. S TCC  40. Y TAC  40. C TGC   4.
     L TTA   8. S TCA   7. * TAA   8. * TGA   0.
     L TTG  19. S TCG  12. * TAG   1. W TGG  17.
     ===========================================
     L CTT  22. P CCT  17. H CAT   6. R CGT  73.
     L CTC  21. P CCC   4. H CAC  30. R CGC  23.
     L CTA   1. P CCA  10. Q CAA  19. R CGA   5.
     L CTG 168. P CCG  48. Q CAG  80. R CGG   3.
     ===========================================
     I ATT  47. T ACT  14. N AAT  17. S AGT   8.
     I ATC  98. T ACC  54. N AAC  52. S AGC  26.
     I ATA   6. T ACA   7. K AAA  85. R AGA   0.
     M ATG  75. T ACG  13. K AAG  28. R AGG   0.
     ===========================================
     V GTT  67. A GCT  56. D GAT  41. G GGT  90.
     V GTC  29. A GCC  53. D GAC  66. G GGC  66.
     V GTA  49. A GCA  59. E GAA 101. G GGA   5.
     V GTG  57. A GCG  64. E GAG  41. G GGG   8.
     ===========================================
Select normalisation
X  1 Use observed frequencies
   2 Combine with global standard
? Selection  (1-2) (1) =2
          T      C      A      G      Range
      1  0.177  0.211  0.277  0.336  0.159
      2  0.271  0.238  0.310  0.182  0.128
      3  0.242  0.301  0.168  0.289  0.132
? (y/n) (y) Use 1.0 for positional weights
Expected scores per codon in each frame
       0.785     0.736     0.736
? odd span length (31-101) (67) =
? plot interval (1-11) (5) =
? (y/n) (y) Plot relative scores
Scaling values:
   Minimum  maximum    range
    0.3219   0.3519   0.0214
? (y/n) (y) Leave scaling values unchanged

  Graphics not shown
.end lit
.left margIN1
@44. TX 6 @ Uneven positional base frequencies.
.LEFT MARGIN2
.para
Used to find regions of a sequence that might be coding for a protein. The
method looks for sections of the sequence in which the frequency at
which each of the four bases occupies the three positions in codons is
nonrandom. The level of nonrandomness is plotted on a scale that shows
the probability that the sequence is coding. At each position along a
sequence the calculation gives the same value for all six possible reading
frames, so only one value is plotted.
.para
Define the window length and plot interval.
.para
The results are plotted in a box divided by a horizontal line marked "76%".
76% of coding regions achieve values above this line and 76% of
noncoding regions achieve scores below the line.
.para
This method, first described in  Staden R. Nucl. Acid Res. 12 551-567
1984,
looks for uneven positional
usage of bases in codons.
It looks through the sequence in one fixed
phase and counts the number of times each base apears in each of the
three
codon positions: for each window position it counts A1,A2,A3 and
C1,C2,C3
and G1,G2,G3 and T1,T2,T3 and calculates AMEAN=(A1+A2+A3)/3, and
similarly
CMEAN, GMEAN
and TMEAN; it then calculates
ADIF=abs(A1-AMEAN)+abs(A2-AMEAN)+abs(A3-AMEAN) and similarly
CDIF, GDIF and
TDIF to measure the differences between an even base usage for all
positions in the codons and the observed usage. The routine then
calculates
the sum ADIF+CDIF+GDIF+TDIF and plots this value on the following scale:
the base level is such that no known window in a coding region has a
lower
value, whereas 14% of windows in noncoding sequences score below it.
The
top of the scale is not achieved by any known noncoding
region, but is reached by 16% of known coding regions.
The bar drawn across the
plot corresponds to a level that is exceeded by 76% of windows in known
coding regions
but is reached by only 24% of windows in known noncoding regions. ie
76% of
coding windows score above and 76% of noncoding windows score below.
This is similar to Ficketts method but without
the probabilities and weightings from the Los Alamos sequence library: it
is therefore unbiased but may well give very similar results.
.left margin1
@45. TX 6 @ Codon improbability on base composition
.LEFT MARGIN2
.para
Used to find regions of a sequence that might code for a protein.
.para
If dialogue is requested define a window length and plot interval.
.para
 The idea of the method is, that of all sequence features
that we know, it is only
coding regions that will give rise to codon biases well above those
expected
from the base composition.
If a region of sequence shows sufficiently strong
codon bias then we conclude that it is coding for a protein.
 Using the multinomial distribution we
have derived a function to measure the improbability of observing a
set of codons from a sequence of the given composition. Using the
Poisson
distribution we have worked out the distribution
of the improbability. The program plots the observed improbability minus
the expected improbability (the mean as calculated from the Poisson
distribution). The plots are presented against a scale of units of standard
deviation as measured from the Poisson distribution. As with the other
Staden and McLachlan method the program puts an extra point at a fixed
level for the highest of the three probabilities; for this function this
point is placed at six standard deviations above the mean expected level.
The top of each plot corresponds to 12 standard deviations above the
expected level and the bottom corresponds to the expected value.
.para
Analysis of the application
of the method to the EMBL sequence library indicates that the method
does
work for most sequences and that the levels of improbability roughly
correlate with levels of expression.
Coding regions will show high peaks in all three frames making
interpretation more difficult than for some of the other methods.
.left margin1
@46. TX 6 @ Codon improbability on amino acid composition
.LEFT MARGIN2
.para
Used to finds regions of a sequence that might code for a protein.
.para
If dialogue is requested define a window length and a plot interval.
.para
The idea of the method is, that of all sequence features
that we know, it is only
coding regions that will give rise to codon biases such that, for each
amino acid, some codons are used far more frequently than others. The
method is independent of what the bias actually is, requiring only that it
is present.
If a region of sequence shows sufficiently strong
codon bias then we conclude that it is coding for a protein.
 Using the multinomial distribution we
have derived a function to measure the improbability of observing a
set of codons from a sequence of the given composition. Using the
Poisson
distribution we have worked out the distribution
of the improbability. The program plots the observed improbability minus
the expected improbability (the mean as calculated from the Poisson
distribution). The plots are presented against a scale of units of standard
deviation as measured from the Poisson distribution. As with the other
Staden and McLachlan method the program puts an extra point at a fixed
level for the highest of the three probabilities; for this function this
point is placed at six standard deviations above the mean expected level.
The top of each plot corresponds to 12 standard deviations above the
expected level and the bottom corresponds to the expected value.
.left margin1
@47. TX 6 @ Shepherd RNY preference method
.LEFT MARGIN2
.para
Used to find regions of a sequence that might code for a protein. Based on
the method of Shepherd
(PNAS 78 1596-1600, 1981).
.para
If dialogue is requested define a window length and plot interval.
.para
Shepherd has found that
many genes have a preference for the use of codons of the form RNY
where
R=purine, Y=pyrimidine and N=any base. He has attributed this to being
due
to remants of a primitive genetic code. The calculation is similar to that
for the Staden and McLachlan method, the p1's being simply the number of
RNY codons found in frame 1 etc and the P's being p/(p1+p2+p3).
.left margIN1
@48. TX 6 @ Ficketts method
.LEFT MARGIN2
.para
Used to find regions of a sequence that might code for a protein. Based on
the method of Fickett
(Nucl. Acid Res.10
1982), but plots values for fixed window lengths rather than over the
whole of open reading frames.
.para
If dialogue is requested define a window length and plot interval. The
results are plotted in a box divided into three horizontal strips.
.para
Sections of the sequence with values plotted in the top strip of the box
are adjudged to be coding, those in the middle strip "no decision", and
those in the bottom "not coding".
.para
The program performs the following calculations: let A1 = the number of
occurences of base A in position 1 of codons, A2 for position 2 etc.
Similarly for bases C,G and T. For each window position calculate
Apos=max(A1,A2,A3)/min(A1,A2,A3)+1. Similarly for C,G and T to give 4
positional values. Also count the base composition for the window to
give
Acomp, Ccomp etc. Fickett  tested each of these 8 parameters singly as
to
their ability to distinguish coding from noncoding regions and arived at
probabilities of coding for the range of values each can take = Pcod. He
also measured their relative abilities and given weightings to each of
the 8 parameters = Pw. To calculate the "TESTCODE" for a window we
first lookup the Pcod for each of the calculated compositional and
positional values and then calculate TESTCODE=sum(Pcod*Pw). TESTCODE
is
plotted relative to three levels of decision: the top division="coding",
the middle="no opinion" and the bottom division="non coding".
.left margin1
@49. TX 6 @ tRNA gene search.
.LEFT MARGIN2
.para
Used to find segments of a sequence that might code for tRNAs. Looks for
potential cloverleaf forming structures and then for the presence of
expected conserved bases. Presents results graphically or draws out the
cloverleafs.
.para
If dialogue is requested a large number of parameters need to be given
values, including some loop lengths, scores for each of the four stems,
and scores for the conserved bases.
.para
The program was first described in
Staden Nucl. Acid Res 817-825 (1980).
                The tRNA's  that  have
          been  sequenced  so far have two characteristics that can be used
to
          locate their genes within long DNA sequences.  Firstly they  have  a
          common   secondary  structure  -  the  cloverleaf  -  and  secondly,
          particular bases almost always appear at certain  positions  in
the
          cloverleaf.   The  cloverleaf  is composed of four base-paired
stems
          and four loops.  Three of the stems are  of  fixed  length  but  the
          fourth,  the  dhu  stem which usually has four base pairs,
sometimes
          has only three.  All of the loops can vary in size.   The  following
          relationships between the stems in the cloverleaf are assumed in
the
          program:  (a) there are no bases between one end  of  the
aminoacyl
          stem  and  the  adjoining tuc stem;  (b) there are two bases
between
          the aminoacyl stem and the dhu stem;  (c) there is one base
between
          the  dhu  stem and the anticodon stem;  (d) there are at least three
          bases between the anticodon stem and the tuc stem.
                The program looks first for cloverleaf structure and then,  if
          required,  for  conserved bases.  The sizes of the loops, the number
          of basepairs in the stems and the required conserved bases  may
all
          be  specified  by the user.  The process of looking for the presence
          of conserved bases can reduce the  number  of  potential
structures
          found considerably.
 The
          user may also specify that an intron may be present in the
anticodon
          loop.
.para
The user may define a minimum number of
base pairs for each stem using the scoring system G-C, A-T=2 and G-T=1
and
scores for each of the conserved bases. Recommended values for the stem
scores are given by the prompts and the percentage conservation of the
conserved bases as found in the Nucl. Acid Res 1979  paper Gauss, Gruter
 and Sprinzl  are also given,
but the user must decide which bases are most
likely to be conserved for the sequence being examined.
The output shows the position of the possible gene in the sequence by a
vertical line the height of which shows the number of basepairs made in
the
stems. The cloverleaf structure is also drawn but will scroll up off the
screen. Output of the cloverleafs will look like:
.lit

       6942
                    A
                  A-U
                  A-U
                  G-C
                  A-U
                  U-A
                  A-U
                  U-A      AAU
                  U   UAUCU
          AA    A    !!!!!
            AAUG     AUAGA   A
         U  !!!!     U    UCA
         C  UUAC      U
          AA    A
                 U-AA A
                 A-U
                 A-U
                 C-G
                 U-A
                U   A
                U   A
                 GUC

 Typical dialogue follows.

? Menu or option number=D49
 tRNA search
? Maximum trna length (70-130) (92) =
? Aminoacyl stem score (0-14) (11) =
? Tu stem score (0-10) (8) =
? Anticodon stem score (0-10) (8) =
? D stem score (0-8) (3) =
? Minimum base pairing total (30-32) (32) =
? Minimum intron length (0-30) (0) =
? Minimum length for TU loop (4-12) (6) =
? Maximum length for TU loop (6-12) (9) =
? (y/n) (y) Skip search for conserved bases n
Give a score for each base, then a minimum total at the end
? Base  8, T is 100% conserved. Score (0-100) (0) =
? Base 10, G is  95% conserved. Score (0-100) (0) =
? Base 11, Y is  96% conserved. Score (0-100) (0) =
? Base 14, A is 100% conserved. Score (0-100) (0) =
? Base 15, R is 100% conserved. Score (0-100) (0) =
? Base 21, A is  97% conserved. Score (0-100) (0) =
? Base 32, Y is 100% conserved. Score (0-100) (0) =
? Base 33, T is  98% conserved. Score (0-100) (0) =
? Base 37, A is  91% conserved. Score (0-100) (0) =
? Base 48, Y is 100% conserved. Score (0-100) (0) =
? Base 53, G is 100% conserved. Score (0-100) (0) =
? Base 54, T is  95% conserved. Score (0-100) (0) =
? Base 55, T is  97% conserved. Score (0-100) (0) =
? Base 56, C is 100% conserved. Score (0-100) (0) =
? Base 57, R is 100% conserved. Score (0-100) (0) =
? Base 58, A is 100% conserved. Score (0-100) (0) =
? Base 60, Y is  92% conserved. Score (0-100) (0) =
? Base 61, C is 100% conserved. Score (0-100) (0) =
? Minimum total conserved base score (0-0) (0) =
? (y/n) (y) Plot results n

 Searching

       306
                   C
                 C-G
                 C-G
                 G-C
                 T-A
                 C-G
                 A-T
                 T+G     AT
                A   ATACA
        TTC    T    !!!!   G
           CTGT     TATGG  G
       G    ! !     T    GA
       C   TAAA      C
        GCG    C      G
                T+GA   C
                C-G C   T
                T+G  A   T
                T-A   G   T
                T-A    G   A
               G   G    G   C
               A   A     G   A
                AGC       T   C
                           A   T
                            C   T
                             A
                              C T


.end lit
.left margIN1
.left margIN1
@50. TX 7 @ Plot start codons
.left margin2
.para
This function plots the positions of all start codons for each of the three
reading frames.
.left margin1
@51. TX 7 @ Plot stop codons
.left margin2
.para
This function plots the positions of all stop codons for each of the three
reading frames.
.left margIN1
@52. TX 7 @ Plot stop codons on the complementary strand
.left margin2
.para
This function plots the positions of all stop codons for each of the three
reading frames on the complementary strand.
.left margin1
@53. TX 7 @ Plot stop codons on both strands
.left margin2
.para
This function plots the positions of all stop codons for each of the three
reading frames on both strands.
.left margin1
@54. TX 5 @ Search for longest open reading frames
.left margin2
.para
This function will report the positons of the ends of
all sections of sequence that contain no stop codons. All six reading
frames are examined. Results are presented in the form of an EMBL feature
table. Hence if the results are stored in a file by use of "direct output
to disk", the file
 can be used to translate the
open reading frames in a sequence.
Note that in order for the file to be used as a feature table it
must include either EMBL
or GenBank headers, and a suitable "tail". The simplest header is the word
FEATURES starting in column 1 of the first line of the file. The simplest
tail is 2 empty lines at the end of the file. These lines are not included
when nip writes out results in feature table format.
.para
Define the minimum length of open reading frame to report (in amino
acids).
Choose to search either or both strands. The program displays the end
points, the reading frame and strand.
.para
Typical dialogue follows.
.lit

? Menu or option number=D54
 Find open reading frames
? Minimum open frame in amino acids (5-1000) (30) =100

X 1 + strand only
  2 - strand only
  3 Both strands
? 0,1,2,3 =3

FT   CDS           1    831       1    831
FT   CDS        1540   2853       1   1314
FT   CDS        3130   4242       1   1113
FT   CDS        5761   6114       1    354
FT   CDS        6187   6711       1    525
FT   CDS        1766   2077       2    312
FT   CDS        2078   2446       2    369
FT   CDS        4136   5500       2   1365
FT   CDS        1335   1637       3    303
FT   CDS        2844   3194       3    351
FT   CDS        6819   7238       3    420
FT   CDS        2073   1711  C    1    363
FT   CDS        2469   2149  C    1    321
FT   CDS        6542   6144  C    3    399

.end lit
.left margin1
@55. TX 8 @ Search for E. coli promoter (general)
.LEFT MARGIN2
.para
Searches for E coli promoter like sequences using a standard weight
matrix. The positions of the matches are plotted. No dialogue is required.
.para
The method was first described in
 Staden R. Nucl. Acid Res. 12 505-519 1984.
This search uses a weight matrix taken from the frequency tables
contained
in Hawley, D. K. and McClure, R., nar 11 2237-2255 (1983).
 The weight matrix is
divided into 3 sections that are separated by varying sizes of gap: the -
35
region, the -10 and the +1 region.
The algorithm first looks for a sufficiently good -35 region, then for the
best -10 region within range and then for the best +1 region within range
of the -10; each separate region must score above the lowest known
score
for the corresponding section. The gap penalty is then applied and two
plots
produced: one with gap penalties, one without.
 Scaling is such that no
known promoter scores below the bottom level and no known promoter
scores
above the top level when the weight matrix is applied.
.para
Two other functions also look for E. coli promoters: 92 looks for sites on
the complementary strand and 93 looks for individual -35 and -10
regions
and plots them on a scale such the top is the highest known value +10%
and
the bottom is the lowest known -10%
.LEFT MARGIN1
.lit
weights for E. coli promoters
-35 region:
P -50-49-48-47-46-45-44-43-42-41-40-39-38-37-36-35-34-33-32-31-30-29-28-27-26

107109109110110110110110110111111110111112112112112112112112112112112112112
T  41 33 32 25 34 22 35 35 42 27 32 42 47 14 92 94 11 19 15 37 46 34 38 48 34
C  22 27 18 29 20 14 20 12 22 23 16 25 10 43  7  6 11 18 60  8 25 23 23 17 20
A  28 38 30 37 35 56 42 42 37 42 39 18 25 26  2  6  2 72 26 50 26 34 25 26 31
G  16 11 29 19 21 18 13 21  9 19 24 26 29 29 11  6 88  3 11 17 15 21 26 21 27
-10 region:
P -23-22-21-20-19-18-17-16-15-14-13-12-11-10 -9 -8 -7 -6 -5
  112112112112112112112112112112112112112112112112112112112
T  35 28 28 27 39 51 34 43 26 31 89  3 49 15 19108 31 29 21
C  34 21 24 27 12 25 20 25 20 27 10  2 16 14 22  3 13 16 30
A  20 39 33 33 39 23 29 16 23 19  2106 29 66 57  1 35 23 31
G  23 24 27 25 22 13 29 28 43 35 11  1 18 17 14  0 33 24 30
+ region:
P -2 -1  1  2  3  4  5  6  7  8  9 10
  86 88 85 88 88 88 88 88 88 88 88 88
T 16 22  2 42 27 23 20 25 27 15 16 29
C 29 49  4 25 25 13 18 22 17 17 16 17
A 20  9 45 16 24 25 28 24 24 32 35 26
G 21  8 37  5 12 27 22 17 20 24 21 16
.end lit
Notes:
E. coli promoters have been shown to contain 2 regions of conserved
sequence
located about 10 and 35 bases upstream of the transcription startsite.
These
are TATAAT and TTGACA with an allowed spacing of 15 to 21 bases
between. The
spacing with maximum efficiency was 17 bases and all but 12 of the 112
sequences could be aligned with a separation of 17 +or-1 bases. The
standard
promoter has spacing 7 and 17 bases between the startsite and the -10
region,
and the -10 and -35 regions, respectively. The spacing between the -10
region
and the startsite is usually 6 or 7 bases but varies between 4 and 8
bases.
There is an AT rich region of 8 to 10 bases upstream of the -35 region.
Iniation with a purine is highly prefered with G being used if A is not
present.
.lit
Gap penalties:
	15 0.02   (only exists as mutant)
	16 0.2
	17 1.0
	18 0.2
	19 0.05   (guess)
	20 0.02   (guess)
	21 0.01   (guess)
.end lit
.left margin1
@56. TX  8 @ Search for E. coli promoter (general)
strand
.LEFT MARGIN2
.para
This function searches for E. Coli promoters on the complementary strand
of
the sequence. See the notes on option 55.
.left margin1
@57. TX 8 @ Search for E. coli promoter sequences. (-35 and -10)
.LEFT MARGIN
.para
This function searches separately for the -35 and -10 sequences of an E.
coli promoter. See the notes on option 55.
.left margIN1
@58. TX 8 @ Search for procaryotic ribosome binding sites
.LEFT MARGIN2
.para
This function searches for the 5' ends of prokaryotic genes using an
unusual weight matrix. The search is relatively slow because the matrix
is 101 bases in length. No dialogue is required.
.para
The method was first described in
 Staden Nucl. Acid Res. 12 505-519 1984. This actually looks for more
than
a ribosome binding site as is explained below.  This uses their weight
matrix w101 of Stormo and
Schneider (NAR 10 2971-3024, 1982)
which with a value of 2 finds all gene starts in their library.
.LEFT MARGIN1
.lit
 P-60-59-58-57-56-55-54-53-52-51-50-49-48-47-46-45-44-43-42-41-40-39-38-37-36
 T  5  1 -3  9-14  7 15 -5  3-16-17  4 18  5 -3 -1  2  4  5 -5  7  8 -5-15  6
 C-21 -6-11-21  0  8 -7-12 -1  1  0-19 12 -3 -1 10  2 -8 -5-11  8  1 23  6 -5
 A  7 -2 13 -2 -8-13-18  5  0 -5 13  8-15  9 -4 -7  9  0 -8-11-10 -6 -7 -5 -6
 G -6 -9 -7  0  8-16 -4 -2-16  1 -4  8-14  5 11-13-24  3  7 22-11 -9-15 10 -4

 P-35-34-33-32-31-30-29-28-27-26-25-24-23-22-21-20-19-18-17-16-15-14-13-12-11
 T  3  4 16 -4  7 11 -4 -1 12  8 10 -1  1  8  2-10-16 11  1 -3 16 -3-36 -8-27
 C  2-14 -3 -8-10-21  2  0 -2 -1-11 -3 -1  5-11 -4  7  0-14  6 -8-20 -7-36-44
 A-12 -1-27 -3 -6  0-12 -3 -4 -7 14 -2 -4 -6  0 12  5 -9  0-11-11 10  8  2  8
 G  4 -5 -6 -3 -1 -4 -1 -4-15  0-14  3 10-19 -3-10 -7 -7  7  1 -8 -6 15 21 42

 P-10 -9 -8 -7 -6 -5 -4 -3 -2 -1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14
 T-53-27-26-23  2 -7-14-40-28  0-53 75-62-20-40-10-35 -5-12 -1  4 14-23  7 -2
 C-15-50-43-35-38-29-29  1 -9  1-87-55-64-45 11-22-14-20-15-15-10-22 -5  2  6
 A  0 -3 -5  4-20-11  5  6 -2-15 66-69-52 -5 -4  6  8-24 -7-10 -7 13 14 -9-18
 G 35 22 16 -6 -5-15-25-33-28-53-36-50107 -5-37-44-27-15-23-16-29-47-17-29-15

 P 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
 T-26  1  4 -7  3 -4  0-10  8-18  7-22-21  8  4 -3 -6  7 -8  1 -5-16-16  7 -6
 C  6 -8 19 -7  9 -3 17 -2  3 -9  5 22 22  8 -1  1 18  6 11-10 -8  7 10  0  7
 A 14-12-42  1 -5 -4-32 12-10 20 -6 -1  3 -4  4-10 -1 -2-14 11 14 -3  2-13  5
 G-23 -7 -1 -6-17 -4  0-15-14 -4-17-10 -5-13 -8 10-13-13  9 -4 -3 10  2  4 -8

 P 40
 T  0
 C 14
 A  5
 G-21
.END LIT
These come from w101 of Stormo, Schneider, Gold and Ehrenfeucht Nucl.
Acid Res. 10 2997-
3011, 1982. They report that this matrix gives a score of at least 2 for
all
gene starts in their library whereas all other sequences score 1 or less.
.left margin1
@29. TX 1 @ Reverse and complement the sequence
.LEFT MARGIN2
.para
Reverses and complements the current active region of the sequence.
.left margin1
@60. TX 7 @ Search using a dinucleotide weight matrix
.LEFT MARGIN2
.para
This function performs searches for short sequence
motifs using an appropriate  dinucleotide weight matrix. In addition it
can be used to create or modify weight matrices. In order to perform a
search the only input
required is the name of the file containing the weight matrix.
The results can be presented graphically or listed. The graphical
presentation will draw line at the position of any matches found; the
height of the line is proportional to the score. The method is identical to
that using weight matrices derived from nucleotide frequencies, except
that here we use the frequencies of dinucleotides.
.para
For a search, select "use weight matrix", supply the name of the file
containing the weight matrix, and choose between having results plotted
or listed. If dialogue is requested when the function is selected users can
alter the cutoff score employed.
.para
To create a weight matrix several steps are involved. A file containing an
alignment of known motifs is required. (This file must be created before
the current option is selected. The format is a follows: each sequence is
written on a separate line with at least one space at the beginning; each
sequence is terminated by a space character, and can be followed by a
name. The sequences must be aligned.) Supply the name of the file of
aligned sequences. The program reads and displays the sequences. Choose
between "summing logs of weights" or summing weights (i.e. whether to
multiply or add weights). If logs are used all scores will be negative.
Choose if all positions in the set of aligned sequences should be used or
if a mask should be applied. If so selected, define a mask as a string of
symbols, in which symbol - means ignore and any other symbol means
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
.para
The program will calculate weights as the frequencies of the
dinucleotides at each unmasked position in the set of aligned sequences.
These weights are then applied to the set of aligned sequences to give a
range  of "observed" scores. The mean and standard deviation of these
scores is displayed. The user is asked to supply several values to be used
when the weight matrix is applied to other sequences: a cutoff score (by
default, the mean minus 3 standard deviations), a top score for scaling
graphical results (by default, the mean plus 3 standard deviations), and a
position to identify (this means that if a particular base within the
motif is used as a "landmark", such as the A of the AG in splice acceptor
sites, then its position will be marked in plots). All these values are
stored along with the weight matrix. Finally supply the name of a file to
contain the weight matrix.
.para
Weight matrices can be "rescaled" using a set of aligned sequences in
much the same ways as a matrix is created. The purpose is to redefine
the cutoff scores, and rescaling does not alter any other values in the
weight matrix file.
.para
 The methods have always had to deal with the problem of zeroes in the
matrices. The current versions
employ "Laplaces Law of Succession" in which 1 is
added to each term.

.lit
Typical dialogue follows.

? Menu or option number=D60

 Motif search using dinucleotide weight matrix
X 1 Use weight matrix
  2 Make weight matrix
  3 Rescale weight matrix
? 0,1,2,3 = 2
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ


     1 AGCGTGACTCTTCCCGGAA HIS1
     2 GAGGTGACTCACTTGGAAG HIS1
     3 CGGATGACTCTTTTTTTTT HIS3
     4 ACAGTGACTCACGTTTTTT HIS4
     5 GTCGTGACTCATATGCTTT ARG3
     6 TGAATGACTCACTTTTTGG ARG4
     7 TTCTTGACTCGTCTTTTCT CPA1
     8 CGAATGACTCTTATTGATG CPA2
     9 AGAATGACTAATTTTACTA TRP5
    10 TCGTTGACTCATTCTAATC TRP3
    11 TTGCTGACTCATTACGATT TRP2
    12 GAGATGACTCTTTTTCTTT IV1
    13 GCGATGATTCATTTCTCTG IV2
    14 TAGATGACTCAGTTTAGTC LEU1
    15 TAAGTGACTCAGTTCTTTC LEU4
    16 ATGATGACTCTTAAGCATG ILS1
Length of motif    18
? (y/n) (y) Sum logs of weights n
? (y/n) (y) Use all motif positions n
x means use, - means ignore
e.g. xx-x---x-x means use positions 1,2,4,8,10
? Mask=----XXXXXXXX--------
 Applying weights to input sequences
   1       89.000 AGCGTGACTCTTCCCGGA
   2       91.000 GAGGTGACTCACTTGGAA
   3       93.000 CGGATGACTCTTTTTTTT
   4       90.000 ACAGTGACTCACGTTTTT
   5       94.000 GTCGTGACTCATATGCTT
   6       91.000 TGAATGACTCACTTTTTG
   7       81.000 TTCTTGACTCGTCTTTTC
   8       90.000 CGAATGACTCTTATTGAT
   9       75.000 AGAATGACTAATTTTACT
  10       97.000 TCGTTGACTCATTCTAAT
  11       97.000 TTGCTGACTCATTACGAT
  12       93.000 GAGATGACTCTTTTTCTT
  13       69.000 GCGATGATTCATTTCTCT
  14       90.000 TAGATGACTCAGTTTAGT
  15       90.000 TAAGTGACTCAGTTCTTT
  16       90.000 ATGATGACTCTTAAGCAT
Top score      97.000  Bottom score      69.000
Mean      88.750  Standard deviation       7.319
Mean minus 3.sd      66.794  Mean plus 3.sd     110.706
? Cutoff score (-999.00-9999.00) (66.79) =
? Top score for scaling plots (66.79-999.00) (110.71) =
? Position to identify (0-18) (1) =
? Title=GCN4 DI WTS
? Name for new weight matrix file=3.WTS

? Menu or option number=D60
 Motif search using dinucleotide weight matrix
X 1 Use weight matrix
  2 Make weight matrix
  3 Rescale weight matrix
? 0,1,2,3 =
? Motif weight matrix file=3.WTS
 GCN4 DI WTS
? Cutoff score (-9999.00-9999.00) (66.79) =40
? (y/n) (y) Plot results n
     15     42.00 CAACCCGCTCACCGACAA
     29     42.00 ACAACAGCTCACCCACGC
     93     46.00 AGCCTTCCTCATCGCTGC
    153     40.00 CAGCGGAATCAAACTTAA
    408     42.00 CGATGGATTCAAGTTGAA
    469     47.00 TTAGGAACTCCCTCTGTC
    493     60.00 AAGCTGAATCTTAGCAGC
    530     43.00 CGGAGGGCTCAGTGAGGG
    542     47.00 TGAGGGACTACTGCACCA
    678     41.00 CTTCTGCTTCAAAGAGTT
    709     47.00 AATATGACGGCGCACGTG
    848     54.00 GTCAGAACTCAAATCAGT
    940     49.00 CCGTTGACGACCTCCGCA
    992     42.00 TGGGCACCTCACACCAAG


.end lit
.left margIN1
@61. TX 8 @ Search for eukaryotic ribosome binding sites
.LEFT MARGIN2
.para
Searches for eukaryotic ribosome binding sites using weightings derived
from
 Sargan,Gregory,Butterworth febs let 147 133-136 1982.  No dialogue is
required. First described in Staden Nucl. Acid Res. 12 505-519 1984.

.LEFT MARGIN1
.lit
mRNA WTS FOR EUKARYOTES SARGAN,GREGORY,BUTTERWORTH FEBS LET
147 133-136 1982
P  -7 -6 -5 -4 -3 -2 -1  1  2  3
  102102102102102102102102102102
T  19 24 31 12  0 18  5  0102  0
C  20 15 32 65  5 42 52  0  0  0
A  50 27 27 19 86 36 34102  0  0
G   6 29 12  6 11  6 11  0  0102
VIRAL ONLY
P  -7 -6 -5 -4 -3 -2 -1  1  2  3
   41 41 41 41 41 41 41 41 41 41
T  14 12 16  4  2 13  9  0 41  0
C   7  3 13 17  7  9 14  0  0  0
A  15 10  6 10 27 15  9 41  0  0
G   5 16  6 10  5  4  9  0  0 41
.END LIT
The Sargan et al paper puts forward the hypothesis that there is an
interaction between
some mRNA leader sequences and a highly conserved structure in the 18S
rRNA
of eukaryotic ribosomes. The attempt to substantiate the hypothesis
includes
a table of base frequencies for sequences immediately 5' to start codons.
They examined 102 sequences and I have used the base frequencies they
found
as a weight matrix for searching for eukaryotic gene starts. I don't yet
know how good this method is. The viral sequences were found to be
slightly
different but the separate table shown here is not used in the program.
.left margin1
@62. TX 8 @ Search for splice junctions
.LEFT MARGIN2
.para
Used to search for mRNA splice junctions using a weight matrix. The
default weight matrix is still that derived from the paper of Mount (Nucl.
Acids Res. 10, 459-472). However users may employ their own tables.
By default the positions of possible junctions will
be plotted rather than listed.
 The diagram splits the donor plot into 3 horizontal boxes
 so that all the
sites marked in any box are from the same reading frame. The acceptor
plot appears above the donor plot and is split in an equivalent way. So
sites marked as donors and acceptors in equivalent boxes are compatible.
i.e. donors from donor box 1 are compatible with acceptors from acceptor box
1, etc. Of course it is the combination of reading frame and splice sites
that really matters, and donors from box 1 can be compatible with acceptors
in box 3 if the reading frame switches.
.para
If dialogue is selected users can employ their own file of weights (see
below for the format), can change the cutoff scores, and can elect to have
the results listed rather than plotted. Listed results show the position
(of the last or first base in the exon), the frame and the matching sequence.
The frequency table shown below is used as a default
weight matrix and AG and GT are obligatory at the appropriate positions.
The plots are scaled so that the top of scale is the highest value achieved
by
a junction sequence in the set used to compile the frequency table, and
the
bottom of the scale is the lowest value achieved by a junction sequence
in
the set used to compile the frequency table.
.para
In the light of current knowledge it would be sensible for users to use
the weight matrix search option (20)
to create matrices that define  more specific splice junctions. If so it is
important that the positions "marked" are the last base in the donor exon and
the first base in the acceptor exon. To make a weight matrix suitable for
use with this function follow the instructions for option 20 and create
files for both donor and acceptor sites. Then concatenate the two matrix files
with the donor file first.
Note that any positions in the weight matrix that are
100% conserved will be made obligatory (normally the AG and GT).
.LEFT MARGIN1
.lit

 Mount donors redone 16-4-91
     12     3   -16.085    -7.500
 P  -2  -1   0   1   2   3   4   5   6   7   8   9
 N 136 136 136 136 136 136 136 136 136 136 136 136
 T  28   8  15  17   0 136   9  16   7  84  30  36
 C  41  60  16   7   0   0   3  13   3  17  28  39
 A  40  56  89  12   0   0  83  91  12  23  53  33
 G  27  12  16 100 136   0  41  16 114  12  25  28
 Mount acceptors redone 16-4-91
     18    15   -26.142   -14.400
 P -14 -13 -12 -11 -10  -9  -8  -7  -6  -5  -4  -3  -2  -1   0   1   2   3
 N 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113
 T  58  50  57  59  67  56  58  49  47  66  64  31  34   0   0  11  41  31
 C  21  28  34  25  29  33  35  32  42  40  33  25  74   0   0  23  28  41
 A  17  11  11  18   7  17  12  23  15   3  10  29   5 113   0  24  21  21
 G  17  24  11  11  10   7   8   9   9   4   6  28   0   0 113  55  23  20
.END LIT

.left margIN1
@63. TX 7 @ Search using a weight matrix (complementary)
.LEFT MARGIN2
.para
This function searches the complementary strand of the sequence  using
a weight matrix. Many
motifs can bind to either strand of the DNA and this function allows
users to
search the complementary strand without having to change the
orientation of the sequence. See option 20 for more details.
.left margin1
@64. TX 3 @ Plot observed-expected word frequencies
.LEFT MARGIN2
.PARA
This  option is designed to examine the abundances of short
words in a sequence to see if particular ones are either under or over
represented. It compares the observed and expected frequencies and
plots them along the sequence. There has been some work on the relative
amounts of CG dinucleotides in eukaryotic sequences (eg Bird, Nature
321,
209-213 (1986)) and this new routine can be used to examine such
biases, or
any others that might be interesting.
.para
The user selects a word - say CG -, a window length, and a maximum and
mininum scale for plotting the results. The
program examines each sucessive window length along the sequence,
with each
window overlapping the previous one by windowlength-1.
The program counts the base frequencies in each window, and the number
of
occurrences of the chosen word within the window. Using the base
frequencies it calculates an expected number of occurrences for the
chosen
word (simply by multiplying the relevant frequencies). It plots
observed-expected, and hence will show regions that are rich or depleted
in
the chosen word. The longest allowed word is 9 characters, but the
calculation of the expected frequencies becomes less appropriate as the
word
length increases above 2.
.para
Typical dialogue follows.
.lit

? Menu or option number=D64
Plot composition differences (obs-exp))
Default String=CG
? String=
? odd span length (3-401) (101) =
? plot interval (1-20) (5) =
? Maximum plot value (-6.31-25.25) (6.31) =
? Minimum plot value (-25.25-6.31) (-6.31) =

 Missing graphics display here

.end lit
.left margIN1
@65. TX 9 @ Search for polya sites
.LEFT MARGIN2
.para
Simply searches for the sequence AATAAA
 (Proudfoot and Brownlee Nature 263, 211-214,
 1982) and marks it with a short vertical line.
.left margin1
@66. TX 1 @ Interconvert t and u
.LEFT MARGIN2
.para
This function interconverts T and U characters in the active sequence i.e
between DNA and RNA.
.LEFT MARGIN1
@67. TX 7 @ Search for patterns of motifs
.left margin2
.para
This option searches for patterns of motifs. Patterns can be defined
interactively or read from files. Results can be displayed in several ways
in both graphical and textual form. Used to create pattern files for
searching libraries. The option is extremely flexible and consequently the
following documentation is quite lengthy. However the routine is capable
of searching for almost any known pattern. In addition the flexibility
does not necessitate difficulty of use, and the userinterface has been
simplified considerably since the methods were first published.
.para
Users should refer to the "typical dialogue" shown below for the most
helpful information on using the program.
.para
There are currently
four ways to display the matching patterns: 1=each individual
motif and its position is listed; 2=all the sequence between, and
including the two
outermost motifs is listed; 3=graphical, with a vertical line marking the
position
of the leftmost motif; 4 = EMBL feature table format, where the KEYNAM
field if the motif name, the FROM and TO fields denote the ends of the
match, and the DESCRIPTION field is "Program".
.para
When it is defined for the first time a pattern must be entered
interactively at the keyboard, but the pattern description
can be saved to a file.
This file can be used for all subsequent searches.
.para
When defining a pattern interactively
select a motif class and the program will request the required inputs.
.para
The program gives each motif an identifying name and number.
For motifs other than the first, a range of allowed positions must be
defined (Note that sets of motifs included using the OR operator will all
be given the same range, and so the program will only request range
values
for the first motif in any such set).
To specify the allowed range for a motif the user must supply the
following: the
identifying number of the motif, relative to which the current motifs
positions are to be defined (termed the "reference motif"); a "relative start
position" and a range. The relative start position can be negative or positive.
A negative start position means that although the reference motif
is searched for first, the current motif can be found to its left.
A zero relative start position means their left ends are superimposed. The
default start position is to butt-joint the motif to righthand end of the
"reference motif". The range is "the number of extra positions" that the
motif can take.
.para
The program will display the probability of finding each motif. These
values are presented in the following form: .1234E-5 means 0.1234 times
10
to the power -5.
.para
After the pattern has been defined, the program will type a description
of
it on the screen. It will then allow the user to give an overall cutoff
score and overall probability cutoff.
.para
Typical dialogue for all the different motif classes is displayed below.
.lit

? Menu or option number=67
  Pattern searcher
? (y/n) (y) Read pattern from keyboard
X 1 Exact match
  2 Percentage match
  3 Cut-off score and score matrix
  4 Cut-off score and weight matrix
  5 Complement of weight matrix
  6 Inverted repeat or stem-loop
  7 Exact match, defined step
  8 Direct repeat
  9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =
? Motif name=Ematch
? String=AA
Probability of score     2.0000 = 0.595E-01
X 1 Exact match
  2 Percentage match
  3 Cut-off score and score matrix
  4 Cut-off score and weight matrix
  5 Complement of weight matrix
  6 Inverted repeat or stem-loop
  7 Exact match, defined step
  8 Direct repeat
  9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =2
? Motif name=AAA
X 1 And
  2 Or
  3 Not
? 0,1,2,3 =
? Number of reference motif (1-1) (1) =
? Relative start position (-1000-1000) (3) =
? Number of extra positions (0-1000) (0) =
? string=AAA
? Minimum matches (1.00-3.00) (3.00) =2
Probability of score     2.0000 = 0.149E+00
  1 Exact match
X 2 Percentage match
  3 Cut-off score and score matrix
  4 Cut-off score and weight matrix
  5 Complement of weight matrix
  6 Inverted repeat or stem-loop
  7 Exact match, defined step
  8 Direct repeat
  9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =3
? Motif name=T'S
X 1 And
  2 Or
  3 Not
? 0,1,2,3 =
? Number of reference motif (1-2) (2) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =
? String=TTT
? Minimum score (0.00-108.00) (108.00) =72
Probability of score    72.0000 = 0.258E+00
  1 Exact match
  2 Percentage match
X 3 Cut-off score and score matrix
  4 Cut-off score and weight matrix
  5 Complement of weight matrix
  6 Inverted repeat or stem-loop
  7 Exact match, defined step
  8 Direct repeat
  9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =4
? Motif name=GCN4
X 1 And
  2 Or
  3 Not
? 0,1,2,3 =
? Number of reference motif (1-3) (3) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =
? Weight matrix file name=GCN4
 GCN4 FROM WEIGHTS 17-11-87
Probability of score   -22.0020 = 0.139E-02
  1 Exact match
  2 Percentage match
  3 Cut-off score and score matrix
X 4 Cut-off score and weight matrix
  5 Complement of weight matrix
  6 Inverted repeat or stem-loop
  7 Exact match, defined step
  8 Direct repeat
  9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =5
? Motif name=GCN4
X 1 And
  2 Or
  3 Not
? 0,1,2,3 =
? Number of reference motif (1-4) (4) =
? Relative start position (-1000-1000) (20) =
? Number of extra positions (0-1000) (0) =
? Weight matrix file name=GCN4
 GCN4 FROM WEIGHTS 17-11-87
Probability of score   -22.0020 = 0.606E-03
  1 Exact match
  2 Percentage match
  3 Cut-off score and score matrix
  4 Cut-off score and weight matrix
X 5 Complement of weight matrix
  6 Inverted repeat or stem-loop
  7 Exact match, defined step
  8 Direct repeat
  9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =6
? Motif name=LOOP
X 1 And
  2 Or
  3 Not
? 0,1,2,3 =
? Number of reference motif (1-5) (5) =
? Relative start position (-1000-1000) (20) =
? Number of extra positions (0-1000) (0) =
? Stem length (1-60) (6) =
? Minimum loop length (-6-60) (0) =
? Maximum loop length (0-60) (0) =5
? Minimum score (1.00-12.00) (12.00) =10
Probability of score    10.0000 = 0.598E-02
  1 Exact match
  2 Percentage match
  3 Cut-off score and score matrix
  4 Cut-off score and weight matrix
  5 Complement of weight matrix
X 6 Inverted repeat or stem-loop
  7 Exact match, defined step
  8 Direct repeat
  9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =7
? Motif name=Tstep
X 1 And
  2 Or
  3 Not
? 0,1,2,3 =
? Number of reference motif (1-6) (6) =
? (y/n) (y) Relative to 5 prime end
? Relative start position (-1000-1000) (1) =
? Number of extra positions (0-1000) (0) =
? String=TTT
? Step (1-20) (3) =
Probability of score     3.0000 = 0.367E-01
  1 Exact match
  2 Percentage match
  3 Cut-off score and score matrix
  4 Cut-off score and weight matrix
  5 Complement of weight matrix
  6 Inverted repeat or stem-loop
X 7 Exact match, defined step
  8 Direct repeat
  9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =8
? Motif name=REPEAT
X 1 And
  2 Or
  3 Not
? 0,1,2,3 =
? Number of reference motif (1-7) (7) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =2
? Repeat length (1-60) (6) =
? Minimum gap (0-60) (0) =
? Maximum gap (0-60) (0) =4
? Minimum score (1.00-6.00) (6.00) =5
Probability of score     5.0000 = 0.554E-02
  1 Exact match
  2 Percentage match
  3 Cut-off score and score matrix
  4 Cut-off score and weight matrix
  5 Complement of weight matrix
  6 Inverted repeat or stem-loop
  7 Exact match, defined step
X 8 Direct repeat
  9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =9
? (y/n) (y) Save pattern in a file N

Pattern description

Motif  1 named Ematch   is of class    1
Which is an exact match to the string
AA
Motif  2 named AAA      is of class    2
which is a match of score     2. to the string
AAA
and the 5 prime base can take positions      3 to       3
relative to the 5 prime end of motif   1
It is anded with the previous motif.
Motif  3 named T'S      is of class    3
which is a match of score    72. to the string
TTT
and the 5 prime base can take positions      4 to       4
relative to the 5 prime end of motif   2
It is anded with the previous motif.
Motif  4 named GCN4     is of class    4
Which is a match to a weight matrix with score -22.002
and the 5 prime base can take positions      4 to       4
relative to the 5 prime end of motif   3
It is anded with the previous motif.
Motif  5 named GCN4     is of class    5
Which is a match to the complement of a weight matrix with score -22.002
and the 5 prime base can take positions     20 to      20
relative to the 5 prime end of motif   4
It is anded with the previous motif.
Motif  6 named LOOP     is of class    6
Which is a stem-loop structure with stem length    6 and score    10.
The loop can have sizes      0 to      5
and the 5 prime base can take positions     20 to      20
relative to the 5 prime end of motif   5
It is anded with the previous motif.
Motif  7 named Tstep    is of class    7
Which is an exact match to the string
TTT
with a step size of     3
and the 5 prime base can take positions      1 to       1
relative to the 5 prime end of motif   6
It is anded with the previous motif.
Motif  8 named REPEAT   is of class    8
Which is a repeat with repeat length    6 and score     5.
The loop-out can have sizes      0 to      4
and the 5 prime base can take positions      4 to       6
relative to the 5 prime end of motif   7
It is anded with the previous motif.
Probability of finding pattern = 0.2348E-14
Expected number of matches  = 0.5100E-09
? Maximum pattern probability (0.00-1.00) (1.00) =
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
 Select display mode
X 1 Motif by motif
  2 Inclusive
  3 Graphical
  4 EMBL feature table
? 0,1,2,3,4 =4
 Searching


Total matches found      0

Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structures and repeats
m5 = Translation and codons
m6 = Gene search by content
m7 = Prokaryotic signal search
m8 = Eukaryotic signal search
 ? = Help
 ! = Quit
? Menu or option number=67
  Pattern searcher
? (y/n) (y) Read pattern from keyboard
X 1 Exact match
  2 Percentage match
  3 Cut-off score and score matrix
  4 Cut-off score and weight matrix
  5 Complement of weight matrix
  6 Inverted repeat or stem-loop
  7 Exact match, defined step
  8 Direct repeat
  9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =
? Motif name=Arun
? String=AAAAAA
Probability of score     6.0000 = 0.210E-03
X 1 Exact match
  2 Percentage match
  3 Cut-off score and score matrix
  4 Cut-off score and weight matrix
  5 Complement of weight matrix
  6 Inverted repeat or stem-loop
  7 Exact match, defined step
  8 Direct repeat
  9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =9
? (y/n) (y) Save pattern in a file N

Pattern description

Motif  1 named Arun     is of class    1
Which is an exact match to the string
AAAAAA
Probability of finding pattern = 0.2103E-03
Expected number of matches  = 0.1522E+01
? Maximum pattern probability (0.00-1.00) (1.00) =
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
 Select display mode
X 1 Motif by motif
  2 Inclusive
  3 Graphical
  4 EMBL feature table
? 0,1,2,3,4 =4
 Searching


FT   Arun       1582   1587       Program
FT   Arun       3160   3165       Program
FT   Arun       4204   4209       Program
FT   Arun       5691   5696       Program
FT   Arun       6710   6715       Program
Total matches found      5
Minimum and maximum observed scores        6.00        6.00

.end lit
.para
These methods allow users to define and search for
complex patterns of motifs defined as single objects.
The programs allow individual DNA motifs to be defined in eight
different
ways, and protein motifs in six. Motifs are combined, using the logical
operators AND, OR and NOT, to describe a pattern. The pattern also
specifies the ranges of allowed relative separations of the individual
motifs.
.para
First some definitions.
.para
A MOTIF is a contiguous subsequence of fixed length.
At its simplest
it could be a single definite base or amino acid; a more complex motif
might be better represented as a consensus or a weight matrix;
two more-abstract types of
motif are direct and inverted repeats.
.para
A PATTERN is a higher order of structure defined by a list of motifs. The
motifs in a pattern are combined using the logical operators AND, OR and
NOT. The list also defines the allowed relative separations of the
motifs. In the current versions of the programs up
 to 50 motifs can be combined into a single pattern. So using these
definitions there are two
differences between motifs and patterns: 1) the distances between all
elements of a motif are fixed, but
the separations of parts of patterns can vary;
 2) all characters in a motif are defined
using the same method (class), but different parts of a pattern can be
defined in completely different ways.
.para
Each motif
can be represented in 9 ways (known as the motif class):
.sk1
.lit
           MOTIF CLASSES
CLASS           DESCRIPTION
 1       Exact match to a short defined sequence. The IUB symbols
         can be used for DNA sequences.
 2       Percentage match to a defined short sequence. In nucleic acids,
         the IUB symbols can be used.
 3       Match to a defined sequence, using a score matrix and cutoff
         score. The DNA matrix (see option 18) gives scores to IUB symbols
         depending on their level of redundancy. MDM78 is used for proteins.
 4       Match to a weight matrix with cutoff score.
 5       As class 4 but on the complementary strand.
 6       Inverted repeat or stem-loop. Fixed stem length, range of
         loop sizes, and cutoff score using A-T, G-C=2; G-T=1.
 7       Exact match to short sequence but with a defined step size.
 8       Direct repeat. Fixed repeat length, range of loop-out sizes,
         cutoff score, and score matrix (for protein sequences MDM78 and
         for nucleic acids an identity matrix).
 9       Membership of a set. A list of sets of allowed amino acids for
         each position in the motif. The sets are separated by commas(,).
         For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
         acids in which one of I,V or L must be found in the first position,
         then anything in the next two positions, D,E,K or R in the fourth
         position and F,Y,W,I,L,V or M in the fifth. This class only applies
         to protein sequences because for nucleic acids "membership of a
set"
         can be achieved using IUB symbols.

    Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to
    nucleic acids.

.end lit
.para
Class 1: exact match.
.para
The motif is defined by a short sequence, which for nucleic acids,
 may include IUB symbols. All symbols must match.
.para
Class 2: percentage match
.para
The motif is defined by a short sequence, which for nucleic acids,
may include IUB symbols. The minimum number of matching characters
must
also be specified.
.para
Class 3: match using a score matrix
.para
The motif is defined by a short sequence, which for nucleic acids,
may include IUB symbols. The motif is not compared directly with the
sequence to count the number of matching characters. Instead a matrix is
used to provide a score for all possible pairs of characters. The motif
score for
any position along the sequence is the sum of the scores found by
looking-up the scores for each pair of aligned characters. A match is
declared if some minimum score is achieved.
.para
Class 4: weight matrix
.para
The motif is defined by a table of values (called weights or scores). The
table gives a score for finding each possible character at each position
along the length of the motif. It therefore
has dimension motif-length x character-set-size, and allows us to give
different scores for each character at each position. It is equivalent to
having a different score matrix for each position along the motif, and
provides the most flexible and specific method of defining motifs. The
weight matrices are created by program NIP option 20 and
stored as files. The file contains the values
for each position, as well as an overall minimum score.
There are two ways in which these values can be used to calculate an
overall
score for any section of the sequence. The simplest way is to add the
values in the file. (This means that the highest possible score
can be calculated by adding the top value at each column
position, and the lowest
by adding the bottom value.)
 The normal way of using the values in the file is as
follows.
First the programs divide the values in each column by the column total
so
that they sum to 1.0
Then the natural
logs of these values are used as scores. When the matrix is applied to a
sequence these logarithmic values are summed (which is of course
equivalent
to multiplying the frequencies).
Note that using the natural logs of the frequencies as
weights and
adding them means that the overall cutoff score must be less than zero,
whereas if the original
values in the weight matrix file are added, the cutoff score will be
greater than zero. The search routines therefore decide whether the user
wants to add values or multiply frequencies
by examining the value of the cutoff score: it will add if the cutoff
is
greater than zero and add logs of frequencies if it is less than zero.
 Hence we effectively get two
motif classes in one. The program NIP, when creating weight matrix
files, will ask the user whether the scores should be added or multiplied.
 If the values in the table have been defined
without using a set of aligned sequences
it is easier for the user to
choose a cutoff score if the values are added.
.para
Class 5: complement of weight matrix
.para
The motif is defined by a weight matrix, but the program searches for its
complement.
.para
Class 6: inverted repeat, or stem-loop
.para
The motif is defined by a repeat length, a minimum score
 and a range of loop sizes. The scores are A-T=2, G-C=2, G-T=1, else=0.
The loop sizes are defined by a minimum
and maximum distance from the 3' end of the stem.
For a stem-loop these will be positive numbers. For example to
define a stem of length 8 and loop sizes varying from 3 to 5, the stem
would be set to 8, the minimum start distance to 3 and the maximum
to 5. To define an
inverted repeat the minimum distance will be negative. For example stem
length=9,
minimum distance=-9, and maximum distance=-8 will find
inverted repeats of lengths 9 and 10.
E.g. AAAAATTTT and AAAAATTTTT would be found, the first having a base
at
its centre, the second having none.
.para
Class 7: exact match, defined step size.
.para
The motif is defined by a short sequence, which for nucleic acids,
 may include IUB symbols. All symbols must match. The class differs
from
class 1 in that searches will move in steps of some given size. For
example
we could search for a certain codon and use a step size of 3 and hence
 keep in a
single reading frame.
.para
Class 8: direct repeat
.para
The motif is defined by a repeat length, a minimum score
 and a range of loop sizes. The scores are defined using MDM78 for protein
sequences and an identity matrix for nucleic acids.
The loop sizes are defined by a minimum
and maximum distance from the 3' end of the stem.
.para
Class 9: membership of a set
.para
This motif class is for protein sequences. It is defined by lists of
allowed amino acids for each position in the motif, and a cut-off score.
Positions at which any amino acid can occur are left blank.
All allowed amino acids for each position give a score of 1.
The motifs can be defined in two ways: either typed at the keyboard or
read
in as a weight-matrix-like file.
When the motif is defined at the keyboard the sets of allowed amino
acids
are separated by commas(,).
         For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
         acids in which one of I,V or L must be found in the first position,
         then anything in the next two positions, D,E,K or R in the fourth
         position and F,Y,W,I,L,V or M in the fifth.  To specify that the
whole motif must match a score of 3 would be required (i.e. one of the
allowed amino acids must be found for each of the three defined
positions).
If the motif is read from a file the file must have been written by
program
NIP, or have been saved by the pattern searching routines. If the
user
elects to save a pattern, and it includes class 9 motifs typed at the
keyboard, then the program will save the class 9 motifs as weight matrix
files. Therefore it will request file names for each motif of this class.
If the motif given above as an example were saved the weight matrix file
would have 5 columns.
The first column
would contain zeroes except for the I, V and L rows
which would be set to 1; the next two columns would all be zero; the next
would be zero except for the D,E,K and R rows which would be 1; the final
column would contain 1's in rows F,Y,W,I,L,V and M, with
the rest zero.
.para

The logical operator (AND, OR or NOT) used to add each motif to the
pattern
is specified by preceding
the class number by the letters A, O or N. A = AND, O = OR, N = NOT.
The default is A, so N2 means include, using the NOT operator, a class 2
motif; O2 means include, using the OR operator, a class 2 motif; both A2
and
2 mean include, using the AND operator, a class 2 motif.

.para
Range setting.
.para
The motifs in a pattern are numbered according to their order in the list.
Apart from the first motif in a pattern all motifs are given a range
of allowed positions relative to a motif further up the list.
For example
suppose we have a pattern defined by A AND B AND C AND D.
Motif A can occur anywhere, but B must have its range of allowed
positions defined relative to the position of motif A, and C's positions
can be defined relative to either A or B, depending on which is most
convenient, and likewise D's positions can be relative to A or B or C.
.para
Notice that the positions of motifs can be defined relative to more than
one motif. Suppose we have a pattern consisting of
motifs A, B and C, and that B occurs 5-10 residues right of A, C occurs 5-
10
residues right of B, and also C is never more than 15 residues from A.
Then
it is quite consistent with the methods to include motif C into the
pattern
twice using the AND operator: once relative to A and once relative to B.
This will define the relative spacing and the ORDER of the motifs in the
pattern. (If we simply defined the position of C relative to A it could be
found to the left of B).
.para
Motifs combined together using the OR operator are all given the same
range. For example suppose we had a pattern A AND (B OR C) AND (D OR E),
 then B and C each have the same range, and D and E also have
the same range as one another. The range for D and E can be relative to
A or to B.
.para
Motifs cannot have their ranges defined relative to motifs that are
included using the NOT operator. For example if we had the pattern A NOT
B
AND C, then the range for C can only be defined relative to motif A.
.para
Speed can be gained by arranging the order
of the motifs so that those higher up the list are of types that can be
searched for rapidly and that are also unlikely to be found.
.para
Motifs combined by the OR operator are alternatives: if any one of a set
of motifs
combined by the OR operator is found, then a match is declared. All
alternatives will be reported. For example if we had a pattern defined by
A
AND (B OR C), then all places where A occurs and B is found within range,
and all places where A is found and C is found within range will be
reported. A typical use would be where we might allow a motif to appear
on
either strand of the DNA sequence. For example a weight matrix
representing
the heatshock element could be used in a pattern which included
heatshock
as a motif class 4 combined using the OR operator
with heatshock as a motif class 5.
.para
The probability calculations are performed for each motif as it is
defined.
If an overall probability cut-off is given the calculation is repeated for
each match found. To achieve maximum searching speed do not give an
overall
probability cut-off. Overall cut-off scores should only be used if the
motif
classes used are compatible.
.para
There are currently
several ways to display the matches: 1 = each
motif and its position is listed; 2 = all the sequence between the two
outermost motifs is listed; 3 = graphical, with a spike marking the
position
of the leftmost motif. The library versions also give entry names, and a
one
line title; in addition they can be used to produce aligned families of
sequences. When this mode of output is selected the program will write a
separate file for each match. The files will be called ENTRYNAME.DAT
where
ENTRYNAME is the name of the entry in the library. The matching
sequence
will be written out so that the spacing between motifs is constant, and
set to the maximum allowed by the pattern definition. Any gaps will be
filled with dashes (-). If the individual sequences were subsequently
written one above the other
they should line up so that all motifs are in register. There two types of
output of this sort: one, option 4, writes out whole sequences, the other,
option 5, writes out only the sequences between the two outermost
motifs.
If the individual sequences were subsequently
written one above the other
they should line up so that all motifs are in register. There two types of
output of this sort: one, option 4, writes out whole sequences, the other,
option 5, writes out only the sequences between the two outermost
motifs.
Note that for option 4 users are asked to type the position of the
first motif, and the reason for
this is explained below.
Consider a pattern found in several sequences. Consider only
the first motif in
the pattern and suppose that it was found in different positions in these
sequences.
Say that of these positions the one furthest from the left end was
position 100. Then, in order to ensure that all the sequences would align,
we must specify that motif 1 must start at position 100.
Any sequences in which motif 1 started
nearer to the left end than position 100 would be padded accordingly.
These modes of output
should only be used when the position of each motif is defined relative to
its
immediate neighbour.
.para
The pattern descriptions can be saved to files. These files
can be used instead of typing definitions again at the keyboard. As the
files are annotated,
they can easily
be changed using system editors, and the modified versions used to
define the variant patterns for the programs.
.para
Use of lists of entry names
.para
The two programs that operate on libraries have the ability to
restrict their searches to subsets of the libraries. This does not require
sublibraries to be created but instead is achieved by using files
containing a list of the entry names of sequences. The user may choose to
search only those entries on the list or, alternatively to search all but
those on the list (i.e. in the latter case
the list contains the names of those to be excluded).
 The programs can search libraries that have indexes and those that
do not.
 If a list of names for inclusion is used,
then the search will be faster if the index is present. In all other
circumstances the whole library will be read.
The list must be in library order except when it is used
to include entries, and an index is available.
The list must contain each entry name on a separate line, with the name
starting in column 1 of the line. ie there must be no spaces at the start
of the line.
The list of entry names
can be produced by the keyword searches of nip, pip, etc as long
as the listings produced have a space character separating the entry name
from the entry description. This will depend on how well the library
reformatting programs work. For example swissprot entry names tend to run
into the beginning of the descriptions, but other libraries are generally
OK.
.para
One use of the programs is to look for patterns that we already know
about, but in new sequences. However it is hoped that they will also be
useful for finding new motifs. For example
several known control regions in
nucleic acid
sequences consist of particular direct or inverted repeats;
the inclusion of
direct and inverted repeats as motif classes
makes it possible to
find previously unknown
motifs of these types.
Using these new programs we can
ask questions like: "are there any inverted or direct repeats near to
sections of sequence that contain both a
CCAAT box and a TATA box?"; and to search for such things throughout
the
libraries. In addition, the mode of output in which all the sequence
between
the two outermost motifs found is printed out, allows us to extract
sequences and examine them in more detail for further common
subsequences.
For example we might want to collect together all the sequences
between
putative CCAAT and TATA boxes.
.para
A further use of the inverted repeat motif class is the following. If a
regulatory sequence in DNA is poorly defined but also an inverted repeat,
then it might be an advantage to specify it both as a consensus sequence
and
a superimposed inverted repeat. In this way two weak definitions can be
combined to produce a stronger pattern.
.para
Given only a few examples of a motif it
should be possible to perform initial searches using a
class 3 motif, and then, using plausible matching sequences, create a
more
specific weight matrix for the same motif.
.para
If motifs are combined with the first motif using the OR operator
they will be ignored until all
permutations that include the first motif have been looked for.
The whole search will then be repeated, in
turn, for each of
those motifs that are combined with the first motif using the OR
operator.
An interesting consequence of this is that the program can be used,
without
change, to compare any newly determined sequence with all known
individual
motifs. We achieve this by having a pattern in which all known relevant
motifs are combined using the OR operator.
If we ask to use this pattern with
a sequence, the program will automatically compare each individual
motif in
the pattern with the whole length of the
sequence. As the number of known
motifs grows this should become an increasingly useful standard
procedure.
.para
The NOT operator is obviously
useful for making sure particular motifs are not present, but it can also
be used to bracket the levels of matches found. We may want a degree of
match that lies between two limits - binding should occur, but not too
strongly; or base-pairs should form, but not too many. We can specify
this
by asking for a match with a low score, in combination with a match and
a
high score, both for the same motif, but with the high score included
using
the NOT operator.
.para
The algorithm is designed to find all sections of a sequence that satisfy
the pattern rather than only the best match.
Particularly if some of the motifs in a pattern are less well defined than
others, this can often result in the same region of a sequence being
reported as having several matches, but which only vary in the
positions of the weakest motifs.
.para
General remarks on motif searching
.para
Generally motifs are short subsequences that are thought to be
associated with
particular functions in some known sequences. Often
we search for them to try to
understand or interpret other sequences. Sometimes we search for
motifs and
patterns to
test a hypothesis about their role: are they found in the expected
positions in the expected sequences. In doing so we should remember
that, in both proteins and nucleic acids,
 what we are really looking for is a particular
three dimensional structure with certain affinities for other structures,
and that we are assuming that the sequence of the motif alone
defines the 3D structure we searching for.
 The overall structure
may be completely different to those in which the motif is functional,
and
hence the motif may have a different shape or be inaccessible.
We should be aware of the
importance of the context in which a motif is found. Where does it lie
relative to the overall structure, is it accessible, is the three
dimensional spacing between
it and other motifs correct? For example, is it on the same side of the
double helix, and the correct distance from some other motif? How does
context affect our assessment of the significance of finding a motif?
Finding false mammalian mRNA splice junctions in non-coding sequences
is
far less important than finding false sites in pre-mRNA sequences, but
finding them in the correct places is most important! In other words, it
is
often the case that when we are searching for a motif that is known to
be
necessary for some function, then a positive result in the form of a
match
in the required position, is more important than a high background of
matches in the wrong positions. Being
 able to write
down the probability of finding a motif in a random sequence tells us how
well it is defined.
In nucleic
acids the DNA may contain many superimposed types of information such
as
those concerned with histone phasing, protein coding or mRNA secondary
structure. These overlapping "codes" may interfere with one another
causing
matches to motifs to be poorer than expected.
In general we will only have a limited number of examples of the
motif and we do not know how representative they are.
.para
Sequences have superimposed functions: some parts may be of general
structural
importance and give rise to an overall framework, and other parts give
specificity and hence are not common; we may want to use a set of
aligned
sequences to define a motif, but want to use only the framework
positions.
 Alternatively we may want to pick out
only those parts of a set of aligned sequences that give a particular
property, and to ignore other similarities that are due to some other
property
and which could obscure the pattern
we are interested in.
It is possible to apply a mask to a set of aligned sequences in
order to give weight to selected positions only.
 The ability to define a mask allows certain positions
to be used in the motif and others to be ignored, and yet still permits the
use of a set of aligned sequences to calculate weights. The mask is
requested and applied
by the program and results in the masked positions being zero
in
the weight matrix. The mask is defined in the following way.
Suppose we had a motif of length 15, then the mask
x--x--xx-x will give zero weights to positions 2,3,5,6 and 9 (note it is
the dashes (-) that are significant and that positions
1,4,7,8,10,11,12,13,14 and 15
will be non-zero). Of course
the same set of sequences could be used with several alternative masks
in
order to extract different features and create corresponding weight
matrices.
.para
The programs are described in Staden,R.
CABIOS 4, 53-60, 1988; Staden,R.
 CABIOS 5, 89-96, 1989, and Methods in Enzymology 183, 193-211 (1990).
.left margin1
@ end of help