staden-lg/help/nip_help

4621 lines
197 KiB
Text
Raw Permalink Normal View History

2021-12-04 13:07:58 +08:00
@-1. TX 0 @General
@-2. T 0 @Screen control
@-2. X 0 @Screen
@-3. T 0 @Statistical analysis of content
@-3. X 0 @Statistics
@-4. T 0 @Structures and repeats
@-4. X 0 @Structures
@-5. TX 0 @Translation and codons
@-6. TX 0 @Gene search by content
@-7. TX 0 @General signals
@-8. TX 0 @Specific signals
@0. TX -1 @NIP
This is a program for analysing individual nucleotide
sequences. It can read sequences stored in many of the most commonly
used formats, and performs all of the usual simple analyses. However
the main purpose of the program is to provide methods for finding
the function of each section of a sequence. In general no single
method can give an unequivecal interpretation of a sequence so we
need to use many techniques together and to combine their results.
For this reason the program present many of its results
graphically.
General information is contained in the user interface. Online
documentation for any function follows a consistent pattern: summary,
list of inputs, list of outputs, details, example.
@1. TX 0 @ Help
This option gives online help. The user should select option
numbers and the current documentation will be given. Note that
option 0 gives an introduction to the program, and that ? will get
help from anywhere in the program. The following functions are
included:
@2. TX 0 @ Quit
This function stops the program.
@3. TX 1 @ Read a new sequence
This option allows users to read in new sequences, browse
through annotations, or search sequence libraries for keywords.
Sequences can be read from "personal" sequence files or from
sequence libraries. These are referred to as the sequence "source".
Personal files can be stored in several formats: Staden, PIR, EMBL,
GENBANK and GCG. At LMB we use "Staden" format for sequencing and
all the libraries are stored in their original formats. Note,
however, that libraries such as EMBL or GenBank that are divided
into several files (eg GenBank has 13 separate files) are indexed as
a whole. This means that users do not need to know which file
contains an entry, only which library. When the user selects to
read in a sequence the program first asks for the sequence "source".
If the user selects "personal" the program will ask for the
format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
the file. For PIR format the user will also be required to know the
entry name of the sequence as the file can contain several. For the
other formats only a single entry is expected. The file will be
read, its length and composition will be displayed and the option
left.
If the user selects "library" as the sequence source the
program will display a list of available libraries. The programs are
capable of handling all current libraries but which ones are
available will vary from site to site. At LMB we have several
libraries and also weekly updates of data gathered between releases.
The program will ask users to select a library and then give a list
of options:
X 1 Get a sequence
2 Get annotations
3 Get entrynames from accession numbers
4 Search titles for keywords
5 Search text index for keywords
If get a sequence or get annotations is selected users will be asked
to type the entry name. The option will be left when a sequence is
selected or ! is typed. The composition and length will be
displayed.
The text index contains all words from feature tables,
reference titles, definition lines, keywords lists and comments, so
the text index search is most useful. It is also the fastest. Up to
5 words can be searched for at once. The words should be typed
separated by spaces, for example
? Keywords=P53 mouse murine tumo
will search for all entries that contain words starting with p53,
mouse, murine and tumo. Only the unique entries that contain ALL
words will be listed. Before listing the matching entries the
program will show the number of 'hits' for each word and ring the
bell. Escape is possible at this point, or after each screenfull of
entries. In addition to the entry names the text search displays
the primary accession number, the sequence length and up to 80
characters of description. (The search of 'titles' is now redundant
because the full text index contains all the title words and the
search is much faster. It will probably be removed from the
program.) All searches are independent of case. Where possible the
program will offer default entry names.
Typical dialogue follows.
Select sequence source
X 1 Personal file
2 Sequence library
? Selection (1-2) (1) =
Select sequence file format
X 1 Staden
2 EMBL
3 GenBank
4 PIR
5 GCG
? Selection (1-5) (1) =
? Sequence file name=M13MP7.SEQ
Contig title removed
Sequence length= 7238
Sequence composition
T C A G -
2405. 1539. 1765. 1527. 2.
33.2% 21.3% 24.4% 21.1% 0.0%
.
.
.
Select sequence source
X 1 Personal file
2 Sequence library
? Selection (1-2) (1) =2
Select a library
X 1 EMBL 29 nucleotide library Dec 91
2 SWISSPROT 20 protein library Nov 91
3 PIR 31 protein library Dec 91
4 NRL3D 58 From Brookhaven protein library Dec 91
5 GenBank
? Selection (1-5) (1) =
Library is in EMBL format with indexes
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =5
Search for keywords
? Keywords=P53 mouse
P53 hits 68
MOUSE hits 8180
MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen
MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen
MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen
MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen
MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen
MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen
MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen
MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen
MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen
MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon
MMANT11 X00885 562 Murine p53 gene 3' region with exon 11
MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end.
MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds.
MMP53 X00741 1377 Mouse mRNA for transformation associated protein
MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53.
MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11.
MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8.
MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou
MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4
MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53
MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53
MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5
22 different entries found
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =4
Search for keywords
? Keywords=alpha
Searching for alpha
AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank.
ADHBADA1 102 duck alpha-d-globin gene, exon 1.
ADHBADA2 1145 duck alpha-a-globin gene and 5' flank
ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co
AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide.
AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
AGAMYLS 120 synthetic alpha-amylase gene, 5' end.
AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
!
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =3
? Accession number=v00636
Entry name LAMBDA
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =2
Default Entry name=LAMBDA
? Entry name=
ID LAMBDA standard; DNA; PHG; 48502 BP.
XX
AC V00636; J02459; M17233; X00906;
XX
DT 03-JUL-1991 (Rel. 28, Last updated, Version 3)
DT 09-JUN-1982 (Rel. 1, Created)
XX
DE Genome of the bacteriophage lambda (Styloviridae).
XX
KW circular; coat protein; DNA binding protein; genome;
KW origin of replication.
XX
OS Bacteriophage lambda
OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
XX
RN [1]
RP 1-48502
RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
RT "Nucleotide sequence of bacteriophage lambda DNA";
RL J. Mol. Biol. 162:729-773(1982).
XX
!
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =
Default Entry name=LAMBDA
? Entry name=
DE Genome of the bacteriophage lambda (Styloviridae).
Sequence length 48502
Sequence composition
T C A G -
11988. 11360. 12336. 12818. 0.
24.7% 23.4% 25.4% 26.4% 0.0%
@4. TX 1 @ Define active region
For its analytic functions the program always works on a
region of the sequence called the "active region". This function
allows the start and end points of the active region to be reset.
Define the required start and end points.
When a new sequence is read into the program the active region
is automatically set to start at the beginning of the sequence and
extend to the maximum the program can handle. On most machines this
will be to the end of the sequence. The positions are shown on the
screen. Note that for convenience, in the listing and translation
functions, the user is given access to regions outside the active
region.
@5. TX 1 @ List a sequence
The sequence can be listed single or double stranded with line
lengths from 10 to 120 in multiples of 10.
Define the region to list, the line length required and choose
between a single or double stranded display. The output looks like:
GTTAATGTAG CTTAATAACA AAGCAAAGCA CTGAAAATGC TTAGATGGAT
CAATTACATC GAATTATTGT TTCGTTTCGT GACTTTTACG AATCTACCTA
10 20 30 40 50
AATTGTATCC CATAAACACA AAGGTTTGGT CCTGGCCTTA TAATTAATTA
TTAACATAGG GTATTTGTGT TTCCAAACCA GGACCGGAAT ATTAATTAAT
60 70 80 90 100
GAGGTAAAAT TACACATGCA AACCTCCATA GACCGGTGTA AAATCCCTTA
CTCCATTTTA ATGTGTACGT TTGGAGGTAT CTGGCCACAT TTTAGGGAAT
110 120 130 140 150
AACATTTACT TAAAATTTAA GGAGAGGGTA TCAAGCACAT TAAAATAGCT
TTGTAAATGA ATTTTAAATT CCTCTCCCAT AGTTCGTGTA ATTTTATCGA
160 170 180 190 200
@6. TX 1 @ List a text file.
Allows the user to have a text file displayed on the screen.
It will appear one page at a time.
Supply the name of the file to be displayed.
@7. TX 1 @ Direct output to disk
Used to direct output that would normally appear on the screen
to a file.
Select redirection of either text or graphics, and supply the
name of the file that the output should be written to.
The results from the next options selected will not appear on
the screen but will be written to the file. When option 7 is
selected again the file will be closed and output will again appear
on the screen.
@8. TX 1 @ Write active region to disk
Used to write the current active section of sequence to a disk
file in "Staden format".
Supply a file name and an optional title.
The program has the capability of reading sequences stored in
several formats and so, in conjunction with this option, can be used
to reformat them.
@9. TX 1 @ Edit the sequence
Used to edit sequences or any other files by giving access to
the computers system editor. For editing sequences the input file
should have already been created using one of the listing functions
such as "list sequence", "list translation" or "list restriction
sites above the sequence".
Supply the name of the file to edit. Wait while the system
editor is made ready (can take awhile on a vax). Use the editor.
Exit from the editor. If a sequence has been edited, and you want to
process it, affirm that the sequence should be "made active". The
edited sequence will replace the original sequence.
This editing method is designed to give users access to an
editor with which they are familiar - i.e. the one on their machine,
and yet to allow them to edit a sequence which contains all the
landmarks they need in order to know where they are. Users can
create files containing simple listings (single stranded) with
numbering, using "list the sequence", and then edit them with their
system editor, using the numbering to know where they are within the
sequence. When the edits are complete they exit from the editor and
the program "analyses" the edited file to extract only the sequence
characters. Similarly a file containing a three phase tranlslation
can be edited, or a file containing a sequence plus its three phase
translation, plus its restriction sites marked above the sequence.
In order to be able to "analyse" such complicated listings and
correctly extract the sequence the following simple rule is used:
all lines in the file that contain a character that is not A,C,T,G
or U are deleted. It is obviously important to be aware of this rule
and its implications.
@10. TX 2 @ Clear graphics
Clears graphics from the screen.
@11. TX 2 @ Clear text
Clears text from the screen.
@12. TX 2 @ Draw a ruler
This option allows the user to draw a ruler or scale along the
x axis of the screen to help identify the coordinates of points of
interest. The user can define the position of the first base to be
marked (for example if the active region is 1501 to 8000, the user
might wish to mark every 1000th base starting at either 1501 or 2000
- it depends if the user wishes to treat the active region as an
independent unit with its own numbering starting at its left edge,
or as part of the whole sequence). The user can also define the
separation of the ticks on the scale and their height. If required
the labelling routine can be used to add numbers to the ticks.
@13. TX 2 @ Use crosshair
This function puts a steerable cross on the screen that can be
used to find the coordinates of points in the sequence. The user can
move the cross around using the directional keys; when he hits the
space bar the program will print out the coordinates of the cross in
sequence units and the option will be exited.
If instead, you hit a , the position will be displayed but the
cross will remain on the screen.
If a letter s is hit the program will display the sequence
around the crosshair position, and leave the cross on the screen.
@14. TX 2 @ Reposition plots
The positions of each of the plots is defined relative to a
users drawing board which has size 1-10,000 in x and 1-10,000 in y.
Plots for each option are drawn in a window defined by x0,y0 and
xlength,ylength. Where x0,y0 is the position of the bottom left hand
corner of the window, and xlength is the width of the window and
ylength the height of the window.
--------------------------------------------------------- 10,000
1 1
1 -------------------------------------- ^ 1
1 1 1 1 1
1 1 1 1 1
1 1 1 ylength 1
1 1 1 1 1
1 1 1 1 1
1 -------------------------------------- v 1
1 x0,y0^ 1
1 <---------------xlength--------------> 1
--------------------------------------------------------- 1
1 10,000
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
The default window positions are read from a file "NIPMARG" when the
program is started. Users can have their own file if required. As
all the plots start at the same position in x and have the same
width, x0 and xlength are the same for all options. Generally users
will only want to change the start level of the window y0 and its
height ylength. This option allows users to change window positions
whilst running the program. The routine prompts first for the
number of the option that the users wishes to reposition; then for
the y start and height; then for the x start and length. Note that
changes to the x values affect all options. If the user types only
carriage return for any value it will remain unchanged. The cross-
hair can be used to choose suitable heights.
@15. TX 2 @ Label a diagram
This routine allows users to label any diagrams they have
produced. They are asked to type in a label. When the user types
carriage return to finish typing the label the cross-hair appears on
the screen. The user can position it anywhere on the screen. If the
user types R (for right justify) the label will be written on the
diagram with its right end at the cross-hair position. If the user
types L (for left justify) the label will be written on the diagram
with its left end at the cross hair position. The cross-hair will
then immediately reappear. The user may put the same label on
another part of the diagram as before or if he hits the space bar he
will be asked if he wishes to type in another label.
Typical dialogue follows.
? Menu or option number=15
Type label then drive cross hair to left or right end
of label position then hit "L" to write label left
justified or "R" to write label right justified or
the space bar to quit
? Label=delta gene
missing graphics
? Label=
@16. TX 2 @Display a map
This draws a map of any sequence features selected by the
user. These features may be protein coding regions (CDS), tRNA
genes (TRNA), promoter positions (PRM), etc. Users may define their
own feature table key names. For example I find it convenient to
split CDS lines into CDS1, CDS2 and CDS3 each of which contains only
those sequences that code in the reading frames 1, 2 or 3. Then I
can plot them at different heights on the screen ( suitable heights
can be determined by using the cross-hair).
The coordinates must be stored in a file in the format of an
EMBL or GenBank feature table. Note that this means that the file
must include either EMBL or GenBank headers, and a suitable "tail".
The simplest header is the word FEATURES starting in column 1 of the
first line of the file. The simplest tail is 2 empty lines at the
end of the file. These lines are not included when nip writes out
results in feature table format.
Typical dialogue follows.
? Menu or option number=16
Display a map using an EMBL feature table file
? map file name=hsegl1.ft
? feature code(e.g. CDS) =CDS
X 1 + strand
2 - strand
3 both strands
? 0,1,2,3 =
? level (0-9480) (256) =4000
missing graphics
? feature code(e.g. CDS) =
@17. TX 1 @ Search for restriction enzymes
This routine is used to search for short sequences, like
restriction enzyme recognition sequences, and can either list the
results or present them graphically. Listings can take several forms
and can include the sequence and its translation. Examples are given
below. The program will also display the names of enzymes that cut
the sequence infrequently. Users can select from sets of enzymes
stored in files or can enter them from the keyboard.
The short sequences (strings) and their names need to be
arranged in a particular way. See below. Select to search, list an
enzyme file or clear the screen. Choose either a file of enzymes or
to enter their recognition sequences at the keyboard. Choose to
search for all the enzymes in the list or to select from the list.
Select a mode of output. Define the sequence as circular or linear.
Select to search for "definite" or "possible" matches. The search
starts, and after the results have been displayed, further searches
can be performed.
When the enzymes and their recognition sequences are stored in
a file they must be defined in the following way. We call the
recognition sequences "strings". The format is as follows: each
string or set of strings must be preceded by a name, each string
must be preceded and terminated with a slash (/), and each set of
strings by 2 slashes. For example AATII/GACGT'C// defines the name
AATII, its recognition sequence GACGTC and its cut site with the '
symbol; ACCI/GT'MKAC// defines the name ACCI and its recognition
sequence includes IUB symbols for incompletely defined symbols in
nucleic acid sequences; BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
defines the name BBVI and this time two recognition sequences and
cut sites are specified in order to correctly show the cutting
position relative to the recognition sequence. If no cut site is
included the first base of the recognition sequence is displayed as
being on the 3' side of the recognition sequence.
These collections of strings and their names can be read from
disk or entered from the keyboard. When names and strings are
entered from the keyboard the program will ask for the name and then
the string(s). If more than one string is typed per name they must
be separated by slash (/) characters. See the "Typical dialogue"
below. Three files containing restriction enzyme recognition
sequences are currently available. The "all enzymes" file contains
the Rich Roberts REBASE restriction enzyme database, which is
updated monthly.
The user can select strings by name from these collections. If
so the program will prompt for the names, one at a time. The user
can continue to select names until a blank name is entered (by the
user typing only return).
Listed output can be displayed in several ways: it can be
ordered enzyme by enzyme, or on cut positions, or with enzyme names
written above a listing of the sequence. This last listing can also
include a three phase translation of the sequence. In addition the
program will display only infrequent cutters (the user defines the
minimum number of cuts), or can plot the positions of matches.
Listings sorted "enzyme by enzyme" have the following form:
Matches found= 1
Name Sequence Position Fragment lengths
1 AATII GACGT'C 112 111 111
912 912
Matches found= 2
Name Sequence Position Fragment lengths
1 ACCI GT'CGAC 112 111 111
2 ACCI GT'AGAC 420 308 308
604 604
Matches found= 2
Name Sequence Position Fragment lengths
1 AHAII GA'CGTC 109 108 90
2 AHAII GG'CGTC 199 90 108
825 825
Matches found= 2
Name Sequence Position Fragment lengths
1 AVAII G'GACC 84 83 51
2 AVAII G'GTCC 973 889 83
51 889
Matches found= 1
Name Sequence Position Fragment lengths
1 BALI TGG'CCA 258 257 257
766 766
Matches found= 1
Name Sequence Position Fragment lengths
1 BAMHI G'GATCC 92 91 91
...... etc
Listings sorted on cut position have the following form:
Searching
Name Sequence Position Fragment lengths
1 ECORI G'AATTC 2 1
2 BANI G'GTGCC 26 24
3 BSP1286 GTGCC'C 31 5
4 BBVI 'TACTGCGCCGCAGCTGC 38 7
5 NSPBII CAG'CTG 51 13
6 PVUII CAG'CTG 51 0
7 BBVI GCAGCTGCTGGTG' 60 9
8 HINCII GTC'AAC 80 20
9 AVAII G'GACC 84 4
10 BINI 'CCAGGGATCC 87 3
11 BSTNI CC'AGG 89 2
12 BAMHI G'GATCC 92 3
13 XHOII G'GATCC 92 0
14 NSPBII CCG'CTG 98 6
15 BINI GGATCCGCT' 100 2
16 AHAII GA'CGTC 109 9
17 SALI G'TCGAC 111 2
18 AATII GACGT'C 112 1
19 ACCI GT'CGAC 112 0
20 HINCII GTC'GAC 113 1
21 BBVI GCAGCGACTGATT' 166 53
22 BINI 'ACTCAGATCC 178 12
23 XHOII A'GATCC 183 5
24 HGAI 'GGCGGCGGAGGCGTC 188 5
.....etc
Lists of infrequent cutters have the following form:
0 AFLII
0 AFLIII
0 APAI
0 APALI
0 ASUII
0 AVAI
0 AVRII
0 BCLI
0 BGLI
0 BGLII
0 BSMI
0 BSPMII
0 BSTEII
...... etc
Listings showing names above the sequence, and a translation have the
following form:
ECORI BANI BSP1286
. . . BBVI NSPBII
. . . . PVUII BBVI
GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
GTGC
10 20 30 40 50 60
E F G L G L V * G A Q R L L R R S C W C
N S V W A W C E V P R D Y C A A A A G A
I R F G L G V R C P E I T A P Q L L V L
HINCII
. AVAII
. . BINI
. . . BSTNI
. . . . BAMHI
. . . . XHOII NSPBII
. . . . . . BINI AHAII
. . . . . . . . SALI
. . . . . . . . .AATII
. . . . . . . . .ACCI
. . . . . . . . ..HINCII
TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
ACGA
70 80 90 100 110 120
W R C G G R Q R T Q G S A G R G R R R R
G G A E V V N G P R D P L D E D V D D E
A V R R S S T D P G I R W T R T S T T R
BBVI BINI
GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
CACT
130 140 150 160 170 180
G G G G * R I A G G W Q R L I * V L T T
E E V D S A L L V A G S D * F E F * P L
R R W I A H C W W L A A T D L S S D H S
XHOII
. HGAI AHAII PFIMI
. . . . BBVI
CAGATCCGGCGGCGGAGGCGTCGAGGCTCCCGAAACTCCCAGTGGCTGGCCTGCTA
GATT
190 200 210 220 230 240
Q I R R R R R R G S R N S Q W L A C * I
R S G G G G V E A P E T P S G W P A R F
D P A A E A S R L P K L P V A G L L D S
.........etc
The terms "possible" and "definite" matches are important only
for back translations of protein into DNA, and which include IUB
redundancy codes. Those matches that the program terms "definite
matches" and are ones in which the specification of the recognition
sequence corresponds exactly to that of the back translation, and
consequently are definitely in the DNA sequence. The program will
also find what it terms 'possible matches' which are ones that
depend on the particular codons chosen for each amino acid. These
are sites at which recognition sequences could be engineered to
produce a cut in the DNA without changing the amino acid, but which
are not necessarily found in the original sequence.
The routine will handle both linear and circular sequences,
and so finds cutsites spanning the "ends" of circular sequences.
The program will only find cutsites spanning the ends of sequences
if the sequence is declared as circular. This includes sites for
recognition sequences containing leading or trailing N symbols, in
which the actual recognition sequence does not span the join. For
example if the recognition sequence was 'NNNNACGT and the first 4
characters in the sequence were ACGT, then the match would only be
found if the sequence was declared as circular. If the sequence is
linear then the first fragment starts at base number 1, and the last
ends at the last base. If the sequence is circular then the length
of the first fragment is the clockwise distance from the last cut to
the first.
Graphical output marks the position of each string by a short
vertical line and gives the name of the enzyme at the left end of
the line. If the top of the screen is reached the program gives the
user the oportunity to take a hard copy and then will clear the
screen and restart plotting results at the original start position.
Below is an edited piece of dialogue from use of the search
option:
? Menu or option number=17
Search for restriction enzyme sites
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 = 2
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =
AATII/GACGT'C//
ACCI/GT'MKAC//
AFLII/C'TTAAG//
AFLIII/A'CRYGT//
AHAII/GR'CGYC//
APAI/GGGCC'C//
APALI/G'TGCAC//
ASUII/TT'CGAA//
AVAI/C'YCGRG//
AVAII/G'GWCC//
AVRII/C'CTAGG//
BALI/TGG'CCA//
BAMHI/G'GATCC//
BANI/G'GYRCC//
BANII/GRGCY'C//
BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
BCLI/T'GATCA//
BGLI/GCCNNNN'NGGC//
BGLII/A'GATCT//
BINI/GGATCNNNN'/'NNNNNGATCC//
BSMI/GAATGCN'/NG'CATTC//
BSP1286/GDGCH'C//
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =
? (y/n) (y) Search for all names
X 1 Order results enzyme by enzyme
2 Order results by position
3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 =
? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
Matches found= 1
Name Sequence Position Fragment lengths
1 AATII GACGT'C 112 111 111
912 912
Matches found= 2
Name Sequence Position Fragment lengths
1 ACCI GT'CGAC 112 111 111
2 ACCI GT'AGAC 420 308 308
604 604
Matches found= 2
Name Sequence Position Fragment lengths
1 AHAII GA'CGTC 109 108 90
2 AHAII GG'CGTC 199 90 108
825 825
Matches found= 2
Name Sequence Position Fragment lengths
1 AVAII G'GACC 84 83 51
2 AVAII G'GTCC 973 889 83
51 889
Matches found= 1
Name Sequence Position Fragment lengths
1 BALI TGG'CCA 258 257 257
766 766
Matches found= 1
Name Sequence Position Fragment lengths
1 BAMHI G'GATCC 92 91 91
932 932
Matches found= 1
Name Sequence Position Fragment lengths
1 BANI G'GTGCC 26 25 25
998 998
Matches found= 1
Name Sequence Position Fragment lengths
1 BANII GAGCC'C 490 489 489
534 534
Matches found= 11
Name Sequence Position Fragment lengths
1 BBVI 'TACTGCGCCGCAGCTGC 38 37 3
2 BBVI GCAGCTGCTGGTG' 60 22 22
3 BBVI GCAGCGACTGATT' 166 106 28
4 BBVI 'CCTGCTAGATTCGCTGC 230 64 37
5 BBVI GCAGCGGTACGTA' 452 222 50
6 BBVI 'CTCGCCAACGTTGCTGC 502 50 55
7 BBVI GCAGCCTTCAACT' 606 104 64
8 BBVI 'GAGGTATTCCTGGCTGC 634 28 97
9 BBVI 'CTGGCCGCCGCCGCTGC 869 235 104
10 BBVI 'GCCGCCGCCGCTGCTGC 872 3 106
11 BBVI GCAGCGATGAGGA' 927 55 222
....etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =
? (y/n) (y) Search for all names
X 1 Order results enzyme by enzyme
2 Order results by position
3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 = 2
? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
Name Sequence Position Fragment lengths
1 ECORI G'AATTC 2 1
2 BANI G'GTGCC 26 24
3 BSP1286 GTGCC'C 31 5
4 BBVI 'TACTGCGCCGCAGCTGC 38 7
5 NSPBII CAG'CTG 51 13
6 PVUII CAG'CTG 51 0
7 BBVI GCAGCTGCTGGTG' 60 9
8 HINCII GTC'AAC 80 20
9 AVAII G'GACC 84 4
10 BINI 'CCAGGGATCC 87 3
11 BSTNI CC'AGG 89 2
12 BAMHI G'GATCC 92 3
13 XHOII G'GATCC 92 0
14 NSPBII CCG'CTG 98 6
15 BINI GGATCCGCT' 100 2
16 AHAII GA'CGTC 109 9
17 SALI G'TCGAC 111 2
18 AATII GACGT'C 112 1
19 ACCI GT'CGAC 112 0
20 HINCII GTC'GAC 113 1
.....etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =
? (y/n) (y) Search for all names
1 Order results enzyme by enzyme
X 2 Order results by position
3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 =3
? Maximum number of cuts (0-100) (0) =
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
0 AFLII
0 AFLIII
0 APAI
0 APALI
0 ASUII
0 AVAI
0 AVRII
0 BCLI
0 BGLI
0 BGLII
0 BSMI
0 BSPMII
0 BSTEII
0 CLAI
0 DRAI
0 DRAII
0 ECOB
0 ECOK
0 ECORV
0 ESPI
......etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =
? (y/n) (y) Search for all names
1 Order results enzyme by enzyme
2 Order results by position
X 3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 =4
? (y/n) (y) Hide translation n
? (y/n) (y) Use 1 letter codes
? Line length (30-90) (60) =
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
ECORI BANI BSP1286
. . . BBVI NSPBII
. . . . PVUII BBVI
GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
GTGC
10 20 30 40 50 60
E F G L G L V * G A Q R L L R R S C W C
N S V W A W C E V P R D Y C A A A A G A
I R F G L G V R C P E I T A P Q L L V L
HINCII
. AVAII
. . BINI
. . . BSTNI
. . . . BAMHI
. . . . XHOII NSPBII
. . . . . . BINI AHAII
. . . . . . . . SALI
. . . . . . . . .AATII
. . . . . . . . .ACCI
. . . . . . . . ..HINCII
TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
ACGA
70 80 90 100 110 120
W R C G G R Q R T Q G S A G R G R R R R
G G A E V V N G P R D P L D E D V D D E
A V R R S S T D P G I R W T R T S T T R
BBVI BINI
GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
CACT
130 140 150 160 170 180
G G G G * R I A G G W Q R L I * V L T T
E E V D S A L L V A G S D * F E F * P L
R R W I A H C W W L A A T D L S S D H S
.......etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =5
Define search strings by typing a string name
followed by the string(s)
? Name=FRED
? String(s)=AAAAAA/TTTTTT
? Name=MARY
? String(s)=CCCC/GGGG/GCGCT
? Name=
? (y/n) (y) Search for all names
X 1 Order results enzyme by enzyme
2 Order results by position
3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 =
? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
Matches found= 9
Name Sequence Position Fragment lengths
1 FRED 'TTTTTT 1557 1556 1
2 FRED 'TTTTTT 1558 1 1
3 FRED 'TTTTTT 1559 1 1
4 FRED 'TTTTTT 1560 1 22
5 FRED 'AAAAAA 1582 22 529
6 FRED 'AAAAAA 3160 1578 1019
7 FRED 'AAAAAA 4204 1044 1044
8 FRED 'AAAAAA 5691 1487 1487
9 FRED 'AAAAAA 6710 1019 1556
529 1578
Matches found= 36
Name Sequence Position Fragment lengths
1 MARY 'CCCC 47 46 1
2 MARY 'GGGG 486 439 1
3 MARY 'GGGG 487 1 1
4 MARY 'CCCC 557 70 1
5 MARY 'CCCC 558 1 1
6 MARY 'GCGCT 1177 619 1
... etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =5
Define search strings by typing a string name
followed by the string(s)
? Name=JANE
? String(s)=A'TTTT/CC'GGG
? Name=
? (y/n) (y) Search for all names
X 1 Order results enzyme by enzyme
2 Order results by position
3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 =
? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
Matches found= 30
Name Sequence Position Fragment lengths
1 JANE A'TTTT 437 436 6
2 JANE A'TTTT 546 109 33
3 JANE A'TTTT 597 51 43
4 JANE A'TTTT 777 180 51
5 JANE A'TTTT 1274 497 60
6 JANE A'TTTT 1571 297 62
7 JANE CC'GGG 1926 355 75
8 JANE A'TTTT 2403 477 81
9 JANE A'TTTT 2586 183 82
10 JANE A'TTTT 2731 145 101
11 JANE A'TTTT 2812 81 103
... etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =!
@18. TX 1 7 @ Compare a short sequence
This routine slides a short sequence along the current
sequence and finds all positions at which a given percentage of the
bases match. Output is in both graphical and listed forms.
If users call for dialogue when the routine is selected they
will be given the choice of keyboard or file input. Define the
string, select the "sense" to use and the percentage match. Matches
will be plotted out and then the user can select to have them
listed. Then the routine cycles around.
The routine slides the search string along the sequence and
marks the positions at which a minimum percentage score is reached.
The graphical output draws a vertical line at the match position;
the height of the line represents the percentage score, so that if
the line reaches the top of the box the score is 100%. The NC-IUB
symbols may be used in the search sequence to encode uncertain
characters. Any other symbols will not match.
NC-IUB SYMBOLS
A,C,G,T
R (A,G) 'puRine'
Y (T,C) 'pYrimidine'
W (A,T) 'Weak'
S (C,G) 'Strong'
M (A,C) 'aMino'
K (G,T) 'Keto'
H (A,T,C) 'not G'
B (G,C,T) 'not A'
V (G,A,C) 'not T'
D (G,A,T) 'not C'
N (G,A,C,T) 'aNy'
Typical dialogue is shown below.
? Menu or option number=18
Find percentage matches
? (y/n) (y) Keep picture
? String=AAATTTCCC
STRING=AAATTTCCC
? (y/n) (y) This sense
? Percent match (1.00-100.00) (70.00) =
Missing graphics display here
Total scoring positions above 70.000 percent = 7
Scores 7 6 6 6 6 6 6
Positions 365 212 213 292 311 358 627
? Display (0-7) (0) =3
365
ACATTTCGC
* ***** *
AAATTTCCC
1
212
GAAACTCCC
** ****
AAATTTCCC
1
213
AAACTCCCA
*** * **
AAATTTCCC
1
? (y/n) (y) Keep picture
Default String=AAATTTCCC
? String=
STRING=AAATTTCCC
? (y/n) (y) This sense n
STRING=GGGAAATTT
? Percent match (1.00-100.00) (70.00) =
Missing graphics display here
Total scoring positions above 70.000 percent = 7
Scores 6 6 6 6 6 6 6
Positions 269 270 271 288 354 624 853
? Display (0-7) (0) =3
269
GAGGGATTT
* * ****
GGGAAATTT
1
270
AGGGATTTT
** * ***
GGGAAATTT
1
271
GGGATTTTC
**** **
GGGAAATTT
1
? (y/n) (y) Keep picture !
@19. TX 7 @ Compare a short sequence using a score matrix
This routine slides a short sequence along the current
sequence and finds all positions at which a given level of
similarity (a cutoff score) is reached. The score is defined by use
of a score matrix. Output is in both graphical and listed forms.
If users call for dialogue when the routine is selected they
will be given the choice of keyboard or file input. Define the
string, select the "sense" to use and the cutoff score. Matches will
be plotted out and then the user can select to have them listed.
Then the routine cycles around.
The routine slides the search string along the sequence and
marks the positions at which a the cutoff score is achieved. The
graphical output draws a vertical line at the match position; the
height of the line represents the score, so that if the line
reaches the top of the box the score is the maximum possible. The
NC-IUB symbols may be used in the search sequence to encode
uncertain characters.
The score matrix reflects the level of redundancy in the probe
sequence and hence will put more emphasis on those characters that
are better defined. The score matrix is:
DNA SCORE MATRIX USING IUB SYMBOLS
T C A G - R Y W S M K H B V D N ?
T 36 0 0 0 9 0 18 18 0 0 18 12 12 0 12 9 0
C 0 36 0 0 9 0 18 0 18 18 0 12 12 12 0 9 0
A 0 0 36 0 9 18 0 18 0 18 0 12 0 12 12 9 0
G 0 0 0 36 9 18 0 0 18 0 18 0 12 12 12 9 0
- 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0
R 0 0 18 18 18 36 0 9 9 9 9 6 6 12 12 18 0
Y 18 18 0 0 18 0 36 9 9 9 9 12 12 6 6 18 0
W 18 0 18 0 18 9 9 36 0 9 9 12 6 6 12 18 0
S 0 18 0 18 18 9 9 0 36 9 9 6 12 12 6 18 0
M 0 18 18 0 18 9 9 9 9 36 0 12 6 12 6 18 0
K 18 0 0 18 18 9 9 9 9 0 36 6 12 6 12 18 0
H 12 12 12 0 27 6 12 12 6 12 6 36 8 8 8 27 0
B 12 12 0 12 27 6 12 6 12 6 12 8 36 8 8 27 0
V 0 12 12 12 27 12 6 6 12 12 6 8 8 36 8 27 0
D 12 0 12 12 27 12 6 12 6 6 12 8 8 8 36 27 0
N 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0
? 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
? is any unrecognised character.
Typical dialogue is shown below.
? Menu or option number=19
Find matches using a score matrix
? (y/n) (y) Keep picture
? String=AAATTTCCC
STRING=AAATTTCCC
? (y/n) (y) This sense
Minimum score= 0 Maximum score= 324
? Score (0-324) (280) =250
Missing graphics display here
For score 250 the number of matches= 1
Scores 252
Positions 365
? Display (0-1) (0) =1
365
ACATTTCGC
* ***** *
AAATTTCCC
1
? (y/n) (y) Keep picture
Default String=AAATTTCCC
? String=
STRING=AAATTTCCC
? (y/n) (y) This sense n
STRING=GGGAAATTT
Minimum score= 0 Maximum score= 324
? Score (0-324) (222) = 200
Missing graphics display here
For score 200 the number of matches= 7
Scores 216 216 216 216 216 216 216
Positions 269 270 271 288 354 624 853
? Display (0-7) (0) =3
269
GAGGGATTT
* * ****
GGGAAATTT
1
270
AGGGATTTT
** * ***
GGGAAATTT
1
271
GGGATTTTC
**** **
GGGAAATTT
1
? (y/n) (y) Keep picture !
@20. TX 7 @ Search for a motif using a weight matrix
This function performs searches for short sequence motifs
using an appropriate weight matrix. In addition it can be used to
create or modify weight matrices. In order to perform a search the
only input required is the name of the file containing the weight
matrix. The results can be presented graphically or listed. The
graphical presentation will draw line at the position of any matches
found; the height of the line is proportional to the score.
For a search, select "use weight matrix", supply the name of
the file containing the weight matrix, and choose between having
results plotted or listed. If dialogue is requested when the
function is selected users can alter the cutoff score employed.
To create a weight matrix several steps are involved. A file
containing an alignment of known motifs is required. (This file must
be created before the current option is selected. The format is a
follows: each sequence is written on a separate line with at least
one space at the beginning; each sequence is terminated by a space
character, and can be followed by a name. The sequences must be
aligned.) Supply the name of the file of aligned sequences. The
program reads and displays the sequences. Choose between "summing
logs of weights" or summing weights (i.e. whether to multiply or add
weights). If logs are used all scores will be negative. Choose if
all positions in the set of aligned sequences should be used or if a
mask should be applied. If so selected, define a mask as a string of
symbols, in which symbol - means ignore and any other symbol means
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
The program will calculate weights as the frequencies of each
base at each unmasked position in the set of aligned sequences.
These weights are then applied to the set of aligned sequences to
give a range of "observed" scores. The mean and standard deviation
of these scores is displayed. The user is asked to supply several
values to be used when the weight matrix is applied to other
sequences: a cutoff score (by default, the mean minus 3 standard
deviations), a top score for scaling graphical results (by default,
the mean plus 3 standard deviations), and a position to identify
(this means that if a particular base within the motif is used as a
"landmark", such as the A of the AG in splice acceptor sites, then
its position will be marked in plots). All these values are stored
along with the weight matrix. Finally supply the name of a file to
contain the weight matrix.
Weight matrices can be "rescaled" using a set of aligned
sequences in much the same ways as a matrix is created. The purpose
is to redefine the cutoff scores, and rescaling does not alter any
other values in the weight matrix file.
The methods have changed considerably but were first outlined
in Staden, R. Nucl. Acid Res. 12 505-519 1984, and Staden, R.
Genetic engineering: principles and methods vol 7, Edited by J.K.
Setlow and A. Hollaender, Plenum publishing corp., 1985.
The methods have always had to deal with the problem of zeroes
in the matrices. The current versions employ "Laplaces Law of
Succession" in which 1 is added to each term.
It is now possible to apply a mask to a set of aligned
sequences in order to give weight to selected positions only.
Sequences have superimposed functions: some parts may be of general
structural importance and give rise to an overall framework, and
other parts give specificity and hence are not common; we may want
to use a set of aligned sequences to define a motif, but want to use
only the framework positions. Alternatively we may want to pick out
only those parts of a set of aligned sequences that give a
particular property, and to ignore other similarities that are due
to some other property and which could obscure the pattern we are
interested in. The ability to define a mask allows certain positions
to be used in the motif and others to be ignored, and yet still
permits the use of a set of aligned sequences to calculate weights.
Typical dialogue is shown below.
? Menu or option number=20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =2
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
1 AGCGTGACTCTTCCCGGAA HIS1
2 GAGGTGACTCACTTGGAAG HIS1
3 CGGATGACTCTTTTTTTTT HIS3
4 ACAGTGACTCACGTTTTTT HIS4
5 GTCGTGACTCATATGCTTT ARG3
6 TGAATGACTCACTTTTTGG ARG4
7 TTCTTGACTCGTCTTTTCT CPA1
8 CGAATGACTCTTATTGATG CPA2
9 AGAATGACTAATTTTACTA TRP5
10 TCGTTGACTCATTCTAATC TRP3
11 TTGCTGACTCATTACGATT TRP2
12 GAGATGACTCTTTTTCTTT IV1
13 GCGATGATTCATTTCTCTG IV2
14 TAGATGACTCAGTTTAGTC LEU1
15 TAAGTGACTCAGTTCTTTC LEU4
16 ATGATGACTCTTAAGCATG ILS1
Length of motif 19
? (y/n) (y) Sum logs of weights
? (y/n) (y) Use all motif positions n
x means use, - means ignore
e.g. xx-x---x-x means use positions 1,2,4,8,10
? Mask=----XXXXXXXX
Applying weights to input sequences
1 -27.979 AGCGTGACTCTTCCCGGAA
2 -24.543 GAGGTGACTCACTTGGAAG
3 -20.890 CGGATGACTCTTTTTTTTT
4 -23.087 ACAGTGACTCACGTTTTTT
5 -22.771 GTCGTGACTCATATGCTTT
6 -23.408 TGAATGACTCACTTTTTGG
7 -25.159 TTCTTGACTCGTCTTTTCT
8 -22.679 CGAATGACTCTTATTGATG
9 -24.751 AGAATGACTAATTTTACTA
10 -23.157 TCGTTGACTCATTCTAATC
11 -23.067 TTGCTGACTCATTACGATT
12 -21.449 GAGATGACTCTTTTTCTTT
13 -24.191 GCGATGATTCATTTCTCTG
14 -23.770 TAGATGACTCAGTTTAGTC
15 -22.923 TAAGTGACTCAGTTCTTTC
16 -25.285 ATGATGACTCTTAAGCATG
Top score -20.890 Bottom score -27.979
Mean -23.694 Standard deviation 1.613
Mean minus 3.sd -28.534 Mean plus 3.sd -18.854
? Cutoff score (-999.00-9999.00) (-28.53) =
? Top score for scaling plots (-28.53-999.00) (-18.85) =
? Position to identify (0-19) (1) =
? Title=GCN4 SEQUENCES
? Name for new weight matrix file=1.WTS
? Menu or option number=20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =3
? Name of existing weight matrix file=1.WTS
GCN4 SEQUENCES
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
Length of motif 19
? (y/n) (y) Sum logs of weights n
? (y/n) (y) Use all motif positions
Applying weights to input sequences
1 128.000 AGCGTGACTCTTCCCGGAA
2 148.000 GAGGTGACTCACTTGGAAG
3 172.000 CGGATGACTCTTTTTTTTT
4 160.000 ACAGTGACTCACGTTTTTT
5 161.000 GTCGTGACTCATATGCTTT
6 157.000 TGAATGACTCACTTTTTGG
7 149.000 TTCTTGACTCGTCTTTTCT
8 160.000 CGAATGACTCTTATTGATG
9 151.000 AGAATGACTAATTTTACTA
10 159.000 TCGTTGACTCATTCTAATC
11 158.000 TTGCTGACTCATTACGATT
12 169.000 GAGATGACTCTTTTTCTTT
13 152.000 GCGATGATTCATTTCTCTG
14 157.000 TAGATGACTCAGTTTAGTC
15 160.000 TAAGTGACTCAGTTCTTTC
16 143.000 ATGATGACTCTTAAGCATG
Top score 172.000 Bottom score 128.000
Mean 155.250 Standard deviation 10.034
Mean minus 3.sd 125.147 Mean plus 3.sd 185.353
? Cutoff score (-999.00-9999.00) (125.15) =
? Top score for scaling plots (125.15-999.00) (185.35) =
? Position to identify (0-19) (1) =
? Title=GCN4 SEQUENCES
? Name for new weight matrix file=2.WTS
? Menu or option number=20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =
? Motif weight matrix file=1.WTS
GCN4 SEQUENCES
? (y/n) (y) Plot results n
153 -22.61 GCAGCGACTGATTTGAGTT
169 -28.53 GTTCTGACCACTCAGATCC
172 -27.27 CTGACCACTCAGATCCGGC
219 -27.35 CCAGTGGCTGGCCTGCTAG
268 -27.82 CGAGGGATTTTCGATCTTG
274 -26.99 ATTTTCGATCTTGTGGATG
283 -25.79 CTTGTGGATGATTTTCACG
287 -27.50 TGGATGATTTTCACGTGCG
298 -28.17 CACGTGCGCCGTCATATTG
332 -28.27 TCTTTGAAGCAGAAGGGAC
351 -28.27 AGGGGTACACTTTCACATT
357 -25.05 ACACTTTCACATTTCGCTT
364 -28.51 CACATTTCGCTTATGGGAG
400 -23.77 GAAGTTACTAATGTGCGTG
451 -26.22 ATGCTCGCCCTCTTTGGTG
476 -28.00 TCCCTCACTGAGCCCTCCG
480 -28.33 TCACTGAGCCCTCCGCCTC
517 -23.46 GCTAAGATTCAGCTTGGTT
556 -27.27 TCCAGCACTCAGGTTCGGC
602 -27.01 AACTTGAATCCATCGTTGC
648 -28.45 TGCTAAACACAGCCGGTTT
679 -28.18 CTGTTTGCCCAGTTTGGGC
691 -28.51 TTTGGGCCGCTTCTGGACG
713 -27.67 GGCTTGACCGTGGCTGTGG
803 -25.47 ATGCTGACCATGCTTTTCA
848 -28.11 ATAATGTTAAGTTTGATTC
857 -25.97 AGTTTGATTCCGCTGGCCG
879 -27.85 CCGCTGCTGCTGTTTCCAC
917 -27.77 GCGATGAGGAAGGCTTGTT
931 -27.81 TTGTTGGCGCGCCTGCTCG
952 -23.52 GAGGTGACTACCATCCGTG
977 -28.40 TGCGTGGGTGAGCTGTTGT
? Menu or option number=6
Page through text files
? Name of file to read=1.WTS
GCN4 SEQUENCES
19 1 -28.534 -18.854
P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
N 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
16
T 0 0 0 0 16 0 0 1 16 0 5 11 10 12 9 6 7 12 6
C 0 0 0 0 0 0 0 15 0 15 0 3 2 2 4 3 2 1 3
A 0 0 0 0 0 0 16 0 0 1 10 0 3 2 0 3 5 2 2
G 0 0 0 0 0 16 0 0 0 0 1 2 1 0 3 4 2 1 5
End of file
@21. TX 3 @ Count base composition
This routine calculates the base composition of the active
region of the sequence as both totals and percentages.
@22. TX 3 @ Count dinucleotide frequencies
This routine simply counts dinucleotide frequencies for the
currently active region of the sequence. It also calculates an
expected distribution based on the base composition. The output
looks like:
T C A G
obs expected obs expected obs expected obs expected
T 8.44 8.25 6.67 7.01 10.35 9.92 3.27 3.54
C 7.49 7.01 6.76 5.95 8.39 8.43 1.76 3.01
A 10.13 9.92 7.78 8.43 11.74 11.93 4.89 4.26
G 2.67 3.54 3.19 3.01 4.06 4.26 2.42 1.52
@23. TX 3 5 @ Count codons and amino acids
This function counts codons, amino acid composition, protein
molecular weights, and base composition. Users select the segments
of the sequence that the program should analyse.
Choose between being shown observed counts or counts
normalised so that the totals for each amino acid sum to 100. Select
to define segments using either the keyboard or an EMBL feature
table. Define the segments to count over. Select strand for each
segment. Stop selecting segments by typing a zero for "Count from
()". The results are displayed a screenful at a time, and the bell
is sounded to show there is more to come. A zero start position, or
the end of an EMBL feature table, signals the routine to print out
totals for all values.
The counts are broken down into several figures. Base
composition by position in codon expressed as a percentage of each
bases own frequency; base composition by position in codon
expressed as a percentage of the overall base composition of the
section; base composition expected for this amino acid composition
if there was no codon preference; percentage deviations of the
observed amino acid composition from an average amino acid
composition.
The output looks like:
===========================================
F TTT 1. S TCT 2. Y TAT 2. C TGT 1.
F TTC 1. S TCC 1. Y TAC 3. C TGC 2.
L TTA 7. S TCA 4. * TAA 9. * TGA 1.
L TTG 2. S TCG 1. * TAG 2. W TGG 2.
===========================================
L CTT 3. P CCT 2. H CAT 4. R CGT 1.
L CTC 2. P CCC 3. H CAC 1. R CGC 0.
L CTA 3. P CCA 2. Q CAA 4. R CGA 0.
L CTG 2. P CCG 2. Q CAG 1. R CGG 2.
===========================================
I ATT 9. T ACT 1. N AAT 7. S AGT 3.
I ATC 2. T ACC 2. N AAC 4. S AGC 2.
I ATA 4. T ACA 5. K AAA 13. R AGA 5.
M ATG 1. T ACG 2. K AAG 4. R AGG 1.
===========================================
V GTT 2. A GCT 2. D GAT 1. G GGT 3.
V GTC 2. A GCC 2. D GAC 1. G GGC 1.
V GTA 4. A GCA 3. E GAA 2. G GGA 1.
V GTG 2. A GCG 0. E GAG 1. G GGG 1.
===========================================
total codons= 166.
T C A G
1 31.06 33.68 34.03 35.00
2 35.61 35.79 30.89 32.50
3 33.33 30.53 35.08 32.50
1 24.70 19.28 39.16 16.87
2 28.31 20.48 35.54 15.66
3 26.51 17.47 40.36 15.66
% 26.51 19.08 38.35 16.06 observed, overall totals
% 25.00 22.26 33.10 19.65 expected, even codons per acid
A C D E F G H I K L
7. 3. 2. 3. 2. 6. 5. 15. 17. 19.
o-e % -47. -33. -76. -68. -64. -54. 62. 116. 67. 67.
M N P Q R S T V W Y
1. 11. 9. 5. 9. 13. 10. 10. 2. 5.
o-e % -62. 66. 12. -17. 19. 21. 6. -2. 0. -5.
total acids= 154. molecular weight= 17421.
Typical dialogue follows.
? Menu or option number=23
Calculate codon usage, base composition
and amino acid composition
? (y/n) (y) Show observed counts
? (y/n) (y) Define segments using keyboard
? Count from (0-1023) (0) =1
? Count to (1-1023) (1023) =1000
? (y/n) (y) + strand
===========================================
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
===========================================
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
===========================================
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
===========================================
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
===========================================
Total codons= 333.
T C A G
1 23.32 37.69 28.99 40.06
2 37.15 22.31 38.46 36.59
3 39.53 40.00 32.54 23.34
----- ----- ----- -----
= 100% 100% 100% 100%
1 17.72 29.43 14.71 38.14 = 100%
2 28.23 17.42 19.52 34.83 = 100%
3 30.03 31.23 16.52 22.22 = 100%
% 25.33 26.03 16.92 31.73 Observed, overall totals
% 24.44 22.31 20.90 32.35 Expected, even codons per acid
A C D E F G H I K L
33. 10. 15. 9. 17. 40. 10. 12. 5. 33.
O-E % 22. 81. -13. -55. 34. 71. 40. -29. -73. 13.
M N P Q R S T V W Y
2. 7. 2. 13. 49. 20. 11. 30. 5. 2.
O-E % -74. -51. -88. 0. 165. -11. -42. 40. 18. -81.
Total acids= 325. Molecular weight= 35831. Hydrophobicity= -17.8
? Count from (0-1023) (0) =
Codon totals over all genes
===========================================
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
===========================================
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
===========================================
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
===========================================
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
===========================================
Total codons= 333.
T C A G
1 23.32 37.69 28.99 40.06
2 37.15 22.31 38.46 36.59
3 39.53 40.00 32.54 23.34
----- ----- ----- -----
= 100% 100% 100% 100%
1 17.72 29.43 14.71 38.14 = 100%
2 28.23 17.42 19.52 34.83 = 100%
3 30.03 31.23 16.52 22.22 = 100%
% 25.33 26.03 16.92 31.73 Observed, overall totals
% 24.44 22.31 20.90 32.35 Expected, even codons per acid
A C D E F G H I K L
33. 10. 15. 9. 17. 40. 10. 12. 5. 33.
O-E % 22. 81. -13. -55. 34. 71. 40. -29. -73. 13.
M N P Q R S T V W Y
2. 7. 2. 13. 49. 20. 11. 30. 5. 2.
O-E % -74. -51. -88. 0. 165. -11. -42. 40. 18. -81.
Total acids= 325. Molecular weight= 35831. Hydrophobicity= -17.8
@24. TX 3 @ Plot base composition
This option plots the base composition of the sequence. The
counts for any combination of bases can be plotted.
If dialogue is requested the user is presented with a check
box for selecting which bases should be counted, and then allowed to
define a window length, and a "plot interval". Otherwise, the AT
composition is plotted with a window of 101 and a plot interval of
5.
Typical dialogue follows.
? Menu or option number=d24
Plot base composition
checkbox: those set are marked X
X 1 T
2 C
X 3 A
4 G
? 0,1,2,3,4 =1
checkbox: those set are marked X
1 T
2 C
X 3 A
4 G
? 0,1,2,3,4 =3
checkbox: those set are marked X
1 T
2 C
3 A
4 G
? 0,1,2,3,4 =2
checkbox: those set are marked X
1 T
X 2 C
3 A
4 G
? 0,1,2,3,4 =4
checkbox: those set are marked X
1 T
X 2 C
3 A
X 4 G
? 0,1,2,3,4 =
? odd span length (1-201) (31) =
? plot interval (1-11) (5) =
missing graphics
@25. TX 3 @ Plot local deviations in base composition
The "local deviation" routines are designed to indicate the
similarity of the compositions of different parts of the sequence.
The composition of every segment of the sequence is compared with a
standard composition. The levels of similarity are plotted as a chi
squared values. The standard can be the composition of the whole
sequence, or alternatively that of a small segment defined by the
user.
If dialogue is forced define the standard region, the window
length and the plot interval. Otherwise the composition of the whole
sequence is taken as a standard. The maximum and minimum observed
value of the chi squared calculation is displayed, and plots will
always exactly fill the available box. Any unusual regions will show
as peaks.
The following measure is used: for each window position
calculate (sum((obs-exp)*(obs-exp))/(exp*exp)) where obs is the
observed composition and exp is the expected composition (the
composition of the standard). The calculation is performed once to
find out the range of values and is then repeated and plotted so
that the plot exactly fills the allocated screen space.
@26. TX 3 @ Plot local deviations from dinucleotide composition
The "local deviation" routines are designed to indicate the
similarity of the compositions of different parts of the sequence.
The dinucleotide composition of every segment of the sequence is
compared with a standard composition. The levels of similarity are
plotted as a chi squared values. The standard can be the composition
of the whole sequence, or alternatively that of a small segment
defined by the user.
If dialogue is forced define the standard region, the window
length and the plot interval. Otherwise the composition of the whole
sequence is taken as a standard. The maximum and minimum observed
value of the chi squared calculation is displayed, and plots will
always exactly fill the available box. Any unusual regions will show
as peaks.
The following measure is used: for each window position
calculate (sum((obs-exp)*(obs-exp))/(exp*exp)) where obs is the
observed composition and exp is the expected composition (the
composition of the standard). The calculation is performed once to
find out the range of values and is then repeated and plotted so
that the plot exactly fills the allocated screen space.
@27. TX 3 @ Plot local deviations from trinucleotide composition
The "local deviation" routines are designed to indicate the
similarity of the compositions of different parts of the sequence.
The trinucleotide composition of every segment of the sequence is
compared with a standard composition. The levels of similarity are
plotted as a chi squared values. The standard can be the composition
of the whole sequence, or alternatively that of a small segment
defined by the user.
If dialogue is forced define the standard region, the window
length and the plot interval. Otherwise the composition of the whole
sequence is taken as a standard. The maximum and minimum observed
value of the chi squared calculation is displayed, and plots will
always exactly fill the available box. Any unusual regions will show
as peaks.
The following measure is used: for each window position
calculate (sum((obs-exp)*(obs-exp))/(exp*exp)) where obs is the
observed composition and exp is the expected composition (the
composition of the standard). The calculation is performed once to
find out the range of values and is then repeated and plotted so
that the plot exactly fills the allocated screen space.
@28. TX 5 @ Calculate codon constraint
The purpose of this option (which is somewhat specialised) is
to measure the level of constraint imposed on the sequence by coding
for a protein of the observed composition. It measures the strength
of the codon bias averaged over windows of 99 codons and displays
the values observed.
Select between defining segments at the keyboard or using an
EMBL feature table. Finish selecting segments by typing a zero
start. The value for each segment is displayed:
Mean (W-EW) / EWD, window 99 10.5
The codon constraint is the difference between the observed
codon improbability and the mean improbabilty for a sequence of the
same composition. See McLachlan, Staden and Boswell Nucl. Acid
Res. 1984
@59. TX 3 @ Plot negentropy
This routine is designed to show regions of the sequence that
differ in composition from others, and hence is like the "plot
deviation.." routines.
Negentropy or information is defined in the following way: let
Pi be the probability of observing base i, where i = A,C,G or T,
then the average information per base is I=-sum(Pi.Log(Pi)) (sum
over all i). This routine calculates Pi by calculating the overall
composition for the sequence and then plots I for windows of length
defined by the user.
@30. TX 4 @ Search for hairpin loops
Used to find simple inverted repeats or potential hairpin
loops The loops are defined by a range of sizes for the loop and a
minimum number of consecutive base pairs in the stem. The results
can be presented graphically or listed. A-T, G-C and G-T basepairs
are counted.
Define the range of loop sizes and the minimum number of
consecutive basepairs required. Choose between plotted or listed
results.
The loops found are plotted as blips on a horizontal line that
represents the sequence, the heights of the lines are proportional
to the number of basepairs in the stems. Note that only
uninterrupted stems are found - i.e. all basepairs must be made. To
look for stems with some unpaired bases (or for palindromes) use the
inverted repeat motif class in the pattern searching option.
Typical dialogue follows.
? Menu or option number=30
Search for hairpin loops
Define the range of loop sizes
? Minimum loop size (1-30) (1) =
? Maximum loop size (3-60) (3) =
? Minimum number of basepairs (2-20) (6) =
? (y/n) (y) Plot results n
Searching
T.G
G-C
G.T
T.G
C-G
G-C
T.G
C-G
G.T
GCCGCA GCGGAGG
49
G
G-C
T.G
C-G
G.T
T.G
G-C
CTGCTG GGAGGTC
56
G
T.G
G-C
G.T
T.G
C-G
G-C
T-A
T.G
AGCGCA CGACTGA
139
A C
G.T
C-G
G.T
C-G
C-G
G-C
TTCGCT CAACGCC
244
@31. TX 4 @ Search for long range inverted repeats
Searches for inverted repeats. The repeats found are exact
matches of at least 6 consecutive bases. Results can be presented
graphically or listed. Plotted results show the end points of
repeats joined by rectangular lines.
If dialogue is not requested the defaults will be taken.
Otherwise choose between plotted or listed results. If required
select to analyse a restricted segment of the currently active
region. Choose a repeat length.
Typical dialogue follows.
? Menu or option number=D31
Plot long-range inverted repeats
? (y/n) (y) Plot results n
Define restricted region
? start (1-1023) (1) =
? end (2-1023) (1023) =
? Minimum inverted repeat (6-30) (12) =10
Searching
27 909 10 TGCCCAGAGA
@32. TX 4 @ Search for repeats
Searches for direct repeats. The repeats found are exact
matches of at least 6 consecutive bases. Results can be presented
graphically or listed. Plotted results show the end points of
repeats joined by rectangular lines.
If dialogue is not requested the defaults will be taken.
Otherwise choose between plotted or listed results. If required
select to analyse a restricted segment of the currently active
region. Choose a repeat length.
Typical dialogue follows.
? Menu or option number=D32
Plot repeats
? (y/n) (y) Plot results n
Define restricted region
? start (1-1023) (1) =
? end (2-1023) (1023) =
? Minimum repeat (6-30) (12) =8
Searching
619 988 8 GCTGTTGT
514 646 8 GCTGCTAA
94 865 8 TCCGCTGG
146 222 9 GTGGCTGGC
455 497 8 TCGCCCTC
454 496 9 CTCGCCCTC
872 875 8 GCCGCCGC
510 615 8 CGTTGCTG
152 913 8 GGCAGCGA
199 265 8 CGTCGAGG
689 794 8 AGTTTGGG
147 223 8 TGGCTGGC
101 116 8 GACGAGGA
8 690 8 GTTTGGGC
52 141 8 TGCTGGTG
@33. TX 4 @ Search for z dna (total ry, yr)
Searches for segments of the sequence that might form Z DNA. A
window length is chosen and the number of RY and YR dinucleotides
within each window is plotted. The top of the box corresponds to all
RY or YR, the bottom to zero RY or YR.
If dialogue is requested, select a window length and plot
interval. Otherwise the defaults will be used.
The program contains three separate ways of doing this
(options 33,34,35).
@34. TX 4 @ Search for z dna (runs of ry, yr)
Searches for segments of the sequence that might form Z DNA.
Results are plotted.
If dialogue is requested define a window length and plot
interval. Otherwise the defaults will be used. The routine counts
the number of R in positions 1,3,5 etc =R1, the number of Y in
positions 2,4,6 etc =Y1, the number of Y in positions 1,3,5 etc =Y2
and the number of R in positions 2,4,6 etc =R2 for a window length.
It plots the maximum of R1+Y1 and R2+Y2 relative to a minimum of
(window length)/2 and a maximum of (window length). (see 33,35).
@35. TX 4 @ Search for z dna (best phased value)
Searches for segments of the sequence that might form Z DNA.
Results are plotted.
If dialogue is requested define a window length and a plot
interval. Ohterwise the defaults values will be used.
The routine counts the number of consecutive RY or YR
dinucleotides in phase. It moves through the sequence counting the
number of RY or YR dinucleotides; when the next dinucleotide is not
of the correct type the score is set back to zero and the search
restarted using the current base to set the phase. The plots are
done relative to a minimum of zero and a maximum defined by the
user. (See 33,34).
@36. TX 4 @ Local similarity or complementarity search
This function is designed to find segments of local similarity
or complementarity. It is therefore like performing a DIAGON plot
that is restricted to regions near the main diagonal. Results can
be presented graphically or listed.
Users define a region to search through, a span length, a
range for searching through and a cut-off score. The program takes
all sections of sequence of length span within the defined region
and compares them to all other sequences within the region and range
specified. If a match above the cutoff is found we need to show the
position of the two sections of sequence and the score, and we do it
in the following way. If we have a 70% match between a sequence
that starts at p1 and a sequence that starts at p2 the program draws
a diagonal line that starts at p1 with height 70% of the box and
which finishes at p2 with height 0. The matches can also be listed.
Here I define the terms range, region, and span and what is
compared. Suppose we have a defined region j1 to j2, a range of i1
to i2 and a span of s; the program will take, in turn, all sections
of sequence of length s within j1 and j2 and compare them to all
sequences that start a distance i1+s-1 to i2+s-1 away from them.
First it will take the sequence of length s starting at j1 and
compare it with the sequence of length s starting at j1+s-1+i1, then
j1+s-1+i1+1, etc up to j1+s-1+i2; then it will take the sequence of
length s starting at j1+1 and compare it with the sequence starting
at j1+s-1+1+i1 etc. This continues until we hit the right hand end
of the sequence as defined by j2. Note 1)that sequences are not
compared with themselves: the nearest sequence compared to a span s
starting at j starts at j+s; 2) ranges i1 and i2 are ranges of start
positions; 3) by choosing a range greater than the length of the
sequence this routine will do a full DIAGON analysis except for
those points within a distance span of the main diagonal (see note
1).
Typical dialog follows.
? Menu or option number=36
Search for local similarity or complementarity
? (y/n) (y) Find direct repeats
? (y/n) (y) Keep picture n
? Span (5-200) (15) =
Define restricted region
? start (0-1023) (1) =
? end (2-1023) (1023) =
? Percent match (1.00-100.00) (70.00) =
? Range start (1-50) (1) =
? Range end (1-50) (1) =5
? (y/n) (y) Plot results n
Working
118 128
CGAGGAGGAG GTGGA
** ***** ** **
GGACGAGGAC GTCGA
100 110
119 129
GAGGAGGAGG TGGAT
** ***** * * **
GACGAGGACG TCGAC
101 111
? (y/n) (y) Find direct repeats n
? (y/n) (y) Keep picture
? Span (5-200) (15) =
Define restricted region
? start (0-1023) (1) =
? end (2-1023) (1023) =
? Percent match (1.00-100.00) (70.00) =
? Range start (1-50) (1) =
? Range end (1-50) (5) =8
? (y/n) (y) List results
Working
178 188
ACTCAGATCC GGCGG
***** *** * **
ACTCAAATCA GTCGC
156 166
177 187
CACTCAGATC CGGCG
***** *** * **
AACTCAAATC AGTCG
157 167
? (y/n) (y) Find inverted repeats !
@37. TX 5 @ Set genetic code
This function allows the user to change the current active
genetic code for all the options. The user may select: the standard
code, the mammalian mitochondrial code, the yeast mitochondrial code
or a personal code (define your own).
Select code. If personal, define a codon and select an amino
acid. When all codons have been reset define a blank codon.
The code differences are:
Mammalian Yeast
Codon Mitochondrial Mitochondrial Standard
UGA W W STOP
AUA M M I
CUA L T L
AGA STOP R R
AGG STOP R R
Typical dialogue follows.
? Menu or option number=37
X 1 Standard code
2 Mammalian mitochondrial code
3 Yeast mitochondrial code
4 Personal code
? 0,1,2,3,4 =2
? Menu or option number=37
X 1 Standard code
2 Mammalian mitochondrial code
3 Yeast mitochondrial code
4 Personal code
? 0,1,2,3,4 =4
Define genetic code by typing a codon
followed by a 1 letter amino acid symbol
? Codon=TTT
Default Amino acid symbol=F
? Amino acid symbol=W
? Codon=
@38. T 3 4 @ Examine repeats
This function can be used to examine the frequencies of
repeated words within a sequence. It finds all words that occur more
than once. The user selects a minimum word length and the program
finds all words of that length that occur more than once; then it
"follows" each repeated word until it becomes unique. For each word
length it can report the number of different repeated words, the
number of occurrences of each word, and their actual positions and
sequences.
It is possible that the algorithm may run out of memory,
paticularly if a short mimimum word length is chosen, or if the
sequence is very long or very repetitive. If this occurs the longest
reported word length will not necessarily be the longest in the
sequence: the memory will have been consumed before the longest word
is found.
Typical dialogue and output is shown below.
Expected length of longest repeat 14
? Minumim word length (1-6) (6) =6
Working
? Show repeat frequencies for words of at least length (6-15) (15) =10
For length 10 the number of different repeated words is 2035
For length 11 the number of different repeated words is 613
For length 12 the number of different repeated words is 161
For length 13 the number of different repeated words is 37
For length 14 the number of different repeated words is 10
For length 15 the number of different repeated words is 1
? Show repeats for words of length (6-15) (15) =14
? Show repeats for words occuring with frequency (2-9999) (2) =2
ggtgctcatgccca
occurs at 21611
occurs at 21851
ttatccggtgatga
occurs at 4604
occurs at 8806
agcaccacgctgac
occurs at 5954
occurs at 9486
catgacggaggatg
occurs at 10480
occurs at 19925
aaagacgggaaaat
occurs at 11820
occurs at 43157
tacaaaaccaattt
occurs at 26797
occurs at 31369
cgagaaagagtgcg
occurs at 4260
occurs at 44305
gccggatgatggcg
occurs at 7893
occurs at 16638
atgacggaggatga
occurs at 10481
occurs at 19926
gcggcgaacgaggc
occurs at 11352
occurs at 18718
? Show repeats for words of length (6-15) (15) =!
Example of not enough memory
----------------------------
Expected length of longest repeat 14
? Minumim word length (1-6) (6) =1
Working
Not enough memory
Memory used in bytes 1125996. Length of longest repeat 5
? Show repeat frequencies for words of at least length (1-5) (5) =!
@39. TX 5 @ Translate and list in upto six phases
This is a general listing function that will perform
translations and produce several forms of output. The possibilities
are:
1) no translation, list one or two strands, two ways of numbering the
sequence.
2) translation, one or two strands, one or three letter codes.
Positions defined by:
a) open reading frames of some minimum length l, l can be 0, hence giving
a complete six phase translation.
b) positions typed on keyboard, again 1 to 6 phases, translations appearing
above and below the dna.
c) positions read from a feature table.
It should be used in preference to option 5. For publication
without a translation, the option to number ends of lines is more compact
than option 5. Some examples and typical dialogue are given below. Note the
requirement for d39.
? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation
The segments to translate can be
1 Typed on the keyboard
2 Read from a feature table
X 3 Open reading frames
? 1,2,3 =
? Minimum open frame in amino acids (0-7238) (30) =
? (y/n) (y) Use 1 letter codes
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X 1 + strand only
2 - strand only
3 Both strands
? 1,2,3 =3
? (y/n) (y) Number ends of lines
N A T T I S R I D A T F S A R A P N E N
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
. : . : . : . : . : . :
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
* S A G W I F I
A V V I L L I S A V K E A R A G F S F
I A K Q V I D H L R N V S N G Q T K S T
L N R L L T I C E M Y L M V K L N L L
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
. : . : . : . : . : . :
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
Y S F L N N V M Q S I Y R I T L S F R S
I A L C T I S W K R F T D L P * V L D V
R S Q N W E S T V T W N E T S R H R T L
V R R I G N Q L L H G M K L P D T V L *
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
. : . : . : . : . : . :
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
T R L I P F
R E C F Q S D V T V H F S V E L C R V K
V A Y L K H V E L Q H Q I Q Q L S S K P
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
. : . : . : . : . : . :
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
T A Y K F C T S S C C W I
S A K M T S Y Q K E Q L K V L S N P D L
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
. : . : . : . : . : . :
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation N
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X 1 + strand only
2 - strand only
3 Both strands
? 1,2,3 =
? (y/n) (y) Number ends of lines
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation
The segments to translate can be
1 Typed on the keyboard
2 Read from a feature table
X 3 Open reading frames
? 1,2,3 =
? Minimum open frame in amino acids (0-7238) (30) =0
? (y/n) (y) Use 1 letter codes N
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X 1 + strand only
2 - strand only
3 Both strands
? 1,2,3 =3
? (y/n) (y) Number ends of lines
AsnAlaThrThrIleSerArgIleAspAlaThrPheSerAlaArgAlaProAsnGluAsn
ThrLeuLeuLeuLeuValGluLeuMetProProPheGlnLeuAlaProGlnMetLysIle
ArgTyrTyrTyr******Asn***CysHisLeuPheSerSerArgProLys***Lys
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
. : . : . : . : . : . :
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
ValSerSerSerAsnThrSerAsnIleGlyGlyLys***SerAlaGlyTrpIlePheIle
Arg************TyrPheGlnHisTrpArgLysLeuGluArgGlyLeuHisPheTyr
AlaValValIleLeuLeuIleSerAlaValLysGluAlaArgAlaGlyPheSerPhe
IleAlaLysGlnValIleAspHisLeuArgAsnValSerAsnGlyGlnThrLysSerThr
***LeuAsnArgLeuLeuThrIleCysGluMetTyrLeuMetValLysLeuAsnLeuLeu
TyrSer***ThrGlyTyr***ProPheAlaLysCysIle***TrpSerAsn***IleTyr
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
. : . : . : . : . : . :
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
TyrSerPheLeuAsnAsnValMetGlnSerIleTyrArgIleThrLeuSerPheArgSer
Leu***ValPro***GlnGlyAsnAlaPheHisIle***HisAspPhe***Ile***Glu
IleAlaLeuCysThrIleSerTrpLysArgPheThrAspLeuPro***ValLeuAspVal
ArgSerGlnAsnTrpGluSerThrValThrTrpAsnGluThrSerArgHisArgThrLeu
ValArgArgIleGlyAsnGlnLeuLeuHisGlyMetLysLeuProAspThrValLeu***
SerPheAlaGluLeuGlyIleAsnCysTyrMetGlu***AsnPheGlnThrProTyrPhe
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
. : . : . : . : . : . :
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
ThrArgLeuIleProPhe***SerAsnCysProIlePheSerGlySerValThrSer***
AsnAlaSerAsnProIleLeuGln***MetSerHisPheLysTrpValGlyTyrLysLeu
ArgGluCysPheGlnSerAspValThrValHisPheSerValGluLeuCysArgValLys
ValAlaTyrLeuLysHisValGluLeuGlnHisGlnIleGlnGlnLeuSerSerLysPro
LeuHisIle***AsnMetLeuSerTyrSerThrArgPheSerAsn***AlaLeuSerHis
SerCysIlePheLysThrCys***AlaThrAlaProAspSerAlaIleLysLeu***Ala
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
. : . : . : . : . : . :
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
AsnCysIle***PheMetAsnLeu***LeuValLeuAsnLeuLeu***AlaArgLeuTrp
GlnMetAsnLeuValHisGlnAlaValAlaGlySerGluAlaIleLeuSer***AlaMet
ThrAlaTyrLysPheCysThrSerSerCysCysTrpIle***CysAsnLeuGluLeuGly
SerAlaLysMetThrSerTyrGlnLysGluGlnLeuLysValLeuSerAsnProAspLeu
ProGlnLys***ProLeuIleLysArgSerAsn***ArgTyrSerLeuIleLeuThrCys
IleArgLysAsnAspLeuLeuSerLysGlyAlaIleLysGlyThrLeu***Ser***Pro
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
. : . : . : . : . : . :
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
GlyCysPheHisGlyArgIleLeuLeuLeuLeu***LeuTyrGluArgIleArgValGln
ArgLeuPheSerArgLysAspPheProAlaIleLeuProValArg***AspGlnGlyThr
AspAlaPheIleValGlu******PheSerCysAsnPheThrSerGluLeuGlySerArg
? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation
The segments to translate can be
1 Typed on the keyboard
2 Read from a feature table
X 3 Open reading frames
? 1,2,3 =1
? (y/n) (y) Use 1 letter codes
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X 1 + strand only
2 - strand only
3 Both strands
? 1,2,3 =
? (y/n) (y) Number ends of lines N
Translate
? From (0-300) (0) =101
? To (1-300) (300) =300
Translate
? From (0-300) (0) =102
? To (1-300) (300) =200
Translate
? From (0-300) (0) =
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT
10 20 30 40 50 60
M V K L N L L
W S N * I Y
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT
70 80 90 100 110 120
V R R I G N Q L L H G M K L P D T V L *
S F A E L G I N C Y M E * N F Q T P Y F
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA
130 140 150 160 170 180
L H I * N M L S Y S T R F S N * A L S H
S C I F K T C
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA
190 200 210 220 230 240
P Q K * P L I K R S N * R Y S L I L T C
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG
250 260 270 280 290 300
? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation
The segments to translate can be
1 Typed on the keyboard
2 Read from a feature table
X 3 Open reading frames
? 1,2,3 =2
? Embl feature table file=1.FT
? (y/n) (y) Use 1 letter codes
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X 1 + strand only
2 - strand only
3 Both strands
? 1,2,3 =3
? (y/n) (y) Number ends of lines
N A T T I S R I D A T F S A R A P N E N
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
. : . : . : . : . : . :
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
* S A G W I F I
A V V I L L I S A V K E A R A G F S F
I A K Q V I D H L R N V S N G Q T K S T
L N R L L T I C E M Y L M V K L N L L
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
. : . : . : . : . : . :
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
Y S F L N N V M Q S I Y R I T L S F R S
I A L C T I S W K R F T D L P * V L D V
R S Q N W E S T V T W N E T S R H R T L
V R R I G N Q L L H G M K L P D T V L *
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
. : . : . : . : . : . :
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
T R L I P F
R E C F Q S D V T V H F S V E L C R V K
V A Y L K H V E L Q H Q I Q Q L S S K P
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
. : . : . : . : . : . :
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
T A Y K F C T S S C C W I
S A K M T S Y Q K E Q L K V L S N P D L
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
. : . : . : . : . : . :
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
* L Y E R I R V Q
* F S C N F T S E L G S R
@40. TX 5 @ Translate and write the protein sequence to disk
This routine allows the user to translate sections of the
sequence into the 1 letter amino acid codes and store the resulting
amino acid sequences in a disk file. Two modes of use are possible.
Either all open reading frames of at least some minimum length will
automatically be found and translated, or the user can specify that
particular segments be translated.
Mode 1: the user selects to to translate all open reading
frames.
Either, or both, strands can be translated. The output file
is in the same format as a PIR .seq file. Each protein segment is
given an entry name that is its start base in the DNA, and a title
that includes its end position, reading frame and strand (+ for
plus, - for minus). Each segment is terminated by * whether or not
there is a stop codon in the DNA. The file is therefore suitable for
input to FASTA, ALIGNL and ANALYSEPL.
Mode 2: the user selects to identify the segments to
translate.
Either, or both, strands can be translated. If multiple
coding regions are translated each will be separated from the
previous one by a gap of 5 dashes (-----). The sections to
translate can be defined from the keyboard or by supplying the name
of the appropriate EMBL library feature table.
Typical dialogue follows.
? Menu or option number=40
Translate and write protein sequence to disk
? (y/n) (y) Translate selected regions
? (y/n) (y) Define segments using keyboard
Translate
? From (0-1023) (0) =1
? To (1-1023) (1023) =111
? (y/n) (y) + strand
Translate
? From (0-1023) (0) =
? Output file name=1.OUT
? Menu or option number=40
Translate and write protein sequence to disk
? (y/n) (y) Translate selected regions n
? Minimum open frame in amino acids (5-1000) (30) =
X 1 + strand only
2 - strand only
3 Both strands
? 0,1,2,3 =3
? File name for translation=1.OUT
? Menu or option number=6
Page through text files
? Name of file to read=1.OUT
>P1; 25
135 1 +
GAQRLLRRSCWCWRCGGRQRTQGSAGRGRRRRGGGG*
>P1; 238
486 1 +
IRCRDCGQRRRGIFDLVDDFHVRRHIVLARKLFEAEGTGVHFHISLMGGNIVTAEVTNVR
VDAGADFAAVRMLALFGAVVPH*
>P1; 556
795 1 +
SSTQVRRASAQTSSLQLESIVAVVNVEVFLAAKHSRFYIAVLFAQFGPLLDARLDRGCGK
GAGRRDQWRGGGVDLANGR*
>P1; 796
987 1 +
FGYADHAFHLRSTSRHSDNVKFDSAGRRRCCCFHLVFSLGSDEEGLLARLLVEVTTIRVV
LRG*
>P1; 2
163 2 +
NSVWAWCEVPRDYCAAAAGAGGAEVVNGPRDPLDEDVDDEEEVDSALLVAGSD*
>P1; 176
391 2 +
PLRSGGGGVEAPETPSGWPARFAAATVANAVEGFSILWMIFTCAVILSLRVNSLKQKGQG
YTFTFRLWEVT*
>P1; 476
628 2 +
SLTEPSASPSPTLLLRFSLVLTEGVPNPALRFGVLPLRPAAFNLNPSLLL*
>P1; 629
958 2 +
MSRYSWLLNTAGFTSPFCLPSLGRFWTRGLTVAVEKEPAGETNGVEAALTLPMGVSLGML
TMLFTCAPPAAIPIMLSLIPLAAAAAAVSTWCFLWAAMRKACWRACSLR*
>P1; 3
293 3 +
IRFGLGVRCPEITAPQLLVLAVRRSSTDPGIRWTRTSTTRRRWIAHCWWLAATDLSSDHS
DPAAEASRLPKLPVAGLLDSLPRLWPTPSRDFRSCG*
>P1; 411
521 3 +
CACRRGSRLCSGTYARPLWCSSPSLSPPPRPRQRCC*
>P1; 1020
37 1 -
EFGKYNPLTDNSSPTQDHTDGSHLNEQARQQAFLIAAQRKHQVETAAAAAASGIKLNIIG
MAAGGAQVKSMVSIPKLTPIGKVNAASTPLVSPAGSFSTATVKPRVQKRPKLGKQNGDVK
PAVFSSQEYLDIYNSNDGFKLKAAGLSGSTPNLSAGLGTPSVKTKLNLSSNVGEGEAEGS
VRDYCTKEGEHTYRCKVCSRVYTHISNFCRHYVTSHKRNVKVYPCPFCFKEFTRKDNMTA
HVKIIHKIENPSTALATVAAANLAGQPLGVSGASTPPPPDLSGQNSNQSLPATSNALSTS
SSSSTSSSSGSLGPLTTSAPPAPAAAAQ*
>P1; 373
-1 2 -
AKCESVPLSLLLQRVYAQGQYDGARENHPQDRKSLDGVGHSRGSESSRPATGSFGSLDAS
AAGSEWSELKSVAASHQQCAIHLLLVVDVLVQRIPGSVDDLRTASTSSCGAVISGHLTPS
PNRI*
>P1; 517
407 2 -
QQRWRGRGGGLSEGLLHQRGRAYVPLQSLLPRLHAH*
>P1; 649
518 2 -
QPGIPRHLQQQRWIQVEGCWSERKHAEPECWIRNSLCQNQAES*
>P1; 853
650 2 -
HYRNGGWWSAGEKHGQHTQTNAHWQGQRRLHAIGLACRLLFHSHGQAARPEAAQTQTER
RCKTGCV*
>P1; 958
854 2 -
SPQRAGAPTSLPHRCPEKTPGGNSSSGGGQRNQT*
>P1; 179
78 3 -
VVRTQISRCQPPAMRYPPPPRRRRPRPADPWVR*
>P1; 479
363 3 -
GTTAPKRASIRTAAKSAPASTRTLVTSAVTMLPPISEM*
>P1; 791
666 3 -
RPLARSTPPPRHWSRLPAPFPQPRSSRASRSGPNWANRTAM*
>P1; 1022
819 3 -
SNSASTTRSPTTAHPRRTTRMVVTSTSRRANKPSSSLPRENTRWKQQQRRRPAESNLTLS
EWRLVERR*
End of file
@41. TX 5 @ Calculate and write codon table to disk
This routine calculates codon usage tables for sections of the
sequence and stores the resulting tables on disk. The sections to
translate can be defined from the keyboard or by supplying the name
of the appropriate EMBL library feature table.
If required users can add to an existing codon table stored as
a disk file. Choose between storing observed counts or having them
normalised so that the totals for each amino acid sum to 100. Select
between defining segments at the keyboard or using an EMBL feature
table. Define segments. Signal completion with a zero start. Supply
a file name. For each segment the program will display the counts,
at the end it will display the accumulated totals.
Typical dialogue follows.
? Menu or option number=41
Calculate and write codon table to disk
? (y/n) (y) Start with empty table
? (y/n) (y) Show observed counts
? (y/n) (y) Define segments using keyboard
? Count from (0-1023) (0) =1
? Count to (1-1023) (1023) =111
? (y/n) (y) + strand
===========================================
F TTT 0. S TCT 0. Y TAT 0. C TGT 0.
F TTC 1. S TCC 1. Y TAC 0. C TGC 3.
L TTA 1. S TCA 0. * TAA 0. * TGA 1.
L TTG 2. S TCG 0. * TAG 0. W TGG 2.
===========================================
L CTT 0. P CCT 0. H CAT 0. R CGT 2.
L CTC 0. P CCC 0. H CAC 0. R CGC 2.
L CTA 0. P CCA 0. Q CAA 1. R CGA 1.
L CTG 1. P CCG 0. Q CAG 2. R CGG 2.
===========================================
I ATT 0. T ACT 0. N AAT 0. S AGT 0.
I ATC 0. T ACC 1. N AAC 0. S AGC 1.
I ATA 0. T ACA 0. K AAA 0. R AGA 1.
M ATG 0. T ACG 0. K AAG 0. R AGG 0.
===========================================
V GTT 0. A GCT 1. D GAT 0. G GGT 3.
V GTC 0. A GCC 1. D GAC 0. G GGC 1.
V GTA 0. A GCA 0. E GAA 1. G GGA 4.
V GTG 1. A GCG 0. E GAG 0. G GGG 0.
===========================================
? Count from (0-1023) (0) =
Codon totals over all genes
===========================================
F TTT 0. S TCT 0. Y TAT 0. C TGT 0.
F TTC 1. S TCC 1. Y TAC 0. C TGC 3.
L TTA 1. S TCA 0. * TAA 0. * TGA 1.
L TTG 2. S TCG 0. * TAG 0. W TGG 2.
===========================================
L CTT 0. P CCT 0. H CAT 0. R CGT 2.
L CTC 0. P CCC 0. H CAC 0. R CGC 2.
L CTA 0. P CCA 0. Q CAA 1. R CGA 1.
L CTG 1. P CCG 0. Q CAG 2. R CGG 2.
===========================================
I ATT 0. T ACT 0. N AAT 0. S AGT 0.
I ATC 0. T ACC 1. N AAC 0. S AGC 1.
I ATA 0. T ACA 0. K AAA 0. R AGA 1.
M ATG 0. T ACG 0. K AAG 0. R AGG 0.
===========================================
V GTT 0. A GCT 1. D GAT 0. G GGT 3.
V GTC 0. A GCC 1. D GAC 0. G GGC 1.
V GTA 0. A GCA 0. E GAA 1. G GGA 4.
V GTG 1. A GCG 0. E GAG 0. G GGG 0.
===========================================
? (y/n) (y) Save table in a file n
@42. TX 6 @ Codon usage method
Used to find protein coding regions. For each window length of
the sequence the routine measures the closeness to an expected codon
usage. Results are plotted for each of the three reading frames.
Stop and start codons are also marked on the plots. Has the highest
resolution of all such methods, but makes the strongest assumption,
i.e. that the codon usage is known. The latest version is described
in Methods in Enzymology 183, 193-211.
Choose whether to use an internal standard (i.e. part of the
current sequence known to code for a protein). If so define its end
points, and those of any others. Otherwise supply the name of a disk
file containing a table of codon usage. Tables are listed. Choose
between using the observed counts, or two types of normalisation:
normalised to give an average amino acid composition; normalised to
no amino acid bias. The first normalisation is clearly often
sensible, but the second removes valuable information and is only
made availabe for special circumstances. The final table will be
displayed, followed by the expected scores for window lengths 21, 31
and 41 codons. The scores for each of the three reading frames are
shown (they are logarithmic values) to help users choose a window
length for the analysis. Define a window length and plot interval.
Plotting will start.
The method was first described in Staden and McLachlan Nucl.
Acid Res. 10 141-156 (1982) and the following is a summary of the
initial ideas. The method makes the following main assumptions: the
codon preferences of all the genes in the sequence we are examining
are similar to those of the standard; the sequence is coding
throughout its whole length in only one reading frame; in the coding
frame the frequency of codon abc has a definite value Fabc
If we select a sequence a1b1c1a2b2c2a3b3c3,...,anbncnan+1bn+1cn+1
then the probability of selecting it in each of the three frames is:
frame 1: p1=Fa1b1c1.Fa2b2c2....Fanbncn
frame 2: p2=Fb1c1a2.Fb2c2a3...Fbncnan+1
frame 3: p3=Fc1a2b2.Fc2a3b3...Fcnan+1bn+1
The probability that selection of a particular sequence was "caused"
by it being a coding sequence is:
P1=p1/(p1+p2+p3), P2=p2/(p1+p2+p3), P3=p3/(p1+p2+p3).
The program calculates these values for the given window length but
plots Log(P/(1-P)) for each of the three frames. At each point along
the sequence that the program has a point to plot it finds which of
the three values is highest and places a single point at the 50%
level for the corresponding frame. These single points will join to
form a solid line if one frame is consistently the highest scoring.
In addition stop codons are shown as short vertical lines that
bisect the 50% level of probability. When looking for coding regions
the user should look for solid horizontal lines at the 50% level
that are not interrupted by these short vertical lines.
Changes. Two normalisations are offered: 1) to remove all
amino acid compositional components from the tables, hence leaving
only the codon preference components. In general this is not
recommended as the amino acid component alone is often sufficient to
choose correctly between frames, but may be useful in special
circumstances. 2) to change the amino acid composition components to
give an average amino acid composition rather the the one contained
in the standard (this leaves the codon preference components
unchanged). In general this should be useful as the average amino
acid composition is likely to be closer to the composition of the
genes being hunted, than is that of the standard table of codon
preferences. The average composition is that recently published by
Argos, not the Dayhoff one that we have used before.
Typical dialogue follows.
? Menu or option number=42
Staden and McLachlan codon usage method
Codon tables for standards may be read from disk
or calculated from parts of the current sequence
? (y/n) (y) Define internal standard
Define standard
? start (0-1023) (0) =1
? end (2-1023) (1023) =1000
===========================================
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
===========================================
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
===========================================
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
===========================================
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
===========================================
Define standard
? start (0-1023) (0) =
Total codons in standard= 333.
X 1 Use observed frequencies
2 Normalize to average amino acid composition
3 Normalize to no amino acid bias
? 0,1,2,3 =2
===========================================
F TTT 19. S TCT 2. Y TAT 10. C TGT 3.
F TTC 6. S TCC 22. Y TAC 10. C TGC 8.
L TTA 2. S TCA 0. * TAA 0. * TGA 0.
L TTG 7. S TCG 2. * TAG 0. W TGG 8.
===========================================
L CTT 16. P CCT 16. H CAT 4. R CGT 10.
L CTC 12. P CCC 0. H CAC 10. R CGC 10.
L CTA 0. P CCA 0. Q CAA 8. R CGA 7.
L CTG 21. P CCG 16. Q CAG 18. R CGG 6.
===========================================
I ATT 19. T ACT 13. N AAT 16. S AGT 2.
I ATC 11. T ACC 17. N AAC 12. S AGC 15.
I ATA 3. T ACA 3. K AAA 22. R AGA 1.
M ATG 15. T ACG 3. K AAG 15. R AGG 1.
===========================================
V GTT 15. A GCT 21. D GAT 14. G GGT 10.
V GTC 7. A GCC 16. D GAC 20. G GGC 13.
V GTA 8. A GCA 8. E GAA 26. G GGA 14.
V GTG 11. A GCG 8. E GAG 13. G GGG 9.
===========================================
Span length 21 expected mean values: 4.8 -5.7 -4.8
Span length 31 expected mean values: 7.1 -8.4 -7.2
Span length 41 expected mean values: 9.5 -11.1 -9.5
? odd span length (11-101) (25) =41
? plot interval (1-11) (5) =
Missing graphics display here
@43. TX 6 @ Positional base preference method.
Used to find protein coding regions. For each window length of
the sequence the routine measures the closeness to an expected
pattern of base frequencies . Results are plotted for each of the
three reading frames. Stop and start codons are also marked on the
plots. The method is particularly useful for showing which reading
frame is the most likely to be coding. The latest version is
described in a forthcoming issue of Methods in Enzymology, but the
original ideas were given in Staden, R. Nucl. Acid Res. 12 551-567
(1984).
If dialogue is requested the following inputs are needed,
otherwise the standard analysis is performed. Choose between a
"global" standard, or a selected one. If the global standard is
selected the expected scores are displayed and the user asked to
define a span length and a plot interval. Then users choose between
plotting relative or absolute scores, and can reset the scaling
values employed for plotting. If the global standard is not
selected users must define a region of the sequence to use as a
standard, or they can read in a codon table from which the program
will calculate one. Then they can either, use the values observed in
this standard, or they can combine its values for the third
positions in codons, with those from the global standard. Next they
can give different weightings to each of the three positions in
codons.
In its original form the method took advantage of the uneven
use of amino acids by proteins and the structure of the genetic code
table and assumed that there is a typical ("global") amino acid
composition and no codon preference. The typical amino acid
composition is the average composition found by Argos (see below).
This composition and no codon preference determines the frequency of
each of the four bases in each of the three codon positions. This
3x4 frequency table shows unequal use of the bases and in particular
a marked use of G in position 1 and of A in position 2 (at the
expence of G). The routine slides a window along the sequence and
calculates a score for each of the three reading frames at each
window position. It assumes the sequence is coding throughout its
whole length and calcualtes the probability that it is coding in
each of the three frames. When tested against all the E. coli
sequences in the EMBL sequence library it correctly identified the
coding frame for 91% of window positions. (The E. coli sequences
were chosen only for technical reasons: I have no reason to think
the method would work less well on other organisms with roughly even
base composition.) The routine can plot either absolute or relative
values: ie absolute values are the values found by summing the
scores for each frame (say p1, p2 and p3), and the relative values
are then p1/(p1+p2+p3), p2/(p1+p2+p3) and p3/(p1+p2+p3).
At each point along the sequence that the program has a point
to plot it finds which of the three values is highest and places a
single point at the 50% level for the corresponding frame. These
single points will join to form a solid line if one frame is
consistently the highest scoring. In addition stop codons are shown
as short vertical lines that bisect the 50% level of probability.
When looking for coding regions the user should look for solid
horizontal lines at the 50% level that are not interrupted by these
short vertical lines. The absolute mean values expected on the
complement of the coding strand (and in the same frame) are 5% lower
than those on the coding strand but the relative values are the same
on both strands. Although the relative values give smoother plots
and tend to emphasize the coding frame they therefore, cannot be
used to decide which strand is coding. The absolute values plot
should be used for this purpose but bearing in mind the fact the the
differences between strands are quite small.
The method has been improved in two overall ways: first it now
allows users to define their own typical amino acid composition by
selecting a standard sequence from within the sequence they are
analysing or from a codon table; secondly it allows the inclusion of
third position preferences. Again these third position preferences
are defined by the use of an internal standard sequence. Not only
can users define their own standards but they can also give weights
to each of the three positions in codons. This allows different
emphasis to be used for each of the three positions. As an example
of its use, by giving, in turn, weights of 1.0, 0.0, 0.0, and 0.0,
1.0, 0.0, and finally 0.0, 0.0, 1.0, you could see the separate
contribution made by each of the three positions. It is also
possible to use the third position preferences with the values for
the first two positions taken from the "global" amino acid
composition. In all cases users may choose to plot absolute or
relative values. The expected scores are displayed before each
analysis and scales are drawn on the plots. At present this method
does not give probabilities of coding; it has only been tested for
its ability to choose the correct reading frame (see above). It
could be used to give probabilities of coding if was applied to all
known coding and non-coding sequences in the way that the uneven
positional base frequencies method was. It is designed to be used in
conjunction with this method. Note that the average amino
composition used to derive the base frequencies was changed on 17-
11-1988, to be the new average given by McCaldon and Argos in
Proteins 4 99-122 (1988). A further change is to allow users to
select their own scales for producing the plots. It can be helpful
if they want to emphasise or diminish certain features.
Typical dialogue follows.
? Menu or option number=D43
Positional base preferences method to find protein genes
Select standard source
X 1 Use global standard
2 Use internal standard
3 Use codon usage table
? Selection (1-3) (1) =2
Define region for standard
? start (0-8134) (0) =3171
? end (3172-8134) (8134) =4700
Select normalisation
X 1 Use observed frequencies
2 Combine with global standard
? Selection (1-2) (1) =1
T C A G Range
1 0.125 0.249 0.230 0.397 0.272
2 0.298 0.245 0.292 0.165 0.132
3 0.288 0.313 0.169 0.230 0.144
? (y/n) (y) Use 1.0 for positional weights
Give weights between 0.0 and 1.0
to each of the 3 codon positions
? Position 1 (0.00-1.00) (1.00) =
? Position 2 (0.00-1.00) (1.00) =
? Position 3 (0.00-1.00) (1.00) =
Expected scores per codon in each frame
0.136 0.122 0.123
? odd span length (31-101) (67) =
? plot interval (1-11) (5) =
? (y/n) (y) Plot relative scores
Scaling values:
Minimum maximum range
0.3121 0.3656 0.0382
? (y/n) (y) Leave scaling values unchanged
Graphics not shown
? Menu or option number=D43
Positional base preferences method to find protein genes
Select standard source
X 1 Use global standard
2 Use internal standard
3 Use codon usage table
? Selection (1-3) (1) =3
? File name of standard=atpase.cods
===========================================
F TTT 21. S TCT 33. Y TAT 15. C TGT 5.
F TTC 55. S TCC 40. Y TAC 40. C TGC 4.
L TTA 8. S TCA 7. * TAA 8. * TGA 0.
L TTG 19. S TCG 12. * TAG 1. W TGG 17.
===========================================
L CTT 22. P CCT 17. H CAT 6. R CGT 73.
L CTC 21. P CCC 4. H CAC 30. R CGC 23.
L CTA 1. P CCA 10. Q CAA 19. R CGA 5.
L CTG 168. P CCG 48. Q CAG 80. R CGG 3.
===========================================
I ATT 47. T ACT 14. N AAT 17. S AGT 8.
I ATC 98. T ACC 54. N AAC 52. S AGC 26.
I ATA 6. T ACA 7. K AAA 85. R AGA 0.
M ATG 75. T ACG 13. K AAG 28. R AGG 0.
===========================================
V GTT 67. A GCT 56. D GAT 41. G GGT 90.
V GTC 29. A GCC 53. D GAC 66. G GGC 66.
V GTA 49. A GCA 59. E GAA 101. G GGA 5.
V GTG 57. A GCG 64. E GAG 41. G GGG 8.
===========================================
Select normalisation
X 1 Use observed frequencies
2 Combine with global standard
? Selection (1-2) (1) =2
T C A G Range
1 0.177 0.211 0.277 0.336 0.159
2 0.271 0.238 0.310 0.182 0.128
3 0.242 0.301 0.168 0.289 0.132
? (y/n) (y) Use 1.0 for positional weights
Expected scores per codon in each frame
0.785 0.736 0.736
? odd span length (31-101) (67) =
? plot interval (1-11) (5) =
? (y/n) (y) Plot relative scores
Scaling values:
Minimum maximum range
0.3219 0.3519 0.0214
? (y/n) (y) Leave scaling values unchanged
Graphics not shown
@44. TX 6 @ Uneven positional base frequencies.
Used to find regions of a sequence that might be coding for a
protein. The method looks for sections of the sequence in which the
frequency at which each of the four bases occupies the three
positions in codons is nonrandom. The level of nonrandomness is
plotted on a scale that shows the probability that the sequence is
coding. At each position along a sequence the calculation gives the
same value for all six possible reading frames, so only one value is
plotted.
Define the window length and plot interval.
The results are plotted in a box divided by a horizontal line
marked "76%". 76% of coding regions achieve values above this line
and 76% of noncoding regions achieve scores below the line.
This method, first described in Staden R. Nucl. Acid Res. 12
551-567 1984, looks for uneven positional usage of bases in codons.
It looks through the sequence in one fixed phase and counts the
number of times each base apears in each of the three codon
positions: for each window position it counts A1,A2,A3 and C1,C2,C3
and G1,G2,G3 and T1,T2,T3 and calculates AMEAN=(A1+A2+A3)/3, and
similarly CMEAN, GMEAN and TMEAN; it then calculates ADIF=abs(A1-
AMEAN)+abs(A2-AMEAN)+abs(A3-AMEAN) and similarly CDIF, GDIF and TDIF
to measure the differences between an even base usage for all
positions in the codons and the observed usage. The routine then
calculates the sum ADIF+CDIF+GDIF+TDIF and plots this value on the
following scale: the base level is such that no known window in a
coding region has a lower value, whereas 14% of windows in noncoding
sequences score below it. The top of the scale is not achieved by
any known noncoding region, but is reached by 16% of known coding
regions. The bar drawn across the plot corresponds to a level that
is exceeded by 76% of windows in known coding regions but is reached
by only 24% of windows in known noncoding regions. ie 76% of coding
windows score above and 76% of noncoding windows score below. This
is similar to Ficketts method but without the probabilities and
weightings from the Los Alamos sequence library: it is therefore
unbiased but may well give very similar results.
@45. TX 6 @ Codon improbability on base composition
Used to find regions of a sequence that might code for a
protein.
If dialogue is requested define a window length and plot
interval.
The idea of the method is, that of all sequence features that
we know, it is only coding regions that will give rise to codon
biases well above those expected from the base composition. If a
region of sequence shows sufficiently strong codon bias then we
conclude that it is coding for a protein. Using the multinomial
distribution we have derived a function to measure the improbability
of observing a set of codons from a sequence of the given
composition. Using the Poisson distribution we have worked out the
distribution of the improbability. The program plots the observed
improbability minus the expected improbability (the mean as
calculated from the Poisson distribution). The plots are presented
against a scale of units of standard deviation as measured from the
Poisson distribution. As with the other Staden and McLachlan method
the program puts an extra point at a fixed level for the highest of
the three probabilities; for this function this point is placed at
six standard deviations above the mean expected level. The top of
each plot corresponds to 12 standard deviations above the expected
level and the bottom corresponds to the expected value.
Analysis of the application of the method to the EMBL sequence
library indicates that the method does work for most sequences and
that the levels of improbability roughly correlate with levels of
expression. Coding regions will show high peaks in all three frames
making interpretation more difficult than for some of the other
methods.
@46. TX 6 @ Codon improbability on amino acid composition
Used to finds regions of a sequence that might code for a
protein.
If dialogue is requested define a window length and a plot
interval.
The idea of the method is, that of all sequence features that
we know, it is only coding regions that will give rise to codon
biases such that, for each amino acid, some codons are used far more
frequently than others. The method is independent of what the bias
actually is, requiring only that it is present. If a region of
sequence shows sufficiently strong codon bias then we conclude that
it is coding for a protein. Using the multinomial distribution we
have derived a function to measure the improbability of observing a
set of codons from a sequence of the given composition. Using the
Poisson distribution we have worked out the distribution of the
improbability. The program plots the observed improbability minus
the expected improbability (the mean as calculated from the Poisson
distribution). The plots are presented against a scale of units of
standard deviation as measured from the Poisson distribution. As
with the other Staden and McLachlan method the program puts an extra
point at a fixed level for the highest of the three probabilities;
for this function this point is placed at six standard deviations
above the mean expected level. The top of each plot corresponds to
12 standard deviations above the expected level and the bottom
corresponds to the expected value.
@47. TX 6 @ Shepherd RNY preference method
Used to find regions of a sequence that might code for a
protein. Based on the method of Shepherd (PNAS 78 1596-1600, 1981).
If dialogue is requested define a window length and plot
interval.
Shepherd has found that many genes have a preference for the
use of codons of the form RNY where R=purine, Y=pyrimidine and N=any
base. He has attributed this to being due to remants of a primitive
genetic code. The calculation is similar to that for the Staden and
McLachlan method, the p1's being simply the number of RNY codons
found in frame 1 etc and the P's being p/(p1+p2+p3).
@48. TX 6 @ Ficketts method
Used to find regions of a sequence that might code for a
protein. Based on the method of Fickett (Nucl. Acid Res.10 1982),
but plots values for fixed window lengths rather than over the whole
of open reading frames.
If dialogue is requested define a window length and plot
interval. The results are plotted in a box divided into three
horizontal strips.
Sections of the sequence with values plotted in the top strip
of the box are adjudged to be coding, those in the middle strip "no
decision", and those in the bottom "not coding".
The program performs the following calculations: let A1 = the
number of occurences of base A in position 1 of codons, A2 for
position 2 etc. Similarly for bases C,G and T. For each window
position calculate Apos=max(A1,A2,A3)/min(A1,A2,A3)+1. Similarly for
C,G and T to give 4 positional values. Also count the base
composition for the window to give Acomp, Ccomp etc. Fickett tested
each of these 8 parameters singly as to their ability to distinguish
coding from noncoding regions and arived at probabilities of coding
for the range of values each can take = Pcod. He also measured their
relative abilities and given weightings to each of the 8 parameters
= Pw. To calculate the "TESTCODE" for a window we first lookup the
Pcod for each of the calculated compositional and positional values
and then calculate TESTCODE=sum(Pcod*Pw). TESTCODE is plotted
relative to three levels of decision: the top division="coding", the
middle="no opinion" and the bottom division="non coding".
@49. TX 6 @ tRNA gene search.
Used to find segments of a sequence that might code for tRNAs.
Looks for potential cloverleaf forming structures and then for the
presence of expected conserved bases. Presents results graphically
or draws out the cloverleafs.
If dialogue is requested a large number of parameters need to
be given values, including some loop lengths, scores for each of the
four stems, and scores for the conserved bases.
The program was first described in Staden Nucl. Acid Res
817-825 (1980). The tRNA's that have been sequenced so far have
two characteristics that can be used to locate their genes within
long DNA sequences. Firstly they have a common secondary
structure - the cloverleaf - and secondly, particular bases
almost always appear at certain positions in the cloverleaf. The
cloverleaf is composed of four base-paired stems and four loops.
Three of the stems are of fixed length but the fourth, the
dhu stem which usually has four base pairs, sometimes has only
three. All of the loops can vary in size. The following
relationships between the stems in the cloverleaf are assumed in the
program: (a) there are no bases between one end of the aminoacyl
stem and the adjoining tuc stem; (b) there are two bases between
the aminoacyl stem and the dhu stem; (c) there is one base between
the dhu stem and the anticodon stem; (d) there are at least three
bases between the anticodon stem and the tuc stem. The program
looks first for cloverleaf structure and then, if required, for
conserved bases. The sizes of the loops, the number of basepairs in
the stems and the required conserved bases may all be specified
by the user. The process of looking for the presence of conserved
bases can reduce the number of potential structures found
considerably. The user may also specify that an intron may be
present in the anticodon loop.
The user may define a minimum number of base pairs for each
stem using the scoring system G-C, A-T=2 and G-T=1 and scores for
each of the conserved bases. Recommended values for the stem scores
are given by the prompts and the percentage conservation of the
conserved bases as found in the Nucl. Acid Res 1979 paper Gauss,
Gruter and Sprinzl are also given, but the user must decide which
bases are most likely to be conserved for the sequence being
examined. The output shows the position of the possible gene in the
sequence by a vertical line the height of which shows the number of
basepairs made in the stems. The cloverleaf structure is also drawn
but will scroll up off the screen. Output of the cloverleafs will
look like:
6942
A
A-U
A-U
G-C
A-U
U-A
A-U
U-A AAU
U UAUCU
AA A !!!!!
AAUG AUAGA A
U !!!! U UCA
C UUAC U
AA A
U-AA A
A-U
A-U
C-G
U-A
U A
U A
GUC
Typical dialogue follows.
? Menu or option number=D49
tRNA search
? Maximum trna length (70-130) (92) =
? Aminoacyl stem score (0-14) (11) =
? Tu stem score (0-10) (8) =
? Anticodon stem score (0-10) (8) =
? D stem score (0-8) (3) =
? Minimum base pairing total (30-32) (32) =
? Minimum intron length (0-30) (0) =
? Minimum length for TU loop (4-12) (6) =
? Maximum length for TU loop (6-12) (9) =
? (y/n) (y) Skip search for conserved bases n
Give a score for each base, then a minimum total at the end
? Base 8, T is 100% conserved. Score (0-100) (0) =
? Base 10, G is 95% conserved. Score (0-100) (0) =
? Base 11, Y is 96% conserved. Score (0-100) (0) =
? Base 14, A is 100% conserved. Score (0-100) (0) =
? Base 15, R is 100% conserved. Score (0-100) (0) =
? Base 21, A is 97% conserved. Score (0-100) (0) =
? Base 32, Y is 100% conserved. Score (0-100) (0) =
? Base 33, T is 98% conserved. Score (0-100) (0) =
? Base 37, A is 91% conserved. Score (0-100) (0) =
? Base 48, Y is 100% conserved. Score (0-100) (0) =
? Base 53, G is 100% conserved. Score (0-100) (0) =
? Base 54, T is 95% conserved. Score (0-100) (0) =
? Base 55, T is 97% conserved. Score (0-100) (0) =
? Base 56, C is 100% conserved. Score (0-100) (0) =
? Base 57, R is 100% conserved. Score (0-100) (0) =
? Base 58, A is 100% conserved. Score (0-100) (0) =
? Base 60, Y is 92% conserved. Score (0-100) (0) =
? Base 61, C is 100% conserved. Score (0-100) (0) =
? Minimum total conserved base score (0-0) (0) =
? (y/n) (y) Plot results n
Searching
306
C
C-G
C-G
G-C
T-A
C-G
A-T
T+G AT
A ATACA
TTC T !!!! G
CTGT TATGG G
G ! ! T GA
C TAAA C
GCG C G
T+GA C
C-G C T
T+G A T
T-A G T
T-A G A
G G G C
A A G A
AGC T C
A T
C T
A
C T
@50. TX 7 @ Plot start codons
This function plots the positions of all start codons for each
of the three reading frames.
@51. TX 7 @ Plot stop codons
This function plots the positions of all stop codons for each
of the three reading frames.
@52. TX 7 @ Plot stop codons on the complementary strand
This function plots the positions of all stop codons for each
of the three reading frames on the complementary strand.
@53. TX 7 @ Plot stop codons on both strands
This function plots the positions of all stop codons for each
of the three reading frames on both strands.
@54. TX 5 @ Search for longest open reading frames
This function will report the positons of the ends of all
sections of sequence that contain no stop codons. All six reading
frames are examined. Results are presented in the form of an EMBL
feature table. Hence if the results are stored in a file by use of
"direct output to disk", the file can be used to translate the open
reading frames in a sequence. Note that in order for the file to be
used as a feature table it must include either EMBL or GenBank
headers, and a suitable "tail". The simplest header is the word
FEATURES starting in column 1 of the first line of the file. The
simplest tail is 2 empty lines at the end of the file. These lines
are not included when nip writes out results in feature table
format.
Define the minimum length of open reading frame to report (in
amino acids). Choose to search either or both strands. The program
displays the end points, the reading frame and strand.
Typical dialogue follows.
? Menu or option number=D54
Find open reading frames
? Minimum open frame in amino acids (5-1000) (30) =100
X 1 + strand only
2 - strand only
3 Both strands
? 0,1,2,3 =3
FT CDS 1 831 1 831
FT CDS 1540 2853 1 1314
FT CDS 3130 4242 1 1113
FT CDS 5761 6114 1 354
FT CDS 6187 6711 1 525
FT CDS 1766 2077 2 312
FT CDS 2078 2446 2 369
FT CDS 4136 5500 2 1365
FT CDS 1335 1637 3 303
FT CDS 2844 3194 3 351
FT CDS 6819 7238 3 420
FT CDS 2073 1711 C 1 363
FT CDS 2469 2149 C 1 321
FT CDS 6542 6144 C 3 399
@55. TX 8 @ Search for E. coli promoter (general)
Searches for E coli promoter like sequences using a standard
weight matrix. The positions of the matches are plotted. No dialogue
is required.
The method was first described in Staden R. Nucl. Acid Res. 12
505-519 1984. This search uses a weight matrix taken from the
frequency tables contained in Hawley, D. K. and McClure, R., nar 11
2237-2255 (1983). The weight matrix is divided into 3 sections that
are separated by varying sizes of gap: the - 35 region, the -10 and
the +1 region. The algorithm first looks for a sufficiently good
-35 region, then for the best -10 region within range and then for
the best +1 region within range of the -10; each separate region
must score above the lowest known score for the corresponding
section. The gap penalty is then applied and two plots produced: one
with gap penalties, one without. Scaling is such that no known
promoter scores below the bottom level and no known promoter scores
above the top level when the weight matrix is applied.
Two other functions also look for E. coli promoters: 92 looks
for sites on the complementary strand and 93 looks for individual
-35 and -10 regions and plots them on a scale such the top is the
highest known value +10% and the bottom is the lowest known -10%
weights for E. coli promoters
-35 region:
P -50-49-48-47-46-45-44-43-42-41-40-39-38-37-36-35-34-33-32-31-30-29-28-27-26
107109109110110110110110110111111110111112112112112112112112112112112112112
T 41 33 32 25 34 22 35 35 42 27 32 42 47 14 92 94 11 19 15 37 46 34 38 48 34
C 22 27 18 29 20 14 20 12 22 23 16 25 10 43 7 6 11 18 60 8 25 23 23 17 20
A 28 38 30 37 35 56 42 42 37 42 39 18 25 26 2 6 2 72 26 50 26 34 25 26 31
G 16 11 29 19 21 18 13 21 9 19 24 26 29 29 11 6 88 3 11 17 15 21 26 21 27
-10 region:
P -23-22-21-20-19-18-17-16-15-14-13-12-11-10 -9 -8 -7 -6 -5
112112112112112112112112112112112112112112112112112112112
T 35 28 28 27 39 51 34 43 26 31 89 3 49 15 19108 31 29 21
C 34 21 24 27 12 25 20 25 20 27 10 2 16 14 22 3 13 16 30
A 20 39 33 33 39 23 29 16 23 19 2106 29 66 57 1 35 23 31
G 23 24 27 25 22 13 29 28 43 35 11 1 18 17 14 0 33 24 30
+ region:
P -2 -1 1 2 3 4 5 6 7 8 9 10
86 88 85 88 88 88 88 88 88 88 88 88
T 16 22 2 42 27 23 20 25 27 15 16 29
C 29 49 4 25 25 13 18 22 17 17 16 17
A 20 9 45 16 24 25 28 24 24 32 35 26
G 21 8 37 5 12 27 22 17 20 24 21 16
Notes: E. coli promoters have been shown to contain 2 regions of
conserved sequence located about 10 and 35 bases upstream of the
transcription startsite. These are TATAAT and TTGACA with an allowed
spacing of 15 to 21 bases between. The spacing with maximum
efficiency was 17 bases and all but 12 of the 112 sequences could be
aligned with a separation of 17 +or-1 bases. The standard promoter
has spacing 7 and 17 bases between the startsite and the -10 region,
and the -10 and -35 regions, respectively. The spacing between the
-10 region and the startsite is usually 6 or 7 bases but varies
between 4 and 8 bases. There is an AT rich region of 8 to 10 bases
upstream of the -35 region. Iniation with a purine is highly
prefered with G being used if A is not present.
Gap penalties:
15 0.02 (only exists as mutant)
16 0.2
17 1.0
18 0.2
19 0.05 (guess)
20 0.02 (guess)
21 0.01 (guess)
@56. TX 8 @ Search for E. coli promoter (general) strand
This function searches for E. Coli promoters on the
complementary strand of the sequence. See the notes on option 55.
@57. TX 8 @ Search for E. coli promoter sequences. (-35 and -10)
This function searches separately for the -35 and -10
sequences of an E. coli promoter. See the notes on option 55.
@58. TX 8 @ Search for procaryotic ribosome binding sites
This function searches for the 5' ends of prokaryotic genes
using an unusual weight matrix. The search is relatively slow
because the matrix is 101 bases in length. No dialogue is required.
The method was first described in Staden Nucl. Acid Res. 12
505-519 1984. This actually looks for more than a ribosome binding
site as is explained below. This uses their weight matrix w101 of
Stormo and Schneider (NAR 10 2971-3024, 1982) which with a value of
2 finds all gene starts in their library.
P-60-59-58-57-56-55-54-53-52-51-50-49-48-47-46-45-44-43-42-41-40-39-38-37-36
T 5 1 -3 9-14 7 15 -5 3-16-17 4 18 5 -3 -1 2 4 5 -5 7 8 -5-15 6
C-21 -6-11-21 0 8 -7-12 -1 1 0-19 12 -3 -1 10 2 -8 -5-11 8 1 23 6 -5
A 7 -2 13 -2 -8-13-18 5 0 -5 13 8-15 9 -4 -7 9 0 -8-11-10 -6 -7 -5 -6
G -6 -9 -7 0 8-16 -4 -2-16 1 -4 8-14 5 11-13-24 3 7 22-11 -9-15 10 -4
P-35-34-33-32-31-30-29-28-27-26-25-24-23-22-21-20-19-18-17-16-15-14-13-12-11
T 3 4 16 -4 7 11 -4 -1 12 8 10 -1 1 8 2-10-16 11 1 -3 16 -3-36 -8-27
C 2-14 -3 -8-10-21 2 0 -2 -1-11 -3 -1 5-11 -4 7 0-14 6 -8-20 -7-36-44
A-12 -1-27 -3 -6 0-12 -3 -4 -7 14 -2 -4 -6 0 12 5 -9 0-11-11 10 8 2 8
G 4 -5 -6 -3 -1 -4 -1 -4-15 0-14 3 10-19 -3-10 -7 -7 7 1 -8 -6 15 21 42
P-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
T-53-27-26-23 2 -7-14-40-28 0-53 75-62-20-40-10-35 -5-12 -1 4 14-23 7 -2
C-15-50-43-35-38-29-29 1 -9 1-87-55-64-45 11-22-14-20-15-15-10-22 -5 2 6
A 0 -3 -5 4-20-11 5 6 -2-15 66-69-52 -5 -4 6 8-24 -7-10 -7 13 14 -9-18
G 35 22 16 -6 -5-15-25-33-28-53-36-50107 -5-37-44-27-15-23-16-29-47-17-29-15
P 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
T-26 1 4 -7 3 -4 0-10 8-18 7-22-21 8 4 -3 -6 7 -8 1 -5-16-16 7 -6
C 6 -8 19 -7 9 -3 17 -2 3 -9 5 22 22 8 -1 1 18 6 11-10 -8 7 10 0 7
A 14-12-42 1 -5 -4-32 12-10 20 -6 -1 3 -4 4-10 -1 -2-14 11 14 -3 2-13 5
G-23 -7 -1 -6-17 -4 0-15-14 -4-17-10 -5-13 -8 10-13-13 9 -4 -3 10 2 4 -8
P 40
T 0
C 14
A 5
G-21
These come from w101 of Stormo, Schneider, Gold and Ehrenfeucht Nucl.
Acid Res. 10 2997- 3011, 1982. They report that this matrix gives a
score of at least 2 for all gene starts in their library whereas all
other sequences score 1 or less.
@29. TX 1 @ Reverse and complement the sequence
Reverses and complements the current active region of the
sequence.
@60. TX 7 @ Search using a dinucleotide weight matrix
This function performs searches for short sequence motifs
using an appropriate dinucleotide weight matrix. In addition it can
be used to create or modify weight matrices. In order to perform a
search the only input required is the name of the file containing
the weight matrix. The results can be presented graphically or
listed. The graphical presentation will draw line at the position of
any matches found; the height of the line is proportional to the
score. The method is identical to that using weight matrices derived
from nucleotide frequencies, except that here we use the frequencies
of dinucleotides.
For a search, select "use weight matrix", supply the name of
the file containing the weight matrix, and choose between having
results plotted or listed. If dialogue is requested when the
function is selected users can alter the cutoff score employed.
To create a weight matrix several steps are involved. A file
containing an alignment of known motifs is required. (This file must
be created before the current option is selected. The format is a
follows: each sequence is written on a separate line with at least
one space at the beginning; each sequence is terminated by a space
character, and can be followed by a name. The sequences must be
aligned.) Supply the name of the file of aligned sequences. The
program reads and displays the sequences. Choose between "summing
logs of weights" or summing weights (i.e. whether to multiply or add
weights). If logs are used all scores will be negative. Choose if
all positions in the set of aligned sequences should be used or if a
mask should be applied. If so selected, define a mask as a string of
symbols, in which symbol - means ignore and any other symbol means
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
The program will calculate weights as the frequencies of the
dinucleotides at each unmasked position in the set of aligned
sequences. These weights are then applied to the set of aligned
sequences to give a range of "observed" scores. The mean and
standard deviation of these scores is displayed. The user is asked
to supply several values to be used when the weight matrix is
applied to other sequences: a cutoff score (by default, the mean
minus 3 standard deviations), a top score for scaling graphical
results (by default, the mean plus 3 standard deviations), and a
position to identify (this means that if a particular base within
the motif is used as a "landmark", such as the A of the AG in splice
acceptor sites, then its position will be marked in plots). All
these values are stored along with the weight matrix. Finally supply
the name of a file to contain the weight matrix.
Weight matrices can be "rescaled" using a set of aligned
sequences in much the same ways as a matrix is created. The purpose
is to redefine the cutoff scores, and rescaling does not alter any
other values in the weight matrix file.
The methods have always had to deal with the problem of zeroes
in the matrices. The current versions employ "Laplaces Law of
Succession" in which 1 is added to each term.
Typical dialogue follows.
? Menu or option number=D60
Motif search using dinucleotide weight matrix
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 = 2
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
1 AGCGTGACTCTTCCCGGAA HIS1
2 GAGGTGACTCACTTGGAAG HIS1
3 CGGATGACTCTTTTTTTTT HIS3
4 ACAGTGACTCACGTTTTTT HIS4
5 GTCGTGACTCATATGCTTT ARG3
6 TGAATGACTCACTTTTTGG ARG4
7 TTCTTGACTCGTCTTTTCT CPA1
8 CGAATGACTCTTATTGATG CPA2
9 AGAATGACTAATTTTACTA TRP5
10 TCGTTGACTCATTCTAATC TRP3
11 TTGCTGACTCATTACGATT TRP2
12 GAGATGACTCTTTTTCTTT IV1
13 GCGATGATTCATTTCTCTG IV2
14 TAGATGACTCAGTTTAGTC LEU1
15 TAAGTGACTCAGTTCTTTC LEU4
16 ATGATGACTCTTAAGCATG ILS1
Length of motif 18
? (y/n) (y) Sum logs of weights n
? (y/n) (y) Use all motif positions n
x means use, - means ignore
e.g. xx-x---x-x means use positions 1,2,4,8,10
? Mask=----XXXXXXXX--------
Applying weights to input sequences
1 89.000 AGCGTGACTCTTCCCGGA
2 91.000 GAGGTGACTCACTTGGAA
3 93.000 CGGATGACTCTTTTTTTT
4 90.000 ACAGTGACTCACGTTTTT
5 94.000 GTCGTGACTCATATGCTT
6 91.000 TGAATGACTCACTTTTTG
7 81.000 TTCTTGACTCGTCTTTTC
8 90.000 CGAATGACTCTTATTGAT
9 75.000 AGAATGACTAATTTTACT
10 97.000 TCGTTGACTCATTCTAAT
11 97.000 TTGCTGACTCATTACGAT
12 93.000 GAGATGACTCTTTTTCTT
13 69.000 GCGATGATTCATTTCTCT
14 90.000 TAGATGACTCAGTTTAGT
15 90.000 TAAGTGACTCAGTTCTTT
16 90.000 ATGATGACTCTTAAGCAT
Top score 97.000 Bottom score 69.000
Mean 88.750 Standard deviation 7.319
Mean minus 3.sd 66.794 Mean plus 3.sd 110.706
? Cutoff score (-999.00-9999.00) (66.79) =
? Top score for scaling plots (66.79-999.00) (110.71) =
? Position to identify (0-18) (1) =
? Title=GCN4 DI WTS
? Name for new weight matrix file=3.WTS
? Menu or option number=D60
Motif search using dinucleotide weight matrix
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =
? Motif weight matrix file=3.WTS
GCN4 DI WTS
? Cutoff score (-9999.00-9999.00) (66.79) =40
? (y/n) (y) Plot results n
15 42.00 CAACCCGCTCACCGACAA
29 42.00 ACAACAGCTCACCCACGC
93 46.00 AGCCTTCCTCATCGCTGC
153 40.00 CAGCGGAATCAAACTTAA
408 42.00 CGATGGATTCAAGTTGAA
469 47.00 TTAGGAACTCCCTCTGTC
493 60.00 AAGCTGAATCTTAGCAGC
530 43.00 CGGAGGGCTCAGTGAGGG
542 47.00 TGAGGGACTACTGCACCA
678 41.00 CTTCTGCTTCAAAGAGTT
709 47.00 AATATGACGGCGCACGTG
848 54.00 GTCAGAACTCAAATCAGT
940 49.00 CCGTTGACGACCTCCGCA
992 42.00 TGGGCACCTCACACCAAG
@61. TX 8 @ Search for eukaryotic ribosome binding sites
Searches for eukaryotic ribosome binding sites using
weightings derived from Sargan,Gregory,Butterworth febs let 147
133-136 1982. No dialogue is required. First described in Staden
Nucl. Acid Res. 12 505-519 1984.
mRNA WTS FOR EUKARYOTES SARGAN,GREGORY,BUTTERWORTH FEBS LET
147 133-136 1982
P -7 -6 -5 -4 -3 -2 -1 1 2 3
102102102102102102102102102102
T 19 24 31 12 0 18 5 0102 0
C 20 15 32 65 5 42 52 0 0 0
A 50 27 27 19 86 36 34102 0 0
G 6 29 12 6 11 6 11 0 0102
VIRAL ONLY
P -7 -6 -5 -4 -3 -2 -1 1 2 3
41 41 41 41 41 41 41 41 41 41
T 14 12 16 4 2 13 9 0 41 0
C 7 3 13 17 7 9 14 0 0 0
A 15 10 6 10 27 15 9 41 0 0
G 5 16 6 10 5 4 9 0 0 41
The Sargan et al paper puts forward the hypothesis that there is an
interaction between some mRNA leader sequences and a highly conserved
structure in the 18S rRNA of eukaryotic ribosomes. The attempt to
substantiate the hypothesis includes a table of base frequencies for
sequences immediately 5' to start codons. They examined 102
sequences and I have used the base frequencies they found as a weight
matrix for searching for eukaryotic gene starts. I don't yet know how
good this method is. The viral sequences were found to be slightly
different but the separate table shown here is not used in the
program.
@62. TX 8 @ Search for splice junctions
Used to search for mRNA splice junctions using a weight
matrix. The default weight matrix is still that derived from the
paper of Mount (Nucl. Acids Res. 10, 459-472). However users may
employ their own tables. By default the positions of possible
junctions will be plotted rather than listed. The diagram splits
the donor plot into 3 horizontal boxes so that all the sites marked
in any box are from the same reading frame. The acceptor plot
appears above the donor plot and is split in an equivalent way. So
sites marked as donors and acceptors in equivalent boxes are
compatible. i.e. donors from donor box 1 are compatible with
acceptors from acceptor box 1, etc. Of course it is the combination
of reading frame and splice sites that really matters, and donors
from box 1 can be compatible with acceptors in box 3 if the reading
frame switches.
If dialogue is selected users can employ their own file of
weights (see below for the format), can change the cutoff scores,
and can elect to have the results listed rather than plotted. Listed
results show the position (of the last or first base in the exon),
the frame and the matching sequence. The frequency table shown
below is used as a default weight matrix and AG and GT are
obligatory at the appropriate positions. The plots are scaled so
that the top of scale is the highest value achieved by a junction
sequence in the set used to compile the frequency table, and the
bottom of the scale is the lowest value achieved by a junction
sequence in the set used to compile the frequency table.
In the light of current knowledge it would be sensible for
users to use the weight matrix search option (20) to create matrices
that define more specific splice junctions. If so it is important
that the positions "marked" are the last base in the donor exon and
the first base in the acceptor exon. To make a weight matrix
suitable for use with this function follow the instructions for
option 20 and create files for both donor and acceptor sites. Then
concatenate the two matrix files with the donor file first. Note
that any positions in the weight matrix that are 100% conserved will
be made obligatory (normally the AG and GT).
Mount donors redone 16-4-91
12 3 -16.085 -7.500
P -2 -1 0 1 2 3 4 5 6 7 8 9
N 136 136 136 136 136 136 136 136 136 136 136 136
T 28 8 15 17 0 136 9 16 7 84 30 36
C 41 60 16 7 0 0 3 13 3 17 28 39
A 40 56 89 12 0 0 83 91 12 23 53 33
G 27 12 16 100 136 0 41 16 114 12 25 28
Mount acceptors redone 16-4-91
18 15 -26.142 -14.400
P -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
N 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113
T 58 50 57 59 67 56 58 49 47 66 64 31 34 0 0 11 41 31
C 21 28 34 25 29 33 35 32 42 40 33 25 74 0 0 23 28 41
A 17 11 11 18 7 17 12 23 15 3 10 29 5 113 0 24 21 21
G 17 24 11 11 10 7 8 9 9 4 6 28 0 0 113 55 23 20
@63. TX 7 @ Search using a weight matrix (complementary)
This function searches the complementary strand of the
sequence using a weight matrix. Many motifs can bind to either
strand of the DNA and this function allows users to search the
complementary strand without having to change the orientation of the
sequence. See option 20 for more details.
@64. TX 3 @ Plot observed-expected word frequencies
This option is designed to examine the abundances of short
words in a sequence to see if particular ones are either under or
over represented. It compares the observed and expected frequencies
and plots them along the sequence. There has been some work on the
relative amounts of CG dinucleotides in eukaryotic sequences (eg
Bird, Nature 321, 209-213 (1986)) and this new routine can be used
to examine such biases, or any others that might be interesting.
The user selects a word - say CG -, a window length, and a
maximum and mininum scale for plotting the results. The program
examines each sucessive window length along the sequence, with each
window overlapping the previous one by windowlength-1. The program
counts the base frequencies in each window, and the number of
occurrences of the chosen word within the window. Using the base
frequencies it calculates an expected number of occurrences for the
chosen word (simply by multiplying the relevant frequencies). It
plots observed-expected, and hence will show regions that are rich
or depleted in the chosen word. The longest allowed word is 9
characters, but the calculation of the expected frequencies becomes
less appropriate as the word length increases above 2.
Typical dialogue follows.
? Menu or option number=D64
Plot composition differences (obs-exp))
Default String=CG
? String=
? odd span length (3-401) (101) =
? plot interval (1-20) (5) =
? Maximum plot value (-6.31-25.25) (6.31) =
? Minimum plot value (-25.25-6.31) (-6.31) =
Missing graphics display here
@65. TX 9 @ Search for polya sites
Simply searches for the sequence AATAAA (Proudfoot and
Brownlee Nature 263, 211-214, 1982) and marks it with a short
vertical line.
@66. TX 1 @ Interconvert t and u
This function interconverts T and U characters in the active
sequence i.e between DNA and RNA.
@67. TX 7 @ Search for patterns of motifs
This option searches for patterns of motifs. Patterns can be
defined interactively or read from files. Results can be displayed
in several ways in both graphical and textual form. Used to create
pattern files for searching libraries. The option is extremely
flexible and consequently the following documentation is quite
lengthy. However the routine is capable of searching for almost any
known pattern. In addition the flexibility does not necessitate
difficulty of use, and the userinterface has been simplified
considerably since the methods were first published.
Users should refer to the "typical dialogue" shown below for
the most helpful information on using the program.
There are currently four ways to display the matching
patterns: 1=each individual motif and its position is listed; 2=all
the sequence between, and including the two outermost motifs is
listed; 3=graphical, with a vertical line marking the position of
the leftmost motif; 4 = EMBL feature table format, where the KEYNAM
field if the motif name, the FROM and TO fields denote the ends of
the match, and the DESCRIPTION field is "Program".
When it is defined for the first time a pattern must be
entered interactively at the keyboard, but the pattern description
can be saved to a file. This file can be used for all subsequent
searches.
When defining a pattern interactively select a motif class and
the program will request the required inputs.
The program gives each motif an identifying name and number.
For motifs other than the first, a range of allowed positions must
be defined (Note that sets of motifs included using the OR operator
will all be given the same range, and so the program will only
request range values for the first motif in any such set). To
specify the allowed range for a motif the user must supply the
following: the identifying number of the motif, relative to which
the current motifs positions are to be defined (termed the
"reference motif"); a "relative start position" and a range. The
relative start position can be negative or positive. A negative
start position means that although the reference motif is searched
for first, the current motif can be found to its left. A zero
relative start position means their left ends are superimposed. The
default start position is to butt-joint the motif to righthand end
of the "reference motif". The range is "the number of extra
positions" that the motif can take.
The program will display the probability of finding each
motif. These values are presented in the following form: .1234E-5
means 0.1234 times 10 to the power -5.
After the pattern has been defined, the program will type a
description of it on the screen. It will then allow the user to give
an overall cutoff score and overall probability cutoff.
Typical dialogue for all the different motif classes is
displayed below.
? Menu or option number=67
Pattern searcher
? (y/n) (y) Read pattern from keyboard
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =
? Motif name=Ematch
? String=AA
Probability of score 2.0000 = 0.595E-01
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =2
? Motif name=AAA
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-1) (1) =
? Relative start position (-1000-1000) (3) =
? Number of extra positions (0-1000) (0) =
? string=AAA
? Minimum matches (1.00-3.00) (3.00) =2
Probability of score 2.0000 = 0.149E+00
1 Exact match
X 2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =3
? Motif name=T'S
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-2) (2) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =
? String=TTT
? Minimum score (0.00-108.00) (108.00) =72
Probability of score 72.0000 = 0.258E+00
1 Exact match
2 Percentage match
X 3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =4
? Motif name=GCN4
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-3) (3) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =
? Weight matrix file name=GCN4
GCN4 FROM WEIGHTS 17-11-87
Probability of score -22.0020 = 0.139E-02
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
X 4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =5
? Motif name=GCN4
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-4) (4) =
? Relative start position (-1000-1000) (20) =
? Number of extra positions (0-1000) (0) =
? Weight matrix file name=GCN4
GCN4 FROM WEIGHTS 17-11-87
Probability of score -22.0020 = 0.606E-03
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
X 5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =6
? Motif name=LOOP
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-5) (5) =
? Relative start position (-1000-1000) (20) =
? Number of extra positions (0-1000) (0) =
? Stem length (1-60) (6) =
? Minimum loop length (-6-60) (0) =
? Maximum loop length (0-60) (0) =5
? Minimum score (1.00-12.00) (12.00) =10
Probability of score 10.0000 = 0.598E-02
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
X 6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =7
? Motif name=Tstep
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-6) (6) =
? (y/n) (y) Relative to 5 prime end
? Relative start position (-1000-1000) (1) =
? Number of extra positions (0-1000) (0) =
? String=TTT
? Step (1-20) (3) =
Probability of score 3.0000 = 0.367E-01
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
X 7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =8
? Motif name=REPEAT
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-7) (7) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =2
? Repeat length (1-60) (6) =
? Minimum gap (0-60) (0) =
? Maximum gap (0-60) (0) =4
? Minimum score (1.00-6.00) (6.00) =5
Probability of score 5.0000 = 0.554E-02
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
X 8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =9
? (y/n) (y) Save pattern in a file N
Pattern description
Motif 1 named Ematch is of class 1
Which is an exact match to the string
AA
Motif 2 named AAA is of class 2
which is a match of score 2. to the string
AAA
and the 5 prime base can take positions 3 to 3
relative to the 5 prime end of motif 1
It is anded with the previous motif.
Motif 3 named T'S is of class 3
which is a match of score 72. to the string
TTT
and the 5 prime base can take positions 4 to 4
relative to the 5 prime end of motif 2
It is anded with the previous motif.
Motif 4 named GCN4 is of class 4
Which is a match to a weight matrix with score -22.002
and the 5 prime base can take positions 4 to 4
relative to the 5 prime end of motif 3
It is anded with the previous motif.
Motif 5 named GCN4 is of class 5
Which is a match to the complement of a weight matrix with score -22.002
and the 5 prime base can take positions 20 to 20
relative to the 5 prime end of motif 4
It is anded with the previous motif.
Motif 6 named LOOP is of class 6
Which is a stem-loop structure with stem length 6 and score 10.
The loop can have sizes 0 to 5
and the 5 prime base can take positions 20 to 20
relative to the 5 prime end of motif 5
It is anded with the previous motif.
Motif 7 named Tstep is of class 7
Which is an exact match to the string
TTT
with a step size of 3
and the 5 prime base can take positions 1 to 1
relative to the 5 prime end of motif 6
It is anded with the previous motif.
Motif 8 named REPEAT is of class 8
Which is a repeat with repeat length 6 and score 5.
The loop-out can have sizes 0 to 4
and the 5 prime base can take positions 4 to 6
relative to the 5 prime end of motif 7
It is anded with the previous motif.
Probability of finding pattern = 0.2348E-14
Expected number of matches = 0.5100E-09
? Maximum pattern probability (0.00-1.00) (1.00) =
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
Select display mode
X 1 Motif by motif
2 Inclusive
3 Graphical
4 EMBL feature table
? 0,1,2,3,4 =4
Searching
Total matches found 0
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structures and repeats
m5 = Translation and codons
m6 = Gene search by content
m7 = Prokaryotic signal search
m8 = Eukaryotic signal search
? = Help
! = Quit
? Menu or option number=67
Pattern searcher
? (y/n) (y) Read pattern from keyboard
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =
? Motif name=Arun
? String=AAAAAA
Probability of score 6.0000 = 0.210E-03
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =9
? (y/n) (y) Save pattern in a file N
Pattern description
Motif 1 named Arun is of class 1
Which is an exact match to the string
AAAAAA
Probability of finding pattern = 0.2103E-03
Expected number of matches = 0.1522E+01
? Maximum pattern probability (0.00-1.00) (1.00) =
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
Select display mode
X 1 Motif by motif
2 Inclusive
3 Graphical
4 EMBL feature table
? 0,1,2,3,4 =4
Searching
FT Arun 1582 1587 Program
FT Arun 3160 3165 Program
FT Arun 4204 4209 Program
FT Arun 5691 5696 Program
FT Arun 6710 6715 Program
Total matches found 5
Minimum and maximum observed scores 6.00 6.00
These methods allow users to define and search for complex
patterns of motifs defined as single objects. The programs allow
individual DNA motifs to be defined in eight different ways, and
protein motifs in six. Motifs are combined, using the logical
operators AND, OR and NOT, to describe a pattern. The pattern also
specifies the ranges of allowed relative separations of the
individual motifs.
First some definitions.
A MOTIF is a contiguous subsequence of fixed length. At its
simplest it could be a single definite base or amino acid; a more
complex motif might be better represented as a consensus or a weight
matrix; two more-abstract types of motif are direct and inverted
repeats.
A PATTERN is a higher order of structure defined by a list of
motifs. The motifs in a pattern are combined using the logical
operators AND, OR and NOT. The list also defines the allowed
relative separations of the motifs. In the current versions of the
programs up to 50 motifs can be combined into a single pattern. So
using these definitions there are two differences between motifs and
patterns: 1) the distances between all elements of a motif are
fixed, but the separations of parts of patterns can vary; 2) all
characters in a motif are defined using the same method (class), but
different parts of a pattern can be defined in completely different
ways.
Each motif can be represented in 9 ways (known as the motif
class):
MOTIF CLASSES
CLASS DESCRIPTION
1 Exact match to a short defined sequence. The IUB symbols
can be used for DNA sequences.
2 Percentage match to a defined short sequence. In nucleic acids,
the IUB symbols can be used.
3 Match to a defined sequence, using a score matrix and cutoff
score. The DNA matrix (see option 18) gives scores to IUB symbols
depending on their level of redundancy. MDM78 is used for proteins.
4 Match to a weight matrix with cutoff score.
5 As class 4 but on the complementary strand.
6 Inverted repeat or stem-loop. Fixed stem length, range of
loop sizes, and cutoff score using A-T, G-C=2; G-T=1.
7 Exact match to short sequence but with a defined step size.
8 Direct repeat. Fixed repeat length, range of loop-out sizes,
cutoff score, and score matrix (for protein sequences MDM78 and
for nucleic acids an identity matrix).
9 Membership of a set. A list of sets of allowed amino acids for
each position in the motif. The sets are separated by commas(,).
For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
acids in which one of I,V or L must be found in the first position,
then anything in the next two positions, D,E,K or R in the fourth
position and F,Y,W,I,L,V or M in the fifth. This class only applies
to protein sequences because for nucleic acids "membership of a
set"
can be achieved using IUB symbols.
Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to
nucleic acids.
Class 1: exact match.
The motif is defined by a short sequence, which for nucleic
acids, may include IUB symbols. All symbols must match.
Class 2: percentage match
The motif is defined by a short sequence, which for nucleic
acids, may include IUB symbols. The minimum number of matching
characters must also be specified.
Class 3: match using a score matrix
The motif is defined by a short sequence, which for nucleic
acids, may include IUB symbols. The motif is not compared directly
with the sequence to count the number of matching characters.
Instead a matrix is used to provide a score for all possible pairs
of characters. The motif score for any position along the sequence
is the sum of the scores found by looking-up the scores for each
pair of aligned characters. A match is declared if some minimum
score is achieved.
Class 4: weight matrix
The motif is defined by a table of values (called weights or
scores). The table gives a score for finding each possible character
at each position along the length of the motif. It therefore has
dimension motif-length x character-set-size, and allows us to give
different scores for each character at each position. It is
equivalent to having a different score matrix for each position
along the motif, and provides the most flexible and specific method
of defining motifs. The weight matrices are created by program NIP
option 20 and stored as files. The file contains the values for each
position, as well as an overall minimum score. There are two ways in
which these values can be used to calculate an overall score for any
section of the sequence. The simplest way is to add the values in
the file. (This means that the highest possible score can be
calculated by adding the top value at each column position, and the
lowest by adding the bottom value.) The normal way of using the
values in the file is as follows. First the programs divide the
values in each column by the column total so that they sum to 1.0
Then the natural logs of these values are used as scores. When the
matrix is applied to a sequence these logarithmic values are summed
(which is of course equivalent to multiplying the frequencies).
Note that using the natural logs of the frequencies as weights and
adding them means that the overall cutoff score must be less than
zero, whereas if the original values in the weight matrix file are
added, the cutoff score will be greater than zero. The search
routines therefore decide whether the user wants to add values or
multiply frequencies by examining the value of the cutoff score: it
will add if the cutoff is greater than zero and add logs of
frequencies if it is less than zero. Hence we effectively get two
motif classes in one. The program NIP, when creating weight matrix
files, will ask the user whether the scores should be added or
multiplied. If the values in the table have been defined without
using a set of aligned sequences it is easier for the user to choose
a cutoff score if the values are added.
Class 5: complement of weight matrix
The motif is defined by a weight matrix, but the program
searches for its complement.
Class 6: inverted repeat, or stem-loop
The motif is defined by a repeat length, a minimum score and a
range of loop sizes. The scores are A-T=2, G-C=2, G-T=1, else=0.
The loop sizes are defined by a minimum and maximum distance from
the 3' end of the stem. For a stem-loop these will be positive
numbers. For example to define a stem of length 8 and loop sizes
varying from 3 to 5, the stem would be set to 8, the minimum start
distance to 3 and the maximum to 5. To define an inverted repeat the
minimum distance will be negative. For example stem length=9,
minimum distance=-9, and maximum distance=-8 will find inverted
repeats of lengths 9 and 10. E.g. AAAAATTTT and AAAAATTTTT would be
found, the first having a base at its centre, the second having
none.
Class 7: exact match, defined step size.
The motif is defined by a short sequence, which for nucleic
acids, may include IUB symbols. All symbols must match. The class
differs from class 1 in that searches will move in steps of some
given size. For example we could search for a certain codon and use
a step size of 3 and hence keep in a single reading frame.
Class 8: direct repeat
The motif is defined by a repeat length, a minimum score and a
range of loop sizes. The scores are defined using MDM78 for protein
sequences and an identity matrix for nucleic acids. The loop sizes
are defined by a minimum and maximum distance from the 3' end of the
stem.
Class 9: membership of a set
This motif class is for protein sequences. It is defined by
lists of allowed amino acids for each position in the motif, and a
cut-off score. Positions at which any amino acid can occur are left
blank. All allowed amino acids for each position give a score of 1.
The motifs can be defined in two ways: either typed at the keyboard
or read in as a weight-matrix-like file. When the motif is defined
at the keyboard the sets of allowed amino acids are separated by
commas(,). For example IVL,,,DEKR,FYWILVM defines a motif of length
5 amino acids in which one of I,V or L must be found in the first
position, then anything in the next two positions, D,E,K or R in the
fourth position and F,Y,W,I,L,V or M in the fifth. To specify that
the whole motif must match a score of 3 would be required (i.e. one
of the allowed amino acids must be found for each of the three
defined positions). If the motif is read from a file the file must
have been written by program NIP, or have been saved by the pattern
searching routines. If the user elects to save a pattern, and it
includes class 9 motifs typed at the keyboard, then the program will
save the class 9 motifs as weight matrix files. Therefore it will
request file names for each motif of this class. If the motif given
above as an example were saved the weight matrix file would have 5
columns. The first column would contain zeroes except for the I, V
and L rows which would be set to 1; the next two columns would all
be zero; the next would be zero except for the D,E,K and R rows
which would be 1; the final column would contain 1's in rows
F,Y,W,I,L,V and M, with the rest zero.
The logical operator (AND, OR or NOT) used to add each motif
to the pattern is specified by preceding the class number by the
letters A, O or N. A = AND, O = OR, N = NOT. The default is A, so
N2 means include, using the NOT operator, a class 2 motif; O2 means
include, using the OR operator, a class 2 motif; both A2 and 2 mean
include, using the AND operator, a class 2 motif.
Range setting.
The motifs in a pattern are numbered according to their order
in the list. Apart from the first motif in a pattern all motifs are
given a range of allowed positions relative to a motif further up
the list. For example suppose we have a pattern defined by A AND B
AND C AND D. Motif A can occur anywhere, but B must have its range
of allowed positions defined relative to the position of motif A,
and C's positions can be defined relative to either A or B,
depending on which is most convenient, and likewise D's positions
can be relative to A or B or C.
Notice that the positions of motifs can be defined relative to
more than one motif. Suppose we have a pattern consisting of motifs
A, B and C, and that B occurs 5-10 residues right of A, C occurs 5-
10 residues right of B, and also C is never more than 15 residues
from A. Then it is quite consistent with the methods to include
motif C into the pattern twice using the AND operator: once relative
to A and once relative to B. This will define the relative spacing
and the ORDER of the motifs in the pattern. (If we simply defined
the position of C relative to A it could be found to the left of B).
Motifs combined together using the OR operator are all given
the same range. For example suppose we had a pattern A AND (B OR C)
AND (D OR E), then B and C each have the same range, and D and E
also have the same range as one another. The range for D and E can
be relative to A or to B.
Motifs cannot have their ranges defined relative to motifs
that are included using the NOT operator. For example if we had the
pattern A NOT B AND C, then the range for C can only be defined
relative to motif A.
Speed can be gained by arranging the order of the motifs so
that those higher up the list are of types that can be searched for
rapidly and that are also unlikely to be found.
Motifs combined by the OR operator are alternatives: if any
one of a set of motifs combined by the OR operator is found, then a
match is declared. All alternatives will be reported. For example if
we had a pattern defined by A AND (B OR C), then all places where A
occurs and B is found within range, and all places where A is found
and C is found within range will be reported. A typical use would be
where we might allow a motif to appear on either strand of the DNA
sequence. For example a weight matrix representing the heatshock
element could be used in a pattern which included heatshock as a
motif class 4 combined using the OR operator with heatshock as a
motif class 5.
The probability calculations are performed for each motif as
it is defined. If an overall probability cut-off is given the
calculation is repeated for each match found. To achieve maximum
searching speed do not give an overall probability cut-off. Overall
cut-off scores should only be used if the motif classes used are
compatible.
There are currently several ways to display the matches: 1 =
each motif and its position is listed; 2 = all the sequence between
the two outermost motifs is listed; 3 = graphical, with a spike
marking the position of the leftmost motif. The library versions
also give entry names, and a one line title; in addition they can be
used to produce aligned families of sequences. When this mode of
output is selected the program will write a separate file for each
match. The files will be called ENTRYNAME.DAT where ENTRYNAME is the
name of the entry in the library. The matching sequence will be
written out so that the spacing between motifs is constant, and set
to the maximum allowed by the pattern definition. Any gaps will be
filled with dashes (-). If the individual sequences were
subsequently written one above the other they should line up so that
all motifs are in register. There two types of output of this sort:
one, option 4, writes out whole sequences, the other, option 5,
writes out only the sequences between the two outermost motifs. If
the individual sequences were subsequently written one above the
other they should line up so that all motifs are in register. There
two types of output of this sort: one, option 4, writes out whole
sequences, the other, option 5, writes out only the sequences
between the two outermost motifs. Note that for option 4 users are
asked to type the position of the first motif, and the reason for
this is explained below. Consider a pattern found in several
sequences. Consider only the first motif in the pattern and suppose
that it was found in different positions in these sequences. Say
that of these positions the one furthest from the left end was
position 100. Then, in order to ensure that all the sequences would
align, we must specify that motif 1 must start at position 100. Any
sequences in which motif 1 started nearer to the left end than
position 100 would be padded accordingly. These modes of output
should only be used when the position of each motif is defined
relative to its immediate neighbour.
The pattern descriptions can be saved to files. These files
can be used instead of typing definitions again at the keyboard. As
the files are annotated, they can easily be changed using system
editors, and the modified versions used to define the variant
patterns for the programs.
Use of lists of entry names
The two programs that operate on libraries have the ability to
restrict their searches to subsets of the libraries. This does not
require sublibraries to be created but instead is achieved by using
files containing a list of the entry names of sequences. The user
may choose to search only those entries on the list or,
alternatively to search all but those on the list (i.e. in the
latter case the list contains the names of those to be excluded).
The programs can search libraries that have indexes and those that
do not. If a list of names for inclusion is used, then the search
will be faster if the index is present. In all other circumstances
the whole library will be read. The list must be in library order
except when it is used to include entries, and an index is
available. The list must contain each entry name on a separate
line, with the name starting in column 1 of the line. ie there must
be no spaces at the start of the line. The list of entry names can
be produced by the keyword searches of nip, pip, etc as long as the
listings produced have a space character separating the entry name
from the entry description. This will depend on how well the library
reformatting programs work. For example swissprot entry names tend
to run into the beginning of the descriptions, but other libraries
are generally OK.
One use of the programs is to look for patterns that we
already know about, but in new sequences. However it is hoped that
they will also be useful for finding new motifs. For example several
known control regions in nucleic acid sequences consist of
particular direct or inverted repeats; the inclusion of direct and
inverted repeats as motif classes makes it possible to find
previously unknown motifs of these types. Using these new programs
we can ask questions like: "are there any inverted or direct repeats
near to sections of sequence that contain both a CCAAT box and a
TATA box?"; and to search for such things throughout the libraries.
In addition, the mode of output in which all the sequence between
the two outermost motifs found is printed out, allows us to extract
sequences and examine them in more detail for further common
subsequences. For example we might want to collect together all the
sequences between putative CCAAT and TATA boxes.
A further use of the inverted repeat motif class is the
following. If a regulatory sequence in DNA is poorly defined but
also an inverted repeat, then it might be an advantage to specify it
both as a consensus sequence and a superimposed inverted repeat. In
this way two weak definitions can be combined to produce a stronger
pattern.
Given only a few examples of a motif it should be possible to
perform initial searches using a class 3 motif, and then, using
plausible matching sequences, create a more specific weight matrix
for the same motif.
If motifs are combined with the first motif using the OR
operator they will be ignored until all permutations that include
the first motif have been looked for. The whole search will then be
repeated, in turn, for each of those motifs that are combined with
the first motif using the OR operator. An interesting consequence
of this is that the program can be used, without change, to compare
any newly determined sequence with all known individual motifs. We
achieve this by having a pattern in which all known relevant motifs
are combined using the OR operator. If we ask to use this pattern
with a sequence, the program will automatically compare each
individual motif in the pattern with the whole length of the
sequence. As the number of known motifs grows this should become an
increasingly useful standard procedure.
The NOT operator is obviously useful for making sure
particular motifs are not present, but it can also be used to
bracket the levels of matches found. We may want a degree of match
that lies between two limits - binding should occur, but not too
strongly; or base-pairs should form, but not too many. We can
specify this by asking for a match with a low score, in combination
with a match and a high score, both for the same motif, but with the
high score included using the NOT operator.
The algorithm is designed to find all sections of a sequence
that satisfy the pattern rather than only the best match.
Particularly if some of the motifs in a pattern are less well
defined than others, this can often result in the same region of a
sequence being reported as having several matches, but which only
vary in the positions of the weakest motifs.
General remarks on motif searching
Generally motifs are short subsequences that are thought to be
associated with particular functions in some known sequences. Often
we search for them to try to understand or interpret other
sequences. Sometimes we search for motifs and patterns to test a
hypothesis about their role: are they found in the expected
positions in the expected sequences. In doing so we should remember
that, in both proteins and nucleic acids, what we are really looking
for is a particular three dimensional structure with certain
affinities for other structures, and that we are assuming that the
sequence of the motif alone defines the 3D structure we searching
for. The overall structure may be completely different to those in
which the motif is functional, and hence the motif may have a
different shape or be inaccessible. We should be aware of the
importance of the context in which a motif is found. Where does it
lie relative to the overall structure, is it accessible, is the
three dimensional spacing between it and other motifs correct? For
example, is it on the same side of the double helix, and the correct
distance from some other motif? How does context affect our
assessment of the significance of finding a motif? Finding false
mammalian mRNA splice junctions in non-coding sequences is far less
important than finding false sites in pre-mRNA sequences, but
finding them in the correct places is most important! In other
words, it is often the case that when we are searching for a motif
that is known to be necessary for some function, then a positive
result in the form of a match in the required position, is more
important than a high background of matches in the wrong positions.
Being able to write down the probability of finding a motif in a
random sequence tells us how well it is defined. In nucleic acids
the DNA may contain many superimposed types of information such as
those concerned with histone phasing, protein coding or mRNA
secondary structure. These overlapping "codes" may interfere with
one another causing matches to motifs to be poorer than expected.
In general we will only have a limited number of examples of the
motif and we do not know how representative they are.
Sequences have superimposed functions: some parts may be of
general structural importance and give rise to an overall framework,
and other parts give specificity and hence are not common; we may
want to use a set of aligned sequences to define a motif, but want
to use only the framework positions. Alternatively we may want to
pick out only those parts of a set of aligned sequences that give a
particular property, and to ignore other similarities that are due
to some other property and which could obscure the pattern we are
interested in. It is possible to apply a mask to a set of aligned
sequences in order to give weight to selected positions only. The
ability to define a mask allows certain positions to be used in the
motif and others to be ignored, and yet still permits the use of a
set of aligned sequences to calculate weights. The mask is requested
and applied by the program and results in the masked positions being
zero in the weight matrix. The mask is defined in the following way.
Suppose we had a motif of length 15, then the mask x--x--xx-x will
give zero weights to positions 2,3,5,6 and 9 (note it is the dashes
(-) that are significant and that positions 1,4,7,8,10,11,12,13,14
and 15 will be non-zero). Of course the same set of sequences could
be used with several alternative masks in order to extract different
features and create corresponding weight matrices.
The programs are described in Staden,R. CABIOS 4, 53-60, 1988;
Staden,R. CABIOS 5, 89-96, 1989, and Methods in Enzymology 183,
193-211 (1990).
@ end of help