staden-lg/help/pip_help

2245 lines
94 KiB
Plaintext

@-1. TX 0 @General
@-2. T 0 @Screen control
@-2. X 0 @Screen
@-3. T 0 @Statistical analysis of content
@-3. X 0 @Statistics
@-4. T 0 @Structures and repeats
@-4. X 0 @Structures
@-5. TX 0 @Search
@0. TX -1 @PIP
This is a program for analysing individual protein sequences.
It can read sequences stored in many of the most commonly used
formats, and performs all of the usual simple analyses. In addition
it has very flexible search procedures and presents many of its
results graphically.
The following analyses (preceded by their option numbers) are
included:
? = Help
! = Quit
3 = read a new sequence
4 = define active region
5 = list the sequence
6 = list a text file
7 = direct output to disk
8 = write active sequence to disk
9 = edit the sequence
10 = clear graphics screen
11 = clear text screen
12 = draw a ruler
13 = use cross hair
14 = reposition plots
15 = label diagram
16 = display a map
17 = search for short sequences
18 = compare a sequence
19 = compare a sequence using a score matrix
20 = search for a sequence using a weight matrix
21 = calculate amino acid composition
22 = plot hydrophobicity
23 = plot charge
24 = plot Robson prediction
25 = plot hydrophobic moment
26 = draw helix wheel
27 = back translate
28 = search for patterns of motifs
Some of these methods produce graphical results and so the
program is generally used from a graphics terminal (a vdu on which
lines and points can be drawn as well as characters).
For users of VT640's or their equivalents the terminal must be
set nowrap (type NOWRAP) prior to running the program.
The positions of each of the plots is defined relative to a users
drawing board which has size 1-10,000 in x and 1-10,000 in y. Plots
for each option are drawn in a window defined by x0,y0 and
xlength,ylength. Where x0,y0 is the position of the bottom left hand
corner of the window, and xlength is the width of the window and
ylength the height of the window.
--------------------------------------------------------- 10,000
1 1
1 -------------------------------------- ^ 1
1 1 1 1 1
1 1 1 1 1
1 1 1 ylength 1
1 1 1 1 1
1 1 1 1 1
1 -------------------------------------- v 1
1 x0,y0^ 1
1 <---------------xlength--------------> 1
--------------------------------------------------------- 1
1 10,000
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
The default window positions are read from a file "ANALPMRG" when
the program is started. Users can have their own file if required.
The program can handle sequences stored in several formats:
Staden, EMBL, GENBANK, PIR (also known as NBRF) and GCG and they are
described in the help for 'READ NEW SEQUENCE'.
The options for the program are accessed from 5 main menus:
general, screen control, statistical analysis of content, structure,
search. Both menus and options are selected by number.
@1. TX 0 @Help
This option gives online help. The user should select option
numbers and the current documentation will be given. Note that
option 0 gives an introduction to the program, and that ? will get
help from anywhere in the program. The following analyses (preceded
by their option numbers) are included:
@2. TX 0 @Quit
This function stops the program.
@3. TX 1 @Read a new sequence
This option allows users to read in new sequences, browse
through annotations, or search sequence libraries for keywords.
Sequences can be read from "personal" sequence files or from
sequence libraries. These are referred to as the sequence "source".
Personal files can be stored in several formats: Staden, PIR, EMBL,
GENBANK and GCG. At LMB we use "Staden" format for sequencing and
all the libraries are stored in their original formats. Note,
however, that libraries such as EMBL or GenBank that are divided
into several files (eg GenBank has 13 separate files) are indexed as
a whole. This means that users do not need to know which file
contains an entry, only which library. When the user selects to
read in a sequence the program first asks for the sequence "source".
If the user selects "personal" the program will ask for the
format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
the file. For PIR format the user will also be required to know the
entry name of the sequence as the file can contain several. For the
other formats only a single entry is expected. The file will be
read, its length and composition will be displayed and the option
left.
If the user selects "library" as the sequence source the
program will display a list of available libraries. The programs are
capable of handling all current libraries but which ones are
available will vary from site to site. At LMB we have several
libraries and also weekly updates of data gathered between releases.
The program will ask users to select a library and then give a list
of options:
X 1 Get a sequence
2 Get annotations
3 Get entrynames from accession numbers
4 Search titles for keywords
5 Search text index for keywords
If get a sequence or get annotations is selected users will be asked
to type the entry name. The option will be left when a sequence is
selected or ! is typed. The composition and length will be
displayed.
The text index contains all words from feature tables,
reference titles, definition lines, keywords lists and comments, so
the text index search is most useful. It is also the fastest. Up to
5 words can be searched for at once. The words should be typed
separated by spaces, for example
? Keywords=P53 mouse murine tumo
will search for all entries that contain words starting with p53,
mouse, murine and tumo. Only the unique entries that contain ALL
words will be listed. Before listing the matching entries the
program will show the number of 'hits' for each word and ring the
bell. Escape is possible at this point, or after each screenfull of
entries. In addition to the entry names the text search displays
the primary accession number, the sequence length and up to 80
characters of description. (The search of 'titles' is now redundant
because the full text index contains all the title words and the
search is much faster. It will probably be removed from the
program.) All searches are independent of case. Where possible the
program will offer default entry names.
Typical dialogue follows.
Select sequence source
X 1 Personal file
2 Sequence library
? Selection (1-2) (1) =
Select sequence file format
X 1 Staden
2 EMBL
3 GenBank
4 PIR
5 GCG
? Selection (1-5) (1) =
? Sequence file name=M13MP7.SEQ
Contig title removed
Sequence length= 7238
Sequence composition
T C A G -
2405. 1539. 1765. 1527. 2.
33.2% 21.3% 24.4% 21.1% 0.0%
.
.
.
Select sequence source
X 1 Personal file
2 Sequence library
? Selection (1-2) (1) =2
Select a library
X 1 EMBL 29 nucleotide library Dec 91
2 SWISSPROT 20 protein library Nov 91
3 PIR 31 protein library Dec 91
4 NRL3D 58 From Brookhaven protein library Dec 91
5 GenBank
? Selection (1-5) (1) =
Library is in EMBL format with indexes
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =5
Search for keywords
? Keywords=P53 mouse
P53 hits 68
MOUSE hits 8180
MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen
MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen
MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen
MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen
MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen
MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen
MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen
MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen
MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen
MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon
MMANT11 X00885 562 Murine p53 gene 3' region with exon 11
MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end.
MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds.
MMP53 X00741 1377 Mouse mRNA for transformation associated protein
MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53.
MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11.
MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8.
MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou
MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4
MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53
MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53
MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5
22 different entries found
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =4
Search for keywords
? Keywords=alpha
Searching for alpha
AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank.
ADHBADA1 102 duck alpha-d-globin gene, exon 1.
ADHBADA2 1145 duck alpha-a-globin gene and 5' flank
ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co
AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide.
AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
AGAMYLS 120 synthetic alpha-amylase gene, 5' end.
AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
!
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =3
? Accession number=v00636
Entry name LAMBDA
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =2
Default Entry name=LAMBDA
? Entry name=
ID LAMBDA standard; DNA; PHG; 48502 BP.
XX
AC V00636; J02459; M17233; X00906;
XX
DT 03-JUL-1991 (Rel. 28, Last updated, Version 3)
DT 09-JUN-1982 (Rel. 1, Created)
XX
DE Genome of the bacteriophage lambda (Styloviridae).
XX
KW circular; coat protein; DNA binding protein; genome;
KW origin of replication.
XX
OS Bacteriophage lambda
OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
XX
RN [1]
RP 1-48502
RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
RT "Nucleotide sequence of bacteriophage lambda DNA";
RL J. Mol. Biol. 162:729-773(1982).
XX
!
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =
Default Entry name=LAMBDA
? Entry name=
DE Genome of the bacteriophage lambda (Styloviridae).
Sequence length 48502
Sequence composition
T C A G -
11988. 11360. 12336. 12818. 0.
24.7% 23.4% 25.4% 26.4% 0.0%
@4. TX 1 @Redefine active region
For its analytic functions the program always works on a
region of the sequence called the active region. When a new sequence
is read into the program the active region is automatically set to
start at the beginning of the sequence and go up to the maximum
allowed size of active region the version of the program can handle.
The positions are shown on the screen. On most machines this will
be to the end of the sequence. This option allows the user define a
different region. Note that for convenience in the listing and
translation functions the user is given access to regions outside
the active region.
@5. TX 1 @List a sequence
The sequence can be listed with line lengths from 10 to 120 in
multiples of 10. Output can be directed to a disk file by first
selecting disk output. The output looks like:
10 20 30 40 50 60
MQLNSTEISE LIKQRIAQFN VVSEAHNEGT IVSVSDGVIR IHGLADCMQG EMISLPGNRY
70 80 90 100 110 120
AIALNLERDS VGAVVMGPYA DLAEGMKVKC TGRILEVPVG RGLLGRVVNT LGAPIDGKGP
130 140 150 160 170 180
LDHDGFSAVE AIAPGVIERQ SVDQPVQTGY KAVDSMIPIG RGQRELIIGD RQTGKTALAI
190 200 210 220 230 240
DAIINQRDSG IKCIYVAIGQ KASTISNVVR KLEEHGALAN TIVVVATASE SAALQYLARM
250 260 270 280 290 300
PVALMGEYFR DRGEDALIIY DDLSKQAVAY RQISLLLRRP PGREAFPGDV FYLHSRLLER
310 320 330 340 350 360
AARVNAEYVE AFTKGEVKGK TGSLTALPII ETQAGDVSAF VPTNVISITD GQIFLETNLF
370 380 390 400 410 420
NAGIRPAVNP GISVSRVGGA AQTKIMKKLS GGIRTALAQY RELAAFSQFA SDLDDATRKQ
430 440 450 460 470 480
LDHGQKVTEL LKQKQYAPMS VAQQSLVLFA AERGYLADVE LSKIGSFEAA LLAYVDRDHA
490 500 510 520 530 540
PLMQEINQTG GYNDEIEGKL KGILDSFKAT QSW*
@6. TX 1 @List a text file
Allows the user to have a text file displayed on the screen.
It will appear one page at a time.
@7. TX 1 @Direct output to disk
Used to direct output that would normally appear on the screen
to a file.
Select redirection of either text or graphics, and supply the
name of the file that the output should be written to.
The results from the next options selected will not appear on
the screen but will be written to the file. When option 7 is
selected again the file will be closed and output will again appear
on the screen.
@8. TX 1 @Write active region to disk
The program has the capability of reading in EMBL, GENBANK,
NBRF, GCG and Staden formats and of reversing and complementing
sequences. This option allows users to write the current active
sequence to a disk file in Staden format. Hence it allows format
conversion and crude sequence cutting.
@9. TX 1 @Edit the sequence
Used to edit sequences or any other files by giving access to
the computers system editor. For editing sequences the input file
should have already been created using the listing function "list
sequence".
Supply the name of the file to edit. Wait while the system
editor is made ready (can take awhile on a vax). Use the editor.
Exit from the editor. If a sequence has been edited, and you want to
process it, affirm that the sequence should be "made active". The
edited sequence will replace the original sequence.
This editing method is designed to give users access to an
editor with which they are familiar - i.e. the one on their machine,
and yet to allow them to edit a sequence which contains the
landmarks they need in order to know where they are. Users can
create files containing simple listings with numbering, using "list
the sequence", and then edit them with their system editor, using
the numbering to know where they are within the sequence. When the
edits are complete they exit from the editor and the program
"analyses" the edited file to extract only the sequence characters.
Define the permitted set of characters to be:
ACDEFGHIKLMNPQRSTVWXYZ-acdefghiklmnpqrstvwxyz. All permitted
characters found in the file will become part of the sequence, all
others removed.
@10. TX 2 @Clear graphics
Clears the screen of both text and graphics.
@11. TX 2 @Clear text
Clears only text from the screen.
@12. TX 2 @Draw a ruler
This option allows the user to draw a ruler or scale along the
x axis of the screen to help identify the coordinates of points of
interest. The user can define the position of the first amino acid
to be marked (for example if the active region is 1501 to 8000, the
user might wish to mark every 1000th amino acid starting at either
1501 or 2000 - it depends if the user wishes to treat the active
region as an independent unit with its own numbering starting at its
left edge, or as part of the whole sequence). The user can also
define the separation of the ticks on the scale and their height. If
required the labelling routine can be used to add numbers to the
ticks.
@13. TX 2 @Use cross hair
This function puts a steerable cross on the screen that can be
used to find the coordinates of points in the sequence. The user can
move the cross around using the directional keys; when he hits the
space bar the program will print out the coordinates of the cross in
sequence units and the option will be exited.
If instead, you hit a , the position will be displayed but the
cross will remain on the screen.
If a letter s is hit the sequence around the cross hair is
displayed and the cross remains on the screen.
@14. TX 2 @Reset margins
The positions of each of the plots is defined relative to a
users drawing board which has size 1-10,000 in x and 1-10,000 in y.
Plots for each option are drawn in a window defined by x0,y0 and
xlength,ylength. Where x0,y0 is the position of the bottom left hand
corner of the window, and xlength is the width of the window and
ylength the height of the window.
--------------------------------------------------------- 10,000
1 1
1 -------------------------------------- ^ 1
1 1 1 1 1
1 1 1 1 1
1 1 1 ylength 1
1 1 1 1 1
1 1 1 1 1
1 -------------------------------------- v 1
1 x0,y0^ 1
1 <---------------xlength--------------> 1
--------------------------------------------------------- 1
1 10,000
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
The default window positions are read from a file "ANALMARG" when
the program is started. Users can have their own file if required.
As all the plots start at the same position in x and have the same
width, x0 and xlength are the same for all options. Generally users
will only want to change the start level of the window y0 and its
height ylength. This option allows users to change window positions
whilst running the program. The routine prompts first for the
number of the option that the users wishes to reposition; then for
the y start and height; then for the x start and length. Note that
changes to the x values affect all options. If the user types only
carriage return for any value it will remain unchanged. The cross-
hair can be used to choose suitable heights.
@15. TX 2 @Label a diagram
This routine allows users to label any diagrams they have
produced. They are asked to type in a label. When the user types
carriage return to finish typing the label the cross-hair appears on
the screen. The user can position it anywhere on the screen. If the
user types R (for right justify) the label will be written on the
diagram with its right end at the cross-hair position. If the user
types L (for left justify) the label will be written on the diagram
with its left end at the cross hair position. The cross-hair will
then immediately reappear. The user may put the same label on
another part of the diagram as before or if he hits the space bar he
will be asked if he wishes to type in another label.
@16. TX 2 @Display a map
It is often convenient to plot a map alongside graphed
analysis in order to indicate features within the sequence. This
function allows users to draw maps using files arranged in the form
of EMBL feature tables. Of course the EMBL table are usually only
used for nucleic acid sequence annotation but, as long as the
features are written in the correct format, they can be employed by
this routine. The map is composed of a line representing the
sequence and then further lines denoting the endpoints of each
feature the user identifies. The user is asked to define height at
which the line representing the sequence should be drawn; then for
the feature height; then for the features to plot.
@17. TX 1 5 @Short sequence search
This routine is used to search for exact matches to short
sequences. It is equivalent to the restriction enzyme search in
program NIP. It and can either list matches or present the results
graphically.
Select from searching, screen clearing or file listing. Choose
a file of strings and the mode of output required.
The files of short sequences (strings) and their names need to
be arranged in a particular way. For example
ACID/D/E//
BASIC/R/K/H//
HYDRO/F/L/I/V/Y//
GLYCO/N-S/N-T//
+/R/K/H//
-/D/E//
defines various groups of amino acids. Each string or set of
strings must be preceded by a name, each string must be preceded and
terminated with a slash (/), and each set of strings by 2 slashes.
These collections of strings and their names can be read from disk
or entered from the keyboard. Two files containing sequences are
currently available. One contains named groups of amino acids. The
other simply contains the names of all amino acids and gives a
convenient way of producing a plot of the positions of all the
different amino acids in the sequence. The user can select strings
by name from these collections. Results can be displayed name by
name or all together. Strings entered from the keyboard need to be
separated by slash characters(/). For the name by name search the
output looks like:
MATCHES= 12
NAME SEQUENCE POSITION FRAGMENT LENGTHS
ACID E 7 7 1
ACID E 10 3 1
ACID E 24 14 1
ACID E 28 4 1
ACID D 36 8 1
ACID D 46 10 2
ACID E 51 5 2
ACID E 67 16 2
ACID D 69 2 2
ACID D 81 12 2
ACID E 84 3 2
ACID E 96 12 3
MATCHES= 10
NAME SEQUENCE POSITION FRAGMENT LENGTHS
BASIC K 13 13 1
BASIC R 15 2 1
BASIC H 26 11 1
BASIC R 40 14 1
BASIC H 42 2 2
BASIC R 59 17 2
BASIC R 68 9 2
BASIC K 87 19 2
BASIC K 89 2 2
BASIC R 93 4 2
MATCHES= 1
NAME SEQUENCE POSITION FRAGMENT LENGTHS
GLYCO NST 4 4 3
or when the results are ordered only on position the output looks like:
NAME SEQUENCE POSITION FRAGMENT LENGTHS
GLYCO NST 4 3
ACID E 7 3
ACID E 10 3
BASIC K 13 3
BASIC R 15 2
ACID E 24 9
BASIC H 26 2
ACID E 28 2
ACID D 36 8
BASIC R 40 4
BASIC H 42 2
ACID D 46 4
ACID E 51 5
BASIC R 59 8
Graphical output marks the position of each string by a short
vertical line and gives its name at the left end of the line. If the
top of the screen is reached the program gives the user the
oportunity to take a hard copy and then will clear the screen and
restart plotting results at the original start position. Note that
any character in the string that is not a recognisable protein
symbol will be treated as a wild card character will match with all
characters in the searched sequence.
Typical dialogue follows.
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=17
Search for short sequences
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =2
1 All acids
X 2 Named groups
3 Personal file
4 Keyboard
? 0,1,2,3,4 =
ACID/D/E//
BASIC/R/K/H//
HYDRO/F/L/I/V/Y//
GLYCO/N-S/N-T//
+/R/K/H//
-/D/E//
DIBASIC/RR/KK/RK/KR//
TURN/N/D/G/P/S//
BLOCK/A/Q/E/I/L/M/F/W/V//
INDIF/R/C/H/K/T/Y//
End of file
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All acids
X 2 Named groups
3 Personal file
4 Keyboard
? 0,1,2,3,4 =
? (y/n) (y) All names n
? Name=acid
? Name=basic
? Name=glyco
? Name=
? (y/n) (y) Show results name by name
? (y/n) (y) List matches
searching
matches= 59
NAME SEQUENCE POSITION FRAGMENT LENGTHS
ACID E 7 7 1
ACID E 10 3 1
ACID E 24 14 1
ACID E 28 4 1
ACID D 36 8 1
ACID D 46 10 2
ACID E 51 5 2
ACID E 67 16 2
ACID D 69 2 2
ACID D 81 12 2
ACID E 84 3 2
ACID E 96 12 3
ACID D 116 20 3
matches= 61
NAME SEQUENCE POSITION FRAGMENT LENGTHS
BASIC K 13 13 1
BASIC R 15 2 1
BASIC H 26 11 1
BASIC R 40 14 1
BASIC H 42 2 2
BASIC R 59 17 2
...etc
matches= 2
NAME SEQUENCE POSITION FRAGMENT LENGTHS
GLYCO NST 4 4 3
GLYCO NQT 487 483 28
28 483
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All acids
X 2 Named groups
3 Personal file
4 Keyboard
? 0,1,2,3,4 =
? (y/n) (y) Selected names
? Name=basic
? Name=glyco
? Name=
? (y/n) (y) Show results name by name n
? (y/n) (y) List matches
searching
NAME SEQUENCE POSITION FRAGMENT LENGTHS
GLYCO NST 4 3
BASIC K 13 9
BASIC R 15 2
BASIC H 26 11
BASIC R 40 14
BASIC H 42 2
BASIC R 59 17
BASIC R 68 9
BASIC K 87 19
...etc
BASIC R 477 14
BASIC H 479 2
GLYCO NQT 487 8
BASIC K 499 12
BASIC K 501 2
BASIC K 508 7
7
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All acids
X 2 Named groups
3 Personal file
4 Keyboard
? 0,1,2,3,4 =4
Define search strings by typing a string name
followed by the string(s)
? Name=MARY
? String(s)=AL/VI
? Name=
? (y/n) (y) All names
? (y/n) (y) Show results name by name
? (y/n) (y) List matches
searching
matches= 12
NAME SEQUENCE POSITION FRAGMENT LENGTHS
MARY VI 38 38 10
MARY AL 63 25 13
MARY VI 136 73 16
MARY AL 177 41 19
MARY AL 217 40 25
MARY AL 233 16 37
MARY AL 243 10 40
MARY AL 256 13 41
MARY AL 326 70 45
MARY VI 345 19 51
MARY AL 396 51 70
MARY AL 470 74 73
@18. TX 1 5 @Compare a sequence
This routine slides a short sequence along the current
sequence and finds all positions at which a given percentage of the
amino acids match. Output is in both graphical and listed forms.
If users call for dialogue when the routine is selected they
will be given the choice of keyboard or file input. Define the
string, and the percentage match. Matches will be plotted out and
then the user can select to have them listed. Then the routine
cycles around.
The routine slides the search string along the sequence and
marks the positions at which a minimum percentage score is reached.
The graphical output draws a vertical line at the match position;
the height of the line represents the percentage score, so that if
the line reaches the top of the box the score is 100%.
Typical dialogue follows.
? Menu or option number=18
Find percentage matches
? (y/n) (y) Keep picture
? String=aaa
? Percent match (1.00-100.00) (70.00) =
missing graphics
Total scoring positions above 70.000 percent = 19
Scores 2 2 2 2 2 2 2 2 2 2
Positions 61 131 177 217 226 231 232 267 300 301
? Number to list (0-19) (0) =3
61
AIA
* *
aaa
1
131
AIA
* *
aaa
1
177
ALA
* *
aaa
1
? (y/n) (y) Keep picture n
Default String=aaa
? String=!
@19. TX 1 5 @Compare a sequence using a score matrix
This routine slides a short sequence along the current
sequence and finds all positions at which a given level of
similarity (a cutoff score) is reached. The score is defined by use
of a score matrix (MDM78). Output is in both graphical and listed
forms.
If users call for dialogue when the routine is selected they
will be given the choice of keyboard or file input. Define the
string and the cutoff score. Matches will be plotted out and then
the user can select to have them listed. Then the routine cycles
around.
The routine slides the search string along the sequence and
marks the positions at which a the cutoff score is achieved. The
graphical output draws a vertical line at the match position; the
height of the line represents the score, so that if the line
reaches the top of the box the score is the maximum possible.
Typical dialogue follows.
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=19
Find matches using a score matrix
? (y/n) (y) Keep picture
? String=aaa
Minimum score= 12 Maximum score= 36
? Score (12-36) (36) =
missing graphics
For score 24 the number of matches= 507
scores 35 35 35 34 34 34 34 34 34 34
positions 226 231 379 112 133 202 227 267 378
380
? Number to list (0-507) (0) =3
226
ATA
* *
aaa
1
231
SAA
**
aaa
1
379
GAA
**
aaa
1
? (y/n) (y) Keep picture n
Default String=aaa
? String=!
@20. TX 5 @Search for a motif using a weight matrix
This function performs searches for short sequence motifs
using an appropriate weight matrix. In addition it can be used to
create or modify weight matrices. In order to perform a search the
only input required is the name of the file containing the weight
matrix. The results can be presented graphically or listed. The
graphical presentation will draw line at the position of any matches
found; the height of the line is proportional to the score.
For a search, select "use weight matrix", supply the name of
the file containing the weight matrix, and choose between having
results plotted or listed. If dialogue is requested when the
function is selected users can alter the cutoff score employed.
To create a weight matrix several steps are involved. A file
containing an alignment of known motifs is required. (This file must
be created before the current option is selected. The format is a
follows: each sequence is written on a separate line with at least
one space at the beginning; each sequence is terminated by a space
character, and can be followed by a name. The sequences must be
aligned.) Supply the name of the file of aligned sequences. The
program reads and displays the sequences. Choose between "summing
logs of weights" or summing weights (i.e. whether to multiply or add
weights). If logs are used all scores will be negative. Choose if
all positions in the set of aligned sequences should be used or if a
mask should be applied. If so selected, define a mask as a string of
symbols, in which symbol - means ignore and any other symbol means
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
The program will calculate weights as the frequencies of each
amino acid at each unmasked position in the set of aligned
sequences. These weights are then applied to the set of aligned
sequences to give a range of "observed" scores. The mean and
standard deviation of these scores is displayed. The user is asked
to supply several values to be used when the weight matrix is
applied to other sequences: a cutoff score (by default, the mean
minus 3 standard deviations), a top score for scaling graphical
results (by default, the mean plus 3 standard deviations), and a
position to identify (this means that if a particular amino acid
within the motif is used as a "landmark", such as the G of the
helix-turn-helix motif, then its position will be marked in plots).
All these values are stored along with the weight matrix. Finally
supply the name of a file to contain the weight matrix.
Weight matrices can be "rescaled" using a set of aligned
sequences in much the same ways as a matrix is created. The purpose
is to redefine the cutoff scores, and rescaling does not alter any
other values in the weight matrix file.
The methods have changed considerably but were first outlined
in Staden, R. Nucl. Acid Res. 12 505-519 1984, and Staden, R.
Genetic engineering: principles and methods vol 7, Edited by J.K.
Setlow and A. Hollaender, Plenum publishing corp., 1985.
The methods have always had to deal with the problem of zeroes
in the matrices. The current versions employ "Laplaces Law of
Succession" in which 1 is added to each term.
It is now possible to apply a mask to a set of aligned
sequences in order to give weight to selected positions only.
Sequences have superimposed functions: some parts may be of general
structural importance and give rise to an overall framework, and
other parts give specificity and hence are not common; we may want
to use a set of aligned sequences to define a motif, but want to use
only the framework positions. Alternatively we may want to pick out
only those parts of a set of aligned sequences that give a
particular property, and to ignore other similarities that are due
to some other property and which could obscure the pattern we are
interested in. The ability to define a mask allows certain positions
to be used in the motif and others to be ignored, and yet still
permits the use of a set of aligned sequences to calculate weights.
Typical dialogue is shown below.
? Menu or option number=20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =2
? Name of aligned sequences file=[rs.motifs]hth.seq
1 QESVADKMGMGQSGVGALFN LAMBDA.REP
2 QTKTAKDLGVYQSAINKAIH LAMBDA.CRO
3 QAALGKMVGVSNVAISQWQR P22.REP
4 QRAVAKALGISDAAVSQWKE P22.CRO
5 QAELAQKVGTTQQSIEQLEN 434.REP
6 QTELATKAGVKQQSIQLIEA 434.CRO
7 RQEIGQIVGCSRETVGRILK CAP
8 RGDIGNYLGLTVETISRLLG Fnr
9 LYDVAEYAGVSYQTVSRVVN LAC.R
10 IKDVARLAGVSVATVSRVIN GAL.R
11 TEKTAEAVGVDKSQISRWKR LAMBDA.CII
12 QRKVADALGINESQISRWKG P22.CI
13 KEEVAKKCGITPLQVRVWCN MAT.ALPHA
14 TRKLAQKLGVEQPTLYWHVK TETR.TN10
15 TRRLAERLGVQQPALYWHFK TETR.pSC1
16 QRELKNELGAGIATITRGSN TRP.REP
17 RQQLAIIFGIGVSTLYRYFP H-INVERSN
18 ATEIAHQLSIARSTVYKILE TN3.RESOL
19 ASHISKTMNIARSTVYKVIN GD.RESOLV
20 IASVAQHVCLSPSRLSHLFR ARA.C
21 RAEIAQRLGFRSPNAAEEHL LEX.R
Length of motif 20
? (y/n) (y) Sum logs of weights
? (y/n) (y) Use all motif positions n
x means use, - means ignore
e.g. xx-x---x-x means use positions 1,2,4,8,10
? Mask=--xxxxxxxxxxxx------
Applying weights to input sequences
1 -57.143 QESVADKMGMGQSGVGALFN
2 -55.087 QTKTAKDLGVYQSAINKAIH
3 -58.079 QAALGKMVGVSNVAISQWQR
4 -54.986 QRAVAKALGISDAAVSQWKE
5 -55.181 QAELAQKVGTTQQSIEQLEN
6 -55.874 QTELATKAGVKQQSIQLIEA
7 -56.692 RQEIGQIVGCSRETVGRILK
8 -57.722 RGDIGNYLGLTVETISRLLG
9 -55.363 LYDVAEYAGVSYQTVSRVVN
10 -55.769 IKDVARLAGVSVATVSRVIN
11 -56.786 TEKTAEAVGVDKSQISRWKR
12 -55.833 QRKVADALGINESQISRWKG
13 -56.279 KEEVAKKCGITPLQVRVWCN
14 -53.125 TRKLAQKLGVEQPTLYWHVK
15 -55.833 TRRLAERLGVQQPALYWHFK
16 -58.651 QRELKNELGAGIATITRGSN
17 -56.749 RQQLAIIFGIGVSTLYRYFP
18 -56.986 ATEIAHQLSIARSTVYKILE
19 -60.618 ASHISKTMNIARSTVYKVIN
20 -58.988 IASVAQHVCLSPSRLSHLFR
21 -58.002 RAEIAQRLGFRSPNAAEEHL
Top score -53.125 Bottom score -60.618
Mean -56.655 Standard deviation 1.617
Mean minus 3.sd -61.505 Mean plus 3.sd -51.804
? Cutoff score (-999.00-9999.00) (-61.51) =
? Top score for scaling plots (-61.51-999.00) (-51.80) =
? Position to identify (0-20) (1) =9
? Title=hth
? Name for new weight matrix file=1.wts
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =
? Motif weight matrix file=1.wts
hth
? (y/n) (y) Use frequencies as weights
? (y/n) (y) Plot results n
5 -61.46 STEISELIKQRIAQFNVVSE
13 -58.93 KQRIAQFNVVSEAHNEGTIV
21 -60.42 VVSEAHNEGTIVSVSDGVIR
57 -59.39 GNRYAIALNLERDSVGAVVM
59 -61.47 RYAIALNLERDSVGAVVMGP
79 -59.90 YADLAEGMKVKCTGRILEVP
88 -61.41 VKCTGRILEVPVGRGLLGRV
104 -60.38 LGRVVNTLGAPIDGKGPLDH
127 -60.13 SAVEAIAPGVIERQSVDQPV
129 -59.91 VEAIAPGVIERQSVDQPVQT
133 -60.79 APGVIERQSVDQPVQTGYKA
139 -61.12 RQSVDQPVQTGYKAVDSMIP
175 -58.90 KTALAIDAIINQRDSGIKCI
191 -60.95 IKCIYVAIGQKASTISNVVR
195 -60.94 YVAIGQKASTISNVVRKLEE
215 -60.66 HGALANTIVVVATASESAAL
254 -60.56 EDALIIYDDLSKQAVAYRQI
260 -60.08 YDDLSKQAVAYRQISLLLRR
297 -61.00 LLERAARVNAEYVEAFTKGE
314 -61.29 KGEVKGKTGSLTALPIIETQ
330 -60.49 IETQAGDVSAFVPTNVISIT
363 -57.63 GIRPAVNPGISVSRVGGAAQ
365 -61.48 RPAVNPGISVSRVGGAAQTK
371 -61.02 GISVSRVGGAAQTKIMKKLS
382 -57.90 QTKIMKKLSGGIRTALAQYR
394 -60.07 RTALAQYRELAAFSQFASDL
424 -59.95 GQKVTELLKQKQYAPMSVAQ
430 -58.89 LLKQKQYAPMSVAQQSLVLF
432 -61.14 KQKQYAPMSVAQQSLVLFAA
438 -58.58 PMSVAQQSLVLFAAERGYLA
458 -61.06 DVELSKIGSFEAALLAYVDR
466 -61.00 SFEAALLAYVDRDHAPLMQE
483 -60.48 MQEINQTGGYNDEIEGKLKG
494 -60.61 DEIEGKLKGILDSFKATQSW
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=d20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =
? Motif weight matrix file=1.wts
hth
? (y/n) (y) Use frequencies as weights
? Cutoff score (-9999.00-9999.00) (-61.51) =-56.
? (y/n) (y) Plot results n
@21. TX 3 @Calculate amino acid composition
This function calculates the amino acid composition and
molecular weight for the active region.
? Menu or option number=21
Sequence composition
A C S T P A G N D E Q B Z H
N 3. 32. 23. 18. 57. 47. 16. 28. 31. 28. 0. 0. 7.
% 0.6 6.2 4.5 3.5 11.1 9.1 3.1 5.4 6.0 5.4 0.0 0.0 1.4
W 309. 2786. 2325. 1748. 4051. 2682. 1826. 3222. 4003. 3588. 0. 0.
960.
A R K M I L V F Y W - X ?
N 30. 24. 11. 40. 47. 41. 14. 15. 1. 0. 0. 0. 1.
% 5.8 4.7 2.1 7.8 9.1 8.0 2.7 2.9 0.2 0.0 0.0 0.0 0.2
W 4686. 3076. 1443. 4527. 5319. 4065. 2060. 2448. 186. 0. 0. 0.
0.
Total molecular weight= 55328.
@22. TX 3 4 @Plot hydrophobicity
This routine plots the hydrophobicity of each section of the
sequence using the hydrophobicity values of Kyte and Doolittle (J.
Mol. Biol. 157, 105-132 (1982)). A window of size span is slid
along the sequence and a sum calculated for each position.
If dialogue is requested select a span length and a plot
interval.
The diagrams are on the same scale as Fig. 6 of the Kyte and
Doolittle paper and values of + and - 50 could be assigned to the
top and bottom of the diagram with corresponding values in between
(-40,-20,0,20,40 are shown in the paper).
? Menu or option number=d22
Plot hydrophobicity
? odd span length (1-101) (11) =
? plot interval (1-101) (3) =
missing graphics
@23. TX 3 4 @Plot charge
This routine plots the charge of each section of the sequence.
A window of size span is slid along the sequence and a sum
calculated for each position. Amino acids are assigned charges of 1,
-1 or 0.
If dialogue is requested select a span length and a plot
interval.
Typical dialogue follows.
? Menu or option number=d23
Plot charge
? odd span length (1-101) (11) =
? plot interval (1-101) (3) =
missing graphics
@24. TX 4 @Plot robson prediction
This routine uses the method of Garnier J, Osguthorpe D J, and
Robson B. (1978) J. Mol. Biol. 120, 97-120 to predict secondary
structures. The method divides protein secondary structures into 4
classes: helix, extended (usually referred to as sheet), turn and
coil. The routine calculates the likelihood that each segment of the
sequence lies in each of these classes. Results are presented
graphically or listed.
If dialogue is requested choose between plotted or listed
output.
Each residue has a certain probability of being found in each
of the 4 classes. This probability depends both on its own amino
acid type and also the 8 amino acids found to either side along the
protein chain. Four tables of weights, each 20 by 17 elements are
used to calculate the likelihood that each residue along the chain
falls into one of the four classes of structure. The most likely
structure at each point is the one with the highest score. The four
values are plotted in strips labelled H, E, T and C. Below, a strip
labelled D for decision is divided into four levels, each
corresponding to one of the four structure types. Their top to
bottom order is the same as that for the strips above, i.e C, T, E,
and H. For each residue the program measures which of the four
likelhoods is highest. It places a single dot at the mid-point of
the corresponding strip, and also at the appropriate level in the
strip labelled D.
It should be noted that the method, when tested by Kabsch W
and Sander C, (1983) Febs. Lett. 155 (179-182), although one of the
better ones, was correct for only about 56% of residues.
Typical dialogue follows.
? Menu or option number=d24
Plot Robson secondary structure predictions
? (y/n) (y) Plot results n
9 S 217 -7 -39 15
10 E 226 5 -27 -39
11 L 233 -7 -26 -15
12 I 229 -23 9 4
13 K 214 -8 10 -8
14 Q 178 42 19 5
15 R 131 54 16 3
16 I 86 42 -31 -23
17 A 55 52 -30 -15
18 Q 15 67 4 25
19 F -34 86 47 74
20 N -41 74 17 106
21 V -16 118 -5 100
22 V 64 88 5 115
23 S 96 38 26 155
24 E 133 -25 13 96
25 A 118 -98 25 100
26 H 110 -150 37 86
27 N 57 -201 37 66
28 E 51 -140 11 -4
29 G 2 -77 37 9
30 T 2 28 28 7
31 I -11 117 -21 22
32 V -23 178 -55 5
33 S -54 193 -14 35
34 V -46 123 5 30
35 S -54 53 51 80
36 D -60 1 86 55
37 G -66 8 57 49
38 V -1 128 -30 -5
39 I 11 212 -56 -33
40 R 16 204 -44 -57
...etc
@26. TX 4 @Draw a helix wheel
A helical representation of segments of the sequence is shown.
The display includes a schematic of the helix showing the links
between residues, with each vertex numbered according to position;
the sequence element at each vertex; a symbol denoting a
classification as hydrophobic(.), positively charged(+), negatively
charged(-), or otherwise( ). The residue number of the first
sequence element in the current window is displayed at the top-
left-hand corner of the diagram. Also at the top-left corner the
sequence in the current window is listed. Below this is the total
hydrophobicity and hydrophobic moment for the window calculated
according to Eisenberg et al J. Mol. Biol. 179, 125-142 (1984).
If dialogue is requested the user is asked for the angle to
define the turn between residues as seen looking along the helix,
and a window length. The window length can be up to 60, with default
18, and the angle has a default of 100 degrees. Note that 18 x 100
is 5 turns. When the option is selected the first segment in the
current active region is displayed then the bell rings. If the user
types only return, the display will click on by one residue; if
another number is typed, say N, then the display will click forwards
(or backwards if N is negative) by N residues. If the wheel runs off
either end of the sequence the option will be exited.
Typical dialogue follows.
? Menu or option number=d26
? Angle (1-130) (100) =
? Window (1-60) (18) =
missing graphics
@25. TX 3 4 @Plot hydrophobic moment
This routine plots hydrophobic moment and hydrophobicity
according to Eisenberg et al J. Mol. Biol. 179, 125-142 (1984). The
mean hydrophobicity per residue in the window is plotted on a scale
-1.0 to 1.5, and the mean hydrophobic moment per residue on a scale
0.0 to 1.5. The hydrophobicity is shown in the top frame with the
hydrophobic moment below. The plot is arranged so that the value
shown at position x represents the mean value for residues x-
window+1 to x, where window is the window length.
If dialogue is requested the user can select a window length,
and the angle used for the hydrophobic moment calculation.
Note that according to Eisenberg et al, in transmembrane
proteins an "initiator" is required. This is either a very
hydrophobic single helix with <H> >=0.68, or a moderately
hydrophobic pair of helices whose <H> sum to >= 1.1. Other helices
are then accepted as transmembrane if their <H> >= 0.42
The following rules are claimed: if <H> < 0.51 and points lie
below the line <M> = -0.392 + 0.603x <H> they are "globular", if
they lie above this line they are "surface". If <H> > 0.51 and they
lie above the line <M> = 0.6 - 0.342x<H> they are "monomeric", if
above "multimeric".
Typical dialogue follows.
? Menu or option number=d25
? Angle (1-130) (100) =
? Window (1-60) (18) =
? Plot interval (1-101) (3) =
missing graphics
@27. TX 1 @Back translate to dna
This routine back translates protein sequences into DNA using
the standard genetic code. The level of redundancy can be plotted
and the backtranslation saved to a file.
The translation can use either the IUB symbols shown below, or
a set of codon preferences. If a set of codon preferences are used
they must conform to the format of codon tables produced by the
nucleotide analysis program, and the back translation will contain
the favoured codons. If there is no favoured codon the IUB symbols
will be employed. The window length for plotting the redundancy is
in codons.
The program will plot the redundancy along the sequence and
hence can be used to find the best sequences to use as primers. Note
that the program plots the inverse, and so the higher the plot the
LESS redundant the sequence. For primers look for peaks rather than
troughs.
The DNA sequence can be saved to a file and analysed using the
nucleotide analysis program. Depending on the application it is
often useful to produce a back translation using both a table of
codon preferences and one using the IUB symbols. This is because the
restriction enzyme search program can distinguish between definite
and possible cuts in the sequence. These matches are what the
program terms "definite matches" and are ones in which the
specification of the recognition sequence corresponds exactly to
that of the back translation. The program will also find what it
terms "possible matches" which are ones that depend on the
particular codons chosen for each amino acid. These are sites at
which recognition sequences could be engineered to produce a cut in
the DNA without changing the amino acid, but which are not
necessarily found in the original sequence.
NC-IUB SYMBOLS
A,C,G,T
R (A,R) 'puRine'
Y (T,C) 'pYrimidine'
W (A,T) 'Weak'
S (C,G) 'Strong'
M (A,C) 'aMino'
K (G,T) 'Keto'
H (A,T,C) 'not G'
B (G,C,T) 'not A'
V (G,A,C) 'not T'
D (G,A,T) 'not C'
N (G,A,C,T) 'aNy'
Typical dialogue follows.
? Menu or option number=d27
Back translate
? (y/n) (y) No codon preference
? (y/n) (y) Plot redundancy n
? (y/n) (y) Save DNA to disk
? File name for DNA sequence=tt:
ATGCARYTNAAYWSNACNGARATHWSNGARYTNATHAARCARMGNATHGCNCARTTYAAY
GTNGTNWSNGARGCNCAYAAYGARGGNACNATHGTNWSNGTNWSNGAYGGNGTNATHMGN
ATHCAYGGNYTNGCNGAYTGYATGCARGGNGARATGATHWSNYTNCCNGGNAAYMGNTAY
GCNATHGCNYTNAAYYTNGARMGNGAYWSNGTNGGNGCNGTNGTNATGGGNCCNTAYGCN
GAYYTNGCNGARGGNATGAARGTNAARTGYACNGGNMGNATHYTNGARGTNCCNGTNGGN
MGNGGNYTNYTNGGNMGNGTNGTNAAYACNYTNGGNGCNCCNATHGAYGGNAARGGNCCN
YTNGAYCAYGAYGGNTTYWSNGCNGTNGARGCNATHGCNCCNGGNGTNATHGARMGNCAR
WSNGTNGAYCARCCNGTNCARACNGGNTAYAARGCNGTNGAYWSNATGATHCCNATHGGN
MGNGGNCARMGNGARYTNATHATHGGNGAYMGNCARACNGGNAARACNGCNYTNGCNATH
GAYGCNATHATHAAYCARMGNGAYWSNGGNATHAARTGYATHTAYGTNGCNATHGGNCAR
AARGCNWSNACNATHWSNAAYGTNGTNMGNAARYTNGARGARCAYGGNGCNYTNGCNAAY
ACNATHGTNGTNGTNGCNACNGCNWSNGARWSNGCNGCNYTNCARTAYYTNGCNMGNATG
CCNGTNGCNYTNATGGGNGARTAYTTYMGNGAYMGNGGNGARGAYGCNYTNATHATHTAY
GAYGAYYTNWSNAARCARGCNGTNGCNTAYMGNCARATHWSNYTNYTNYTNMGNMGNCCN
CCNGGNMGNGARGCNTTYCCNGGNGAYGTNTTYTAYYTNCAYWSNMGNYTNYTNGARMGN
GCNGCNMGNGTNAAYGCNGARTAYGTNGARGCNTTYACNAARGGNGARGTNAARGGNAAR
ACNGGNWSNYTNACNGCNYTNCCNATHATHGARACNCARGCNGGNGAYGTNWSNGCNTTY
GTNCCNACNAAYGTNATHWSNATHACNGAYGGNCARATHTTYYTNGARACNAAYYTNTTY
AAYGCNGGNATHMGNCCNGCNGTNAAYCCNGGNATHWSNGTNWSNMGNGTNGGNGGNGCN
GCNCARACNAARATHATGAARAARYTNWSNGGNGGNATHMGNACNGCNYTNGCNCARTAY
MGNGARYTNGCNGCNTTYWSNCARTTYGCNWSNGAYYTNGAYGAYGCNACNMGNAARCAR
YTNGAYCAYGGNCARAARGTNACNGARYTNYTNAARCARAARCARTAYGCNCCNATGWSN
GTNGCNCARCARWSNYTNGTNYTNTTYGCNGCNGARMGNGGNTAYYTNGCNGAYGTNGAR
YTNWSNAARATHGGNWSNTTYGARGCNGCNYTNYTNGCNTAYGTNGAYMGNGAYCAYGCN
CCNYTNATGCARGARATHAAYCARACNGGNGGNTAYAAYGAYGARATHGARGGNAARYTN
AARGGNATHYTNGAYWSNTTYAARGCNACNCARWSNTGG---
@28. TX 5 @Search for patterns of motifs
This option searches for patterns of motifs. Patterns can be
defined interactively or read from files. Results can be displayed
in several ways in both graphical and textual form. Used to create
pattern files for searching libraries. The option is extremely
flexible and consequently the following documentation is quite
lengthy. However the routine is capable of searching for almost any
known pattern. In addition the flexibility does not necessitate
difficulty of use, and the userinterface has been simplified
considerably since the methods were first published.
Users should refer to the "typical dialogue" shown below for
the most helpful information on using the program.
There are currently four ways to display the matching
patterns: 1=each individual motif and its position is listed; 2=all
the sequence between, and including the two outermost motifs is
listed; 3=graphical, with a vertical line marking the position of
the leftmost motif; 4 = EMBL feature table format, where the KEYNAM
field is the motif name, the FROM and TO fields denote the ends of
the match, and the DESCRIPTION field is "Program".
When it is defined for the first time a pattern must be
entered interactively at the keyboard, but the pattern description
can be saved to a file. This file can be used for all subsequent
searches.
When defining a pattern interactively select a motif class and
the program will request the required inputs.
The program gives each motif an identifying name and number.
For motifs other than the first, a range of allowed positions must
be defined (Note that sets of motifs included using the OR operator
will all be given the same range, and so the program will only
request range values for the first motif in any such set). To
specify the allowed range for a motif the user must supply the
following: the identifying number of the motif, relative to which
the current motifs positions are to be defined (termed the
"reference motif"); a "relative start position" and a range. The
relative start position can be negative or positive. A negative
start position means that although the reference motif is searched
for first, the current motif can be found to its left. A zero
relative start position means their left ends are superimposed. The
default start position is to butt-joint the motif to righthand end
of the "reference motif". The range is "the number of extra
positions" that the motif can take.
The program will display the probability of finding each
motif. These values are presented in the following form: .1234E-5
means 0.1234 times 10 to the power -5.
After the pattern has been defined, the program will type a
description of it on the screen. It will then allow the user to give
an overall cutoff score and overall probability cutoff.
Typical dialogue for all the different motif classes is
displayed below.
? Menu or option number=28
Pattern searcher
? (y/n) (y) Read pattern from keyboard
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =
? Motif name=aa
? String=aa
Probability of score 2.0000 = 0.123E-01
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =2
? Motif name=pmatch
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-1) (1) =
? Relative start position (-1000-1000) (3) =
? Number of extra positions (0-1000) (0) =
? String=qqq
? Minimum matches (1.00-3.00) (3.00) =2
Probability of score 2.0000 = 0.858E-02
1 Exact match
X 2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =3
? Motif name=sm
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-2) (2) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =
? String=wqa
? Minimum score (11.00-53.00) (53.00) =36
Probability of score 36.0000 = 0.531E-02
1 Exact match
2 Percentage match
X 3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =4
? Motif name=hth
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-3) (3) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =
? Weight matrix file name=hth
HELIX TURN HELIX PABO SAUER WEIGHTS 17-11-87
Probability of score -51.5860 = 0.230E-04
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
X 4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =5
? Motif name=repeat
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-4) (4) =
? Relative start position (-1000-1000) (21) =
? Number of extra positions (0-1000) (0) =3
? Repeat length (1-60) (6) =3
? Minimum gap (0-60) (0) =
? Maximum gap (0-60) (0) =2
? Minimum score (11.00-60.00) (36.00) =
Probability of score 36.0000 = 0.445E-01
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
X 5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =6
? Motif name=mset
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-5) (5) =
? Relative start position (-1000-1000) (1) =
? Number of extra positions (0-1000) (0) =
X 1 Keyboard input
2 File input
? 0,1,2 =
Separate sets with commas
? String=AVL,AST,,WYRF
? Minimum matches (1.00-4.00) (4.00) =3
Probability of score 3.0000 = 0.718E-02
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
X 6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =7
? (y/n) (y) Save pattern in a file
? Pattern definition file=EXAM.PAT
Motif 6 needs a file name to store set as a weight matrix
? Weight matrix file name=DEMO.WTS
Weight matrix needs a title
? Title=Demonstration class 6 weight matrix
Pattern description
Motif 1 named aa is of class 1
Which is an exact match to the string
aa
Motif 2 named pmatch is of class 2
which is a match of score 2. to the string
qqq
and the N-terminal residue can take positions 3 to 3
relative to the N-terminal end of motif 1
It is anded with the previous motif.
Motif 3 named sm is of class 3
which is a match of score 36. to the string
wqa
and the N-terminal residue can take positions 4 to 4
relative to the N-terminal end of motif 2
It is anded with the previous motif.
Motif 4 named hth is of class 4
Which is a match to a weight matrix with score -51.586
and the N-terminal residue can take positions 4 to 4
relative to the N-terminal end of motif 3
It is anded with the previous motif.
Motif 5 named repeat is of class 5
Which is a repeat with repeat length 3 and score 36.
The loop-out can have sizes 0 to 2
and the N-terminal residue can take positions 21 to 24
relative to the N-terminal end of motif 4
It is anded with the previous motif.
Motif 6 named mset is of class 6
Which is membership of a set with score 3.000
It is anded with the previous motif.
Probability of finding pattern = 0.4109E-14
Expected number of matches = 0.2539E-10
? Maximum pattern probability (0.00-1.00) (1.00) =
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
Select display mode
X 1 Motif by motif
2 Inclusive
3 Graphical
4 EMBL feature table
? 0,1,2,3,4 =
Searching
Total matches found 0
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=6
Page through text files
? Name of file to read=exam.pat
A1 aa Class
aa
@ End of string
A2 pmatch Class
1 Relative motif
3 Relative start position
0 Number of extra positions
qqq
@ End of string
2.00000 Cutoff
A3 sm Class
2 Relative motif
4 Relative start position
0 Number of extra positions
wqa
@ End of string
36.00000 Cutoff
A4 hth Class
3 Relative motif
4 Relative start position
0 Number of extra positions
hth File name
A5 repeat Class
4 Relative motif
21 Relative start position
3 Number of extra positions
3 Length
0 Minimum loop
2 Maximum loop
36.00000 Cutoff
A6 mset Class
5 Relative motif
1 Relative start position
0 Number of extra positions
DEMO.WTS File name
End of file
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=6
Page through text files
? Name of file to read=demo.wts
Demonstration class 6 weight matrix
4 0 3.000 4.000
P 1 2 3 4
N 0 0 0 0
C 0 0 0 0
S 0 1 0 0
T 0 1 0 0
P 0 0 0 0
A 1 1 0 0
G 0 0 0 0
N 0 0 0 0
D 0 0 0 0
E 0 0 0 0
Q 0 0 0 0
B 0 0 0 0
Z 0 0 0 0
H 0 0 0 0
R 0 0 0 1
K 0 0 0 0
M 0 0 0 0
I 0 0 0 0
L 1 0 0 0
V 1 0 0 0
F 0 0 0 1
Y 0 0 0 1
W 0 0 0 1
End of file
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=28
Pattern searcher
? (y/n) (y) Read pattern from keyboard
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =2
? Motif name=avlst
? String=avlst
? Minimum matches (1.00-5.00) (5.00) =3
Probability of score 3.0000 = 0.394E-02
1 Exact match
X 2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =7
? (y/n) (y) Save pattern in a file n
Pattern description
Motif 1 named avlst is of class 2
which is a match of score 3. to the string
avlst
Probability of finding pattern = 0.3941E-02
Expected number of matches = 0.2030E+01
? Maximum pattern probability (0.00-1.00) (1.00) =
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
Select display mode
X 1 Motif by motif
2 Inclusive
3 Graphical
4 EMBL feature table
? 0,1,2,3,4 =4
Searching
FT avlst 152 156 Program
Total matches found 1
Minimum and maximum observed scores 3.00 3.00
General notes
These methods allow users to define and search for complex
patterns of motifs defined as single objects. The programs allow
individual DNA motifs to be defined in eight different ways, and
protein motifs in six. Motifs are combined, using the logical
operators AND, OR and NOT, to describe a pattern. The pattern also
specifies the ranges of allowed relative separations of the
individual motifs.
First some definitions.
A MOTIF is a contiguous subsequence of fixed length. At its
simplest it could be a single definite base or amino acid; a more
complex motif might be better represented as a consensus or a weight
matrix; two more-abstract types of motif are direct and inverted
repeats.
A PATTERN is a higher order of structure defined by a list of
motifs. The motifs in a pattern are combined using the logical
operators AND, OR and NOT. The list also defines the allowed
relative separations of the motifs. In the current versions of the
programs up to 50 motifs can be combined into a single pattern. So
using these definitions there are two differences between motifs and
patterns: 1) the distances between all elements of a motif are
fixed, but the separations of parts of patterns can vary; 2) all
characters in a motif are defined using the same method (class), but
different parts of a pattern can be defined in completely different
ways.
Each motif can be represented in 9 ways (known as the motif
class):
MOTIF CLASSES
CLASS DESCRIPTION
1 Exact match to a short defined sequence. The IUB symbols
can be used for DNA sequences.
2 Percentage match to a defined short sequence. In nucleic acids,
the IUB symbols can be used.
3 Match to a defined sequence, using a score matrix and cutoff
score. The DNA matrix (see option 18) gives scores to IUB symbols
depending on their level of redundancy. MDM78 is used for proteins.
4 Match to a weight matrix with cutoff score.
5 As class 4 but on the complementary strand.
6 Inverted repeat or stem-loop. Fixed stem length, range of
loop sizes, and cutoff score using A-T, G-C=2; G-T=1.
7 Exact match to short sequence but with a defined step size.
8 Direct repeat. Fixed repeat length, range of loop-out sizes,
cutoff score, and score matrix (for protein sequences MDM78 and
for nucleic acids an identity matrix).
9 Membership of a set. A list of sets of allowed amino acids for
each position in the motif. The sets are separated by commas(,).
For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
acids in which one of I,V or L must be found in the first position,
then anything in the next two positions, D,E,K or R in the fourth
position and F,Y,W,I,L,V or M in the fifth. This class only applies
to protein sequences because for nucleic acids "membership of a
set"
can be achieved using IUB symbols.
Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to
nucleic acids.
Class 1: exact match.
The motif is defined by a short sequence, which for nucleic
acids, may include IUB symbols. All symbols must match.
Class 2: percentage match
The motif is defined by a short sequence, which for nucleic
acids, may include IUB symbols. The minimum number of matching
characters must also be specified.
Class 3: match using a score matrix
The motif is defined by a short sequence, which for nucleic
acids, may include IUB symbols. The motif is not compared directly
with the sequence to count the number of matching characters.
Instead a matrix is used to provide a score for all possible pairs
of characters. The motif score for any position along the sequence
is the sum of the scores found by looking-up the scores for each
pair of aligned characters. A match is declared if some minimum
score is achieved.
Class 4: weight matrix
The motif is defined by a table of values (called weights or
scores). The table gives a score for finding each possible character
at each position along the length of the motif. It therefore has
dimension motif-length x character-set-size, and allows us to give
different scores for each character at each position. It is
equivalent to having a different score matrix for each position
along the motif, and provides the most flexible and specific method
of defining motifs. The weight matrices are created by program PIP
option 20 and stored as files. The file contains the values for each
position, as well as an overall minimum score. There are two ways in
which these values can be used to calculate an overall score for any
section of the sequence. The simplest way is to add the values in
the file. (This means that the highest possible score can be
calculated by adding the top value at each column position, and the
lowest by adding the bottom value.) The normal way of using the
values in the file is as follows. First the programs divide the
values in each column by the column total so that they sum to 1.0
Then the natural logs of these values are used as scores. When the
matrix is applied to a sequence these logarithmic values are summed
(which is of course equivalent to multiplying the frequencies).
Note that using the natural logs of the frequencies as weights and
adding them means that the overall cutoff score must be less than
zero, whereas if the original values in the weight matrix file are
added, the cutoff score will be greater than zero. The search
routines therefore decide whether the user wants to add values or
multiply frequencies by examining the value of the cutoff score: it
will add if the cutoff is greater than zero and add logs of
frequencies if it is less than zero. Hence we effectively get two
motif classes in one. The program PIP, when creating weight matrix
files, will ask the user whether the scores should be added or
multiplied. If the values in the table have been defined without
using a set of aligned sequences it is easier for the user to choose
a cutoff score if the values are added.
Class 5: complement of weight matrix
The motif is defined by a weight matrix, but the program
searches for its complement.
Class 6: inverted repeat, or stem-loop
The motif is defined by a repeat length, a minimum score and a
range of loop sizes. The scores are A-T=2, G-C=2, G-T=1, else=0.
The loop sizes are defined by a minimum and maximum distance from
the 3' end of the stem. For a stem-loop these will be positive
numbers. For example to define a stem of length 8 and loop sizes
varying from 3 to 5, the stem would be set to 8, the minimum start
distance to 3 and the maximum to 5. To define an inverted repeat the
minimum distance will be negative. For example stem length=9,
minimum distance=-9, and maximum distance=-8 will find inverted
repeats of lengths 9 and 10. E.g. AAAAATTTT and AAAAATTTTT would be
found, the first having a base at its centre, the second having
none.
Class 7: exact match, defined step size.
The motif is defined by a short sequence, which for nucleic
acids, may include IUB symbols. All symbols must match. The class
differs from class 1 in that searches will move in steps of some
given size. For example we could search for a certain codon and use
a step size of 3 and hence keep in a single reading frame.
Class 8: direct repeat
The motif is defined by a repeat length, a minimum score and a
range of loop sizes. The scores are defined using MDM78 for protein
sequences and an identity matrix for nucleic acids. The loop sizes
are defined by a minimum and maximum distance from the 3' end of the
stem.
Class 9: membership of a set
This motif class is for protein sequences. It is defined by
lists of allowed amino acids for each position in the motif, and a
cut-off score. Positions at which any amino acid can occur are left
blank. All allowed amino acids for each position give a score of 1.
The motifs can be defined in two ways: either typed at the keyboard
or read in as a weight-matrix-like file. When the motif is defined
at the keyboard the sets of allowed amino acids are separated by
commas(,). For example IVL,,,DEKR,FYWILVM defines a motif of length
5 amino acids in which one of I,V or L must be found in the first
position, then anything in the next two positions, D,E,K or R in the
fourth position and F,Y,W,I,L,V or M in the fifth. To specify that
the whole motif must match a score of 3 would be required (i.e. one
of the allowed amino acids must be found for each of the three
defined positions). If the motif is read from a file the file must
have been written by program PIP, or have been saved by the pattern
searching routines. If the user elects to save a pattern, and it
includes class 9 motifs typed at the keyboard, then the program will
save the class 9 motifs as weight matrix files. Therefore it will
request file names for each motif of this class. If the motif given
above as an example were saved the weight matrix file would have 5
columns. The first column would contain zeroes except for the I, V
and L rows which would be set to 1; the next two columns would all
be zero; the next would be zero except for the D,E,K and R rows
which would be 1; the final column would contain 1's in rows
F,Y,W,I,L,V and M, with the rest zero.
The logical operator (AND, OR or NOT) used to add each motif
to the pattern is specified by preceding the class number by the
letters A, O or N. A = AND, O = OR, N = NOT. The default is A, so
N2 means include, using the NOT operator, a class 2 motif; O2 means
include, using the OR operator, a class 2 motif; both A2 and 2 mean
include, using the AND operator, a class 2 motif.
Range setting.
The motifs in a pattern are numbered according to their order
in the list. Apart from the first motif in a pattern all motifs are
given a range of allowed positions relative to a motif further up
the list. For example suppose we have a pattern defined by A AND B
AND C AND D. Motif A can occur anywhere, but B must have its range
of allowed positions defined relative to the position of motif A,
and C's positions can be defined relative to either A or B,
depending on which is most convenient, and likewise D's positions
can be relative to A or B or C.
Notice that the positions of motifs can be defined relative to
more than one motif. Suppose we have a pattern consisting of motifs
A, B and C, and that B occurs 5-10 residues right of A, C occurs 5-
10 residues right of B, and also C is never more than 15 residues
from A. Then it is quite consistent with the methods to include
motif C into the pattern twice using the AND operator: once relative
to A and once relative to B. This will define the relative spacing
and the ORDER of the motifs in the pattern. (If we simply defined
the position of C relative to A it could be found to the left of B).
Motifs combined together using the OR operator are all given
the same range. For example suppose we had a pattern A AND (B OR C)
AND (D OR E), then B and C each have the same range, and D and E
also have the same range as one another. The range for D and E can
be relative to A or to B.
Motifs cannot have their ranges defined relative to motifs
that are included using the NOT operator. For example if we had the
pattern A NOT B AND C, then the range for C can only be defined
relative to motif A.
Speed can be gained by arranging the order of the motifs so
that those higher up the list are of types that can be searched for
rapidly and that are also unlikely to be found.
Motifs combined by the OR operator are alternatives: if any
one of a set of motifs combined by the OR operator is found, then a
match is declared. All alternatives will be reported. For example if
we had a pattern defined by A AND (B OR C), then all places where A
occurs and B is found within range, and all places where A is found
and C is found within range will be reported. A typical use would be
where we might allow a motif to appear on either strand of the DNA
sequence. For example a weight matrix representing the heatshock
element could be used in a pattern which included heatshock as a
motif class 4 combined using the OR operator with heatshock as a
motif class 5.
The probability calculations are performed for each motif as
it is defined. If an overall probability cut-off is given the
calculation is repeated for each match found. To achieve maximum
searching speed do not give an overall probability cut-off. Overall
cut-off scores should only be used if the motif classes used are
compatible.
There are currently several ways to display the matches: 1 =
each motif and its position is listed; 2 = all the sequence between
the two outermost motifs is listed; 3 = graphical, with a spike
marking the position of the leftmost motif. The library versions
also give entry names, and a one line title; in addition they can be
used to produce aligned families of sequences. When this mode of
output is selected the program will write a separate file for each
match. The files will be called ENTRYNAME.DAT where ENTRYNAME is the
name of the entry in the library. The matching sequence will be
written out so that the spacing between motifs is constant, and set
to the maximum allowed by the pattern definition. Any gaps will be
filled with dashes (-). If the individual sequences were
subsequently written one above the other they should line up so that
all motifs are in register. There two types of output of this sort:
one, option 4, writes out whole sequences, the other, option 5,
writes out only the sequences between the two outermost motifs. If
the individual sequences were subsequently written one above the
other they should line up so that all motifs are in register. There
two types of output of this sort: one, option 4, writes out whole
sequences, the other, option 5, writes out only the sequences
between the two outermost motifs. Note that for option 4 users are
asked to type the position of the first motif, and the reason for
this is explained below. Consider a pattern found in several
sequences. Consider only the first motif in the pattern and suppose
that it was found in different positions in these sequences. Say
that of these positions the one furthest from the left end was
position 100. Then, in order to ensure that all the sequences would
align, we must specify that motif 1 must start at position 100. Any
sequences in which motif 1 started nearer to the left end than
position 100 would be padded accordingly. These modes of output
should only be used when the position of each motif is defined
relative to its immediate neighbour.
The pattern descriptions can be saved to files. These files
can be used instead of typing definitions again at the keyboard. As
the files are annotated, they can easily be changed using system
editors, and the modified versions used to define the variant
patterns for the programs.
Use of lists of entry names
The two programs that operate on libraries have the ability to
restrict their searches to subsets of the libraries. This does not
require sublibraries to be created but instead is achieved by using
files containing a list of the entry names of sequences. The user
may choose to search only those entries on the list or,
alternatively to search all but those on the list (i.e. in the
latter case the list contains the names of those to be excluded).
The programs can search libraries that have indexes and those that
do not. If a list of names for inclusion is used, then the search
will be faster if the index is present. In all other circumstances
the whole library will be read. The list must be in library order
except when it is used to include entries, and an index is
available. The list must contain each entry name on a separate
line, with the name starting in column 1 of the line. ie there must
be no spaces at the start of the line. The list of entry names can
be produced by the keyword searches of nip, pip, etc as long as the
listings produced have a space character separating the entry name
from the entry description. This will depend on how well the library
reformatting programs work. For example swissprot entry names tend
to run into the beginning of the descriptions, but other libraries
are generally OK.
One use of the programs is to look for patterns that we
already know about, but in new sequences. However it is hoped that
they will also be useful for finding new motifs. For example several
known control regions in nucleic acid sequences consist of
particular direct or inverted repeats; the inclusion of direct and
inverted repeats as motif classes makes it possible to find
previously unknown motifs of these types. Using these new programs
we can ask questions like: "are there any inverted or direct repeats
near to sections of sequence that contain both a CCAAT box and a
TATA box?"; and to search for such things throughout the libraries.
In addition, the mode of output in which all the sequence between
the two outermost motifs found is printed out, allows us to extract
sequences and examine them in more detail for further common
subsequences. For example we might want to collect together all the
sequences between putative CCAAT and TATA boxes.
A further use of the inverted repeat motif class is the
following. If a regulatory sequence in DNA is poorly defined but
also an inverted repeat, then it might be an advantage to specify it
both as a consensus sequence and a superimposed inverted repeat. In
this way two weak definitions can be combined to produce a stronger
pattern.
Given only a few examples of a motif it should be possible to
perform initial searches using a class 3 motif, and then, using
plausible matching sequences, create a more specific weight matrix
for the same motif.
If motifs are combined with the first motif using the OR
operator they will be ignored until all permutations that include
the first motif have been looked for. The whole search will then be
repeated, in turn, for each of those motifs that are combined with
the first motif using the OR operator. An interesting consequence
of this is that the program can be used, without change, to compare
any newly determined sequence with all known individual motifs. We
achieve this by having a pattern in which all known relevant motifs
are combined using the OR operator. If we ask to use this pattern
with a sequence, the program will automatically compare each
individual motif in the pattern with the whole length of the
sequence. As the number of known motifs grows this should become an
increasingly useful standard procedure.
The NOT operator is obviously useful for making sure
particular motifs are not present, but it can also be used to
bracket the levels of matches found. We may want a degree of match
that lies between two limits - binding should occur, but not too
strongly; or base-pairs should form, but not too many. We can
specify this by asking for a match with a low score, in combination
with a match and a high score, both for the same motif, but with the
high score included using the NOT operator.
The algorithm is designed to find all sections of a sequence
that satisfy the pattern rather than only the best match.
Particularly if some of the motifs in a pattern are less well
defined than others, this can often result in the same region of a
sequence being reported as having several matches, but which only
vary in the positions of the weakest motifs.
General remarks on motif searching
Generally motifs are short subsequences that are thought to be
associated with particular functions in some known sequences. Often
we search for them to try to understand or interpret other
sequences. Sometimes we search for motifs and patterns to test a
hypothesis about their role: are they found in the expected
positions in the expected sequences. In doing so we should remember
that, in both proteins and nucleic acids, what we are really looking
for is a particular three dimensional structure with certain
affinities for other structures, and that we are assuming that the
sequence of the motif alone defines the 3D structure we searching
for. The overall structure may be completely different to those in
which the motif is functional, and hence the motif may have a
different shape or be inaccessible. We should be aware of the
importance of the context in which a motif is found. Where does it
lie relative to the overall structure, is it accessible, is the
three dimensional spacing between it and other motifs correct? For
example, is it on the same side of the double helix, and the correct
distance from some other motif? How does context affect our
assessment of the significance of finding a motif? Finding false
mammalian mRNA splice junctions in non-coding sequences is far less
important than finding false sites in pre-mRNA sequences, but
finding them in the correct places is most important! In other
words, it is often the case that when we are searching for a motif
that is known to be necessary for some function, then a positive
result in the form of a match in the required position, is more
important than a high background of matches in the wrong positions.
Being able to write down the probability of finding a motif in a
random sequence tells us how well it is defined. In nucleic acids
the DNA may contain many superimposed types of information such as
those concerned with histone phasing, protein coding or mRNA
secondary structure. These overlapping "codes" may interfere with
one another causing matches to motifs to be poorer than expected.
In general we will only have a limited number of examples of the
motif and we do not know how representative they are.
Sequences have superimposed functions: some parts may be of
general structural importance and give rise to an overall framework,
and other parts give specificity and hence are not common; we may
want to use a set of aligned sequences to define a motif, but want
to use only the framework positions. Alternatively we may want to
pick out only those parts of a set of aligned sequences that give a
particular property, and to ignore other similarities that are due
to some other property and which could obscure the pattern we are
interested in. It is possible to apply a mask to a set of aligned
sequences in order to give weight to selected positions only. The
ability to define a mask allows certain positions to be used in the
motif and others to be ignored, and yet still permits the use of a
set of aligned sequences to calculate weights. The mask is requested
and applied by the program and results in the masked positions being
zero in the weight matrix. The mask is defined in the following way.
Suppose we had a motif of length 15, then the mask x--x--xx-x will
give zero weights to positions 2,3,5,6 and 9 (note it is the dashes
(-) that are significant and that positions 1,4,7,8,10,11,12,13,14
and 15 will be non-zero). Of course the same set of sequences could
be used with several alternative masks in order to extract different
features and create corresponding weight matrices.
The programs are described in Staden,R. CABIOS 4, 53-60, 1988;
Staden,R. CABIOS 5, 89-96, 1989, anf a forthcoming Methods in
Enzymology.
@ end of help