staden-lg/help/PIP.RNO

2470 lines
90 KiB
Plaintext

.NPA
.SP 1
.left margin1
@-1. TX 0 @General
.sp
@-2. T 0 @Screen control
.sp
@-2. X 0 @Screen
.sp
@-3. T 0 @Statistical analysis of content
.sp
@-3. X 0 @Statistics
.sp
@-4. T 0 @Structures and repeats
.sp
@-4. X 0 @Structures
.sp
@-5. TX 0 @Search
.sp
@0. TX -1 @PIP
.para
This is a program for analysing individual protein sequences. It can read
sequences stored in many of the most commonly used formats, and
performs all of the usual simple analyses. In addition it has very flexible
search procedures and presents many of its results graphically.
.PARA
The following analyses (preceded by their option numbers) are included:
.lit
? = Help
! = Quit
3 = read a new sequence
4 = define active region
5 = list the sequence
6 = list a text file
7 = direct output to disk
8 = write active sequence to disk
9 = edit the sequence
10 = clear graphics screen
11 = clear text screen
12 = draw a ruler
13 = use cross hair
14 = reposition plots
15 = label diagram
16 = display a map
17 = search for short sequences
18 = compare a sequence
19 = compare a sequence using a score matrix
20 = search for a sequence using a weight matrix
21 = calculate amino acid composition
22 = plot hydrophobicity
23 = plot charge
24 = plot Robson prediction
25 = plot hydrophobic moment
26 = draw helix wheel
27 = back translate
28 = search for patterns of motifs
.end lit
.para
Some of these methods produce graphical
results
and so the
program is generally used from a graphics terminal (a vdu on which lines
and points can be drawn as well as characters).
.para
For users of VT640's or their equivalents the
terminal must be set nowrap (type NOWRAP) prior to running the program.
.LEFT MARGIN2
The positions of each of the plots is defined relative to a users drawing
board which has size 1-10,000 in x and 1-10,000 in y.
Plots for
each option are drawn in a window defined by x0,y0 and xlength,ylength.
Where x0,y0 is the position of the bottom left hand corner of the window,
and xlength is the width of the window and ylength the
height of the window.
.lit
--------------------------------------------------------- 10,000
1 1
1 -------------------------------------- ^ 1
1 1 1 1 1
1 1 1 1 1
1 1 1 ylength 1
1 1 1 1 1
1 1 1 1 1
1 -------------------------------------- v 1
1 x0,y0^ 1
1 <---------------xlength--------------> 1
--------------------------------------------------------- 1
1 10,000
.end lit
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
The default window positions are read from a file "ANALPMRG" when the
program is started. Users can have their own file if required.
.para
The program can handle sequences stored in several formats:
Staden, EMBL, GENBANK, PIR (also known as NBRF) and GCG and they are described
in
the help for 'READ NEW SEQUENCE'.
.para
The options for the program are accessed from 5 main menus: general,
screen control, statistical analysis of content, structure, search.
Both menus and options are selected by number.
.LEFT MARGIN1
@1. TX 0 @Help
.LEFT MARGIN2
.para
This option gives online help. The user should select option numbers and
the current documentation will be given. Note that option 0 gives an
introduction to the program, and that ? will get help from anywhere in
the
program.
The following analyses (preceded by their option numbers) are included:
.sp
.left margin1
@2. TX 0 @Quit
.left margin2
.para
This function stops the program.
.left margin1
@3. TX 1 @Read a new sequence
.LEFT MARGIN2
.para
This option allows users to read in new sequences, browse through annotations,
or search sequence
libraries for keywords. Sequences can be read from "personal"
sequence files or from sequence libraries. These are referred to as the
sequence "source". Personal files can be stored in several formats:
Staden, PIR, EMBL, GENBANK and GCG.
At LMB we use "Staden" format for sequencing and all
the
libraries are stored in their original formats. Note, however, that libraries
such as EMBL or GenBank that are divided into several files (eg GenBank has
13 separate files) are indexed as a whole. This means that users do not need
to know which file contains an entry, only which library.
When the user selects to read in a sequence the program first asks for the
sequence "source".
.para
If the user selects "personal" the program will ask for
the format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
the file. For PIR format the user will also be required to know the entry
name of the sequence as the file can contain several. For the other formats
only a single entry is expected. The file will be read, its length and
composition will be displayed and the option left.
.para
If the user selects "library" as the sequence source the program will display a
list of available libraries. The programs are capable of handling all current
libraries but which ones are available will vary from site to site. At LMB we
have several libraries and also weekly updates of data gathered between releases.
The program will ask users to select a library and then give a list of options:
.lit
X 1 Get a sequence
2 Get annotations
3 Get entrynames from accession numbers
4 Search titles for keywords
5 Search text index for keywords
.end lit
If get a sequence or get annotations is selected users will be asked to
type the entry name. The option will be left when a sequence is selected or
! is typed. The composition and length will be displayed.
.para
The text index contains all words from feature tables, reference titles,
definition lines, keywords lists and comments, so the text index search
is most useful. It is also the fastest. Up to 5 words can be searched for
at once. The words should be typed separated by spaces, for example
.lit
? Keywords=P53 mouse murine tumo
.end lit
will search for all entries that contain words starting with p53, mouse,
murine and tumo. Only the unique entries that contain ALL words will be
listed. Before listing the matching entries
the program will show the number of 'hits' for each word and ring the bell.
Escape is possible at this point, or after each screenfull of entries.
In addition to the entry names the text search displays the primary accession
number, the sequence length and up to 80 characters of description.
(The search of 'titles' is now redundant because the full text index
contains all the title words and the search is much faster. It will probably
be removed from the program.)
All searches are independent of case. Where
possible the program will offer default entry names.
.para
Typical dialogue follows.
.lit
Select sequence source
X 1 Personal file
2 Sequence library
? Selection (1-2) (1) =
Select sequence file format
X 1 Staden
2 EMBL
3 GenBank
4 PIR
5 GCG
? Selection (1-5) (1) =
? Sequence file name=M13MP7.SEQ
Contig title removed
Sequence length= 7238
Sequence composition
T C A G -
2405. 1539. 1765. 1527. 2.
33.2% 21.3% 24.4% 21.1% 0.0%
.
.
.
Select sequence source
X 1 Personal file
2 Sequence library
? Selection (1-2) (1) =2
Select a library
X 1 EMBL 29 nucleotide library Dec 91
2 SWISSPROT 20 protein library Nov 91
3 PIR 31 protein library Dec 91
4 NRL3D 58 From Brookhaven protein library Dec 91
5 GenBank
? Selection (1-5) (1) =
Library is in EMBL format with indexes
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =5
Search for keywords
? Keywords=P53 mouse
P53 hits 68
MOUSE hits 8180
MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen
MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen
MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen
MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen
MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen
MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen
MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen
MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen
MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen
MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon
MMANT11 X00885 562 Murine p53 gene 3' region with exon 11
MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end.
MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds.
MMP53 X00741 1377 Mouse mRNA for transformation associated protein
MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53.
MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11.
MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8.
MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou
MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4
MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53
MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53
MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5
22 different entries found
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =4
Search for keywords
? Keywords=alpha
Searching for alpha
AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank.
ADHBADA1 102 duck alpha-d-globin gene, exon 1.
ADHBADA2 1145 duck alpha-a-globin gene and 5' flank
ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co
AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide.
AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
AGAMYLS 120 synthetic alpha-amylase gene, 5' end.
AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
!
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =3
? Accession number=v00636
Entry name LAMBDA
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =2
Default Entry name=LAMBDA
? Entry name=
ID LAMBDA standard; DNA; PHG; 48502 BP.
XX
AC V00636; J02459; M17233; X00906;
XX
DT 03-JUL-1991 (Rel. 28, Last updated, Version 3)
DT 09-JUN-1982 (Rel. 1, Created)
XX
DE Genome of the bacteriophage lambda (Styloviridae).
XX
KW circular; coat protein; DNA binding protein; genome;
KW origin of replication.
XX
OS Bacteriophage lambda
OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
XX
RN [1]
RP 1-48502
RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
RT "Nucleotide sequence of bacteriophage lambda DNA";
RL J. Mol. Biol. 162:729-773(1982).
XX
!
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =
Default Entry name=LAMBDA
? Entry name=
DE Genome of the bacteriophage lambda (Styloviridae).
Sequence length 48502
Sequence composition
T C A G -
11988. 11360. 12336. 12818. 0.
24.7% 23.4% 25.4% 26.4% 0.0%
.end lit
.left margin1
@4. TX 1 @Redefine active region
.LEFT MARGIN2
.para
For its analytic functions
the program always works on a region of the sequence called the active
region. When a new sequence is read into the program the active region is
automatically set to start at the beginning of the sequence and go
up to the
maximum allowed size of active region the version of the program can
handle. The positions are shown on the screen.
On most machines this will be to the end of the sequence.
This option allows the user define a different region. Note that for
convenience in the
listing and translation functions the user is given access to regions
outside the active region.
.left margin1
@5. TX 1 @List a sequence
.LEFT MARGIN2
.para
The sequence can be listed with line lengths from
10 to 120 in multiples of 10. Output can be directed to a disk file by
first selecting disk output. The output looks like:
.lit
10 20 30 40 50 60
MQLNSTEISE LIKQRIAQFN VVSEAHNEGT IVSVSDGVIR IHGLADCMQG EMISLPGNRY
70 80 90 100 110 120
AIALNLERDS VGAVVMGPYA DLAEGMKVKC TGRILEVPVG RGLLGRVVNT LGAPIDGKGP
130 140 150 160 170 180
LDHDGFSAVE AIAPGVIERQ SVDQPVQTGY KAVDSMIPIG RGQRELIIGD RQTGKTALAI
190 200 210 220 230 240
DAIINQRDSG IKCIYVAIGQ KASTISNVVR KLEEHGALAN TIVVVATASE SAALQYLARM
250 260 270 280 290 300
PVALMGEYFR DRGEDALIIY DDLSKQAVAY RQISLLLRRP PGREAFPGDV FYLHSRLLER
310 320 330 340 350 360
AARVNAEYVE AFTKGEVKGK TGSLTALPII ETQAGDVSAF VPTNVISITD GQIFLETNLF
370 380 390 400 410 420
NAGIRPAVNP GISVSRVGGA AQTKIMKKLS GGIRTALAQY RELAAFSQFA SDLDDATRKQ
430 440 450 460 470 480
LDHGQKVTEL LKQKQYAPMS VAQQSLVLFA AERGYLADVE LSKIGSFEAA LLAYVDRDHA
490 500 510 520 530 540
PLMQEINQTG GYNDEIEGKL KGILDSFKAT QSW*
.end lit
.left margin1
@6. TX 1 @List a text file
.LEFT MARGIN2
.para
Allows the user to have a text file displayed on the screen. It will appear
one page at a time.
.left margin1
@7. TX 1 @Direct output to disk
.LEFT MARGIN2
.para
Used to direct output that would normally appear on the screen to a file.
.para
Select redirection of either text or graphics, and
supply the name of the file that the output should be written to.
.para
The results from the next options selected will not appear on the screen
but will be written to the file. When option 7 is selected again
the file will be
closed and output will again appear on the screen.
.left margin1
@8. TX 1 @Write active region to disk
.LEFT MARGIN2
.para
The program has the capability of reading in EMBL, GENBANK, NBRF, GCG
and Staden formats
and of reversing and complementing sequences. This option allows users
to
write the current active sequence to a disk file in Staden format. Hence
it
allows format conversion and crude sequence cutting.
.left margin1
@9. TX 1 @Edit the sequence
.LEFT MARGIN2
.para
Used to edit sequences or any other files by giving access to the
computers system editor. For editing sequences the input file should
have already been created using the listing function "list
sequence".
.para
Supply the name of the file to edit. Wait while the system editor is made
ready (can take awhile on a vax). Use the editor. Exit from the editor. If a
sequence has been edited, and you want to process it, affirm that the
sequence should be "made active". The edited sequence will replace the
original sequence.
.para
This editing method is designed to give users access to an editor with
which they are familiar - i.e. the one on their machine, and yet to allow
them to edit a sequence which contains the landmarks they need in
order to know where they are. Users can create files containing simple
listings with numbering, using "list the sequence", and
then edit them with their system editor, using the numbering to know
where they are within the sequence. When the edits are complete they
exit from the editor and the program "analyses" the edited file to extract
only the sequence characters. Define the permitted set of characters to be:
ACDEFGHIKLMNPQRSTVWXYZ-acdefghiklmnpqrstvwxyz. All permitted characters
found in the file will become part of the sequence, all others removed.
.left margin1
@10. TX 2 @Clear graphics
.LEFT MARGIN2
.para
Clears the screen of both text and graphics.
.left margin1
@11. TX 2 @Clear text
.LEFT MARGIN2
.para
Clears only text from the screen.
.left margin1
@12. TX 2 @Draw a ruler
.LEFT MARGIN2
.para
This option
allows the user to draw a ruler or scale along the x axis of the screen to
help identify the coordinates of points of interest. The user can define
the position of the first amino acid to be marked (for example if the
active
region is 1501 to 8000, the user might wish to mark every 1000th amino
acid
starting at either 1501 or 2000 - it depends if the user wishes to treat
the active region as an independent unit with its own numbering starting
at
its left edge, or as part of the whole sequence). The user can also define
the separation of the ticks on the scale and their height. If required the
labelling routine can be used to add numbers to the ticks.
.left margin1
@13. TX 2 @Use cross hair
.LEFT MARGIN2
.para
This function puts
a steerable cross on the screen that can be used to find the
coordinates of points in the sequence. The user can move the cross
around using the directional keys; when he hits the space bar the
program will print out the coordinates of the cross in sequence units and
the option will be exited.
.para
If instead,
you hit a , the position will be displayed but the cross will remain on
the screen.
.para
If a letter s is hit the sequence around the cross hair is displayed and
the cross remains on the screen.
.left margin1
@14. TX 2 @Reset margins
.LEFT MARGIN2
.para
The positions of each of the plots is defined relative to a users drawing
board which has size 1-10,000 in x and 1-10,000 in y.
Plots for
each option are drawn in a window defined by x0,y0 and xlength,ylength.
Where x0,y0 is the position of the bottom left hand corner of the window,
and xlength is the width of the window and ylength the
height of the window.
.lit
--------------------------------------------------------- 10,000
1 1
1 -------------------------------------- ^ 1
1 1 1 1 1
1 1 1 1 1
1 1 1 ylength 1
1 1 1 1 1
1 1 1 1 1
1 -------------------------------------- v 1
1 x0,y0^ 1
1 <---------------xlength--------------> 1
--------------------------------------------------------- 1
1 10,000
.end lit
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
The default window positions are read from a file "ANALMARG" when the
program is started. Users can have their own file if required.
As all the plots start
at the same position in x and have the same width, x0 and xlength are the
same for all options. Generally users will only want to change the start
level of the window y0 and its height ylength.
This option
allows users to change window positions whilst running the program.
The routine prompts first for the number of the option that the users
wishes
to reposition; then for the y start and height; then for the x start and
length. Note that changes to the x values affect all options. If the user
types only carriage return for any value it will remain unchanged.
The cross-hair can be used to choose suitable heights.
.LEFT MARGIN1
@15. TX 2 @Label a diagram
.LEFT MARGIN2
.para
This routine allows users to label any diagrams they have produced. They
are asked to type in a label. When the user types carriage return to finish
typing the label the cross-hair appears on the screen. The user can
position it anywhere on the screen. If the user types R (for right justify)
the label will be
written on the diagram with its right end at the cross-hair position.
If the user types L (for left justify) the label will be written on the
diagram with its left end at the cross hair position.
The
cross-hair will then immediately reappear. The user may put the same
label
on another part of the diagram as before or if he hits the space bar he
will be asked if he wishes to type in another label.
.left margin1
@16. TX 2 @Display a map
.LEFT MARGIN2
.para
It is often convenient to plot a map alongside graphed analysis in order
to
indicate features within the sequence. This function allows users to
draw
maps using files arranged in the form of EMBL feature tables. Of course
the
EMBL table are usually only used for nucleic acid sequence annotation
but,
as long as the features are written in the correct format, they can be
employed by this routine. The map is composed of a line representing the
sequence and then further lines denoting the endpoints of each feature
the
user identifies. The user is asked to define height at which the line
representing the sequence should be drawn; then for the feature height;
then for the features to plot.
.left margin1
@17. TX 1 5 @Short sequence search
.LEFT MARGIN2
.para
This routine is used to search for exact matches to short sequences. It is
equivalent to the restriction enzyme search in program NIP. It and can
either list matches
or present the results graphically.
.PARA
Select from searching, screen clearing or file listing. Choose a file of
strings and the mode of output required.
.para
The files of short
sequences (strings) and their names
need to be arranged in a particular way. For example
.lit
ACID/D/E//
BASIC/R/K/H//
HYDRO/F/L/I/V/Y//
GLYCO/N-S/N-T//
+/R/K/H//
-/D/E//
.end lit
defines various groups of amino acids.
Each string or set of strings must be
preceded by a name, each string must be preceded and
terminated with a slash (/), and
each set of strings by 2 slashes. These collections of strings and their
names can be read from disk or entered from the keyboard. Two files
containing sequences are currently
available. One contains named groups of amino acids. The other simply
contains the names of all amino acids and gives a convenient way of
producing a plot of the positions of all the different
amino acids in the sequence.
The user can select strings
by name from these collections. Results can be displayed name by name
or all
together.
Strings entered from the keyboard need to be separated by slash
characters(/).
For the name by name search the output looks like:
.lit
MATCHES= 12
NAME SEQUENCE POSITION FRAGMENT LENGTHS
ACID E 7 7 1
ACID E 10 3 1
ACID E 24 14 1
ACID E 28 4 1
ACID D 36 8 1
ACID D 46 10 2
ACID E 51 5 2
ACID E 67 16 2
ACID D 69 2 2
ACID D 81 12 2
ACID E 84 3 2
ACID E 96 12 3
MATCHES= 10
NAME SEQUENCE POSITION FRAGMENT LENGTHS
BASIC K 13 13 1
BASIC R 15 2 1
BASIC H 26 11 1
BASIC R 40 14 1
BASIC H 42 2 2
BASIC R 59 17 2
BASIC R 68 9 2
BASIC K 87 19 2
BASIC K 89 2 2
BASIC R 93 4 2
MATCHES= 1
NAME SEQUENCE POSITION FRAGMENT LENGTHS
GLYCO NST 4 4 3
or when the results are ordered only on position the output looks like:
NAME SEQUENCE POSITION FRAGMENT LENGTHS
GLYCO NST 4 3
ACID E 7 3
ACID E 10 3
BASIC K 13 3
BASIC R 15 2
ACID E 24 9
BASIC H 26 2
ACID E 28 2
ACID D 36 8
BASIC R 40 4
BASIC H 42 2
ACID D 46 4
ACID E 51 5
BASIC R 59 8
.end lit
.LEFT MARGIN2
Graphical output marks the position of each string by a
short vertical line and gives its name at the left end of the
line. If the top of the screen is reached the program gives the user the
oportunity to take a hard copy and then will clear the screen and restart
plotting results at the original start position.
Note that any character in the string
that is not a recognisable protein symbol will be treated as a
wild card character will match with all
characters in the searched sequence.
.para
.lit
Typical dialogue follows.
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=17
Search for short sequences
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =2
1 All acids
X 2 Named groups
3 Personal file
4 Keyboard
? 0,1,2,3,4 =
ACID/D/E//
BASIC/R/K/H//
HYDRO/F/L/I/V/Y//
GLYCO/N-S/N-T//
+/R/K/H//
-/D/E//
DIBASIC/RR/KK/RK/KR//
TURN/N/D/G/P/S//
BLOCK/A/Q/E/I/L/M/F/W/V//
INDIF/R/C/H/K/T/Y//
End of file
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All acids
X 2 Named groups
3 Personal file
4 Keyboard
? 0,1,2,3,4 =
? (y/n) (y) All names n
? Name=acid
? Name=basic
? Name=glyco
? Name=
? (y/n) (y) Show results name by name
? (y/n) (y) List matches
searching
matches= 59
NAME SEQUENCE POSITION FRAGMENT LENGTHS
ACID E 7 7 1
ACID E 10 3 1
ACID E 24 14 1
ACID E 28 4 1
ACID D 36 8 1
ACID D 46 10 2
ACID E 51 5 2
ACID E 67 16 2
ACID D 69 2 2
ACID D 81 12 2
ACID E 84 3 2
ACID E 96 12 3
ACID D 116 20 3
... etc
matches= 61
NAME SEQUENCE POSITION FRAGMENT LENGTHS
BASIC K 13 13 1
BASIC R 15 2 1
BASIC H 26 11 1
BASIC R 40 14 1
BASIC H 42 2 2
BASIC R 59 17 2
...etc
matches= 2
NAME SEQUENCE POSITION FRAGMENT LENGTHS
GLYCO NST 4 4 3
GLYCO NQT 487 483 28
28 483
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All acids
X 2 Named groups
3 Personal file
4 Keyboard
? 0,1,2,3,4 =
? (y/n) (y) Selected names
? Name=basic
? Name=glyco
? Name=
? (y/n) (y) Show results name by name n
? (y/n) (y) List matches
searching
NAME SEQUENCE POSITION FRAGMENT LENGTHS
GLYCO NST 4 3
BASIC K 13 9
BASIC R 15 2
BASIC H 26 11
BASIC R 40 14
BASIC H 42 2
BASIC R 59 17
BASIC R 68 9
BASIC K 87 19
...etc
BASIC R 477 14
BASIC H 479 2
GLYCO NQT 487 8
BASIC K 499 12
BASIC K 501 2
BASIC K 508 7
7
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All acids
X 2 Named groups
3 Personal file
4 Keyboard
? 0,1,2,3,4 =4
Define search strings by typing a string name
followed by the string(s)
? Name=MARY
? String(s)=AL/VI
? Name=
? (y/n) (y) All names
? (y/n) (y) Show results name by name
? (y/n) (y) List matches
searching
matches= 12
NAME SEQUENCE POSITION FRAGMENT LENGTHS
MARY VI 38 38 10
MARY AL 63 25 13
MARY VI 136 73 16
MARY AL 177 41 19
MARY AL 217 40 25
MARY AL 233 16 37
MARY AL 243 10 40
MARY AL 256 13 41
MARY AL 326 70 45
MARY VI 345 19 51
MARY AL 396 51 70
MARY AL 470 74 73
.END LIT
.left margin1
@18. TX 1 5 @Compare a sequence
.LEFT MARGIN2
.para
This routine slides a short sequence along the current sequence and finds
all positions at which a given percentage of the amino acids match.
Output is in both graphical and listed forms.
.para
If users call for dialogue when the routine is selected they will be given
the choice of keyboard or file input. Define the string, and the percentage
match. Matches will be plotted out and then the user can select to have
them listed. Then the routine cycles around.
.para
The routine slides the search string
along the sequence and marks the positions at which a minimum
percentage score is reached. The graphical output draws a vertical line at
the match position; the height of the line represents the percentage
score,
so that if the line reaches the top of the box the score is 100%.
.para
Typical dialogue follows.
.lit
? Menu or option number=18
Find percentage matches
? (y/n) (y) Keep picture
? String=aaa
? Percent match (1.00-100.00) (70.00) =
missing graphics
Total scoring positions above 70.000 percent = 19
Scores 2 2 2 2 2 2 2 2 2 2
Positions 61 131 177 217 226 231 232 267 300 301
? Number to list (0-19) (0) =3
61
AIA
* *
aaa
1
131
AIA
* *
aaa
1
177
ALA
* *
aaa
1
? (y/n) (y) Keep picture n
Default String=aaa
? String=!
.end lit
.left margin1
@19. TX 1 5 @Compare a sequence using a score matrix
.LEFT MARGIN2
.para
This routine slides a short sequence along the current sequence and finds
all positions at which a given level of similarity (a cutoff score) is
reached. The score is defined by use of a score matrix (MDM78). Output is
in both graphical and listed forms.
.para
If users call for dialogue when the routine is selected they will be given
the choice of keyboard or file input. Define the string and the cutoff
score. Matches will be plotted out and then the user can select to have
them listed. Then the routine cycles around.
.para
The routine slides the search string
along the sequence and marks the positions at which a the cutoff score
is achieved. The graphical output draws a vertical line at
the match position; the height of the line represents the score,
so that if the line reaches the top of the box the score is the maximum
possible.
.para
Typical dialogue follows.
.lit
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=19
Find matches using a score matrix
? (y/n) (y) Keep picture
? String=aaa
Minimum score= 12 Maximum score= 36
? Score (12-36) (36) =
missing graphics
For score 24 the number of matches= 507
scores 35 35 35 34 34 34 34 34 34 34
positions 226 231 379 112 133 202 227 267 378
380
? Number to list (0-507) (0) =3
226
ATA
* *
aaa
1
231
SAA
**
aaa
1
379
GAA
**
aaa
1
? (y/n) (y) Keep picture n
Default String=aaa
? String=!
.end lit
.left margin1
@20. TX 5 @Search for a motif using a weight matrix
.LEFT MARGIN2
.para
This function performs searches for short sequence
motifs using an appropriate weight matrix. In addition it can be used to
create or modify weight matrices. In order to perform a search the only
input
required is the name of the file containing the weight matrix.
The results can be presented graphically or listed. The graphical
presentation will draw line at the position of any matches found; the
height of the line is proportional to the score.
.para
For a search, select "use weight matrix", supply the name of the file
containing the weight matrix, and choose between having results plotted
or listed. If dialogue is requested when the function is selected users can
alter the cutoff score employed.
.para
To create a weight matrix several steps are involved. A file containing an
alignment of known motifs is required. (This file must be created before
the current option is selected. The format is a follows: each sequence is
written on a separate line with at least one space at the beginning; each
sequence is terminated by a space character, and can be followed by a
name. The sequences must be aligned.) Supply the name of the file of
aligned sequences. The program reads and displays the sequences. Choose
between "summing logs of weights" or summing weights (i.e. whether to
multiply or add weights). If logs are used all scores will be negative.
Choose if all positions in the set of aligned sequences should be used or
if a mask should be applied. If so selected, define a mask as a string of
symbols, in which symbol - means ignore and any other symbol means
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
.para
The program will calculate weights as the frequencies of each amino
acid at each unmasked position in the set of aligned sequences. These
weights are then applied to the set of aligned sequences to give a range
of "observed" scores. The mean and standard deviation of these scores is
displayed. The user is asked to supply several values to be used when the
weight matrix is applied to other sequences: a cutoff score (by default,
the mean minus 3 standard deviations), a top score for scaling graphical
results (by default, the mean plus 3 standard deviations), and a position
to identify (this means that if a particular amino acid within the motif
is used as a "landmark", such as the G of the helix-turn-helix motif, then
its position will be marked in plots). All these values are stored along
with the weight matrix. Finally supply the name of a file to contain the
weight matrix.
.para
Weight matrices can be "rescaled" using a set of aligned sequences in
much the same ways as a matrix is created. The purpose is to redefine
the cutoff scores, and rescaling does not alter any other values in the
weight matrix file.
.para
The methods have changed considerably but were first outlined in
Staden, R. Nucl. Acid Res. 12 505-519 1984, and
Staden, R. Genetic
engineering: principles and methods vol 7, Edited by J.K. Setlow and A.
Hollaender, Plenum publishing corp., 1985.
.para
The methods have always had to deal with the problem of zeroes in the
matrices. The current versions
employ "Laplaces Law of Succession" in which 1 is
added to each term.
.para
It is now possible to apply a mask to a set of aligned sequences in
order to give weight to selected positions only.
Sequences have superimposed functions: some parts may be of general
structural
importance and give rise to an overall framework, and other parts give
specificity and hence are not common; we may want to use a set of
aligned
sequences to define a motif, but want to use only the framework
positions.
Alternatively we may want to pick out
only those parts of a set of aligned sequences that give a particular
property, and to ignore other similarities that are due to some other
property
and which could obscure the pattern
we are interested in. The ability to define a mask allows certain
positions
to be used in the motif and others to be ignored, and yet still permits the
use of a set of aligned sequences to calculate weights.
.para
Typical dialogue is shown below.
.lit
? Menu or option number=20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =2
? Name of aligned sequences file=[rs.motifs]hth.seq
1 QESVADKMGMGQSGVGALFN LAMBDA.REP
2 QTKTAKDLGVYQSAINKAIH LAMBDA.CRO
3 QAALGKMVGVSNVAISQWQR P22.REP
4 QRAVAKALGISDAAVSQWKE P22.CRO
5 QAELAQKVGTTQQSIEQLEN 434.REP
6 QTELATKAGVKQQSIQLIEA 434.CRO
7 RQEIGQIVGCSRETVGRILK CAP
8 RGDIGNYLGLTVETISRLLG Fnr
9 LYDVAEYAGVSYQTVSRVVN LAC.R
10 IKDVARLAGVSVATVSRVIN GAL.R
11 TEKTAEAVGVDKSQISRWKR LAMBDA.CII
12 QRKVADALGINESQISRWKG P22.CI
13 KEEVAKKCGITPLQVRVWCN MAT.ALPHA
14 TRKLAQKLGVEQPTLYWHVK TETR.TN10
15 TRRLAERLGVQQPALYWHFK TETR.pSC1
16 QRELKNELGAGIATITRGSN TRP.REP
17 RQQLAIIFGIGVSTLYRYFP H-INVERSN
18 ATEIAHQLSIARSTVYKILE TN3.RESOL
19 ASHISKTMNIARSTVYKVIN GD.RESOLV
20 IASVAQHVCLSPSRLSHLFR ARA.C
21 RAEIAQRLGFRSPNAAEEHL LEX.R
Length of motif 20
? (y/n) (y) Sum logs of weights
? (y/n) (y) Use all motif positions n
x means use, - means ignore
e.g. xx-x---x-x means use positions 1,2,4,8,10
? Mask=--xxxxxxxxxxxx------
Applying weights to input sequences
1 -57.143 QESVADKMGMGQSGVGALFN
2 -55.087 QTKTAKDLGVYQSAINKAIH
3 -58.079 QAALGKMVGVSNVAISQWQR
4 -54.986 QRAVAKALGISDAAVSQWKE
5 -55.181 QAELAQKVGTTQQSIEQLEN
6 -55.874 QTELATKAGVKQQSIQLIEA
7 -56.692 RQEIGQIVGCSRETVGRILK
8 -57.722 RGDIGNYLGLTVETISRLLG
9 -55.363 LYDVAEYAGVSYQTVSRVVN
10 -55.769 IKDVARLAGVSVATVSRVIN
11 -56.786 TEKTAEAVGVDKSQISRWKR
12 -55.833 QRKVADALGINESQISRWKG
13 -56.279 KEEVAKKCGITPLQVRVWCN
14 -53.125 TRKLAQKLGVEQPTLYWHVK
15 -55.833 TRRLAERLGVQQPALYWHFK
16 -58.651 QRELKNELGAGIATITRGSN
17 -56.749 RQQLAIIFGIGVSTLYRYFP
18 -56.986 ATEIAHQLSIARSTVYKILE
19 -60.618 ASHISKTMNIARSTVYKVIN
20 -58.988 IASVAQHVCLSPSRLSHLFR
21 -58.002 RAEIAQRLGFRSPNAAEEHL
Top score -53.125 Bottom score -60.618
Mean -56.655 Standard deviation 1.617
Mean minus 3.sd -61.505 Mean plus 3.sd -51.804
? Cutoff score (-999.00-9999.00) (-61.51) =
? Top score for scaling plots (-61.51-999.00) (-51.80) =
? Position to identify (0-20) (1) =9
? Title=hth
? Name for new weight matrix file=1.wts
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =
? Motif weight matrix file=1.wts
hth
? (y/n) (y) Use frequencies as weights
? (y/n) (y) Plot results n
5 -61.46 STEISELIKQRIAQFNVVSE
13 -58.93 KQRIAQFNVVSEAHNEGTIV
21 -60.42 VVSEAHNEGTIVSVSDGVIR
57 -59.39 GNRYAIALNLERDSVGAVVM
59 -61.47 RYAIALNLERDSVGAVVMGP
79 -59.90 YADLAEGMKVKCTGRILEVP
88 -61.41 VKCTGRILEVPVGRGLLGRV
104 -60.38 LGRVVNTLGAPIDGKGPLDH
127 -60.13 SAVEAIAPGVIERQSVDQPV
129 -59.91 VEAIAPGVIERQSVDQPVQT
133 -60.79 APGVIERQSVDQPVQTGYKA
139 -61.12 RQSVDQPVQTGYKAVDSMIP
175 -58.90 KTALAIDAIINQRDSGIKCI
191 -60.95 IKCIYVAIGQKASTISNVVR
195 -60.94 YVAIGQKASTISNVVRKLEE
215 -60.66 HGALANTIVVVATASESAAL
254 -60.56 EDALIIYDDLSKQAVAYRQI
260 -60.08 YDDLSKQAVAYRQISLLLRR
297 -61.00 LLERAARVNAEYVEAFTKGE
314 -61.29 KGEVKGKTGSLTALPIIETQ
330 -60.49 IETQAGDVSAFVPTNVISIT
363 -57.63 GIRPAVNPGISVSRVGGAAQ
365 -61.48 RPAVNPGISVSRVGGAAQTK
371 -61.02 GISVSRVGGAAQTKIMKKLS
382 -57.90 QTKIMKKLSGGIRTALAQYR
394 -60.07 RTALAQYRELAAFSQFASDL
424 -59.95 GQKVTELLKQKQYAPMSVAQ
430 -58.89 LLKQKQYAPMSVAQQSLVLF
432 -61.14 KQKQYAPMSVAQQSLVLFAA
438 -58.58 PMSVAQQSLVLFAAERGYLA
458 -61.06 DVELSKIGSFEAALLAYVDR
466 -61.00 SFEAALLAYVDRDHAPLMQE
483 -60.48 MQEINQTGGYNDEIEGKLKG
494 -60.61 DEIEGKLKGILDSFKATQSW
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=d20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =
? Motif weight matrix file=1.wts
hth
? (y/n) (y) Use frequencies as weights
? Cutoff score (-9999.00-9999.00) (-61.51) =-56.
? (y/n) (y) Plot results n
.end lit
.left margin1
@21. TX 3 @Calculate amino acid composition
.LEFT MARGIN2
.para
This function calculates the amino acid composition and molecular
weight
for the active region.
.lit
? Menu or option number=21
Sequence composition
A C S T P A G N D E Q B Z H
N 3. 32. 23. 18. 57. 47. 16. 28. 31. 28. 0. 0. 7.
% 0.6 6.2 4.5 3.5 11.1 9.1 3.1 5.4 6.0 5.4 0.0 0.0 1.4
W 309. 2786. 2325. 1748. 4051. 2682. 1826. 3222. 4003. 3588. 0. 0.
960.
A R K M I L V F Y W - X ?
N 30. 24. 11. 40. 47. 41. 14. 15. 1. 0. 0. 0. 1.
% 5.8 4.7 2.1 7.8 9.1 8.0 2.7 2.9 0.2 0.0 0.0 0.0 0.2
W 4686. 3076. 1443. 4527. 5319. 4065. 2060. 2448. 186. 0. 0. 0.
0.
Total molecular weight= 55328.
.end lit
.left margin1
@22. TX 3 4 @Plot hydrophobicity
.LEFT MARGIN2
.para
This routine plots the hydrophobicity of each section of the sequence
using
the hydrophobicity
values of Kyte and Doolittle (J. Mol. Biol. 157, 105-132 (1982)).
A window of size span is slid along the sequence and a sum calculated
for
each position.
.para
If dialogue is requested select a span length and a plot interval.
.para
The diagrams are on the same scale as Fig. 6 of the Kyte and Doolittle
paper and values of + and - 50 could be assigned to the top and bottom of
the diagram with corresponding values in between (-40,-20,0,20,40 are
shown
in the paper).
.lit
? Menu or option number=d22
Plot hydrophobicity
? odd span length (1-101) (11) =
? plot interval (1-101) (3) =
missing graphics
.end lit
.LEFT MARGIN1
@23. TX 3 4 @Plot charge
.LEFT MARGIN2
.para
This routine plots the charge of each section of the sequence.
A window of size span is slid along the sequence and a sum calculated
for
each position. Amino acids are assigned charges of 1, -1 or 0.
.para
If dialogue is requested select a span length and a plot interval.
.para
Typical dialogue follows.
.lit
? Menu or option number=d23
Plot charge
? odd span length (1-101) (11) =
? plot interval (1-101) (3) =
missing graphics
.end lit
.LEFT MARGIN1
@24. TX 4 @Plot robson prediction
.LEFT MARGIN2
.para
This routine uses the method of Garnier J, Osguthorpe D J, and Robson B.
(1978) J. Mol. Biol. 120, 97-120 to predict secondary structures. The
method divides protein secondary structures into 4 classes: helix,
extended
(usually referred to as sheet), turn and coil. The routine calculates the
likelihood that each segment of the sequence lies in each of these
classes. Results are presented graphically or listed.
.para
If dialogue is requested choose between plotted or listed output.
.para
Each residue
has a
certain probability of being found in each of the 4 classes. This
probability
depends both on its own amino acid type and also the 8
amino acids found to either side along the protein chain. Four tables of
weights, each 20 by 17 elements are used to calculate the likelihood that
each residue along the chain falls into one of the four classes of
structure. The most likely structure at each point
is the one with the highest score.
The four values are plotted in strips labelled H, E, T and C.
Below, a strip labelled D for decision is divided into four levels, each
corresponding to one of the four structure types. Their top to bottom
order
is the same as that for the strips above, i.e C, T, E, and H. For each
residue the program measures which of the four likelhoods is highest. It
places a single dot at the
mid-point of the corresponding strip, and
also at the
appropriate level in the strip labelled D.
.PARA
It should be noted that the method, when tested by Kabsch W and Sander
C,
(1983) Febs. Lett. 155 (179-182), although one of the better ones, was
correct for only about 56% of residues.
.para
Typical dialogue follows.
.lit
? Menu or option number=d24
Plot Robson secondary structure predictions
? (y/n) (y) Plot results n
9 S 217 -7 -39 15
10 E 226 5 -27 -39
11 L 233 -7 -26 -15
12 I 229 -23 9 4
13 K 214 -8 10 -8
14 Q 178 42 19 5
15 R 131 54 16 3
16 I 86 42 -31 -23
17 A 55 52 -30 -15
18 Q 15 67 4 25
19 F -34 86 47 74
20 N -41 74 17 106
21 V -16 118 -5 100
22 V 64 88 5 115
23 S 96 38 26 155
24 E 133 -25 13 96
25 A 118 -98 25 100
26 H 110 -150 37 86
27 N 57 -201 37 66
28 E 51 -140 11 -4
29 G 2 -77 37 9
30 T 2 28 28 7
31 I -11 117 -21 22
32 V -23 178 -55 5
33 S -54 193 -14 35
34 V -46 123 5 30
35 S -54 53 51 80
36 D -60 1 86 55
37 G -66 8 57 49
38 V -1 128 -30 -5
39 I 11 212 -56 -33
40 R 16 204 -44 -57
...etc
.end lit
.LEFT MARGIN1
@26. TX 4 @Draw a helix wheel
.LEFT MARGIN2
.para
A helical representation of segments of the sequence is shown. The
display
includes a schematic of the helix showing the links between residues,
with
each vertex numbered according to position; the sequence element at
each
vertex; a symbol denoting a classification as hydrophobic(.), positively
charged(+), negatively charged(-), or otherwise( ). The
residue number of the first sequence element in
the current window is displayed at the top-left-hand
corner of the diagram. Also at the top-left corner the sequence in the
current window is listed. Below this is the total hydrophobicity and
hydrophobic moment for the window calculated according to Eisenberg et
al
J. Mol. Biol. 179, 125-142 (1984).
.para
If dialogue is requested the user is asked for the angle to define the turn
between residues as seen
looking along the helix, and a window length. The window length can be up
to 60, with default 18, and the angle has a default of 100 degrees. Note
that 18 x 100 is 5 turns. When the option is selected the first segment in
the current active region is displayed then the bell rings. If the user
types only return, the display will click on by one residue; if another
number is typed, say N, then the display will click forwards (or
backwards
if N is negative) by N residues. If the wheel runs off either end of the
sequence the option will be exited.
.para
Typical dialogue follows.
.lit
? Menu or option number=d26
? Angle (1-130) (100) =
? Window (1-60) (18) =
missing graphics
.end lit
.left margin1
@25. TX 3 4 @Plot hydrophobic moment
.LEFT MARGIN2
.para
This routine plots hydrophobic moment and hydrophobicity according to
Eisenberg et al
J. Mol. Biol. 179, 125-142 (1984). The mean hydrophobicity per residue in
the window is plotted on a scale -1.0 to 1.5, and the mean hydrophobic
moment per residue on a scale 0.0 to 1.5.
The hydrophobicity is shown in the top frame with the
hydrophobic moment below.
The plot is arranged so that the
value shown at position x represents the mean value for residues x-
window+1
to x, where window is the window length.
.para
If dialogue is requested the user can select a window
length, and the angle used for the hydrophobic moment
calculation.
.para
Note that according to Eisenberg et al, in transmembrane proteins an
"initiator" is required. This is either a very hydrophobic single helix
with <H> >=0.68, or a moderately hydrophobic pair of helices whose <H>
sum
to >= 1.1. Other helices are then accepted as transmembrane if their <H>
>=
0.42
.para
The following rules are claimed: if <H> < 0.51 and points lie below the
line <M> = -0.392 + 0.603x <H> they are "globular", if they lie above this
line they are "surface". If <H> > 0.51 and they lie above the line <M> =
0.6 - 0.342x<H> they are "monomeric", if above "multimeric".
.para
Typical dialogue follows.
.lit
? Menu or option number=d25
? Angle (1-130) (100) =
? Window (1-60) (18) =
? Plot interval (1-101) (3) =
missing graphics
.end lit
.left margin1
@27. TX 1 @Back translate to dna
.LEFT MARGIN2
.para
This routine back translates protein sequences into DNA using the
standard
genetic code. The level of redundancy can be plotted and the
backtranslation saved to a file.
.para
The translation can use either the IUB symbols shown below, or a set of
codon
preferences. If a set of codon preferences are used they must conform to
the format of codon tables produced by the nucleotide analysis
program, and the back
translation
will contain the favoured codons. If there is no favoured codon
the IUB symbols will be employed. The window length for
plotting the redundancy is in codons.
.para
The program will plot the redundancy along the sequence and hence can
be
used to find the best sequences to use as primers. Note that the program
plots the inverse, and so the higher the
plot the LESS redundant the sequence. For primers look for peaks rather
than
troughs.
.para
The DNA sequence can be saved to a file and analysed using the nucleotide
analysis program.
Depending on the application it is often useful to produce a back
translation using both a table of codon preferences and one using the IUB
symbols. This is because the restriction enzyme search program can
distinguish between definite and possible cuts in the sequence.
These matches are what the program terms "definite matches" and are
ones in
which the specification of the recognition sequence corresponds
exactly to that of the back translation. The program will also find what
it
terms "possible matches" which are ones that depend on the particular
codons
chosen for each amino acid.
These are sites at which recognition
sequences could be engineered to produce a cut in the DNA
without changing the amino
acid, but which are not
necessarily found in the original sequence.
.LIT
NC-IUB SYMBOLS
A,C,G,T
R (A,R) 'puRine'
Y (T,C) 'pYrimidine'
W (A,T) 'Weak'
S (C,G) 'Strong'
M (A,C) 'aMino'
K (G,T) 'Keto'
H (A,T,C) 'not G'
B (G,C,T) 'not A'
V (G,A,C) 'not T'
D (G,A,T) 'not C'
N (G,A,C,T) 'aNy'
Typical dialogue follows.
? Menu or option number=d27
Back translate
? (y/n) (y) No codon preference
? (y/n) (y) Plot redundancy n
? (y/n) (y) Save DNA to disk
? File name for DNA sequence=tt:
ATGCARYTNAAYWSNACNGARATHWSNGARYTNATHAARCARMGNATHGCNCARTTYAAY
GTNGTNWSNGARGCNCAYAAYGARGGNACNATHGTNWSNGTNWSNGAYGGNGTNATHMGN
ATHCAYGGNYTNGCNGAYTGYATGCARGGNGARATGATHWSNYTNCCNGGNAAYMGNTAY
GCNATHGCNYTNAAYYTNGARMGNGAYWSNGTNGGNGCNGTNGTNATGGGNCCNTAYGCN
GAYYTNGCNGARGGNATGAARGTNAARTGYACNGGNMGNATHYTNGARGTNCCNGTNGGN
MGNGGNYTNYTNGGNMGNGTNGTNAAYACNYTNGGNGCNCCNATHGAYGGNAARGGNCCN
YTNGAYCAYGAYGGNTTYWSNGCNGTNGARGCNATHGCNCCNGGNGTNATHGARMGNCAR
WSNGTNGAYCARCCNGTNCARACNGGNTAYAARGCNGTNGAYWSNATGATHCCNATHGGN
MGNGGNCARMGNGARYTNATHATHGGNGAYMGNCARACNGGNAARACNGCNYTNGCNATH
GAYGCNATHATHAAYCARMGNGAYWSNGGNATHAARTGYATHTAYGTNGCNATHGGNCAR
AARGCNWSNACNATHWSNAAYGTNGTNMGNAARYTNGARGARCAYGGNGCNYTNGCNAAY
ACNATHGTNGTNGTNGCNACNGCNWSNGARWSNGCNGCNYTNCARTAYYTNGCNMGNATG
CCNGTNGCNYTNATGGGNGARTAYTTYMGNGAYMGNGGNGARGAYGCNYTNATHATHTAY
GAYGAYYTNWSNAARCARGCNGTNGCNTAYMGNCARATHWSNYTNYTNYTNMGNMGNCCN
CCNGGNMGNGARGCNTTYCCNGGNGAYGTNTTYTAYYTNCAYWSNMGNYTNYTNGARMGN
GCNGCNMGNGTNAAYGCNGARTAYGTNGARGCNTTYACNAARGGNGARGTNAARGGNAAR
ACNGGNWSNYTNACNGCNYTNCCNATHATHGARACNCARGCNGGNGAYGTNWSNGCNTTY
GTNCCNACNAAYGTNATHWSNATHACNGAYGGNCARATHTTYYTNGARACNAAYYTNTTY
AAYGCNGGNATHMGNCCNGCNGTNAAYCCNGGNATHWSNGTNWSNMGNGTNGGNGGNGCN
GCNCARACNAARATHATGAARAARYTNWSNGGNGGNATHMGNACNGCNYTNGCNCARTAY
MGNGARYTNGCNGCNTTYWSNCARTTYGCNWSNGAYYTNGAYGAYGCNACNMGNAARCAR
YTNGAYCAYGGNCARAARGTNACNGARYTNYTNAARCARAARCARTAYGCNCCNATGWSN
GTNGCNCARCARWSNYTNGTNYTNTTYGCNGCNGARMGNGGNTAYYTNGCNGAYGTNGAR
YTNWSNAARATHGGNWSNTTYGARGCNGCNYTNYTNGCNTAYGTNGAYMGNGAYCAYGCN
CCNYTNATGCARGARATHAAYCARACNGGNGGNTAYAAYGAYGARATHGARGGNAARYTN
AARGGNATHYTNGAYWSNTTYAARGCNACNCARWSNTGG---
.end lit
.LEFT MARGIN1
@28. TX 5 @Search for patterns of motifs
.left margin2
.para
This option searches for patterns of motifs. Patterns can be defined
interactively or read from files. Results can be displayed in several ways
in both graphical and textual form. Used to create pattern files for
searching libraries. The option is extremely flexible and consequently the
following documentation is quite lengthy. However the routine is capable
of searching for almost any known pattern. In addition the flexibility
does not necessitate difficulty of use, and the userinterface has been
simplified considerably since the methods were first published.
.para
Users should refer to the "typical dialogue" shown below for the most
helpful information on using the program.
.para
There are currently
four ways to display the matching patterns: 1=each individual
motif and its position is listed; 2=all the sequence between, and
including the two
outermost motifs is listed; 3=graphical, with a vertical line marking the
position
of the leftmost motif; 4 = EMBL feature table format, where the KEYNAM
field is the motif name, the FROM and TO fields denote the ends of the
match, and the DESCRIPTION field is "Program".
.para
When it is defined for the first time a pattern must be entered
interactively at the keyboard, but the pattern description
can be saved to a file.
This file can be used for all subsequent searches.
.para
When defining a pattern interactively
select a motif class and the program will request the required inputs.
.para
The program gives each motif an identifying name and number.
For motifs other than the first, a range of allowed positions must be
defined (Note that sets of motifs included using the OR operator will all
be given the same range, and so the program will only request range
values
for the first motif in any such set).
To specify the allowed range for a motif the user must supply the
following: the
identifying number of the motif, relative to which the current motifs
positions are to be defined (termed the "reference motif"); a "relative start
position" and a range. The relative start position can be negative or positive.
A negative start position means that although the reference motif
is searched for first, the current motif can be found to its left.
A zero relative start position means their left ends are superimposed. The
default start position is to butt-joint the motif to righthand end of the
"reference motif". The range is "the number of extra positions" that the
motif can take.
.para
The program will display the probability of finding each motif. These
values are presented in the following form: .1234E-5 means 0.1234 times
10
to the power -5.
.para
After the pattern has been defined, the program will type a description
of
it on the screen. It will then allow the user to give an overall cutoff
score and overall probability cutoff.
.para
Typical dialogue for all the different motif classes is displayed below.
.lit
? Menu or option number=28
Pattern searcher
? (y/n) (y) Read pattern from keyboard
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =
? Motif name=aa
? String=aa
Probability of score 2.0000 = 0.123E-01
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =2
? Motif name=pmatch
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-1) (1) =
? Relative start position (-1000-1000) (3) =
? Number of extra positions (0-1000) (0) =
? String=qqq
? Minimum matches (1.00-3.00) (3.00) =2
Probability of score 2.0000 = 0.858E-02
1 Exact match
X 2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =3
? Motif name=sm
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-2) (2) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =
? String=wqa
? Minimum score (11.00-53.00) (53.00) =36
Probability of score 36.0000 = 0.531E-02
1 Exact match
2 Percentage match
X 3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =4
? Motif name=hth
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-3) (3) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =
? Weight matrix file name=hth
HELIX TURN HELIX PABO SAUER WEIGHTS 17-11-87
Probability of score -51.5860 = 0.230E-04
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
X 4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =5
? Motif name=repeat
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-4) (4) =
? Relative start position (-1000-1000) (21) =
? Number of extra positions (0-1000) (0) =3
? Repeat length (1-60) (6) =3
? Minimum gap (0-60) (0) =
? Maximum gap (0-60) (0) =2
? Minimum score (11.00-60.00) (36.00) =
Probability of score 36.0000 = 0.445E-01
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
X 5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =6
? Motif name=mset
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-5) (5) =
? Relative start position (-1000-1000) (1) =
? Number of extra positions (0-1000) (0) =
X 1 Keyboard input
2 File input
? 0,1,2 =
Separate sets with commas
? String=AVL,AST,,WYRF
? Minimum matches (1.00-4.00) (4.00) =3
Probability of score 3.0000 = 0.718E-02
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
X 6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =7
? (y/n) (y) Save pattern in a file
? Pattern definition file=EXAM.PAT
Motif 6 needs a file name to store set as a weight matrix
? Weight matrix file name=DEMO.WTS
Weight matrix needs a title
? Title=Demonstration class 6 weight matrix
Pattern description
Motif 1 named aa is of class 1
Which is an exact match to the string
aa
Motif 2 named pmatch is of class 2
which is a match of score 2. to the string
qqq
and the N-terminal residue can take positions 3 to 3
relative to the N-terminal end of motif 1
It is anded with the previous motif.
Motif 3 named sm is of class 3
which is a match of score 36. to the string
wqa
and the N-terminal residue can take positions 4 to 4
relative to the N-terminal end of motif 2
It is anded with the previous motif.
Motif 4 named hth is of class 4
Which is a match to a weight matrix with score -51.586
and the N-terminal residue can take positions 4 to 4
relative to the N-terminal end of motif 3
It is anded with the previous motif.
Motif 5 named repeat is of class 5
Which is a repeat with repeat length 3 and score 36.
The loop-out can have sizes 0 to 2
and the N-terminal residue can take positions 21 to 24
relative to the N-terminal end of motif 4
It is anded with the previous motif.
Motif 6 named mset is of class 6
Which is membership of a set with score 3.000
It is anded with the previous motif.
Probability of finding pattern = 0.4109E-14
Expected number of matches = 0.2539E-10
? Maximum pattern probability (0.00-1.00) (1.00) =
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
Select display mode
X 1 Motif by motif
2 Inclusive
3 Graphical
4 EMBL feature table
? 0,1,2,3,4 =
Searching
Total matches found 0
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=6
Page through text files
? Name of file to read=exam.pat
A1 aa Class
aa
@ End of string
A2 pmatch Class
1 Relative motif
3 Relative start position
0 Number of extra positions
qqq
@ End of string
2.00000 Cutoff
A3 sm Class
2 Relative motif
4 Relative start position
0 Number of extra positions
wqa
@ End of string
36.00000 Cutoff
A4 hth Class
3 Relative motif
4 Relative start position
0 Number of extra positions
hth File name
A5 repeat Class
4 Relative motif
21 Relative start position
3 Number of extra positions
3 Length
0 Minimum loop
2 Maximum loop
36.00000 Cutoff
A6 mset Class
5 Relative motif
1 Relative start position
0 Number of extra positions
DEMO.WTS File name
End of file
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=6
Page through text files
? Name of file to read=demo.wts
Demonstration class 6 weight matrix
4 0 3.000 4.000
P 1 2 3 4
N 0 0 0 0
C 0 0 0 0
S 0 1 0 0
T 0 1 0 0
P 0 0 0 0
A 1 1 0 0
G 0 0 0 0
N 0 0 0 0
D 0 0 0 0
E 0 0 0 0
Q 0 0 0 0
B 0 0 0 0
Z 0 0 0 0
H 0 0 0 0
R 0 0 0 1
K 0 0 0 0
M 0 0 0 0
I 0 0 0 0
L 1 0 0 0
V 1 0 0 0
F 0 0 0 1
Y 0 0 0 1
W 0 0 0 1
End of file
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structure
m5 = Search
? = Help
! = Quit
? Menu or option number=28
Pattern searcher
? (y/n) (y) Read pattern from keyboard
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =2
? Motif name=avlst
? String=avlst
? Minimum matches (1.00-5.00) (5.00) =3
Probability of score 3.0000 = 0.394E-02
1 Exact match
X 2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Direct repeat
6 Membership of set
7 Pattern complete
? 0,1,2,3,4,5,6,7 =7
? (y/n) (y) Save pattern in a file n
Pattern description
Motif 1 named avlst is of class 2
which is a match of score 3. to the string
avlst
Probability of finding pattern = 0.3941E-02
Expected number of matches = 0.2030E+01
? Maximum pattern probability (0.00-1.00) (1.00) =
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
Select display mode
X 1 Motif by motif
2 Inclusive
3 Graphical
4 EMBL feature table
? 0,1,2,3,4 =4
Searching
FT avlst 152 156 Program
Total matches found 1
Minimum and maximum observed scores 3.00 3.00
.end lit
.para
General notes
.para
These methods allow users to define and search for
complex patterns of motifs defined as single objects.
The programs allow individual DNA motifs to be defined in eight
different
ways, and protein motifs in six. Motifs are combined, using the logical
operators AND, OR and NOT, to describe a pattern. The pattern also
specifies the ranges of allowed relative separations of the individual
motifs.
.para
First some definitions.
.para
A MOTIF is a contiguous subsequence of fixed length.
At its simplest
it could be a single definite base or amino acid; a more complex motif
might be better represented as a consensus or a weight matrix;
two more-abstract types of
motif are direct and inverted repeats.
.para
A PATTERN is a higher order of structure defined by a list of motifs. The
motifs in a pattern are combined using the logical operators AND, OR and
NOT. The list also defines the allowed relative separations of the
motifs. In the current versions of the programs up
to 50 motifs can be combined into a single pattern. So using these
definitions there are two
differences between motifs and patterns: 1) the distances between all
elements of a motif are fixed, but
the separations of parts of patterns can vary;
2) all characters in a motif are defined
using the same method (class), but different parts of a pattern can be
defined in completely different ways.
.para
Each motif
can be represented in 9 ways (known as the motif class):
.sk1
.lit
MOTIF CLASSES
CLASS DESCRIPTION
1 Exact match to a short defined sequence. The IUB symbols
can be used for DNA sequences.
2 Percentage match to a defined short sequence. In nucleic acids,
the IUB symbols can be used.
3 Match to a defined sequence, using a score matrix and cutoff
score. The DNA matrix (see option 18) gives scores to IUB symbols
depending on their level of redundancy. MDM78 is used for proteins.
4 Match to a weight matrix with cutoff score.
5 As class 4 but on the complementary strand.
6 Inverted repeat or stem-loop. Fixed stem length, range of
loop sizes, and cutoff score using A-T, G-C=2; G-T=1.
7 Exact match to short sequence but with a defined step size.
8 Direct repeat. Fixed repeat length, range of loop-out sizes,
cutoff score, and score matrix (for protein sequences MDM78 and
for nucleic acids an identity matrix).
9 Membership of a set. A list of sets of allowed amino acids for
each position in the motif. The sets are separated by commas(,).
For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
acids in which one of I,V or L must be found in the first position,
then anything in the next two positions, D,E,K or R in the fourth
position and F,Y,W,I,L,V or M in the fifth. This class only applies
to protein sequences because for nucleic acids "membership of a
set"
can be achieved using IUB symbols.
Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to
nucleic acids.
.end lit
.para
Class 1: exact match.
.para
The motif is defined by a short sequence, which for nucleic acids,
may include IUB symbols. All symbols must match.
.para
Class 2: percentage match
.para
The motif is defined by a short sequence, which for nucleic acids,
may include IUB symbols. The minimum number of matching characters
must
also be specified.
.para
Class 3: match using a score matrix
.para
The motif is defined by a short sequence, which for nucleic acids,
may include IUB symbols. The motif is not compared directly with the
sequence to count the number of matching characters. Instead a matrix is
used to provide a score for all possible pairs of characters. The motif
score for
any position along the sequence is the sum of the scores found by
looking-up the scores for each pair of aligned characters. A match is
declared if some minimum score is achieved.
.para
Class 4: weight matrix
.para
The motif is defined by a table of values (called weights or scores). The
table gives a score for finding each possible character at each position
along the length of the motif. It therefore
has dimension motif-length x character-set-size, and allows us to give
different scores for each character at each position. It is equivalent to
having a different score matrix for each position along the motif, and
provides the most flexible and specific method of defining motifs. The
weight matrices are created by program PIP option 20 and
stored as files. The file contains the values
for each position, as well as an overall minimum score.
There are two ways in which these values can be used to calculate an
overall
score for any section of the sequence. The simplest way is to add the
values in the file. (This means that the highest possible score
can be calculated by adding the top value at each column
position, and the lowest
by adding the bottom value.)
The normal way of using the values in the file is as
follows.
First the programs divide the values in each column by the column total
so
that they sum to 1.0
Then the natural
logs of these values are used as scores. When the matrix is applied to a
sequence these logarithmic values are summed (which is of course
equivalent
to multiplying the frequencies).
Note that using the natural logs of the frequencies as
weights and
adding them means that the overall cutoff score must be less than zero,
whereas if the original
values in the weight matrix file are added, the cutoff score will be
greater than zero. The search routines therefore decide whether the user
wants to add values or multiply frequencies
by examining the value of the cutoff score: it will add if the cutoff
is
greater than zero and add logs of frequencies if it is less than zero.
Hence we effectively get two
motif classes in one. The program PIP, when creating weight matrix
files, will ask the user whether the scores should be added or multiplied.
If the values in the table have been defined
without using a set of aligned sequences
it is easier for the user to
choose a cutoff score if the values are added.
.para
Class 5: complement of weight matrix
.para
The motif is defined by a weight matrix, but the program searches for its
complement.
.para
Class 6: inverted repeat, or stem-loop
.para
The motif is defined by a repeat length, a minimum score
and a range of loop sizes. The scores are A-T=2, G-C=2, G-T=1, else=0.
The loop sizes are defined by a minimum
and maximum distance from the 3' end of the stem.
For a stem-loop these will be positive numbers. For example to
define a stem of length 8 and loop sizes varying from 3 to 5, the stem
would be set to 8, the minimum start distance to 3 and the maximum
to 5. To define an
inverted repeat the minimum distance will be negative. For example stem
length=9,
minimum distance=-9, and maximum distance=-8 will find
inverted repeats of lengths 9 and 10.
E.g. AAAAATTTT and AAAAATTTTT would be found, the first having a base
at
its centre, the second having none.
.para
Class 7: exact match, defined step size.
.para
The motif is defined by a short sequence, which for nucleic acids,
may include IUB symbols. All symbols must match. The class differs
from
class 1 in that searches will move in steps of some given size. For
example
we could search for a certain codon and use a step size of 3 and hence
keep in a
single reading frame.
.para
Class 8: direct repeat
.para
The motif is defined by a repeat length, a minimum score
and a range of loop sizes. The scores are defined using MDM78 for protein
sequences and an identity matrix for nucleic acids.
The loop sizes are defined by a minimum
and maximum distance from the 3' end of the stem.
.para
Class 9: membership of a set
.para
This motif class is for protein sequences. It is defined by lists of
allowed amino acids for each position in the motif, and a cut-off score.
Positions at which any amino acid can occur are left blank.
All allowed amino acids for each position give a score of 1.
The motifs can be defined in two ways: either typed at the keyboard or
read
in as a weight-matrix-like file.
When the motif is defined at the keyboard the sets of allowed amino
acids
are separated by commas(,).
For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
acids in which one of I,V or L must be found in the first position,
then anything in the next two positions, D,E,K or R in the fourth
position and F,Y,W,I,L,V or M in the fifth. To specify that the
whole motif must match a score of 3 would be required (i.e. one of the
allowed amino acids must be found for each of the three defined
positions).
If the motif is read from a file the file must have been written by
program
PIP, or have been saved by the pattern searching routines. If the
user
elects to save a pattern, and it includes class 9 motifs typed at the
keyboard, then the program will save the class 9 motifs as weight matrix
files. Therefore it will request file names for each motif of this class.
If the motif given above as an example were saved the weight matrix file
would have 5 columns.
The first column
would contain zeroes except for the I, V and L rows
which would be set to 1; the next two columns would all be zero; the next
would be zero except for the D,E,K and R rows which would be 1; the final
column would contain 1's in rows F,Y,W,I,L,V and M, with
the rest zero.
.para
The logical operator (AND, OR or NOT) used to add each motif to the
pattern
is specified by preceding
the class number by the letters A, O or N. A = AND, O = OR, N = NOT.
The default is A, so N2 means include, using the NOT operator, a class 2
motif; O2 means include, using the OR operator, a class 2 motif; both A2
and
2 mean include, using the AND operator, a class 2 motif.
.para
Range setting.
.para
The motifs in a pattern are numbered according to their order in the list.
Apart from the first motif in a pattern all motifs are given a range
of allowed positions relative to a motif further up the list.
For example
suppose we have a pattern defined by A AND B AND C AND D.
Motif A can occur anywhere, but B must have its range of allowed
positions defined relative to the position of motif A, and C's positions
can be defined relative to either A or B, depending on which is most
convenient, and likewise D's positions can be relative to A or B or C.
.para
Notice that the positions of motifs can be defined relative to more than
one motif. Suppose we have a pattern consisting of
motifs A, B and C, and that B occurs 5-10 residues right of A, C occurs 5-
10
residues right of B, and also C is never more than 15 residues from A.
Then
it is quite consistent with the methods to include motif C into the
pattern
twice using the AND operator: once relative to A and once relative to B.
This will define the relative spacing and the ORDER of the motifs in the
pattern. (If we simply defined the position of C relative to A it could be
found to the left of B).
.para
Motifs combined together using the OR operator are all given the same
range. For example suppose we had a pattern A AND (B OR C) AND (D OR E),
then B and C each have the same range, and D and E also have
the same range as one another. The range for D and E can be relative to
A or to B.
.para
Motifs cannot have their ranges defined relative to motifs that are
included using the NOT operator. For example if we had the pattern A NOT
B
AND C, then the range for C can only be defined relative to motif A.
.para
Speed can be gained by arranging the order
of the motifs so that those higher up the list are of types that can be
searched for rapidly and that are also unlikely to be found.
.para
Motifs combined by the OR operator are alternatives: if any one of a set
of motifs
combined by the OR operator is found, then a match is declared. All
alternatives will be reported. For example if we had a pattern defined by
A
AND (B OR C), then all places where A occurs and B is found within range,
and all places where A is found and C is found within range will be
reported. A typical use would be where we might allow a motif to appear
on
either strand of the DNA sequence. For example a weight matrix
representing
the heatshock element could be used in a pattern which included
heatshock
as a motif class 4 combined using the OR operator
with heatshock as a motif class 5.
.para
The probability calculations are performed for each motif as it is
defined.
If an overall probability cut-off is given the calculation is repeated for
each match found. To achieve maximum searching speed do not give an
overall
probability cut-off. Overall cut-off scores should only be used if the
motif
classes used are compatible.
.para
There are currently
several ways to display the matches: 1 = each
motif and its position is listed; 2 = all the sequence between the two
outermost motifs is listed; 3 = graphical, with a spike marking the
position
of the leftmost motif. The library versions also give entry names, and a
one
line title; in addition they can be used to produce aligned families of
sequences. When this mode of output is selected the program will write a
separate file for each match. The files will be called ENTRYNAME.DAT
where
ENTRYNAME is the name of the entry in the library. The matching
sequence
will be written out so that the spacing between motifs is constant, and
set to the maximum allowed by the pattern definition. Any gaps will be
filled with dashes (-). If the individual sequences were subsequently
written one above the other
they should line up so that all motifs are in register. There two types of
output of this sort: one, option 4, writes out whole sequences, the other,
option 5, writes out only the sequences between the two outermost
motifs.
If the individual sequences were subsequently
written one above the other
they should line up so that all motifs are in register. There two types of
output of this sort: one, option 4, writes out whole sequences, the other,
option 5, writes out only the sequences between the two outermost
motifs.
Note that for option 4 users are asked to type the position of the
first motif, and the reason for
this is explained below.
Consider a pattern found in several sequences. Consider only
the first motif in
the pattern and suppose that it was found in different positions in these
sequences.
Say that of these positions the one furthest from the left end was
position 100. Then, in order to ensure that all the sequences would align,
we must specify that motif 1 must start at position 100.
Any sequences in which motif 1 started
nearer to the left end than position 100 would be padded accordingly.
These modes of output
should only be used when the position of each motif is defined relative to
its
immediate neighbour.
.para
The pattern descriptions can be saved to files. These files
can be used instead of typing definitions again at the keyboard. As the
files are annotated,
they can easily
be changed using system editors, and the modified versions used to
define the variant patterns for the programs.
.para
.para
Use of lists of entry names
.para
The two programs that operate on libraries have the ability to
restrict their searches to subsets of the libraries. This does not require
sublibraries to be created but instead is achieved by using files
containing a list of the entry names of sequences. The user may choose to
search only those entries on the list or, alternatively to search all but
those on the list (i.e. in the latter case
the list contains the names of those to be excluded).
The programs can search libraries that have indexes and those that
do not.
If a list of names for inclusion is used,
then the search will be faster if the index is present. In all other
circumstances the whole library will be read.
The list must be in library order except when it is used
to include entries, and an index is available.
The list must contain each entry name on a separate line, with the name
starting in column 1 of the line. ie there must be no spaces at the start
of the line.
The list of entry names
can be produced by the keyword searches of nip, pip, etc as long
as the listings produced have a space character separating the entry name
from the entry description. This will depend on how well the library
reformatting programs work. For example swissprot entry names tend to run
into the beginning of the descriptions, but other libraries are generally
OK.
.para
One use of the programs is to look for patterns that we already know
about, but in new sequences. However it is hoped that they will also be
useful for finding new motifs. For example
several known control regions in
nucleic acid
sequences consist of particular direct or inverted repeats;
the inclusion of
direct and inverted repeats as motif classes
makes it possible to
find previously unknown
motifs of these types.
Using these new programs we can
ask questions like: "are there any inverted or direct repeats near to
sections of sequence that contain both a
CCAAT box and a TATA box?"; and to search for such things throughout
the
libraries. In addition, the mode of output in which all the sequence
between
the two outermost motifs found is printed out, allows us to extract
sequences and examine them in more detail for further common
subsequences.
For example we might want to collect together all the sequences
between
putative CCAAT and TATA boxes.
.para
A further use of the inverted repeat motif class is the following. If a
regulatory sequence in DNA is poorly defined but also an inverted repeat,
then it might be an advantage to specify it both as a consensus sequence
and
a superimposed inverted repeat. In this way two weak definitions can be
combined to produce a stronger pattern.
.para
Given only a few examples of a motif it
should be possible to perform initial searches using a
class 3 motif, and then, using plausible matching sequences, create a
more
specific weight matrix for the same motif.
.para
If motifs are combined with the first motif using the OR operator
they will be ignored until all
permutations that include the first motif have been looked for.
The whole search will then be repeated, in
turn, for each of
those motifs that are combined with the first motif using the OR
operator.
An interesting consequence of this is that the program can be used,
without
change, to compare any newly determined sequence with all known
individual
motifs. We achieve this by having a pattern in which all known relevant
motifs are combined using the OR operator.
If we ask to use this pattern with
a sequence, the program will automatically compare each individual
motif in
the pattern with the whole length of the
sequence. As the number of known
motifs grows this should become an increasingly useful standard
procedure.
.para
The NOT operator is obviously
useful for making sure particular motifs are not present, but it can also
be used to bracket the levels of matches found. We may want a degree of
match that lies between two limits - binding should occur, but not too
strongly; or base-pairs should form, but not too many. We can specify
this
by asking for a match with a low score, in combination with a match and
a
high score, both for the same motif, but with the high score included
using
the NOT operator.
.para
The algorithm is designed to find all sections of a sequence that satisfy
the pattern rather than only the best match.
Particularly if some of the motifs in a pattern are less well defined than
others, this can often result in the same region of a sequence being
reported as having several matches, but which only vary in the
positions of the weakest motifs.
.para
General remarks on motif searching
.para
Generally motifs are short subsequences that are thought to be
associated with
particular functions in some known sequences. Often
we search for them to try to
understand or interpret other sequences. Sometimes we search for
motifs and
patterns to
test a hypothesis about their role: are they found in the expected
positions in the expected sequences. In doing so we should remember
that, in both proteins and nucleic acids,
what we are really looking for is a particular
three dimensional structure with certain affinities for other structures,
and that we are assuming that the sequence of the motif alone
defines the 3D structure we searching for.
The overall structure
may be completely different to those in which the motif is functional,
and
hence the motif may have a different shape or be inaccessible.
We should be aware of the
importance of the context in which a motif is found. Where does it lie
relative to the overall structure, is it accessible, is the three
dimensional spacing between
it and other motifs correct? For example, is it on the same side of the
double helix, and the correct distance from some other motif? How does
context affect our assessment of the significance of finding a motif?
Finding false mammalian mRNA splice junctions in non-coding sequences
is
far less important than finding false sites in pre-mRNA sequences, but
finding them in the correct places is most important! In other words, it
is
often the case that when we are searching for a motif that is known to
be
necessary for some function, then a positive result in the form of a
match
in the required position, is more important than a high background of
matches in the wrong positions. Being
able to write
down the probability of finding a motif in a random sequence tells us how
well it is defined.
In nucleic
acids the DNA may contain many superimposed types of information such
as
those concerned with histone phasing, protein coding or mRNA secondary
structure. These overlapping "codes" may interfere with one another
causing
matches to motifs to be poorer than expected.
In general we will only have a limited number of examples of the
motif and we do not know how representative they are.
.para
Sequences have superimposed functions: some parts may be of general
structural
importance and give rise to an overall framework, and other parts give
specificity and hence are not common; we may want to use a set of
aligned
sequences to define a motif, but want to use only the framework
positions.
Alternatively we may want to pick out
only those parts of a set of aligned sequences that give a particular
property, and to ignore other similarities that are due to some other
property
and which could obscure the pattern
we are interested in.
It is possible to apply a mask to a set of aligned sequences in
order to give weight to selected positions only.
The ability to define a mask allows certain positions
to be used in the motif and others to be ignored, and yet still permits the
use of a set of aligned sequences to calculate weights. The mask is
requested and applied
by the program and results in the masked positions being zero
in
the weight matrix. The mask is defined in the following way.
Suppose we had a motif of length 15, then the mask
x--x--xx-x will give zero weights to positions 2,3,5,6 and 9 (note it is
the dashes (-) that are significant and that positions
1,4,7,8,10,11,12,13,14 and 15
will be non-zero). Of course
the same set of sequences could be used with several alternative masks
in
order to extract different features and create corresponding weight
matrices.
.para
The programs are described in Staden,R.
CABIOS 4, 53-60, 1988; Staden,R.
CABIOS 5, 89-96, 1989, anf a forthcoming Methods in
Enzymology.
.left margin1
@ end of help