2469 lines
90 KiB
Text
2469 lines
90 KiB
Text
.NPA
|
|
.SP 1
|
|
.left margin1
|
|
@-1. TX 0 @General
|
|
.sp
|
|
@-2. T 0 @Screen control
|
|
.sp
|
|
@-2. X 0 @Screen
|
|
.sp
|
|
@-3. T 0 @Statistical analysis of content
|
|
.sp
|
|
@-3. X 0 @Statistics
|
|
.sp
|
|
@-4. T 0 @Structures and repeats
|
|
.sp
|
|
@-4. X 0 @Structures
|
|
.sp
|
|
@-5. TX 0 @Search
|
|
.sp
|
|
@0. TX -1 @PIP
|
|
.para
|
|
This is a program for analysing individual protein sequences. It can read
|
|
sequences stored in many of the most commonly used formats, and
|
|
performs all of the usual simple analyses. In addition it has very flexible
|
|
search procedures and presents many of its results graphically.
|
|
.PARA
|
|
The following analyses (preceded by their option numbers) are included:
|
|
.lit
|
|
? = Help
|
|
! = Quit
|
|
3 = read a new sequence
|
|
4 = define active region
|
|
5 = list the sequence
|
|
6 = list a text file
|
|
7 = direct output to disk
|
|
8 = write active sequence to disk
|
|
9 = edit the sequence
|
|
10 = clear graphics screen
|
|
11 = clear text screen
|
|
12 = draw a ruler
|
|
13 = use cross hair
|
|
14 = reposition plots
|
|
15 = label diagram
|
|
16 = display a map
|
|
17 = search for short sequences
|
|
18 = compare a sequence
|
|
19 = compare a sequence using a score matrix
|
|
20 = search for a sequence using a weight matrix
|
|
21 = calculate amino acid composition
|
|
22 = plot hydrophobicity
|
|
23 = plot charge
|
|
24 = plot Robson prediction
|
|
25 = plot hydrophobic moment
|
|
26 = draw helix wheel
|
|
27 = back translate
|
|
28 = search for patterns of motifs
|
|
.end lit
|
|
.para
|
|
Some of these methods produce graphical
|
|
results
|
|
and so the
|
|
program is generally used from a graphics terminal (a vdu on which lines
|
|
and points can be drawn as well as characters).
|
|
.para
|
|
For users of VT640's or their equivalents the
|
|
terminal must be set nowrap (type NOWRAP) prior to running the program.
|
|
.LEFT MARGIN2
|
|
The positions of each of the plots is defined relative to a users drawing
|
|
board which has size 1-10,000 in x and 1-10,000 in y.
|
|
Plots for
|
|
each option are drawn in a window defined by x0,y0 and xlength,ylength.
|
|
Where x0,y0 is the position of the bottom left hand corner of the window,
|
|
and xlength is the width of the window and ylength the
|
|
height of the window.
|
|
.lit
|
|
--------------------------------------------------------- 10,000
|
|
1 1
|
|
1 -------------------------------------- ^ 1
|
|
1 1 1 1 1
|
|
1 1 1 1 1
|
|
1 1 1 ylength 1
|
|
1 1 1 1 1
|
|
1 1 1 1 1
|
|
1 -------------------------------------- v 1
|
|
1 x0,y0^ 1
|
|
1 <---------------xlength--------------> 1
|
|
--------------------------------------------------------- 1
|
|
1 10,000
|
|
|
|
.end lit
|
|
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
|
|
The default window positions are read from a file "ANALPMRG" when the
|
|
program is started. Users can have their own file if required.
|
|
.para
|
|
The program can handle sequences stored in several formats:
|
|
Staden, EMBL, GENBANK, PIR (also known as NBRF) and GCG and they are described
|
|
in
|
|
the help for 'READ NEW SEQUENCE'.
|
|
.para
|
|
The options for the program are accessed from 5 main menus: general,
|
|
screen control, statistical analysis of content, structure, search.
|
|
Both menus and options are selected by number.
|
|
.LEFT MARGIN1
|
|
@1. TX 0 @Help
|
|
.LEFT MARGIN2
|
|
.para
|
|
This option gives online help. The user should select option numbers and
|
|
the current documentation will be given. Note that option 0 gives an
|
|
introduction to the program, and that ? will get help from anywhere in
|
|
the
|
|
program.
|
|
The following analyses (preceded by their option numbers) are included:
|
|
.sp
|
|
.left margin1
|
|
@2. TX 0 @Quit
|
|
.left margin2
|
|
.para
|
|
This function stops the program.
|
|
.left margin1
|
|
@3. TX 1 @Read a new sequence
|
|
.LEFT MARGIN2
|
|
.para
|
|
This option allows users to read in new sequences, browse through annotations,
|
|
or search sequence
|
|
libraries for keywords. Sequences can be read from "personal"
|
|
sequence files or from sequence libraries. These are referred to as the
|
|
sequence "source". Personal files can be stored in several formats:
|
|
Staden, PIR, EMBL, GENBANK and GCG.
|
|
At LMB we use "Staden" format for sequencing and all
|
|
the
|
|
libraries are stored in their original formats. Note, however, that libraries
|
|
such as EMBL or GenBank that are divided into several files (eg GenBank has
|
|
13 separate files) are indexed as a whole. This means that users do not need
|
|
to know which file contains an entry, only which library.
|
|
When the user selects to read in a sequence the program first asks for the
|
|
sequence "source".
|
|
.para
|
|
If the user selects "personal" the program will ask for
|
|
the format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
|
|
the file. For PIR format the user will also be required to know the entry
|
|
name of the sequence as the file can contain several. For the other formats
|
|
only a single entry is expected. The file will be read, its length and
|
|
composition will be displayed and the option left.
|
|
.para
|
|
If the user selects "library" as the sequence source the program will display a
|
|
list of available libraries. The programs are capable of handling all current
|
|
libraries but which ones are available will vary from site to site. At LMB we
|
|
have several libraries and also weekly updates of data gathered between releases.
|
|
The program will ask users to select a library and then give a list of options:
|
|
.lit
|
|
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entrynames from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
|
|
.end lit
|
|
If get a sequence or get annotations is selected users will be asked to
|
|
type the entry name. The option will be left when a sequence is selected or
|
|
! is typed. The composition and length will be displayed.
|
|
.para
|
|
The text index contains all words from feature tables, reference titles,
|
|
definition lines, keywords lists and comments, so the text index search
|
|
is most useful. It is also the fastest. Up to 5 words can be searched for
|
|
at once. The words should be typed separated by spaces, for example
|
|
.lit
|
|
? Keywords=P53 mouse murine tumo
|
|
|
|
.end lit
|
|
will search for all entries that contain words starting with p53, mouse,
|
|
murine and tumo. Only the unique entries that contain ALL words will be
|
|
listed. Before listing the matching entries
|
|
the program will show the number of 'hits' for each word and ring the bell.
|
|
Escape is possible at this point, or after each screenfull of entries.
|
|
In addition to the entry names the text search displays the primary accession
|
|
number, the sequence length and up to 80 characters of description.
|
|
(The search of 'titles' is now redundant because the full text index
|
|
contains all the title words and the search is much faster. It will probably
|
|
be removed from the program.)
|
|
All searches are independent of case. Where
|
|
possible the program will offer default entry names.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
Select sequence source
|
|
X 1 Personal file
|
|
2 Sequence library
|
|
? Selection (1-2) (1) =
|
|
Select sequence file format
|
|
X 1 Staden
|
|
2 EMBL
|
|
3 GenBank
|
|
4 PIR
|
|
5 GCG
|
|
? Selection (1-5) (1) =
|
|
? Sequence file name=M13MP7.SEQ
|
|
Contig title removed
|
|
Sequence length= 7238
|
|
Sequence composition
|
|
T C A G -
|
|
2405. 1539. 1765. 1527. 2.
|
|
33.2% 21.3% 24.4% 21.1% 0.0%
|
|
.
|
|
.
|
|
.
|
|
|
|
|
|
Select sequence source
|
|
X 1 Personal file
|
|
2 Sequence library
|
|
? Selection (1-2) (1) =2
|
|
Select a library
|
|
X 1 EMBL 29 nucleotide library Dec 91
|
|
2 SWISSPROT 20 protein library Nov 91
|
|
3 PIR 31 protein library Dec 91
|
|
4 NRL3D 58 From Brookhaven protein library Dec 91
|
|
5 GenBank
|
|
? Selection (1-5) (1) =
|
|
Library is in EMBL format with indexes
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =5
|
|
Search for keywords
|
|
? Keywords=P53 mouse
|
|
P53 hits 68
|
|
MOUSE hits 8180
|
|
|
|
MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen
|
|
MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen
|
|
MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen
|
|
MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen
|
|
MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen
|
|
MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen
|
|
MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen
|
|
MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen
|
|
MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen
|
|
MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon
|
|
MMANT11 X00885 562 Murine p53 gene 3' region with exon 11
|
|
MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end.
|
|
MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds.
|
|
MMP53 X00741 1377 Mouse mRNA for transformation associated protein
|
|
MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53.
|
|
MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11.
|
|
MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8.
|
|
MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou
|
|
MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4
|
|
MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53
|
|
MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53
|
|
MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5
|
|
22 different entries found
|
|
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =4
|
|
Search for keywords
|
|
? Keywords=alpha
|
|
Searching for alpha
|
|
AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
|
|
AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
|
|
AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
|
|
AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
|
|
ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank.
|
|
ADHBADA1 102 duck alpha-d-globin gene, exon 1.
|
|
ADHBADA2 1145 duck alpha-a-globin gene and 5' flank
|
|
ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co
|
|
AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
|
|
AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
|
|
AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
|
|
AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
|
|
AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
|
AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
|
AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
|
AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide.
|
|
AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
|
|
AGAMYLS 120 synthetic alpha-amylase gene, 5' end.
|
|
AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
|
|
!
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =3
|
|
? Accession number=v00636
|
|
Entry name LAMBDA
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =2
|
|
Default Entry name=LAMBDA
|
|
? Entry name=
|
|
ID LAMBDA standard; DNA; PHG; 48502 BP.
|
|
XX
|
|
AC V00636; J02459; M17233; X00906;
|
|
XX
|
|
DT 03-JUL-1991 (Rel. 28, Last updated, Version 3)
|
|
DT 09-JUN-1982 (Rel. 1, Created)
|
|
XX
|
|
DE Genome of the bacteriophage lambda (Styloviridae).
|
|
XX
|
|
KW circular; coat protein; DNA binding protein; genome;
|
|
KW origin of replication.
|
|
XX
|
|
OS Bacteriophage lambda
|
|
OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
|
|
XX
|
|
RN [1]
|
|
RP 1-48502
|
|
RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
|
|
RT "Nucleotide sequence of bacteriophage lambda DNA";
|
|
RL J. Mol. Biol. 162:729-773(1982).
|
|
XX
|
|
!
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =
|
|
Default Entry name=LAMBDA
|
|
? Entry name=
|
|
DE Genome of the bacteriophage lambda (Styloviridae).
|
|
Sequence length 48502
|
|
Sequence composition
|
|
T C A G -
|
|
11988. 11360. 12336. 12818. 0.
|
|
24.7% 23.4% 25.4% 26.4% 0.0%
|
|
|
|
.end lit
|
|
.left margin1
|
|
@4. TX 1 @Redefine active region
|
|
.LEFT MARGIN2
|
|
.para
|
|
For its analytic functions
|
|
the program always works on a region of the sequence called the active
|
|
region. When a new sequence is read into the program the active region is
|
|
automatically set to start at the beginning of the sequence and go
|
|
up to the
|
|
maximum allowed size of active region the version of the program can
|
|
handle. The positions are shown on the screen.
|
|
On most machines this will be to the end of the sequence.
|
|
This option allows the user define a different region. Note that for
|
|
convenience in the
|
|
listing and translation functions the user is given access to regions
|
|
outside the active region.
|
|
.left margin1
|
|
@5. TX 1 @List a sequence
|
|
.LEFT MARGIN2
|
|
.para
|
|
The sequence can be listed with line lengths from
|
|
10 to 120 in multiples of 10. Output can be directed to a disk file by
|
|
first selecting disk output. The output looks like:
|
|
.lit
|
|
|
|
10 20 30 40 50 60
|
|
MQLNSTEISE LIKQRIAQFN VVSEAHNEGT IVSVSDGVIR IHGLADCMQG EMISLPGNRY
|
|
|
|
70 80 90 100 110 120
|
|
AIALNLERDS VGAVVMGPYA DLAEGMKVKC TGRILEVPVG RGLLGRVVNT LGAPIDGKGP
|
|
|
|
130 140 150 160 170 180
|
|
LDHDGFSAVE AIAPGVIERQ SVDQPVQTGY KAVDSMIPIG RGQRELIIGD RQTGKTALAI
|
|
|
|
190 200 210 220 230 240
|
|
DAIINQRDSG IKCIYVAIGQ KASTISNVVR KLEEHGALAN TIVVVATASE SAALQYLARM
|
|
|
|
250 260 270 280 290 300
|
|
PVALMGEYFR DRGEDALIIY DDLSKQAVAY RQISLLLRRP PGREAFPGDV FYLHSRLLER
|
|
|
|
310 320 330 340 350 360
|
|
AARVNAEYVE AFTKGEVKGK TGSLTALPII ETQAGDVSAF VPTNVISITD GQIFLETNLF
|
|
|
|
370 380 390 400 410 420
|
|
NAGIRPAVNP GISVSRVGGA AQTKIMKKLS GGIRTALAQY RELAAFSQFA SDLDDATRKQ
|
|
|
|
430 440 450 460 470 480
|
|
LDHGQKVTEL LKQKQYAPMS VAQQSLVLFA AERGYLADVE LSKIGSFEAA LLAYVDRDHA
|
|
|
|
490 500 510 520 530 540
|
|
PLMQEINQTG GYNDEIEGKL KGILDSFKAT QSW*
|
|
|
|
.end lit
|
|
.left margin1
|
|
@6. TX 1 @List a text file
|
|
.LEFT MARGIN2
|
|
.para
|
|
Allows the user to have a text file displayed on the screen. It will appear
|
|
one page at a time.
|
|
.left margin1
|
|
@7. TX 1 @Direct output to disk
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to direct output that would normally appear on the screen to a file.
|
|
.para
|
|
Select redirection of either text or graphics, and
|
|
supply the name of the file that the output should be written to.
|
|
.para
|
|
The results from the next options selected will not appear on the screen
|
|
but will be written to the file. When option 7 is selected again
|
|
the file will be
|
|
closed and output will again appear on the screen.
|
|
.left margin1
|
|
@8. TX 1 @Write active region to disk
|
|
.LEFT MARGIN2
|
|
.para
|
|
The program has the capability of reading in EMBL, GENBANK, NBRF, GCG
|
|
and Staden formats
|
|
and of reversing and complementing sequences. This option allows users
|
|
to
|
|
write the current active sequence to a disk file in Staden format. Hence
|
|
it
|
|
allows format conversion and crude sequence cutting.
|
|
.left margin1
|
|
@9. TX 1 @Edit the sequence
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to edit sequences or any other files by giving access to the
|
|
computers system editor. For editing sequences the input file should
|
|
have already been created using the listing function "list
|
|
sequence".
|
|
.para
|
|
Supply the name of the file to edit. Wait while the system editor is made
|
|
ready (can take awhile on a vax). Use the editor. Exit from the editor. If a
|
|
sequence has been edited, and you want to process it, affirm that the
|
|
sequence should be "made active". The edited sequence will replace the
|
|
original sequence.
|
|
.para
|
|
This editing method is designed to give users access to an editor with
|
|
which they are familiar - i.e. the one on their machine, and yet to allow
|
|
them to edit a sequence which contains the landmarks they need in
|
|
order to know where they are. Users can create files containing simple
|
|
listings with numbering, using "list the sequence", and
|
|
then edit them with their system editor, using the numbering to know
|
|
where they are within the sequence. When the edits are complete they
|
|
exit from the editor and the program "analyses" the edited file to extract
|
|
only the sequence characters. Define the permitted set of characters to be:
|
|
ACDEFGHIKLMNPQRSTVWXYZ-acdefghiklmnpqrstvwxyz. All permitted characters
|
|
found in the file will become part of the sequence, all others removed.
|
|
.left margin1
|
|
@10. TX 2 @Clear graphics
|
|
.LEFT MARGIN2
|
|
.para
|
|
Clears the screen of both text and graphics.
|
|
.left margin1
|
|
@11. TX 2 @Clear text
|
|
.LEFT MARGIN2
|
|
.para
|
|
Clears only text from the screen.
|
|
.left margin1
|
|
@12. TX 2 @Draw a ruler
|
|
.LEFT MARGIN2
|
|
.para
|
|
This option
|
|
allows the user to draw a ruler or scale along the x axis of the screen to
|
|
help identify the coordinates of points of interest. The user can define
|
|
the position of the first amino acid to be marked (for example if the
|
|
active
|
|
region is 1501 to 8000, the user might wish to mark every 1000th amino
|
|
acid
|
|
starting at either 1501 or 2000 - it depends if the user wishes to treat
|
|
the active region as an independent unit with its own numbering starting
|
|
at
|
|
its left edge, or as part of the whole sequence). The user can also define
|
|
the separation of the ticks on the scale and their height. If required the
|
|
labelling routine can be used to add numbers to the ticks.
|
|
.left margin1
|
|
@13. TX 2 @Use cross hair
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function puts
|
|
a steerable cross on the screen that can be used to find the
|
|
coordinates of points in the sequence. The user can move the cross
|
|
around using the directional keys; when he hits the space bar the
|
|
program will print out the coordinates of the cross in sequence units and
|
|
the option will be exited.
|
|
.para
|
|
If instead,
|
|
you hit a , the position will be displayed but the cross will remain on
|
|
the screen.
|
|
.para
|
|
If a letter s is hit the sequence around the cross hair is displayed and
|
|
the cross remains on the screen.
|
|
.left margin1
|
|
@14. TX 2 @Reset margins
|
|
.LEFT MARGIN2
|
|
.para
|
|
The positions of each of the plots is defined relative to a users drawing
|
|
board which has size 1-10,000 in x and 1-10,000 in y.
|
|
Plots for
|
|
each option are drawn in a window defined by x0,y0 and xlength,ylength.
|
|
Where x0,y0 is the position of the bottom left hand corner of the window,
|
|
and xlength is the width of the window and ylength the
|
|
height of the window.
|
|
.lit
|
|
--------------------------------------------------------- 10,000
|
|
1 1
|
|
1 -------------------------------------- ^ 1
|
|
1 1 1 1 1
|
|
1 1 1 1 1
|
|
1 1 1 ylength 1
|
|
1 1 1 1 1
|
|
1 1 1 1 1
|
|
1 -------------------------------------- v 1
|
|
1 x0,y0^ 1
|
|
1 <---------------xlength--------------> 1
|
|
--------------------------------------------------------- 1
|
|
1 10,000
|
|
|
|
.end lit
|
|
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
|
|
The default window positions are read from a file "ANALMARG" when the
|
|
program is started. Users can have their own file if required.
|
|
As all the plots start
|
|
at the same position in x and have the same width, x0 and xlength are the
|
|
same for all options. Generally users will only want to change the start
|
|
level of the window y0 and its height ylength.
|
|
This option
|
|
allows users to change window positions whilst running the program.
|
|
The routine prompts first for the number of the option that the users
|
|
wishes
|
|
to reposition; then for the y start and height; then for the x start and
|
|
length. Note that changes to the x values affect all options. If the user
|
|
types only carriage return for any value it will remain unchanged.
|
|
The cross-hair can be used to choose suitable heights.
|
|
.LEFT MARGIN1
|
|
@15. TX 2 @Label a diagram
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine allows users to label any diagrams they have produced. They
|
|
are asked to type in a label. When the user types carriage return to finish
|
|
typing the label the cross-hair appears on the screen. The user can
|
|
position it anywhere on the screen. If the user types R (for right justify)
|
|
the label will be
|
|
written on the diagram with its right end at the cross-hair position.
|
|
If the user types L (for left justify) the label will be written on the
|
|
diagram with its left end at the cross hair position.
|
|
The
|
|
cross-hair will then immediately reappear. The user may put the same
|
|
label
|
|
on another part of the diagram as before or if he hits the space bar he
|
|
will be asked if he wishes to type in another label.
|
|
.left margin1
|
|
@16. TX 2 @Display a map
|
|
.LEFT MARGIN2
|
|
.para
|
|
It is often convenient to plot a map alongside graphed analysis in order
|
|
to
|
|
indicate features within the sequence. This function allows users to
|
|
draw
|
|
maps using files arranged in the form of EMBL feature tables. Of course
|
|
the
|
|
EMBL table are usually only used for nucleic acid sequence annotation
|
|
but,
|
|
as long as the features are written in the correct format, they can be
|
|
employed by this routine. The map is composed of a line representing the
|
|
sequence and then further lines denoting the endpoints of each feature
|
|
the
|
|
user identifies. The user is asked to define height at which the line
|
|
representing the sequence should be drawn; then for the feature height;
|
|
then for the features to plot.
|
|
.left margin1
|
|
@17. TX 1 5 @Short sequence search
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine is used to search for exact matches to short sequences. It is
|
|
equivalent to the restriction enzyme search in program NIP. It and can
|
|
either list matches
|
|
or present the results graphically.
|
|
.PARA
|
|
Select from searching, screen clearing or file listing. Choose a file of
|
|
strings and the mode of output required.
|
|
.para
|
|
The files of short
|
|
sequences (strings) and their names
|
|
need to be arranged in a particular way. For example
|
|
.lit
|
|
ACID/D/E//
|
|
BASIC/R/K/H//
|
|
HYDRO/F/L/I/V/Y//
|
|
GLYCO/N-S/N-T//
|
|
+/R/K/H//
|
|
-/D/E//
|
|
.end lit
|
|
defines various groups of amino acids.
|
|
Each string or set of strings must be
|
|
preceded by a name, each string must be preceded and
|
|
terminated with a slash (/), and
|
|
each set of strings by 2 slashes. These collections of strings and their
|
|
names can be read from disk or entered from the keyboard. Two files
|
|
containing sequences are currently
|
|
available. One contains named groups of amino acids. The other simply
|
|
contains the names of all amino acids and gives a convenient way of
|
|
producing a plot of the positions of all the different
|
|
amino acids in the sequence.
|
|
The user can select strings
|
|
by name from these collections. Results can be displayed name by name
|
|
or all
|
|
together.
|
|
Strings entered from the keyboard need to be separated by slash
|
|
characters(/).
|
|
For the name by name search the output looks like:
|
|
.lit
|
|
MATCHES= 12
|
|
NAME SEQUENCE POSITION FRAGMENT LENGTHS
|
|
ACID E 7 7 1
|
|
ACID E 10 3 1
|
|
ACID E 24 14 1
|
|
ACID E 28 4 1
|
|
ACID D 36 8 1
|
|
ACID D 46 10 2
|
|
ACID E 51 5 2
|
|
ACID E 67 16 2
|
|
ACID D 69 2 2
|
|
ACID D 81 12 2
|
|
ACID E 84 3 2
|
|
ACID E 96 12 3
|
|
MATCHES= 10
|
|
NAME SEQUENCE POSITION FRAGMENT LENGTHS
|
|
BASIC K 13 13 1
|
|
BASIC R 15 2 1
|
|
BASIC H 26 11 1
|
|
BASIC R 40 14 1
|
|
BASIC H 42 2 2
|
|
BASIC R 59 17 2
|
|
BASIC R 68 9 2
|
|
BASIC K 87 19 2
|
|
BASIC K 89 2 2
|
|
BASIC R 93 4 2
|
|
MATCHES= 1
|
|
NAME SEQUENCE POSITION FRAGMENT LENGTHS
|
|
GLYCO NST 4 4 3
|
|
|
|
or when the results are ordered only on position the output looks like:
|
|
|
|
NAME SEQUENCE POSITION FRAGMENT LENGTHS
|
|
GLYCO NST 4 3
|
|
ACID E 7 3
|
|
ACID E 10 3
|
|
BASIC K 13 3
|
|
BASIC R 15 2
|
|
ACID E 24 9
|
|
BASIC H 26 2
|
|
ACID E 28 2
|
|
ACID D 36 8
|
|
BASIC R 40 4
|
|
BASIC H 42 2
|
|
ACID D 46 4
|
|
ACID E 51 5
|
|
BASIC R 59 8
|
|
.end lit
|
|
.LEFT MARGIN2
|
|
Graphical output marks the position of each string by a
|
|
short vertical line and gives its name at the left end of the
|
|
line. If the top of the screen is reached the program gives the user the
|
|
oportunity to take a hard copy and then will clear the screen and restart
|
|
plotting results at the original start position.
|
|
Note that any character in the string
|
|
that is not a recognisable protein symbol will be treated as a
|
|
wild card character will match with all
|
|
characters in the searched sequence.
|
|
.para
|
|
.lit
|
|
Typical dialogue follows.
|
|
|
|
Menus and their numbers are
|
|
m0 = This menu
|
|
m1 = General
|
|
m2 = Screen control
|
|
m3 = Statistical analysis of content
|
|
m4 = Structure
|
|
m5 = Search
|
|
? = Help
|
|
! = Quit
|
|
? Menu or option number=17
|
|
Search for short sequences
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 =2
|
|
1 All acids
|
|
X 2 Named groups
|
|
3 Personal file
|
|
4 Keyboard
|
|
? 0,1,2,3,4 =
|
|
|
|
ACID/D/E//
|
|
BASIC/R/K/H//
|
|
HYDRO/F/L/I/V/Y//
|
|
GLYCO/N-S/N-T//
|
|
+/R/K/H//
|
|
-/D/E//
|
|
DIBASIC/RR/KK/RK/KR//
|
|
TURN/N/D/G/P/S//
|
|
BLOCK/A/Q/E/I/L/M/F/W/V//
|
|
INDIF/R/C/H/K/T/Y//
|
|
End of file
|
|
|
|
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 =
|
|
|
|
1 All acids
|
|
X 2 Named groups
|
|
3 Personal file
|
|
4 Keyboard
|
|
? 0,1,2,3,4 =
|
|
|
|
? (y/n) (y) All names n
|
|
? Name=acid
|
|
? Name=basic
|
|
? Name=glyco
|
|
? Name=
|
|
|
|
? (y/n) (y) Show results name by name
|
|
? (y/n) (y) List matches
|
|
|
|
searching
|
|
matches= 59
|
|
NAME SEQUENCE POSITION FRAGMENT LENGTHS
|
|
ACID E 7 7 1
|
|
ACID E 10 3 1
|
|
ACID E 24 14 1
|
|
ACID E 28 4 1
|
|
ACID D 36 8 1
|
|
ACID D 46 10 2
|
|
ACID E 51 5 2
|
|
ACID E 67 16 2
|
|
ACID D 69 2 2
|
|
ACID D 81 12 2
|
|
ACID E 84 3 2
|
|
ACID E 96 12 3
|
|
ACID D 116 20 3
|
|
... etc
|
|
matches= 61
|
|
NAME SEQUENCE POSITION FRAGMENT LENGTHS
|
|
BASIC K 13 13 1
|
|
BASIC R 15 2 1
|
|
BASIC H 26 11 1
|
|
BASIC R 40 14 1
|
|
BASIC H 42 2 2
|
|
BASIC R 59 17 2
|
|
...etc
|
|
matches= 2
|
|
NAME SEQUENCE POSITION FRAGMENT LENGTHS
|
|
GLYCO NST 4 4 3
|
|
GLYCO NQT 487 483 28
|
|
28 483
|
|
|
|
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 =
|
|
|
|
1 All acids
|
|
X 2 Named groups
|
|
3 Personal file
|
|
4 Keyboard
|
|
? 0,1,2,3,4 =
|
|
|
|
? (y/n) (y) Selected names
|
|
|
|
? Name=basic
|
|
? Name=glyco
|
|
? Name=
|
|
|
|
? (y/n) (y) Show results name by name n
|
|
? (y/n) (y) List matches
|
|
|
|
searching
|
|
NAME SEQUENCE POSITION FRAGMENT LENGTHS
|
|
GLYCO NST 4 3
|
|
BASIC K 13 9
|
|
BASIC R 15 2
|
|
BASIC H 26 11
|
|
BASIC R 40 14
|
|
BASIC H 42 2
|
|
BASIC R 59 17
|
|
BASIC R 68 9
|
|
BASIC K 87 19
|
|
...etc
|
|
BASIC R 477 14
|
|
BASIC H 479 2
|
|
GLYCO NQT 487 8
|
|
BASIC K 499 12
|
|
BASIC K 501 2
|
|
BASIC K 508 7
|
|
7
|
|
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 =
|
|
1 All acids
|
|
X 2 Named groups
|
|
3 Personal file
|
|
4 Keyboard
|
|
? 0,1,2,3,4 =4
|
|
Define search strings by typing a string name
|
|
followed by the string(s)
|
|
? Name=MARY
|
|
? String(s)=AL/VI
|
|
? Name=
|
|
? (y/n) (y) All names
|
|
? (y/n) (y) Show results name by name
|
|
? (y/n) (y) List matches
|
|
|
|
searching
|
|
matches= 12
|
|
NAME SEQUENCE POSITION FRAGMENT LENGTHS
|
|
MARY VI 38 38 10
|
|
MARY AL 63 25 13
|
|
MARY VI 136 73 16
|
|
MARY AL 177 41 19
|
|
MARY AL 217 40 25
|
|
MARY AL 233 16 37
|
|
MARY AL 243 10 40
|
|
MARY AL 256 13 41
|
|
MARY AL 326 70 45
|
|
MARY VI 345 19 51
|
|
MARY AL 396 51 70
|
|
MARY AL 470 74 73
|
|
|
|
|
|
.END LIT
|
|
|
|
.left margin1
|
|
@18. TX 1 5 @Compare a sequence
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine slides a short sequence along the current sequence and finds
|
|
all positions at which a given percentage of the amino acids match.
|
|
Output is in both graphical and listed forms.
|
|
.para
|
|
If users call for dialogue when the routine is selected they will be given
|
|
the choice of keyboard or file input. Define the string, and the percentage
|
|
match. Matches will be plotted out and then the user can select to have
|
|
them listed. Then the routine cycles around.
|
|
.para
|
|
The routine slides the search string
|
|
along the sequence and marks the positions at which a minimum
|
|
percentage score is reached. The graphical output draws a vertical line at
|
|
the match position; the height of the line represents the percentage
|
|
score,
|
|
so that if the line reaches the top of the box the score is 100%.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
|
|
? Menu or option number=18
|
|
Find percentage matches
|
|
? (y/n) (y) Keep picture
|
|
|
|
? String=aaa
|
|
? Percent match (1.00-100.00) (70.00) =
|
|
|
|
missing graphics
|
|
|
|
Total scoring positions above 70.000 percent = 19
|
|
Scores 2 2 2 2 2 2 2 2 2 2
|
|
Positions 61 131 177 217 226 231 232 267 300 301
|
|
|
|
? Number to list (0-19) (0) =3
|
|
|
|
61
|
|
AIA
|
|
* *
|
|
aaa
|
|
1
|
|
|
|
131
|
|
AIA
|
|
* *
|
|
aaa
|
|
1
|
|
|
|
177
|
|
ALA
|
|
* *
|
|
aaa
|
|
1
|
|
? (y/n) (y) Keep picture n
|
|
|
|
Default String=aaa
|
|
? String=!
|
|
|
|
.end lit
|
|
|
|
.left margin1
|
|
@19. TX 1 5 @Compare a sequence using a score matrix
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine slides a short sequence along the current sequence and finds
|
|
all positions at which a given level of similarity (a cutoff score) is
|
|
reached. The score is defined by use of a score matrix (MDM78). Output is
|
|
in both graphical and listed forms.
|
|
.para
|
|
If users call for dialogue when the routine is selected they will be given
|
|
the choice of keyboard or file input. Define the string and the cutoff
|
|
score. Matches will be plotted out and then the user can select to have
|
|
them listed. Then the routine cycles around.
|
|
.para
|
|
The routine slides the search string
|
|
along the sequence and marks the positions at which a the cutoff score
|
|
is achieved. The graphical output draws a vertical line at
|
|
the match position; the height of the line represents the score,
|
|
so that if the line reaches the top of the box the score is the maximum
|
|
possible.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
|
|
Menus and their numbers are
|
|
m0 = This menu
|
|
m1 = General
|
|
m2 = Screen control
|
|
m3 = Statistical analysis of content
|
|
m4 = Structure
|
|
m5 = Search
|
|
? = Help
|
|
! = Quit
|
|
? Menu or option number=19
|
|
Find matches using a score matrix
|
|
? (y/n) (y) Keep picture
|
|
|
|
? String=aaa
|
|
Minimum score= 12 Maximum score= 36
|
|
? Score (12-36) (36) =
|
|
|
|
missing graphics
|
|
|
|
For score 24 the number of matches= 507
|
|
scores 35 35 35 34 34 34 34 34 34 34
|
|
positions 226 231 379 112 133 202 227 267 378
|
|
380
|
|
|
|
? Number to list (0-507) (0) =3
|
|
|
|
226
|
|
ATA
|
|
* *
|
|
aaa
|
|
1
|
|
|
|
231
|
|
SAA
|
|
**
|
|
aaa
|
|
1
|
|
|
|
379
|
|
GAA
|
|
**
|
|
aaa
|
|
1
|
|
? (y/n) (y) Keep picture n
|
|
|
|
Default String=aaa
|
|
? String=!
|
|
.end lit
|
|
.left margin1
|
|
@20. TX 5 @Search for a motif using a weight matrix
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function performs searches for short sequence
|
|
motifs using an appropriate weight matrix. In addition it can be used to
|
|
create or modify weight matrices. In order to perform a search the only
|
|
input
|
|
required is the name of the file containing the weight matrix.
|
|
The results can be presented graphically or listed. The graphical
|
|
presentation will draw line at the position of any matches found; the
|
|
height of the line is proportional to the score.
|
|
.para
|
|
For a search, select "use weight matrix", supply the name of the file
|
|
containing the weight matrix, and choose between having results plotted
|
|
or listed. If dialogue is requested when the function is selected users can
|
|
alter the cutoff score employed.
|
|
.para
|
|
To create a weight matrix several steps are involved. A file containing an
|
|
alignment of known motifs is required. (This file must be created before
|
|
the current option is selected. The format is a follows: each sequence is
|
|
written on a separate line with at least one space at the beginning; each
|
|
sequence is terminated by a space character, and can be followed by a
|
|
name. The sequences must be aligned.) Supply the name of the file of
|
|
aligned sequences. The program reads and displays the sequences. Choose
|
|
between "summing logs of weights" or summing weights (i.e. whether to
|
|
multiply or add weights). If logs are used all scores will be negative.
|
|
Choose if all positions in the set of aligned sequences should be used or
|
|
if a mask should be applied. If so selected, define a mask as a string of
|
|
symbols, in which symbol - means ignore and any other symbol means
|
|
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
|
|
.para
|
|
The program will calculate weights as the frequencies of each amino
|
|
acid at each unmasked position in the set of aligned sequences. These
|
|
weights are then applied to the set of aligned sequences to give a range
|
|
of "observed" scores. The mean and standard deviation of these scores is
|
|
displayed. The user is asked to supply several values to be used when the
|
|
weight matrix is applied to other sequences: a cutoff score (by default,
|
|
the mean minus 3 standard deviations), a top score for scaling graphical
|
|
results (by default, the mean plus 3 standard deviations), and a position
|
|
to identify (this means that if a particular amino acid within the motif
|
|
is used as a "landmark", such as the G of the helix-turn-helix motif, then
|
|
its position will be marked in plots). All these values are stored along
|
|
with the weight matrix. Finally supply the name of a file to contain the
|
|
weight matrix.
|
|
.para
|
|
Weight matrices can be "rescaled" using a set of aligned sequences in
|
|
much the same ways as a matrix is created. The purpose is to redefine
|
|
the cutoff scores, and rescaling does not alter any other values in the
|
|
weight matrix file.
|
|
.para
|
|
The methods have changed considerably but were first outlined in
|
|
Staden, R. Nucl. Acid Res. 12 505-519 1984, and
|
|
Staden, R. Genetic
|
|
engineering: principles and methods vol 7, Edited by J.K. Setlow and A.
|
|
Hollaender, Plenum publishing corp., 1985.
|
|
.para
|
|
The methods have always had to deal with the problem of zeroes in the
|
|
matrices. The current versions
|
|
employ "Laplaces Law of Succession" in which 1 is
|
|
added to each term.
|
|
.para
|
|
It is now possible to apply a mask to a set of aligned sequences in
|
|
order to give weight to selected positions only.
|
|
Sequences have superimposed functions: some parts may be of general
|
|
structural
|
|
importance and give rise to an overall framework, and other parts give
|
|
specificity and hence are not common; we may want to use a set of
|
|
aligned
|
|
sequences to define a motif, but want to use only the framework
|
|
positions.
|
|
Alternatively we may want to pick out
|
|
only those parts of a set of aligned sequences that give a particular
|
|
property, and to ignore other similarities that are due to some other
|
|
property
|
|
and which could obscure the pattern
|
|
we are interested in. The ability to define a mask allows certain
|
|
positions
|
|
to be used in the motif and others to be ignored, and yet still permits the
|
|
use of a set of aligned sequences to calculate weights.
|
|
.para
|
|
Typical dialogue is shown below.
|
|
.lit
|
|
? Menu or option number=20
|
|
X 1 Use weight matrix
|
|
2 Make weight matrix
|
|
3 Rescale weight matrix
|
|
? 0,1,2,3 =2
|
|
? Name of aligned sequences file=[rs.motifs]hth.seq
|
|
1 QESVADKMGMGQSGVGALFN LAMBDA.REP
|
|
2 QTKTAKDLGVYQSAINKAIH LAMBDA.CRO
|
|
3 QAALGKMVGVSNVAISQWQR P22.REP
|
|
4 QRAVAKALGISDAAVSQWKE P22.CRO
|
|
5 QAELAQKVGTTQQSIEQLEN 434.REP
|
|
6 QTELATKAGVKQQSIQLIEA 434.CRO
|
|
7 RQEIGQIVGCSRETVGRILK CAP
|
|
8 RGDIGNYLGLTVETISRLLG Fnr
|
|
9 LYDVAEYAGVSYQTVSRVVN LAC.R
|
|
10 IKDVARLAGVSVATVSRVIN GAL.R
|
|
11 TEKTAEAVGVDKSQISRWKR LAMBDA.CII
|
|
12 QRKVADALGINESQISRWKG P22.CI
|
|
13 KEEVAKKCGITPLQVRVWCN MAT.ALPHA
|
|
14 TRKLAQKLGVEQPTLYWHVK TETR.TN10
|
|
15 TRRLAERLGVQQPALYWHFK TETR.pSC1
|
|
16 QRELKNELGAGIATITRGSN TRP.REP
|
|
17 RQQLAIIFGIGVSTLYRYFP H-INVERSN
|
|
18 ATEIAHQLSIARSTVYKILE TN3.RESOL
|
|
19 ASHISKTMNIARSTVYKVIN GD.RESOLV
|
|
20 IASVAQHVCLSPSRLSHLFR ARA.C
|
|
21 RAEIAQRLGFRSPNAAEEHL LEX.R
|
|
Length of motif 20
|
|
? (y/n) (y) Sum logs of weights
|
|
? (y/n) (y) Use all motif positions n
|
|
x means use, - means ignore
|
|
e.g. xx-x---x-x means use positions 1,2,4,8,10
|
|
? Mask=--xxxxxxxxxxxx------
|
|
Applying weights to input sequences
|
|
1 -57.143 QESVADKMGMGQSGVGALFN
|
|
2 -55.087 QTKTAKDLGVYQSAINKAIH
|
|
3 -58.079 QAALGKMVGVSNVAISQWQR
|
|
4 -54.986 QRAVAKALGISDAAVSQWKE
|
|
5 -55.181 QAELAQKVGTTQQSIEQLEN
|
|
6 -55.874 QTELATKAGVKQQSIQLIEA
|
|
7 -56.692 RQEIGQIVGCSRETVGRILK
|
|
8 -57.722 RGDIGNYLGLTVETISRLLG
|
|
9 -55.363 LYDVAEYAGVSYQTVSRVVN
|
|
10 -55.769 IKDVARLAGVSVATVSRVIN
|
|
11 -56.786 TEKTAEAVGVDKSQISRWKR
|
|
12 -55.833 QRKVADALGINESQISRWKG
|
|
13 -56.279 KEEVAKKCGITPLQVRVWCN
|
|
14 -53.125 TRKLAQKLGVEQPTLYWHVK
|
|
15 -55.833 TRRLAERLGVQQPALYWHFK
|
|
16 -58.651 QRELKNELGAGIATITRGSN
|
|
17 -56.749 RQQLAIIFGIGVSTLYRYFP
|
|
18 -56.986 ATEIAHQLSIARSTVYKILE
|
|
19 -60.618 ASHISKTMNIARSTVYKVIN
|
|
20 -58.988 IASVAQHVCLSPSRLSHLFR
|
|
21 -58.002 RAEIAQRLGFRSPNAAEEHL
|
|
Top score -53.125 Bottom score -60.618
|
|
Mean -56.655 Standard deviation 1.617
|
|
Mean minus 3.sd -61.505 Mean plus 3.sd -51.804
|
|
? Cutoff score (-999.00-9999.00) (-61.51) =
|
|
? Top score for scaling plots (-61.51-999.00) (-51.80) =
|
|
? Position to identify (0-20) (1) =9
|
|
? Title=hth
|
|
? Name for new weight matrix file=1.wts
|
|
|
|
Menus and their numbers are
|
|
m0 = This menu
|
|
m1 = General
|
|
m2 = Screen control
|
|
m3 = Statistical analysis of content
|
|
m4 = Structure
|
|
m5 = Search
|
|
? = Help
|
|
! = Quit
|
|
? Menu or option number=20
|
|
X 1 Use weight matrix
|
|
2 Make weight matrix
|
|
3 Rescale weight matrix
|
|
? 0,1,2,3 =
|
|
|
|
? Motif weight matrix file=1.wts
|
|
hth
|
|
? (y/n) (y) Use frequencies as weights
|
|
? (y/n) (y) Plot results n
|
|
5 -61.46 STEISELIKQRIAQFNVVSE
|
|
13 -58.93 KQRIAQFNVVSEAHNEGTIV
|
|
21 -60.42 VVSEAHNEGTIVSVSDGVIR
|
|
57 -59.39 GNRYAIALNLERDSVGAVVM
|
|
59 -61.47 RYAIALNLERDSVGAVVMGP
|
|
79 -59.90 YADLAEGMKVKCTGRILEVP
|
|
88 -61.41 VKCTGRILEVPVGRGLLGRV
|
|
104 -60.38 LGRVVNTLGAPIDGKGPLDH
|
|
127 -60.13 SAVEAIAPGVIERQSVDQPV
|
|
129 -59.91 VEAIAPGVIERQSVDQPVQT
|
|
133 -60.79 APGVIERQSVDQPVQTGYKA
|
|
139 -61.12 RQSVDQPVQTGYKAVDSMIP
|
|
175 -58.90 KTALAIDAIINQRDSGIKCI
|
|
191 -60.95 IKCIYVAIGQKASTISNVVR
|
|
195 -60.94 YVAIGQKASTISNVVRKLEE
|
|
215 -60.66 HGALANTIVVVATASESAAL
|
|
254 -60.56 EDALIIYDDLSKQAVAYRQI
|
|
260 -60.08 YDDLSKQAVAYRQISLLLRR
|
|
297 -61.00 LLERAARVNAEYVEAFTKGE
|
|
314 -61.29 KGEVKGKTGSLTALPIIETQ
|
|
330 -60.49 IETQAGDVSAFVPTNVISIT
|
|
363 -57.63 GIRPAVNPGISVSRVGGAAQ
|
|
365 -61.48 RPAVNPGISVSRVGGAAQTK
|
|
371 -61.02 GISVSRVGGAAQTKIMKKLS
|
|
382 -57.90 QTKIMKKLSGGIRTALAQYR
|
|
394 -60.07 RTALAQYRELAAFSQFASDL
|
|
424 -59.95 GQKVTELLKQKQYAPMSVAQ
|
|
430 -58.89 LLKQKQYAPMSVAQQSLVLF
|
|
432 -61.14 KQKQYAPMSVAQQSLVLFAA
|
|
438 -58.58 PMSVAQQSLVLFAAERGYLA
|
|
458 -61.06 DVELSKIGSFEAALLAYVDR
|
|
466 -61.00 SFEAALLAYVDRDHAPLMQE
|
|
483 -60.48 MQEINQTGGYNDEIEGKLKG
|
|
494 -60.61 DEIEGKLKGILDSFKATQSW
|
|
|
|
Menus and their numbers are
|
|
m0 = This menu
|
|
m1 = General
|
|
m2 = Screen control
|
|
m3 = Statistical analysis of content
|
|
m4 = Structure
|
|
m5 = Search
|
|
? = Help
|
|
! = Quit
|
|
? Menu or option number=d20
|
|
X 1 Use weight matrix
|
|
2 Make weight matrix
|
|
3 Rescale weight matrix
|
|
? 0,1,2,3 =
|
|
|
|
? Motif weight matrix file=1.wts
|
|
hth
|
|
? (y/n) (y) Use frequencies as weights
|
|
? Cutoff score (-9999.00-9999.00) (-61.51) =-56.
|
|
? (y/n) (y) Plot results n
|
|
|
|
|
|
.end lit
|
|
.left margin1
|
|
@21. TX 3 @Calculate amino acid composition
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function calculates the amino acid composition and molecular
|
|
weight
|
|
for the active region.
|
|
.lit
|
|
? Menu or option number=21
|
|
Sequence composition
|
|
|
|
A C S T P A G N D E Q B Z H
|
|
N 3. 32. 23. 18. 57. 47. 16. 28. 31. 28. 0. 0. 7.
|
|
% 0.6 6.2 4.5 3.5 11.1 9.1 3.1 5.4 6.0 5.4 0.0 0.0 1.4
|
|
W 309. 2786. 2325. 1748. 4051. 2682. 1826. 3222. 4003. 3588. 0. 0.
|
|
960.
|
|
|
|
A R K M I L V F Y W - X ?
|
|
N 30. 24. 11. 40. 47. 41. 14. 15. 1. 0. 0. 0. 1.
|
|
% 5.8 4.7 2.1 7.8 9.1 8.0 2.7 2.9 0.2 0.0 0.0 0.0 0.2
|
|
W 4686. 3076. 1443. 4527. 5319. 4065. 2060. 2448. 186. 0. 0. 0.
|
|
0.
|
|
Total molecular weight= 55328.
|
|
|
|
.end lit
|
|
.left margin1
|
|
@22. TX 3 4 @Plot hydrophobicity
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine plots the hydrophobicity of each section of the sequence
|
|
using
|
|
the hydrophobicity
|
|
values of Kyte and Doolittle (J. Mol. Biol. 157, 105-132 (1982)).
|
|
A window of size span is slid along the sequence and a sum calculated
|
|
for
|
|
each position.
|
|
.para
|
|
If dialogue is requested select a span length and a plot interval.
|
|
.para
|
|
The diagrams are on the same scale as Fig. 6 of the Kyte and Doolittle
|
|
paper and values of + and - 50 could be assigned to the top and bottom of
|
|
the diagram with corresponding values in between (-40,-20,0,20,40 are
|
|
shown
|
|
in the paper).
|
|
.lit
|
|
? Menu or option number=d22
|
|
Plot hydrophobicity
|
|
? odd span length (1-101) (11) =
|
|
? plot interval (1-101) (3) =
|
|
|
|
missing graphics
|
|
.end lit
|
|
.LEFT MARGIN1
|
|
@23. TX 3 4 @Plot charge
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine plots the charge of each section of the sequence.
|
|
A window of size span is slid along the sequence and a sum calculated
|
|
for
|
|
each position. Amino acids are assigned charges of 1, -1 or 0.
|
|
.para
|
|
If dialogue is requested select a span length and a plot interval.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
|
|
? Menu or option number=d23
|
|
Plot charge
|
|
? odd span length (1-101) (11) =
|
|
? plot interval (1-101) (3) =
|
|
|
|
missing graphics
|
|
|
|
.end lit
|
|
.LEFT MARGIN1
|
|
@24. TX 4 @Plot robson prediction
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine uses the method of Garnier J, Osguthorpe D J, and Robson B.
|
|
(1978) J. Mol. Biol. 120, 97-120 to predict secondary structures. The
|
|
method divides protein secondary structures into 4 classes: helix,
|
|
extended
|
|
(usually referred to as sheet), turn and coil. The routine calculates the
|
|
likelihood that each segment of the sequence lies in each of these
|
|
classes. Results are presented graphically or listed.
|
|
.para
|
|
If dialogue is requested choose between plotted or listed output.
|
|
.para
|
|
Each residue
|
|
has a
|
|
certain probability of being found in each of the 4 classes. This
|
|
probability
|
|
depends both on its own amino acid type and also the 8
|
|
amino acids found to either side along the protein chain. Four tables of
|
|
weights, each 20 by 17 elements are used to calculate the likelihood that
|
|
each residue along the chain falls into one of the four classes of
|
|
structure. The most likely structure at each point
|
|
is the one with the highest score.
|
|
The four values are plotted in strips labelled H, E, T and C.
|
|
Below, a strip labelled D for decision is divided into four levels, each
|
|
corresponding to one of the four structure types. Their top to bottom
|
|
order
|
|
is the same as that for the strips above, i.e C, T, E, and H. For each
|
|
residue the program measures which of the four likelhoods is highest. It
|
|
places a single dot at the
|
|
mid-point of the corresponding strip, and
|
|
also at the
|
|
appropriate level in the strip labelled D.
|
|
.PARA
|
|
It should be noted that the method, when tested by Kabsch W and Sander
|
|
C,
|
|
(1983) Febs. Lett. 155 (179-182), although one of the better ones, was
|
|
correct for only about 56% of residues.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
? Menu or option number=d24
|
|
Plot Robson secondary structure predictions
|
|
? (y/n) (y) Plot results n
|
|
|
|
9 S 217 -7 -39 15
|
|
10 E 226 5 -27 -39
|
|
11 L 233 -7 -26 -15
|
|
12 I 229 -23 9 4
|
|
13 K 214 -8 10 -8
|
|
14 Q 178 42 19 5
|
|
15 R 131 54 16 3
|
|
16 I 86 42 -31 -23
|
|
17 A 55 52 -30 -15
|
|
18 Q 15 67 4 25
|
|
19 F -34 86 47 74
|
|
20 N -41 74 17 106
|
|
21 V -16 118 -5 100
|
|
22 V 64 88 5 115
|
|
23 S 96 38 26 155
|
|
24 E 133 -25 13 96
|
|
25 A 118 -98 25 100
|
|
26 H 110 -150 37 86
|
|
27 N 57 -201 37 66
|
|
28 E 51 -140 11 -4
|
|
29 G 2 -77 37 9
|
|
30 T 2 28 28 7
|
|
31 I -11 117 -21 22
|
|
32 V -23 178 -55 5
|
|
33 S -54 193 -14 35
|
|
34 V -46 123 5 30
|
|
35 S -54 53 51 80
|
|
36 D -60 1 86 55
|
|
37 G -66 8 57 49
|
|
38 V -1 128 -30 -5
|
|
39 I 11 212 -56 -33
|
|
40 R 16 204 -44 -57
|
|
...etc
|
|
|
|
.end lit
|
|
.LEFT MARGIN1
|
|
@26. TX 4 @Draw a helix wheel
|
|
.LEFT MARGIN2
|
|
.para
|
|
A helical representation of segments of the sequence is shown. The
|
|
display
|
|
includes a schematic of the helix showing the links between residues,
|
|
with
|
|
each vertex numbered according to position; the sequence element at
|
|
each
|
|
vertex; a symbol denoting a classification as hydrophobic(.), positively
|
|
charged(+), negatively charged(-), or otherwise( ). The
|
|
residue number of the first sequence element in
|
|
the current window is displayed at the top-left-hand
|
|
corner of the diagram. Also at the top-left corner the sequence in the
|
|
current window is listed. Below this is the total hydrophobicity and
|
|
hydrophobic moment for the window calculated according to Eisenberg et
|
|
al
|
|
J. Mol. Biol. 179, 125-142 (1984).
|
|
.para
|
|
If dialogue is requested the user is asked for the angle to define the turn
|
|
between residues as seen
|
|
looking along the helix, and a window length. The window length can be up
|
|
to 60, with default 18, and the angle has a default of 100 degrees. Note
|
|
that 18 x 100 is 5 turns. When the option is selected the first segment in
|
|
the current active region is displayed then the bell rings. If the user
|
|
types only return, the display will click on by one residue; if another
|
|
number is typed, say N, then the display will click forwards (or
|
|
backwards
|
|
if N is negative) by N residues. If the wheel runs off either end of the
|
|
sequence the option will be exited.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
? Menu or option number=d26
|
|
? Angle (1-130) (100) =
|
|
? Window (1-60) (18) =
|
|
|
|
missing graphics
|
|
|
|
.end lit
|
|
.left margin1
|
|
@25. TX 3 4 @Plot hydrophobic moment
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine plots hydrophobic moment and hydrophobicity according to
|
|
Eisenberg et al
|
|
J. Mol. Biol. 179, 125-142 (1984). The mean hydrophobicity per residue in
|
|
the window is plotted on a scale -1.0 to 1.5, and the mean hydrophobic
|
|
moment per residue on a scale 0.0 to 1.5.
|
|
The hydrophobicity is shown in the top frame with the
|
|
hydrophobic moment below.
|
|
The plot is arranged so that the
|
|
value shown at position x represents the mean value for residues x-
|
|
window+1
|
|
to x, where window is the window length.
|
|
.para
|
|
If dialogue is requested the user can select a window
|
|
length, and the angle used for the hydrophobic moment
|
|
calculation.
|
|
.para
|
|
Note that according to Eisenberg et al, in transmembrane proteins an
|
|
"initiator" is required. This is either a very hydrophobic single helix
|
|
with <H> >=0.68, or a moderately hydrophobic pair of helices whose <H>
|
|
sum
|
|
to >= 1.1. Other helices are then accepted as transmembrane if their <H>
|
|
>=
|
|
0.42
|
|
.para
|
|
The following rules are claimed: if <H> < 0.51 and points lie below the
|
|
line <M> = -0.392 + 0.603x <H> they are "globular", if they lie above this
|
|
line they are "surface". If <H> > 0.51 and they lie above the line <M> =
|
|
0.6 - 0.342x<H> they are "monomeric", if above "multimeric".
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
|
|
? Menu or option number=d25
|
|
? Angle (1-130) (100) =
|
|
? Window (1-60) (18) =
|
|
? Plot interval (1-101) (3) =
|
|
|
|
missing graphics
|
|
|
|
|
|
.end lit
|
|
.left margin1
|
|
@27. TX 1 @Back translate to dna
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine back translates protein sequences into DNA using the
|
|
standard
|
|
genetic code. The level of redundancy can be plotted and the
|
|
backtranslation saved to a file.
|
|
.para
|
|
The translation can use either the IUB symbols shown below, or a set of
|
|
codon
|
|
preferences. If a set of codon preferences are used they must conform to
|
|
the format of codon tables produced by the nucleotide analysis
|
|
program, and the back
|
|
translation
|
|
will contain the favoured codons. If there is no favoured codon
|
|
the IUB symbols will be employed. The window length for
|
|
plotting the redundancy is in codons.
|
|
.para
|
|
The program will plot the redundancy along the sequence and hence can
|
|
be
|
|
used to find the best sequences to use as primers. Note that the program
|
|
plots the inverse, and so the higher the
|
|
plot the LESS redundant the sequence. For primers look for peaks rather
|
|
than
|
|
troughs.
|
|
.para
|
|
The DNA sequence can be saved to a file and analysed using the nucleotide
|
|
analysis program.
|
|
Depending on the application it is often useful to produce a back
|
|
translation using both a table of codon preferences and one using the IUB
|
|
symbols. This is because the restriction enzyme search program can
|
|
distinguish between definite and possible cuts in the sequence.
|
|
These matches are what the program terms "definite matches" and are
|
|
ones in
|
|
which the specification of the recognition sequence corresponds
|
|
exactly to that of the back translation. The program will also find what
|
|
it
|
|
terms "possible matches" which are ones that depend on the particular
|
|
codons
|
|
chosen for each amino acid.
|
|
These are sites at which recognition
|
|
sequences could be engineered to produce a cut in the DNA
|
|
without changing the amino
|
|
acid, but which are not
|
|
necessarily found in the original sequence.
|
|
.LIT
|
|
|
|
|
|
NC-IUB SYMBOLS
|
|
|
|
A,C,G,T
|
|
R (A,R) 'puRine'
|
|
Y (T,C) 'pYrimidine'
|
|
W (A,T) 'Weak'
|
|
S (C,G) 'Strong'
|
|
M (A,C) 'aMino'
|
|
K (G,T) 'Keto'
|
|
H (A,T,C) 'not G'
|
|
B (G,C,T) 'not A'
|
|
V (G,A,C) 'not T'
|
|
D (G,A,T) 'not C'
|
|
N (G,A,C,T) 'aNy'
|
|
|
|
Typical dialogue follows.
|
|
|
|
? Menu or option number=d27
|
|
Back translate
|
|
? (y/n) (y) No codon preference
|
|
? (y/n) (y) Plot redundancy n
|
|
? (y/n) (y) Save DNA to disk
|
|
? File name for DNA sequence=tt:
|
|
ATGCARYTNAAYWSNACNGARATHWSNGARYTNATHAARCARMGNATHGCNCARTTYAAY
|
|
GTNGTNWSNGARGCNCAYAAYGARGGNACNATHGTNWSNGTNWSNGAYGGNGTNATHMGN
|
|
ATHCAYGGNYTNGCNGAYTGYATGCARGGNGARATGATHWSNYTNCCNGGNAAYMGNTAY
|
|
GCNATHGCNYTNAAYYTNGARMGNGAYWSNGTNGGNGCNGTNGTNATGGGNCCNTAYGCN
|
|
GAYYTNGCNGARGGNATGAARGTNAARTGYACNGGNMGNATHYTNGARGTNCCNGTNGGN
|
|
MGNGGNYTNYTNGGNMGNGTNGTNAAYACNYTNGGNGCNCCNATHGAYGGNAARGGNCCN
|
|
YTNGAYCAYGAYGGNTTYWSNGCNGTNGARGCNATHGCNCCNGGNGTNATHGARMGNCAR
|
|
WSNGTNGAYCARCCNGTNCARACNGGNTAYAARGCNGTNGAYWSNATGATHCCNATHGGN
|
|
MGNGGNCARMGNGARYTNATHATHGGNGAYMGNCARACNGGNAARACNGCNYTNGCNATH
|
|
GAYGCNATHATHAAYCARMGNGAYWSNGGNATHAARTGYATHTAYGTNGCNATHGGNCAR
|
|
AARGCNWSNACNATHWSNAAYGTNGTNMGNAARYTNGARGARCAYGGNGCNYTNGCNAAY
|
|
ACNATHGTNGTNGTNGCNACNGCNWSNGARWSNGCNGCNYTNCARTAYYTNGCNMGNATG
|
|
CCNGTNGCNYTNATGGGNGARTAYTTYMGNGAYMGNGGNGARGAYGCNYTNATHATHTAY
|
|
GAYGAYYTNWSNAARCARGCNGTNGCNTAYMGNCARATHWSNYTNYTNYTNMGNMGNCCN
|
|
CCNGGNMGNGARGCNTTYCCNGGNGAYGTNTTYTAYYTNCAYWSNMGNYTNYTNGARMGN
|
|
GCNGCNMGNGTNAAYGCNGARTAYGTNGARGCNTTYACNAARGGNGARGTNAARGGNAAR
|
|
ACNGGNWSNYTNACNGCNYTNCCNATHATHGARACNCARGCNGGNGAYGTNWSNGCNTTY
|
|
GTNCCNACNAAYGTNATHWSNATHACNGAYGGNCARATHTTYYTNGARACNAAYYTNTTY
|
|
AAYGCNGGNATHMGNCCNGCNGTNAAYCCNGGNATHWSNGTNWSNMGNGTNGGNGGNGCN
|
|
GCNCARACNAARATHATGAARAARYTNWSNGGNGGNATHMGNACNGCNYTNGCNCARTAY
|
|
MGNGARYTNGCNGCNTTYWSNCARTTYGCNWSNGAYYTNGAYGAYGCNACNMGNAARCAR
|
|
YTNGAYCAYGGNCARAARGTNACNGARYTNYTNAARCARAARCARTAYGCNCCNATGWSN
|
|
GTNGCNCARCARWSNYTNGTNYTNTTYGCNGCNGARMGNGGNTAYYTNGCNGAYGTNGAR
|
|
YTNWSNAARATHGGNWSNTTYGARGCNGCNYTNYTNGCNTAYGTNGAYMGNGAYCAYGCN
|
|
CCNYTNATGCARGARATHAAYCARACNGGNGGNTAYAAYGAYGARATHGARGGNAARYTN
|
|
AARGGNATHYTNGAYWSNTTYAARGCNACNCARWSNTGG---
|
|
|
|
|
|
.end lit
|
|
|
|
.LEFT MARGIN1
|
|
@28. TX 5 @Search for patterns of motifs
|
|
.left margin2
|
|
.para
|
|
This option searches for patterns of motifs. Patterns can be defined
|
|
interactively or read from files. Results can be displayed in several ways
|
|
in both graphical and textual form. Used to create pattern files for
|
|
searching libraries. The option is extremely flexible and consequently the
|
|
following documentation is quite lengthy. However the routine is capable
|
|
of searching for almost any known pattern. In addition the flexibility
|
|
does not necessitate difficulty of use, and the userinterface has been
|
|
simplified considerably since the methods were first published.
|
|
.para
|
|
Users should refer to the "typical dialogue" shown below for the most
|
|
helpful information on using the program.
|
|
.para
|
|
There are currently
|
|
four ways to display the matching patterns: 1=each individual
|
|
motif and its position is listed; 2=all the sequence between, and
|
|
including the two
|
|
outermost motifs is listed; 3=graphical, with a vertical line marking the
|
|
position
|
|
of the leftmost motif; 4 = EMBL feature table format, where the KEYNAM
|
|
field is the motif name, the FROM and TO fields denote the ends of the
|
|
match, and the DESCRIPTION field is "Program".
|
|
.para
|
|
When it is defined for the first time a pattern must be entered
|
|
interactively at the keyboard, but the pattern description
|
|
can be saved to a file.
|
|
This file can be used for all subsequent searches.
|
|
.para
|
|
When defining a pattern interactively
|
|
select a motif class and the program will request the required inputs.
|
|
.para
|
|
The program gives each motif an identifying name and number.
|
|
For motifs other than the first, a range of allowed positions must be
|
|
defined (Note that sets of motifs included using the OR operator will all
|
|
be given the same range, and so the program will only request range
|
|
values
|
|
for the first motif in any such set).
|
|
To specify the allowed range for a motif the user must supply the
|
|
following: the
|
|
identifying number of the motif, relative to which the current motifs
|
|
positions are to be defined (termed the "reference motif"); a "relative start
|
|
position" and a range. The relative start position can be negative or positive.
|
|
A negative start position means that although the reference motif
|
|
is searched for first, the current motif can be found to its left.
|
|
A zero relative start position means their left ends are superimposed. The
|
|
default start position is to butt-joint the motif to righthand end of the
|
|
"reference motif". The range is "the number of extra positions" that the
|
|
motif can take.
|
|
.para
|
|
The program will display the probability of finding each motif. These
|
|
values are presented in the following form: .1234E-5 means 0.1234 times
|
|
10
|
|
to the power -5.
|
|
.para
|
|
After the pattern has been defined, the program will type a description
|
|
of
|
|
it on the screen. It will then allow the user to give an overall cutoff
|
|
score and overall probability cutoff.
|
|
.para
|
|
Typical dialogue for all the different motif classes is displayed below.
|
|
.lit
|
|
|
|
? Menu or option number=28
|
|
Pattern searcher
|
|
? (y/n) (y) Read pattern from keyboard
|
|
X 1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Direct repeat
|
|
6 Membership of set
|
|
7 Pattern complete
|
|
? 0,1,2,3,4,5,6,7 =
|
|
? Motif name=aa
|
|
? String=aa
|
|
Probability of score 2.0000 = 0.123E-01
|
|
X 1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Direct repeat
|
|
6 Membership of set
|
|
7 Pattern complete
|
|
? 0,1,2,3,4,5,6,7 =2
|
|
? Motif name=pmatch
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-1) (1) =
|
|
? Relative start position (-1000-1000) (3) =
|
|
? Number of extra positions (0-1000) (0) =
|
|
? String=qqq
|
|
? Minimum matches (1.00-3.00) (3.00) =2
|
|
Probability of score 2.0000 = 0.858E-02
|
|
1 Exact match
|
|
X 2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Direct repeat
|
|
6 Membership of set
|
|
7 Pattern complete
|
|
? 0,1,2,3,4,5,6,7 =3
|
|
? Motif name=sm
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-2) (2) =
|
|
? Relative start position (-1000-1000) (4) =
|
|
? Number of extra positions (0-1000) (0) =
|
|
? String=wqa
|
|
? Minimum score (11.00-53.00) (53.00) =36
|
|
Probability of score 36.0000 = 0.531E-02
|
|
1 Exact match
|
|
2 Percentage match
|
|
X 3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Direct repeat
|
|
6 Membership of set
|
|
7 Pattern complete
|
|
? 0,1,2,3,4,5,6,7 =4
|
|
? Motif name=hth
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-3) (3) =
|
|
? Relative start position (-1000-1000) (4) =
|
|
? Number of extra positions (0-1000) (0) =
|
|
? Weight matrix file name=hth
|
|
HELIX TURN HELIX PABO SAUER WEIGHTS 17-11-87
|
|
Probability of score -51.5860 = 0.230E-04
|
|
1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
X 4 Cut-off score and weight matrix
|
|
5 Direct repeat
|
|
6 Membership of set
|
|
7 Pattern complete
|
|
? 0,1,2,3,4,5,6,7 =5
|
|
? Motif name=repeat
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-4) (4) =
|
|
? Relative start position (-1000-1000) (21) =
|
|
? Number of extra positions (0-1000) (0) =3
|
|
? Repeat length (1-60) (6) =3
|
|
? Minimum gap (0-60) (0) =
|
|
? Maximum gap (0-60) (0) =2
|
|
? Minimum score (11.00-60.00) (36.00) =
|
|
Probability of score 36.0000 = 0.445E-01
|
|
1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
X 5 Direct repeat
|
|
6 Membership of set
|
|
7 Pattern complete
|
|
? 0,1,2,3,4,5,6,7 =6
|
|
? Motif name=mset
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-5) (5) =
|
|
? Relative start position (-1000-1000) (1) =
|
|
? Number of extra positions (0-1000) (0) =
|
|
X 1 Keyboard input
|
|
2 File input
|
|
? 0,1,2 =
|
|
Separate sets with commas
|
|
? String=AVL,AST,,WYRF
|
|
? Minimum matches (1.00-4.00) (4.00) =3
|
|
Probability of score 3.0000 = 0.718E-02
|
|
1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Direct repeat
|
|
X 6 Membership of set
|
|
7 Pattern complete
|
|
? 0,1,2,3,4,5,6,7 =7
|
|
? (y/n) (y) Save pattern in a file
|
|
? Pattern definition file=EXAM.PAT
|
|
Motif 6 needs a file name to store set as a weight matrix
|
|
? Weight matrix file name=DEMO.WTS
|
|
Weight matrix needs a title
|
|
? Title=Demonstration class 6 weight matrix
|
|
|
|
Pattern description
|
|
|
|
Motif 1 named aa is of class 1
|
|
Which is an exact match to the string
|
|
aa
|
|
Motif 2 named pmatch is of class 2
|
|
which is a match of score 2. to the string
|
|
qqq
|
|
and the N-terminal residue can take positions 3 to 3
|
|
relative to the N-terminal end of motif 1
|
|
It is anded with the previous motif.
|
|
Motif 3 named sm is of class 3
|
|
which is a match of score 36. to the string
|
|
wqa
|
|
and the N-terminal residue can take positions 4 to 4
|
|
relative to the N-terminal end of motif 2
|
|
It is anded with the previous motif.
|
|
Motif 4 named hth is of class 4
|
|
Which is a match to a weight matrix with score -51.586
|
|
and the N-terminal residue can take positions 4 to 4
|
|
relative to the N-terminal end of motif 3
|
|
It is anded with the previous motif.
|
|
Motif 5 named repeat is of class 5
|
|
Which is a repeat with repeat length 3 and score 36.
|
|
The loop-out can have sizes 0 to 2
|
|
and the N-terminal residue can take positions 21 to 24
|
|
relative to the N-terminal end of motif 4
|
|
It is anded with the previous motif.
|
|
Motif 6 named mset is of class 6
|
|
Which is membership of a set with score 3.000
|
|
It is anded with the previous motif.
|
|
Probability of finding pattern = 0.4109E-14
|
|
Expected number of matches = 0.2539E-10
|
|
? Maximum pattern probability (0.00-1.00) (1.00) =
|
|
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
|
|
Select display mode
|
|
X 1 Motif by motif
|
|
2 Inclusive
|
|
3 Graphical
|
|
4 EMBL feature table
|
|
? 0,1,2,3,4 =
|
|
Searching
|
|
|
|
Total matches found 0
|
|
Menus and their numbers are
|
|
m0 = This menu
|
|
m1 = General
|
|
m2 = Screen control
|
|
m3 = Statistical analysis of content
|
|
m4 = Structure
|
|
m5 = Search
|
|
? = Help
|
|
! = Quit
|
|
? Menu or option number=6
|
|
Page through text files
|
|
? Name of file to read=exam.pat
|
|
A1 aa Class
|
|
aa
|
|
@ End of string
|
|
A2 pmatch Class
|
|
1 Relative motif
|
|
3 Relative start position
|
|
0 Number of extra positions
|
|
qqq
|
|
@ End of string
|
|
2.00000 Cutoff
|
|
A3 sm Class
|
|
2 Relative motif
|
|
4 Relative start position
|
|
0 Number of extra positions
|
|
wqa
|
|
@ End of string
|
|
36.00000 Cutoff
|
|
A4 hth Class
|
|
3 Relative motif
|
|
4 Relative start position
|
|
0 Number of extra positions
|
|
hth File name
|
|
A5 repeat Class
|
|
4 Relative motif
|
|
21 Relative start position
|
|
3 Number of extra positions
|
|
3 Length
|
|
0 Minimum loop
|
|
2 Maximum loop
|
|
36.00000 Cutoff
|
|
A6 mset Class
|
|
5 Relative motif
|
|
1 Relative start position
|
|
0 Number of extra positions
|
|
DEMO.WTS File name
|
|
End of file
|
|
Menus and their numbers are
|
|
m0 = This menu
|
|
m1 = General
|
|
m2 = Screen control
|
|
m3 = Statistical analysis of content
|
|
m4 = Structure
|
|
m5 = Search
|
|
? = Help
|
|
! = Quit
|
|
? Menu or option number=6
|
|
Page through text files
|
|
? Name of file to read=demo.wts
|
|
Demonstration class 6 weight matrix
|
|
4 0 3.000 4.000
|
|
P 1 2 3 4
|
|
N 0 0 0 0
|
|
C 0 0 0 0
|
|
S 0 1 0 0
|
|
T 0 1 0 0
|
|
P 0 0 0 0
|
|
A 1 1 0 0
|
|
G 0 0 0 0
|
|
N 0 0 0 0
|
|
D 0 0 0 0
|
|
E 0 0 0 0
|
|
Q 0 0 0 0
|
|
B 0 0 0 0
|
|
Z 0 0 0 0
|
|
H 0 0 0 0
|
|
R 0 0 0 1
|
|
K 0 0 0 0
|
|
M 0 0 0 0
|
|
I 0 0 0 0
|
|
L 1 0 0 0
|
|
V 1 0 0 0
|
|
F 0 0 0 1
|
|
Y 0 0 0 1
|
|
W 0 0 0 1
|
|
End of file
|
|
Menus and their numbers are
|
|
m0 = This menu
|
|
m1 = General
|
|
m2 = Screen control
|
|
m3 = Statistical analysis of content
|
|
m4 = Structure
|
|
m5 = Search
|
|
? = Help
|
|
! = Quit
|
|
? Menu or option number=28
|
|
Pattern searcher
|
|
? (y/n) (y) Read pattern from keyboard
|
|
X 1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Direct repeat
|
|
6 Membership of set
|
|
7 Pattern complete
|
|
? 0,1,2,3,4,5,6,7 =2
|
|
? Motif name=avlst
|
|
? String=avlst
|
|
? Minimum matches (1.00-5.00) (5.00) =3
|
|
Probability of score 3.0000 = 0.394E-02
|
|
1 Exact match
|
|
X 2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Direct repeat
|
|
6 Membership of set
|
|
7 Pattern complete
|
|
? 0,1,2,3,4,5,6,7 =7
|
|
? (y/n) (y) Save pattern in a file n
|
|
|
|
Pattern description
|
|
|
|
Motif 1 named avlst is of class 2
|
|
which is a match of score 3. to the string
|
|
avlst
|
|
Probability of finding pattern = 0.3941E-02
|
|
Expected number of matches = 0.2030E+01
|
|
? Maximum pattern probability (0.00-1.00) (1.00) =
|
|
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
|
|
Select display mode
|
|
X 1 Motif by motif
|
|
2 Inclusive
|
|
3 Graphical
|
|
4 EMBL feature table
|
|
? 0,1,2,3,4 =4
|
|
Searching
|
|
|
|
FT avlst 152 156 Program
|
|
Total matches found 1
|
|
Minimum and maximum observed scores 3.00 3.00
|
|
|
|
.end lit
|
|
.para
|
|
General notes
|
|
.para
|
|
These methods allow users to define and search for
|
|
complex patterns of motifs defined as single objects.
|
|
The programs allow individual DNA motifs to be defined in eight
|
|
different
|
|
ways, and protein motifs in six. Motifs are combined, using the logical
|
|
operators AND, OR and NOT, to describe a pattern. The pattern also
|
|
specifies the ranges of allowed relative separations of the individual
|
|
motifs.
|
|
.para
|
|
First some definitions.
|
|
.para
|
|
A MOTIF is a contiguous subsequence of fixed length.
|
|
At its simplest
|
|
it could be a single definite base or amino acid; a more complex motif
|
|
might be better represented as a consensus or a weight matrix;
|
|
two more-abstract types of
|
|
motif are direct and inverted repeats.
|
|
.para
|
|
A PATTERN is a higher order of structure defined by a list of motifs. The
|
|
motifs in a pattern are combined using the logical operators AND, OR and
|
|
NOT. The list also defines the allowed relative separations of the
|
|
motifs. In the current versions of the programs up
|
|
to 50 motifs can be combined into a single pattern. So using these
|
|
definitions there are two
|
|
differences between motifs and patterns: 1) the distances between all
|
|
elements of a motif are fixed, but
|
|
the separations of parts of patterns can vary;
|
|
2) all characters in a motif are defined
|
|
using the same method (class), but different parts of a pattern can be
|
|
defined in completely different ways.
|
|
.para
|
|
Each motif
|
|
can be represented in 9 ways (known as the motif class):
|
|
.sk1
|
|
.lit
|
|
MOTIF CLASSES
|
|
CLASS DESCRIPTION
|
|
1 Exact match to a short defined sequence. The IUB symbols
|
|
can be used for DNA sequences.
|
|
2 Percentage match to a defined short sequence. In nucleic acids,
|
|
the IUB symbols can be used.
|
|
3 Match to a defined sequence, using a score matrix and cutoff
|
|
score. The DNA matrix (see option 18) gives scores to IUB symbols
|
|
depending on their level of redundancy. MDM78 is used for proteins.
|
|
4 Match to a weight matrix with cutoff score.
|
|
5 As class 4 but on the complementary strand.
|
|
6 Inverted repeat or stem-loop. Fixed stem length, range of
|
|
loop sizes, and cutoff score using A-T, G-C=2; G-T=1.
|
|
7 Exact match to short sequence but with a defined step size.
|
|
8 Direct repeat. Fixed repeat length, range of loop-out sizes,
|
|
cutoff score, and score matrix (for protein sequences MDM78 and
|
|
for nucleic acids an identity matrix).
|
|
9 Membership of a set. A list of sets of allowed amino acids for
|
|
each position in the motif. The sets are separated by commas(,).
|
|
For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
|
|
acids in which one of I,V or L must be found in the first position,
|
|
then anything in the next two positions, D,E,K or R in the fourth
|
|
position and F,Y,W,I,L,V or M in the fifth. This class only applies
|
|
to protein sequences because for nucleic acids "membership of a
|
|
set"
|
|
can be achieved using IUB symbols.
|
|
|
|
Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to
|
|
nucleic acids.
|
|
|
|
.end lit
|
|
.para
|
|
Class 1: exact match.
|
|
.para
|
|
The motif is defined by a short sequence, which for nucleic acids,
|
|
may include IUB symbols. All symbols must match.
|
|
.para
|
|
Class 2: percentage match
|
|
.para
|
|
The motif is defined by a short sequence, which for nucleic acids,
|
|
may include IUB symbols. The minimum number of matching characters
|
|
must
|
|
also be specified.
|
|
.para
|
|
Class 3: match using a score matrix
|
|
.para
|
|
The motif is defined by a short sequence, which for nucleic acids,
|
|
may include IUB symbols. The motif is not compared directly with the
|
|
sequence to count the number of matching characters. Instead a matrix is
|
|
used to provide a score for all possible pairs of characters. The motif
|
|
score for
|
|
any position along the sequence is the sum of the scores found by
|
|
looking-up the scores for each pair of aligned characters. A match is
|
|
declared if some minimum score is achieved.
|
|
.para
|
|
Class 4: weight matrix
|
|
.para
|
|
The motif is defined by a table of values (called weights or scores). The
|
|
table gives a score for finding each possible character at each position
|
|
along the length of the motif. It therefore
|
|
has dimension motif-length x character-set-size, and allows us to give
|
|
different scores for each character at each position. It is equivalent to
|
|
having a different score matrix for each position along the motif, and
|
|
provides the most flexible and specific method of defining motifs. The
|
|
weight matrices are created by program PIP option 20 and
|
|
stored as files. The file contains the values
|
|
for each position, as well as an overall minimum score.
|
|
There are two ways in which these values can be used to calculate an
|
|
overall
|
|
score for any section of the sequence. The simplest way is to add the
|
|
values in the file. (This means that the highest possible score
|
|
can be calculated by adding the top value at each column
|
|
position, and the lowest
|
|
by adding the bottom value.)
|
|
The normal way of using the values in the file is as
|
|
follows.
|
|
First the programs divide the values in each column by the column total
|
|
so
|
|
that they sum to 1.0
|
|
Then the natural
|
|
logs of these values are used as scores. When the matrix is applied to a
|
|
sequence these logarithmic values are summed (which is of course
|
|
equivalent
|
|
to multiplying the frequencies).
|
|
Note that using the natural logs of the frequencies as
|
|
weights and
|
|
adding them means that the overall cutoff score must be less than zero,
|
|
whereas if the original
|
|
values in the weight matrix file are added, the cutoff score will be
|
|
greater than zero. The search routines therefore decide whether the user
|
|
wants to add values or multiply frequencies
|
|
by examining the value of the cutoff score: it will add if the cutoff
|
|
is
|
|
greater than zero and add logs of frequencies if it is less than zero.
|
|
Hence we effectively get two
|
|
motif classes in one. The program PIP, when creating weight matrix
|
|
files, will ask the user whether the scores should be added or multiplied.
|
|
If the values in the table have been defined
|
|
without using a set of aligned sequences
|
|
it is easier for the user to
|
|
choose a cutoff score if the values are added.
|
|
.para
|
|
Class 5: complement of weight matrix
|
|
.para
|
|
The motif is defined by a weight matrix, but the program searches for its
|
|
complement.
|
|
.para
|
|
Class 6: inverted repeat, or stem-loop
|
|
.para
|
|
The motif is defined by a repeat length, a minimum score
|
|
and a range of loop sizes. The scores are A-T=2, G-C=2, G-T=1, else=0.
|
|
The loop sizes are defined by a minimum
|
|
and maximum distance from the 3' end of the stem.
|
|
For a stem-loop these will be positive numbers. For example to
|
|
define a stem of length 8 and loop sizes varying from 3 to 5, the stem
|
|
would be set to 8, the minimum start distance to 3 and the maximum
|
|
to 5. To define an
|
|
inverted repeat the minimum distance will be negative. For example stem
|
|
length=9,
|
|
minimum distance=-9, and maximum distance=-8 will find
|
|
inverted repeats of lengths 9 and 10.
|
|
E.g. AAAAATTTT and AAAAATTTTT would be found, the first having a base
|
|
at
|
|
its centre, the second having none.
|
|
.para
|
|
Class 7: exact match, defined step size.
|
|
.para
|
|
The motif is defined by a short sequence, which for nucleic acids,
|
|
may include IUB symbols. All symbols must match. The class differs
|
|
from
|
|
class 1 in that searches will move in steps of some given size. For
|
|
example
|
|
we could search for a certain codon and use a step size of 3 and hence
|
|
keep in a
|
|
single reading frame.
|
|
.para
|
|
Class 8: direct repeat
|
|
.para
|
|
The motif is defined by a repeat length, a minimum score
|
|
and a range of loop sizes. The scores are defined using MDM78 for protein
|
|
sequences and an identity matrix for nucleic acids.
|
|
The loop sizes are defined by a minimum
|
|
and maximum distance from the 3' end of the stem.
|
|
.para
|
|
Class 9: membership of a set
|
|
.para
|
|
This motif class is for protein sequences. It is defined by lists of
|
|
allowed amino acids for each position in the motif, and a cut-off score.
|
|
Positions at which any amino acid can occur are left blank.
|
|
All allowed amino acids for each position give a score of 1.
|
|
The motifs can be defined in two ways: either typed at the keyboard or
|
|
read
|
|
in as a weight-matrix-like file.
|
|
When the motif is defined at the keyboard the sets of allowed amino
|
|
acids
|
|
are separated by commas(,).
|
|
For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
|
|
acids in which one of I,V or L must be found in the first position,
|
|
then anything in the next two positions, D,E,K or R in the fourth
|
|
position and F,Y,W,I,L,V or M in the fifth. To specify that the
|
|
whole motif must match a score of 3 would be required (i.e. one of the
|
|
allowed amino acids must be found for each of the three defined
|
|
positions).
|
|
If the motif is read from a file the file must have been written by
|
|
program
|
|
PIP, or have been saved by the pattern searching routines. If the
|
|
user
|
|
elects to save a pattern, and it includes class 9 motifs typed at the
|
|
keyboard, then the program will save the class 9 motifs as weight matrix
|
|
files. Therefore it will request file names for each motif of this class.
|
|
If the motif given above as an example were saved the weight matrix file
|
|
would have 5 columns.
|
|
The first column
|
|
would contain zeroes except for the I, V and L rows
|
|
which would be set to 1; the next two columns would all be zero; the next
|
|
would be zero except for the D,E,K and R rows which would be 1; the final
|
|
column would contain 1's in rows F,Y,W,I,L,V and M, with
|
|
the rest zero.
|
|
.para
|
|
|
|
The logical operator (AND, OR or NOT) used to add each motif to the
|
|
pattern
|
|
is specified by preceding
|
|
the class number by the letters A, O or N. A = AND, O = OR, N = NOT.
|
|
The default is A, so N2 means include, using the NOT operator, a class 2
|
|
motif; O2 means include, using the OR operator, a class 2 motif; both A2
|
|
and
|
|
2 mean include, using the AND operator, a class 2 motif.
|
|
|
|
.para
|
|
Range setting.
|
|
.para
|
|
The motifs in a pattern are numbered according to their order in the list.
|
|
Apart from the first motif in a pattern all motifs are given a range
|
|
of allowed positions relative to a motif further up the list.
|
|
For example
|
|
suppose we have a pattern defined by A AND B AND C AND D.
|
|
Motif A can occur anywhere, but B must have its range of allowed
|
|
positions defined relative to the position of motif A, and C's positions
|
|
can be defined relative to either A or B, depending on which is most
|
|
convenient, and likewise D's positions can be relative to A or B or C.
|
|
.para
|
|
Notice that the positions of motifs can be defined relative to more than
|
|
one motif. Suppose we have a pattern consisting of
|
|
motifs A, B and C, and that B occurs 5-10 residues right of A, C occurs 5-
|
|
10
|
|
residues right of B, and also C is never more than 15 residues from A.
|
|
Then
|
|
it is quite consistent with the methods to include motif C into the
|
|
pattern
|
|
twice using the AND operator: once relative to A and once relative to B.
|
|
This will define the relative spacing and the ORDER of the motifs in the
|
|
pattern. (If we simply defined the position of C relative to A it could be
|
|
found to the left of B).
|
|
.para
|
|
Motifs combined together using the OR operator are all given the same
|
|
range. For example suppose we had a pattern A AND (B OR C) AND (D OR E),
|
|
then B and C each have the same range, and D and E also have
|
|
the same range as one another. The range for D and E can be relative to
|
|
A or to B.
|
|
.para
|
|
Motifs cannot have their ranges defined relative to motifs that are
|
|
included using the NOT operator. For example if we had the pattern A NOT
|
|
B
|
|
AND C, then the range for C can only be defined relative to motif A.
|
|
.para
|
|
Speed can be gained by arranging the order
|
|
of the motifs so that those higher up the list are of types that can be
|
|
searched for rapidly and that are also unlikely to be found.
|
|
.para
|
|
Motifs combined by the OR operator are alternatives: if any one of a set
|
|
of motifs
|
|
combined by the OR operator is found, then a match is declared. All
|
|
alternatives will be reported. For example if we had a pattern defined by
|
|
A
|
|
AND (B OR C), then all places where A occurs and B is found within range,
|
|
and all places where A is found and C is found within range will be
|
|
reported. A typical use would be where we might allow a motif to appear
|
|
on
|
|
either strand of the DNA sequence. For example a weight matrix
|
|
representing
|
|
the heatshock element could be used in a pattern which included
|
|
heatshock
|
|
as a motif class 4 combined using the OR operator
|
|
with heatshock as a motif class 5.
|
|
.para
|
|
The probability calculations are performed for each motif as it is
|
|
defined.
|
|
If an overall probability cut-off is given the calculation is repeated for
|
|
each match found. To achieve maximum searching speed do not give an
|
|
overall
|
|
probability cut-off. Overall cut-off scores should only be used if the
|
|
motif
|
|
classes used are compatible.
|
|
.para
|
|
There are currently
|
|
several ways to display the matches: 1 = each
|
|
motif and its position is listed; 2 = all the sequence between the two
|
|
outermost motifs is listed; 3 = graphical, with a spike marking the
|
|
position
|
|
of the leftmost motif. The library versions also give entry names, and a
|
|
one
|
|
line title; in addition they can be used to produce aligned families of
|
|
sequences. When this mode of output is selected the program will write a
|
|
separate file for each match. The files will be called ENTRYNAME.DAT
|
|
where
|
|
ENTRYNAME is the name of the entry in the library. The matching
|
|
sequence
|
|
will be written out so that the spacing between motifs is constant, and
|
|
set to the maximum allowed by the pattern definition. Any gaps will be
|
|
filled with dashes (-). If the individual sequences were subsequently
|
|
written one above the other
|
|
they should line up so that all motifs are in register. There two types of
|
|
output of this sort: one, option 4, writes out whole sequences, the other,
|
|
option 5, writes out only the sequences between the two outermost
|
|
motifs.
|
|
If the individual sequences were subsequently
|
|
written one above the other
|
|
they should line up so that all motifs are in register. There two types of
|
|
output of this sort: one, option 4, writes out whole sequences, the other,
|
|
option 5, writes out only the sequences between the two outermost
|
|
motifs.
|
|
Note that for option 4 users are asked to type the position of the
|
|
first motif, and the reason for
|
|
this is explained below.
|
|
Consider a pattern found in several sequences. Consider only
|
|
the first motif in
|
|
the pattern and suppose that it was found in different positions in these
|
|
sequences.
|
|
Say that of these positions the one furthest from the left end was
|
|
position 100. Then, in order to ensure that all the sequences would align,
|
|
we must specify that motif 1 must start at position 100.
|
|
Any sequences in which motif 1 started
|
|
nearer to the left end than position 100 would be padded accordingly.
|
|
These modes of output
|
|
should only be used when the position of each motif is defined relative to
|
|
its
|
|
immediate neighbour.
|
|
.para
|
|
The pattern descriptions can be saved to files. These files
|
|
can be used instead of typing definitions again at the keyboard. As the
|
|
files are annotated,
|
|
they can easily
|
|
be changed using system editors, and the modified versions used to
|
|
define the variant patterns for the programs.
|
|
.para
|
|
.para
|
|
Use of lists of entry names
|
|
.para
|
|
The two programs that operate on libraries have the ability to
|
|
restrict their searches to subsets of the libraries. This does not require
|
|
sublibraries to be created but instead is achieved by using files
|
|
containing a list of the entry names of sequences. The user may choose to
|
|
search only those entries on the list or, alternatively to search all but
|
|
those on the list (i.e. in the latter case
|
|
the list contains the names of those to be excluded).
|
|
The programs can search libraries that have indexes and those that
|
|
do not.
|
|
If a list of names for inclusion is used,
|
|
then the search will be faster if the index is present. In all other
|
|
circumstances the whole library will be read.
|
|
The list must be in library order except when it is used
|
|
to include entries, and an index is available.
|
|
The list must contain each entry name on a separate line, with the name
|
|
starting in column 1 of the line. ie there must be no spaces at the start
|
|
of the line.
|
|
The list of entry names
|
|
can be produced by the keyword searches of nip, pip, etc as long
|
|
as the listings produced have a space character separating the entry name
|
|
from the entry description. This will depend on how well the library
|
|
reformatting programs work. For example swissprot entry names tend to run
|
|
into the beginning of the descriptions, but other libraries are generally
|
|
OK.
|
|
|
|
.para
|
|
One use of the programs is to look for patterns that we already know
|
|
about, but in new sequences. However it is hoped that they will also be
|
|
useful for finding new motifs. For example
|
|
several known control regions in
|
|
nucleic acid
|
|
sequences consist of particular direct or inverted repeats;
|
|
the inclusion of
|
|
direct and inverted repeats as motif classes
|
|
makes it possible to
|
|
find previously unknown
|
|
motifs of these types.
|
|
Using these new programs we can
|
|
ask questions like: "are there any inverted or direct repeats near to
|
|
sections of sequence that contain both a
|
|
CCAAT box and a TATA box?"; and to search for such things throughout
|
|
the
|
|
libraries. In addition, the mode of output in which all the sequence
|
|
between
|
|
the two outermost motifs found is printed out, allows us to extract
|
|
sequences and examine them in more detail for further common
|
|
subsequences.
|
|
For example we might want to collect together all the sequences
|
|
between
|
|
putative CCAAT and TATA boxes.
|
|
.para
|
|
A further use of the inverted repeat motif class is the following. If a
|
|
regulatory sequence in DNA is poorly defined but also an inverted repeat,
|
|
then it might be an advantage to specify it both as a consensus sequence
|
|
and
|
|
a superimposed inverted repeat. In this way two weak definitions can be
|
|
combined to produce a stronger pattern.
|
|
.para
|
|
Given only a few examples of a motif it
|
|
should be possible to perform initial searches using a
|
|
class 3 motif, and then, using plausible matching sequences, create a
|
|
more
|
|
specific weight matrix for the same motif.
|
|
.para
|
|
If motifs are combined with the first motif using the OR operator
|
|
they will be ignored until all
|
|
permutations that include the first motif have been looked for.
|
|
The whole search will then be repeated, in
|
|
turn, for each of
|
|
those motifs that are combined with the first motif using the OR
|
|
operator.
|
|
An interesting consequence of this is that the program can be used,
|
|
without
|
|
change, to compare any newly determined sequence with all known
|
|
individual
|
|
motifs. We achieve this by having a pattern in which all known relevant
|
|
motifs are combined using the OR operator.
|
|
If we ask to use this pattern with
|
|
a sequence, the program will automatically compare each individual
|
|
motif in
|
|
the pattern with the whole length of the
|
|
sequence. As the number of known
|
|
motifs grows this should become an increasingly useful standard
|
|
procedure.
|
|
.para
|
|
The NOT operator is obviously
|
|
useful for making sure particular motifs are not present, but it can also
|
|
be used to bracket the levels of matches found. We may want a degree of
|
|
match that lies between two limits - binding should occur, but not too
|
|
strongly; or base-pairs should form, but not too many. We can specify
|
|
this
|
|
by asking for a match with a low score, in combination with a match and
|
|
a
|
|
high score, both for the same motif, but with the high score included
|
|
using
|
|
the NOT operator.
|
|
.para
|
|
The algorithm is designed to find all sections of a sequence that satisfy
|
|
the pattern rather than only the best match.
|
|
Particularly if some of the motifs in a pattern are less well defined than
|
|
others, this can often result in the same region of a sequence being
|
|
reported as having several matches, but which only vary in the
|
|
positions of the weakest motifs.
|
|
.para
|
|
General remarks on motif searching
|
|
.para
|
|
Generally motifs are short subsequences that are thought to be
|
|
associated with
|
|
particular functions in some known sequences. Often
|
|
we search for them to try to
|
|
understand or interpret other sequences. Sometimes we search for
|
|
motifs and
|
|
patterns to
|
|
test a hypothesis about their role: are they found in the expected
|
|
positions in the expected sequences. In doing so we should remember
|
|
that, in both proteins and nucleic acids,
|
|
what we are really looking for is a particular
|
|
three dimensional structure with certain affinities for other structures,
|
|
and that we are assuming that the sequence of the motif alone
|
|
defines the 3D structure we searching for.
|
|
The overall structure
|
|
may be completely different to those in which the motif is functional,
|
|
and
|
|
hence the motif may have a different shape or be inaccessible.
|
|
We should be aware of the
|
|
importance of the context in which a motif is found. Where does it lie
|
|
relative to the overall structure, is it accessible, is the three
|
|
dimensional spacing between
|
|
it and other motifs correct? For example, is it on the same side of the
|
|
double helix, and the correct distance from some other motif? How does
|
|
context affect our assessment of the significance of finding a motif?
|
|
Finding false mammalian mRNA splice junctions in non-coding sequences
|
|
is
|
|
far less important than finding false sites in pre-mRNA sequences, but
|
|
finding them in the correct places is most important! In other words, it
|
|
is
|
|
often the case that when we are searching for a motif that is known to
|
|
be
|
|
necessary for some function, then a positive result in the form of a
|
|
match
|
|
in the required position, is more important than a high background of
|
|
matches in the wrong positions. Being
|
|
able to write
|
|
down the probability of finding a motif in a random sequence tells us how
|
|
well it is defined.
|
|
In nucleic
|
|
acids the DNA may contain many superimposed types of information such
|
|
as
|
|
those concerned with histone phasing, protein coding or mRNA secondary
|
|
structure. These overlapping "codes" may interfere with one another
|
|
causing
|
|
matches to motifs to be poorer than expected.
|
|
In general we will only have a limited number of examples of the
|
|
motif and we do not know how representative they are.
|
|
.para
|
|
Sequences have superimposed functions: some parts may be of general
|
|
structural
|
|
importance and give rise to an overall framework, and other parts give
|
|
specificity and hence are not common; we may want to use a set of
|
|
aligned
|
|
sequences to define a motif, but want to use only the framework
|
|
positions.
|
|
Alternatively we may want to pick out
|
|
only those parts of a set of aligned sequences that give a particular
|
|
property, and to ignore other similarities that are due to some other
|
|
property
|
|
and which could obscure the pattern
|
|
we are interested in.
|
|
It is possible to apply a mask to a set of aligned sequences in
|
|
order to give weight to selected positions only.
|
|
The ability to define a mask allows certain positions
|
|
to be used in the motif and others to be ignored, and yet still permits the
|
|
use of a set of aligned sequences to calculate weights. The mask is
|
|
requested and applied
|
|
by the program and results in the masked positions being zero
|
|
in
|
|
the weight matrix. The mask is defined in the following way.
|
|
Suppose we had a motif of length 15, then the mask
|
|
x--x--xx-x will give zero weights to positions 2,3,5,6 and 9 (note it is
|
|
the dashes (-) that are significant and that positions
|
|
1,4,7,8,10,11,12,13,14 and 15
|
|
will be non-zero). Of course
|
|
the same set of sequences could be used with several alternative masks
|
|
in
|
|
order to extract different features and create corresponding weight
|
|
matrices.
|
|
.para
|
|
The programs are described in Staden,R.
|
|
CABIOS 4, 53-60, 1988; Staden,R.
|
|
CABIOS 5, 89-96, 1989, anf a forthcoming Methods in
|
|
Enzymology.
|
|
.left margin1
|
|
@ end of help
|