staden-lg/help/NIP.RNO

5117 lines
190 KiB
Plaintext

.NPA
.SP 1
.left margin1
@-1. TX 0 @General
.sp
@-2. T 0 @Screen control
.sp
@-2. X 0 @Screen
.sp
@-3. T 0 @Statistical analysis of content
.sp
@-3. X 0 @Statistics
.sp
@-4. T 0 @Structures and repeats
.sp
@-4. X 0 @Structures
.sp
@-5. TX 0 @Translation and codons
.sp
@-6. TX 0 @Gene search by content
.sp
@-7. TX 0 @General signals
.sp
@-8. TX 0 @Specific signals
.sp
@0. TX -1 @NIP
.PARA
.para
This is a program for analysing individual nucleotide sequences. It can
read sequences stored in many of the most commonly used formats, and
performs all of the usual simple analyses. However the main purpose of
the program is to provide methods for finding the function of each
section of a sequence. In general no single method can give an
unequivecal interpretation of a sequence so we need to use many
techniques together and to combine their results. For this reason the
program present many of its results graphically.
.para
General information is contained in the user interface. Online
documentation for any function follows a consistent pattern: summary,
list of inputs, list of outputs, details, example.
.LEFT MARGIN1
@1. TX 0 @ Help
.LEFT MARGIN2
.para
This option gives online help. The user should select option numbers and
the current documentation will be given. Note that option 0 gives an
introduction to the program, and that ? will get help from anywhere in
the
program.
The following functions are included:
.left margin1
@2. TX 0 @ Quit
.left margin2
.para
This function stops the program.
.left margin1
@3. TX 1 @ Read a new sequence
.LEFT MARGIN2
.para
This option allows users to read in new sequences, browse through annotations,
or search sequence
libraries for keywords. Sequences can be read from "personal"
sequence files or from sequence libraries. These are referred to as the
sequence "source". Personal files can be stored in several formats:
Staden, PIR, EMBL, GENBANK and GCG.
At LMB we use "Staden" format for sequencing and all
the
libraries are stored in their original formats. Note, however, that libraries
such as EMBL or GenBank that are divided into several files (eg GenBank has
13 separate files) are indexed as a whole. This means that users do not need
to know which file contains an entry, only which library.
When the user selects to read in a sequence the program first asks for the
sequence "source".
.para
If the user selects "personal" the program will ask for
the format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
the file. For PIR format the user will also be required to know the entry
name of the sequence as the file can contain several. For the other formats
only a single entry is expected. The file will be read, its length and
composition will be displayed and the option left.
.para
If the user selects "library" as the sequence source the program will display a
list of available libraries. The programs are capable of handling all current
libraries but which ones are available will vary from site to site. At LMB we
have several libraries and also weekly updates of data gathered between releases.
The program will ask users to select a library and then give a list of options:
.lit
X 1 Get a sequence
2 Get annotations
3 Get entrynames from accession numbers
4 Search titles for keywords
5 Search text index for keywords
.end lit
If get a sequence or get annotations is selected users will be asked to
type the entry name. The option will be left when a sequence is selected or
! is typed. The composition and length will be displayed.
.para
The text index contains all words from feature tables, reference titles,
definition lines, keywords lists and comments, so the text index search
is most useful. It is also the fastest. Up to 5 words can be searched for
at once. The words should be typed separated by spaces, for example
.lit
? Keywords=P53 mouse murine tumo
.end lit
will search for all entries that contain words starting with p53, mouse,
murine and tumo. Only the unique entries that contain ALL words will be
listed. Before listing the matching entries
the program will show the number of 'hits' for each word and ring the bell.
Escape is possible at this point, or after each screenfull of entries.
In addition to the entry names the text search displays the primary accession
number, the sequence length and up to 80 characters of description.
(The search of 'titles' is now redundant because the full text index
contains all the title words and the search is much faster. It will probably
be removed from the program.)
All searches are independent of case. Where
possible the program will offer default entry names.
.para
Typical dialogue follows.
.lit
Select sequence source
X 1 Personal file
2 Sequence library
? Selection (1-2) (1) =
Select sequence file format
X 1 Staden
2 EMBL
3 GenBank
4 PIR
5 GCG
? Selection (1-5) (1) =
? Sequence file name=M13MP7.SEQ
Contig title removed
Sequence length= 7238
Sequence composition
T C A G -
2405. 1539. 1765. 1527. 2.
33.2% 21.3% 24.4% 21.1% 0.0%
.
.
.
Select sequence source
X 1 Personal file
2 Sequence library
? Selection (1-2) (1) =2
Select a library
X 1 EMBL 29 nucleotide library Dec 91
2 SWISSPROT 20 protein library Nov 91
3 PIR 31 protein library Dec 91
4 NRL3D 58 From Brookhaven protein library Dec 91
5 GenBank
? Selection (1-5) (1) =
Library is in EMBL format with indexes
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =5
Search for keywords
? Keywords=P53 mouse
P53 hits 68
MOUSE hits 8180
MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen
MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen
MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen
MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen
MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen
MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen
MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen
MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen
MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen
MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon
MMANT11 X00885 562 Murine p53 gene 3' region with exon 11
MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end.
MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds.
MMP53 X00741 1377 Mouse mRNA for transformation associated protein
MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53.
MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11.
MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8.
MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou
MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4
MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53
MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53
MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5
22 different entries found
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =4
Search for keywords
? Keywords=alpha
Searching for alpha
AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank.
ADHBADA1 102 duck alpha-d-globin gene, exon 1.
ADHBADA2 1145 duck alpha-a-globin gene and 5' flank
ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co
AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide.
AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
AGAMYLS 120 synthetic alpha-amylase gene, 5' end.
AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
!
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =3
? Accession number=v00636
Entry name LAMBDA
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =2
Default Entry name=LAMBDA
? Entry name=
ID LAMBDA standard; DNA; PHG; 48502 BP.
XX
AC V00636; J02459; M17233; X00906;
XX
DT 03-JUL-1991 (Rel. 28, Last updated, Version 3)
DT 09-JUN-1982 (Rel. 1, Created)
XX
DE Genome of the bacteriophage lambda (Styloviridae).
XX
KW circular; coat protein; DNA binding protein; genome;
KW origin of replication.
XX
OS Bacteriophage lambda
OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
XX
RN [1]
RP 1-48502
RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
RT "Nucleotide sequence of bacteriophage lambda DNA";
RL J. Mol. Biol. 162:729-773(1982).
XX
!
Select a task
X 1 Get a sequence
2 Get annotations
3 Get entry names from accession numbers
4 Search titles for keywords
5 Search text index for keywords
? Selection (1-5) (1) =
Default Entry name=LAMBDA
? Entry name=
DE Genome of the bacteriophage lambda (Styloviridae).
Sequence length 48502
Sequence composition
T C A G -
11988. 11360. 12336. 12818. 0.
24.7% 23.4% 25.4% 26.4% 0.0%
.end lit
.left margin1
@4. TX 1 @ Define active region
.LEFT MARGIN2
.para
For its analytic functions
the program always works on a region of the sequence called the "active
region". This function allows the start and end points of the active region
to be reset.
.para
Define the required start and end points.
.para
When a new sequence is read into the program the active region is
automatically set to start at the beginning of the sequence and extend to
the
maximum the program can
handle. On most machines this will be to the end of the sequence. The
positions are shown on the screen.
Note that for
convenience, in the
listing and translation functions, the user is given access to regions
outside the active region.
.left margin1
@5. TX 1 @ List a sequence
.LEFT MARGIN2
.para
The sequence can be listed single or double stranded with line lengths
from
10 to 120 in multiples of 10.
.para
Define the region to list, the line length required and choose between a
single or double stranded display.
The output looks like:
.lit
GTTAATGTAG CTTAATAACA AAGCAAAGCA CTGAAAATGC TTAGATGGAT
CAATTACATC GAATTATTGT TTCGTTTCGT GACTTTTACG AATCTACCTA
10 20 30 40 50
AATTGTATCC CATAAACACA AAGGTTTGGT CCTGGCCTTA TAATTAATTA
TTAACATAGG GTATTTGTGT TTCCAAACCA GGACCGGAAT ATTAATTAAT
60 70 80 90 100
GAGGTAAAAT TACACATGCA AACCTCCATA GACCGGTGTA AAATCCCTTA
CTCCATTTTA ATGTGTACGT TTGGAGGTAT CTGGCCACAT TTTAGGGAAT
110 120 130 140 150
AACATTTACT TAAAATTTAA GGAGAGGGTA TCAAGCACAT TAAAATAGCT
TTGTAAATGA ATTTTAAATT CCTCTCCCAT AGTTCGTGTA ATTTTATCGA
160 170 180 190 200
.end lit
.left margin1
@6. TX 1 @ List a text file.
.LEFT MARGIN2
.para
Allows the user to have a text file displayed on the screen. It will appear
one page at a time.
.para
Supply the name of the file to be displayed.
.left margin1
@7. TX 1 @ Direct output to disk
.LEFT MARGIN2
.para
Used to direct output that would normally appear on the screen to a file.
.para
Select redirection of either text or graphics, and
supply the name of the file that the output should be written to.
.para
The results from the next options selected will not appear on the screen
but will be written to the file. When option 7 is selected again
the file will be
closed and output will again appear on the screen.
.left margin1
@8. TX 1 @ Write active region to disk
.LEFT MARGIN2
.para
Used to write the current active section of sequence to a disk file in
"Staden format".
.para
Supply a file name and an optional title.
.para
The program has the capability of reading sequences stored in several
formats and so, in conjunction with this option, can be used to reformat
them.
.left margin1
@9. TX 1 @ Edit the sequence
.LEFT MARGIN2
.para
Used to edit sequences or any other files by giving access to the
computers system editor. For editing sequences the input file should
have already been created using one of the listing functions such as "list
sequence", "list translation" or "list restriction sites above the
sequence".
.para
Supply the name of the file to edit. Wait while the system editor is made
ready (can take awhile on a vax). Use the editor. Exit from the editor. If a
sequence has been edited, and you want to process it, affirm that the
sequence should be "made active". The edited sequence will replace the
original sequence.
.para
This editing method is designed to give users access to an editor with
which they are familiar - i.e. the one on their machine, and yet to allow
them to edit a sequence which contains all the landmarks they need in
order to know where they are. Users can create files containing simple
listings (single stranded) with numbering, using "list the sequence", and
then edit them with their system editor, using the numbering to know
where they are within the sequence. When the edits are complete they
exit from the editor and the program "analyses" the edited file to extract
only the sequence characters. Similarly a file containing a three phase
tranlslation can be edited, or a file containing a sequence plus its three
phase translation, plus its restriction sites marked above the sequence.
In order to be able to "analyse" such complicated listings and correctly
extract the sequence the following simple rule is used: all lines in the
file that contain a character that is not A,C,T,G or U are deleted. It is
obviously important to be aware of this rule and its implications.
.left margin1
@10. TX 2 @ Clear graphics
.LEFT MARGIN1
.para
Clears graphics from the screen.
.left margin1
@11. TX 2 @ Clear text
.LEFT MARGIN1
.para
Clears text from the screen.
.left margin1
@12. TX 2 @ Draw a ruler
.LEFT MARGIN2
.para
This option
allows the user to draw a ruler or scale along the x axis of the screen to
help identify the coordinates of points of interest. The user can define
the position of the first base to be marked (for example if the active
region is 1501 to 8000, the user might wish to mark every 1000th base
starting at either 1501 or 2000 - it depends if the user wishes to treat
the active region as an independent unit with its own numbering starting
at
its left edge, or as part of the whole sequence). The user can also define
the separation of the ticks on the scale and their height. If required the
labelling routine can be used to add numbers to the ticks.
.left margin1
@13. TX 2 @ Use crosshair
.LEFT MARGIN2
.para
This function puts
a steerable cross on the screen that can be used to find the
coordinates of points in the sequence. The user can move the cross
around using the directional keys; when he hits the space bar the
program will print out the coordinates of the cross in sequence units and
the option will be exited.
.PARA
If instead,
you hit a , the position will be displayed but the cross will remain on
the screen.
.PARA
If a letter s is hit the program will display the sequence around the
crosshair
position, and leave the cross on the screen.
.left margin1
@14. TX 2 @ Reposition plots
.LEFT MARGIN2
.para
The positions of each of the plots is defined relative to a users drawing
board which has size 1-10,000 in x and 1-10,000 in y.
Plots for
each option are drawn in a window defined by x0,y0 and xlength,ylength.
Where x0,y0 is the position of the bottom left hand corner of the window,
and xlength is the width of the window and ylength the
height of the window.
.lit
--------------------------------------------------------- 10,000
1 1
1 -------------------------------------- ^ 1
1 1 1 1 1
1 1 1 1 1
1 1 1 ylength 1
1 1 1 1 1
1 1 1 1 1
1 -------------------------------------- v 1
1 x0,y0^ 1
1 <---------------xlength--------------> 1
--------------------------------------------------------- 1
1 10,000
.end lit
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
The default window positions are read from a file "NIPMARG" when the
program is started. Users can have their own file if required.
As all the plots start
at the same position in x and have the same width, x0 and xlength are the
same for all options. Generally users will only want to change the start
level of the window y0 and its height ylength.
This option
allows users to change window positions whilst running the program.
The routine prompts first for the number of the option that the users
wishes
to reposition; then for the y start and height; then for the x start and
length. Note that changes to the x values affect all options. If the user
types only carriage return for any value it will remain unchanged.
The cross-hair can be used to choose suitable heights.
.LEFT MARGIN1
@15. TX 2 @ Label a diagram
.LEFT MARGIN2
.para
This routine allows users to label any diagrams they have produced. They
are asked to type in a label. When the user types carriage return to finish
typing the label the cross-hair appears on the screen. The user can
position it anywhere on the screen. If the user types R (for right justify)
the label will be
written on the diagram with its right end at the cross-hair position.
If the user types L (for left justify) the label will be written on the
diagram with its left end at the cross hair position.
The
cross-hair will then immediately reappear. The user may put the same
label
on another part of the diagram as before or if he hits the space bar he
will be asked if he wishes to type in another label.
.para
Typical dialogue follows.
.lit
? Menu or option number=15
Type label then drive cross hair to left or right end
of label position then hit "L" to write label left
justified or "R" to write label right justified or
the space bar to quit
? Label=delta gene
missing graphics
? Label=
.end lit
.left margin1
@16. TX 2 @Display a map
.LEFT MARGIN2
.para
This draws a map
of any sequence features selected by the user.
These features may be protein coding regions (CDS), tRNA genes (TRNA),
promoter positions (PRM), etc. Users may define their own feature table
key
names. For example I find it convenient to split CDS lines into CDS1,
CDS2
and CDS3 each of which contains only those sequences that code in the
reading frames 1, 2 or 3. Then I can plot them at different heights on
the screen ( suitable heights can be determined by using the cross-hair).
.para
The coordinates must be stored in a file in the format of an EMBL or GenBank
feature table. Note that this means that the file must include either EMBL
or GenBank headers, and a suitable "tail". The simplest header is the word
FEATURES starting in column 1 of the first line of the file. The simplest
tail is 2 empty lines at the end of the file. These lines are not included
when nip writes out results in feature table format.
.para
Typical dialogue follows.
.lit
? Menu or option number=16
Display a map using an EMBL feature table file
? map file name=hsegl1.ft
? feature code(e.g. CDS) =CDS
X 1 + strand
2 - strand
3 both strands
? 0,1,2,3 =
? level (0-9480) (256) =4000
missing graphics
? feature code(e.g. CDS) =
.end lit
.left margin1
@17. TX 1 @ Search for restriction enzymes
.LEFT MARGIN2
.para
This routine is used to search for short sequences, like restriction
enzyme
recognition sequences,
and can either list the results or present them graphically. Listings can
take several forms and can include the sequence and its translation.
Examples are given below. The program will also display the names of
enzymes that cut the sequence infrequently. Users can select from sets
of enzymes stored in files or can enter them from the keyboard.
.para
The short
sequences (strings) and their names need to be arranged in a particular
way. See below. Select to search, list an enzyme file or clear the screen.
Choose either a file of enzymes or to enter their recognition sequences at the
keyboard. Choose to search for all the enzymes in the list or to select
from the list. Select a mode of output. Define the sequence as circular or
linear. Select to search for "definite" or "possible" matches. The search
starts, and after the results have been displayed, further searches can be
performed.
.para
When the enzymes and their recognition sequences are stored in a file
they must be defined in the following way. We
call the recognition sequences "strings".
The format is as follows: each string or set of strings must be
preceded by a name, each string must be preceded and
terminated with a slash (/), and
each set of strings by 2 slashes.
For example
AATII/GACGT'C// defines the name AATII, its recognition sequence
GACGTC
and its cut site with the ' symbol; ACCI/GT'MKAC// defines the name
ACCI
and its recognition sequence includes IUB symbols for incompletely
defined
symbols in nucleic acid sequences;
BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
defines the name BBVI and this time two recognition sequences and cut
sites
are specified in order to correctly show the cutting position relative to
the recognition sequence. If no cut site is included the first base of the
recognition sequence is displayed as being on the 3' side of the
recognition sequence.
.para
These collections of strings and their
names can be read from disk or entered from the keyboard.
When names and strings are entered from the keyboard the program will ask
for the name and then the string(s). If more than one string is typed per
name they must be separated by slash (/) characters. See the "Typical
dialogue" below.
Three files
containing restriction enzyme recognition sequences are currently
available. The "all enzymes" file contains the Rich Roberts REBASE
restriction enzyme database, which is updated monthly.
.para
The user can select strings
by name from these collections. If so the program will prompt for the
names, one at a time. The user can continue to select names until a blank
name is entered (by the user typing only return).
.para
Listed output can be displayed in several ways: it
can be ordered enzyme by enzyme, or on cut positions, or with enzyme
names
written above a listing of the sequence. This last listing can also include
a three phase translation of the sequence. In addition the program will
display only infrequent cutters (the user defines the minimum number of
cuts), or can plot the positions of matches.
.para
Listings sorted "enzyme by enzyme" have the following form:
.lit
Matches found= 1
Name Sequence Position Fragment lengths
1 AATII GACGT'C 112 111 111
912 912
Matches found= 2
Name Sequence Position Fragment lengths
1 ACCI GT'CGAC 112 111 111
2 ACCI GT'AGAC 420 308 308
604 604
Matches found= 2
Name Sequence Position Fragment lengths
1 AHAII GA'CGTC 109 108 90
2 AHAII GG'CGTC 199 90 108
825 825
Matches found= 2
Name Sequence Position Fragment lengths
1 AVAII G'GACC 84 83 51
2 AVAII G'GTCC 973 889 83
51 889
Matches found= 1
Name Sequence Position Fragment lengths
1 BALI TGG'CCA 258 257 257
766 766
Matches found= 1
Name Sequence Position Fragment lengths
1 BAMHI G'GATCC 92 91 91
...... etc
Listings sorted on cut position have the following form:
Searching
Name Sequence Position Fragment lengths
1 ECORI G'AATTC 2 1
2 BANI G'GTGCC 26 24
3 BSP1286 GTGCC'C 31 5
4 BBVI 'TACTGCGCCGCAGCTGC 38 7
5 NSPBII CAG'CTG 51 13
6 PVUII CAG'CTG 51 0
7 BBVI GCAGCTGCTGGTG' 60 9
8 HINCII GTC'AAC 80 20
9 AVAII G'GACC 84 4
10 BINI 'CCAGGGATCC 87 3
11 BSTNI CC'AGG 89 2
12 BAMHI G'GATCC 92 3
13 XHOII G'GATCC 92 0
14 NSPBII CCG'CTG 98 6
15 BINI GGATCCGCT' 100 2
16 AHAII GA'CGTC 109 9
17 SALI G'TCGAC 111 2
18 AATII GACGT'C 112 1
19 ACCI GT'CGAC 112 0
20 HINCII GTC'GAC 113 1
21 BBVI GCAGCGACTGATT' 166 53
22 BINI 'ACTCAGATCC 178 12
23 XHOII A'GATCC 183 5
24 HGAI 'GGCGGCGGAGGCGTC 188 5
.....etc
Lists of infrequent cutters have the following form:
0 AFLII
0 AFLIII
0 APAI
0 APALI
0 ASUII
0 AVAI
0 AVRII
0 BCLI
0 BGLI
0 BGLII
0 BSMI
0 BSPMII
0 BSTEII
...... etc
Listings showing names above the sequence, and a translation have the
following form:
ECORI BANI BSP1286
. . . BBVI NSPBII
. . . . PVUII BBVI
GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
GTGC
10 20 30 40 50 60
E F G L G L V * G A Q R L L R R S C W C
N S V W A W C E V P R D Y C A A A A G A
I R F G L G V R C P E I T A P Q L L V L
HINCII
. AVAII
. . BINI
. . . BSTNI
. . . . BAMHI
. . . . XHOII NSPBII
. . . . . . BINI AHAII
. . . . . . . . SALI
. . . . . . . . .AATII
. . . . . . . . .ACCI
. . . . . . . . ..HINCII
TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
ACGA
70 80 90 100 110 120
W R C G G R Q R T Q G S A G R G R R R R
G G A E V V N G P R D P L D E D V D D E
A V R R S S T D P G I R W T R T S T T R
BBVI BINI
GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
CACT
130 140 150 160 170 180
G G G G * R I A G G W Q R L I * V L T T
E E V D S A L L V A G S D * F E F * P L
R R W I A H C W W L A A T D L S S D H S
XHOII
. HGAI AHAII PFIMI
. . . . BBVI
CAGATCCGGCGGCGGAGGCGTCGAGGCTCCCGAAACTCCCAGTGGCTGGCCTGCTA
GATT
190 200 210 220 230 240
Q I R R R R R R G S R N S Q W L A C * I
R S G G G G V E A P E T P S G W P A R F
D P A A E A S R L P K L P V A G L L D S
.........etc
.end lit
.para
The terms "possible" and "definite" matches are important only for back
translations of protein into DNA, and which include IUB redundancy codes.
Those matches that the program terms "definite matches" and are ones in
which the specification of the recognition sequence corresponds
exactly to that of the back translation, and consequently are definitely in
the DNA sequence. The program will also find what it
terms 'possible matches' which are ones that depend on the particular
codons
chosen for each amino acid.
These are sites at which recognition
sequences could be engineered to produce a cut in the DNA
without changing the amino
acid, but which are not
necessarily found in the original sequence.
.para
The routine will handle both linear and circular sequences, and
so finds cutsites spanning the "ends" of circular sequences.
The program will only find cutsites spanning the
ends of sequences if the sequence is declared as circular.
This includes sites for
recognition sequences containing leading or trailing N symbols, in which
the actual recognition sequence does not span the join. For example if the
recognition sequence was 'NNNNACGT and the first 4 characters in the
sequence were ACGT, then the match would only be found if the sequence
was
declared as circular. If the sequence is linear then the first fragment
starts at base number 1, and the last ends at the last base. If the
sequence is circular then the length of the first fragment is the
clockwise
distance from the last cut to the first.
.para
Graphical output marks the position of each string by a
short vertical line and gives the name of the enzyme at the left end of
the
line. If the top of the screen is reached the program gives the user the
oportunity to take a hard copy and then will clear the screen and restart
plotting results at the original start position.
.para
Below is an edited piece of dialogue from use of the search option:
.lit
? Menu or option number=17
Search for restriction enzyme sites
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 = 2
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =
AATII/GACGT'C//
ACCI/GT'MKAC//
AFLII/C'TTAAG//
AFLIII/A'CRYGT//
AHAII/GR'CGYC//
APAI/GGGCC'C//
APALI/G'TGCAC//
ASUII/TT'CGAA//
AVAI/C'YCGRG//
AVAII/G'GWCC//
AVRII/C'CTAGG//
BALI/TGG'CCA//
BAMHI/G'GATCC//
BANI/G'GYRCC//
BANII/GRGCY'C//
BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
BCLI/T'GATCA//
BGLI/GCCNNNN'NGGC//
BGLII/A'GATCT//
BINI/GGATCNNNN'/'NNNNNGATCC//
BSMI/GAATGCN'/NG'CATTC//
BSP1286/GDGCH'C//
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =
? (y/n) (y) Search for all names
X 1 Order results enzyme by enzyme
2 Order results by position
3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 =
? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
Matches found= 1
Name Sequence Position Fragment lengths
1 AATII GACGT'C 112 111 111
912 912
Matches found= 2
Name Sequence Position Fragment lengths
1 ACCI GT'CGAC 112 111 111
2 ACCI GT'AGAC 420 308 308
604 604
Matches found= 2
Name Sequence Position Fragment lengths
1 AHAII GA'CGTC 109 108 90
2 AHAII GG'CGTC 199 90 108
825 825
Matches found= 2
Name Sequence Position Fragment lengths
1 AVAII G'GACC 84 83 51
2 AVAII G'GTCC 973 889 83
51 889
Matches found= 1
Name Sequence Position Fragment lengths
1 BALI TGG'CCA 258 257 257
766 766
Matches found= 1
Name Sequence Position Fragment lengths
1 BAMHI G'GATCC 92 91 91
932 932
Matches found= 1
Name Sequence Position Fragment lengths
1 BANI G'GTGCC 26 25 25
998 998
Matches found= 1
Name Sequence Position Fragment lengths
1 BANII GAGCC'C 490 489 489
534 534
Matches found= 11
Name Sequence Position Fragment lengths
1 BBVI 'TACTGCGCCGCAGCTGC 38 37 3
2 BBVI GCAGCTGCTGGTG' 60 22 22
3 BBVI GCAGCGACTGATT' 166 106 28
4 BBVI 'CCTGCTAGATTCGCTGC 230 64 37
5 BBVI GCAGCGGTACGTA' 452 222 50
6 BBVI 'CTCGCCAACGTTGCTGC 502 50 55
7 BBVI GCAGCCTTCAACT' 606 104 64
8 BBVI 'GAGGTATTCCTGGCTGC 634 28 97
9 BBVI 'CTGGCCGCCGCCGCTGC 869 235 104
10 BBVI 'GCCGCCGCCGCTGCTGC 872 3 106
11 BBVI GCAGCGATGAGGA' 927 55 222
....etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =
? (y/n) (y) Search for all names
X 1 Order results enzyme by enzyme
2 Order results by position
3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 = 2
? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
Name Sequence Position Fragment lengths
1 ECORI G'AATTC 2 1
2 BANI G'GTGCC 26 24
3 BSP1286 GTGCC'C 31 5
4 BBVI 'TACTGCGCCGCAGCTGC 38 7
5 NSPBII CAG'CTG 51 13
6 PVUII CAG'CTG 51 0
7 BBVI GCAGCTGCTGGTG' 60 9
8 HINCII GTC'AAC 80 20
9 AVAII G'GACC 84 4
10 BINI 'CCAGGGATCC 87 3
11 BSTNI CC'AGG 89 2
12 BAMHI G'GATCC 92 3
13 XHOII G'GATCC 92 0
14 NSPBII CCG'CTG 98 6
15 BINI GGATCCGCT' 100 2
16 AHAII GA'CGTC 109 9
17 SALI G'TCGAC 111 2
18 AATII GACGT'C 112 1
19 ACCI GT'CGAC 112 0
20 HINCII GTC'GAC 113 1
.....etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =
? (y/n) (y) Search for all names
1 Order results enzyme by enzyme
X 2 Order results by position
3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 =3
? Maximum number of cuts (0-100) (0) =
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
0 AFLII
0 AFLIII
0 APAI
0 APALI
0 ASUII
0 AVAI
0 AVRII
0 BCLI
0 BGLI
0 BGLII
0 BSMI
0 BSPMII
0 BSTEII
0 CLAI
0 DRAI
0 DRAII
0 ECOB
0 ECOK
0 ECORV
0 ESPI
......etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =
? (y/n) (y) Search for all names
1 Order results enzyme by enzyme
2 Order results by position
X 3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 =4
? (y/n) (y) Hide translation n
? (y/n) (y) Use 1 letter codes
? Line length (30-90) (60) =
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
ECORI BANI BSP1286
. . . BBVI NSPBII
. . . . PVUII BBVI
GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
GTGC
10 20 30 40 50 60
E F G L G L V * G A Q R L L R R S C W C
N S V W A W C E V P R D Y C A A A A G A
I R F G L G V R C P E I T A P Q L L V L
HINCII
. AVAII
. . BINI
. . . BSTNI
. . . . BAMHI
. . . . XHOII NSPBII
. . . . . . BINI AHAII
. . . . . . . . SALI
. . . . . . . . .AATII
. . . . . . . . .ACCI
. . . . . . . . ..HINCII
TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
ACGA
70 80 90 100 110 120
W R C G G R Q R T Q G S A G R G R R R R
G G A E V V N G P R D P L D E D V D D E
A V R R S S T D P G I R W T R T S T T R
BBVI BINI
GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
CACT
130 140 150 160 170 180
G G G G * R I A G G W Q R L I * V L T T
E E V D S A L L V A G S D * F E F * P L
R R W I A H C W W L A A T D L S S D H S
.......etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =5
Define search strings by typing a string name
followed by the string(s)
? Name=FRED
? String(s)=AAAAAA/TTTTTT
? Name=MARY
? String(s)=CCCC/GGGG/GCGCT
? Name=
? (y/n) (y) Search for all names
X 1 Order results enzyme by enzyme
2 Order results by position
3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 =
? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
Matches found= 9
Name Sequence Position Fragment lengths
1 FRED 'TTTTTT 1557 1556 1
2 FRED 'TTTTTT 1558 1 1
3 FRED 'TTTTTT 1559 1 1
4 FRED 'TTTTTT 1560 1 22
5 FRED 'AAAAAA 1582 22 529
6 FRED 'AAAAAA 3160 1578 1019
7 FRED 'AAAAAA 4204 1044 1044
8 FRED 'AAAAAA 5691 1487 1487
9 FRED 'AAAAAA 6710 1019 1556
529 1578
Matches found= 36
Name Sequence Position Fragment lengths
1 MARY 'CCCC 47 46 1
2 MARY 'GGGG 486 439 1
3 MARY 'GGGG 487 1 1
4 MARY 'CCCC 557 70 1
5 MARY 'CCCC 558 1 1
6 MARY 'GCGCT 1177 619 1
... etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =
1 All enzymes
X 2 Six cutters
3 Four cutters
4 Personal file
5 Keyboard
? 0,1,2,3,4,5 =5
Define search strings by typing a string name
followed by the string(s)
? Name=JANE
? String(s)=A'TTTT/CC'GGG
? Name=
? (y/n) (y) Search for all names
X 1 Order results enzyme by enzyme
2 Order results by position
3 Show only infrequent cutters
4 Show names above the sequence
? 0,1,2,3,4 =
? (y/n) (y) List matches
? (y/n) (y) The sequence is linear
? (y/n) (y) Search for definite matches
Searching
Matches found= 30
Name Sequence Position Fragment lengths
1 JANE A'TTTT 437 436 6
2 JANE A'TTTT 546 109 33
3 JANE A'TTTT 597 51 43
4 JANE A'TTTT 777 180 51
5 JANE A'TTTT 1274 497 60
6 JANE A'TTTT 1571 297 62
7 JANE CC'GGG 1926 355 75
8 JANE A'TTTT 2403 477 81
9 JANE A'TTTT 2586 183 82
10 JANE A'TTTT 2731 145 101
11 JANE A'TTTT 2812 81 103
... etc
X 1 Search
2 List enzyme file
3 Clear text
4 Clear graphics
? 0,1,2,3,4 =!
.end lit
.left margin1
@18. TX 1 7 @ Compare a short sequence
.LEFT MARGIN2
.para
This routine slides a short sequence along the current sequence and finds
all positions at which a given percentage of the bases match.
Output is in both graphical and listed forms.
.para
If users call for dialogue when the routine is selected they will be given
the choice of keyboard or file input. Define the string, select the "sense"
to use and the percentage match. Matches will be plotted out and then the
user can select to have them listed. Then the routine cycles around.
.para
The routine slides the search string
along the sequence and marks the positions at which a minimum
percentage score is reached. The graphical output draws a vertical line at
the match position; the height of the line represents the percentage
score,
so that if the line reaches the top of the box the score is 100%.
The NC-IUB symbols may be used in the search sequence to encode
uncertain
characters. Any other symbols will not match.
.LIT
NC-IUB SYMBOLS
A,C,G,T
R (A,G) 'puRine'
Y (T,C) 'pYrimidine'
W (A,T) 'Weak'
S (C,G) 'Strong'
M (A,C) 'aMino'
K (G,T) 'Keto'
H (A,T,C) 'not G'
B (G,C,T) 'not A'
V (G,A,C) 'not T'
D (G,A,T) 'not C'
N (G,A,C,T) 'aNy'
Typical dialogue is shown below.
? Menu or option number=18
Find percentage matches
? (y/n) (y) Keep picture
? String=AAATTTCCC
STRING=AAATTTCCC
? (y/n) (y) This sense
? Percent match (1.00-100.00) (70.00) =
Missing graphics display here
Total scoring positions above 70.000 percent = 7
Scores 7 6 6 6 6 6 6
Positions 365 212 213 292 311 358 627
? Display (0-7) (0) =3
365
ACATTTCGC
* ***** *
AAATTTCCC
1
212
GAAACTCCC
** ****
AAATTTCCC
1
213
AAACTCCCA
*** * **
AAATTTCCC
1
? (y/n) (y) Keep picture
Default String=AAATTTCCC
? String=
STRING=AAATTTCCC
? (y/n) (y) This sense n
STRING=GGGAAATTT
? Percent match (1.00-100.00) (70.00) =
Missing graphics display here
Total scoring positions above 70.000 percent = 7
Scores 6 6 6 6 6 6 6
Positions 269 270 271 288 354 624 853
? Display (0-7) (0) =3
269
GAGGGATTT
* * ****
GGGAAATTT
1
270
AGGGATTTT
** * ***
GGGAAATTT
1
271
GGGATTTTC
**** **
GGGAAATTT
1
? (y/n) (y) Keep picture !
.end lit
.left margin1
@19. TX 7 @ Compare a short sequence using a score matrix
.LEFT MARGIN2
.para
This routine slides a short sequence along the current sequence and finds
all positions at which a given level of similarity (a cutoff score) is
reached. The score is defined by use of a score matrix. Output is in both
graphical and listed forms.
.para
If users call for dialogue when the routine is selected they will be given
the choice of keyboard or file input. Define the string, select the "sense"
to use and the cutoff score. Matches will be plotted out and then the user
can select to have them listed. Then the routine cycles around.
.para
The routine slides the search string
along the sequence and marks the positions at which a the cutoff score
is achieved. The graphical output draws a vertical line at
the match position; the height of the line represents the score,
so that if the line reaches the top of the box the score is the maximum
possible.
The NC-IUB symbols may be used in the search sequence to encode
uncertain
characters.
.para
The score matrix reflects the level of
redundancy in the probe sequence and hence will put more emphasis on
those
characters that are better defined. The score matrix is:
.lit
DNA SCORE MATRIX USING IUB SYMBOLS
T C A G - R Y W S M K H B V D N ?
T 36 0 0 0 9 0 18 18 0 0 18 12 12 0 12 9 0
C 0 36 0 0 9 0 18 0 18 18 0 12 12 12 0 9 0
A 0 0 36 0 9 18 0 18 0 18 0 12 0 12 12 9 0
G 0 0 0 36 9 18 0 0 18 0 18 0 12 12 12 9 0
- 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0
R 0 0 18 18 18 36 0 9 9 9 9 6 6 12 12 18 0
Y 18 18 0 0 18 0 36 9 9 9 9 12 12 6 6 18 0
W 18 0 18 0 18 9 9 36 0 9 9 12 6 6 12 18 0
S 0 18 0 18 18 9 9 0 36 9 9 6 12 12 6 18 0
M 0 18 18 0 18 9 9 9 9 36 0 12 6 12 6 18 0
K 18 0 0 18 18 9 9 9 9 0 36 6 12 6 12 18 0
H 12 12 12 0 27 6 12 12 6 12 6 36 8 8 8 27 0
B 12 12 0 12 27 6 12 6 12 6 12 8 36 8 8 27 0
V 0 12 12 12 27 12 6 6 12 12 6 8 8 36 8 27 0
D 12 0 12 12 27 12 6 12 6 6 12 8 8 8 36 27 0
N 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0
? 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
? is any unrecognised character.
Typical dialogue is shown below.
? Menu or option number=19
Find matches using a score matrix
? (y/n) (y) Keep picture
? String=AAATTTCCC
STRING=AAATTTCCC
? (y/n) (y) This sense
Minimum score= 0 Maximum score= 324
? Score (0-324) (280) =250
Missing graphics display here
For score 250 the number of matches= 1
Scores 252
Positions 365
? Display (0-1) (0) =1
365
ACATTTCGC
* ***** *
AAATTTCCC
1
? (y/n) (y) Keep picture
Default String=AAATTTCCC
? String=
STRING=AAATTTCCC
? (y/n) (y) This sense n
STRING=GGGAAATTT
Minimum score= 0 Maximum score= 324
? Score (0-324) (222) = 200
Missing graphics display here
For score 200 the number of matches= 7
Scores 216 216 216 216 216 216 216
Positions 269 270 271 288 354 624 853
? Display (0-7) (0) =3
269
GAGGGATTT
* * ****
GGGAAATTT
1
270
AGGGATTTT
** * ***
GGGAAATTT
1
271
GGGATTTTC
**** **
GGGAAATTT
1
? (y/n) (y) Keep picture !
.end lit
.left margin1
@20. TX 7 @ Search for a motif using a weight matrix
.LEFT MARGIN2
.para
This function performs searches for short sequence
motifs using an appropriate weight matrix. In addition it can be used to
create or modify weight matrices. In order to perform a search the only
input
required is the name of the file containing the weight matrix.
The results can be presented graphically or listed. The graphical
presentation will draw line at the position of any matches found; the
height of the line is proportional to the score.
.para
For a search, select "use weight matrix", supply the name of the file
containing the weight matrix, and choose between having results plotted
or listed. If dialogue is requested when the function is selected users can
alter the cutoff score employed.
.para
To create a weight matrix several steps are involved. A file containing an
alignment of known motifs is required. (This file must be created before
the current option is selected. The format is a follows: each sequence is
written on a separate line with at least one space at the beginning; each
sequence is terminated by a space character, and can be followed by a
name. The sequences must be aligned.) Supply the name of the file of
aligned sequences. The program reads and displays the sequences. Choose
between "summing logs of weights" or summing weights (i.e. whether to
multiply or add weights). If logs are used all scores will be negative.
Choose if all positions in the set of aligned sequences should be used or
if a mask should be applied. If so selected, define a mask as a string of
symbols, in which symbol - means ignore and any other symbol means
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
.para
The program will calculate weights as the frequencies of each base at
each unmasked position in the set of aligned sequences. These weights
are then applied to the set of aligned sequences to give a range of
"observed" scores. The mean and standard deviation of these scores is
displayed. The user is asked to supply several values to be used when the
weight matrix is applied to other sequences: a cutoff score (by default,
the mean minus 3 standard deviations), a top score for scaling graphical
results (by default, the mean plus 3 standard deviations), and a position
to identify (this means that if a particular base within the motif is used
as a "landmark", such as the A of the AG in splice acceptor sites, then its
position will be marked in plots). All these values are stored along with
the weight matrix. Finally supply the name of a file to contain the weight
matrix.
.para
Weight matrices can be "rescaled" using a set of aligned sequences in
much the same ways as a matrix is created. The purpose is to redefine
the cutoff scores, and rescaling does not alter any other values in the
weight matrix file.
.para
The methods have changed considerably but were first outlined in
Staden, R. Nucl. Acid Res. 12 505-519 1984, and
Staden, R. Genetic
engineering: principles and methods vol 7, Edited by J.K. Setlow and A.
Hollaender, Plenum publishing corp., 1985.
.para
The methods have always had to deal with the problem of zeroes in the
matrices. The current versions
employ "Laplaces Law of Succession" in which 1 is
added to each term.
.para
It is now possible to apply a mask to a set of aligned sequences in
order to give weight to selected positions only.
Sequences have superimposed functions: some parts may be of general
structural
importance and give rise to an overall framework, and other parts give
specificity and hence are not common; we may want to use a set of
aligned
sequences to define a motif, but want to use only the framework
positions.
Alternatively we may want to pick out
only those parts of a set of aligned sequences that give a particular
property, and to ignore other similarities that are due to some other
property
and which could obscure the pattern
we are interested in. The ability to define a mask allows certain
positions
to be used in the motif and others to be ignored, and yet still permits the
use of a set of aligned sequences to calculate weights.
.para
Typical dialogue is shown below.
.lit
? Menu or option number=20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =2
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
1 AGCGTGACTCTTCCCGGAA HIS1
2 GAGGTGACTCACTTGGAAG HIS1
3 CGGATGACTCTTTTTTTTT HIS3
4 ACAGTGACTCACGTTTTTT HIS4
5 GTCGTGACTCATATGCTTT ARG3
6 TGAATGACTCACTTTTTGG ARG4
7 TTCTTGACTCGTCTTTTCT CPA1
8 CGAATGACTCTTATTGATG CPA2
9 AGAATGACTAATTTTACTA TRP5
10 TCGTTGACTCATTCTAATC TRP3
11 TTGCTGACTCATTACGATT TRP2
12 GAGATGACTCTTTTTCTTT IV1
13 GCGATGATTCATTTCTCTG IV2
14 TAGATGACTCAGTTTAGTC LEU1
15 TAAGTGACTCAGTTCTTTC LEU4
16 ATGATGACTCTTAAGCATG ILS1
Length of motif 19
? (y/n) (y) Sum logs of weights
? (y/n) (y) Use all motif positions n
x means use, - means ignore
e.g. xx-x---x-x means use positions 1,2,4,8,10
? Mask=----XXXXXXXX
Applying weights to input sequences
1 -27.979 AGCGTGACTCTTCCCGGAA
2 -24.543 GAGGTGACTCACTTGGAAG
3 -20.890 CGGATGACTCTTTTTTTTT
4 -23.087 ACAGTGACTCACGTTTTTT
5 -22.771 GTCGTGACTCATATGCTTT
6 -23.408 TGAATGACTCACTTTTTGG
7 -25.159 TTCTTGACTCGTCTTTTCT
8 -22.679 CGAATGACTCTTATTGATG
9 -24.751 AGAATGACTAATTTTACTA
10 -23.157 TCGTTGACTCATTCTAATC
11 -23.067 TTGCTGACTCATTACGATT
12 -21.449 GAGATGACTCTTTTTCTTT
13 -24.191 GCGATGATTCATTTCTCTG
14 -23.770 TAGATGACTCAGTTTAGTC
15 -22.923 TAAGTGACTCAGTTCTTTC
16 -25.285 ATGATGACTCTTAAGCATG
Top score -20.890 Bottom score -27.979
Mean -23.694 Standard deviation 1.613
Mean minus 3.sd -28.534 Mean plus 3.sd -18.854
? Cutoff score (-999.00-9999.00) (-28.53) =
? Top score for scaling plots (-28.53-999.00) (-18.85) =
? Position to identify (0-19) (1) =
? Title=GCN4 SEQUENCES
? Name for new weight matrix file=1.WTS
? Menu or option number=20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =3
? Name of existing weight matrix file=1.WTS
GCN4 SEQUENCES
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
Length of motif 19
? (y/n) (y) Sum logs of weights n
? (y/n) (y) Use all motif positions
Applying weights to input sequences
1 128.000 AGCGTGACTCTTCCCGGAA
2 148.000 GAGGTGACTCACTTGGAAG
3 172.000 CGGATGACTCTTTTTTTTT
4 160.000 ACAGTGACTCACGTTTTTT
5 161.000 GTCGTGACTCATATGCTTT
6 157.000 TGAATGACTCACTTTTTGG
7 149.000 TTCTTGACTCGTCTTTTCT
8 160.000 CGAATGACTCTTATTGATG
9 151.000 AGAATGACTAATTTTACTA
10 159.000 TCGTTGACTCATTCTAATC
11 158.000 TTGCTGACTCATTACGATT
12 169.000 GAGATGACTCTTTTTCTTT
13 152.000 GCGATGATTCATTTCTCTG
14 157.000 TAGATGACTCAGTTTAGTC
15 160.000 TAAGTGACTCAGTTCTTTC
16 143.000 ATGATGACTCTTAAGCATG
Top score 172.000 Bottom score 128.000
Mean 155.250 Standard deviation 10.034
Mean minus 3.sd 125.147 Mean plus 3.sd 185.353
? Cutoff score (-999.00-9999.00) (125.15) =
? Top score for scaling plots (125.15-999.00) (185.35) =
? Position to identify (0-19) (1) =
? Title=GCN4 SEQUENCES
? Name for new weight matrix file=2.WTS
? Menu or option number=20
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =
? Motif weight matrix file=1.WTS
GCN4 SEQUENCES
? (y/n) (y) Plot results n
153 -22.61 GCAGCGACTGATTTGAGTT
169 -28.53 GTTCTGACCACTCAGATCC
172 -27.27 CTGACCACTCAGATCCGGC
219 -27.35 CCAGTGGCTGGCCTGCTAG
268 -27.82 CGAGGGATTTTCGATCTTG
274 -26.99 ATTTTCGATCTTGTGGATG
283 -25.79 CTTGTGGATGATTTTCACG
287 -27.50 TGGATGATTTTCACGTGCG
298 -28.17 CACGTGCGCCGTCATATTG
332 -28.27 TCTTTGAAGCAGAAGGGAC
351 -28.27 AGGGGTACACTTTCACATT
357 -25.05 ACACTTTCACATTTCGCTT
364 -28.51 CACATTTCGCTTATGGGAG
400 -23.77 GAAGTTACTAATGTGCGTG
451 -26.22 ATGCTCGCCCTCTTTGGTG
476 -28.00 TCCCTCACTGAGCCCTCCG
480 -28.33 TCACTGAGCCCTCCGCCTC
517 -23.46 GCTAAGATTCAGCTTGGTT
556 -27.27 TCCAGCACTCAGGTTCGGC
602 -27.01 AACTTGAATCCATCGTTGC
648 -28.45 TGCTAAACACAGCCGGTTT
679 -28.18 CTGTTTGCCCAGTTTGGGC
691 -28.51 TTTGGGCCGCTTCTGGACG
713 -27.67 GGCTTGACCGTGGCTGTGG
803 -25.47 ATGCTGACCATGCTTTTCA
848 -28.11 ATAATGTTAAGTTTGATTC
857 -25.97 AGTTTGATTCCGCTGGCCG
879 -27.85 CCGCTGCTGCTGTTTCCAC
917 -27.77 GCGATGAGGAAGGCTTGTT
931 -27.81 TTGTTGGCGCGCCTGCTCG
952 -23.52 GAGGTGACTACCATCCGTG
977 -28.40 TGCGTGGGTGAGCTGTTGT
? Menu or option number=6
Page through text files
? Name of file to read=1.WTS
GCN4 SEQUENCES
19 1 -28.534 -18.854
P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
N 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
16
T 0 0 0 0 16 0 0 1 16 0 5 11 10 12 9 6 7 12 6
C 0 0 0 0 0 0 0 15 0 15 0 3 2 2 4 3 2 1 3
A 0 0 0 0 0 0 16 0 0 1 10 0 3 2 0 3 5 2 2
G 0 0 0 0 0 16 0 0 0 0 1 2 1 0 3 4 2 1 5
End of file
.end lit
.left margin1
@21. TX 3 @ Count base composition
.LEFT MARGIN2
.para
This routine
calculates the base composition of the
active region of the sequence as both totals and percentages.
.left margin1
@22. TX 3 @ Count dinucleotide frequencies
.LEFT MARGIN2
.para
This routine simply counts dinucleotide frequencies for the currently
active region of the sequence. It also calculates an expected distribution
based on the base composition.
The output looks like:
.LIT
T C A G
obs expected obs expected obs expected obs expected
T 8.44 8.25 6.67 7.01 10.35 9.92 3.27 3.54
C 7.49 7.01 6.76 5.95 8.39 8.43 1.76 3.01
A 10.13 9.92 7.78 8.43 11.74 11.93 4.89 4.26
G 2.67 3.54 3.19 3.01 4.06 4.26 2.42 1.52
.END LIT
.left margin1
@23. TX 3 5 @ Count codons and amino acids
.LEFT MARGIN2
.para
This function
counts codons, amino acid composition, protein molecular weights, and
base
composition. Users select the segments of the sequence that the program
should analyse.
.para
Choose between being shown observed counts or counts normalised so
that the totals for each amino acid sum to 100. Select to define
segments using either the keyboard or an EMBL feature table.
Define the segments to count over. Select strand for each segment. Stop
selecting segments by typing a zero for "Count from ()". The results are
displayed a screenful at a time, and the bell is sounded to show there is
more to come. A zero start position, or the end of an EMBL feature table,
signals
the routine to print out totals for all values.
.para
The counts are broken down into several figures.
Base
composition by position in codon expressed as a percentage of each bases
own frequency; base composition by position in codon expressed as a
percentage of the overall base composition of the section; base
composition
expected for this amino acid composition if there was no codon
preference;
percentage deviations of the observed amino acid composition from an
average amino acid composition.
.para
The output looks like:
.LIT
===========================================
F TTT 1. S TCT 2. Y TAT 2. C TGT 1.
F TTC 1. S TCC 1. Y TAC 3. C TGC 2.
L TTA 7. S TCA 4. * TAA 9. * TGA 1.
L TTG 2. S TCG 1. * TAG 2. W TGG 2.
===========================================
L CTT 3. P CCT 2. H CAT 4. R CGT 1.
L CTC 2. P CCC 3. H CAC 1. R CGC 0.
L CTA 3. P CCA 2. Q CAA 4. R CGA 0.
L CTG 2. P CCG 2. Q CAG 1. R CGG 2.
===========================================
I ATT 9. T ACT 1. N AAT 7. S AGT 3.
I ATC 2. T ACC 2. N AAC 4. S AGC 2.
I ATA 4. T ACA 5. K AAA 13. R AGA 5.
M ATG 1. T ACG 2. K AAG 4. R AGG 1.
===========================================
V GTT 2. A GCT 2. D GAT 1. G GGT 3.
V GTC 2. A GCC 2. D GAC 1. G GGC 1.
V GTA 4. A GCA 3. E GAA 2. G GGA 1.
V GTG 2. A GCG 0. E GAG 1. G GGG 1.
===========================================
total codons= 166.
T C A G
1 31.06 33.68 34.03 35.00
2 35.61 35.79 30.89 32.50
3 33.33 30.53 35.08 32.50
1 24.70 19.28 39.16 16.87
2 28.31 20.48 35.54 15.66
3 26.51 17.47 40.36 15.66
% 26.51 19.08 38.35 16.06 observed, overall totals
% 25.00 22.26 33.10 19.65 expected, even codons per acid
A C D E F G H I K L
7. 3. 2. 3. 2. 6. 5. 15. 17. 19.
o-e % -47. -33. -76. -68. -64. -54. 62. 116. 67. 67.
M N P Q R S T V W Y
1. 11. 9. 5. 9. 13. 10. 10. 2. 5.
o-e % -62. 66. 12. -17. 19. 21. 6. -2. 0. -5.
total acids= 154. molecular weight= 17421.
Typical dialogue follows.
? Menu or option number=23
Calculate codon usage, base composition
and amino acid composition
? (y/n) (y) Show observed counts
? (y/n) (y) Define segments using keyboard
? Count from (0-1023) (0) =1
? Count to (1-1023) (1023) =1000
? (y/n) (y) + strand
===========================================
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
===========================================
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
===========================================
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
===========================================
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
===========================================
Total codons= 333.
T C A G
1 23.32 37.69 28.99 40.06
2 37.15 22.31 38.46 36.59
3 39.53 40.00 32.54 23.34
----- ----- ----- -----
= 100% 100% 100% 100%
1 17.72 29.43 14.71 38.14 = 100%
2 28.23 17.42 19.52 34.83 = 100%
3 30.03 31.23 16.52 22.22 = 100%
% 25.33 26.03 16.92 31.73 Observed, overall totals
% 24.44 22.31 20.90 32.35 Expected, even codons per acid
A C D E F G H I K L
33. 10. 15. 9. 17. 40. 10. 12. 5. 33.
O-E % 22. 81. -13. -55. 34. 71. 40. -29. -73. 13.
M N P Q R S T V W Y
2. 7. 2. 13. 49. 20. 11. 30. 5. 2.
O-E % -74. -51. -88. 0. 165. -11. -42. 40. 18. -81.
Total acids= 325. Molecular weight= 35831. Hydrophobicity= -17.8
? Count from (0-1023) (0) =
Codon totals over all genes
===========================================
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
===========================================
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
===========================================
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
===========================================
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
===========================================
Total codons= 333.
T C A G
1 23.32 37.69 28.99 40.06
2 37.15 22.31 38.46 36.59
3 39.53 40.00 32.54 23.34
----- ----- ----- -----
= 100% 100% 100% 100%
1 17.72 29.43 14.71 38.14 = 100%
2 28.23 17.42 19.52 34.83 = 100%
3 30.03 31.23 16.52 22.22 = 100%
% 25.33 26.03 16.92 31.73 Observed, overall totals
% 24.44 22.31 20.90 32.35 Expected, even codons per acid
A C D E F G H I K L
33. 10. 15. 9. 17. 40. 10. 12. 5. 33.
O-E % 22. 81. -13. -55. 34. 71. 40. -29. -73. 13.
M N P Q R S T V W Y
2. 7. 2. 13. 49. 20. 11. 30. 5. 2.
O-E % -74. -51. -88. 0. 165. -11. -42. 40. 18. -81.
Total acids= 325. Molecular weight= 35831. Hydrophobicity= -17.8
.END LIT
.LEFT MARGIN1
@24. TX 3 @ Plot base composition
.LEFT MARGIN2
.para
This option plots the base composition of the sequence. The counts for
any combination of bases can be plotted.
.para
If dialogue is requested the user is presented with a check box for
selecting which bases should be counted, and then allowed to define a
window length, and a "plot interval". Otherwise, the AT composition is
plotted with a window of 101 and a plot interval of 5.
.para
Typical dialogue follows.
.lit
? Menu or option number=d24
Plot base composition
checkbox: those set are marked X
X 1 T
2 C
X 3 A
4 G
? 0,1,2,3,4 =1
checkbox: those set are marked X
1 T
2 C
X 3 A
4 G
? 0,1,2,3,4 =3
checkbox: those set are marked X
1 T
2 C
3 A
4 G
? 0,1,2,3,4 =2
checkbox: those set are marked X
1 T
X 2 C
3 A
4 G
? 0,1,2,3,4 =4
checkbox: those set are marked X
1 T
X 2 C
3 A
X 4 G
? 0,1,2,3,4 =
? odd span length (1-201) (31) =
? plot interval (1-11) (5) =
missing graphics
.end lit
.left margIN1
@25. TX 3 @ Plot local deviations in base composition
.LEFT MARGIN2
.para
The "local deviation" routines are designed to indicate the similarity of
the compositions of different parts of the sequence. The composition of
every segment of the sequence is compared with a standard composition.
The levels of similarity are plotted as a chi squared values. The standard
can be the composition of the whole sequence, or alternatively that of a
small segment defined by the user.
.para
If dialogue is forced define the standard region, the window length and
the plot interval. Otherwise the composition of the whole sequence is
taken as a standard. The maximum and minimum observed value of the chi
squared calculation is displayed, and plots will always exactly fill the
available box. Any unusual regions will show as peaks.
.para
The following measure is used: for each window position
calculate (sum((obs-exp)*(obs-exp))/(exp*exp))
where obs is the observed composition
and exp is the expected composition (the composition of the standard).
The calculation is performed once to find out the range of values and is
then repeated and
plotted so that the plot exactly fills the allocated screen space.
.left margIN1
@26. TX 3 @ Plot local deviations from dinucleotide composition
.LEFT MARGIN2
.para
The "local deviation" routines are designed to indicate the similarity of
the compositions of different parts of the sequence. The dinucleotide
composition of every segment of the sequence is compared with a
standard composition. The levels of similarity are plotted as a chi
squared values. The standard can be the composition of the whole
sequence, or alternatively that of a small segment defined by the user.
.para
If dialogue is forced define the standard region, the window length and
the plot interval. Otherwise the composition of the whole sequence is
taken as a standard. The maximum and minimum observed value of the chi
squared calculation is displayed, and plots will always exactly fill the
available box. Any unusual regions will show as peaks.
.para
The following measure is used: for each window position
calculate (sum((obs-exp)*(obs-exp))/(exp*exp))
where obs is the observed composition
and exp is the expected composition (the composition of the standard).
The calculation is performed once to find out the range of values and is
then repeated and
plotted so that the plot exactly fills the allocated screen space.
.left margin1
@27. TX 3 @ Plot local deviations from trinucleotide composition
.LEFT MARGIN2
.para
The "local deviation" routines are designed to indicate the similarity of
the compositions of different parts of the sequence. The trinucleotide
composition of every segment of the sequence is compared with a
standard composition. The levels of similarity are plotted as a chi
squared values. The standard can be the composition of the whole
sequence, or alternatively that of a small segment defined by the user.
.para
If dialogue is forced define the standard region, the window length and
the plot interval. Otherwise the composition of the whole sequence is
taken as a standard. The maximum and minimum observed value of the chi
squared calculation is displayed, and plots will always exactly fill the
available box. Any unusual regions will show as peaks.
.para
The following measure is used: for each window position
calculate (sum((obs-exp)*(obs-exp))/(exp*exp))
where obs is the observed composition
and exp is the expected composition (the composition of the standard).
The calculation is performed once to find out the range of values and is
then repeated and
plotted so that the plot exactly fills the allocated screen space.
.left margin1
@28. TX 5 @ Calculate codon constraint
.left margin2
.para
The purpose of this option (which is somewhat specialised) is to measure
the level of constraint imposed on the sequence by coding for a protein of
the observed composition. It measures the strength of the codon bias
averaged over windows of 99 codons and displays the values observed.
.para
Select between defining segments at the keyboard or using an EMBL
feature table. Finish selecting segments by typing a zero start. The value
for each segment is displayed:
.para
Mean (W-EW) / EWD, window 99 10.5
.para
The codon constraint is the
difference between the observed codon improbability and the mean
improbabilty for
a sequence of the same composition. See McLachlan, Staden and Boswell
Nucl. Acid Res. 1984
.left margin1
@59. TX 3 @ Plot negentropy
.LEFT MARGIN2
.para
This routine is designed to show regions of the sequence that differ in
composition from others, and hence is like the "plot deviation.." routines.
.para
Negentropy or information is defined in the following way: let Pi be the
probability of observing base i, where i = A,C,G or T, then the average
information per base is
I=-sum(Pi.Log(Pi)) (sum over all i). This routine calculates Pi by
calculating the overall composition for the sequence and then plots I for
windows of length defined by the user.
.left margin1
@30. TX 4 @ Search for hairpin loops
.LEFT MARGIN2
.para
Used to find simple inverted repeats or potential hairpin loops
The loops are defined by a range of sizes for
the loop and a minimum number of consecutive base pairs in the stem.
The results can be presented graphically or listed.
A-T, G-C and G-T basepairs are counted.
.para
Define the range of loop sizes and the minimum number of consecutive
basepairs required. Choose between plotted or listed results.
.para
The loops found are plotted as blips on a
horizontal line that represents the sequence, the heights of the lines are
proportional to the number of basepairs in the stems. Note that only
uninterrupted stems are found - i.e. all basepairs must be made. To look
for stems with some unpaired bases (or for palindromes) use the inverted
repeat motif class in the pattern searching option.
.para
Typical dialogue follows.
.lit
? Menu or option number=30
Search for hairpin loops
Define the range of loop sizes
? Minimum loop size (1-30) (1) =
? Maximum loop size (3-60) (3) =
? Minimum number of basepairs (2-20) (6) =
? (y/n) (y) Plot results n
Searching
T.G
G-C
G.T
T.G
C-G
G-C
T.G
C-G
G.T
GCCGCA GCGGAGG
49
G
G-C
T.G
C-G
G.T
T.G
G-C
CTGCTG GGAGGTC
56
G
T.G
G-C
G.T
T.G
C-G
G-C
T-A
T.G
AGCGCA CGACTGA
139
A C
G.T
C-G
G.T
C-G
C-G
G-C
TTCGCT CAACGCC
244
.end lit
.LEFT MARGIN1
@31. TX 4 @ Search for long range inverted repeats
.LEFT MARGIN2
.para
Searches for inverted repeats. The repeats found are exact matches of at
least 6 consecutive bases. Results can be presented graphically or listed.
Plotted results show the end points of repeats joined by rectangular
lines.
.para
If dialogue is not requested the defaults will be taken. Otherwise choose
between plotted or listed results. If required select to analyse a
restricted segment of the currently active region. Choose a repeat length.
.para
Typical dialogue follows.
.lit
? Menu or option number=D31
Plot long-range inverted repeats
? (y/n) (y) Plot results n
Define restricted region
? start (1-1023) (1) =
? end (2-1023) (1023) =
? Minimum inverted repeat (6-30) (12) =10
Searching
27 909 10 TGCCCAGAGA
.end lit
.LEFT MARGIN1
@32. TX 4 @ Search for repeats
.LEFT MARGIN2
.para
Searches for direct repeats. The repeats found are exact matches of at
least 6 consecutive bases. Results can be presented graphically or listed.
Plotted results show the end points of repeats joined by rectangular
lines.
.para
If dialogue is not requested the defaults will be taken. Otherwise choose
between plotted or listed results. If required select to analyse a
restricted segment of the currently active region. Choose a repeat length.
.para
Typical dialogue follows.
.lit
? Menu or option number=D32
Plot repeats
? (y/n) (y) Plot results n
Define restricted region
? start (1-1023) (1) =
? end (2-1023) (1023) =
? Minimum repeat (6-30) (12) =8
Searching
619 988 8 GCTGTTGT
514 646 8 GCTGCTAA
94 865 8 TCCGCTGG
146 222 9 GTGGCTGGC
455 497 8 TCGCCCTC
454 496 9 CTCGCCCTC
872 875 8 GCCGCCGC
510 615 8 CGTTGCTG
152 913 8 GGCAGCGA
199 265 8 CGTCGAGG
689 794 8 AGTTTGGG
147 223 8 TGGCTGGC
101 116 8 GACGAGGA
8 690 8 GTTTGGGC
52 141 8 TGCTGGTG
.end lit
.left margin1
@33. TX 4 @ Search for z dna (total ry, yr)
.LEFT MARGIN2
.para
Searches for segments of the sequence that might form Z DNA. A window
length is chosen and the number of RY and YR dinucleotides within each
window is plotted. The top of the box corresponds to all RY or YR, the
bottom to zero RY or YR.
.para
If dialogue is requested, select a window length and plot interval.
Otherwise the defaults will be used.
.para
The program contains three
separate ways of doing this (options 33,34,35).
.left margin1
@34. TX 4 @ Search for z dna (runs of ry, yr)
.LEFT MARGIN2
.para
Searches for segments of the sequence that might form Z DNA. Results
are plotted.
.para
If dialogue is requested define a window length and plot interval.
Otherwise the defaults will be used.
The routine
counts the number of R in positions 1,3,5 etc =R1, the number of Y in
positions 2,4,6 etc =Y1, the number of Y in positions 1,3,5 etc =Y2 and
the
number of R in positions 2,4,6 etc =R2 for a window length. It plots the
maximum of R1+Y1 and R2+Y2 relative to a minimum of (window
length)/2 and a
maximum of (window length). (see 33,35).
.LEFT MARGIN1
@35. TX 4 @ Search for z dna (best phased value)
.LEFT MARGIN2
.para
Searches for segments of the sequence that might form Z DNA. Results
are plotted.
.para
If dialogue is requested define a window length and a plot interval.
Ohterwise the defaults values will be used.
.para
The routine
counts the number of consecutive RY or YR dinucleotides in phase. It
moves
through the sequence counting the number of RY or YR dinucleotides; when
the next dinucleotide is not of the correct type the score is set back to
zero and the search restarted using the current base to set the phase. The
plots are done relative to a minimum of zero and a maximum defined by
the
user. (See 33,34).
.LEFT MARGIN1
@36. TX 4 @ Local similarity or complementarity search
.LEFT MARGIN2
.PARA
This function is designed to find segments of
local similarity or complementarity. It is therefore like performing
a DIAGON
plot that is
restricted to regions near the main diagonal. Results can be presented
graphically or listed.
.para
Users define
a region to search through,
a span length, a range for searching through and a cut-off score. The
program takes all sections of sequence
of length span within the defined region
and compares them to
all other sequences within the region and
range specified.
If a match above the cutoff is found we
need to show the position
of the two sections of sequence and the score, and we do it in the
following way.
If we have a 70%
match between
a sequence that starts at p1 and a sequence that starts at p2
the program draws a
diagonal line that starts at p1 with height 70% of the box and which
finishes at p2 with
height 0.
The matches can also be listed.
.para
Here I define the terms range, region, and span and what is compared.
Suppose we have a defined region j1 to j2, a range of i1 to i2 and a span
of
s; the program will take, in turn, all sections of sequence of length s
within j1 and j2 and compare them to all sequences that start a distance
i1+s-1
to i2+s-1 away from them. First it will take the sequence of length s
starting
at j1 and compare it
with the sequence of length s starting at
j1+s-1+i1, then j1+s-1+i1+1, etc up to j1+s-1+i2; then it will take the
sequence of length s starting at j1+1 and compare it with the sequence
starting at j1+s-1+1+i1 etc. This continues until we hit
the right hand end of the
sequence as defined by j2. Note 1)that sequences are not compared with
themselves: the nearest sequence compared to a span s starting at j
starts
at j+s; 2) ranges i1 and i2 are ranges of start positions; 3) by choosing a
range greater than the length of the sequence this routine will do a full
DIAGON analysis except for those points within a distance span of
the main diagonal (see note 1).
.para
Typical dialog follows.
.lit
? Menu or option number=36
Search for local similarity or complementarity
? (y/n) (y) Find direct repeats
? (y/n) (y) Keep picture n
? Span (5-200) (15) =
Define restricted region
? start (0-1023) (1) =
? end (2-1023) (1023) =
? Percent match (1.00-100.00) (70.00) =
? Range start (1-50) (1) =
? Range end (1-50) (1) =5
? (y/n) (y) Plot results n
Working
118 128
CGAGGAGGAG GTGGA
** ***** ** **
GGACGAGGAC GTCGA
100 110
119 129
GAGGAGGAGG TGGAT
** ***** * * **
GACGAGGACG TCGAC
101 111
? (y/n) (y) Find direct repeats n
? (y/n) (y) Keep picture
? Span (5-200) (15) =
Define restricted region
? start (0-1023) (1) =
? end (2-1023) (1023) =
? Percent match (1.00-100.00) (70.00) =
? Range start (1-50) (1) =
? Range end (1-50) (5) =8
? (y/n) (y) List results
Working
178 188
ACTCAGATCC GGCGG
***** *** * **
ACTCAAATCA GTCGC
156 166
177 187
CACTCAGATC CGGCG
***** *** * **
AACTCAAATC AGTCG
157 167
? (y/n) (y) Find inverted repeats !
.end lit
.left margin1
@37. TX 5 @ Set genetic code
.LEFT MARGIN2
.para
This function allows the user to change the current active genetic code
for
all the options. The user may select: the standard code, the mammalian
mitochondrial code, the yeast mitochondrial code or a personal code
(define
your own).
.para
Select code. If personal, define a codon and select an amino acid. When all
codons have been reset define a blank codon.
.para
The code differences are:
.lit
Mammalian Yeast
Codon Mitochondrial Mitochondrial Standard
UGA W W STOP
AUA M M I
CUA L T L
AGA STOP R R
AGG STOP R R
.END LIT
.para
Typical dialogue follows.
.lit
? Menu or option number=37
X 1 Standard code
2 Mammalian mitochondrial code
3 Yeast mitochondrial code
4 Personal code
? 0,1,2,3,4 =2
? Menu or option number=37
X 1 Standard code
2 Mammalian mitochondrial code
3 Yeast mitochondrial code
4 Personal code
? 0,1,2,3,4 =4
Define genetic code by typing a codon
followed by a 1 letter amino acid symbol
? Codon=TTT
Default Amino acid symbol=F
? Amino acid symbol=W
? Codon=
.end lit
.left margin1
@38. T 3 4 @ Examine repeats
.left margin2
.para
This function can be used to examine the frequencies of repeated words
within a sequence. It finds all words that occur more than once. The
user selects a minimum word length and the program finds all words of that
length that occur more than once; then it "follows" each repeated word until it
becomes unique. For each word length it can report the number of different
repeated words, the number of occurrences of each word, and their actual
positions and sequences.
.para
It is possible that the algorithm may run out of memory, paticularly if a short
mimimum word length is chosen, or if the sequence is very long or very
repetitive. If this occurs the longest reported word length will not
necessarily be the longest in the sequence: the memory will have been consumed
before the longest word is found.
.lit
Typical dialogue and output is shown below.
Expected length of longest repeat 14
? Minumim word length (1-6) (6) =6
Working
? Show repeat frequencies for words of at least length (6-15) (15) =10
For length 10 the number of different repeated words is 2035
For length 11 the number of different repeated words is 613
For length 12 the number of different repeated words is 161
For length 13 the number of different repeated words is 37
For length 14 the number of different repeated words is 10
For length 15 the number of different repeated words is 1
? Show repeats for words of length (6-15) (15) =14
? Show repeats for words occuring with frequency (2-9999) (2) =2
ggtgctcatgccca
occurs at 21611
occurs at 21851
ttatccggtgatga
occurs at 4604
occurs at 8806
agcaccacgctgac
occurs at 5954
occurs at 9486
catgacggaggatg
occurs at 10480
occurs at 19925
aaagacgggaaaat
occurs at 11820
occurs at 43157
tacaaaaccaattt
occurs at 26797
occurs at 31369
cgagaaagagtgcg
occurs at 4260
occurs at 44305
gccggatgatggcg
occurs at 7893
occurs at 16638
atgacggaggatga
occurs at 10481
occurs at 19926
gcggcgaacgaggc
occurs at 11352
occurs at 18718
? Show repeats for words of length (6-15) (15) =!
Example of not enough memory
----------------------------
Expected length of longest repeat 14
? Minumim word length (1-6) (6) =1
Working
Not enough memory
Memory used in bytes 1125996. Length of longest repeat 5
? Show repeat frequencies for words of at least length (1-5) (5) =!
.end lit
.left margin1
@39. TX 5 @ Translate and list in upto six phases
.LEFT MARGIN2
.para
This is a general listing function that will perform translations and
produce several forms of output. The possibilities are:
.lit
1) no translation, list one or two strands, two ways of numbering the
sequence.
2) translation, one or two strands, one or three letter codes.
Positions defined by:
a) open reading frames of some minimum length l, l can be 0, hence giving
a complete six phase translation.
b) positions typed on keyboard, again 1 to 6 phases, translations appearing
above and below the dna.
c) positions read from a feature table.
It should be used in preference to option 5. For publication
without a translation, the option to number ends of lines is more compact
than option 5. Some examples and typical dialogue are given below. Note the
requirement for d39.
? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation
The segments to translate can be
1 Typed on the keyboard
2 Read from a feature table
X 3 Open reading frames
? 1,2,3 =
? Minimum open frame in amino acids (0-7238) (30) =
? (y/n) (y) Use 1 letter codes
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X 1 + strand only
2 - strand only
3 Both strands
? 1,2,3 =3
? (y/n) (y) Number ends of lines
N A T T I S R I D A T F S A R A P N E N
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
. : . : . : . : . : . :
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
* S A G W I F I
A V V I L L I S A V K E A R A G F S F
I A K Q V I D H L R N V S N G Q T K S T
L N R L L T I C E M Y L M V K L N L L
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
. : . : . : . : . : . :
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
Y S F L N N V M Q S I Y R I T L S F R S
I A L C T I S W K R F T D L P * V L D V
R S Q N W E S T V T W N E T S R H R T L
V R R I G N Q L L H G M K L P D T V L *
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
. : . : . : . : . : . :
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
T R L I P F
R E C F Q S D V T V H F S V E L C R V K
V A Y L K H V E L Q H Q I Q Q L S S K P
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
. : . : . : . : . : . :
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
T A Y K F C T S S C C W I
S A K M T S Y Q K E Q L K V L S N P D L
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
. : . : . : . : . : . :
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation N
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X 1 + strand only
2 - strand only
3 Both strands
? 1,2,3 =
? (y/n) (y) Number ends of lines
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation
The segments to translate can be
1 Typed on the keyboard
2 Read from a feature table
X 3 Open reading frames
? 1,2,3 =
? Minimum open frame in amino acids (0-7238) (30) =0
? (y/n) (y) Use 1 letter codes N
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X 1 + strand only
2 - strand only
3 Both strands
? 1,2,3 =3
? (y/n) (y) Number ends of lines
AsnAlaThrThrIleSerArgIleAspAlaThrPheSerAlaArgAlaProAsnGluAsn
ThrLeuLeuLeuLeuValGluLeuMetProProPheGlnLeuAlaProGlnMetLysIle
ArgTyrTyrTyr******Asn***CysHisLeuPheSerSerArgProLys***Lys
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
. : . : . : . : . : . :
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
ValSerSerSerAsnThrSerAsnIleGlyGlyLys***SerAlaGlyTrpIlePheIle
Arg************TyrPheGlnHisTrpArgLysLeuGluArgGlyLeuHisPheTyr
AlaValValIleLeuLeuIleSerAlaValLysGluAlaArgAlaGlyPheSerPhe
IleAlaLysGlnValIleAspHisLeuArgAsnValSerAsnGlyGlnThrLysSerThr
***LeuAsnArgLeuLeuThrIleCysGluMetTyrLeuMetValLysLeuAsnLeuLeu
TyrSer***ThrGlyTyr***ProPheAlaLysCysIle***TrpSerAsn***IleTyr
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
. : . : . : . : . : . :
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
TyrSerPheLeuAsnAsnValMetGlnSerIleTyrArgIleThrLeuSerPheArgSer
Leu***ValPro***GlnGlyAsnAlaPheHisIle***HisAspPhe***Ile***Glu
IleAlaLeuCysThrIleSerTrpLysArgPheThrAspLeuPro***ValLeuAspVal
ArgSerGlnAsnTrpGluSerThrValThrTrpAsnGluThrSerArgHisArgThrLeu
ValArgArgIleGlyAsnGlnLeuLeuHisGlyMetLysLeuProAspThrValLeu***
SerPheAlaGluLeuGlyIleAsnCysTyrMetGlu***AsnPheGlnThrProTyrPhe
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
. : . : . : . : . : . :
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
ThrArgLeuIleProPhe***SerAsnCysProIlePheSerGlySerValThrSer***
AsnAlaSerAsnProIleLeuGln***MetSerHisPheLysTrpValGlyTyrLysLeu
ArgGluCysPheGlnSerAspValThrValHisPheSerValGluLeuCysArgValLys
ValAlaTyrLeuLysHisValGluLeuGlnHisGlnIleGlnGlnLeuSerSerLysPro
LeuHisIle***AsnMetLeuSerTyrSerThrArgPheSerAsn***AlaLeuSerHis
SerCysIlePheLysThrCys***AlaThrAlaProAspSerAlaIleLysLeu***Ala
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
. : . : . : . : . : . :
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
AsnCysIle***PheMetAsnLeu***LeuValLeuAsnLeuLeu***AlaArgLeuTrp
GlnMetAsnLeuValHisGlnAlaValAlaGlySerGluAlaIleLeuSer***AlaMet
ThrAlaTyrLysPheCysThrSerSerCysCysTrpIle***CysAsnLeuGluLeuGly
SerAlaLysMetThrSerTyrGlnLysGluGlnLeuLysValLeuSerAsnProAspLeu
ProGlnLys***ProLeuIleLysArgSerAsn***ArgTyrSerLeuIleLeuThrCys
IleArgLysAsnAspLeuLeuSerLysGlyAlaIleLysGlyThrLeu***Ser***Pro
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
. : . : . : . : . : . :
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
GlyCysPheHisGlyArgIleLeuLeuLeuLeu***LeuTyrGluArgIleArgValGln
ArgLeuPheSerArgLysAspPheProAlaIleLeuProValArg***AspGlnGlyThr
AspAlaPheIleValGlu******PheSerCysAsnPheThrSerGluLeuGlySerArg
? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation
The segments to translate can be
1 Typed on the keyboard
2 Read from a feature table
X 3 Open reading frames
? 1,2,3 =1
? (y/n) (y) Use 1 letter codes
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X 1 + strand only
2 - strand only
3 Both strands
? 1,2,3 =
? (y/n) (y) Number ends of lines N
Translate
? From (0-300) (0) =101
? To (1-300) (300) =300
Translate
? From (0-300) (0) =102
? To (1-300) (300) =200
Translate
? From (0-300) (0) =
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT
10 20 30 40 50 60
M V K L N L L
W S N * I Y
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT
70 80 90 100 110 120
V R R I G N Q L L H G M K L P D T V L *
S F A E L G I N C Y M E * N F Q T P Y F
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA
130 140 150 160 170 180
L H I * N M L S Y S T R F S N * A L S H
S C I F K T C
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA
190 200 210 220 230 240
P Q K * P L I K R S N * R Y S L I L T C
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG
250 260 270 280 290 300
? Menu or option number=D39
Find open reading frames, translate and list
? (y/n) (y) Show translation
The segments to translate can be
1 Typed on the keyboard
2 Read from a feature table
X 3 Open reading frames
? 1,2,3 =2
? Embl feature table file=1.FT
? (y/n) (y) Use 1 letter codes
Define section of DNA to display
? start (1-7238) (1) =
? end (2-7238) (7238) =300
? Line length (30-120) (60) =
Which strands should be shown
X 1 + strand only
2 - strand only
3 Both strands
? 1,2,3 =3
? (y/n) (y) Number ends of lines
N A T T I S R I D A T F S A R A P N E N
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
. : . : . : . : . : . :
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
* S A G W I F I
A V V I L L I S A V K E A R A G F S F
I A K Q V I D H L R N V S N G Q T K S T
L N R L L T I C E M Y L M V K L N L L
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
. : . : . : . : . : . :
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
Y S F L N N V M Q S I Y R I T L S F R S
I A L C T I S W K R F T D L P * V L D V
R S Q N W E S T V T W N E T S R H R T L
V R R I G N Q L L H G M K L P D T V L *
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
. : . : . : . : . : . :
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
T R L I P F
R E C F Q S D V T V H F S V E L C R V K
V A Y L K H V E L Q H Q I Q Q L S S K P
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
. : . : . : . : . : . :
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
T A Y K F C T S S C C W I
S A K M T S Y Q K E Q L K V L S N P D L
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
. : . : . : . : . : . :
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
* L Y E R I R V Q
* F S C N F T S E L G S R
.end lit
.left margin1
@40. TX 5 @ Translate and write the protein sequence to disk
.LEFT MARGIN2
.para
This routine allows the user to translate sections of the sequence into
the
1 letter amino acid codes and store the resulting amino acid sequences in
a disk file.
Two modes of use are possible. Either all open reading frames of at least
some minimum length will
automatically be found and translated, or the user can specify that
particular segments be translated.
.para
Mode 1: the user selects to to translate all open reading frames.
.para
Either, or both, strands can be
translated.
The output file is in the same format as a PIR .seq file.
Each protein segment is given an entry name that is its start base in
the DNA, and a title that includes its end position,
reading frame and strand (+ for plus, - for minus).
Each segment is terminated by * whether or not
there is a stop codon in the DNA. The file is therefore suitable for input
to FASTA, ALIGNL and ANALYSEPL.
.para
Mode 2: the user selects to identify the segments to translate.
.para
Either, or both, strands can be
translated.
If multiple coding regions
are translated each will be separated from the previous one by a gap of 5
dashes (-----).
The sections to translate can be
defined from the keyboard or by supplying the name of the appropriate
EMBL
library feature table.
.para
Typical dialogue follows.
.lit
? Menu or option number=40
Translate and write protein sequence to disk
? (y/n) (y) Translate selected regions
? (y/n) (y) Define segments using keyboard
Translate
? From (0-1023) (0) =1
? To (1-1023) (1023) =111
? (y/n) (y) + strand
Translate
? From (0-1023) (0) =
? Output file name=1.OUT
? Menu or option number=40
Translate and write protein sequence to disk
? (y/n) (y) Translate selected regions n
? Minimum open frame in amino acids (5-1000) (30) =
X 1 + strand only
2 - strand only
3 Both strands
? 0,1,2,3 =3
? File name for translation=1.OUT
? Menu or option number=6
Page through text files
? Name of file to read=1.OUT
>P1; 25
135 1 +
GAQRLLRRSCWCWRCGGRQRTQGSAGRGRRRRGGGG*
>P1; 238
486 1 +
IRCRDCGQRRRGIFDLVDDFHVRRHIVLARKLFEAEGTGVHFHISLMGGNIVTAEVTNVR
VDAGADFAAVRMLALFGAVVPH*
>P1; 556
795 1 +
SSTQVRRASAQTSSLQLESIVAVVNVEVFLAAKHSRFYIAVLFAQFGPLLDARLDRGCGK
GAGRRDQWRGGGVDLANGR*
>P1; 796
987 1 +
FGYADHAFHLRSTSRHSDNVKFDSAGRRRCCCFHLVFSLGSDEEGLLARLLVEVTTIRVV
LRG*
>P1; 2
163 2 +
NSVWAWCEVPRDYCAAAAGAGGAEVVNGPRDPLDEDVDDEEEVDSALLVAGSD*
>P1; 176
391 2 +
PLRSGGGGVEAPETPSGWPARFAAATVANAVEGFSILWMIFTCAVILSLRVNSLKQKGQG
YTFTFRLWEVT*
>P1; 476
628 2 +
SLTEPSASPSPTLLLRFSLVLTEGVPNPALRFGVLPLRPAAFNLNPSLLL*
>P1; 629
958 2 +
MSRYSWLLNTAGFTSPFCLPSLGRFWTRGLTVAVEKEPAGETNGVEAALTLPMGVSLGML
TMLFTCAPPAAIPIMLSLIPLAAAAAAVSTWCFLWAAMRKACWRACSLR*
>P1; 3
293 3 +
IRFGLGVRCPEITAPQLLVLAVRRSSTDPGIRWTRTSTTRRRWIAHCWWLAATDLSSDHS
DPAAEASRLPKLPVAGLLDSLPRLWPTPSRDFRSCG*
>P1; 411
521 3 +
CACRRGSRLCSGTYARPLWCSSPSLSPPPRPRQRCC*
>P1; 1020
37 1 -
EFGKYNPLTDNSSPTQDHTDGSHLNEQARQQAFLIAAQRKHQVETAAAAAASGIKLNIIG
MAAGGAQVKSMVSIPKLTPIGKVNAASTPLVSPAGSFSTATVKPRVQKRPKLGKQNGDVK
PAVFSSQEYLDIYNSNDGFKLKAAGLSGSTPNLSAGLGTPSVKTKLNLSSNVGEGEAEGS
VRDYCTKEGEHTYRCKVCSRVYTHISNFCRHYVTSHKRNVKVYPCPFCFKEFTRKDNMTA
HVKIIHKIENPSTALATVAAANLAGQPLGVSGASTPPPPDLSGQNSNQSLPATSNALSTS
SSSSTSSSSGSLGPLTTSAPPAPAAAAQ*
>P1; 373
-1 2 -
AKCESVPLSLLLQRVYAQGQYDGARENHPQDRKSLDGVGHSRGSESSRPATGSFGSLDAS
AAGSEWSELKSVAASHQQCAIHLLLVVDVLVQRIPGSVDDLRTASTSSCGAVISGHLTPS
PNRI*
>P1; 517
407 2 -
QQRWRGRGGGLSEGLLHQRGRAYVPLQSLLPRLHAH*
>P1; 649
518 2 -
QPGIPRHLQQQRWIQVEGCWSERKHAEPECWIRNSLCQNQAES*
>P1; 853
650 2 -
HYRNGGWWSAGEKHGQHTQTNAHWQGQRRLHAIGLACRLLFHSHGQAARPEAAQTQTER
RCKTGCV*
>P1; 958
854 2 -
SPQRAGAPTSLPHRCPEKTPGGNSSSGGGQRNQT*
>P1; 179
78 3 -
VVRTQISRCQPPAMRYPPPPRRRRPRPADPWVR*
>P1; 479
363 3 -
GTTAPKRASIRTAAKSAPASTRTLVTSAVTMLPPISEM*
>P1; 791
666 3 -
RPLARSTPPPRHWSRLPAPFPQPRSSRASRSGPNWANRTAM*
>P1; 1022
819 3 -
SNSASTTRSPTTAHPRRTTRMVVTSTSRRANKPSSSLPRENTRWKQQQRRRPAESNLTLS
EWRLVERR*
End of file
.end lit
.LEFT MARGIN1
@41. TX 5 @ Calculate and write codon table to disk
.LEFT MARGIN2
.para
This routine calculates codon usage tables
for sections of the sequence
and stores the resulting tables on disk.
The sections to translate can be
defined from the keyboard or by supplying the name of the appropriate
EMBL
library feature table.
.para
If required users can add to an existing codon table stored as a disk file.
Choose between storing observed counts or having them normalised so
that the totals for each amino acid sum to 100. Select between defining
segments at the keyboard or using an EMBL feature table. Define
segments. Signal completion with a zero start. Supply a file name. For
each segment the program will display the counts, at the end it will
display the accumulated totals.
.lit
Typical dialogue follows.
? Menu or option number=41
Calculate and write codon table to disk
? (y/n) (y) Start with empty table
? (y/n) (y) Show observed counts
? (y/n) (y) Define segments using keyboard
? Count from (0-1023) (0) =1
? Count to (1-1023) (1023) =111
? (y/n) (y) + strand
===========================================
F TTT 0. S TCT 0. Y TAT 0. C TGT 0.
F TTC 1. S TCC 1. Y TAC 0. C TGC 3.
L TTA 1. S TCA 0. * TAA 0. * TGA 1.
L TTG 2. S TCG 0. * TAG 0. W TGG 2.
===========================================
L CTT 0. P CCT 0. H CAT 0. R CGT 2.
L CTC 0. P CCC 0. H CAC 0. R CGC 2.
L CTA 0. P CCA 0. Q CAA 1. R CGA 1.
L CTG 1. P CCG 0. Q CAG 2. R CGG 2.
===========================================
I ATT 0. T ACT 0. N AAT 0. S AGT 0.
I ATC 0. T ACC 1. N AAC 0. S AGC 1.
I ATA 0. T ACA 0. K AAA 0. R AGA 1.
M ATG 0. T ACG 0. K AAG 0. R AGG 0.
===========================================
V GTT 0. A GCT 1. D GAT 0. G GGT 3.
V GTC 0. A GCC 1. D GAC 0. G GGC 1.
V GTA 0. A GCA 0. E GAA 1. G GGA 4.
V GTG 1. A GCG 0. E GAG 0. G GGG 0.
===========================================
? Count from (0-1023) (0) =
Codon totals over all genes
===========================================
F TTT 0. S TCT 0. Y TAT 0. C TGT 0.
F TTC 1. S TCC 1. Y TAC 0. C TGC 3.
L TTA 1. S TCA 0. * TAA 0. * TGA 1.
L TTG 2. S TCG 0. * TAG 0. W TGG 2.
===========================================
L CTT 0. P CCT 0. H CAT 0. R CGT 2.
L CTC 0. P CCC 0. H CAC 0. R CGC 2.
L CTA 0. P CCA 0. Q CAA 1. R CGA 1.
L CTG 1. P CCG 0. Q CAG 2. R CGG 2.
===========================================
I ATT 0. T ACT 0. N AAT 0. S AGT 0.
I ATC 0. T ACC 1. N AAC 0. S AGC 1.
I ATA 0. T ACA 0. K AAA 0. R AGA 1.
M ATG 0. T ACG 0. K AAG 0. R AGG 0.
===========================================
V GTT 0. A GCT 1. D GAT 0. G GGT 3.
V GTC 0. A GCC 1. D GAC 0. G GGC 1.
V GTA 0. A GCA 0. E GAA 1. G GGA 4.
V GTG 1. A GCG 0. E GAG 0. G GGG 0.
===========================================
? (y/n) (y) Save table in a file n
.end lit
.left margin1
@42. TX 6 @ Codon usage method
.LEFT MARGIN2
.para
Used to find protein coding regions. For each window length of the
sequence the routine measures the closeness to an expected codon usage.
Results are plotted for each of the three reading frames. Stop and start
codons are also marked on the plots. Has the highest resolution of all
such methods, but makes the strongest assumption, i.e. that the codon
usage is known. The latest version is described in Methods in Enzymology
183, 193-211.
.para
Choose whether to use an internal standard (i.e. part of the current
sequence known to code for a protein). If so define its end points, and
those of any others. Otherwise supply the name of a disk file containing a
table of codon usage. Tables are listed. Choose between using the
observed counts, or two types of normalisation: normalised to give an
average amino acid composition; normalised to no amino acid bias. The
first normalisation is clearly often sensible, but the second removes
valuable information and is only made availabe for special
circumstances. The final table will be displayed, followed by the
expected scores for window lengths 21, 31 and 41 codons. The scores for
each of the three reading frames are shown (they are logarithmic values)
to help users choose a window length for the analysis. Define a window
length and plot interval. Plotting will start.
.para
The method was first described in
Staden and McLachlan Nucl. Acid Res. 10 141-156 (1982) and the
following is a summary of the initial ideas.
The method makes the following main assumptions: the codon
preferences
of all the
genes in the sequence we are examining are similar to those of the
standard;
the sequence is coding
throughout its whole length in only one reading frame; in the coding
frame
the frequency of codon abc has a definite value Fabc
.LEFT MARGIN2
If we select a sequence a1b1c1a2b2c2a3b3c3,...,anbncnan+1bn+1cn+1
then the
probability of selecting it in each of the three frames is:
.left margin15
frame 1: p1=Fa1b1c1.Fa2b2c2....Fanbncn
.left margin15
frame 2: p2=Fb1c1a2.Fb2c2a3...Fbncnan+1
.left margin15
frame 3: p3=Fc1a2b2.Fc2a3b3...Fcnan+1bn+1
.LEFT MARGIN2
The probability that selection of a particular sequence was "caused" by it
being a coding sequence is:
.LEFT MARGIN2
P1=p1/(p1+p2+p3), P2=p2/(p1+p2+p3), P3=p3/(p1+p2+p3).
.LEFT MARGIN2
The program calculates these values for the given window length but
plots
Log(P/(1-P)) for each of the three frames. At each point along the
sequence
that the program has a
point to plot it finds which of the three values is highest and places a
single point at the 50% level for the corresponding frame. These single
points will join to form a solid line if one frame is consistently the
highest scoring. In addition stop codons are shown as short vertical lines
that bisect the 50%
level of probability. When looking for coding regions
the user should look for solid horizontal lines at the
50% level that are not interrupted by these short vertical lines.
.para
Changes.
Two normalisations are offered: 1) to remove all amino acid
compositional components from the tables, hence leaving only the codon
preference components. In general this is not recommended as the amino
acid
component alone is often sufficient to choose correctly between frames,
but
may be useful in special circumstances. 2) to change the amino acid
composition components to give an average amino acid composition
rather the
the one contained in the standard (this leaves the codon preference
components unchanged). In general this should be useful as the average
amino acid composition is likely to be closer to the composition of the
genes being hunted, than is that of the standard table of codon
preferences.
The average composition
is that recently published by Argos, not the Dayhoff one that we have
used
before.
.para
Typical dialogue follows.
.lit
? Menu or option number=42
Staden and McLachlan codon usage method
Codon tables for standards may be read from disk
or calculated from parts of the current sequence
? (y/n) (y) Define internal standard
Define standard
? start (0-1023) (0) =1
? end (2-1023) (1023) =1000
===========================================
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
===========================================
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
===========================================
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
===========================================
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
===========================================
Define standard
? start (0-1023) (0) =
Total codons in standard= 333.
X 1 Use observed frequencies
2 Normalize to average amino acid composition
3 Normalize to no amino acid bias
? 0,1,2,3 =2
===========================================
F TTT 19. S TCT 2. Y TAT 10. C TGT 3.
F TTC 6. S TCC 22. Y TAC 10. C TGC 8.
L TTA 2. S TCA 0. * TAA 0. * TGA 0.
L TTG 7. S TCG 2. * TAG 0. W TGG 8.
===========================================
L CTT 16. P CCT 16. H CAT 4. R CGT 10.
L CTC 12. P CCC 0. H CAC 10. R CGC 10.
L CTA 0. P CCA 0. Q CAA 8. R CGA 7.
L CTG 21. P CCG 16. Q CAG 18. R CGG 6.
===========================================
I ATT 19. T ACT 13. N AAT 16. S AGT 2.
I ATC 11. T ACC 17. N AAC 12. S AGC 15.
I ATA 3. T ACA 3. K AAA 22. R AGA 1.
M ATG 15. T ACG 3. K AAG 15. R AGG 1.
===========================================
V GTT 15. A GCT 21. D GAT 14. G GGT 10.
V GTC 7. A GCC 16. D GAC 20. G GGC 13.
V GTA 8. A GCA 8. E GAA 26. G GGA 14.
V GTG 11. A GCG 8. E GAG 13. G GGG 9.
===========================================
Span length 21 expected mean values: 4.8 -5.7 -4.8
Span length 31 expected mean values: 7.1 -8.4 -7.2
Span length 41 expected mean values: 9.5 -11.1 -9.5
? odd span length (11-101) (25) =41
? plot interval (1-11) (5) =
Missing graphics display here
.end lit
.left margin1
@43. TX 6 @ Positional base preference method.
.LEFT MARGIN2
.para
Used to find protein coding regions. For each window length of the
sequence the routine measures the closeness to an expected pattern of
base frequencies . Results are plotted for each of the three reading
frames. Stop and start codons are also marked on the plots. The method
is particularly useful for showing which reading frame is the most likely
to be coding. The latest version is described in a forthcoming issue of
Methods in Enzymology, but the original ideas were given in
Staden, R. Nucl. Acid Res. 12 551-567 (1984).
.para
If dialogue is requested the following inputs are needed, otherwise the
standard analysis is performed. Choose between a "global" standard, or a
selected one. If the global standard is selected the
expected scores are displayed and the user asked to define a span length
and a plot interval. Then users choose between plotting relative or
absolute scores, and can reset the scaling values employed for plotting.
If the global standard is not selected users must define a region of the
sequence to use as a standard, or they can read in a codon table from which
the
program will calculate one. Then they can either, use the values
observed in this standard, or they can combine its values for the third
positions in codons, with those from the global standard. Next they can
give different weightings to each of the three positions in codons.
.para
In its original form the method
took advantage of the
uneven
use of amino acids by proteins and the structure of the genetic code table
and assumed that there is a typical ("global")
amino acid composition
and no codon preference. The typical amino acid composition is the
average
composition found by Argos (see below).
This composition and no codon preference
determines the frequency of each of the four bases in each of the three
codon positions. This 3x4 frequency table shows unequal use of the bases
and in particular a marked use of G in position 1 and of A in position 2
(at the expence of G). The routine slides a window along the sequence and
calculates a score for each of the three reading
frames at each window position. It assumes the sequence is coding
throughout its whole length and calcualtes the probability that it is
coding in each of the three frames.
When tested against all the E. coli sequences in the EMBL sequence
library
it correctly identified the coding frame for 91% of window positions.
(The E. coli
sequences were chosen only for technical reasons: I have no reason to
think
the method would work less well on other organisms with roughly even
base composition.)
The routine can plot either absolute or relative values: ie absolute values
are the values found by summing the scores for each frame (say p1, p2
and
p3), and the relative values are then p1/(p1+p2+p3), p2/(p1+p2+p3) and
p3/(p1+p2+p3).
.para
At each point along the sequence
that the program has a
point to plot it finds which of the three values is highest and places a
single point at the 50% level for the corresponding frame. These single
points will join to form a solid line if one frame is consistently the
highest scoring. In addition stop codons are shown as short vertical lines
that bisect the 50%
level of probability. When looking for coding regions
the user should look for solid horizontal lines at the
50% level that are not interrupted by these short vertical lines.
The absolute mean
values expected on the complement of
the coding strand (and in the same frame)
are 5% lower than those on the coding strand but the relative values
are the same on both strands. Although the
relative values give smoother plots and tend to emphasize the coding
frame
they therefore, cannot be used to decide which strand is coding. The
absolute values plot should be used for this purpose but bearing in mind
the fact the the differences between strands are quite small.
.para
The method has been improved in two overall ways: first it now allows
users to
define their own typical amino acid composition by selecting a standard
sequence from within the sequence they are analysing or from a codon table;
secondly it allows the inclusion of third position preferences.
Again these third position preferences are defined by the use of an
internal standard sequence. Not only can users define their own standards
but they can also give weights to each of the three positions in codons.
This allows different emphasis to be used for each of the three positions.
As an example of its use, by giving, in turn, weights of 1.0, 0.0, 0.0, and
0.0, 1.0, 0.0, and finally 0.0, 0.0, 1.0, you could see the separate
contribution made by each of the three positions. It is also possible to
use the third position preferences with the values for the first two
positions taken from the "global" amino acid composition.
In all cases users may choose to plot
absolute or relative values. The expected scores are displayed before
each
analysis and scales are drawn on the plots.
At present this method does not give probabilities of coding; it has only
been tested for its ability to choose the correct reading frame (see
above). It could be used to give probabilities of coding if was applied to
all known coding and non-coding sequences in the way that the uneven
positional base frequencies method was. It is designed to be used in
conjunction with this method. Note that the average amino composition
used
to derive the base frequencies was changed on 17-11-1988, to be
the new average given by McCaldon and Argos in Proteins 4 99-122
(1988).
A further change is to allow users to select their own scales for
producing the plots. It can be helpful if they want to emphasise or
diminish
certain features.
.para
Typical dialogue follows.
.lit
? Menu or option number=D43
Positional base preferences method to find protein genes
Select standard source
X 1 Use global standard
2 Use internal standard
3 Use codon usage table
? Selection (1-3) (1) =2
Define region for standard
? start (0-8134) (0) =3171
? end (3172-8134) (8134) =4700
Select normalisation
X 1 Use observed frequencies
2 Combine with global standard
? Selection (1-2) (1) =1
T C A G Range
1 0.125 0.249 0.230 0.397 0.272
2 0.298 0.245 0.292 0.165 0.132
3 0.288 0.313 0.169 0.230 0.144
? (y/n) (y) Use 1.0 for positional weights
Give weights between 0.0 and 1.0
to each of the 3 codon positions
? Position 1 (0.00-1.00) (1.00) =
? Position 2 (0.00-1.00) (1.00) =
? Position 3 (0.00-1.00) (1.00) =
Expected scores per codon in each frame
0.136 0.122 0.123
? odd span length (31-101) (67) =
? plot interval (1-11) (5) =
? (y/n) (y) Plot relative scores
Scaling values:
Minimum maximum range
0.3121 0.3656 0.0382
? (y/n) (y) Leave scaling values unchanged
Graphics not shown
? Menu or option number=D43
Positional base preferences method to find protein genes
Select standard source
X 1 Use global standard
2 Use internal standard
3 Use codon usage table
? Selection (1-3) (1) =3
? File name of standard=atpase.cods
===========================================
F TTT 21. S TCT 33. Y TAT 15. C TGT 5.
F TTC 55. S TCC 40. Y TAC 40. C TGC 4.
L TTA 8. S TCA 7. * TAA 8. * TGA 0.
L TTG 19. S TCG 12. * TAG 1. W TGG 17.
===========================================
L CTT 22. P CCT 17. H CAT 6. R CGT 73.
L CTC 21. P CCC 4. H CAC 30. R CGC 23.
L CTA 1. P CCA 10. Q CAA 19. R CGA 5.
L CTG 168. P CCG 48. Q CAG 80. R CGG 3.
===========================================
I ATT 47. T ACT 14. N AAT 17. S AGT 8.
I ATC 98. T ACC 54. N AAC 52. S AGC 26.
I ATA 6. T ACA 7. K AAA 85. R AGA 0.
M ATG 75. T ACG 13. K AAG 28. R AGG 0.
===========================================
V GTT 67. A GCT 56. D GAT 41. G GGT 90.
V GTC 29. A GCC 53. D GAC 66. G GGC 66.
V GTA 49. A GCA 59. E GAA 101. G GGA 5.
V GTG 57. A GCG 64. E GAG 41. G GGG 8.
===========================================
Select normalisation
X 1 Use observed frequencies
2 Combine with global standard
? Selection (1-2) (1) =2
T C A G Range
1 0.177 0.211 0.277 0.336 0.159
2 0.271 0.238 0.310 0.182 0.128
3 0.242 0.301 0.168 0.289 0.132
? (y/n) (y) Use 1.0 for positional weights
Expected scores per codon in each frame
0.785 0.736 0.736
? odd span length (31-101) (67) =
? plot interval (1-11) (5) =
? (y/n) (y) Plot relative scores
Scaling values:
Minimum maximum range
0.3219 0.3519 0.0214
? (y/n) (y) Leave scaling values unchanged
Graphics not shown
.end lit
.left margIN1
@44. TX 6 @ Uneven positional base frequencies.
.LEFT MARGIN2
.para
Used to find regions of a sequence that might be coding for a protein. The
method looks for sections of the sequence in which the frequency at
which each of the four bases occupies the three positions in codons is
nonrandom. The level of nonrandomness is plotted on a scale that shows
the probability that the sequence is coding. At each position along a
sequence the calculation gives the same value for all six possible reading
frames, so only one value is plotted.
.para
Define the window length and plot interval.
.para
The results are plotted in a box divided by a horizontal line marked "76%".
76% of coding regions achieve values above this line and 76% of
noncoding regions achieve scores below the line.
.para
This method, first described in Staden R. Nucl. Acid Res. 12 551-567
1984,
looks for uneven positional
usage of bases in codons.
It looks through the sequence in one fixed
phase and counts the number of times each base apears in each of the
three
codon positions: for each window position it counts A1,A2,A3 and
C1,C2,C3
and G1,G2,G3 and T1,T2,T3 and calculates AMEAN=(A1+A2+A3)/3, and
similarly
CMEAN, GMEAN
and TMEAN; it then calculates
ADIF=abs(A1-AMEAN)+abs(A2-AMEAN)+abs(A3-AMEAN) and similarly
CDIF, GDIF and
TDIF to measure the differences between an even base usage for all
positions in the codons and the observed usage. The routine then
calculates
the sum ADIF+CDIF+GDIF+TDIF and plots this value on the following scale:
the base level is such that no known window in a coding region has a
lower
value, whereas 14% of windows in noncoding sequences score below it.
The
top of the scale is not achieved by any known noncoding
region, but is reached by 16% of known coding regions.
The bar drawn across the
plot corresponds to a level that is exceeded by 76% of windows in known
coding regions
but is reached by only 24% of windows in known noncoding regions. ie
76% of
coding windows score above and 76% of noncoding windows score below.
This is similar to Ficketts method but without
the probabilities and weightings from the Los Alamos sequence library: it
is therefore unbiased but may well give very similar results.
.left margin1
@45. TX 6 @ Codon improbability on base composition
.LEFT MARGIN2
.para
Used to find regions of a sequence that might code for a protein.
.para
If dialogue is requested define a window length and plot interval.
.para
The idea of the method is, that of all sequence features
that we know, it is only
coding regions that will give rise to codon biases well above those
expected
from the base composition.
If a region of sequence shows sufficiently strong
codon bias then we conclude that it is coding for a protein.
Using the multinomial distribution we
have derived a function to measure the improbability of observing a
set of codons from a sequence of the given composition. Using the
Poisson
distribution we have worked out the distribution
of the improbability. The program plots the observed improbability minus
the expected improbability (the mean as calculated from the Poisson
distribution). The plots are presented against a scale of units of standard
deviation as measured from the Poisson distribution. As with the other
Staden and McLachlan method the program puts an extra point at a fixed
level for the highest of the three probabilities; for this function this
point is placed at six standard deviations above the mean expected level.
The top of each plot corresponds to 12 standard deviations above the
expected level and the bottom corresponds to the expected value.
.para
Analysis of the application
of the method to the EMBL sequence library indicates that the method
does
work for most sequences and that the levels of improbability roughly
correlate with levels of expression.
Coding regions will show high peaks in all three frames making
interpretation more difficult than for some of the other methods.
.left margin1
@46. TX 6 @ Codon improbability on amino acid composition
.LEFT MARGIN2
.para
Used to finds regions of a sequence that might code for a protein.
.para
If dialogue is requested define a window length and a plot interval.
.para
The idea of the method is, that of all sequence features
that we know, it is only
coding regions that will give rise to codon biases such that, for each
amino acid, some codons are used far more frequently than others. The
method is independent of what the bias actually is, requiring only that it
is present.
If a region of sequence shows sufficiently strong
codon bias then we conclude that it is coding for a protein.
Using the multinomial distribution we
have derived a function to measure the improbability of observing a
set of codons from a sequence of the given composition. Using the
Poisson
distribution we have worked out the distribution
of the improbability. The program plots the observed improbability minus
the expected improbability (the mean as calculated from the Poisson
distribution). The plots are presented against a scale of units of standard
deviation as measured from the Poisson distribution. As with the other
Staden and McLachlan method the program puts an extra point at a fixed
level for the highest of the three probabilities; for this function this
point is placed at six standard deviations above the mean expected level.
The top of each plot corresponds to 12 standard deviations above the
expected level and the bottom corresponds to the expected value.
.left margin1
@47. TX 6 @ Shepherd RNY preference method
.LEFT MARGIN2
.para
Used to find regions of a sequence that might code for a protein. Based on
the method of Shepherd
(PNAS 78 1596-1600, 1981).
.para
If dialogue is requested define a window length and plot interval.
.para
Shepherd has found that
many genes have a preference for the use of codons of the form RNY
where
R=purine, Y=pyrimidine and N=any base. He has attributed this to being
due
to remants of a primitive genetic code. The calculation is similar to that
for the Staden and McLachlan method, the p1's being simply the number of
RNY codons found in frame 1 etc and the P's being p/(p1+p2+p3).
.left margIN1
@48. TX 6 @ Ficketts method
.LEFT MARGIN2
.para
Used to find regions of a sequence that might code for a protein. Based on
the method of Fickett
(Nucl. Acid Res.10
1982), but plots values for fixed window lengths rather than over the
whole of open reading frames.
.para
If dialogue is requested define a window length and plot interval. The
results are plotted in a box divided into three horizontal strips.
.para
Sections of the sequence with values plotted in the top strip of the box
are adjudged to be coding, those in the middle strip "no decision", and
those in the bottom "not coding".
.para
The program performs the following calculations: let A1 = the number of
occurences of base A in position 1 of codons, A2 for position 2 etc.
Similarly for bases C,G and T. For each window position calculate
Apos=max(A1,A2,A3)/min(A1,A2,A3)+1. Similarly for C,G and T to give 4
positional values. Also count the base composition for the window to
give
Acomp, Ccomp etc. Fickett tested each of these 8 parameters singly as
to
their ability to distinguish coding from noncoding regions and arived at
probabilities of coding for the range of values each can take = Pcod. He
also measured their relative abilities and given weightings to each of
the 8 parameters = Pw. To calculate the "TESTCODE" for a window we
first lookup the Pcod for each of the calculated compositional and
positional values and then calculate TESTCODE=sum(Pcod*Pw). TESTCODE
is
plotted relative to three levels of decision: the top division="coding",
the middle="no opinion" and the bottom division="non coding".
.left margin1
@49. TX 6 @ tRNA gene search.
.LEFT MARGIN2
.para
Used to find segments of a sequence that might code for tRNAs. Looks for
potential cloverleaf forming structures and then for the presence of
expected conserved bases. Presents results graphically or draws out the
cloverleafs.
.para
If dialogue is requested a large number of parameters need to be given
values, including some loop lengths, scores for each of the four stems,
and scores for the conserved bases.
.para
The program was first described in
Staden Nucl. Acid Res 817-825 (1980).
The tRNA's that have
been sequenced so far have two characteristics that can be used
to
locate their genes within long DNA sequences. Firstly they have a
common secondary structure - the cloverleaf - and secondly,
particular bases almost always appear at certain positions in
the
cloverleaf. The cloverleaf is composed of four base-paired
stems
and four loops. Three of the stems are of fixed length but the
fourth, the dhu stem which usually has four base pairs,
sometimes
has only three. All of the loops can vary in size. The following
relationships between the stems in the cloverleaf are assumed in
the
program: (a) there are no bases between one end of the
aminoacyl
stem and the adjoining tuc stem; (b) there are two bases
between
the aminoacyl stem and the dhu stem; (c) there is one base
between
the dhu stem and the anticodon stem; (d) there are at least three
bases between the anticodon stem and the tuc stem.
The program looks first for cloverleaf structure and then, if
required, for conserved bases. The sizes of the loops, the number
of basepairs in the stems and the required conserved bases may
all
be specified by the user. The process of looking for the presence
of conserved bases can reduce the number of potential
structures
found considerably.
The
user may also specify that an intron may be present in the
anticodon
loop.
.para
The user may define a minimum number of
base pairs for each stem using the scoring system G-C, A-T=2 and G-T=1
and
scores for each of the conserved bases. Recommended values for the stem
scores are given by the prompts and the percentage conservation of the
conserved bases as found in the Nucl. Acid Res 1979 paper Gauss, Gruter
and Sprinzl are also given,
but the user must decide which bases are most
likely to be conserved for the sequence being examined.
The output shows the position of the possible gene in the sequence by a
vertical line the height of which shows the number of basepairs made in
the
stems. The cloverleaf structure is also drawn but will scroll up off the
screen. Output of the cloverleafs will look like:
.lit
6942
A
A-U
A-U
G-C
A-U
U-A
A-U
U-A AAU
U UAUCU
AA A !!!!!
AAUG AUAGA A
U !!!! U UCA
C UUAC U
AA A
U-AA A
A-U
A-U
C-G
U-A
U A
U A
GUC
Typical dialogue follows.
? Menu or option number=D49
tRNA search
? Maximum trna length (70-130) (92) =
? Aminoacyl stem score (0-14) (11) =
? Tu stem score (0-10) (8) =
? Anticodon stem score (0-10) (8) =
? D stem score (0-8) (3) =
? Minimum base pairing total (30-32) (32) =
? Minimum intron length (0-30) (0) =
? Minimum length for TU loop (4-12) (6) =
? Maximum length for TU loop (6-12) (9) =
? (y/n) (y) Skip search for conserved bases n
Give a score for each base, then a minimum total at the end
? Base 8, T is 100% conserved. Score (0-100) (0) =
? Base 10, G is 95% conserved. Score (0-100) (0) =
? Base 11, Y is 96% conserved. Score (0-100) (0) =
? Base 14, A is 100% conserved. Score (0-100) (0) =
? Base 15, R is 100% conserved. Score (0-100) (0) =
? Base 21, A is 97% conserved. Score (0-100) (0) =
? Base 32, Y is 100% conserved. Score (0-100) (0) =
? Base 33, T is 98% conserved. Score (0-100) (0) =
? Base 37, A is 91% conserved. Score (0-100) (0) =
? Base 48, Y is 100% conserved. Score (0-100) (0) =
? Base 53, G is 100% conserved. Score (0-100) (0) =
? Base 54, T is 95% conserved. Score (0-100) (0) =
? Base 55, T is 97% conserved. Score (0-100) (0) =
? Base 56, C is 100% conserved. Score (0-100) (0) =
? Base 57, R is 100% conserved. Score (0-100) (0) =
? Base 58, A is 100% conserved. Score (0-100) (0) =
? Base 60, Y is 92% conserved. Score (0-100) (0) =
? Base 61, C is 100% conserved. Score (0-100) (0) =
? Minimum total conserved base score (0-0) (0) =
? (y/n) (y) Plot results n
Searching
306
C
C-G
C-G
G-C
T-A
C-G
A-T
T+G AT
A ATACA
TTC T !!!! G
CTGT TATGG G
G ! ! T GA
C TAAA C
GCG C G
T+GA C
C-G C T
T+G A T
T-A G T
T-A G A
G G G C
A A G A
AGC T C
A T
C T
A
C T
.end lit
.left margIN1
.left margIN1
@50. TX 7 @ Plot start codons
.left margin2
.para
This function plots the positions of all start codons for each of the three
reading frames.
.left margin1
@51. TX 7 @ Plot stop codons
.left margin2
.para
This function plots the positions of all stop codons for each of the three
reading frames.
.left margIN1
@52. TX 7 @ Plot stop codons on the complementary strand
.left margin2
.para
This function plots the positions of all stop codons for each of the three
reading frames on the complementary strand.
.left margin1
@53. TX 7 @ Plot stop codons on both strands
.left margin2
.para
This function plots the positions of all stop codons for each of the three
reading frames on both strands.
.left margin1
@54. TX 5 @ Search for longest open reading frames
.left margin2
.para
This function will report the positons of the ends of
all sections of sequence that contain no stop codons. All six reading
frames are examined. Results are presented in the form of an EMBL feature
table. Hence if the results are stored in a file by use of "direct output
to disk", the file
can be used to translate the
open reading frames in a sequence.
Note that in order for the file to be used as a feature table it
must include either EMBL
or GenBank headers, and a suitable "tail". The simplest header is the word
FEATURES starting in column 1 of the first line of the file. The simplest
tail is 2 empty lines at the end of the file. These lines are not included
when nip writes out results in feature table format.
.para
Define the minimum length of open reading frame to report (in amino
acids).
Choose to search either or both strands. The program displays the end
points, the reading frame and strand.
.para
Typical dialogue follows.
.lit
? Menu or option number=D54
Find open reading frames
? Minimum open frame in amino acids (5-1000) (30) =100
X 1 + strand only
2 - strand only
3 Both strands
? 0,1,2,3 =3
FT CDS 1 831 1 831
FT CDS 1540 2853 1 1314
FT CDS 3130 4242 1 1113
FT CDS 5761 6114 1 354
FT CDS 6187 6711 1 525
FT CDS 1766 2077 2 312
FT CDS 2078 2446 2 369
FT CDS 4136 5500 2 1365
FT CDS 1335 1637 3 303
FT CDS 2844 3194 3 351
FT CDS 6819 7238 3 420
FT CDS 2073 1711 C 1 363
FT CDS 2469 2149 C 1 321
FT CDS 6542 6144 C 3 399
.end lit
.left margin1
@55. TX 8 @ Search for E. coli promoter (general)
.LEFT MARGIN2
.para
Searches for E coli promoter like sequences using a standard weight
matrix. The positions of the matches are plotted. No dialogue is required.
.para
The method was first described in
Staden R. Nucl. Acid Res. 12 505-519 1984.
This search uses a weight matrix taken from the frequency tables
contained
in Hawley, D. K. and McClure, R., nar 11 2237-2255 (1983).
The weight matrix is
divided into 3 sections that are separated by varying sizes of gap: the -
35
region, the -10 and the +1 region.
The algorithm first looks for a sufficiently good -35 region, then for the
best -10 region within range and then for the best +1 region within range
of the -10; each separate region must score above the lowest known
score
for the corresponding section. The gap penalty is then applied and two
plots
produced: one with gap penalties, one without.
Scaling is such that no
known promoter scores below the bottom level and no known promoter
scores
above the top level when the weight matrix is applied.
.para
Two other functions also look for E. coli promoters: 92 looks for sites on
the complementary strand and 93 looks for individual -35 and -10
regions
and plots them on a scale such the top is the highest known value +10%
and
the bottom is the lowest known -10%
.LEFT MARGIN1
.lit
weights for E. coli promoters
-35 region:
P -50-49-48-47-46-45-44-43-42-41-40-39-38-37-36-35-34-33-32-31-30-29-28-27-26
107109109110110110110110110111111110111112112112112112112112112112112112112
T 41 33 32 25 34 22 35 35 42 27 32 42 47 14 92 94 11 19 15 37 46 34 38 48 34
C 22 27 18 29 20 14 20 12 22 23 16 25 10 43 7 6 11 18 60 8 25 23 23 17 20
A 28 38 30 37 35 56 42 42 37 42 39 18 25 26 2 6 2 72 26 50 26 34 25 26 31
G 16 11 29 19 21 18 13 21 9 19 24 26 29 29 11 6 88 3 11 17 15 21 26 21 27
-10 region:
P -23-22-21-20-19-18-17-16-15-14-13-12-11-10 -9 -8 -7 -6 -5
112112112112112112112112112112112112112112112112112112112
T 35 28 28 27 39 51 34 43 26 31 89 3 49 15 19108 31 29 21
C 34 21 24 27 12 25 20 25 20 27 10 2 16 14 22 3 13 16 30
A 20 39 33 33 39 23 29 16 23 19 2106 29 66 57 1 35 23 31
G 23 24 27 25 22 13 29 28 43 35 11 1 18 17 14 0 33 24 30
+ region:
P -2 -1 1 2 3 4 5 6 7 8 9 10
86 88 85 88 88 88 88 88 88 88 88 88
T 16 22 2 42 27 23 20 25 27 15 16 29
C 29 49 4 25 25 13 18 22 17 17 16 17
A 20 9 45 16 24 25 28 24 24 32 35 26
G 21 8 37 5 12 27 22 17 20 24 21 16
.end lit
Notes:
E. coli promoters have been shown to contain 2 regions of conserved
sequence
located about 10 and 35 bases upstream of the transcription startsite.
These
are TATAAT and TTGACA with an allowed spacing of 15 to 21 bases
between. The
spacing with maximum efficiency was 17 bases and all but 12 of the 112
sequences could be aligned with a separation of 17 +or-1 bases. The
standard
promoter has spacing 7 and 17 bases between the startsite and the -10
region,
and the -10 and -35 regions, respectively. The spacing between the -10
region
and the startsite is usually 6 or 7 bases but varies between 4 and 8
bases.
There is an AT rich region of 8 to 10 bases upstream of the -35 region.
Iniation with a purine is highly prefered with G being used if A is not
present.
.lit
Gap penalties:
15 0.02 (only exists as mutant)
16 0.2
17 1.0
18 0.2
19 0.05 (guess)
20 0.02 (guess)
21 0.01 (guess)
.end lit
.left margin1
@56. TX 8 @ Search for E. coli promoter (general)
strand
.LEFT MARGIN2
.para
This function searches for E. Coli promoters on the complementary strand
of
the sequence. See the notes on option 55.
.left margin1
@57. TX 8 @ Search for E. coli promoter sequences. (-35 and -10)
.LEFT MARGIN
.para
This function searches separately for the -35 and -10 sequences of an E.
coli promoter. See the notes on option 55.
.left margIN1
@58. TX 8 @ Search for procaryotic ribosome binding sites
.LEFT MARGIN2
.para
This function searches for the 5' ends of prokaryotic genes using an
unusual weight matrix. The search is relatively slow because the matrix
is 101 bases in length. No dialogue is required.
.para
The method was first described in
Staden Nucl. Acid Res. 12 505-519 1984. This actually looks for more
than
a ribosome binding site as is explained below. This uses their weight
matrix w101 of Stormo and
Schneider (NAR 10 2971-3024, 1982)
which with a value of 2 finds all gene starts in their library.
.LEFT MARGIN1
.lit
P-60-59-58-57-56-55-54-53-52-51-50-49-48-47-46-45-44-43-42-41-40-39-38-37-36
T 5 1 -3 9-14 7 15 -5 3-16-17 4 18 5 -3 -1 2 4 5 -5 7 8 -5-15 6
C-21 -6-11-21 0 8 -7-12 -1 1 0-19 12 -3 -1 10 2 -8 -5-11 8 1 23 6 -5
A 7 -2 13 -2 -8-13-18 5 0 -5 13 8-15 9 -4 -7 9 0 -8-11-10 -6 -7 -5 -6
G -6 -9 -7 0 8-16 -4 -2-16 1 -4 8-14 5 11-13-24 3 7 22-11 -9-15 10 -4
P-35-34-33-32-31-30-29-28-27-26-25-24-23-22-21-20-19-18-17-16-15-14-13-12-11
T 3 4 16 -4 7 11 -4 -1 12 8 10 -1 1 8 2-10-16 11 1 -3 16 -3-36 -8-27
C 2-14 -3 -8-10-21 2 0 -2 -1-11 -3 -1 5-11 -4 7 0-14 6 -8-20 -7-36-44
A-12 -1-27 -3 -6 0-12 -3 -4 -7 14 -2 -4 -6 0 12 5 -9 0-11-11 10 8 2 8
G 4 -5 -6 -3 -1 -4 -1 -4-15 0-14 3 10-19 -3-10 -7 -7 7 1 -8 -6 15 21 42
P-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
T-53-27-26-23 2 -7-14-40-28 0-53 75-62-20-40-10-35 -5-12 -1 4 14-23 7 -2
C-15-50-43-35-38-29-29 1 -9 1-87-55-64-45 11-22-14-20-15-15-10-22 -5 2 6
A 0 -3 -5 4-20-11 5 6 -2-15 66-69-52 -5 -4 6 8-24 -7-10 -7 13 14 -9-18
G 35 22 16 -6 -5-15-25-33-28-53-36-50107 -5-37-44-27-15-23-16-29-47-17-29-15
P 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
T-26 1 4 -7 3 -4 0-10 8-18 7-22-21 8 4 -3 -6 7 -8 1 -5-16-16 7 -6
C 6 -8 19 -7 9 -3 17 -2 3 -9 5 22 22 8 -1 1 18 6 11-10 -8 7 10 0 7
A 14-12-42 1 -5 -4-32 12-10 20 -6 -1 3 -4 4-10 -1 -2-14 11 14 -3 2-13 5
G-23 -7 -1 -6-17 -4 0-15-14 -4-17-10 -5-13 -8 10-13-13 9 -4 -3 10 2 4 -8
P 40
T 0
C 14
A 5
G-21
.END LIT
These come from w101 of Stormo, Schneider, Gold and Ehrenfeucht Nucl.
Acid Res. 10 2997-
3011, 1982. They report that this matrix gives a score of at least 2 for
all
gene starts in their library whereas all other sequences score 1 or less.
.left margin1
@29. TX 1 @ Reverse and complement the sequence
.LEFT MARGIN2
.para
Reverses and complements the current active region of the sequence.
.left margin1
@60. TX 7 @ Search using a dinucleotide weight matrix
.LEFT MARGIN2
.para
This function performs searches for short sequence
motifs using an appropriate dinucleotide weight matrix. In addition it
can be used to create or modify weight matrices. In order to perform a
search the only input
required is the name of the file containing the weight matrix.
The results can be presented graphically or listed. The graphical
presentation will draw line at the position of any matches found; the
height of the line is proportional to the score. The method is identical to
that using weight matrices derived from nucleotide frequencies, except
that here we use the frequencies of dinucleotides.
.para
For a search, select "use weight matrix", supply the name of the file
containing the weight matrix, and choose between having results plotted
or listed. If dialogue is requested when the function is selected users can
alter the cutoff score employed.
.para
To create a weight matrix several steps are involved. A file containing an
alignment of known motifs is required. (This file must be created before
the current option is selected. The format is a follows: each sequence is
written on a separate line with at least one space at the beginning; each
sequence is terminated by a space character, and can be followed by a
name. The sequences must be aligned.) Supply the name of the file of
aligned sequences. The program reads and displays the sequences. Choose
between "summing logs of weights" or summing weights (i.e. whether to
multiply or add weights). If logs are used all scores will be negative.
Choose if all positions in the set of aligned sequences should be used or
if a mask should be applied. If so selected, define a mask as a string of
symbols, in which symbol - means ignore and any other symbol means
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
.para
The program will calculate weights as the frequencies of the
dinucleotides at each unmasked position in the set of aligned sequences.
These weights are then applied to the set of aligned sequences to give a
range of "observed" scores. The mean and standard deviation of these
scores is displayed. The user is asked to supply several values to be used
when the weight matrix is applied to other sequences: a cutoff score (by
default, the mean minus 3 standard deviations), a top score for scaling
graphical results (by default, the mean plus 3 standard deviations), and a
position to identify (this means that if a particular base within the
motif is used as a "landmark", such as the A of the AG in splice acceptor
sites, then its position will be marked in plots). All these values are
stored along with the weight matrix. Finally supply the name of a file to
contain the weight matrix.
.para
Weight matrices can be "rescaled" using a set of aligned sequences in
much the same ways as a matrix is created. The purpose is to redefine
the cutoff scores, and rescaling does not alter any other values in the
weight matrix file.
.para
The methods have always had to deal with the problem of zeroes in the
matrices. The current versions
employ "Laplaces Law of Succession" in which 1 is
added to each term.
.lit
Typical dialogue follows.
? Menu or option number=D60
Motif search using dinucleotide weight matrix
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 = 2
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
1 AGCGTGACTCTTCCCGGAA HIS1
2 GAGGTGACTCACTTGGAAG HIS1
3 CGGATGACTCTTTTTTTTT HIS3
4 ACAGTGACTCACGTTTTTT HIS4
5 GTCGTGACTCATATGCTTT ARG3
6 TGAATGACTCACTTTTTGG ARG4
7 TTCTTGACTCGTCTTTTCT CPA1
8 CGAATGACTCTTATTGATG CPA2
9 AGAATGACTAATTTTACTA TRP5
10 TCGTTGACTCATTCTAATC TRP3
11 TTGCTGACTCATTACGATT TRP2
12 GAGATGACTCTTTTTCTTT IV1
13 GCGATGATTCATTTCTCTG IV2
14 TAGATGACTCAGTTTAGTC LEU1
15 TAAGTGACTCAGTTCTTTC LEU4
16 ATGATGACTCTTAAGCATG ILS1
Length of motif 18
? (y/n) (y) Sum logs of weights n
? (y/n) (y) Use all motif positions n
x means use, - means ignore
e.g. xx-x---x-x means use positions 1,2,4,8,10
? Mask=----XXXXXXXX--------
Applying weights to input sequences
1 89.000 AGCGTGACTCTTCCCGGA
2 91.000 GAGGTGACTCACTTGGAA
3 93.000 CGGATGACTCTTTTTTTT
4 90.000 ACAGTGACTCACGTTTTT
5 94.000 GTCGTGACTCATATGCTT
6 91.000 TGAATGACTCACTTTTTG
7 81.000 TTCTTGACTCGTCTTTTC
8 90.000 CGAATGACTCTTATTGAT
9 75.000 AGAATGACTAATTTTACT
10 97.000 TCGTTGACTCATTCTAAT
11 97.000 TTGCTGACTCATTACGAT
12 93.000 GAGATGACTCTTTTTCTT
13 69.000 GCGATGATTCATTTCTCT
14 90.000 TAGATGACTCAGTTTAGT
15 90.000 TAAGTGACTCAGTTCTTT
16 90.000 ATGATGACTCTTAAGCAT
Top score 97.000 Bottom score 69.000
Mean 88.750 Standard deviation 7.319
Mean minus 3.sd 66.794 Mean plus 3.sd 110.706
? Cutoff score (-999.00-9999.00) (66.79) =
? Top score for scaling plots (66.79-999.00) (110.71) =
? Position to identify (0-18) (1) =
? Title=GCN4 DI WTS
? Name for new weight matrix file=3.WTS
? Menu or option number=D60
Motif search using dinucleotide weight matrix
X 1 Use weight matrix
2 Make weight matrix
3 Rescale weight matrix
? 0,1,2,3 =
? Motif weight matrix file=3.WTS
GCN4 DI WTS
? Cutoff score (-9999.00-9999.00) (66.79) =40
? (y/n) (y) Plot results n
15 42.00 CAACCCGCTCACCGACAA
29 42.00 ACAACAGCTCACCCACGC
93 46.00 AGCCTTCCTCATCGCTGC
153 40.00 CAGCGGAATCAAACTTAA
408 42.00 CGATGGATTCAAGTTGAA
469 47.00 TTAGGAACTCCCTCTGTC
493 60.00 AAGCTGAATCTTAGCAGC
530 43.00 CGGAGGGCTCAGTGAGGG
542 47.00 TGAGGGACTACTGCACCA
678 41.00 CTTCTGCTTCAAAGAGTT
709 47.00 AATATGACGGCGCACGTG
848 54.00 GTCAGAACTCAAATCAGT
940 49.00 CCGTTGACGACCTCCGCA
992 42.00 TGGGCACCTCACACCAAG
.end lit
.left margIN1
@61. TX 8 @ Search for eukaryotic ribosome binding sites
.LEFT MARGIN2
.para
Searches for eukaryotic ribosome binding sites using weightings derived
from
Sargan,Gregory,Butterworth febs let 147 133-136 1982. No dialogue is
required. First described in Staden Nucl. Acid Res. 12 505-519 1984.
.LEFT MARGIN1
.lit
mRNA WTS FOR EUKARYOTES SARGAN,GREGORY,BUTTERWORTH FEBS LET
147 133-136 1982
P -7 -6 -5 -4 -3 -2 -1 1 2 3
102102102102102102102102102102
T 19 24 31 12 0 18 5 0102 0
C 20 15 32 65 5 42 52 0 0 0
A 50 27 27 19 86 36 34102 0 0
G 6 29 12 6 11 6 11 0 0102
VIRAL ONLY
P -7 -6 -5 -4 -3 -2 -1 1 2 3
41 41 41 41 41 41 41 41 41 41
T 14 12 16 4 2 13 9 0 41 0
C 7 3 13 17 7 9 14 0 0 0
A 15 10 6 10 27 15 9 41 0 0
G 5 16 6 10 5 4 9 0 0 41
.END LIT
The Sargan et al paper puts forward the hypothesis that there is an
interaction between
some mRNA leader sequences and a highly conserved structure in the 18S
rRNA
of eukaryotic ribosomes. The attempt to substantiate the hypothesis
includes
a table of base frequencies for sequences immediately 5' to start codons.
They examined 102 sequences and I have used the base frequencies they
found
as a weight matrix for searching for eukaryotic gene starts. I don't yet
know how good this method is. The viral sequences were found to be
slightly
different but the separate table shown here is not used in the program.
.left margin1
@62. TX 8 @ Search for splice junctions
.LEFT MARGIN2
.para
Used to search for mRNA splice junctions using a weight matrix. The
default weight matrix is still that derived from the paper of Mount (Nucl.
Acids Res. 10, 459-472). However users may employ their own tables.
By default the positions of possible junctions will
be plotted rather than listed.
The diagram splits the donor plot into 3 horizontal boxes
so that all the
sites marked in any box are from the same reading frame. The acceptor
plot appears above the donor plot and is split in an equivalent way. So
sites marked as donors and acceptors in equivalent boxes are compatible.
i.e. donors from donor box 1 are compatible with acceptors from acceptor box
1, etc. Of course it is the combination of reading frame and splice sites
that really matters, and donors from box 1 can be compatible with acceptors
in box 3 if the reading frame switches.
.para
If dialogue is selected users can employ their own file of weights (see
below for the format), can change the cutoff scores, and can elect to have
the results listed rather than plotted. Listed results show the position
(of the last or first base in the exon), the frame and the matching sequence.
The frequency table shown below is used as a default
weight matrix and AG and GT are obligatory at the appropriate positions.
The plots are scaled so that the top of scale is the highest value achieved
by
a junction sequence in the set used to compile the frequency table, and
the
bottom of the scale is the lowest value achieved by a junction sequence
in
the set used to compile the frequency table.
.para
In the light of current knowledge it would be sensible for users to use
the weight matrix search option (20)
to create matrices that define more specific splice junctions. If so it is
important that the positions "marked" are the last base in the donor exon and
the first base in the acceptor exon. To make a weight matrix suitable for
use with this function follow the instructions for option 20 and create
files for both donor and acceptor sites. Then concatenate the two matrix files
with the donor file first.
Note that any positions in the weight matrix that are
100% conserved will be made obligatory (normally the AG and GT).
.LEFT MARGIN1
.lit
Mount donors redone 16-4-91
12 3 -16.085 -7.500
P -2 -1 0 1 2 3 4 5 6 7 8 9
N 136 136 136 136 136 136 136 136 136 136 136 136
T 28 8 15 17 0 136 9 16 7 84 30 36
C 41 60 16 7 0 0 3 13 3 17 28 39
A 40 56 89 12 0 0 83 91 12 23 53 33
G 27 12 16 100 136 0 41 16 114 12 25 28
Mount acceptors redone 16-4-91
18 15 -26.142 -14.400
P -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
N 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113
T 58 50 57 59 67 56 58 49 47 66 64 31 34 0 0 11 41 31
C 21 28 34 25 29 33 35 32 42 40 33 25 74 0 0 23 28 41
A 17 11 11 18 7 17 12 23 15 3 10 29 5 113 0 24 21 21
G 17 24 11 11 10 7 8 9 9 4 6 28 0 0 113 55 23 20
.END LIT
.left margIN1
@63. TX 7 @ Search using a weight matrix (complementary)
.LEFT MARGIN2
.para
This function searches the complementary strand of the sequence using
a weight matrix. Many
motifs can bind to either strand of the DNA and this function allows
users to
search the complementary strand without having to change the
orientation of the sequence. See option 20 for more details.
.left margin1
@64. TX 3 @ Plot observed-expected word frequencies
.LEFT MARGIN2
.PARA
This option is designed to examine the abundances of short
words in a sequence to see if particular ones are either under or over
represented. It compares the observed and expected frequencies and
plots them along the sequence. There has been some work on the relative
amounts of CG dinucleotides in eukaryotic sequences (eg Bird, Nature
321,
209-213 (1986)) and this new routine can be used to examine such
biases, or
any others that might be interesting.
.para
The user selects a word - say CG -, a window length, and a maximum and
mininum scale for plotting the results. The
program examines each sucessive window length along the sequence,
with each
window overlapping the previous one by windowlength-1.
The program counts the base frequencies in each window, and the number
of
occurrences of the chosen word within the window. Using the base
frequencies it calculates an expected number of occurrences for the
chosen
word (simply by multiplying the relevant frequencies). It plots
observed-expected, and hence will show regions that are rich or depleted
in
the chosen word. The longest allowed word is 9 characters, but the
calculation of the expected frequencies becomes less appropriate as the
word
length increases above 2.
.para
Typical dialogue follows.
.lit
? Menu or option number=D64
Plot composition differences (obs-exp))
Default String=CG
? String=
? odd span length (3-401) (101) =
? plot interval (1-20) (5) =
? Maximum plot value (-6.31-25.25) (6.31) =
? Minimum plot value (-25.25-6.31) (-6.31) =
Missing graphics display here
.end lit
.left margIN1
@65. TX 9 @ Search for polya sites
.LEFT MARGIN2
.para
Simply searches for the sequence AATAAA
(Proudfoot and Brownlee Nature 263, 211-214,
1982) and marks it with a short vertical line.
.left margin1
@66. TX 1 @ Interconvert t and u
.LEFT MARGIN2
.para
This function interconverts T and U characters in the active sequence i.e
between DNA and RNA.
.LEFT MARGIN1
@67. TX 7 @ Search for patterns of motifs
.left margin2
.para
This option searches for patterns of motifs. Patterns can be defined
interactively or read from files. Results can be displayed in several ways
in both graphical and textual form. Used to create pattern files for
searching libraries. The option is extremely flexible and consequently the
following documentation is quite lengthy. However the routine is capable
of searching for almost any known pattern. In addition the flexibility
does not necessitate difficulty of use, and the userinterface has been
simplified considerably since the methods were first published.
.para
Users should refer to the "typical dialogue" shown below for the most
helpful information on using the program.
.para
There are currently
four ways to display the matching patterns: 1=each individual
motif and its position is listed; 2=all the sequence between, and
including the two
outermost motifs is listed; 3=graphical, with a vertical line marking the
position
of the leftmost motif; 4 = EMBL feature table format, where the KEYNAM
field if the motif name, the FROM and TO fields denote the ends of the
match, and the DESCRIPTION field is "Program".
.para
When it is defined for the first time a pattern must be entered
interactively at the keyboard, but the pattern description
can be saved to a file.
This file can be used for all subsequent searches.
.para
When defining a pattern interactively
select a motif class and the program will request the required inputs.
.para
The program gives each motif an identifying name and number.
For motifs other than the first, a range of allowed positions must be
defined (Note that sets of motifs included using the OR operator will all
be given the same range, and so the program will only request range
values
for the first motif in any such set).
To specify the allowed range for a motif the user must supply the
following: the
identifying number of the motif, relative to which the current motifs
positions are to be defined (termed the "reference motif"); a "relative start
position" and a range. The relative start position can be negative or positive.
A negative start position means that although the reference motif
is searched for first, the current motif can be found to its left.
A zero relative start position means their left ends are superimposed. The
default start position is to butt-joint the motif to righthand end of the
"reference motif". The range is "the number of extra positions" that the
motif can take.
.para
The program will display the probability of finding each motif. These
values are presented in the following form: .1234E-5 means 0.1234 times
10
to the power -5.
.para
After the pattern has been defined, the program will type a description
of
it on the screen. It will then allow the user to give an overall cutoff
score and overall probability cutoff.
.para
Typical dialogue for all the different motif classes is displayed below.
.lit
? Menu or option number=67
Pattern searcher
? (y/n) (y) Read pattern from keyboard
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =
? Motif name=Ematch
? String=AA
Probability of score 2.0000 = 0.595E-01
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =2
? Motif name=AAA
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-1) (1) =
? Relative start position (-1000-1000) (3) =
? Number of extra positions (0-1000) (0) =
? string=AAA
? Minimum matches (1.00-3.00) (3.00) =2
Probability of score 2.0000 = 0.149E+00
1 Exact match
X 2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =3
? Motif name=T'S
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-2) (2) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =
? String=TTT
? Minimum score (0.00-108.00) (108.00) =72
Probability of score 72.0000 = 0.258E+00
1 Exact match
2 Percentage match
X 3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =4
? Motif name=GCN4
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-3) (3) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =
? Weight matrix file name=GCN4
GCN4 FROM WEIGHTS 17-11-87
Probability of score -22.0020 = 0.139E-02
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
X 4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =5
? Motif name=GCN4
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-4) (4) =
? Relative start position (-1000-1000) (20) =
? Number of extra positions (0-1000) (0) =
? Weight matrix file name=GCN4
GCN4 FROM WEIGHTS 17-11-87
Probability of score -22.0020 = 0.606E-03
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
X 5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =6
? Motif name=LOOP
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-5) (5) =
? Relative start position (-1000-1000) (20) =
? Number of extra positions (0-1000) (0) =
? Stem length (1-60) (6) =
? Minimum loop length (-6-60) (0) =
? Maximum loop length (0-60) (0) =5
? Minimum score (1.00-12.00) (12.00) =10
Probability of score 10.0000 = 0.598E-02
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
X 6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =7
? Motif name=Tstep
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-6) (6) =
? (y/n) (y) Relative to 5 prime end
? Relative start position (-1000-1000) (1) =
? Number of extra positions (0-1000) (0) =
? String=TTT
? Step (1-20) (3) =
Probability of score 3.0000 = 0.367E-01
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
X 7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =8
? Motif name=REPEAT
X 1 And
2 Or
3 Not
? 0,1,2,3 =
? Number of reference motif (1-7) (7) =
? Relative start position (-1000-1000) (4) =
? Number of extra positions (0-1000) (0) =2
? Repeat length (1-60) (6) =
? Minimum gap (0-60) (0) =
? Maximum gap (0-60) (0) =4
? Minimum score (1.00-6.00) (6.00) =5
Probability of score 5.0000 = 0.554E-02
1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
X 8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =9
? (y/n) (y) Save pattern in a file N
Pattern description
Motif 1 named Ematch is of class 1
Which is an exact match to the string
AA
Motif 2 named AAA is of class 2
which is a match of score 2. to the string
AAA
and the 5 prime base can take positions 3 to 3
relative to the 5 prime end of motif 1
It is anded with the previous motif.
Motif 3 named T'S is of class 3
which is a match of score 72. to the string
TTT
and the 5 prime base can take positions 4 to 4
relative to the 5 prime end of motif 2
It is anded with the previous motif.
Motif 4 named GCN4 is of class 4
Which is a match to a weight matrix with score -22.002
and the 5 prime base can take positions 4 to 4
relative to the 5 prime end of motif 3
It is anded with the previous motif.
Motif 5 named GCN4 is of class 5
Which is a match to the complement of a weight matrix with score -22.002
and the 5 prime base can take positions 20 to 20
relative to the 5 prime end of motif 4
It is anded with the previous motif.
Motif 6 named LOOP is of class 6
Which is a stem-loop structure with stem length 6 and score 10.
The loop can have sizes 0 to 5
and the 5 prime base can take positions 20 to 20
relative to the 5 prime end of motif 5
It is anded with the previous motif.
Motif 7 named Tstep is of class 7
Which is an exact match to the string
TTT
with a step size of 3
and the 5 prime base can take positions 1 to 1
relative to the 5 prime end of motif 6
It is anded with the previous motif.
Motif 8 named REPEAT is of class 8
Which is a repeat with repeat length 6 and score 5.
The loop-out can have sizes 0 to 4
and the 5 prime base can take positions 4 to 6
relative to the 5 prime end of motif 7
It is anded with the previous motif.
Probability of finding pattern = 0.2348E-14
Expected number of matches = 0.5100E-09
? Maximum pattern probability (0.00-1.00) (1.00) =
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
Select display mode
X 1 Motif by motif
2 Inclusive
3 Graphical
4 EMBL feature table
? 0,1,2,3,4 =4
Searching
Total matches found 0
Menus and their numbers are
m0 = This menu
m1 = General
m2 = Screen control
m3 = Statistical analysis of content
m4 = Structures and repeats
m5 = Translation and codons
m6 = Gene search by content
m7 = Prokaryotic signal search
m8 = Eukaryotic signal search
? = Help
! = Quit
? Menu or option number=67
Pattern searcher
? (y/n) (y) Read pattern from keyboard
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =
? Motif name=Arun
? String=AAAAAA
Probability of score 6.0000 = 0.210E-03
X 1 Exact match
2 Percentage match
3 Cut-off score and score matrix
4 Cut-off score and weight matrix
5 Complement of weight matrix
6 Inverted repeat or stem-loop
7 Exact match, defined step
8 Direct repeat
9 Pattern complete
? 0,1,2,3,4,5,6,7,8,9 =9
? (y/n) (y) Save pattern in a file N
Pattern description
Motif 1 named Arun is of class 1
Which is an exact match to the string
AAAAAA
Probability of finding pattern = 0.2103E-03
Expected number of matches = 0.1522E+01
? Maximum pattern probability (0.00-1.00) (1.00) =
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
Select display mode
X 1 Motif by motif
2 Inclusive
3 Graphical
4 EMBL feature table
? 0,1,2,3,4 =4
Searching
FT Arun 1582 1587 Program
FT Arun 3160 3165 Program
FT Arun 4204 4209 Program
FT Arun 5691 5696 Program
FT Arun 6710 6715 Program
Total matches found 5
Minimum and maximum observed scores 6.00 6.00
.end lit
.para
These methods allow users to define and search for
complex patterns of motifs defined as single objects.
The programs allow individual DNA motifs to be defined in eight
different
ways, and protein motifs in six. Motifs are combined, using the logical
operators AND, OR and NOT, to describe a pattern. The pattern also
specifies the ranges of allowed relative separations of the individual
motifs.
.para
First some definitions.
.para
A MOTIF is a contiguous subsequence of fixed length.
At its simplest
it could be a single definite base or amino acid; a more complex motif
might be better represented as a consensus or a weight matrix;
two more-abstract types of
motif are direct and inverted repeats.
.para
A PATTERN is a higher order of structure defined by a list of motifs. The
motifs in a pattern are combined using the logical operators AND, OR and
NOT. The list also defines the allowed relative separations of the
motifs. In the current versions of the programs up
to 50 motifs can be combined into a single pattern. So using these
definitions there are two
differences between motifs and patterns: 1) the distances between all
elements of a motif are fixed, but
the separations of parts of patterns can vary;
2) all characters in a motif are defined
using the same method (class), but different parts of a pattern can be
defined in completely different ways.
.para
Each motif
can be represented in 9 ways (known as the motif class):
.sk1
.lit
MOTIF CLASSES
CLASS DESCRIPTION
1 Exact match to a short defined sequence. The IUB symbols
can be used for DNA sequences.
2 Percentage match to a defined short sequence. In nucleic acids,
the IUB symbols can be used.
3 Match to a defined sequence, using a score matrix and cutoff
score. The DNA matrix (see option 18) gives scores to IUB symbols
depending on their level of redundancy. MDM78 is used for proteins.
4 Match to a weight matrix with cutoff score.
5 As class 4 but on the complementary strand.
6 Inverted repeat or stem-loop. Fixed stem length, range of
loop sizes, and cutoff score using A-T, G-C=2; G-T=1.
7 Exact match to short sequence but with a defined step size.
8 Direct repeat. Fixed repeat length, range of loop-out sizes,
cutoff score, and score matrix (for protein sequences MDM78 and
for nucleic acids an identity matrix).
9 Membership of a set. A list of sets of allowed amino acids for
each position in the motif. The sets are separated by commas(,).
For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
acids in which one of I,V or L must be found in the first position,
then anything in the next two positions, D,E,K or R in the fourth
position and F,Y,W,I,L,V or M in the fifth. This class only applies
to protein sequences because for nucleic acids "membership of a
set"
can be achieved using IUB symbols.
Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to
nucleic acids.
.end lit
.para
Class 1: exact match.
.para
The motif is defined by a short sequence, which for nucleic acids,
may include IUB symbols. All symbols must match.
.para
Class 2: percentage match
.para
The motif is defined by a short sequence, which for nucleic acids,
may include IUB symbols. The minimum number of matching characters
must
also be specified.
.para
Class 3: match using a score matrix
.para
The motif is defined by a short sequence, which for nucleic acids,
may include IUB symbols. The motif is not compared directly with the
sequence to count the number of matching characters. Instead a matrix is
used to provide a score for all possible pairs of characters. The motif
score for
any position along the sequence is the sum of the scores found by
looking-up the scores for each pair of aligned characters. A match is
declared if some minimum score is achieved.
.para
Class 4: weight matrix
.para
The motif is defined by a table of values (called weights or scores). The
table gives a score for finding each possible character at each position
along the length of the motif. It therefore
has dimension motif-length x character-set-size, and allows us to give
different scores for each character at each position. It is equivalent to
having a different score matrix for each position along the motif, and
provides the most flexible and specific method of defining motifs. The
weight matrices are created by program NIP option 20 and
stored as files. The file contains the values
for each position, as well as an overall minimum score.
There are two ways in which these values can be used to calculate an
overall
score for any section of the sequence. The simplest way is to add the
values in the file. (This means that the highest possible score
can be calculated by adding the top value at each column
position, and the lowest
by adding the bottom value.)
The normal way of using the values in the file is as
follows.
First the programs divide the values in each column by the column total
so
that they sum to 1.0
Then the natural
logs of these values are used as scores. When the matrix is applied to a
sequence these logarithmic values are summed (which is of course
equivalent
to multiplying the frequencies).
Note that using the natural logs of the frequencies as
weights and
adding them means that the overall cutoff score must be less than zero,
whereas if the original
values in the weight matrix file are added, the cutoff score will be
greater than zero. The search routines therefore decide whether the user
wants to add values or multiply frequencies
by examining the value of the cutoff score: it will add if the cutoff
is
greater than zero and add logs of frequencies if it is less than zero.
Hence we effectively get two
motif classes in one. The program NIP, when creating weight matrix
files, will ask the user whether the scores should be added or multiplied.
If the values in the table have been defined
without using a set of aligned sequences
it is easier for the user to
choose a cutoff score if the values are added.
.para
Class 5: complement of weight matrix
.para
The motif is defined by a weight matrix, but the program searches for its
complement.
.para
Class 6: inverted repeat, or stem-loop
.para
The motif is defined by a repeat length, a minimum score
and a range of loop sizes. The scores are A-T=2, G-C=2, G-T=1, else=0.
The loop sizes are defined by a minimum
and maximum distance from the 3' end of the stem.
For a stem-loop these will be positive numbers. For example to
define a stem of length 8 and loop sizes varying from 3 to 5, the stem
would be set to 8, the minimum start distance to 3 and the maximum
to 5. To define an
inverted repeat the minimum distance will be negative. For example stem
length=9,
minimum distance=-9, and maximum distance=-8 will find
inverted repeats of lengths 9 and 10.
E.g. AAAAATTTT and AAAAATTTTT would be found, the first having a base
at
its centre, the second having none.
.para
Class 7: exact match, defined step size.
.para
The motif is defined by a short sequence, which for nucleic acids,
may include IUB symbols. All symbols must match. The class differs
from
class 1 in that searches will move in steps of some given size. For
example
we could search for a certain codon and use a step size of 3 and hence
keep in a
single reading frame.
.para
Class 8: direct repeat
.para
The motif is defined by a repeat length, a minimum score
and a range of loop sizes. The scores are defined using MDM78 for protein
sequences and an identity matrix for nucleic acids.
The loop sizes are defined by a minimum
and maximum distance from the 3' end of the stem.
.para
Class 9: membership of a set
.para
This motif class is for protein sequences. It is defined by lists of
allowed amino acids for each position in the motif, and a cut-off score.
Positions at which any amino acid can occur are left blank.
All allowed amino acids for each position give a score of 1.
The motifs can be defined in two ways: either typed at the keyboard or
read
in as a weight-matrix-like file.
When the motif is defined at the keyboard the sets of allowed amino
acids
are separated by commas(,).
For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
acids in which one of I,V or L must be found in the first position,
then anything in the next two positions, D,E,K or R in the fourth
position and F,Y,W,I,L,V or M in the fifth. To specify that the
whole motif must match a score of 3 would be required (i.e. one of the
allowed amino acids must be found for each of the three defined
positions).
If the motif is read from a file the file must have been written by
program
NIP, or have been saved by the pattern searching routines. If the
user
elects to save a pattern, and it includes class 9 motifs typed at the
keyboard, then the program will save the class 9 motifs as weight matrix
files. Therefore it will request file names for each motif of this class.
If the motif given above as an example were saved the weight matrix file
would have 5 columns.
The first column
would contain zeroes except for the I, V and L rows
which would be set to 1; the next two columns would all be zero; the next
would be zero except for the D,E,K and R rows which would be 1; the final
column would contain 1's in rows F,Y,W,I,L,V and M, with
the rest zero.
.para
The logical operator (AND, OR or NOT) used to add each motif to the
pattern
is specified by preceding
the class number by the letters A, O or N. A = AND, O = OR, N = NOT.
The default is A, so N2 means include, using the NOT operator, a class 2
motif; O2 means include, using the OR operator, a class 2 motif; both A2
and
2 mean include, using the AND operator, a class 2 motif.
.para
Range setting.
.para
The motifs in a pattern are numbered according to their order in the list.
Apart from the first motif in a pattern all motifs are given a range
of allowed positions relative to a motif further up the list.
For example
suppose we have a pattern defined by A AND B AND C AND D.
Motif A can occur anywhere, but B must have its range of allowed
positions defined relative to the position of motif A, and C's positions
can be defined relative to either A or B, depending on which is most
convenient, and likewise D's positions can be relative to A or B or C.
.para
Notice that the positions of motifs can be defined relative to more than
one motif. Suppose we have a pattern consisting of
motifs A, B and C, and that B occurs 5-10 residues right of A, C occurs 5-
10
residues right of B, and also C is never more than 15 residues from A.
Then
it is quite consistent with the methods to include motif C into the
pattern
twice using the AND operator: once relative to A and once relative to B.
This will define the relative spacing and the ORDER of the motifs in the
pattern. (If we simply defined the position of C relative to A it could be
found to the left of B).
.para
Motifs combined together using the OR operator are all given the same
range. For example suppose we had a pattern A AND (B OR C) AND (D OR E),
then B and C each have the same range, and D and E also have
the same range as one another. The range for D and E can be relative to
A or to B.
.para
Motifs cannot have their ranges defined relative to motifs that are
included using the NOT operator. For example if we had the pattern A NOT
B
AND C, then the range for C can only be defined relative to motif A.
.para
Speed can be gained by arranging the order
of the motifs so that those higher up the list are of types that can be
searched for rapidly and that are also unlikely to be found.
.para
Motifs combined by the OR operator are alternatives: if any one of a set
of motifs
combined by the OR operator is found, then a match is declared. All
alternatives will be reported. For example if we had a pattern defined by
A
AND (B OR C), then all places where A occurs and B is found within range,
and all places where A is found and C is found within range will be
reported. A typical use would be where we might allow a motif to appear
on
either strand of the DNA sequence. For example a weight matrix
representing
the heatshock element could be used in a pattern which included
heatshock
as a motif class 4 combined using the OR operator
with heatshock as a motif class 5.
.para
The probability calculations are performed for each motif as it is
defined.
If an overall probability cut-off is given the calculation is repeated for
each match found. To achieve maximum searching speed do not give an
overall
probability cut-off. Overall cut-off scores should only be used if the
motif
classes used are compatible.
.para
There are currently
several ways to display the matches: 1 = each
motif and its position is listed; 2 = all the sequence between the two
outermost motifs is listed; 3 = graphical, with a spike marking the
position
of the leftmost motif. The library versions also give entry names, and a
one
line title; in addition they can be used to produce aligned families of
sequences. When this mode of output is selected the program will write a
separate file for each match. The files will be called ENTRYNAME.DAT
where
ENTRYNAME is the name of the entry in the library. The matching
sequence
will be written out so that the spacing between motifs is constant, and
set to the maximum allowed by the pattern definition. Any gaps will be
filled with dashes (-). If the individual sequences were subsequently
written one above the other
they should line up so that all motifs are in register. There two types of
output of this sort: one, option 4, writes out whole sequences, the other,
option 5, writes out only the sequences between the two outermost
motifs.
If the individual sequences were subsequently
written one above the other
they should line up so that all motifs are in register. There two types of
output of this sort: one, option 4, writes out whole sequences, the other,
option 5, writes out only the sequences between the two outermost
motifs.
Note that for option 4 users are asked to type the position of the
first motif, and the reason for
this is explained below.
Consider a pattern found in several sequences. Consider only
the first motif in
the pattern and suppose that it was found in different positions in these
sequences.
Say that of these positions the one furthest from the left end was
position 100. Then, in order to ensure that all the sequences would align,
we must specify that motif 1 must start at position 100.
Any sequences in which motif 1 started
nearer to the left end than position 100 would be padded accordingly.
These modes of output
should only be used when the position of each motif is defined relative to
its
immediate neighbour.
.para
The pattern descriptions can be saved to files. These files
can be used instead of typing definitions again at the keyboard. As the
files are annotated,
they can easily
be changed using system editors, and the modified versions used to
define the variant patterns for the programs.
.para
Use of lists of entry names
.para
The two programs that operate on libraries have the ability to
restrict their searches to subsets of the libraries. This does not require
sublibraries to be created but instead is achieved by using files
containing a list of the entry names of sequences. The user may choose to
search only those entries on the list or, alternatively to search all but
those on the list (i.e. in the latter case
the list contains the names of those to be excluded).
The programs can search libraries that have indexes and those that
do not.
If a list of names for inclusion is used,
then the search will be faster if the index is present. In all other
circumstances the whole library will be read.
The list must be in library order except when it is used
to include entries, and an index is available.
The list must contain each entry name on a separate line, with the name
starting in column 1 of the line. ie there must be no spaces at the start
of the line.
The list of entry names
can be produced by the keyword searches of nip, pip, etc as long
as the listings produced have a space character separating the entry name
from the entry description. This will depend on how well the library
reformatting programs work. For example swissprot entry names tend to run
into the beginning of the descriptions, but other libraries are generally
OK.
.para
One use of the programs is to look for patterns that we already know
about, but in new sequences. However it is hoped that they will also be
useful for finding new motifs. For example
several known control regions in
nucleic acid
sequences consist of particular direct or inverted repeats;
the inclusion of
direct and inverted repeats as motif classes
makes it possible to
find previously unknown
motifs of these types.
Using these new programs we can
ask questions like: "are there any inverted or direct repeats near to
sections of sequence that contain both a
CCAAT box and a TATA box?"; and to search for such things throughout
the
libraries. In addition, the mode of output in which all the sequence
between
the two outermost motifs found is printed out, allows us to extract
sequences and examine them in more detail for further common
subsequences.
For example we might want to collect together all the sequences
between
putative CCAAT and TATA boxes.
.para
A further use of the inverted repeat motif class is the following. If a
regulatory sequence in DNA is poorly defined but also an inverted repeat,
then it might be an advantage to specify it both as a consensus sequence
and
a superimposed inverted repeat. In this way two weak definitions can be
combined to produce a stronger pattern.
.para
Given only a few examples of a motif it
should be possible to perform initial searches using a
class 3 motif, and then, using plausible matching sequences, create a
more
specific weight matrix for the same motif.
.para
If motifs are combined with the first motif using the OR operator
they will be ignored until all
permutations that include the first motif have been looked for.
The whole search will then be repeated, in
turn, for each of
those motifs that are combined with the first motif using the OR
operator.
An interesting consequence of this is that the program can be used,
without
change, to compare any newly determined sequence with all known
individual
motifs. We achieve this by having a pattern in which all known relevant
motifs are combined using the OR operator.
If we ask to use this pattern with
a sequence, the program will automatically compare each individual
motif in
the pattern with the whole length of the
sequence. As the number of known
motifs grows this should become an increasingly useful standard
procedure.
.para
The NOT operator is obviously
useful for making sure particular motifs are not present, but it can also
be used to bracket the levels of matches found. We may want a degree of
match that lies between two limits - binding should occur, but not too
strongly; or base-pairs should form, but not too many. We can specify
this
by asking for a match with a low score, in combination with a match and
a
high score, both for the same motif, but with the high score included
using
the NOT operator.
.para
The algorithm is designed to find all sections of a sequence that satisfy
the pattern rather than only the best match.
Particularly if some of the motifs in a pattern are less well defined than
others, this can often result in the same region of a sequence being
reported as having several matches, but which only vary in the
positions of the weakest motifs.
.para
General remarks on motif searching
.para
Generally motifs are short subsequences that are thought to be
associated with
particular functions in some known sequences. Often
we search for them to try to
understand or interpret other sequences. Sometimes we search for
motifs and
patterns to
test a hypothesis about their role: are they found in the expected
positions in the expected sequences. In doing so we should remember
that, in both proteins and nucleic acids,
what we are really looking for is a particular
three dimensional structure with certain affinities for other structures,
and that we are assuming that the sequence of the motif alone
defines the 3D structure we searching for.
The overall structure
may be completely different to those in which the motif is functional,
and
hence the motif may have a different shape or be inaccessible.
We should be aware of the
importance of the context in which a motif is found. Where does it lie
relative to the overall structure, is it accessible, is the three
dimensional spacing between
it and other motifs correct? For example, is it on the same side of the
double helix, and the correct distance from some other motif? How does
context affect our assessment of the significance of finding a motif?
Finding false mammalian mRNA splice junctions in non-coding sequences
is
far less important than finding false sites in pre-mRNA sequences, but
finding them in the correct places is most important! In other words, it
is
often the case that when we are searching for a motif that is known to
be
necessary for some function, then a positive result in the form of a
match
in the required position, is more important than a high background of
matches in the wrong positions. Being
able to write
down the probability of finding a motif in a random sequence tells us how
well it is defined.
In nucleic
acids the DNA may contain many superimposed types of information such
as
those concerned with histone phasing, protein coding or mRNA secondary
structure. These overlapping "codes" may interfere with one another
causing
matches to motifs to be poorer than expected.
In general we will only have a limited number of examples of the
motif and we do not know how representative they are.
.para
Sequences have superimposed functions: some parts may be of general
structural
importance and give rise to an overall framework, and other parts give
specificity and hence are not common; we may want to use a set of
aligned
sequences to define a motif, but want to use only the framework
positions.
Alternatively we may want to pick out
only those parts of a set of aligned sequences that give a particular
property, and to ignore other similarities that are due to some other
property
and which could obscure the pattern
we are interested in.
It is possible to apply a mask to a set of aligned sequences in
order to give weight to selected positions only.
The ability to define a mask allows certain positions
to be used in the motif and others to be ignored, and yet still permits the
use of a set of aligned sequences to calculate weights. The mask is
requested and applied
by the program and results in the masked positions being zero
in
the weight matrix. The mask is defined in the following way.
Suppose we had a motif of length 15, then the mask
x--x--xx-x will give zero weights to positions 2,3,5,6 and 9 (note it is
the dashes (-) that are significant and that positions
1,4,7,8,10,11,12,13,14 and 15
will be non-zero). Of course
the same set of sequences could be used with several alternative masks
in
order to extract different features and create corresponding weight
matrices.
.para
The programs are described in Staden,R.
CABIOS 4, 53-60, 1988; Staden,R.
CABIOS 5, 89-96, 1989, and Methods in Enzymology 183, 193-211 (1990).
.left margin1
@ end of help