5116 lines
190 KiB
Text
5116 lines
190 KiB
Text
.NPA
|
|
.SP 1
|
|
.left margin1
|
|
@-1. TX 0 @General
|
|
.sp
|
|
@-2. T 0 @Screen control
|
|
.sp
|
|
@-2. X 0 @Screen
|
|
.sp
|
|
@-3. T 0 @Statistical analysis of content
|
|
.sp
|
|
@-3. X 0 @Statistics
|
|
.sp
|
|
@-4. T 0 @Structures and repeats
|
|
.sp
|
|
@-4. X 0 @Structures
|
|
.sp
|
|
@-5. TX 0 @Translation and codons
|
|
.sp
|
|
@-6. TX 0 @Gene search by content
|
|
.sp
|
|
@-7. TX 0 @General signals
|
|
.sp
|
|
@-8. TX 0 @Specific signals
|
|
.sp
|
|
@0. TX -1 @NIP
|
|
.PARA
|
|
.para
|
|
This is a program for analysing individual nucleotide sequences. It can
|
|
read sequences stored in many of the most commonly used formats, and
|
|
performs all of the usual simple analyses. However the main purpose of
|
|
the program is to provide methods for finding the function of each
|
|
section of a sequence. In general no single method can give an
|
|
unequivecal interpretation of a sequence so we need to use many
|
|
techniques together and to combine their results. For this reason the
|
|
program present many of its results graphically.
|
|
.para
|
|
General information is contained in the user interface. Online
|
|
documentation for any function follows a consistent pattern: summary,
|
|
list of inputs, list of outputs, details, example.
|
|
.LEFT MARGIN1
|
|
@1. TX 0 @ Help
|
|
.LEFT MARGIN2
|
|
.para
|
|
This option gives online help. The user should select option numbers and
|
|
the current documentation will be given. Note that option 0 gives an
|
|
introduction to the program, and that ? will get help from anywhere in
|
|
the
|
|
program.
|
|
The following functions are included:
|
|
.left margin1
|
|
@2. TX 0 @ Quit
|
|
.left margin2
|
|
.para
|
|
This function stops the program.
|
|
.left margin1
|
|
@3. TX 1 @ Read a new sequence
|
|
.LEFT MARGIN2
|
|
.para
|
|
This option allows users to read in new sequences, browse through annotations,
|
|
or search sequence
|
|
libraries for keywords. Sequences can be read from "personal"
|
|
sequence files or from sequence libraries. These are referred to as the
|
|
sequence "source". Personal files can be stored in several formats:
|
|
Staden, PIR, EMBL, GENBANK and GCG.
|
|
At LMB we use "Staden" format for sequencing and all
|
|
the
|
|
libraries are stored in their original formats. Note, however, that libraries
|
|
such as EMBL or GenBank that are divided into several files (eg GenBank has
|
|
13 separate files) are indexed as a whole. This means that users do not need
|
|
to know which file contains an entry, only which library.
|
|
When the user selects to read in a sequence the program first asks for the
|
|
sequence "source".
|
|
.para
|
|
If the user selects "personal" the program will ask for
|
|
the format (Staden, PIR, EMBL, GENBANK or GCG), and then for the name of
|
|
the file. For PIR format the user will also be required to know the entry
|
|
name of the sequence as the file can contain several. For the other formats
|
|
only a single entry is expected. The file will be read, its length and
|
|
composition will be displayed and the option left.
|
|
.para
|
|
If the user selects "library" as the sequence source the program will display a
|
|
list of available libraries. The programs are capable of handling all current
|
|
libraries but which ones are available will vary from site to site. At LMB we
|
|
have several libraries and also weekly updates of data gathered between releases.
|
|
The program will ask users to select a library and then give a list of options:
|
|
.lit
|
|
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entrynames from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
|
|
.end lit
|
|
If get a sequence or get annotations is selected users will be asked to
|
|
type the entry name. The option will be left when a sequence is selected or
|
|
! is typed. The composition and length will be displayed.
|
|
.para
|
|
The text index contains all words from feature tables, reference titles,
|
|
definition lines, keywords lists and comments, so the text index search
|
|
is most useful. It is also the fastest. Up to 5 words can be searched for
|
|
at once. The words should be typed separated by spaces, for example
|
|
.lit
|
|
? Keywords=P53 mouse murine tumo
|
|
|
|
.end lit
|
|
will search for all entries that contain words starting with p53, mouse,
|
|
murine and tumo. Only the unique entries that contain ALL words will be
|
|
listed. Before listing the matching entries
|
|
the program will show the number of 'hits' for each word and ring the bell.
|
|
Escape is possible at this point, or after each screenfull of entries.
|
|
In addition to the entry names the text search displays the primary accession
|
|
number, the sequence length and up to 80 characters of description.
|
|
(The search of 'titles' is now redundant because the full text index
|
|
contains all the title words and the search is much faster. It will probably
|
|
be removed from the program.)
|
|
All searches are independent of case. Where
|
|
possible the program will offer default entry names.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
Select sequence source
|
|
X 1 Personal file
|
|
2 Sequence library
|
|
? Selection (1-2) (1) =
|
|
Select sequence file format
|
|
X 1 Staden
|
|
2 EMBL
|
|
3 GenBank
|
|
4 PIR
|
|
5 GCG
|
|
? Selection (1-5) (1) =
|
|
? Sequence file name=M13MP7.SEQ
|
|
Contig title removed
|
|
Sequence length= 7238
|
|
Sequence composition
|
|
T C A G -
|
|
2405. 1539. 1765. 1527. 2.
|
|
33.2% 21.3% 24.4% 21.1% 0.0%
|
|
.
|
|
.
|
|
.
|
|
|
|
|
|
Select sequence source
|
|
X 1 Personal file
|
|
2 Sequence library
|
|
? Selection (1-2) (1) =2
|
|
Select a library
|
|
X 1 EMBL 29 nucleotide library Dec 91
|
|
2 SWISSPROT 20 protein library Nov 91
|
|
3 PIR 31 protein library Dec 91
|
|
4 NRL3D 58 From Brookhaven protein library Dec 91
|
|
5 GenBank
|
|
? Selection (1-5) (1) =
|
|
Library is in EMBL format with indexes
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =5
|
|
Search for keywords
|
|
? Keywords=P53 mouse
|
|
P53 hits 68
|
|
MOUSE hits 8180
|
|
|
|
MMANT01 X00875 536 Murine gene fragment for cellular tumour antigen
|
|
MMANT02 X00876 83 Murine gene fragment for cellular tumour antigen
|
|
MMANT03 X00877 21 Murine gene fragment for cellular tumour antigen
|
|
MMANT04 X00878 261 Murine gene fragment for cellular tumour antigen
|
|
MMANT05 X00879 184 Murine gene fragment for cellular tumour antigen
|
|
MMANT06 X00880 113 Murine gene fragment for cellular tumour antigen
|
|
MMANT07 X00881 110 Murine gene fragment for cellular tumour antigen
|
|
MMANT08 X00882 137 Murine gene fragment for cellular tumour antigen
|
|
MMANT09 X00883 74 Murine gene fragment for cellular tumour antigen
|
|
MMANT10 X00884 107 Murine gene for cellular tumour antigen p53 (exon
|
|
MMANT11 X00885 562 Murine p53 gene 3' region with exon 11
|
|
MMANTP53 M26862 536 Mouse tumor antigen p53 gene, 5' end.
|
|
MMLYN M64608 2044 Mouse lyn protein mRNA, complete cds.
|
|
MMP53 X00741 1377 Mouse mRNA for transformation associated protein
|
|
MMP53A M13872 1285 Mouse p53 mRNA, complete cds, clone pcD53.
|
|
MMP53B M13873 1241 Mouse p53 mRNA, complete cds, clone p53-m11.
|
|
MMP53C M13874 1322 Mouse p53 mRNA, complete cds, clone p53-m8.
|
|
MMP53G1 X01235 554 Mouse genomic DNA for 5' region of cellular tumou
|
|
MMP53IN4 X60470 729 M.musculus p53 gene for p53 protein, intron 4
|
|
MMP53P X01236 2132 Mouse pseudogene for cellular tumour antigen p53
|
|
MMP53R X01237 1773 Mouse mRNA for cellular tumour antigen p53
|
|
MMRSB2P5 M64597 196 Mouse B2 repeat in the 3' flank of protein 53 (p5
|
|
22 different entries found
|
|
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =4
|
|
Search for keywords
|
|
? Keywords=alpha
|
|
Searching for alpha
|
|
AAGHA 623 a.anguilla mrna for glycoprotein hormone alpha subunit precu
|
|
AAMALI 3338 a.aegypti mali gene encoding alpha 1-4 glucosidase, complete
|
|
AAMALIA 1659 a.aegypti maltase-like i (mali) gene encoding alpha-1,4-gluc
|
|
AAMALIB 1832 a.aegypti maltase-like i (mali) mrna encoding alpha-1,4-gluc
|
|
ACA13GT 371 alouatta caraya alpha-1,3gt gene, 3' flank.
|
|
ADHBADA1 102 duck alpha-d-globin gene, exon 1.
|
|
ADHBADA2 1145 duck alpha-a-globin gene and 5' flank
|
|
ADHBADWP 513 duck (white pekin) alpha ii (minor) globin mrna, complete co
|
|
AEACOXABC 5279 a.eutrophus protein x (acox), acetoin:dcpip oxidoreductase-a
|
|
AGA13GT 371 ateles geoffroyi alpha-1,3gt gene, 3' flank.
|
|
AGAAAGFP 282 c.tetragonoloba alpha-amylase/alpha-galactosidase fusion pro
|
|
AGAABL 138 b.subtilis alpha-amylase signal peptide gene e.coli beta-lac
|
|
AGAFAMYA 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
|
AGAFAMYB 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
|
AGAFAMYC 57 synthetic b.stearothermophilus alpha amylase/s.cerevisiae ma
|
|
AGAFCOXA 98 synthetic alpha-factor/cox iv fusion gene signal peptide.
|
|
AGAGABA 7876 synthetic gossypium hirsutum (cotton) alpha globulin a and b
|
|
AGAMYLS 120 synthetic alpha-amylase gene, 5' end.
|
|
AGANPS 95 synthetic gene (jcnf-1) encoding alpha-factor pro-region/han
|
|
!
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =3
|
|
? Accession number=v00636
|
|
Entry name LAMBDA
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =2
|
|
Default Entry name=LAMBDA
|
|
? Entry name=
|
|
ID LAMBDA standard; DNA; PHG; 48502 BP.
|
|
XX
|
|
AC V00636; J02459; M17233; X00906;
|
|
XX
|
|
DT 03-JUL-1991 (Rel. 28, Last updated, Version 3)
|
|
DT 09-JUN-1982 (Rel. 1, Created)
|
|
XX
|
|
DE Genome of the bacteriophage lambda (Styloviridae).
|
|
XX
|
|
KW circular; coat protein; DNA binding protein; genome;
|
|
KW origin of replication.
|
|
XX
|
|
OS Bacteriophage lambda
|
|
OC Viridae; ds-DNA nonenveloped viruses; Siphoviridae.
|
|
XX
|
|
RN [1]
|
|
RP 1-48502
|
|
RA Sanger F., Coulson A.R., Hong G.F., Hill D.F., Petersen G.B.;
|
|
RT "Nucleotide sequence of bacteriophage lambda DNA";
|
|
RL J. Mol. Biol. 162:729-773(1982).
|
|
XX
|
|
!
|
|
Select a task
|
|
X 1 Get a sequence
|
|
2 Get annotations
|
|
3 Get entry names from accession numbers
|
|
4 Search titles for keywords
|
|
5 Search text index for keywords
|
|
? Selection (1-5) (1) =
|
|
Default Entry name=LAMBDA
|
|
? Entry name=
|
|
DE Genome of the bacteriophage lambda (Styloviridae).
|
|
Sequence length 48502
|
|
Sequence composition
|
|
T C A G -
|
|
11988. 11360. 12336. 12818. 0.
|
|
24.7% 23.4% 25.4% 26.4% 0.0%
|
|
|
|
.end lit
|
|
.left margin1
|
|
@4. TX 1 @ Define active region
|
|
.LEFT MARGIN2
|
|
.para
|
|
For its analytic functions
|
|
the program always works on a region of the sequence called the "active
|
|
region". This function allows the start and end points of the active region
|
|
to be reset.
|
|
.para
|
|
Define the required start and end points.
|
|
.para
|
|
When a new sequence is read into the program the active region is
|
|
automatically set to start at the beginning of the sequence and extend to
|
|
the
|
|
maximum the program can
|
|
handle. On most machines this will be to the end of the sequence. The
|
|
positions are shown on the screen.
|
|
Note that for
|
|
convenience, in the
|
|
listing and translation functions, the user is given access to regions
|
|
outside the active region.
|
|
.left margin1
|
|
@5. TX 1 @ List a sequence
|
|
.LEFT MARGIN2
|
|
.para
|
|
The sequence can be listed single or double stranded with line lengths
|
|
from
|
|
10 to 120 in multiples of 10.
|
|
.para
|
|
Define the region to list, the line length required and choose between a
|
|
single or double stranded display.
|
|
The output looks like:
|
|
.lit
|
|
|
|
GTTAATGTAG CTTAATAACA AAGCAAAGCA CTGAAAATGC TTAGATGGAT
|
|
CAATTACATC GAATTATTGT TTCGTTTCGT GACTTTTACG AATCTACCTA
|
|
10 20 30 40 50
|
|
|
|
AATTGTATCC CATAAACACA AAGGTTTGGT CCTGGCCTTA TAATTAATTA
|
|
TTAACATAGG GTATTTGTGT TTCCAAACCA GGACCGGAAT ATTAATTAAT
|
|
60 70 80 90 100
|
|
|
|
GAGGTAAAAT TACACATGCA AACCTCCATA GACCGGTGTA AAATCCCTTA
|
|
CTCCATTTTA ATGTGTACGT TTGGAGGTAT CTGGCCACAT TTTAGGGAAT
|
|
110 120 130 140 150
|
|
|
|
AACATTTACT TAAAATTTAA GGAGAGGGTA TCAAGCACAT TAAAATAGCT
|
|
TTGTAAATGA ATTTTAAATT CCTCTCCCAT AGTTCGTGTA ATTTTATCGA
|
|
160 170 180 190 200
|
|
|
|
.end lit
|
|
.left margin1
|
|
@6. TX 1 @ List a text file.
|
|
.LEFT MARGIN2
|
|
.para
|
|
Allows the user to have a text file displayed on the screen. It will appear
|
|
one page at a time.
|
|
.para
|
|
Supply the name of the file to be displayed.
|
|
.left margin1
|
|
@7. TX 1 @ Direct output to disk
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to direct output that would normally appear on the screen to a file.
|
|
.para
|
|
Select redirection of either text or graphics, and
|
|
supply the name of the file that the output should be written to.
|
|
.para
|
|
The results from the next options selected will not appear on the screen
|
|
but will be written to the file. When option 7 is selected again
|
|
the file will be
|
|
closed and output will again appear on the screen.
|
|
.left margin1
|
|
@8. TX 1 @ Write active region to disk
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to write the current active section of sequence to a disk file in
|
|
"Staden format".
|
|
.para
|
|
Supply a file name and an optional title.
|
|
.para
|
|
The program has the capability of reading sequences stored in several
|
|
formats and so, in conjunction with this option, can be used to reformat
|
|
them.
|
|
.left margin1
|
|
@9. TX 1 @ Edit the sequence
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to edit sequences or any other files by giving access to the
|
|
computers system editor. For editing sequences the input file should
|
|
have already been created using one of the listing functions such as "list
|
|
sequence", "list translation" or "list restriction sites above the
|
|
sequence".
|
|
.para
|
|
Supply the name of the file to edit. Wait while the system editor is made
|
|
ready (can take awhile on a vax). Use the editor. Exit from the editor. If a
|
|
sequence has been edited, and you want to process it, affirm that the
|
|
sequence should be "made active". The edited sequence will replace the
|
|
original sequence.
|
|
.para
|
|
This editing method is designed to give users access to an editor with
|
|
which they are familiar - i.e. the one on their machine, and yet to allow
|
|
them to edit a sequence which contains all the landmarks they need in
|
|
order to know where they are. Users can create files containing simple
|
|
listings (single stranded) with numbering, using "list the sequence", and
|
|
then edit them with their system editor, using the numbering to know
|
|
where they are within the sequence. When the edits are complete they
|
|
exit from the editor and the program "analyses" the edited file to extract
|
|
only the sequence characters. Similarly a file containing a three phase
|
|
tranlslation can be edited, or a file containing a sequence plus its three
|
|
phase translation, plus its restriction sites marked above the sequence.
|
|
In order to be able to "analyse" such complicated listings and correctly
|
|
extract the sequence the following simple rule is used: all lines in the
|
|
file that contain a character that is not A,C,T,G or U are deleted. It is
|
|
obviously important to be aware of this rule and its implications.
|
|
.left margin1
|
|
@10. TX 2 @ Clear graphics
|
|
.LEFT MARGIN1
|
|
.para
|
|
Clears graphics from the screen.
|
|
.left margin1
|
|
@11. TX 2 @ Clear text
|
|
.LEFT MARGIN1
|
|
.para
|
|
Clears text from the screen.
|
|
.left margin1
|
|
@12. TX 2 @ Draw a ruler
|
|
.LEFT MARGIN2
|
|
.para
|
|
This option
|
|
allows the user to draw a ruler or scale along the x axis of the screen to
|
|
help identify the coordinates of points of interest. The user can define
|
|
the position of the first base to be marked (for example if the active
|
|
region is 1501 to 8000, the user might wish to mark every 1000th base
|
|
starting at either 1501 or 2000 - it depends if the user wishes to treat
|
|
the active region as an independent unit with its own numbering starting
|
|
at
|
|
its left edge, or as part of the whole sequence). The user can also define
|
|
the separation of the ticks on the scale and their height. If required the
|
|
labelling routine can be used to add numbers to the ticks.
|
|
.left margin1
|
|
@13. TX 2 @ Use crosshair
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function puts
|
|
a steerable cross on the screen that can be used to find the
|
|
coordinates of points in the sequence. The user can move the cross
|
|
around using the directional keys; when he hits the space bar the
|
|
program will print out the coordinates of the cross in sequence units and
|
|
the option will be exited.
|
|
.PARA
|
|
If instead,
|
|
you hit a , the position will be displayed but the cross will remain on
|
|
the screen.
|
|
.PARA
|
|
If a letter s is hit the program will display the sequence around the
|
|
crosshair
|
|
position, and leave the cross on the screen.
|
|
.left margin1
|
|
@14. TX 2 @ Reposition plots
|
|
.LEFT MARGIN2
|
|
.para
|
|
The positions of each of the plots is defined relative to a users drawing
|
|
board which has size 1-10,000 in x and 1-10,000 in y.
|
|
Plots for
|
|
each option are drawn in a window defined by x0,y0 and xlength,ylength.
|
|
Where x0,y0 is the position of the bottom left hand corner of the window,
|
|
and xlength is the width of the window and ylength the
|
|
height of the window.
|
|
.lit
|
|
--------------------------------------------------------- 10,000
|
|
1 1
|
|
1 -------------------------------------- ^ 1
|
|
1 1 1 1 1
|
|
1 1 1 1 1
|
|
1 1 1 ylength 1
|
|
1 1 1 1 1
|
|
1 1 1 1 1
|
|
1 -------------------------------------- v 1
|
|
1 x0,y0^ 1
|
|
1 <---------------xlength--------------> 1
|
|
--------------------------------------------------------- 1
|
|
1 10,000
|
|
|
|
.end lit
|
|
All values are in drawing board units (i.e. 1-10,000, 1-10,000).
|
|
The default window positions are read from a file "NIPMARG" when the
|
|
program is started. Users can have their own file if required.
|
|
As all the plots start
|
|
at the same position in x and have the same width, x0 and xlength are the
|
|
same for all options. Generally users will only want to change the start
|
|
level of the window y0 and its height ylength.
|
|
This option
|
|
allows users to change window positions whilst running the program.
|
|
The routine prompts first for the number of the option that the users
|
|
wishes
|
|
to reposition; then for the y start and height; then for the x start and
|
|
length. Note that changes to the x values affect all options. If the user
|
|
types only carriage return for any value it will remain unchanged.
|
|
The cross-hair can be used to choose suitable heights.
|
|
.LEFT MARGIN1
|
|
@15. TX 2 @ Label a diagram
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine allows users to label any diagrams they have produced. They
|
|
are asked to type in a label. When the user types carriage return to finish
|
|
typing the label the cross-hair appears on the screen. The user can
|
|
position it anywhere on the screen. If the user types R (for right justify)
|
|
the label will be
|
|
written on the diagram with its right end at the cross-hair position.
|
|
If the user types L (for left justify) the label will be written on the
|
|
diagram with its left end at the cross hair position.
|
|
The
|
|
cross-hair will then immediately reappear. The user may put the same
|
|
label
|
|
on another part of the diagram as before or if he hits the space bar he
|
|
will be asked if he wishes to type in another label.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
? Menu or option number=15
|
|
Type label then drive cross hair to left or right end
|
|
of label position then hit "L" to write label left
|
|
justified or "R" to write label right justified or
|
|
the space bar to quit
|
|
|
|
|
|
? Label=delta gene
|
|
|
|
missing graphics
|
|
|
|
? Label=
|
|
|
|
.end lit
|
|
.left margin1
|
|
@16. TX 2 @Display a map
|
|
.LEFT MARGIN2
|
|
.para
|
|
This draws a map
|
|
of any sequence features selected by the user.
|
|
These features may be protein coding regions (CDS), tRNA genes (TRNA),
|
|
promoter positions (PRM), etc. Users may define their own feature table
|
|
key
|
|
names. For example I find it convenient to split CDS lines into CDS1,
|
|
CDS2
|
|
and CDS3 each of which contains only those sequences that code in the
|
|
reading frames 1, 2 or 3. Then I can plot them at different heights on
|
|
the screen ( suitable heights can be determined by using the cross-hair).
|
|
.para
|
|
The coordinates must be stored in a file in the format of an EMBL or GenBank
|
|
feature table. Note that this means that the file must include either EMBL
|
|
or GenBank headers, and a suitable "tail". The simplest header is the word
|
|
FEATURES starting in column 1 of the first line of the file. The simplest
|
|
tail is 2 empty lines at the end of the file. These lines are not included
|
|
when nip writes out results in feature table format.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
? Menu or option number=16
|
|
Display a map using an EMBL feature table file
|
|
? map file name=hsegl1.ft
|
|
? feature code(e.g. CDS) =CDS
|
|
X 1 + strand
|
|
2 - strand
|
|
3 both strands
|
|
? 0,1,2,3 =
|
|
? level (0-9480) (256) =4000
|
|
|
|
missing graphics
|
|
|
|
? feature code(e.g. CDS) =
|
|
|
|
.end lit
|
|
.left margin1
|
|
@17. TX 1 @ Search for restriction enzymes
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine is used to search for short sequences, like restriction
|
|
enzyme
|
|
recognition sequences,
|
|
and can either list the results or present them graphically. Listings can
|
|
take several forms and can include the sequence and its translation.
|
|
Examples are given below. The program will also display the names of
|
|
enzymes that cut the sequence infrequently. Users can select from sets
|
|
of enzymes stored in files or can enter them from the keyboard.
|
|
.para
|
|
The short
|
|
sequences (strings) and their names need to be arranged in a particular
|
|
way. See below. Select to search, list an enzyme file or clear the screen.
|
|
Choose either a file of enzymes or to enter their recognition sequences at the
|
|
keyboard. Choose to search for all the enzymes in the list or to select
|
|
from the list. Select a mode of output. Define the sequence as circular or
|
|
linear. Select to search for "definite" or "possible" matches. The search
|
|
starts, and after the results have been displayed, further searches can be
|
|
performed.
|
|
.para
|
|
When the enzymes and their recognition sequences are stored in a file
|
|
they must be defined in the following way. We
|
|
call the recognition sequences "strings".
|
|
The format is as follows: each string or set of strings must be
|
|
preceded by a name, each string must be preceded and
|
|
terminated with a slash (/), and
|
|
each set of strings by 2 slashes.
|
|
For example
|
|
AATII/GACGT'C// defines the name AATII, its recognition sequence
|
|
GACGTC
|
|
and its cut site with the ' symbol; ACCI/GT'MKAC// defines the name
|
|
ACCI
|
|
and its recognition sequence includes IUB symbols for incompletely
|
|
defined
|
|
symbols in nucleic acid sequences;
|
|
BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
|
|
defines the name BBVI and this time two recognition sequences and cut
|
|
sites
|
|
are specified in order to correctly show the cutting position relative to
|
|
the recognition sequence. If no cut site is included the first base of the
|
|
recognition sequence is displayed as being on the 3' side of the
|
|
recognition sequence.
|
|
.para
|
|
These collections of strings and their
|
|
names can be read from disk or entered from the keyboard.
|
|
When names and strings are entered from the keyboard the program will ask
|
|
for the name and then the string(s). If more than one string is typed per
|
|
name they must be separated by slash (/) characters. See the "Typical
|
|
dialogue" below.
|
|
Three files
|
|
containing restriction enzyme recognition sequences are currently
|
|
available. The "all enzymes" file contains the Rich Roberts REBASE
|
|
restriction enzyme database, which is updated monthly.
|
|
.para
|
|
The user can select strings
|
|
by name from these collections. If so the program will prompt for the
|
|
names, one at a time. The user can continue to select names until a blank
|
|
name is entered (by the user typing only return).
|
|
.para
|
|
Listed output can be displayed in several ways: it
|
|
can be ordered enzyme by enzyme, or on cut positions, or with enzyme
|
|
names
|
|
written above a listing of the sequence. This last listing can also include
|
|
a three phase translation of the sequence. In addition the program will
|
|
display only infrequent cutters (the user defines the minimum number of
|
|
cuts), or can plot the positions of matches.
|
|
.para
|
|
Listings sorted "enzyme by enzyme" have the following form:
|
|
.lit
|
|
|
|
Matches found= 1
|
|
Name Sequence Position Fragment lengths
|
|
1 AATII GACGT'C 112 111 111
|
|
912 912
|
|
Matches found= 2
|
|
Name Sequence Position Fragment lengths
|
|
1 ACCI GT'CGAC 112 111 111
|
|
2 ACCI GT'AGAC 420 308 308
|
|
604 604
|
|
Matches found= 2
|
|
Name Sequence Position Fragment lengths
|
|
1 AHAII GA'CGTC 109 108 90
|
|
2 AHAII GG'CGTC 199 90 108
|
|
825 825
|
|
Matches found= 2
|
|
Name Sequence Position Fragment lengths
|
|
1 AVAII G'GACC 84 83 51
|
|
2 AVAII G'GTCC 973 889 83
|
|
51 889
|
|
Matches found= 1
|
|
Name Sequence Position Fragment lengths
|
|
1 BALI TGG'CCA 258 257 257
|
|
766 766
|
|
Matches found= 1
|
|
Name Sequence Position Fragment lengths
|
|
1 BAMHI G'GATCC 92 91 91
|
|
|
|
...... etc
|
|
|
|
Listings sorted on cut position have the following form:
|
|
|
|
Searching
|
|
Name Sequence Position Fragment lengths
|
|
1 ECORI G'AATTC 2 1
|
|
2 BANI G'GTGCC 26 24
|
|
3 BSP1286 GTGCC'C 31 5
|
|
4 BBVI 'TACTGCGCCGCAGCTGC 38 7
|
|
5 NSPBII CAG'CTG 51 13
|
|
6 PVUII CAG'CTG 51 0
|
|
7 BBVI GCAGCTGCTGGTG' 60 9
|
|
8 HINCII GTC'AAC 80 20
|
|
9 AVAII G'GACC 84 4
|
|
10 BINI 'CCAGGGATCC 87 3
|
|
11 BSTNI CC'AGG 89 2
|
|
12 BAMHI G'GATCC 92 3
|
|
13 XHOII G'GATCC 92 0
|
|
14 NSPBII CCG'CTG 98 6
|
|
15 BINI GGATCCGCT' 100 2
|
|
16 AHAII GA'CGTC 109 9
|
|
17 SALI G'TCGAC 111 2
|
|
18 AATII GACGT'C 112 1
|
|
19 ACCI GT'CGAC 112 0
|
|
20 HINCII GTC'GAC 113 1
|
|
21 BBVI GCAGCGACTGATT' 166 53
|
|
22 BINI 'ACTCAGATCC 178 12
|
|
23 XHOII A'GATCC 183 5
|
|
24 HGAI 'GGCGGCGGAGGCGTC 188 5
|
|
|
|
.....etc
|
|
|
|
Lists of infrequent cutters have the following form:
|
|
|
|
0 AFLII
|
|
0 AFLIII
|
|
0 APAI
|
|
0 APALI
|
|
0 ASUII
|
|
0 AVAI
|
|
0 AVRII
|
|
0 BCLI
|
|
0 BGLI
|
|
0 BGLII
|
|
0 BSMI
|
|
0 BSPMII
|
|
0 BSTEII
|
|
...... etc
|
|
|
|
Listings showing names above the sequence, and a translation have the
|
|
following form:
|
|
|
|
|
|
ECORI BANI BSP1286
|
|
. . . BBVI NSPBII
|
|
. . . . PVUII BBVI
|
|
GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
|
|
GTGC
|
|
10 20 30 40 50 60
|
|
E F G L G L V * G A Q R L L R R S C W C
|
|
N S V W A W C E V P R D Y C A A A A G A
|
|
I R F G L G V R C P E I T A P Q L L V L
|
|
|
|
HINCII
|
|
. AVAII
|
|
. . BINI
|
|
. . . BSTNI
|
|
. . . . BAMHI
|
|
. . . . XHOII NSPBII
|
|
. . . . . . BINI AHAII
|
|
. . . . . . . . SALI
|
|
. . . . . . . . .AATII
|
|
. . . . . . . . .ACCI
|
|
. . . . . . . . ..HINCII
|
|
TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
|
|
ACGA
|
|
70 80 90 100 110 120
|
|
W R C G G R Q R T Q G S A G R G R R R R
|
|
G G A E V V N G P R D P L D E D V D D E
|
|
A V R R S S T D P G I R W T R T S T T R
|
|
|
|
BBVI BINI
|
|
GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
|
|
CACT
|
|
130 140 150 160 170 180
|
|
G G G G * R I A G G W Q R L I * V L T T
|
|
E E V D S A L L V A G S D * F E F * P L
|
|
R R W I A H C W W L A A T D L S S D H S
|
|
|
|
XHOII
|
|
. HGAI AHAII PFIMI
|
|
. . . . BBVI
|
|
CAGATCCGGCGGCGGAGGCGTCGAGGCTCCCGAAACTCCCAGTGGCTGGCCTGCTA
|
|
GATT
|
|
190 200 210 220 230 240
|
|
Q I R R R R R R G S R N S Q W L A C * I
|
|
R S G G G G V E A P E T P S G W P A R F
|
|
D P A A E A S R L P K L P V A G L L D S
|
|
|
|
.........etc
|
|
|
|
.end lit
|
|
.para
|
|
The terms "possible" and "definite" matches are important only for back
|
|
translations of protein into DNA, and which include IUB redundancy codes.
|
|
Those matches that the program terms "definite matches" and are ones in
|
|
which the specification of the recognition sequence corresponds
|
|
exactly to that of the back translation, and consequently are definitely in
|
|
the DNA sequence. The program will also find what it
|
|
terms 'possible matches' which are ones that depend on the particular
|
|
codons
|
|
chosen for each amino acid.
|
|
These are sites at which recognition
|
|
sequences could be engineered to produce a cut in the DNA
|
|
without changing the amino
|
|
acid, but which are not
|
|
necessarily found in the original sequence.
|
|
.para
|
|
The routine will handle both linear and circular sequences, and
|
|
so finds cutsites spanning the "ends" of circular sequences.
|
|
The program will only find cutsites spanning the
|
|
ends of sequences if the sequence is declared as circular.
|
|
This includes sites for
|
|
recognition sequences containing leading or trailing N symbols, in which
|
|
the actual recognition sequence does not span the join. For example if the
|
|
recognition sequence was 'NNNNACGT and the first 4 characters in the
|
|
sequence were ACGT, then the match would only be found if the sequence
|
|
was
|
|
declared as circular. If the sequence is linear then the first fragment
|
|
starts at base number 1, and the last ends at the last base. If the
|
|
sequence is circular then the length of the first fragment is the
|
|
clockwise
|
|
distance from the last cut to the first.
|
|
.para
|
|
Graphical output marks the position of each string by a
|
|
short vertical line and gives the name of the enzyme at the left end of
|
|
the
|
|
line. If the top of the screen is reached the program gives the user the
|
|
oportunity to take a hard copy and then will clear the screen and restart
|
|
plotting results at the original start position.
|
|
.para
|
|
Below is an edited piece of dialogue from use of the search option:
|
|
.lit
|
|
? Menu or option number=17
|
|
|
|
Search for restriction enzyme sites
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 = 2
|
|
|
|
1 All enzymes
|
|
X 2 Six cutters
|
|
3 Four cutters
|
|
4 Personal file
|
|
5 Keyboard
|
|
? 0,1,2,3,4,5 =
|
|
|
|
AATII/GACGT'C//
|
|
ACCI/GT'MKAC//
|
|
AFLII/C'TTAAG//
|
|
AFLIII/A'CRYGT//
|
|
AHAII/GR'CGYC//
|
|
APAI/GGGCC'C//
|
|
APALI/G'TGCAC//
|
|
ASUII/TT'CGAA//
|
|
AVAI/C'YCGRG//
|
|
AVAII/G'GWCC//
|
|
AVRII/C'CTAGG//
|
|
BALI/TGG'CCA//
|
|
BAMHI/G'GATCC//
|
|
BANI/G'GYRCC//
|
|
BANII/GRGCY'C//
|
|
BBVI/GCAGCNNNNNNNN'/'NNNNNNNNNNNNGCTGC//
|
|
BCLI/T'GATCA//
|
|
BGLI/GCCNNNN'NGGC//
|
|
BGLII/A'GATCT//
|
|
BINI/GGATCNNNN'/'NNNNNGATCC//
|
|
BSMI/GAATGCN'/NG'CATTC//
|
|
BSP1286/GDGCH'C//
|
|
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 =
|
|
1 All enzymes
|
|
X 2 Six cutters
|
|
3 Four cutters
|
|
4 Personal file
|
|
5 Keyboard
|
|
? 0,1,2,3,4,5 =
|
|
? (y/n) (y) Search for all names
|
|
X 1 Order results enzyme by enzyme
|
|
2 Order results by position
|
|
3 Show only infrequent cutters
|
|
4 Show names above the sequence
|
|
? 0,1,2,3,4 =
|
|
? (y/n) (y) List matches
|
|
? (y/n) (y) The sequence is linear
|
|
? (y/n) (y) Search for definite matches
|
|
|
|
Searching
|
|
Matches found= 1
|
|
Name Sequence Position Fragment lengths
|
|
1 AATII GACGT'C 112 111 111
|
|
912 912
|
|
Matches found= 2
|
|
Name Sequence Position Fragment lengths
|
|
1 ACCI GT'CGAC 112 111 111
|
|
2 ACCI GT'AGAC 420 308 308
|
|
604 604
|
|
Matches found= 2
|
|
Name Sequence Position Fragment lengths
|
|
1 AHAII GA'CGTC 109 108 90
|
|
2 AHAII GG'CGTC 199 90 108
|
|
825 825
|
|
Matches found= 2
|
|
Name Sequence Position Fragment lengths
|
|
1 AVAII G'GACC 84 83 51
|
|
2 AVAII G'GTCC 973 889 83
|
|
51 889
|
|
Matches found= 1
|
|
Name Sequence Position Fragment lengths
|
|
1 BALI TGG'CCA 258 257 257
|
|
766 766
|
|
Matches found= 1
|
|
Name Sequence Position Fragment lengths
|
|
1 BAMHI G'GATCC 92 91 91
|
|
932 932
|
|
Matches found= 1
|
|
Name Sequence Position Fragment lengths
|
|
1 BANI G'GTGCC 26 25 25
|
|
998 998
|
|
Matches found= 1
|
|
Name Sequence Position Fragment lengths
|
|
1 BANII GAGCC'C 490 489 489
|
|
534 534
|
|
Matches found= 11
|
|
Name Sequence Position Fragment lengths
|
|
1 BBVI 'TACTGCGCCGCAGCTGC 38 37 3
|
|
2 BBVI GCAGCTGCTGGTG' 60 22 22
|
|
3 BBVI GCAGCGACTGATT' 166 106 28
|
|
4 BBVI 'CCTGCTAGATTCGCTGC 230 64 37
|
|
5 BBVI GCAGCGGTACGTA' 452 222 50
|
|
6 BBVI 'CTCGCCAACGTTGCTGC 502 50 55
|
|
7 BBVI GCAGCCTTCAACT' 606 104 64
|
|
8 BBVI 'GAGGTATTCCTGGCTGC 634 28 97
|
|
9 BBVI 'CTGGCCGCCGCCGCTGC 869 235 104
|
|
10 BBVI 'GCCGCCGCCGCTGCTGC 872 3 106
|
|
11 BBVI GCAGCGATGAGGA' 927 55 222
|
|
|
|
....etc
|
|
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 =
|
|
|
|
1 All enzymes
|
|
X 2 Six cutters
|
|
3 Four cutters
|
|
4 Personal file
|
|
5 Keyboard
|
|
? 0,1,2,3,4,5 =
|
|
|
|
? (y/n) (y) Search for all names
|
|
|
|
X 1 Order results enzyme by enzyme
|
|
2 Order results by position
|
|
3 Show only infrequent cutters
|
|
4 Show names above the sequence
|
|
? 0,1,2,3,4 = 2
|
|
|
|
? (y/n) (y) List matches
|
|
? (y/n) (y) The sequence is linear
|
|
? (y/n) (y) Search for definite matches
|
|
|
|
Searching
|
|
Name Sequence Position Fragment lengths
|
|
1 ECORI G'AATTC 2 1
|
|
2 BANI G'GTGCC 26 24
|
|
3 BSP1286 GTGCC'C 31 5
|
|
4 BBVI 'TACTGCGCCGCAGCTGC 38 7
|
|
5 NSPBII CAG'CTG 51 13
|
|
6 PVUII CAG'CTG 51 0
|
|
7 BBVI GCAGCTGCTGGTG' 60 9
|
|
8 HINCII GTC'AAC 80 20
|
|
9 AVAII G'GACC 84 4
|
|
10 BINI 'CCAGGGATCC 87 3
|
|
11 BSTNI CC'AGG 89 2
|
|
12 BAMHI G'GATCC 92 3
|
|
13 XHOII G'GATCC 92 0
|
|
14 NSPBII CCG'CTG 98 6
|
|
15 BINI GGATCCGCT' 100 2
|
|
16 AHAII GA'CGTC 109 9
|
|
17 SALI G'TCGAC 111 2
|
|
18 AATII GACGT'C 112 1
|
|
19 ACCI GT'CGAC 112 0
|
|
20 HINCII GTC'GAC 113 1
|
|
|
|
.....etc
|
|
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 =
|
|
|
|
1 All enzymes
|
|
X 2 Six cutters
|
|
3 Four cutters
|
|
4 Personal file
|
|
5 Keyboard
|
|
? 0,1,2,3,4,5 =
|
|
|
|
? (y/n) (y) Search for all names
|
|
|
|
1 Order results enzyme by enzyme
|
|
X 2 Order results by position
|
|
3 Show only infrequent cutters
|
|
4 Show names above the sequence
|
|
? 0,1,2,3,4 =3
|
|
? Maximum number of cuts (0-100) (0) =
|
|
? (y/n) (y) The sequence is linear
|
|
? (y/n) (y) Search for definite matches
|
|
|
|
Searching
|
|
0 AFLII
|
|
0 AFLIII
|
|
0 APAI
|
|
0 APALI
|
|
0 ASUII
|
|
0 AVAI
|
|
0 AVRII
|
|
0 BCLI
|
|
0 BGLI
|
|
0 BGLII
|
|
0 BSMI
|
|
0 BSPMII
|
|
0 BSTEII
|
|
0 CLAI
|
|
0 DRAI
|
|
0 DRAII
|
|
0 ECOB
|
|
0 ECOK
|
|
0 ECORV
|
|
0 ESPI
|
|
|
|
......etc
|
|
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 =
|
|
|
|
1 All enzymes
|
|
X 2 Six cutters
|
|
3 Four cutters
|
|
4 Personal file
|
|
5 Keyboard
|
|
? 0,1,2,3,4,5 =
|
|
|
|
? (y/n) (y) Search for all names
|
|
|
|
1 Order results enzyme by enzyme
|
|
2 Order results by position
|
|
X 3 Show only infrequent cutters
|
|
4 Show names above the sequence
|
|
? 0,1,2,3,4 =4
|
|
? (y/n) (y) Hide translation n
|
|
? (y/n) (y) Use 1 letter codes
|
|
? Line length (30-90) (60) =
|
|
? (y/n) (y) The sequence is linear
|
|
? (y/n) (y) Search for definite matches
|
|
|
|
Searching
|
|
ECORI BANI BSP1286
|
|
. . . BBVI NSPBII
|
|
. . . . PVUII BBVI
|
|
GAATTCGGTTTGGGCTTGGTGTGAGGTGCCCAGAGATTACTGCGCCGCAGCTGCTG
|
|
GTGC
|
|
10 20 30 40 50 60
|
|
E F G L G L V * G A Q R L L R R S C W C
|
|
N S V W A W C E V P R D Y C A A A A G A
|
|
I R F G L G V R C P E I T A P Q L L V L
|
|
|
|
HINCII
|
|
. AVAII
|
|
. . BINI
|
|
. . . BSTNI
|
|
. . . . BAMHI
|
|
. . . . XHOII NSPBII
|
|
. . . . . . BINI AHAII
|
|
. . . . . . . . SALI
|
|
. . . . . . . . .AATII
|
|
. . . . . . . . .ACCI
|
|
. . . . . . . . ..HINCII
|
|
TGGCGGTGCGGAGGTCGTCAACGGACCCAGGGATCCGCTGGACGAGGACGTCGACG
|
|
ACGA
|
|
70 80 90 100 110 120
|
|
W R C G G R Q R T Q G S A G R G R R R R
|
|
G G A E V V N G P R D P L D E D V D D E
|
|
A V R R S S T D P G I R W T R T S T T R
|
|
|
|
BBVI BINI
|
|
GGAGGAGGTGGATAGCGCATTGCTGGTGGCTGGCAGCGACTGATTTGAGTTCTGAC
|
|
CACT
|
|
130 140 150 160 170 180
|
|
G G G G * R I A G G W Q R L I * V L T T
|
|
E E V D S A L L V A G S D * F E F * P L
|
|
R R W I A H C W W L A A T D L S S D H S
|
|
|
|
.......etc
|
|
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 =
|
|
|
|
1 All enzymes
|
|
X 2 Six cutters
|
|
3 Four cutters
|
|
4 Personal file
|
|
5 Keyboard
|
|
? 0,1,2,3,4,5 =5
|
|
Define search strings by typing a string name
|
|
followed by the string(s)
|
|
? Name=FRED
|
|
? String(s)=AAAAAA/TTTTTT
|
|
? Name=MARY
|
|
? String(s)=CCCC/GGGG/GCGCT
|
|
? Name=
|
|
? (y/n) (y) Search for all names
|
|
X 1 Order results enzyme by enzyme
|
|
2 Order results by position
|
|
3 Show only infrequent cutters
|
|
4 Show names above the sequence
|
|
? 0,1,2,3,4 =
|
|
? (y/n) (y) List matches
|
|
? (y/n) (y) The sequence is linear
|
|
? (y/n) (y) Search for definite matches
|
|
Searching
|
|
Matches found= 9
|
|
Name Sequence Position Fragment lengths
|
|
1 FRED 'TTTTTT 1557 1556 1
|
|
2 FRED 'TTTTTT 1558 1 1
|
|
3 FRED 'TTTTTT 1559 1 1
|
|
4 FRED 'TTTTTT 1560 1 22
|
|
5 FRED 'AAAAAA 1582 22 529
|
|
6 FRED 'AAAAAA 3160 1578 1019
|
|
7 FRED 'AAAAAA 4204 1044 1044
|
|
8 FRED 'AAAAAA 5691 1487 1487
|
|
9 FRED 'AAAAAA 6710 1019 1556
|
|
529 1578
|
|
Matches found= 36
|
|
Name Sequence Position Fragment lengths
|
|
1 MARY 'CCCC 47 46 1
|
|
2 MARY 'GGGG 486 439 1
|
|
3 MARY 'GGGG 487 1 1
|
|
4 MARY 'CCCC 557 70 1
|
|
5 MARY 'CCCC 558 1 1
|
|
6 MARY 'GCGCT 1177 619 1
|
|
|
|
... etc
|
|
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 =
|
|
1 All enzymes
|
|
X 2 Six cutters
|
|
3 Four cutters
|
|
4 Personal file
|
|
5 Keyboard
|
|
? 0,1,2,3,4,5 =5
|
|
Define search strings by typing a string name
|
|
followed by the string(s)
|
|
? Name=JANE
|
|
? String(s)=A'TTTT/CC'GGG
|
|
? Name=
|
|
? (y/n) (y) Search for all names
|
|
X 1 Order results enzyme by enzyme
|
|
2 Order results by position
|
|
3 Show only infrequent cutters
|
|
4 Show names above the sequence
|
|
? 0,1,2,3,4 =
|
|
? (y/n) (y) List matches
|
|
? (y/n) (y) The sequence is linear
|
|
? (y/n) (y) Search for definite matches
|
|
Searching
|
|
Matches found= 30
|
|
Name Sequence Position Fragment lengths
|
|
1 JANE A'TTTT 437 436 6
|
|
2 JANE A'TTTT 546 109 33
|
|
3 JANE A'TTTT 597 51 43
|
|
4 JANE A'TTTT 777 180 51
|
|
5 JANE A'TTTT 1274 497 60
|
|
6 JANE A'TTTT 1571 297 62
|
|
7 JANE CC'GGG 1926 355 75
|
|
8 JANE A'TTTT 2403 477 81
|
|
9 JANE A'TTTT 2586 183 82
|
|
10 JANE A'TTTT 2731 145 101
|
|
11 JANE A'TTTT 2812 81 103
|
|
|
|
... etc
|
|
|
|
|
|
X 1 Search
|
|
2 List enzyme file
|
|
3 Clear text
|
|
4 Clear graphics
|
|
? 0,1,2,3,4 =!
|
|
.end lit
|
|
|
|
.left margin1
|
|
@18. TX 1 7 @ Compare a short sequence
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine slides a short sequence along the current sequence and finds
|
|
all positions at which a given percentage of the bases match.
|
|
Output is in both graphical and listed forms.
|
|
.para
|
|
If users call for dialogue when the routine is selected they will be given
|
|
the choice of keyboard or file input. Define the string, select the "sense"
|
|
to use and the percentage match. Matches will be plotted out and then the
|
|
user can select to have them listed. Then the routine cycles around.
|
|
.para
|
|
The routine slides the search string
|
|
along the sequence and marks the positions at which a minimum
|
|
percentage score is reached. The graphical output draws a vertical line at
|
|
the match position; the height of the line represents the percentage
|
|
score,
|
|
so that if the line reaches the top of the box the score is 100%.
|
|
The NC-IUB symbols may be used in the search sequence to encode
|
|
uncertain
|
|
characters. Any other symbols will not match.
|
|
.LIT
|
|
|
|
|
|
NC-IUB SYMBOLS
|
|
|
|
A,C,G,T
|
|
R (A,G) 'puRine'
|
|
Y (T,C) 'pYrimidine'
|
|
W (A,T) 'Weak'
|
|
S (C,G) 'Strong'
|
|
M (A,C) 'aMino'
|
|
K (G,T) 'Keto'
|
|
H (A,T,C) 'not G'
|
|
B (G,C,T) 'not A'
|
|
V (G,A,C) 'not T'
|
|
D (G,A,T) 'not C'
|
|
N (G,A,C,T) 'aNy'
|
|
|
|
Typical dialogue is shown below.
|
|
|
|
|
|
? Menu or option number=18
|
|
Find percentage matches
|
|
? (y/n) (y) Keep picture
|
|
? String=AAATTTCCC
|
|
STRING=AAATTTCCC
|
|
? (y/n) (y) This sense
|
|
? Percent match (1.00-100.00) (70.00) =
|
|
|
|
Missing graphics display here
|
|
|
|
Total scoring positions above 70.000 percent = 7
|
|
Scores 7 6 6 6 6 6 6
|
|
Positions 365 212 213 292 311 358 627
|
|
? Display (0-7) (0) =3
|
|
|
|
365
|
|
ACATTTCGC
|
|
* ***** *
|
|
AAATTTCCC
|
|
1
|
|
|
|
212
|
|
GAAACTCCC
|
|
** ****
|
|
AAATTTCCC
|
|
1
|
|
|
|
213
|
|
AAACTCCCA
|
|
*** * **
|
|
AAATTTCCC
|
|
1
|
|
? (y/n) (y) Keep picture
|
|
Default String=AAATTTCCC
|
|
? String=
|
|
STRING=AAATTTCCC
|
|
? (y/n) (y) This sense n
|
|
STRING=GGGAAATTT
|
|
? Percent match (1.00-100.00) (70.00) =
|
|
|
|
Missing graphics display here
|
|
|
|
Total scoring positions above 70.000 percent = 7
|
|
Scores 6 6 6 6 6 6 6
|
|
Positions 269 270 271 288 354 624 853
|
|
? Display (0-7) (0) =3
|
|
|
|
269
|
|
GAGGGATTT
|
|
* * ****
|
|
GGGAAATTT
|
|
1
|
|
|
|
270
|
|
AGGGATTTT
|
|
** * ***
|
|
GGGAAATTT
|
|
1
|
|
|
|
271
|
|
GGGATTTTC
|
|
**** **
|
|
GGGAAATTT
|
|
1
|
|
? (y/n) (y) Keep picture !
|
|
|
|
.end lit
|
|
.left margin1
|
|
@19. TX 7 @ Compare a short sequence using a score matrix
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine slides a short sequence along the current sequence and finds
|
|
all positions at which a given level of similarity (a cutoff score) is
|
|
reached. The score is defined by use of a score matrix. Output is in both
|
|
graphical and listed forms.
|
|
.para
|
|
If users call for dialogue when the routine is selected they will be given
|
|
the choice of keyboard or file input. Define the string, select the "sense"
|
|
to use and the cutoff score. Matches will be plotted out and then the user
|
|
can select to have them listed. Then the routine cycles around.
|
|
.para
|
|
The routine slides the search string
|
|
along the sequence and marks the positions at which a the cutoff score
|
|
is achieved. The graphical output draws a vertical line at
|
|
the match position; the height of the line represents the score,
|
|
so that if the line reaches the top of the box the score is the maximum
|
|
possible.
|
|
The NC-IUB symbols may be used in the search sequence to encode
|
|
uncertain
|
|
characters.
|
|
.para
|
|
The score matrix reflects the level of
|
|
redundancy in the probe sequence and hence will put more emphasis on
|
|
those
|
|
characters that are better defined. The score matrix is:
|
|
.lit
|
|
DNA SCORE MATRIX USING IUB SYMBOLS
|
|
|
|
T C A G - R Y W S M K H B V D N ?
|
|
|
|
T 36 0 0 0 9 0 18 18 0 0 18 12 12 0 12 9 0
|
|
C 0 36 0 0 9 0 18 0 18 18 0 12 12 12 0 9 0
|
|
A 0 0 36 0 9 18 0 18 0 18 0 12 0 12 12 9 0
|
|
G 0 0 0 36 9 18 0 0 18 0 18 0 12 12 12 9 0
|
|
- 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0
|
|
R 0 0 18 18 18 36 0 9 9 9 9 6 6 12 12 18 0
|
|
Y 18 18 0 0 18 0 36 9 9 9 9 12 12 6 6 18 0
|
|
W 18 0 18 0 18 9 9 36 0 9 9 12 6 6 12 18 0
|
|
S 0 18 0 18 18 9 9 0 36 9 9 6 12 12 6 18 0
|
|
M 0 18 18 0 18 9 9 9 9 36 0 12 6 12 6 18 0
|
|
K 18 0 0 18 18 9 9 9 9 0 36 6 12 6 12 18 0
|
|
H 12 12 12 0 27 6 12 12 6 12 6 36 8 8 8 27 0
|
|
B 12 12 0 12 27 6 12 6 12 6 12 8 36 8 8 27 0
|
|
V 0 12 12 12 27 12 6 6 12 12 6 8 8 36 8 27 0
|
|
D 12 0 12 12 27 12 6 12 6 6 12 8 8 8 36 27 0
|
|
N 9 9 9 9 36 18 18 18 18 18 18 27 27 27 27 36 0
|
|
? 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
|
|
|
|
? is any unrecognised character.
|
|
|
|
Typical dialogue is shown below.
|
|
|
|
? Menu or option number=19
|
|
Find matches using a score matrix
|
|
? (y/n) (y) Keep picture
|
|
? String=AAATTTCCC
|
|
STRING=AAATTTCCC
|
|
? (y/n) (y) This sense
|
|
Minimum score= 0 Maximum score= 324
|
|
? Score (0-324) (280) =250
|
|
|
|
Missing graphics display here
|
|
|
|
For score 250 the number of matches= 1
|
|
Scores 252
|
|
Positions 365
|
|
? Display (0-1) (0) =1
|
|
|
|
365
|
|
ACATTTCGC
|
|
* ***** *
|
|
AAATTTCCC
|
|
1
|
|
? (y/n) (y) Keep picture
|
|
Default String=AAATTTCCC
|
|
? String=
|
|
STRING=AAATTTCCC
|
|
? (y/n) (y) This sense n
|
|
STRING=GGGAAATTT
|
|
Minimum score= 0 Maximum score= 324
|
|
? Score (0-324) (222) = 200
|
|
|
|
Missing graphics display here
|
|
|
|
For score 200 the number of matches= 7
|
|
Scores 216 216 216 216 216 216 216
|
|
Positions 269 270 271 288 354 624 853
|
|
? Display (0-7) (0) =3
|
|
|
|
269
|
|
GAGGGATTT
|
|
* * ****
|
|
GGGAAATTT
|
|
1
|
|
|
|
270
|
|
AGGGATTTT
|
|
** * ***
|
|
GGGAAATTT
|
|
1
|
|
|
|
271
|
|
GGGATTTTC
|
|
**** **
|
|
GGGAAATTT
|
|
1
|
|
? (y/n) (y) Keep picture !
|
|
|
|
.end lit
|
|
.left margin1
|
|
@20. TX 7 @ Search for a motif using a weight matrix
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function performs searches for short sequence
|
|
motifs using an appropriate weight matrix. In addition it can be used to
|
|
create or modify weight matrices. In order to perform a search the only
|
|
input
|
|
required is the name of the file containing the weight matrix.
|
|
The results can be presented graphically or listed. The graphical
|
|
presentation will draw line at the position of any matches found; the
|
|
height of the line is proportional to the score.
|
|
.para
|
|
For a search, select "use weight matrix", supply the name of the file
|
|
containing the weight matrix, and choose between having results plotted
|
|
or listed. If dialogue is requested when the function is selected users can
|
|
alter the cutoff score employed.
|
|
.para
|
|
To create a weight matrix several steps are involved. A file containing an
|
|
alignment of known motifs is required. (This file must be created before
|
|
the current option is selected. The format is a follows: each sequence is
|
|
written on a separate line with at least one space at the beginning; each
|
|
sequence is terminated by a space character, and can be followed by a
|
|
name. The sequences must be aligned.) Supply the name of the file of
|
|
aligned sequences. The program reads and displays the sequences. Choose
|
|
between "summing logs of weights" or summing weights (i.e. whether to
|
|
multiply or add weights). If logs are used all scores will be negative.
|
|
Choose if all positions in the set of aligned sequences should be used or
|
|
if a mask should be applied. If so selected, define a mask as a string of
|
|
symbols, in which symbol - means ignore and any other symbol means
|
|
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
|
|
.para
|
|
The program will calculate weights as the frequencies of each base at
|
|
each unmasked position in the set of aligned sequences. These weights
|
|
are then applied to the set of aligned sequences to give a range of
|
|
"observed" scores. The mean and standard deviation of these scores is
|
|
displayed. The user is asked to supply several values to be used when the
|
|
weight matrix is applied to other sequences: a cutoff score (by default,
|
|
the mean minus 3 standard deviations), a top score for scaling graphical
|
|
results (by default, the mean plus 3 standard deviations), and a position
|
|
to identify (this means that if a particular base within the motif is used
|
|
as a "landmark", such as the A of the AG in splice acceptor sites, then its
|
|
position will be marked in plots). All these values are stored along with
|
|
the weight matrix. Finally supply the name of a file to contain the weight
|
|
matrix.
|
|
.para
|
|
Weight matrices can be "rescaled" using a set of aligned sequences in
|
|
much the same ways as a matrix is created. The purpose is to redefine
|
|
the cutoff scores, and rescaling does not alter any other values in the
|
|
weight matrix file.
|
|
.para
|
|
The methods have changed considerably but were first outlined in
|
|
Staden, R. Nucl. Acid Res. 12 505-519 1984, and
|
|
Staden, R. Genetic
|
|
engineering: principles and methods vol 7, Edited by J.K. Setlow and A.
|
|
Hollaender, Plenum publishing corp., 1985.
|
|
.para
|
|
The methods have always had to deal with the problem of zeroes in the
|
|
matrices. The current versions
|
|
employ "Laplaces Law of Succession" in which 1 is
|
|
added to each term.
|
|
.para
|
|
It is now possible to apply a mask to a set of aligned sequences in
|
|
order to give weight to selected positions only.
|
|
Sequences have superimposed functions: some parts may be of general
|
|
structural
|
|
importance and give rise to an overall framework, and other parts give
|
|
specificity and hence are not common; we may want to use a set of
|
|
aligned
|
|
sequences to define a motif, but want to use only the framework
|
|
positions.
|
|
Alternatively we may want to pick out
|
|
only those parts of a set of aligned sequences that give a particular
|
|
property, and to ignore other similarities that are due to some other
|
|
property
|
|
and which could obscure the pattern
|
|
we are interested in. The ability to define a mask allows certain
|
|
positions
|
|
to be used in the motif and others to be ignored, and yet still permits the
|
|
use of a set of aligned sequences to calculate weights.
|
|
.para
|
|
Typical dialogue is shown below.
|
|
.lit
|
|
|
|
? Menu or option number=20
|
|
X 1 Use weight matrix
|
|
2 Make weight matrix
|
|
3 Rescale weight matrix
|
|
? 0,1,2,3 =2
|
|
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
|
|
|
|
1 AGCGTGACTCTTCCCGGAA HIS1
|
|
2 GAGGTGACTCACTTGGAAG HIS1
|
|
3 CGGATGACTCTTTTTTTTT HIS3
|
|
4 ACAGTGACTCACGTTTTTT HIS4
|
|
5 GTCGTGACTCATATGCTTT ARG3
|
|
6 TGAATGACTCACTTTTTGG ARG4
|
|
7 TTCTTGACTCGTCTTTTCT CPA1
|
|
8 CGAATGACTCTTATTGATG CPA2
|
|
9 AGAATGACTAATTTTACTA TRP5
|
|
10 TCGTTGACTCATTCTAATC TRP3
|
|
11 TTGCTGACTCATTACGATT TRP2
|
|
12 GAGATGACTCTTTTTCTTT IV1
|
|
13 GCGATGATTCATTTCTCTG IV2
|
|
14 TAGATGACTCAGTTTAGTC LEU1
|
|
15 TAAGTGACTCAGTTCTTTC LEU4
|
|
16 ATGATGACTCTTAAGCATG ILS1
|
|
Length of motif 19
|
|
? (y/n) (y) Sum logs of weights
|
|
|
|
? (y/n) (y) Use all motif positions n
|
|
x means use, - means ignore
|
|
e.g. xx-x---x-x means use positions 1,2,4,8,10
|
|
? Mask=----XXXXXXXX
|
|
Applying weights to input sequences
|
|
1 -27.979 AGCGTGACTCTTCCCGGAA
|
|
2 -24.543 GAGGTGACTCACTTGGAAG
|
|
3 -20.890 CGGATGACTCTTTTTTTTT
|
|
4 -23.087 ACAGTGACTCACGTTTTTT
|
|
5 -22.771 GTCGTGACTCATATGCTTT
|
|
6 -23.408 TGAATGACTCACTTTTTGG
|
|
7 -25.159 TTCTTGACTCGTCTTTTCT
|
|
8 -22.679 CGAATGACTCTTATTGATG
|
|
9 -24.751 AGAATGACTAATTTTACTA
|
|
10 -23.157 TCGTTGACTCATTCTAATC
|
|
11 -23.067 TTGCTGACTCATTACGATT
|
|
12 -21.449 GAGATGACTCTTTTTCTTT
|
|
13 -24.191 GCGATGATTCATTTCTCTG
|
|
14 -23.770 TAGATGACTCAGTTTAGTC
|
|
15 -22.923 TAAGTGACTCAGTTCTTTC
|
|
16 -25.285 ATGATGACTCTTAAGCATG
|
|
Top score -20.890 Bottom score -27.979
|
|
Mean -23.694 Standard deviation 1.613
|
|
Mean minus 3.sd -28.534 Mean plus 3.sd -18.854
|
|
? Cutoff score (-999.00-9999.00) (-28.53) =
|
|
? Top score for scaling plots (-28.53-999.00) (-18.85) =
|
|
? Position to identify (0-19) (1) =
|
|
? Title=GCN4 SEQUENCES
|
|
? Name for new weight matrix file=1.WTS
|
|
|
|
|
|
? Menu or option number=20
|
|
X 1 Use weight matrix
|
|
2 Make weight matrix
|
|
3 Rescale weight matrix
|
|
? 0,1,2,3 =3
|
|
? Name of existing weight matrix file=1.WTS
|
|
GCN4 SEQUENCES
|
|
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
|
|
Length of motif 19
|
|
? (y/n) (y) Sum logs of weights n
|
|
? (y/n) (y) Use all motif positions
|
|
|
|
Applying weights to input sequences
|
|
1 128.000 AGCGTGACTCTTCCCGGAA
|
|
2 148.000 GAGGTGACTCACTTGGAAG
|
|
3 172.000 CGGATGACTCTTTTTTTTT
|
|
4 160.000 ACAGTGACTCACGTTTTTT
|
|
5 161.000 GTCGTGACTCATATGCTTT
|
|
6 157.000 TGAATGACTCACTTTTTGG
|
|
7 149.000 TTCTTGACTCGTCTTTTCT
|
|
8 160.000 CGAATGACTCTTATTGATG
|
|
9 151.000 AGAATGACTAATTTTACTA
|
|
10 159.000 TCGTTGACTCATTCTAATC
|
|
11 158.000 TTGCTGACTCATTACGATT
|
|
12 169.000 GAGATGACTCTTTTTCTTT
|
|
13 152.000 GCGATGATTCATTTCTCTG
|
|
14 157.000 TAGATGACTCAGTTTAGTC
|
|
15 160.000 TAAGTGACTCAGTTCTTTC
|
|
16 143.000 ATGATGACTCTTAAGCATG
|
|
Top score 172.000 Bottom score 128.000
|
|
Mean 155.250 Standard deviation 10.034
|
|
Mean minus 3.sd 125.147 Mean plus 3.sd 185.353
|
|
? Cutoff score (-999.00-9999.00) (125.15) =
|
|
? Top score for scaling plots (125.15-999.00) (185.35) =
|
|
? Position to identify (0-19) (1) =
|
|
? Title=GCN4 SEQUENCES
|
|
? Name for new weight matrix file=2.WTS
|
|
|
|
|
|
? Menu or option number=20
|
|
X 1 Use weight matrix
|
|
2 Make weight matrix
|
|
3 Rescale weight matrix
|
|
? 0,1,2,3 =
|
|
? Motif weight matrix file=1.WTS
|
|
GCN4 SEQUENCES
|
|
? (y/n) (y) Plot results n
|
|
|
|
153 -22.61 GCAGCGACTGATTTGAGTT
|
|
169 -28.53 GTTCTGACCACTCAGATCC
|
|
172 -27.27 CTGACCACTCAGATCCGGC
|
|
219 -27.35 CCAGTGGCTGGCCTGCTAG
|
|
268 -27.82 CGAGGGATTTTCGATCTTG
|
|
274 -26.99 ATTTTCGATCTTGTGGATG
|
|
283 -25.79 CTTGTGGATGATTTTCACG
|
|
287 -27.50 TGGATGATTTTCACGTGCG
|
|
298 -28.17 CACGTGCGCCGTCATATTG
|
|
332 -28.27 TCTTTGAAGCAGAAGGGAC
|
|
351 -28.27 AGGGGTACACTTTCACATT
|
|
357 -25.05 ACACTTTCACATTTCGCTT
|
|
364 -28.51 CACATTTCGCTTATGGGAG
|
|
400 -23.77 GAAGTTACTAATGTGCGTG
|
|
451 -26.22 ATGCTCGCCCTCTTTGGTG
|
|
476 -28.00 TCCCTCACTGAGCCCTCCG
|
|
480 -28.33 TCACTGAGCCCTCCGCCTC
|
|
517 -23.46 GCTAAGATTCAGCTTGGTT
|
|
556 -27.27 TCCAGCACTCAGGTTCGGC
|
|
602 -27.01 AACTTGAATCCATCGTTGC
|
|
648 -28.45 TGCTAAACACAGCCGGTTT
|
|
679 -28.18 CTGTTTGCCCAGTTTGGGC
|
|
691 -28.51 TTTGGGCCGCTTCTGGACG
|
|
713 -27.67 GGCTTGACCGTGGCTGTGG
|
|
803 -25.47 ATGCTGACCATGCTTTTCA
|
|
848 -28.11 ATAATGTTAAGTTTGATTC
|
|
857 -25.97 AGTTTGATTCCGCTGGCCG
|
|
879 -27.85 CCGCTGCTGCTGTTTCCAC
|
|
917 -27.77 GCGATGAGGAAGGCTTGTT
|
|
931 -27.81 TTGTTGGCGCGCCTGCTCG
|
|
952 -23.52 GAGGTGACTACCATCCGTG
|
|
977 -28.40 TGCGTGGGTGAGCTGTTGT
|
|
|
|
|
|
|
|
|
|
? Menu or option number=6
|
|
Page through text files
|
|
? Name of file to read=1.WTS
|
|
GCN4 SEQUENCES
|
|
19 1 -28.534 -18.854
|
|
P 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
|
|
N 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16
|
|
16
|
|
T 0 0 0 0 16 0 0 1 16 0 5 11 10 12 9 6 7 12 6
|
|
C 0 0 0 0 0 0 0 15 0 15 0 3 2 2 4 3 2 1 3
|
|
A 0 0 0 0 0 0 16 0 0 1 10 0 3 2 0 3 5 2 2
|
|
G 0 0 0 0 0 16 0 0 0 0 1 2 1 0 3 4 2 1 5
|
|
End of file
|
|
|
|
.end lit
|
|
|
|
|
|
.left margin1
|
|
@21. TX 3 @ Count base composition
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine
|
|
calculates the base composition of the
|
|
active region of the sequence as both totals and percentages.
|
|
.left margin1
|
|
@22. TX 3 @ Count dinucleotide frequencies
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine simply counts dinucleotide frequencies for the currently
|
|
active region of the sequence. It also calculates an expected distribution
|
|
based on the base composition.
|
|
The output looks like:
|
|
.LIT
|
|
T C A G
|
|
obs expected obs expected obs expected obs expected
|
|
|
|
T 8.44 8.25 6.67 7.01 10.35 9.92 3.27 3.54
|
|
C 7.49 7.01 6.76 5.95 8.39 8.43 1.76 3.01
|
|
A 10.13 9.92 7.78 8.43 11.74 11.93 4.89 4.26
|
|
G 2.67 3.54 3.19 3.01 4.06 4.26 2.42 1.52
|
|
|
|
.END LIT
|
|
.left margin1
|
|
@23. TX 3 5 @ Count codons and amino acids
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function
|
|
counts codons, amino acid composition, protein molecular weights, and
|
|
base
|
|
composition. Users select the segments of the sequence that the program
|
|
should analyse.
|
|
.para
|
|
Choose between being shown observed counts or counts normalised so
|
|
that the totals for each amino acid sum to 100. Select to define
|
|
segments using either the keyboard or an EMBL feature table.
|
|
Define the segments to count over. Select strand for each segment. Stop
|
|
selecting segments by typing a zero for "Count from ()". The results are
|
|
displayed a screenful at a time, and the bell is sounded to show there is
|
|
more to come. A zero start position, or the end of an EMBL feature table,
|
|
signals
|
|
the routine to print out totals for all values.
|
|
|
|
.para
|
|
The counts are broken down into several figures.
|
|
Base
|
|
composition by position in codon expressed as a percentage of each bases
|
|
own frequency; base composition by position in codon expressed as a
|
|
percentage of the overall base composition of the section; base
|
|
composition
|
|
expected for this amino acid composition if there was no codon
|
|
preference;
|
|
percentage deviations of the observed amino acid composition from an
|
|
average amino acid composition.
|
|
.para
|
|
The output looks like:
|
|
.LIT
|
|
|
|
===========================================
|
|
F TTT 1. S TCT 2. Y TAT 2. C TGT 1.
|
|
F TTC 1. S TCC 1. Y TAC 3. C TGC 2.
|
|
L TTA 7. S TCA 4. * TAA 9. * TGA 1.
|
|
L TTG 2. S TCG 1. * TAG 2. W TGG 2.
|
|
===========================================
|
|
L CTT 3. P CCT 2. H CAT 4. R CGT 1.
|
|
L CTC 2. P CCC 3. H CAC 1. R CGC 0.
|
|
L CTA 3. P CCA 2. Q CAA 4. R CGA 0.
|
|
L CTG 2. P CCG 2. Q CAG 1. R CGG 2.
|
|
===========================================
|
|
I ATT 9. T ACT 1. N AAT 7. S AGT 3.
|
|
I ATC 2. T ACC 2. N AAC 4. S AGC 2.
|
|
I ATA 4. T ACA 5. K AAA 13. R AGA 5.
|
|
M ATG 1. T ACG 2. K AAG 4. R AGG 1.
|
|
===========================================
|
|
V GTT 2. A GCT 2. D GAT 1. G GGT 3.
|
|
V GTC 2. A GCC 2. D GAC 1. G GGC 1.
|
|
V GTA 4. A GCA 3. E GAA 2. G GGA 1.
|
|
V GTG 2. A GCG 0. E GAG 1. G GGG 1.
|
|
===========================================
|
|
total codons= 166.
|
|
T C A G
|
|
|
|
1 31.06 33.68 34.03 35.00
|
|
2 35.61 35.79 30.89 32.50
|
|
3 33.33 30.53 35.08 32.50
|
|
|
|
1 24.70 19.28 39.16 16.87
|
|
2 28.31 20.48 35.54 15.66
|
|
3 26.51 17.47 40.36 15.66
|
|
% 26.51 19.08 38.35 16.06 observed, overall totals
|
|
% 25.00 22.26 33.10 19.65 expected, even codons per acid
|
|
|
|
A C D E F G H I K L
|
|
7. 3. 2. 3. 2. 6. 5. 15. 17. 19.
|
|
o-e % -47. -33. -76. -68. -64. -54. 62. 116. 67. 67.
|
|
|
|
M N P Q R S T V W Y
|
|
1. 11. 9. 5. 9. 13. 10. 10. 2. 5.
|
|
o-e % -62. 66. 12. -17. 19. 21. 6. -2. 0. -5.
|
|
total acids= 154. molecular weight= 17421.
|
|
|
|
Typical dialogue follows.
|
|
|
|
? Menu or option number=23
|
|
Calculate codon usage, base composition
|
|
and amino acid composition
|
|
? (y/n) (y) Show observed counts
|
|
? (y/n) (y) Define segments using keyboard
|
|
? Count from (0-1023) (0) =1
|
|
? Count to (1-1023) (1023) =1000
|
|
? (y/n) (y) + strand
|
|
|
|
===========================================
|
|
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
|
|
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
|
|
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
|
|
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
|
|
===========================================
|
|
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
|
|
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
|
|
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
|
|
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
|
|
===========================================
|
|
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
|
|
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
|
|
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
|
|
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
|
|
===========================================
|
|
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
|
|
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
|
|
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
|
|
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
|
|
===========================================
|
|
|
|
|
|
Total codons= 333.
|
|
T C A G
|
|
|
|
1 23.32 37.69 28.99 40.06
|
|
2 37.15 22.31 38.46 36.59
|
|
3 39.53 40.00 32.54 23.34
|
|
----- ----- ----- -----
|
|
= 100% 100% 100% 100%
|
|
|
|
1 17.72 29.43 14.71 38.14 = 100%
|
|
2 28.23 17.42 19.52 34.83 = 100%
|
|
3 30.03 31.23 16.52 22.22 = 100%
|
|
% 25.33 26.03 16.92 31.73 Observed, overall totals
|
|
% 24.44 22.31 20.90 32.35 Expected, even codons per acid
|
|
|
|
A C D E F G H I K L
|
|
33. 10. 15. 9. 17. 40. 10. 12. 5. 33.
|
|
O-E % 22. 81. -13. -55. 34. 71. 40. -29. -73. 13.
|
|
|
|
M N P Q R S T V W Y
|
|
2. 7. 2. 13. 49. 20. 11. 30. 5. 2.
|
|
O-E % -74. -51. -88. 0. 165. -11. -42. 40. 18. -81.
|
|
Total acids= 325. Molecular weight= 35831. Hydrophobicity= -17.8
|
|
|
|
|
|
? Count from (0-1023) (0) =
|
|
|
|
Codon totals over all genes
|
|
===========================================
|
|
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
|
|
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
|
|
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
|
|
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
|
|
===========================================
|
|
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
|
|
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
|
|
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
|
|
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
|
|
===========================================
|
|
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
|
|
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
|
|
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
|
|
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
|
|
===========================================
|
|
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
|
|
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
|
|
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
|
|
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
|
|
===========================================
|
|
|
|
|
|
Total codons= 333.
|
|
T C A G
|
|
|
|
1 23.32 37.69 28.99 40.06
|
|
2 37.15 22.31 38.46 36.59
|
|
3 39.53 40.00 32.54 23.34
|
|
----- ----- ----- -----
|
|
= 100% 100% 100% 100%
|
|
|
|
1 17.72 29.43 14.71 38.14 = 100%
|
|
2 28.23 17.42 19.52 34.83 = 100%
|
|
3 30.03 31.23 16.52 22.22 = 100%
|
|
% 25.33 26.03 16.92 31.73 Observed, overall totals
|
|
% 24.44 22.31 20.90 32.35 Expected, even codons per acid
|
|
|
|
A C D E F G H I K L
|
|
33. 10. 15. 9. 17. 40. 10. 12. 5. 33.
|
|
O-E % 22. 81. -13. -55. 34. 71. 40. -29. -73. 13.
|
|
|
|
M N P Q R S T V W Y
|
|
2. 7. 2. 13. 49. 20. 11. 30. 5. 2.
|
|
O-E % -74. -51. -88. 0. 165. -11. -42. 40. 18. -81.
|
|
Total acids= 325. Molecular weight= 35831. Hydrophobicity= -17.8
|
|
|
|
.END LIT
|
|
.LEFT MARGIN1
|
|
@24. TX 3 @ Plot base composition
|
|
.LEFT MARGIN2
|
|
.para
|
|
This option plots the base composition of the sequence. The counts for
|
|
any combination of bases can be plotted.
|
|
.para
|
|
If dialogue is requested the user is presented with a check box for
|
|
selecting which bases should be counted, and then allowed to define a
|
|
window length, and a "plot interval". Otherwise, the AT composition is
|
|
plotted with a window of 101 and a plot interval of 5.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
? Menu or option number=d24
|
|
Plot base composition
|
|
|
|
checkbox: those set are marked X
|
|
X 1 T
|
|
2 C
|
|
X 3 A
|
|
4 G
|
|
? 0,1,2,3,4 =1
|
|
|
|
checkbox: those set are marked X
|
|
1 T
|
|
2 C
|
|
X 3 A
|
|
4 G
|
|
? 0,1,2,3,4 =3
|
|
|
|
checkbox: those set are marked X
|
|
1 T
|
|
2 C
|
|
3 A
|
|
4 G
|
|
? 0,1,2,3,4 =2
|
|
|
|
checkbox: those set are marked X
|
|
1 T
|
|
X 2 C
|
|
3 A
|
|
4 G
|
|
? 0,1,2,3,4 =4
|
|
|
|
checkbox: those set are marked X
|
|
1 T
|
|
X 2 C
|
|
3 A
|
|
X 4 G
|
|
? 0,1,2,3,4 =
|
|
|
|
? odd span length (1-201) (31) =
|
|
? plot interval (1-11) (5) =
|
|
|
|
missing graphics
|
|
|
|
|
|
|
|
.end lit
|
|
.left margIN1
|
|
@25. TX 3 @ Plot local deviations in base composition
|
|
.LEFT MARGIN2
|
|
.para
|
|
The "local deviation" routines are designed to indicate the similarity of
|
|
the compositions of different parts of the sequence. The composition of
|
|
every segment of the sequence is compared with a standard composition.
|
|
The levels of similarity are plotted as a chi squared values. The standard
|
|
can be the composition of the whole sequence, or alternatively that of a
|
|
small segment defined by the user.
|
|
.para
|
|
If dialogue is forced define the standard region, the window length and
|
|
the plot interval. Otherwise the composition of the whole sequence is
|
|
taken as a standard. The maximum and minimum observed value of the chi
|
|
squared calculation is displayed, and plots will always exactly fill the
|
|
available box. Any unusual regions will show as peaks.
|
|
.para
|
|
The following measure is used: for each window position
|
|
calculate (sum((obs-exp)*(obs-exp))/(exp*exp))
|
|
where obs is the observed composition
|
|
and exp is the expected composition (the composition of the standard).
|
|
The calculation is performed once to find out the range of values and is
|
|
then repeated and
|
|
plotted so that the plot exactly fills the allocated screen space.
|
|
.left margIN1
|
|
@26. TX 3 @ Plot local deviations from dinucleotide composition
|
|
.LEFT MARGIN2
|
|
.para
|
|
The "local deviation" routines are designed to indicate the similarity of
|
|
the compositions of different parts of the sequence. The dinucleotide
|
|
composition of every segment of the sequence is compared with a
|
|
standard composition. The levels of similarity are plotted as a chi
|
|
squared values. The standard can be the composition of the whole
|
|
sequence, or alternatively that of a small segment defined by the user.
|
|
.para
|
|
If dialogue is forced define the standard region, the window length and
|
|
the plot interval. Otherwise the composition of the whole sequence is
|
|
taken as a standard. The maximum and minimum observed value of the chi
|
|
squared calculation is displayed, and plots will always exactly fill the
|
|
available box. Any unusual regions will show as peaks.
|
|
.para
|
|
The following measure is used: for each window position
|
|
calculate (sum((obs-exp)*(obs-exp))/(exp*exp))
|
|
where obs is the observed composition
|
|
and exp is the expected composition (the composition of the standard).
|
|
The calculation is performed once to find out the range of values and is
|
|
then repeated and
|
|
plotted so that the plot exactly fills the allocated screen space.
|
|
.left margin1
|
|
@27. TX 3 @ Plot local deviations from trinucleotide composition
|
|
.LEFT MARGIN2
|
|
.para
|
|
The "local deviation" routines are designed to indicate the similarity of
|
|
the compositions of different parts of the sequence. The trinucleotide
|
|
composition of every segment of the sequence is compared with a
|
|
standard composition. The levels of similarity are plotted as a chi
|
|
squared values. The standard can be the composition of the whole
|
|
sequence, or alternatively that of a small segment defined by the user.
|
|
.para
|
|
If dialogue is forced define the standard region, the window length and
|
|
the plot interval. Otherwise the composition of the whole sequence is
|
|
taken as a standard. The maximum and minimum observed value of the chi
|
|
squared calculation is displayed, and plots will always exactly fill the
|
|
available box. Any unusual regions will show as peaks.
|
|
.para
|
|
The following measure is used: for each window position
|
|
calculate (sum((obs-exp)*(obs-exp))/(exp*exp))
|
|
where obs is the observed composition
|
|
and exp is the expected composition (the composition of the standard).
|
|
The calculation is performed once to find out the range of values and is
|
|
then repeated and
|
|
plotted so that the plot exactly fills the allocated screen space.
|
|
.left margin1
|
|
@28. TX 5 @ Calculate codon constraint
|
|
.left margin2
|
|
.para
|
|
The purpose of this option (which is somewhat specialised) is to measure
|
|
the level of constraint imposed on the sequence by coding for a protein of
|
|
the observed composition. It measures the strength of the codon bias
|
|
averaged over windows of 99 codons and displays the values observed.
|
|
.para
|
|
Select between defining segments at the keyboard or using an EMBL
|
|
feature table. Finish selecting segments by typing a zero start. The value
|
|
for each segment is displayed:
|
|
.para
|
|
Mean (W-EW) / EWD, window 99 10.5
|
|
.para
|
|
The codon constraint is the
|
|
difference between the observed codon improbability and the mean
|
|
improbabilty for
|
|
a sequence of the same composition. See McLachlan, Staden and Boswell
|
|
Nucl. Acid Res. 1984
|
|
|
|
.left margin1
|
|
@59. TX 3 @ Plot negentropy
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine is designed to show regions of the sequence that differ in
|
|
composition from others, and hence is like the "plot deviation.." routines.
|
|
.para
|
|
Negentropy or information is defined in the following way: let Pi be the
|
|
probability of observing base i, where i = A,C,G or T, then the average
|
|
information per base is
|
|
I=-sum(Pi.Log(Pi)) (sum over all i). This routine calculates Pi by
|
|
calculating the overall composition for the sequence and then plots I for
|
|
windows of length defined by the user.
|
|
.left margin1
|
|
@30. TX 4 @ Search for hairpin loops
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to find simple inverted repeats or potential hairpin loops
|
|
The loops are defined by a range of sizes for
|
|
the loop and a minimum number of consecutive base pairs in the stem.
|
|
The results can be presented graphically or listed.
|
|
A-T, G-C and G-T basepairs are counted.
|
|
.para
|
|
Define the range of loop sizes and the minimum number of consecutive
|
|
basepairs required. Choose between plotted or listed results.
|
|
.para
|
|
The loops found are plotted as blips on a
|
|
horizontal line that represents the sequence, the heights of the lines are
|
|
proportional to the number of basepairs in the stems. Note that only
|
|
uninterrupted stems are found - i.e. all basepairs must be made. To look
|
|
for stems with some unpaired bases (or for palindromes) use the inverted
|
|
repeat motif class in the pattern searching option.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
? Menu or option number=30
|
|
Search for hairpin loops
|
|
Define the range of loop sizes
|
|
? Minimum loop size (1-30) (1) =
|
|
? Maximum loop size (3-60) (3) =
|
|
? Minimum number of basepairs (2-20) (6) =
|
|
? (y/n) (y) Plot results n
|
|
Searching
|
|
|
|
T.G
|
|
G-C
|
|
G.T
|
|
T.G
|
|
C-G
|
|
G-C
|
|
T.G
|
|
C-G
|
|
G.T
|
|
GCCGCA GCGGAGG
|
|
49
|
|
|
|
G
|
|
G-C
|
|
T.G
|
|
C-G
|
|
G.T
|
|
T.G
|
|
G-C
|
|
CTGCTG GGAGGTC
|
|
56
|
|
|
|
|
|
G
|
|
T.G
|
|
G-C
|
|
G.T
|
|
T.G
|
|
C-G
|
|
G-C
|
|
T-A
|
|
T.G
|
|
AGCGCA CGACTGA
|
|
139
|
|
|
|
A C
|
|
G.T
|
|
C-G
|
|
G.T
|
|
C-G
|
|
C-G
|
|
G-C
|
|
TTCGCT CAACGCC
|
|
244
|
|
|
|
.end lit
|
|
.LEFT MARGIN1
|
|
@31. TX 4 @ Search for long range inverted repeats
|
|
.LEFT MARGIN2
|
|
.para
|
|
Searches for inverted repeats. The repeats found are exact matches of at
|
|
least 6 consecutive bases. Results can be presented graphically or listed.
|
|
Plotted results show the end points of repeats joined by rectangular
|
|
lines.
|
|
.para
|
|
If dialogue is not requested the defaults will be taken. Otherwise choose
|
|
between plotted or listed results. If required select to analyse a
|
|
restricted segment of the currently active region. Choose a repeat length.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
? Menu or option number=D31
|
|
Plot long-range inverted repeats
|
|
? (y/n) (y) Plot results n
|
|
Define restricted region
|
|
? start (1-1023) (1) =
|
|
? end (2-1023) (1023) =
|
|
? Minimum inverted repeat (6-30) (12) =10
|
|
Searching
|
|
27 909 10 TGCCCAGAGA
|
|
|
|
.end lit
|
|
.LEFT MARGIN1
|
|
@32. TX 4 @ Search for repeats
|
|
.LEFT MARGIN2
|
|
.para
|
|
Searches for direct repeats. The repeats found are exact matches of at
|
|
least 6 consecutive bases. Results can be presented graphically or listed.
|
|
Plotted results show the end points of repeats joined by rectangular
|
|
lines.
|
|
.para
|
|
If dialogue is not requested the defaults will be taken. Otherwise choose
|
|
between plotted or listed results. If required select to analyse a
|
|
restricted segment of the currently active region. Choose a repeat length.
|
|
.para
|
|
Typical dialogue follows.
|
|
|
|
.lit
|
|
? Menu or option number=D32
|
|
Plot repeats
|
|
? (y/n) (y) Plot results n
|
|
Define restricted region
|
|
? start (1-1023) (1) =
|
|
? end (2-1023) (1023) =
|
|
? Minimum repeat (6-30) (12) =8
|
|
Searching
|
|
619 988 8 GCTGTTGT
|
|
514 646 8 GCTGCTAA
|
|
94 865 8 TCCGCTGG
|
|
146 222 9 GTGGCTGGC
|
|
455 497 8 TCGCCCTC
|
|
454 496 9 CTCGCCCTC
|
|
872 875 8 GCCGCCGC
|
|
510 615 8 CGTTGCTG
|
|
152 913 8 GGCAGCGA
|
|
199 265 8 CGTCGAGG
|
|
689 794 8 AGTTTGGG
|
|
147 223 8 TGGCTGGC
|
|
101 116 8 GACGAGGA
|
|
8 690 8 GTTTGGGC
|
|
52 141 8 TGCTGGTG
|
|
|
|
.end lit
|
|
.left margin1
|
|
@33. TX 4 @ Search for z dna (total ry, yr)
|
|
.LEFT MARGIN2
|
|
.para
|
|
Searches for segments of the sequence that might form Z DNA. A window
|
|
length is chosen and the number of RY and YR dinucleotides within each
|
|
window is plotted. The top of the box corresponds to all RY or YR, the
|
|
bottom to zero RY or YR.
|
|
.para
|
|
If dialogue is requested, select a window length and plot interval.
|
|
Otherwise the defaults will be used.
|
|
.para
|
|
The program contains three
|
|
separate ways of doing this (options 33,34,35).
|
|
.left margin1
|
|
@34. TX 4 @ Search for z dna (runs of ry, yr)
|
|
.LEFT MARGIN2
|
|
.para
|
|
Searches for segments of the sequence that might form Z DNA. Results
|
|
are plotted.
|
|
.para
|
|
If dialogue is requested define a window length and plot interval.
|
|
Otherwise the defaults will be used.
|
|
The routine
|
|
counts the number of R in positions 1,3,5 etc =R1, the number of Y in
|
|
positions 2,4,6 etc =Y1, the number of Y in positions 1,3,5 etc =Y2 and
|
|
the
|
|
number of R in positions 2,4,6 etc =R2 for a window length. It plots the
|
|
maximum of R1+Y1 and R2+Y2 relative to a minimum of (window
|
|
length)/2 and a
|
|
maximum of (window length). (see 33,35).
|
|
.LEFT MARGIN1
|
|
@35. TX 4 @ Search for z dna (best phased value)
|
|
.LEFT MARGIN2
|
|
.para
|
|
Searches for segments of the sequence that might form Z DNA. Results
|
|
are plotted.
|
|
.para
|
|
If dialogue is requested define a window length and a plot interval.
|
|
Ohterwise the defaults values will be used.
|
|
.para
|
|
The routine
|
|
counts the number of consecutive RY or YR dinucleotides in phase. It
|
|
moves
|
|
through the sequence counting the number of RY or YR dinucleotides; when
|
|
the next dinucleotide is not of the correct type the score is set back to
|
|
zero and the search restarted using the current base to set the phase. The
|
|
plots are done relative to a minimum of zero and a maximum defined by
|
|
the
|
|
user. (See 33,34).
|
|
.LEFT MARGIN1
|
|
@36. TX 4 @ Local similarity or complementarity search
|
|
.LEFT MARGIN2
|
|
.PARA
|
|
This function is designed to find segments of
|
|
local similarity or complementarity. It is therefore like performing
|
|
a DIAGON
|
|
plot that is
|
|
restricted to regions near the main diagonal. Results can be presented
|
|
graphically or listed.
|
|
.para
|
|
Users define
|
|
a region to search through,
|
|
a span length, a range for searching through and a cut-off score. The
|
|
program takes all sections of sequence
|
|
of length span within the defined region
|
|
and compares them to
|
|
all other sequences within the region and
|
|
range specified.
|
|
If a match above the cutoff is found we
|
|
need to show the position
|
|
of the two sections of sequence and the score, and we do it in the
|
|
following way.
|
|
If we have a 70%
|
|
match between
|
|
a sequence that starts at p1 and a sequence that starts at p2
|
|
the program draws a
|
|
diagonal line that starts at p1 with height 70% of the box and which
|
|
finishes at p2 with
|
|
height 0.
|
|
The matches can also be listed.
|
|
.para
|
|
Here I define the terms range, region, and span and what is compared.
|
|
Suppose we have a defined region j1 to j2, a range of i1 to i2 and a span
|
|
of
|
|
s; the program will take, in turn, all sections of sequence of length s
|
|
within j1 and j2 and compare them to all sequences that start a distance
|
|
i1+s-1
|
|
to i2+s-1 away from them. First it will take the sequence of length s
|
|
starting
|
|
at j1 and compare it
|
|
with the sequence of length s starting at
|
|
j1+s-1+i1, then j1+s-1+i1+1, etc up to j1+s-1+i2; then it will take the
|
|
sequence of length s starting at j1+1 and compare it with the sequence
|
|
starting at j1+s-1+1+i1 etc. This continues until we hit
|
|
the right hand end of the
|
|
sequence as defined by j2. Note 1)that sequences are not compared with
|
|
themselves: the nearest sequence compared to a span s starting at j
|
|
starts
|
|
at j+s; 2) ranges i1 and i2 are ranges of start positions; 3) by choosing a
|
|
range greater than the length of the sequence this routine will do a full
|
|
DIAGON analysis except for those points within a distance span of
|
|
the main diagonal (see note 1).
|
|
.para
|
|
Typical dialog follows.
|
|
.lit
|
|
|
|
? Menu or option number=36
|
|
Search for local similarity or complementarity
|
|
? (y/n) (y) Find direct repeats
|
|
? (y/n) (y) Keep picture n
|
|
? Span (5-200) (15) =
|
|
Define restricted region
|
|
? start (0-1023) (1) =
|
|
? end (2-1023) (1023) =
|
|
? Percent match (1.00-100.00) (70.00) =
|
|
? Range start (1-50) (1) =
|
|
? Range end (1-50) (1) =5
|
|
? (y/n) (y) Plot results n
|
|
Working
|
|
|
|
|
|
118 128
|
|
CGAGGAGGAG GTGGA
|
|
** ***** ** **
|
|
GGACGAGGAC GTCGA
|
|
100 110
|
|
|
|
|
|
119 129
|
|
GAGGAGGAGG TGGAT
|
|
** ***** * * **
|
|
GACGAGGACG TCGAC
|
|
101 111
|
|
? (y/n) (y) Find direct repeats n
|
|
? (y/n) (y) Keep picture
|
|
? Span (5-200) (15) =
|
|
Define restricted region
|
|
? start (0-1023) (1) =
|
|
? end (2-1023) (1023) =
|
|
? Percent match (1.00-100.00) (70.00) =
|
|
? Range start (1-50) (1) =
|
|
? Range end (1-50) (5) =8
|
|
? (y/n) (y) List results
|
|
|
|
Working
|
|
|
|
|
|
178 188
|
|
ACTCAGATCC GGCGG
|
|
***** *** * **
|
|
ACTCAAATCA GTCGC
|
|
156 166
|
|
|
|
|
|
177 187
|
|
CACTCAGATC CGGCG
|
|
***** *** * **
|
|
AACTCAAATC AGTCG
|
|
157 167
|
|
? (y/n) (y) Find inverted repeats !
|
|
.end lit
|
|
|
|
.left margin1
|
|
@37. TX 5 @ Set genetic code
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function allows the user to change the current active genetic code
|
|
for
|
|
all the options. The user may select: the standard code, the mammalian
|
|
mitochondrial code, the yeast mitochondrial code or a personal code
|
|
(define
|
|
your own).
|
|
.para
|
|
Select code. If personal, define a codon and select an amino acid. When all
|
|
codons have been reset define a blank codon.
|
|
.para
|
|
The code differences are:
|
|
.lit
|
|
Mammalian Yeast
|
|
Codon Mitochondrial Mitochondrial Standard
|
|
UGA W W STOP
|
|
AUA M M I
|
|
CUA L T L
|
|
AGA STOP R R
|
|
AGG STOP R R
|
|
.END LIT
|
|
.para
|
|
Typical dialogue follows.
|
|
|
|
.lit
|
|
? Menu or option number=37
|
|
X 1 Standard code
|
|
2 Mammalian mitochondrial code
|
|
3 Yeast mitochondrial code
|
|
4 Personal code
|
|
? 0,1,2,3,4 =2
|
|
|
|
? Menu or option number=37
|
|
X 1 Standard code
|
|
2 Mammalian mitochondrial code
|
|
3 Yeast mitochondrial code
|
|
4 Personal code
|
|
? 0,1,2,3,4 =4
|
|
Define genetic code by typing a codon
|
|
followed by a 1 letter amino acid symbol
|
|
? Codon=TTT
|
|
Default Amino acid symbol=F
|
|
? Amino acid symbol=W
|
|
? Codon=
|
|
.end lit
|
|
|
|
.left margin1
|
|
@38. T 3 4 @ Examine repeats
|
|
.left margin2
|
|
.para
|
|
This function can be used to examine the frequencies of repeated words
|
|
within a sequence. It finds all words that occur more than once. The
|
|
user selects a minimum word length and the program finds all words of that
|
|
length that occur more than once; then it "follows" each repeated word until it
|
|
becomes unique. For each word length it can report the number of different
|
|
repeated words, the number of occurrences of each word, and their actual
|
|
positions and sequences.
|
|
.para
|
|
It is possible that the algorithm may run out of memory, paticularly if a short
|
|
mimimum word length is chosen, or if the sequence is very long or very
|
|
repetitive. If this occurs the longest reported word length will not
|
|
necessarily be the longest in the sequence: the memory will have been consumed
|
|
before the longest word is found.
|
|
.lit
|
|
Typical dialogue and output is shown below.
|
|
|
|
Expected length of longest repeat 14
|
|
? Minumim word length (1-6) (6) =6
|
|
Working
|
|
? Show repeat frequencies for words of at least length (6-15) (15) =10
|
|
For length 10 the number of different repeated words is 2035
|
|
For length 11 the number of different repeated words is 613
|
|
For length 12 the number of different repeated words is 161
|
|
For length 13 the number of different repeated words is 37
|
|
For length 14 the number of different repeated words is 10
|
|
For length 15 the number of different repeated words is 1
|
|
? Show repeats for words of length (6-15) (15) =14
|
|
? Show repeats for words occuring with frequency (2-9999) (2) =2
|
|
|
|
ggtgctcatgccca
|
|
occurs at 21611
|
|
occurs at 21851
|
|
ttatccggtgatga
|
|
occurs at 4604
|
|
occurs at 8806
|
|
agcaccacgctgac
|
|
occurs at 5954
|
|
occurs at 9486
|
|
catgacggaggatg
|
|
occurs at 10480
|
|
occurs at 19925
|
|
aaagacgggaaaat
|
|
occurs at 11820
|
|
occurs at 43157
|
|
tacaaaaccaattt
|
|
occurs at 26797
|
|
occurs at 31369
|
|
cgagaaagagtgcg
|
|
occurs at 4260
|
|
occurs at 44305
|
|
gccggatgatggcg
|
|
occurs at 7893
|
|
occurs at 16638
|
|
atgacggaggatga
|
|
occurs at 10481
|
|
occurs at 19926
|
|
gcggcgaacgaggc
|
|
occurs at 11352
|
|
occurs at 18718
|
|
? Show repeats for words of length (6-15) (15) =!
|
|
|
|
Example of not enough memory
|
|
----------------------------
|
|
|
|
Expected length of longest repeat 14
|
|
? Minumim word length (1-6) (6) =1
|
|
Working
|
|
Not enough memory
|
|
Memory used in bytes 1125996. Length of longest repeat 5
|
|
? Show repeat frequencies for words of at least length (1-5) (5) =!
|
|
|
|
.end lit
|
|
.left margin1
|
|
@39. TX 5 @ Translate and list in upto six phases
|
|
.LEFT MARGIN2
|
|
.para
|
|
This is a general listing function that will perform translations and
|
|
produce several forms of output. The possibilities are:
|
|
.lit
|
|
1) no translation, list one or two strands, two ways of numbering the
|
|
sequence.
|
|
2) translation, one or two strands, one or three letter codes.
|
|
Positions defined by:
|
|
a) open reading frames of some minimum length l, l can be 0, hence giving
|
|
a complete six phase translation.
|
|
b) positions typed on keyboard, again 1 to 6 phases, translations appearing
|
|
above and below the dna.
|
|
c) positions read from a feature table.
|
|
|
|
It should be used in preference to option 5. For publication
|
|
without a translation, the option to number ends of lines is more compact
|
|
than option 5. Some examples and typical dialogue are given below. Note the
|
|
requirement for d39.
|
|
|
|
? Menu or option number=D39
|
|
Find open reading frames, translate and list
|
|
? (y/n) (y) Show translation
|
|
|
|
The segments to translate can be
|
|
1 Typed on the keyboard
|
|
2 Read from a feature table
|
|
X 3 Open reading frames
|
|
? 1,2,3 =
|
|
? Minimum open frame in amino acids (0-7238) (30) =
|
|
? (y/n) (y) Use 1 letter codes
|
|
Define section of DNA to display
|
|
? start (1-7238) (1) =
|
|
? end (2-7238) (7238) =300
|
|
? Line length (30-120) (60) =
|
|
Which strands should be shown
|
|
X 1 + strand only
|
|
2 - strand only
|
|
3 Both strands
|
|
? 1,2,3 =3
|
|
? (y/n) (y) Number ends of lines
|
|
|
|
|
|
N A T T I S R I D A T F S A R A P N E N
|
|
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
|
|
. : . : . : . : . : . :
|
|
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
|
|
* S A G W I F I
|
|
A V V I L L I S A V K E A R A G F S F
|
|
|
|
I A K Q V I D H L R N V S N G Q T K S T
|
|
L N R L L T I C E M Y L M V K L N L L
|
|
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
|
|
. : . : . : . : . : . :
|
|
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
|
|
Y S F L N N V M Q S I Y R I T L S F R S
|
|
I A L C T I S W K R F T D L P * V L D V
|
|
|
|
R S Q N W E S T V T W N E T S R H R T L
|
|
V R R I G N Q L L H G M K L P D T V L *
|
|
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
|
|
. : . : . : . : . : . :
|
|
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
|
|
T R L I P F
|
|
R E C F Q S D V T V H F S V E L C R V K
|
|
|
|
V A Y L K H V E L Q H Q I Q Q L S S K P
|
|
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
|
|
. : . : . : . : . : . :
|
|
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
|
|
T A Y K F C T S S C C W I
|
|
|
|
S A K M T S Y Q K E Q L K V L S N P D L
|
|
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
|
|
. : . : . : . : . : . :
|
|
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
|
|
|
|
|
|
? Menu or option number=D39
|
|
Find open reading frames, translate and list
|
|
? (y/n) (y) Show translation N
|
|
Define section of DNA to display
|
|
? start (1-7238) (1) =
|
|
? end (2-7238) (7238) =300
|
|
? Line length (30-120) (60) =
|
|
Which strands should be shown
|
|
X 1 + strand only
|
|
2 - strand only
|
|
3 Both strands
|
|
? 1,2,3 =
|
|
? (y/n) (y) Number ends of lines
|
|
|
|
|
|
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
|
|
|
|
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
|
|
|
|
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
|
|
|
|
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
|
|
|
|
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
|
|
|
|
|
|
? Menu or option number=D39
|
|
Find open reading frames, translate and list
|
|
? (y/n) (y) Show translation
|
|
The segments to translate can be
|
|
1 Typed on the keyboard
|
|
2 Read from a feature table
|
|
X 3 Open reading frames
|
|
? 1,2,3 =
|
|
? Minimum open frame in amino acids (0-7238) (30) =0
|
|
? (y/n) (y) Use 1 letter codes N
|
|
Define section of DNA to display
|
|
? start (1-7238) (1) =
|
|
? end (2-7238) (7238) =300
|
|
? Line length (30-120) (60) =
|
|
Which strands should be shown
|
|
X 1 + strand only
|
|
2 - strand only
|
|
3 Both strands
|
|
? 1,2,3 =3
|
|
? (y/n) (y) Number ends of lines
|
|
|
|
|
|
AsnAlaThrThrIleSerArgIleAspAlaThrPheSerAlaArgAlaProAsnGluAsn
|
|
ThrLeuLeuLeuLeuValGluLeuMetProProPheGlnLeuAlaProGlnMetLysIle
|
|
ArgTyrTyrTyr******Asn***CysHisLeuPheSerSerArgProLys***Lys
|
|
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
|
|
. : . : . : . : . : . :
|
|
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
|
|
ValSerSerSerAsnThrSerAsnIleGlyGlyLys***SerAlaGlyTrpIlePheIle
|
|
Arg************TyrPheGlnHisTrpArgLysLeuGluArgGlyLeuHisPheTyr
|
|
AlaValValIleLeuLeuIleSerAlaValLysGluAlaArgAlaGlyPheSerPhe
|
|
|
|
IleAlaLysGlnValIleAspHisLeuArgAsnValSerAsnGlyGlnThrLysSerThr
|
|
***LeuAsnArgLeuLeuThrIleCysGluMetTyrLeuMetValLysLeuAsnLeuLeu
|
|
TyrSer***ThrGlyTyr***ProPheAlaLysCysIle***TrpSerAsn***IleTyr
|
|
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
|
|
. : . : . : . : . : . :
|
|
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
|
|
TyrSerPheLeuAsnAsnValMetGlnSerIleTyrArgIleThrLeuSerPheArgSer
|
|
Leu***ValPro***GlnGlyAsnAlaPheHisIle***HisAspPhe***Ile***Glu
|
|
IleAlaLeuCysThrIleSerTrpLysArgPheThrAspLeuPro***ValLeuAspVal
|
|
|
|
ArgSerGlnAsnTrpGluSerThrValThrTrpAsnGluThrSerArgHisArgThrLeu
|
|
ValArgArgIleGlyAsnGlnLeuLeuHisGlyMetLysLeuProAspThrValLeu***
|
|
SerPheAlaGluLeuGlyIleAsnCysTyrMetGlu***AsnPheGlnThrProTyrPhe
|
|
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
|
|
. : . : . : . : . : . :
|
|
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
|
|
ThrArgLeuIleProPhe***SerAsnCysProIlePheSerGlySerValThrSer***
|
|
AsnAlaSerAsnProIleLeuGln***MetSerHisPheLysTrpValGlyTyrLysLeu
|
|
ArgGluCysPheGlnSerAspValThrValHisPheSerValGluLeuCysArgValLys
|
|
|
|
ValAlaTyrLeuLysHisValGluLeuGlnHisGlnIleGlnGlnLeuSerSerLysPro
|
|
LeuHisIle***AsnMetLeuSerTyrSerThrArgPheSerAsn***AlaLeuSerHis
|
|
SerCysIlePheLysThrCys***AlaThrAlaProAspSerAlaIleLysLeu***Ala
|
|
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
|
|
. : . : . : . : . : . :
|
|
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
|
|
AsnCysIle***PheMetAsnLeu***LeuValLeuAsnLeuLeu***AlaArgLeuTrp
|
|
GlnMetAsnLeuValHisGlnAlaValAlaGlySerGluAlaIleLeuSer***AlaMet
|
|
ThrAlaTyrLysPheCysThrSerSerCysCysTrpIle***CysAsnLeuGluLeuGly
|
|
|
|
SerAlaLysMetThrSerTyrGlnLysGluGlnLeuLysValLeuSerAsnProAspLeu
|
|
ProGlnLys***ProLeuIleLysArgSerAsn***ArgTyrSerLeuIleLeuThrCys
|
|
IleArgLysAsnAspLeuLeuSerLysGlyAlaIleLysGlyThrLeu***Ser***Pro
|
|
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
|
|
. : . : . : . : . : . :
|
|
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
|
|
GlyCysPheHisGlyArgIleLeuLeuLeuLeu***LeuTyrGluArgIleArgValGln
|
|
ArgLeuPheSerArgLysAspPheProAlaIleLeuProValArg***AspGlnGlyThr
|
|
AspAlaPheIleValGlu******PheSerCysAsnPheThrSerGluLeuGlySerArg
|
|
|
|
|
|
? Menu or option number=D39
|
|
Find open reading frames, translate and list
|
|
? (y/n) (y) Show translation
|
|
The segments to translate can be
|
|
1 Typed on the keyboard
|
|
2 Read from a feature table
|
|
X 3 Open reading frames
|
|
? 1,2,3 =1
|
|
? (y/n) (y) Use 1 letter codes
|
|
Define section of DNA to display
|
|
? start (1-7238) (1) =
|
|
? end (2-7238) (7238) =300
|
|
? Line length (30-120) (60) =
|
|
Which strands should be shown
|
|
X 1 + strand only
|
|
2 - strand only
|
|
3 Both strands
|
|
? 1,2,3 =
|
|
? (y/n) (y) Number ends of lines N
|
|
Translate
|
|
? From (0-300) (0) =101
|
|
? To (1-300) (300) =300
|
|
Translate
|
|
? From (0-300) (0) =102
|
|
? To (1-300) (300) =200
|
|
Translate
|
|
? From (0-300) (0) =
|
|
|
|
|
|
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT
|
|
10 20 30 40 50 60
|
|
|
|
M V K L N L L
|
|
W S N * I Y
|
|
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT
|
|
70 80 90 100 110 120
|
|
|
|
V R R I G N Q L L H G M K L P D T V L *
|
|
S F A E L G I N C Y M E * N F Q T P Y F
|
|
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA
|
|
130 140 150 160 170 180
|
|
|
|
L H I * N M L S Y S T R F S N * A L S H
|
|
S C I F K T C
|
|
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA
|
|
190 200 210 220 230 240
|
|
|
|
P Q K * P L I K R S N * R Y S L I L T C
|
|
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG
|
|
250 260 270 280 290 300
|
|
|
|
|
|
? Menu or option number=D39
|
|
Find open reading frames, translate and list
|
|
? (y/n) (y) Show translation
|
|
The segments to translate can be
|
|
1 Typed on the keyboard
|
|
2 Read from a feature table
|
|
X 3 Open reading frames
|
|
? 1,2,3 =2
|
|
? Embl feature table file=1.FT
|
|
? (y/n) (y) Use 1 letter codes
|
|
Define section of DNA to display
|
|
? start (1-7238) (1) =
|
|
? end (2-7238) (7238) =300
|
|
? Line length (30-120) (60) =
|
|
Which strands should be shown
|
|
X 1 + strand only
|
|
2 - strand only
|
|
3 Both strands
|
|
? 1,2,3 =3
|
|
? (y/n) (y) Number ends of lines
|
|
|
|
|
|
N A T T I S R I D A T F S A R A P N E N
|
|
AACGCTACTACTATTAGTAGAATTGATGCCACCTTTTCAGCTCGCGCCCCAAATGAAAAT 60
|
|
. : . : . : . : . : . :
|
|
TTGCGATGATGATAATCATCTTAACTACGGTGGAAAAGTCGAGCGCGGGGTTTACTTTTA
|
|
* S A G W I F I
|
|
A V V I L L I S A V K E A R A G F S F
|
|
|
|
I A K Q V I D H L R N V S N G Q T K S T
|
|
L N R L L T I C E M Y L M V K L N L L
|
|
ATAGCTAAACAGGTTATTGACCATTTGCGAAATGTATCTAATGGTCAAACTAAATCTACT 120
|
|
. : . : . : . : . : . :
|
|
TATCGATTTGTCCAATAACTGGTAAACGCTTTACATAGATTACCAGTTTGATTTAGATGA
|
|
Y S F L N N V M Q S I Y R I T L S F R S
|
|
I A L C T I S W K R F T D L P * V L D V
|
|
|
|
R S Q N W E S T V T W N E T S R H R T L
|
|
V R R I G N Q L L H G M K L P D T V L *
|
|
CGTTCGCAGAATTGGGAATCAACTGTTACATGGAATGAAACTTCCAGACACCGTACTTTA 180
|
|
. : . : . : . : . : . :
|
|
GCAAGCGTCTTAACCCTTAGTTGACAATGTACCTTACTTTGAAGGTCTGTGGCATGAAAT
|
|
T R L I P F
|
|
R E C F Q S D V T V H F S V E L C R V K
|
|
|
|
V A Y L K H V E L Q H Q I Q Q L S S K P
|
|
GTTGCATATTTAAAACATGTTGAGCTACAGCACCAGATTCAGCAATTAAGCTCTAAGCCA 240
|
|
. : . : . : . : . : . :
|
|
CAACGTATAAATTTTGTACAACTCGATGTCGTGGTCTAAGTCGTTAATTCGAGATTCGGT
|
|
T A Y K F C T S S C C W I
|
|
|
|
S A K M T S Y Q K E Q L K V L S N P D L
|
|
TCCGCAAAAATGACCTCTTATCAAAAGGAGCAATTAAAGGTACTCTCTAATCCTGACCTG 300
|
|
. : . : . : . : . : . :
|
|
AGGCGTTTTTACTGGAGAATAGTTTTCCTCGTTAATTTCCATGAGAGATTAGGACTGGAC
|
|
* L Y E R I R V Q
|
|
* F S C N F T S E L G S R
|
|
.end lit
|
|
.left margin1
|
|
@40. TX 5 @ Translate and write the protein sequence to disk
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine allows the user to translate sections of the sequence into
|
|
the
|
|
1 letter amino acid codes and store the resulting amino acid sequences in
|
|
a disk file.
|
|
Two modes of use are possible. Either all open reading frames of at least
|
|
some minimum length will
|
|
automatically be found and translated, or the user can specify that
|
|
particular segments be translated.
|
|
.para
|
|
Mode 1: the user selects to to translate all open reading frames.
|
|
.para
|
|
Either, or both, strands can be
|
|
translated.
|
|
The output file is in the same format as a PIR .seq file.
|
|
Each protein segment is given an entry name that is its start base in
|
|
the DNA, and a title that includes its end position,
|
|
reading frame and strand (+ for plus, - for minus).
|
|
Each segment is terminated by * whether or not
|
|
there is a stop codon in the DNA. The file is therefore suitable for input
|
|
to FASTA, ALIGNL and ANALYSEPL.
|
|
.para
|
|
Mode 2: the user selects to identify the segments to translate.
|
|
.para
|
|
Either, or both, strands can be
|
|
translated.
|
|
If multiple coding regions
|
|
are translated each will be separated from the previous one by a gap of 5
|
|
dashes (-----).
|
|
The sections to translate can be
|
|
defined from the keyboard or by supplying the name of the appropriate
|
|
EMBL
|
|
library feature table.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
? Menu or option number=40
|
|
Translate and write protein sequence to disk
|
|
? (y/n) (y) Translate selected regions
|
|
? (y/n) (y) Define segments using keyboard
|
|
Translate
|
|
? From (0-1023) (0) =1
|
|
? To (1-1023) (1023) =111
|
|
? (y/n) (y) + strand
|
|
Translate
|
|
? From (0-1023) (0) =
|
|
? Output file name=1.OUT
|
|
|
|
? Menu or option number=40
|
|
Translate and write protein sequence to disk
|
|
? (y/n) (y) Translate selected regions n
|
|
? Minimum open frame in amino acids (5-1000) (30) =
|
|
|
|
X 1 + strand only
|
|
2 - strand only
|
|
3 Both strands
|
|
? 0,1,2,3 =3
|
|
? File name for translation=1.OUT
|
|
|
|
? Menu or option number=6
|
|
Page through text files
|
|
? Name of file to read=1.OUT
|
|
>P1; 25
|
|
135 1 +
|
|
GAQRLLRRSCWCWRCGGRQRTQGSAGRGRRRRGGGG*
|
|
>P1; 238
|
|
486 1 +
|
|
IRCRDCGQRRRGIFDLVDDFHVRRHIVLARKLFEAEGTGVHFHISLMGGNIVTAEVTNVR
|
|
VDAGADFAAVRMLALFGAVVPH*
|
|
>P1; 556
|
|
795 1 +
|
|
|
|
SSTQVRRASAQTSSLQLESIVAVVNVEVFLAAKHSRFYIAVLFAQFGPLLDARLDRGCGK
|
|
GAGRRDQWRGGGVDLANGR*
|
|
>P1; 796
|
|
987 1 +
|
|
|
|
FGYADHAFHLRSTSRHSDNVKFDSAGRRRCCCFHLVFSLGSDEEGLLARLLVEVTTIRVV
|
|
LRG*
|
|
>P1; 2
|
|
163 2 +
|
|
NSVWAWCEVPRDYCAAAAGAGGAEVVNGPRDPLDEDVDDEEEVDSALLVAGSD*
|
|
>P1; 176
|
|
391 2 +
|
|
PLRSGGGGVEAPETPSGWPARFAAATVANAVEGFSILWMIFTCAVILSLRVNSLKQKGQG
|
|
YTFTFRLWEVT*
|
|
>P1; 476
|
|
628 2 +
|
|
SLTEPSASPSPTLLLRFSLVLTEGVPNPALRFGVLPLRPAAFNLNPSLLL*
|
|
>P1; 629
|
|
958 2 +
|
|
MSRYSWLLNTAGFTSPFCLPSLGRFWTRGLTVAVEKEPAGETNGVEAALTLPMGVSLGML
|
|
TMLFTCAPPAAIPIMLSLIPLAAAAAAVSTWCFLWAAMRKACWRACSLR*
|
|
>P1; 3
|
|
293 3 +
|
|
IRFGLGVRCPEITAPQLLVLAVRRSSTDPGIRWTRTSTTRRRWIAHCWWLAATDLSSDHS
|
|
DPAAEASRLPKLPVAGLLDSLPRLWPTPSRDFRSCG*
|
|
>P1; 411
|
|
521 3 +
|
|
CACRRGSRLCSGTYARPLWCSSPSLSPPPRPRQRCC*
|
|
>P1; 1020
|
|
37 1 -
|
|
EFGKYNPLTDNSSPTQDHTDGSHLNEQARQQAFLIAAQRKHQVETAAAAAASGIKLNIIG
|
|
MAAGGAQVKSMVSIPKLTPIGKVNAASTPLVSPAGSFSTATVKPRVQKRPKLGKQNGDVK
|
|
PAVFSSQEYLDIYNSNDGFKLKAAGLSGSTPNLSAGLGTPSVKTKLNLSSNVGEGEAEGS
|
|
VRDYCTKEGEHTYRCKVCSRVYTHISNFCRHYVTSHKRNVKVYPCPFCFKEFTRKDNMTA
|
|
HVKIIHKIENPSTALATVAAANLAGQPLGVSGASTPPPPDLSGQNSNQSLPATSNALSTS
|
|
SSSSTSSSSGSLGPLTTSAPPAPAAAAQ*
|
|
>P1; 373
|
|
-1 2 -
|
|
AKCESVPLSLLLQRVYAQGQYDGARENHPQDRKSLDGVGHSRGSESSRPATGSFGSLDAS
|
|
AAGSEWSELKSVAASHQQCAIHLLLVVDVLVQRIPGSVDDLRTASTSSCGAVISGHLTPS
|
|
PNRI*
|
|
>P1; 517
|
|
407 2 -
|
|
QQRWRGRGGGLSEGLLHQRGRAYVPLQSLLPRLHAH*
|
|
>P1; 649
|
|
518 2 -
|
|
QPGIPRHLQQQRWIQVEGCWSERKHAEPECWIRNSLCQNQAES*
|
|
>P1; 853
|
|
650 2 -
|
|
HYRNGGWWSAGEKHGQHTQTNAHWQGQRRLHAIGLACRLLFHSHGQAARPEAAQTQTER
|
|
RCKTGCV*
|
|
>P1; 958
|
|
854 2 -
|
|
SPQRAGAPTSLPHRCPEKTPGGNSSSGGGQRNQT*
|
|
>P1; 179
|
|
78 3 -
|
|
VVRTQISRCQPPAMRYPPPPRRRRPRPADPWVR*
|
|
>P1; 479
|
|
363 3 -
|
|
GTTAPKRASIRTAAKSAPASTRTLVTSAVTMLPPISEM*
|
|
>P1; 791
|
|
666 3 -
|
|
RPLARSTPPPRHWSRLPAPFPQPRSSRASRSGPNWANRTAM*
|
|
>P1; 1022
|
|
819 3 -
|
|
SNSASTTRSPTTAHPRRTTRMVVTSTSRRANKPSSSLPRENTRWKQQQRRRPAESNLTLS
|
|
EWRLVERR*
|
|
End of file
|
|
.end lit
|
|
|
|
.LEFT MARGIN1
|
|
@41. TX 5 @ Calculate and write codon table to disk
|
|
.LEFT MARGIN2
|
|
.para
|
|
This routine calculates codon usage tables
|
|
for sections of the sequence
|
|
and stores the resulting tables on disk.
|
|
The sections to translate can be
|
|
defined from the keyboard or by supplying the name of the appropriate
|
|
EMBL
|
|
library feature table.
|
|
.para
|
|
If required users can add to an existing codon table stored as a disk file.
|
|
Choose between storing observed counts or having them normalised so
|
|
that the totals for each amino acid sum to 100. Select between defining
|
|
segments at the keyboard or using an EMBL feature table. Define
|
|
segments. Signal completion with a zero start. Supply a file name. For
|
|
each segment the program will display the counts, at the end it will
|
|
display the accumulated totals.
|
|
.lit
|
|
|
|
Typical dialogue follows.
|
|
? Menu or option number=41
|
|
Calculate and write codon table to disk
|
|
? (y/n) (y) Start with empty table
|
|
? (y/n) (y) Show observed counts
|
|
? (y/n) (y) Define segments using keyboard
|
|
? Count from (0-1023) (0) =1
|
|
? Count to (1-1023) (1023) =111
|
|
? (y/n) (y) + strand
|
|
|
|
===========================================
|
|
F TTT 0. S TCT 0. Y TAT 0. C TGT 0.
|
|
F TTC 1. S TCC 1. Y TAC 0. C TGC 3.
|
|
L TTA 1. S TCA 0. * TAA 0. * TGA 1.
|
|
L TTG 2. S TCG 0. * TAG 0. W TGG 2.
|
|
===========================================
|
|
L CTT 0. P CCT 0. H CAT 0. R CGT 2.
|
|
L CTC 0. P CCC 0. H CAC 0. R CGC 2.
|
|
L CTA 0. P CCA 0. Q CAA 1. R CGA 1.
|
|
L CTG 1. P CCG 0. Q CAG 2. R CGG 2.
|
|
===========================================
|
|
I ATT 0. T ACT 0. N AAT 0. S AGT 0.
|
|
I ATC 0. T ACC 1. N AAC 0. S AGC 1.
|
|
I ATA 0. T ACA 0. K AAA 0. R AGA 1.
|
|
M ATG 0. T ACG 0. K AAG 0. R AGG 0.
|
|
===========================================
|
|
V GTT 0. A GCT 1. D GAT 0. G GGT 3.
|
|
V GTC 0. A GCC 1. D GAC 0. G GGC 1.
|
|
V GTA 0. A GCA 0. E GAA 1. G GGA 4.
|
|
V GTG 1. A GCG 0. E GAG 0. G GGG 0.
|
|
===========================================
|
|
? Count from (0-1023) (0) =
|
|
|
|
Codon totals over all genes
|
|
===========================================
|
|
F TTT 0. S TCT 0. Y TAT 0. C TGT 0.
|
|
F TTC 1. S TCC 1. Y TAC 0. C TGC 3.
|
|
L TTA 1. S TCA 0. * TAA 0. * TGA 1.
|
|
L TTG 2. S TCG 0. * TAG 0. W TGG 2.
|
|
===========================================
|
|
L CTT 0. P CCT 0. H CAT 0. R CGT 2.
|
|
L CTC 0. P CCC 0. H CAC 0. R CGC 2.
|
|
L CTA 0. P CCA 0. Q CAA 1. R CGA 1.
|
|
L CTG 1. P CCG 0. Q CAG 2. R CGG 2.
|
|
===========================================
|
|
I ATT 0. T ACT 0. N AAT 0. S AGT 0.
|
|
I ATC 0. T ACC 1. N AAC 0. S AGC 1.
|
|
I ATA 0. T ACA 0. K AAA 0. R AGA 1.
|
|
M ATG 0. T ACG 0. K AAG 0. R AGG 0.
|
|
===========================================
|
|
V GTT 0. A GCT 1. D GAT 0. G GGT 3.
|
|
V GTC 0. A GCC 1. D GAC 0. G GGC 1.
|
|
V GTA 0. A GCA 0. E GAA 1. G GGA 4.
|
|
V GTG 1. A GCG 0. E GAG 0. G GGG 0.
|
|
===========================================
|
|
? (y/n) (y) Save table in a file n
|
|
.end lit
|
|
|
|
.left margin1
|
|
@42. TX 6 @ Codon usage method
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to find protein coding regions. For each window length of the
|
|
sequence the routine measures the closeness to an expected codon usage.
|
|
Results are plotted for each of the three reading frames. Stop and start
|
|
codons are also marked on the plots. Has the highest resolution of all
|
|
such methods, but makes the strongest assumption, i.e. that the codon
|
|
usage is known. The latest version is described in Methods in Enzymology
|
|
183, 193-211.
|
|
.para
|
|
Choose whether to use an internal standard (i.e. part of the current
|
|
sequence known to code for a protein). If so define its end points, and
|
|
those of any others. Otherwise supply the name of a disk file containing a
|
|
table of codon usage. Tables are listed. Choose between using the
|
|
observed counts, or two types of normalisation: normalised to give an
|
|
average amino acid composition; normalised to no amino acid bias. The
|
|
first normalisation is clearly often sensible, but the second removes
|
|
valuable information and is only made availabe for special
|
|
circumstances. The final table will be displayed, followed by the
|
|
expected scores for window lengths 21, 31 and 41 codons. The scores for
|
|
each of the three reading frames are shown (they are logarithmic values)
|
|
to help users choose a window length for the analysis. Define a window
|
|
length and plot interval. Plotting will start.
|
|
.para
|
|
The method was first described in
|
|
Staden and McLachlan Nucl. Acid Res. 10 141-156 (1982) and the
|
|
following is a summary of the initial ideas.
|
|
The method makes the following main assumptions: the codon
|
|
preferences
|
|
of all the
|
|
genes in the sequence we are examining are similar to those of the
|
|
standard;
|
|
the sequence is coding
|
|
throughout its whole length in only one reading frame; in the coding
|
|
frame
|
|
the frequency of codon abc has a definite value Fabc
|
|
.LEFT MARGIN2
|
|
If we select a sequence a1b1c1a2b2c2a3b3c3,...,anbncnan+1bn+1cn+1
|
|
then the
|
|
probability of selecting it in each of the three frames is:
|
|
.left margin15
|
|
frame 1: p1=Fa1b1c1.Fa2b2c2....Fanbncn
|
|
.left margin15
|
|
frame 2: p2=Fb1c1a2.Fb2c2a3...Fbncnan+1
|
|
.left margin15
|
|
frame 3: p3=Fc1a2b2.Fc2a3b3...Fcnan+1bn+1
|
|
.LEFT MARGIN2
|
|
The probability that selection of a particular sequence was "caused" by it
|
|
being a coding sequence is:
|
|
.LEFT MARGIN2
|
|
P1=p1/(p1+p2+p3), P2=p2/(p1+p2+p3), P3=p3/(p1+p2+p3).
|
|
.LEFT MARGIN2
|
|
The program calculates these values for the given window length but
|
|
plots
|
|
Log(P/(1-P)) for each of the three frames. At each point along the
|
|
sequence
|
|
that the program has a
|
|
point to plot it finds which of the three values is highest and places a
|
|
single point at the 50% level for the corresponding frame. These single
|
|
points will join to form a solid line if one frame is consistently the
|
|
highest scoring. In addition stop codons are shown as short vertical lines
|
|
that bisect the 50%
|
|
level of probability. When looking for coding regions
|
|
the user should look for solid horizontal lines at the
|
|
50% level that are not interrupted by these short vertical lines.
|
|
.para
|
|
Changes.
|
|
Two normalisations are offered: 1) to remove all amino acid
|
|
compositional components from the tables, hence leaving only the codon
|
|
preference components. In general this is not recommended as the amino
|
|
acid
|
|
component alone is often sufficient to choose correctly between frames,
|
|
but
|
|
may be useful in special circumstances. 2) to change the amino acid
|
|
composition components to give an average amino acid composition
|
|
rather the
|
|
the one contained in the standard (this leaves the codon preference
|
|
components unchanged). In general this should be useful as the average
|
|
amino acid composition is likely to be closer to the composition of the
|
|
genes being hunted, than is that of the standard table of codon
|
|
preferences.
|
|
The average composition
|
|
is that recently published by Argos, not the Dayhoff one that we have
|
|
used
|
|
before.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
|
|
? Menu or option number=42
|
|
Staden and McLachlan codon usage method
|
|
Codon tables for standards may be read from disk
|
|
or calculated from parts of the current sequence
|
|
? (y/n) (y) Define internal standard
|
|
Define standard
|
|
? start (0-1023) (0) =1
|
|
? end (2-1023) (1023) =1000
|
|
===========================================
|
|
F TTT 13. S TCT 1. Y TAT 1. C TGT 3.
|
|
F TTC 4. S TCC 10. Y TAC 1. C TGC 7.
|
|
L TTA 1. S TCA 0. * TAA 1. * TGA 4.
|
|
L TTG 4. S TCG 1. * TAG 3. W TGG 5.
|
|
===========================================
|
|
L CTT 9. P CCT 1. H CAT 3. R CGT 14.
|
|
L CTC 7. P CCC 0. H CAC 7. R CGC 14.
|
|
L CTA 0. P CCA 0. Q CAA 4. R CGA 9.
|
|
L CTG 12. P CCG 1. Q CAG 9. R CGG 8.
|
|
===========================================
|
|
I ATT 7. T ACT 4. N AAT 4. S AGT 1.
|
|
I ATC 4. T ACC 5. N AAC 3. S AGC 7.
|
|
I ATA 1. T ACA 1. K AAA 3. R AGA 2.
|
|
M ATG 2. T ACG 1. K AAG 2. R AGG 2.
|
|
===========================================
|
|
V GTT 11. A GCT 13. D GAT 6. G GGT 9.
|
|
V GTC 5. A GCC 10. D GAC 9. G GGC 11.
|
|
V GTA 6. A GCA 5. E GAA 6. G GGA 12.
|
|
V GTG 8. A GCG 5. E GAG 3. G GGG 8.
|
|
===========================================
|
|
Define standard
|
|
? start (0-1023) (0) =
|
|
Total codons in standard= 333.
|
|
X 1 Use observed frequencies
|
|
2 Normalize to average amino acid composition
|
|
3 Normalize to no amino acid bias
|
|
? 0,1,2,3 =2
|
|
===========================================
|
|
F TTT 19. S TCT 2. Y TAT 10. C TGT 3.
|
|
F TTC 6. S TCC 22. Y TAC 10. C TGC 8.
|
|
L TTA 2. S TCA 0. * TAA 0. * TGA 0.
|
|
L TTG 7. S TCG 2. * TAG 0. W TGG 8.
|
|
===========================================
|
|
L CTT 16. P CCT 16. H CAT 4. R CGT 10.
|
|
L CTC 12. P CCC 0. H CAC 10. R CGC 10.
|
|
L CTA 0. P CCA 0. Q CAA 8. R CGA 7.
|
|
L CTG 21. P CCG 16. Q CAG 18. R CGG 6.
|
|
===========================================
|
|
I ATT 19. T ACT 13. N AAT 16. S AGT 2.
|
|
I ATC 11. T ACC 17. N AAC 12. S AGC 15.
|
|
I ATA 3. T ACA 3. K AAA 22. R AGA 1.
|
|
M ATG 15. T ACG 3. K AAG 15. R AGG 1.
|
|
===========================================
|
|
V GTT 15. A GCT 21. D GAT 14. G GGT 10.
|
|
V GTC 7. A GCC 16. D GAC 20. G GGC 13.
|
|
V GTA 8. A GCA 8. E GAA 26. G GGA 14.
|
|
V GTG 11. A GCG 8. E GAG 13. G GGG 9.
|
|
===========================================
|
|
Span length 21 expected mean values: 4.8 -5.7 -4.8
|
|
Span length 31 expected mean values: 7.1 -8.4 -7.2
|
|
Span length 41 expected mean values: 9.5 -11.1 -9.5
|
|
? odd span length (11-101) (25) =41
|
|
? plot interval (1-11) (5) =
|
|
|
|
Missing graphics display here
|
|
|
|
.end lit
|
|
|
|
.left margin1
|
|
@43. TX 6 @ Positional base preference method.
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to find protein coding regions. For each window length of the
|
|
sequence the routine measures the closeness to an expected pattern of
|
|
base frequencies . Results are plotted for each of the three reading
|
|
frames. Stop and start codons are also marked on the plots. The method
|
|
is particularly useful for showing which reading frame is the most likely
|
|
to be coding. The latest version is described in a forthcoming issue of
|
|
Methods in Enzymology, but the original ideas were given in
|
|
Staden, R. Nucl. Acid Res. 12 551-567 (1984).
|
|
.para
|
|
If dialogue is requested the following inputs are needed, otherwise the
|
|
standard analysis is performed. Choose between a "global" standard, or a
|
|
selected one. If the global standard is selected the
|
|
expected scores are displayed and the user asked to define a span length
|
|
and a plot interval. Then users choose between plotting relative or
|
|
absolute scores, and can reset the scaling values employed for plotting.
|
|
If the global standard is not selected users must define a region of the
|
|
sequence to use as a standard, or they can read in a codon table from which
|
|
the
|
|
program will calculate one. Then they can either, use the values
|
|
observed in this standard, or they can combine its values for the third
|
|
positions in codons, with those from the global standard. Next they can
|
|
give different weightings to each of the three positions in codons.
|
|
.para
|
|
In its original form the method
|
|
took advantage of the
|
|
uneven
|
|
use of amino acids by proteins and the structure of the genetic code table
|
|
and assumed that there is a typical ("global")
|
|
amino acid composition
|
|
and no codon preference. The typical amino acid composition is the
|
|
average
|
|
composition found by Argos (see below).
|
|
This composition and no codon preference
|
|
determines the frequency of each of the four bases in each of the three
|
|
codon positions. This 3x4 frequency table shows unequal use of the bases
|
|
and in particular a marked use of G in position 1 and of A in position 2
|
|
(at the expence of G). The routine slides a window along the sequence and
|
|
calculates a score for each of the three reading
|
|
frames at each window position. It assumes the sequence is coding
|
|
throughout its whole length and calcualtes the probability that it is
|
|
coding in each of the three frames.
|
|
When tested against all the E. coli sequences in the EMBL sequence
|
|
library
|
|
it correctly identified the coding frame for 91% of window positions.
|
|
(The E. coli
|
|
sequences were chosen only for technical reasons: I have no reason to
|
|
think
|
|
the method would work less well on other organisms with roughly even
|
|
base composition.)
|
|
The routine can plot either absolute or relative values: ie absolute values
|
|
are the values found by summing the scores for each frame (say p1, p2
|
|
and
|
|
p3), and the relative values are then p1/(p1+p2+p3), p2/(p1+p2+p3) and
|
|
p3/(p1+p2+p3).
|
|
.para
|
|
At each point along the sequence
|
|
that the program has a
|
|
point to plot it finds which of the three values is highest and places a
|
|
single point at the 50% level for the corresponding frame. These single
|
|
points will join to form a solid line if one frame is consistently the
|
|
highest scoring. In addition stop codons are shown as short vertical lines
|
|
that bisect the 50%
|
|
level of probability. When looking for coding regions
|
|
the user should look for solid horizontal lines at the
|
|
50% level that are not interrupted by these short vertical lines.
|
|
|
|
The absolute mean
|
|
values expected on the complement of
|
|
the coding strand (and in the same frame)
|
|
are 5% lower than those on the coding strand but the relative values
|
|
are the same on both strands. Although the
|
|
relative values give smoother plots and tend to emphasize the coding
|
|
frame
|
|
they therefore, cannot be used to decide which strand is coding. The
|
|
absolute values plot should be used for this purpose but bearing in mind
|
|
the fact the the differences between strands are quite small.
|
|
.para
|
|
The method has been improved in two overall ways: first it now allows
|
|
users to
|
|
define their own typical amino acid composition by selecting a standard
|
|
sequence from within the sequence they are analysing or from a codon table;
|
|
secondly it allows the inclusion of third position preferences.
|
|
Again these third position preferences are defined by the use of an
|
|
internal standard sequence. Not only can users define their own standards
|
|
but they can also give weights to each of the three positions in codons.
|
|
This allows different emphasis to be used for each of the three positions.
|
|
As an example of its use, by giving, in turn, weights of 1.0, 0.0, 0.0, and
|
|
0.0, 1.0, 0.0, and finally 0.0, 0.0, 1.0, you could see the separate
|
|
contribution made by each of the three positions. It is also possible to
|
|
use the third position preferences with the values for the first two
|
|
positions taken from the "global" amino acid composition.
|
|
In all cases users may choose to plot
|
|
absolute or relative values. The expected scores are displayed before
|
|
each
|
|
analysis and scales are drawn on the plots.
|
|
At present this method does not give probabilities of coding; it has only
|
|
been tested for its ability to choose the correct reading frame (see
|
|
above). It could be used to give probabilities of coding if was applied to
|
|
all known coding and non-coding sequences in the way that the uneven
|
|
positional base frequencies method was. It is designed to be used in
|
|
conjunction with this method. Note that the average amino composition
|
|
used
|
|
to derive the base frequencies was changed on 17-11-1988, to be
|
|
the new average given by McCaldon and Argos in Proteins 4 99-122
|
|
(1988).
|
|
A further change is to allow users to select their own scales for
|
|
producing the plots. It can be helpful if they want to emphasise or
|
|
diminish
|
|
certain features.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
? Menu or option number=D43
|
|
Positional base preferences method to find protein genes
|
|
Select standard source
|
|
X 1 Use global standard
|
|
2 Use internal standard
|
|
3 Use codon usage table
|
|
? Selection (1-3) (1) =2
|
|
Define region for standard
|
|
? start (0-8134) (0) =3171
|
|
? end (3172-8134) (8134) =4700
|
|
Select normalisation
|
|
X 1 Use observed frequencies
|
|
2 Combine with global standard
|
|
? Selection (1-2) (1) =1
|
|
T C A G Range
|
|
1 0.125 0.249 0.230 0.397 0.272
|
|
2 0.298 0.245 0.292 0.165 0.132
|
|
3 0.288 0.313 0.169 0.230 0.144
|
|
? (y/n) (y) Use 1.0 for positional weights
|
|
Give weights between 0.0 and 1.0
|
|
to each of the 3 codon positions
|
|
? Position 1 (0.00-1.00) (1.00) =
|
|
? Position 2 (0.00-1.00) (1.00) =
|
|
? Position 3 (0.00-1.00) (1.00) =
|
|
Expected scores per codon in each frame
|
|
0.136 0.122 0.123
|
|
? odd span length (31-101) (67) =
|
|
? plot interval (1-11) (5) =
|
|
? (y/n) (y) Plot relative scores
|
|
Scaling values:
|
|
Minimum maximum range
|
|
0.3121 0.3656 0.0382
|
|
? (y/n) (y) Leave scaling values unchanged
|
|
|
|
Graphics not shown
|
|
|
|
? Menu or option number=D43
|
|
Positional base preferences method to find protein genes
|
|
Select standard source
|
|
X 1 Use global standard
|
|
2 Use internal standard
|
|
3 Use codon usage table
|
|
? Selection (1-3) (1) =3
|
|
? File name of standard=atpase.cods
|
|
===========================================
|
|
F TTT 21. S TCT 33. Y TAT 15. C TGT 5.
|
|
F TTC 55. S TCC 40. Y TAC 40. C TGC 4.
|
|
L TTA 8. S TCA 7. * TAA 8. * TGA 0.
|
|
L TTG 19. S TCG 12. * TAG 1. W TGG 17.
|
|
===========================================
|
|
L CTT 22. P CCT 17. H CAT 6. R CGT 73.
|
|
L CTC 21. P CCC 4. H CAC 30. R CGC 23.
|
|
L CTA 1. P CCA 10. Q CAA 19. R CGA 5.
|
|
L CTG 168. P CCG 48. Q CAG 80. R CGG 3.
|
|
===========================================
|
|
I ATT 47. T ACT 14. N AAT 17. S AGT 8.
|
|
I ATC 98. T ACC 54. N AAC 52. S AGC 26.
|
|
I ATA 6. T ACA 7. K AAA 85. R AGA 0.
|
|
M ATG 75. T ACG 13. K AAG 28. R AGG 0.
|
|
===========================================
|
|
V GTT 67. A GCT 56. D GAT 41. G GGT 90.
|
|
V GTC 29. A GCC 53. D GAC 66. G GGC 66.
|
|
V GTA 49. A GCA 59. E GAA 101. G GGA 5.
|
|
V GTG 57. A GCG 64. E GAG 41. G GGG 8.
|
|
===========================================
|
|
Select normalisation
|
|
X 1 Use observed frequencies
|
|
2 Combine with global standard
|
|
? Selection (1-2) (1) =2
|
|
T C A G Range
|
|
1 0.177 0.211 0.277 0.336 0.159
|
|
2 0.271 0.238 0.310 0.182 0.128
|
|
3 0.242 0.301 0.168 0.289 0.132
|
|
? (y/n) (y) Use 1.0 for positional weights
|
|
Expected scores per codon in each frame
|
|
0.785 0.736 0.736
|
|
? odd span length (31-101) (67) =
|
|
? plot interval (1-11) (5) =
|
|
? (y/n) (y) Plot relative scores
|
|
Scaling values:
|
|
Minimum maximum range
|
|
0.3219 0.3519 0.0214
|
|
? (y/n) (y) Leave scaling values unchanged
|
|
|
|
Graphics not shown
|
|
.end lit
|
|
.left margIN1
|
|
@44. TX 6 @ Uneven positional base frequencies.
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to find regions of a sequence that might be coding for a protein. The
|
|
method looks for sections of the sequence in which the frequency at
|
|
which each of the four bases occupies the three positions in codons is
|
|
nonrandom. The level of nonrandomness is plotted on a scale that shows
|
|
the probability that the sequence is coding. At each position along a
|
|
sequence the calculation gives the same value for all six possible reading
|
|
frames, so only one value is plotted.
|
|
.para
|
|
Define the window length and plot interval.
|
|
.para
|
|
The results are plotted in a box divided by a horizontal line marked "76%".
|
|
76% of coding regions achieve values above this line and 76% of
|
|
noncoding regions achieve scores below the line.
|
|
.para
|
|
This method, first described in Staden R. Nucl. Acid Res. 12 551-567
|
|
1984,
|
|
looks for uneven positional
|
|
usage of bases in codons.
|
|
It looks through the sequence in one fixed
|
|
phase and counts the number of times each base apears in each of the
|
|
three
|
|
codon positions: for each window position it counts A1,A2,A3 and
|
|
C1,C2,C3
|
|
and G1,G2,G3 and T1,T2,T3 and calculates AMEAN=(A1+A2+A3)/3, and
|
|
similarly
|
|
CMEAN, GMEAN
|
|
and TMEAN; it then calculates
|
|
ADIF=abs(A1-AMEAN)+abs(A2-AMEAN)+abs(A3-AMEAN) and similarly
|
|
CDIF, GDIF and
|
|
TDIF to measure the differences between an even base usage for all
|
|
positions in the codons and the observed usage. The routine then
|
|
calculates
|
|
the sum ADIF+CDIF+GDIF+TDIF and plots this value on the following scale:
|
|
the base level is such that no known window in a coding region has a
|
|
lower
|
|
value, whereas 14% of windows in noncoding sequences score below it.
|
|
The
|
|
top of the scale is not achieved by any known noncoding
|
|
region, but is reached by 16% of known coding regions.
|
|
The bar drawn across the
|
|
plot corresponds to a level that is exceeded by 76% of windows in known
|
|
coding regions
|
|
but is reached by only 24% of windows in known noncoding regions. ie
|
|
76% of
|
|
coding windows score above and 76% of noncoding windows score below.
|
|
This is similar to Ficketts method but without
|
|
the probabilities and weightings from the Los Alamos sequence library: it
|
|
is therefore unbiased but may well give very similar results.
|
|
.left margin1
|
|
@45. TX 6 @ Codon improbability on base composition
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to find regions of a sequence that might code for a protein.
|
|
.para
|
|
If dialogue is requested define a window length and plot interval.
|
|
.para
|
|
The idea of the method is, that of all sequence features
|
|
that we know, it is only
|
|
coding regions that will give rise to codon biases well above those
|
|
expected
|
|
from the base composition.
|
|
If a region of sequence shows sufficiently strong
|
|
codon bias then we conclude that it is coding for a protein.
|
|
Using the multinomial distribution we
|
|
have derived a function to measure the improbability of observing a
|
|
set of codons from a sequence of the given composition. Using the
|
|
Poisson
|
|
distribution we have worked out the distribution
|
|
of the improbability. The program plots the observed improbability minus
|
|
the expected improbability (the mean as calculated from the Poisson
|
|
distribution). The plots are presented against a scale of units of standard
|
|
deviation as measured from the Poisson distribution. As with the other
|
|
Staden and McLachlan method the program puts an extra point at a fixed
|
|
level for the highest of the three probabilities; for this function this
|
|
point is placed at six standard deviations above the mean expected level.
|
|
The top of each plot corresponds to 12 standard deviations above the
|
|
expected level and the bottom corresponds to the expected value.
|
|
.para
|
|
Analysis of the application
|
|
of the method to the EMBL sequence library indicates that the method
|
|
does
|
|
work for most sequences and that the levels of improbability roughly
|
|
correlate with levels of expression.
|
|
Coding regions will show high peaks in all three frames making
|
|
interpretation more difficult than for some of the other methods.
|
|
.left margin1
|
|
@46. TX 6 @ Codon improbability on amino acid composition
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to finds regions of a sequence that might code for a protein.
|
|
.para
|
|
If dialogue is requested define a window length and a plot interval.
|
|
.para
|
|
The idea of the method is, that of all sequence features
|
|
that we know, it is only
|
|
coding regions that will give rise to codon biases such that, for each
|
|
amino acid, some codons are used far more frequently than others. The
|
|
method is independent of what the bias actually is, requiring only that it
|
|
is present.
|
|
If a region of sequence shows sufficiently strong
|
|
codon bias then we conclude that it is coding for a protein.
|
|
Using the multinomial distribution we
|
|
have derived a function to measure the improbability of observing a
|
|
set of codons from a sequence of the given composition. Using the
|
|
Poisson
|
|
distribution we have worked out the distribution
|
|
of the improbability. The program plots the observed improbability minus
|
|
the expected improbability (the mean as calculated from the Poisson
|
|
distribution). The plots are presented against a scale of units of standard
|
|
deviation as measured from the Poisson distribution. As with the other
|
|
Staden and McLachlan method the program puts an extra point at a fixed
|
|
level for the highest of the three probabilities; for this function this
|
|
point is placed at six standard deviations above the mean expected level.
|
|
The top of each plot corresponds to 12 standard deviations above the
|
|
expected level and the bottom corresponds to the expected value.
|
|
.left margin1
|
|
@47. TX 6 @ Shepherd RNY preference method
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to find regions of a sequence that might code for a protein. Based on
|
|
the method of Shepherd
|
|
(PNAS 78 1596-1600, 1981).
|
|
.para
|
|
If dialogue is requested define a window length and plot interval.
|
|
.para
|
|
Shepherd has found that
|
|
many genes have a preference for the use of codons of the form RNY
|
|
where
|
|
R=purine, Y=pyrimidine and N=any base. He has attributed this to being
|
|
due
|
|
to remants of a primitive genetic code. The calculation is similar to that
|
|
for the Staden and McLachlan method, the p1's being simply the number of
|
|
RNY codons found in frame 1 etc and the P's being p/(p1+p2+p3).
|
|
.left margIN1
|
|
@48. TX 6 @ Ficketts method
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to find regions of a sequence that might code for a protein. Based on
|
|
the method of Fickett
|
|
(Nucl. Acid Res.10
|
|
1982), but plots values for fixed window lengths rather than over the
|
|
whole of open reading frames.
|
|
.para
|
|
If dialogue is requested define a window length and plot interval. The
|
|
results are plotted in a box divided into three horizontal strips.
|
|
.para
|
|
Sections of the sequence with values plotted in the top strip of the box
|
|
are adjudged to be coding, those in the middle strip "no decision", and
|
|
those in the bottom "not coding".
|
|
.para
|
|
The program performs the following calculations: let A1 = the number of
|
|
occurences of base A in position 1 of codons, A2 for position 2 etc.
|
|
Similarly for bases C,G and T. For each window position calculate
|
|
Apos=max(A1,A2,A3)/min(A1,A2,A3)+1. Similarly for C,G and T to give 4
|
|
positional values. Also count the base composition for the window to
|
|
give
|
|
Acomp, Ccomp etc. Fickett tested each of these 8 parameters singly as
|
|
to
|
|
their ability to distinguish coding from noncoding regions and arived at
|
|
probabilities of coding for the range of values each can take = Pcod. He
|
|
also measured their relative abilities and given weightings to each of
|
|
the 8 parameters = Pw. To calculate the "TESTCODE" for a window we
|
|
first lookup the Pcod for each of the calculated compositional and
|
|
positional values and then calculate TESTCODE=sum(Pcod*Pw). TESTCODE
|
|
is
|
|
plotted relative to three levels of decision: the top division="coding",
|
|
the middle="no opinion" and the bottom division="non coding".
|
|
.left margin1
|
|
@49. TX 6 @ tRNA gene search.
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to find segments of a sequence that might code for tRNAs. Looks for
|
|
potential cloverleaf forming structures and then for the presence of
|
|
expected conserved bases. Presents results graphically or draws out the
|
|
cloverleafs.
|
|
.para
|
|
If dialogue is requested a large number of parameters need to be given
|
|
values, including some loop lengths, scores for each of the four stems,
|
|
and scores for the conserved bases.
|
|
.para
|
|
The program was first described in
|
|
Staden Nucl. Acid Res 817-825 (1980).
|
|
The tRNA's that have
|
|
been sequenced so far have two characteristics that can be used
|
|
to
|
|
locate their genes within long DNA sequences. Firstly they have a
|
|
common secondary structure - the cloverleaf - and secondly,
|
|
particular bases almost always appear at certain positions in
|
|
the
|
|
cloverleaf. The cloverleaf is composed of four base-paired
|
|
stems
|
|
and four loops. Three of the stems are of fixed length but the
|
|
fourth, the dhu stem which usually has four base pairs,
|
|
sometimes
|
|
has only three. All of the loops can vary in size. The following
|
|
relationships between the stems in the cloverleaf are assumed in
|
|
the
|
|
program: (a) there are no bases between one end of the
|
|
aminoacyl
|
|
stem and the adjoining tuc stem; (b) there are two bases
|
|
between
|
|
the aminoacyl stem and the dhu stem; (c) there is one base
|
|
between
|
|
the dhu stem and the anticodon stem; (d) there are at least three
|
|
bases between the anticodon stem and the tuc stem.
|
|
The program looks first for cloverleaf structure and then, if
|
|
required, for conserved bases. The sizes of the loops, the number
|
|
of basepairs in the stems and the required conserved bases may
|
|
all
|
|
be specified by the user. The process of looking for the presence
|
|
of conserved bases can reduce the number of potential
|
|
structures
|
|
found considerably.
|
|
The
|
|
user may also specify that an intron may be present in the
|
|
anticodon
|
|
loop.
|
|
.para
|
|
The user may define a minimum number of
|
|
base pairs for each stem using the scoring system G-C, A-T=2 and G-T=1
|
|
and
|
|
scores for each of the conserved bases. Recommended values for the stem
|
|
scores are given by the prompts and the percentage conservation of the
|
|
conserved bases as found in the Nucl. Acid Res 1979 paper Gauss, Gruter
|
|
and Sprinzl are also given,
|
|
but the user must decide which bases are most
|
|
likely to be conserved for the sequence being examined.
|
|
The output shows the position of the possible gene in the sequence by a
|
|
vertical line the height of which shows the number of basepairs made in
|
|
the
|
|
stems. The cloverleaf structure is also drawn but will scroll up off the
|
|
screen. Output of the cloverleafs will look like:
|
|
.lit
|
|
|
|
6942
|
|
A
|
|
A-U
|
|
A-U
|
|
G-C
|
|
A-U
|
|
U-A
|
|
A-U
|
|
U-A AAU
|
|
U UAUCU
|
|
AA A !!!!!
|
|
AAUG AUAGA A
|
|
U !!!! U UCA
|
|
C UUAC U
|
|
AA A
|
|
U-AA A
|
|
A-U
|
|
A-U
|
|
C-G
|
|
U-A
|
|
U A
|
|
U A
|
|
GUC
|
|
|
|
Typical dialogue follows.
|
|
|
|
? Menu or option number=D49
|
|
tRNA search
|
|
? Maximum trna length (70-130) (92) =
|
|
? Aminoacyl stem score (0-14) (11) =
|
|
? Tu stem score (0-10) (8) =
|
|
? Anticodon stem score (0-10) (8) =
|
|
? D stem score (0-8) (3) =
|
|
? Minimum base pairing total (30-32) (32) =
|
|
? Minimum intron length (0-30) (0) =
|
|
? Minimum length for TU loop (4-12) (6) =
|
|
? Maximum length for TU loop (6-12) (9) =
|
|
? (y/n) (y) Skip search for conserved bases n
|
|
Give a score for each base, then a minimum total at the end
|
|
? Base 8, T is 100% conserved. Score (0-100) (0) =
|
|
? Base 10, G is 95% conserved. Score (0-100) (0) =
|
|
? Base 11, Y is 96% conserved. Score (0-100) (0) =
|
|
? Base 14, A is 100% conserved. Score (0-100) (0) =
|
|
? Base 15, R is 100% conserved. Score (0-100) (0) =
|
|
? Base 21, A is 97% conserved. Score (0-100) (0) =
|
|
? Base 32, Y is 100% conserved. Score (0-100) (0) =
|
|
? Base 33, T is 98% conserved. Score (0-100) (0) =
|
|
? Base 37, A is 91% conserved. Score (0-100) (0) =
|
|
? Base 48, Y is 100% conserved. Score (0-100) (0) =
|
|
? Base 53, G is 100% conserved. Score (0-100) (0) =
|
|
? Base 54, T is 95% conserved. Score (0-100) (0) =
|
|
? Base 55, T is 97% conserved. Score (0-100) (0) =
|
|
? Base 56, C is 100% conserved. Score (0-100) (0) =
|
|
? Base 57, R is 100% conserved. Score (0-100) (0) =
|
|
? Base 58, A is 100% conserved. Score (0-100) (0) =
|
|
? Base 60, Y is 92% conserved. Score (0-100) (0) =
|
|
? Base 61, C is 100% conserved. Score (0-100) (0) =
|
|
? Minimum total conserved base score (0-0) (0) =
|
|
? (y/n) (y) Plot results n
|
|
|
|
Searching
|
|
|
|
306
|
|
C
|
|
C-G
|
|
C-G
|
|
G-C
|
|
T-A
|
|
C-G
|
|
A-T
|
|
T+G AT
|
|
A ATACA
|
|
TTC T !!!! G
|
|
CTGT TATGG G
|
|
G ! ! T GA
|
|
C TAAA C
|
|
GCG C G
|
|
T+GA C
|
|
C-G C T
|
|
T+G A T
|
|
T-A G T
|
|
T-A G A
|
|
G G G C
|
|
A A G A
|
|
AGC T C
|
|
A T
|
|
C T
|
|
A
|
|
C T
|
|
|
|
|
|
.end lit
|
|
.left margIN1
|
|
.left margIN1
|
|
@50. TX 7 @ Plot start codons
|
|
.left margin2
|
|
.para
|
|
This function plots the positions of all start codons for each of the three
|
|
reading frames.
|
|
.left margin1
|
|
@51. TX 7 @ Plot stop codons
|
|
.left margin2
|
|
.para
|
|
This function plots the positions of all stop codons for each of the three
|
|
reading frames.
|
|
.left margIN1
|
|
@52. TX 7 @ Plot stop codons on the complementary strand
|
|
.left margin2
|
|
.para
|
|
This function plots the positions of all stop codons for each of the three
|
|
reading frames on the complementary strand.
|
|
.left margin1
|
|
@53. TX 7 @ Plot stop codons on both strands
|
|
.left margin2
|
|
.para
|
|
This function plots the positions of all stop codons for each of the three
|
|
reading frames on both strands.
|
|
.left margin1
|
|
@54. TX 5 @ Search for longest open reading frames
|
|
.left margin2
|
|
.para
|
|
This function will report the positons of the ends of
|
|
all sections of sequence that contain no stop codons. All six reading
|
|
frames are examined. Results are presented in the form of an EMBL feature
|
|
table. Hence if the results are stored in a file by use of "direct output
|
|
to disk", the file
|
|
can be used to translate the
|
|
open reading frames in a sequence.
|
|
Note that in order for the file to be used as a feature table it
|
|
must include either EMBL
|
|
or GenBank headers, and a suitable "tail". The simplest header is the word
|
|
FEATURES starting in column 1 of the first line of the file. The simplest
|
|
tail is 2 empty lines at the end of the file. These lines are not included
|
|
when nip writes out results in feature table format.
|
|
.para
|
|
Define the minimum length of open reading frame to report (in amino
|
|
acids).
|
|
Choose to search either or both strands. The program displays the end
|
|
points, the reading frame and strand.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
|
|
? Menu or option number=D54
|
|
Find open reading frames
|
|
? Minimum open frame in amino acids (5-1000) (30) =100
|
|
|
|
X 1 + strand only
|
|
2 - strand only
|
|
3 Both strands
|
|
? 0,1,2,3 =3
|
|
|
|
FT CDS 1 831 1 831
|
|
FT CDS 1540 2853 1 1314
|
|
FT CDS 3130 4242 1 1113
|
|
FT CDS 5761 6114 1 354
|
|
FT CDS 6187 6711 1 525
|
|
FT CDS 1766 2077 2 312
|
|
FT CDS 2078 2446 2 369
|
|
FT CDS 4136 5500 2 1365
|
|
FT CDS 1335 1637 3 303
|
|
FT CDS 2844 3194 3 351
|
|
FT CDS 6819 7238 3 420
|
|
FT CDS 2073 1711 C 1 363
|
|
FT CDS 2469 2149 C 1 321
|
|
FT CDS 6542 6144 C 3 399
|
|
|
|
.end lit
|
|
.left margin1
|
|
@55. TX 8 @ Search for E. coli promoter (general)
|
|
.LEFT MARGIN2
|
|
.para
|
|
Searches for E coli promoter like sequences using a standard weight
|
|
matrix. The positions of the matches are plotted. No dialogue is required.
|
|
.para
|
|
The method was first described in
|
|
Staden R. Nucl. Acid Res. 12 505-519 1984.
|
|
This search uses a weight matrix taken from the frequency tables
|
|
contained
|
|
in Hawley, D. K. and McClure, R., nar 11 2237-2255 (1983).
|
|
The weight matrix is
|
|
divided into 3 sections that are separated by varying sizes of gap: the -
|
|
35
|
|
region, the -10 and the +1 region.
|
|
The algorithm first looks for a sufficiently good -35 region, then for the
|
|
best -10 region within range and then for the best +1 region within range
|
|
of the -10; each separate region must score above the lowest known
|
|
score
|
|
for the corresponding section. The gap penalty is then applied and two
|
|
plots
|
|
produced: one with gap penalties, one without.
|
|
Scaling is such that no
|
|
known promoter scores below the bottom level and no known promoter
|
|
scores
|
|
above the top level when the weight matrix is applied.
|
|
.para
|
|
Two other functions also look for E. coli promoters: 92 looks for sites on
|
|
the complementary strand and 93 looks for individual -35 and -10
|
|
regions
|
|
and plots them on a scale such the top is the highest known value +10%
|
|
and
|
|
the bottom is the lowest known -10%
|
|
.LEFT MARGIN1
|
|
.lit
|
|
weights for E. coli promoters
|
|
-35 region:
|
|
P -50-49-48-47-46-45-44-43-42-41-40-39-38-37-36-35-34-33-32-31-30-29-28-27-26
|
|
|
|
107109109110110110110110110111111110111112112112112112112112112112112112112
|
|
T 41 33 32 25 34 22 35 35 42 27 32 42 47 14 92 94 11 19 15 37 46 34 38 48 34
|
|
C 22 27 18 29 20 14 20 12 22 23 16 25 10 43 7 6 11 18 60 8 25 23 23 17 20
|
|
A 28 38 30 37 35 56 42 42 37 42 39 18 25 26 2 6 2 72 26 50 26 34 25 26 31
|
|
G 16 11 29 19 21 18 13 21 9 19 24 26 29 29 11 6 88 3 11 17 15 21 26 21 27
|
|
-10 region:
|
|
P -23-22-21-20-19-18-17-16-15-14-13-12-11-10 -9 -8 -7 -6 -5
|
|
112112112112112112112112112112112112112112112112112112112
|
|
T 35 28 28 27 39 51 34 43 26 31 89 3 49 15 19108 31 29 21
|
|
C 34 21 24 27 12 25 20 25 20 27 10 2 16 14 22 3 13 16 30
|
|
A 20 39 33 33 39 23 29 16 23 19 2106 29 66 57 1 35 23 31
|
|
G 23 24 27 25 22 13 29 28 43 35 11 1 18 17 14 0 33 24 30
|
|
+ region:
|
|
P -2 -1 1 2 3 4 5 6 7 8 9 10
|
|
86 88 85 88 88 88 88 88 88 88 88 88
|
|
T 16 22 2 42 27 23 20 25 27 15 16 29
|
|
C 29 49 4 25 25 13 18 22 17 17 16 17
|
|
A 20 9 45 16 24 25 28 24 24 32 35 26
|
|
G 21 8 37 5 12 27 22 17 20 24 21 16
|
|
.end lit
|
|
Notes:
|
|
E. coli promoters have been shown to contain 2 regions of conserved
|
|
sequence
|
|
located about 10 and 35 bases upstream of the transcription startsite.
|
|
These
|
|
are TATAAT and TTGACA with an allowed spacing of 15 to 21 bases
|
|
between. The
|
|
spacing with maximum efficiency was 17 bases and all but 12 of the 112
|
|
sequences could be aligned with a separation of 17 +or-1 bases. The
|
|
standard
|
|
promoter has spacing 7 and 17 bases between the startsite and the -10
|
|
region,
|
|
and the -10 and -35 regions, respectively. The spacing between the -10
|
|
region
|
|
and the startsite is usually 6 or 7 bases but varies between 4 and 8
|
|
bases.
|
|
There is an AT rich region of 8 to 10 bases upstream of the -35 region.
|
|
Iniation with a purine is highly prefered with G being used if A is not
|
|
present.
|
|
.lit
|
|
Gap penalties:
|
|
15 0.02 (only exists as mutant)
|
|
16 0.2
|
|
17 1.0
|
|
18 0.2
|
|
19 0.05 (guess)
|
|
20 0.02 (guess)
|
|
21 0.01 (guess)
|
|
.end lit
|
|
.left margin1
|
|
@56. TX 8 @ Search for E. coli promoter (general)
|
|
strand
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function searches for E. Coli promoters on the complementary strand
|
|
of
|
|
the sequence. See the notes on option 55.
|
|
.left margin1
|
|
@57. TX 8 @ Search for E. coli promoter sequences. (-35 and -10)
|
|
.LEFT MARGIN
|
|
.para
|
|
This function searches separately for the -35 and -10 sequences of an E.
|
|
coli promoter. See the notes on option 55.
|
|
.left margIN1
|
|
@58. TX 8 @ Search for procaryotic ribosome binding sites
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function searches for the 5' ends of prokaryotic genes using an
|
|
unusual weight matrix. The search is relatively slow because the matrix
|
|
is 101 bases in length. No dialogue is required.
|
|
.para
|
|
The method was first described in
|
|
Staden Nucl. Acid Res. 12 505-519 1984. This actually looks for more
|
|
than
|
|
a ribosome binding site as is explained below. This uses their weight
|
|
matrix w101 of Stormo and
|
|
Schneider (NAR 10 2971-3024, 1982)
|
|
which with a value of 2 finds all gene starts in their library.
|
|
.LEFT MARGIN1
|
|
.lit
|
|
P-60-59-58-57-56-55-54-53-52-51-50-49-48-47-46-45-44-43-42-41-40-39-38-37-36
|
|
T 5 1 -3 9-14 7 15 -5 3-16-17 4 18 5 -3 -1 2 4 5 -5 7 8 -5-15 6
|
|
C-21 -6-11-21 0 8 -7-12 -1 1 0-19 12 -3 -1 10 2 -8 -5-11 8 1 23 6 -5
|
|
A 7 -2 13 -2 -8-13-18 5 0 -5 13 8-15 9 -4 -7 9 0 -8-11-10 -6 -7 -5 -6
|
|
G -6 -9 -7 0 8-16 -4 -2-16 1 -4 8-14 5 11-13-24 3 7 22-11 -9-15 10 -4
|
|
|
|
P-35-34-33-32-31-30-29-28-27-26-25-24-23-22-21-20-19-18-17-16-15-14-13-12-11
|
|
T 3 4 16 -4 7 11 -4 -1 12 8 10 -1 1 8 2-10-16 11 1 -3 16 -3-36 -8-27
|
|
C 2-14 -3 -8-10-21 2 0 -2 -1-11 -3 -1 5-11 -4 7 0-14 6 -8-20 -7-36-44
|
|
A-12 -1-27 -3 -6 0-12 -3 -4 -7 14 -2 -4 -6 0 12 5 -9 0-11-11 10 8 2 8
|
|
G 4 -5 -6 -3 -1 -4 -1 -4-15 0-14 3 10-19 -3-10 -7 -7 7 1 -8 -6 15 21 42
|
|
|
|
P-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
|
|
T-53-27-26-23 2 -7-14-40-28 0-53 75-62-20-40-10-35 -5-12 -1 4 14-23 7 -2
|
|
C-15-50-43-35-38-29-29 1 -9 1-87-55-64-45 11-22-14-20-15-15-10-22 -5 2 6
|
|
A 0 -3 -5 4-20-11 5 6 -2-15 66-69-52 -5 -4 6 8-24 -7-10 -7 13 14 -9-18
|
|
G 35 22 16 -6 -5-15-25-33-28-53-36-50107 -5-37-44-27-15-23-16-29-47-17-29-15
|
|
|
|
P 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39
|
|
T-26 1 4 -7 3 -4 0-10 8-18 7-22-21 8 4 -3 -6 7 -8 1 -5-16-16 7 -6
|
|
C 6 -8 19 -7 9 -3 17 -2 3 -9 5 22 22 8 -1 1 18 6 11-10 -8 7 10 0 7
|
|
A 14-12-42 1 -5 -4-32 12-10 20 -6 -1 3 -4 4-10 -1 -2-14 11 14 -3 2-13 5
|
|
G-23 -7 -1 -6-17 -4 0-15-14 -4-17-10 -5-13 -8 10-13-13 9 -4 -3 10 2 4 -8
|
|
|
|
P 40
|
|
T 0
|
|
C 14
|
|
A 5
|
|
G-21
|
|
.END LIT
|
|
These come from w101 of Stormo, Schneider, Gold and Ehrenfeucht Nucl.
|
|
Acid Res. 10 2997-
|
|
3011, 1982. They report that this matrix gives a score of at least 2 for
|
|
all
|
|
gene starts in their library whereas all other sequences score 1 or less.
|
|
.left margin1
|
|
@29. TX 1 @ Reverse and complement the sequence
|
|
.LEFT MARGIN2
|
|
.para
|
|
Reverses and complements the current active region of the sequence.
|
|
.left margin1
|
|
@60. TX 7 @ Search using a dinucleotide weight matrix
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function performs searches for short sequence
|
|
motifs using an appropriate dinucleotide weight matrix. In addition it
|
|
can be used to create or modify weight matrices. In order to perform a
|
|
search the only input
|
|
required is the name of the file containing the weight matrix.
|
|
The results can be presented graphically or listed. The graphical
|
|
presentation will draw line at the position of any matches found; the
|
|
height of the line is proportional to the score. The method is identical to
|
|
that using weight matrices derived from nucleotide frequencies, except
|
|
that here we use the frequencies of dinucleotides.
|
|
.para
|
|
For a search, select "use weight matrix", supply the name of the file
|
|
containing the weight matrix, and choose between having results plotted
|
|
or listed. If dialogue is requested when the function is selected users can
|
|
alter the cutoff score employed.
|
|
.para
|
|
To create a weight matrix several steps are involved. A file containing an
|
|
alignment of known motifs is required. (This file must be created before
|
|
the current option is selected. The format is a follows: each sequence is
|
|
written on a separate line with at least one space at the beginning; each
|
|
sequence is terminated by a space character, and can be followed by a
|
|
name. The sequences must be aligned.) Supply the name of the file of
|
|
aligned sequences. The program reads and displays the sequences. Choose
|
|
between "summing logs of weights" or summing weights (i.e. whether to
|
|
multiply or add weights). If logs are used all scores will be negative.
|
|
Choose if all positions in the set of aligned sequences should be used or
|
|
if a mask should be applied. If so selected, define a mask as a string of
|
|
symbols, in which symbol - means ignore and any other symbol means
|
|
use. E.g. xx-x--abc means use all positions except 3,5 and 6.
|
|
.para
|
|
The program will calculate weights as the frequencies of the
|
|
dinucleotides at each unmasked position in the set of aligned sequences.
|
|
These weights are then applied to the set of aligned sequences to give a
|
|
range of "observed" scores. The mean and standard deviation of these
|
|
scores is displayed. The user is asked to supply several values to be used
|
|
when the weight matrix is applied to other sequences: a cutoff score (by
|
|
default, the mean minus 3 standard deviations), a top score for scaling
|
|
graphical results (by default, the mean plus 3 standard deviations), and a
|
|
position to identify (this means that if a particular base within the
|
|
motif is used as a "landmark", such as the A of the AG in splice acceptor
|
|
sites, then its position will be marked in plots). All these values are
|
|
stored along with the weight matrix. Finally supply the name of a file to
|
|
contain the weight matrix.
|
|
.para
|
|
Weight matrices can be "rescaled" using a set of aligned sequences in
|
|
much the same ways as a matrix is created. The purpose is to redefine
|
|
the cutoff scores, and rescaling does not alter any other values in the
|
|
weight matrix file.
|
|
.para
|
|
The methods have always had to deal with the problem of zeroes in the
|
|
matrices. The current versions
|
|
employ "Laplaces Law of Succession" in which 1 is
|
|
added to each term.
|
|
|
|
.lit
|
|
Typical dialogue follows.
|
|
|
|
? Menu or option number=D60
|
|
|
|
Motif search using dinucleotide weight matrix
|
|
X 1 Use weight matrix
|
|
2 Make weight matrix
|
|
3 Rescale weight matrix
|
|
? 0,1,2,3 = 2
|
|
? Name of aligned sequences file=[RS.MOTIFS]GCN4.SEQ
|
|
|
|
|
|
1 AGCGTGACTCTTCCCGGAA HIS1
|
|
2 GAGGTGACTCACTTGGAAG HIS1
|
|
3 CGGATGACTCTTTTTTTTT HIS3
|
|
4 ACAGTGACTCACGTTTTTT HIS4
|
|
5 GTCGTGACTCATATGCTTT ARG3
|
|
6 TGAATGACTCACTTTTTGG ARG4
|
|
7 TTCTTGACTCGTCTTTTCT CPA1
|
|
8 CGAATGACTCTTATTGATG CPA2
|
|
9 AGAATGACTAATTTTACTA TRP5
|
|
10 TCGTTGACTCATTCTAATC TRP3
|
|
11 TTGCTGACTCATTACGATT TRP2
|
|
12 GAGATGACTCTTTTTCTTT IV1
|
|
13 GCGATGATTCATTTCTCTG IV2
|
|
14 TAGATGACTCAGTTTAGTC LEU1
|
|
15 TAAGTGACTCAGTTCTTTC LEU4
|
|
16 ATGATGACTCTTAAGCATG ILS1
|
|
Length of motif 18
|
|
? (y/n) (y) Sum logs of weights n
|
|
? (y/n) (y) Use all motif positions n
|
|
x means use, - means ignore
|
|
e.g. xx-x---x-x means use positions 1,2,4,8,10
|
|
? Mask=----XXXXXXXX--------
|
|
Applying weights to input sequences
|
|
1 89.000 AGCGTGACTCTTCCCGGA
|
|
2 91.000 GAGGTGACTCACTTGGAA
|
|
3 93.000 CGGATGACTCTTTTTTTT
|
|
4 90.000 ACAGTGACTCACGTTTTT
|
|
5 94.000 GTCGTGACTCATATGCTT
|
|
6 91.000 TGAATGACTCACTTTTTG
|
|
7 81.000 TTCTTGACTCGTCTTTTC
|
|
8 90.000 CGAATGACTCTTATTGAT
|
|
9 75.000 AGAATGACTAATTTTACT
|
|
10 97.000 TCGTTGACTCATTCTAAT
|
|
11 97.000 TTGCTGACTCATTACGAT
|
|
12 93.000 GAGATGACTCTTTTTCTT
|
|
13 69.000 GCGATGATTCATTTCTCT
|
|
14 90.000 TAGATGACTCAGTTTAGT
|
|
15 90.000 TAAGTGACTCAGTTCTTT
|
|
16 90.000 ATGATGACTCTTAAGCAT
|
|
Top score 97.000 Bottom score 69.000
|
|
Mean 88.750 Standard deviation 7.319
|
|
Mean minus 3.sd 66.794 Mean plus 3.sd 110.706
|
|
? Cutoff score (-999.00-9999.00) (66.79) =
|
|
? Top score for scaling plots (66.79-999.00) (110.71) =
|
|
? Position to identify (0-18) (1) =
|
|
? Title=GCN4 DI WTS
|
|
? Name for new weight matrix file=3.WTS
|
|
|
|
? Menu or option number=D60
|
|
Motif search using dinucleotide weight matrix
|
|
X 1 Use weight matrix
|
|
2 Make weight matrix
|
|
3 Rescale weight matrix
|
|
? 0,1,2,3 =
|
|
? Motif weight matrix file=3.WTS
|
|
GCN4 DI WTS
|
|
? Cutoff score (-9999.00-9999.00) (66.79) =40
|
|
? (y/n) (y) Plot results n
|
|
15 42.00 CAACCCGCTCACCGACAA
|
|
29 42.00 ACAACAGCTCACCCACGC
|
|
93 46.00 AGCCTTCCTCATCGCTGC
|
|
153 40.00 CAGCGGAATCAAACTTAA
|
|
408 42.00 CGATGGATTCAAGTTGAA
|
|
469 47.00 TTAGGAACTCCCTCTGTC
|
|
493 60.00 AAGCTGAATCTTAGCAGC
|
|
530 43.00 CGGAGGGCTCAGTGAGGG
|
|
542 47.00 TGAGGGACTACTGCACCA
|
|
678 41.00 CTTCTGCTTCAAAGAGTT
|
|
709 47.00 AATATGACGGCGCACGTG
|
|
848 54.00 GTCAGAACTCAAATCAGT
|
|
940 49.00 CCGTTGACGACCTCCGCA
|
|
992 42.00 TGGGCACCTCACACCAAG
|
|
|
|
|
|
.end lit
|
|
.left margIN1
|
|
@61. TX 8 @ Search for eukaryotic ribosome binding sites
|
|
.LEFT MARGIN2
|
|
.para
|
|
Searches for eukaryotic ribosome binding sites using weightings derived
|
|
from
|
|
Sargan,Gregory,Butterworth febs let 147 133-136 1982. No dialogue is
|
|
required. First described in Staden Nucl. Acid Res. 12 505-519 1984.
|
|
|
|
.LEFT MARGIN1
|
|
.lit
|
|
mRNA WTS FOR EUKARYOTES SARGAN,GREGORY,BUTTERWORTH FEBS LET
|
|
147 133-136 1982
|
|
P -7 -6 -5 -4 -3 -2 -1 1 2 3
|
|
102102102102102102102102102102
|
|
T 19 24 31 12 0 18 5 0102 0
|
|
C 20 15 32 65 5 42 52 0 0 0
|
|
A 50 27 27 19 86 36 34102 0 0
|
|
G 6 29 12 6 11 6 11 0 0102
|
|
VIRAL ONLY
|
|
P -7 -6 -5 -4 -3 -2 -1 1 2 3
|
|
41 41 41 41 41 41 41 41 41 41
|
|
T 14 12 16 4 2 13 9 0 41 0
|
|
C 7 3 13 17 7 9 14 0 0 0
|
|
A 15 10 6 10 27 15 9 41 0 0
|
|
G 5 16 6 10 5 4 9 0 0 41
|
|
.END LIT
|
|
The Sargan et al paper puts forward the hypothesis that there is an
|
|
interaction between
|
|
some mRNA leader sequences and a highly conserved structure in the 18S
|
|
rRNA
|
|
of eukaryotic ribosomes. The attempt to substantiate the hypothesis
|
|
includes
|
|
a table of base frequencies for sequences immediately 5' to start codons.
|
|
They examined 102 sequences and I have used the base frequencies they
|
|
found
|
|
as a weight matrix for searching for eukaryotic gene starts. I don't yet
|
|
know how good this method is. The viral sequences were found to be
|
|
slightly
|
|
different but the separate table shown here is not used in the program.
|
|
.left margin1
|
|
@62. TX 8 @ Search for splice junctions
|
|
.LEFT MARGIN2
|
|
.para
|
|
Used to search for mRNA splice junctions using a weight matrix. The
|
|
default weight matrix is still that derived from the paper of Mount (Nucl.
|
|
Acids Res. 10, 459-472). However users may employ their own tables.
|
|
By default the positions of possible junctions will
|
|
be plotted rather than listed.
|
|
The diagram splits the donor plot into 3 horizontal boxes
|
|
so that all the
|
|
sites marked in any box are from the same reading frame. The acceptor
|
|
plot appears above the donor plot and is split in an equivalent way. So
|
|
sites marked as donors and acceptors in equivalent boxes are compatible.
|
|
i.e. donors from donor box 1 are compatible with acceptors from acceptor box
|
|
1, etc. Of course it is the combination of reading frame and splice sites
|
|
that really matters, and donors from box 1 can be compatible with acceptors
|
|
in box 3 if the reading frame switches.
|
|
.para
|
|
If dialogue is selected users can employ their own file of weights (see
|
|
below for the format), can change the cutoff scores, and can elect to have
|
|
the results listed rather than plotted. Listed results show the position
|
|
(of the last or first base in the exon), the frame and the matching sequence.
|
|
The frequency table shown below is used as a default
|
|
weight matrix and AG and GT are obligatory at the appropriate positions.
|
|
The plots are scaled so that the top of scale is the highest value achieved
|
|
by
|
|
a junction sequence in the set used to compile the frequency table, and
|
|
the
|
|
bottom of the scale is the lowest value achieved by a junction sequence
|
|
in
|
|
the set used to compile the frequency table.
|
|
.para
|
|
In the light of current knowledge it would be sensible for users to use
|
|
the weight matrix search option (20)
|
|
to create matrices that define more specific splice junctions. If so it is
|
|
important that the positions "marked" are the last base in the donor exon and
|
|
the first base in the acceptor exon. To make a weight matrix suitable for
|
|
use with this function follow the instructions for option 20 and create
|
|
files for both donor and acceptor sites. Then concatenate the two matrix files
|
|
with the donor file first.
|
|
Note that any positions in the weight matrix that are
|
|
100% conserved will be made obligatory (normally the AG and GT).
|
|
.LEFT MARGIN1
|
|
.lit
|
|
|
|
Mount donors redone 16-4-91
|
|
12 3 -16.085 -7.500
|
|
P -2 -1 0 1 2 3 4 5 6 7 8 9
|
|
N 136 136 136 136 136 136 136 136 136 136 136 136
|
|
T 28 8 15 17 0 136 9 16 7 84 30 36
|
|
C 41 60 16 7 0 0 3 13 3 17 28 39
|
|
A 40 56 89 12 0 0 83 91 12 23 53 33
|
|
G 27 12 16 100 136 0 41 16 114 12 25 28
|
|
Mount acceptors redone 16-4-91
|
|
18 15 -26.142 -14.400
|
|
P -14 -13 -12 -11 -10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3
|
|
N 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113 113
|
|
T 58 50 57 59 67 56 58 49 47 66 64 31 34 0 0 11 41 31
|
|
C 21 28 34 25 29 33 35 32 42 40 33 25 74 0 0 23 28 41
|
|
A 17 11 11 18 7 17 12 23 15 3 10 29 5 113 0 24 21 21
|
|
G 17 24 11 11 10 7 8 9 9 4 6 28 0 0 113 55 23 20
|
|
.END LIT
|
|
|
|
.left margIN1
|
|
@63. TX 7 @ Search using a weight matrix (complementary)
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function searches the complementary strand of the sequence using
|
|
a weight matrix. Many
|
|
motifs can bind to either strand of the DNA and this function allows
|
|
users to
|
|
search the complementary strand without having to change the
|
|
orientation of the sequence. See option 20 for more details.
|
|
.left margin1
|
|
@64. TX 3 @ Plot observed-expected word frequencies
|
|
.LEFT MARGIN2
|
|
.PARA
|
|
This option is designed to examine the abundances of short
|
|
words in a sequence to see if particular ones are either under or over
|
|
represented. It compares the observed and expected frequencies and
|
|
plots them along the sequence. There has been some work on the relative
|
|
amounts of CG dinucleotides in eukaryotic sequences (eg Bird, Nature
|
|
321,
|
|
209-213 (1986)) and this new routine can be used to examine such
|
|
biases, or
|
|
any others that might be interesting.
|
|
.para
|
|
The user selects a word - say CG -, a window length, and a maximum and
|
|
mininum scale for plotting the results. The
|
|
program examines each sucessive window length along the sequence,
|
|
with each
|
|
window overlapping the previous one by windowlength-1.
|
|
The program counts the base frequencies in each window, and the number
|
|
of
|
|
occurrences of the chosen word within the window. Using the base
|
|
frequencies it calculates an expected number of occurrences for the
|
|
chosen
|
|
word (simply by multiplying the relevant frequencies). It plots
|
|
observed-expected, and hence will show regions that are rich or depleted
|
|
in
|
|
the chosen word. The longest allowed word is 9 characters, but the
|
|
calculation of the expected frequencies becomes less appropriate as the
|
|
word
|
|
length increases above 2.
|
|
.para
|
|
Typical dialogue follows.
|
|
.lit
|
|
|
|
? Menu or option number=D64
|
|
Plot composition differences (obs-exp))
|
|
Default String=CG
|
|
? String=
|
|
? odd span length (3-401) (101) =
|
|
? plot interval (1-20) (5) =
|
|
? Maximum plot value (-6.31-25.25) (6.31) =
|
|
? Minimum plot value (-25.25-6.31) (-6.31) =
|
|
|
|
Missing graphics display here
|
|
|
|
.end lit
|
|
.left margIN1
|
|
@65. TX 9 @ Search for polya sites
|
|
.LEFT MARGIN2
|
|
.para
|
|
Simply searches for the sequence AATAAA
|
|
(Proudfoot and Brownlee Nature 263, 211-214,
|
|
1982) and marks it with a short vertical line.
|
|
.left margin1
|
|
@66. TX 1 @ Interconvert t and u
|
|
.LEFT MARGIN2
|
|
.para
|
|
This function interconverts T and U characters in the active sequence i.e
|
|
between DNA and RNA.
|
|
.LEFT MARGIN1
|
|
@67. TX 7 @ Search for patterns of motifs
|
|
.left margin2
|
|
.para
|
|
This option searches for patterns of motifs. Patterns can be defined
|
|
interactively or read from files. Results can be displayed in several ways
|
|
in both graphical and textual form. Used to create pattern files for
|
|
searching libraries. The option is extremely flexible and consequently the
|
|
following documentation is quite lengthy. However the routine is capable
|
|
of searching for almost any known pattern. In addition the flexibility
|
|
does not necessitate difficulty of use, and the userinterface has been
|
|
simplified considerably since the methods were first published.
|
|
.para
|
|
Users should refer to the "typical dialogue" shown below for the most
|
|
helpful information on using the program.
|
|
.para
|
|
There are currently
|
|
four ways to display the matching patterns: 1=each individual
|
|
motif and its position is listed; 2=all the sequence between, and
|
|
including the two
|
|
outermost motifs is listed; 3=graphical, with a vertical line marking the
|
|
position
|
|
of the leftmost motif; 4 = EMBL feature table format, where the KEYNAM
|
|
field if the motif name, the FROM and TO fields denote the ends of the
|
|
match, and the DESCRIPTION field is "Program".
|
|
.para
|
|
When it is defined for the first time a pattern must be entered
|
|
interactively at the keyboard, but the pattern description
|
|
can be saved to a file.
|
|
This file can be used for all subsequent searches.
|
|
.para
|
|
When defining a pattern interactively
|
|
select a motif class and the program will request the required inputs.
|
|
.para
|
|
The program gives each motif an identifying name and number.
|
|
For motifs other than the first, a range of allowed positions must be
|
|
defined (Note that sets of motifs included using the OR operator will all
|
|
be given the same range, and so the program will only request range
|
|
values
|
|
for the first motif in any such set).
|
|
To specify the allowed range for a motif the user must supply the
|
|
following: the
|
|
identifying number of the motif, relative to which the current motifs
|
|
positions are to be defined (termed the "reference motif"); a "relative start
|
|
position" and a range. The relative start position can be negative or positive.
|
|
A negative start position means that although the reference motif
|
|
is searched for first, the current motif can be found to its left.
|
|
A zero relative start position means their left ends are superimposed. The
|
|
default start position is to butt-joint the motif to righthand end of the
|
|
"reference motif". The range is "the number of extra positions" that the
|
|
motif can take.
|
|
.para
|
|
The program will display the probability of finding each motif. These
|
|
values are presented in the following form: .1234E-5 means 0.1234 times
|
|
10
|
|
to the power -5.
|
|
.para
|
|
After the pattern has been defined, the program will type a description
|
|
of
|
|
it on the screen. It will then allow the user to give an overall cutoff
|
|
score and overall probability cutoff.
|
|
.para
|
|
Typical dialogue for all the different motif classes is displayed below.
|
|
.lit
|
|
|
|
? Menu or option number=67
|
|
Pattern searcher
|
|
? (y/n) (y) Read pattern from keyboard
|
|
X 1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Complement of weight matrix
|
|
6 Inverted repeat or stem-loop
|
|
7 Exact match, defined step
|
|
8 Direct repeat
|
|
9 Pattern complete
|
|
? 0,1,2,3,4,5,6,7,8,9 =
|
|
? Motif name=Ematch
|
|
? String=AA
|
|
Probability of score 2.0000 = 0.595E-01
|
|
X 1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Complement of weight matrix
|
|
6 Inverted repeat or stem-loop
|
|
7 Exact match, defined step
|
|
8 Direct repeat
|
|
9 Pattern complete
|
|
? 0,1,2,3,4,5,6,7,8,9 =2
|
|
? Motif name=AAA
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-1) (1) =
|
|
? Relative start position (-1000-1000) (3) =
|
|
? Number of extra positions (0-1000) (0) =
|
|
? string=AAA
|
|
? Minimum matches (1.00-3.00) (3.00) =2
|
|
Probability of score 2.0000 = 0.149E+00
|
|
1 Exact match
|
|
X 2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Complement of weight matrix
|
|
6 Inverted repeat or stem-loop
|
|
7 Exact match, defined step
|
|
8 Direct repeat
|
|
9 Pattern complete
|
|
? 0,1,2,3,4,5,6,7,8,9 =3
|
|
? Motif name=T'S
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-2) (2) =
|
|
? Relative start position (-1000-1000) (4) =
|
|
? Number of extra positions (0-1000) (0) =
|
|
? String=TTT
|
|
? Minimum score (0.00-108.00) (108.00) =72
|
|
Probability of score 72.0000 = 0.258E+00
|
|
1 Exact match
|
|
2 Percentage match
|
|
X 3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Complement of weight matrix
|
|
6 Inverted repeat or stem-loop
|
|
7 Exact match, defined step
|
|
8 Direct repeat
|
|
9 Pattern complete
|
|
? 0,1,2,3,4,5,6,7,8,9 =4
|
|
? Motif name=GCN4
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-3) (3) =
|
|
? Relative start position (-1000-1000) (4) =
|
|
? Number of extra positions (0-1000) (0) =
|
|
? Weight matrix file name=GCN4
|
|
GCN4 FROM WEIGHTS 17-11-87
|
|
Probability of score -22.0020 = 0.139E-02
|
|
1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
X 4 Cut-off score and weight matrix
|
|
5 Complement of weight matrix
|
|
6 Inverted repeat or stem-loop
|
|
7 Exact match, defined step
|
|
8 Direct repeat
|
|
9 Pattern complete
|
|
? 0,1,2,3,4,5,6,7,8,9 =5
|
|
? Motif name=GCN4
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-4) (4) =
|
|
? Relative start position (-1000-1000) (20) =
|
|
? Number of extra positions (0-1000) (0) =
|
|
? Weight matrix file name=GCN4
|
|
GCN4 FROM WEIGHTS 17-11-87
|
|
Probability of score -22.0020 = 0.606E-03
|
|
1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
X 5 Complement of weight matrix
|
|
6 Inverted repeat or stem-loop
|
|
7 Exact match, defined step
|
|
8 Direct repeat
|
|
9 Pattern complete
|
|
? 0,1,2,3,4,5,6,7,8,9 =6
|
|
? Motif name=LOOP
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-5) (5) =
|
|
? Relative start position (-1000-1000) (20) =
|
|
? Number of extra positions (0-1000) (0) =
|
|
? Stem length (1-60) (6) =
|
|
? Minimum loop length (-6-60) (0) =
|
|
? Maximum loop length (0-60) (0) =5
|
|
? Minimum score (1.00-12.00) (12.00) =10
|
|
Probability of score 10.0000 = 0.598E-02
|
|
1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Complement of weight matrix
|
|
X 6 Inverted repeat or stem-loop
|
|
7 Exact match, defined step
|
|
8 Direct repeat
|
|
9 Pattern complete
|
|
? 0,1,2,3,4,5,6,7,8,9 =7
|
|
? Motif name=Tstep
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-6) (6) =
|
|
? (y/n) (y) Relative to 5 prime end
|
|
? Relative start position (-1000-1000) (1) =
|
|
? Number of extra positions (0-1000) (0) =
|
|
? String=TTT
|
|
? Step (1-20) (3) =
|
|
Probability of score 3.0000 = 0.367E-01
|
|
1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Complement of weight matrix
|
|
6 Inverted repeat or stem-loop
|
|
X 7 Exact match, defined step
|
|
8 Direct repeat
|
|
9 Pattern complete
|
|
? 0,1,2,3,4,5,6,7,8,9 =8
|
|
? Motif name=REPEAT
|
|
X 1 And
|
|
2 Or
|
|
3 Not
|
|
? 0,1,2,3 =
|
|
? Number of reference motif (1-7) (7) =
|
|
? Relative start position (-1000-1000) (4) =
|
|
? Number of extra positions (0-1000) (0) =2
|
|
? Repeat length (1-60) (6) =
|
|
? Minimum gap (0-60) (0) =
|
|
? Maximum gap (0-60) (0) =4
|
|
? Minimum score (1.00-6.00) (6.00) =5
|
|
Probability of score 5.0000 = 0.554E-02
|
|
1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Complement of weight matrix
|
|
6 Inverted repeat or stem-loop
|
|
7 Exact match, defined step
|
|
X 8 Direct repeat
|
|
9 Pattern complete
|
|
? 0,1,2,3,4,5,6,7,8,9 =9
|
|
? (y/n) (y) Save pattern in a file N
|
|
|
|
Pattern description
|
|
|
|
Motif 1 named Ematch is of class 1
|
|
Which is an exact match to the string
|
|
AA
|
|
Motif 2 named AAA is of class 2
|
|
which is a match of score 2. to the string
|
|
AAA
|
|
and the 5 prime base can take positions 3 to 3
|
|
relative to the 5 prime end of motif 1
|
|
It is anded with the previous motif.
|
|
Motif 3 named T'S is of class 3
|
|
which is a match of score 72. to the string
|
|
TTT
|
|
and the 5 prime base can take positions 4 to 4
|
|
relative to the 5 prime end of motif 2
|
|
It is anded with the previous motif.
|
|
Motif 4 named GCN4 is of class 4
|
|
Which is a match to a weight matrix with score -22.002
|
|
and the 5 prime base can take positions 4 to 4
|
|
relative to the 5 prime end of motif 3
|
|
It is anded with the previous motif.
|
|
Motif 5 named GCN4 is of class 5
|
|
Which is a match to the complement of a weight matrix with score -22.002
|
|
and the 5 prime base can take positions 20 to 20
|
|
relative to the 5 prime end of motif 4
|
|
It is anded with the previous motif.
|
|
Motif 6 named LOOP is of class 6
|
|
Which is a stem-loop structure with stem length 6 and score 10.
|
|
The loop can have sizes 0 to 5
|
|
and the 5 prime base can take positions 20 to 20
|
|
relative to the 5 prime end of motif 5
|
|
It is anded with the previous motif.
|
|
Motif 7 named Tstep is of class 7
|
|
Which is an exact match to the string
|
|
TTT
|
|
with a step size of 3
|
|
and the 5 prime base can take positions 1 to 1
|
|
relative to the 5 prime end of motif 6
|
|
It is anded with the previous motif.
|
|
Motif 8 named REPEAT is of class 8
|
|
Which is a repeat with repeat length 6 and score 5.
|
|
The loop-out can have sizes 0 to 4
|
|
and the 5 prime base can take positions 4 to 6
|
|
relative to the 5 prime end of motif 7
|
|
It is anded with the previous motif.
|
|
Probability of finding pattern = 0.2348E-14
|
|
Expected number of matches = 0.5100E-09
|
|
? Maximum pattern probability (0.00-1.00) (1.00) =
|
|
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
|
|
Select display mode
|
|
X 1 Motif by motif
|
|
2 Inclusive
|
|
3 Graphical
|
|
4 EMBL feature table
|
|
? 0,1,2,3,4 =4
|
|
Searching
|
|
|
|
|
|
Total matches found 0
|
|
|
|
Menus and their numbers are
|
|
m0 = This menu
|
|
m1 = General
|
|
m2 = Screen control
|
|
m3 = Statistical analysis of content
|
|
m4 = Structures and repeats
|
|
m5 = Translation and codons
|
|
m6 = Gene search by content
|
|
m7 = Prokaryotic signal search
|
|
m8 = Eukaryotic signal search
|
|
? = Help
|
|
! = Quit
|
|
? Menu or option number=67
|
|
Pattern searcher
|
|
? (y/n) (y) Read pattern from keyboard
|
|
X 1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Complement of weight matrix
|
|
6 Inverted repeat or stem-loop
|
|
7 Exact match, defined step
|
|
8 Direct repeat
|
|
9 Pattern complete
|
|
? 0,1,2,3,4,5,6,7,8,9 =
|
|
? Motif name=Arun
|
|
? String=AAAAAA
|
|
Probability of score 6.0000 = 0.210E-03
|
|
X 1 Exact match
|
|
2 Percentage match
|
|
3 Cut-off score and score matrix
|
|
4 Cut-off score and weight matrix
|
|
5 Complement of weight matrix
|
|
6 Inverted repeat or stem-loop
|
|
7 Exact match, defined step
|
|
8 Direct repeat
|
|
9 Pattern complete
|
|
? 0,1,2,3,4,5,6,7,8,9 =9
|
|
? (y/n) (y) Save pattern in a file N
|
|
|
|
Pattern description
|
|
|
|
Motif 1 named Arun is of class 1
|
|
Which is an exact match to the string
|
|
AAAAAA
|
|
Probability of finding pattern = 0.2103E-03
|
|
Expected number of matches = 0.1522E+01
|
|
? Maximum pattern probability (0.00-1.00) (1.00) =
|
|
? Minimum pattern score (-9999.00-9999.00) (-9999.00) =
|
|
Select display mode
|
|
X 1 Motif by motif
|
|
2 Inclusive
|
|
3 Graphical
|
|
4 EMBL feature table
|
|
? 0,1,2,3,4 =4
|
|
Searching
|
|
|
|
|
|
FT Arun 1582 1587 Program
|
|
FT Arun 3160 3165 Program
|
|
FT Arun 4204 4209 Program
|
|
FT Arun 5691 5696 Program
|
|
FT Arun 6710 6715 Program
|
|
Total matches found 5
|
|
Minimum and maximum observed scores 6.00 6.00
|
|
|
|
.end lit
|
|
.para
|
|
These methods allow users to define and search for
|
|
complex patterns of motifs defined as single objects.
|
|
The programs allow individual DNA motifs to be defined in eight
|
|
different
|
|
ways, and protein motifs in six. Motifs are combined, using the logical
|
|
operators AND, OR and NOT, to describe a pattern. The pattern also
|
|
specifies the ranges of allowed relative separations of the individual
|
|
motifs.
|
|
.para
|
|
First some definitions.
|
|
.para
|
|
A MOTIF is a contiguous subsequence of fixed length.
|
|
At its simplest
|
|
it could be a single definite base or amino acid; a more complex motif
|
|
might be better represented as a consensus or a weight matrix;
|
|
two more-abstract types of
|
|
motif are direct and inverted repeats.
|
|
.para
|
|
A PATTERN is a higher order of structure defined by a list of motifs. The
|
|
motifs in a pattern are combined using the logical operators AND, OR and
|
|
NOT. The list also defines the allowed relative separations of the
|
|
motifs. In the current versions of the programs up
|
|
to 50 motifs can be combined into a single pattern. So using these
|
|
definitions there are two
|
|
differences between motifs and patterns: 1) the distances between all
|
|
elements of a motif are fixed, but
|
|
the separations of parts of patterns can vary;
|
|
2) all characters in a motif are defined
|
|
using the same method (class), but different parts of a pattern can be
|
|
defined in completely different ways.
|
|
.para
|
|
Each motif
|
|
can be represented in 9 ways (known as the motif class):
|
|
.sk1
|
|
.lit
|
|
MOTIF CLASSES
|
|
CLASS DESCRIPTION
|
|
1 Exact match to a short defined sequence. The IUB symbols
|
|
can be used for DNA sequences.
|
|
2 Percentage match to a defined short sequence. In nucleic acids,
|
|
the IUB symbols can be used.
|
|
3 Match to a defined sequence, using a score matrix and cutoff
|
|
score. The DNA matrix (see option 18) gives scores to IUB symbols
|
|
depending on their level of redundancy. MDM78 is used for proteins.
|
|
4 Match to a weight matrix with cutoff score.
|
|
5 As class 4 but on the complementary strand.
|
|
6 Inverted repeat or stem-loop. Fixed stem length, range of
|
|
loop sizes, and cutoff score using A-T, G-C=2; G-T=1.
|
|
7 Exact match to short sequence but with a defined step size.
|
|
8 Direct repeat. Fixed repeat length, range of loop-out sizes,
|
|
cutoff score, and score matrix (for protein sequences MDM78 and
|
|
for nucleic acids an identity matrix).
|
|
9 Membership of a set. A list of sets of allowed amino acids for
|
|
each position in the motif. The sets are separated by commas(,).
|
|
For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
|
|
acids in which one of I,V or L must be found in the first position,
|
|
then anything in the next two positions, D,E,K or R in the fourth
|
|
position and F,Y,W,I,L,V or M in the fifth. This class only applies
|
|
to protein sequences because for nucleic acids "membership of a
|
|
set"
|
|
can be achieved using IUB symbols.
|
|
|
|
Classes 1 - 4, 8 and 9 apply to protein sequences, and classes 1-8 to
|
|
nucleic acids.
|
|
|
|
.end lit
|
|
.para
|
|
Class 1: exact match.
|
|
.para
|
|
The motif is defined by a short sequence, which for nucleic acids,
|
|
may include IUB symbols. All symbols must match.
|
|
.para
|
|
Class 2: percentage match
|
|
.para
|
|
The motif is defined by a short sequence, which for nucleic acids,
|
|
may include IUB symbols. The minimum number of matching characters
|
|
must
|
|
also be specified.
|
|
.para
|
|
Class 3: match using a score matrix
|
|
.para
|
|
The motif is defined by a short sequence, which for nucleic acids,
|
|
may include IUB symbols. The motif is not compared directly with the
|
|
sequence to count the number of matching characters. Instead a matrix is
|
|
used to provide a score for all possible pairs of characters. The motif
|
|
score for
|
|
any position along the sequence is the sum of the scores found by
|
|
looking-up the scores for each pair of aligned characters. A match is
|
|
declared if some minimum score is achieved.
|
|
.para
|
|
Class 4: weight matrix
|
|
.para
|
|
The motif is defined by a table of values (called weights or scores). The
|
|
table gives a score for finding each possible character at each position
|
|
along the length of the motif. It therefore
|
|
has dimension motif-length x character-set-size, and allows us to give
|
|
different scores for each character at each position. It is equivalent to
|
|
having a different score matrix for each position along the motif, and
|
|
provides the most flexible and specific method of defining motifs. The
|
|
weight matrices are created by program NIP option 20 and
|
|
stored as files. The file contains the values
|
|
for each position, as well as an overall minimum score.
|
|
There are two ways in which these values can be used to calculate an
|
|
overall
|
|
score for any section of the sequence. The simplest way is to add the
|
|
values in the file. (This means that the highest possible score
|
|
can be calculated by adding the top value at each column
|
|
position, and the lowest
|
|
by adding the bottom value.)
|
|
The normal way of using the values in the file is as
|
|
follows.
|
|
First the programs divide the values in each column by the column total
|
|
so
|
|
that they sum to 1.0
|
|
Then the natural
|
|
logs of these values are used as scores. When the matrix is applied to a
|
|
sequence these logarithmic values are summed (which is of course
|
|
equivalent
|
|
to multiplying the frequencies).
|
|
Note that using the natural logs of the frequencies as
|
|
weights and
|
|
adding them means that the overall cutoff score must be less than zero,
|
|
whereas if the original
|
|
values in the weight matrix file are added, the cutoff score will be
|
|
greater than zero. The search routines therefore decide whether the user
|
|
wants to add values or multiply frequencies
|
|
by examining the value of the cutoff score: it will add if the cutoff
|
|
is
|
|
greater than zero and add logs of frequencies if it is less than zero.
|
|
Hence we effectively get two
|
|
motif classes in one. The program NIP, when creating weight matrix
|
|
files, will ask the user whether the scores should be added or multiplied.
|
|
If the values in the table have been defined
|
|
without using a set of aligned sequences
|
|
it is easier for the user to
|
|
choose a cutoff score if the values are added.
|
|
.para
|
|
Class 5: complement of weight matrix
|
|
.para
|
|
The motif is defined by a weight matrix, but the program searches for its
|
|
complement.
|
|
.para
|
|
Class 6: inverted repeat, or stem-loop
|
|
.para
|
|
The motif is defined by a repeat length, a minimum score
|
|
and a range of loop sizes. The scores are A-T=2, G-C=2, G-T=1, else=0.
|
|
The loop sizes are defined by a minimum
|
|
and maximum distance from the 3' end of the stem.
|
|
For a stem-loop these will be positive numbers. For example to
|
|
define a stem of length 8 and loop sizes varying from 3 to 5, the stem
|
|
would be set to 8, the minimum start distance to 3 and the maximum
|
|
to 5. To define an
|
|
inverted repeat the minimum distance will be negative. For example stem
|
|
length=9,
|
|
minimum distance=-9, and maximum distance=-8 will find
|
|
inverted repeats of lengths 9 and 10.
|
|
E.g. AAAAATTTT and AAAAATTTTT would be found, the first having a base
|
|
at
|
|
its centre, the second having none.
|
|
.para
|
|
Class 7: exact match, defined step size.
|
|
.para
|
|
The motif is defined by a short sequence, which for nucleic acids,
|
|
may include IUB symbols. All symbols must match. The class differs
|
|
from
|
|
class 1 in that searches will move in steps of some given size. For
|
|
example
|
|
we could search for a certain codon and use a step size of 3 and hence
|
|
keep in a
|
|
single reading frame.
|
|
.para
|
|
Class 8: direct repeat
|
|
.para
|
|
The motif is defined by a repeat length, a minimum score
|
|
and a range of loop sizes. The scores are defined using MDM78 for protein
|
|
sequences and an identity matrix for nucleic acids.
|
|
The loop sizes are defined by a minimum
|
|
and maximum distance from the 3' end of the stem.
|
|
.para
|
|
Class 9: membership of a set
|
|
.para
|
|
This motif class is for protein sequences. It is defined by lists of
|
|
allowed amino acids for each position in the motif, and a cut-off score.
|
|
Positions at which any amino acid can occur are left blank.
|
|
All allowed amino acids for each position give a score of 1.
|
|
The motifs can be defined in two ways: either typed at the keyboard or
|
|
read
|
|
in as a weight-matrix-like file.
|
|
When the motif is defined at the keyboard the sets of allowed amino
|
|
acids
|
|
are separated by commas(,).
|
|
For example IVL,,,DEKR,FYWILVM defines a motif of length 5 amino
|
|
acids in which one of I,V or L must be found in the first position,
|
|
then anything in the next two positions, D,E,K or R in the fourth
|
|
position and F,Y,W,I,L,V or M in the fifth. To specify that the
|
|
whole motif must match a score of 3 would be required (i.e. one of the
|
|
allowed amino acids must be found for each of the three defined
|
|
positions).
|
|
If the motif is read from a file the file must have been written by
|
|
program
|
|
NIP, or have been saved by the pattern searching routines. If the
|
|
user
|
|
elects to save a pattern, and it includes class 9 motifs typed at the
|
|
keyboard, then the program will save the class 9 motifs as weight matrix
|
|
files. Therefore it will request file names for each motif of this class.
|
|
If the motif given above as an example were saved the weight matrix file
|
|
would have 5 columns.
|
|
The first column
|
|
would contain zeroes except for the I, V and L rows
|
|
which would be set to 1; the next two columns would all be zero; the next
|
|
would be zero except for the D,E,K and R rows which would be 1; the final
|
|
column would contain 1's in rows F,Y,W,I,L,V and M, with
|
|
the rest zero.
|
|
.para
|
|
|
|
The logical operator (AND, OR or NOT) used to add each motif to the
|
|
pattern
|
|
is specified by preceding
|
|
the class number by the letters A, O or N. A = AND, O = OR, N = NOT.
|
|
The default is A, so N2 means include, using the NOT operator, a class 2
|
|
motif; O2 means include, using the OR operator, a class 2 motif; both A2
|
|
and
|
|
2 mean include, using the AND operator, a class 2 motif.
|
|
|
|
.para
|
|
Range setting.
|
|
.para
|
|
The motifs in a pattern are numbered according to their order in the list.
|
|
Apart from the first motif in a pattern all motifs are given a range
|
|
of allowed positions relative to a motif further up the list.
|
|
For example
|
|
suppose we have a pattern defined by A AND B AND C AND D.
|
|
Motif A can occur anywhere, but B must have its range of allowed
|
|
positions defined relative to the position of motif A, and C's positions
|
|
can be defined relative to either A or B, depending on which is most
|
|
convenient, and likewise D's positions can be relative to A or B or C.
|
|
.para
|
|
Notice that the positions of motifs can be defined relative to more than
|
|
one motif. Suppose we have a pattern consisting of
|
|
motifs A, B and C, and that B occurs 5-10 residues right of A, C occurs 5-
|
|
10
|
|
residues right of B, and also C is never more than 15 residues from A.
|
|
Then
|
|
it is quite consistent with the methods to include motif C into the
|
|
pattern
|
|
twice using the AND operator: once relative to A and once relative to B.
|
|
This will define the relative spacing and the ORDER of the motifs in the
|
|
pattern. (If we simply defined the position of C relative to A it could be
|
|
found to the left of B).
|
|
.para
|
|
Motifs combined together using the OR operator are all given the same
|
|
range. For example suppose we had a pattern A AND (B OR C) AND (D OR E),
|
|
then B and C each have the same range, and D and E also have
|
|
the same range as one another. The range for D and E can be relative to
|
|
A or to B.
|
|
.para
|
|
Motifs cannot have their ranges defined relative to motifs that are
|
|
included using the NOT operator. For example if we had the pattern A NOT
|
|
B
|
|
AND C, then the range for C can only be defined relative to motif A.
|
|
.para
|
|
Speed can be gained by arranging the order
|
|
of the motifs so that those higher up the list are of types that can be
|
|
searched for rapidly and that are also unlikely to be found.
|
|
.para
|
|
Motifs combined by the OR operator are alternatives: if any one of a set
|
|
of motifs
|
|
combined by the OR operator is found, then a match is declared. All
|
|
alternatives will be reported. For example if we had a pattern defined by
|
|
A
|
|
AND (B OR C), then all places where A occurs and B is found within range,
|
|
and all places where A is found and C is found within range will be
|
|
reported. A typical use would be where we might allow a motif to appear
|
|
on
|
|
either strand of the DNA sequence. For example a weight matrix
|
|
representing
|
|
the heatshock element could be used in a pattern which included
|
|
heatshock
|
|
as a motif class 4 combined using the OR operator
|
|
with heatshock as a motif class 5.
|
|
.para
|
|
The probability calculations are performed for each motif as it is
|
|
defined.
|
|
If an overall probability cut-off is given the calculation is repeated for
|
|
each match found. To achieve maximum searching speed do not give an
|
|
overall
|
|
probability cut-off. Overall cut-off scores should only be used if the
|
|
motif
|
|
classes used are compatible.
|
|
.para
|
|
There are currently
|
|
several ways to display the matches: 1 = each
|
|
motif and its position is listed; 2 = all the sequence between the two
|
|
outermost motifs is listed; 3 = graphical, with a spike marking the
|
|
position
|
|
of the leftmost motif. The library versions also give entry names, and a
|
|
one
|
|
line title; in addition they can be used to produce aligned families of
|
|
sequences. When this mode of output is selected the program will write a
|
|
separate file for each match. The files will be called ENTRYNAME.DAT
|
|
where
|
|
ENTRYNAME is the name of the entry in the library. The matching
|
|
sequence
|
|
will be written out so that the spacing between motifs is constant, and
|
|
set to the maximum allowed by the pattern definition. Any gaps will be
|
|
filled with dashes (-). If the individual sequences were subsequently
|
|
written one above the other
|
|
they should line up so that all motifs are in register. There two types of
|
|
output of this sort: one, option 4, writes out whole sequences, the other,
|
|
option 5, writes out only the sequences between the two outermost
|
|
motifs.
|
|
If the individual sequences were subsequently
|
|
written one above the other
|
|
they should line up so that all motifs are in register. There two types of
|
|
output of this sort: one, option 4, writes out whole sequences, the other,
|
|
option 5, writes out only the sequences between the two outermost
|
|
motifs.
|
|
Note that for option 4 users are asked to type the position of the
|
|
first motif, and the reason for
|
|
this is explained below.
|
|
Consider a pattern found in several sequences. Consider only
|
|
the first motif in
|
|
the pattern and suppose that it was found in different positions in these
|
|
sequences.
|
|
Say that of these positions the one furthest from the left end was
|
|
position 100. Then, in order to ensure that all the sequences would align,
|
|
we must specify that motif 1 must start at position 100.
|
|
Any sequences in which motif 1 started
|
|
nearer to the left end than position 100 would be padded accordingly.
|
|
These modes of output
|
|
should only be used when the position of each motif is defined relative to
|
|
its
|
|
immediate neighbour.
|
|
.para
|
|
The pattern descriptions can be saved to files. These files
|
|
can be used instead of typing definitions again at the keyboard. As the
|
|
files are annotated,
|
|
they can easily
|
|
be changed using system editors, and the modified versions used to
|
|
define the variant patterns for the programs.
|
|
.para
|
|
Use of lists of entry names
|
|
.para
|
|
The two programs that operate on libraries have the ability to
|
|
restrict their searches to subsets of the libraries. This does not require
|
|
sublibraries to be created but instead is achieved by using files
|
|
containing a list of the entry names of sequences. The user may choose to
|
|
search only those entries on the list or, alternatively to search all but
|
|
those on the list (i.e. in the latter case
|
|
the list contains the names of those to be excluded).
|
|
The programs can search libraries that have indexes and those that
|
|
do not.
|
|
If a list of names for inclusion is used,
|
|
then the search will be faster if the index is present. In all other
|
|
circumstances the whole library will be read.
|
|
The list must be in library order except when it is used
|
|
to include entries, and an index is available.
|
|
The list must contain each entry name on a separate line, with the name
|
|
starting in column 1 of the line. ie there must be no spaces at the start
|
|
of the line.
|
|
The list of entry names
|
|
can be produced by the keyword searches of nip, pip, etc as long
|
|
as the listings produced have a space character separating the entry name
|
|
from the entry description. This will depend on how well the library
|
|
reformatting programs work. For example swissprot entry names tend to run
|
|
into the beginning of the descriptions, but other libraries are generally
|
|
OK.
|
|
.para
|
|
One use of the programs is to look for patterns that we already know
|
|
about, but in new sequences. However it is hoped that they will also be
|
|
useful for finding new motifs. For example
|
|
several known control regions in
|
|
nucleic acid
|
|
sequences consist of particular direct or inverted repeats;
|
|
the inclusion of
|
|
direct and inverted repeats as motif classes
|
|
makes it possible to
|
|
find previously unknown
|
|
motifs of these types.
|
|
Using these new programs we can
|
|
ask questions like: "are there any inverted or direct repeats near to
|
|
sections of sequence that contain both a
|
|
CCAAT box and a TATA box?"; and to search for such things throughout
|
|
the
|
|
libraries. In addition, the mode of output in which all the sequence
|
|
between
|
|
the two outermost motifs found is printed out, allows us to extract
|
|
sequences and examine them in more detail for further common
|
|
subsequences.
|
|
For example we might want to collect together all the sequences
|
|
between
|
|
putative CCAAT and TATA boxes.
|
|
.para
|
|
A further use of the inverted repeat motif class is the following. If a
|
|
regulatory sequence in DNA is poorly defined but also an inverted repeat,
|
|
then it might be an advantage to specify it both as a consensus sequence
|
|
and
|
|
a superimposed inverted repeat. In this way two weak definitions can be
|
|
combined to produce a stronger pattern.
|
|
.para
|
|
Given only a few examples of a motif it
|
|
should be possible to perform initial searches using a
|
|
class 3 motif, and then, using plausible matching sequences, create a
|
|
more
|
|
specific weight matrix for the same motif.
|
|
.para
|
|
If motifs are combined with the first motif using the OR operator
|
|
they will be ignored until all
|
|
permutations that include the first motif have been looked for.
|
|
The whole search will then be repeated, in
|
|
turn, for each of
|
|
those motifs that are combined with the first motif using the OR
|
|
operator.
|
|
An interesting consequence of this is that the program can be used,
|
|
without
|
|
change, to compare any newly determined sequence with all known
|
|
individual
|
|
motifs. We achieve this by having a pattern in which all known relevant
|
|
motifs are combined using the OR operator.
|
|
If we ask to use this pattern with
|
|
a sequence, the program will automatically compare each individual
|
|
motif in
|
|
the pattern with the whole length of the
|
|
sequence. As the number of known
|
|
motifs grows this should become an increasingly useful standard
|
|
procedure.
|
|
.para
|
|
The NOT operator is obviously
|
|
useful for making sure particular motifs are not present, but it can also
|
|
be used to bracket the levels of matches found. We may want a degree of
|
|
match that lies between two limits - binding should occur, but not too
|
|
strongly; or base-pairs should form, but not too many. We can specify
|
|
this
|
|
by asking for a match with a low score, in combination with a match and
|
|
a
|
|
high score, both for the same motif, but with the high score included
|
|
using
|
|
the NOT operator.
|
|
.para
|
|
The algorithm is designed to find all sections of a sequence that satisfy
|
|
the pattern rather than only the best match.
|
|
Particularly if some of the motifs in a pattern are less well defined than
|
|
others, this can often result in the same region of a sequence being
|
|
reported as having several matches, but which only vary in the
|
|
positions of the weakest motifs.
|
|
.para
|
|
General remarks on motif searching
|
|
.para
|
|
Generally motifs are short subsequences that are thought to be
|
|
associated with
|
|
particular functions in some known sequences. Often
|
|
we search for them to try to
|
|
understand or interpret other sequences. Sometimes we search for
|
|
motifs and
|
|
patterns to
|
|
test a hypothesis about their role: are they found in the expected
|
|
positions in the expected sequences. In doing so we should remember
|
|
that, in both proteins and nucleic acids,
|
|
what we are really looking for is a particular
|
|
three dimensional structure with certain affinities for other structures,
|
|
and that we are assuming that the sequence of the motif alone
|
|
defines the 3D structure we searching for.
|
|
The overall structure
|
|
may be completely different to those in which the motif is functional,
|
|
and
|
|
hence the motif may have a different shape or be inaccessible.
|
|
We should be aware of the
|
|
importance of the context in which a motif is found. Where does it lie
|
|
relative to the overall structure, is it accessible, is the three
|
|
dimensional spacing between
|
|
it and other motifs correct? For example, is it on the same side of the
|
|
double helix, and the correct distance from some other motif? How does
|
|
context affect our assessment of the significance of finding a motif?
|
|
Finding false mammalian mRNA splice junctions in non-coding sequences
|
|
is
|
|
far less important than finding false sites in pre-mRNA sequences, but
|
|
finding them in the correct places is most important! In other words, it
|
|
is
|
|
often the case that when we are searching for a motif that is known to
|
|
be
|
|
necessary for some function, then a positive result in the form of a
|
|
match
|
|
in the required position, is more important than a high background of
|
|
matches in the wrong positions. Being
|
|
able to write
|
|
down the probability of finding a motif in a random sequence tells us how
|
|
well it is defined.
|
|
In nucleic
|
|
acids the DNA may contain many superimposed types of information such
|
|
as
|
|
those concerned with histone phasing, protein coding or mRNA secondary
|
|
structure. These overlapping "codes" may interfere with one another
|
|
causing
|
|
matches to motifs to be poorer than expected.
|
|
In general we will only have a limited number of examples of the
|
|
motif and we do not know how representative they are.
|
|
.para
|
|
Sequences have superimposed functions: some parts may be of general
|
|
structural
|
|
importance and give rise to an overall framework, and other parts give
|
|
specificity and hence are not common; we may want to use a set of
|
|
aligned
|
|
sequences to define a motif, but want to use only the framework
|
|
positions.
|
|
Alternatively we may want to pick out
|
|
only those parts of a set of aligned sequences that give a particular
|
|
property, and to ignore other similarities that are due to some other
|
|
property
|
|
and which could obscure the pattern
|
|
we are interested in.
|
|
It is possible to apply a mask to a set of aligned sequences in
|
|
order to give weight to selected positions only.
|
|
The ability to define a mask allows certain positions
|
|
to be used in the motif and others to be ignored, and yet still permits the
|
|
use of a set of aligned sequences to calculate weights. The mask is
|
|
requested and applied
|
|
by the program and results in the masked positions being zero
|
|
in
|
|
the weight matrix. The mask is defined in the following way.
|
|
Suppose we had a motif of length 15, then the mask
|
|
x--x--xx-x will give zero weights to positions 2,3,5,6 and 9 (note it is
|
|
the dashes (-) that are significant and that positions
|
|
1,4,7,8,10,11,12,13,14 and 15
|
|
will be non-zero). Of course
|
|
the same set of sequences could be used with several alternative masks
|
|
in
|
|
order to extract different features and create corresponding weight
|
|
matrices.
|
|
.para
|
|
The programs are described in Staden,R.
|
|
CABIOS 4, 53-60, 1988; Staden,R.
|
|
CABIOS 5, 89-96, 1989, and Methods in Enzymology 183, 193-211 (1990).
|
|
.left margin1
|
|
@ end of help
|